<h1 align=center>Proyecto: Titanic - Machine Learning from Disaster</h1>
<hr>
<h2 align=center>Probamos modelos de ML</h2>

**Importamos librerias.**

In [2]:
import pandas as pd
import sklearn as sk
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler, LabelEncoder
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, RandomizedSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier
import pickle
import xgboost as xgb


<h3 align=center>1.-Realizamos la lectura de los datos y los guardamos en sus respectivos dataframes.</h3>

In [3]:
data_test = pd.read_csv("../datasets/test.csv")
data_train = pd.read_csv("../datasets/train.csv")

In [198]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
data_test["Fare"].fillna(data_test["Fare"].median(),inplace=True)

In [5]:

# Vemos la posilibilad de establecer la edad en rangos.
bins= [0,10,20,30,40,50,60,100]
names = ['0','1','2','3','4','5','6']
data_train['Age'] = pd.cut(data_train['Age'], bins, labels=names)
data_test['Age'] = pd.cut(data_test['Age'], bins, labels=names)
data_train = pd.concat([data_train ,pd.get_dummies(data_train['Age'])], axis=1 )
data_test = pd.concat([data_test ,pd.get_dummies(data_test['Age'])], axis=1 )

<h3 align=center>2.-Realizamos pipelines.</h3>

Establecemos el modelo a utilizar. Para cambiar el modelo solo basta con cambiarlo en la siguiente linea de código sin que sea necesario alterar pipelines.

In [120]:
#model = DecisionTreeClassifier(criterion= 'gini', max_depth= 8,min_samples_leaf= 1, min_samples_split= 1)
#model = KNeighborsClassifier(algorithm='auto', leaf_size=10, n_neighbors=9)
model = BaggingClassifier(base_estimator = DecisionTreeClassifier(criterion= 'gini', max_depth= 6),n_estimators=10,max_samples= 1.0, max_features=0.7)
#model = RandomForestClassifier(criterion='gini', max_depth=6, min_samples_split=6, n_estimators=110,min_samples_leaf=1)#
#model = SVC(C= 10, gamma= 0.1, kernel= 'rbf')
#model = xgb.XGBClassifier()
#model = xgb.XGBClassifier(colsample_bytree= 0.8,gamma= 1, max_depth= 3, min_child_weight= 1, subsample= 1.0)

Separamos el set de entrenamiento y testeo.

In [7]:
x= data_train[['Pclass', 'Sex', 'Fare','Embarked','0','SibSp']]
y= data_train['Survived']


In [8]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=60, stratify=y)

Establecemos por medio de variables que procesos de transformacion se le aplicaran a las diversas columnas. Para cambiar, solo basta con modificar el contenido de las variables sin que sea necesario alterar pipelines.

In [9]:
nominal =['Sex','Embarked','0' ]
ordinal =['Pclass' ]
numerical = ['Fare','SibSp']

Definimos pipeline relativos a la tranformación  de los datos.

In [121]:
#Pipeline datos numéricos.
numerical_pipeline = Pipeline([('escaler', StandardScaler())])
#Pipeline datos ordinales. 
ordinal_pipeline = Pipeline([("encoder", OrdinalEncoder())])
#Pipeline datos nominales.
nominal_pipeline = Pipeline([("encoder2", OneHotEncoder())])

#Unimos los dos procesos en un mismo Pipeline.
preprocessin_pipeline = ColumnTransformer([("ordinal_preprocesor", ordinal_pipeline, ordinal),
                                            ("nominal_preprocessor", nominal_pipeline, nominal),
                                            ("numerical_preprocessor",numerical_pipeline, numerical) ])

Realizamos el pipeline completo.

In [122]:
complete_pipeline= Pipeline ([("preprocessor", preprocessin_pipeline), ("estimator", model)])

<h3 align=center>3.-Probamos los modelos.</h3>

**A.-Utilizamos GridSearch para orientarnos con la elección de hiperparametros.**

In [59]:
params_xgb = {'estimator__n_estimators':range(20,220,10),
        'estimator__seed':range(4, 60, 4),
        'estimator__min_child_weight': [1, 5, 10],
        'estimator__gamma': [0.5, 1, 1.5, 2, 5],
        'estimator__subsample': [0.6, 0.8, 1.0],
        'estimator__colsample_bytree': [0.6, 0.8, 1.0],
        'estimator__max_depth': [3, 4, 5]
        }



In [36]:
# Parametros DecisionTreeClassifier

params_tree = {
             'estimator__criterion' : ['gini', 'entropy'],
             'estimator__min_samples_split':range(1,10),
             'estimator__min_samples_leaf':range(1,10),
             'estimator__max_depth':range(1,20) 
}


In [41]:

params_RandomTree = {
             'estimator__criterion' : ['gini', 'entropy'],
             'estimator__n_estimators': range(20,220,10),
             'estimator__min_samples_split':range(1,10),
             'estimator__min_samples_leaf':range(1,10),
             'estimator__max_depth':range(1,20) 
}

In [None]:
params_knn = {'estimator__n_neighbors': np.arange(1,20),
            'estimator__weights': ['uniform', 'distance'],
            'estimator__leaf_size': [1,3,5,7,10],
            'estimator__algorithm':['auto','kd_tree']}

In [None]:
param_svc = {'estimator_C':[1,10,100,1000],'estimatorgamma':[1,0.1,0.001,0.0001], 'estimator_kernel':['linear','rbf']}

In [None]:
params_bagging = {"base_estimator = DecisionTreeClassifier()"
          "base_estimator__max_depth": [3,5,10,20],
          "base_estimator__max_features": [None, "auto"],
          "base_estimator__min_samples_leaf": [1, 3, 5, 7, 10],
          "base_estimator__min_samples_split": [2, 5, 7],
          'bootstrap_features': [False, True],
          'max_features': [0.5, 0.7, 1.0],
          'max_samples': [0.5, 0.7, 1.0],
          'n_estimators': [2, 5, 10, 20],
}

In [42]:
model = GridSearchCV(complete_pipeline, param_grid=params_RandomTree, cv=5)#, scoring = ['accuracy', 'recall'], refit = 'accuracy' )
model.fit(X_train, y_train)


KeyboardInterrupt: 

In [None]:
print("Mejores hiperparámetros: "+str(model.best_params_))
print("Mejor Score: "+str(model.best_score_)+'\n')

scores = pd.DataFrame(model.cv_results_)
scores
#'estimator__criterion': 'gini', 'estimator__max_depth': 8, 'estimator__min_samples_leaf': 1, 'estimator__min_samples_split': 1

Mejores hiperparámetros: {'estimator__colsample_bytree': 0.8, 'estimator__gamma': 1, 'estimator__max_depth': 3, 'estimator__min_child_weight': 1, 'estimator__subsample': 1.0}
Mejor Score: 0.8286910272825766



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_estimator__colsample_bytree,param_estimator__gamma,param_estimator__max_depth,param_estimator__min_child_weight,param_estimator__subsample,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.072960,0.029063,0.011927,0.002440,0.6,0.5,3,1,0.6,"{'estimator__colsample_bytree': 0.6, 'estimato...",0.790210,0.825175,0.802817,0.823944,0.823944,0.813218,0.014217,116
1,0.061137,0.005331,0.008614,0.000823,0.6,0.5,3,1,0.8,"{'estimator__colsample_bytree': 0.6, 'estimato...",0.783217,0.832168,0.795775,0.809859,0.845070,0.813218,0.022757,116
2,0.075342,0.027963,0.014982,0.004318,0.6,0.5,3,1,1.0,"{'estimator__colsample_bytree': 0.6, 'estimato...",0.783217,0.832168,0.802817,0.823944,0.830986,0.814626,0.018911,95
3,0.133262,0.010979,0.020282,0.002789,0.6,0.5,3,5,0.6,"{'estimator__colsample_bytree': 0.6, 'estimato...",0.783217,0.790210,0.795775,0.838028,0.809859,0.803418,0.019392,236
4,0.131177,0.019999,0.018459,0.001610,0.6,0.5,3,5,0.8,"{'estimator__colsample_bytree': 0.6, 'estimato...",0.776224,0.811189,0.795775,0.845070,0.823944,0.810440,0.023529,148
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
400,0.077087,0.003097,0.007812,0.002304,1.0,5,5,5,0.8,"{'estimator__colsample_bytree': 1.0, 'estimato...",0.776224,0.797203,0.774648,0.830986,0.830986,0.802009,0.024964,257
401,0.071896,0.001958,0.011937,0.002542,1.0,5,5,5,1.0,"{'estimator__colsample_bytree': 1.0, 'estimato...",0.783217,0.804196,0.809859,0.852113,0.830986,0.816074,0.023582,81
402,0.063945,0.002106,0.009393,0.001385,1.0,5,5,10,0.6,"{'estimator__colsample_bytree': 1.0, 'estimato...",0.776224,0.783217,0.774648,0.816901,0.781690,0.786536,0.015519,396
403,0.067480,0.003774,0.009203,0.003207,1.0,5,5,10,0.8,"{'estimator__colsample_bytree': 1.0, 'estimato...",0.783217,0.797203,0.802817,0.823944,0.809859,0.803408,0.013490,240


In [46]:
prediction = model.predict(X_test)
report = classification_report(y_test,prediction)
print("Reporte de Clasificación:")
print(report)

Reporte de Clasificación:
              precision    recall  f1-score   support

           0       0.82      0.88      0.85       110
           1       0.79      0.70      0.74        69

    accuracy                           0.81       179
   macro avg       0.80      0.79      0.79       179
weighted avg       0.81      0.81      0.81       179



**B.-Probamos modelos teniendo en cuenta lo sugerido por Gridsearch.**

In [123]:
model =complete_pipeline.fit(X_train,y_train)



Evaluamos el desempeño del modelo.

In [124]:
prediction = complete_pipeline.predict(X_test)


In [125]:
cm = confusion_matrix(y_test,prediction)
print("Matriz de confusión:")
print(cm)

Matriz de confusión:
[[100  10]
 [ 17  52]]


In [126]:

report = classification_report(y_test,prediction)
print("Reporte de Clasificación:")
print(report)

Reporte de Clasificación:
              precision    recall  f1-score   support

           0       0.85      0.91      0.88       110
           1       0.84      0.75      0.79        69

    accuracy                           0.85       179
   macro avg       0.85      0.83      0.84       179
weighted avg       0.85      0.85      0.85       179



**C.-Elegido el modelo procedemos a entrenarlo con el dataset completo de Train para enviar la predicción sobre los datos de test a Kaggle en formato .csv.**

In [97]:
model =complete_pipeline.fit(x,y)



In [98]:
prediction = pd.DataFrame()


In [99]:
x_test_completo = data_test[['Pclass', 'Sex', 'Age','Fare','Embarked','0','SibSp']]
prediction['PassengerID'] = data_test['PassengerId']
prediction['Survived'] = complete_pipeline.predict(x_test_completo)

In [258]:
prediction.set_index('PassengerID')

Unnamed: 0_level_0,Survived
PassengerID,Unnamed: 1_level_1
892,0
893,1
894,0
895,0
896,1
...,...
1305,0
1306,1
1307,0
1308,0


In [100]:
prediction.to_csv(path_or_buf='resultados/bagging.csv' , index=False)

In [None]:
"""
Para guardar modelos con pickle
pickle.dump(model, open('model.pkl', 'wb'))
pickled_model = pickle.load(open('model.pkl', 'rb'))
pickled_model.predict(X_test)
"""

"\nPara guardar modelos con pickle\npickle.dump(model, open('model.pkl', 'wb'))\npickled_model = pickle.load(open('model.pkl', 'rb'))\npickled_model.predict(X_test)\n"