#**Taller Algoritmos de Clasificación**
###Integrantes:
- Daniel Esteban Méndez Díaz
- Juan David Godoy Valencia
- Johan Santiago Ramos Duarte
----

# Sobrevivientes del Titanic

El 15 de abril de 1912, el Titanic naufragó después de chocar con un iceberg. Debido a la insuficiencia de botes salvavidas, 1502 de los 2224 pasajeros y tripulantes murieron.

Aunque la suerte jugó un papel en la supervivencia de los viajeros, algunos grupos pudieron tener mayores posibilidades de sobrevivir que otros.

Se desea construir un modelo de ML que permita dar respuesta a la pregunta: ¿Qué tipo de persona tuvo más posibilidades de sobrevivir?. Para esto, se cuenta con información como nombre, edad, sexo, clase del tiquete, ciudad de embarque, entre otros.

## Importar librerías


In [None]:
import numpy as np
import pandas as pd
import requests
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

## Cargar el conjunto de datos

SibSp: Número de hermanos/esposo(a) abordo.

Parch: Número de padres/hijos abordo

Embarked: Ciudad de embarque.


In [None]:
url = "https://www.dropbox.com/s/g19rqwd53co5dh1/titanic.csv?dl=1"
response = requests.get(url)
filename = 'titanic.csv'
with open(filename, 'wb') as file:
    file.write(response.content)
data = pd.read_csv(filename, sep=';')
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,712.833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30,C148,C


## Descripción del conjunto de datos


In [None]:
print(data.columns)
data.describe()

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch
count,891.0,891.0,891.0,714.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057
min,1.0,0.0,1.0,0.42,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0
50%,446.0,0.0,3.0,28.0,0.0,0.0
75%,668.5,1.0,3.0,38.0,1.0,0.0
max,891.0,1.0,3.0,80.0,8.0,6.0


## Filtrar y transformar características y etiquetas

#### Identificar y eliminar columnas irrelevantes
¿Cuáles columnas son irrelevantes? Elimínelas.

In [None]:
print(data['PassengerId'].nunique()) #Irrelevante
print(data['Ticket'].nunique()) #Irrelevante
print(data['Name'].nunique()) #Irrelevante
print(data['Pclass'].nunique())
print(data['SibSp'].nunique())
print(data['Parch'].nunique())
print(data['Fare'].nunique())
print(data['Cabin'].nunique()) #Irrelevante
#Número de datos faltantes en el campo Cabin
print(len(data[data['Cabin'].isna()]))

#Eliminar columna <nombre_columna>
del data['PassengerId']
del data['Ticket']
del data['Name']
del data['Cabin']

data

891
681
891
3
7
7
247
147
687


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,712.833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S
...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13,S
887,1,1,female,19.0,0,0,30,S
888,0,3,female,,1,2,23.45,S
889,1,1,male,26.0,0,0,30,C


#### Procesar variables restantes


In [None]:
#Convertir sexo a numérico
data['Sex'].replace({'male':0, 'female':1}, inplace=True)
#Normalizar Pclass
data['Pclass'] = data['Pclass'] / data['Pclass'].max()
#Normalizar SibSp (Número de hermanos/esposo(a) abordo)
data['SibSp'] = data['SibSp'] / data['SibSp'].max()
#Normalizar Parch (Número de padres/hijos abordo)
data['Parch'] = data['Parch'] / data['Parch'].max()
#Convertir Fare (Costo del tiquete) a números
data['Fare'] = data['Fare'].apply(lambda x:str(x).replace('.', '')).astype(float)
#Normalizar Fare (Costo del tiquete)
data['Fare'] = data['Fare'] / data['Fare'].max()
#One-hot encoding de punto de embarcación (C = Cherbourg, Q = Queenstown, S = Southampton)
data['E_C'] = (data['Embarked']=='C').replace({True:1, False:0})
data['E_Q'] = (data['Embarked']=='Q').replace({True:1, False:0})
data['E_S'] = (data['Embarked']=='S').replace({True:1, False:0})
del data['Embarked']
#Convertir edad a numérico
data['Age'] = data['Age'].astype(float)
#Imputar datos faltantes de edad con la media
data['Age'] = data['Age'].fillna(data['Age'].mean())
#Normalizar Fare (Costo del tiquete)
data['Age'] = data['Age'] / data['Age'].max()
data.describe()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Sex'].replace({'male':0, 'female':1}, inplace=True)
  data['Sex'].replace({'male':0, 'female':1}, inplace=True)
  data['E_C'] = (data['Embarked']=='C').replace({True:1, False:0})
  data['E_Q'] = (data['Embarked']=='Q').replace({True:1, False:0})
  data['E_S'] = (data['Embarked']=='S').replace({True:1, False:0})


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,E_C,E_Q,E_S
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,0.769547,0.352413,0.371239,0.065376,0.063599,0.024918,0.188552,0.08642,0.722783
std,0.486592,0.27869,0.47799,0.162525,0.137843,0.134343,0.080246,0.391372,0.281141,0.447876
min,0.0,0.333333,0.0,0.00525,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.666667,0.0,0.275,0.0,0.0,3.1e-05,0.0,0.0,0.0
50%,0.0,1.0,0.0,0.371239,0.0,0.0,0.000512,0.0,0.0,1.0
75%,1.0,1.0,1.0,0.4375,0.125,0.0,0.015412,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Separar características de etiquetas
Divide en X y y las características y etiquetas. Calcule el número de ejemplos en cada clase.



In [None]:
y = data['Survived']
del data['Survived']
X = data
print(y.shape)
print(X.shape)
print(y.describe())
print('Clase 0:', y[y==0].shape, len(y[y==0])/len(y))
print('Clase 1:', y[y==1].shape, len(y[y==1])/len(y))

(891,)
(891, 9)
count    891.000000
mean       0.383838
std        0.486592
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64
Clase 0: (549,) 0.6161616161616161
Clase 1: (342,) 0.3838383838383838


## Crear conjunto de entrenamiento y conjunto de prueba
80% entrenamiento, 20% prueba


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, random_state=0)
print('Entrenamiento: ', y_train.shape)
print('Prueba: ', y_test.shape)

Entrenamiento:  (712,)
Prueba:  (179,)


## Línea base
Para evaluar el desempeño del modelo de clasificación, se define un clasificador que etiqueta todos los ejemplos del conjunto de prueba como de la clase mayoritaria.

In [None]:
y_baseline = pd.Series([0]*len(y_test))
print(y_baseline)

0      0
1      0
2      0
3      0
4      0
      ..
174    0
175    0
176    0
177    0
178    0
Length: 179, dtype: int64


## Evaluar el desempeño de la línea base


In [None]:
print(confusion_matrix(y_test, y_baseline))
print(classification_report(y_test, y_baseline))

[[110   0]
 [ 69   0]]
              precision    recall  f1-score   support

           0       0.61      1.00      0.76       110
           1       0.00      0.00      0.00        69

    accuracy                           0.61       179
   macro avg       0.31      0.50      0.38       179
weighted avg       0.38      0.61      0.47       179



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Regresión Logística
Entrenar un modelo de regresión logística y evaluar su desempeño.

In [None]:
model = LogisticRegression(n_jobs=-1)
model.fit(X_train, y_train)
w = np.hstack([np.array([model.intercept_[0]]), model.coef_[0]])
print('w_0',model.intercept_[0])
coefs = pd.DataFrame(w[1:], columns=['w'])
coefs['feat_name'] = X.columns
coefs

w_0 1.6372395051564106


Unnamed: 0,w,feat_name
0,-2.588974,Pclass
1,2.474846,Sex
2,-2.09457,Age
3,-1.500314,SibSp
4,-0.29962,Parch
5,0.867363,Fare
6,0.004408,E_C
7,0.070268,E_Q
8,-0.332903,E_S


## Árbol de decisión
Entrenar un árbol de decisión y evaluar su desempeño.

In [None]:
algo = DecisionTreeClassifier(random_state=0)

params = {'criterion':['gini', 'entropy'],
          'splitter':['best', 'random'],
          'max_depth':[None, 5, 10],
          'min_samples_split':[2, 10, 50],
          'min_impurity_decrease':[0.0, 1e-4, 1e-3, 1e-2]}

best_model = GridSearchCV(algo, params, verbose=3, n_jobs=-1, cv=5, return_train_score=True, scoring='accuracy')
best_model.fit(X_train, y_train)

tree = best_model.best_estimator_

print(best_model.best_score_)
print(best_model.best_params_)

y_pred = tree.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 144 candidates, totalling 720 fits
0.8118191667487442
{'criterion': 'gini', 'max_depth': None, 'min_impurity_decrease': 0.001, 'min_samples_split': 10, 'splitter': 'best'}
[[99 11]
 [21 48]]
              precision    recall  f1-score   support

           0       0.82      0.90      0.86       110
           1       0.81      0.70      0.75        69

    accuracy                           0.82       179
   macro avg       0.82      0.80      0.81       179
weighted avg       0.82      0.82      0.82       179



## Random Forest
Entrenar un modelo <i>random forest</i> y evaluar su desempeño.

In [None]:
algo = RandomForestClassifier(n_estimators=500, max_features='sqrt', bootstrap=True, n_jobs=-1, random_state=10)
params = {'criterion':['gini', 'entropy'],
          'max_depth': [5, 10, None],
          'min_samples_split':[10, 50],
          'min_impurity_decrease':[0.0, 1e-2]}

#best_model = RandomizedSearchCV(algo, params, n_iter=50, verbose=3, n_jobs=-1, cv=5, return_train_score=True, scoring='accuracy')
best_model = GridSearchCV(algo, params, verbose=3, n_jobs=-1, cv=5, return_train_score=True, scoring='accuracy')
best_model.fit(X_train, y_train)

rf = best_model.best_estimator_

print(best_model.best_score_)
print(best_model.best_params_)

y_pred = rf.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 24 candidates, totalling 120 fits
0.8258150300403821
{'criterion': 'gini', 'max_depth': 10, 'min_impurity_decrease': 0.0, 'min_samples_split': 10}
[[103   7]
 [ 21  48]]
              precision    recall  f1-score   support

           0       0.83      0.94      0.88       110
           1       0.87      0.70      0.77        69

    accuracy                           0.84       179
   macro avg       0.85      0.82      0.83       179
weighted avg       0.85      0.84      0.84       179



## Máquina de vectores de soporte
Entrenar una máquina de vectores de soporte y evaluar su desempeño.

In [None]:
algo = SVC(kernel='rbf', gamma='scale', random_state=10)
params = {'C': np.logspace(-3, 3, 7)}

best_model = GridSearchCV(algo, params, verbose=3, n_jobs=-1, cv=5, return_train_score=True, scoring='accuracy')
best_model.fit(X_train, y_train)

svc = best_model.best_estimator_

print(best_model.best_score_)
print(best_model.best_params_)

y_pred = svc.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 7 candidates, totalling 35 fits
0.8159657244164287
{'C': 100.0}
[[100  10]
 [ 24  45]]
              precision    recall  f1-score   support

           0       0.81      0.91      0.85       110
           1       0.82      0.65      0.73        69

    accuracy                           0.81       179
   macro avg       0.81      0.78      0.79       179
weighted avg       0.81      0.81      0.81       179



## Red neuronal
Entrenar una red neuronal y evaluar su desempeño.

In [None]:
layer_size = len(X.columns)

algo = MLPClassifier(max_iter=500, learning_rate='adaptive', warm_start=True, early_stopping=True, random_state=0)

params = {'hidden_layer_sizes':[(layer_size,), (layer_size,layer_size), (layer_size, layer_size, layer_size), (layer_size, layer_size, layer_size, layer_size), (layer_size, layer_size, layer_size, layer_size, layer_size), (layer_size, layer_size, layer_size, layer_size, layer_size, layer_size)],
          'activation': ['tanh', 'relu', 'logistic'],
          'alpha': np.logspace(-3, 3, 7)}

best_model = GridSearchCV(algo, params, verbose=3, n_jobs=-1, cv=5, return_train_score=True, scoring='accuracy')
best_model.fit(X_train, y_train)

mlp = best_model.best_estimator_

print(best_model.best_score_)
print(best_model.best_params_)

y_pred = mlp.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 126 candidates, totalling 630 fits
0.7359795134443022
{'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': (9,)}
[[98 12]
 [25 44]]
              precision    recall  f1-score   support

           0       0.80      0.89      0.84       110
           1       0.79      0.64      0.70        69

    accuracy                           0.79       179
   macro avg       0.79      0.76      0.77       179
weighted avg       0.79      0.79      0.79       179



## K vecinos más cercanos
Resolver el problema de clasificación usando el algoritmo de k vecinos más cercanos.

In [None]:
algo = KNeighborsClassifier()
params = {'n_neighbors':[1, 5, 125],
          'weights':['uniform', 'distance'],
          'p':range(1, 4)}

best_model = GridSearchCV(algo, params, verbose=3, n_jobs=-1, cv=5, return_train_score=True, scoring='accuracy')
best_model.fit(X_train, y_train)

knn = best_model.best_estimator_

print(best_model.best_score_)
print(best_model.best_params_)

y_pred = knn.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 18 candidates, totalling 90 fits
0.8075839653304442
{'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}
[[100  10]
 [ 23  46]]
              precision    recall  f1-score   support

           0       0.81      0.91      0.86       110
           1       0.82      0.67      0.74        69

    accuracy                           0.82       179
   macro avg       0.82      0.79      0.80       179
weighted avg       0.82      0.82      0.81       179



#**Conclusión**
Es posible evidenciar que el Random Forest es el modelo con la mejor exactitud (84%) seguido por el Tree Desicion que tiene una exactitud de (82%). Dando a entender que los árboles son los mejores clasificadores para este problema con respecto a los otros algoritmos de clasificación.