# Objetivo del notebook

Este notebook tiene como finalidad principal el desarrollo del modelo de clasificacion. Este modelo sera entrenado con el conjunto de entrenamiento procesado, a fin de tratar de aprender las caracteristicas que influyen la probabilidad de haber sobrevivido o no en el incidente del Titanic.

# Importar las librerias necesarias

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
# Instancio un objeto DataFrame con el conjunto de datos a utilizar
dataset_train__route = "../data/processed/train.csv"
dataset_test__route = "../data/processed/test.csv"
real_predictions__route = "../data/raw/gender_submission.csv"

train_df = pd.read_csv(dataset_train__route)
test_df = pd.read_csv(dataset_test__route)
real_pred_df = pd.read_csv(real_predictions__route)

In [3]:
# Muestro los 5 primeros ejemplos del dataset de entrenamiento
train_df.head(5)

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass__1,Pclass__2,Sex__male,Embarked__C,Embarked__S,Survived,Cabin__A,Cabin__B,Cabin__C,Cabin__D,Cabin__E,Cabin__F,Cabin__G,Cabin__T
0,0.271174,0.125,0.0,0.014151,0.0,0.0,1.0,0.0,1.0,0,0,0,0,0,0,1,0,0
1,0.472229,0.125,0.0,0.139136,1.0,0.0,0.0,1.0,0.0,1,0,0,1,0,0,0,0,0
2,0.321438,0.0,0.0,0.015469,0.0,0.0,0.0,0.0,1.0,1,0,0,0,0,0,1,0,0
3,0.434531,0.125,0.0,0.103644,1.0,0.0,0.0,0.0,1.0,1,0,0,1,0,0,0,0,0
4,0.434531,0.0,0.0,0.015713,0.0,0.0,1.0,0.0,1.0,0,0,0,0,0,0,1,0,0


In [4]:
# Muestro los 5 primeros ejemplos del dataset de testing
test_df.head(5)

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass__1,Pclass__2,Sex__male,Embarked__C,Embarked__S,Cabin__A,Cabin__B,Cabin__C,Cabin__D,Cabin__E,Cabin__F,Cabin__G,Cabin__T
0,0.452723,0.0,0.0,0.015282,0.0,0.0,1.0,0.0,0.0,0,0,0,0,1,0,0,0
1,0.617566,0.125,0.0,0.013663,0.0,0.0,0.0,0.0,1.0,0,0,0,0,1,0,0,0
2,0.815377,0.0,0.0,0.018909,0.0,1.0,1.0,0.0,0.0,0,0,0,0,1,0,0,0
3,0.353818,0.0,0.0,0.016908,0.0,0.0,1.0,0.0,1.0,0,0,0,0,0,1,0,0
4,0.287881,0.125,0.111111,0.023984,0.0,0.0,0.0,0.0,1.0,0,0,0,0,1,0,0,0


# Division del conjunto de datos

En este caso, ya cuento con un conjunto de entrenamiento y otro de testing, por lo que solamente definire la matriz de caracteristicas del conjunto de entrenamiento (el conjunto de testing no incluye la etiqueta).

In [5]:
X_train = train_df.drop(columns = ['Survived'])
y_train = train_df['Survived']

In [6]:
y_test = real_pred_df['Survived']

## Desarrollo del modelo

In [7]:
model = KNeighborsClassifier()

parameters = {
    'n_neighbors': [3, 5, 7, 10],  # Número de vecinos a considerar
    'weights': ['uniform', 'distance'],  # Función de peso utilizada en la predicción
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],  # Algoritmo utilizado para computar los vecinos más cercanos
    'leaf_size': [20, 30, 40],  # Tamaño de la hoja pasada al BallTree o KDTree
    'p': [1, 2],  # Parámetro de potencia para la métrica de Minkowski
}

# Configuro GridSearchCV
grid_cv = GridSearchCV(estimator=model, param_grid=parameters, cv=5, n_jobs=-1, verbose=2)
grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 192 candidates, totalling 960 fits


In [8]:
# Muestro los hiperparametros del mejor modelo
best_model_params = grid_cv.best_params_
print("Hiperparametros:", best_model_params)

# Instancio una copia del mejor modelo entrenado
best_model = grid_cv.best_estimator_

Hiperparametros: {'algorithm': 'auto', 'leaf_size': 20, 'n_neighbors': 3, 'p': 1, 'weights': 'uniform'}


In [9]:
# Muestro el coeficiente R2 para el mejor modelo entrenado con respecto al conjunto de testing
score = best_model.score(test_df, y_test)
print("Coeficiente de R2 del mejor modelo para el conjunto de prueba:", score)

Coeficiente de R2 del mejor modelo para el conjunto de prueba: 0.7822966507177034


In [10]:
# Muestro una matriz de confusion y el resultado de diferentes metricas en base a los resultados de las predicciones del modelo
# Predicciones para el conjunto de prueba
y_pred = best_model.predict(test_df)

conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(conf_matrix)
print(class_report)

[[200  66]
 [ 25 127]]
              precision    recall  f1-score   support

           0       0.89      0.75      0.81       266
           1       0.66      0.84      0.74       152

    accuracy                           0.78       418
   macro avg       0.77      0.79      0.78       418
weighted avg       0.80      0.78      0.79       418



El modelo obtiene un 0.78 de precision general. Con este modelo, obtengo unos resultados similares a los obtenidos con el Bosque Aleatorio, siendo hasta un poco peores. Por el momento, este no es el modelo que voy a utilizar para generar las predicciones que subire como resultado de la competicion de Kaggle.

## Guardo el mejor modelo


In [11]:
import joblib
import os

dataset__route = "../model/"

if not os.path.exists(dataset__route):
    os.mkdir(dataset__route)

joblib.dump(best_model, os.path.join(dataset__route, 'model.joblib'))
print("El modelo ha sido guardado con exito.")

El modelo ha sido guardado con exito.
