# Actividades Lección 5: Fundamentos de Big Data 

# Actividad 1

Vamos a examinar de nuevo el dataset de Titanic y vamos a hacer dos cosas:

1.- Vamos a tratar de eliminar los valores nulos del dataset.
2.- Vamos a descartar aquellas columnas de datos no categóricos

Empezamos como siempre importanto las librerías que vayamos a utilizar y cargando el dataset con pandas.

In [379]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

In [380]:
df = pd.read_csv('./train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Comprobamos que columnas contienen valores nulos

In [381]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Vemos que las columnas Age, Cabin y Embarked tienen valores nulos. Procederemos de la siguiente manera:

- Los valores nulos de Age se rellenarán con el valor medio del resto de valores no nulos.
- Los valores nulos de Embarked se rellenarán con una 'S'.
- Los valores nulos de Cabin no se tendrán en cuenta pues se va a eliminar la columna.

In [382]:
df.Age = df.Age.fillna(df.Age.mean())
df.Age.isnull().sum()

0

In [383]:
df.Embarked = df.Embarked.fillna('S')
df.Embarked.isnull().sum()

0

In [384]:
df = df.drop('Cabin', axis = 1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


Una vez hecho esto, vamos a quedarnos únicamente con los datos categóricos, es decir:

- Columnas con string hombre/mujer
- Columnas con strings con 3 opciones (como la clase)

Por tanto, vamos a prescindir también de las columnas de nombre y ticket.

In [385]:
df = df.drop(['Ticket', 'Name'], axis = 1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.25,S
1,2,1,1,female,38.0,1,0,71.2833,C
2,3,1,3,female,26.0,0,0,7.925,S
3,4,1,1,female,35.0,1,0,53.1,S
4,5,0,3,male,35.0,0,0,8.05,S


In [386]:
df = pd.get_dummies(df, columns = ['Sex', 'Pclass', 'Embarked'], 
                    drop_first = True)
df.head()

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare,Sex_male,Pclass_2,Pclass_3,Embarked_Q,Embarked_S
0,1,0,22.0,1,0,7.25,1,0,1,0,1
1,2,1,38.0,1,0,71.2833,0,0,0,0,0
2,3,1,26.0,0,0,7.925,0,0,1,0,1
3,4,1,35.0,1,0,53.1,0,0,0,0,1
4,5,0,35.0,0,0,8.05,1,0,1,0,1


A continuación vamos a escalar los datos para evitar que el modelo de predicción de mayor peso a valores más altos.

In [387]:
# x - mean(x) / std(x)

df.Age = (df.Age - np.mean(df.Age, axis = 0)) / np.std(df.Age,  axis = 0)
df.Fare = (df.Fare - np.mean(df.Fare, axis = 0)) / np.std(df.Fare, axis = 0)
df.head()

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare,Sex_male,Pclass_2,Pclass_3,Embarked_Q,Embarked_S
0,1,0,-0.592481,1,0,-0.502445,1,0,1,0,1
1,2,1,0.638789,1,0,0.786845,0,0,0,0,0
2,3,1,-0.284663,0,0,-0.488854,0,0,1,0,1
3,4,1,0.407926,1,0,0.42073,0,0,0,0,1
4,5,0,0.407926,0,0,-0.486337,1,0,1,0,1


A continuación vamos a obtener el conjunto de datos X, y el conjunto de etiquetas y. 

In [388]:
X = df.drop(['Survived', 'PassengerId'], axis = 1)
X.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_male,Pclass_2,Pclass_3,Embarked_Q,Embarked_S
0,-0.592481,1,0,-0.502445,1,0,1,0,1
1,0.638789,1,0,0.786845,0,0,0,0,0
2,-0.284663,0,0,-0.488854,0,0,1,0,1
3,0.407926,1,0,0.42073,0,0,0,0,1
4,0.407926,0,0,-0.486337,1,0,1,0,1


In [389]:
y = df.Survived
y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [390]:
X = X.values
y = y.values

Hacemos la partición 80/20 en conjunto de entrenamiento y de test

In [391]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                   test_size = 0.2, random_state = 42)

Vamos a proceder a realizar las tareas de entrenamiento y medida de precisión de cada modelo. 

In [392]:
# Clasificador de K-Vecinos
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc_KN = accuracy_score(y_test, y_pred)
acc_KN

0.8212290502793296

In [393]:
# Clasificador Decision Tree
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc_DT = accuracy_score(y_test, y_pred)
acc_DT

0.7821229050279329

In [394]:
# Random Forest Classifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc_RF = accuracy_score(y_test, y_pred)
acc_RF

0.8212290502793296

In [395]:
# Gaussian NB
clf = GaussianNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc_NB = accuracy_score(y_test, y_pred)
acc_NB

0.7653631284916201

In [396]:
# SVC
clf = SVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc_SVC = accuracy_score(y_test, y_pred)
acc_SVC

0.8156424581005587

En mi caso, el mejor predictor parece ser el K-Neighbor Classifier. Voy a usarlo para el test.csv

In [397]:
test = pd.read_csv('./test.csv')
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [398]:
test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [399]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [400]:
test.Age = test.Age.fillna(test.Age.mean())
test.Fare = test.Fare.fillna(test.Fare.mean())

test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
dtype: int64

In [401]:
test = test.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis = 1)
test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,34.5,0,0,7.8292,Q
1,3,female,47.0,1,0,7.0,S
2,2,male,62.0,0,0,9.6875,Q
3,3,male,27.0,0,0,8.6625,S
4,3,female,22.0,1,1,12.2875,S


In [402]:
test.Age = (test.Age - np.mean(test.Age, axis = 0)) / np.std(test.Age, axis = 0)
test.Fare = (test.Fare - np.mean(test.Fare, axis = 0)) / np.std(test.Fare, axis = 0)
test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,0.334993,0,0,-0.498407,Q
1,3,female,1.32553,1,0,-0.513274,S
2,2,male,2.514175,0,0,-0.465088,Q
3,3,male,-0.25933,0,0,-0.483466,S
4,3,female,-0.655545,1,1,-0.418471,S


In [403]:
test = pd.get_dummies(test, columns = ['Sex', 'Pclass', 'Embarked'], 
                    drop_first = True)
test.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_male,Pclass_2,Pclass_3,Embarked_Q,Embarked_S
0,0.334993,0,0,-0.498407,1,0,1,1,0
1,1.32553,1,0,-0.513274,0,0,1,0,1
2,2.514175,0,0,-0.465088,1,1,0,1,0
3,-0.25933,0,0,-0.483466,1,0,1,0,1
4,-0.655545,1,1,-0.418471,0,0,1,0,1


Una vez aplicado al conjunto de test el mismo tratamiento que al conjunto de entrenamiento, vamos a proceder a clasificar el conjunto de test y medir la exactitud.

In [404]:
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(test)
y_pred



array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

Creamos el archivo a enviar

In [405]:
df_submission = pd.read_csv('./gender_submission.csv')
df_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [406]:
df_submission.Survived = y_pred
df_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0


In [407]:
df_submission.to_csv('res.csv', index = False)

# Actividad 2

A continuación, voy a plantear mi propia metodología para entrenar un clasificador y tratar de conseguir una alta precisión en las predicciones. 

In [408]:
df = pd.read_csv('./train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


En primer lugar, vamos a eliminar las columnas "Ticket" y "Cabin" del DataFrame

In [409]:
df = df.drop(['Ticket', 'Cabin'], axis = 1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S


Antes de proceder con algunas técnicas de escalado que explicaré más adelante, es conveniente hacer un manejo diligente de los valores nulos. Vamos a empezar comprobando que columnas presentan este tipo de valores.

In [410]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Embarked         2
dtype: int64

Teniendo en cuenta que hemos descartado 'Cabin', debemos plantearnos que hacer con los valores nulos de 'Age' y 'Embarked'. Por una parte, vamos a rellenar los valores faltantes de 'Age' utilizando un método de interpolación. En concreto, voy a hacer uso de una interpolación polinómica de grado 11, pues después de varias pruebas, parece proporcionar una aproximación de las edades que mejora la precisión de las predicciones. 

In [411]:
df_age = df.Age
df_age = df_age.interpolate(method = 'polynomial', order = 11, axis = 0)
df.Age = df_age
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       2
dtype: int64

Por otra parte, vamos a rellenar los valores nulos de 'Embarked' con una 'S'.

In [412]:
df.Embarked = df.Embarked.fillna('S')
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

Otro posible procedimiento sería el extraer el título de cada persona en el barco. Los nombres de cada persona toman la forma "Apellido, Título. Nombre". A partir de esto, podemos extraer los títulos de cada individuo y crear una codificación one-hot para cada título. Finalmente eliminaríamos la columna "Name" y utilizaríamos get_dummies() para incluir la codificación one-hot de los títulos. Vamos a ver como hacerlo.

In [413]:
titles = set()
for name in df.Name:
    titles.add(name.split(',')[1].split('.')[0].strip())
titles

{'Capt',
 'Col',
 'Don',
 'Dr',
 'Jonkheer',
 'Lady',
 'Major',
 'Master',
 'Miss',
 'Mlle',
 'Mme',
 'Mr',
 'Mrs',
 'Ms',
 'Rev',
 'Sir',
 'the Countess'}

En la salida anterior hemos obtenido todos los títulos posibles, vamos a generar ahora una correspondencia entre los nombres y sus títulos, y a añadirlo al dataframe. Este código que incluyo es algo ineficiente, cuando tenga un mejor manejo de Pandas volveré para mejorarlo :D

In [414]:
title_column = []
for name in df.Name:
    for title in titles:
        if title + "." in name:
            title_column.append(title)
            break
title_column
df['Title'] = title_column
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S,Mr


Creamos ahora las columnas one-hot para PClass, Sex, Embarked y Title y eliminamos Name.

In [415]:
df = pd.get_dummies(df, columns = ['Sex', 'Pclass', 'Embarked', 'Title'], 
                    drop_first = True)
df = df.drop('Name', axis = 1)
df.head()

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare,Sex_male,Pclass_2,Pclass_3,Embarked_Q,...,Title_Master,Title_Miss,Title_Mlle,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rev,Title_Sir,Title_the Countess
0,1,0,22.0,1,0,7.25,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0
1,2,1,38.0,1,0,71.2833,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,3,1,26.0,0,0,7.925,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
3,4,1,35.0,1,0,53.1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,5,0,35.0,0,0,8.05,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0


Existen algoritmos de clasificación que son sensibles a la "escala" de los datos. Por ejemplo, los algoritmo basados en distáncias como KNN son sensibles a la escala de los datos pues miden distancias entre puntos situados en espacios métricos (o pseudo-métricos) de $n$ dimensiones (tantas como características tengan nuestros datos). 

Para evitar comportamientos anómalos en este sentido, deberíamos empezar por escalar los datos. Vamos a probar dos aproximaciones distintas, en primer lugar, vamos a utilizar una técnica de normalizado de los datos, y en segundo vamos a utilizar una técnica de estandarizado.

Para la técnica de normalización, usaremos Min-Max. Es decir, sea $X$ una característica de nuestro conjunto de datos, su normalización se obtiene con la siguiente expresión.

$ X = \frac{X - X_{min}}{X_{max} - X_{min}} $

Por otra parte, la estandarización se obtiene centrando los datos alrededor de la media con una desviación estándar, es decir

$ X = \frac{X - \mu}{\sigma} $

Como vemos, las características que pueden someterse a este escalado son "Age" y "Fare". 

In [416]:
df_norm = df.copy(deep = True)
df_norm.Age = (df_norm.Age - df_norm.Age.min()) / (df_norm.Age.max() - df_norm.Age.min())
df_norm.Fare = (df_norm.Fare - df_norm.Fare.min()) / (df_norm.Fare.max() - df_norm.Fare.min())
df_norm.head()

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare,Sex_male,Pclass_2,Pclass_3,Embarked_Q,...,Title_Master,Title_Miss,Title_Mlle,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rev,Title_Sir,Title_the Countess
0,1,0,0.23846,1,0,0.014151,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0
1,2,1,0.253877,1,0,0.139136,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,3,1,0.242314,0,0,0.015469,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
3,4,1,0.250986,1,0,0.103644,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,5,0,0.250986,0,0,0.015713,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0


In [417]:
df_stand = df.copy(deep = True)
df_stand.Age = (df_stand.Age - df_stand.Age.mean()) / df_stand.Age.std()
df_stand.Fare = (df_stand.Fare - df_stand.Fare.mean()) / df_stand.Fare.std()
df_stand.head()

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare,Sex_male,Pclass_2,Pclass_3,Embarked_Q,...,Title_Master,Title_Miss,Title_Mlle,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rev,Title_Sir,Title_the Countess
0,1,0,-0.216916,1,0,-0.502163,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0
1,2,1,0.191254,1,0,0.786404,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,3,1,-0.114874,0,0,-0.48858,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
3,4,1,0.114722,1,0,0.420494,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,5,0,0.114722,0,0,-0.486064,1,0,1,0,...,0,0,0,0,1,0,0,0,0,0


Vamos a proceder a entrenar los clasificadores. Empezamos generando los conjuntos de test y entrenamiento.

In [418]:
X_n = df_norm.drop(['Survived', 'PassengerId'], axis = 1)
y_n = df_norm.Survived
X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(X_n, y_n, 
                                                   test_size = 0.2, random_state = 42)
X_s = df_stand.drop(['Survived', 'PassengerId'], axis = 1)
y_s = df_stand.Survived
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_s, y_s, 
                                                   test_size = 0.2, random_state = 42)

Vamos a comenzar generando probando los clasificadores de K-Neighbors. Para ello, vamos a entrenar los clasificadores con el conjunto de datos normalizado y con el estandarizado, además de probar con cantidades distintas de los K vecinos.

In [419]:
clfs_n = []; accs_n = []
for k in range(1, 16):
    clf = KNeighborsClassifier(k)
    clf.fit(X_train_n, y_train_n)
    clfs_n.append(clf)
    y_pred = clf.predict(X_test_n)
    acc = accuracy_score(y_test_n, y_pred)
    accs_n.append(acc)
accs_n

[0.7877094972067039,
 0.8324022346368715,
 0.8435754189944135,
 0.8435754189944135,
 0.8491620111731844,
 0.8324022346368715,
 0.8491620111731844,
 0.8268156424581006,
 0.8435754189944135,
 0.8324022346368715,
 0.8379888268156425,
 0.8435754189944135,
 0.8491620111731844,
 0.8491620111731844,
 0.8379888268156425]

In [420]:
best_k = np.argmax(accs_n) + 1
best_k

5

In [421]:
clfs_s = []; accs_s = []
for k in range(1, 16):
    clf = KNeighborsClassifier(k)
    clf.fit(X_train_s, y_train_s)
    clfs_s.append(clf)
    y_pred = clf.predict(X_test_s)
    acc = accuracy_score(y_test_s, y_pred)
    accs_s.append(acc)
accs_s

[0.7877094972067039,
 0.7988826815642458,
 0.8324022346368715,
 0.8212290502793296,
 0.8324022346368715,
 0.8379888268156425,
 0.8435754189944135,
 0.8212290502793296,
 0.8435754189944135,
 0.8268156424581006,
 0.8212290502793296,
 0.8268156424581006,
 0.8324022346368715,
 0.8379888268156425,
 0.8379888268156425]

In [422]:
best_k = np.argmax(accs_s) + 1
best_k

7

In [423]:
np.max(accs_n) > np.max(accs_s)

True

In [424]:
best_k = np.argmax(accs_n) + 1
clf_knn_n = clfs_n[best_k - 1]
np.max(accs_n)

0.8491620111731844

De todo esto vemos que el conjunto normalizado con k = 7 vecinos ofrece la precisión máxima de $\approx 84.916 \% $. Vamos a probar con el resto de clasificadores que ya hemos visto.

In [425]:
# Clasificador Decision Tree Normalized Set
clf_dt_n = DecisionTreeClassifier()
clf_dt_n.fit(X_train_n, y_train_n)
y_pred = clf_dt_n.predict(X_test_n)
acc_DT_n = accuracy_score(y_test_n, y_pred)
acc_DT_n

0.7486033519553073

In [426]:
# Clasificador Decision Tree Standarized Set
clf_dt_s = DecisionTreeClassifier()
clf_dt_s.fit(X_train_s, y_train_s)
y_pred = clf_dt_s.predict(X_test_s)
acc_DT_s = accuracy_score(y_test_s, y_pred)
acc_DT_s

0.7486033519553073

In [427]:
# Random Forest Classifier
clf_rf_n = RandomForestClassifier()
clf_rf_n.fit(X_train_n, y_train_n)
y_pred = clf_rf_n.predict(X_test_n)
acc_RF_n = accuracy_score(y_test_n, y_pred)
acc_RF_n

0.8379888268156425

In [428]:
# Random Forest Classifier
clf_rf_s = RandomForestClassifier()
clf_rf_s.fit(X_train_s, y_train_s)
y_pred = clf_rf_s.predict(X_test_s)
acc_RF_s = accuracy_score(y_test_s, y_pred)
acc_RF_s

0.8435754189944135

En este caso, el conjunto de datos estandarizado parece ofrecer un rendimiento ligeramente mejor que el normalizado.

In [429]:
# Gaussian NB with normalized dataset
clf_g_n = GaussianNB()
clf_g_n.fit(X_train_n, y_train_n)
y_pred = clf_g_n.predict(X_test_n)
acc_NB_n = accuracy_score(y_test_n, y_pred)
acc_NB_n

0.5921787709497207

In [430]:
# Gaussian NB with standarized dataset
clf_g_s = GaussianNB()
clf_g_s.fit(X_train_s, y_train_s)
y_pred = clf_g_s.predict(X_test_s)
acc_NB_s = accuracy_score(y_test_s, y_pred)
acc_NB_s

0.5921787709497207

De nuevo, ambas aproximaciones parecen ofrecer el mismo rendimiento.

In [431]:
# SVC with normalized dataset
clf_svc_n = SVC()
clf_svc_n.fit(X_train_n, y_train_n)
y_pred = clf_svc_n.predict(X_test_n)
acc_SVC_n = accuracy_score(y_test_n, y_pred)
acc_SVC_n

0.8156424581005587

In [432]:
# SVC with standarized dataset
clf_svc_s = SVC()
clf_svc_s.fit(X_train_s, y_train_s)
y_pred = clf_svc_s.predict(X_test_s)
acc_SVC_s = accuracy_score(y_test_s, y_pred)
acc_SVC_s

0.8156424581005587

Obtenemos mejor rendimiento con el conjunto estandarizado. Vamos a hacer un estudio un tanto más profundo de los RandomForestClassifier haciendo uso de un mayor numero de estimadores y el conjunto de datos estandarizados.

In [433]:
accs_rf = []
estimators = range(100, 1000, 100)
for estimator in estimators:
    # Random Forest Classifier
    clf = RandomForestClassifier()
    clf.fit(X_train_s, y_train_s)
    y_pred = clf.predict(X_test_s)
    accs_rf.append(accuracy_score(y_test_s, y_pred))
accs_rf

[0.8435754189944135,
 0.8268156424581006,
 0.8324022346368715,
 0.8268156424581006,
 0.8435754189944135,
 0.8379888268156425,
 0.8491620111731844,
 0.8268156424581006,
 0.8379888268156425]

In [434]:
index = np.argmax(accs_rf)
best_estimator = estimators[index]
best_estimator

700

En cualquier caso, al tratarse de un clasificador con cierto grado de aleatoriedad, podemos esperar que el mejor estimador pueda cambiar entre ejecuciones. Una técnica que puede aplicarse es la de $bagging$, que consiste en dividir los grupos de entrenamiento en subgrupos y aplicar el algoritmo de clasificación en cada uno. 

In [435]:
# Logistic Regression with Normalized Set
clf = LogisticRegression()
clf.fit(X_train_n, y_train_n)
y_pred = clf.predict(X_test_n)
acc_LR_n = accuracy_score(y_test_n, y_pred)
acc_LR_n

0.8044692737430168

In [436]:
# Logistic Regression with Standarized Set
clf = LogisticRegression()
clf.fit(X_train_s, y_train_s)
y_pred = clf.predict(X_test_s)
acc_LR_s = accuracy_score(y_test_s, y_pred)
acc_LR_s

0.7988826815642458

El clasificador por regresión Logística parece obtener mejor precisión con el conjunto de datos Normalizado.

In [437]:
clf_b_knn = BaggingClassifier(KNeighborsClassifier(best_k))
clf_b_svc = BaggingClassifier(SVC())
clf_b_dtc = BaggingClassifier(DecisionTreeClassifier())
clf_b_rfc = BaggingClassifier(RandomForestClassifier())
clf_b_lrc = BaggingClassifier(LogisticRegression())

In [438]:
clf_b_knn.fit(X_train_n, y_train_n)
y_pred = clf_b_knn.predict(X_test_n)
acc_b_knn = accuracy_score(y_test_n, y_pred)
acc_b_knn

0.8547486033519553

In [439]:
clf_b_svc.fit(X_train_s, y_train_s)
y_pred = clf_b_svc.predict(X_test_s)
acc_b_svc = accuracy_score(y_test_s, y_pred)
acc_b_svc

0.8100558659217877

In [440]:
clf_b_dtc.fit(X_train_s, y_train_s)
y_pred = clf_b_dtc.predict(X_test_s)
acc_b_dtc = accuracy_score(y_test_s, y_pred)
acc_b_dtc

0.8044692737430168

In [441]:
clf_b_rfc.fit(X_train_s, y_train_s)
y_pred = clf_b_rfc.predict(X_test_s)
acc_b_rfc = accuracy_score(y_test_s, y_pred)
acc_b_rfc

0.8435754189944135

In [442]:
clf_b_lrc.fit(X_train_n, y_train_n)
y_pred = clf_b_lrc.predict(X_test_n)
acc_b_lrc = accuracy_score(y_test_n, y_pred)
acc_b_lrc

0.8100558659217877

Teniendo en cuenta las medidas de precisión obtenidas, pensamos que es conveniente llevar a cabo pruebas con el conjunto de test con los siguientes clasificadores.

- Clasificador K-Neighbors con el dataset normalizado
- Classificador Bagging K-Neighbors con el dataset normalizado
- Clasificador SVC con el dataset estandarizado
- Clasificador Bagging SVC con el dataset estandarizado
- Classificador RandomForestClassifier
- Classificador Bagging RandomForestClassifier

Nos hemos quedado con estos clasificadores porque ofrecen una precisión >80% con el conjunto de entrenamiento. Vamos a cargar el conjunto de test, eliminar sus valores nulos, normalizarlo y estandarizarlo y luego llevar a cabo las pruebas.

In [443]:
classifiers = [clf_knn_n, clf_b_knn, clf_svc_s, clf_b_svc, clf_rf_s, clf_rf_n, clf_b_rfc]
test = pd.read_csv('./test.csv')
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [444]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [445]:
test = test.drop(['Cabin', 'Ticket', 'PassengerId'], axis = 1)
test.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,"Kelly, Mr. James",male,34.5,0,0,7.8292,Q
1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,7.0,S
2,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,9.6875,Q
3,3,"Wirz, Mr. Albert",male,27.0,0,0,8.6625,S
4,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,12.2875,S


In [446]:
test_age = test.Age
test_age = test_age.interpolate(method = 'polynomial', order = 11, axis = 0)
test.Age = test_age
test.Age = test.Age.fillna(test.Age.mean())
test.isnull().sum()

Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        1
Embarked    0
dtype: int64

In [447]:
test.Fare = test.Fare.fillna(test.Fare.mean())
test.isnull().sum()

Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

In [448]:
title_column = []
for name in test.Name:
    added = False
    for title in titles:
        if title + "." in name:
            title_column.append(title)
            added = True
            break
    if not added:
        title_column.append('Mr')
title_column
test['Title'] = title_column
test.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,3,"Kelly, Mr. James",male,34.5,0,0,7.8292,Q,Mr
1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,7.0,S,Mrs
2,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,9.6875,Q,Mr
3,3,"Wirz, Mr. Albert",male,27.0,0,0,8.6625,S,Mr
4,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,12.2875,S,Mrs


In [449]:
to_append = []
for title in titles:
    if title not in test.Title and f'Title_{title}' in df.columns:
        to_append.append(f'Title_{title}')

test = pd.get_dummies(test, columns = ['Sex', 'Pclass', 'Embarked', 'Title'],  drop_first = True)
aux_arr = np.zeros(len(test), dtype = 'int64')
for title in to_append:
        test[title] = aux_arr
    
test = test.drop('Name', axis = 1)
test.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_male,Pclass_2,Pclass_3,Embarked_Q,Embarked_S,Title_Dr,...,Title_Rev,Title_Major,Title_Jonkheer,Title_Mme,Title_Mlle,Title_Col,Title_Don,Title_Lady,Title_Sir,Title_the Countess
0,34.5,0,0,7.8292,1,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,47.0,1,0,7.0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,62.0,0,0,9.6875,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,27.0,0,0,8.6625,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,22.0,1,1,12.2875,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [450]:
test_norm = test.copy(deep = True)
test_norm.Age = (test_norm.Age - test_norm.Age.min()) / (test_norm.Age.max() - test_norm.Age.min())
test_norm.Fare = (test_norm.Fare - test_norm.Fare.min()) / (test_norm.Fare.max() - test_norm.Fare.min())
test_norm = test_norm[list(df_norm.columns)[2:]]
test_norm.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_male,Pclass_2,Pclass_3,Embarked_Q,Embarked_S,Title_Col,...,Title_Master,Title_Miss,Title_Mlle,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rev,Title_Sir,Title_the Countess
0,0.407965,0,0,0.015282,1,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0.424952,1,0,0.013663,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0.445337,0,0,0.018909,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0.397773,0,0,0.016908,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0.390978,1,1,0.023984,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [451]:
test_stand = test.copy(deep = True)
test_stand.Age = (test_stand.Age - test_stand.Age.mean()) / test_stand.Age.std()
test_stand.Fare = (test_stand.Fare - test_stand.Fare.mean()) / test_stand.Fare.std()
test_stand = test_stand[list(df_stand.columns)[2:]]
test_stand.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_male,Pclass_2,Pclass_3,Embarked_Q,Embarked_S,Title_Col,...,Title_Master,Title_Miss,Title_Mlle,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rev,Title_Sir,Title_the Countess
0,0.102556,0,0,-0.497811,1,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0.371678,1,0,-0.51266,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0.694625,0,0,-0.464532,1,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,-0.058918,0,0,-0.482888,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,-0.166567,1,1,-0.417971,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Una vez el conjunto de test ha recibido el mismo tratamiento que el de entrenamiento, podemos proceder a las pruebas.

In [452]:
predictions = []
tests = [test_norm]*2 + [test_stand]*3 + [test_norm]*2
for (classifier, test_f) in zip(classifiers, tests):
    predictions.append(classifier.predict(test_f))    

In [453]:
df_submission = pd.read_csv('./gender_submission.csv')
df_submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [454]:
for i in range(len(predictions)):
    df_submission.Survived = predictions[i]
    df_submission.to_csv(f'./res_{i}.csv', index = False)

# RESULTADOS

Con las técnicas aplicadas y pasando las pruebas de Kaggle, se obtuvo la predicción más exacta usando el clasificador SVC con el dataset estandarizado. En este caso, obteníamos una precisión del 77.511%, lo que nos situaba en el puesto 3935 en el ránking.

![Ránking](captura.png)