# Ejercicio: Titanic

Toma como base el dataset de pasajeros del Titanic (https://www.kaggle.com/c/titanic/data) y crea un modelo de árbol de decisión para estimar las predicciones de superviviencia de dichos pasajeros.

Link de referencia adicional: https://medium.com/@merijoanna/learn-decision-trees-with-kaggle-example-cc03b1dbc6fa

## 1. Importación de paquetes y dataset

In [48]:
import pandas as pd
from sklearn.tree import export_graphviz
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split


titanic = pd.read_csv("Notebooks_data_files/titanic_train.csv", index_col =  'PassengerId')
titanic.head(5)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 2. Preprocessing

### Eliminar columnas prescindibles

In [49]:
to_drop = ['Name','Ticket','Cabin','Fare','Embarked']
titanic.drop(to_drop,axis=1,inplace=True)
titanic.head(5)

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,3,male,22.0,1,0
2,1,1,female,38.0,1,0
3,1,3,female,26.0,0,0
4,1,1,female,35.0,1,0
5,0,3,male,35.0,0,0


### Sustituir nulos en la columna 'Age' por su media


In [50]:
titanic.isnull().sum() # 177 nulos en 'Age'
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].mean())
titanic.isnull().sum()


Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
dtype: int64

### Codificación de datos categóricos

In [51]:
label_encoder = LabelEncoder()
titanic['Sex'] = label_encoder.fit_transform(titanic['Sex'].values)
titanic.head(5)

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,3,1,22.0,1,0
2,1,1,0,38.0,1,0
3,1,3,0,26.0,0,0
4,1,1,0,35.0,1,0
5,0,3,1,35.0,0,0


### División dataset

In [52]:
X = titanic.drop('Survived', axis=1)
y = titanic['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)


## 3. Creación del árbol de decisión y entrenamiento

### Creación del árbol

In [53]:
tree_clf = DecisionTreeClassifier(max_depth=4) 

### Entrenamiento

In [54]:
tree_clf.fit(X, y)

### Visualización del árbol

In [55]:
export_graphviz(tree_clf, 
                out_file='./img/titanic_tree.dot',
                feature_names = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch'],
                class_names = ['Survived', "Wasted"],
                rounded=True,
                filled=True)

In [56]:
! dot -Tpng ./img/titanic_tree.dot -o ./img/titanic_tree.png # Conversión de .dot a .png