<a href="https://colab.research.google.com/github/JCaballerot/Taller_decisionTrees/blob/main/Titanic_lab/Lab_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<h1 align=center><font size = 5>Titanic - Machine Learning from Disaster</font></h1>

---

## Tabla de Contenidos

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
    
1. <a href="#item31">Introducción</a>  
2. <a href="#item32">Descargar y limpiar el Dataset</a>  
3. <a href="#item33">Titanic Problem</a>  
4. <a href="#item34">Análisis y tratamiento de variables</a>  
5. <a href="#item34">Regresión Logística</a>  
6. <a href="#item34">Elastic Net</a>  

</font>
</div>

## Introducción


En este laboratorio, aprenderá a usar python para construir un modelo de regresión logística.


<h3>Objetivo de este Notebook<h3>    
<h5> 1. Como construir e interpretar un modelo de regresión logística.</h5>
<h5> 2. Descargar y limpiar un Dataset </h5>
<h5> 3. Realizar los pasos necesarios previos a la etapa de modelamiento </h5>
<h5> 4. Entrenar y Testear modelo </h5>     

## Descargar y limpiar Dataset


Primero, importemos algunos módulos que necesitaremos para el análisis y construcción del modelo.

In [5]:

# Scikit-Learn 
import sklearn

# Imports comunes
import pandas as pd
import numpy as np


# Configuración de tamaño de gráficos matplotlib
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Cargar seaborn 
import seaborn as sns
sns.set(style="whitegrid", color_codes = True)
sns.set(rc={'figure.figsize':(10,6)})



## Esta es la legendaria competencia Titanic ML



El hundimiento del Titanic es uno de los naufragios más infames de la historia.

El 15 de abril de 1912, durante su viaje inaugural, el RMS Titanic, ampliamente considerado "insumergible", se hundió tras chocar con un iceberg. Desafortunadamente, no había suficientes botes salvavidas para todos a bordo, lo que resultó en la muerte de 1502 de los 2224 pasajeros y la tripulación.

Si bien hubo algún elemento de suerte involucrado en sobrevivir, parece que algunos grupos de personas tenían más probabilidades de sobrevivir que otros.

En este desafío, le pedimos que cree un modelo predictivo que responda a la pregunta: "¿Qué tipo de personas tenían más probabilidades de sobrevivir?" utilizando datos de pasajeros (es decir, nombre, edad, sexo, clase socioeconómica, etc.).

<img src="https://storage.googleapis.com/kaggle-media/welcome/video_thumbnail.jpg" alt="HTML5 Icon" style="width: 600px; height: 450px;">
<div style="text-align: center">¿Qué tipo de personas tenían más probabilidades de sobrevivir? </div>


<b>Descripción de datos</b>

El data frame de Titanic tiene 891 filas y 12 columnas.

<b>Este data frame contiene las siguientes columnas:</b>

---

* <b>Survival : </b>  Supervivencia (0 = No, 1 = Sí)
* <b>Pclass : </b>  Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
* <b>Sex : </b>  Sexo 
* <b>Age : </b>  Edad en años
* <b>Sibsp : </b>  # de hermanos / cónyuges a bordo del Titanic
* <b>Parch : </b>  # de padres / hijos a bordo del Titanic

* <b>Ticket : </b>  Número de ticket

* <b>Fare : </b>  Tarifa de pasajero
* <b>Cabin : </b>  Número de cabina
* <b>Embarked : </b>  Puerto de embarque (C = Cherburgo, Q = Queenstown, S = Southampton)



---


<strong>Puede consultar este [link](https://www.kaggle.com/c/titanic/overview) para leer más sobre la fuente de datos Titanic.</strong>


## Descargar y limpiar Dataset


In [None]:
# Cargar Data
data = pd.read_csv("train_titanic.csv")
data.head()

In [None]:
data.shape

In [None]:
data.describe().transpose()

In [None]:
# Analizando el target 
sns.countplot(x='Survived', data = data, palette = 'hls')

## Análisis de variables categóricas

In [None]:
# Analizando variable categórica
sns.countplot(x='Sex', data = data, palette = 'hls')

In [None]:
data.groupby(['Sex']).agg({"PassengerId":"count",
                           "Survived" :"mean"}).reset_index()

In [None]:

table = pd.crosstab(data.Sex,data.Survived)
table.div(table.sum(1).astype(float),axis=0).plot(kind='bar',stacked=True)

## Análisis de variables numéricas

In [None]:
# Analizando variable numérica
sns.displot(data, x="Age",kind="kde", fill=True)

In [None]:
sns.displot(data, x="Age", hue='Survived', kind="kde", fill=True)

In [None]:
ax = sns.boxplot(x="Survived", y="Age", data=data, palette = 'hls')

In [None]:
data['Fare'].fillna(data['Fare'].mean())

In [None]:
data['Fare'].apply(lambda x: 1 if x < 100 else 2)




In [None]:
# Discretizar variable 

from sklearn.preprocessing import KBinsDiscretizer

data['Fare_cat'] = KBinsDiscretizer(n_bins = 20, 
                                   encode = 'ordinal',
                                   strategy = "quantile").fit_transform(data[['Fare']])


In [None]:
sns.displot(data['Fare_cat'], palette = 'hls', discrete=True)

In [None]:
data[['Fare', 'Fare_cat', 'Survived']]

In [None]:
aggregations = {'Survived':'mean', 'Fare':'min', 'Fare':'max'}
res = data.groupby('Fare_cat').agg(aggregations).reset_index()
res

In [None]:
# Ratio del evento por tramo dela variable numérica
sns.lineplot(data=data, x="Fare_cat", y="Survived", palette = 'hls')

## Decision Tree

In [7]:
data = pd.read_csv("train_titanic.csv")

In [9]:

numFeatures = ['Age','Fare','SibSp','Parch']
catFeatures = ['Pclass','Sex','Embarked']

for c in catFeatures:
  data[c] = data[c].replace(np.nan,'missing')

In [10]:
data[[x + '_t' for x in numFeatures]] = data[numFeatures].fillna(data[numFeatures].median())


In [None]:
 # Installar category_encoders
 !pip install category_encoders

In [None]:
# Target Encoding
from category_encoders import TargetEncoder
encoder = TargetEncoder()

data[[x + '_num' for x in catFeatures]] = encoder.fit_transform(data[catFeatures], data['Survived'])


In [14]:
#Variables para el modelo

numFeatures = ['Age_t','Fare_t','SibSp_t','Parch_t']
catFeatures = ['Pclass_num','Sex_num','Embarked_num']

X = data[numFeatures + catFeatures]
y = data.Survived

In [15]:
# Muestreo de data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.3,
                                                    random_state = 123)

In [17]:
# Configuramos el modelo

from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier(max_depth = 5,
                               min_samples_leaf = 30,
                               random_state = 123)


In [18]:
# Entrenamos el modelo

dtree = dtree.fit(X_train, y_train)
    
dtree

DecisionTreeClassifier(max_depth=5, min_samples_leaf=30, random_state=123)

In [19]:
from sklearn.tree import export_graphviz
from pydotplus import graph_from_dot_data

dot_data = export_graphviz(dtree,
                           feature_names = numFeatures + catFeatures,
                           filled = True,
                           rounded = True,
                           special_characters = True)

graph = graph_from_dot_data(dot_data)
graph.write_png('tree.png')
print(graph)


<pydotplus.graphviz.Dot object at 0x7fe4548a3690>


In [20]:
# Usando el modelo para predecir

X_train['probability'] = dtree.predict_proba(X_train[numFeatures + catFeatures])[:,1]
X_test['probability']  = dtree.predict_proba(X_test[numFeatures + catFeatures])[:,1]

X_train['prediction'] = dtree.predict(X_train[numFeatures + catFeatures])
X_test['prediction']  = dtree.predict(X_test[numFeatures + catFeatures])

X_train['Survived'] = y_train
X_test['Survived'] = y_test

In [None]:
#Resumen de todas las métricas del modelo
from sklearn.metrics import *

metricsDtree = pd.DataFrame({'metric':['AUC','Gini','Accuracy','Precision','Recall','F1-score'],
                                'dTree_train':[roc_auc_score(y_train, X_train.probability),
                                        (roc_auc_score(y_train, X_train.probability)*2-1),
                                        accuracy_score(y_train, X_train.prediction),
                                        precision_score(y_train, X_train.prediction),
                                        recall_score(y_train, X_train.prediction),
                                        f1_score(y_train, X_train.prediction)],

                                'dTree_test':[roc_auc_score(y_test, X_test.probability),
                                        (roc_auc_score(y_test, X_test.probability)*2-1),
                                        accuracy_score(y_test, X_test.prediction),
                                        precision_score(y_test, X_test.prediction),
                                        recall_score(y_test, X_test.prediction),
                                        f1_score(y_test, X_test.prediction)]})


metricsDtree                                 

## Análsisis de datos con Pycaret

In [None]:
!pip install pycaret

In [None]:
!pip install pycaret[full]

Collecting optuna>=2.2.0
  Downloading optuna-3.0.1-py3-none-any.whl (348 kB)
[K     |████████████████████████████████| 348 kB 93.2 MB/s 
[?25hCollecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting optuna>=2.2.0
  Downloading optuna-3.0.0-py3-none-any.whl (348 kB)
[K     |████████████████████████████████| 348 kB 65.9 MB/s 
[?25h  Downloading optuna-2.10.1-py3-none-any.whl (308 kB)
[K     |████████████████████████████████| 308 kB 70.8 MB/s 
[?25hCollecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 8.3 MB/s 
Collecting virtualenv
  Downloading virtualenv-20.16.5-py3-none-any.whl (8.8 MB)
[K     |████████████████████████████████| 8.8 MB 64.7 MB/s 
Collecting grpcio<=1.43.0,>=1.28.1
  Downloading grpcio-1.43.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.1 MB)
[K     |████████████████████████████████| 4.1 MB 43.9 MB/s 
Collecting tensorboardX>=1.9
  Downloading tensorboar

In [None]:
from pycaret.utils import version
version()

'2.3.10'

In [None]:
import numpy as np
from pycaret.utils import enable_colab
enable_colab()

Colab mode enabled.


In [None]:
base_train = data.sample(frac = 0.8, random_state=1)
base_train.shape

In [None]:
from pycaret.classification import setup
experimento = setup(data = base_train, target = 'species', session_id=123)

In [None]:
from pycaret.classification import compare_models
modelos = compare_models(sort = 'Accuracy', fold = 10)

In [None]:
from pycaret.classification import create_model
dt = create_model('dt')

### Gracias por completar este laboratorio!

---

