# Proyecto

***

### Librerías

Se importan las librerías necesarias:

In [370]:
import tensorflow as tf, numpy as np, matplotlib as plt, pandas as pd, sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn import tree
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
from sklearn.externals import joblib
import graphviz

***
### Análisis exploratorio de datos

In [46]:
data_set = pd.read_csv("data_titanic_proyecto.csv")
data_set.head(5)

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,passenger_class,passenger_sex,passenger_survived
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S,Lower,M,N
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,Upper,F,Y
2,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S,Lower,F,Y
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,Upper,F,Y
4,5,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S,Lower,M,N


In [47]:
data_shape = data_set.shape
print(data_shape)

(891, 12)


In [48]:
col_name = data_set.columns
print(col_name)

Index(['PassengerId', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked', 'passenger_class', 'passenger_sex',
       'passenger_survived'],
      dtype='object')


***

### NaN

Buscando valores NaN en los features

In [49]:
data_set.isnull().sum()

PassengerId             0
Name                    0
Age                   177
SibSp                   0
Parch                   0
Ticket                  0
Fare                    0
Cabin                 687
Embarked                2
passenger_class         0
passenger_sex           0
passenger_survived      0
dtype: int64

Para las features utilizadas se dejará de tomar en cuenta la variable Cabin (687 NaN) porque tiene más del 70% de la data perdida.

In [50]:
data_set = data_set.drop('Cabin', axis = 1)
col_name = col_name.drop('Cabin')
print(col_name)
data_set.head(5)

Index(['PassengerId', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Embarked', 'passenger_class', 'passenger_sex', 'passenger_survived'],
      dtype='object')


Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Embarked,passenger_class,passenger_sex,passenger_survived
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,S,Lower,M,N
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C,Upper,F,Y
2,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,S,Lower,F,Y
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,S,Upper,F,Y
4,5,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,S,Lower,M,N


Embarked tiene dos valores desconocidos y se buscará reemplazar con el valor que más veces se repite en esta columna.

In [51]:
data_set.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [52]:
data_set["Embarked"] = data_set["Embarked"].fillna("S")

En el caso Age (108 NaN) y  y se colocará la mediana de la edad.

In [53]:
median_edad = data_set["Age"].median()
data_set["Age"] = data_set["Age"].fillna(median_edad)

Nos aseguramos que en ninguna de las features tengamos valores NaN

In [54]:
data_set.isnull().sum()

PassengerId           0
Name                  0
Age                   0
SibSp                 0
Parch                 0
Ticket                0
Fare                  0
Embarked              0
passenger_class       0
passenger_sex         0
passenger_survived    0
dtype: int64

***
### Datos categóricos

Se realiza one hot encoding para las siguientes variables categoricas:
* passenger_sex
* passenger_survived (target)
* Embarked

In [55]:
data_x = data_set.iloc[:,:-1]
data_x.head(2)

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Embarked,passenger_class,passenger_sex
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,S,Lower,M
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C,Upper,F


In [56]:
data_y = data_set.iloc[:,-1]

In [57]:
labelencoder = LabelEncoder()
#Aplicando one hot encoding para y
data_y = labelencoder.fit_transform(data_y)
#one_hot = np.eye(len(set(data_y)))[categorias]
#data_y = one_hot
print(data_y[:2])

[0 1]


In [59]:
labels_one_hot = list(("Embarked", "passenger_sex"))
data_encoded = pd.get_dummies(data_x[labels_one_hot])
data_encoded.head(5)

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S,passenger_sex_F,passenger_sex_M
0,0,0,1,0,1
1,1,0,0,1,0
2,0,0,1,1,0
3,0,0,1,1,0
4,0,0,1,0,1


Agregando la data con one hot encoding en nuestra matriz de features

In [60]:
data_x = data_x.join(data_encoded)

In [61]:
data_x.head(2)

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Embarked,passenger_class,passenger_sex,Embarked_C,Embarked_Q,Embarked_S,passenger_sex_F,passenger_sex_M
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,S,Lower,M,0,0,1,0,1
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C,Upper,F,1,0,0,1,0


***

Para la siguiente variable categorica es del tipo ordinal:
* passenger_class

Lo que quiere decir que tiene un orden entonces se asignará de la siguiente manera:
- Lower = 1
- Middle = 2
- Upper = 3

In [62]:
data_x["passenger_class"] = data_x["passenger_class"].map({"Lower" : 1, "Middle" : 2, "Upper" : 3})

In [63]:
data_x.head(2)

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Embarked,passenger_class,passenger_sex,Embarked_C,Embarked_Q,Embarked_S,passenger_sex_F,passenger_sex_M
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,S,1,M,0,0,1,0,1
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C,3,F,1,0,0,1,0


***
### Descartando features

Antes de procedes a realizar la separación del data set se descartaran algunas features muy especificas y que a simple vista no aportaran información relevante al problema:
* PassengerId
* Name
* Ticket

Las columnas a las que le aplicamos one hot encoding también se quitarán.

In [64]:
col_name = data_x.columns

In [65]:
col_name = col_name.drop(list(("PassengerId", "Name", "Ticket", "Embarked", "passenger_class", "passenger_sex")))
print(col_name)

Index(['Age', 'SibSp', 'Parch', 'Fare', 'Embarked_C', 'Embarked_Q',
       'Embarked_S', 'passenger_sex_F', 'passenger_sex_M'],
      dtype='object')


In [66]:
data_x = data_x[col_name]
data_x.head(2)

Unnamed: 0,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,passenger_sex_F,passenger_sex_M
0,22.0,1,0,7.25,0,0,1,0,1
1,38.0,1,0,71.2833,1,0,0,1,0


***
### Separando datos

Separando el data set en:
* Training
* Validation
* Test

In [67]:
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size = 0.2, random_state = 0)

In [68]:
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size = 0.2, random_state = 0)

In [69]:
print("------------------")
print("x_train "+str(x_train.shape))
print("y_train "+str(y_train.shape))
print("------------------")
print("x_val "+str(x_val.shape))
print("y_val "+str(y_val.shape))
print("------------------")
print("x_test "+str(x_test.shape))
print("y_test "+str(y_test.shape))
print("------------------")

------------------
x_train (569, 9)
y_train (569,)
------------------
x_val (143, 9)
y_val (143,)
------------------
x_test (179, 9)
y_test (179,)
------------------


***
### Normalizando la data

Para que el modelo pueda converger más rápido se volverán a escalar las siguientes features en el training set:
* Age
* Fare

In [70]:
x_train["Age"]=(x_train["Age"]-x_train["Age"].min())/(x_train["Age"].max()-x_train["Age"].min())
x_train["Fare"]=(x_train["Fare"]-x_train["Fare"].min())/(x_train["Fare"].max()-x_train["Fare"].min())
x_train.head(2)

Unnamed: 0,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,passenger_sex_F,passenger_sex_M
486,0.468158,1,0,0.175668,0,0,1,1,0
108,0.509069,0,0,0.015412,0,0,1,0,1


***
## Función Decision Trees

In [372]:
def train_dt(x, y, depth):
    class_dt = DecisionTreeClassifier(max_depth = depth)
    class_dt = class_dt.fit(x, y)
    y_hat = class_dt.predict(x)
    accuracy = metrics.accuracy_score(y, y_hat)
    error = metrics.mean_squared_error(y, y_hat)
    precision = metrics.precision_score(y, y_hat)
    recall = metrics.recall_score(y, y_hat)
    f1_score = metrics.f1_score(y, y_hat)
    return(class_dt, (accuracy, precision, recall, f1_score))

In [373]:
class_dt, resultado_dt = train_dt(x_train, y_train, 1)

### Grid search

In [418]:
grid_dt = np.arange(1,10, step = 1)
val_dt = []
met_dt = []
acc_dt = []
for x in grid_dt:
    m_dt, r_dt = train_dt(x_train, y_train, x)
    val_dt.append(m_dt)
    met_dt.append(r_dt)
    
for z in range(0,len(met_dt)):
    acc_dt.append(met_dt[z][0])

mejor_dt = val_dt[np.argmax(acc_dt)]

In [419]:
joblib.dump(mejor_dt, "dt.pkl")

['dt.pkl']

***
## Función SVM

In [363]:
def train_SVM(x, y, c, g):
    class_svm = svm.SVC(kernel="rbf", gamma = g, C = c) # Gaussian Kernel
    class_svm = class_svm.fit(x, y)
    y_hat = class_svm.predict(x)
    accuracy = metrics.accuracy_score(y, y_hat)
    error = metrics.mean_squared_error(y, y_hat)
    precision = metrics.precision_score(y, y_hat)
    recall = metrics.recall_score(y, y_hat)
    f1_score = metrics.f1_score(y, y_hat)
    return(class_svm, (accuracy, error, precision, recall, f1_score))

In [364]:
class_svm, resultado_svm = train_SVM(x_train, y_train, 1, 10)
print(class_svm)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=10, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)


In [426]:
c_svm = np.arange(1,10, step = 0.5)
g_svm = np.arange(1,10, step = 1)
val_svm = []
met_svm = []
acc_svm = []
for x in c_svm:
    for z in g_svm:
        m_svm, r_svm = train_SVM(x_train, y_train, x, z)
        val_svm.append(m_svm)
        met_svm.append(r_svm)
    
for i in range(0,len(met_svm)):
    acc_svm.append(met_svm[i][0])

mejor_svm = val_svm[np.argmax(acc_svm)]

SVC(C=8.5, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=9, kernel='rbf', max_iter=-1,
    probability=False, random_state=None, shrinking=True, tol=0.001,
    verbose=False)


In [427]:
joblib.dump(mejor_svm, "svm.pkl")

['svm.pkl']

***
## Función Naive Bayes

In [32]:
def train_nb(x, y):
    total = len(y)
    total_yes = len(y[y==1])
    total_no = len(y[y==0])
    
    
    
    #accuracy = metrics.accuracy_score(y, y_hat)
    #error = metrics.mean_squared_error(y, y_hat)
    #precision = metrics.precision_score(y, y_hat)
    #recall = metrics.recall_score(y, y_hat)
    #f1_score = metrics.f1_score(y, y_hat)
    #return((accuracy, precision, recall, f1_score))
    return

In [33]:
train_nb(x_train, y_train)

In [34]:
len(x_train['Age'].unique())

80

***
## Función Regresión logistica con regularización

In [319]:
def train_rl(x, y, lr, lp, t_epoch, batch_size):
    
    #Realizando el grafo
    tf.reset_default_graph()
    grafo = tf.Graph()
    with grafo.as_default() as g:
        X = tf.placeholder("float", shape = [None, 9], name = "X_train")
        Y = tf.placeholder("float", shape = [None, 1], name = "Y_train")
        W = tf.Variable(tf.ones([9, 1]), name = "W", dtype = "float")
        b = tf.Variable(tf.ones([1,1]), name = "b", dtype = "float")
        learning_rate = tf.placeholder("float", name = "learning_rate")
        lambda_p = tf.placeholder("float", name = "lambda")
        with tf.name_scope('Mult') as scope:
            mul = tf.add(tf.matmul(X, W), b)
        with tf.name_scope('Sigmoid') as scope:
            sig = tf.nn.sigmoid(mul, name = "Sigmoid")
        with tf.name_scope('Costo') as scope:
            lasso = tf.multiply(tf.multiply(0.5, lambda_p), tf.reduce_sum(tf.pow(W,2)))
            cross = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=Y, logits=sig))
            cost = tf.add(cross, lasso)
            optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
        init = tf.global_variables_initializer()
    
    #Definiendo el batch
    num_train = y.shape[0]
    t_batch = num_train//batch_size
        
    with tf.Session(graph = grafo) as sess:
        sess.run(init)
        for epoch in range(t_epoch):
            resultado_costo = 0
            for bt in range(t_batch):
                batch_x = x[bt*batch_size:bt*batch_size+batch_size]
                batch_y = y[bt*batch_size:bt*batch_size+batch_size]
                sess.run(optimizer, feed_dict = {X: batch_x, Y: batch_y.reshape((batch_y.shape[0],1)), learning_rate: lr, lambda_p: lp})
                c, weight, bias = sess.run([cost, W, b], feed_dict = {X: batch_x, Y: batch_y.reshape((batch_y.shape[0],1)), learning_rate: lr, lambda_p: lp})
    return(weight, bias)

In [325]:
w_t, b_t = train_rl(x_train, y_train, 0.015, 0.001, 1500, 50)

print(w_t)

[[ 0.02734548]
 [-0.4074739 ]
 [-0.22921725]
 [ 0.8940383 ]
 [ 0.5203464 ]
 [ 0.4739687 ]
 [-0.37591222]
 [ 2.3235385 ]
 [-2.448175  ]]


In [323]:
#Realizando la predicción
def prediccion(x, y, w, b):
    reg_log = np.matmul(x.as_matrix(), w)+b
    y_hat = 1/(1+np.exp(-reg_log))
    y_hat = 1*(y_hat > 0.5)
    accuracy = metrics.accuracy_score(y, y_hat)
    error = metrics.mean_squared_error(y, y_hat)
    precision = metrics.precision_score(y, y_hat)
    recall = metrics.recall_score(y, y_hat)
    f1_score = metrics.f1_score(y, y_hat)
    return((accuracy, error, precision, recall, f1_score))

In [326]:
prediccion(x_train, y_train, w_t, b_t)

  This is separate from the ipykernel package so we can avoid doing imports until


(0.8084358523725835,
 0.19156414762741653,
 0.7954545454545454,
 0.6572769953051644,
 0.7197943444730077)

***
## Modelo árbol de decisión

In [355]:
dot_data = StringIO()
data = tree.export_graphviz(class_dt, out_file = None, filled = True, rounded = True, special_characters = True, feature_names = x_train.columns, class_names=str(y_train))
graph = graphviz.Source(data)
graph.render()

'Source.gv.pdf'

<img src = "./class_dt.png">

***
## K-folds cross validation

Consiste en una tecnica la cual evalua los parametros estadisticos generados por un conjunto de modelos para determinar si estos no tienen sobreajuste y que el modelo seleccionado sea el más simple y preciso posible.

La tecnica realiza sub grupos de datos para que estos sean evaluados por los modelos.

En el caso del proyecto se pudo utilizar en cada uno de los modelos para reordenar y volver a evaluar cada uno, con diferentes combinaciones de features y elegir aquellos que tengan mejores estadisticas.

***
## CONCLUSIONES

* El reto más importante en este proyecto fue seleccionar y procesar adecuadamente la data para que esta pueda ser utilizada por cada uno de los modelos expuestos en el documento. Es una parte fundamental porque para cada modelo se consideraron situaciones como la normalización de la data y trabajar con datos categoricos.
* El tiempo definitivamente fue un factor importante y el mayor obstaculo que no me permitió completar el proyecto como yo lo esperaba. Invertí tiempo en leer documentación, comprender los algoritmos (de forma teórica aunque no todos esten implementados) y en hacer feature selection por lo que descuide las otras partes del proyecto como por ejemplo ponerlo en producción. Definitivamente seguiré trabajando en este proyecto en el futuro para finalizarlo como me hubiera gustado entregarlo.
* Se recomienda utilizar o dominar varias herramientas para realizar este tipo de trabajo (numpy, pandas, scikitlearn, tensorflow) al combinarlas y al investigar algunas de las ventajas que nos ofrecen nos pueden simplificar varias horas de trabajo (por ejemplo utilizar variables dummies o realizar modelos en scikitlearn).