In [1]:
import numpy as np
import tensorflow as tf
import pandas as pd
from sklearn.externals import joblib

In [2]:
data = pd.read_csv("data_titanic_proyecto.csv") 
data.head()

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,passenger_class,passenger_sex,passenger_survived
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S,Lower,M,N
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,Upper,F,Y
2,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S,Lower,F,Y
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,Upper,F,Y
4,5,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S,Lower,M,N


In [3]:
data.dtypes

PassengerId             int64
Name                   object
Age                   float64
SibSp                   int64
Parch                   int64
Ticket                 object
Fare                  float64
Cabin                  object
Embarked               object
passenger_class        object
passenger_sex          object
passenger_survived     object
dtype: object

In [4]:
data['Age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [5]:
data.shape[0]

891

### Creating Frequency tables

# Encode variables to numbers for statistical learning model implementations

First we will check if the dataset contains missing values. If some numeric column contains missing values, they will be replaced with mean imputation of the column.

Numerical variables will be scaled using *min max normalization* under the hypothesis that this will help models to converge more quickly and to solve for better coefficients.

In [6]:
data.isnull().sum()

PassengerId             0
Name                    0
Age                   177
SibSp                   0
Parch                   0
Ticket                  0
Fare                    0
Cabin                 687
Embarked                2
passenger_class         0
passenger_sex           0
passenger_survived      0
dtype: int64

## Imputations for missing values variables 

The variable *Cabin* will be dropped as it is not used in the model. 

Mean of *Age* is imputed in missing values.

In [4]:
df =data.drop(columns=['Cabin'])

In [5]:
df.shape

(891, 11)

In [34]:
data['Age'].mean()
#data['Age'].median()

29.69911764705882

In [7]:
df['Age'].fillna((df['Age'].mean()), inplace=True)

In [8]:
df.isnull().sum()

PassengerId           0
Name                  0
Age                   0
SibSp                 0
Parch                 0
Ticket                0
Fare                  0
Embarked              2
passenger_class       0
passenger_sex         0
passenger_survived    0
dtype: int64

In [9]:
df = df.dropna()

In [10]:
df.shape

(889, 11)

In [35]:
from sklearn import preprocessing

#Creating the label encoder
encoder = preprocessing.LabelEncoder()

#Converting string labels into numbers
df['Embarked_encoded'] = encoder.fit_transform(np.array(df['Embarked']))
df['class_encoded'] = encoder.fit_transform(np.array(df['passenger_class']))
df['sex_encoded'] = encoder.fit_transform(np.array(df['passenger_sex']))
df['passenger_survived_encoded'] = encoder.fit_transform(np.array(df['passenger_survived']))
#df['fare'] = preprocessing.normalize(np.array(df['Fare']).reshape(1,-1))
df['Age'] = preprocessing.normalize(np.array(df['Age']).reshape(1, -1))[0]
df['Fare'] = preprocessing.normalize(np.array(df['Fare']).reshape(1, -1))[0]


In [36]:
df.head()

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Embarked,passenger_class,passenger_sex,passenger_survived,Embarked_encoded,class_encoded,sex_encoded,passenger_survived_encoded
0,1,"Braund, Mr. Owen Harris",0.0228,1,0,A/5 21171,0.004112,S,Lower,M,N,2,0,1,0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0.039382,1,0,PC 17599,0.040427,C,Upper,F,Y,0,2,0,1
2,3,"Heikkinen, Miss. Laina",0.026945,0,0,STON/O2. 3101282,0.004495,S,Lower,F,Y,2,0,0,1
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0.036273,1,0,113803,0.030115,S,Upper,F,Y,2,2,0,1
4,5,"Allen, Mr. William Henry",0.036273,0,0,373450,0.004565,S,Lower,M,N,2,0,1,0


In [37]:
data.head()

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,passenger_class,passenger_sex,passenger_survived
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S,Lower,M,N
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,Upper,F,Y
2,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S,Lower,F,Y
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,Upper,F,Y
4,5,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S,Lower,M,N


In [38]:
labels = df['passenger_survived_encoded']
labels.head()
df_features = df[['Age','SibSp','Parch','Fare','Embarked_encoded','class_encoded','sex_encoded']]
df_features.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Embarked_encoded,class_encoded,sex_encoded
0,0.0228,1,0,0.004112,2,0,1
1,0.039382,1,0,0.040427,0,2,0
2,0.026945,0,0,0.004495,2,0,0
3,0.036273,1,0,0.030115,2,2,0
4,0.036273,0,0,0.004565,2,0,1


In [39]:
df_features.shape[0]

889

In [40]:
from sklearn.model_selection import train_test_split
X_train, X_test_validation, y_train, y_test_validation = train_test_split(df_features, labels,
                                                    stratify=labels, 
                                                    test_size=0.4)

In [41]:
X_train.shape
#X_test.shape

(533, 7)

In [42]:
from sklearn.model_selection import train_test_split
X_validation, X_test, y_validation, y_test = train_test_split(X_test_validation, y_test_validation,
                                                    stratify=y_test_validation, 
                                                    test_size=0.5)

In [43]:
print("Training dataset:", X_train.shape," Validation dataset:",X_validation.shape, "Test dataset:", X_test.shape)

Training dataset: (533, 7)  Validation dataset: (178, 7) Test dataset: (178, 7)


In [44]:
pd.DataFrame(y_train).groupby('passenger_survived_encoded').agg({'passenger_survived_encoded':'count'})

Unnamed: 0_level_0,passenger_survived_encoded
passenger_survived_encoded,Unnamed: 1_level_1
0,329
1,204


In [45]:
pd.DataFrame(y_validation).groupby('passenger_survived_encoded').agg({'passenger_survived_encoded':'count'})

Unnamed: 0_level_0,passenger_survived_encoded
passenger_survived_encoded,Unnamed: 1_level_1
0,110
1,68


In [46]:
pd.DataFrame(y_test).groupby('passenger_survived_encoded').agg({'passenger_survived_encoded':'count'})

Unnamed: 0_level_0,passenger_survived_encoded
passenger_survived_encoded,Unnamed: 1_level_1
0,110
1,68


# Creating a Logbook for all experiments

In [47]:
chronicle = pd.DataFrame(columns=["Model","Accuracy","Recall","Precision","F1_Score"])

# Creating Models

Models will be trained and then stored so they can be called in ensemble learning model. Each model training will have a description that states which variables where used in the training and a run declaring the number of that type of model trained. 

Models will be selected as weak learners for the ensemble learning model. 

# Naive Bayes Model

In [75]:
train_bayes = pd.concat([X_train,y_train ], axis=1)
train_bayes = train_bayes[['sex_encoded','Embarked_encoded','passenger_survived_encoded']]
#train_bayes.columns = ['sex','']

#X_train[['sex_encoded','Embarked_encoded']].head()
print(X_train.columns.values)

['Age' 'SibSp' 'Parch' 'Fare' 'Embarked_encoded' 'class_encoded'
 'sex_encoded']


In [54]:
from itertools import product
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [110]:
def naive_bayes(x_train,y_train,x_test,y_test,threshold):
    
    data =  pd.concat([x_train,y_train ], axis=1)
    l = data.drop('passenger_survived_encoded', axis = 1).values.T.tolist()
    key = pd.DataFrame(list(product(*l)), 
                       columns=data.columns[0:len(data.columns)-1])
    key = key.drop_duplicates()
    key = key.sort_values(by = data.columns[0])
    origin_len = len(key.columns)
    
    for i in range(0, origin_len):
        p = pd.DataFrame(np.array(data.groupby(key.columns[i]).size()/len(data)))
        p.reset_index(level = None, inplace = True)
        p.columns = [key.columns[i], 'p_'+key.columns[i]]
        key = pd.merge(key, p, how = 'left')

        ct = pd.crosstab(data['passenger_survived_encoded'], data[key.columns[i]]).apply(lambda r: r/r.sum(), axis=1).T
        ct.reset_index(level = None, inplace = True)
        ct.columns = [key.columns[i], 'pc_'+key.columns[i]+'_0', 'pc_'+key.columns[i]+'_1']
        key = pd.merge(key, ct, how = 'left')

    key['p_survived_0'] = np.array(data.groupby('passenger_survived_encoded').size()/len(data))[0]
    key['p_survived_1'] = np.array(data.groupby('passenger_survived_encoded').size()/len(data))[1]
    
    prob_0 = 1
    prob_1 = 1
    for i in range(0, origin_len):
        prob_0 *= ((key['pc_'+str(key.columns[i])+'_0']*key['p_survived_0'])/key['p_'+str(key.columns[i])])
        prob_1 *= ((key['pc_'+str(key.columns[i])+'_1']*key['p_survived_1'])/key['p_'+str(key.columns[i])])

    key['p_0'] = prob_0/(prob_0+prob_1)
    key['p_1'] = prob_1/(prob_0+prob_1)
    
  
    pred = pd.merge(x_test, key, how = 'left')
    pred_train = pd.merge(data, key, how = 'left')
    pred['survived_code_pred'] = np.where(pred['p_1'] > threshold, 1, 0)
    pred_train['survived_code_pred'] = np.where(pred_train['p_1'] > threshold, 1, 0)
    pred = np.array(pred['survived_code_pred'])
    pred_train = np.array(pred_train['survived_code_pred'])
    accuracy = accuracy_score(pred,y_test)
    accuracy_train = accuracy_score(pred_train,y_train)
    print("Validation Accuracy: ",accuracy)
    print("Training Accuracy: ",accuracy_train)
    
    
    #y_pred = decision_tree.predict(x_test)
    #accuracy = accuracy_score(y_pred,y_test)
    error = 1 - accuracy
    recall = recall_score(pred,y_test)
    precision = precision_score(pred, y_test)
    f1_sc = f1_score(pred,y_test)

    dic = dict()
    dic['Model']='naive_bayes'
    dic['Accuracy']=accuracy
    dic['Recall']= recall
    dic['Precision']=precision
    dic['F1_Score']=f1_sc
    description = "variables:"+ str(data.columns.values)+ "_accuracy_"+str(accuracy)+"_threshold_"+str(threshold)+".pkl"
    dic['Description'] = description

   
    
    #joblib.dump(description)
    #return(dic)
    
    
    return(pred)

In [115]:
bayes = naive_bayes(X_train[['sex_encoded','Embarked_encoded']],
                    y_train,
                    X_validation[['sex_encoded','Embarked_encoded']],
                    y_validation,
                    threshold=0.5)
bayes

Validation Accuracy:  0.8370786516853933
Training Accuracy:  0.7598499061913696


array([1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0])

The Naive Bayes Model will be saved as a .py file and will be called in the ensemble learning part of the project. This decision has been made as Naive Bayes does not have any hyperparameters to tune, as the procedure is the same for all posible inputs. Only inputs vary, so might as well define a function that can be trained for any combination of variables.

# Decision Tree Model

In [53]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [130]:
def decision_tree_model(x_train,y_train,x_test,y_test, run,crit="entropy"):
    model="decision_tree_"+str(run)
    decision_tree = DecisionTreeClassifier(criterion=crit)
    decision_tree.fit(x_train,y_train)

    y_pred = decision_tree.predict(x_test)
    accuracy = accuracy_score(y_pred,y_test)
    error = 1 - accuracy
    recall = recall_score(y_pred,y_test)
    precision = precision_score(y_pred, y_test)
    f1_sc = f1_score(y_pred,y_test)

    dic = dict()
    dic['Model']=model
    dic['Accuracy']=accuracy
    dic['Recall']= recall
    dic['Precision']=precision
    dic['F1_Score']=f1_sc
    description = "criterion_"+ str(crit)+ "_accuracy_"+str(accuracy)+".pkl"
    dic['Description'] = description

   
    
    joblib.dump(decision_tree, description)
    return(dic)

In [142]:
dtm = decision_tree_model(X_train,y_train,X_validation,y_validation, run=2)
dtm

{'Model': 'decision_tree_2',
 'Accuracy': 0.8146067415730337,
 'Recall': 0.7966101694915254,
 'Precision': 0.6911764705882353,
 'F1_Score': 0.7401574803149606,
 'Description': 'criterion_entropy_accuracy_0.8146067415730337.pkl'}

In [143]:
chronicle = chronicle.append(dtm,ignore_index=True)
chronicle

Unnamed: 0,Model,Accuracy,Recall,Precision,F1_Score,Description
0,decision_tree_1,0.842697,0.8125,0.764706,0.787879,criterion_entropy_accuracy_0.8426966292134831.pkl
1,svm_1,0.859551,0.890909,0.720588,0.796748,C_0.5_gamma_1_kernel_rbf.pkl
2,decision_tree_1,0.820225,0.8,0.705882,0.75,criterion_entropy_accuracy_0.8202247191011236.pkl
3,decision_tree_2,0.814607,0.79661,0.691176,0.740157,criterion_entropy_accuracy_0.8146067415730337.pkl


In [249]:
#tree_test = joblib.load("test.pkl")
#tree_test.predict(X_validation)

joblib.load(np.array(chronicle[chronicle['Model']=='decision_tree_1']['Description'])[0])

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## Support Vector Machine Model 

In [136]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

In [263]:
def svm_model(x_train,y_train,x_test,y_test,run):
    model="svm_"+str(run)
    svm = SVC()
    
    
    #svm.fit(x_train, y_train)
    parameters = {'kernel':('linear', 'rbf'), 
                  'C':(1,0.25,0.5,0.75),
                  'gamma': (1,2,3,'auto'),
                  'decision_function_shape':('ovo','ovr'),
                  'shrinking':(True,False)}

    grid = GridSearchCV(svm, parameters)
    grid.fit(x_train,y_train)
    
    
    y_pred = grid.predict(x_test)
    accuracy = accuracy_score(y_pred,y_test)
    error = 1 - accuracy
    recall = recall_score(y_pred,y_test)
    precision = precision_score(y_pred, y_test)
    f1_sc = f1_score(y_pred,y_test)
    
    dic = dict()
    dic['Model']=model
    dic['Accuracy']=accuracy
    dic['Recall']= recall
    dic['Precision']=precision
    dic['F1_Score']=f1_sc
    description = "C_"+str(grid.best_params_['C'])+ "_gamma_"+str(grid.best_params_['gamma'])+"_kernel_"+str(grid.best_params_['kernel'])+".pkl"
    dic['Description'] = description
    joblib.dump(grid,description)
    
    print(description)
    print(dic)
    return(dic)

    
    

In [264]:
svm = svm_model(X_train,y_train,X_validation,y_validation, run=2)



C_0.5_gamma_1_kernel_rbf.pkl
{'Model': 'svm_2', 'Accuracy': 0.8595505617977528, 'Recall': 0.8909090909090909, 'Precision': 0.7205882352941176, 'F1_Score': 0.7967479674796749, 'Description': 'C_0.5_gamma_1_kernel_rbf.pkl'}


In [139]:
chronicle = chronicle.append(svm,ignore_index=True)
chronicle

Unnamed: 0,Model,Accuracy,Recall,Precision,F1_Score,Description
0,decision_tree_1,0.842697,0.8125,0.764706,0.787879,criterion_entropy_accuracy_0.8426966292134831.pkl
1,svm_1,0.859551,0.890909,0.720588,0.796748,C_0.5_gamma_1_kernel_rbf.pkl


In [33]:
yt= y_train.values.reshape((y_train.size, 1))
yt.shape

(533, 1)

In [34]:
train_labels_model_encoded = pd.DataFrame(y_train)
train_labels_model_hot= pd.get_dummies(y_train)
train_labels_model = train_labels_model_hot
np.shape(train_labels_model)

(533, 2)

## Logistic Regression Model 

In [313]:
def multinomial_model(epoch_num,lr,batch_size,x_training,y_training, x_validation, y_validation,beta) :
    

    
    
    ###one hot encoding ###
    y_training_encoded = pd.DataFrame(y_training)
    y_training_hot= pd.get_dummies(y_training)
    y_training = y_training_hot
    
   # y_validation = y_validation.values.reshape(y_validation.size,1)
    y_validation_encoded = pd.DataFrame(y_validation)
    y_validation_hot = pd.get_dummies(y_validation)
    y_validation = y_validation_hot
    
    
    tf.reset_default_graph()
    ##Hyperparameters
    batch = batch_size
    #y_training = y_training.values.reshape((y_training.size,1))
    m = np.shape(x_training)[1]
    n = np.shape(y_training)[1]
    training_epochs = epoch_num
    learning_rate = lr

    x_train = tf.placeholder(tf.float64, shape =[None,m], name="x_train")
    y_train = tf.placeholder(tf.float64, shape=[None,n], name="y_train")

    #W = tf.Variable(np.random.randn(m,n), name = "W") 
    #b = tf.Variable(np.random.randn(n), name = "b")

    W = tf.Variable(np.zeros([m,n]), name = "W") 
    b = tf.Variable(np.zeros(n), name = "b")

    with tf.name_scope("Hypotesis"):
        logits = tf.matmul(x_train,W) + b
        y_pred = tf.nn.softmax(logits, name="Softmax")

    with tf.name_scope("Cross_Entropy"):
        cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_train * tf.log(y_pred), reduction_indices=[1]))
        #cross_entropy = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=x_training, labels=y_training))
    
    with tf.name_scope("Regularization"):
        regularizer = tf.nn.l2_loss(W)
        cross_entropy = tf.reduce_mean(cross_entropy + beta*regularizer)
        
        
    with tf.name_scope("Optimizer"):
        optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)

    with tf.name_scope("Accuracy"):
        correct_prediction = tf.equal(tf.argmax(y_pred,1), tf.argmax(y_train,1))
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    init = tf.global_variables_initializer()

    with tf.name_scope("Disturbance"):
        disturbance = tf.summary.scalar(name = "Costfunction", tensor = cross_entropy)
        
    with tf.name_scope("Accuracy"):
        ac = tf.summary.scalar(name = "Costfunction", tensor = accuracy)
        
    #summaries = tf.summary.merge_all()
    
    with tf.Session() as sess:
    
        sess.run(init)
        
        for epoch in range(training_epochs+1):

            batch_num = (epoch * batch ) % (len(x_training) -batch)


            sess.run(optimizer, feed_dict = {x_train: x_training[batch_num:(batch_num+batch)], 
                                             y_train : y_training[batch_num:(batch_num+batch)]})



            if (epoch + 1) % 100 == 0:
                c = sess.run(disturbance, feed_dict = {x_train: x_training[batch_num:(batch_num+batch)], 
                                             y_train : y_training[batch_num:(batch_num+batch)]})
                
                a = sess.run(ac, feed_dict = {x_train: x_training[batch_num:(batch_num+batch)], 
                                             y_train : y_training[batch_num:(batch_num+batch)]})
                print("Epoch: " +str(epoch))
                
                print("train accuracy: " + 
                  str(sess.run(accuracy, feed_dict = {x_train: x_training[batch_num:(batch_num+batch)],
                                                      y_train : y_training[batch_num:(batch_num+batch)]})))
                
                #print("validation accuracy: " + 
                 # str(sess.run(accuracy, feed_dict = {x_train: x_validation,
                  #                                    y_train : y_validation})))
                
                print("")
                
                
    




               # writer.add_summary(c,epoch)
                #writer.add_summary(a,epoch)
                
        #weights = W.eval()
        #bias = b.eval()

      
        

            description = "epochs_"+str(epoch_num)+ "_batch_"+str(batch)+"_lr_"+str(lr)+"_beta_"+str(beta)+".pkl"

        
            coefficients = (W.eval(),b.eval())
        
            pickle_out = open(description, 'wb')
            pickle.dump(coefficients, pickle_out)
            pickle_out.close()
        

        
        return((W.eval(),b.eval()))
        
        sess.close()
        #writer.close()




In [314]:
test = multinomial_model(epoch_num=1500, lr=0.01, batch_size=32, x_training=X_train, y_training=y_train,x_validation=X_validation,
                  y_validation=y_validation,beta = 1)


Epoch: 99
train accuracy: 0.65625
validation accuracy: 0.66292137

Epoch: 199
train accuracy: 0.65625
validation accuracy: 0.66292137

Epoch: 299
train accuracy: 0.71875
validation accuracy: 0.66292137

Epoch: 399
train accuracy: 0.65625
validation accuracy: 0.66292137

Epoch: 499
train accuracy: 0.75
validation accuracy: 0.66292137

Epoch: 599
train accuracy: 0.6875
validation accuracy: 0.66292137

Epoch: 699
train accuracy: 0.40625
validation accuracy: 0.66292137

Epoch: 799
train accuracy: 0.625
validation accuracy: 0.66292137

Epoch: 899
train accuracy: 0.625
validation accuracy: 0.66292137

Epoch: 999
train accuracy: 0.6875
validation accuracy: 0.66292137

Epoch: 1099
train accuracy: 0.5625
validation accuracy: 0.66292137

Epoch: 1199
train accuracy: 0.625
validation accuracy: 0.66292137

Epoch: 1299
train accuracy: 0.6875
validation accuracy: 0.66292137

Epoch: 1399
train accuracy: 0.75
validation accuracy: 0.66292137

Epoch: 1499
train accuracy: 0.5625
validation accuracy: 0.662

In [248]:
pickle_in = open("epochs_1299_batch_32_lr_0.01_beta_0.01.pkl","rb")
log_model = pickle.load(pickle_in)
log_model

(array([[ 0.00835892, -0.00835892],
        [ 0.06978582, -0.06978582],
        [-0.06297449,  0.06297449],
        [-0.01187224,  0.01187224],
        [ 0.09361684, -0.09361684],
        [-0.37452618,  0.37452618],
        [ 0.6956248 , -0.6956248 ]]), array([-0.04447041,  0.04447041]))

# Ensemble  Models 

In [251]:
chronicle

Unnamed: 0,Model,Accuracy,Recall,Precision,F1_Score,Description
0,decision_tree_1,0.842697,0.8125,0.764706,0.787879,criterion_entropy_accuracy_0.8426966292134831.pkl
1,svm_1,0.859551,0.890909,0.720588,0.796748,C_0.5_gamma_1_kernel_rbf.pkl
2,decision_tree_1,0.820225,0.8,0.705882,0.75,criterion_entropy_accuracy_0.8202247191011236.pkl
3,decision_tree_2,0.814607,0.79661,0.691176,0.740157,criterion_entropy_accuracy_0.8146067415730337.pkl


## loading models from joblib dumps

In [325]:
decision_tree =joblib.load(np.array(chronicle[chronicle['Model']=='decision_tree_1']['Description'])[0])
support_vector_m = joblib.load(np.array(chronicle[chronicle['Model']=='svm_1']['Description'])[0])


## Importing Naive Bayes from .py 

In [327]:
import Naive_Bayes as NB

bayes = NB.naive_bayes(X_train[['sex_encoded','Embarked_encoded']],
                    y_train,
                    X_validation[['sex_encoded','Embarked_encoded']],
                    y_validation,
                    threshold=0.5)
bayes

Validation Accuracy:  0.8370786516853933
Training Accuracy:  0.7598499061913696


array([1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0])

In [266]:

support_vector_m.predict(X_test)

array([1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1,
       0, 0])

In [323]:
pickle_in = open("epochs_1299_batch_32_lr_0.01_beta_0.01.pkl","rb")
log_model = pickle.load(pickle_in)
logits = np.matmul(X_test.values,log_model[0]) + log_model[1]
y_pred = tf.nn.softmax(logits, name="Softmax")
y_pred
with tf.Session() as sess:
        result = np.where(y_pred.eval()[:,0] <= 0.5, 1,0)

result

array([0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1,
       1, 1])

In [384]:
def ensemble_model(x_test,y_test):
    ##Support Vector Machine
    svm_pred = support_vector_m.predict(x_test)
    
    ##Logistic Model using tensorflow
    
    pickle_in = open("epochs_1299_batch_32_lr_0.01_beta_0.01.pkl","rb")
    log_model = pickle.load(pickle_in)
    
    logits = np.matmul(X_test.values,log_model[0]) + log_model[1]
    logit_pred = tf.nn.softmax(logits, name="Softmax")
    logit_pred
    with tf.Session() as sess:
        logistic_model_pred = np.where(logit_pred.eval()[:,0] <= 0.5, 1,0)
        
    
    bayes = NB.naive_bayes(X_train[['sex_encoded','Embarked_encoded']],
                    y_train,
                    x_test[['sex_encoded','Embarked_encoded']],
                    y_test,
                    threshold=0.5)
    
    dt_pred = decision_tree.predict(x_test)
    
    predictions =  pd.DataFrame()
    predictions['SVM'] = svm_pred
    predictions['Logistic_Model'] = logistic_model_pred
    predictions['Naive_Bayes'] = bayes
    predictions['Decision_Tree'] = dt_pred
    predictions['Combined_predictions']= np.array(predictions.mode(axis=1).iloc[:,0])
    predictions['Y_test'] = np.array(y_test)
    accuracy = accuracy_score(predictions['Y_test'],predictions['Combined_predictions'])
    print("")
    print("Combined Accuracy:", accuracy)

    
    return(predictions.head())

In [386]:
ensemble_model(x_test=X_test,y_test=y_test)

Validation Accuracy:  0.8146067415730337
Training Accuracy:  0.7598499061913696

Combined Accuracy: 0.8033707865168539


Unnamed: 0,SVM,Logistic_Model,Naive_Bayes,Decision_Tree,Combined_predictions,Y_test
0,1,0,0,0,0.0,1
1,1,1,1,1,1.0,1
2,0,0,0,0,0.0,0
3,0,0,0,0,0.0,0
4,0,0,0,1,0.0,0


# Comments and Conclusions 

#### Feature Engineering

Feature engineering was crucial for model training. Scaling numeric variables **Age** and **Fare** using ***min max normalization*** impacted in a positive way the training accuracy. This could be do to the fact that when applying L2 regularization, this coefficients are not as strongly penalized as when they are not normalized.

Imputations for the **Age** variable also affected the accuracy of the models. Mean imputation resulted in a slightly higher accuracy.

#### Training the Logistic Regression Model
Training the logistic regression makes necesarry to apply ***One Hot Encoding*** because a ***SoftMax*** function is being used as a cost function. This is done in order to have a more generalizable model for future predictions of different datasets. In order to perform a prediction, it is necessary to apply the softmax function to the results of the matmul in order to obtain probabilities.

#### Training the Support Vector Machine Model 

Applying a gridsearch to the SVM training resulted in the optimal accuracy model. Otherwise, getting such value would have required a huge amount of time.

#### Training the Naive Bayes Model 

Naive Bayes was the most difficult model to develop. Code is not optimal as performance is severely affected after passing more than three variables. However, the model is fairly easy to train because there is no need to tune hyperparameters to obtain decent results. 

### Regarding Data: Sampling and Training 

Comparing the accuracy results of the models to other Data Scientist models, we noticed that sampling had an effect in performance. A way to control for this is to apply **knn folds** to the data. This technique ensures that all data is used in the training process and in the validation process. This way you can analyze if the accuracy of your model is close to the average of all the folds or it was obtained just for pure luck.

Another approach could be applying **bootstrapping**. When data is scarce, bootstrapping allows to create different samples from the same distribution, to perform several models. Furthermore, if data is unbalanced, bootstrapping allows to undersample or to oversample observations to have a a better train model. This could be used in this dataset if a lesser amount of people would have survived.