# [Replication and Extension] Titanic Project Example Walk Through 
In this notebook, I hope to show how a data scientist would go about working through a problem. The goal is to correctly predict if someone survived the Titanic shipwreck. I thought it would be fun to see how well I could do in this competition w/ or w/o deep learning. 

## Overview 
### 1) Understand the shape of the data (Histograms, box plots, etc.)

### 2) Data Cleaning 

### 3) Data Exploration

### 4) Feature Engineering and Selection

### 5) Data Preprocessing for Model

### 6) Basic Model Building 

### 7) Model Tuning 

### 8) Ensemble Modle Building 

### 9) Deep Learning and its Hyper-param tuning

### 10) Results 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns 
import matplotlib.pyplot as plt
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
        
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Here we import the data. For this analysis, we will be exclusively working with the Training set. We will be validating based on data from the training set as well. For our final submissions, we will make predictions based on the test set. 

In [None]:
trainset = pd.read_csv('/kaggle/input/titanic/train.csv')
testset = pd.read_csv('/kaggle/input/titanic/test.csv')

# add bool variable to differentiate two datasets
trainset['train_test'] = 1
testset['train_test'] = 0
testset['Survived'] = np.NaN
all_data = pd.concat([trainset,testset], axis=0)

%matplotlib inline
all_data.columns

## Project Planning
When starting any project, I like to outline the steps that I plan to take. Below is the rough outline that I created for this project using commented cells. 

In [None]:
# Understand nature of the data .info() .describe()
# Histograms and boxplots 
# Value counts 
# Missing data 
# Correlation between the metrics 
# Explore interesting themes 
    # Wealthy survive? 
    # By location 
    # Age scatterplot with ticket price 
    # Young and wealthy Variable? 
    # Total spent? 
# Feature engineering 
# Preprocess data together or use a transformer? 
    # use label for train and test   
# Scaling?

# Model Baseline 
# Model comparison with CV 

# OO-style Perceptron implementation
# OO-style FFN implementation

## Gentle Data Exploration
### 1) For numeric data 
* Made histograms to understand distributions 
* Corr-plot 
* Pivot table comparing survival rate across numeric variables 


### 2) For Categorical Data 
* Made bar charts to understand balance (or not) of classes 
* Made pivot tables to understand relationship with survival 

In [None]:
#quick look at our data types & null counts 
trainset.info()

In [None]:
# to better understand the numeric data, we want to use the .describe() method. This gives us an understanding of the central tendencies of the data 
trainset.describe()

In [None]:
#quick way to separate numeric columns using the desribe() func
trainset.describe().columns

In [None]:
# look at numeric and categorical values separately 
df_num = trainset[['Age','SibSp','Parch','Fare']]
df_cat = trainset[['Survived','Pclass','Sex','Ticket','Cabin','Embarked']]

In [None]:
#distributions for all numeric variables 
for col_name in df_num.columns:
    plt.hist(df_num[col_name])
    plt.title(col_name)
    plt.show()

Perhaps we should take the non-normal distributions and consider normalizing them?

In [None]:
print(df_num.corr())
sns.heatmap(df_num.corr())

In [None]:
# compare survival rate across Age, SibSp, Parch, and Fare; use surived col as the index, all other features as the columns
pd.pivot_table(trainset, index = 'Survived', values = df_num.columns)

Young ppl + high fare + low parch + high Sibsp = ? high likelihood to survive?

In [None]:
df_cat['Pclass'].value_counts()

In [None]:
for col_name in df_cat.columns:
    sns.barplot( df_cat[col_name].value_counts().index, df_cat[col_name].value_counts() ).set_title(col_name)
    plt.show()
    

Cabin and ticket graphs are very messy. This is an area where we may want to do some feature engineering! 

In [None]:
# Comparing survival and each of these categorical variables 
print(pd.pivot_table(trainset, index = 'Survived', columns = 'Pclass', values = 'Ticket' ,aggfunc ='count'), '\n')
print(pd.pivot_table(trainset, index = 'Survived', columns = 'Sex', values = 'Ticket' ,aggfunc ='count'), '\n')
print(pd.pivot_table(trainset, index = 'Survived', columns = 'Embarked', values = 'Ticket' ,aggfunc ='count'))

higher class priority seems to lead a higher survived proportion.  
lady first.  
what does Embarked exactly means?!

## Feature Engineering and Selection
### 1) Cabin - Simplify cabins (evaluated if cabin letter (cabin_adv) or the purchase of tickets across multiple cabins (cabin_multiple) impacted survival)

### 2) Tickets - Do different ticket types impact survival rates?

### 3) Does a person's title relate to survival rates? 

In [None]:
df_cat.Cabin

In [None]:
trainset['cabin_multiple'] = trainset.Cabin.apply(lambda x: 0 if pd.isna(x) else len(x.split(' ')))
# after looking at this, we may want to look at cabin by letter or by number.

In [None]:
trainset['cabin_multiple'].value_counts()

In [None]:
pd.pivot_table(trainset, index = 'Survived', columns = 'cabin_multiple', values = 'Ticket' ,aggfunc ='count')

In [None]:
#creates categories based on the cabin letter (n stands for null)
#in this case we will treat null values like it's own category

trainset['cabin_adv'] = trainset.Cabin.apply(lambda x: str(x)[0])
trainset.cabin_adv

In [None]:
#comparing surivial rate by cabin
print(trainset.cabin_adv.value_counts())
pd.pivot_table(trainset, index='Survived',columns='cabin_adv', values = 'Name', aggfunc='count')

In [None]:
#understand ticket values better 
#numeric vs non numeric 
trainset['numeric_ticket'] = trainset.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
trainset['ticket_letters'] = trainset.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('.','').replace('/','').lower() if len(x.split(' ')[:-1]) >0 else 0)


In [None]:
trainset['numeric_ticket'].value_counts()

In [None]:
#lets us view all rows in dataframe through scrolling. This is for convenience 
pd.set_option("max_rows", None)
trainset['ticket_letters'].value_counts()


In [None]:
#difference in numeric vs non-numeric tickets in survival rate 
pd.pivot_table(trainset, index='Survived', columns='numeric_ticket', values = 'Ticket', aggfunc='count')

In [None]:
#survival rate across different ticket types 
pd.pivot_table(trainset, index='Survived', columns='ticket_letters', values = 'Ticket', aggfunc='count')

In [None]:
#feature engineering on person's title 
trainset.Name.head(50)

In [None]:
trainset['name_title'] = trainset.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
#mr., ms., master. etc

In [None]:
trainset['name_title'].value_counts()

## Data Preprocessing for Model 
### 1) Drop null values from Embarked (only 2) 

### 2) Include only relevant variables (Since we have limited data, I wanted to exclude things like name and passanger ID so that we could have a reasonable number of features for our models to deal with) 
Variables:  'Pclass', 'Sex','Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'cabin_adv', 'cabin_multiple', 'numeric_ticket', 'name_title'

### 3) Do categorical transforms on all data. Usually we would use a transformer, but with this approach we can ensure that our training and test data have the same columns. We also may be able to infer something about the shape of the test data through this method. I will stress, this is generally not recommend outside of a competition (use onehot encoder). 

### 4) Impute data with mean for fare and age (Should also experiment with median) 

### 5) Normalized fare using logarithm to give more semblance of a normal distribution 

### 6) Scaled data 0-1 with standard scaler 


In [None]:
#create all categorical variables that we did above for both training and test sets 
all_data['cabin_multiple'] = all_data.Cabin.apply(lambda x: 0 if pd.isna(x) else len(x.split(' ')))
all_data['cabin_adv'] = all_data.Cabin.apply(lambda x: str(x)[0])
all_data['numeric_ticket'] = all_data.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
all_data['ticket_letters'] = all_data.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('.','').replace('/','').lower() if len(x.split(' ')[:-1]) >0 else 0)
all_data['name_title'] = all_data.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())

#impute nulls for continuous data, using the info from train set
#all_data.Age = all_data.Age.fillna(training.Age.mean())
all_data.Age = all_data.Age.fillna(trainset.Age.median())
#all_data.Fare = all_data.Fare.fillna(training.Fare.mean())
all_data.Fare = all_data.Fare.fillna(trainset.Fare.median())

#drop null 'embarked' rows. Only 2 instances of this in training and 0 in test 
all_data.dropna( subset = ['Embarked'] , inplace = True )

#tried log norm of sibsp (not used)
# make input larger than 1, so the log is larger than 0
all_data['norm_sibsp'] = np.log(all_data.SibSp + 1)
plt.figure()
all_data['norm_sibsp'].hist()

# log norm of fare (used)
all_data['norm_fare'] = np.log( all_data.Fare + 1)
plt.figure()
all_data['norm_fare'].hist()


In [None]:
# converted to category type for pd.get_dummies()
all_data.Pclass = all_data.Pclass.astype(str)
all_data.numeric_ticket = all_data.numeric_ticket.astype(str)

#created dummy variables from categories (also can use OneHotEncoder)
all_dummies = pd.get_dummies(all_data[['Pclass','Sex','Age','Parch','Embarked','train_test','cabin_adv','numeric_ticket','name_title','norm_sibsp','norm_fare']], drop_first=True)

#Split to train test again
X_train = all_dummies[all_dummies.train_test == 1].drop(['train_test'], axis =1)
X_test = all_dummies[all_dummies.train_test == 0].drop(['train_test'], axis =1)


y_train = all_data[all_data.train_test==1].Survived
y_train.shape

In [None]:
X_train.columns

In [None]:
# Scale data 
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
all_dummies_scaled = all_dummies.copy()
all_dummies_scaled[['Age','norm_sibsp','Parch','norm_fare']] = scale.fit_transform(all_dummies_scaled[['Age','norm_sibsp','Parch','norm_fare']])
all_dummies_scaled

X_train_scaled = all_dummies_scaled[all_dummies_scaled.train_test == 1].drop(['train_test'], axis =1)
X_test_scaled = all_dummies_scaled[all_dummies_scaled.train_test == 0].drop(['train_test'], axis =1)

y_train = all_data[all_data.train_test==1].Survived


## Model Building (Baseline Validation Performance)
Before going further, I like to see how various different models perform with default parameters. I tried the following models using 5 fold cross validation to get a baseline. With a validation set basline, we can see how much tuning improves each of the models. Just because a model has a high basline on this validation set doesn't mean that it will actually do better on the eventual test set. 

- Naive Bayes (72%)
- Logistic Regression (82%)
- Decision Tree (78%)
- K Nearest Neighbor (82%)
- Random Forest (81%)
- Support Vector Classifier (83%)
- Xtreme Gradient Boosting (82%)
- Soft Voting Classifier - All Models (83%)

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [None]:
from sklearn.feature_selection import RFE
RFE_estimator = LogisticRegression(max_iter = 1000)
selector = RFE(RFE_estimator, n_features_to_select=15, step=1)
selector = selector.fit(X_train, y_train)

In [None]:
selector.support_

In [None]:
selector.ranking_

In [None]:
X_train = X_train[list(np.array(X_train.columns)[selector.support_])]

In [None]:
#I usually use Naive Bayes as a baseline for my classification tasks 
gnb = GaussianNB()
cv = cross_val_score(gnb, X_train_scaled, y_train, cv=5)
print(cv)
print(cv.mean())

In [None]:
lr = LogisticRegression(max_iter = 1000)
cv = cross_val_score(lr, X_train_scaled, y_train, cv=5)
print(cv)
print(cv.mean())

In [None]:
dt = tree.DecisionTreeClassifier(random_state = 1)
cv = cross_val_score(dt, X_train_scaled, y_train, cv=5)
print(cv)
print(cv.mean())

In [None]:
knn = KNeighborsClassifier()
cv = cross_val_score(knn, X_train_scaled, y_train, cv=5)
print(cv)
print(cv.mean())

In [None]:
rf = RandomForestClassifier(random_state = 1)
cv = cross_val_score(rf, X_train_scaled, y_train, cv=5)
print(cv)
print(cv.mean())

In [None]:
svc = SVC(probability = True)
cv = cross_val_score(svc, X_train_scaled, y_train, cv=5)
print(cv)
print(cv.mean())

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier(random_state =1)
cv = cross_val_score(xgb, X_train_scaled, y_train, cv=5)
print(cv)
print(cv.mean())

In [None]:
# Voting classifier takes all of the inputs and averages the results. For a "hard" voting classifier each classifier gets 1 vote "yes" or "no" and the result is just a popular vote. For this, you generally want odd numbers
# A "soft" classifier averages the confidence of each of the models. If the average confidence is > 50% that it is a 1， it will be counted as such
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators = [('lr',lr),('knn',knn),('rf',rf),('gnb',gnb),('svc',svc),('xgb',xgb)], voting='soft')

In [None]:
cv = cross_val_score(voting_clf, X_train_scaled, y_train, cv=5)
print(cv)
print(cv.mean())

In [None]:
voting_clf.fit(X_train_scaled, y_train)
y_hat_base_vc = voting_clf.predict(X_test_scaled).astype(int)
basic_submission = {'PassengerId': testset.PassengerId, 'Survived': y_hat_base_vc}
base_submission = pd.DataFrame(data=basic_submission)
base_submission.to_csv('base_submission.csv', index=False)

## Model Tuned Performance 
After getting the baselines, let's see if we can improve on the indivdual model results! I mainly used grid search to tune the models. I also used Randomized Search for the Random Forest and XG boosted model to simplify testing time. 

|Model|Baseline|Tuned Performance|
|-----|--------|-----------------|
|Naive Bayes| 72%| NA|
|Logistic Regression| 82%| 82%|
|Decision Tree| 78%| NA|
|K Nearest Neighbor| 82%| 82%|
|Random Forest| 81%| 83% w/ RndSearch|
|Support Vector Classifier| 83%| 83%|
|Xtreme Gradient Boosting| 82%| 85% w/ RndSearch|

In [None]:
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import RandomizedSearchCV 

In [None]:
#simple performance reporting function
def clf_performance(classifier, model_name):
    # input a fitted classifer with searchCV
    print(model_name)
    print('Best Score: ' + str(classifier.best_score_))
    print('Best Parameters: ' + str(classifier.best_params_))

In [None]:
lr = LogisticRegression()
param_grid = {'max_iter' : [2000],
              'penalty' : ['l1', 'l2'],
              'C' : np.logspace(-4, 4, 20),
              'solver' : ['liblinear']}

clf_lr = GridSearchCV(lr, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
clf_lr.fit(X_train_scaled, y_train)
clf_performance(clf_lr, 'Logistic Regression')

In [None]:
knn = KNeighborsClassifier()
param_grid = {'n_neighbors' : [3, 5, 7, 9],
              'weights' : ['uniform', 'distance'],
              'algorithm' : ['auto', 'ball_tree','kd_tree'],
              'p' : [1,2]}
clf_knn = GridSearchCV(knn, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
clf_knn.fit(X_train_scaled, y_train)
clf_performance(clf_knn,'KNN')

In [None]:
svc = SVC(probability = True)
param_grid = tuned_parameters = [{'kernel': ['rbf'], 'gamma': [0.1,0.5,1,2,5],
                                  'C': [.1, 1, 10, 100, 1000]},
                                 {'kernel': ['linear'], 'C': [.1, 1, 10, 100, 1000]},
                                 {'kernel': ['poly'], 'degree' : [2,3,4,5], 'C': [.1, 1, 10, 100, 1000]}]
clf_svc = GridSearchCV(svc, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
clf_svc.fit(X_train_scaled, y_train)
clf_performance(clf_svc,'SVC')

In [None]:
#Because the total feature space is so large, I used a randomized search to narrow down the paramters for the model. I took the best model from this and did a more granular search around the best param region
rf = RandomForestClassifier(random_state = 1)
param_grid =  {'n_estimators': [100,500,1000], 
               'criterion':['gini','entropy'],
                                  'bootstrap': [True,False],
                                  'max_depth': [3,5,10,20,50,75,100,None],
                                  'max_features': ['auto','sqrt'],
                                  'min_samples_leaf': [1,2,4,10],
                                  'min_samples_split': [2,5,10]}
                                  
clf_rf_rnd = RandomizedSearchCV(rf, param_distributions = param_grid, n_iter = 30, cv = 5, verbose = True, n_jobs = -1)
clf_rf_rnd.fit(X_train_scaled, y_train)
clf_performance(clf_rf_rnd, 'Random Forest')

In [None]:
param_grid =  {'n_estimators': [1000],
               'criterion':['entropy'],
                                  'bootstrap': [False],
                                  'max_depth': [None],
                                  'max_features': ['sqrt'],
                                  'min_samples_leaf': [2,4,6],
                                  'min_samples_split': [8,10,12]}
                                  
clf_rf = GridSearchCV(rf, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
clf_rf.fit(X_train_scaled, y_train)
clf_performance(clf_rf, 'Random Forest')

In [None]:
best_rf = clf_rf.best_estimator_.fit(X_train_scaled, y_train)
feat_importances = pd.Series(best_rf.feature_importances_, index=X_train_scaled.columns)
feat_importances.nlargest(10).plot(kind='barh')

In [None]:
xgb = XGBClassifier(random_state = 1)

param_grid = {
    'n_estimators': [20, 50, 100, 250, 500,1000],
    'colsample_bytree': [0.2, 0.5, 0.7, 0.8, 1],
    'max_depth': [2, 5, 10, 15, 20, 25, None],
    'reg_alpha': [0, 0.5, 1],
    'reg_lambda': [1, 1.5, 2],
    'subsample': [0.5,0.6,0.7, 0.8, 0.9],
    'learning_rate':[.01,0.1,0.2,0.3,0.5, 0.7, 0.9],
    'gamma':[0,.01,.1,1,10,100],
    'min_child_weight':[0,.01,0.1,1,10,100],
    'sampling_method': ['uniform', 'gradient_based']
}

clf_xgb_rnd = RandomizedSearchCV(xgb, param_distributions = param_grid, n_iter = 30, cv = 5, verbose = True, n_jobs = -1)
clf_xgb_rnd.fit(X_train_scaled,y_train)
clf_performance(clf_xgb_rnd,'XGB')

In [None]:
param_grid = {
    'n_estimators': [1000],
    'colsample_bytree': [0.7,0.8,0.9],
    'max_depth': [15,20,25],
    'reg_alpha': [0.5],
    'reg_lambda': [0.9,1,1.1],
    'subsample': [0.7,0.9,1.1],
    'learning_rate':[.01],
    'gamma':[0.5,1,1.5],
    'min_child_weight':[0.01],
    'sampling_method': ['uniform']
}

                                  
clf_xgb = GridSearchCV(xgb, param_grid = param_grid, cv = 5, verbose = True, n_jobs = -1)
clf_xgb.fit(X_train_scaled, y_train)
clf_performance(clf_xgb,'XGB')

In [None]:
y_hat_xgb = clf_xgb.best_estimator_.predict(X_test_scaled).astype(int)
xgb_submission = {'PassengerId': testset.PassengerId, 'Survived': y_hat_xgb}
submission_xgb = pd.DataFrame(data=xgb_submission)
submission_xgb.to_csv('xgb_submission3.csv', index=False)

## Model Additional Ensemble Approaches 
1) Experimented with a hard voting classifier of three estimators (KNN, SVM, RF)

2) Experimented with a soft voting classifier of three estimators (KNN, SVM, RF) (82.3%)

3) Experimented with soft voting on all estimators performing better than 80% except xgb (KNN, RF, LR, SVC)

4) Experimented with soft voting on all estimators including XGB (KNN, SVM, RF, LR, XGB)

In [None]:
best_lr = clf_lr.best_estimator_
best_knn = clf_knn.best_estimator_
best_svc = clf_svc.best_estimator_
best_rf = clf_rf.best_estimator_
best_xgb = clf_xgb.best_estimator_

In [None]:
voting_clf_hard = VotingClassifier(estimators = [('knn',best_knn),('rf',best_rf),('svc',best_svc)], voting = 'hard') 
voting_clf_soft = VotingClassifier(estimators = [('knn',best_knn),('rf',best_rf),('svc',best_svc)], voting = 'soft') 
voting_clf_all = VotingClassifier(estimators = [('knn',best_knn),('rf',best_rf),('svc',best_svc), ('lr', best_lr)], voting = 'soft') 
voting_clf_xgb = VotingClassifier(estimators = [('knn',best_knn),('rf',best_rf),('svc',best_svc), ('xgb', best_xgb),('lr', best_lr)], voting = 'soft')


In [None]:
#print('voting_clf_hard :',cross_val_score(voting_clf_hard,X_train,y_train,cv=5))
print('voting_clf_hard mean :',cross_val_score(voting_clf_hard,X_train,y_train,cv=5).mean())

#print('voting_clf_soft :',cross_val_score(voting_clf_soft,X_train,y_train,cv=5))
print('voting_clf_soft mean :',cross_val_score(voting_clf_soft,X_train,y_train,cv=5).mean())

#print('voting_clf_all :',cross_val_score(voting_clf_all,X_train,y_train,cv=5))
print('voting_clf_all mean :',cross_val_score(voting_clf_all,X_train,y_train,cv=5).mean())

#print('voting_clf_xgb :',cross_val_score(voting_clf_xgb,X_train,y_train,cv=5))
print('voting_clf_xgb mean :',cross_val_score(voting_clf_xgb,X_train,y_train,cv=5).mean())


In [None]:
#in a soft voting classifier you can weight some models more than others. I used a grid search to explore different weightings
#no new results here
params = {'weights' : [[1,1,1],[1,2,1],[1,1,2],[2,1,1]]}

vote_weight = GridSearchCV(voting_clf_soft, param_grid = params, cv = 5, verbose = True, n_jobs = -1)
vote_weight.fit(X_train_scaled, y_train)
clf_performance(vote_weight, 'VC Weights')
voting_clf_sub = vote_weight.best_estimator_.predict(X_test_scaled)

In [None]:
#Make Predictions 
voting_clf_hard.fit(X_train_scaled, y_train)
voting_clf_soft.fit(X_train_scaled, y_train)
voting_clf_all.fit(X_train_scaled, y_train)
voting_clf_xgb.fit(X_train_scaled, y_train)
best_rf.fit(X_train_scaled, y_train)

y_hat_vc_hard = voting_clf_hard.predict(X_test_scaled).astype(int)
y_hat_rf = best_rf.predict(X_test_scaled).astype(int)
y_hat_vc_soft =  voting_clf_soft.predict(X_test_scaled).astype(int)
y_hat_vc_all = voting_clf_all.predict(X_test_scaled).astype(int)
y_hat_vc_xgb = voting_clf_xgb.predict(X_test_scaled).astype(int)

In [None]:
#convert output to dataframe 
final_data = {'PassengerId': testset.PassengerId, 'Survived': y_hat_rf}
submission = pd.DataFrame(data=final_data)

final_data_2 = {'PassengerId': testset.PassengerId, 'Survived': y_hat_vc_hard}
submission_2 = pd.DataFrame(data=final_data_2)

final_data_3 = {'PassengerId': testset.PassengerId, 'Survived': y_hat_vc_soft}
submission_3 = pd.DataFrame(data=final_data_3)

final_data_4 = {'PassengerId': testset.PassengerId, 'Survived': y_hat_vc_all}
submission_4 = pd.DataFrame(data=final_data_4)

final_data_5 = {'PassengerId': testset.PassengerId, 'Survived': y_hat_vc_xgb}
submission_5 = pd.DataFrame(data=final_data_5)

final_data_comp = {'PassengerId': testset.PassengerId, 'Survived_vc_hard': y_hat_vc_hard, 'Survived_rf': y_hat_rf, 'Survived_vc_soft' : y_hat_vc_soft, 'Survived_vc_all' : y_hat_vc_all,  'Survived_vc_xgb' : y_hat_vc_xgb}
comparison = pd.DataFrame(data=final_data_comp)

In [None]:
#track differences between outputs 
# find those data points with different predictions from various models
comparison['difference_rf_vc_hard'] = comparison.apply(lambda x: 1 if x.Survived_vc_hard != x.Survived_rf else 0, axis =1)
comparison['difference_soft_hard'] = comparison.apply(lambda x: 1 if x.Survived_vc_hard != x.Survived_vc_soft else 0, axis =1)
comparison['difference_hard_all'] = comparison.apply(lambda x: 1 if x.Survived_vc_all != x.Survived_vc_hard else 0, axis =1)


In [None]:
comparison.difference_rf_vc_hard.value_counts()

In [None]:
#prepare submission files 
submission.to_csv('submission_rf.csv', index =False)
submission_2.to_csv('submission_vc_hard.csv',index=False)
submission_3.to_csv('submission_vc_soft.csv', index=False)
submission_4.to_csv('submission_vc_all.csv', index=False)
submission_5.to_csv('submission_vc_xgb2.csv', index=False)

## DL and its Hyper-param Tuning

# OO-style Perceptron implementation. This is a linear classifier!


In [None]:
class Eval:
    def __init__(self, pred, gold):
        self.pred = np.squeeze(pred)
        self.gold = np.squeeze(gold)
        
    def Accuracy(self):
        return np.sum(np.equal(self.pred, self.gold)) / float(len(self.gold))

class Perceptron:
    def __init__(self, X, Y, N_ITERATIONS):
        #TODO: Initalize parameters
        self.lr = 1e-4
        self.N_epochs = N_ITERATIONS
        self.weights = np.zeros((X.shape[1],1)) # num_feats by 1
        self.w_sum = np.zeros((X.shape[1],1)) # num_feats by 1
        self.bias = 0
        self.b_sum = 0
        self.cnt = 1 # it is not simply the num_samples * num_epochs. since it is monotoniously increasing, the latest values are assigned with larger weights
        self.Train(X,Y)

    def ComputeAverageParameters(self):
        #TODO: Compute average parameters (do this part last)
        self.weights = self.weights - (self.w_sum / float(self.cnt))
        self.bias = self.bias - (self.b_sum / float(self.cnt))
        return

    def Train(self, X, Y):
        #TODO: Estimate perceptron parameters
        for _ in range(self.N_epochs):
            for inputs, label in zip(X, Y):
                prediction = self.Predict(inputs.reshape(1,-1))
                self.weights += self.lr * ((label - prediction) * inputs).reshape(-1,1)
                self.w_sum += self.cnt * self.lr * ((label - prediction) * inputs).reshape(-1,1)
                self.bias += self.lr * (label - prediction)
                self.b_sum += self.cnt * self.lr * (label - prediction)
            
                self.cnt += 1

        return

    def Predict(self, X):
        #TODO: Implement perceptron classification
        out = np.dot(X, self.weights) + self.bias
        return np.asarray([1 if out[i]>= 0.0 else -1 for i in range(X.shape[0])])

    def SavePredictions(self, data, outFile):
        Y_pred = self.Predict(data.X)
        fOut = open(outFile, 'w')
        for i in range(len(data.XfileList)):
            fOut.write(f"{data.XfileList[i]}\t{Y_pred[i]}\n")

    def Eval(self, X_test, Y_test):
        Y_pred = self.Predict(X_test)
        ev = Eval(Y_pred, Y_test)
        return ev.Accuracy()


In [None]:
ptron = Perceptron(X_train_scaled.values, y_train.values, 5) # train, just simply

In [None]:
print('in-sample accuracy', ptron.Eval(X_train_scaled.values, y_train.values))

In [None]:
ptron.ComputeAverageParameters() # calculated average parameters, then take off the mean values by shifting

In [None]:
print('in-sample accuracy', ptron.Eval(X_train_scaled.values, y_train.values)) # significant improvement!

In [None]:
from sklearn.linear_model import Perceptron
clf_pcep = Perceptron(tol=1e-3, random_state=0)
cv = cross_val_score(clf_pcep, X_train_scaled, y_train, cv=5)
print(cv)
print(cv.mean())

In [None]:
clf_pcep.fit(X_train_scaled, y_train)
clf_pcep.score(X_train_scaled, y_train)

# OO-style FFN implementation. Keep as simple as possible and must do hyper-param searching!
# use skorch to finish the 2-step from coarse to fine hyper-param search, then use your self-build pipleline to train and test the nn

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import random
if torch.cuda.is_available():
    device = torch.device('cuda',0)
else:
    device = torch.device('cpu')

In [None]:
#Define the computation graph; one layer hidden network
class FFNN(nn.Module):
    def __init__(self, dim_i, dim_h, dim_o):
        super(FFNN, self).__init__()
        self.V = nn.Linear(dim_i, dim_h)
        self.g = nn.Tanh()
        self.W = nn.Linear(dim_h, dim_o)
        self.logSoftmax = nn.LogSoftmax(dim=0) # usually, the class dim is the last dim. here, we happened to find that there is only one dim, so use dim=0

    def forward(self, x):
        out = self.W(self.g(self.V(x)))
        out = self.logSoftmax(out)
        return out

train_X = X_train_scaled.values
train_Y = y_train.values

num_classes  = 2
num_hidden   = 10
num_features = train_X.shape[1]

ffnn = FFNN(num_features, num_hidden, num_classes).to(device)
optimizer = optim.Adam(ffnn.parameters(), lr=1e-3)

for epoch in range(100):
    total_loss = 0.0
    #Randomly shuffle examples in each epoch
    shuffled_i = list(range(0,len(train_Y)))
    random.shuffle(shuffled_i)
    for i in shuffled_i:
        x        = torch.from_numpy(train_X[i]).float()
        y_onehot = torch.zeros(num_classes)
        y_onehot[int(train_Y[i])] = 1

        logProbs = ffnn.forward(x)

        #print(logProbs.shape, y_onehot.shape)
        loss = torch.neg(logProbs).dot(y_onehot)
        total_loss += loss
        
        ffnn.zero_grad()
        loss.backward()
        optimizer.step()
    if epoch % 10 == 0:    
        print("loss on epoch %i: %f" % (epoch, total_loss))



In [None]:
#Evaluate on the training set:
num_errors = 0
for i in range(len(train_Y)):
    x = torch.from_numpy(train_X[i]).float()
    y = train_Y[i]
    logProbs = ffnn.forward(x)
    prediction = torch.argmax(logProbs)
    if y != prediction:
        num_errors += 1
print("number of errors: %d" % num_errors)
print("mis_ratio: %.2f" % (num_errors/len(train_Y)))

In [None]:
# Create Data Loaders

training_set = torch.utils.data.TensorDataset(torch.Tensor(train_X), torch.Tensor(train_Y))
training_loader = torch.utils.data.DataLoader(training_set, batch_size=64, shuffle=True)

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
def train(model, device, training_loader, optimizer, epoch):
    # epoch means which epoch where are we, instead of total num of epochs
    model.train() # declare train mode, so that we keep the grads
    total_loss = 0
    for idx, data in enumerate(training_loader, 0):
        inputs, targets = data # a batch of data
        
        # add onehot encoder here! then the OO-style FFNN has been finished. just add some param-tune, then it is done.
        ohe = OneHotEncoder()
        targets = torch.tensor(ohe.fit_transform(targets.reshape(-1,1)).toarray(), dtype=torch.float32)
        
        # 1. Forward
        outputs = model(inputs)
        
        #print(outputs, targets)
        # 2. loss calculation, same dtype into the nn.MSELOSS
        loss = criterion(outputs, targets)
        
        # 3. Zero the parameter gradients
        optimizer.zero_grad()
        
        # 4. Comp grad
        loss.backward()
        
        # 5. One step forward
        optimizer.step()
        
        
        total_loss += loss.item()
    
    print("Train Epoch: {}, Loss per batch: {}".format(epoch, round(total_loss/len(training_loader), 4)))
    train_loss_hist.append(total_loss/len(training_loader))

In [None]:
def test(model, device, testing_loader):
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for idx, data in enumerate(testing_loader, 0):
            inputs, S_mat, C0 = data;
    
            # Forward
            outputs = model(inputs)
            
            # Loss
            loss = criterion(outputs * (S_mat[:,1:] - S_mat[:,0:1]) + C0, 
                             torch.max(S_mat[:,1:] - K, torch.zeros(batch_size, 1)))
            
            total_loss += loss.item() 

    print("Test loss per batch: {}".format(round(total_loss/len(testing_loader),4)))
    val_loss_hist.append(total_loss/len(testing_loader))

In [None]:
class FFNN(nn.Module):
    def __init__(self, dim_i, dim_h, dim_o):
        super(FFNN, self).__init__()
        self.V = nn.Linear(dim_i, dim_h)
        self.g = nn.Tanh()
        self.W = nn.Linear(dim_h, dim_o)
        self.logSoftmax = nn.Softmax(dim=-1) # usually, the class dim is the last dim.

    def forward(self, x):
        out = self.W(self.g(self.V(x)))
        out = self.logSoftmax(out)
        return out

In [None]:
model = FFNN(num_features, num_hidden, num_classes).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()
train_loss_hist = []
for epoch in range(10):
    train(model, device, training_loader, optimizer, epoch)

In [None]:
plt.figure()
l1, = plt.plot(train_loss_hist)
#l2, = plt.plot(val_loss_hist)
plt.legend(handles=[l1], labels = ['train','val'], loc='best')

In [None]:
model.eval() # do not update trainable params anymore
nn_predictions = model(torch.Tensor(train_X))

In [None]:
nn_eval = Eval(torch.argmax(nn_predictions, dim=1).detach().numpy(), train_Y)

In [None]:
nn_eval.Accuracy()

Slightly change the loss func, pay attention to the inputs

In [None]:
class FFNN_ce(nn.Module):
    def __init__(self, dim_i, dim_h, dim_o):
        super(FFNN_ce, self).__init__()
        self.V = nn.Linear(dim_i, dim_h)
        self.g = nn.Tanh()
        self.W = nn.Linear(dim_h, dim_o)
        self.logSoftmax = nn.Softmax(dim=-1) # usually, the class dim is the last dim.

    def forward(self, x):
        out = self.W(self.g(self.V(x)))
        #out = self.logSoftmax(out)
        return out

In [None]:
def train(model, device, training_loader, optimizer, epoch):
    # epoch means which epoch where are we, instead of total num of epochs
    model.train() # declare train mode, so that we keep the grads
    total_loss = 0
    for idx, data in enumerate(training_loader, 0):
        inputs, targets = data # a batch of data
        
        # add onehot encoder here! then the OO-style FFNN has been finished. just add some param-tune, then it is done.
        #ohe = OneHotEncoder()
        #targets = torch.tensor(ohe.fit_transform(targets.reshape(-1,1)).toarray(), dtype=torch.float32)
        
        # 1. Forward
        outputs = model(inputs) # just the logits
        
        # 2. input logits and original labels for ce
        loss = criterion(outputs, targets.long())
        
        # 3. Zero the parameter gradients
        optimizer.zero_grad()
        
        # 4. Comp grad
        loss.backward()
        
        # 5. One step forward
        optimizer.step()
        
        
        total_loss += loss.item()
    
    print("Train Epoch: {}, Loss per batch: {}".format(epoch, round(total_loss/len(training_loader), 4)))
    train_loss_hist.append(total_loss/len(training_loader))

In [None]:
model_ce = FFNN_ce(num_features, num_hidden, num_classes).to(device)
optimizer = optim.Adam(model_ce.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
train_loss_hist = []
for epoch in range(10):
    train(model_ce, device, training_loader, optimizer, epoch)

In [None]:
plt.figure()
l1, = plt.plot(train_loss_hist)
#l2, = plt.plot(val_loss_hist)
plt.legend(handles=[l1], labels = ['train','val'], loc='best')

In [None]:
model_ce.eval() # do not update trainable params anymore
nn_predictions = model_ce(torch.Tensor(train_X))

In [None]:
nn_eval = Eval(torch.argmax(nn_predictions, dim=1).detach().numpy(), train_Y)

In [None]:
nn_eval.Accuracy()

### Hyper-param search

In [None]:
import torch
from torch import nn
import torch.nn.functional as F

In [None]:
# inspect if we could use GPU
if torch.cuda.is_available():
    device = torch.device('cuda', 0)
else:
    device = torch.device('cpu')

In [None]:
class ClassifierModule(nn.Module):
    def __init__(
            self,
            num_units=10,
            nonlin=F.relu,
            dropout=0.5,
    ):
        super(ClassifierModule, self).__init__()
        self.num_units = num_units
        self.nonlin = nonlin
        self.dropout = dropout

        self.dense0 = nn.Linear(20, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(dropout)
        self.dense1 = nn.Linear(num_units, 10)
        self.output = nn.Linear(10, 2)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = self.dropout(X)
        X = F.relu(self.dense1(X))
        X = F.softmax(self.output(X), dim=-1) # the last layers are a linear output and a softmax
        return X

In [None]:
!pip install -U skorch

In [None]:
import skorch
from skorch import NeuralNetClassifier
from skorch import NeuralNetRegressor

In [None]:
net = NeuralNetClassifier(
    ClassifierModule,
    max_epochs=20,
    lr=0.1,
    device=device,
)
# the self created ClassifierModule is a input param for the skorch wrapper

In [None]:
net.fit(X, y)

In [None]:
y_pred = net.predict(X[:10])
y_proba = net.predict_proba(X[:10])

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [None]:
net = NeuralNetClassifier(
    ClassifierModule,
    max_epochs = 20,
    lr = 0.1,
    optimizer__momentum = 0.9,
    verbose = 0,
    train_split = False,
)

In [None]:
params = {
    'lr': [0.001, 0.01, 0.1],
    'max_epochs': [10, 20, 30],
    'module__num_units': [10, 20, 40],
    'module__dropout': [0, 0.5],
    'optimizer__nesterov': [False, True],
}

In [None]:
rs = RandomizedSearchCV(net, params, refit=False, cv=3, scoring='f1', verbose=2, n_iter = 20)

In [None]:
rs.fit(X, y)

In [None]:
print(rs.best_score_, rs.best_params_)

In [None]:
params = {
    'lr': [0.1],
    'max_epochs': [15, 20, 25],
    'module__num_units': [30, 40, 50],
    'module__dropout': [0],
    'optimizer__nesterov': [True],
}

In [None]:
gs = GridSearchCV(net, params, refit=True, cv=3, scoring='f1', verbose=2) # with refit = True, to keep the best estimator

In [None]:
gs.fit(X, y)

In [None]:
print(gs.best_score_, gs.best_params_)

In [None]:
gs.best_estimator_

In [None]:
gs.predict(X[:10])

In [None]:
gs.predict_proba(X[:10])