<a href="https://www.kaggle.com/code/jhynes/titanic-standard-classifier-test?scriptVersionId=144561339" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
import seaborn as sns
import os

import tensorflow as tf
import tensorflow_decision_forests as tfdf
import sklearn
import mpl_toolkits

print(f"Found TF-DF {tfdf.__version__}")

!pwd; 

## Overview:

### V1.0 Baseline Submisson
Baseline test useing basic classifiers (SVM, decision trees, KNN) and minimal feature engineering. 

### Next Updates
V2.0 will focus on vector-based feature engineering and node embedding encoding-decoding models. 




## 1.0) Load Data & Inspect:
* Load data from CSV files, typically test/train
* Make copy of data
* Examine Variables using info(), describe(), head(), isnull().sum(), list()


In [None]:
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df = pd.read_csv("/kaggle/input/titanic/test.csv")


#train_df = train_df.copy()

#test_df = test_df.copy()


example_submission_df = pd.read_csv("/kaggle/input/titanic/gender_submission.csv")


### 2.1) Data cleaning
 


* Convert categorical data to numerical (float)
* Missing Data: 
    * Replace missing values or NaNs with values that are both statistically and pragmatically appropriate.
    * Replace nans with mean, mode, 0, etc. 
    * Age: the distribution of ages is non-gaussian, indicating that we should probably mean-fill the ages in a stratified way, but for now we will just assume a normal distribition and use a simple fill method. 
    
* Ignore for now:
    * Cabin Number: We are ignoring "Cabin number" for now as there are many missing values. Cabin letter/number coding likely reflects redundant information related to 1st, 2nd, 3rd class tickest (lower class ticket holders were likely in cabins in the lower decks). That said, individuals in 2nd/3rd class cabins that were furthest away from the hull breach would have increased chances of survival, so come in use in the future.

* Feature Selection:
For now, let's just examine the feature columns that are easy to work with. Here are our columns/features: 
* survival: Survival
* pclass: Ticket class
* sex: Sex
* Age: Age in years
* sibsp: # of siblings / spouses aboard the Titanic
* parch: # of parents / children aboard the Titanic
* ticket: Ticket number
* fare: Passenger fare
* embarked: Port of embarkation


In [None]:
print(test_df.isnull().sum()) # inspect data types: any missing or null data 
print(train_df.isnull().sum()) # inspect data types: any missing or null data 


#### Data Interpretation:
The plots below tell us that class, sex, age, and sibling number are predictive of an individuals survival.  

Although are many confounding relationships among the variables that warrant caution when making an interpretation, some noteable trends and possible interpretations include:
- women in first class were most likely to survive.
- women in 3rd class had notable poorer survival rate than women in 1st/2nd class. 
- men in 3rd class were had poorer survival over all. 
- very young males/female in 1st/2nd had better chance of survival. This was not the case in 3rd class. 



In [None]:
fig, ([ax1, ax2], [ax3, ax4]) = plt.subplots(2,2, figsize=(12,5))

sns.countplot(data=train_df, x="Sex", hue="Survived", ax=ax1)
sns.countplot(data=train_df, x="Pclass", hue="Survived", ax=ax2)
sns.countplot(data=train_df, x="Embarked", hue="Survived", ax=ax3)
sns.countplot(data=train_df, x="Parch", hue="Survived", ax=ax4)
fig.tight_layout()

fig, plt.figure(figsize=(12,5))
sns.catplot(data=train_df, y="Age", x="Pclass", col="Sex", hue = "Survived", ax=ax4)

In [None]:
train_df['Embarked'].fillna('S', inplace=True)  # unkown 
train_df['Embarked'].replace('Q', 1,inplace=True)
train_df['Embarked'].replace('S', 2,inplace=True)
train_df['Embarked'].replace('C', 3,inplace=True)

test_df['Embarked'].fillna('S', inplace=True)
test_df['Embarked'].replace('Q', 1,inplace=True)
test_df['Embarked'].replace('S', 2,inplace=True)
test_df['Embarked'].replace('C', 3,inplace=True)

train_df['Sex'].replace('male', 0,inplace=True)
train_df['Sex'].replace('female', 1,inplace=True)

test_df['Sex'].replace('male', 0,inplace=True)
test_df['Sex'].replace('female', 1,inplace=True)


test_df['Fare'].fillna(test_df['Fare'].mean(),inplace=True)
#train_df['Fare'].fillna(test_df['Fare'].mode(),inplace=True)



# Dropping the rows including NaN values.
train_df.dropna(subset=["Age"], inplace=True)
test_df["Age"].fillna(method ='ffill', inplace=True) # remove negative age values


print(test_df.isnull().sum()) # inspect data types: any missing or null data 
print(train_df.isnull().sum()) # inspect data types: any missing or null data  

In [None]:
# UNCOMMMENT TO FILL DATA

train_df_age_unfilled = train_df.copy(deep = True) 

mean = train_df['Age'].mean() 
std = train_df['Age'].std()
mode_nsurvived= train_df['Age'].mode()
mode_nsurvived_test= test_df['Age'].mode()

 
num_null = train_df['Age'].isnull().sum()
list_null = train_df['Age'].isnull()

# sample from distribution of ages but ignore nan
#resample_index_train = train_df_age_unfilled['Age'].isnull()==0 
#resample_index_test = test_df_age_unfilled['Age'].isnull()==0 

#print(train_df_age_unfilled['Age'])

#filled_resample_train = train_df_age_unfilled['Age'].sample( n=891, replace=False, weights = resample_index_train.astype(int))
#filled_resample_test = test_df_age_unfilled['Age'].sample( n=len(resample_index_test), replace=False, weights = resample_index_test.astype(int)) 


#train_df["Age"].fillna(value=mean, inplace=True) # remove negative age values
#test_df["Age"].fillna(value=mean, inplace=True) # remove negative age values

 # print(train_df_features_survivedId["Age"])
print('AGE DATA CLEANING ...') # inspect data types: any missing or null data 
print(test_df.isnull().sum()) # inspect data types: any missing or null data 
print(train_df.isnull().sum()) # inspect data types: any missing or null data 

### Inspect Consequences of Data Cleaning:

1) How does the predictive value of age change after filling the data?
Does it decrease, increase, or stay the same. Ideally it should match the predictive value 
of the original the original data. Try different methods to ensure this. 
*     filling with zeros reduces predictive power = 0.12
*     filling with mean increases preditive power = 1.43
*     filling with mode = 0.93
*     backfilling = 1.12

2) Does the distribution of the data change?
The age distribution is non-gaussian. If we fill the nans using mean sampled data, 
are we underestimating the impact of age on survival for a particular demographic (lower tail). 
Alternatively, we add more outliers, this can cause data crowding during normalization or dim reduction. 

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# Checking best features
columns_to_be_explored = ['Sex', 'Pclass' ,'Embarked' ,'Parch' , 'Age', 'SibSp']

selector = SelectKBest(f_classif, k='all')

selector.fit(train_df[columns_to_be_explored],train_df['Survived'])
scores = -np.log10(selector.pvalues_)
indices = np.argsort(scores)[::-1]

print('Features importance:')
for i in range(len(scores)):
    print('%.2f %s' % (scores[indices[i]], columns_to_be_explored[indices[i]]))

In [None]:
# Note: the distribution of ages is non-gaussian, 
# indicating that we should probably mean-fill the ages in a stratified way, 
# but for now we will just assume a normal distribition. 
fig, (ax1, ax2, ax3) = plt.subplots(1, 3)

# stats of the data
train_df.describe()     # inspect data types 
# Plot distributions of ages to before and after filling 
sns.histplot(train_df_age_unfilled , x = 'Age', ax=ax1 ).set(title='Train AGE dist  - before mean fill nans')
sns.histplot(train_df, x = 'Age',ax=ax2, color='m').set(title='Train AGE dist - after bfill nans ')
sns.histplot(test_df , x = 'Age', ax=ax3 ).set(title='Test Age - after mean fill nans ')
#sns.histplot(test_df, x = 'Age',ax=ax4, color='m').set(title='Test Age - after bfill nans')


sns.set(rc={"figure.figsize":(15, 6)})  

## 3.0) Exploratory Data Analysis: select most informative features
 
* Univariate Analysis (feature decoding)
* Bivariate Analysis (relational plot) 
* Latent representation learning/inspection (PCA, tSNE, etc)
* Feature Engineering (normalization)



In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# Checking best features
 

selector = SelectKBest(f_classif, k='all')

selector.fit(train_df[columns_to_be_explored], train_df['Survived'])
scores = -np.log10(selector.pvalues_)
indices = np.argsort(scores)[::-1]

print('Features importance:')
for i in range(len(scores)):
    print('%.2f %s' % (scores[indices[i]], columns_to_be_explored[indices[i]]))

In [None]:
columns_to_be_added_as_features = ['Sex', 'Pclass' ,'Embarked' ,'Parch' ]

train_df_features = train_df[columns_to_be_added_as_features + ['Survived']]
test_df_features = test_df[columns_to_be_added_as_features] 

# For submission scoring (i.e., don't normalize 'PassenderID' feature during subsequent feature engineering steps)
test_df_features_Match = test_df[columns_to_be_added_as_features +  ['PassengerId']] 

#### Normalize Features in Train & Test Data. 
Don't normalize passenger information!
This normalization (0-1) will not affect surviver id info as it is already 0s and 1s

In [None]:
def normalize(df):
    result = df.copy()
    for feature_name in df.columns:
        max_value = df[feature_name].max()
        min_value = df[feature_name].min()
        result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result

# New df
train_df_features_norm = normalize(train_df_features)
test_df_features_norm = normalize(test_df_features)


print(train_df_features_norm.columns.values)
print(test_df_features_norm.columns.values)
train_df_features_norm.head()

#### Data visualization, latent factors, feature engineering

Visualizing the structure of the data and the relationship of the variables is
useful for deciding which ML model to use. 





In [None]:
# Relational plot
g = sns.pairplot(train_df_features_norm, hue = 'Survived', corner=True)


In [None]:
from sklearn.decomposition import PCA

num_comp = 4

# python PCA
pca_out = PCA(n_components=num_comp)
pca_out.fit(train_df_features_norm[columns_to_be_added_as_features])
pca_out_data = pca_out.fit_transform(train_df_features_norm[columns_to_be_added_as_features])

print(f'explained variance = {pca_out.explained_variance_ratio_} ')
print(f'cumulative variance  = {np.cumsum(pca_out.explained_variance_ratio_)}' )
 

In [None]:

loadings = pca_out.components_ 
var = pca_out.explained_variance_ratio_
pca_cumSum = np.cumsum(var)

pc_list = ["PC"+str(i) for i in list(range(1, num_comp+1))]

# component loadings or weights (correlation coefficient between original variables and the component) 
# component loadings represents the elements of the eigenvector
# the squared loadings within the PCs always sums to 1


loadings_pca_df = pd.DataFrame.from_dict(dict(zip(pc_list, loadings)))
loadings_pca_df['variable'] = train_df_features_norm[columns_to_be_added_as_features].columns.values
loadings_pca_df = loadings_pca_df.set_index('variable')

pca_cumSum_df = pd.DataFrame({'var':var,'cumsum_var':pca_cumSum, 'PC':pc_list})


pca_plot_df = pd.DataFrame({'PC1': pca_out_data[:,0],
                   'PC2': pca_out_data[:,1],
                   'PC3': pca_out_data[:,2],
                   'PC4': pca_out_data[:,3],  
                   'Survived': train_df_features_norm['Survived']
                  })

In [None]:
figA, (ax1, ax2) = plt.subplots(1,2, figsize=(10, 3))

# Variance explained
sns.barplot(x='PC',y="var", data=pca_cumSum_df, color="c", ax=ax1).set(title='PCA Explained Variance');
sns.barplot(x='PC',y="cumsum_var", data=pca_cumSum_df, color="c", ax=ax2).set(title='PCA Explained Variance');

sns.pairplot(pca_plot_df, hue="Survived" )

#figB2, (axB1, axB2, axB3) = plt.subplots(1,3, figsize=(10, 4))

#sns.scatterplot(data=pca_plot_df, x='PC1',y="PC2",  hue="Survived" ,ax=axB1).set(title='PCA Plot');
#sns.scatterplot(data=pca_plot_df, x='PC1',y="PC3",  hue="Survived", ax=axB2).set(title='PCA Plot');
#sns.scatterplot(data=pca_plot_df, x='PC1',y="PC6",  hue="Survived", ax=axB3).set(title='PCA Plot');


## 4.0) Model Training

After examining the structure of the input data space, we see that it has some
structure (e.g., clustered target data) but that it is complex,
e.g., non-gaussian, nested hierarchies, outliers, non-linearities. 

An SVM, KNN, or decision tree may may get us 70% of the way there but they may run into problems with
over generalizing or overfitting. Some perturbation, sub-ensemble, or penalties may need to be introducted.  
Perhaps some more complex feature engineering could be introducted. 

* Support Vector Machine 
* Decision Tree
* K-Nearest Neighbours

(not included yet...)
- Bagging Decision Tree (Ensemble Learning I)
- Boosted Decision Tree (Ensemble Learning II)
- Random Forest (Ensemble Learning III)
- Naive Bayes
- Logistic Regression
- Voting Classification (Ensemble Learning IV)
- Neural Network (Deep Learning)




### Splitting Training Data into Test/Validation Sets
Below, the validation's ratio was set to 0.2, which means 20% of data will be used to validate it. 
To split the dataset into training and validation sets, we will be using Sklearn's train_test_split method. 
Then we will be splitting the features and labels as shown in the last 4 lines of the code piece below.

In [None]:
from sklearn.model_selection import train_test_split

 ## SWAP OUT REG DATA FOR PCA ###
#train_df_norm = pca_plot_df[['PC1', 'PC2','PC3','PC4', 'Survived']]

#columns_to_be_added_as_features = ['PC1', 'PC2',  'PC3', 'PC4']
#### SWAP OUT REG DATA FOR PCA  ###
#train_df_features_norm = pca_plot_df
    
train_df_features_norm = train_df_features_norm.sample(frac=1).reset_index(drop=True)
 

validation_set_ratio = 0.20  # 20
validation_set_size = int(len(train_df_features_norm)*validation_set_ratio)
training_set_size = len(train_df_features_norm) - validation_set_size

print("Total set size: {}".format(len(train_df_features_norm)))
print("Training set size: {}".format(training_set_size))
print("Validation set size: {}".format(validation_set_size))


# test vs train = 20% split
train, val = train_test_split(train_df_features_norm, test_size=validation_set_ratio)

train_X = train[columns_to_be_added_as_features]
train_Y = train['Survived']

val_X = val[columns_to_be_added_as_features]
val_Y = val['Survived']

### SVM Model: training, predictions, and visualizing

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn import metrics


grid = {"C":np.arange(1,10,1),'gamma':[ 0.00001, 0.00005, 0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1,5]}
svm0 = SVC(random_state=42)
svm_cv = GridSearchCV(svm0, grid, cv=10)
svm_cv.fit(train_X, train_Y.values.ravel())

print("Parameter Selection SVM:",svm_cv.best_params_)

svm = SVC(C=svm_cv.best_params_["C"], gamma=svm_cv.best_params_["gamma"],random_state=42)
svm.fit(train_X,train_Y)
print("SVC Accuracy :",svm.score(val_X,val_Y))


In [None]:
y_pred_svm = svm.predict(val_X)


accuracy = metrics.accuracy_score(val_Y, y_pred_svm)
precision = metrics.precision_score(val_Y, y_pred_svm)
recall = metrics.recall_score(val_Y, y_pred_svm)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

cv_score = cross_val_score(svm, train_df_features_norm[columns_to_be_added_as_features], train_df_features_norm['Survived'], cv = 6  )
print("Cross Val Alt", cv_score)

### Decision Tree Model: training, predictions, and visualizing

In [None]:
from sklearn import tree

# Decision tree

decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(train_X, train_Y.values.ravel())


In [None]:

y_pred_tree = decision_tree.predict(val_X)

#print("Accuracy:",metrics.accuracy_score(val_X, y_pred_tree))


accuracy = metrics.accuracy_score(val_Y, y_pred_tree)
precision = metrics.precision_score(val_Y, y_pred_tree)
recall = metrics.recall_score(val_Y, y_pred_tree)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)


# Alt Cross Val Method
cv_score = cross_val_score(decision_tree, train_df_features_norm[columns_to_be_added_as_features], train_df_features_norm['Survived'], cv = 6  )
print("Cross Val Alt", cv_score)


### Random Forrest

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(train_X,train_Y)
rf.score(val_X,val_Y)

 ### KNN Model: training, predictions, and visualizing
 
 A for loop can be created to check for the optimal value of k "local neighbourhood".  
 I found that 3 and 9 worked equally well in a tested range of k= [1:2:15]

In [None]:
from sklearn.neighbors import KNeighborsClassifier
 
knn0 = KNeighborsClassifier()
knn_cv = GridSearchCV(knn0, {"n_neighbors": np.arange(3,50)}, cv=10)
knn_cv.fit(train_X,train_Y)
print("Best parameters of KNN :",knn_cv.best_params_)

knn = KNeighborsClassifier(n_neighbors=knn_cv.best_params_["n_neighbors"])
knn.fit(train_X,train_Y.values.ravel())

In [None]:
y_pred_knn = knn.predict(val_X)

accuracy =[]
accuracy.append(metrics.accuracy_score(val_Y, y_pred_knn))
precision = metrics.precision_score(val_Y, y_pred_knn)
recall = metrics.recall_score(val_Y, y_pred_knn)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

print("KNN Accuracy :",knn.score(val_X,val_Y))


# Alt Cross Val Method
cv_score = cross_val_score(knn, train_df_features_norm[columns_to_be_added_as_features], train_df_features_norm['Survived'], cv = 5  )
print("Cross Val Alt", cv_score)


## 5.0) Winning Model: Selection, Testing, Submission

Generate predictions for competition test dataset with unknown ground-truth labels. 



In [None]:
y_target_predict = knn.predict(test_df_features_norm[columns_to_be_added_as_features]) 
#y_target_predict_svm = svm_model.predict(test_df_features_norm[columns_to_be_added_as_features])  

print('Predicted result: ', y_target_predict)
print(len(y_target_predict))


In [None]:
submission = pd.DataFrame({'PassengerId':test_df_features_Match.PassengerId.values,'Survived':y_target_predict})
submission.Survived = submission.Survived.astype(int)

print(submission.head()) # make sure we are submitting integers and not floats
print(submission.shape)

filename = 'Titanic Predictions_pub.csv'
submission.to_csv(filename, index=False)
print('Saved file: ' + filename)


