# Titanic
---
In this notebook, I will implementing the titanic dataset from kaggle, https://www.kaggle.com/competitions/titanic/overview. The main objective for this project is to use various machine learning modeling techniques to predict the probability of survival for the passengers on board. To train the models, data will be retrived from train.csv, while testing will be conducted on test.csv. Final estimation result from the most accurate model will generate to a csv file, called gender_submission.csv.

In the dataset, here is the definition for the variables
- survival: Survival; 0 = no, 1 = yes
- pclass: Ticket class; 1 = 1st, 2 = 2nd, 3 = 3rd
- sex: Sex
- age: Age in years
- sibsp: # of siblings / spouses aboard the Titanic
- parch: # of parents / children aboard the Titanic
- ticket: Ticket number
- fare: Passenger fare
- cabin: Cabin number
- embarked: Port of Embarkation; C = Cherbourg, Q = Queenstown, S = Southampton

In data preprocessing, embarked variable will be one-hot encoded.

In modeling, I will using logit, SVM, Random Forest Classifier, XGBoost and Tensorflow Keras for the modeling.

## Modeling
### Data Preprocessing
#### Importing the data

In [22]:
import numpy as np
import pandas as pd

# import train and test csv data from the file and set the index as PassengerId
train = pd.read_csv('train.csv', index_col='PassengerId')
test = pd.read_csv('test.csv', index_col='PassengerId')

# Concat the train and test set for further imputation
# For test set, there is a missing feature Survived. To concat successfully, I will add a feature to the test dataset with zeros
surv_col = np.zeros(len(test))
# Insert the surv_col to the first column in test dataset
test.insert(loc=0, column='Survived', value = surv_col)
# Concate train and test data
data = pd.concat([train, test])
print(data.isna().sum())

Survived       0
Pclass         0
Name           0
Sex            0
Age          263
SibSp          0
Parch          0
Ticket         0
Fare           1
Cabin       1014
Embarked       2
dtype: int64


#### Age
For feature Age, I will impute the missing value based on the average of the age from different ticket classes.

In [23]:
'''
# Use Random number generator for different classes
mean_1 = data.loc[data.Pclass == 1, 'Age'].mean()
std_1 = data.loc[data.Pclass == 1, 'Age'].std()
mean_2 = data.loc[data.Pclass == 2, 'Age'].mean()
std_2 = data.loc[data.Pclass == 2, 'Age'].std()
mean_3 = data.loc[data.Pclass == 3, 'Age'].mean()
std_3 = data.loc[data.Pclass == 2, 'Age'].std()

'''
# Calculate the average age for different Pclass
age_1 = data.loc[data.Pclass == 1, 'Age'].mean()
age_2 = data.loc[data.Pclass == 2, 'Age'].mean()
age_3 = data.loc[data.Pclass == 3, 'Age'].mean()

# Impute the age based on the average of difference ticket class
data.loc[data['Pclass'] == 1,'Age'] = data.loc[data['Pclass'] == 1,'Age'].fillna(age_1)
data.loc[data['Pclass'] == 2,'Age'] = data.loc[data['Pclass'] == 2,'Age'].fillna(age_2)
data.loc[data['Pclass'] == 3,'Age'] = data.loc[data['Pclass'] == 3,'Age'].fillna(age_3)
print(data.isna().sum())

Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           1
Cabin       1014
Embarked       2
dtype: int64


#### Embarked
Since the missing values for feature Embarked are only two, I will impute these missing values with the mode of Embarked.

In [24]:
# Embarked
# using the value_counts to see which port have the most passengers on board
print(data.Embarked.value_counts())
# fill the na with the mode
data.Embarked.fillna(data.Embarked.mode()[0], inplace=True)

Embarked
S    914
C    270
Q    123
Name: count, dtype: int64


#### Fare
The missing value for Fare is only 1, I will impute this missing value with the average fare price in the corresponding ticket class.

In [25]:
# Fare
# Identify the one whose fare is missing
print(data[data.Fare.isna()])
# Calculate the average fare from Pclass 3
fare_3 = data.loc[data.Pclass==3, 'Fare'].mean()
# impute the missing value with fare_3
data.loc[data.Pclass==3, 'Fare'] = data.loc[data.Pclass==3, 'Fare'].fillna(fare_3)
# Final check for missing values in the data
print(data.isna().sum())

             Survived  Pclass                Name   Sex   Age  SibSp  Parch  \
PassengerId                                                                   
1044              0.0       3  Storey, Mr. Thomas  male  60.5      0      0   

            Ticket  Fare Cabin Embarked  
PassengerId                              
1044          3701   NaN   NaN        S  
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin       1014
Embarked       0
dtype: int64


#### One hot Encoding for categorical features
In the data, there are two features, Embarked and Sex, which are categorical type. I will use get_dummies function to change the text to 0/1.

In [26]:
# One hot encoding
ports = {"S": 0, "C": 1, "Q": 2}
sex = {'male':0, 'female':1}
data['Embarked'] = data['Embarked'].replace(ports)
data['Sex'] = data['Sex'].replace(sex)
print(data.head())
print(data.columns)

             Survived  Pclass  \
PassengerId                     
1                 0.0       3   
2                 1.0       1   
3                 1.0       3   
4                 1.0       1   
5                 0.0       3   

                                                          Name  Sex   Age  \
PassengerId                                                                 
1                                      Braund, Mr. Owen Harris    0  22.0   
2            Cumings, Mrs. John Bradley (Florence Briggs Th...    1  38.0   
3                                       Heikkinen, Miss. Laina    1  26.0   
4                 Futrelle, Mrs. Jacques Heath (Lily May Peel)    1  35.0   
5                                     Allen, Mr. William Henry    0  35.0   

             SibSp  Parch            Ticket     Fare Cabin  Embarked  
PassengerId                                                           
1                1      0         A/5 21171   7.2500   NaN         0  
2              

#### Dropping the unnecessary columns and split the data to train and test

In [27]:
# Drop columns
data = data.drop(['Name', 'Ticket', 'Cabin'], axis=1)
# Split the data
train_clean = data.iloc[:len(train), :]
test_clean = data.iloc[len(train):, :]
# Separate X and y
# For train data
X_tr = train_clean.drop(['Survived'], axis=1)
y_tr = train_clean['Survived']
# For test data
X_ts = test_clean.drop(['Survived'], axis=1)
y_ts = test_clean['Survived']

#### Scaling the data
Since the feature Fare have higher mean and std comparing to other features, I decide to use StandardScaler to scale this feature

In [28]:
# Scaling data

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
'''
# Scaling for train data
fare_tr_scaled = ss.fit_transform(X_tr[['Fare']])
X_tr = X_tr.drop('Fare', axis=1)
X_tr['Fare_scale'] = fare_tr_scaled
# Scaling for test data
fare_ts_scaled = ss.fit_transform(X_ts[['Fare']])
X_ts = X_ts.drop('Fare', axis=1)
X_ts['Fare_scale'] = fare_ts_scaled
'''
X_tr = ss.fit_transform(X_tr)
X_ts = ss.transform(X_ts)


#### Train Test Split for Train Dataset

In [29]:
# train test split for train data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_tr, y_tr, test_size=0.2, random_state=1111, stratify=y_tr)

### Logistic Regression

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Specify the Logistic Regression
lr = LogisticRegression()
# Fit the model using train sets
lr.fit(X_train, y_train)
# Calculate Accuracy and AUC ROC score
lr_acc = lr.score(X_test, y_test)
lr_aucroc = roc_auc_score(y_test, lr.predict_proba(X_test)[:,1])
# Print the results
print('Logistic Regression Classification Accuracy Score is %f' % lr_acc)
print('Logistic Regression Classification AUC ROC Score is %f' % lr_aucroc)

Logistic Regression Classification Accuracy Score is 0.787709
Logistic Regression Classification AUC ROC Score is 0.793017


### Support Vector Machine

In [44]:
# Import SVC and use the default setting
from sklearn.svm import SVC
# Instantiate SVC
svm = SVC(probability=True)
# Fit the model using train sets
svm.fit(X_train, y_train)
# Calculate Accuracy and AUC ROC score
svc_acc = svm.score(X_test, y_test)
svc_aucroc = roc_auc_score(y_test, svm.predict_proba(X_test)[:,1])
# Print the results
print('Support Vector Machine Accuracy Score is %f' % svc_acc)
print('Support Vector Machine AUC ROC Score is %f' % svc_aucroc)

Support Vector Machine Accuracy Score is 0.782123
Support Vector Machine AUC ROC Score is 0.780237


### Stochastic Gradient Descent (SGD)

In [46]:
from sklearn.linear_model import SGDClassifier
# set the random state = 1111 for reproducibility
lr_classifier = SGDClassifier(loss='log_loss')

# fit the model using train sets
lr_classifier.fit(X_train, y_train)
# Calculate Accuracy and AUC ROC score
sgd_acc = lr_classifier.score(X_test, y_test)
sgd_aucroc = roc_auc_score(y_test, lr_classifier.predict_proba(X_test)[:,1])
# Print the results
print('Stochastic Gradient Descent Accuracy Score is %f' % sgd_acc)
print('Stochastic Gradient Descent AUC ROC Score is %f' % sgd_aucroc)

Stochastic Gradient Descent Accuracy Score is 0.754190
Stochastic Gradient Descent AUC ROC Score is 0.754941


### Decision Tree Classifier

In [47]:
# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Instantiate Decision Tree Classifier
dt = DecisionTreeClassifier()

# fit the model using train sets
dt.fit(X_train, y_train)

# Calculate Accuracy and AUC ROC score
dt_acc = dt.score(X_test, y_test)
dt_aucroc = roc_auc_score(y_test, dt.predict_proba(X_test)[:,1])
# Print the results
print('Decision Tree Classifier Accuracy Score is %f' % dt_acc)
print('Decision Tree Classifier AUC ROC Score is %f' % dt_aucroc)

Decision Tree Classifier Accuracy Score is 0.731844
Decision Tree Classifier AUC ROC Score is 0.712055


### Random Forest Classifier

In [48]:
from sklearn.ensemble import RandomForestClassifier

# Instantiate Random Forest Classifier
rfc = RandomForestClassifier()

# fit the model using train sets
rfc.fit(X_train, y_train)

# Calculate Accuracy and AUC ROC score
rfc_acc = rfc.score(X_test, y_test)
rfc_aucroc = roc_auc_score(y_test, rfc.predict_proba(X_test)[:,1])
# Print the results
print('Random Forest Classifier Accuracy Score is %f' % rfc_acc)
print('Random Forest Classifier AUC ROC Score is %f' % rfc_aucroc)

Random Forest Classifier Accuracy Score is 0.793296
Random Forest Classifier AUC ROC Score is 0.856126


## Model Selection

In [54]:
models = ['Logistic Regression', 'Support Vector Machine', 'Stochastic Gradient Descent', 'Decision Tree', 'Random Forest']
accuracy_list = [lr_acc, svc_acc, sgd_acc, dt_acc, rfc_acc]
auc_roc_list = [lr_aucroc, svc_aucroc, sgd_aucroc, dt_aucroc, rfc_aucroc]
scoring = {'name':models, 'accuracy': accuracy_list, 'auc_roc':auc_roc_list}
print(pd.DataFrame.from_dict(scoring))

                          name  accuracy   auc_roc
0          Logistic Regression  0.787709  0.793017
1       Support Vector Machine  0.782123  0.780237
2  Stochastic Gradient Descent  0.754190  0.754941
3                Decision Tree  0.731844  0.712055
4                Random Forest  0.793296  0.856126


Based on the scoring result, Random Forest classifier provides the best accuracy and auc_roc scoring among all models built.

### Hyperparameter tuning for Random Forest Classifier

In [57]:
from sklearn.model_selection import GridSearchCV
# Set the parameters
params = {"criterion" : ["gini", "entropy"], "min_samples_leaf" : [1, 2, 5, 10, 25, 50, 70], "min_samples_split" : [2, 4, 10, 12, 16, 18, 25, 35], "n_estimators": [100, 400, 700, 1000, 1500]}
searcher_rfc = GridSearchCV(rfc, params, scoring='accuracy', cv=5, n_jobs=-1)
# fit the model using train sets
searcher_rfc.fit(X_train, y_train)

# print the best parameter results
print(searcher_rfc.best_params_)
# print the best accuracy score from the most fit parameter settings
print(searcher_rfc.best_score_)

{'criterion': 'entropy', 'min_samples_leaf': 2, 'min_samples_split': 12, 'n_estimators': 1000}
0.8412882891756132
0.770949720670391


NameError: name 'auc_roc_score' is not defined

In [59]:
# print the accuracy score from test sets
print(searcher_rfc.score(X_test, y_test))
print(roc_auc_score(y_test, searcher_rfc.predict_proba(X_test)[:,1]))

0.770949720670391
0.8416337285902503


From the grid search results above, the default setting of the random forest classifier provides a better estimate. Thus, default setting RFC will be used.

In [65]:
predict = rfc.predict(X_ts)
predict_prob = rfc.predict_proba(X_ts)[:,1]
submission = pd.read_csv('gender_submission.csv', index_col='PassengerId')
submission['Survived'] = predict
submission['Survived'] = submission['Survived'].astype('int64')
print(submission.head())
submission.to_csv('gender_submission.csv')

             Survived
PassengerId          
892                 0
893                 0
894                 0
895                 1
896                 0


In [66]:
solution = pd.read_csv('solution.csv', index_col = 'PassengerId')
# accuracy score
print(float(np.sum(predict == solution.Survived)/solution['Survived'].shape[0]))
# f1 score
#print(f1_score(solution.Survived, predict))
# auc roc score
print(roc_auc_score(solution.Survived, predict_prob))

0.7511961722488039
0.792261512555824
