## Titanic Shipwreck

### Objective: Use Machine Learning to create a model that can predict which passengers survived the Titanic shipwreck.
### This is a ML competition running on Kaggle. For more info: https://www.kaggle.com/c/titanic
### Huge thanks to Aurelien Geron, whom I consider as my mentor. I have used many ideas from your teachings.

#### Loading necessary data files

In [1]:
import pandas as pd
import numpy as np

df_train = pd.read_csv('titanic_training_data')
df_test = pd.read_csv('titanic_test_data')

#### We'll make a copy of training DF so that we don't accidentally make changes to original DF

In [2]:
train_df = df_train.copy()
test_df = df_test

#### Let's explore the data

In [3]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#### Few points to notice:
#### 1. 'PassengerId' column is same as index, we can either drop this column or set it as index.
#### 2. There are missing values in 'Age' (20%), 'Cabin' (77%) and 'Embarked' (0.002%) columns. With so many missing values, we will need to decide how to deal with these columns. Will they provide any value? What is a better option - dropping these columns with missing values or replacing missing values?
#### 2. There are 4 numerical columns ('Age', 'SibSp', 'Parch', 'Fare') and 5 categorical columns ('Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'). We will make pipelines to deal with them.

#### But before that let's quickly check which features are important. Dropping 'PassengerId',  'Name', 'Ticket', 'Cabin' columns, changing categorical columns to numerical and filling missing values

In [6]:
train_df_feat_imp = train_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

train_df_feat_imp['Sex'] = train_df_feat_imp['Sex'].map({'female': 0, 'male': 1})
train_df_feat_imp['Embarked'] = train_df_feat_imp['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

train_df_feat_imp['Age'] = train_df_feat_imp['Age'].fillna(train_df_feat_imp['Age'].median())
train_df_feat_imp['Embarked'] = train_df_feat_imp['Embarked'].fillna(0)

In [7]:
train_df_feat_imp.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,0.0
1,1,1,0,38.0,1,0,71.2833,1.0
2,1,3,0,26.0,0,0,7.925,0.0
3,1,1,0,35.0,1,0,53.1,0.0
4,0,3,1,35.0,0,0,8.05,0.0


In [8]:
X_train = train_df_feat_imp.drop(['Survived'], axis=1)
y_train = train_df_feat_imp['Survived']

In [9]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(X_train, y_train)

for name, score in zip(X_train.columns, rnd_clf.feature_importances_):
    print(name, round(score, 2))

Pclass 0.09
Sex 0.26
Age 0.26
SibSp 0.05
Parch 0.04
Fare 0.27
Embarked 0.03


#### It seems that most important features are 'Sex', 'Age' and 'Fare'. We can try and test different ML models with these 3 attributes and check how good the results are.

#### Going back to our dataset from before, we will now split the dataset into training and testing sets.

In [10]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#### Since we are keeping only 3 attributes, namely 'Age', 'Sex' and 'Fare', we will drop the rest.

In [11]:
train_df = df_train.copy()

In [12]:
train_df['Sex'] = train_df['Sex'].map({'female': 0, 'male': 1})
train_df['Age'] = train_df['Age'].fillna(train_df['Age'].median())

In [13]:
X_3_features = train_df[['Sex', 'Age', 'Fare']]
y_3_features = train_df['Survived']

In [14]:
X_train_3_features, X_test_3_features, y_train_3_features, y_test_3_features = X_3_features[:700], X_3_features[700:], y_3_features[:700], y_3_features[700:]

In [15]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.647587,29.361582,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,0.47799,13.019697,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,0.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,1.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,1.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,1.0,80.0,8.0,6.0,512.3292


#### Data indicates very low survival rate, mere 38%. This implies that accuracy as an evaluation metric should suffice.

### Testing different ML Models

#### Let's start with Logistic Regression

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

log_reg_3_features = LogisticRegression()
log_reg_3_features.fit(X_train_3_features, y_train_3_features)

cross_val_score(log_reg_3_features, X_train_3_features, y_train_3_features, cv=3, scoring='accuracy')

array([0.79487179, 0.78540773, 0.75536481])

In [17]:
y_pred_3_features = cross_val_predict(log_reg_3_features, X_train_3_features, y_train_3_features, cv=3)
confusion_matrix(y_train_3_features, y_pred_3_features)

array([[358,  71],
       [ 84, 187]], dtype=int64)

In [18]:
print('accuracy ', accuracy_score(y_train_3_features, y_pred_3_features))
print('precision ', precision_score(y_train_3_features, y_pred_3_features))
print('recall ', recall_score(y_train_3_features, y_pred_3_features))
print('f1 score ', f1_score(y_train_3_features, y_pred_3_features))

accuracy  0.7785714285714286
precision  0.7248062015503876
recall  0.6900369003690037
f1 score  0.7069943289224953


#### We are achieving the accuracy of 77%. Now let's try and see how other ML models perform

In [19]:
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

voting_clf = VotingClassifier(estimators= [('log_clf', LogisticRegression(random_state=42)),
                                           ('sgd_clf', SGDClassifier(random_state=42)),
                                           ('svc_clf', SVC(random_state=42)), 
                                           ('tree_clf', DecisionTreeClassifier(max_depth=2, random_state=42)),
                                           ('rnd_clf', RandomForestClassifier(random_state=42))
                                          ], voting = 'hard')

voting_clf.fit(X_train_3_features, y_train_3_features)

for name, clf in voting_clf.named_estimators_.items():
    print(name, '=', round(clf.score(X_test_3_features, y_test_3_features), 2))

log_clf = 0.8
sgd_clf = 0.72
svc_clf = 0.71
tree_clf = 0.79
rnd_clf = 0.81


#### It seems our best bet with 3 selected features is Random Forest Classifier. Let's check the performance of voting_clf

In [20]:
round(voting_clf.score(X_test_3_features, y_test_3_features), 2)

0.79

#### Looks like voting_clf is not performing as well as we hoped. How about other methods like bagging

In [21]:
from sklearn.ensemble import BaggingClassifier

bag_clf = BaggingClassifier(RandomForestClassifier(), n_estimators=500, max_samples=100, 
                            n_jobs=-1, random_state=42, bootstrap=True)

bag_clf.fit(X_train_3_features, y_train_3_features)
round(bag_clf.score(X_test_3_features, y_test_3_features), 2)

0.81

#### The highest score we achieved is still 81%, even with Bagging method. How about if we use a different classifier in bagging method. Let's try Decision tree and see if we get a better result.

In [22]:
from sklearn.ensemble import BaggingClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, 
                            n_jobs=-1, random_state=42, bootstrap=True)

bag_clf.fit(X_train_3_features, y_train_3_features)
round(bag_clf.score(X_test_3_features, y_test_3_features), 2)

0.82

#### Bagging methos with Decision tree classifier gives a slightly better result. Does AdaBoost Classifier performs any better?

In [23]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=30,
                            learning_rate=0.5, random_state=42)

ada_clf.fit(X_train_3_features, y_train_3_features)
round(ada_clf.score(X_test_3_features, y_test_3_features), 2)

0.81

#### So far our best option is Bagging Classifier with Decision Tree. Its giving us 82% score. So we select Bagging Classifier as our preferred model and see if we can improve the score any further.

In [24]:
cross_val_score(bag_clf, X_train_3_features, y_train_3_features, cv=3, scoring='accuracy')

array([0.78632479, 0.80257511, 0.75107296])

In [25]:
y_pred_bag = cross_val_predict(bag_clf, X_train_3_features, y_train_3_features, cv=3)

confusion_matrix(y_train_3_features, y_pred_bag)

array([[358,  71],
       [ 83, 188]], dtype=int64)

In [26]:
print('Accuracy = ', round(accuracy_score(y_train_3_features, y_pred_bag), 2))
print('Precision = ', round(precision_score(y_train_3_features, y_pred_bag), 2))
print('Recall = ', round(recall_score(y_train_3_features, y_pred_bag), 2))
print('f1 score = ', round(f1_score(y_train_3_features, y_pred_bag), 2))

Accuracy =  0.78
Precision =  0.73
Recall =  0.69
f1 score =  0.71


#### There is a dip in accuracy result. This model is under fitting on training set and giving better results on test results. Let's change our approach and start afresh with a new approach. Let's use more features, create pipelines and apply different ML models to see if results are any better

In [27]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

num_pipe = Pipeline([('imputer', SimpleImputer(strategy='median')),
                     ('scaler', StandardScaler())                    
                    ])

cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                     ('one hot', OneHotEncoder())
                    ])

In [28]:
from sklearn.compose import ColumnTransformer

num_attribs = ['Age', 'SibSp', 'Parch', 'Fare']
cat_attribs = ['Pclass', 'Sex', 'Embarked']

final_pipe = ColumnTransformer([('num', num_pipe, num_attribs),
                                ('cat', cat_pipe, cat_attribs)
                               ])

In [29]:
train_df_pipe = df_train.copy()

X_train_pipe = final_pipe.fit_transform(train_df_pipe[num_attribs + cat_attribs])
y_train_pipe = train_df_pipe['Survived']
X_test = final_pipe.transform(test_df[num_attribs + cat_attribs])

X_train_pipe

array([[-0.56573646,  0.43279337, -0.47367361, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.66386103,  0.43279337, -0.47367361, ...,  1.        ,
         0.        ,  0.        ],
       [-0.25833709, -0.4745452 , -0.47367361, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [-0.1046374 ,  0.43279337,  2.00893337, ...,  0.        ,
         0.        ,  1.        ],
       [-0.25833709, -0.4745452 , -0.47367361, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.20276197, -0.4745452 , -0.47367361, ...,  0.        ,
         1.        ,  0.        ]])

#### Final pipeline is ready. Let's start testing different ML models

In [30]:
log_reg = LogisticRegression()
log_reg.fit(X_train_pipe, y_train_pipe)

log_reg_score = cross_val_score(log_reg, X_train_pipe, y_train_pipe, cv=10)
log_reg_score.mean()

0.7991260923845193

In [31]:
voting_clf = VotingClassifier(estimators= [('log_clf', LogisticRegression(random_state=42)),
                                           ('sgd_clf', SGDClassifier(random_state=42)),
                                           ('svc_clf', SVC(random_state=42)), 
                                           ('tree_clf', DecisionTreeClassifier(max_depth=2, random_state=42)),
                                           ('rnd_clf', RandomForestClassifier(random_state=42))
                                          ], voting = 'hard')

voting_clf.fit(X_train_pipe, y_train_pipe)

for name, clf in voting_clf.named_estimators_.items():
    print(name, '=', round(clf.score(X_train_pipe, y_train_pipe), 2))

log_clf = 0.81
sgd_clf = 0.81
svc_clf = 0.84
tree_clf = 0.8
rnd_clf = 0.98


#### 98% with Random Forest Classifier? Is it true or something has gone wrong? Let's checkits performance on its own.

In [32]:
rnd_clf_pipe = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf_pipe.fit(X_train_pipe, y_train_pipe)

rnd_clf_score = cross_val_score(rnd_clf_pipe, X_train_pipe, y_train_pipe, cv=10)
rnd_clf_score.mean()

0.8036828963795255

#### 80% score seems more realistic. Let's try and find the scores of different models independently and not as a part of ensemble.

In [33]:
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train_pipe, y_train_pipe)
sgd_clf_score = cross_val_score(sgd_clf, X_train_pipe, y_train_pipe, cv=10)
sgd_clf_score.mean()

0.7856554307116105

In [36]:
y_pred_sgd_clf = cross_val_predict(sgd_clf, X_train_pipe, y_train_pipe, cv=3)

cm_sgd_clf = confusion_matrix(y_pred_sgd_clf, y_train_pipe)
print(cm_sgd_clf)
print('accuracy =', round(accuracy_score(y_pred_sgd_clf, y_train_pipe), 2))

[[456 117]
 [ 93 225]]
accuracy = 0.76


#### SGD Classifier scores are less than Random Forest Classifier.

In [37]:
svc_clf = SVC(gamma='auto', random_state=42)
svc_clf.fit(X_train_pipe, y_train_pipe)
svc_clf_score = cross_val_score(svc_clf, X_train_pipe, y_train_pipe, cv=10)
svc_clf_score.mean()

0.8249313358302123

In [38]:
y_pred_svc_clf = cross_val_predict(svc_clf, X_train_pipe, y_train_pipe, cv=3)

cm_svc_clf = confusion_matrix(y_pred_svc_clf, y_train_pipe)
print(cm_svc_clf)
print('accuracy =', round(accuracy_score(y_pred_svc_clf, y_train_pipe), 2))

[[491  98]
 [ 58 244]]
accuracy = 0.82


#### Support Vector Classifier seems like a better model with 82% accuracy and only 58 False Positives.

In [39]:
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X_train_pipe, y_train_pipe)
tree_clf_score = cross_val_score(tree_clf, X_train_pipe, y_train_pipe, cv=10)
tree_clf_score.mean()

0.7688764044943821

In [40]:
y_pred_tree_clf = cross_val_predict(tree_clf, X_train_pipe, y_train_pipe, cv=3)

cm_tree_clf = confusion_matrix(y_train_pipe, y_pred_tree_clf)
print(cm_tree_clf)
print('accuracy =', round(accuracy_score(y_pred_tree_clf, y_train_pipe), 2))

[[488  61]
 [136 206]]
accuracy = 0.78


#### After testing various ML models, our choice of model seems to be Support Vector Classifier with highest score and least False Positives. We will use this model to predict on test data.

In [41]:
y_pred_svc_clf = svc_clf.predict(X_test)

#### This prediction result can now be converted in csv file and uploaded on kaggle platform. Later on we will try deep learning models to see if we achieve even better results.