## Kaggle Competition: Titanic Machine Learning

1. Load Data, Combine train & test set
2. Data Structure
3. Data Analysis
4. Feature Engineering
5. Preprocessing before modelling
6. Modelling & Evaluation
7. Models Summary
8. Kaggle Submission

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

### 1. Load Data, Combine train & test set

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
train.shape

(891, 12)

In [4]:
test.shape

(418, 11)

In [5]:
all_data = pd.concat([train, test],axis=0)

In [6]:
all_data.shape

(1309, 12)

### 2. Data Structure

In [7]:
train.columns.values

array(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

In [8]:
test.columns.values

array(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype=object)

**'Survived' is the label**

In [9]:
all_data.to_csv('check1.csv',index=False)

In [10]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB


**Try to fix some missing values in 'Age', 'Fare' and 'Embarked'. <br>Drop 'Cabin' due to high missing values.**

In [11]:
all_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,1309.0,891.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,655.0,0.383838,2.294882,29.881138,0.498854,0.385027,33.295479
std,378.020061,0.486592,0.837836,14.413493,1.041658,0.86556,51.758668
min,1.0,0.0,1.0,0.17,0.0,0.0,0.0
25%,328.0,0.0,2.0,21.0,0.0,0.0,7.8958
50%,655.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,982.0,1.0,3.0,39.0,1.0,0.0,31.275
max,1309.0,1.0,3.0,80.0,8.0,9.0,512.3292


### 3. Data Analysis

In [12]:
pd.crosstab(train.Sex, train.Survived, normalize='index')

Survived,0,1
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.257962,0.742038
male,0.811092,0.188908


**Most women survived, sadly most men died, only one third of the survivals are men.**

In [13]:
pd.crosstab(train.Pclass, train.Survived, normalize='index')

Survived,0,1
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.37037,0.62963
2,0.527174,0.472826
3,0.757637,0.242363


**Class 1 has better survival rate**

In [14]:
group_fare = train.groupby('Pclass')
group_fare['Fare'].sum() / group_fare['Fare'].count()

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

**Class 1 is much more expensive than Class 2 and Class 3**

In [15]:
train['Pclass_Sex'] =  train['Pclass'].astype(str) +'_' + train['Sex'].astype(str)

In [16]:
train['Pclass_Sex'].value_counts()

3_male      347
3_female    144
1_male      122
2_male      108
1_female     94
2_female     76
Name: Pclass_Sex, dtype: int64

In [17]:
pd.crosstab(train.Pclass_Sex, train.Survived, normalize='index')

Survived,0,1
Pclass_Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
1_female,0.031915,0.968085
1_male,0.631148,0.368852
2_female,0.078947,0.921053
2_male,0.842593,0.157407
3_female,0.5,0.5
3_male,0.864553,0.135447


**97% women in Class 1 survived, 92% women in Class 2 survived, this can be a strong indicator**

In [18]:
train['Family_Size'] = train['SibSp'] + train['Parch'] + 1
pd.crosstab(train.Family_Size, train.Survived, normalize='index')

Survived,0,1
Family_Size,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.696462,0.303538
2,0.447205,0.552795
3,0.421569,0.578431
4,0.275862,0.724138
5,0.8,0.2
6,0.863636,0.136364
7,0.666667,0.333333
8,1.0,0.0
11,1.0,0.0


**Family Size of 4 has the highest survival rate.**

### 4. Feature Engineering

##### 4.1 Extract 'Title' from 'Name' & encode

In [19]:
all_data['Title'] = train.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
all_data['Title'].value_counts()

Mr              745
Miss            283
Mrs             183
Master           63
Dr               10
Rev               9
Col               2
Don               2
Mme               2
Major             2
Mlle              2
the Countess      1
Capt              1
Ms                1
Sir               1
Lady              1
Jonkheer          1
Name: Title, dtype: int64

In [20]:
 other_titles = [title for title in all_data["Title"]
                if title not in ["Mr", "Miss", "Mrs", "Master"]]

In [21]:
all_data['Title'] = all_data['Title'].replace(other_titles, 'Other')
all_data['Title'].value_counts()

Mr        745
Miss      283
Mrs       183
Master     63
Other      35
Name: Title, dtype: int64

In [22]:
all_data['en_Title'] = all_data['Title'].map({"Mr":0, "Miss":1, "Mrs" : 2 , "Master":3, "Other":4})

In [23]:
all_data['en_Title'].value_counts()

0    745
1    283
2    183
3     63
4     35
Name: en_Title, dtype: int64

##### 4.2 Encode 'Sex'

In [24]:
all_data['Sex'].value_counts()

male      843
female    466
Name: Sex, dtype: int64

In [25]:
all_data['en_Sex'] = all_data['Sex'].map({"female":0, "male":1})

In [26]:
all_data['en_Sex'].value_counts()

1    843
0    466
Name: en_Sex, dtype: int64

##### 4.3 'Pclass'+'Sex': create column & encode

In [27]:
all_data['Pclass_Sex'] =  all_data['Pclass'].astype(str) + all_data['en_Sex'].astype(str)
all_data['Pclass_Sex'] = all_data['Pclass_Sex'].astype(int)
all_data['Pclass_Sex'].value_counts()

31    493
30    216
11    179
21    171
10    144
20    106
Name: Pclass_Sex, dtype: int64

##### 4.4 'Age': Fill missing values, binning, encode

In [28]:
all_data['Age'].describe()

count    1046.000000
mean       29.881138
std        14.413493
min         0.170000
25%        21.000000
50%        28.000000
75%        39.000000
max        80.000000
Name: Age, dtype: float64

In [29]:
# fill missing age by ticket class median
all_data['Age'] = all_data['Age'].fillna(train['Age'].median())
all_data['Age'].describe()

# What to do with missing age?? more research

count    1309.000000
mean       29.503186
std        12.905241
min         0.170000
25%        22.000000
50%        28.000000
75%        35.000000
max        80.000000
Name: Age, dtype: float64

In [30]:
all_data['Age'].isna().sum()

0

In [31]:
# binning age into age group
all_data["en_Age"] =  pd.cut(all_data["Age"], bins=[0,5,12,21,65,100], labels=[0,1,2,3,4]).astype("int64")
all_data["en_Age"].value_counts()

3    1009
2     196
0      56
1      38
4      10
Name: en_Age, dtype: int64

##### 4.5 'Family Size' = 'SibSp' + 'Parch' + 1

In [32]:
all_data['Family_Size'] = all_data['SibSp'] + all_data['Parch'] + 1
all_data['Family_Size'].value_counts()

1     790
2     235
3     159
4      43
6      25
5      22
7      16
11     11
8       8
Name: Family_Size, dtype: int64

##### 4.6 'Fare': Fill missing values, rounding?

In [33]:
all_data['Fare'].isna().sum()

1

In [34]:
all_data['Fare'] = all_data['Fare'].fillna(all_data['Fare'].median())

In [35]:
all_data['Fare'].isna().sum()

0

In [36]:
all_data['Fare'] = round(all_data['Fare'].astype(int))

In [37]:
all_data['Fare'].head()

0     7
1    71
2     7
3    53
4     8
Name: Fare, dtype: int32

##### 4.7 'Ticket': categorize into numeric or non-numeric, encode

In [38]:
all_data['Ticket_is_numeric'] = all_data.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
all_data['Ticket_is_numeric'].value_counts()

1    957
0    352
Name: Ticket_is_numeric, dtype: int64

##### 4.8 'Embarked': Fill missing values, encode

In [39]:
all_data['Embarked'].value_counts()

S    914
C    270
Q    123
Name: Embarked, dtype: int64

In [40]:
# the most frequent embarked station
all_data['Embarked'].mode()

0    S
Name: Embarked, dtype: object

In [41]:
all_data['Embarked'] = all_data['Embarked'].fillna('S')

In [42]:
all_data['Embarked'].value_counts()

S    916
C    270
Q    123
Name: Embarked, dtype: int64

In [43]:
all_data['en_Embarked'] = all_data['Embarked'].map({"S":1, "C":2, "Q":3})
all_data['en_Embarked'].value_counts()

1    916
2    270
3    123
Name: en_Embarked, dtype: int64

##### 4.9 Feature Selections

In [44]:
all_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Title', 'en_Title',
       'en_Sex', 'Pclass_Sex', 'en_Age', 'Family_Size', 'Ticket_is_numeric',
       'en_Embarked'],
      dtype='object')

In [45]:
final_data = all_data[['Survived', 'Fare', 'en_Title', 'en_Sex', 'Pclass_Sex', 'Family_Size', 'Ticket_is_numeric', 'en_Embarked', 'en_Age']]
final_data.head()

Unnamed: 0,Survived,Fare,en_Title,en_Sex,Pclass_Sex,Family_Size,Ticket_is_numeric,en_Embarked,en_Age
0,0.0,7,0,1,31,2,0,1,3
1,1.0,71,2,0,10,2,0,2,3
2,1.0,7,1,0,30,1,0,1,3
3,1.0,53,2,0,10,2,1,1,3
4,0.0,8,0,1,31,1,1,1,3


### 5. Preprocessing before modelling

In [46]:
df_train=final_data.iloc[:891,:]
df_test=final_data.iloc[891:,:]

In [47]:
df_train.shape

(891, 9)

In [48]:
df_train.head()

Unnamed: 0,Survived,Fare,en_Title,en_Sex,Pclass_Sex,Family_Size,Ticket_is_numeric,en_Embarked,en_Age
0,0.0,7,0,1,31,2,0,1,3
1,1.0,71,2,0,10,2,0,2,3
2,1.0,7,1,0,30,1,0,1,3
3,1.0,53,2,0,10,2,1,1,3
4,0.0,8,0,1,31,1,1,1,3


In [49]:
df_train.tail()

Unnamed: 0,Survived,Fare,en_Title,en_Sex,Pclass_Sex,Family_Size,Ticket_is_numeric,en_Embarked,en_Age
886,0.0,13,4,1,21,1,1,1,3
887,1.0,30,1,0,10,1,1,1,2
888,0.0,23,1,0,30,4,0,1,3
889,1.0,30,0,1,11,1,1,2,3
890,0.0,7,0,1,31,1,1,3,3


In [50]:
df_test.head()

Unnamed: 0,Survived,Fare,en_Title,en_Sex,Pclass_Sex,Family_Size,Ticket_is_numeric,en_Embarked,en_Age
0,,7,0,1,31,1,1,3,3
1,,7,2,0,30,2,1,1,3
2,,9,1,1,21,1,1,3,3
3,,8,2,1,31,1,1,1,3
4,,12,0,0,30,3,1,1,3


In [51]:
df_test.drop(['Survived'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test.drop(['Survived'],axis=1,inplace=True)


In [52]:
df_test.shape

(418, 8)

### 6. Modelling & Evaluation

In [53]:
X = df_train[['Fare', 'en_Title', 'en_Sex', 'Pclass_Sex', 'Family_Size', 'Ticket_is_numeric', 'en_Embarked', 'en_Age']]
y = df_train['Survived']

In [54]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [55]:
print(len(X_train))
print(len(X_test))

623
268


##### 6.1 Logistic Regression (Kaggle: 72%)

In [56]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(X_train, y_train)

# Use logreg to predict instances from the test set and store it
y_pred_logreg = logreg.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [57]:
from sklearn.metrics import confusion_matrix

# Print the confusion matrix of the logreg model
confusion_matrix(y_test, y_pred_logreg)

array([[136,  18],
       [ 37,  77]], dtype=int64)

In [58]:
from sklearn.metrics import roc_auc_score

# AUC score for tpot model
logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {logreg_auc_score:.4f}')


AUC score: 0.8597


In [59]:
from sklearn.metrics import accuracy_score
acc_logreg_score = round(accuracy_score(y_test, y_pred_logreg)*100, 2)
print(f'\nLogistic Regression Accuracy score: {acc_logreg_score:.4f}')


Logistic Regression Accuracy score: 79.4800


In [60]:
y_pred_logreg_sub = logreg.predict(df_test).astype(int)

##### 6.2 Decision Tree (Kaggle: 73%)

In [61]:
# Import Decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Create decision tree
dt = DecisionTreeClassifier(random_state = 1)
cv_score = cross_val_score(dt,X_train,y_train,cv=5)
print(cv_score)
print(cv_score.mean())

[0.744     0.776     0.736     0.7983871 0.7983871]
0.7705548387096774


In [62]:
dt.fit(X_train, y_train)
y_pred_tree = dt.predict(X_test)
acc_tree_score = round(accuracy_score(y_test, y_pred_tree)*100,2)
print(f'\nDecision Tree Accuracy score: {acc_tree_score:.4f}')


Decision Tree Accuracy score: 79.8500


In [63]:
y_pred_tree_sub = dt.predict(df_test).astype(int)

##### 6.3 Compare Decision Tree & Logistic Regression

In [64]:
# Create the classification report for both models
from sklearn.metrics import classification_report
class_rep_tree = classification_report(y_test, y_pred_tree)
class_rep_log = classification_report(y_test, y_pred_logreg)

print("Decision Tree: \n", class_rep_tree)
print("Logistic Regression: \n", class_rep_log)

Decision Tree: 
               precision    recall  f1-score   support

         0.0       0.77      0.92      0.84       154
         1.0       0.85      0.64      0.73       114

    accuracy                           0.80       268
   macro avg       0.81      0.78      0.78       268
weighted avg       0.81      0.80      0.79       268

Logistic Regression: 
               precision    recall  f1-score   support

         0.0       0.79      0.88      0.83       154
         1.0       0.81      0.68      0.74       114

    accuracy                           0.79       268
   macro avg       0.80      0.78      0.78       268
weighted avg       0.80      0.79      0.79       268



##### 6.4 Support Vector Machine (Kaggle: 77%)

In [65]:
from sklearn.svm import SVC

# Define parameters:
params = {
    "kernel": "linear",
    "C": 1, 
    "gamma": 0.0001, 
    "degree": 3,
    "random_state": 123,
}

# Create a svm.SVC with the parameters above
svm = SVC(**params)

# Train the SVM classifer on the train set
svm = svm.fit(X_train, y_train)

# Predict the outcomes on the test set
y_pred_svm = svm.predict(X_test)

# Evaluate accuracy
acc_svm_score = round(accuracy_score(y_test, y_pred_svm)*100,2)
print(f'\nSupport Vector Machine Accuracy score: {acc_svm_score:.4f}')


Support Vector Machine Accuracy score: 77.2400


In [66]:
y_pred_svm_sub = svm.predict(df_test).astype(int)

##### 6.5 Random Forest (Kaggle: 73%)

In [67]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=25)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluate accuracy
acc_rf_score = round(accuracy_score(y_test, y_pred_rf)*100,2)
print(f'\nRandom Forest Accuracy score: {acc_rf_score:.4f}')


Random Forest Accuracy score: 79.1000


In [68]:
y_pred_rf_sub = rf.predict(df_test).astype(int)

##### 6.5 XG Boost (Kaggle: 74%)

In [82]:
import xgboost as xgb

# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)

# Fit the classifier to the training set
xg_cl.fit(X_train, y_train)

# Predict the labels of the test set: preds
y_pred_xgb = xg_cl.predict(X_test)

# Compute the accuracy: accuracy
acc_xgb_score = round(accuracy_score(y_test, y_pred_xgb)*100,2)
# acc_xgb_score = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (acc_xgb_score))

accuracy: 80.970000


In [77]:
y_pred_xgb_sub = xg_cl.predict(df_test).astype(int)

### 7. Models Summary 

In [85]:
models = ['Logistic Regression', 'Decision Tree', 'Support Vector Machines', 'Random Forest', 'XG Boost']
scores = np.array([acc_logreg_score, acc_tree_score, acc_svm_score, acc_rf_score, acc_xgb_score])
kaggle = np.array([72.25, 72.97, 77.03, 72.73, 74.16])
difference = kaggle - scores

models_df = pd.DataFrame({
    'Model': models,
    'Accuracy Score': scores,
    'Kaggle Score': kaggle,
    'Difference': difference,
})

models_df.sort_values(by='Kaggle Score', ascending=False)

Unnamed: 0,Model,Accuracy Score,Kaggle Score,Difference
2,Support Vector Machines,77.24,77.03,-0.21
4,XG Boost,80.97,74.16,-6.81
1,Decision Tree,79.85,72.97,-6.88
3,Random Forest,79.1,72.73,-6.37
0,Logistic Regression,79.48,72.25,-7.23


### 8. Kaggle Submission

In [70]:
## Update Sample Submission file: Logistic Regression
pred = pd.DataFrame(y_pred_logreg_sub)
sub_df = pd.read_csv('gender_submission.csv')
datasets = pd.concat([sub_df['PassengerId'],pred],axis=1)
datasets.columns = ['PassengerId','Survived']
datasets.to_csv('gender_submission_logreg.csv',index=False)

In [71]:
## Update Sample Submission file: Decision Tree
pred = pd.DataFrame(y_pred_tree_sub)
sub_df = pd.read_csv('gender_submission.csv')
datasets = pd.concat([sub_df['PassengerId'],pred],axis=1)
datasets.columns = ['PassengerId','Survived']
datasets.to_csv('gender_submission_tree.csv',index=False)

In [72]:
## Update Sample Submission file: Support Vector Machine
pred = pd.DataFrame(y_pred_svm_sub)
sub_df = pd.read_csv('gender_submission.csv')
datasets = pd.concat([sub_df['PassengerId'],pred],axis=1)
datasets.columns = ['PassengerId','Survived']
datasets.to_csv('gender_submission_sv.csv',index=False)

In [73]:
## Update Sample Submission file: Random Forest
pred = pd.DataFrame(y_pred_rf_sub)
sub_df = pd.read_csv('gender_submission.csv')
datasets = pd.concat([sub_df['PassengerId'],pred],axis=1)
datasets.columns = ['PassengerId','Survived']
datasets.to_csv('gender_submission_rf.csv',index=False)

In [78]:
## Update Sample Submission file: XGBoost
pred = pd.DataFrame(y_pred_xgb_sub)
sub_df = pd.read_csv('gender_submission.csv')
datasets = pd.concat([sub_df['PassengerId'],pred],axis=1)
datasets.columns = ['PassengerId','Survived']
datasets.to_csv('gender_submission_xgb.csv',index=False)

### 8. Questions

- How to improve prediction result?