## Kaggle competition: Titanic - Machine Learning from Disaster

### Michael Leung


In [None]:
# import tools for analysis
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [None]:
# read the csv files
train_df = pd.read_csv('data/train.csv', index_col=0)
test_df = pd.read_csv('data/test.csv', index_col=0)

Let's see what's inside the three data frame

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
# see the size of train dataset
train_df.shape

In [None]:
# see the data type of each column
train_df.info()

In [None]:
# see how many unique values in the Sex column
print(train_df['Sex'].unique())
print(test_df['Sex'].unique())

For the `Sex` column, it can be turned into binary columns.

In [None]:
train_df['Sex'] = np.where(train_df['Sex'] == 'male', 1,0)
test_df['Sex'] = np.where(test_df['Sex'] == 'male', 1,0)

Let's see `Pclass` has how many unique values.

In [None]:
print(train_df['Pclass'].unique())
print(test_df['Pclass'].unique())

We can turn the `Pclass` columns to dummy variables for modeling.

In [None]:
# make Pclass to dummy variables
pclass_dum_train = pd.get_dummies(train_df['Pclass'])
pclass_dum_test = pd.get_dummies(test_df['Pclass'])

In [None]:
# concat the dummy variables to the original data frame
train_df = pd.concat([train_df,pclass_dum_train],axis=1)
test_df = pd.concat([test_df,pclass_dum_test],axis=1)

Let's see the unique values of the `Embarked` column.

In [None]:
print(train_df['Embarked'].unique())
print(test_df['Embarked'].unique())

In [None]:
# turn nan values to unknown(N)
train_df['Embarked'].fillna('N', inplace=True)
test_df['Embarked'].fillna('N', inplace=True)

Same as `Pclass`, the `Embarked` column can be also turned to dummy variables.

In [None]:
# turn Pclass to dummy variables
embarked_dum_train = pd.get_dummies(train_df['Embarked'])
embarked_dum_test = pd.get_dummies(test_df['Embarked'])
# concat data frames
train_df = pd.concat([train_df,embarked_dum_train],axis=1)
test_df = pd.concat([test_df,embarked_dum_test],axis=1)

In [None]:
# let's see the unique values of ticket column
train_df['Ticket'].unique()

In [None]:
# let's see the unique values of cabin column
train_df['Cabin'].unique()

In [None]:
# see if there missing vlaues of each column
train_df.isna().sum()

In [None]:
test_df.isna().sum()

In [None]:
# see the row with missing value on Fare
test_df[test_df['Fare'].isna()]

There are some missing values on the Age and Cabin columns for the train and test set, I may try to fill those in during the EDA. Also, the test set also has 1 missing value on the Fare, I will fill it in with the average fare of the third class.

In [None]:
test_df['Fare'].fillna(test_df[test_df['Pclass']==3]['Fare'].mean(),inplace=True)

### EDA

In [None]:
# see how many passenger survived
train_df.groupby('Survived')['Survived'].count()

Let see the relationship of each column against the Survived column, I will start with Pclass column.

In [None]:
# plot the survival rate of different classes
plt.subplots(3,1,figsize=(10,10))

plt.subplot(3,1,1)

plt.pie(train_df[train_df['Pclass']==1].groupby('Survived')['Survived'].count(),labels=train_df['Survived'].unique(),autopct='%.2f')
plt.title('Survival percentage of first class')

plt.subplot(3,1,2)
plt.pie(train_df[train_df['Pclass']==2].groupby('Survived')['Survived'].count(),labels=train_df['Survived'].unique(),autopct='%.2f')
plt.title('Survival percentage of secoond class')

plt.subplot(3,1,3)
plt.pie(train_df[train_df['Pclass']==3].groupby('Survived')['Survived'].count(),labels=train_df['Survived'].unique(),autopct='%.2f')
plt.title('Survival percentage of third class')

plt.show();

Seems the higher the class, the higher survival rate in the dataset.

Let's plot some pie chart to see the survival rate against gender.

In [None]:
plt.subplots(1,2)

plt.subplot(1,2, 1)

plt.pie(train_df[train_df['Sex']==1].groupby('Survived')['Survived'].count(),labels=train_df['Survived'].unique(),autopct='%.2f')
plt.title('Survival percentage of male')

plt.subplot(1, 2, 2)
plt.pie(train_df[train_df['Sex']==0].groupby('Survived')['Survived'].count(),labels=train_df['Survived'].unique(),autopct='%.2f')
plt.title('Survival percentage of female')
plt.show()

Over 80% male were dead and around a quarter of female were dead in Titanic. Seems female have higher survival rate in the accident.

Let's plot a graph for Age

In [None]:
# plot two bar chart to see if there are difference between the age of the survived and dead passangers
plt.subplots(1,2, figsize=(15,5))
plt.subplot(1,2,1)

sns.histplot(data=train_df[train_df['Survived']==1]['Age'], x=train_df[train_df['Survived']==1]['Age'],bins=80)
plt.axvline(train_df[train_df['Survived']==1]['Age'].mean(),color='red', label= 'average')
plt.title("Survived passanger's age")
plt.legend()

plt.subplot(1,2,2)
sns.histplot(data=train_df[train_df['Survived']==0]['Age'], x=train_df[train_df['Survived']==0]['Age'],bins=80)
plt.axvline(train_df[train_df['Survived']==0]['Age'].mean(),color='red',label= 'average')
plt.title("Dead passanger's age")
plt.legend()

plt.show()

The distribution of the survived passanger is bimodal distributed(the first peak is age = 0 and the second is around 25) and the distribution of dead passangers seems nearly normally distributed. The difference between the average age of the survived and dead passangers is around 1 to 2 years.

Next, I will plot a graph for SibSp

In [None]:
plt.subplots(1,2, figsize=(15,5))
plt.subplot(1,2,1)

plt.bar(height=train_df[train_df['Survived']==1].groupby('SibSp')['SibSp'].count(), x=train_df[train_df['Survived']==1].groupby('SibSp')['SibSp'].unique())
plt.title("Survived passanger's number of siblings / spouses aboard the Titanic")


plt.subplot(1,2,2)
plt.bar(height=train_df[train_df['Survived']==0].groupby('SibSp')['SibSp'].count(), x=train_df[train_df['Survived']==0].groupby('SibSp')['SibSp'].unique())
plt.title("Dead passanger's number of siblings / spouses ")

plt.show()

The distribution of passanger's number of siblings / spouses aboard the Titanic between survived and dead passangers are likely the same, which most of the passangers have 0 or 1 siblings / spouses aboard the Titanic.

Next, I will plot a graph for Parch

In [None]:
plt.subplots(1,2, figsize=(15,5))
plt.subplot(1,2,1)

plt.bar(height=train_df[train_df['Survived']==1].groupby('Parch')['Parch'].count(), x=train_df[train_df['Survived']==1].groupby('Parch')['Parch'].unique())
plt.title("Survived passanger's number of parents / children aboard the Titanic")


plt.subplot(1,2,2)
plt.bar(height=train_df[train_df['Survived']==0].groupby('Parch')['Parch'].count(), x=train_df[train_df['Survived']==0].groupby('Parch')['Parch'].unique())
plt.title("Dead passanger's number of parents / children aboard the Titanic")


plt.show()

The distribution of passanger's number of parents / children aboard the Titanic between survived and dead passangers are likely the same, which most of the passangers have 0 to 2 parents / children aboard the Titanic.

Let's look at the Fare column.

In [None]:
plt.figure()

plt.scatter(x=train_df['Fare'], y = train_df['Survived'])
plt.scatter(train_df[train_df['Survived']==1]['Fare'].mean(),1, label='average fee of survived passangers', color= 'yellow')
plt.scatter(train_df[train_df['Survived']==0]['Fare'].mean(),0, label='average fee of dead passangers',color= 'red')
plt.title("Survived passanger's number of parents / children aboard the Titanic")
plt.legend()
plt.show()

In [None]:
train_df[train_df['Survived']==1]['Fare'].mean()

In [None]:
train_df[train_df['Survived']==0]['Fare'].mean()

Seems the survived passangers paid more in average in the ship fee.

Let's look at the survival rate of different embarked port.

In [None]:
plt.subplots(2,2, figsize=(15,10))

plt.subplot(2,2,1)
plt.pie(train_df[train_df['Embarked']=='C'].groupby('Survived')['Survived'].count(), labels=train_df[train_df['Embarked']=='C']['Survived'].unique(),autopct='%.2f')
plt.title('Passangers embarked from C')

plt.subplot(2,2,2)
plt.pie(train_df[train_df['Embarked']=='S'].groupby('Survived')['Survived'].count(), labels=train_df[train_df['Embarked']=='S']['Survived'].unique(),autopct='%.2f')
plt.title('Passangers embarked from S')

plt.subplot(2,2,3)
plt.pie(train_df[train_df['Embarked']=='Q'].groupby('Survived')['Survived'].count(), labels=train_df[train_df['Embarked']=='Q']['Survived'].unique(),autopct='%.2f')
plt.title('Passangers embarked from Q')

plt.subplot(2,2,4)
plt.pie(train_df[train_df['Embarked']=='N'].groupby('Survived')['Survived'].count(), labels=train_df[train_df['Embarked']=='N']['Survived'].unique(),autopct='%.2f')
plt.title('Passangers embarked from unknown')

plt.show()

Seems the passangers departed from port C has a higher survival rate than port S and Q(For unknow port, it only has passangers, which the sample size is too small to determine the survival rate).

### Deal with missing data

Before modeling, I have to deal with the missing data.

In [None]:
train_df.isna().sum()

In [None]:
test_df.isna().sum()

There are some missing data on the Age and Cabin column. I will try to fill in the Age column with reference of other columns(Cabin is non-numeric columns with too many missing data, it's hard to fill in the missing data of the column)

Let's find out which column has high correlation with the Age column.

In [None]:
numerics = ['uint8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']

# select numeric columns only
X_age = train_df.select_dtypes(include=numerics).drop('Age',axis=1)
y_age = train_df['Age']

In [None]:
plt.subplots(7,2, figsize=(20,50))

count = 1

for col in X_age.columns:
    plt.subplot(7,2,count)
    plt.scatter(X_age[col],y_age)
    plt.title(col)
    
    count += 1
    
plt.tight_layout()
plt.show()

Seems like the Sibsp has some relationship with the age and all other columns do not have a visible relationship with the Age column, but the Sibsp only has a decrease in the maximum number of siblings / spouses aboard the Titanic and it is not a good reference to fill in the missing columns for the Age column. In this case, the Age column will be dropped before modeling

In [None]:
train_df.drop('Age',axis=1,inplace=True)
test_df.drop('Age',axis=1,inplace=True)

### Modeling

Before modeling, we need to drop the non-numeric columns and some columns to minimize the multicollinearity.

In [None]:
# drop non-numeric columns
train_df.drop(['Name', 'Ticket','Cabin'],axis=1,inplace=True)
test_df.drop(['Name', 'Ticket','Cabin'],axis=1,inplace=True)

In [None]:
# drop the columns that were already turned to dummy variables
train_df.drop(['Pclass','Embarked'], axis=1, inplace=True)
test_df.drop(['Pclass','Embarked'], axis=1, inplace=True)

In [None]:
# drop 1 column from each type of dummy variables
train_df.drop([2,'N'],axis=1, inplace=True)
test_df.drop([2],axis=1, inplace=True)

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
X = train_df.drop('Survived',axis=1)
y = train_df['Survived']
X_test = test_df
y_test = result_df['Survived']

In [None]:
# Correlation of the variables in a heatmap
plt.figure(figsize=(30,20))
matrix = np.triu(X.corr())
sns.heatmap(X.corr(), mask=matrix, cmap='coolwarm', vmin = -1, vmax = 1, annot = True)
plt.show()

There are some columns have high positive/negative correlations, such as 
- First class and fare(positive)
- Third class and fare(negative)
- First class and third class(negative)
- Embarked at S and embarked at C(negative)
- Embarked at S and embarked at Q(negative)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# split the train dat to train and validation datasets
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
from sklearn.linear_model import LogisticRegression

lin_model = LogisticRegression()

lin_model.fit(X_train,y_train)

print(f"Train: {lin_model.score(X_train,y_train)}")
print(f"Validation: {lin_model.score(X_validation,y_validation)}")

The overfitting problem seems not serious in the model, we can try to score the test set to see the accuracy of the model. I will try to remove some columns that have high correlations with other columns(Fare and S) to see the acucracy.

In [None]:
X_2 = train_df.drop(['Survived','Fare', 'S'],axis=1)
y = train_df['Survived']
X_test_2 = test_df.drop(['Fare', 'S'],axis=1)

In [None]:
X_train_2, X_validation_2, y_train_2, y_validation_2 = train_test_split(X_2, y, test_size=0.2, random_state=1)

lin_model_2 = LogisticRegression()

lin_model_2.fit(X_train_2,y_train_2)

print(f"Train: {lin_model_2.score(X_train_2,y_train)}")
print(f"Validation: {lin_model_2.score(X_validation_2,y_validation)}")

The model accuracy slightly drops. I will build a model with all numeric to see the accuracy of the test set.

In [None]:
lin_model = LogisticRegression()

lin_model.fit(X,y)

print(lin_model.score(X,y))

The accuracy is 0.799 for the logistic regression model, it is quite high with a such basic model. I will try to build a decision tree classifier to see if this prooblem still exist.

In [None]:
from sklearn.tree import DecisionTreeClassifier

X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.2, random_state=1)

tree_model = DecisionTreeClassifier(random_state=1)

tree_model.fit(X_train,y_train)

print(f"Train: {tree_model.score(X_train,y_train)}")
print(f"Validation: {tree_model.score(X_validation,y_validation)}")

The decision tree classifier seems have overfitting problem, I will try to final out the optimal max_depth to minimize the overfitting problem.

In [None]:
# build two lists to record the accuracy scores for train and validation
tree_train = []
tree_validation = []

tree_depth = range(1,11)

# write a for loop to get tha accuracy scores of the model on different max_depth
for depth in tree_depth:
    tree_model = DecisionTreeClassifier(max_depth= depth, random_state= 1)
    
    tree_model.fit(X_train, y_train)
    
    tree_train.append(tree_model.score(X_train, y_train))
    tree_validation.append(tree_model.score(X_validation, y_validation))

In [None]:
# visualize the accuracy with different max_depth
plt.figure(figsize=(15,10))
plt.plot(tree_depth,tree_train,label = 'train')
plt.plot(tree_depth,tree_validation,label = 'validation')
plt.xticks(tree_depth, fontsize=15)
plt.xlabel("max depth", fontsize = 15)
plt.ylabel("Accuracy", fontsize = 15)
plt.title("Model accuracy vs. max_depth", fontsize= 25)
plt.yticks(fontsize=15)
plt.legend()
plt.show()

Seems the model is less overfitting when the max_depth is lower than 4. I will try to build the model with max_depth = 3.

In [None]:
tree_model = DecisionTreeClassifier(random_state=1,max_depth=3)

tree_model.fit(X,y)

print(f"Train: {tree_model.score(X,y)}")

The decision tree classifier model predicts better than logistic regression on the test set, I will submit thre result of the competition.

In [None]:
test_df['Survived'] = tree_model.predict(X_test)

In [None]:
test_df['Survived'].to_csv('data/Submission.csv')