<h3>Import data</h3>

In [1]:
# Import basic libraries
import os
import pandas as pd
import numpy as np

# Import data
data_path = ['data']
filepath_train = os.sep.join(data_path + ['train.csv'])
filepath_test = os.sep.join(data_path + ['test.csv'])
filepath_y_test = os.sep.join(data_path + ['gender_submission.csv'])
train = pd.read_csv(filepath_train)
test = pd.read_csv(filepath_test)
y_test = pd.read_csv(filepath_y_test)

<h3>Initial data exploration</h3>

In [2]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# Training set info
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [4]:
# Test set info
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


The training data has 12 columns. Test data has 11 columns. Column 'Survived' is a target, integer dtype, binary encoded (1 - survived, 0 - not survived). It  Among the features, 6 columns are numeric dtype and 5 object dtype.

In [5]:
# Check missing values in training set
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In training set, there is 687/891 missing values in column 'Cabin', 177/891 in column 'Age' and 2/891 in column 'Embarked'

In [6]:
# Check missing values in test set
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In test set, there is 327/418 missing values in 'Cabin' column, 86/418 in 'Age' column and 1 missing value in 'Fare' column

In [7]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [8]:
test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In both datasets, minimum value in 'Age' column is less than 1. This feature needs further exploration.

In [9]:
# Assign target data and get feature names
y_train = train['Survived']
features = [x for x in train.columns if x != 'Survived']


In [10]:
y_test = y_test.drop('PassengerId', axis=1)

In [11]:
y_test.head()

Unnamed: 0,Survived
0,0
1,1
2,0
3,0
4,1


<h2>Data preparation</h2>

<h3>Cabin</h3>

In [12]:
# Examine cabin values
train['Cabin'].value_counts()

B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
F33            3
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: Cabin, Length: 147, dtype: int64

Since 'Cabin' column has a lot of unique categorical values and a lot of missing values both in train and test set, I drop this column.

In [13]:
train.drop('Cabin', axis=1, inplace=True)
test.drop('Cabin', axis=1, inplace=True);

In [14]:
features.remove('Cabin')

<h3>Ticket</h3>

Similarly, 'Ticket' column is unlikely to provide any useful information, thus I drop it as well.

In [15]:
train.drop('Ticket', axis=1, inplace=True)
test.drop('Ticket', axis=1, inplace=True)

In [16]:
features.remove('Ticket')

<h3>Embarked</h3>

There are only 2 missing values of this feature in train set, I fill it.

In [17]:
train['Embarked'].value_counts(normalize=True)

S    0.724409
C    0.188976
Q    0.086614
Name: Embarked, dtype: float64

Most frequent value is 'S', so i fill missing values with it.

In [18]:
train = train.fillna({'Embarked': 'S'})

<h3>Age</h3>

There is a lot of missing values in 'Age' columns which should be filled. But first I check the values less than 1

In [19]:
train[train['Age'] < 1]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,29.0,S
305,306,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,151.55,S
469,470,1,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,19.2583,C
644,645,1,3,"Baclini, Miss. Eugenie",female,0.75,2,1,19.2583,C
755,756,1,2,"Hamalainen, Master. Viljo",male,0.67,1,1,14.5,S
803,804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,8.5167,C
831,832,1,2,"Richards, Master. George Sibley",male,0.83,1,1,18.75,S


In [20]:
test[test['Age'] < 1]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
201,1093,3,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0,2,14.4,S
250,1142,2,"West, Miss. Barbara J",female,0.92,1,2,27.75,S
281,1173,3,"Peacock, Master. Alfred Edward",male,0.75,1,1,13.775,S
307,1199,3,"Aks, Master. Philip Frank",male,0.83,0,1,9.35,S
354,1246,3,"Dean, Miss. Elizabeth Gladys Millvina""""",female,0.17,1,2,20.575,S


Age values above seems to be errors. I assign NaN to them, and later they will be filled.

In [21]:
train_age_idxs = train[train['Age'] < 1].index
test_age_idxs = test[test['Age'] < 1].index
train.loc[train_age_idxs, 'Age'] = np.nan
test.loc[test_age_idxs, 'Age'] = np.nan

Fill missing age values

In [22]:
# Check age correlation to another features
age_correlations_train = train[features].corrwith(train['Age']).abs().sort_values(ascending=False)
age_correlations_train

Age            1.000000
Pclass         0.376786
SibSp          0.305371
Parch          0.173905
Fare           0.099329
PassengerId    0.046289
dtype: float64

In [23]:
age_correlations_test = test[features].corrwith(test['Age']).abs().sort_values(ascending=False)
age_correlations_test

Age            1.000000
Pclass         0.486803
Fare           0.337405
SibSp          0.090576
PassengerId    0.016256
Parch          0.014313
dtype: float64

Since there's a highest correlation between Age and Pclass, missing values will be filled with median value of age according to each pclass.

In [24]:
# Fill missing values with median age for each of Pclass in both training and test set
train['Age'] = train.groupby(['Pclass'])['Age'].apply(lambda x: x.fillna(x.median()))
test['Age'] = test.groupby(['Pclass'])['Age'].apply(lambda x: x.fillna(x.median()))

<h3>Fare</h3>

In [25]:
test[test['Fare'].isnull()]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,,S


In [26]:
# Check age correlation to another features
fare_correlations_train = train[features].corrwith(train['Fare']).abs().sort_values(ascending=False)
fare_correlations_train

Fare           1.000000
Pclass         0.549500
Parch          0.216225
SibSp          0.159651
Age            0.129549
PassengerId    0.012658
dtype: float64

In [27]:
fare_correlations_test = test[features].corrwith(test['Fare']).abs().sort_values(ascending=False)
fare_correlations_test

Fare           1.000000
Pclass         0.577147
Age            0.355370
Parch          0.230046
SibSp          0.171539
PassengerId    0.008211
dtype: float64

I fill this value with median value of fare for corresponding Pclass, because it is the feature of highest correlation with fare

In [28]:
test['Fare'] = test.groupby(['Pclass'])['Fare'].apply(lambda x: x.fillna(x.median()))

<h3>Name</h3>

Name doesn't provide any useful information, I drop it.

In [29]:
train.drop('Name', axis=1, inplace=True)
test.drop('Name', axis=1, inplace=True)
features.remove('Name')

<h3>Embarked and sex</h3>

One hot encode these columns

In [30]:
one_hot_encode_cols = ['Embarked', 'Sex']
train = pd.get_dummies(train, columns=one_hot_encode_cols)
test = pd.get_dummies(test, columns=one_hot_encode_cols)
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male
0,1,0,3,22.0,1,0,7.25,0,0,1,0,1
1,2,1,1,38.0,1,0,71.2833,1,0,0,1,0
2,3,1,3,26.0,0,0,7.925,0,0,1,1,0
3,4,1,1,35.0,1,0,53.1,0,0,1,1,0
4,5,0,3,35.0,0,0,8.05,0,0,1,0,1


<h3>Age and fare</h3>

Scale these features using minmax scaler

In [31]:
# Pop passenger id from data
train_id = train.pop('PassengerId')
test_id = test.pop('PassengerId')

In [32]:
y_train = train.pop('Survived')


In [33]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(train)
X_test = scaler.fit_transform(test)

<h2>Model</h2>

In [39]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import accuracy_score

<h3>KNN</h3>

In [38]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision, recall, fscore, _ = score(y_test, y_pred, average='weighted')
metrics_knn = pd.Series({'precision':precision, 'recall':recall, 
                              'fscore':fscore, 'accuracy':accuracy}, name='KNN')
metrics_knn

precision    0.858661
recall       0.858852
fscore       0.858751
accuracy     0.858852
Name: KNN, dtype: float64

<h3>Logistic Regression</h3>

In [54]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision, recall, fscore, _ = score(y_test, y_pred, average='weighted')
metrics_lr = pd.Series({'precision':precision, 'recall':recall, 
                              'fscore':fscore, 'accuracy':accuracy}, name='LR')
metrics_lr

# Save y_pred for later, calculate probability of each output
y_pred_lr = y_pred
proba = lr.predict_proba(X_test)
proba

array([[0.89408879, 0.10591121],
       [0.57798643, 0.42201357],
       [0.85848478, 0.14151522],
       [0.90306021, 0.09693979],
       [0.43357795, 0.56642205],
       [0.87093191, 0.12906809],
       [0.36125187, 0.63874813],
       [0.80196974, 0.19803026],
       [0.26120778, 0.73879222],
       [0.91869273, 0.08130727],
       [0.89642294, 0.10357706],
       [0.66079612, 0.33920388],
       [0.08565033, 0.91434967],
       [0.90683185, 0.09316815],
       [0.14749575, 0.85250425],
       [0.14909594, 0.85090406],
       [0.75614274, 0.24385726],
       [0.83568605, 0.16431395],
       [0.45471012, 0.54528988],
       [0.40829906, 0.59170094],
       [0.64090746, 0.35909254],
       [0.86212807, 0.13787193],
       [0.11616513, 0.88383487],
       [0.40126125, 0.59873875],
       [0.09588463, 0.90411537],
       [0.95137954, 0.04862046],
       [0.04889903, 0.95110097],
       [0.84072386, 0.15927614],
       [0.63156174, 0.36843826],
       [0.88527177, 0.11472823],
       [0.

<h3>SGD</h3>

In [46]:
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier()
sgd.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision, recall, fscore, _ = score(y_test, y_pred, average='weighted')
metrics_sgd = pd.Series({'precision':precision, 'recall':recall, 
                              'fscore':fscore, 'accuracy':accuracy}, name='SGD')
metrics_sgd

precision    0.915013
recall       0.901914
fscore       0.897778
accuracy     0.901914
Name: SGD, dtype: float64

<h3>SVM</h3>

In [36]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision, recall, fscore, _ = score(y_test, y_pred, average='weighted')
metrics_svc = pd.Series({'precision':precision, 'recall':recall, 
                              'fscore':fscore, 'accuracy':accuracy}, name='SVM')
metrics_svc

precision    0.915013
recall       0.901914
fscore       0.897778
accuracy     0.901914
Name: SVM, dtype: float64

<h3>Decision tree</h3>

In [44]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision, recall, fscore, _ = score(y_test, y_pred, average='weighted')
metrics_dt = pd.Series({'precision':precision, 'recall':recall, 
                              'fscore':fscore, 'accuracy':accuracy}, name='DT')
metrics_dt

precision    0.800204
recall       0.801435
fscore       0.800700
accuracy     0.801435
Name: DT, dtype: float64

<h3>Random forest</h3>

In [45]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision, recall, fscore, _ = score(y_test, y_pred, average='weighted')
metrics_rf = pd.Series({'precision':precision, 'recall':recall, 
                              'fscore':fscore, 'accuracy':accuracy}, name='RF')
metrics_rf

precision    0.848718
recall       0.849282
fscore       0.848954
accuracy     0.849282
Name: RF, dtype: float64

In [50]:
metrics = pd.concat([metrics_knn, metrics_lr, metrics_sgd, metrics_svc, metrics_dt, metrics_rf], axis=1)
metrics

Unnamed: 0,KNN,LR,SGD,SVM,DT,RF
precision,0.858661,0.957357,0.915013,0.915013,0.800204,0.848718
recall,0.858852,0.956938,0.901914,0.901914,0.801435,0.849282
fscore,0.858751,0.957054,0.897778,0.897778,0.8007,0.848954
accuracy,0.858852,0.956938,0.901914,0.901914,0.801435,0.849282


Most accurate model in this case is Logistic Regression

<h3>Answer for the question</h3>

Answer for the question, what is every passenger's probability of survival, according to Logistic Regression model

In [60]:
proba.T[0]
submission = pd.DataFrame(data = proba.T[0], index = test_id, columns=['Probability of survival'])
submission

Unnamed: 0_level_0,Probability of survival
PassengerId,Unnamed: 1_level_1
892,0.894089
893,0.577986
894,0.858485
895,0.903060
896,0.433578
...,...
1305,0.896410
1306,0.067129
1307,0.925380
1308,0.896410
