# IBM Machine Learning
## Course 3: Supervised Learning: Regression
## Topic: Titanic

### 1. Introduction

This is the legendary competition in Machine learning. In this challenge, the participants were aske to build a predicitive model to predict who would survive using passenger data(ie name, age, gender, socio-economic class, etc).
The dataset can be found in kaggle: https://www.kaggle.com/c/titanic

### 2. Explorative Data Analysis

In [1]:
## import the library and dataset
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

train_df=pd.read_csv("data/train.csv")
test_df=pd.read_csv("data/test.csv")

#### The datatype of each column

In [2]:
## The size of training set and test set
print("The size of training samples is: ", train_df.shape)
print("The size of test samples is: ",test_df.shape)

The size of training samples is:  (891, 12)
The size of test samples is:  (418, 11)


In [3]:
## The column names
train_df.columns.tolist()

['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [4]:
## Datatype of each column
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [5]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


As we can see above, there are missing value in four columns, Age, Fare, Cabin and Embark. For Embarked and Fare columns, there are only  one or two missing values. We can easily impute that. For Cabin column, there are a lot of missing value, we need to drop that column. For Age column, we will try to use KNN imputation method. 

### 3 Feature engineering

In [6]:
## Assign NA value to test dataset "Survived" columns
test_df['Survived']= np.nan

In [7]:
test_df.Survived

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
       ..
413   NaN
414   NaN
415   NaN
416   NaN
417   NaN
Name: Survived, Length: 418, dtype: float64

In [8]:
## row bind two dataset for feature engineering process
combine=train_df.append(test_df)

In [9]:
combine.index=range(combine.shape[0])

In [10]:
combine.index

RangeIndex(start=0, stop=1309, step=1)

In [11]:
## description of combined dataset
combine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB


#### 3.1 PassengerId column

In [12]:
combine.PassengerId

0          1
1          2
2          3
3          4
4          5
        ... 
1304    1305
1305    1306
1306    1307
1307    1308
1308    1309
Name: PassengerId, Length: 1309, dtype: int64

The passenger ID cannot be used for prediction. We can safely drop the column.

In [13]:
combine=combine.drop('PassengerId', axis=1)

#### 3.2 Pclass column

In [14]:
combine.Pclass.value_counts()

3    709
1    323
2    277
Name: Pclass, dtype: int64

This column is good as predictor

#### 3.3 Name 

In [15]:
combine.Name

0                                 Braund, Mr. Owen Harris
1       Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                  Heikkinen, Miss. Laina
3            Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                                Allen, Mr. William Henry
                              ...                        
1304                                   Spector, Mr. Woolf
1305                         Oliva y Ocana, Dona. Fermina
1306                         Saether, Mr. Simon Sivertsen
1307                                  Ware, Mr. Frederick
1308                             Peter, Master. Michael J
Name: Name, Length: 1309, dtype: object

We can extract the title from the Name column.

In [16]:
combine['Title']=combine.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

In [17]:
combine.Title.value_counts()

Mr          757
Miss        260
Mrs         197
Master       61
Dr            8
Rev           8
Col           4
Mlle          2
Major         2
Ms            2
Mme           1
Capt          1
Sir           1
Don           1
Lady          1
Jonkheer      1
Countess      1
Dona          1
Name: Title, dtype: int64

In [18]:
## Some title have different variant like Miss, Ms, Mlle. We need to standardize these title
combine['Title']=combine['Title'].replace('Mlle', 'Miss')
combine['Title']=combine['Title'].replace('Ms', 'Miss')
combine['Title']=combine['Title'].replace('Mme', 'Mrs')

In [19]:
## select the titles that are less frequent to combine
rare_titles=combine.groupby('Title').filter(lambda x: len(x)<10).Title.unique().tolist()

In [20]:
## Replace the raretitle
combine['Title']=combine['Title'].replace(rare_titles, 'Rare')

In [21]:
## double chekc the title after process
combine.Title.value_counts()

Mr        757
Miss      264
Mrs       198
Master     61
Rare       29
Name: Title, dtype: int64

In [22]:
## Drop name column
combine=combine.drop('Name', axis=1)

#### 3.4&5 Sex and Age

In [23]:
## The raito of different gender
combine.Sex.value_counts(normalize=True)

male      0.644003
female    0.355997
Name: Sex, dtype: float64

In [24]:
## The ratio of missing value in Age
combine.Age.isna().value_counts(normalize=True)

False    0.799083
True     0.200917
Name: Age, dtype: float64

There is about 20% missing value in Age column. we should come back to address this later

#### 3.6&7 Parch and SibSp

In [25]:
## Distribution of parent & children number 
combine.Parch.value_counts()

0    1002
1     170
2     113
3       8
5       6
4       6
9       2
6       2
Name: Parch, dtype: int64

In [26]:
## Distribution of sibling and spouse number 
combine.SibSp.value_counts()

0    891
1    319
2     42
4     22
3     20
8      9
5      6
Name: SibSp, dtype: int64

In [27]:
## We will create a new column called family size 
combine['FamilySize'] = combine['SibSp'] + combine['Parch'] + 1

In [28]:
combine.FamilySize.value_counts()

1     790
2     235
3     159
4      43
6      25
5      22
7      16
11     11
8       8
Name: FamilySize, dtype: int64

In [29]:
## Drop two columns after new column is created
combine=combine.drop(['SibSp','Parch'], axis=1)

#### 3.8 Ticket

In [30]:
len(combine.Ticket.unique())

929

Same as PassengerId, there are too many levels for Ticket column. It's not good for prediction. We may drop this column.

In [31]:
combine=combine.drop('Ticket', axis=1)

#### 3.9 Fare

In [32]:
## Fare is the float data. We can check the distribution 
combine.Fare.describe()

count    1308.000000
mean       33.295479
std        51.758668
min         0.000000
25%         7.895800
50%        14.454200
75%        31.275000
max       512.329200
Name: Fare, dtype: float64

In [33]:
np.where(combine.Fare.isna())[0][0]

1043

In [34]:
## Use the mean value to impute the missing value for Fare
combine.Fare[[1043]]=np.mean(combine.Fare)

In [35]:
combine.Fare[[1043]]

1043    33.295479
Name: Fare, dtype: float64

#### 3.10 Cabin

In [36]:
## The ratio of missing value in Cabin column
combine.Cabin.isna().value_counts(normalize=True)

True     0.774637
False    0.225363
Name: Cabin, dtype: float64

As mentioned, there are too many missing value in this column. We will drop this column.

In [37]:
combine=combine.drop('Cabin',axis=1)

#### 3.11 Embarked

In [38]:
combine.Embarked.value_counts(sort=True)

S    914
C    270
Q    123
Name: Embarked, dtype: int64

In [39]:
combine.Embarked.isna().value_counts()

False    1307
True        2
Name: Embarked, dtype: int64

In [40]:
np.where(combine.Embarked.isna())

(array([ 61, 829], dtype=int64),)

In [41]:
## Use the most frequent value to impute
combine.Embarked[61]=combine.Embarked.value_counts(sort=True).index[0]
combine.Embarked[829]=combine.Embarked.value_counts(sort=True).index[0]

#### Age imputation

In [42]:
## check the value for each column
combine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    float64
 1   Pclass      1309 non-null   int64  
 2   Sex         1309 non-null   object 
 3   Age         1046 non-null   float64
 4   Fare        1309 non-null   float64
 5   Embarked    1309 non-null   object 
 6   Title       1309 non-null   object 
 7   FamilySize  1309 non-null   int64  
dtypes: float64(3), int64(2), object(3)
memory usage: 81.9+ KB


In [63]:
X=combine.iloc[:,1:8]

In [65]:
import os
X = pd.DataFrame(KNN(k = 10).fit_transform(X), columns = X.columns)

NameError: name 'KNN' is not defined