Home Project

Given: Titanic Dataset

Need to do prepare:

 Jupyter notebook with following pipeline:
 1. Data loading
 2. Statistical analysis
 3. Data preprocessing (categorical data, NaN, etc)
 4. Feature engineering
 5. Data preparation for model (scaling, train/test split, etc)
 6. Baseline model
 7. Model selection (try different models)
 8. Model’s hyperparameters tuning

In [0]:
# Data analysis modules
import pandas as pd
# numpy is a great library for doing mathmetical operations. 
import numpy as np
# Visualization libraries
import matplotlib as mpl
from matplotlib import pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import os

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split #for split the data
from sklearn.metrics import accuracy_score  #for accuracy_score
from sklearn.model_selection import cross_val_score #score evaluation
from sklearn.model_selection import cross_val_predict #prediction
from sklearn.metrics import confusion_matrix #for confusion matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

In [0]:
## Exploratory data analysis

# 1. Data loading

We start with Python Pandas packages for working with our dataset. Pandas helps us with acquiring the training and testing datasets into DataFrames.

In [20]:
test = pd.read_csv('gdrive/My Drive/test.csv')
train = pd.read_csv('gdrive/My Drive/train.csv')
alldata = [train,test]

FileNotFoundError: ignored

# 2. Statictical analysis

In [0]:
test.head()

In [0]:
train.head()

### Below is a brief information about each columns of the dataset:
1.	PassengerId: An unique index for passenger rows. It starts from 1 for first row and increments by 1 for every new rows.
2.	Survived: Shows if the passenger survived or not. 1 stands for survived and 0 stands for not survived.
3.	Pclass: Ticket class. 1 stands for First class ticket. 2 stands for Second class ticket. 3 stands for Third class ticket.
4.	Name: Passenger's name. Name also contain title. "Mr" for man. "Mrs" for woman. "Miss" for girl. "Master" for boy.
5.	Sex: Passenger's sex. It's either Male or Female.
6.	Age: Passenger's age. "NaN" values in this column indicates that the age of that particular passenger has not been recorded.
7.	SibSp: Number of siblings or spouses travelling with each passenger.
8.	Parch: Number of parents of children travelling with each passenger.
9.	Ticket: Ticket number.
10.	Fare: How much money the passenger has paid for the travel journey.
11.	Cabin: Cabin number of the passenger. "NaN" values in this column indicates that the cabin number of that particular passenger has not been recorded.
12.	Embarked: Port from where the particular passenger was embarked/boarded


### Wich features are categorical, numerical and mixed data types?
•	Categorical: Survived, Sex, Embarked, and Pclass.

•	Numerical: Age, Fare, SibSp, Parch.

•	Mixed types: Ticket and Cabin.
    


### Errors or typos
"Name" feature can contain errors and typos because there are several ways to describe titles, short names etc.

In [0]:
test.info()
print('*'*40)
train.info()

### NaN values 
Cabin > Age are incomplete in all dataset.

Embarked feature contain a number of NaN values for the training dataset.


### Data types
Five features are int  with "Survived" in train dataset.

Two features are float.

### Categorical fetures 

Categorical:
Survived is a categorical feature with binary values. Pclass is a ordinary feature.

Sex 

Embarked

Pclass

### Numerical fetures

Age

Fare

SibSp

Parch

We want to know how well does each feature correlate with Survival. We want to do this early in our project and match these quick correlations with modelled correlations later in the project.

We may want to complete Sex, Pclass, Age, and Embarked.

Ticket feature may be dropped from our analysis as it contains high ratio of duplicates (22%) and there may not be a correlation between Ticket and survival.

Name feature is relatively non-standard, may .

Age create a Age Band

SibSb&Parch create new feature.

Cabin feature may be dropped as it is highly incomplete or contains many null values both in training and test dataset.

PassengerId may be dropped from training dataset as it does not contribute to survival.



#### SURVIVED

In [0]:
f,ax=plt.subplots(1,2,figsize=(18,8))
train['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Survived')
ax[0].set_ylabel('')
sns.countplot('Survived',data=train,ax=ax[1])
ax[1].set_title('Survived')
plt.show()

##### SEX

In [0]:
train.groupby(by='Sex')[['PassengerId']].count()

In [0]:
g = sns.barplot(x="Sex",y="Survived",data=train, ci=None)
g = g.set_ylabel("Survival Probability")

##### PCLASS

In [0]:
train.groupby('Pclass').Survived.value_counts()

In [0]:
g = sns.barplot(x="Pclass",y="Survived",data=train, ci=None)
g = g.set_ylabel("Survival Probability")

##### EMBARKED

In [0]:
train.groupby(by='Embarked')[['PassengerId']].count()

In [0]:
g = sns.barplot(x="Embarked",y="Survived",data=train, ci=None)
g = g.set_ylabel("Survival Probability")

In [0]:

fig = plt.figure(figsize=(15,5))
sns.violinplot(x="Embarked", y="Pclass", hue="Survived", data=train, split=True)

In [0]:
train["Embarked"].isna().sum()

#### AGE

In [0]:
fig = plt.figure(figsize=(15,5))
#ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)


#sns.violinplot(x="Embarked", y="Age", hue="Survived", data=train, split=True, ax=ax1)
sns.violinplot(x="Pclass", y="Age", hue="Survived", data=train, split=True, ax=ax2)
sns.violinplot(x="Sex", y="Age", hue="Survived", data=train, split=True, ax=ax3)


##### Pclass plot:

1) 1st class has very few children(0-18)

2) All children in 2nd class survived

3) Most children in 3rd class survived

#### Sex plot:

1) Most male children survived

2) 20-38 females have better survival chance

#### FARE

In [0]:
train.groupby(by='Embarked')[['Fare']].sum()

In [0]:
train.groupby(by='Survived')[['Fare']].sum()

In [0]:
ax = plt.figure(figsize=(15,10))
ax = sns.boxplot(x="Embarked", y="Fare", hue="Survived", data=train, palette="Set3")

#### NAME


In [0]:
train["Name"].head()

In [0]:
train["Name"].describe()

### SipSp&Parch


In [0]:
train.groupby(by='SibSp')[['PassengerId']].count()

In [0]:
train.groupby(by='Parch')[['PassengerId']].count()

In [0]:
g = sns.barplot(x="Parch",y="Survived",data=train, ci=None)
g = g.set_ylabel("Survival Probability")

In [0]:
g = sns.barplot(x="SibSp",y="Survived",data=train, ci=None)
g = g.set_ylabel("Survival Probability")

In [0]:
train.groupby('Parch').Survived.value_counts()

In [0]:
train.groupby('SibSp').Survived.value_counts()

In [0]:
ax = plt.figure(figsize=(15,10))
ax = sns.boxplot(x="Parch", y="Fare", hue="Survived", data=train, palette="Set3")


In [0]:
ax = plt.figure(figsize=(15,10))
ax = sns.boxplot(x="SibSp", y="Fare", hue="Survived", data=train, palette="Set3")

## Ticket

In [0]:
train.groupby(by='Ticket')[['Name']].count()

In [0]:
plt.figure(figsize=(15,6))
sns.heatmap(train.drop('PassengerId',axis=1).corr(), vmax=0.6, square=True, annot=True)

# 2. Data preprocessing

Categorical to numerical

In [0]:
train.info()
test.info()

##### Sex

In [0]:
for dataset in alldata:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

train.head()


##### Fare

In features Fare and Embarked we have few NaN values. 

In [0]:
test['Fare'].isna().sum()

In [0]:
test['Fare'].fillna(test['Fare'].dropna().median(), inplace=True)


##### Embarked


In [0]:
train['Embarked'].isna().sum()

In [0]:
train.groupby(by='Embarked')[['Embarked']].count()

In [0]:
train['Embarked']=train['Embarked'].fillna('S')

##### Age

Let's replace the NaN values with the median for the samples from the functions “Pclass” and “Sex”.

In [0]:
print("Train Age missing %: " + str(train.Age.isnull().sum()*100/len(train.Age)))

In [0]:
guess_ages = np.zeros((2,3))
guess_ages

In [0]:
for dataset in alldata:
    for i in range(0, 2):
        for j in range(0, 3):
            guess = dataset[(dataset['Sex'] == i) & (dataset['Pclass'] == j+1)]['Age'].dropna()

            age_guess = guess.median()

            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1), 'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)
    


In [0]:
train.head(12)

# 3. Feature Engineering

#### Name

In [0]:
for dataset in alldata:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train['Title'], train['Sex'])

In [0]:
for dataset in alldata:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

In [0]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in alldata:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

test.head()

In [0]:
test.head()

#### SibSp&Parch

In [0]:
SibSp = train['SibSp']
Parch = train['Parch']

In [0]:
SibParch = pd.concat([SibSp,Parch],axis=0)
SibParch = SibParch.values

In [0]:
rank = np.rank([SibParch])
rank

In [0]:
train.info()

In [0]:
test.info()

In [0]:
for dataset in alldata:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

train[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [0]:
train = train.drop(['SibSp','Parch'], axis=1)
test = test.drop(['SibSp','Parch'], axis=1)

In [0]:
train.info()

In [0]:
test.info()

In [0]:
train.head()

In [0]:
test.head()

In [0]:
train.shape, test.shape

In [0]:
total_train = train.isnull().sum().sort_values(ascending = False)
total_train

In [0]:
total_test = test.isnull().sum().sort_values(ascending = False)
total_test

##### Cabin, Ticket

In [0]:
train = train.drop(['Cabin','Ticket', 'PassengerId', 'Name'], axis=1)
test = test.drop(['Cabin','Ticket', 'PassengerId', 'Name'], axis=1)

In [0]:
train.info()
print('*'*40)
test.info()

### OneHotEncoding for Embarked, Pclass

In [0]:
def one_hot_encoding(dtf, columns):
    for column in columns:
        dm = pd.get_dummies(dtf[column], prefix=column)
        dtf.drop(column, axis=1, inplace=True)
        dtf = pd.concat([dtf, dm], axis=1)
    return dtf

In [0]:
categorial_columns = ['Embarked', 'Pclass']

In [0]:
train = one_hot_encoding(train, categorial_columns)
test = one_hot_encoding(test, categorial_columns)

In [0]:
test.head()

In [0]:
train.head()

In [0]:
train.shape, test.shape

In [0]:
train.head()

In [0]:
test.head()

# 5. Data preparation for model

In [0]:
all_features = train.drop("Survived", axis=1)
Targeted_feature = train["Survived"]
X_train,X_test,y_train,y_test = train_test_split(all_features,Targeted_feature,test_size=0.3,random_state=42)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

In [0]:
test.head()

# 6. Baseline model

In [0]:
#Logistic_Regression
logreg=LogisticRegression()
logreg.fit(X_train,y_train)
logreg_score = logreg.score(X_test,y_test)
logreg_score

# 7. Model selection

#### Decision Tree Classifier

In [0]:
# Decision Tree Classifier

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
decision_tree_score = decision_tree.score(X_test, y_test)
decision_tree_score

#### KNN

In [0]:
#KNN

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_score = knn.score(X_test, y_test)
knn_score

#### RandomForestClassifier

In [0]:
#RFC
random_forest = RandomForestClassifier()
random_forest.fit(X_train, y_train)
rfc_score = random_forest.score(X_test, y_test)
rfc_score

## **Results**

In [0]:
fig, ax1 = plt.subplots(1, 1, figsize=(8,5))
fig = sns.barplot(y=[rfc_score, knn_score, decision_tree_score, logreg_score], x=["RandomForestClassifier", "KNN", "DecisionTreeClassifier", "LogisticRegression"], ax=ax1)
fig.set(xlabel="Models", ylabel="Score")

# 8. Model’s hyperparameters tuning

#### RandomForestClassifier

In [0]:
rfc=RandomForestClassifier()

In [0]:
param_grid = { 
    'n_estimators': [75, 150, 200, 300],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [6,7,8,9,10],
    'criterion' :['gini', 'entropy'],
    'min_samples_split' :[2,3],
    'min_samples_leaf' :[1,2,3]
}

In [0]:
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X_train, y_train)

In [0]:
CV_rfc.best_params_

In [0]:
CV_rfc.best_score_

In [0]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(criterion='gini',max_depth=7, n_estimators=150, max_features='sqrt',min_samples_leaf=1,min_samples_split=3)
model.fit(X_train,y_train)
prediction_rm=model.predict(X_test)
print('--------------The Accuracy of the model----------------------------')
print('The accuracy of the Random Forest Classifier is',round(accuracy_score(prediction_rm,y_test)*100,2))
result_rm=cross_val_score(model,all_features,Targeted_feature,cv=5,scoring='accuracy')
print('The cross validated score for Random Forest Classifier is:',round(result_rm.mean()*100,2))
y_pred_RFC = cross_val_predict(model,all_features,Targeted_feature,cv=5)
sns.heatmap(confusion_matrix(Targeted_feature,y_pred_RFC),annot=True,fmt='3.0f',cmap="summer")
plt.title('Confusion_matrix', y=1.05, size=15)

#### Logistic Regression

In [0]:
logreg = LogisticRegression()

In [0]:
param_grid = { 
   # 'penalty': ['l1', 'l2'],
     'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000], 
    'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}

In [0]:
CV_logreg = GridSearchCV(estimator=logreg, param_grid=param_grid, cv= 5)
CV_logreg.fit(X_train, y_train)

In [0]:
CV_logreg.best_params_

In [0]:
CV_logreg.best_score_

In [0]:
model = LogisticRegression(C = 0.1, solver = 'liblinear')
model.fit(X_train,y_train)
prediction_lr=model.predict(X_test)
print('--------------The Accuracy of the model----------------------------')
print('The accuracy of the Logistic Regression is',round(accuracy_score(prediction_lr,y_test)*100,2))
result_lr=cross_val_score(model,all_features,Targeted_feature,cv=5,scoring='accuracy')
print('The cross validated score for Logistic REgression is:',round(result_lr.mean()*100,2))
y_pred_LOGREG = cross_val_predict(model,all_features,Targeted_feature,cv=5)
sns.heatmap(confusion_matrix(Targeted_feature,y_pred_LOGREG),annot=True,fmt='3.0f',cmap="summer")
plt.title('Confusion_matrix', y=1.05, size=15)

#### DecisionTreeClassifier

In [0]:
dtc = DecisionTreeClassifier()

In [0]:
params_grid = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'max_depth':[5,6,7,8],
    'min_samples_split': [2,3,4], 
    'min_samples_leaf': [1,2,3],
    'min_weight_fraction_leaf': [0.0,0.2,0.3,0.4,0.5],
    'max_features': ['auto', 'sqrt', 'log2']
}

In [0]:
CV_dtc = GridSearchCV(estimator=dtc, param_grid=params_grid, cv= 5)
CV_dtc.fit(X_train,y_train)

In [0]:
CV_dtc.best_params_

In [0]:
CV_dtc.best_score_

In [0]:
from sklearn.tree import DecisionTreeClassifier
model= DecisionTreeClassifier(criterion='gini', max_depth=5,
                             min_samples_split=4,min_samples_leaf=2, min_weight_fraction_leaf=0.0, splitter='best',
                             max_features='auto')
model.fit(X_train,y_train)
prediction_tree=model.predict(X_test)
print('--------------The Accuracy of the model----------------------------')
print('The accuracy of the DecisionTree Classifier is',round(accuracy_score(prediction_tree,y_test)*100,2))
result_tree=cross_val_score(model,all_features,Targeted_feature,cv=5,scoring='accuracy')
print('The cross validated score for Decision Tree classifier is:',round(result_tree.mean()*100,2))
y_pred_DTC = cross_val_predict(model,all_features,Targeted_feature,cv=5)
sns.heatmap(confusion_matrix(Targeted_feature,y_pred_DTC),annot=True,fmt='3.0f',cmap="summer")
plt.title('Confusion_matrix', y=1.05, size=15)

#### KNN

In [0]:
knn = KNeighborsClassifier()

In [0]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

In [0]:
paramms_grid = {
    'n_neighbors': [2,3,4,5,6],
    'weights': ['uniform', 'distance'],
    'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
    'leaf_size': [28,29,30,31,32], 
    'p': [1,2]
}

In [0]:
CV_knn = GridSearchCV(estimator=knn, param_grid=paramms_grid, cv= 5)
CV_knn.fit(X_train,y_train)

In [0]:
CV_knn.best_params_

In [0]:
CV_knn.best_score_

In [0]:
model= KNeighborsClassifier(algorithm='brute', leaf_size=28, n_neighbors=6, p=1, weights='uniform')
model.fit(X_train,y_train)
prediction_tree=model.predict(X_test)
print('--------------The Accuracy of the model----------------------------')
print('The accuracy of the DecisionTree Classifier is',round(accuracy_score(prediction_tree,y_test)*100,2))
result_tree=cross_val_score(model,all_features,Targeted_feature,cv=5,scoring='accuracy')
print('The cross validated score for Decision Tree classifier is:',round(result_tree.mean()*100,2))
y_pred_KNN = cross_val_predict(model,all_features,Targeted_feature,cv=5)
sns.heatmap(confusion_matrix(Targeted_feature,y_pred_KNN),annot=True,fmt='3.0f',cmap="summer")
plt.title('Confusion_matrix', y=1.05, size=15)