Applying the tools of ML to predict which passengers survived the tragedy:
The data is given by Kaggle which has been split into two groups:
* training set (train.csv)
* test set (test.csv)

The training set is used to build the machine learning models.

The model is based on features like passengers’ gender and class and also new features which are built using feature engineering.


The test set is used to see how well the models perform on unseen data. For the test set, the ground truth for each passenger is provided to predict the outcomes. For each passenger in the test set, the model predicts whether the passengers survived the sinking of the Titanic or not.

**Importing the Libraries**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Algorithms
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

# Any results you write to the current directory are saved as output.

**Data Dictionary and Variable Notes**

**PassengerId:**: An unique index for passenger rows. It starts from 1 for first row and increments by 1 for every new rows.

**Survived:** Shows if the passenger survived or not. 1 stands for survived and 0 stands for not survived.

**Pclass:** Ticket class 1 = First class ticket, 2 = Second class ticket, 3 = Third class ticket (A proxy for socio-economic status (SES) 1st = Upper, 2nd = Middle, 3rd = Lower)

**Name:** Passenger's name. Name also contain title. "Mr" for man. "Mrs" for woman. "Miss" for girl. "Master" for boy.

**Sex:** Passenger's sex. It's either Male or Female.

**Age:** in years ("NaN" values in this column indicates that the age of that particular passenger has not been recorded.fractional age=1. estimated age=xx.5)

**SibSp:**  Number of siblings or spouses travelling with each passenger(The dataset defines family relations in this way Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored))

**Parch:** Number of parents of children travelling with each passenger. (The dataset defines family relations in this way  Parent = mother, father

Child = daughter, son, stepdaughter, stepson  Some children travelled only with a nanny, therefore parch=0 for them)

**Ticket:** Ticket number.

**Fare:** How much money the passenger has paid for the travel journey.

**Cabin:** Cabin number of the passenger. "NaN" (Not a number) values in this column indicates that the cabin number of that particular passenger has not been recorded.This missing field in  data will be filled out using feature engineering.

**Embarked:** Port from where the particular passenger was embarked/boarded. First character of port name. C = Cherbourg, Q = Queenstown, S = Southampton

**Loading Dataset Looking into the training dataset **

In [None]:
train = pd.read_csv("/kaggle/input/titanic/train.csv")
test = pd.read_csv("/kaggle/input/titanic/test.csv")

**Data Exploration/Analysis**

In [None]:
train.describe()

Loading train dataset Printing first 8 rows of the train dataset. Train is our data frame for the train dataset. To see how the train set looks like. We use head function of pandas for data frame which gives us 8 rows of training dataset.

In [None]:
train.head(8)

In [None]:
test.head()

In [None]:
train.shape

In [None]:
test.shape

Survived column is not present in Test data. We have to train our classifier using the Train data and generate predictions (Survived) on Test data. 11 features only missing field from the dataset is survive field which we will predict.

**Missing Value:**

In [None]:
train.info()

We can see that Age value is missing for many rows.

Out of 891 rows, the Age value is present only in 714 rows.

Similarly, Cabin values are also missing in many rows. Only 204 out of 891 rows have Cabin values.

We use function isnull.sum it will give the number of null data.

In [None]:
train.isnull().sum()

There are 177 rows with missing Age, 687 rows with missing Cabin and 2 rows with missing Embarked information.

**Relationship between Features and Survival**

We analyze relationship between different features with respect to Survival. We see how different feature values show different survival chance using different kinds of diagrams to visualize our data.

In [None]:
survived = train[train['Survived'] == 1]
not_survived = train[train['Survived'] == 0]

print ("Survived: %i (%.1f%%)"%(len(survived), float(len(survived))/len(train)*100.0))
print ("Not Survived: %i (%.1f%%)"%(len(not_survived), float(len(not_survived))/len(train)*100.0))
print ("Total: %i"%len(train))

**Pclass vs. Survival**

Higher class passengers have better survival chance.

Total number of passengers in each passenger class.

In [None]:
train.Pclass.value_counts()

Number of survived and unsurvived passengers in each passenger class.

In [None]:
train.groupby('Pclass').Survived.value_counts()

In [None]:
sns.barplot(x='Pclass', y='Survived', data=train)

**Sex vs. Survival**

Females have better survival chance.

Total number of female and male passengers

In [None]:
train.Sex.value_counts()

In [None]:

train.groupby('Sex').Survived.value_counts()

In [None]:

train[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean()

In [None]:
sns.barplot(x='Sex', y='Survived', data=train)

In [None]:
women = train.loc[train.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

In [None]:
men = train.loc[train.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

**Pclass & Sex vs. Survival**

The number of males and females in each Pclass have been shown.
In the diagram we found that there are more males among the 3rd Pclass passengers.

In [None]:
tab = pd.crosstab(train['Pclass'], train['Sex'])
print (tab)

tab.div(tab.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True)
plt.xlabel('Pclass')
plt.ylabel('Percentage')

In [None]:
sns.factorplot('Sex', 'Survived', hue='Pclass', size=4, aspect=2, data=train)

Above plot shows that:

Women from 1st and 2nd Pclass have almost 100% survival chance. 

Men from 2nd and 3rd Pclass have only around 10% survival chance.

**Pclass, Sex & Embarked vs. Survival**

In [None]:
sns.factorplot(x='Pclass', y='Survived', hue='Sex', col='Embarked', data=train)

From the above plot, it can be seen that:

Almost all females from Pclass 1 and 2 survived.

Females dying were mostly from 3rd Pclass.

Males from Pclass 1 only have slightly higher survival chance than Pclass 2 and 3

**Embarked vs. Survived**

In [None]:
train.Embarked.value_counts()

In [None]:
train.groupby('Embarked').Survived.value_counts()

In [None]:
train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean()

In [None]:
sns.barplot(x='Embarked', y='Survived', data=train)

**Parch vs. Survival**

In [None]:
train.Parch.value_counts()

In [None]:
train.groupby('Parch').Survived.value_counts()

In [None]:
train[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean()

In [1]:
sns.barplot(x='Parch', y='Survived', ci=None, data=train)

NameError: name 'sns' is not defined

**SibSp vs. Survival**

In [None]:
train.SibSp.value_counts()

In [None]:
train.groupby('SibSp').Survived.value_counts()

In [None]:
train[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean()

In [None]:
sns.barplot(x='SibSp', y='Survived', ci=None, data=train) 

**Required Data and Feature Engineering:**

We need to convert a lot of features into numeric ones.

Features have different ranges,we will put them into the same scale.

Some features contain missing values (NaN = not a number)

By doing that features(columns) will be more understandable by Machine Learning algorithm.

**Data Preprocessing and Feature Selection**

Unecessary columns/features are dropped and keep only the useful ones. Column PassengerId is only dropped from Train set because we need PassengerId in Test set to be submitted.

In [None]:
train = train.drop(['PassengerId'], axis=1)

**Cabin:**
We extract from the cabin number the deck and create new feature and then we convert them to numeric value.

In [None]:
import re
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8}
data = [train, test]

for dataset in data:
    dataset['Cabin'] = dataset['Cabin'].fillna("U0")
    dataset['Deck'] = dataset['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
    dataset['Deck'] = dataset['Deck'].map(deck)
    dataset['Deck'] = dataset['Deck'].fillna(0)
    dataset['Deck'] = dataset['Deck'].astype(int)

train = train.drop(['Cabin'], axis=1)
test = test.drop(['Cabin'], axis=1)

**Age:**

We fill the Null values of age with a random number between (mean_age-std_age) and (mean_age+std_age).

In [None]:
data = [train, test]

for dataset in data:
    mean = train["Age"].mean()
    std = test["Age"].std()
    is_null = dataset["Age"].isnull().sum()
    
    rand_age = np.random.randint(mean - std, mean + std, size = is_null)
    
    age_slice = dataset["Age"].copy()
    age_slice[np.isnan(age_slice)] = rand_age
    dataset["Age"] = age_slice
    dataset["Age"] = train["Age"].astype(int)
train["Age"].isnull().sum()

**Embarked:**

There are 2 empty values for Embarked column.

In [None]:
train['Embarked'].unique()

Number of passengers for each Embarked category

In [None]:
train.Embarked.value_counts()

In [None]:
train['Embarked'].describe()

Category "S" has maxomum passengers. So we replace "NaN" value with "S".

In [None]:
common_value = 'S'
data = [train, test]

for dataset in data:
    dataset['Embarked'] = dataset['Embarked'].fillna(common_value)

**Converting Features:**

In [None]:
train.info()

**Fare:**

We convert “Fare” from float to int64, using the “astype()” function pandas provides:

In [None]:
data = [train, test]

for dataset in data:
    dataset['Fare'] = dataset['Fare'].fillna(0)
    dataset['Fare'] = dataset['Fare'].astype(int)

**Name:**

We extract Titles from the Name.

In [None]:
data = [train, test]
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

for dataset in data:
   
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
   
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',\
                                            'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
    dataset['Title'] = dataset['Title'].map(titles)
   
    dataset['Title'] = dataset['Title'].fillna(0)
train = train.drop(['Name'], axis=1)
test = test.drop(['Name'], axis=1)

**Sex:**

Here Sex value is converted to 0 and 1.

In [None]:
genders = {"male": 0, "female": 1}
data = [train, test]

for dataset in data:
    dataset['Sex'] = dataset['Sex'].map(genders)

**Ticket:**

In [None]:
train['Ticket'].describe()

We  drop ticket from the dataset.

In [None]:
train = train.drop(['Ticket'], axis=1)
test = test.drop(['Ticket'], axis=1)

**Embarked:**

Here we convert‘Embarked’ feature into numeric.

In [None]:
ports = {"S": 0, "C": 1, "Q": 2}
data = [train, test]

for dataset in data:
    dataset['Embarked'] = dataset['Embarked'].map(ports)

**features Categories:**

**Age**:

We convert age from float into integer.We create "AgeGroup” variable.

In [None]:
data = [train, test]
for dataset in data:
    dataset['Age'] = dataset['Age'].astype(int)
    dataset.loc[ dataset['Age'] <= 11, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 11) & (dataset['Age'] <= 18), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 18) & (dataset['Age'] <= 22), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33), 'Age'] = 4
    dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= 40), 'Age'] = 5
    dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 66), 'Age'] = 6
    dataset.loc[ dataset['Age'] > 66, 'Age'] = 6

train['Age'].value_counts()

**Fair**
Here we put fair in groups.

In [None]:
train.head(10)

In [None]:
data = [train, test]

for dataset in data:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[(dataset['Fare'] > 31) & (dataset['Fare'] <= 99), 'Fare']   = 3
    dataset.loc[(dataset['Fare'] > 99) & (dataset['Fare'] <= 250), 'Fare']   = 4
    dataset.loc[ dataset['Fare'] > 250, 'Fare'] = 5
    dataset['Fare'] = dataset['Fare'].astype(int)

**Creating new Features**

**Age_Class**

In [None]:
data = [train, test]
for dataset in data:
    dataset['Age_Class']= dataset['Age']* dataset['Pclass']

**Fare per Person**

In [None]:
data = [train, test]
for dataset in data:
    dataset['relatives'] = dataset['SibSp'] + dataset['Parch']
    dataset.loc[dataset['relatives'] > 0, 'not_alone'] = 0
    dataset.loc[dataset['relatives'] == 0, 'not_alone'] = 1
    dataset['not_alone'] = dataset['not_alone'].astype(int)
train['not_alone'].value_counts()

In [None]:
for dataset in data:
    dataset['Fare_Per_Person'] = dataset['Fare']/(dataset['relatives']+1)
    dataset['Fare_Per_Person'] = dataset['Fare_Per_Person'].astype(int)
train.head(10)

**Building Machine Learning Models**
We build multiple classifier to predict our dataset and compare their results. We need to use the predictions on the training set to compare the algorithms.

In [None]:
X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]
X_test  = test.drop("PassengerId", axis=1).copy()

**Stochastic Gradient Descent (SGD):**

In [None]:
sgd = linear_model.SGDClassifier(max_iter=5, tol=None)
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)

sgd.score(X_train, Y_train)

acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)

**Random Forest:**

In [None]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)

**Logistic Regression:**

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)

Y_pred = logreg.predict(X_test)

acc_log = round(logreg.score(X_train, Y_train) * 100, 2)

**K Nearest Neighbor:**

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3) 
knn.fit(X_train, Y_train)  
Y_pred = knn.predict(X_test) 
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)

**Gaussian Naive Bayes:**

In [None]:
gaussian = GaussianNB() 
gaussian.fit(X_train, Y_train)  
Y_pred = gaussian.predict(X_test)  
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)

**Perceptron:**

In [None]:
perceptron = Perceptron(max_iter=5)
perceptron.fit(X_train, Y_train)

Y_pred = perceptron.predict(X_test)

acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)

**Linear Support Vector Machine:**

In [None]:
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)

Y_pred = linear_svc.predict(X_test)

acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)

**Decision Tree**

In [None]:
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)  
Y_pred = decision_tree.predict(X_test)  
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)

**Comparng accuracy rate of models**

In [None]:
results = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 
              'Decision Tree'],
    'Score': [acc_linear_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_decision_tree]})
result = results.sort_values(by='Score', ascending=False)
result = result.set_index('Score')
result.head(9)

**K-Fold Cross Validation:**

In [None]:
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(rf, X_train, Y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

**Random Forest**

**Feature Importance**

Now we compare importance of each feature by looking at how much the tree nodes reduce impurity on average across all trees in the forest.

In [None]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
importances.head(15)

not_alone and Parch are less important so we drop them.

In [None]:
train  = train.drop("not_alone", axis=1)
test  = test.drop("not_alone", axis=1)

train  = train.drop("Parch", axis=1)
test  = test.drop("Parch", axis=1)

**Reraining random forest:**

In [None]:
random_forest = RandomForestClassifier(n_estimators=100, oob_score = True)
random_forest.fit(X_train, Y_train)
Y_prediction = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)

acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
print(round(acc_random_forest,2,), "%")

**Submission**

In [None]:
submission = pd.DataFrame({
    "PassengerId":test["PassengerId"],
    "Survived":Y_prediction
})
submission.to_csv('submission.csv',index=False)

**Result and Conclusions**

In this project first the data has been explored the missing data processed and important features have been found.During the data preprocessing part, missing values have been computed, features have been converted into numeric ones, values grouped and categories and new features have been created. 
Different methods have been tested on this database and after comparing the result the  Decision Tree and Random Forest have the most accurate result.
At the end random forest has been chosen because it has the ability to limit overfitting as compared to Decision Tree classifier and cross validation has been applied on it.
