   # Titanic Machine Learning from Disaster
![alt text](stower_titanic.jpg)

## Introduction
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.This data science project will give you introdcution on how to use Python to apply various machine learning techniques  to the RMS Titanic dataset and predict which passenger would have survived the tragedy.

### The Data

The dataset consists of 11 predictor variables and a binary target variable "survived". The features include assengerId
* Pclass
* Name
* Sex
* Age
* SibSp
* Parch
* Ticket
* Fare
* Cabin
* Embarked

There are 891 records of passengers, out of which 342 survied and 549 who did not survive.


In [1]:
import pandas as pd
from pandas import Series,DataFrame 

import matplotlib.pyplot as plt
%pylab inline

import seaborn as sns; sns.set()


Populating the interactive namespace from numpy and matplotlib


In [2]:
#Load the dataset
titanic_df=pd.read_csv("train.csv")
#Fetch the top head rows
titanic_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### Analyze
* We know that women and children were more likely to survive. Thus Age, sex are probably good predictors
* Its also logical to think that passenger class might affect the outcome, as first class were closer to the deck of the ship
* Fare is highly tied to passenger class, and will probably be highly corelated with it,might add some additional information
* Number of siblings and parents/children will probably be corealted with survival one way or the other, as either there are 
* more people to help you, or more people to think about you and trying to save



### Statistical summary 

In [3]:
# Describe on every variable
# statistical summary on an id does not make sense
# Pclass should be treated as categorical variable
# age is important. Mean age is 29
# minimum is .4
titanic_df.describe()


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Imputation


In [4]:
#Age
Age_median=titanic_df['Age'].median()
titanic_df["Age"]=titanic_df["Age"].fillna(Age_median)
#Embark
titanic_df["Embarked"].fillna('S',inplace=True) #C = Cherbourg, Q = Queenstown, S = Southampton
itanic_df.loc[titanic_df["Embarked"] == "S", "Embarked"] = 0
titanic_df.loc[titanic_df["Embarked"] == "C", "Embarked"] = 1
titanic_df.loc[titanic_df["Embarked"] == "Q", "Embarked"] = 2
#Sex
titanic_df.loc[titanic_df['Sex'] == 'male','Sex'] = 0
titanic_df.loc[titanic_df['Sex'] == 'female','Sex'] = 1

NameError: name 'itanic_df' is not defined

In [None]:
#Convert embarked into binary variable
#Convert "S" to 0, "c" to 1 and "Q to 2 in Embarked column
t

titanic_df.Embarked.unique()

In [None]:
# Columns which would be used to predict the target
# Age, Pclass, Sex, SibSp,Fare, Parch and Embarked
titanic_df
x = titanic_df[["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]].values
#print(x)
y = titanic_df["Survived"].values


In [None]:
#Histogram
fig, ax = plt.subplots(1, 2, figsize=(22,4))
ax[0].hist(titanic_df.loc[titanic_df['Survived'] == 1, "Age"], color = 'chartreuse', edgecolor='black')
ax[0].set_title('Survived ', fontsize = 18)
ax[0].set_xlabel('Age', fontsize = 18)
ax[1].hist(titanic_df.loc[titanic_df['Survived'] == 0, "Age"], color = 'crimson', edgecolor='black')
ax[1].set_title('Perished ', fontsize = 18)
ax[1].set_xlabel('Age', fontsize = 18)

# Density Plot
fig, ax = plt.subplots(figsize=(22,4))
ax = sns.kdeplot(titanic_df.loc[titanic_df['Survived'] == 1, "Age"], shade=True, color=sns.xkcd_rgb["green"], label="Survived")
ax = sns.kdeplot(titanic_df.loc[titanic_df['Survived'] == 0, "Age"], shade=True, color=sns.xkcd_rgb["red"], label="Perished")
ax.set_title('Age Distribution by Survival or Perished', fontsize = 18)
ax.set_ylabel('Density', fontsize = 18)
ax.set_xlabel('Age', fontsize = 18)
plt.legend(fontsize=24)
plt.show()

# Train/Test split


In [None]:
from sklearn.model_selection import train_test_split

# split dataset into test/train  using All features
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size=0.3)

# Scale/Standardize

In [None]:
from sklearn.preprocessing import StandardScaler

# scale/standardize features
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test) 
X_train_std[:5]

 # Essentials of Modeling(Overfitting and cross validation)   `

The aim of machine learning is generalization

* We want to train on the different dat than we make predictions on.This is critical if we want to avoid overfitting.

OVERFITTING: Model fits itself to "noise" not signal.
* Every dataset has its own quirks that dont exist in the full population.
* IF asked to predict the top speed of the car from its horse power and other characteristic and if we have a dataset that randomly had cars with high speed, We would create a model that over stated speed.
* The way to figure out if our model is doing this is to evaluate its performance on data that it hasnt used
* Every machine learning algorithm can overfit.Linear regression are much prone to it
* If we evaluate our algorithm on the same dataset it would definetly overfit

CROSS VALIDATION
* Simple way to avoid overfitting
* To Cross validate we split the data into no of parts

Ex :Split data into 3 parts
* Combine the first 2 parts,train a model, make predections on the third
* Combine the fisrt and the third part and make predections on the second
* Combine the third and the second part and make predictions on the first


# scikit learn
* Model Families from where we import 
* Declaring an estimator object exposes its methods to us
* These include fit, transform and predict
* Parameters are defined when declaring the estimator object'

Note: We do not have to change our code to use different models



Model.fit : Train the model
Model.Predict : Testing the model¶

# LOGISTIC REGRESSION
Takes the output of Linear regression and map it so that it is between 0 and 1.
We can use logit function
Passing any value to the logit function would map it to a value between 0 to 1 by squeezing extreme values

WE CAN USE A SKLEARN HELPER FUNCTION TO DO CROSS VALIDATION AND EVALUATION FOR US

# ISSUE WITH LINEAR REGRESSION
* We use it for salary or housing price prediction
* When we have a categorical dependent variable then we would not use it
  Ex : Whether a customer would default?  Whether the customer would buy? 
      it is numeric.( However it is binary numeric) 
      Note: In the above case we use LOGISTICAL REGRESSION
      In our case we have a binary outcome which is categorical in nature
      
# ALTERNATE TO LOGICAL REGRESSION FOR BINARY CLASSIFICATION WOULD BE DECISION TREE, SUPPORT VECTOR MACHINES  ETC..


# Models

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(C = 1)

# k-Nearest Neighbor
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)

# Support Vector Machine
from sklearn.svm import SVC
svm = SVC(kernel='linear', C=1)

# Random Forest (ensemble of Decision Trees)
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=1, random_state=0)

# Neural Network
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier()

# Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(log_reg, X_train_std, y_train, scoring='accuracy', cv=10)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())               # accuracy measure
    print("Standard deviation:", scores.std())  # std measures how precise the measure is
    
display_scores(scores)

#SUPPORT VECTOR MACHINE TAKES
Kernal,c,gama
#Linear classifiers
Aplha,penalty
#Random forest
n_estimators

Note: When we move across algorithms, sytax does not change
      As we define the object a few methods would become available to us   


# 10 Fold Cross validation

In [None]:
#Have all the algorithsm in an array
classifiers = [log_reg, knn, svm, forest, nn]

model_scores = []
for clf in classifiers:
    model_scores.append(cross_val_score(clf, X_train_std, y_train, scoring='accuracy', cv=10))
#com

In [None]:
#Combination DF
combination_df=pd.DataFrame(model_scores,columns=[1,2,3,4,5,6,7,8,9,10],index=["LR","KNN","SVM","Forest","NN"])

In [None]:
#Mean
combination_df["Mean"]=combination_df.mean(axis=1)
combination_df



# Boxplot

In [None]:
## BOXPLOT comparing models and comparing SVM using different feature subsets
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(18, 8))
# rectangular box plot
bplot_models = axes.boxplot(model_scores, vert=True, patch_artist=True)

# fill with colors - Models
colors_d = ["lightgreen", "lightyellow", "lime", "yellow", "yellowgreen"]
for patch, color in zip(bplot_models['boxes'], colors_d):
    patch.set_facecolor(color)
    
    # adding axes labels
axes.yaxis.grid(True)
axes.set_xticks([y+1 for y in range(len(model_scores))])
axes.set_xlabel('Classification Models', fontsize=18)
axes.set_ylabel('Accuracy', fontsize=18)
axes.set_ylim((.4, 1.1))
axes.set_title('Classification Accuracy using All Features', fontsize = 18)

In [None]:
# Random Forest Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

nn.fit(X_train, y_train)
y_pred = forest.predict(X_test)
confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)

fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confmat.shape[0]):
    for j in range(confmat.shape[1]):
        ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')

# Precision, Recall, and F1 scores
from sklearn.metrics import precision_score, recall_score, f1_score
precision = precision_score(y_true=y_test, y_pred=y_pred)
recall = recall_score(y_true=y_test, y_pred=y_pred)
f1 = f1_score(y_true=y_test, y_pred=y_pred)

#print('Precision: {:.3f}, Recall: {:.3f}, F1: {:.3f}'.format(precision, recall, f1))
print(classification_report(y_test, y_pred, target_names=["Not Survived", "Survived"]))