# Predict Survival on Titanic Datasets Using Classification Methods(Part 2)

## 3. Apply Machine Learning Algorithms
In module 1, I started with exploratory data analysis (EDA) in order to understand the problem and find the hidden information inside each feature. After that I reengineer the existing features to modify them and create new features to better explain the dataset to machine learning model. 

In this module, our goal is to identify the relationship between survived or not(target variable) with other features. Please see the table of content as follows:

### Table of Content
* 3.0 Import Packages
* 3.1 Read Dataset
* 3.2 Dummy Variables Encoding
* 3.3 Feature Normalization(Feature Scaling)
* 3.4 Split Dataset
* 3.5 Introduction of Evaluation Matrics
* 3.6 Machine Learning Methods
* 3.7 Model Comparison
* 3.8 Conclusion

###  3.0 Import Packages

In [37]:
import pandas as pd
from pandas import get_dummies
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression


from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_curve

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

%matplotlib inline

### 3.1 Read Dataset 

In [38]:
Data = pd.read_csv("Titanic_Data_Preparation.csv") 
TestIndex = pd.read_csv("TestDataIndex.csv",header = None,names=['0','Index'])
Data.shape#1309 x 9
Data.head()

Unnamed: 0,PassengerId,Age,Embarked,Fare,Pclass,Sex,Survived,Title,NumFamily
0,1,22.0,S,7.25,3,male,0,Mr,1
1,2,38.0,C,71.2833,1,female,1,Mrs,1
2,3,26.0,S,7.925,3,female,1,Miss,0
3,4,35.0,S,53.1,1,female,1,Mrs,1
4,5,35.0,S,8.05,3,male,0,Mr,0


### 3.2 Dummy Variables Encoding
As we can see above, there are several categorical variables which are stored as text values, including column 'Sex' (male, female), 'Embarked' (S, C, Q) and 'Title' (Mr, Mrs, Miss, Master, Others). Regardless of what the value is used for, the challenge is determining how to use data in the analysis. As we know, many machine learning algorithms can deal with categorical variables without any manipulation but there are also some algorithms cannot do. How to turn these text attributes into numarics for further processing? Read http://pbpython.com/categorical-encoding.html if you want to know more methods about dealing with categorical variables.

At the beginning, we may think of converting categorical variables into ordinal numbers, however, categorical variables cannot tell us which levels should be more important than others, but ordinal numbers can. I think this method will make algorithms bias. 

In order to unify our data and easy to do comparison based on the ML results, we will create dummy variables to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set. To avoid multicolinearity (Dummy Variable Trap), I plan to remove the first dummy variables of all categorical variables, such as 'Sex_female', 'Title_Master' and 'Embarked_C'

In [39]:
Data0 = pd.get_dummies(Data,columns = ['Sex','Title','Embarked'])
Data1 = Data0.drop(['Sex_female', 'Title_Master', 'Embarked_C'],axis = 1)
Data1.head()

Unnamed: 0,PassengerId,Age,Fare,Pclass,Survived,NumFamily,Sex_male,Title_Miss,Title_Mr,Title_Mrs,Title_Others,Embarked_Q,Embarked_S
0,1,22.0,7.25,3,0,1,1,0,1,0,0,0,1
1,2,38.0,71.2833,1,1,1,0,0,0,1,0,0,0
2,3,26.0,7.925,3,1,0,0,1,0,0,0,0,1
3,4,35.0,53.1,1,1,1,0,0,0,1,0,0,1
4,5,35.0,8.05,3,0,0,1,0,1,0,0,0,1


### 3.3 Feature Normalization(Feature Scaling)
We also notice there are some features are in different scales, such as column 'Age' and 'Fare'. These two features are numerical variables. When we apply ML methods like KNN, SVM, Neural Networks, etc. different scales of input features may give different contributions to the ML algorithms and then cause problems. So transform the input features into same scale is necessary. We may consider to use mean-std normalization 
<img src = "normalized.png" width = 280>
instead of min-max normalization
<img src = "minmax.png" width = 340>
as follows:

In [40]:
Data2 = Data1.copy()
Data2.Age=(Data2.Age-Data2.Age.mean())/Data2.Age.std()
Data2.Fare=(Data2.Fare-Data2.Fare.mean())/Data2.Fare.std()
Data2.head()

Unnamed: 0,PassengerId,Age,Fare,Pclass,Survived,NumFamily,Sex_male,Title_Miss,Title_Mr,Title_Mrs,Title_Others,Embarked_Q,Embarked_S
0,1,-0.575596,-0.503099,3,0,1,1,0,1,0,0,0,1
1,2,0.613562,0.734463,1,1,1,0,0,0,1,0,0,0
2,3,-0.278307,-0.490053,3,1,0,0,1,0,0,0,0,1
3,4,0.390594,0.383037,1,1,1,0,0,0,1,0,0,1
4,5,0.390594,-0.487637,3,0,0,1,0,1,0,0,0,1


### 3.4 Split Dataset

In module 1, we did explanatory data analysis on whole dataset. Since the dataset is ready now, we will split the dataset, train models use a number of machine learning algorithms on train set and do evaluation & model comparison to find the optimal algorithm based on evaluation metric scores on test set. 

In [41]:
Testlist = TestIndex.Index.tolist()
Train = Data2.loc[~Data2.PassengerId.isin(Testlist)]
Test = Data2.loc[Data2.PassengerId.isin(Testlist)]
print('Dimension of trainset is',Trainset.shape)#891 x 13
print('Dimension of testset is: ',Testset.shape)#418 x 13

Train_y = Train.Survived
Train_X = Train[['PassengerId', 'Age', 'Fare', 'Pclass', 'NumFamily', 'Sex_male', 'Title_Miss','Title_Mr', 'Title_Mrs', 'Title_Others','Embarked_Q', 'Embarked_S']]
Test_y = Test.Survived
Test_X = Test[['PassengerId', 'Age', 'Fare', 'Pclass', 'NumFamily', 'Sex_male', 'Title_Miss','Title_Mr', 'Title_Mrs', 'Title_Others','Embarked_Q', 'Embarked_S']]

Dimension of trainset is (891, 13)
Dimension of testset is:  (418, 13)


### 3.5 Introduction of Evaluation Matrics
Before using multiple ML algorithms, let's talk about Evaluation Matrics in advance. How to define whether an algorithm is a good one or not? If it is hard to understand, maybe we can think about a simple question: How to tell if a restaurant is a good one. Are we judging the restaurants on the basis of their hotpot? Sushi? Ramen? Service? or even In-store decoration? Different judges return us different results. This is what evaluation matrics do for ML algorithms.

As we know, Machine Learning algorithms work on constructive feedback principle. If we apply ML algorithms, get feedback from the metrics, make improvement(try other tuning parameters) and continue until you achieve a desirable criterion. We use evaluation matrics to explain the performance of an algorithm. In House Sale in King County project, as it is a regression problem, we use RMSE as metric. Titanic Survival project is a binary classification problem, so we will implement some other type of metrics, such as **Confusion Matrix**, **Accuracy**, **Specificity**, **Sensitivity**,  **Precision-Recall**, **F1-score**, **ROC-AUC**, etc.

**Confusion Matrix**

In classification problem, after prediction, we can compare the predicted labels with the true labels to see whether our prediction is accurate or not. 

<img src = "confusionmatrix.png" width = 550>
* a = True Positive
* b = False Positive 
* c = False Negative
* d = True Negative

**False Positive Rate** (Type I Error): b/(b+d), 

**False Negative Rate** (Type II Error): c/(a+c),

**Specificity** (True Negative RATE): Number of items correctly identified as negative out of total negatives

Specificity = d/(b+d)

**Sensitivity** (True Positive Rate/Recall): Number of items correctly identified as positive out of total true positives

Sensitivity = a/(a+c)

**Accuracy**: Percentage of total items classified correctly

Accuracy = (a+d)/(a+b+c+d)

**Precision**:  Number of items correctly identified as positive out of total items identified as positive

Precision = a/(a+b)

**Recall**: Number of items correctly identified as positive out of total true positives

Recall = a/(a+c)

**F1-score**: A harmonic mean of precision and recall

F1 = 2*Precision*Recall/(Precision + Recall)



This is a classification problem. :
<img src = "Classification.png" width = 500>

