# Titanic's data analysis and machine learning
## By Jérémy P. Schneider, consultant at IBM interactive

As someone new in this field I decided to take my first challenge with the Titanic dataset from Kaggle (https://www.kaggle.com/c/titanic)

## My OS
For this work I used a computer with :
    * Windows 7
    * Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
    * 16 Go RAM (4 x 4 Go)
    * NVIDIA GeForce GTX 1050 Ti

This notebook will be split in different part :

## [Setting our environnement :](#ENV)

### [Libraries Import](#lib)
### [Personnalized functions and tools](#func)
### [Data import](#import)

## [Data cleaning :](#CLEAN)
### [Data analysis and Handeling missing data](#analysis)
### [Converting all data in number](#convert)


## [Test and train of a model](#test_train)

## [Conclusion and trial over the test sample](#conclusion)

<a id="ENV"></a>
# Setting our environnement
<a id="lib"></a>
## Libraries Import

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import math
import seaborn as sns
import numpy as np
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

<a id="func"></a>
## Personnalized functions and tools

Since the Titanic dataset contains name with Title, I used the idea of Ertuğrul Demir to create a dictionnary to remplace them (https://www.kaggle.com/datafan07/titanic-eda-and-several-modelling-approaches)

In [2]:
dict_titre = {
    'Capt': 'Dr/Clerc/Mil',
    'Col': 'Dr/Clerc/Mil',
    'Major': 'Dr/Clerc/Mil',
    'Jonkheer': 'Honor',
    'Don': 'Honor',
    'Dona': 'Honor',
    'Sir': 'Honor',
    'Dr': 'Dr/Clerc/Mil',
    'Rev': 'Dr/Clerc/Mil',
    'the Countess': 'Honor',
    'Mme': 'Mrs',
    'Mlle': 'Miss',
    'Ms': 'Mrs',
    'Mr': 'Mr',
    'Mrs': 'Mrs',
    'Miss': 'Miss',
    'Master': 'Master',
    'Lady': 'Honor'
}

<a id="import"></a>
## Data import

In [3]:
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")

<a id="CLEAN"></a>
# Data cleaning

<a id="analysis"></a>
## Data analysis and Handeling missing data
    From Kaggle we know how the data is ordered and it define our goal :

The data has been split into two groups:

    training set (train.csv)
    test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

### Data Dictionary

|Variable	    |Definition	        |Key
| ------------- |:-----------------:| -----:|
|survival 	    |Survival 	        |0 = No, 1 = Yes
|pclass 	    |Ticket class 	    |1 = 1st, 2 = 2nd, 3 = 3rd
|sex 	        |Sex 	
|Age 	        |Age in years 	
|sibsp 	        |# of siblings / spouses aboard the Titanic 	
|parch 	        |# of parents / children aboard the Titanic 	
|ticket         |Ticket number 	
|fare 	        |Passenger fare 	
|cabin 	        |Cabin number 	
|embarked 	    |Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton

### Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

### Checking data

If we watch how the data is in our ile, wa saw that the information is correct.

* First the train data

In [4]:
df_train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [14]:
print("In the train data we have {} rows and {} column".format(df_train.shape[0], df_train.shape[1]))

In the train data we have 891 rows and 12 column


* Then the test data

In [5]:
df_test.dtypes

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [15]:
print("In the test data we have {} rows and {} column".format(df_test.shape[0], df_test.shape[1]))

In the test data we have 418 rows and 11 column


As we see the two set have the same topology :
* int and float for numerical values.
* object for text data.

We'll need to transform the text data in numerical values in order to feed our machine learning model.

After checking the type of the data we use, we need to see where missing values are, to do so we can simply calculated them this way :

In [22]:
df_train.isna().mean()

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

In [23]:
df_test.isna().mean()

PassengerId    0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.205742
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.002392
Cabin          0.782297
Embarked       0.000000
dtype: float64

From this step we know :
* __Cabin__ information is missing in more than 75% of our data, so we'll not use it.
* __Age__ information is missing at 20% of the time, we'll try to remplace the missing values
* __Embarked__ information is missing several values in the trainset
* __Fare__ information is missing serveral values in the testset