# Titanic: Machine Learning from Disaster

## Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

## Objective

Predict what sorts of people were likely to survive.

Inspiration:
   - https://www.kaggle.com/omarelgabry/titanic/a-journey-through-titanic
   - https://www.kaggle.com/poonaml/titanic/titanic-survival-prediction-end-to-end-ml-pipeline
   - https://www.kaggle.com/helgejo/titanic/an-interactive-data-science-tutorial
   - https://www.kaggle.com/arthurlu/titanic/exploratory-tutorial-titanic

## Table of contents


- [Description of the data set](#Description-of-the-data-set)
- [First look at the data](#First-look-at-the-data)
    - [Import Libraries](#Import-Libraries)
    - [Load Data](#Load-Data)
    - [Brief summaries](#Brief-summaries)
- [Visualization](#Visualization)
    - [Basic insight of the data](#Basic-insight-of-the-data)
    - [Focus on the mean of survival](#Focus-on-the-mean-of-survival)
- [Missing Values](#Missing-Values)
    - [Embarked](#Embarked)
    - [Fare](#Fare)
    - [Age with Median](#Age-with-median)
- [Features engineering](#Features-engineering)
    - [Name](#Name)
    - [Family](#Family)
    - [Name](#Name)
- [Visualization new Features](#Visualization-new-features)
    - [Visualization Name](#Visualization-Name)
    - [Visualization Family](#Visualization-Family)
- [Features Encoding](#Features-Encoding)
    - [Categorial features encoding](#Categorial features encoding)
        - [Label Encoding](#Label-Encoding)
        - [One Hot Encoding](#One-Hot-Encodingn)   
    - [Feature Scalling](#Feature-Scalling)
    - [Data Preparation](#Data-Preparation)
- [Features Importance](#Features-Importance)
    - [Correlation - Numerical label](#Correlation---Numerical-label)
    - [Correlation - One Hot Encoder](#Correlation---One-Hot-Encoder)
    - [LDA](#LDA)
    - [Select K Best](#Select-K-Best)
- [Model Selection](#Model-Selection)
    - [Helper function](#Helper-function)
    - [Gradient Boosting Classifier](#Gradient-Boosting-Classifier)
    - [Random Forest Classifier](#Random-Forest-Classifier)
    - [Adaboost](#Adaboost)
    - [SVC](#SVC)
    - [Logistic Regression](#Logistic-Regression)
    - [Voting Classifier](#Voting-Classifier)
- [Submission](#Submission)



- [Feature Selection](#Feature-selection)
- [Feature Selection](#Feature-selection)

### Description of the data set

## First look at the data

[[back to top](#Table-of-contents)]

### Import Libraries

In [None]:
# Dataframe
import pandas as pd

# Visualization
from IPython.display import display, HTML
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pprint import pprint
sns.set_style('whitegrid')
%matplotlib inline

# Sklearn
import sklearn as sk
# Pipeline
from sklearn.pipeline import Pipeline
# Preprocessing
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder,StandardScaler
# Features and model selection
from sklearn.cross_validation import KFold
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
# Metric
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Classifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import (ExtraTreesClassifier, RandomForestClassifier, 
                              AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier)

from sklearn.metrics import f1_score,accuracy_score

# Warning
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
warnings.filterwarnings("ignore", category=RuntimeWarning)

### Load data

In [None]:
# Load data directly into a dataframe
df_train=pd.read_csv("Data/Titanic/train.csv")
df_test=pd.read_csv("Data/Titanic/test.csv")

# Get a look at the first rows
df_train.head()

In [None]:
df_test.head()

Variable Description
    - Survived: Survived (1) or died (0)
    - Pclass: Passenger's class
    - Name: Passenger's name
    - Sex: Passenger's sex
    - Age: Passenger's age
    - SibSp: Number of siblings/spouses aboard
    - Parch: Number of parents/children aboard
    - Ticket: Ticket number
    - Fare: Fare
    - Cabin: Cabin
    - Embarked: Port of embarkation
    
    Source of information : https://www.kaggle.com/c/titanic/data

### Brief summaries

In [None]:
print("----------------------------------Informations for the training set----------------------------------\n")
df_train.info()
print('\n',df_train.isnull().sum())
print("\n----------------------------------Informations for the testing set ----------------------------------\n")
df_test.info()
print('\n',df_test.isnull().sum())

Note that:
    - No Survived feature on the testing set
    - Cabin feature is mostly null --> Will be dropped
    - Embarked feature has a few missing values
    - Some Ages are missing --> Will need to be completed or drop the missing rows
    - Survived and Pclass should be treated as object because they are qualitative

In [None]:
# Dropping Cabin, Ticket and PassengerId
df_train=df_train.drop(['Cabin','PassengerId','Ticket'], axis=1)

df_test=df_test.drop(['Cabin','Ticket'], axis=1)
PassengerId = df_test['PassengerId']

In [None]:
# Changing the type of Pclass and Survived 
df_train['Pclass']=df_train['Pclass'].astype(object)
df_train['Survived']=df_train['Survived'].astype(object)

df_test['Pclass']=df_test['Pclass'].astype(object)

In [None]:
# Basic statistical information about quantitative and qualitative columns

print("----------------------------------Informations for the training set----------------------------------\n")
# Quantitative
display(df_train.describe())
# Qualitative
display(df_train.describe(include=['object']))
print("----------------------------------Informations for the testing set----------------------------------\n")
# Quantitative
display(df_test.describe())
# Qualitative
display(df_test.describe(include=['object']))

###  Visualization

[[back to top](#Table-of-contents)]

#### Basic insight of the data

In [None]:
# Qualitative Data : [Survived, Sex, Embarked, Pclass] 
fig, (axis1,axis2,axis3,axis4) = plt.subplots(1,4,figsize=(15,5))
sns.countplot(x='Survived', data=df_train, ax=axis1)
sns.countplot(x='Sex', data=df_train, ax=axis2)
sns.countplot(x='Embarked', data=df_train, ax=axis3)
sns.countplot(x='Pclass', data=df_train, ax=axis4)
fig.suptitle("Basic representation of Qualitative data with count")

In [None]:
# Discrete Quantitative Data : [SibSp, Parch] 
fig2, (axis1,axis2) = plt.subplots(1,2,figsize=(15,5))
sns.countplot(df_train['SibSp'],ax=axis1)
sns.countplot(df_train['Parch'],ax=axis2)
fig2.suptitle("Basic representation of Discrete Quantitative data with count")

In [None]:
# Continuous Quantitative Data : [Age, Fare]
fig3, (axis1,axis2) = plt.subplots(2,1,figsize=(15,10))
sns.distplot(df_train['Age'].dropna(), bins=80, ax=axis1)
sns.distplot(df_train['Fare'], ax=axis2)
fig3.suptitle("Basic representation of Continuous Quantitative data with probability")

In [None]:
# Age distribution within Sex and Pclass
fig3, ((axis1),(axis2),(axis3)) = plt.subplots(3,1,figsize=(15,15))

# Age distribution
df_train.Age.plot(kind='kde',ax=axis1)
axis1.set_xlabel("Age")    
axis1.set_title("Age Distribution")

# Age distribution within Sex
df_train.Age[df_train.Sex == 'male'].plot(kind='kde',ax=axis2,)    
df_train.Age[df_train.Sex == 'female'].plot(kind='kde',ax=axis2)
axis2.set_xlabel("Age")    
axis2.set_title("Age Distribution within Sex")
axis2.legend(('Male', 'Female'))

# Age distribution within Pclass
df_train.Age[df_train.Pclass == 1].plot(kind='kde',ax=axis3)    
df_train.Age[df_train.Pclass == 2].plot(kind='kde',ax=axis3)
df_train.Age[df_train.Pclass == 3].plot(kind='kde',ax=axis3)
axis3.set_xlabel("Age")    
axis3.set_title("Age Distribution within Classes")
axis3.legend(('1st Class', '2nd Class','3rd Class'))

#### Focus on the mean of survival

In [None]:
# [Sex, Pclass, Embarked] by mean of survival
fig4, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(15,5))
sns.barplot(x='Sex',y='Survived', data=df_train, ax=axis1)
sns.barplot(x='Embarked',y='Survived', data=df_train, ax=axis2)
sns.barplot(x='Pclass',y='Survived', data=df_train, ax=axis3)
fig4.suptitle("Representation of features linked to the target : Survived ")

In [None]:
# [SibSp, Parch] by mean of survival
fig6, (axis1,axis2) = plt.subplots(1,2,figsize=(15,5))
sns.barplot(x='SibSp',y='Survived', data=df_train, ax=axis1)
sns.barplot(x='Parch',y='Survived', data=df_train, ax=axis2)
fig6.suptitle("Representation of features linked to the target : Survived ")

In [None]:
# Cross relation betwen [Sex, Pclass, Embarked] by mean of survival
fig5, ((axis1,axis2),(axis3,axis4),(axis5,axis6)) = plt.subplots(3,2,figsize=(15,15))
sns.barplot(x='Sex',y='Survived',hue='Pclass', data=df_train, ax=axis1)
sns.barplot(x='Sex',y='Survived',hue='Embarked', data=df_train, ax=axis2)

sns.barplot(x='Pclass',y='Survived',hue='Sex', data=df_train, ax=axis3)
sns.barplot(x='Pclass',y='Survived',hue='Embarked', data=df_train, ax=axis4)

sns.barplot(x='Embarked',y='Survived',hue='Sex', data=df_train, ax=axis5)
sns.barplot(x='Embarked',y='Survived',hue='Pclass', data=df_train, ax=axis6)

fig5.suptitle("Cross Representation of the features linked to the target : Survived")

In [None]:
# Age by mean of survival
fig=sns.barplot(x='Age', y='Survived', data=df_train)

In [None]:
# Age

# Kernel density of survivor and non survivor by Age
g1 = sns.FacetGrid( df_train , hue='Survived' , aspect=4)
g1.map( sns.kdeplot , 'Age' , shade= True )
g1.add_legend()

# Kernel density of survivor and non survivor by Age and Sex 
g2 = sns.FacetGrid( df_train , hue='Survived' , aspect=4 , row = 'Sex')
g2.map( sns.kdeplot , 'Age' , shade= True )
g2.add_legend()

# Kernel density of survivor and non survivor by Age and Pclass
g3 = sns.FacetGrid( df_train , hue='Survived' , aspect=4 , row = 'Pclass')
g3.map( sns.kdeplot , 'Age' , shade= True )
g3.add_legend()

In [None]:
# Fare

# Scatterplot Fare & Age
g = sns.FacetGrid(df_train, hue="Survived", size=6)
g=g.map(plt.scatter, "Fare", "Age",edgecolor="w").add_legend()
g.add_legend()
g.set(xlim=(0, 550))

# Scatterplot Fare & Age by Sex
g = sns.FacetGrid(df_train, hue="Survived", col="Sex", margin_titles=True,
                palette="Set1",hue_kws=dict(marker=["^", "v"]),size=6)
g=g.map(plt.scatter, "Fare", "Age",edgecolor="w").add_legend()
g.add_legend()
g.set(xlim=(0, 300))

# Scatterplot Fare & Age by Pclass
g = sns.FacetGrid(df_train, col="Pclass", hue="Survived", size=4)
g=g.map(plt.scatter, "Fare", "Age",edgecolor="w").add_legend()
g.add_legend()
g.set(xlim=(0, 300))

# Scatterplot Fare & Age by Pclass & Sex
g = sns.FacetGrid(df_train, hue="Survived", col="Pclass", row="Sex" ,margin_titles=True,
                  palette={1:"red", 0:"grey"},size=5)
g=g.map(plt.scatter, "Fare", "Age",edgecolor="w").add_legend()
g.set(xlim=(0, 300))


### Features engineering

[[back to top](#Table-of-contents)]

#### Combined DataFrame

In [None]:
# Creation of a dataframe with train and test for Feature Engineering
def get_combined_data():
    # reading train data
    train = pd.read_csv("Data/Titanic/train.csv")
    
    # reading test data
    test = pd.read_csv("Data/Titanic/test.csv")

    # extracting and then removing the targets from the training data 
    targets = train.Survived
    #train.drop('Survived',axis=1,inplace=True)

    # merging train data and test data for future feature engineering
    combined = train.append(test)
    combined.reset_index(inplace=True)
    combined.drop('index',axis=1,inplace=True)
    
    return combined

def recover_train_test_target(combined):
    
    train0 = pd.read_csv("Data/Titanic/train.csv")
    
    targets = train0.Survived
    train = combined.ix[0:890]
    test = combined.ix[891:]
    
    return train,test,targets

In [None]:
combined = get_combined_data()

#### Name

In [None]:
# Name

#Create feature for the length of name 
combined["Name_Length"] = combined["Name"].apply(lambda x: len(x))

# Create a categorical feature Name_Size
combined['Name_Size']=pd.cut(combined['Name_Length']
                            ,bins=[0,20,40,60,90]
                            ,labels=["Short","Medium","Long","Very Long"])

# Extract the title from each name
combined['Title'] = combined['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())

# Map for aggregated titles
Title_Dictionary = {
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "Master" :    "Master",
                    "Lady" :      "Royalty",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dona":       "Royalty",
                    "the Countess":"Royalty"
                    }
    
# Mapping
combined['Title_aggr'] = combined.Title.map(Title_Dictionary)

#### Family

In [None]:
# Family

# Creation of a feature Number_of_relatives = SibSp + Parch
combined['Number_of_relatives']=combined['SibSp']+combined['Parch']

# Creation of a categorical feature Size_Family
combined.loc[combined['Number_of_relatives'] == 0, 'Size_Family'] = 'Alone'
combined.loc[ (combined['Number_of_relatives'] > 0) 
            & (combined['Number_of_relatives'] < 4), 'Size_Family'] = 'Small'
combined.loc[combined['Number_of_relatives'] > 3, 'Size_Family'] = 'Big'

We can create 3 categories : 
    - Alone = 0
    - Small = [1,2,3]
    - Big = > 3

#### Cabin

In [None]:
combined.shape

In [None]:
# Mostly NaN values
combined.Cabin.isnull().sum()

In [None]:
# Create a category Unknown
combined['Cabin'] = combined.Cabin.fillna( 'U' )

In [None]:
# Get the first letter
combined["Deck"]=combined.Cabin.str[0]
Sort_Deck=combined.groupby('Deck').size()
Sort_Deck.sort_values(ascending=False,inplace=True)
display(Sort_Deck)

In [None]:
sns.barplot(x='Deck', y='Survived', data=combined)

In [None]:
# Creation of 4 categories
Deck_Dictionary = {
                    "E":        "C1",
                    "D":       "C1",
                    "B":         "C1",
                    "C" :        "C2",
                    "F" :       "C2",
                    "G" :      "C3",
                    "A":       "C3",
                    "U":        "U",
                    "T":        "U"
                    }

# Mapping
combined['Deck_aggr'] = combined.Deck.map(Deck_Dictionary)

sns.barplot(x='Deck_aggr', y='Survived', data=combined)

#### Ticket

In [None]:
combined.Ticket.isnull().sum()

In [None]:
Sort_Ticket=combined.groupby('Ticket').size()
Sort_Ticket.sort_values(ascending=False,inplace=True)
display(Sort_Ticket)

In [None]:
# a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
def cleanTicket( ticket ):
    ticket = ticket.replace( '.' , '' )
    ticket = ticket.replace( '/' , '' )
    ticket = ticket.split()
    ticket = map( lambda t : t.strip() , ticket )
    ticket = list(filter( lambda t : not t.isdigit() , ticket ))
    if len( ticket ) > 0:
        return ticket[0]
    else: 
        return 'XXX'

In [None]:
combined[ 'Ticket_clean' ] = combined[ 'Ticket' ].map( cleanTicket )

Sort_Clean_Ticket=combined.groupby('Ticket_clean').size()
Sort_Clean_Ticket.sort_values(ascending=False,inplace=True)
display(Sort_Clean_Ticket)

In [None]:
g = plt.figure(figsize=(20,10)) 
g = sns.barplot(x='Ticket_clean',data=combined, y='Survived',order=['XXX','PC','CA','A5','SOTONOQ','WC','SCPARIS','STONO','A4','FCC','C','SOC','SOPP','STONO2','SCParis','SCAH','PP','LINE','WEP','FC','SOTONO2','PPP','SC','SWPP','SCA4','AS','AQ4','AQ3','SCA3','CASOTON','Fa','LP','SCOW','SOP','SP','STONOQ','A'])

In [None]:
Ticket_Dictionary = {
                    "XXX":        "T1",
                    "PC":      "T2",
                    "CA":         "T3",
                    "A5" :        "T4",
                    "SOTONOQ" :       "T4",
                    "WC" :      "T4",
                    "SCPARIS":       "T5",
                    "STONO":        "T5",
                    "A4":        "T1",
                    "FCC":       "T1",
                    "C":         "T1",
                    "SOC" :        "T1",
                    "SOPP" :       "T1",
                    "STONO2" :      "T1",
                    "SCParis":       "T1",
                    "SCAH":     "T1",
                    "PP":        "T1",
                    "LINE":        "T1",
                    "WEP":       "T1",
                    "FC":         "T1",
                    "SOTONO2" :        "T1",
                    "PPP" :       "T1",
                    "SC" :      "T1",
                    "SWPP":       "T1",
                    "SCA4":        "T1",
                    "AS":        "T1",
                    "AQ4":        "T1",
                    "AQ3":       "T1",
                    "SCA3":         "T1",
                    "CASOTON" :        "T1",
                    "Fa" :       "T1",
                    "LP" :      "T1",
                    "SCOW":       "T1",
                    "SOP":        "T1",
                    "SP" :       "T1",
                    "STONOQ" :      "T1",
                    "A":       "T1"
                    }

# Mapping
combined['Ticket_aggr'] = combined.Ticket_clean.map(Ticket_Dictionary)

sns.barplot(x='Ticket_aggr',y='Survived', data=combined)

### Missing Values

[[back to top](#Table-of-contents)]

#### Embarked

In [None]:
# Embarked

# Get the null rows where Embarked is null
display(combined[combined.Embarked.isnull()][['Fare', 'Pclass', 'Embarked']])

# Embarked missing values
combined.boxplot(column='Fare', by=['Embarked','Pclass'], figsize=(8,6))
plt.axhline(y=80, color='blue')

# Remplace null values by C because most people who are Pclass 1 and Fare 80 has Embarked from C
combined = combined.set_value(combined.Embarked.isnull(), 'Embarked', 'C')

#### Fare

In [None]:
# Fare

# Visualization of the fare which is missing
combined[combined.Fare.isnull()][['Pclass', 'Fare', 'Embarked']]
#df_test[(df_test.Pclass==3)&(df_test.Embarked=='S')].Fare.hist(bins=100)
combined.loc[(combined['Pclass']==3) & (combined['Embarked']=='S')].Fare.hist(bins=100,figsize=(8,6))

# Get and affect the median to the missing value
Fare_median=combined[(combined.Pclass==3) & (combined.Embarked=='S')].Fare.median()
#df_test = df_test.set_value(df_test.Fare.isnull(), 'Fare', Fare_median)
combined["Fare"].fillna(Fare_median, inplace=True)

#### Age with median

In [None]:
# Simply fill the nan values with median using Sex, Pclass and Title
grouped = combined.groupby(['Sex','Pclass','Title_aggr'])
age_median = grouped['Age'].median()
display(age_median)
combined["Age"] = combined.groupby(['Sex','Pclass','Title_aggr'])['Age'].transform(lambda x: x.fillna(x.median()))

#### Verification of missing values

In [None]:
combined.isnull().sum()

#### Split for visualization

In [None]:
# Split
df_train, df_test, targets = recover_train_test_target(combined)

# Dropping Cabin, Ticket and PassengerId
df_train = df_train.drop(['Cabin','PassengerId','Ticket'], axis=1)
df_test = df_test.drop(['Cabin','Ticket'], axis=1)

In [None]:
df_train.shape

In [None]:
df_train.columns

### Visualization new Features

[[back to top](#Table-of-contents)]

#### Visualization Name

In [None]:
# Vrack v1 a faire marcher
fig5, ((axis1,axis2),(axis3,axis4),(axis5,axis6),(axis7,axis8)) = plt.subplots(4,2,figsize=(15,15))

# Plot Name_Length
sns.countplot(x='Name_Length', data=df_train, ax=axis1)

# Plot Name_Length by mean of survival
sns.barplot(x='Name_Length', y='Survived', data=df_train, ax=axis2)

# Plot Name_Size by mean of survival
sns.barplot(x='Name_Size', data=df_train, order=["Short","Medium","Long","Very Long"], ax=axis3)

# Plot Name_Size by mean of survival
sns.barplot(x='Name_Size',y='Survived', data=df_train, order=["Short","Medium","Long","Very Long"], ax=axis4)

# Plot Title aggregate
sns.barplot(x='Title_aggr', data=df_train, ax=axis5)

# Display Title aggregate by mean of survival
sns.barplot(x='Title_aggr',y='Survived', data=df_train, ax=axis6)

# Display Title aggregate and Name Size by mean of survival
sns.barplot(x='Title_aggr',y='Survived', hue='Name_Size', data=df_train, ax=axis7)

# Display Title aggregate and Name Size by mean of survival
sns.barplot(x='Name_Size',y='Survived', hue='Title_aggr', data=df_train, ax=axis8)

In [None]:
# Vrack v2 a faire marcher
fig = plt.figure(figsize=(15, 5))
fig=sns.countplot(x='Name_Length', data=df_train)

fig=sns.barplot(x='Name_Length', y='Survived', data=df_train)

sns.barplot(x='Name_Size',y='Survived', data=df_train, order=["Short","Medium","Long","Very Long"])

# Display aggregate title by survived probability
fig1 = plt.figure(figsize=(15, 5))
fig1=sns.barplot(x='Title_aggr',y='Survived', data=df_train)

# Display aggregate title and Name Size by survived probability
fig2 = plt.figure(figsize=(15, 5))
fig2=sns.barplot(x='Title_aggr',y='Survived', hue='Name_Size', data=df_train)

# Display aggregate title and Name Size by survived probability
fig3 = plt.figure(figsize=(15, 5))
fig3=sns.barplot(x='Name_Size',y='Survived', hue='Title_aggr', data=df_train)

#### Visualization Family

In [None]:
# A afficher correctrement
fig1 = plt.figure(figsize=(15, 5))
fig1 = sns.countplot(x='Number_of_relatives', data=df_train)

fig2 = plt.figure(figsize=(15, 5))
fig2 = sns.barplot(x='Number_of_relatives',y='Survived', data=df_train)

sns.barplot(x='Size_Family',y='Survived', data=df_train, order=['Alone', 'Small', 'Big'])

# Number_of_relatives with Pclass, Sex, Embarked by mean of survival
fig7, ((axis1),(axis2),(axis3)) = plt.subplots(3,1,figsize=(20,20))
sns.barplot(x='Number_of_relatives',y='Survived',hue='Pclass', data=df_train, ax=axis1)
sns.barplot(x='Number_of_relatives',y='Survived',hue='Sex', data=df_train, ax=axis2)
sns.barplot(x='Number_of_relatives',y='Survived',hue='Embarked', data=df_train, ax=axis3)
fig7.suptitle("Cross Representation of the features linked to the target : Survived")

# Size_Family with Pclass, Sex, Embarked by mean of survival
fig8, ((axis1),(axis2),(axis3)) = plt.subplots(3,1,figsize=(20,20))
sns.barplot(x='Size_Family',y='Survived',hue='Pclass', data=df_train, order=['Alone', 'Small', 'Big'], ax=axis1,)
sns.barplot(x='Size_Family',y='Survived',hue='Sex', data=df_train, order=['Alone', 'Small', 'Big'], ax=axis2)
sns.barplot(x='Size_Family',y='Survived',hue='Embarked', data=df_train, order=['Alone', 'Small', 'Big'], ax=axis3)
fig8.suptitle("Cross Representation of the features linked to the target : Survived")

### Features Encoding

[[back to top](#Table-of-contents)]

In [None]:
display(combined.columns)
display(combined.isnull().sum())
display(combined.shape)
display(combined[["Embarked","Sex","Title_aggr","Size_Family","Name_Size","Pclass"]].head())

#### Categorial features encoding

##### Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

# Dataframe with numerical categorical feature
combined_num_cat = pd.DataFrame()

# LabelEncoder
labelEnc = LabelEncoder()

# Columns to apply
cat_vars=["Embarked","Sex","Title_aggr","Size_Family","Name_Size"]

for col in cat_vars:
    labelEnc.fit(np.unique(list(combined[col].values)))
    combined_num_cat[col]=labelEnc.transform(combined[col].astype('str'))
    
labelEnc.fit(np.unique(list(combined["Pclass"].values)))
combined_num_cat["Pclass"]=labelEnc.transform(combined["Pclass"].astype('int'))

In [None]:
combined_num_cat.head()

##### One Hot  Encoding

In [None]:
def one_hot(df_in, cols):
    df_out = pd.DataFrame()
    for each in cols:
        dummies = pd.get_dummies(df_in[each], prefix=each, drop_first=False)
        df_out = pd.concat([df_out, dummies], axis=1)
    return df_out

In [None]:
# Dataframe with binary categorical feature

# Columns to apply
cat_vars=['Embarked','Sex',"Title_aggr","Size_Family","Name_Size","Pclass"]
combined_One_Hot_Cat = one_hot(combined,cat_vars)

In [None]:
combined_One_Hot_Cat.head()

### Feature Scalling

[[back to top](#Table-of-contents)]

In [None]:
combined[['Fare','Age','Name_Length','Number_of_relatives']].head()

In [None]:
from sklearn import preprocessing

std_columns = ['Fare','Age']

combined_num_std = pd.DataFrame(combined[std_columns])

# StandardScaller process
std_scale = preprocessing.StandardScaler()
combined_num_std[std_columns] = std_scale.fit_transform(combined[std_columns].astype(float))

combined_num_std[std_columns].head()

### Data Preparation

[[back to top](#Table-of-contents)]

In [None]:
# Concat
combined_OH_Std = pd.concat([combined_num_std,combined_One_Hot_Cat],axis=1)
combined_Num_Std = pd.concat([combined_num_std,combined_num_cat],axis=1)

In [None]:
# Display shape
display(combined_OH_Std.shape)
display(combined_Num_Std.shape)

In [None]:
# Split into Train and Eval
Train_OH_Std, Eval_OH_Std, Target_OH_Std = recover_train_test_target(combined_OH_Std)
Train_Num_Std, Eval_Num_Std, Target_Num_Std = recover_train_test_target(combined_Num_Std)

In [None]:
# Display shape
display(Train_OH_Std.shape)
display(Eval_OH_Std.shape)
display(Target_OH_Std.shape)

In [None]:
# Select Data
data = Train_OH_Std
test_data = Eval_OH_Std
target = Target_OH_Std
columns_name = list(Train_OH_Std)

# Train & Test
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.20, random_state = 42,stratify=target)

# Dataframe of prediction
Prediction = pd.DataFrame()

pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 1000)

### Features Importance


[[back to top](#Table-of-contents)]

#### Correlation - Numerical label

In [None]:
# Concant data and target
Features_with_target = pd.concat([Train_Num_Std,target],axis=1)
# Correlation with target
Corr = pd.DataFrame()
Corr['Corr'] = Features_with_target.corr()["Survived"]
Corr.sort_values('Corr',ascending=False,inplace=True)
display(Corr)
# A réfléchir au sens en regardant la formule de corr pour les catégories le numerical label ordinales et non ordinales

#### Correlation - One Hot Encoder

In [None]:
# Concant data and target
Features_with_target = pd.concat([Train_OH_Std,target],axis=1)
# Correlation with target
Corr = pd.DataFrame()
Corr['Corr'] = Features_with_target.corr()["Survived"]
Corr.sort_values('Corr',ascending=False,inplace=True)
display(Corr)

#### LDA

In [None]:
# LDA
# LDA with n = 2 solver svd --> score : 0.78947
lda = LinearDiscriminantAnalysis(n_components = 2, solver='svd')
lda.fit(X_train, y_train)

y_train_pred_lda = lda.predict(X_train)
y_test_pred_lda = lda.predict(X_test)

lda_acc = accuracy_score(y_test, y_test_pred_lda)
lda_cr= classification_report(y_test, y_test_pred_lda)
lda_cm = confusion_matrix(y_test, y_test_pred_lda)

lda_acc_train = accuracy_score(y_train, y_train_pred_lda)
lda_cr_train = classification_report(y_train, y_train_pred_lda)
lda_cm_train = confusion_matrix(y_train, y_train_pred_lda)

print("Training set")
print( "LDA Accuracy :", lda_acc_train)
print(lda_cr_train)
print("Confusion Matrix :\n",lda_cm_train)
#print('Explained variance ratio :',lda.explained_variance_ratio_)
print('Balance of classes',lda.priors_)

print("-----------------------------------------------------------------------------")
print("Testing set")
print( "LDA Accuracy :", lda_acc)
print(lda_cr)
print("Confusion Matrix :\n",lda_cm)
#print('Explained variance ratio :',lda.explained_variance_ratio_)
print('Balance of classes',lda.priors_)



Prediction['LDA'] = lda.predict(test_data)


Coef = pd.DataFrame()
Coef['Features'] = list(X_train.columns)
Coef['Coef'] = lda.coef_.transpose()
Coef.sort_values('Coef',ascending=False,inplace=True)
Coef

#### Features importance


In [None]:
list(data.columns)

#### Anova

In [None]:
# Select K Best 
from sklearn.feature_selection import SelectKBest,f_classif,chi2,SelectFpr,SelectFdr

# Perform feature selection
selector_anova = SelectKBest(f_classif, k="all")
#selector_chi2 = SelectKBest(chi2, k="all")

selector_anova.fit(data, target)
#selector_chi2.fit(data, target)

In [None]:
# Get and display result
result_selector = pd.DataFrame()
result_selector['feature'] = list(data.columns)
result_selector['Anova score'] = selector_anova.scores_
result_selector['Anova pval'] = selector_anova.pvalues_
result_selector.sort_values('Anova score',ascending=False,inplace=True)
result_selector

### Features Selection

[[back to top](#Table-of-contents)]

In [None]:
data.drop(['Title_aggr_Royalty','Title_aggr_Officer','Name_Size_Medium','Embarked_Q'],axis=1,inplace=True)
test_data.drop(['Title_aggr_Royalty','Title_aggr_Officer','Name_Size_Medium','Embarked_Q'],axis=1,inplace=True)

In [None]:
display(data.shape)
test_data.shape

In [None]:
# Train & Test
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.20 ,random_state =42)

### Models Selection

[[back to top](#Table-of-contents)]

#### Helper function

In [None]:
# Helper function to analyse and get result

# Plot the confusion matrice 
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')


# Grid Score into a Pandas Dataframe
def cv_results_to_df(cv_results):
    """
    Convert a sklearn.model_selection.GridSearchCV.cv_results_ attribute to a tidy
    pandas DataFrame where the output is filtered with only mean std and params.
    """
    df=pd.DataFrame.from_dict(cv_results)
    df=df[['mean_test_score', 'std_test_score', 'mean_train_score', 'std_train_score', 'params']]
    df.sort_values('mean_test_score',ascending=False,inplace=True)
    return df


# Helper function for gridseach
def grid_search_global(dict_pip, dict_param, class_names):

    dict_of_res={}
    dict_of_best={}
    df_results_global=pd.DataFrame()
    
    print ("Starting Gridsearch")
    
    for key in dict_param.keys():
        gs = GridSearchCV(dict_pip[key], dict_param[key], verbose=0, refit=True, n_jobs=-1, cv=5)
        gs = gs.fit(X_train, y_train)
        dict_of_res[key]=gs.grid_scores_
        
        print('\n-------------------------------------------------------------------------------------------------------')
        print ("Gridsearch for %s \n" % dict_param[key])
        print ("Best score :", gs.best_score_)
        print ("Best params :",gs.best_params_)
        dict_of_best[key]=[gs.best_score_,gs.best_params_]
        
        y_test_pred=gs.predict(X_test)
        validation_acc = accuracy_score(y_test,y_test_pred)
        validation_Fscore = f1_score(y_test,y_test_pred)
        
        
        
        # Obtention des résultats avec selection et réarrangement des attributs puis stockage
        df_results=cv_results_to_df(gs.cv_results_)
        df_results['Algo']=key
        df_results['Diff_train_test']=  np.absolute (df_results['mean_train_score'] - df_results['mean_test_score'])
        df_results['Val_Acc'] = validation_acc
        df_results['Diff_test_val']= np.absolute(df_results['mean_test_score'] - df_results['Val_Acc'])
        df_results['Val_F_score'] = validation_Fscore
        df_results=df_results[['Algo','Val_Acc','Diff_test_val','Diff_train_test','Val_F_score','mean_test_score', 'std_test_score', 'mean_train_score', 'std_train_score', 'params']]
        df_results_global=df_results_global.append(df_results)
        print("\nGrid Score #cv_results_ alégé")
        display(df_results)
        

        
    # Transformation de dict_of_best en dataframe
    df_best=pd.DataFrame.from_dict(dict_of_best,'index')
    df_best.columns=['Scores','Parameters']
    df_best.sort_values('Scores',ascending=False,inplace=True)  

    df_results_global.sort_values('Val_Acc',ascending=False,inplace=True)
    print('\n -------------------------------------------------------------------------------------------------------')
    print('\nList of best score and parameters by pipeline')
    display(df_best)
    print('\nSummary')
    display(df_results_global)
    print ("Gridsearch Finished")
    print('\n -------------------------------------------------------------------------------------------------------')    
    return df_best, dict_of_best, df_results_global



"""
========================
Plotting Learning Curves
========================
On the left side the learning curve of a naive Bayes classifier is shown for
the digits dataset. Note that the training score and the cross-validation score
are both not very good at the end. However, the shape of the curve can be found
in more complex datasets very often: the training score is very high at the
beginning and decreases and the cross-validation score is very low at the
beginning and increases. On the right side we see the learning curve of an SVM
with RBF kernel. We can see clearly that the training score is still around
the maximum and the validation score could be increased with more training
samples.
"""
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.
    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.
    title : string
        Title for the chart.
    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.
    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.
    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.
    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.
        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.
    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(0.7, 1)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import validation_curve

def plot_validation_curve(estimator, estimator_name, param_name, param_range, X, y, cv,
    scoring='accuracy', scale='classic' , n_jobs=-1):
    
    train_scores, test_scores = validation_curve(
        estimator, X, y, param_name=param_name, param_range=param_range,
        cv=cv, scoring=scoring, n_jobs=n_jobs)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    title_fig='Validation Curve with %s' % estimator_name
    plt.title(title_fig)
    plt.xlabel(param_name)
    plt.ylabel("Score : %s" % scoring)
    plt.ylim(0.7, 1)
    lw = 2
    
    if (scale=='semilog'):
        plt.semilogx(param_range, train_scores_mean, label="Training score",
                     color="darkorange", lw=lw)
        plt.fill_between(param_range, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.2,
                         color="darkorange", lw=lw)
        plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
                     color="navy", lw=lw)
        plt.fill_between(param_range, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.2,
                         color="navy", lw=lw)
    else :
        plt.plot(param_range, train_scores_mean, label="Training score",
                     color="darkorange", lw=lw)
        plt.fill_between(param_range, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.2,
                         color="darkorange", lw=lw)
        plt.plot(param_range, test_scores_mean, label="Cross-validation score",
                     color="navy", lw=lw)
        plt.fill_between(param_range, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.2,
                         color="navy", lw=lw) 

    plt.legend(loc="best")
    plt.show()
    




### Multipes Algo

In [None]:
# Pipeline setup
models = { 
    'RandomForestClassifier': RandomForestClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
    'SVC': SVC()
}

# Parameters setup
params = {
    # Il faut mettre 'ExtraTreesClassifier__n_estimators' dans 'ExtraTreesClassifier' car on est sur un pipeline 
    # il est donc possible de préciser des parametres pour chacune des étapes
    'RandomForestClassifier': { 'n_estimators': [5, 10, 15, 20, 25, 30, 35] },
    'GradientBoostingClassifier': { 'n_estimators': [5, 10, 15, 20, 25, 30, 35], 'learning_rate': [0.8, 1.0] },
    'SVC': [
        {'kernel': ['linear'], 'C': [1, 10]},
        {'kernel': ['rbf'], 'C': [1, 10], 'gamma': [0.001, 0.0001]},
    ]
}

# Lancer la grid search
df_best, dic_best, d_res =grid_search_global('clas',models,params,class_names=columns_name)

### Gradient Boosting Classifier

[[back to top](#Table-of-contents)]

In [None]:
X_train.shape

In [None]:
Res_gbc  =pd.DataFrame()
Res_rf  =pd.DataFrame()

for i in range(1,100) :
    print('Step',i)
    RS = np.random.randint(10000)

    X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.20,random_state = RS,stratify=target)

    # Pipeline setup
    model_gbc = { 
        'GradientBoostingClassifier': GradientBoostingClassifier(),
    }

    # Parameters setup
    params_gbc = {
        'GradientBoostingClassifier': { 'n_estimators': [5,10,15,20], 
                                       'learning_rate': [0.05,0.1,0.2],
                                      'loss' : ['deviance','exponential'],
                                      'max_depth' : [3],
                                       'min_samples_split': [3,5,7],
                                       'min_samples_leaf' : [3,5,7],
                                       'max_features' : [2,4,6,8,10]
                                      }
    }

    # Lancer la grid search
    df_best_gbc, dic_best_gbc, d_res_gbc =grid_search_global(model_gbc,params_gbc,class_names=columns_name)

    model_rf = { 
    'RandomForestClassifier': RandomForestClassifier(),
    }

    # Parameters setup
    params_rf = {
    # Il faut mettre 'ExtraTreesClassifier__n_estimators' dans 'ExtraTreesClassifier' car on est sur un pipeline 
    # il est donc possible de préciser des parametres pour chacune des étapes
    'RandomForestClassifier': { 'n_estimators': [5,10,15,20,30],
                               'criterion' : ['gini','entropy'],
                               'max_depth' : [3,9,15,30],
                               'max_features':[2,4,6,8],
                               'min_samples_split': [3,5,7],
                               'min_samples_leaf':  [3,5,7]
                              }
    }

    # Lancer la grid search
    df_best_rf, dic_best_rf, d_res_rf =grid_search_global(model_rf,params_rf,class_names=columns_name)

    # Results dataframe
    d_res_gbc['Random_state'] = RS
    d_res_gbc.sort_values(by=['Val_Acc','std_test_score'],ascending=[False,True],inplace=True)
    d_res_gbc_sort = d_res_gbc.loc[(d_res_gbc['Val_Acc'] > 0.80) 
                  & (d_res_gbc['Diff_test_val'] < 0.01) 
                  & (d_res_gbc['Diff_train_test'] < 0.01)  
                  & (d_res_gbc['std_test_score'] < 0.01) 
                  & (d_res_gbc['std_train_score'] < 0.01)]
    Res_gbc = Res_gbc.append(d_res_gbc_sort)
    
    # Results dataframe
    d_res_rf['Random_state'] = RS
    d_res_rf.sort_values(by=['Val_Acc','std_test_score'],ascending=[False,True],inplace=True)
    d_res_rf_sort = d_res_rf.loc[(d_res_rf['Val_Acc'] > 0.80) 
                  & (d_res_rf['Diff_test_val'] < 0.01) 
                  & (d_res_rf['Diff_train_test'] < 0.01)  
                  & (d_res_rf['std_test_score'] < 0.01) 
                  & (d_res_rf['std_train_score'] < 0.01)]
    Res_rf = Res_rf.append(d_res_rf_sort)   


In [108]:
Res

Unnamed: 0,Algo,Val_Acc,Diff_test_val,Diff_train_test,Val_F_score,mean_test_score,std_test_score,mean_train_score,std_train_score,params,Random_state
246,GradientBoostingClassifier,0.815642,0.000369,0.006665,0.740157,0.816011,0.004392,0.822676,0.009988,"{'learning_rate': 0.05, 'loss': 'exponential', 'max_depth': 3, 'max_features': 4, 'min_samples_leaf': 7, 'min_samples_split': 5, 'n_estimators': 15}",0
488,GradientBoostingClassifier,0.815642,0.008058,0.008767,0.740157,0.807584,0.007447,0.816352,0.0104,"{'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 5, 'min_samples_split': 7, 'n_estimators': 5}",0
133,GradientBoostingClassifier,0.815642,0.012272,0.009473,0.740157,0.803371,0.007779,0.812843,0.009864,"{'learning_rate': 0.05, 'loss': 'deviance', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 10}",0
301,GradientBoostingClassifier,0.815642,0.010867,0.002098,0.740157,0.804775,0.008407,0.806873,0.009793,"{'learning_rate': 0.05, 'loss': 'exponential', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 5, 'min_samples_split': 3, 'n_estimators': 10}",0
141,GradientBoostingClassifier,0.815642,0.006654,0.000702,0.740157,0.808989,0.008569,0.808286,0.007403,"{'learning_rate': 0.05, 'loss': 'deviance', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 7, 'min_samples_split': 7, 'n_estimators': 10}",0
77,GradientBoostingClassifier,0.815642,0.010867,0.004558,0.740157,0.804775,0.009107,0.809333,0.010155,"{'learning_rate': 0.05, 'loss': 'deviance', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 10}",0
137,GradientBoostingClassifier,0.815642,0.013676,0.004206,0.740157,0.801966,0.010263,0.806172,0.008801,"{'learning_rate': 0.05, 'loss': 'deviance', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 7, 'min_samples_split': 5, 'n_estimators': 10}",0
329,GradientBoostingClassifier,0.815642,0.006654,0.002095,0.740157,0.808989,0.01136,0.811084,0.011163,"{'learning_rate': 0.05, 'loss': 'exponential', 'max_depth': 3, 'max_features': 10, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 10}",0
153,GradientBoostingClassifier,0.815642,0.003845,0.004202,0.740157,0.811798,0.012086,0.816,0.010926,"{'learning_rate': 0.05, 'loss': 'deviance', 'max_depth': 3, 'max_features': 10, 'min_samples_leaf': 3, 'min_samples_split': 7, 'n_estimators': 10}",0
468,GradientBoostingClassifier,0.815642,0.013676,0.012986,0.740157,0.801966,0.012361,0.814952,0.013651,"{'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 3, 'min_samples_split': 3, 'n_estimators': 5}",0


In [115]:
# Train & Test
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.20,random_state = 6725,stratify=target)

In [116]:
# Pipeline setup
models = { 
    'GradientBoostingClassifier': GradientBoostingClassifier(),
}

# Parameters setup
params = {
    'GradientBoostingClassifier': { 'n_estimators': [5,10,15,20,50,100,250,500], 
                                   'learning_rate': [0.01,0.05,0.1,0.2],
                                  'loss' : ['deviance','exponential'],
                                    'max_depth' : [3,5,7,10],
                                   'min_samples_split': [3,5,7],
                                   'min_samples_leaf' : [3,5,7],
                                   'max_features' : [2,4,6,8,10]
                                  }
}

# Lancer la grid search
df_best_gbc, dic_best_gbc, d_res_gbc =grid_search_global(models,params,class_names=columns_name)

Starting Gridsearch

-------------------------------------------------------------------------------------------------------
Gridsearch for {'n_estimators': [5, 10, 15, 20, 50, 100, 250, 500], 'learning_rate': [0.01, 0.05, 0.1, 0.2], 'loss': ['deviance', 'exponential'], 'max_depth': [3, 5, 7, 10], 'min_samples_split': [3, 5, 7], 'min_samples_leaf': [3, 5, 7], 'max_features': [2, 4, 6, 8, 10]} 

Best score : 0.842696629213
Best params : {'learning_rate': 0.05, 'loss': 'exponential', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 3, 'n_estimators': 20}

Grid Score #cv_results_ alégé


Unnamed: 0,Algo,Val_Acc,Diff_test_val,Diff_train_test,Val_F_score,mean_test_score,std_test_score,mean_train_score,std_train_score,params
4467,GradientBoostingClassifier,0.810056,0.032641,2.816694e-03,0.721311,0.842697,0.017260,0.845513,0.010422,"{'learning_rate': 0.05, 'loss': 'exponential', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 3, 'n_estimators': 20}"
181,GradientBoostingClassifier,0.810056,0.031236,6.322761e-03,0.721311,0.841292,0.010766,0.847615,0.004405,"{'learning_rate': 0.01, 'loss': 'deviance', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 5, 'min_samples_split': 5, 'n_estimators': 100}"
4475,GradientBoostingClassifier,0.810056,0.031236,5.267665e-03,0.721311,0.841292,0.016857,0.846560,0.004885,"{'learning_rate': 0.05, 'loss': 'exponential', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 20}"
5994,GradientBoostingClassifier,0.810056,0.029832,8.428393e-03,0.721311,0.839888,0.026079,0.848316,0.002769,"{'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 3, 'min_samples_split': 7, 'n_estimators': 15}"
1669,GradientBoostingClassifier,0.810056,0.029832,8.081211e-03,0.721311,0.839888,0.012895,0.847969,0.005467,"{'learning_rate': 0.01, 'loss': 'exponential', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 100}"
7948,GradientBoostingClassifier,0.810056,0.029832,6.671848e-02,0.721311,0.839888,0.036457,0.906606,0.006904,"{'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 7, 'max_features': 2, 'min_samples_leaf': 5, 'min_samples_split': 3, 'n_estimators': 50}"
3147,GradientBoostingClassifier,0.810056,0.028427,9.489397e-03,0.721311,0.838483,0.016562,0.847973,0.007554,"{'learning_rate': 0.05, 'loss': 'deviance', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 20}"
5993,GradientBoostingClassifier,0.810056,0.028427,8.080353e-03,0.721311,0.838483,0.021770,0.846563,0.010202,"{'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': 8, 'min_samples_leaf': 3, 'min_samples_split': 7, 'n_estimators': 10}"
4850,GradientBoostingClassifier,0.810056,0.028427,1.439738e-02,0.721311,0.838483,0.018575,0.852881,0.007702,"{'learning_rate': 0.05, 'loss': 'exponential', 'max_depth': 5, 'max_features': 6, 'min_samples_leaf': 5, 'min_samples_split': 3, 'n_estimators': 15}"
5893,GradientBoostingClassifier,0.810056,0.028427,4.425157e-02,0.721311,0.838483,0.026372,0.882735,0.010852,"{'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': 4, 'min_samples_leaf': 7, 'min_samples_split': 5, 'n_estimators': 100}"



 -------------------------------------------------------------------------------------------------------

List of best score and parameters by pipeline


Unnamed: 0,Scores,Parameters
GradientBoostingClassifier,0.842697,"{'learning_rate': 0.05, 'loss': 'exponential', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 3, 'n_estimators': 20}"



Summary


Unnamed: 0,Algo,Val_Acc,Diff_test_val,Diff_train_test,Val_F_score,mean_test_score,std_test_score,mean_train_score,std_train_score,params
4467,GradientBoostingClassifier,0.810056,0.032641,2.816694e-03,0.721311,0.842697,0.017260,0.845513,0.010422,"{'learning_rate': 0.05, 'loss': 'exponential', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 3, 'n_estimators': 20}"
6400,GradientBoostingClassifier,0.810056,0.001067,3.441312e-02,0.721311,0.808989,0.029559,0.843402,0.006799,"{'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 5, 'max_features': 8, 'min_samples_leaf': 7, 'min_samples_split': 7, 'n_estimators': 5}"
7799,GradientBoostingClassifier,0.810056,0.001067,1.811817e-01,0.721311,0.808989,0.050340,0.990170,0.002846,"{'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 5, 'max_features': 8, 'min_samples_leaf': 3, 'min_samples_split': 7, 'n_estimators': 500}"
7241,GradientBoostingClassifier,0.810056,0.001067,1.438969e-02,0.721311,0.808989,0.030936,0.823378,0.011342,"{'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 3, 'max_features': 2, 'min_samples_leaf': 5, 'min_samples_split': 7, 'n_estimators': 10}"
4014,GradientBoostingClassifier,0.810056,0.001067,1.523906e-01,0.721311,0.808989,0.046550,0.961379,0.003976,"{'learning_rate': 0.05, 'loss': 'deviance', 'max_depth': 10, 'max_features': 2, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 250}"
10138,GradientBoostingClassifier,0.810056,0.001067,2.950084e-02,0.721311,0.808989,0.031129,0.838490,0.008550,"{'learning_rate': 0.2, 'loss': 'exponential', 'max_depth': 3, 'max_features': 2, 'min_samples_leaf': 7, 'min_samples_split': 5, 'n_estimators': 15}"
9349,GradientBoostingClassifier,0.810056,0.001067,1.688905e-01,0.721311,0.808989,0.043324,0.977879,0.001404,"{'learning_rate': 0.2, 'loss': 'deviance', 'max_depth': 5, 'max_features': 10, 'min_samples_leaf': 7, 'min_samples_split': 5, 'n_estimators': 100}"
6534,GradientBoostingClassifier,0.810056,0.001067,1.643279e-01,0.721311,0.808989,0.043495,0.973317,0.004486,"{'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 7, 'max_features': 2, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 250}"
6390,GradientBoostingClassifier,0.810056,0.001067,1.724030e-01,0.721311,0.808989,0.039131,0.981392,0.001777,"{'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 5, 'max_features': 8, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 250}"
4269,GradientBoostingClassifier,0.810056,0.001067,1.766154e-01,0.721311,0.808989,0.039672,0.985604,0.002807,"{'learning_rate': 0.05, 'loss': 'deviance', 'max_depth': 10, 'max_features': 10, 'min_samples_leaf': 3, 'min_samples_split': 7, 'n_estimators': 100}"


Gridsearch Finished

 -------------------------------------------------------------------------------------------------------


In [None]:
# Results dataframe
d_res_gbc.sort_values(by=['Val_Acc','std_test_score'],ascending=[False,True],inplace=True)
d_res_gbc_sort = d_res_gbc.loc[(d_res_gbc['Val_Acc'] > 0.80) 
              & (d_res_gbc['Diff_test_val'] < 0.015) 
              & (d_res_gbc['Diff_train_test'] < 0.015)  
              & (d_res_gbc['std_test_score'] < 0.015) 
              & (d_res_gbc['std_train_score'] < 0.015)]
d_res_gbc_sort

In [None]:
# Pipeline setup
models = { 
    'GradientBoostingClassifier': GradientBoostingClassifier(),
}

# Parameters setup
params = {
    'GradientBoostingClassifier': { 'n_estimators': range(1,20,1), 
                                   'learning_rate': [0.05,0.08,0.10,0.12,0.14,0.16],
                                  'loss' : ['deviance','exponential'],
                                  'max_depth' : range(2,4),
                                   'min_samples_split': range(2,10,1),
                                   'min_samples_leaf' : range(2,10,1),
                                   'max_features' : range(1,15,1)
                                  }
}

# Lancer la grid search
df_best_gbc, dic_best_gbc, d_res_gbc =grid_search_global('clas',models,params,class_names=columns_name)

In [None]:
# Selection of the parameters to study
index_selection = [177,130,106,63,87,86]
df_study = d_res.loc[index_selection,['Algo','params','mean_test_score','mean_train_score']]
df_study

In [None]:
# Learning curve
index = list(df_study.index)
for ind in index :
    algo = df_study.loc[ind, 'Algo']
    params = df_study.loc[ind, 'params']
    mean_test = df_study.loc[ind, 'mean_test_score']
    mean_train = df_study.loc[ind, 'mean_train_score']
    title = "Learning Curves for the estimator %s which results are \n mean test:%s \n mean train %s \n with parameters %s)" % (algo,mean_test,mean_train,params)
    cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
    estimator = models[algo]
    estimator.set_params(**params)
    plot_learning_curve(estimator, title, data, target, (0.3, 1.01), cv=cv, n_jobs=-1)
    plt.show()

In [None]:
# Validation parameters setup
dict_Validation = { 
    'GradientBoostingClassifier':  { 'n_estimators': range(5,100,5), 
                                   'learning_rate': [0.01,0.03,0.1,0.2,0.3,0.4,0.5,0.6,0.7],
                                  'max_depth' : range(2,10,1),
                                   'min_samples_split': range(2,10,1),
                                   'min_samples_leaf' : range(2,10,1),
                                   'max_features' : range(2,19,1)
                                  } 
}

for ind in index :
    algo = df_study.loc[ind, 'Algo']
    params = df_study.loc[ind, 'params']
    mean_test = df_study.loc[ind, 'mean_test_score']
    mean_train = df_study.loc[ind, 'mean_train_score']
    print("Model: %s\nMean test: %s\nMean train: %s\nParams: %s" % (algo,mean_test,mean_train,params))
    estimator = models[algo]
    estimator.set_params(**params)
    cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
    for key, value in dict_Validation[algo].items():
        plot_validation_curve(estimator, clef, key, value, X_train, y_train, scoring='accuracy', cv=cv)
        plt.show()

In [None]:
# Prediction 
GBC_params =  {'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 3, 'max_features': 4, 'min_samples_leaf': 7, 'min_samples_split': 7, 'n_estimators': 10}
gbc = GradientBoostingClassifier(**GBC_params)
gbc.fit(X_train,y_train)
gbc_pred = gbc.predict(test_data)
Prediction['GBC'] = gbc_pred

### Random Forest Classifier

[[back to top](#Table-of-contents)]

In [114]:
# Pipeline setup
# data.drop(['Title_aggr_Royalty','Title_aggr_Officer','Name_Size_Medium','Embarked_Q'],axis=1,inplace=True)
# {'criterion': 'gini', 'max_features': 'sqrt', 'min_samples_leaf': 8, 'min_samples_split': 2, 'n_estimators': 30}
# Score 80.803 leader board, train with X_train or all data 

models = { 
    'RandomForestClassifier': RandomForestClassifier(),
}

# Parameters setup
params = {
    # Il faut mettre 'ExtraTreesClassifier__n_estimators' dans 'ExtraTreesClassifier' car on est sur un pipeline 
    # il est donc possible de préciser des parametres pour chacune des étapes
    'RandomForestClassifier': { 'n_estimators': [5,10,15,20,30,100],
                               'criterion' : ['gini','entropy'],
                               'max_depth' : [3,5,8,10,15,20],
                               'max_features':[2,4,6,8],
                               'min_samples_split': [3,5,7],
                               'min_samples_leaf':  [3,5,7]
                              }
}

# Lancer la grid search
df_best_rf, dic_best_rf, d_res_rf =grid_search_global(models,params,class_names=columns_name)

Starting Gridsearch

-------------------------------------------------------------------------------------------------------
Gridsearch for {'n_estimators': [5, 10, 15, 20, 30, 100], 'criterion': ['gini', 'entropy'], 'max_depth': [3, 5, 8, 10, 15, 20], 'max_features': [2, 4, 6, 8], 'min_samples_split': [3, 5, 7], 'min_samples_leaf': [3, 5, 7]} 

Best score : 0.838483146067
Best params : {'criterion': 'gini', 'max_depth': 10, 'max_features': 6, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 10}

Grid Score #cv_results_ alégé


Unnamed: 0,Algo,Val_Acc,Diff_test_val,Diff_train_test,Val_F_score,mean_test_score,std_test_score,mean_train_score,std_train_score,params
793,RandomForestClassifier,0.804469,0.034014,0.019668,0.724409,0.838483,0.027523,0.858151,0.014499,"{'criterion': 'gini', 'max_depth': 10, 'max_features': 6, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 10}"
2326,RandomForestClassifier,0.804469,0.032609,0.069521,0.724409,0.837079,0.029544,0.906600,0.007654,"{'criterion': 'entropy', 'max_depth': 15, 'max_features': 8, 'min_samples_leaf': 3, 'min_samples_split': 3, 'n_estimators': 30}"
1059,RandomForestClassifier,0.804469,0.031205,0.044247,0.724409,0.835674,0.029476,0.879922,0.007732,"{'criterion': 'gini', 'max_depth': 15, 'max_features': 8, 'min_samples_leaf': 5, 'min_samples_split': 7, 'n_estimators': 20}"
510,RandomForestClassifier,0.804469,0.031205,0.015449,0.724409,0.835674,0.031337,0.851123,0.007070,"{'criterion': 'gini', 'max_depth': 8, 'max_features': 4, 'min_samples_leaf': 5, 'min_samples_split': 5, 'n_estimators': 5}"
2485,RandomForestClassifier,0.804469,0.031205,0.053019,0.724409,0.835674,0.022323,0.888693,0.012685,"{'criterion': 'entropy', 'max_depth': 20, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 3, 'n_estimators': 10}"
2047,RandomForestClassifier,0.804469,0.029800,0.014392,0.724409,0.834270,0.022200,0.848662,0.006595,"{'criterion': 'entropy', 'max_depth': 10, 'max_features': 4, 'min_samples_leaf': 7, 'min_samples_split': 7, 'n_estimators': 10}"
1788,RandomForestClassifier,0.804469,0.029800,0.030542,0.724409,0.834270,0.033843,0.864812,0.009568,"{'criterion': 'entropy', 'max_depth': 8, 'max_features': 4, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 5}"
2074,RandomForestClassifier,0.804469,0.029800,0.031959,0.724409,0.834270,0.032246,0.866229,0.008514,"{'criterion': 'entropy', 'max_depth': 10, 'max_features': 6, 'min_samples_leaf': 5, 'min_samples_split': 3, 'n_estimators': 30}"
2279,RandomForestClassifier,0.804469,0.029800,0.065665,0.724409,0.834270,0.028472,0.899935,0.008833,"{'criterion': 'entropy', 'max_depth': 15, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 100}"
1599,RandomForestClassifier,0.804469,0.029800,0.012985,0.724409,0.834270,0.018122,0.847255,0.008572,"{'criterion': 'entropy', 'max_depth': 5, 'max_features': 4, 'min_samples_leaf': 5, 'min_samples_split': 7, 'n_estimators': 20}"



 -------------------------------------------------------------------------------------------------------

List of best score and parameters by pipeline


Unnamed: 0,Scores,Parameters
RandomForestClassifier,0.838483,"{'criterion': 'gini', 'max_depth': 10, 'max_features': 6, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 10}"



Summary


Unnamed: 0,Algo,Val_Acc,Diff_test_val,Diff_train_test,Val_F_score,mean_test_score,std_test_score,mean_train_score,std_train_score,params
793,RandomForestClassifier,0.804469,0.034014,0.019668,0.724409,0.838483,0.027523,0.858151,0.014499,"{'criterion': 'gini', 'max_depth': 10, 'max_features': 6, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 10}"
506,RandomForestClassifier,0.804469,0.012946,0.040382,0.724409,0.817416,0.033376,0.857798,0.005477,"{'criterion': 'gini', 'max_depth': 8, 'max_features': 4, 'min_samples_leaf': 5, 'min_samples_split': 3, 'n_estimators': 15}"
661,RandomForestClassifier,0.804469,0.012946,0.045297,0.724409,0.817416,0.029930,0.862713,0.007439,"{'criterion': 'gini', 'max_depth': 10, 'max_features': 2, 'min_samples_leaf': 3, 'min_samples_split': 7, 'n_estimators': 10}"
1985,RandomForestClassifier,0.804469,0.012946,0.025629,0.724409,0.817416,0.024262,0.843044,0.007142,"{'criterion': 'entropy', 'max_depth': 10, 'max_features': 2, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 100}"
1982,RandomForestClassifier,0.804469,0.012946,0.019665,0.724409,0.817416,0.022831,0.837081,0.007986,"{'criterion': 'entropy', 'max_depth': 10, 'max_features': 2, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 15}"
2080,RandomForestClassifier,0.804469,0.012946,0.056541,0.724409,0.817416,0.016898,0.873957,0.010315,"{'criterion': 'entropy', 'max_depth': 10, 'max_features': 6, 'min_samples_leaf': 5, 'min_samples_split': 5, 'n_estimators': 30}"
575,RandomForestClassifier,0.804469,0.012946,0.051616,0.724409,0.817416,0.033049,0.869032,0.001699,"{'criterion': 'gini', 'max_depth': 8, 'max_features': 6, 'min_samples_leaf': 5, 'min_samples_split': 7, 'n_estimators': 100}"
1921,RandomForestClassifier,0.804469,0.012946,0.060397,0.724409,0.817416,0.025849,0.877813,0.006822,"{'criterion': 'entropy', 'max_depth': 8, 'max_features': 8, 'min_samples_leaf': 5, 'min_samples_split': 7, 'n_estimators': 10}"
2221,RandomForestClassifier,0.804469,0.012946,0.060745,0.724409,0.817416,0.032971,0.878161,0.007064,"{'criterion': 'entropy', 'max_depth': 15, 'max_features': 4, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 10}"
2361,RandomForestClassifier,0.804469,0.012946,0.049515,0.724409,0.817416,0.033445,0.866931,0.006069,"{'criterion': 'entropy', 'max_depth': 15, 'max_features': 8, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 20}"


Gridsearch Finished

 -------------------------------------------------------------------------------------------------------


In [113]:
# Results dataframe
d_res_rf

Unnamed: 0,Algo,Val_Acc,Diff_test_val,Diff_train_test,Val_F_score,mean_test_score,std_test_score,mean_train_score,std_train_score,params
1820,RandomForestClassifier,0.804469,0.032609,0.011934,0.72,0.837079,0.024975,0.849012,0.007122,"{'criterion': 'entropy', 'max_depth': 8, 'max_features': 4, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 15}"
321,RandomForestClassifier,0.804469,0.012946,0.023172,0.72,0.817416,0.025906,0.840588,0.011195,"{'criterion': 'gini', 'max_depth': 5, 'max_features': 4, 'min_samples_leaf': 7, 'min_samples_split': 7, 'n_estimators': 20}"
1088,RandomForestClassifier,0.804469,0.012946,0.048796,0.72,0.817416,0.032513,0.866211,0.011195,"{'criterion': 'gini', 'max_depth': 20, 'max_features': 2, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 15}"
384,RandomForestClassifier,0.804469,0.012946,0.037573,0.72,0.817416,0.037391,0.854989,0.011538,"{'criterion': 'gini', 'max_depth': 5, 'max_features': 8, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 5}"
1092,RandomForestClassifier,0.804469,0.012946,0.037561,0.72,0.817416,0.024262,0.854977,0.009327,"{'criterion': 'gini', 'max_depth': 20, 'max_features': 2, 'min_samples_leaf': 3, 'min_samples_split': 7, 'n_estimators': 5}"
1623,RandomForestClassifier,0.804469,0.012946,0.041782,0.72,0.817416,0.030410,0.859198,0.004393,"{'criterion': 'entropy', 'max_depth': 5, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 3, 'n_estimators': 20}"
865,RandomForestClassifier,0.804469,0.012946,0.056887,0.72,0.817416,0.036241,0.874303,0.006799,"{'criterion': 'gini', 'max_depth': 15, 'max_features': 2, 'min_samples_leaf': 3, 'min_samples_split': 3, 'n_estimators': 10}"
1628,RandomForestClassifier,0.804469,0.012946,0.038274,0.72,0.817416,0.025345,0.855690,0.011164,"{'criterion': 'entropy', 'max_depth': 5, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 15}"
1629,RandomForestClassifier,0.804469,0.012946,0.036511,0.72,0.817416,0.029951,0.853927,0.008997,"{'criterion': 'entropy', 'max_depth': 5, 'max_features': 6, 'min_samples_leaf': 3, 'min_samples_split': 5, 'n_estimators': 20}"
254,RandomForestClassifier,0.804469,0.012946,0.017550,0.72,0.817416,0.016369,0.834966,0.012224,"{'criterion': 'gini', 'max_depth': 5, 'max_features': 2, 'min_samples_leaf': 7, 'min_samples_split': 3, 'n_estimators': 15}"


In [None]:
# Pipeline setup
# data.drop(['Title_aggr_Royalty','Title_aggr_Officer','Name_Size_Medium','Embarked_Q'],axis=1,inplace=True)
# {'criterion': 'gini', 'max_features': 'sqrt', 'min_samples_leaf': 8, 'min_samples_split': 2, 'n_estimators': 30}
# Score 80.803 leader board, train with X_train or all data 

models = { 
    'RandomForestClassifier': RandomForestClassifier(),
}

# Parameters setup
params = {
    # Il faut mettre 'ExtraTreesClassifier__n_estimators' dans 'ExtraTreesClassifier' car on est sur un pipeline 
    # il est donc possible de préciser des parametres pour chacune des étapes
    'RandomForestClassifier': { 'n_estimators': range(1,20,1),
                               'criterion' : ['gini','entropy'],
                               'max_features':range(1,20,1),
                               'min_samples_split': range(2,10,1),
                               'min_samples_leaf':  range(1,10,1)
                              }
}

# Lancer la grid search
df_best_rf, dic_best_rf, d_res_rf =grid_search_global('clas',models,params,class_names=columns_name)

In [None]:
# Selection of the parameters to study
index_selection = [707,102,199,1031]
df_study_rf = d_res_rf.loc[index_selection,['Algo','params','mean_test_score','mean_train_score']]
df_study_rf

In [None]:
# Learning curve
index = list(df_study_rf.index)
for ind in index :
    algo = df_study_rf.loc[ind, 'Algo']
    params = df_study_rf.loc[ind, 'params']
    mean_test = df_study_rf.loc[ind, 'mean_test_score']
    mean_train = df_study_rf.loc[ind, 'mean_train_score']
    title = "Learning Curves for the estimator %s which results are \n mean test:%s \n mean train %s \n with parameters %s)" % (algo,mean_test,mean_train,params)
    cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
    estimator = models[algo]
    estimator.set_params(**params)
    plot_learning_curve(estimator, title, data, target, (0.3, 1.01), cv=cv, n_jobs=-1)
    plt.show()

In [None]:
# Validation parameters setup
dict_Validation = { 
    'RandomForestClassifier': {'n_estimators': range(1,50,2),
                               'max_features':range(1,20,1),
                                  'min_samples_split': range(2,10,1),
                                  'min_samples_leaf': range(2,10,1)}  
}

for ind in index :
    algo = df_study_rf.loc[ind, 'Algo']
    params = df_study_rf.loc[ind, 'params']
    mean_test = df_study_rf.loc[ind, 'mean_test_score']
    mean_train = df_study_rf.loc[ind, 'mean_train_score']
    print("Model: %s\nMean test: %s\nMean train: %s\nParams: %s" % (algo,mean_test,mean_train,params))
    estimator = models[algo]
    estimator.set_params(**params)
    cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
    for key, value in dict_Validation[algo].items():
        plot_validation_curve(estimator, algo, key, value, X_train, y_train, scoring='accuracy', cv=cv)
        plt.show()

In [None]:
# Prediction 
RF_params =  {'criterion': 'entropy', 'max_features': 12, 'min_samples_leaf': 1, 'min_samples_split': 8, 'n_estimators': 10}
rf = RandomForestClassifier(**RF_params)
rf.fit(X_train,y_train)
rf_pred = rf.predict(test_data)
Prediction['RF'] = rf_pred

In [None]:
# Get and display result
Rf_feat_imp = pd.DataFrame()
Rf_feat_imp['Feature'] = list(data.columns)
Rf_feat_imp['Importance'] = rf.feature_importances_
Rf_feat_imp.sort_values('Importance',ascending=False,inplace=True)
Rf_feat_imp

In [None]:
# From Random Forest
importances=rf.feature_importances_
plt.figure(figsize=(10,10))
plt.title("Feature Importances By Random Forest Model")
plt.bar(range(np.size(importances)), importances, align="center")
plt.xticks(range(np.size(importances)),list(data.columns),rotation='vertical')
plt.show()

### Adaboost

[[back to top](#Table-of-contents)]

In [None]:
# Pipeline setup
models = { 
    'AdaBoostClassifier': AdaBoostClassifier(),
}

# Parameters setup
params = {
    # Il faut mettre 'ExtraTreesClassifier__n_estimators' dans 'ExtraTreesClassifier' car on est sur un pipeline 
    # il est donc possible de préciser des parametres pour chacune des étapes
    'AdaBoostClassifier': { 'n_estimators': [5, 10, 20, 30, 40, 50 ,100, 500 ],
                               'learning_rate' : [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],
                              }
}

# Lancer la grid search
df_best, dic_best, d_res =grid_search_global('clas',models,params,class_names=columns_name)

In [None]:
# Prediction
ab_params = {'learning_rate': 0.3, 'n_estimators': 20}
ab = AdaBoostClassifier(**ab_params)
ab.fit(X_train,y_train)
ab_pred = ab.predict(test_data)
Prediction['AB'] = ab_pred

### SVC

[[back to top](#Table-of-contents)]

In [None]:
# Pipeline setup
models = { 
    'SVC': SVC()
}

# Parameters setup
params = {
    'SVC': [
        {'kernel': ['poly'], 'C': [0.3,0.4,0.5,0.6,0.7,0.8,0.9,1], 
                            'gamma': [0.08,0.09,0.1,0.11,0.12],
                            'degree': [2,3]}]}

# Lancer la grid search
df_best_svc , dic_best_svc, d_res_svc = grid_search_global('clas',models,params,class_names=columns_name)

In [None]:
# Results dataframe
d_res_svc

In [None]:
# Selection of the parameters to study
index_selection = [14,32,41,50]
svc_study = d_res_svc.loc[index_selection,['Algo','params','mean_test_score','mean_train_score']]
svc_study

In [None]:
# Learning curve
df_study = svc_study
index = list(df_study.index)
for ind in index :
    algo = df_study.loc[ind, 'Algo']
    params = df_study.loc[ind, 'params']
    mean_test = df_study.loc[ind, 'mean_test_score']
    mean_train = df_study.loc[ind, 'mean_train_score']
    title = "Learning Curves for the estimator %s which results are \n mean test:%s \n mean train %s \n with parameters %s)" % (algo,mean_test,mean_train,params)
    cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
    estimator = models[algo]
    estimator.set_params(**params)
    plot_learning_curve(estimator, title, data, target, (0.3, 1.01), cv=cv, n_jobs=-1)
    plt.show()

In [None]:
# Validation parameters setup
dict_Validation = { 
    'SVC': { 'C': np.linspace(0.01,1,20),
              'gamma': np.linspace(0.01,0.4,20),
              'degree': [2,3]}  
    }

for ind in index :
    algo = df_study.loc[ind, 'Algo']
    params = df_study.loc[ind, 'params']
    mean_test = df_study.loc[ind, 'mean_test_score']
    mean_train = df_study.loc[ind, 'mean_train_score']
    print("Model: %s\nMean test: %s\nMean train: %s\nParams: %s" % (algo,mean_test,mean_train,params))
    estimator = models[algo]
    estimator.set_params(**params)
    cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
    for key, value in dict_Validation[algo].items():
        plot_validation_curve(estimator, algo, key, value, X_train, y_train, scoring='accuracy', cv=cv)
        plt.show()

In [None]:
# Prediction
svc_params = {'C': 0.8, 'degree': 2, 'gamma': 0.08, 'kernel': 'poly'}
svc = SVC(**svc_params)
svc.fit(data,target)
svc_pred = svc.predict(test_data)
Prediction['SVC'] = svc_pred

### Logistic Regression

[[back to top](#Table-of-contents)]

In [None]:
from sklearn.linear_model import LogisticRegression
# Pipeline setup
models = { 
    'LogisticRegression': LogisticRegression()
}

# Parameters setup
params = {'LogisticRegression' : { 'C': [0.1,0.2,0.3,0.4,0.5,0.6,0.8,0.9,1],
           'penalty': ['l2'],
           'solver': ['newton-cg','lbfgs','liblinear','sag'],
           'max_iter': [100,250,500],
           'tol':  [1e-4,3e-4,7e-4,1e-3,3e-3],
          }
         }

# Lancer la grid search
df_best_svc , dic_best_svc, d_res_svc = grid_search_global('clas',models,params,class_names=columns_name)

In [None]:
# Results dataframe
d_res

In [None]:
# Selection of the parameters to study
index_selection = [177,130,106,63,87,86]
df_study = d_res.loc[index_selection,['Algo','params','mean_test_score','mean_train_score']]
df_study

In [None]:
# Learning curve
index = list(df_study.index)
for ind in index :
    algo = df_study.loc[ind, 'Algo']
    params = df_study.loc[ind, 'params']
    mean_test = df_study.loc[ind, 'mean_test_score']
    mean_train = df_study.loc[ind, 'mean_train_score']
    title = "Learning Curves for the estimator %s which results are \n mean test:%s \n mean train %s \n with parameters %s)" % (algo,mean_test,mean_train,params)
    cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
    estimator = models[algo]
    estimator.set_params(**params)
    plot_learning_curve(estimator, title, data, target, (0.3, 1.01), cv=cv, n_jobs=-1)
    plt.show()

In [None]:
# Validation parameters setup
dict_Validation = { 
    'GradientBoostingClassifier': {'n_estimators': range(1,50,2),
                                  'learning_rate': np.linspace(0.01,0.5,50),
                                  'max_depth': range(1,5,1)}  
}

for ind in index :
    algo = df_study.loc[ind, 'Algo']
    params = df_study.loc[ind, 'params']
    mean_test = df_study.loc[ind, 'mean_test_score']
    mean_train = df_study.loc[ind, 'mean_train_score']
    print("Model: %s\nMean test: %s\nMean train: %s\nParams: %s" % (algo,mean_test,mean_train,params))
    estimator = models[algo]
    estimator.set_params(**params)
    cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
    for key, value in dict_Validation[algo].items():
        plot_validation_curve(estimator, clef, key, value, X_train, y_train, scoring='accuracy', cv=cv)
        plt.show()

In [None]:
# Prediction 
lr_parm = {'C': 1, 'max_iter': 100, 'penalty': 'l2', 'solver': 'newton-cg', 'tol': 0.0001}
lr = LogisticRegression(**lr_parm)
lr.fit(X_train,y_train)
lr_pred = lr.predict(test_data)
Prediction['LR'] = lr_pred

### Voting Classifier

[[back to top](#Table-of-contents)]

In [None]:
# Train & Test
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.20,random_state =123)

In [None]:
# Classifier default parameters
abc = AdaBoostClassifier(learning_rate = 0.5, n_estimators = 20)        
lda = LinearDiscriminantAnalysis(n_components = 2, solver='svd')
lr = LogisticRegression()
svc = SVC(probability=True)
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

ab_params = {'learning_rate': 0.3, 'n_estimators': 30}
GBC_params = {'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 3, 'max_features': 10, 'min_samples_leaf': 7, 'min_samples_split': 5, 'n_estimators': 10}
RF_params =  {'criterion': 'entropy', 'max_features': 4, 'min_samples_leaf': 8, 'min_samples_split': 9, 'n_estimators': 11}

# Classifier Tunep parameters
abc_tp = AdaBoostClassifier(**ab_params)  
lda_tp = LinearDiscriminantAnalysis(n_components = 2, solver='svd')
lr_tp = LogisticRegression(C = 0.2, max_iter = 10, n_jobs = -1, penalty = 'l2', solver = 'sag', tol = 0.001)
svc_tp = SVC(C=0.5, gamma=0.10, kernel='poly', degree=3, probability= True)
rfc_tp = RandomForestClassifier(**RF_params)
gbc_tp = GradientBoostingClassifier(**GBC_params)

models = { 
    'VotingClassifier': VotingClassifier(estimators= [('abc_tp', abc_tp),('rfc_tp', rfc_tp), ('gbc_tp', gbc_tp)])
}

# Parameters setup
params = {'VotingClassifier' : {'voting': ['soft','hard']}
         }

# Lancer la grid search
df_best_vc , dic_best_vc, d_res_vc = grid_search_global('clas',models,params,class_names=columns_name)

In [None]:
# Classifier Tunep parameters
# Score 0.77

GBC_params = {'learning_rate': 0.2, 'loss': 'deviance', 'max_depth': 3, 'max_features': 10, 'min_samples_leaf': 7, 'min_samples_split': 6, 'n_estimators': 10}
RF_params =  {'criterion': 'entropy', 'max_features': 10, 'min_samples_leaf': 10, 'min_samples_split': 3, 'n_estimators': 30}

rfc_tp = RandomForestClassifier(**RF_params)
gbc_tp = GradientBoostingClassifier(**GBC_params)

# [ ('lr_tp', lr_tp),('lda_tp', lda_tp), ('svc_tp', svc_tp), ('rfc_tp', rfc_tp), ('gbc_tp', gbc_tp)]
VC_tp_soft = VotingClassifier(estimators=[('rfc_tp', rfc_tp), ('gbc_tp', gbc_tp)],   
                                         voting='soft')
VC_tp_soft.fit(X_train,y_train)
Prediction['VC_tp_soft'] = VC_tp_soft.predict(Eval_OH_Std)

In [None]:
# Classifier Not Tunep parameters
# Score 0.77

abc = AdaBoostClassifier(learning_rate = 0.5, n_estimators = 20)        
lda = LinearDiscriminantAnalysis(n_components = 2, solver='svd')
lr = LogisticRegression()
svc = SVC(probability=True)
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()



VC_nottp_soft = VotingClassifier(estimators=[ ('lr_tp', lr_tp),('lda_tp', lda_tp), ('svc_tp', svc_tp), ('rfc_tp', rfc_tp), ('gbc_tp', gbc_tp)],
                                         voting='soft')
VC_nottp_soft.fit(data,target)
Prediction['VC_nottp_soft'] = VC_nottp_soft.predict(Eval_OH_Std)

In [None]:
pd.set_option('display.max_colwidth', -1)

In [None]:
Eval_OH_Std.shape

In [None]:
Prediction.shape

#### Save best params

In [None]:
# 0.79-0.81 with PassId_0.81 and without ticket and cabin and dropping 'Title_aggr_Royalty','Title_aggr_Officer','Name_Size_Medium','Embarked_Q'
GBC_params = {'learning_rate': 0.2, 'loss': 'deviance', 'max_depth': 3, 'max_features': 10, 'min_samples_leaf': 7, 'min_samples_split': 6, 'n_estimators': 10}
# 0.80 same conditions
RF_params =  {'criterion': 'entropy', 'max_features': 10, 'min_samples_leaf': 10, 'min_samples_split': 3, 'n_estimators': 30}
#0.79
VotingClassifier(estimators=[('rfc_tp', rfc_tp), ('gbc_tp', gbc_tp)],voting='soft')

#### Save X_train and y_train that generalize well

In [None]:
Pass_Id = list(X_train.index)
Pass_Id

In [None]:
PasId = pd.DataFrame({
        "PassengerId": Pass_Id
    })
PasId.to_csv('PassId_0.81.csv', index=False)

#### Save Number of random state that are good

### Submission

[[back to top](#Table-of-contents)]

In [None]:
estim = #  Remplir ici
Prediction = estim.predict(Eval_OH_Std)

In [None]:
Predic = Prediction['GBC']

In [None]:
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": Predic
    })
submission.to_csv('titanic_all_data_R999_GBC2_pred.csv', index=False)