# Modelling

## Preprocessing

In this part, we use our knowledge of the passengers based on the features we created and then build a statistical model. 

There is a wide variety of models to use, from logistic regression to decision trees and more sophisticated ones such as random forests and gradient boosted trees.

Back to our problem, we now have to:

1. Break the combined dataset in train set and test set.
2. Use the train set to build a predictive model.
3. Evaluate the model using the train set.
4. Test the model using the test set and generate and output file for the submission.

Keep in mind that we'll have to reiterate on 2. and 3. until an acceptable evaluation score is achieved.

Let's start by importing the useful libraries.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

import pickle

import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)

from datetime import datetime



In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

To evaluate our model we'll be using a 5-fold cross validation with the accuracy since it's the metric that the competition uses in the leaderboard.

To do that, we'll define a small scoring function.

In [33]:
def compute_score(clf, X, y, scoring='accuracy'):
    xval = cross_val_score(clf, X, y, cv = 5, scoring=scoring, verbose=0)
    return np.mean(xval)

Recovering the train set and the test set from the combined dataset is an easy task.

In [3]:
train_with_labels = pd.read_csv("data/new_train.csv")
targets = train_with_labels["Target"]
train = train_with_labels.drop("Target", axis=1)
test =  pd.read_csv("data/new_test.csv")

In [4]:
train

Unnamed: 0,Sex,Age,SibSp,Parch,Fare,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,...,Ticket_STONO2,Ticket_STONOQ,Ticket_SWPP,Ticket_WC,Ticket_WEP,Ticket_XXX,FamilySize,Singleton,SmallFamily,LargeFamily
0,1,22.0,1,0,7.2500,0,0,1,0,0,...,0,0,0,0,0,0,2,0,1,0
1,0,38.0,1,0,71.2833,0,0,0,1,0,...,0,0,0,0,0,0,2,0,1,0
2,0,26.0,0,0,7.9250,0,1,0,0,0,...,1,0,0,0,0,0,1,1,0,0
3,0,35.0,1,0,53.1000,0,0,0,1,0,...,0,0,0,0,0,1,2,0,1,0
4,1,35.0,0,0,8.0500,0,0,1,0,0,...,0,0,0,0,0,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,1,27.0,0,0,13.0000,0,0,0,0,1,...,0,0,0,0,0,1,1,1,0,0
887,0,19.0,0,0,30.0000,0,1,0,0,0,...,0,0,0,0,0,1,1,1,0,0
888,0,18.0,1,2,23.4500,0,1,0,0,0,...,0,0,0,1,0,0,4,0,1,0
889,1,26.0,0,0,30.0000,0,0,1,0,0,...,0,0,0,0,0,1,1,1,0,0


In [5]:
test

Unnamed: 0,Sex,Age,SibSp,Parch,Fare,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,...,Ticket_STONO2,Ticket_STONOQ,Ticket_SWPP,Ticket_WC,Ticket_WEP,Ticket_XXX,FamilySize,Singleton,SmallFamily,LargeFamily
0,1,34.5,0,0,7.8292,0,0,1,0,0,...,0,0,0,0,0,1,1,1,0,0
1,0,47.0,1,0,7.0000,0,0,0,1,0,...,0,0,0,0,0,1,2,0,1,0
2,1,62.0,0,0,9.6875,0,0,1,0,0,...,0,0,0,0,0,1,1,1,0,0
3,1,27.0,0,0,8.6625,0,0,1,0,0,...,0,0,0,0,0,1,1,1,0,0
4,0,22.0,1,1,12.2875,0,0,0,1,0,...,0,0,0,0,0,1,3,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1,26.0,0,0,8.0500,0,0,1,0,0,...,0,0,0,0,0,0,1,1,0,0
414,0,39.0,0,0,108.9000,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
415,1,38.5,0,0,7.2500,0,0,1,0,0,...,0,0,0,0,0,0,1,1,0,0
416,1,26.0,0,0,8.0500,0,0,1,0,0,...,0,0,0,0,0,1,1,1,0,0


## Author - Feature selection

We've come up to more than 30 features so far. This number is quite large.

When feature engineering is done, we usually tend to decrease the dimensionality by selecting the "right" number of features that capture the essential.

In fact, feature selection comes with many benefits:

* It decreases redundancy among the data
* It speeds up the training process
* It reduces overfitting

Tree-based estimators can be used to compute feature importances, which in turn can be used to discard irrelevant features.

In [None]:
clf = RandomForestClassifier(n_estimators=100, max_features='sqrt')
clf = clf.fit(train, targets)

Let's have a look at the importance of each feature.

In [None]:
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)

features.plot(kind='barh', figsize=(25, 25));

As you may notice, there is a great importance linked to `Title_Mr`, `Age`, `Fare`, and `Sex`.

There is also an important correlation with the `Passenger_Id`.

Let's now transform our train set and test set in a more compact datasets.

In [None]:
model = SelectFromModel(clf, prefit=True)
train_reduced = model.transform(train)
print(train_reduced.shape, end=", ")

test_reduced = model.transform(test)
print(test_reduced.shape)

## Author - First models

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_reduced, targets, test_size=0.15)

In [None]:
logreg = LogisticRegression()
logreg_cv = LogisticRegressionCV()
rf = RandomForestClassifier()
gboost = GradientBoostingClassifier()

models = [logreg, logreg_cv, rf, gboost]

for model in models:
    np.random.seed(42)
    print('Cross-validation of : {0}'.format(model.__class__))
    score = compute_score(clf=model, X=X_test, y=y_test, scoring='accuracy')
    print('CV Train score = {0}'.format(score))
    score = compute_score(clf=model, X=X_train, y=y_train, scoring='accuracy')
    print('CV Val score = {0}'.format(score))
    print('****')

## Author - Hyperparameters tuning

As mentioned in the beginning of the Modeling part, we will be using a Random Forest model. It may not be the best model for this task but we'll show how to tune. This work can be applied to different models.

Random Forest are quite handy. They do however come with some parameters to tweak in order to get an optimal model for the prediction task.

Additionally, we'll use the full train set.

In [None]:
# turn run_gs to True if you want to run the gridsearch again.
run_gs = False

if run_gs:
    parameter_grid = {
                 'max_depth' : [4, 6, 8],
                 'n_estimators': [50, 10],
                 'max_features': ['sqrt', 'auto', 'log2'],
                 'min_samples_split': [2, 3, 10],
                 'min_samples_leaf': [1, 3, 10],
                 'bootstrap': [True, False],
                 }
    forest = RandomForestClassifier()
    cross_validation = StratifiedKFold(n_splits=5)

    grid_search = GridSearchCV(forest,
                               scoring='accuracy',
                               param_grid=parameter_grid,
                               cv=cross_validation,
                               verbose=1
                              )

    grid_search.fit(train, targets)
    model = grid_search
    parameters = grid_search.best_params_

    print('Best score: {}'.format(grid_search.best_score_))
    print('Best parameters: {}'.format(grid_search.best_params_))
    
else: 
    parameters = {'bootstrap': False, 'min_samples_leaf': 3, 'n_estimators': 50, 
                  'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6}
    
    model = RandomForestClassifier(**parameters)
    model.fit(train, targets)

Now that the model is built by scanning several combinations of the hyperparameters, we can generate an output file to submit on Kaggle.

In [None]:
output = model.predict(test).astype(int)
df_output = pd.DataFrame()
aux = pd.read_csv('./data/test.csv')
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('submission/author_rand_forest.csv', index=False)

FINAL RESULT: 78,947% IS THE ACCURACY FOUND. 



# My experiments

## Validation Dataset

Now, using this more updated dataset, I'll try different ways to see if I can improve the result obtained in the website as I already did before. In particular, now I have:
* `train`: Train dataset with 67 attributes due to the pandas dummies columns
* `targets`: Train target (Survived or no?)
* `test`: Test dataset with 67 attributes as Train without targets (they are on Kaggle)

Now I use a Validation dataset of 25% through `train_test_split()` of Scikit-Learn, obtaining:
* `X_train,y_train`: New restricted Train dataset and its targets
* `X_test,y_test`: New Validation dataset with its targets

In [9]:
train.shape, targets.shape, test.shape

((891, 67), (891,), (418, 67))

In [10]:
X_train, X_test, y_train, y_test = train_test_split(train, targets, test_size=0.25)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((668, 67), (223, 67), (668,), (223,))

## Feature Selection

In [11]:
clf = RandomForestClassifier(n_estimators=100, max_features='sqrt')
clf = clf.fit(train, targets)
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=False, inplace=True)
features.head()

Unnamed: 0,feature,importance
1,Age,0.188281
4,Fare,0.178803
0,Sex,0.106771
7,Title_Mr,0.102058
8,Title_Mrs,0.041124


As said before, 67 attributes are too many, they make the machine learning algorithm overfit, therefore we need to implement a new solution in order to make better predictions. 

In [12]:
features.head(13).importance.sum()

0.8212357822352765

In [13]:
features.head(1).importance.sum(), features.head(20).importance.sum(), features.head(25).importance.sum()

(0.18828130583255157, 0.9052482197558591, 0.9427400480806225)

We can see that:

* `Age` and `Fare` together make the 36% of all Feature importance, in particular `Age` standalone make almost 20%;
* The first 13/67 elements chosen in the previous case were 82% of all Feature importance;
* 20/67 are 90%, 25/67 95%;

In this case, we try 20, i.e. 90% of the feature importance. They are:

In [14]:
feature_list = list(features.head(20)["feature"])
feature_list

['Age',
 'Fare',
 'Sex',
 'Title_Mr',
 'Title_Mrs',
 'Title_Miss',
 'FamilySize',
 'Pclass_3',
 'Cabin_U',
 'SibSp',
 'Pclass_1',
 'SmallFamily',
 'Parch',
 'Ticket_XXX',
 'LargeFamily',
 'Pclass_2',
 'Embarked_S',
 'Embarked_C',
 'Title_Master',
 'Cabin_E']

In [15]:
new_train = train[feature_list] 
new_test = test[feature_list]

In [16]:
new_train

Unnamed: 0,Age,Fare,Sex,Title_Mr,Title_Mrs,Title_Miss,FamilySize,Pclass_3,Cabin_U,SibSp,Pclass_1,SmallFamily,Parch,Ticket_XXX,LargeFamily,Pclass_2,Embarked_S,Embarked_C,Title_Master,Cabin_E
0,22.0,7.2500,1,1,0,0,2,1,1,1,0,1,0,0,0,0,1,0,0,0
1,38.0,71.2833,0,0,1,0,2,0,0,1,1,1,0,0,0,0,0,1,0,0
2,26.0,7.9250,0,0,0,1,1,1,1,0,0,0,0,0,0,0,1,0,0,0
3,35.0,53.1000,0,0,1,0,2,0,0,1,1,1,0,1,0,0,1,0,0,0
4,35.0,8.0500,1,1,0,0,1,1,1,0,0,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,27.0,13.0000,1,0,0,0,1,0,1,0,0,0,0,1,0,1,1,0,0,0
887,19.0,30.0000,0,0,0,1,1,0,0,0,1,0,0,1,0,0,1,0,0,0
888,18.0,23.4500,0,0,0,1,4,1,1,1,0,1,2,0,0,0,1,0,0,0
889,26.0,30.0000,1,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0


In [17]:
new_test.reindex(np.linspace(0,418,1))
new_test.index = pd.RangeIndex(start=0, stop=418, step=1)
new_test

Unnamed: 0,Age,Fare,Sex,Title_Mr,Title_Mrs,Title_Miss,FamilySize,Pclass_3,Cabin_U,SibSp,Pclass_1,SmallFamily,Parch,Ticket_XXX,LargeFamily,Pclass_2,Embarked_S,Embarked_C,Title_Master,Cabin_E
0,34.5,7.8292,1,1,0,0,1,1,1,0,0,0,0,1,0,0,0,0,0,0
1,47.0,7.0000,0,0,1,0,2,1,1,1,0,1,0,1,0,0,1,0,0,0
2,62.0,9.6875,1,1,0,0,1,0,1,0,0,0,0,1,0,1,0,0,0,0
3,27.0,8.6625,1,1,0,0,1,1,1,0,0,0,0,1,0,0,1,0,0,0
4,22.0,12.2875,0,0,1,0,3,1,1,1,0,1,1,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,26.0,8.0500,1,1,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0
414,39.0,108.9000,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0
415,38.5,7.2500,1,1,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0
416,26.0,8.0500,1,1,0,0,1,1,1,0,0,0,0,1,0,0,1,0,0,0


In [18]:
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(new_train, targets, test_size=0.15)

## First Try with different methods

Now I'll try different approaches to this method, in order to have a baseline idea and work on that. The methods are:
* `LogisticRegression()`
* `LogisticRegressionCV()`
* `RandomForestClassifier()`
* `GradientBoostingClassifier()`

We'll build a baseline of our models.

In [None]:
logreg = LogisticRegression()
logreg_cv = LogisticRegressionCV()
rf = RandomForestClassifier()
gboost = GradientBoostingClassifier()

models = [logreg, logreg_cv, rf, gboost]

for model in models:
    np.random.seed(42) # I need the same starting point
    print('Cross-validation of : {0}'.format(model.__class__))
    model.fit(X_train, y_train)
    score = compute_score(clf=model, X=X_train, y=y_train, scoring='accuracy')
    print('CV Train score = {0}'.format(score))
    score = compute_score(clf=model, X=X_test, y=y_test, scoring='accuracy')
    print('CV Validation score = {0}'.format(score))
    print('****')

As baseline result we can see that `RandomForestClassifier()` is the best at the first attempt with the new dataset, as before, even though the results obtained are slightly different than before. We'll now dive a little into the problem.

## RandomizedSearchCV

We'll now do a selection of the various method we proposed. Surfing the Net, we found the meaning of the main hyperparameters. In this section we try some combinations in order to understand better on which model we need to focus our attentions. We now define different grid of hyperparameters for each algorithm that later we'll apply.

In [20]:
def rand_grid_search(model_funct, grid, name="Model", crossval=5, randomiz=True, n_iterat=1000, rand_seed=42, scoring_choose='accuracy', verb=True, X_tra=X_train, y_tra=y_train):
    """
    Function that tries a model and finds its hyperparameters.
    """
    np.random.seed(rand_seed) 
    if randomiz:
        model = RandomizedSearchCV(model_funct, 
                                   param_distributions=grid,
                                   scoring=scoring_choose,
                                   n_iter=n_iterat,
                                   cv=crossval,
                                   verbose=verb, 
                                   n_jobs=-1)
        print(name + " found with RandomizedSearchCV")
    else:
        model = GridSearchCV(model_funct, 
                             param_grid=grid,
                             scoring=scoring_choose,
                             cv=crossval,
                             verbose=verb, 
                             n_jobs=-1)
        print(name + " found with GridSearchCV")
    
    model.fit(X_tra, y_tra)
    return model 

def compute_accuracy(model, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, seed=42):
    print('*************************')
    np.random.seed(seed)
    print("Accuracy on Train Dataset", end="\t")
    train_acc = precision_score(y_train, model.predict(X_train))
    print(train_acc)
    print("Accuracy on Test Dataset", end="\t")
    test_acc = precision_score(y_test, model.predict(X_test))
    print(test_acc)
    return [train_acc, test_acc]

### Logistic Regression 

In [None]:
run_randomized_logistic = False

if run_randomized_logistic:
    # Create an hyperparameter grid for Logistic Regression
    log_reg_grid = {"C":np.logspace(-10, 10, 1000), 
                    "dual":[True, False],
                    "fit_intercept":[True, False],
                    "penalty" : ['none', 'l1', 'l2', 'elasticnet'],
                    "solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

    log_reg_rancv = rand_grid_search(LogisticRegression(max_iter=1000), 
                                     log_reg_grid, 
                                     name="Logistic Regression", 
                                     n_iterat=10000)

    accur_log_reg_rancv = compute_accuracy(log_reg_rancv)
    
    now = datetime.now()
    dt_string = now.strftime("%d-%m-%Y_%H-%M-%S")
    
    pickle.dump(log_reg_rancv, open("model/new_version/logistic_randomizedsearch_" + dt_string + ".pkl", "wb"))
    print('*************************')
    print('File name')
    print('*************************')
    print("model/new_version/logistic_randomizedsearch_" + dt_string + ".pkl")

Last Result Found in:

`model/new_version/logistic_randomizedsearch_20-08-2020_11-44-27.pkl`

In [None]:
model_path = "model/new_version/logistic_randomizedsearch_20-08-2020_11-44-27.pkl"
log_reg_rancv = pickle.load(open(model_path, "rb"))

In [None]:
log_reg_rancv.best_params_

### Logistic Regression CV

In [None]:
run_randomized_logistic_CV = False

if run_randomized_logistic_CV:
    # Create an hyperparameter grid for Logistic Regression CV 
    log_reg_CV_grid = {"Cs":[1, 3, 5, 7, 10, 12], 
                       "fit_intercept":[True, False],
                       "dual":[True, False],
                       "penalty" : ['l1', 'l2', 'elasticnet'],
                       "solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

    log_reg_CV_rancv = rand_grid_search(LogisticRegressionCV(max_iter=1000), 
                                        log_reg_CV_grid, 
                                        name="Logistic Regression CV", 
                                        n_iterat=5000)

    accur_log_reg_CV_rancv = compute_accuracy(log_reg_CV_rancv)
    
    now = datetime.now()
    dt_string = now.strftime("%d-%m-%Y_%H-%M-%S")
    
    pickle.dump(log_reg_CV_rancv, open("model/new_version/logistic_CV_randomizedsearch_" + dt_string + ".pkl", "wb"))
    print('*************************')
    print('File name')
    print('*************************')
    print("model/new_version/logistic_CV_randomizedsearch_" + dt_string + ".pkl")

Last Result Found in:

`model/new_version/logistic_CV_randomizedsearch_20-08-2020_12-11-13.pkl`

In [None]:
model_path = "model/new_version/logistic_CV_randomizedsearch_20-08-2020_12-11-13.pkl"
log_reg_CV_rancv = pickle.load(open(model_path, "rb"))

In [None]:
log_reg_CV_rancv.best_params_

### Random Forest

In [None]:
run_randomized_rand_fores = False

if run_randomized_rand_fores:
    # Create an hyperparameter grid for Random Forest Classifier
    rf_grid = {"n_estimators":np.arange(10, 1000, 100), 
               "max_depth":[None, 3, 5, 7], 
               "bootstrap":[True, False],
               "max_features": ['sqrt', 'auto', 'log2'],
               "min_samples_split":np.arange(2, 20, 1), 
               "min_samples_leaf":np.arange(1, 20, 1)}

    rand_fores = rand_grid_search(RandomForestClassifier(), 
                                  rf_grid, 
                                  name="Random Forest Classifier", 
                                  n_iterat=500)

    accur_rand_fores = compute_accuracy(rand_fores)
    
    now = datetime.now()
    dt_string = now.strftime("%d-%m-%Y_%H-%M-%S")
    
    pickle.dump(rand_fores, open("model/new_version/rand_forest_randomizedsearch_" + dt_string + ".pkl", "wb"))
    print('*************************')
    print('File name')
    print('*************************')
    print("model/new_version/rand_forest_randomizedsearch_" + dt_string + ".pkl")

Last Result Found in:

`model/new_version/rand_forest_randomizedsearch.pkl`

In [None]:
model_path = "model/new_version/rand_forest_randomizedsearch.pkl"
rand_fores = pickle.load(open(model_path, "rb"))

In [None]:
rand_fores.best_params_

### Gradient Boosting

In [None]:
run_randomized_rs_gra_boo = False

if run_randomized_rs_gra_boo:

    # Create an hyperparameter grid for Gradiet Boosting Classifier
    gb_grid = {"loss":['exponential', 'deviance'],
               "learning_rate":[0.1, 0.05], 
               "max_depth":[None, 1, 3], 
               "n_estimators":np.arange(10, 1000, 20), 
               "min_samples_split":np.arange(2, 10, 1), 
               "min_samples_leaf":np.arange(1, 5, 1)}

    rs_gra_boo = rand_grid_search(GradientBoostingClassifier(), 
                                  gb_grid, 
                                  name="Gradient Boosting Classifier", 
                                  n_iterat=500)

    accur_rs_gra_boo = compute_accuracy(rs_gra_boo)
    
    now = datetime.now()
    dt_string = now.strftime("%d-%m-%Y_%H-%M-%S")
    
    pickle.dump(rs_gra_boo, open("model/new_version/grad_boost_randomizedsearch_" + dt_string + ".pkl", "wb"))
    print('*************************')
    print('File name')
    print('*************************')
    print("model/new_version/grad_boost_randomizedsearch_" + dt_string + ".pkl")

Last Result Found in:

`model/new_version/grad_boost_randomizedsearch.pkl`

In [None]:
model_path = "model/new_version/grad_boost_randomizedsearch.pkl"
rs_gra_boo = pickle.load(open(model_path, "rb"))

In [None]:
rs_gra_boo.best_params_

In [None]:
accur_rs_gra_boo = compute_accuracy(rs_gra_boo)

Between the four models we chose two of them, since they seem to behave better. 

Between the two logistic regression models, we select the Cross Validation one, since it gives a more realistic result even though they score the same precision.

The random forest model is the best one between our approaches, since is the highest accuracy with the least overfit. 

Regarding the Gradient Boosting model, the overfit generated is remarkable, since the accuracy difference between Train and Validation is 12%; therefore we chose to not pursue this model.

## GridSearchCV

Now we'll focus, as already said, in a diver search of hyperparameters for the two chosen model, `LogisticRegressionCV()` and `RandomForestClassifier()`, through `GridSearchCV()`.

We now update our grids.

### Grids

In [None]:
# Create an hyperparameter grid for Logistic Regression - GridSearchCV Version
log_reg_grid_gscv = {"C":np.logspace(-5, 5, 200), 
                "dual":[True, False],
                "fit_intercept":[True, False],
                "penalty" : ['l1', 'l2', 'elasticnet'],
                "solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

# Create an hyperparameter grid for Logistic Regression CV - GridSearchCV Version
log_reg_CV_grid_gscv = {"Cs":[np.logspace(-1, 1, 30).tolist(), 1, 3, 5, 7, 10, 12], 
                        "fit_intercept":[True, False],
                        "dual":[True, False],
                        "penalty" : ['l1', 'l2', 'elasticnet'],
                        "solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

# Create an hyperparameter grid for Random Forest Classifier
rf_grid_gscv = {"n_estimators":np.arange(10, 150, 10), 
                "max_depth":[4, 6, 8, 10], 
                'max_features': ['auto', 'sqrt', 'log2'],
                "min_samples_split":np.arange(6, 10, 2), 
                "min_samples_leaf":np.arange(6, 10, 2)}

### Logistic Regression 

In [None]:
grid_search_logistic = False

if grid_search_logistic:

    log_reg_gridsearch = rand_grid_search(LogisticRegression(), 
                                          log_reg_grid_gscv,
                                          randomiz=False,  
                                          name="Logistic Regression")

    accur_log_reg_rancv = compute_accuracy(log_reg_rancv)
    
    pickle.dump(log_reg_gridsearch, open("model/new_version/logistic_gridsearch.pkl", "wb"))

In [None]:
log_reg_gridsearch = pickle.load(open("model/new_version/logistic_gridsearch.pkl", "rb"))

In [None]:
log_reg_gridsearch.best_params_

In [None]:
accur_log_reg_rancv = compute_accuracy(log_reg_rancv)

### Logistic Regression CV

In [None]:
grid_search_logistic_CV = False

if grid_search_logistic_CV:

    log_reg_CV_gricv = rand_grid_search(LogisticRegressionCV(), 
                                        log_reg_CV_grid_gscv, 
                                        randomiz=False,  
                                        name="Logistic Regression CV")

    accur_log_reg_CV_rancv = compute_accuracy(log_reg_CV_gricv)
    
    pickle.dump(log_reg_CV_gricv, open("model/new_version/logistic_CV_gridsearch.pkl", "wb"))

In [None]:
log_reg_CV_gricv = pickle.load(open("model/new_version/logistic_CV_gridsearch.pkl", "rb"))

In [None]:
log_reg_CV_gricv.best_params_

### Random Forest

In [None]:
grid_Search_rand_fores = False

if grid_Search_rand_fores:

    rand_fores_gcv = rand_grid_search(RandomForestClassifier(), 
                                      rf_grid_gscv, 
                                      randomiz=False,  
                                      name="Random Forest Classifier")

    accur_rand_fores_gcv = compute_accuracy(rand_fores_gcv)
    
    pickle.dump(rand_fores_gcv, open("model/new_version/rand_forest_gridsearch.pkl", "wb"))

In [6]:
rand_fores_gcv = pickle.load(open("model/new_version/rand_forest_gridsearch.pkl", "rb"))

In [7]:
rand_fores_gcv.best_params_

{'max_depth': 10,
 'max_features': 'auto',
 'min_samples_leaf': 6,
 'min_samples_split': 6,
 'n_estimators': 30}

Ok, let's compare the best result obtained with the other already obtained.

In [32]:
parameters = {'bootstrap': False, 'min_samples_leaf': 3, 'n_estimators': 50, 
                  'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6}
np.random.seed(42)
author_model = RandomForestClassifier(**parameters)
author_model.fit(new_train, targets)

print('Author Random Forest')
accur_rand_fores_author = compute_accuracy(author_model)
print('Updated Random Forest')
parameters = {'bootstrap': True, 'min_samples_leaf': 6, 'n_estimators': 30, 
                  'min_samples_split': 10, 'max_features': 'auto', 'max_depth': 10}
anot = RandomForestClassifier(**parameters)
anot.fit(new_train, targets)
np.random.seed(42)
accur_rand_fores_gcv = compute_accuracy(anot)

Author Random Forest
*************************
Accuracy on Train Dataset	0.8571428571428571
Accuracy on Test Dataset	0.8490566037735849
Updated Random Forest
*************************
Accuracy on Train Dataset	0.8554216867469879
Accuracy on Test Dataset	0.8461538461538461


> Okay, We have a better result on the validation dataset!!!! 

Let's see what is the result on Kaggle!

In [None]:
output = anot.predict(new_test).astype(int)
df_output = pd.DataFrame()
aux = pd.read_csv('data/test.csv')
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('submission/my_try.csv', index=False)

In [34]:
print(compute_score(author_model, new_train, targets, scoring='accuracy'))
print(compute_score(anot, new_train, targets, scoring='accuracy'))

0.8293955181721172
0.8215366267026551


## Deeper in GridSearchCV

We used already `GridSearchCV()` but we need to understand it. At the end of the process, indeed, it returns the best choiche between our attempt that score better on the train dataset we gave it. We are risking a possible overfit, hence we need to understand better any result in the better possible way. Remember: 
* `log_reg_gridsearch` is the GridSearchCV for Logistic Regression
* `log_reg_CV_gricv` is the GridSearchCV for Logistic Regression CV
* `rand_fores_gcv` is the GridSearchCV for Random Forest Classifier

In [22]:
rand_fores_gcv.best_params_

{'max_depth': 10,
 'max_features': 'auto',
 'min_samples_leaf': 6,
 'min_samples_split': 6,
 'n_estimators': 30}

In [23]:
df_res2 = pd.concat([pd.DataFrame(rand_fores_gcv.cv_results_["params"]),
                    pd.DataFrame(rand_fores_gcv.cv_results_["mean_test_score"], columns=["TrainAccuracy"])], axis=1)
print(df_res2.shape)
df_res = df_res2.where(pd.notnull(df_res2), None)

(672, 6)


In [24]:
df_res

Unnamed: 0,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,TrainAccuracy
0,4,auto,6,6,10,0.812426
1,4,auto,6,6,20,0.819005
2,4,auto,6,6,30,0.816373
3,4,auto,6,6,40,0.825593
4,4,auto,6,6,50,0.820329
...,...,...,...,...,...,...
667,10,log2,8,8,100,0.820338
668,10,log2,8,8,110,0.824303
669,10,log2,8,8,120,0.820338
670,10,log2,8,8,130,0.824294


In [26]:
# I Want the precision also on the Validation set
acc_test = list()
acc_tota = list()
count = 0
for max_de, max_fea, min_sa_le, min_sa_spl, n_est in zip(df_res.max_depth, df_res.max_features, df_res.min_samples_leaf, df_res.min_samples_split, df_res.n_estimators):
    parameters = {'max_depth': max_de,
                  'max_features': max_fea,
                  'min_samples_leaf': min_sa_le,
                  'min_samples_split': min_sa_spl,
                  'n_estimators': n_est
                 }
    np.random.seed(42)
    last_model = RandomForestClassifier(**parameters, n_jobs=-1)
    last_model.fit(X_train, y_train)
    all_accuracies2 = np.mean(cross_val_score(last_model, X_test, y_test, cv = 5, scoring='accuracy', verbose=0))
    acc_test.append(np.mean(all_accuracies2))
    acc_tota.append(np.mean(cross_val_score(last_model, new_train, targets, cv = 5, scoring='accuracy', verbose=0)))
    count += 1
    if count % 100 == 0:
        print(count)

100
200
300
400
500
600


In [27]:
len(acc_test)

672

In [28]:
df_res["TestAccuracy"] = acc_test
df_res["TotalAccuracy"] = acc_tota
df_res

Unnamed: 0,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,TrainAccuracy,TestAccuracy,TotalAccuracy
0,4,auto,6,6,10,0.812426,0.790598,0.822648
1,4,auto,6,6,20,0.819005,0.805983,0.815950
2,4,auto,6,6,30,0.816373,0.783761,0.823778
3,4,auto,6,6,40,0.825593,0.791453,0.829408
4,4,auto,6,6,50,0.820329,0.783476,0.824933
...,...,...,...,...,...,...,...,...
667,10,log2,8,8,100,0.820338,0.790883,0.829396
668,10,log2,8,8,110,0.824303,0.790883,0.827142
669,10,log2,8,8,120,0.820338,0.791453,0.824889
670,10,log2,8,8,130,0.824294,0.806553,0.828272


In [29]:
df_res["difference"] = df_res["TrainAccuracy"] - df_res["TestAccuracy"]
df_res["abs_difference"] = abs(df_res["TrainAccuracy"] - df_res["TestAccuracy"])

In [30]:
df_res.sort_values('TotalAccuracy', ascending=False)[0:30]

Unnamed: 0,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,TrainAccuracy,TestAccuracy,TotalAccuracy,difference,abs_difference
562,10,sqrt,6,6,30,0.828259,0.783761,0.832779,0.044498,0.044498
506,10,auto,6,6,30,0.836223,0.783761,0.832779,0.052463,0.052463
576,10,sqrt,6,8,30,0.828285,0.783761,0.832779,0.044524,0.044524
632,10,log2,6,8,30,0.830891,0.783761,0.832779,0.04713,0.04713
618,10,log2,6,6,30,0.821663,0.783761,0.832779,0.037902,0.037902
520,10,auto,6,8,30,0.832232,0.783761,0.832779,0.048472,0.048472
304,6,log2,6,8,110,0.829583,0.791168,0.83276,0.038415,0.038415
87,4,sqrt,8,6,40,0.826952,0.783476,0.83276,0.043476,0.043476
234,6,sqrt,6,6,110,0.824303,0.791168,0.83276,0.033135,0.033135
101,4,sqrt,8,8,40,0.820329,0.783476,0.83276,0.036854,0.036854


In [None]:
df_res2 = df_res.sort_values('TrainAccuracy', ascending=False)[0:30]
df_res2[df_res2.abs_difference == df_res2.abs_difference.min()]

In [None]:
parameters = {'max_depth': 10,
              'max_features': 'auto',
              'min_samples_leaf': 6,
              'min_samples_split': 6,
              'n_estimators': 30
             }

np.random.seed(42)
# parameters = another.best_params_
another = RandomForestClassifier(**parameters)
another.fit(new_train, targets)
print('*************************')
print(another.score(X_train, y_train))
print(another.score(X_test, y_test))

In [None]:
print(compute_score(author_model, new_train, targets, scoring='accuracy'))
print(compute_score(another, new_train, targets, scoring='accuracy'))

In [None]:
output = another.predict(new_test).astype(int)
df_output = pd.DataFrame()
aux = pd.read_csv('data/test.csv')
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('submission/my_try.csv', index=False)