## First Kaggle Competition 

Requirements: 
Predict survival on the Titanic ship based on the Titanic Dataset downloaded from Kaggle

### Collect the data from Kaggle

In [2]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Importing the Titanic Training set from Kaggle
titanic = pd.read_csv("Data\Titanic_train.csv")
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


**As seen, the data is structured.**

### Establishing the feature space and target variable
* The target variable is: **Survived** `(numerical)` feature which indicates the status: 0->death, 1->survived


* The feature matrix has:
    1. **PassengerId** `(numerical)` = the number or id of passenger in thid dataset
    2. **Pclass** `(numerical)` = Ticket class (1, 2, 3)
    3. **Name** `(categorical)` = Name of the passenger.
    4. **Sex** `(categorical)` = male or female 
    5. **Age** `(numerical)` = age of the passenger
    6. **Parch** `(numerical)` = # of parents / children aboard the Titanic
    7. **Ticket** `(categorica)` = ticket number
    8. **Fare** `(numerical)` = Passenger fare
    9. **Cabin** `(categorical)` = Cabin number
    10. **Embarked** `(categorical)` = Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

### Establishing The Problem. 
* Since the target variable is a binary feature, it means that we face a **Binary Classification** problem where we have to predict wether a passenger survived or not. The targer variable is known, hence we deal with Supervised Learning.
What are the features which predict wether a passenger survived or not?


### Is the dataset balanced with respect to the target variable?

In [3]:
survived = titanic[titanic["Survived"] == 0]
print(f"From the {len(titanic)} datapoints, {len(survived)} survived and {len(titanic) - len(survived)} did not make it ")

From the 891 datapoints, 549 survived and 342 did not make it 


As seen above, the our data is not balanced. There is an offset of 207 between the survived and non survived data points.

### Does our Titanic dataset need feature engineering. Encoding from categorical to numerical values.

In [4]:
# The dtypes of each column of the Titanic dataset
titanic.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Since there are some categorical feature, our data needs feature engineering.

### Does our Titanic dataset need imputation? Dealing with missing values.

In [5]:
# Calculating the sum of the missing values
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

There are missing values on **age**, **Cabin** and **Embarked** features in the training set, hence our data needs imputation.

### Importing the Titanic test Dataset

In [6]:
# Importing the Titanic test dataset
titanic_test = pd.read_csv("Data\Titanic_test.csv")
titanic_test

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


The Titanic test Dataset does not contain the target variable **"Survived"**

### Does our Titanic test set need imputation? Dealing with missing values. The test set needs feature encoding.

In [7]:
titanic_test.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

There are missing values on **Age**, **Cabin** and **Fare** features in the test set, hence the data needs imputation.

### Preprocessing the data.  1) Splitting the dataset into the features space and target space.
* There is no need for splitting the data into the training and test set, since there are 2 separate datasets.
* <b>Let's suppose some features from the features space are not relevant such as: Name, Ticket and PassengerId.
    Our machine learning model can not find patterns through this kind of features because there are possibly 890 different values
    for each of these features. </b>

Finally, our feature space and target space look like this:

In [8]:
# The feature and target variable spaces for train and test dataset
X_train = titanic.drop(columns = ["Name", "Ticket", "PassengerId", "Survived"], axis = 1)
y_train = titanic["Survived"]
X_test = titanic_test.drop(columns = ["Name", "Ticket", "PassengerId"], axis = 1)
#y_test = titanic["Survived"]

# Viewing the feature space and target space
X_train, y_train

(     Pclass     Sex   Age  SibSp  Parch     Fare Cabin Embarked
 0         3    male  22.0      1      0   7.2500   NaN        S
 1         1  female  38.0      1      0  71.2833   C85        C
 2         3  female  26.0      0      0   7.9250   NaN        S
 3         1  female  35.0      1      0  53.1000  C123        S
 4         3    male  35.0      0      0   8.0500   NaN        S
 ..      ...     ...   ...    ...    ...      ...   ...      ...
 886       2    male  27.0      0      0  13.0000   NaN        S
 887       1  female  19.0      0      0  30.0000   B42        S
 888       3  female   NaN      1      2  23.4500   NaN        S
 889       1    male  26.0      0      0  30.0000  C148        C
 890       3    male  32.0      0      0   7.7500   NaN        Q
 
 [891 rows x 8 columns], 0      0
 1      1
 2      1
 3      1
 4      0
       ..
 886    0
 887    1
 888    0
 889    1
 890    0
 Name: Survived, Length: 891, dtype: int64)

### 2) Imputation

In [9]:
# Imputation
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
    
# Only age, Cabin and Embarked features have missing values. Since the cabin has a very high rate of missing values,
# and has a very wide range of possible values, the best strategy is to assign the fill_value = "missing".

# Establishing the features that require imputation
categorical_features = ["Cabin", "Embarked"]
numerical_features_train = ["Age"]
numerical_features_test = ["Fare"]

# Instantiating the SimpleImputer class for each feature
categorical_imputer = SimpleImputer(strategy = "constant", fill_value = "missing")                           
age_imputer = SimpleImputer(strategy = "mean")
fare_imputer = SimpleImputer(strategy = "mean")

# Applying ColumnTransformer to the categories
imputed_transform_train = ColumnTransformer([("categorical_imputer", categorical_imputer, categorical_features),
                                     ("age_imputer", age_imputer, numerical_features_train)])


imputed_transform_test = ColumnTransformer([("categorical_imputer", categorical_imputer, categorical_features),
                                     ("age_imputer", age_imputer, numerical_features_train),
                                     ("fare_imputer", fare_imputer, numerical_features_test)])


filled_X_train = imputed_transform_train.fit_transform(X_train)
filled_X_test = imputed_transform_test.fit_transform(X_test)

# Making df1 and df2 dataframes which hold the imputed values for the below features
df1 = pd.DataFrame(filled_X_train, columns = ["Cabin", "Embarked", "Age"])
df2 = pd.DataFrame(filled_X_test, columns = ["Cabin", "Embarked", "Age", "Fare"])

titanic_filled_train = X_train
titanic_filled_train["Cabin"], titanic_filled_train["Embarked"], titanic_filled_train["Age"] = df1["Cabin"], df1["Embarked"], df1["Age"]
titanic_filled_test = X_test
titanic_filled_test["Cabin"], titanic_filled_test["Embarked"], titanic_filled_test["Age"], titanic_filled_test["Fare"] = df2["Cabin"], df2["Embarked"], df2["Age"], df2["Fare"]
# titanic_filled_train and test are the imputed final versions of the data

### 3) Feature Encoding

In [10]:
# Making a list of the categories to be encoded
from sklearn.preprocessing import OneHotEncoder
categories = ["Pclass", "Sex", "Parch", "Cabin", "Embarked"]
# Instantiate the OneHotEncoder class
encoder = OneHotEncoder()

# Applying ColumnTransformer to the categories
transformed_train = ColumnTransformer(transformers = [("encoder", encoder, categories)],
                               remainder = "passthrough")

transformed_test = ColumnTransformer(transformers = [("encoder", encoder, categories)],
                               remainder = "passthrough")

encoded_X_train = transformed_train.fit_transform(titanic_filled_train)
encoded_X_test = transformed_test.fit_transform(titanic_filled_test)

In [11]:
# Viewing the Preprocessed data
encoded_X_train, encoded_X_test

(<891x167 sparse matrix of type '<class 'numpy.float64'>'
 	with 6505 stored elements in Compressed Sparse Row format>,
 <418x96 sparse matrix of type '<class 'numpy.float64'>'
 	with 3059 stored elements in Compressed Sparse Row format>)

In total there are 891 data points with the feature space of 167 features. 5/Parch, 3/Pclass, 3/Embarked, 2/Sex, 154/Cabin 

## Choosing a machine learning model

In [12]:
from sklearn.svm import LinearSVC
linear_svc = LinearSVC()
linear_svc.fit(encoded_X_train, y_train);
#linear_svc.score(encoded_X_test, y_test)
linear_svc.score(encoded_X_test, y_test)



NameError: name 'y_test' is not defined

###  The model is not working because the Test and Train datasets have different shapes because of Encoding of Cabin feature.  The training set has 167 columns and and the training set has 96. There is a mismatch of features.

### Note! We should combine X_train and X_test in a single dataset, apply imputation and feature encoding, and then split into train and test

In [13]:
# Set a random seed to get the same output
np.random.seed(42)

# Importing the training and test sets again
titanic_test = pd.read_csv("Data\Titanic_test.csv")
titanic = pd.read_csv("Data\Titanic_train.csv")

# Phase one of splitting into Feature Space and target variable space
feature_space_train = titanic.drop(columns = ["Name", "Ticket", "PassengerId", "Survived"], axis = 1)
y = titanic["Survived"]
feature_space_test = titanic_test.drop(columns = ["Name", "Ticket", "PassengerId"], axis = 1)

# Checking the datasets
feature_space_train, feature_space_test

(     Pclass     Sex   Age  SibSp  Parch     Fare Cabin Embarked
 0         3    male  22.0      1      0   7.2500   NaN        S
 1         1  female  38.0      1      0  71.2833   C85        C
 2         3  female  26.0      0      0   7.9250   NaN        S
 3         1  female  35.0      1      0  53.1000  C123        S
 4         3    male  35.0      0      0   8.0500   NaN        S
 ..      ...     ...   ...    ...    ...      ...   ...      ...
 886       2    male  27.0      0      0  13.0000   NaN        S
 887       1  female  19.0      0      0  30.0000   B42        S
 888       3  female   NaN      1      2  23.4500   NaN        S
 889       1    male  26.0      0      0  30.0000  C148        C
 890       3    male  32.0      0      0   7.7500   NaN        Q
 
 [891 rows x 8 columns],
      Pclass     Sex   Age  SibSp  Parch      Fare Cabin Embarked
 0         3    male  34.5      0      0    7.8292   NaN        Q
 1         3  female  47.0      1      0    7.0000   NaN     

In [14]:
# X = Combined feature_space_test and feature_space_train. feature_space_test is appended to the feature_space_train. Hence, the feature_space_test index starts at 891 instead of 0
feature_space_test.index = range(891, 1309)
X = feature_space_train.append(feature_space_test)
X

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,male,22.0,1,0,7.2500,,S
1,1,female,38.0,1,0,71.2833,C85,C
2,3,female,26.0,0,0,7.9250,,S
3,1,female,35.0,1,0,53.1000,C123,S
4,3,male,35.0,0,0,8.0500,,S
...,...,...,...,...,...,...,...,...
1304,3,male,,0,0,8.0500,,S
1305,1,female,39.0,0,0,108.9000,C105,C
1306,3,male,38.5,0,0,7.2500,,S
1307,3,male,,0,0,8.0500,,S


In [15]:
# Checking where the data needs imputation
X.isna().sum()

Pclass         0
Sex            0
Age          263
SibSp          0
Parch          0
Fare           1
Cabin       1014
Embarked       2
dtype: int64

### Apply again feature encoding and imputation to the new X feature space

In [16]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
    
# Only age, Cabin and Embarked features have missing values. 

# Establishing the features that require imputation
categorical_features = ["Cabin", "Embarked"]
# numerical feature of the train set to be encoded
numerical_features_train = ["Age"]
# numerical feature of the test set to be encoded
numerical_features_test = ["Fare"]

# Instantiating the SimpleImputer class for each feature
categorical_imputer = SimpleImputer(strategy = "constant", fill_value = "missing")                           
age_imputer = SimpleImputer(strategy = "mean")
fare_imputer = SimpleImputer(strategy = "mean")

# Applying ColumnTransformer to the categories
imputed_transform = ColumnTransformer([("categorical_imputer", categorical_imputer, categorical_features),
                                     ("age_imputer", age_imputer, numerical_features_train),
                                     ("fare_imputer", fare_imputer, numerical_features_test)])

# filled_X represents the imputed array with missing or mean depending on which feature it is addressed
filled_X = imputed_transform.fit_transform(X)

# Transforming the filled_X array into the titanic_filled dataset (Dataset with no missing values)
df_X = pd.DataFrame(filled_X, columns = ["Cabin", "Embarked", "Age", "Fare"])
titanic_filled = X
titanic_filled["Cabin"], titanic_filled["Embarked"], titanic_filled["Age"], titanic_filled["Fare"] = df_X["Cabin"], df_X["Embarked"], df_X["Age"], df_X["Fare"]
titanic_filled

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,male,22,1,0,7.25,missing,S
1,1,female,38,1,0,71.2833,C85,C
2,3,female,26,0,0,7.925,missing,S
3,1,female,35,1,0,53.1,C123,S
4,3,male,35,0,0,8.05,missing,S
...,...,...,...,...,...,...,...,...
1304,3,male,29.8811,0,0,8.05,missing,S
1305,1,female,39,0,0,108.9,C105,C
1306,3,male,38.5,0,0,7.25,missing,S
1307,3,male,29.8811,0,0,8.05,missing,S


In [17]:
# Checking the integrity of the imputation process above.
X.isna().sum()

Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Cabin       0
Embarked    0
dtype: int64

In [18]:
# Making a list of the categories to be encoded
from sklearn.preprocessing import OneHotEncoder
# Pclass and Parch are chosen to be encoded even if they are numeric.
categories = ["Pclass", "Sex", "Parch", "Cabin", "Embarked"]

# Instantiate the OneHotEncoder class
encoder = OneHotEncoder()

# Applying ColumnTransformer to the categories
transformed = ColumnTransformer(transformers = [("encoder", encoder, categories)],
                               remainder = "passthrough")

# encoded_X = encoded array  
encoded_X = transformed.fit_transform(titanic_filled)
encoded_X

<1309x207 sparse matrix of type '<class 'numpy.float64'>'
	with 9564 stored elements in Compressed Sparse Row format>

### Splitting the data into train and test sets
* Due to the fact that the test set from kaggle does not have the true target variable values, we need to split the training set from kaggle into another training and test sets and then to fit a model to the training set, perform evaluation, hyperperameters tuning and then after achieving the desired model to predict the target variable on the kaggle test data. 

### Splitting the kaggle training dataset into training and test sets. Even if training and test sets from kaggle are separate, the test set does not containt the target variable values. It is required to split the training set into another training and test sets, to fit a model to the data, to evaluate, and then to apply the model to the kaggle Train set to evaluate the target variable.

In [19]:
from sklearn.model_selection import train_test_split
# Set a random seed
np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(encoded_X[:891], y, test_size = 0.2)
# Above it is used encoded_X[:891] because the training set from kaggle has the first 891 datapoints in the encoded_X
X_train, X_test, len(y_train), len(y_test)

(<712x207 sparse matrix of type '<class 'numpy.float64'>'
 	with 5199 stored elements in Compressed Sparse Row format>,
 <179x207 sparse matrix of type '<class 'numpy.float64'>'
 	with 1306 stored elements in Compressed Sparse Row format>,
 712,
 179)

In [20]:
# All the features are float64
X_train.dtype, X_test.dtype, y_train.dtype, y_test.dtype

(dtype('float64'), dtype('float64'), dtype('int64'), dtype('int64'))

### Choosing a machine learning model
* According to the https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html, First we try LinearSVC

In [21]:
# 1st model
# Baseline LinearSVC Model
np.random.seed(42)

# Import the LinearSVC
from sklearn.svm import LinearSVC

# Instantiating the LinearSVC class
svc_baseline = LinearSVC()

# Fitting the data to the model
svc_baseline.fit(X_train, y_train);
y_svc_pred_baseline = svc_baseline.predict(X_test);
print(f"The LinearSVC baseline model has: {svc_baseline.score(X_test, y_test)*100:.2f}% score");
print(f"The predicted values of the baseline LinearSVC for the target variable are: \n {y_svc_pred_baseline}")

The LinearSVC baseline model has: 79.89% score
The predicted values of the baseline LinearSVC for the target variable are: 
 [0 0 0 1 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 0 0 0 0 0 1 1 1 0 1
 0 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0]




# The theory for metrics
* Confusion Matrix is a good way to visualize TP, TN, FP, FN
* Accuracy score (TP+TN/TP+TN+FP+FN) = % how many samples from the total samples were correctly predicted (correct predictions/ total population)
* Recall (TP/TP+FN) = % how many observed cases are corectly predicted 
* Precision (TP/TP+FP) = how likely the predicted value is correct.(how many of the positive predicted values are actually corect)
* f1 score is a combination between precision and recall
* auc (TPR/FPR) = It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.

## Due to the fact that the data is unbalanced 549 survived and 342 did not survived, it means that accuracy is not the quite the right metric
* Focus on higher auc_score , and higer TN, higher f1(which is a combination between the 2). Overall there should be balance when predicting both true positives and negatives.

In [22]:
# Metric Evaluation
# Making a function to evaluate our model performance in estimating the target variable
def evaluation_metrics(y_test, y_pred):
    from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
    confusion = confusion_matrix(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred)
    print(f"Confusion matrix: \n{confusion}")
    print(f"Classification report: \n {report}")
    print(f"auc = {auc}")

### Evaluate the SVC baseline

In [23]:
kn_baseline_evaluation = evaluation_metrics(y_test, y_svc_pred_baseline)

Confusion matrix: 
[[100   5]
 [ 31  43]]
Classification report: 
               precision    recall  f1-score   support

           0       0.76      0.95      0.85       105
           1       0.90      0.58      0.70        74

    accuracy                           0.80       179
   macro avg       0.83      0.77      0.78       179
weighted avg       0.82      0.80      0.79       179

auc = 0.7667310167310166


In [24]:
# Tuning Hyperparameters
np.random.seed(42)
from sklearn.model_selection import GridSearchCV
hyperparameters_svc = {"penalty": ["l2"],
                  "loss": ["hinge", "squared_hinge"],
                  #"dual": ["False"],
                  "max_iter": [1000,1200,1400,1600,1800,2000]}

svc_model = GridSearchCV(estimator = linear_svc,
                    param_grid = hyperparameters_svc,
                   verbose = 2,
                    cv = 5)

svc_model.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] loss=hinge, max_iter=1000, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1000, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1000, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1000, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1000, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1000, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1000, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1000, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1000, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1000, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1200, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1200, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1200, penalty=l2 ...........................
[CV] ...........

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV] ............ loss=hinge, max_iter=1200, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1200, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1200, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1200, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1200, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1400, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1400, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1400, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1400, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1400, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1400, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1400, penalty=l2 ...........................
[CV] ............ loss=hinge, max_iter=1400, penalty=l2, total=   0.0s
[CV] loss=hinge, max_iter=1400, penalty=l2 ...........................
[CV] .

[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed:    2.2s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
                                 fit_intercept=True, intercept_scaling=1,
                                 loss='squared_hinge', max_iter=1000,
                                 multi_class='ovr', penalty='l2',
                                 random_state=None, tol=0.0001, verbose=0),
             iid='warn', n_jobs=None,
             param_grid={'loss': ['hinge', 'squared_hinge'],
                         'max_iter': [1000, 1200, 1400, 1600, 1800, 2000],
                         'penalty': ['l2']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)

In [25]:
svc_model.best_params_

{'loss': 'hinge', 'max_iter': 1800, 'penalty': 'l2'}

In [26]:
svc_model.best_estimator_

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='hinge', max_iter=1800, multi_class='ovr',
          penalty='l2', random_state=None, tol=0.0001, verbose=0)

In [29]:
y_svc_pred = svc_model.predict(X_test)
svc_model.score(X_test, y_test)

0.770949720670391

### Evaluate the SVC model

In [30]:
svc_model_evaluation = evaluation_metrics(y_test, y_svc_pred)

Confusion matrix: 
[[78 27]
 [14 60]]
Classification report: 
               precision    recall  f1-score   support

           0       0.85      0.74      0.79       105
           1       0.69      0.81      0.75        74

    accuracy                           0.77       179
   macro avg       0.77      0.78      0.77       179
weighted avg       0.78      0.77      0.77       179

auc = 0.7768339768339769


### Compare it with the baseline model

In [31]:
svc_baseline_evaluation = evaluation_metrics(y_test, y_svc_pred_baseline)

Confusion matrix: 
[[100   5]
 [ 31  43]]
Classification report: 
               precision    recall  f1-score   support

           0       0.76      0.95      0.85       105
           1       0.90      0.58      0.70        74

    accuracy                           0.80       179
   macro avg       0.83      0.77      0.78       179
weighted avg       0.82      0.80      0.79       179

auc = 0.7667310167310166


**The svc_model is better because it has higher auc, higher TN, and balanced f1-score**

### Choose another model. Nearest Neighbors model

In [32]:
# Import the model
np.random.seed(42)
from sklearn.neighbors import KNeighborsClassifier
kn_baseline = KNeighborsClassifier()
kn_baseline.fit(X_train, y_train)

y_kn_baseline_pred = kn_baseline.predict(X_test)
print(f"The KNeighborsClassifier baseline model has: {kn_baseline.score(X_test, y_test)*100:.2f}% score");
print(f"The predicted values of the baseline KNeighborsClassifier for the target variable are: \n {y_kn_baseline_pred}")

The KNeighborsClassifier baseline model has: 71.51% score
The predicted values of the baseline KNeighborsClassifier for the target variable are: 
 [0 0 0 0 0 1 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0
 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1
 0 0 1 1 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0
 0 1 0 1 0 0 0 1 0 0 1 1 1 0 0 1 1 1 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1 1 0
 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1]


In [33]:
# The probability estimates for the X_test. 0th columns is 0 and the first column = 1 
kn_baseline_predict_proba = pd.DataFrame(data = kn_baseline.predict_proba(X_test), columns = ["0 Probability", "1 Probability"])
kn_baseline_predict_proba

Unnamed: 0,0 Probability,1 Probability
0,1.0,0.0
1,0.8,0.2
2,1.0,0.0
3,0.6,0.4
4,0.6,0.4
...,...,...
174,0.8,0.2
175,1.0,0.0
176,1.0,0.0
177,0.6,0.4


In [34]:
# See how many predictions have 0.4,0.6 probabilities or 0.6,0.4
# count is the dataframe with a["0 Probability"] >= 0.4 and ["0 Probability"] <= 0.6
count_kn_baseline = kn_baseline_predict_proba[kn_baseline_predict_proba["0 Probability"] >= 0.4]
count_kn_baseline = count_kn_baseline[count_kn_baseline["0 Probability"] <= 0.6]
# middle_predictions = the length of the count
middle_predictions_kn_baseline = len(count_kn_baseline)
count_kn_baseline

Unnamed: 0,0 Probability,1 Probability
3,0.6,0.4
4,0.6,0.4
8,0.4,0.6
14,0.4,0.6
15,0.6,0.4
...,...,...
159,0.4,0.6
160,0.6,0.4
162,0.6,0.4
165,0.6,0.4


### Evaluating the baseline KNeighborsClassifier model

In [35]:
kn_baseline_evaluation = evaluation_metrics(y_test, y_kn_baseline_pred)

Confusion matrix: 
[[88 17]
 [34 40]]
Classification report: 
               precision    recall  f1-score   support

           0       0.72      0.84      0.78       105
           1       0.70      0.54      0.61        74

    accuracy                           0.72       179
   macro avg       0.71      0.69      0.69       179
weighted avg       0.71      0.72      0.71       179

auc = 0.6893178893178893


### Hyperparameter tunig for KNeighborsClassifier

In [36]:
from sklearn.model_selection import GridSearchCV
np.random.seed(42)
hyperparameters_kn = {"leaf_size": [30, 40, 50],
                     "weights": ["uniform", "distance"],
                     "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
                      "p": [1,2],
                      "n_jobs": [1]}

kn_model = GridSearchCV(estimator = kn_baseline,
                       param_grid = hyperparameters_kn,
                       verbose = 2,
                       cv = 5)

kn_model.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=uniform ....
[CV]  algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=uniform ....
[CV]  algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=uniform ....
[CV]  algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=uniform ....
[CV]  algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=uniform ....
[CV]  algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=distance ...
[CV]  algorithm=auto, leaf_size=30, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=auto

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV]  algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=uniform ....
[CV]  algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=uniform ....
[CV]  algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=distance ...
[CV]  algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=distance ...
[CV]  algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=distance ...
[CV]  algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=distance ...
[CV]  algorithm=auto, leaf_size=40, n_jobs=1, p=1, weights=distance, total=   0



[CV]  algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=uniform, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=uniform 
[CV]  algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=uniform, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=uniform 
[CV]  algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=uniform, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=uniform 
[CV]  algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=uniform, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=distance 
[CV]  algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=distance, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=distance 
[CV]  algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=distance, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=30, n_jobs=1, p=2, weights=distance 
[CV]  algorithm=ball_tree, leaf_size=30, 




[CV] algorithm=ball_tree, leaf_size=40, n_jobs=1, p=2, weights=distance 
[CV]  algorithm=ball_tree, leaf_size=40, n_jobs=1, p=2, weights=distance, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=40, n_jobs=1, p=2, weights=distance 
[CV]  algorithm=ball_tree, leaf_size=40, n_jobs=1, p=2, weights=distance, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=40, n_jobs=1, p=2, weights=distance 
[CV]  algorithm=ball_tree, leaf_size=40, n_jobs=1, p=2, weights=distance, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=50, n_jobs=1, p=1, weights=uniform 
[CV]  algorithm=ball_tree, leaf_size=50, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=50, n_jobs=1, p=1, weights=uniform 
[CV]  algorithm=ball_tree, leaf_size=50, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=50, n_jobs=1, p=1, weights=uniform 
[CV]  algorithm=ball_tree, leaf_size=50, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=ball_tree, leaf_size=50,




[CV]  algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=uniform .
[CV]  algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=30, n_jobs=1, p=1, weigh



[CV]  algorithm=kd_tree, leaf_size=40, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=40, n_jobs=1, p=1, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=40, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=40, n_jobs=1, p=1, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=40, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=40, n_jobs=1, p=1, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=40, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=40, n_jobs=1, p=2, weights=uniform .
[CV]  algorithm=kd_tree, leaf_size=40, n_jobs=1, p=2, weights=uniform, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=40, n_jobs=1, p=2, weights=uniform .
[CV]  algorithm=kd_tree, leaf_size=40, n_jobs=1, p=2, weights=uniform, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=40, n_jobs=1, p=2, weights=uniform .
[CV]  algorithm=kd_tree, leaf_size=40, n_jobs=1, p=2, weight



[CV]  algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=uniform, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=uniform .
[CV]  algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=uniform, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=uniform .
[CV]  algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=uniform, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=uniform .
[CV]  algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=uniform, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=distance, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=distance, total=   0.0s
[CV] algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=distance 
[CV]  algorithm=kd_tree, leaf_size=50, n_jobs=1, p=2, weights=

[CV]  algorithm=brute, leaf_size=50, n_jobs=1, p=1, weights=uniform, total=   0.0s
[CV] algorithm=brute, leaf_size=50, n_jobs=1, p=1, weights=distance ..
[CV]  algorithm=brute, leaf_size=50, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=brute, leaf_size=50, n_jobs=1, p=1, weights=distance ..
[CV]  algorithm=brute, leaf_size=50, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=brute, leaf_size=50, n_jobs=1, p=1, weights=distance ..
[CV]  algorithm=brute, leaf_size=50, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=brute, leaf_size=50, n_jobs=1, p=1, weights=distance ..
[CV]  algorithm=brute, leaf_size=50, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=brute, leaf_size=50, n_jobs=1, p=1, weights=distance ..
[CV]  algorithm=brute, leaf_size=50, n_jobs=1, p=1, weights=distance, total=   0.0s
[CV] algorithm=brute, leaf_size=50, n_jobs=1, p=2, weights=uniform ...
[CV]  algorithm=brute, leaf_size=50, n_jobs=1, p=2, weights=uniform, to

[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:    1.7s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=None,
             param_grid={'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
                         'leaf_size': [30, 40, 50], 'n_jobs': [1], 'p': [1, 2],
                         'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)

In [37]:
kn_model.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=1, n_neighbors=5, p=1,
                     weights='distance')

In [38]:
kn_model.best_score_

0.7429775280898876

In [39]:
kn_model.score(X_test, y_test)

0.7374301675977654

In [40]:
# The probability estimates for the X_test. 0th columns is 0 and the first column = 1 
kn_model_predict_proba = pd.DataFrame(data = kn_model.predict_proba(X_test), columns = ["0 Probability", "1 Probability"])
# See how many predictions have 0.4,0.6 probabilities or 0.6,0.4
# count is the dataframe with a ["0 Probability"] >= 0.4 and ["0 Probability"] <= 0.6
count_kn_model = kn_model_predict_proba[kn_model_predict_proba["0 Probability"] >= 0.4]
count_kn_model = count_kn_model[count_kn_model["0 Probability"] <= 0.6]
# middle_predictions = the length of the count 
middle_predictions_kn_model = len(count_kn_model)
count_kn_model

Unnamed: 0,0 Probability,1 Probability
4,0.422351,0.577649
15,0.477091,0.522909
16,0.504005,0.495995
21,0.598935,0.401065
26,0.421539,0.578461
29,0.551188,0.448812
36,0.548685,0.451315
49,0.545336,0.454664
57,0.585493,0.414507
60,0.578408,0.421592


### Evaluate KNeighborsClassifier model

In [41]:
y_kn_model_pred = kn_model.predict(X_test)
kn_model_evaluation = evaluation_metrics(y_test, y_kn_model_pred)

Confusion matrix: 
[[90 15]
 [32 42]]
Classification report: 
               precision    recall  f1-score   support

           0       0.74      0.86      0.79       105
           1       0.74      0.57      0.64        74

    accuracy                           0.74       179
   macro avg       0.74      0.71      0.72       179
weighted avg       0.74      0.74      0.73       179

auc = 0.7123552123552124


### Compare it with the baseline model

In [42]:
kn_baseline_evaluation = evaluation_metrics(y_test, y_kn_baseline_pred)

Confusion matrix: 
[[88 17]
 [34 40]]
Classification report: 
               precision    recall  f1-score   support

           0       0.72      0.84      0.78       105
           1       0.70      0.54      0.61        74

    accuracy                           0.72       179
   macro avg       0.71      0.69      0.69       179
weighted avg       0.71      0.72      0.71       179

auc = 0.6893178893178893


### Choose another model. RandomForestClassifier

**The kn model is better becuase it has higher TN, higher auc and higher f1-score**

In [43]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(42)
rf_baseline = RandomForestClassifier(n_jobs = 1)
rf_baseline.fit(X_train, y_train)
y_rf_baseline_pred = rf_baseline.predict(X_test)
print(f"The RandomForestClassifier baseline model has: {rf_baseline.score(X_test, y_test)*100:.2f}% score");
print(f"The predicted values of the baseline RandomForestClassifier for the target variable are: \n {y_rf_baseline_pred}")

The RandomForestClassifier baseline model has: 78.77% score
The predicted values of the baseline RandomForestClassifier for the target variable are: 
 [1 0 0 1 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0
 1 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1
 0 0 1 1 1 1 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1
 0 1 0 0 0 0 1 1 0 0 1 1 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1
 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0]




In [44]:
# The probability estimates for the X_test. 0th columns is 0 and the first column = 1 
rf_baseline_predict_proba = pd.DataFrame(data = rf_baseline.predict_proba(X_test), columns = ["0 Probability", "1 Probability"])
# See how many predictions have 0.4,0.6 probabilities or 0.6,0.4
# count is the dataframe with a ["0 Probability"] >= 0.4 and ["0 Probability"] <= 0.6
count_rf_baseline = rf_baseline_predict_proba[rf_baseline_predict_proba["0 Probability"] >= 0.4]
count_rf_baseline = count_rf_baseline[count_rf_baseline["0 Probability"] <= 0.6]
# middle_predictions = the length of the count
middle_predictions_rf_baseline = len(count_rf_baseline) 
middle_predictions_rf_baseline

30

### Evalute the baseline RandomForestClassifier

In [54]:
rf_baseline_evaluation = evaluation_metrics(y_test, y_rf_baseline_pred)

Confusion matrix: 
[[91 14]
 [24 50]]
Classification report: 
               precision    recall  f1-score   support

           0       0.79      0.87      0.83       105
           1       0.78      0.68      0.72        74

    accuracy                           0.79       179
   macro avg       0.79      0.77      0.78       179
weighted avg       0.79      0.79      0.78       179

auc = 0.7711711711711712


### Hyperparameter tunig for RandomForestClassifier

In [56]:
from sklearn.model_selection import GridSearchCV
np.random.seed(42)

hyperparameters_rf = {"n_estimators": [100, 300, 600],
                     "max_depth": [None, 5, 10, 20, 30],
                     "max_features": ["auto", "sqrt"],
                     "min_samples_split": [2, 4, 6],
                     "min_samples_leaf": [1, 2, 4]}

rf_model = GridSearchCV(estimator = rf_baseline,
                       param_grid= hyperparameters_rf,
                       cv = 5,
                       verbose = 2)

rf_model.fit(X_train, y_train)

Fitting 5 folds for each of 270 candidates, totalling 1350 fits
[CV] max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100, total=   0.2s
[CV] max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


[CV]  max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100, total=   0.2s
[CV] max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100, total=   0.2s
[CV] max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300, total=   0.4s
[CV] max_depth=None, max_features=

[CV]  max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600, total=   0.8s
[CV] max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600, total=   0.8s
[CV] max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600, total=   0.9s
[CV] max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600, total=   0.8s
[CV] max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600, total=   0.8s
[CV] max_depth=None, max_features=

[CV]  max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300, total=   0.3s
[CV] max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=None, max_features=

[CV]  max_depth=None, max_features=auto, min_samples_leaf=4, min_samples_split=4, n_estimators=600, total=   0.4s
[CV] max_depth=None, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=100 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=None, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=100 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=None, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=100 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=None, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=100 
[CV]  max_depth=None, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=None, max_features=

[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=300, total=   0.4s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=300 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=300, total=   0.5s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=600 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=600, total=   0.9s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=600 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=600, total=   0.8s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=600 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=600, total=   0.8s
[CV] max_depth=None, max_features=

[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=100 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=100 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=300 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=300 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=None, max_features=

[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=600, total=   0.4s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=600 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=600, total=   0.4s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=600 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=600, total=   0.4s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=600 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=600, total=   0.4s
[CV] max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=100 
[CV]  max_depth=None, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=None, max_features=

[CV]  max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=auto, min_samples_leaf=1, min_

[CV]  max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=300 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=300 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=auto, min_samples_leaf=2, min_

[CV]  max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=600, total=   0.4s
[CV] max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=600 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=600, total=   0.4s
[CV] max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=600 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=600, total=   0.4s
[CV] max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=5, max_features=auto, min_samples_leaf=4, min_

[CV]  max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=300 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=300 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=600 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=600, total=   0.4s
[CV] max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=600 
[CV]  max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=600, total=   0.4s
[CV] max_depth=5, max_features=auto, min_samples_leaf=4, min_

[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=100 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=100 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=300 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=300 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=1, min_

[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=600, total=   0.4s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=600 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=600, total=   0.4s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=600 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=600, total=   0.4s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=100 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=100 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=2, min_

[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=300 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=300 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=600 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=600, total=   0.4s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=600 
[CV]  max_depth=5, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=600, total=   0.4s
[CV] max_depth=5, max_features=sqrt, min_samples_leaf=4, min_

[CV]  max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=100 
[CV]  max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=100 
[CV]  max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=300 
[CV]  max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=300, total=   0.3s
[CV] max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=300 
[CV]  max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=300, total=   0.3s
[CV] max_depth=10, max_features=auto, min_samples_le

[CV]  max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=600, total=   0.4s
[CV] max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=600 
[CV]  max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=600, total=   0.4s
[CV] max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=600 
[CV]  max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=600, total=   0.4s
[CV] max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=100 
[CV]  max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=100 
[CV]  max_depth=10, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=auto, min_samples_le

[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, total=   0.3s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, total=   0.3s
[CV] max_depth=10, max_features=sqrt, min_samples_le

[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=600, total=   0.6s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=600 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=600, total=   0.6s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=600 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=600, total=   0.6s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=sqrt, min_samples_le

[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=300 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=300 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=600 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=600, total=   0.4s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=600 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=6, n_estimators=600, total=   0.4s
[CV] max_depth=10, max_features=sqrt, min_samples_le

[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=100 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=300 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=300 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=300 
[CV]  max_depth=10, max_features=sqrt, min_samples_leaf=4, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=10, max_features=sqrt, min_samples_le

[CV]  max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=600, total=   0.8s
[CV] max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=600 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=4, n_estimators=600, total=   0.8s
[CV] max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=20, max_features=auto, min_samples_le

[CV]  max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=300 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=600 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=600, total=   0.4s
[CV] max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=600 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=600, total=   0.5s
[CV] max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=600 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=4, n_estimators=600, total=   0.4s
[CV] max_depth=20, max_features=auto, min_samples_le

[CV]  max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=4, n_estimators=300 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=4, n_estimators=300 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=4, n_estimators=300 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=4, n_estimators=300 
[CV]  max_depth=20, max_features=auto, min_samples_leaf=4, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=20, max_features=auto, min_samples_le

[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=100 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=100 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=100 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=300 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=300, total=   0.4s
[CV] max_depth=20, max_features=sqrt, min_samples_le

[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=600, total=   0.5s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=600 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=600, total=   0.5s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=600 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=600, total=   0.4s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=600 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=600, total=   0.5s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=100 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=20, max_features=sqrt, min_samples_le

[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=300, total=   0.2s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=300 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=300, total=   0.2s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=300 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=300, total=   0.2s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=600 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=600, total=   0.4s
[CV] max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=600 
[CV]  max_depth=20, max_features=sqrt, min_samples_leaf=4, min_samples_split=2, n_estimators=600, total=   0.4s
[CV] max_depth=20, max_features=sqrt, min_samples_le

[CV]  max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=100, total=   0.2s
[CV] max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300, total=   0.5s
[CV] max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300, total=   0.5s
[CV] max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=300, total=   0.5s
[CV] max_depth=30, max_features=auto, min_samples_le

[CV]  max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600, total=   0.8s
[CV] max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=600, total=   0.8s
[CV] max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=2, n_estimators=100, total=   0.1s
[CV] max_depth=30, max_features=auto, min_samples_le

[CV]  max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=600 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=600, total=   0.4s
[CV] max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=600 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=600, total=   0.5s
[CV] max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=600 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=600, total=   0.4s
[CV] max_depth=30, max_features=auto, min_samples_le

[CV]  max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=100 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=300 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=300 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=300 
[CV]  max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=6, n_estimators=300, total=   0.2s
[CV] max_depth=30, max_features=auto, min_samples_le

[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=600, total=   0.8s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=600 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=4, n_estimators=600, total=   0.8s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=100 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=100 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=100 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=1, min_samples_split=6, n_estimators=100, total=   0.1s
[CV] max_depth=30, max_features=sqrt, min_samples_le

[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=300 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=600 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=600, total=   0.4s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=600 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=600, total=   0.5s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=600 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=2, min_samples_split=4, n_estimators=600, total=   0.5s
[CV] max_depth=30, max_features=sqrt, min_samples_le

[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=100 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=100, total=   0.1s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=300 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=300 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=300 
[CV]  max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=4, n_estimators=300, total=   0.2s
[CV] max_depth=30, max_features=sqrt, min_samples_le

[Parallel(n_jobs=1)]: Done 1350 out of 1350 | elapsed:  6.4min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=10, n_jobs=1,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='warn'

In [57]:
rf_model.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=6,
                       min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [58]:
rf_model.best_params_

{'max_depth': 10,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 6,
 'n_estimators': 100}

In [59]:
# The probability estimates for the X_test. 0th columns is 0 and the first column = 1 
rf_model_predict_proba = pd.DataFrame(data = rf_model.predict_proba(X_test), columns = ["0 Probability", "1 Probability"])
# See how many predictions have 0.4,0.6 probabilities or 0.6,0.4
# count1 is the dataframe with a["0 Probability"] >= 0.4 and ["0 Probability"] <=0.6
count_rf_model = rf_model_predict_proba[rf_model_predict_proba["0 Probability"] >= 0.4]
count_rf_model = count_rf_model[count_rf_model["0 Probability"] <=0.6]
# middle_predictions = the length of the count
middle_predictions_rf_model = len(count_rf_model)
middle_predictions_rf_model

20

### Evaluate the RandomForestClassifier model

In [60]:
y_rf_model_pred = rf_model.predict(X_test)
rf_model_evaluation = evaluation_metrics(y_test, y_rf_model_pred)

Confusion matrix: 
[[96  9]
 [24 50]]
Classification report: 
               precision    recall  f1-score   support

           0       0.80      0.91      0.85       105
           1       0.85      0.68      0.75        74

    accuracy                           0.82       179
   macro avg       0.82      0.79      0.80       179
weighted avg       0.82      0.82      0.81       179

auc = 0.794980694980695


### Compare it with the RandomForestClassifier Baseline

In [61]:
rf_baseline_evaluation = evaluation_metrics(y_test, y_rf_baseline_pred)

Confusion matrix: 
[[91 14]
 [24 50]]
Classification report: 
               precision    recall  f1-score   support

           0       0.79      0.87      0.83       105
           1       0.78      0.68      0.72        74

    accuracy                           0.79       179
   macro avg       0.79      0.77      0.78       179
weighted avg       0.79      0.79      0.78       179

auc = 0.7711711711711712


**The RandonForestClassifier model is better because it has higher TN, higher f1, and higer auc**

# Finally, the 3 selected models are the models for svc, kn and rf classifier. Comparing all 3.

In [62]:
svc_evaluation = evaluation_metrics(y_test, y_svc_pred)

Confusion matrix: 
[[78 27]
 [14 60]]
Classification report: 
               precision    recall  f1-score   support

           0       0.85      0.74      0.79       105
           1       0.69      0.81      0.75        74

    accuracy                           0.77       179
   macro avg       0.77      0.78      0.77       179
weighted avg       0.78      0.77      0.77       179

auc = 0.7768339768339769


In [63]:
kn_evaluation = evaluation_metrics(y_test, y_kn_model_pred)

Confusion matrix: 
[[90 15]
 [32 42]]
Classification report: 
               precision    recall  f1-score   support

           0       0.74      0.86      0.79       105
           1       0.74      0.57      0.64        74

    accuracy                           0.74       179
   macro avg       0.74      0.71      0.72       179
weighted avg       0.74      0.74      0.73       179

auc = 0.7123552123552124


In [64]:
rf_model_evaluation = evaluation_metrics(y_test, y_rf_model_pred)

Confusion matrix: 
[[96  9]
 [24 50]]
Classification report: 
               precision    recall  f1-score   support

           0       0.80      0.91      0.85       105
           1       0.85      0.68      0.75        74

    accuracy                           0.82       179
   macro avg       0.82      0.79      0.80       179
weighted avg       0.82      0.82      0.81       179

auc = 0.794980694980695


# The best model is the RandomForestClassifier because it has the highest auc, high TN, and high f1-score. f1 is balanced and predict_proba of the RandonForestClassifier which gives the probability of the sample to be part of a class or another gives in total 20 samples which have probability between 40 and 60 to be in one class or another. Around this values of probability the classifier does not differentiate the classes very well.

### The final step! The model is ready to train the Kaggle train set and  test to the Kaggle test set!

In [65]:
# Train on the Kaggle train set
X_train_final = encoded_X[:891]
X_test_final = encoded_X[890:-1]
X_train_final, X_test_final

(<891x207 sparse matrix of type '<class 'numpy.float64'>'
 	with 6505 stored elements in Compressed Sparse Row format>,
 <418x207 sparse matrix of type '<class 'numpy.float64'>'
 	with 3058 stored elements in Compressed Sparse Row format>)

In [74]:
from sklearn.ensemble import RandomForestClassifier
# Fitting the best model tuned with hyperparameters to the Final training data.
titanic_model = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=6,
                       min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

titanic_model.fit(X_train_final, y)
# Predicting the target variable for the kaggle test dataset
y_pred = titanic_model.predict(X_test_final)
y_pred

array([0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,

In [79]:
# The predicted proba
titanic_model.predict_proba(X_test_final)

array([[0.90289372, 0.09710628],
       [0.8894897 , 0.1105103 ],
       [0.68338594, 0.31661406],
       [0.8621119 , 0.1378881 ],
       [0.86702979, 0.13297021],
       [0.48680268, 0.51319732],
       [0.8417366 , 0.1582634 ],
       [0.49138394, 0.50861606],
       [0.76463108, 0.23536892],
       [0.28867211, 0.71132789],
       [0.87895408, 0.12104592],
       [0.8786618 , 0.1213382 ],
       [0.74210345, 0.25789655],
       [0.12247699, 0.87752301],
       [0.84865058, 0.15134942],
       [0.1781688 , 0.8218312 ],
       [0.151418  , 0.848582  ],
       [0.8688038 , 0.1311962 ],
       [0.69584947, 0.30415053],
       [0.59455351, 0.40544649],
       [0.43134158, 0.56865842],
       [0.67978863, 0.32021137],
       [0.58534022, 0.41465978],
       [0.21070776, 0.78929224],
       [0.63712554, 0.36287446],
       [0.07825232, 0.92174768],
       [0.87212933, 0.12787067],
       [0.09075189, 0.90924811],
       [0.71885007, 0.28114993],
       [0.60402458, 0.39597542],
       [0.

In [94]:
# The probability estimates for the test set from Kaggle. 0th columns is 0 and the first column = 1 
titanic_model_predict_proba = pd.DataFrame(data = titanic_model.predict_proba(X_test_final), columns = ["0 Probability", "1 Probability"])
# See how many predictions have 0.4,0.6 probabilities or 0.6,0.4
# count is the dataframe with a["0 Probability"] >= 0.4 and ["0 Probability"] <=0.6
count_titanic_model = titanic_model_predict_proba[titanic_model_predict_proba["0 Probability"] >= 0.4]
count_titanic_model = count_titanic_model[count_titanic_model["0 Probability"] <=0.6]
len(count_titanic_model)

74

## Hence, for at least 74 cases the classifier has difficutlties in differenciating between the 2 classes.

### Saving the Model and results!!