https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#Titanic-challenge-part-2

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import rcParams
import os
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

%matplotlib inline
rcParams['figure.figsize'] = 10,8
sns.set(style='whitegrid', palette='muted',
        rc={'figure.figsize': (15,10)})

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
# print(os.listdir("../input"))

In [None]:
# print(os.listdir("./data/titanic-cleaned-data.csv"))

# Load data as Pandas dataframe
train = pd.read_csv('./data/train_clean.csv', )
test = pd.read_csv('./data/test_clean.csv')
df = pd.concat([train, test], axis=0, sort=True)

In [None]:
df.head()

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

        
display_all(df.describe(include='all').T)

In [None]:
df['Survived'].value_counts()

<h1 id="2.-Encode-categorical-variables">2. Encode categorical variables<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#2.-Encode-categorical-variables" target="_self" rel=" noreferrer nofollow">¶</a></h1>
<p>We need to convert all categorical variables into numeric format. The categorical variables we will be keeping are <code>Embarked</code>, <code>Sex</code> and <code>Title</code>.</p>
<p>The <code>Sex</code> variable can be encoded into single 1-or-0 column, but the other variables will need to be <a href="https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f" rel=" noreferrer nofollow">one-hot encoded</a>. Regular label encoding assigns some category labels higher numerical values. This implies some sort of scale (Embarked = 1 is not <strong>more</strong> than Embarked = 0 - it's just <em>different</em>). One Hot Encoding avoids this problem.</p>
<p>We will assume that there is some ordinality in the <code>Pclass</code> variable, so we will leave that as a single column.</p>

In [None]:
sns.countplot(x='Pclass', data=df, palette='hls', hue='Survived')
# plt.figure(figsize=[1.4,1.4])
plt.xticks(rotation=45)
plt.show()

In [None]:
sns.countplot(x='Embarked', data=df, palette='hls', hue='Survived')
plt.xticks(rotation=45)
plt.show()

In [None]:
# convert to category dtype
df['Sex'] = df['Sex'].astype('category')


# convert to category codes
df['Sex'] = df['Sex'].cat.codes

In [None]:
df['Sex']

In [None]:
# type(df['Sex'])
# df['Sex']

In [None]:
# subset all categorical variables which need to be encoded
categorical = ['Embarked', 'Title']

for var in categorical:
    df = pd.concat([df, 
                    pd.get_dummies(df[var], prefix=var)], axis=1)
    del df[var]

In [None]:
df

In [None]:
# drop the variables we won't be using
df.drop(['Cabin', 'Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)

In [None]:
df.head()

<h1 id="3.-Random-Forest">3. Random Forest<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#3.-Random-Forest" target="_self" rel=" noreferrer nofollow">¶</a></h1>
<p>Now, all that is left is to feed our data that has been cleaned, encoded and scaled to a random forest.<br>
<a id="train-test"></a></p>
<h2 id="3.1.-Train/test-split">3.1. Train/test split<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#3.1.-Train/test-split" target="_self" rel=" noreferrer nofollow">¶</a></h2>
<p>But first, we need to separate <em>data_df</em> back into <em>train</em> and <em>test</em> sets.</p>

In [None]:
train = df[pd.notnull(df['Survived'])]
X_test = df[pd.isnull(df['Survived'])].drop(['Survived'], axis=1)

<h3 id="Validation-set">Validation set<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#Validation-set" target="_self" rel=" noreferrer nofollow">¶</a></h3>
<p>Since we can't use our test set to assess our model (it doesn't have any labels), we will create a separte 'validation set'. We will use this set to test how our model generalises to unseen data.</p>

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    train.drop(['Survived'], axis=1),
    train['Survived'],
    test_size=0.2, random_state=42)

In [None]:
for i in [X_train, X_val, X_test]:
    # print(i.to_string)
    print(i.shape)
    # print(i.head(2))

<h3 id="Create-Random-Forest-model">Create Random Forest model<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#Create-Random-Forest-model" target="_self" rel=" noreferrer nofollow">¶</a></h3>
<p>We will first make a random forest model, using all of the default parameters.</p>
<blockquote><p>Note: set the <code>random_state</code> to 42 for reproducibility</p>
</blockquote>

In [None]:
rf = RandomForestClassifier(random_state=42)

<h3 id="Train-model">Train model<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#Train-model" target="_self" rel=" noreferrer nofollow">¶</a></h3>
<p>Now, let's train the model on our training set.</p>

In [None]:
rf.fit(X_train, y_train)

In [None]:
accuracy_score(y_val, rf.predict(X_val))

<h2 id="3.2.-Cross-validation">3.2. Cross-validation<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#3.2.-Cross-validation" target="_self" rel=" noreferrer nofollow">¶</a></h2>
<p>Keeping a separate validation set means that we have less data on which to train our model. Cross-validation allows us to train our model on <em>all</em> of the data, while still assessing its performance on unseen data.</p>
<p>K-folds cross validation is the process of creating <em>k</em> different train/validate splits in the data and training the model <em>k</em> times.</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg" alt="CV"></p>
<p>In the image above, k=4. This means that the model will be trained 4 times, each time using 1/4 of the data for validation. In this way, each of the four 'folds' takes one turn sitting out from training and is used as the validation set.</p>
<p>Let's combine our train and validation sets back into one training set, and then use cross-validation to assess our model:</p>

In [None]:
X_train = pd.concat([X_train, X_val])
y_train = pd.concat([y_train, y_val])

In [None]:
X_train.shape

<p>Now we have all of training data again. Let's fit a model to it, and assess its accuracy using 5-fold cross-validation:</p>

In [None]:
rf = RandomForestClassifier(n_estimators=10, random_state=42)
cross_val_score(rf, X_train, y_train, cv=5)

In [None]:
cross_val_score(rf, X_train, y_train, cv=5).mean()

<p>Here, our CV score is slightly lower than our previous single validation score. Taking a look at the scores for each of the folds, the score does seem to vary slightly.</p>
<p>Cross-validation has the added advantage of being a more robust measure of model accuracy than single validation.</p>
<blockquote><p>Note: the method we used initially is actually just 1-fold cross-validation</p>
</blockquote>

<h2 id="3.3.-Hyperparameter-tuning">3.3. Hyperparameter tuning<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#3.3.-Hyperparameter-tuning" target="_self" rel=" noreferrer nofollow">¶</a></h2>
<p>Our first model didn't do too badly! It scored over 80% on the CV score. However, we didn't put any thought into our choice of hyperparameters, we simply went with the defaults.</p>
<p>Take a look at the various parameters by using the <code>help()</code> function:</p>

In [None]:
help(RandomForestClassifier)

<p>It is hard to know the best values for each of these hyperparameters without first <em>trying</em> them out. If we wanted to know the best value for the <code>n_estimators</code> parameter, we could fit a few models, each with a different value, and see which one tests the best.</p>
<p><strong>Grid search</strong> allows us to do this for multiple parameters simultaneously. We will select a few different parameters that we want to tune, and for each one we will provide a few different values to try out. Then grid search will fit models to every possible combination of these parameter values and use <strong>cross-validation</strong> to assess the performance in each case.</p>
<p>Furthermore, since we are using CV, we don't need to keep a separate validation set.</p>




<div class="text_cell_render border-box-sizing rendered_html">
<h3 id="3.2.1.-Number-of-estimators-and-max-depth">3.2.1. Number of estimators and max depth<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#3.2.1.-Number-of-estimators-and-max-depth" target="_self" rel=" noreferrer nofollow">¶</a></h3><p>We will start by tuning the <code>n_estimators</code> (number of trees in the forest) and the <code>max_depth</code> (how deep each tree grows) parameters.</p>
<p>The first step that we need to do is to define the grid of parameters over which to search:</p>

</div>




In [None]:
n_estimators = [10, 100, 1000, 2000]
max_depth = [None, 5, 10, 20]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)

<p>We have set out a total of <span class="MathJax_Preview" style="color: inherit; display: none;"></span><span class="MathJax" id="MathJax-Element-1-Frame" tabindex="0" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mn>4</mn><mo>&amp;#x00D7;</mo><mn>4</mn><mo>=</mo><mn>16</mn></math>" role="presentation" style="position: relative;"><nobr aria-hidden="true"><span class="math" id="MathJax-Span-1" style="width: 5.598em; display: inline-block;"><span style="display: inline-block; position: relative; width: 4.646em; height: 0px; font-size: 120%;"><span style="position: absolute; clip: rect(1.432em, 1004.59em, 2.443em, -999.997em); top: -2.259em; left: 0em;"><span class="mrow" id="MathJax-Span-2"><span class="mn" id="MathJax-Span-3" style="font-family: MathJax_Main;">4</span><span class="mo" id="MathJax-Span-4" style="font-family: MathJax_Main; padding-left: 0.241em;">×</span><span class="mn" id="MathJax-Span-5" style="font-family: MathJax_Main; padding-left: 0.241em;">4</span><span class="mo" id="MathJax-Span-6" style="font-family: MathJax_Main; padding-left: 0.301em;">=</span><span class="mn" id="MathJax-Span-7" style="font-family: MathJax_Main; padding-left: 0.301em;">16</span></span><span style="display: inline-block; width: 0px; height: 2.265em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.068em; border-left: 0px solid; width: 0px; height: 1.004em;"></span></span></nobr><span class="MJX_Assistive_MathML" role="presentation"><math xmlns="http://www.w3.org/1998/Math/MathML"><mn>4</mn><mo>×</mo><mn>4</mn><mo>=</mo><mn>16</mn></math></span></span><script type="math/tex" id="MathJax-Element-1">4 \times 4 = 16</script> models over which to search. Grid search uses cross-validation on each of the models, so if we use 3-folds cross-validation, that will leave us with 48 different fits to try out. (You can see how the number of fits can grow pretty quickly as we increase the number of parameters!)</p>

<p>The good news is that SkLearn's grid search allows us to run the job in parallel. Including the <code>n_jobs=-1</code> argument below let's grid search run on all of the available cores on the host machine.</p>

In [None]:
# create the default model
rf = RandomForestClassifier(random_state=42)

# search the grid
grid = GridSearchCV(estimator=rf, 
                    param_grid=param_grid,
                    cv=3,
                    verbose=2,
                    n_jobs=-1)

grid_result = grid.fit(X_train, y_train)

<pre>Fitting 3 folds for each of 16 candidates, totalling 48 fits
</pre>
<pre>[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   21.6s
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:   31.0s finished
</pre>
<p>Now let's take a look at the results of the grid search.</p>
<p>We can get the best performing model directly from <code>grid_result</code>:</p>

In [None]:
grid_result.best_estimator_


<pre>RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=2000, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)</pre>
  <p>Or just the best parameters:</p>

In [None]:
grid_result.best_params_

In [None]:
grid_result.best_score_

<p>But let's take a look at all of the models so we can make a more informed decision</p>

In [None]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

<h3 id="3.2.2.-Leaf-size">3.2.2. Leaf size<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#3.2.2.-Leaf-size" target="_self" rel=" noreferrer nofollow">¶</a></h3>
<p>The <code>min_samples_leaf</code> argument controls the size of the leaves in the trees.</p>
<p>We will set out the grid in a similar manner as before, only this time we will use the <code>max_depth</code> and <code>n_estimators</code> parameters that we found above.</p>

In [None]:
# create the grid
max_features = [5, 8, 10, 12, None]
bootstrap = [True, False]
param_grid = dict(max_features=max_features, bootstrap=bootstrap)

# create the model with new leaf size
rf = grid_result.best_estimator_

# search the grid
grid = GridSearchCV(estimator=rf, 
                    param_grid=param_grid,
                    cv=3,
                    verbose=2,
                    n_jobs=-1)

grid_result = grid.fit(X_train, y_train)

<pre>Fitting 3 folds for each of 10 candidates, totalling 30 fits
</pre>
<pre>[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:   19.0s finished
</pre>


In [None]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

<h3 id="3.2.3.-To-bag-or-not-to-bag">3.2.3. To bag or not to bag<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#3.2.3.-To-bag-or-not-to-bag" target="_self" rel=" noreferrer nofollow">¶</a></h3>
<p>Bootstrap aggregating (or bagging) is a special case of the random forest where we bootstrap (sample with replacement) from the n training obersvations to create a new training set of size n for each tree. Furthermore, each tree considers all variables when making each split.</p>

<p>We can use grid search to determine if bootstrapping will be an appropriate method to use.</p>

In [None]:
# create the grid
max_features = [5, 8, 10, 12, None]
bootstrap = [True, False]
param_grid = dict(max_features=max_features, bootstrap=bootstrap)

# create the model with new leaf size
rf = grid_result.best_estimator_

# search the grid
grid = GridSearchCV(estimator=rf, 
                    param_grid=param_grid,
                    cv=3,
                    verbose=2,
                    n_jobs=-1)

grid_result = grid.fit(X_train, y_train)

In [None]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

<h2 id="4.-Make-Predictions-on-Test-Set">4. Make Predictions on Test Set<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#4.-Make-Predictions-on-Test-Set" target="_self" rel=" noreferrer nofollow">¶</a></h2>
<p>Finally, we can attempt to predict which passengers in the test set survived.</p>

In [None]:
rf = grid_result.best_estimator_

In [None]:
# test our CV score
cross_val_score(rf, X_train, y_train, cv=5).mean()

In [None]:
test['Survived'] = rf.predict(X_test)


In [None]:
solution = test[['PassengerId', 'Survived']]
solution['Survived'] = solution['Survived'].apply(int)

In [None]:
solution.head(10)

<h2 id="Output-Final-Predictions">Output Final Predictions<a class="anchor-link" href="https://www.kaggle.com/code/jamesleslie/titanic-random-forest-grid-search/notebook#Output-Final-Predictions" target="_self" rel=" noreferrer nofollow">¶</a></h2>

In [None]:
solution.to_csv("Random_Forest_Solution.csv", index=False)