# Hands-On Lab 7

This lab is the capstone of the course, consisting of three parts:

1. Evaluating features using *permutation importance*.
2. Tuning a *RandomForestClassifier* for better predictive performance.
3. Making predictions using the *RandomForestClassifier* for the *Titanic* test dataset.

**NOTE -** The code in this lab assumes a top-to-bottom execution order. If things don't look right, just execute all the code cells again starting from the top.

## Part 1 - Evaluating Features

### Step 1 - Load Data

The *titantic_train.csv* file is provided along with the lab Jupyter Notebook file for you to download from the course page. Run the following code cell to load the dataset.

In [None]:
import pandas as pd

# Load Titanic training data from CSV file
titanic_train = pd.read_csv('titanic_train.csv')
titanic_train.head(n = 10)

### Step 2 - Feature Engineering

The *RandomForestClassifier* model will leverage the features that you created in Lab #4. Run the following code to produce the results.

In [None]:
# Create the train_wrangled DataFrame
train_wrangled = (titanic_train
                    .assign(Female = lambda df_: df_['Sex'].replace({'female': 1, 'male': 0}),
                            PartySize = lambda df_: df_['SibSp'] + df_['Parch'] + 1,
                            PartyFare = lambda df_: df_['Fare'] / df_['PartySize'],
                            Embarked = lambda df_: df_['Embarked'].fillna('S'),
                            CommaSplit = lambda df_: df_['Name'].str.split(', ', expand = True).loc[:, 1],
                            Title = lambda df_: df_['CommaSplit'].str.split('.', expand = True).loc[:, 0])
                 )

train_wrangled.head(n = 10)

### Step 3 - Encode Categorical Features

As you've seen in previous labs, the categorical features must be one-hot encoded. Run the following code cell to produce the results.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Designate the categorical features to use
cat_features = ['Pclass', 'Embarked', 'Title']

# Instatiate a OneHotEncoder 
cat_encoder = OneHotEncoder(sparse_output = False)
cat_encoder.set_output(transform = 'pandas')

# Learn the encodings and transform data
train_cat = cat_encoder.fit_transform(train_wrangled[cat_features])
train_cat.head()

### Step 4 - Build Predictors DataFrame

The model will use both the *Female* and *Title* features to allow comparisons. Run the code cell below to produce the results.

In [None]:
# Designate numeric features
num_features = ['SibSp', 'Parch', 'Female', 'PartySize', 'PartyFare']

# Build the predictors DataFrame
titanic_X = pd.concat([train_wrangled[num_features], train_cat], axis = 1)
titanic_X.head()

### Step 5 - Train a Mighty Random Forest

The following code trains a *RandomForestClassifier* as you saw in Lab #6. The resulting model will be used with permutation importance to evaluate features. Notices that the OOB generalization estimate is slightly lower than what you saw in Lab #6. This is due to the addition of the *Female* feature. Run the code cell below to produce the results.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Intstantiate the object, make sure OOB score is calculated
titanic_rf = RandomForestClassifier(random_state = 12345, oob_score = True)

# Train the random forest
titanic_rf.fit(titanic_X, titanic_train['Survived'])

# What is the accuracy estimate?
print(f'Estimated accuracy with OOB data: {titanic_rf.oob_score_:.4f}')

### Step 6 - Permutation Importance

The following code evaluates each of the features used to train the *RandomForestClassifier*. The *n_repeats* parameter is set to conduct 10 separate permutation importance tests where the results are averaged together. Higher repeats gives you better results, but takes longer to process. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE -** The code below sets the *n_jobs* parameter to 2. You can increase this number to speed up the processing, but be sure to set this to a resonable number based on the number of your computer's cores. 

In [None]:
# Type your lab code here

### Step 7 - The Top Features

The following output shows the top features based on *permutation importance*. Notice how the *PartyFare* feature was found to be the most importance because, on average, the predictive accuracy of the model decreased by 12.1% when the values of *PartyFare* were permuted (i.e., randomized). Run the code cell below to produce the results.

In [None]:
# Look at the top 15 features
importance_df.head(n = 15)

### Step 8 - The Bottom Features

The following output shows the bottom features based on *permutation importance*. Notice how the *Title_Don* feature was found to be the least importance because accuracy decreased by 0% on average. Another way to think about this is that the *Title_Don* feature offers no predictive information at all! Run the code cell below to produce the results.

In [None]:
# Look at the bottom 15 features
importance_df.tail(n = 15)

### Step 9 - *Title* Feature Exploration

While there are many reasons why a feature might not offer any predictive information, one of the most common reasons for categorical features is that certain values are rare. These are sometimes referred to as "long-tail values." The following output shows that the features found to be the least important are relatively rare. Run the code cell below to produce the results.

In [None]:
# Get the counts of each Title
train_wrangled['Title'].value_counts()

## Part 2 - Model Tuning

### Step 10 - Feature Cleanup

A common approach to addressing long-tail categorical values is to replace them with a designated category (e.g., *Other*). The following code performs this data wrangling on the *Title* feature long-tail values. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 11 - Rebuild Predictors DataFrame

Given the data wrangling of Step 10, you need to rebuild the predictors DataFrame to use the new *Other* category. Run the code cell below to produce the results.

In [None]:
# Relearn the encodings and transform data
train_cat = cat_encoder.fit_transform(train_wrangled[cat_features])
train_cat.head()

# Rebuild the predictors DataFrame
titanic_X = pd.concat([train_wrangled[num_features], train_cat], axis = 1)
titanic_X.head()

### Step 12 - Train a Tuned Model

As discussed in the lecture, random forests' use of *bagging* is analogous to *cross-validation*. While cross-validation can be used to tune random forests, relying on the *out-of-bag (OOB)* estimates is far more efficient. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE -** The following code takes a while to run. If you can, increasing the value of the *n_job* parameter will reduce the execution time.

In [None]:
# Type you lab code here

### Step 13 - Get the Optimal Hyperparameters

Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 14 - Evaluating Bias and Variance

Unlike single decision trees, you can't realistically understand random forests using visualizations. While it's possible to visualize individual trees in the random forest, understanding how 100s of trees work togther is just too difficult. In practice, we rely on techniques like *permutation importance*, confusion matrices, and evaluating bias and variance to develop an understanding of random forest models. Run the following code cell to produce the results.

Note the following in the output:
* The mean accuracy (i.e., *bias*) is approximately 82.5% accuracy across the 100 models
* The standard deviation (i.e., *variance*) is approximately 0.3% across the 100 models

To interpret the variance:
* We would expect the model to score between 82.3% and 82.9% accuracy about 2/3 of the time.
* We would expect the model to score between 82.0% and 83.2% accuracy about 95% of the time.

**Bottom Line -** The random forest has improved accuracy compared to the decision tree from Lab #4 and also has far, far lower variance!

In [None]:
# Grab the best hyperparameter values
params = param_list[best_params]

# Train 100 models using the best hyperparameter values
oob_accuracy = []
for value in range(0, 100):
    # Do NOT set random_state!
    rf = RandomForestClassifier(n_estimators = params['n_estimators'], max_features = params['max_features'], 
                                min_samples_leaf = params['min_samples_leaf'], oob_score = True, n_jobs = 2)
    rf.fit(titanic_X, titanic_train['Survived'])
    oob_accuracy.append(rf.oob_score_)

# Evaluate the bias and variance across the 100 models
# Due to randomness, you will likely see slightly different results
print(f'Mean OOB Accuracy: {np.average(oob_accuracy)}')
print(f'OOB Accuracy Std Deviation: {np.std(oob_accuracy)}')

## Part 3 - Model Testing

### Step 15 - Preparing the Test Dataset

Before predictions can be made on the test dataset, it needs to be wrangled using the same logic that created the predictors DataFrame. Examine the code cell output to ensure the *test_X* DataFrame has the same columns as the *titanic_X* DataFrame. Also, be sure to check for any missing values. Run the following code cell to produce the results.

In [None]:
# Load Titanic test data from CSV file
titanic_test = pd.read_csv('titanic_test.csv')

# Create the test_wrangled DataFrame
test_wrangled = (titanic_test
                   .assign(Female = lambda df_: df_['Sex'].replace({'female': 1, 'male': 0}),
                           PartySize = lambda df_: df_['SibSp'] + df_['Parch'] + 1,
                           PartyFare = lambda df_: df_['Fare'] / df_['PartySize'],
                           Embarked = lambda df_: df_['Embarked'].fillna('S'),
                           CommaSplit = lambda df_: df_['Name'].str.split(', ', expand = True).loc[:, 1],
                           Title = lambda df_: df_['CommaSplit'].str.split('.', expand = True).loc[:, 0])
                 )


# Overwrite unimportant titles with 'Other'
test_title_mask = test_wrangled['Title'].isin(important_titles)
test_wrangled.loc[~test_title_mask, 'Title'] = 'Other'

# Reuse the OneHotEncoder object to ensure the test dataset matches the train dataset!
test_cat = cat_encoder.transform(test_wrangled[cat_features])

# Build the wrangled test dataset
test_X = pd.concat([test_wrangled[num_features], test_cat], axis = 1)
test_X.info()

### Step 16 - Impute Missing Data

The output of Step 15 shows that a single value is missing for the *PartyFare* feature. Given that *PartyFare* has skewed values, a reasonable imputation approach is to replace the single missing value with the median of *PartyFare*. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE -** Imputation is *always* performed using the training data applied to the test data to avoid information leakage!

In [None]:
# Type your lab code here

### Step 17 - Make Predictions

To make predictions for the test set, one final *RandomForestClassifier* must be trained using the optimal hyperparameter vaues. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 18 - Save Predictions to .CSV

If you're interested, the following code creates a .CSV file suitable for submission to the [**Kaggle Titanic competition**](https://www.kaggle.com/competitions/titanic). Run the following code cell to produce the results.

In [None]:
# Add a Survived column to the Titanic test dataset
titanic_test['Survived'] = test_predictions

# Save Kaggle submission .CSV
titanic_test[['PassengerId', 'Survived']].to_csv('TitanicSubmission.csv', index = False)