# Hands-On Lab 6

In this lab, you will train a mighty *RandomFoestClassifier* on the *Titanic* dataset. You will then compare the predictive performance of this lab's *RandomForestClassifier* to Lab #4's *DecisionTreeClassifier*.

### Step 1 - Load Data

Run the following code cell to load the dataset.

In [None]:
import pandas as pd

# Load Titanic training data from CSV file
titanic_train = pd.read_csv('titanic_train.csv')
titanic_train.head(n = 10)

### Step 2 - Feature Engineering

The *RandomForestClassifier* model will leverage most of the features that you created in Lab #4. Run the following code to produce the results.

In [None]:
# Create the train_wrangled DataFrame
train_wrangled = (titanic_train
                    .assign(PartySize = lambda df_: df_['SibSp'] + df_['Parch'] + 1,
                            PartyFare = lambda df_: df_['Fare'] / df_['PartySize'],
                            Embarked = lambda df_: df_['Embarked'].fillna('S'),
                            CommaSplit = lambda df_: df_['Name'].str.split(', ', expand = True).loc[:, 1],
                            Title = lambda df_: df_['CommaSplit'].str.split('.', expand = True).loc[:, 0])
                 )

train_wrangled.head(n = 10)

### Step 3 - Encode Categorical Features

As you've seen in previous labs, the categorical features must be one-hot encoded. Run the following code cell to produce the results.

In [3]:
from sklearn.preprocessing import OneHotEncoder

# Designate the categorical features to use
cat_features = ['Pclass', 'Embarked', 'Title']

# Instatiate a OneHotEncoder 
cat_encoder = OneHotEncoder(sparse_output = False)
cat_encoder.set_output(transform = 'pandas')

# Learn the encodings and transform data
train_cat = cat_encoder.fit_transform(train_wrangled[cat_features])
train_cat.head()

Unnamed: 0,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Title_Capt,Title_Col,Title_Don,Title_Dr,...,Title_Master,Title_Miss,Title_Mlle,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rev,Title_Sir,Title_the Countess
0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### Step 4 - Build Predictors DataFrame

The model will use the *Title* feature as a proxy for *Sex* during training. Run the code cell below to produce the results.

In [None]:
# Designate numeric features
num_features = ['SibSp', 'Parch', 'PartySize', 'PartyFare']

# Build the predictors DataFrame
titanic_X = pd.concat([train_wrangled[num_features], train_cat], axis = 1)
titanic_X.head()

### Step 5 - Train a Mighty Random Forest

As the *Survived* label is numeric (i.e., 1 == Survived, 0 == Perished), it does not need to be encoded. The following code trains a *RandomForestClassifier* using the hyperparameter defaults (e.g., 100 trees). The code also sets the *oob_score* to *True* so that the random forest will calculate a generalization estimate using out-of-bag (OOB) data. The OOB estimate is comparable to the cross-validation estimates for *DecisionTreeClassifiers* from previous labs. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 6 - Evaluating the Random Forest

The tuned *DecisionTreeClassifier* from Lab #4 had a cross-validation mean accuracy of 82.4%. Compare this to the estimated OOB accuracy of 80.7% for the *RandomForestClassifier*. A confusion matrix for *the out-of-bag (OOB)* predictions also provides much insight into the *RandomForestClassifier*. In the next section of the course you will learn about tuning random forests to improve their predictive performance. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

Note the following in the confusion matrix:
* The model's *sensitivity* (i.e., accuracy for predicting survival) is only 249 / (93 + 249) = 72.8%.
* The model's *specificity* is 470 / (470 + 79) = 85.6%.

**Bottom Line -** The best way to improve the model's performance is to engineer features that increase *sensitivity*!

In [None]:
# Type your lab code here