# Hands-On Lab 5

In this lab, you will train a tuned *DecisionTreeRegressor* to predict the *Age* feature from the *Titanic* dataset. You will then use the model to replace (i.e., *impute*) missing *Age* values.

**NOTE -** As discussed in the lecture, the imputation model built in this lab suffers from *information leakage* during cross-validation.

### Step 1 - Load Data

Note how the sixth row has a missing value (i.e., *NaN*) for the *Age* feature. Run the following code cell to load the dataset.

In [None]:
import pandas as pd

# Load Titanic training data from CSV file
titanic_train = pd.read_csv('titanic_train.csv')
titanic_train.head(n = 10)

### Step 2 - Create the *age_missing_mask*

A mask is a collection of boolean (i.e., *True/False*) values that can be used for filtering. Run the following code to create a mask where *True* values indicate that the *Age* feature is missing (e.g., the 6th row).

In [None]:
# Create a boolean mask for training with missing Age values
age_missing_mask = titanic_train['Age'].isna()
age_missing_mask[0:10]

### Step 3 - Create the *age_present_mask*

Run the following code to create a mask that is the opposite of *age_missing_mask*.

In [None]:
# Create a boolean mask for training with Age values present
age_present_mask = ~age_missing_mask
age_present_mask[0:10]

### Step 4 - Feature Engineering

The imputation model will leverage many of the features that you created in Lab #4. Run the following code to produce the results.

In [None]:
# Create the train_wrangled DataFrame
train_wrangled = (titanic_train
                    .assign(PartySize = lambda df_: df_['SibSp'] + df_['Parch'] + 1,
                            PartyFare = lambda df_: df_['Fare'] / df_['PartySize'],
                            Embarked = lambda df_: df_['Embarked'].fillna('S'),
                            CommaSplit = lambda df_: df_['Name'].str.split(', ', expand = True).loc[:, 1],
                            Title = lambda df_: df_['CommaSplit'].str.split('.', expand = True).loc[:, 0])
                 )

train_wrangled.head(n = 10)

### Step 5 - Encode Categorical Features

As you've seen in previous labs, the categorical features must be one-hot encoded. Run the following code cell to produce the results.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Designate the categorical features to use
cat_features = ['Pclass', 'Embarked', 'Title']

# Instatiate a OneHotEncoder 
cat_encoder = OneHotEncoder(sparse_output = False)
cat_encoder.set_output(transform = 'pandas')

# Learn the encodings and transform data
train_cat = cat_encoder.fit_transform(train_wrangled[cat_features])
train_cat.head()

### Step 6 - Build Predictors DataFrame

The imputation model will use the *Title* feature as a proxy for *Sex* during training. Run the code cell below to produce the results.

In [None]:
# Designate numeric features
num_features = ['SibSp', 'Parch', 'PartySize', 'PartyFare']

# Build the predictors DataFrame
titanic_X = pd.concat([train_wrangled[num_features], train_cat], axis = 1)
titanic_X.head()

### Step 7 - Build Training DataFrame

The *DecisionTreeRegressor* imputation model will be trained on the subset of data where *Age* values are present. Note how the model will be trained on 714 of 890 rows. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

For this lab, a few categorical features have to be one-hot encoded. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

### Step 8 - Train a Tuned Model

The following code performs a grid search using *min_samples_leaf* and *min_impurity_decrease* hyperparameters. The best model is evaluated using *neg_mean_squared_error* to measure awesomeness. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 9 - Visualize the Model

As demonstrated below, the imputation model is quite complex. Understanding its usefulness using only the visualization is a non-starter. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 10 - Evaluating the Imputation Model

Evaluating *ClassificationTreeRegressor* models are more complicated than evaluating *DecisionTreeClassifiers*. While many metrics can be used, one of the most intuitive is *mean absolute error* (MAE). The MAE value tells you how far off, on average, the model's predictions are. In the case of the *Age* imputation model, on average, the model's predictions are 8.1 years to high/low. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 11 - Imputing Missing *Age* Values

The only reliable way to know for sure if the imputation model helps or hurts is by comparing models trained using imputed *Age* values to models that do not use imputed *Age* values. Here's an example of such a test:

* Evaluating the cross-validation performance of a *DecisionTreeClassifier* using the *Female* and *Age* features, but not *Title*.
* Evaluating the cross-validation performance of a *DecisionTreeClassifier* using the *Title* feature, but not *Female* and *Age* .

**NOTE -** The above assumes no support for missing values by the *DecisionTreeClassifier*.

The following code demonstrates how you can replace missing *Age* values with imputation model predictions. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here