# Hands-On Lab 3

In this lab you will train a tuned *DecisionTreeClassifier* on the *Titanic* dataset. The features used in this lab to train the decision tree model will be based on your findings from Lab 1.

### Step 1 - Load Data

Run the following code cell to load the dataset.

In [None]:
import pandas as pd

# Load Titanic training data from CSV file
titanic_train = pd.read_csv('titanic_train.csv')
titanic_train.info()

### Step 2 - Encode Categorical Features

In Lab 1, you discovered that a number of *Titanic* dataset features cannot be used directly. For example, the *Ticket* feature's cardinality is too high, and the *Embarked* feature needs missing values replaced. For this lab, you will use a subset of "complete" features for your first model. Run the following code to produce the results. 

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Designate the features to use - including categorical features
features = ['Pclass', 'Sex', 'SibSp', 'Parch']
cat_features = ['Pclass', 'Sex']

# Instatiate a OneHotEncoder 
cat_encoder = OneHotEncoder(sparse_output = False)
cat_encoder.set_output(transform = 'pandas')

# Learn the encodings and transform data
train_cat = cat_encoder.fit_transform(titanic_train[cat_features])
train_cat.head()

### Step 3 - Build Predictors DataFrame

With the *Pclass* and *Sex* features one-hot encoded in the *train_cat* DataFrame, it's time to create the predictor DataFrame that will be used in training the decision tree. Run the following code to produce the results.

In [None]:
# Designate numeric features
num_features = ['SibSp', 'Parch']

# Build the predictors DataFrame
titanic_X = pd.concat([titanic_train[num_features], train_cat], axis = 1)
titanic_X.head()

### Step 4 - Train a Tuned Model

As the *Survived* label is numeric (i.e., 1 == Survived, 0 == Perished), it does not need to be encoded. The following code performs a grid search using *min_samples_leaf* and *min_impurity_decrease* hyperparameters. The best model is evaluated using *accuracy* as the measure of awesomeness. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 5 - Visualize the Model

When visualizing the *DecisionTreeClassifier* model, the labels of *Perished* and *Survived* will be used instead of ones and zeroes. The *best_estimator_* attribute of the *grid_cv* object provides a model trained on all the data using the best hyperparameter values. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 6 - Evaluating Bias and Variance

The tuned *decision_tree* model above isn't particularly complex. While you could spend time pouring over all the paths through the tree, a far more efficient way to understand the effectiveness of the model's predictions is to get the bias and variance. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

Note the following in the output:
* The model's mean accuracy (i.e., *bias*) is 79.5% accuracy accross the 100 folds
* However, the model's standard deviation (i.e., *variance*) is 4.4% across the 100 folds

To interpret the variance:
* We would expect the model to score between 75.1% and 83.9% accuracy about 2/3 of the time.
* We would expect the model to score between 70.7% and 88.3% accuracy about 95% of the time.

**Bottom Line -** This model's bias isn't great and it's variance is quite high.

In [None]:
# Enter your lab code here