# Hands-On Lab 4

In this lab, you will perform feature engineering and train a tuned *DecisionTreeClassifier* on the *Titanic* dataset. You will then compare the predictive performance of this lab's *DecisionTreeClassifier* to Lab #3's *DecisionTreeClassifier*.

### Step 1 - Load Data

The *titantic_train.csv* file is provided along with the lab Jupyter Notebook file for you to download from the course page. Run the following code cell to load the dataset.

In [None]:
import pandas as pd

# Load Titanic training data from CSV file
titanic_train = pd.read_csv('titanic_train.csv')
titanic_train.head(n = 10)

### Step 2 - Engineer the *Female* Feature

Currently, the *Sex* feature is encoded using the string values of *male* and *female*. As you saw in Lab #3, one-hot encoding the *Sex* feature produces two binary features - *Sex_female* and *Sex_male*. This is not optimal for machine learning as the *Sex* feature's information is spread across two new features. A better option is to create a binary *Female* feature instead. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 3 - Engineer the *Party* Features

As you've learned in a previous lesson, the *Titanic* dataset offers opportunities to engineer features based on domain knowledge. Specically, engineering features to calculate the total number of family members traveling together and how much *Fare* they paid, on average, for each family member. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

NOTE - Be sure to copy and paste the code from Step 2 and extend it below.

In [None]:
# Type your lab code here

### Step 4 - Impute Missing *Embarked* Values

The simplest form of replacing missing values (i.e., *imputation*) is to use simple calculations like the mean (or average), median, and mode. If you're unfamiliar, the mode is the most frequencly occuring values. Running the following code outputs the value counts of the *Embarked* feature.

In [None]:
# Make sure to count missing values as well!
train_wrangled['Embarked'].value_counts(dropna = False)

The output above shows that the most common *Embarked* value (or mode) is *S*. Using this knowledge you can *impute* the missing *Embarked* values to be *S*. The code below uses the [***fillna()* method**](https://pandas.pydata.org/docs/reference/api/pandas.Series.fillna.html) to replace missing *Embarked* values with *S*. Run the following code to see the output.

In [None]:
# Create the train_wrangled DataFrame
train_wrangled = (titanic_train
                    .assign(Female = lambda df_: df_['Sex'].replace({'female': '1', 'male': '0'}).astype(int),
                            PartySize = lambda df_: df_['SibSp'] + df_['Parch'] + 1,
                            PartyFare = lambda df_: df_['Fare'] / df_['PartySize'],
                            Embarked = lambda df_: df_['Embarked'].fillna('S'))
                 )

### Step 5 - Extract the *Title* Feature

As covered in a previous lesson, the *Name* feature includes the titles of passengers (e.g., *Master.*). Passenger titles appear to contain information regarding both the *Sex* and *Age* of the passenger. In this regard, titles could be a proxy feature for both *Sex* and *Age*. The following code creates the *Title* feature by using the [***str.split()* method**](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html) extracting data from the *Name* feature. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

NOTE - Be sure to copy and paste the code from Step 4 and extend it below.

In [None]:
# Type your lab code here

### Step 6 - Drop Unneeded Features

At this stage, there are a number of features that will not be used to train the model. For example, *PassengerId* and *CommaSplit* are poor features to use in a machine learning model. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

NOTE - Be sure to copy and paste the code from Step 5 and extend it below.

In [None]:
# Type your lab code here

### Step 7 - Encode Categorical Features

For this lab, a few categorical features have to be one-hot encoded. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 8 - Build Predictors DataFrame

With the categorical features one-hot encoded in the *train_cat* DataFrame, it's time to create the predictor DataFrame that will be used in training the decision tree. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 9 - Train a Tuned Model

As the *Survived* label is numeric (i.e., 1 == Survived, 0 == Perished), it does not need to be encoded. The following code performs a grid search using *min_samples_leaf* and *min_impurity_decrease* hyperparameters. The best model is evaluated using *accuracy* as the measure of awesomeness. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 10 - Visualize the Model

When visualizing the *DecisionTreeClassifier* model, the labels of *Perished* and *Survived* will be used instead of ones and zeroes. The *best_estimator_* attribute of the *grid_cv* object provides a model trained on all the data using the best hyperparameter values. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 11 - Evaluating Bias and Variance

The tuned *decision_tree* model above isn't particularly complex. While you could spend time pouring over all the paths through the tree, a far more efficient way to understand the effectiveness of the model's predictions is to get the bias and variance. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

Note the following in the output:
* The model's mean accuracy (i.e., *bias*) is 82.4% across the 100 folds
* However, the model's standard deviation (i.e., *variance*) is 4% across the 100 folds

To interpret the variance:
* We would expect the model to score between 78.4% and 86.4% accuracy about 2/3 of the time.
* We would expect the model to score between 74.4% and 90.4% accuracy about 95% of the time.

**Bottom Line -** This model's bias and variance have improved compared to the Lab 3 model! However, the variance remains high.

In [None]:
# Type your lab code here