# Hands-On Lab 2 - Random Forests

### Step 1 - Load Data

The *adult_train.csv* file is the training dataset. Run the following code cell to load the dataset.

In [None]:
import pandas as pd

adult_train = pd.read_csv('adult_train.csv')
adult_train.head()

### Step 2 - Prepare the Features 

As you learned in Lab 1, you must always prepare your data before training a machine learning model. For this lab, we will expand the working hypotheses to include the *Race*, *CapitalGain*, and *CapitalLoss* features. The intuition is that *Race* is associated with income in the US economy, and the *CapitalGain* and *CapitalLoss* features are also associated with income. Run the following code to prepare the dataset.

In [None]:
# Features to use to predict the labels
all_features = ['Age', 'EducationNum', 'MaritalStatus', 'Occupation', 'Race', 'Sex', 
                'CapitalGain', 'CapitalLoss', 'HoursPerWeek']

# Categorical features
cat_features = ['MaritalStatus', 'Occupation', 'Race', 'Sex']

# Select the above features and one-hot encode
adult_X = pd.get_dummies(adult_train[all_features], prefix = cat_features , columns = cat_features)
adult_X.head()

### Step 3 - Explore the Feature Count

One downside to one-hot encoding is that it can significantly increase the features provided to *scikit-learn* machine learning algorithms. For example, the code below shows that the original nine features become 33 after one-hot encoding. Always remember the number of features that result from your data preparation. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 4 - Preparing the Labels

As with Lab 1, you need to encode the string values in the label before training a model. Run the following code to prepare the labels.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode labels
label_encoder = LabelEncoder()
adult_y = label_encoder.fit_transform(adult_train['Label'])

print(label_encoder.classes_)
print(adult_y)

### Step 5 - Train the Random Forest

The *scikit-learn* library was designed to enable repeating coding patterns regardless of the machine learning algorithm you use. For example, the *fit()* method is the same across the *DecisionTreeClassifier* and *RandomForestClassifer* classes. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE** - You can adjust the *n_jobs* parameter if you have a more powerful laptop.

In [None]:
# Enter your lab code here

### Step 6 - Get OOB Predictions

As discussed in lecture, think of OOB data as representing the unkownable future. Analyzing the predictive performance of a *RandomForestClassifier* in terms of OOB allows for estimating the quality of the model's future predictions. The first step is to build the predicted OOB labels. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here

### Step 7 - Analyzing the OOB Predictions

Using a *confusion matrix* to analyze OOB predictions gives you insights into how your *RandomForestClassifier* is performing. The confusion matrix allows you answer many questions about the nature of model predictions. As you will learn later, iterating over multiple model versions is common in a machine learning project. Analyzing the OOB of each model iteration is critical for crafting the most valuable models. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Enter your lab code here