# Hands-On Lab 4 - Model Testing

### Step 1 - Load the Data

The *adult_train.csv* file is the training dataset. Run the following code cell to load the dataset.

In [None]:
import pandas as pd

adult_train = pd.read_csv('adult_train.csv')
adult_train.head()

### Step 2 - Engineer the *Female* Feature

The last lab illustrated how engineering a *Female* feature likely produced a better model. Run the following code cell to produce the results.

In [None]:
# Add a new Female feature to the DataFrame
adult_train['Female'] = adult_train['Sex'].replace({'Female': 1, 'Male': 0})

# Check the results
adult_train[['Sex', 'Female']].head()

### Step 3 - Prepare the Features

This lab will use the same feature preparation as Lab 3. Run the following code cell to produce the results.

In [None]:
# Features to use to predict the labels
all_features = ['Age', 'EducationNum', 'MaritalStatus', 'Occupation', 'Race', 'Female', 
                'CapitalGain', 'CapitalLoss', 'HoursPerWeek']

# Categorical features
cat_features = ['MaritalStatus', 'Occupation', 'Race']

# Select the above features and one-hot encode
adult_X = pd.get_dummies(adult_train[all_features], prefix = cat_features , columns = cat_features)
adult_X.head()

### Step 4 - Preparing the Labels

This lab will use the same label preparation as Lab 3. Run the following code cell to produce the results.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode labels
label_encoder = LabelEncoder()
adult_y = label_encoder.fit_transform(adult_train['Label'])

print(label_encoder.classes_)
print(adult_y)

### Step 5 - Train the Random Forest

Run the following code cell to produce the results.

**NOTE** - You can adjust the *n_jobs* parameter if you have a more powerful laptop.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Instatiate the Random Forest object
rf_1 = RandomForestClassifier(n_estimators = 200, oob_score = True, n_jobs = 1, random_state = 12345)

# Train the RandomForestClassifier
rf_1.fit(adult_X, adult_y)

# What is the accuracy estimate?
print(f'Estimated accuracy with OOB data: {rf_1.oob_score_:.4f}')

# What is the accuracy on the training data?
print(f'Training data accuracy: {rf_1.score(adult_X, adult_y):.4f}')

### Step 6 - Load Test Data

As you learned in the lecture, the purpose of the test dataset is to provide a final estimate of the quality of future model predictions. The following code loads the *Adult Census* test dataset. Run the following code cell to produce the results.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Instatiate the Random Forest object
rf_1 = RandomForestClassifier(n_estimators = 200, oob_score = True, n_jobs = 1, random_state = 12345)

# Train the RandomForestClassifier
rf_1.fit(adult_X, adult_y)

# What is the accuracy estimate?
print(f'Estimated accuracy with OOB data: {rf_1.oob_score_:.4f}')

# What is the accuracy on the training data?
print(f'Training data accuracy: {rf_1.score(adult_X, adult_y):.4f}')

### Step 7 - Prepare the Test Data Features

A trained machine learning model requires the test *DataFrame* to have columns that match those used to train the model. Any data cleaning or feature engineering performed on the training *DataFrame* has to be applied to the test dataset. Type the following code into the blank code cell in your lab notebook and run it to produce the results. 

In [None]:
# Type your lab code here

### Step 8 - Preparing the Test Labels

As with the predictive features, you must prepare the test set labels to match the training dataset's labels. The easiest way to achieve this is to use the existing *LabelEncoder* object created in Step 4. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

**NOTE** - The code uses the *transform()* method

In [None]:
# Type your lab code here

### Step 9 - Making Predictions

The *RandomForestClassifier* offers the *predict()* method to make predictions for a dataset. In this case, predictions for the test dataset. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here

### Step 10 - Evaluating the Model

As discussed during the lecture, the final step of any machine learning project is evaluating the model's predictive performance against the test data. The goal is to ascertain how likely the model is to meet business requirements in the future. Type the following code into the blank code cell in your lab notebook and run it to produce the results.

In [None]:
# Type your lab code here