# Activity: Build a random forest model

## **Introduction**

Random forests are popular statistical learning algorithms for a classification task which primary benefits include reducing variance, bias, and the chance of overfitting.

This activity is a continuation of the project you began modeling with decision trees for an airline. Here, you will train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from 129,880 customers. It includes data points such as class, flight distance, and inflight entertainment. Your random forest model will be used to predict whether a customer will be satisfied with their flight experience.


## **Step 1: Imports** 


Import relevant Python libraries and modules, including `numpy` and `pandas`libraries for data processing; the `pickle` package to save the model; and the `sklearn` library, containing:


In [52]:
# Import relevant Python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, PredefinedSplit, GridSearchCV
import sklearn.metrics as metrics
 

In [53]:
# IMPORT DATA. 
flight_data = pd.read_csv("Invistico_Airline.csv")

## **Step 2: Data cleaning** 

To get a sense of the data, display the first 10 rows.

In [9]:
# Display first 10 rows.
flight_data.head(10)

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,2,4,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,0,2,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,2,0,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,3,4,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,4,3,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,2,0,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,2,5,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,2,0,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,5,3,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,2,0,2,2,2,4,5,3,4,2,30,26.0


In [10]:
# Display variable names and types.
flight_data.dtypes


satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
dtype: obj

**Question:** What is observed about the differences in data types among the variables included in the data?

- The categorical variables `satisfaction`, `Customer Type`, `Type of Travel`, `Class` are in string datatype which require to be encoded as numerical before the modeling process.  

Next, to understand the size of the dataset, identify the number of rows and the number of columns.

In [11]:
# Identify the number of rows and the number of columns.
flight_data.shape


(129880, 22)

Now, check for missing values in the rows of the data. 

In [13]:
# Get the number of rows that contain missing values.
flight_data.isna().any(axis=1).sum()


393

**Question:** How many rows of data are missing values?**

- There are 393 rows containing missing values. 

Clean the rows with missing valuesdata cleaning to make data more useful for analysis and regression. 

In [54]:
# Drop missing values and create a copy in variable `air_data_subset`.
air_data_subset = flight_data.dropna()

In [16]:
air_data_subset.head(10)

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,2,4,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,0,2,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,2,0,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,3,4,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,4,3,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,2,0,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,2,5,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,2,0,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,5,3,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,2,0,2,2,2,4,5,3,4,2,30,26.0


Confirm that it does not contain any missing values.

In [22]:
# Count of missing values.
air_data_subset.isna().sum()


satisfaction                         0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
dtype: int64

Next, convert the categorical features to indicator (one-hot encoded) features. 

In [55]:
# Convert categorical features to one-hot encoded features.
dummy_vars = pd.get_dummies(air_data_subset[['Customer Type', 'Type of Travel', 'Class']], drop_first=True)
air_data_subset = pd.concat([air_data_subset.drop(columns=['Customer Type', 'Type of Travel', 'Class']), dummy_vars], axis=1)

**Question:** Why is it necessary to convert categorical data into dummy variables?

- to get categorical data into appropriate shape for numerical modeling & computations.  

In [36]:
# Display the first 10 rows.
air_data_subset.head(10)


Unnamed: 0,satisfaction,Age,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Customer Type_disloyal Customer,Type of Travel_Personal Travel,Class_Eco,Class_Eco Plus
0,satisfied,65,265,0,0,0,2,2,4,2,3,3,0,3,5,3,2,0,0.0,0,1,1,0
1,satisfied,47,2464,0,0,0,3,0,2,2,3,4,4,4,2,3,2,310,305.0,0,1,0,0
2,satisfied,15,2138,0,0,0,3,2,0,2,2,3,3,4,4,4,2,0,0.0,0,1,1,0
3,satisfied,60,623,0,0,0,3,3,4,3,1,1,0,1,4,1,3,0,0.0,0,1,1,0
4,satisfied,70,354,0,0,0,3,4,3,4,2,2,0,2,4,2,5,0,0.0,0,1,1,0
5,satisfied,30,1894,0,0,0,3,2,0,2,2,5,4,5,5,4,2,0,0.0,0,1,1,0
6,satisfied,66,227,0,0,0,3,2,5,5,5,5,0,5,5,5,3,17,15.0,0,1,1,0
7,satisfied,10,1812,0,0,0,3,2,0,2,2,3,3,4,5,4,2,0,0.0,0,1,1,0
8,satisfied,56,73,0,0,0,3,5,3,5,4,4,0,1,5,4,4,0,0.0,0,1,0,0
9,satisfied,22,1556,0,0,0,3,2,0,2,2,2,4,5,3,4,2,30,26.0,0,1,1,0


Then, check the variables of air_data_subset_dummies.

In [37]:
# Display variables.
air_data_subset.dtypes


satisfaction                          object
Age                                    int64
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
Customer Type_disloyal Customer        uint8
Type of Travel_Personal Travel         uint8
Class_Eco                              uint8
Class_Eco 

**Question:** What changes do you observe after converting the string data to dummy variables?**

- The categorical variables are converted to binary representations with k-1 groups, where k is the no. of distinct classes in the feature. 

## **Step 3: Model building** 

The first step to building your model is separating the labels (y) from the features (X).

In [56]:
# Separate the dataset into labels (y) and features (X).
X = air_data_subset.drop(columns='satisfaction').copy()
y = air_data_subset['satisfaction']

Once separated, split the data into train, validate, and test sets. 

In [57]:
# Separate into train, validate, test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=10) 

### Tune the model

Now, fit and tune a random forest model with separate validation set. Begin by determining a set of hyperparameters for tuning the model using GridSearchCV.


In [47]:
# Determine set of hyperparameters.
cv_params = {'max_depth': [2, 3, 4, 5, 10, None],
             'min_samples_leaf': [1, 2, 3],
             'min_samples_split': [2, 3, 4],
             'max_features': [2, 3, 4],
             'n_estimators': [75, 100, 125, 150]
            }


Next, create a list of split indices.

In [58]:
# Create list of split indices.
split_index = [0 if x in X_val.index else -1 for x in X_train.index]
custom_split = PredefinedSplit(split_index)

Instantiate the model and use GridSearchCV to search over the specified hyperparameters.

In [48]:
# Instantiate model & search over specified parameters..
RF_cv = GridSearchCV(RandomForestClassifier(random_state=0), 
                     cv_params,
                     cv=custom_split,
                     refit='f1',
                     n_jobs=-1
                    )


Now, fit your model.

In [49]:
%%time
# Fit the model.
RF_cv.fit(X_train, y_train)


CPU times: user 22.9 s, sys: 1.48 s, total: 24.4 s
Wall time: 23min 55s


GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0])),
             error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight...
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, 

Finally, obtain the optimal parameters.

In [50]:
# Obtain optimal parameters.
RF_cv.best_params_


{'max_depth': None,
 'max_features': 4,
 'min_samples_leaf': 1,
 'min_samples_split': 4,
 'n_estimators': 150}

## **Step 4: Results and evaluation** 

Use the selected model to predict on the test data. Use the optimal parameters found via GridSearchCV.

In [94]:
opt_params = {'max_features': [4], 'min_samples_leaf': [1], 'min_samples_split': [4], 'n_estimators': [150]}

# Use optimal parameters on GridSearchCV.
RF_opt = GridSearchCV(RandomForestClassifier(random_state=0), 
                      opt_params,
                      cv=10,
                      refit='f1',
                      n_jobs=-1)

In [95]:
# Fit the optimal model.
RF_opt.fit(X_train, y_train)


GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=0,
                                  

And predict on the test set using the optimal model.

In [68]:
# Predict on test set.
y_pred = RF_opt.predict(X_test)


### Obtain performance scores

First, get the precision score.

In [77]:
# Get precision score.
precision = metrics.precision_score(y_test, y_pred, pos_label='satisfied')
precision

0.9665253932907119

Then, collect the recall score.

In [71]:
# Get recall score.
recall = metrics.recall_score(y_test, y_pred, pos_label="satisfied")
recall

0.9484034679636286

Next, obtain the accuracy score.

In [75]:
# Get accuracy score.
accuracy = metrics.accuracy_score(y_test, y_pred)
accuracy

0.9537416016680825

Finally, collect the F1-score.

In [79]:
# Get F1 score.
f1 = metrics.f1_score(y_test, y_pred, pos_label='satisfied')
f1

0.9573786822257009

**Question:** How is the F1-score calculated?

F1 scores are calculated using the formula: 

F1 = 2 * (precision * recall) / (precision + recall)


**Question:** What are the pros and cons of performing the model selection using test data instead of a separate validation dataset?

- Performing model selection using test data allows larger data to be used as training set which the algorithm has a greater chance of exposing to the distribution that are representative of the entire dataset and capture the intrinsic pattern within data.  
- However, this risks the generality evaluation for a model as data is leaked from the reuse of test set, resulting in parameters fitting too close / memorizing the data nature in the test set.   


### Evaluate the model

Now that you have the results, evaluate how well does the model perform. 

**Question:** What are the four basic parameters for evaluating the performance of a classification model?

1. True positives (TP): These are correctly predicted positive values, which means the value of actual and predicted classes are positive. 

2. True negatives (TN): These are correctly predicted negative values, which means the value of the actual and predicted classes are negative.

3. False positives (FP): This occurs when the value of the actual class is negative and the value of the predicted class is positive.

4. False negatives (FN): This occurs when the value of the actual class is positive and the value of the predicted class in negative. 

**Question:**  What do the four scores demonstrate about your model, and how do you calculate them?

- Accuracy (TP+TN/TP+FP+FN+TN): The ratio of correctly predicted observations to total observations. 
 
- Precision (TP/TP+FP): The ratio of correctly predicted positive observations to total predicted positive observations. 

- Recall (Sensitivity, TP/TP+FN): The ratio of correctly predicted positive observations to all observations in actual class.

- F1 score: The harmonic average of precision and recall, which takes into account both false positives and false negatives. 

Calculate the scores: precision score, recall score, accuracy score, F1 score.

In [81]:
# Precision score on test data set.
test_precision = metrics.precision_score(RF_opt.predict(X_test), y_test, pos_label='satisfied')
test_precision

0.9484034679636286

In [82]:
# Recall score on test data set.
test_recall = metrics.recall_score(RF_opt.predict(X_test), y_test, pos_label='satisfied')
test_recall


0.9665253932907119

In [83]:
# Accuracy score on test data set.
test_acc = metrics.accuracy_score(RF_opt.predict(X_test), y_test)
test_acc

0.9537416016680825

In [85]:
# F1 score on test data set.
test_f1 = metrics.f1_score(RF_opt.predict(X_test), y_test, pos_label='satisfied')
test_f1

0.9573786822257009

**Question:** How does this model perform based on the four scores?

 The model performs well in overall, with the recall score slightly better than the other 3 metrics. 

### Evaluate the model

Finally, create a table of results that you can use to evaluate the performace of your model.

In [96]:
# Create table of results.
table = pd.DataFrame({'Model': ["Tuned Decision Tree", "Tuned Random Forest"],
                        'F1':  [0.945422, test_f1],
                        'Recall': [0.935863, test_recall],
                        'Precision': [0.955197, test_precision],
                        'Accuracy': [0.940864, test_acc]
                      }
                    )
table


Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Tuned Decision Tree,0.945422,0.935863,0.955197,0.940864
1,Tuned Random Forest,0.957379,0.966525,0.948403,0.953742


**Question:** How does the random forest model compare to the decision tree model you built in the previous lab?

The tuned random forest has higher scores overall, so it is the better model. Particularly, it shows a better F1 score than the decision tree model, which indicates that the random forest model may do better at classification when taking into account false positives and false negatives. 


## **Considerations**


**What are the key takeaways from this lab? Consider important steps when building a model, most effective approaches and tools, and overall results.**

- Data exploring, cleaning, and encoding are necessary for model building.
- A separate validation set is typically used for tuning a model, rather than using the test set. This also helps avoid the evaluation becoming biased.
-  F1 scores are usually more useful than accuracy scores. If the cost of false positives and false negatives are very different, it’s better to use the F1 score and combine the information from precision and recall. 
* The random forest model yields a more effective performance than a decision tree model. 

**What summary would you provide to stakeholders?**

* The random forest model predicted satisfaction with more than 95.3% accuracy. The precision is over 94.9% and the recall is approximately 96.7%. 
* The random forest model outperformed the tuned decision tree with the best hyperparameters in most of the four scores. This indicates that the random forest model may perform better.
* Because stakeholders were interested in learning about the factors that are most important to customer satisfaction, this would be shared based on the tuned random forest. 
* In addition, you would provide details about the precision, recall, accuracy, and F1 scores to support your findings. 