# Build a random forest model

## **Introduction**

This activity is a continuation of the project we began modeling with decision trees for an airline. Here, we will train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from 129,880 customers as we have started in previous labs. It includes data points such as class, flight distance, and inflight entertainment. The random forest model will be used to predict whether a customer will be satisfied with their flight experience.

Let's get started!

In [27]:
# Import relevant libaries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

import pickle as pkl

In [3]:
# Import dataset
air_data= pd.read_csv(r'C:\Users\user\Desktop\Course 5\Invistico_Airline.csv')

In [10]:
# Display first few rows
air_data.head()

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0


In [11]:
air_data.shape

(129880, 22)

In [12]:
air_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 22 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   satisfaction                       129880 non-null  object 
 1   Customer Type                      129880 non-null  object 
 2   Age                                129880 non-null  int64  
 3   Type of Travel                     129880 non-null  object 
 4   Class                              129880 non-null  object 
 5   Flight Distance                    129880 non-null  int64  
 6   Seat comfort                       129880 non-null  int64  
 7   Departure/Arrival time convenient  129880 non-null  int64  
 8   Food and drink                     129880 non-null  int64  
 9   Gate location                      129880 non-null  int64  
 10  Inflight wifi service              129880 non-null  int64  
 11  Inflight entertainment             1298

There are 393 rows with missing values.



Drop the rows with missing values. This is an important step in data cleaning, as it makes our data more useful for analysis and regression. Then, save the resulting pandas DataFrame in a variable named `air_data_subset`.

In [13]:
# Drop missing values
air_data_subset= air_data.dropna(axis= 0)

Next, convert the categorical features to indicator (one-hot encoded) features. 

**Note:** The `drop_first` argument can be kept as default (`False`) during one-hot encoding for random forest models, so it does not need to be specified. Also, the target variable, `satisfaction`, does not need to be encoded and will be extracted in a later step.

In [15]:
# Convert categorical features to one-hot encoded features.
air_data_subset_dummies = pd.get_dummies(air_data_subset, 
                                         columns=['Customer Type','Type of Travel','Class'])

It is necessary that we convert categorical variables to dummy variables because the sklearn implementation of `RandomForestClassifier()` requires that categorical features be encoded to numeric, which can be done using dummy variables or one-hot encoding.

In [17]:
# Display the first 10 rows.
air_data_subset_dummies.head(5)

Unnamed: 0,satisfaction,Age,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,...,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Class_Business,Class_Eco,Class_Eco Plus
0,satisfied,65,265,0,0,0,2,2,4,2,...,2,0,0.0,1,0,0,1,0,1,0
1,satisfied,47,2464,0,0,0,3,0,2,2,...,2,310,305.0,1,0,0,1,1,0,0
2,satisfied,15,2138,0,0,0,3,2,0,2,...,2,0,0.0,1,0,0,1,0,1,0
3,satisfied,60,623,0,0,0,3,3,4,3,...,3,0,0.0,1,0,0,1,0,1,0
4,satisfied,70,354,0,0,0,3,4,3,4,...,5,0,0.0,1,0,0,1,0,1,0


In [18]:
# Display variables.
air_data_subset_dummies.dtypes

satisfaction                          object
Age                                    int64
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
Customer Type_Loyal Customer           uint8
Customer Type_disloyal Customer        uint8
Type of Travel_Business travel         uint8
Type of Tr

All of the following changes could be observed, which all have data type as unit8

- Customer Type  -->  Customer Type_Loyal Customer and Customer Type_disloyal Customer
- Type of Travel -->  Type of Travel_Business travel and Type of Travel_Personal travel 
- Class          --> Class_Business, Class_Eco, Class_Eco Plus

## Model building

In [19]:
# Separate the dataset into labels (y) and features (X).
y = air_data_subset_dummies["satisfaction"]
X = air_data_subset_dummies.drop("satisfaction", axis=1)

In [20]:
# Separate into train, validate, test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.25, random_state = 0)


### Tune the model

Now, we'll fit and tune a random forest model with separate validation set. Begin by determining a set of hyperparameters for tuning the model using GridSearchCV.

In [21]:
# Determine set of hyperparameters

cv_params= {'n_estimators': [50,100],
             'max_depth': [10,50],
             'min_samples_leaf' : [0.5,1], 
             'min_samples_split' : [0.001, 0.01],
             'max_features' : ["sqrt"], 
             'max_samples' : [.5,.9]}

In [24]:
# Create a list of split indices
split_index= [0 if x in X_val.index else -1 for x in X_train.index]
custom_split= PredefinedSplit(split_index)

In [28]:
# Instantiate the model
rf= RandomForestClassifier(random_state= 0)

**Next, use GridSearchCV to search over the specified parameters.**

In [29]:
# Search over specified parameters
rf_val= GridSearchCV(rf, cv_params, cv= custom_split, refit= 'f1', n_jobs= -1, verbose=1)

Now, we'll fit the model

In [30]:
%%time

# Fit the model.
rf_val.fit(X_train, y_train)

Fitting 1 folds for each of 32 candidates, totalling 32 fits
Wall time: 2min 10s


GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ..., -1, -1])),
             estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
             param_grid={'max_depth': [10, 50], 'max_features': ['sqrt'],
                         'max_samples': [0.5, 0.9],
                         'min_samples_leaf': [0.5, 1],
                         'min_samples_split': [0.001, 0.01],
                         'n_estimators': [50, 100]},
             refit='f1', verbose=1)

Finally, obtain the optimal parameters.

In [31]:
# Obtain optimal parameters.
rf_val.best_params_

{'max_depth': 50,
 'max_features': 'sqrt',
 'max_samples': 0.9,
 'min_samples_leaf': 1,
 'min_samples_split': 0.001,
 'n_estimators': 50}

## Result and evalution
Use the selected model to predict on the test data. Use the optimal parameters found via GridSearchCV.

In [35]:
# Use optimal parameters on GridSearchCV
rf_opt= RandomForestClassifier(n_estimators= 50, max_depth =50,
                              min_samples_leaf=1, min_samples_split= 0.001,
                              max_features= 'sqrt', max_samples= 0.9, random_state=0)

Once again, fit the optimal model.

In [36]:
# Fit the model
rf_opt.fit(X_train, y_train)

RandomForestClassifier(max_depth=50, max_features='sqrt', max_samples=0.9,
                       min_samples_split=0.001, n_estimators=50,
                       random_state=0)

In [37]:
# Predict on test set
y_pred= rf_opt.predict(X_test)

## Obtain performance score

In [40]:
# Get precision score

pc_test= precision_score(y_test, y_pred, pos_label= 'satisfied')
print(pc_test)
print("The precision score is {pc:.3f}".format(pc = pc_test))

0.9501276595744681
The precision score is 0.950


In [43]:
# Get recall score

rc_test = recall_score(y_test, y_pred, pos_label = "satisfied")
print(rc_test)
print("The recall score is {rc:.3f}".format(rc = rc_test))

0.9445008460236887
The recall score is 0.945


In [48]:
# Get accuracy score
ac_test= accuracy_score(y_test, y_pred,)
print(ac_test)
print('The accuracy score is {ac:.3f}'.format(ac= ac_test))

0.9424502656616829
The accuracy score is 0.942


In [49]:
# Get F1 score.
f1_test = f1_score(y_test, y_pred, pos_label = "satisfied")
print(f1_test)
print("The F1 score is {f1:.3f}".format(f1 = f1_test))

0.9473058973271109
The F1 score is 0.947


Pros of performing the model selection using test data instead of a separate validation dataset : <br />
*  The coding workload is reduced.
*  The scripts for data splitting are shorter.
*  It's only  necessary to evaluate test dataset performance once, instead of two evaluations (validate and test).

Cons  of performing the model selection using test data instead of a separate validation dataset: <br />
* If a model is evaluated using samples that were also used to build or fine-tune that model, it likely will provide a biased evaluation.
* A potential overfitting issue could happen when fitting the model's scores on the test data.



- Accuracy (TP+TN/TP+FP+FN+TN): The ratio of correctly predicted observations to total observations. 
 
- Precision (TP/TP+FP): The ratio of correctly predicted positive observations to total predicted positive observations. 

- Recall (Sensitivity, TP/TP+FN): The ratio of correctly predicted positive observations to all observations in actual class.

- F1 score: The harmonic average of precision and recall, which takes into account both false positives and false negatives. 

## Reults and evaluation

**In the light of the above cells,**


The precision score is: 0.950 for the test set, 
which means of all positive predictions, 95.0% prediction are true positive.

The recall score is: 0.945 for the test set, 
which means of which means of all real positive cases in test set, 94.5% are  predicted positive.

The accuracy score is: 0.942 for the test set, 
which means of all cases in test set, 94.2% are predicted true positive or true negative.

The F1 score is: 0.947 for the test set, 
which means the test set's harmonic mean is 94.7%.

**NOTE:**

The tuned random forest has higher scores overall, so it is the better model compared to the models built in previous labs. Particularly, it shows a better F1 score than the decision tree model, which indicates that the random forest model may do better at classification when taking into account false positives and false negatives. 



## Conclusion
* The random forest model predicted satisfaction with more than 94.2% accuracy. The precision is over 95% and the recall is approximately 94.5%. 
* The random forest model outperformed the tuned decision tree with the best hyperparameters in most of the four scores. This indicates that the random forest model may perform better.
* Because stakeholders were interested in learning about the factors that are most important to customer satisfaction, this would be shared based on the tuned random forest. 