# random forest model

## **Introduction**


train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from 129,880 customers. It includes data points such as class, flight distance, and inflight entertainment. Your random forest model will be used to predict whether a customer will be satisfied with their flight experience.

**Note:** Because this lab uses a real dataset, this notebook first requires exploratory data analysis, data cleaning, and other manipulations to prepare it for modeling.

In [1]:
#imports
import numpy as np
import pandas as pd  

import pickle as pkl

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split,PredefinedSplit,GridSearchCV
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score

As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [2]:
#import data 
air_data = pd.read_csv("Invistico_Airline.csv")

## Data cleaning

In [3]:
air_data.head()

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0


In [8]:
air_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 22 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   satisfaction                       129880 non-null  object 
 1   Customer Type                      129880 non-null  object 
 2   Age                                129880 non-null  int64  
 3   Type of Travel                     129880 non-null  object 
 4   Class                              129880 non-null  object 
 5   Flight Distance                    129880 non-null  int64  
 6   Seat comfort                       129880 non-null  int64  
 7   Departure/Arrival time convenient  129880 non-null  int64  
 8   Food and drink                     129880 non-null  int64  
 9   Gate location                      129880 non-null  int64  
 10  Inflight wifi service              129880 non-null  int64  
 11  Inflight entertainment             1298

In [9]:
air_data.shape

(129880, 22)

In [10]:
air_data.isna().any(axis=1).sum()

393

In [11]:
# Drop missing values.
air_data_subset = air_data.dropna(axis=1)


In [12]:
air_data_subset.head()

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Inflight entertainment,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,4,2,3,3,0,3,5,3,2,0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,2,3,4,4,4,2,3,2,310
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,0,2,2,3,3,4,4,4,2,0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,4,3,1,1,0,1,4,1,3,0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,3,4,2,2,0,2,4,2,5,0


In [13]:
air_data_subset.isna().sum()

satisfaction                         0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
dtype: int64

In [14]:
#One Hot Encoding
air_data_subset_dummies = pd.get_dummies(air_data_subset,columns=['Customer Type', 'Type of Travel', 'Class'])

In [15]:
air_data_subset_dummies.head()

Unnamed: 0,satisfaction,Age,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,...,Cleanliness,Online boarding,Departure Delay in Minutes,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Class_Business,Class_Eco,Class_Eco Plus
0,satisfied,65,265,0,0,0,2,2,4,2,...,3,2,0,1,0,0,1,0,1,0
1,satisfied,47,2464,0,0,0,3,0,2,2,...,3,2,310,1,0,0,1,1,0,0
2,satisfied,15,2138,0,0,0,3,2,0,2,...,4,2,0,1,0,0,1,0,1,0
3,satisfied,60,623,0,0,0,3,3,4,3,...,1,3,0,1,0,0,1,0,1,0
4,satisfied,70,354,0,0,0,3,4,3,4,...,2,5,0,1,0,0,1,0,1,0


In [17]:
air_data_subset_dummies.dtypes

satisfaction                         object
Age                                   int64
Flight Distance                       int64
Seat comfort                          int64
Departure/Arrival time convenient     int64
Food and drink                        int64
Gate location                         int64
Inflight wifi service                 int64
Inflight entertainment                int64
Online support                        int64
Ease of Online booking                int64
On-board service                      int64
Leg room service                      int64
Baggage handling                      int64
Checkin service                       int64
Cleanliness                           int64
Online boarding                       int64
Departure Delay in Minutes            int64
Customer Type_Loyal Customer          uint8
Customer Type_disloyal Customer       uint8
Type of Travel_Business travel        uint8
Type of Travel_Personal Travel        uint8
Class_Business                  

## Model building

In [20]:
y  = air_data_subset_dummies["satisfaction"]
x = air_data_subset_dummies.drop("satisfaction", axis= 1)

In [21]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=.25, random_state=0)
x_tr,x_val,y_tr,y_val = train_test_split(x_train, y_train, test_size=.25, random_state=0)

### Hyperparameter tuning of the model

In [22]:
cv_parameters = {'n_estimators' : [50,100], 
              'max_depth' : [10,50],        
              'min_samples_leaf' : [0.5,1], 
              'min_samples_split' : [0.001, 0.01],
              'max_features' : ["sqrt"], 
              'max_samples' : [.5,.9]}

In [24]:
split_index = [0 if x in x_val.index else -1 for x in x_train.index]
custom_split = PredefinedSplit(split_index)

In [25]:
# Instantiate model.
rf = RandomForestClassifier(random_state=0)

In [26]:
# Search over specified parameters.
rf_val = GridSearchCV(rf, cv_parameters, cv = custom_split, refit= 'f1', n_jobs= 1, verbose= 1)

In [28]:
%%time

rf_val.fit(x_train, y_train)

Fitting 1 folds for each of 32 candidates, totalling 32 fits
CPU times: total: 1min 27s
Wall time: 1min 42s


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `fit()` method to train the GridSearchCV model on `X_train` and `y_train`. 

</details>

In [29]:
rf_val.best_params_

{'max_depth': 50,
 'max_features': 'sqrt',
 'max_samples': 0.9,
 'min_samples_leaf': 1,
 'min_samples_split': 0.001,
 'n_estimators': 100}

In [30]:
rf_opt = RandomForestClassifier(n_estimators = 50, max_depth = 50, 
                                min_samples_leaf = 1, min_samples_split = 0.001,
                                max_features="sqrt", max_samples = 0.9, random_state = 0)

In [31]:
# Fitting the optimal model.
rf_opt.fit(x_train,y_train)

predict on the test set using the optimal model.

In [32]:
y_pred = rf_opt.predict(x_test)

In [34]:
# Getting precision score.
prec_test = precision_score(y_test, y_pred, pos_label = "satisfied")
print("The precision score is {pc:.3f}".format(pc = prec_test))

The precision score is 0.950


In [35]:
# Get recall score.
rcl_test = recall_score(y_test,y_pred,pos_label='satisfied')
print('recall score is {rc:.3f}'.format(rc = rcl_test))

recall score is 0.945


In [36]:
ac_test = accuracy_score(y_test, y_pred)
print("The accuracy score is {ac:.3f}".format(ac = ac_test))

The accuracy score is 0.943


In [37]:
# Get F1 score.
f1_test = f1_score(y_test, y_pred, pos_label = "satisfied")
print("The F1 score is {f1:.3f}".format(f1 = f1_test))

The F1 score is 0.948


### Evaluation of the model

In [38]:
print("\nThe precision score is: {pc:.3f}".format(pc = prec_test), "for the test set,", "\nwhich means of all positive predictions,", "{pc_pct:.1f}% prediction are true positive.".format(pc_pct = prec_test * 100))


The precision score is: 0.950 for the test set, 
which means of all positive predictions, 95.0% prediction are true positive.


In [44]:
print("\nThe recall score is: {rc:.3f}".format(rc = rcl_test), "for the test set,", "\nwhich means of which means of all real positive cases in test set,", "{rc_pct:.1f}% are  predicted positive.".format(rc_pct = rcl_test * 100))


The recall score is: 0.945 for the test set, 
which means of which means of all real positive cases in test set, 94.5% are  predicted positive.


In [45]:
print("\nThe accuracy score is: {ac:.3f}".format(ac = ac_test), "for the test set,", "\nwhich means of all cases in test set,", "{ac_pct:.1f}% are predicted true positive or true negative.".format(ac_pct = ac_test * 100))


The accuracy score is: 0.943 for the test set, 
which means of all cases in test set, 94.3% are predicted true positive or true negative.


In [46]:
print("\nThe F1 score is: {f1:.3f}".format(f1 = f1_test), "for the test set,", "\nwhich means the test set's harmonic mean is {f1_pct:.1f}%.".format(f1_pct = f1_test * 100))


The F1 score is: 0.948 for the test set, 
which means the test set's harmonic mean is 94.8%.


In [48]:
table = pd.DataFrame({'Model':["Tuned Decision Tree", "Tuned Random Forest"],
                      'F1':[0.945422, f1_test],
                     'Recall':[0.935863, rcl_test],
                     'Precision':[0.955197, prec_test],
                     'Accuracy':[0.940864, ac_test]})

Here mentioned score for decision tree model is built in another notebook

In [49]:
print(table)

                 Model        F1    Recall  Precision  Accuracy
0  Tuned Decision Tree  0.945422  0.935863   0.955197  0.940864
1  Tuned Random Forest  0.947809  0.945154   0.950479  0.942778
