# Activity: Build a random forest model

## **Introduction**


As you're learning, random forests are popular statistical learning algorithms. Some of their primary benefits include reducing variance, bias, and the chance of overfitting.

This activity is a continuation of the project you began modeling with decision trees for an airline. Here, you will train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from 129,880 customers. It includes data points such as class, flight distance, and inflight entertainment. Your random forest model will be used to predict whether a customer will be satisfied with their flight experience.

**Note:** Because this lab uses a real dataset, this notebook first requires exploratory data analysis, data cleaning, and other manipulations to prepare it for modeling.

## **Step 1: Imports** 


Import relevant Python libraries and modules, including `numpy` and `pandas`libraries for data processing; the `pickle` package to save the model; and the `sklearn` library, containing:
- The module `ensemble`, which has the function `RandomForestClassifier`
- The module `model_selection`, which has the functions `train_test_split`, `PredefinedSplit`, and `GridSearchCV` 
- The module `metrics`, which has the functions `f1_score`, `precision_score`, `recall_score`, and `accuracy_score`


In [2]:
# Import `numpy`, `pandas`, `pickle`, and `sklearn`.
# Import the relevant functions from `sklearn.ensemble`, `sklearn.model_selection`, and `sklearn.metrics`.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

import pickle
### YOUR CODE HERE ###
 

As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [3]:
# RUN THIS CELL TO IMPORT YOUR DATA. 

### YOUR CODE HERE ###

air_data = pd.read_csv("Invistico_Airline.csv")

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

The `read_csv()` function from the `pandas` library can be helpful here.
 
</details>

Now, you're ready to begin cleaning your data. 

## **Step 2: Data cleaning** 

To get a sense of the data, display the first 10 rows.

In [4]:
# Display first 10 rows.
air_data.head(n=10)
### YOUR CODE HERE ###


Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,...,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,...,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,...,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,...,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,...,2,2,2,4,5,3,4,2,30,26.0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

The `head()` function from the `pandas` library can be helpful here.
 
</details>

Now, display the variable names and their data types. 

In [5]:
# Display variable names and types.
air_data.dtypes
### YOUR CODE HERE ###


satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
dtype: obj

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

DataFrames have an attribute that outputs variable names and data types in one result.
 
</details>

**Question:** What do you observe about the differences in data types among the variables included in the data?

[There are different kinds of data types such as int, float, and object. But, the model interpret only numbers so we will have to change the object type into integer type]

Next, to understand the size of the dataset, identify the number of rows and the number of columns.

In [6]:
# Identify the number of rows and the number of columns.
air_data.shape
### YOUR CODE HERE ###


(129880, 22)

<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

There is a method in the `pandas` library that outputs the number of rows and the number of columns in one result.

</details>

Now, check for missing values in the rows of the data. Start with .isna() to get Booleans indicating whether each value in the data is missing. Then, use .any(axis=1) to get Booleans indicating whether there are any missing values along the columns in each row. Finally, use .sum() to get the number of rows that contain missing values.

In [7]:
# Get Booleans to find missing values in data.
# Get Booleans to find missing values along columns.
# Get the number of rows that contain missing values.

air_data.isna().any(axis=1).sum()  # how many rows along the column axis =1
#air_data.isna().sum()


### YOUR CODE HERE ###


393

**Question:** How many rows of data are missing values?**

[393 rows of data have missing values.]

Drop the rows with missing values. This is an important step in data cleaning, as it makes the data more useful for analysis and regression. Then, save the resulting pandas DataFrame in a variable named `air_data_subset`.

In [8]:
# Drop missing values.
# Save the DataFrame in variable `air_data_subset`.

air_data_subset = air_data.dropna(axis = 0)
### YOUR CODE HERE ###


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

The `dropna()` function is helpful here.
</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

The axis parameter passed in to this function should be set to 0 (if you want to drop rows containing missing values) or 1 (if you want to drop columns containing missing values).
</details>

Next, display the first 10 rows to examine the data subset.

In [9]:
# Display the first 10 rows.

air_data_subset.head(n=10)
### YOUR CODE HERE ###


Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,...,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,...,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,...,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,...,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,...,2,2,2,4,5,3,4,2,30,26.0


Confirm that it does not contain any missing values.

In [10]:
# Count of missing values.
air_data_subset.isna().sum()
### YOUR CODE HERE ###


satisfaction                         0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
dtype: int64

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can use the `.isna().sum()` to get the number of missing values for each variable.

</details>

Next, convert the categorical features to indicator (one-hot encoded) features. 

**Note:** The `drop_first` argument can be kept as default (`False`) during one-hot encoding for random forest models, so it does not need to be specified. Also, the target variable, `satisfaction`, does not need to be encoded and will be extracted in a later step.

In [11]:
# Convert categorical features to one-hot encoded features.
air_data_subset['satisfaction'] = air_data_subset['satisfaction'].map({'satisfied': 1, 'dissatisfied': 0})
air_data_subset['Class'] = air_data_subset['Class'].map({'Business': 3, 'Eco Plus': 2, 'Eco': 1})
air_data_subset = pd.get_dummies(air_data_subset, columns = ['Customer Type', 'Type of Travel'], drop_first=False)
air_data_subset
### YOUR CODE HERE ###


Unnamed: 0,satisfaction,Age,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,...,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel
0,1,65,1,265,0,0,0,2,2,4,...,3,5,3,2,0,0.0,1,0,0,1
1,1,47,3,2464,0,0,0,3,0,2,...,4,2,3,2,310,305.0,1,0,0,1
2,1,15,1,2138,0,0,0,3,2,0,...,4,4,4,2,0,0.0,1,0,0,1
3,1,60,1,623,0,0,0,3,3,4,...,1,4,1,3,0,0.0,1,0,0,1
4,1,70,1,354,0,0,0,3,4,3,...,2,4,2,5,0,0.0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129875,1,29,1,1731,5,5,5,3,2,5,...,4,4,4,2,0,0.0,0,1,0,1
129876,0,63,3,2087,2,3,2,4,2,1,...,3,1,2,1,174,172.0,0,1,0,1
129877,0,69,1,2320,3,0,3,3,3,2,...,4,2,3,2,155,163.0,0,1,0,1
129878,0,66,1,2450,3,2,3,2,3,2,...,3,2,1,2,193,205.0,0,1,0,1


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can use the `pd.get_dummies()` function to convert categorical variables to one-hot encoded variables.
</details>

**Question:** Why is it necessary to convert categorical data into dummy variables?**

[It is important to convert categorical data into dummy variables because machine learning models interpret data in digit rather than objects. So, to have good results in machine learning including random forest we have convert the categorical data into numerical data.]

Next, display the first 10 rows to review the `air_data_subset_dummies`. 

In [12]:
# Display the first 10 rows.
air_data_subset.head(n=10)
### YOUR CODE HERE ###


Unnamed: 0,satisfaction,Age,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,...,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel
0,1,65,1,265,0,0,0,2,2,4,...,3,5,3,2,0,0.0,1,0,0,1
1,1,47,3,2464,0,0,0,3,0,2,...,4,2,3,2,310,305.0,1,0,0,1
2,1,15,1,2138,0,0,0,3,2,0,...,4,4,4,2,0,0.0,1,0,0,1
3,1,60,1,623,0,0,0,3,3,4,...,1,4,1,3,0,0.0,1,0,0,1
4,1,70,1,354,0,0,0,3,4,3,...,2,4,2,5,0,0.0,1,0,0,1
5,1,30,1,1894,0,0,0,3,2,0,...,5,5,4,2,0,0.0,1,0,0,1
6,1,66,1,227,0,0,0,3,2,5,...,5,5,5,3,17,15.0,1,0,0,1
7,1,10,1,1812,0,0,0,3,2,0,...,4,5,4,2,0,0.0,1,0,0,1
8,1,56,3,73,0,0,0,3,5,3,...,1,5,4,4,0,0.0,1,0,0,1
9,1,22,1,1556,0,0,0,3,2,0,...,5,3,4,2,30,26.0,1,0,0,1


Then, check the variables of air_data_subset_dummies.

In [13]:
# Display variables.

#air_data_subset.columns
air_data_subset.dtypes
### YOUR CODE HERE ###


satisfaction                           int64
Age                                    int64
Class                                  int64
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
Customer Type_Loyal Customer           uint8
Customer Type_disloyal Customer        uint8
Type of Tr

**Question:** What changes do you observe after converting the string data to dummy variables?**

[The difference I am noticing is that the number of columns has increased from 22 to 27 and the categorical variables have been chaneged to numerical values.]

## **Step 3: Model building** 

The first step to building your model is separating the labels (y) from the features (X).

In [14]:
# Separate the dataset into labels (y) and features (X).
Y = air_data_subset['satisfaction']
X = air_data_subset.drop('satisfaction', axis=1)
### YOUR CODE HERE ###


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Save the labels (the values in the `satisfaction` column) as `y`.

Save the features as `X`. 

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

To obtain the features, drop the `satisfaction` column from the DataFrame.

</details>

Once separated, split the data into train, validate, and test sets. 

In [15]:
# Separate into train, validate, test sets.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, stratify=Y, random_state = 0)
X_tr, X_val, Y_tr, Y_val = train_test_split(X_train, Y_train, test_size=0.2, stratify=Y_train, random_state=0)
### YOUR CODE HERE ###


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `train_test_split()` function twice to create train/validate/test sets, passing in `random_state` for reproducible results. 

</details>

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Split `X`, `y` to get `X_train`, `X_test`, `y_train`, `y_test`. Set the `test_size` argument to the proportion of data points you want to select for testing. 

Split `X_train`, `y_train` to get `X_tr`, `X_val`, `y_tr`, `y_val`. Set the `test_size` argument to the proportion of data points you want to select for validation. 

</details>

### Tune the model

Now, fit and tune a random forest model with separate validation set. Begin by determining a set of hyperparameters for tuning the model using GridSearchCV.


In [16]:
# Determine set of hyperparameters.


rf = RandomForestClassifier(random_state=0)

cv_params = {'max_depth': [3,7,50],
            'min_samples_leaf': [2,4],
            'min_samples_split': [0.001, 0.01],
            'max_features': ['sqrt'],
            'n_estimators': [50,75],
             'max_samples': [.5,.9]
            }
scoring = {'precision', 'accuracy','recall','f1'}
#split_index = [0 if x in X_val.index else -1 for x in X_train.index]
#from sklearn.model_selection import PredefinedSplit
#custom_split = PredefinedSplit(split_index)

#rf_val = GridSearchCV(rf, cv_params, scoring=scoring, cv=custom_split, refit='f1') 

#we will be using gridsearch for the separate validation data. but instead of putting some CV value we will do it through
#split index. split index will make a fixed validation data instead of folds.

### YOUR CODE HERE ###


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Create a dictionary `cv_params` that maps each hyperparameter name to a list of values. The GridSearch you conduct will set the hyperparameter to each possible value, as specified, and determine which value is optimal.

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

The main hyperparameters here include `'n_estimators', 'max_depth', 'min_samples_leaf', 'min_samples_split', 'max_features', and 'max_samples'`. These will be the keys in the dictionary `cv_params`.

</details>

Next, create a list of split indices.

In [17]:
# Create list of split indices.
from sklearn.model_selection import PredefinedSplit
split_index = [0 if x in X_val.index else -1 for x in X_train.index]
custom_split = PredefinedSplit(split_index)
### YOUR CODE HERE ###


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use list comprehension, iterating over the indices of `X_train`. The list can consists of 0s to indicate data points that should be treated as validation data and -1s to indicate data points that should be treated as training data.

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Use `PredfinedSplit()`, passing in `split_index`, saving the output as `custom_split`. This will serve as a custom split that will identify which data points from the train set should be treated as validation data during GridSearch.

</details>

Now, instantiate your model.

In [18]:
# Instantiate model.
rf = RandomForestClassifier(random_state=0)


### YOUR CODE HERE ### 


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use `RandomForestClassifier()`, specifying the `random_state` argument for reproducible results. This will help you instantiate a random forest model, `rf`.

</details>

Next, use GridSearchCV to search over the specified parameters.

In [19]:
# Search over specified parameters.
rf_val = GridSearchCV(rf, cv_params, scoring=scoring, cv=custom_split, refit='f1', n_jobs=-1, verbose=1) 
### YOUR CODE HERE ### 

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use `GridSearchCV()`, passing in `rf` and `cv_params` and specifying `cv` as `custom_split`. Additional arguments that you can specify include: `refit='f1', n_jobs = -1, verbose = 1`. 

</details>

Now, fit your model.

In [20]:

%%time

rf_val.fit(X_train, Y_train)
### YOUR CODE HERE ###


Fitting 1 folds for each of 48 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 out of  48 | elapsed:  1.4min remaining:   35.6s
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  1.6min finished


CPU times: user 6.34 s, sys: 440 ms, total: 6.78 s
Wall time: 1min 43s


GridSearchCV(cv=PredefinedSplit(test_fold=array([ 0, -1, ..., -1, -1])),
             error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weigh...
                                              oob_score=False, random_state=0,
                                              verbose=0, warm_

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `fit()` method to train the GridSearchCV model on `X_train` and `y_train`. 

</details>

<details>
<summary><h4><strong>Hint 2</strong></h4></summary>

Add the magic function `%%time` to keep track of the amount of time it takes to fit the model and display this information once execution has completed. Remember that this code must be the first line in the cell.

</details>

Finally, obtain the optimal parameters.

In [21]:
# Obtain optimal parameters.

rf_val.best_params_
### YOUR CODE HERE ###


{'max_depth': 50,
 'max_features': 'sqrt',
 'max_samples': 0.9,
 'min_samples_leaf': 2,
 'min_samples_split': 0.001,
 'n_estimators': 75}

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `best_params_` attribute to obtain the optimal values for the hyperparameters from the GridSearchCV model.

</details>

## **Step 4: Results and evaluation** 

Use the selected model to predict on your test data. Use the optimal parameters found via GridSearchCV.

In [22]:
# Use optimal parameters on GridSearchCV.

#when we got the parameters then we will directly put those paramaters into random forest because they are the
# the random forest parameters.
rf_opt = RandomForestClassifier(max_depth = 50, max_features = 'sqrt', max_samples =0.9, min_samples_leaf = 2, min_samples_split = 0.001, n_estimators = 75, random_state =0 )
#rf_opt = GridSearchCV(rf, )
### YOUR CODE HERE ###


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use `RandomForestClassifier()`, specifying the `random_state` argument for reproducible results and passing in the optimal hyperparameters found in the previous step. To distinguish this from the previous random forest model, consider naming this variable `rf_opt`.

</details>

Once again, fit the optimal model.

In [23]:
# Fit the optimal model.
rf_opt.fit(X_train, Y_train)
### YOUR CODE HERE ###


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=50, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=0.9,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=0.001,
                       min_weight_fraction_leaf=0.0, n_estimators=75,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Use the `fit()` method to train `rf_opt` on `X_train` and `y_train`.

</details>

And predict on the test set using the optimal model.

In [24]:
# Predict on test set.
Y_pred = rf_opt.predict(X_test)

### YOUR CODE HERE ###


<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `predict()` function to make predictions on `X_test` using `rf_opt`. Save the predictions now (for example, as `y_pred`), to use them later for comparing to the true labels. 

</details>

### Obtain performance scores

First, get your precision score.

In [25]:
# Get precision score.
precision = precision_score(Y_test, Y_pred, pos_label = 1)
precision
### YOUR CODE HERE ###


0.9517648394052889

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `precision_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

Then, collect the recall score.

In [26]:
# Get recall score.
recall = recall_score(Y_test, Y_pred, pos_label = 1)
recall
### YOUR CODE HERE ###


0.946447717397438

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `recall_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

Next, obtain your accuracy score.

In [27]:
# Get accuracy score.
accuracy = accuracy_score(Y_test, Y_pred)
accuracy
### YOUR CODE HERE ###


0.9444272828370196

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `accuracy_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

Finally, collect your F1-score.

In [28]:
# Get F1 score.
F1 = f1_score(Y_test, Y_pred, pos_label = 1)
F1
### YOUR CODE HERE ###


0.9490988314517727

<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

You can call the `f1_score()` function from `sklearn.metrics`, passing in `y_test` and `y_pred` and specifying the `pos_label` argument as `"satisfied"`.
</details>

**Question:** How is the F1-score calculated?

[F1 score is the harmonic mean of precision and recall.]

**Question:** What are the pros and cons of performing the model selection using test data instead of a separate validation dataset?

[using cross validation with not having separate validation data is computationally very time taking, but it provides with the stable variance and better results. on the other hand validation dataset reduces the computation time but it have less stable variance]



### Evaluate the model

Now that you have results, evaluate the model. 

**Question:** What are the four basic parameters for evaluating the performance of a classification model?

[The four basic parameters are precision, accuracy, recall and f1 score.]

**Question:**  What do the four scores demonstrate about your model, and how do you calculate them?

[First of all accuracy is the measure of correct predicted values out of all the values. Precision is the measure of true predicted positive values out of all the predicted positive values in the dataset. recall is the measure of true predicted positive values out of all the positive values. And f1 score is the mean ]

Calculate the scores: precision score, recall score, accuracy score, F1 score.

In [29]:
# Precision score on test data set.

print('The precision score of the model on test data is {precision:.3f}'.format(precision = precision), 'which means {precision_ps:.1f} percent of the predicted positive values are true positives '.format(precision_ps = precision*100))
### YOUR CODE HERE ###


The precision score of the model on test data is 0.952 which means 95.2 percent of the predicted positive values are true positives 


In [30]:
# Recall score on test data set.

print('The recall score of the model on test data is {recall:.3f}'.format(recall = recall), 'which means {recall_rc:.1f} percent of the positive values are true predicted positives out of all the positive values'.format(recall_rc = recall*100))

### YOUR CODE HERE ###


The recall score of the model on test data is 0.946 which means 94.6 percent of the positive values are true predicted positives out of all the positive values


In [31]:
# Accuracy score on test data set.

print('The accuracy score of the model on test data is {accuracy:.3f}'.format(accuracy = accuracy), 'which means {accuracy_ac:.1f} percent of the predicted values are true predicted values '.format(accuracy_ac = accuracy*100))


### YOUR CODE HERE ###


The accuracy score of the model on test data is 0.944 which means 94.4 percent of the predicted values are true predicted values 


In [34]:
# F1 score on test data set.

print('The f1 score of the model on test data is {f1:.3f}'.format(f1 = F1), 'which means {f1_f:.1f} percent is the harmonic mean of precision and recall '.format(f1_f = F1*100))

### YOUR CODE HERE ###


The f1 score of the model on test data is 0.949 which means 94.9 percent is the harmonic mean of precision and recall 


**Question:** How does this model perform based on the four scores?

[The model perfomed well on all the four parameters. The precision score of the model is slighty higher than the other 3.]

### Evaluate the model

Finally, create a table of results that you can use to evaluate the performace of your model.

In [37]:
# Create table of results.

table = pd.DataFrame({'model': ['Tuned Random Forest'],
                     'precision': precision,
                      'accuracy': accuracy,
                      'recall': recall,
                      'F1': F1
                     })
table
### YOUR CODE HERE ###


Unnamed: 0,model,precision,accuracy,recall,F1
0,Tuned Random Forest,0.951765,0.944427,0.946448,0.949099



<details>
<summary><h4><strong>Hint 1</strong></h4></summary>

Build a table to compare the performance of the models. Create a DataFrame using the `pd.DataFrame()` function.

</details>

**Question:** How does the random forest model compare to the decision tree model you built in the previous lab?

[Radom forest model slightly increased the F1 score which indicates it is more stable model than decision tree]



## **Considerations**


**What are the key takeaways from this lab? Consider important steps when building a model, most effective approaches and tools, and overall results.**

[Building random forest with both cross validation and separate validation data is important. for the separte validation data we use split index inplace of cross validation.
hyper parameter tuning is an imortant part of the training which determins your model performaance and computational cost and time.
random forest model is more stable and have less variance than decision tree because it has less chances of overfitting rather it underfits and make the overall results more stable
]


**What summary would you provide to stakeholders?**

[Random forest model was trained on the data evaluating plane passenegrs' satisfaction
The model showes promissing results and will perform well on the unseen new data
The model showed around 94% results in all the 4 evaluation metrics]

### References

[What is the Difference Between Test and Validation Datasets?,  Jason Brownlee](https://machinelearningmastery.com/difference-test-validation-datasets/)

[Decision Trees and Random Forests Neil Liberman](https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991)

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged