# <center> **Random Forest Model**
## <center> **Customer churn**

## **1. Introduction & Overview**

- Random forest primary benefits include **`reducing variance`**, **`bias`**, and the **`chance of overfitting`**.
- This activity is a continuation of the project you began modeling with decision trees for an airline. 
- We will train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from **`129,880 customers`**. 
- The dataset includes data points such as **`class`**, **`flight distance`**, and **`inflight entertainment`**. 
- Our random forest model will be used to predict whether a customer will be satisfied with their flight experience.

## **2. Imports** 

- Import relevant Python libraries and modules, including **`numpy`** and **`pandas`** libraries for data processing; the **`pickle`** package to save the model; and the **`sklearn`** library, containing:
- The module **`ensemble`**, which has the function **`RandomForestClassifier`**
- The module **`model_selection`**, which has the functions **`train_test_split`**, **`PredefinedSplit`**, and **`GridSearchCV`** 
- The module **`metrics`**, which has the functions **`f1_score`**, **`precision_score`**, **`recall_score`**, and **`accuracy_score`**

In [1]:
# Import `numpy`, `pandas`, `pickle`, and `sklearn`.
# Import the relevant functions from `sklearn.ensemble`, `sklearn.model_selection`, and `sklearn.metrics`. 
import numpy as np
import pandas as pd

import pickle as pkl
 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, PredefinedSplit, GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

In [None]:
file = r"C:\Users\barba\OneDrive\Documents\AIO Python\Datasets\Invistico_Airline.csv" # Airline_dataset on github
invistico = pd.read_csv(file)

## **3. Data cleaning** 

In [5]:
# Display first rows.
invistico.head()
# Display variable names and types.
invistico.dtypes
# Identify the number of rows and the number of columns.
invistico.shape

(129880, 22)

In [9]:
# Get Booleans to find missing values in data.
# Get Booleans to find missing values along columns.
# Get the number of rows that contain missing values.
invistico.isna().any(axis=1).sum()
# Drop missing values.
# Save the DataFrame in variable `air_data_subset`.
invistico.dropna(inplace=True, axis=0)

#### **3.1 Encoding data:** 

- It is necessary because the sklearn implementation of **`RandomForestClassifier()`** requires that categorical features be encoded to numeric,  
which can be done using dummy variables or one-hot encoding.
- The **`drop_first`** argument can be kept as default (**`False`**) during one-hot encoding for random forest models. 
- Target variable, **`satisfaction`**, does not need to be encoded and will be extracted in a later step.

In [19]:
# Convert categorical features to one-hot encoded features.
invistico_dummies = pd.get_dummies(invistico,
                           columns=['Customer Type','Type of Travel','Class']
                           )

In [20]:
# Display the first 10 rows.
invistico_dummies.head()

Unnamed: 0,satisfaction,Age,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,...,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Class_Business,Class_Eco,Class_Eco Plus
0,satisfied,65,265,0,0,0,2,2,4,2,...,2,0,0.0,True,False,False,True,False,True,False
1,satisfied,47,2464,0,0,0,3,0,2,2,...,2,310,305.0,True,False,False,True,True,False,False
2,satisfied,15,2138,0,0,0,3,2,0,2,...,2,0,0.0,True,False,False,True,False,True,False
3,satisfied,60,623,0,0,0,3,3,4,3,...,3,0,0.0,True,False,False,True,False,True,False
4,satisfied,70,354,0,0,0,3,4,3,4,...,5,0,0.0,True,False,False,True,False,True,False


## **4. Model building** 

In [None]:
# Separate the dataset into labels (y) and features (X).
y = invistico_dummies["satisfaction"]
X = invistico_dummies.drop("satisfaction", axis=1)
# Separate into train, validate, test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.25, random_state = 0)

#### **4.1. Tune the model**

In [24]:
# Determine set of hyperparameters.
cv_params = {'n_estimators' : [50,100], 
              'max_depth' : [10,50],        
              'min_samples_leaf' : [0.5,1], 
              'min_samples_split' : [0.001, 0.01],
              'max_features' : ["sqrt"], 
              'max_samples' : [.5,.9]}

In [25]:
# Create list of split indices.
split_index = [0 if x in X_val.index else -1 for x in X_train.index]
custom_split = PredefinedSplit(split_index)

In [26]:
# Instantiate model.
rf = RandomForestClassifier(random_state=0)
# Search over specified parameters.
rf_val = GridSearchCV(rf, cv_params, cv=custom_split, refit='f1', n_jobs = -1, verbose = 1)

In [27]:
%%time
# Fit the model.
rf_val.fit(X_train, y_train)

Fitting 1 folds for each of 32 candidates, totalling 32 fits
CPU times: total: 7.05 s
Wall time: 42.4 s


In [28]:
# Obtain optimal parameters.
rf_val.best_params_

{'max_depth': 50,
 'max_features': 'sqrt',
 'max_samples': 0.9,
 'min_samples_leaf': 1,
 'min_samples_split': 0.001,
 'n_estimators': 50}

## **5. Results & Evaluation** 

In [30]:
# Use optimal parameters on GridSearchCV.
rf_opt = RandomForestClassifier(n_estimators = 50, max_depth = 50, 
                                min_samples_leaf = 1, min_samples_split = 0.001,
                                max_features="sqrt", max_samples = 0.9, random_state = 0)

In [31]:
# Fit the optimal model.

rf_opt.fit(X_train, y_train)

In [32]:
# Predict on test set.
y_pred = rf_opt.predict(X_test)

#### **5.1. Obtain performance scores**

In [48]:
# Get precision score.
pc_test = precision_score(y_test, y_pred, pos_label = "satisfied")
print("The precision score is {pc:.3f}".format(pc = pc_test))
# Get recall score.
rc_test = recall_score(y_test, y_pred, pos_label = "satisfied")
print("The recall score is {rc:.3f}".format(rc = rc_test))
# Get accuracy score.
ac_test = accuracy_score(y_test, y_pred)
print("The accuracy score is {ac:.3f}".format(ac = ac_test))
# Get F1 score.
f1_test = f1_score(y_test, y_pred, pos_label = "satisfied")
print("The F1 score is {f1:.3f}".format(f1 = f1_test))

The precision score is 0.950
The recall score is 0.945
The accuracy score is 0.942
The F1 score is 0.947


#### **5.2. Evaluate the model**

1. **`True positives (TP) :`** These are correctly predicted positive values, which means the value of actual and predicted classes are positive. 
2. **`True negatives (TN) :`** These are correctly predicted negative values, which means the value of the actual and predicted classes are negative.
3. **`False positives (FP) :`** This occurs when the value of the actual class is negative and the value of the predicted class is positive.
4. **`False negatives (FN) :`** This occurs when the value of the actual class is positive and the value of the predicted class in negative. 

**Reminder:** When fitting and tuning classification modeld, data professioals aim **`to minimize false positives and false negatives`**.

- **`Accuracy (TP+TN/TP+FP+FN+TN):`** The ratio of correctly predicted observations to total observations. 
- **`Precision (TP/TP+FP):`** The ratio of correctly predicted positive observations to total predicted positive observations. 
- **`Recall (Sensitivity, TP/TP+FN):`** The ratio of correctly predicted positive observations to all observations in actual class.
- **`F1 score:`** The harmonic average of precision and recall, which takes into account both false positives and false negatives. 

In [None]:
# Precision score on test data set.
print("\nThe precision score is: {pc:.3f}".format(pc = pc_test), 
      "for the test set,", "\nwhich means of all positive predictions,", 
      "{pc_pct:.1f}% prediction are true positive.".format(pc_pct = pc_test * 100))
# Recall score on test data set.
print("\nThe recall score is: {rc:.3f}".format(rc = rc_test), 
      "for the test set,", "\nwhich means of which means of all real positive cases in test set,", 
      "{rc_pct:.1f}% are  predicted positive.".format(rc_pct = rc_test * 100))
# Accuracy score on test data set.
print("\nThe accuracy score is: {ac:.3f}".format(ac = ac_test), "for the test set,", 
      "\nwhich means of all cases in test set,", 
      "{ac_pct:.1f}% are predicted true positive or true negative.".format(ac_pct = ac_test * 100))
# F1 score on test data set.
print("\nThe F1 score is: {f1:.3f}".format(f1 = f1_test), "for the test set,", 
      "\nwhich means the test set's harmonic mean is {f1_pct:.1f}%.".format(f1_pct = f1_test * 100))


The precision score is: 0.950 for the test set, 
which means of all positive predictions, 95.0% prediction are true positive.

The recall score is: 0.945 for the test set, 
which means of which means of all real positive cases in test set, 94.5% are  predicted positive.

The accuracy score is: 0.942 for the test set, 
which means of all cases in test set, 94.2% are predicted true positive or true negative.

The F1 score is: 0.947 for the test set, 
which means the test set's harmonic mean is 94.7%.


### **5.3. Evaluate the model**

In [46]:
# Create table of results.
table = pd.DataFrame({'Model': ["Tuned Decision Tree", "Tuned Random Forest"],
                        'F1':  [0.944238, f1_test],
                        'Recall': [0.934426, rc_test],
                        'Precision': [0.95428, pc_test],
                        'Accuracy': [0.939587, ac_test]
                      }
                    )
table

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Tuned Decision Tree,0.944238,0.934426,0.95428,0.939587
1,Tuned Random Forest,0.947306,0.944501,0.950128,0.94245


- The tuned random forest has higher scores overall, so it is the better model. 
- Particularly, it **`shows a better F1 score`** than the decision tree model, which indicates that  
the random forest model may do better at classification when taking into account false positives and false negatives. 

## **6. Considerations**

- #### **6.1. Key takeaways from this lab?**
    - **`Data exploring`**, **`cleaning`**, and **`encoding`** are necessary for model building.
    - A separate validation set is typically used for tuning a model, rather than using the test set. 
    - This also helps avoid the evaluation becoming biased.
    - F1 scores are usually more useful than accuracy scores. 
    - If the cost of false positives and false negatives are very different, it’s better to use the F1 score and combine the information from precision and recall. 
    - The random forest model yields a more effective performance than a decision tree model. 

- #### **6.2. Summary to provide to stakeholders?**
    - The random forest model predicted satisfaction with more than **`94.2% accuracy`**. 
    - The **`precision is over 95%`** and the recall is approximately 94.5%. 
    - The random forest model outperformed the tuned decision tree with the best hyperparameters in most of the four scores. 
    - This indicates that the random forest model may perform better.
    - Because stakeholders were interested in learning about the factors that are most important to customer satisfaction, this would be shared based on the tuned random forest. 
    - In addition, you would provide details about the **`precision, recall, accuracy, and F1 scores`** to support your findings. 

## **7. References**

[**What is the Difference Between Test and Validation Datasets?,  Jason Brownlee**](https://machinelearningmastery.com/difference-test-validation-datasets/)

[**Decision Trees and Random Forests Neil Liberman**](https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991)