<a href="https://colab.research.google.com/github/JordanCWard/Statistical_Tests/blob/main/Random_forest_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Random forest model

## Overview

This analysis focuses on predicting customer satisfaction using survey responses from 129,880 airline customers. The dataset includes customer demographics, travel characteristics, and service quality ratings such as class, flight distance, and inflight entertainment. The objective is to build a classification model that accurately predicts whether a customer is satisfied with their flight experience.

A tuned Random Forest classifier was trained, validated, and evaluated, and its performance was compared with a tuned Decision Tree model. Model performance was assessed using accuracy, precision, recall, and F1 score to ensure balanced evaluation and appropriate handling of false positives and false negatives.

## Executive Summary

The tuned Random Forest model demonstrated strong predictive performance across all evaluation metrics and outperformed the tuned Decision Tree model in most cases. The model achieved more than 94.2 percent accuracy, with precision above 95 percent and recall of approximately 94.5 percent. These results indicate both reliable positive predictions and effective identification of satisfied customers.

Precision was the strongest of the four metrics, showing that the model makes very few false positive predictions. The Random Forest also achieved a higher F1 score than the Decision Tree, indicating better overall classification performance when balancing precision and recall. Based on these results, the tuned Random Forest model is the preferred approach for predicting customer satisfaction.

## Data & Methodology

The analysis used survey data consisting of 129,880 observations with a mix of variable types, including integer, floating-point, and categorical variables. The categorical variables included satisfaction, customer type, type of travel, and class.

All rows containing missing values were removed to ensure data quality. Categorical variables were converted into numerical format using one-hot encoding. The dataset was split into training, validation, and test sets. A predefined validation split was used within GridSearchCV to tune the Random Forest hyperparameters, with the F1 score used as the optimization metric.

Model performance on the test set was evaluated using accuracy, precision, recall, and F1 score. The tuned Random Forest consistently outperformed the tuned Decision Tree, particularly in F1 score, indicating better performance when accounting for both false positives and false negatives.

### Import packages


In [3]:
# NumPy is used for numerical computations, array operations,
# and efficient handling of multi-dimensional data structures.
import numpy as np

# Pandas provides data structures and tools for data loading,
# cleaning, transformation, and exploratory data analysis.
import pandas as pd

# Pickle is used for serializing and deserializing Python objects,
# such as trained models or preprocessing artifacts.
import pickle as pkl

# RandomForestClassifier is an ensemble learning algorithm that
# combines multiple decision trees to improve predictive performance
# and reduce overfitting.
from sklearn.ensemble import RandomForestClassifier

# train_test_split is used to split datasets into training and testing subsets.
# PredefinedSplit enables custom train/validation splits.
# GridSearchCV performs exhaustive hyperparameter tuning using cross-validation.
from sklearn.model_selection import train_test_split, PredefinedSplit, GridSearchCV

# accuracy_score measures overall prediction correctness.
# precision_score measures the proportion of true positives among predicted positives.
# recall_score measures the proportion of true positives captured.
# f1_score provides a balanced metric combining precision and recall.
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

### Load and preview the dataset

In [4]:
# Load the airline dataset from a CSV file into a pandas DataFrame
# for further data exploration, preprocessing, and modeling.
air_data = pd.read_csv("Invistico_Airline.csv")

# Display the first 10 rows of the dataset to perform an initial
# inspection of the data structure, column names, and sample values.
air_data.head(10)

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
5,satisfied,Loyal Customer,30,Personal Travel,Eco,1894,0,0,0,3,...,2,2,5,4,5,5,4,2,0,0.0
6,satisfied,Loyal Customer,66,Personal Travel,Eco,227,0,0,0,3,...,5,5,5,0,5,5,5,3,17,15.0
7,satisfied,Loyal Customer,10,Personal Travel,Eco,1812,0,0,0,3,...,2,2,3,3,4,5,4,2,0,0.0
8,satisfied,Loyal Customer,56,Personal Travel,Business,73,0,0,0,3,...,5,4,4,0,1,5,4,4,0,0.0
9,satisfied,Loyal Customer,22,Personal Travel,Eco,1556,0,0,0,3,...,2,2,2,4,5,3,4,2,30,26.0


In [5]:
# Display the data types of each column to understand how features
# are represented (e.g., numeric, categorical, boolean) and to
# identify any necessary type conversions before preprocessing.
air_data.dtypes

Unnamed: 0,0
satisfaction,object
Customer Type,object
Age,int64
Type of Travel,object
Class,object
Flight Distance,int64
Seat comfort,int64
Departure/Arrival time convenient,int64
Food and drink,int64
Gate location,int64


There are three types of variables included in the data: int64, float64, and object.

The object variables are satisfaction, customer type, type of travel, and class.

In [6]:
# Display the shape of the dataset to understand its dimensionality,
# where the output represents the number of rows (observations)
# and columns (features).
air_data.shape

(129880, 22)

### Data cleaning

In [7]:
# Check for missing values across all columns on a row-wise basis
# and count how many rows contain at least one missing value.
# This helps assess data completeness before cleaning or imputation.
air_data.isna().any(axis=1).sum()

np.int64(393)

In [8]:
# Create a cleaned subset of the dataset by removing all rows
# that contain at least one missing value, ensuring only complete
# observations are used for subsequent analysis or modeling.
air_data_subset = air_data.dropna(axis=0)

In [9]:
# Verify that the cleaned dataset no longer contains missing values
# by computing the total number of null entries in each column.
air_data_subset.isna().sum()

Unnamed: 0,0
satisfaction,0
Customer Type,0
Age,0
Type of Travel,0
Class,0
Flight Distance,0
Seat comfort,0
Departure/Arrival time convenient,0
Food and drink,0
Gate location,0


In [10]:
# Convert categorical variables into numerical format using one-hot encoding.
# This creates binary indicator columns for each category in
# 'Customer Type', 'Type of Travel', and 'Class', making the data
# suitable for machine learning algorithms.
air_data_subset_dummies = pd.get_dummies(
    air_data_subset,
    columns=['Customer Type', 'Type of Travel', 'Class']
)

It is necessary because the sklearn implementation of `RandomForestClassifier()` requires that categorical features be encoded to numeric, which can be done using dummy variables or one-hot encoding.

## Model Development

In [11]:
# Define the target variable (y) as the customer satisfaction label,
# which represents the outcome the model is trained to predict.
y = air_data_subset_dummies["satisfaction"]

# Define the feature matrix (X) by removing the target variable
# from the dataset, retaining only input features for modeling.
X = air_data_subset_dummies.drop("satisfaction", axis=1)

In [12]:
# Split the full dataset into training and test sets,
# reserving 25% of the data for final model evaluation.
# The random_state ensures reproducibility of the split.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0
)

# Further split the training data into training and validation sets,
# reserving 25% of the training portion for model tuning and validation.
# This results in a three-way split: training, validation, and test.
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=0
)

### Tune the model


In [13]:
# Define the hyperparameter grid for tuning the Random Forest model.
# Each parameter and its corresponding values will be evaluated
# during cross-validation to identify the optimal configuration.
cv_params = {
    'n_estimators': [50, 100],        # Number of trees in the forest
    'max_depth': [10, 50],            # Maximum depth of each decision tree
    'min_samples_leaf': [0.5, 1],     # Minimum samples required at a leaf node
    'min_samples_split': [0.001, 0.01],  # Minimum samples required to split an internal node
    'max_features': ["sqrt"],         # Number of features considered at each split
    'max_samples': [0.5, 0.9]         # Fraction of samples used to train each tree
}

In [14]:
# Create a custom split index to explicitly define training and
# validation sets for cross-validation.
# Rows corresponding to the validation set are labeled with 0,
# while all other rows are labeled with -1 (training set).
split_index = [0 if x in X_val.index else -1 for x in X_train.index]

# Initialize a PredefinedSplit object using the custom split index.
# This ensures that GridSearchCV uses the predefined training
# and validation partitions during model tuning.
custom_split = PredefinedSplit(split_index)

In [15]:
# Initialize the Random Forest classifier with a fixed random state
# to ensure reproducibility of results across different runs.
rf = RandomForestClassifier(random_state=0)

In [16]:
# Set up GridSearchCV to perform hyperparameter tuning on the
# Random Forest model using the predefined train/validation split.
# The model is evaluated using the F1 score, refit on the best
# parameters, and parallelized across all available CPU cores.
rf_val = GridSearchCV(
    rf,                                # Base Random Forest model to be tuned
    cv_params,                         # Dictionary of hyperparameters to search over
    cv=custom_split,                   # Use predefined train/validation split
    refit='f1',                        # Refit the model using the best F1 score
    n_jobs=-1,                         # Utilize all available CPU cores for parallel processing
    verbose=1                          # Display progress messages during the search
)

In [17]:
# Measure the execution time of the model training and hyperparameter
# tuning process to assess computational cost.
%%time

# Fit the GridSearchCV object on the training data, performing
# cross-validation and selecting the best model based on F1 score.
rf_val.fit(X_train, y_train)

Fitting 1 folds for each of 32 candidates, totalling 32 fits
CPU times: user 8.68 s, sys: 194 ms, total: 8.87 s
Wall time: 1min 46s


In [18]:
# Retrieve the set of hyperparameters that produced the best
# cross-validated F1 score during the GridSearchCV process.
rf_val.best_params_

{'max_depth': 50,
 'max_features': 'sqrt',
 'max_samples': 0.9,
 'min_samples_leaf': 1,
 'min_samples_split': 0.001,
 'n_estimators': 50}

## Model Results & Interpretation



In [19]:
# Initialize the optimized Random Forest classifier using the
# best-performing hyperparameters identified from GridSearchCV.
# A fixed random state is included to ensure reproducibility.
rf_opt = RandomForestClassifier(
    n_estimators=50,          # Number of trees in the ensemble
    max_depth=50,             # Maximum depth of each decision tree
    min_samples_leaf=1,       # Minimum number of samples required at a leaf node
    min_samples_split=0.001,  # Minimum fraction of samples required to split a node
    max_features="sqrt",      # Number of features considered at each split
    max_samples=0.9,          # Fraction of samples used to train each tree
    random_state=0            # Ensures reproducible results
)

In [20]:
# Train the optimized Random Forest model on the training dataset
# using the selected hyperparameters.
rf_opt.fit(X_train, y_train)

In [21]:
# Generate predictions for the test dataset using the trained
# optimized Random Forest model.
y_pred = rf_opt.predict(X_test)

### Performance scores

In [22]:
# Calculate the precision score on the test set, focusing on the
# "satisfied" class as the positive label.
pc_test = precision_score(y_test, y_pred, pos_label="satisfied")
print("The precision score is {pc:.3f}".format(pc=pc_test))

# Calculate the recall score on the test set, measuring the model’s
# ability to correctly identify satisfied customers.
rc_test = recall_score(y_test, y_pred, pos_label="satisfied")
print("The recall score is {rc:.3f}".format(rc=rc_test))

# Calculate the accuracy score on the test set, representing the
# overall proportion of correctly classified observations.
ac_test = accuracy_score(y_test, y_pred)
print("The accuracy score is {ac:.3f}".format(ac=ac_test))

# Calculate the F1 score on the test set, providing a balanced
# measure of precision and recall for the "satisfied" class.
f1_test = f1_score(y_test, y_pred, pos_label="satisfied")
print("The F1 score is {f1:.3f}".format(f1=f1_test))

The precision score is 0.950
The recall score is 0.945
The accuracy score is 0.942
The F1 score is 0.947


### Evaluating the model

- Accuracy (TP+TN/TP+FP+FN+TN): The ratio of correctly predicted observations to total observations.

- Precision (TP/TP+FP): The ratio of correctly predicted positive observations to total predicted positive observations.

- Recall (Sensitivity, TP/TP+FN): The ratio of correctly predicted positive observations to all observations in actual class.

- F1 score: The harmonic average of precision and recall, which takes into account both false positives and false negatives.

In [23]:
# Print an interpreted precision score, explaining the proportion
# of predicted positive cases that are truly positive in the test set.
print(
    "\nThe precision score is: {pc:.3f}".format(pc=pc_test),
    "for the test set,",
    "\nwhich means of all positive predictions,",
    "{pc_pct:.1f}% predictions are true positive.".format(pc_pct=pc_test * 100)
)

# Print an interpreted recall score, explaining the proportion
# of actual positive cases correctly identified by the model.
print(
    "\nThe recall score is: {rc:.3f}".format(rc=rc_test),
    "for the test set,",
    "\nwhich means of all real positive cases in the test set,",
    "{rc_pct:.1f}% are predicted positive.".format(rc_pct=rc_test * 100)
)

# Print an interpreted accuracy score, explaining the proportion
# of total test cases that are correctly classified.
print(
    "\nThe accuracy score is: {ac:.3f}".format(ac=ac_test),
    "for the test set,",
    "\nwhich means of all cases in the test set,",
    "{ac_pct:.1f}% are predicted as true positive or true negative.".format(ac_pct=ac_test * 100)
)

# Print an interpreted F1 score, explaining the harmonic mean
# of precision and recall for the test set.
print(
    "\nThe F1 score is: {f1:.3f}".format(f1=f1_test),
    "for the test set,",
    "\nwhich means the test set's harmonic mean is {f1_pct:.1f}%.".format(f1_pct=f1_test * 100)
)


The precision score is: 0.950 for the test set, 
which means of all positive predictions, 95.0% predictions are true positive.

The recall score is: 0.945 for the test set, 
which means of all real positive cases in the test set, 94.5% are predicted positive.

The accuracy score is: 0.942 for the test set, 
which means of all cases in the test set, 94.2% are predicted as true positive or true negative.

The F1 score is: 0.947 for the test set, 
which means the test set's harmonic mean is 94.7%.


The model performs well according to all 4 performance metrics. The model's precision score is slightly better than the 3 other metrics.

In [24]:
# Create a summary table to compare model performance metrics.
# This table includes evaluation results for the tuned Decision Tree
# and the tuned Random Forest models across key classification metrics.
table = pd.DataFrame(
    {
        'Model': ["Tuned Decision Tree", "Tuned Random Forest"],  # Model names
        'F1': [0.945422, f1_test],                                # F1 scores
        'Recall': [0.935863, rc_test],                            # Recall scores
        'Precision': [0.955197, pc_test],                         # Precision scores
        'Accuracy': [0.940864, ac_test]                           # Accuracy scores
    }
)

# Display the comparison table
table

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Tuned Decision Tree,0.945422,0.935863,0.955197,0.940864
1,Tuned Random Forest,0.947306,0.944501,0.950128,0.94245


The tuned random forest has higher scores overall, so it is the better model. Particularly, it shows a better F1 score than the decision tree model, which indicates that the random forest model may do better at classification when taking into account false positives and false negatives.

## Next Steps

Future work should focus on extracting and communicating feature importance from the tuned Random Forest model to identify the key drivers of customer satisfaction. These insights can be shared with stakeholders to support data-driven decisions aimed at improving the customer experience.

Additional steps include validating the model on new data to ensure consistent performance over time and conducting deeper analysis by customer segment, such as travel type or class, to better understand differences in satisfaction drivers. These efforts will help translate model results into actionable business insights.