# Random Forest Classifier

In this notebook, we will use a **Random Forest Classifier** model to predict customer churn using the already pre-processed `customer_churn_processed.csv` dataset.

We will also evaluate the model for accuracy, precision and recall, and store the results in a file for comparative analysis of results with other models in later stages of this project phase.

In [52]:
# import dependencies
import time
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# input file containing preprocessed data
input_csv = "../../data/customer_churn_processed.csv"
# output file to be saved containing model results
output_csv = "../model_results/random_forest_results.csv"

## Data

Load and prepare data for training and testing the model.

In [53]:
# Load the data
data = pd.read_csv(input_csv)

# Split the data into X and y
X = data.drop('Churn', axis=1)
y = data['Churn']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

We will define a **Random Forest Classifier** model with default parameters, however, we will set the parameter `n_jobs=-1` to use all available cores in the machine.

In [54]:
# Initialize the Random Forest Classifier
rf = RandomForestClassifier(n_jobs=-1)

## Model Training & Prediction

Let's now train the model on the training data and make predictions on the test data.

In [55]:
# Record the start time before training the model
start = time.time()

# Train the model
rf.fit(X_train, y_train)

# Record the end time after the model has been trained
end = time.time()

# Record the training time
training_time = end - start

# Record the start time before running the model
start = time.time()

# Make predictions
y_pred = rf.predict(X_test)

# Record the end time after the model has been run
end = time.time()

# Record the training time
prediction_time = end - start

## Model Evaluation

Finally, we will evaluate the model for accuracy, precision and recall.

In [56]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
fscore = f1_score(y_test, y_pred)

# Create DataFrame to store evaluation results and training time
evaluation_results = pd.DataFrame({
    'Model': ['Random Forest'],
    'Accuracy': [accuracy],
    'Precision': [precision],
    'Recall': [recall],
    'F1 score': [fscore],
    'Training Time': [training_time],
    'Prediction Time': [prediction_time]
})

# Print the evaluation metrics
evaluation_results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 score,Training Time,Prediction Time
0,Random Forest,0.999,0.998437,0.99688,0.997658,0.183158,0.028219


As can be observed from above, the **Random Forest Classifier** model achieved a very high score for all measured metrics, which is a good indicator that the model is performing well on our dataset.

## Feature Importance

Random forest model provides a way to evaluate feature importance. Let's see the feature importance of the model to understand which features are contributing the most to the prediction of churn.

In [57]:
feature_imp = []
feature_imp.append(rf.feature_importances_)
feature_imp = pd.DataFrame(feature_imp, columns=rf.feature_names_in_)
feature_imp

Unnamed: 0,Complain,IsActiveMember,Age,Gender,Balance,Point Earned,Geography,NumOfProducts
0,0.839098,0.014052,0.064713,0.003364,0.015885,0.011417,0.005717,0.045754


## Save Results

We will save the results in a file for comparative analysis of results with other models in later stages of this project phase.

In [58]:
# save results for Random Forest Classifier model
evaluation_results.to_csv(output_csv, index=False)