<span style="color: Blue;">**Build a Random Forest Classifier**</span>

**Description:**
Implement a Random Forest model for classification on a complex dataset.

**Objective:**

Train a Random Forest model and tune hyperparameters (e.g., number of trees, max depth). **|** Evaluate the model using cross-validation and classification metrics (precision, recall, F1-score). **|** Perform feature importance analysis to identify the most important features in the dataset.

**Tools Used:**
Python, scikit-learn, pandas, matplotlib

In [None]:
# Importing required python libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing data from csv
df = pd.read_csv('churn-bigml-80.csv')

In [None]:
# Viewing the first few rows
df.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


<span style="color: Blue;">**Data Preparation for Modeling**</span>

In [None]:
# Checking for missing values
df.isnull().sum()

Unnamed: 0,0
State,0
Account length,0
Area code,0
International plan,0
Voice mail plan,0
Number vmail messages,0
Total day minutes,0
Total day calls,0
Total day charge,0
Total eve minutes,0


In [None]:
# Identifying categorical and numerical variables
cat = []
num = []
for i in df.columns:
    if df[i].dtype == 'object':
        cat.append(i)
    else:
        num.append(i)

print("Categorical Variables", cat)
print("Numerical Variables", num)

Categorical Variables ['State', 'International plan', 'Voice mail plan']
Numerical Variables ['Account length', 'Area code', 'Number vmail messages', 'Total day minutes', 'Total day calls', 'Total day charge', 'Total eve minutes', 'Total eve calls', 'Total eve charge', 'Total night minutes', 'Total night calls', 'Total night charge', 'Total intl minutes', 'Total intl calls', 'Total intl charge', 'Customer service calls', 'Churn']


In [None]:
# Unique values in State column
df['State'].unique()

array(['KS', 'OH', 'NJ', 'OK', 'AL', 'MA', 'MO', 'WV', 'RI', 'IA', 'MT',
       'ID', 'VT', 'VA', 'TX', 'FL', 'CO', 'AZ', 'NE', 'WY', 'IL', 'NH',
       'LA', 'GA', 'AK', 'MD', 'AR', 'WI', 'OR', 'DE', 'IN', 'UT', 'CA',
       'SD', 'NC', 'WA', 'MN', 'NM', 'NV', 'DC', 'NY', 'KY', 'ME', 'MS',
       'MI', 'SC', 'TN', 'PA', 'HI', 'ND', 'CT'], dtype=object)

In [None]:
# Assessing variable type
df['Churn'].dtypes

dtype('bool')

In [None]:
# One-Hot Encoding of Categorical Features
df = pd.get_dummies(df, columns=cat, drop_first=True)

In [None]:
# Random sample of dataset
df.sample(5)

Unnamed: 0,Account length,Area code,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,...,State_TX,State_UT,State_VA,State_VT,State_WA,State_WI,State_WV,State_WY,International plan_Yes,Voice mail plan_Yes
625,92,415,0,197.0,84,33.49,269.3,105,22.89,158.9,...,False,False,False,False,False,False,False,False,False,False
340,115,415,0,184.1,98,31.3,327.0,73,27.8,212.5,...,False,False,False,False,False,False,False,False,False,False
1913,114,415,0,187.8,109,31.93,154.6,97,13.14,213.9,...,False,False,False,False,False,False,False,False,False,False
1931,130,408,19,152.9,87,25.99,213.2,99,18.12,205.3,...,False,False,False,False,False,False,False,False,False,True
2288,121,510,20,211.9,110,36.02,215.1,120,18.28,238.5,...,False,False,False,False,False,False,False,False,False,True


In [None]:
# Independent and Dependent variable split
X = df.drop('Churn', axis=1)
y = df['Churn']

In [None]:
# Importing Machine Learning libraries and tools
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

In [None]:
# splitting the dataset into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
rf = RandomForestClassifier()

<span style="color: Blue;">**Model Building and Hyperparameter Tuning**</span>

In [None]:
# Random Forest hyperparameter Grid definition
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 3, 5],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2']
}

In [None]:
# Random Forest hyperparameter tuning using Grid Search
rf_cv = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

In [None]:
# Fitting GridSearchCV for Random Forest hyperparameter optimization
rf_cv.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


In [None]:
# Extracting the best Random Forest estimator
best_model = rf_cv.best_estimator_
print("Best max_depth:", rf_cv.best_params_)
print("Best cross-validation score:", rf_cv.best_score_)

Best max_depth: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best cross-validation score: 0.9273125089333817


<span style="color: Blue;">**Model Evaluation**</span>


In [None]:
# Generating predictions using optimal Random Forest
y_pred = best_model.predict(X_test)

# Classification performance evaluation
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.95      1.00      0.97       461
        True       0.96      0.68      0.80        73

    accuracy                           0.95       534
   macro avg       0.96      0.84      0.89       534
weighted avg       0.95      0.95      0.95       534



**Precision**

Class 0 has a precision of 0.95, meaning when the model predicts a customer will not churn (0), it is correct 95% of the time.

Class 1 has a precision of 0.96, meaning when the model predicts a customer will churn (1), it is correct 96% of the time.

**Recall**

Class 0 has recall of 1.00 meaning if the model correctly identifies all customers who didn't churn.

Class 1 has recall of 0.68 meaning the model only identifies 68% of customers who actually churned. The low recall indicates that the model fails to identify many actual positive cases, resulting in numerous false negatives

**F1-Score**

The model performs exceptionally well on Class 0 (F1-score = 0.97). For Class 1, the F1-score is 0.8, which is lower because the model misses many actual positives, even though it is precise when it predicts them.

<span style="color: Blue;">**Feature Importance Analysis**</span>


In [None]:
# Identifying key predictors
importances = best_model.feature_importances_

# Constructing feature importance DataFrame
feature_names = X_train.columns
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by="Importance", ascending=False)

# Displaying complete feature importance table
pd.set_option("display.max_rows", None)
print(feature_importance_df)

                   Feature  Importance
3        Total day minutes    0.116826
5         Total day charge    0.116116
15  Customer service calls    0.097578
66  International plan_Yes    0.075486
8         Total eve charge    0.054796
6        Total eve minutes    0.053091
14       Total intl charge    0.046275
13        Total intl calls    0.044311
12      Total intl minutes    0.041666
11      Total night charge    0.040298
9      Total night minutes    0.039912
0           Account length    0.035471
4          Total day calls    0.034773
10       Total night calls    0.034369
7          Total eve calls    0.031326
2    Number vmail messages    0.021481
67     Voice mail plan_Yes    0.014159
1                Area code    0.010225
58                State_TX    0.004866
46                State_NJ    0.003870
36                State_ME    0.003319
41                State_MT    0.003306
49                State_NY    0.003144
38                State_MN    0.003006
37                State_M

# *I would greatly appreciate any advice or recommendations on enhancing the model's accuracy, precision, or overall performance.*

# *Thank you for your time and feedback.*