<a href="https://colab.research.google.com/github/Chuks-hub/Capproject12/blob/main/churnassignmentphysical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Objective: to build a customer churn prediction model that will predict if a customer will unsubscribe or stop using a service

Dataset Description

This dataset contains information of customers as shown below. The dataset can be used for various data analysis and machine learning tasks, such as predicting loan default risk.


Churn: if customer cancelled service (target variable (0: no churn chances, 1: chances of churn)

AccountWeeks: number of weeks customer has had active account

ContractRenewal: if customer recently renewed contract

DataPlan: if customer has data plan

DataUsage: gigabytes of monthly data usage

CustServCalls:number of calls into customer service

DayMins: average daytime minutes per month

DayCalls: average number of daytime calls

MonthlyCharge: average monthly bill

OverageFee: largest overage fee in last 12 months

RoamMins: average number of roaming minutes

IMPORT LIBRARIES

In [12]:
import pandas as pd                  # For data manipulation and analysis using DataFrames (e.g., loading CSVs, filtering, grouping)
import numpy as np                   # For numerical operations, and mathematical functions used in data preprocessing
import seaborn as sns                # For advanced, beautiful statistical visualizations (like heatmaps, boxplots, etc.)
import matplotlib.pyplot as plt      # For creating plots and graphs (line charts, bar charts, confusion matrix visualization)
import joblib                        # For saving and loading trained models
from sklearn.model_selection import train_test_split   # For splitting the dataset into training and testing sets
from sklearn.preprocessing import LabelEncoder         # For converting categorical text labels into numerical values
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier # Import DecisionTreeClassifier


LOAD DATASET

In [17]:
df = pd.read_csv('telecom_churn.csv')  # Change to your actual path
df

Unnamed: 0,Churn,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
0,0,128,1,1,2.70,1,265.1,110,89.0,9.87,10.0
1,0,107,1,1,3.70,1,161.6,123,82.0,9.78,13.7
2,0,137,1,0,0.00,0,243.4,114,52.0,6.06,12.2
3,0,84,0,0,0.00,2,299.4,71,57.0,3.10,6.6
4,0,75,0,0,0.00,3,166.7,113,41.0,7.42,10.1
...,...,...,...,...,...,...,...,...,...,...,...
3328,0,192,1,1,2.67,2,156.2,77,71.7,10.78,9.9
3329,0,68,1,0,0.34,3,231.1,57,56.4,7.67,9.6
3330,0,28,1,0,0.00,2,180.8,109,56.0,14.44,14.1
3331,0,184,0,0,0.00,2,213.8,105,50.0,7.98,5.0


DATA EXPLORATION

In [None]:
df.shape

(3333, 11)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Churn            3333 non-null   int64  
 1   AccountWeeks     3333 non-null   int64  
 2   ContractRenewal  3333 non-null   int64  
 3   DataPlan         3333 non-null   int64  
 4   DataUsage        3333 non-null   float64
 5   CustServCalls    3333 non-null   int64  
 6   DayMins          3333 non-null   float64
 7   DayCalls         3333 non-null   int64  
 8   MonthlyCharge    3333 non-null   float64
 9   OverageFee       3333 non-null   float64
 10  RoamMins         3333 non-null   float64
dtypes: float64(5), int64(6)
memory usage: 286.6 KB


In [None]:
df.select_dtypes(include='number').columns

Index(['Churn', 'AccountWeeks', 'ContractRenewal', 'DataPlan', 'DataUsage',
       'CustServCalls', 'DayMins', 'DayCalls', 'MonthlyCharge', 'OverageFee',
       'RoamMins'],
      dtype='object')

In [None]:
df.isnull().sum()

Unnamed: 0,0
Churn,0
AccountWeeks,0
ContractRenewal,0
DataPlan,0
DataUsage,0
CustServCalls,0
DayMins,0
DayCalls,0
MonthlyCharge,0
OverageFee,0


In [18]:
df['Churn'].value_counts()

Unnamed: 0_level_0,count
Churn,Unnamed: 1_level_1
0,2850
1,483


In [19]:
# Features and Target
X = df.drop('Churn', axis=1)
y = df['Churn']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# random_state=42 is used to control randomness in functions that involve random splitting or shuffling

Model training

We are training Random Forest, Gradient Boosting and Logistic Reegression

In [20]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Standard Scaler is used to standardize the data to be in a scale of (0,1), which means feature has mean = 0, and standard deviation = 1. This helps
# to improve the model performance

In [23]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
dtclas_model = DecisionTreeClassifier(max_depth=100)


# Train models
rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
dtclas_model.fit(X_train_scaled, y_train)

In [24]:
models = {
    'Random Forest': rf_model,
    'Gradient Boosting': gb_model,
    'Decision Tree': dtclas_model
}

for name, model in models.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n{name} Results")
    print("Accuracy:", acc)
    print("Classification Report:\n", classification_report(y_test, y_pred))


Random Forest Results
Accuracy: 0.9265367316341829
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.98      0.96       566
           1       0.86      0.61      0.72       101

    accuracy                           0.93       667
   macro avg       0.90      0.80      0.84       667
weighted avg       0.92      0.93      0.92       667


Gradient Boosting Results
Accuracy: 0.9370314842578711
Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.98      0.96       566
           1       0.88      0.67      0.76       101

    accuracy                           0.94       667
   macro avg       0.91      0.83      0.86       667
weighted avg       0.93      0.94      0.93       667


Decision Tree Results
Accuracy: 0.8455772113943029
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.99      0.92       566
     



Random Forest achieved 93% accuracy, performing well on both classes. It predicts no churn chances (0) with high precision (0.93) and recall (0.98), also having chances of estimated customer churn rate (1) to be precision  (0.86) and recall (0.61).It's slightly less balanced than Gradient Boosting but still effective for positive loan decisions.

Gradient Boosting achieved 94% accuracy. It predicts both no churn chances  (0) and churn rate chances (1) well, with lower recall (0.67).  Overall, it’s the most balanced and reliable of the three models.

Decision Tree Classifier performed poorly, with only 46% accuracy. It misclassified most chances of churn (1)  and only did well predicting no churn chances (0) ones. The recall for class 1 (chances churn rate) was very low (0.16), making this model unsuitable without further tuning or preprocessing.

In [25]:
from sklearn.metrics import confusion_matrix
print ("Confusion matrix:\n", (confusion_matrix(y_test, y_pred)))

Confusion matrix:
 [[563   3]
 [100   1]]


THE ABOVE MEANS
[TN   FP]
[FN   TP]

In [None]:
# SAVING MODEL

joblib.dump(rf_model, 'rf_model.pkl')

['rf_model.pkl']

In [None]:
X_test

Unnamed: 0,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
438,113,1,0,0.00,1,155.0,93,55.0,16.53,13.5
2674,67,1,0,0.00,0,109.1,117,38.0,10.87,12.8
1345,98,1,0,0.00,4,0.0,0,14.0,7.98,6.8
1957,147,1,0,0.33,1,212.8,79,57.3,10.21,10.2
2148,96,1,0,0.30,1,144.0,102,47.0,11.24,10.0
...,...,...,...,...,...,...,...,...,...,...
2577,157,1,0,0.00,2,185.1,92,50.0,10.65,8.5
2763,116,1,1,2.21,3,155.7,104,65.1,9.27,8.2
3069,148,1,1,2.67,1,158.7,91,67.7,8.03,9.9
1468,75,1,1,1.13,3,117.5,102,49.3,10.34,4.2


In [None]:
y_test.head(10)

Unnamed: 0,Churn
438,0
2674,0
1345,1
1957,0
2148,0
3106,0
1786,0
321,0
3082,0
2240,0


In [None]:
predict = X_test.iloc[:10]
pred = rf_model.predict(predict)
print("Predictions:", pred)

Predictions: [0 0 1 0 0 0 0 0 0 0]


In [None]:
df1 = pd.read_csv('telecom_churn222.csv')
df1

Unnamed: 0,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
0,95,1,0,0.44,3,156.6,88,52.4,12.38,12.3
1,62,1,0,0.00,4,120.7,70,47.0,15.36,13.1
2,161,1,0,0.00,4,332.9,67,84.0,15.89,5.4
3,85,1,1,3.73,1,196.4,139,95.3,14.05,13.8
4,93,1,0,0.00,3,190.7,114,51.0,10.91,8.1
...,...,...,...,...,...,...,...,...,...,...
3303,192,1,1,2.67,2,156.2,77,71.7,10.78,9.9
3304,68,1,0,0.34,3,231.1,57,56.4,7.67,9.6
3305,28,1,0,0.00,2,180.8,109,56.0,14.44,14.1
3306,184,0,0,0.00,2,213.8,105,50.0,7.98,5.0


In [None]:
predict = df1
pred = rf_model.predict(predict)
print("Predictions:", pred)

Predictions: [0 1 1 ... 0 0 0]


In [None]:
dtclas_model.predict(X_test_scaled)

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,

In [26]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
