**Objective**

The objective of this project is to build machine learning models that will look into cutomer data and predict if they will churn or not.

**Dataset Description**

This dataset contains information about mobile service customers which we are going to use to build machine learning models to predict customer churn (whether a customer is likely to leave the service)

**Churn** - Target variable: Indicates if the customer has canceled their service (1: Customer cancelled service, 0: Customer did not cancel)

**AccountWeeks**: Number of weeks the customer has had an active service account. 

**ContractRenewal**: Whether the customer has recently renewed their contract (1: Yes, 0: No)

**DataPlan**: Whether the customer is subscribed to a data plan (1: Has data plan, 0: No data plan)

**DataUsage**: The average monthly mobile data usage (in gigabytes). 

**CustServCalls**: Number of calls the customer made to customer service. A high number means that customer is not always happy with their services.

**DayMins**: Average number of daytime calling minutes per month.

**DayCalls**: Average number of daytime calls per month.

**MonthlyCharge**: The average monthly bill amount for the customer.

**OverageFee**: The highest overage fee (charges for exceeding plan limits) incurred in the last 12 months for Postpaid plans.

**RoamMins**: Average number of roaming minutes per month. Indicates usage while outside the primary service area.

**Importing Libraries**

In [4]:
import pandas as pd                  
import numpy as np                   
import seaborn as sns                
import matplotlib.pyplot as plt                             
from sklearn.model_selection import train_test_split            
from sklearn.ensemble import RandomForestClassifier    
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix 

**Loading dataset**

In [6]:
df = pd.read_csv('telecom_churn.csv') 
df

Unnamed: 0,Churn,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
0,0,128,1,1,2.70,1,265.1,110,89.0,9.87,10.0
1,0,107,1,1,3.70,1,161.6,123,82.0,9.78,13.7
2,0,137,1,0,0.00,0,243.4,114,52.0,6.06,12.2
3,0,84,0,0,0.00,2,299.4,71,57.0,3.10,6.6
4,0,75,0,0,0.00,3,166.7,113,41.0,7.42,10.1
...,...,...,...,...,...,...,...,...,...,...,...
3328,0,192,1,1,2.67,2,156.2,77,71.7,10.78,9.9
3329,0,68,1,0,0.34,3,231.1,57,56.4,7.67,9.6
3330,0,28,1,0,0.00,2,180.8,109,56.0,14.44,14.1
3331,0,184,0,0,0.00,2,213.8,105,50.0,7.98,5.0


In [57]:
print(df.min())

Churn               0.0
AccountWeeks        1.0
ContractRenewal     0.0
DataPlan            0.0
DataUsage           0.0
CustServCalls       0.0
DayMins             0.0
DayCalls            0.0
MonthlyCharge      14.0
OverageFee          0.0
RoamMins            0.0
dtype: float64


In [59]:
print(df.max())

Churn                1.00
AccountWeeks       243.00
ContractRenewal      1.00
DataPlan             1.00
DataUsage            5.40
CustServCalls        9.00
DayMins            350.80
DayCalls           165.00
MonthlyCharge      111.30
OverageFee          18.19
RoamMins            20.00
dtype: float64


**Exploriing dataset**

In [8]:
df.shape

(3333, 11)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Churn            3333 non-null   int64  
 1   AccountWeeks     3333 non-null   int64  
 2   ContractRenewal  3333 non-null   int64  
 3   DataPlan         3333 non-null   int64  
 4   DataUsage        3333 non-null   float64
 5   CustServCalls    3333 non-null   int64  
 6   DayMins          3333 non-null   float64
 7   DayCalls         3333 non-null   int64  
 8   MonthlyCharge    3333 non-null   float64
 9   OverageFee       3333 non-null   float64
 10  RoamMins         3333 non-null   float64
dtypes: float64(5), int64(6)
memory usage: 286.6 KB


In [10]:
df.head() 

Unnamed: 0,Churn,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
0,0,128,1,1,2.7,1,265.1,110,89.0,9.87,10.0
1,0,107,1,1,3.7,1,161.6,123,82.0,9.78,13.7
2,0,137,1,0,0.0,0,243.4,114,52.0,6.06,12.2
3,0,84,0,0,0.0,2,299.4,71,57.0,3.1,6.6
4,0,75,0,0,0.0,3,166.7,113,41.0,7.42,10.1


In [11]:
df.tail() 

Unnamed: 0,Churn,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
3328,0,192,1,1,2.67,2,156.2,77,71.7,10.78,9.9
3329,0,68,1,0,0.34,3,231.1,57,56.4,7.67,9.6
3330,0,28,1,0,0.0,2,180.8,109,56.0,14.44,14.1
3331,0,184,0,0,0.0,2,213.8,105,50.0,7.98,5.0
3332,0,74,1,1,3.7,0,234.4,113,100.0,13.3,13.7


In [12]:
df.isnull().sum() 

Churn              0
AccountWeeks       0
ContractRenewal    0
DataPlan           0
DataUsage          0
CustServCalls      0
DayMins            0
DayCalls           0
MonthlyCharge      0
OverageFee         0
RoamMins           0
dtype: int64

In [13]:
df.dtypes 

Churn                int64
AccountWeeks         int64
ContractRenewal      int64
DataPlan             int64
DataUsage          float64
CustServCalls        int64
DayMins            float64
DayCalls             int64
MonthlyCharge      float64
OverageFee         float64
RoamMins           float64
dtype: object

In [14]:
df.describe() 

Unnamed: 0,Churn,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,0.144914,101.064806,0.90309,0.276628,0.816475,1.562856,179.775098,100.435644,56.305161,10.051488,10.237294
std,0.352067,39.822106,0.295879,0.447398,1.272668,1.315491,54.467389,20.069084,16.426032,2.535712,2.79184
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,0.0,0.0
25%,0.0,74.0,1.0,0.0,0.0,1.0,143.7,87.0,45.0,8.33,8.5
50%,0.0,101.0,1.0,0.0,0.0,1.0,179.4,101.0,53.5,10.07,10.3
75%,0.0,127.0,1.0,1.0,1.78,2.0,216.4,114.0,66.2,11.77,12.1
max,1.0,243.0,1.0,1.0,5.4,9.0,350.8,165.0,111.3,18.19,20.0


In [15]:
#df.describe(include='number') 

In [16]:
df.duplicated().sum()

0

In [17]:
df['Churn'].value_counts()

Churn
0    2850
1     483
Name: count, dtype: int64

In [18]:
for col in df.columns:
    print(f"\nColumn: {col}")
    print(df[col].value_counts(dropna=False))


Column: Churn
Churn
0    2850
1     483
Name: count, dtype: int64

Column: AccountWeeks
AccountWeeks
105    43
87     42
101    40
93     40
90     39
       ..
243     1
200     1
232     1
5       1
221     1
Name: count, Length: 212, dtype: int64

Column: ContractRenewal
ContractRenewal
1    3010
0     323
Name: count, dtype: int64

Column: DataPlan
DataPlan
0    2411
1     922
Name: count, dtype: int64

Column: DataUsage
DataUsage
0.00    1813
0.31      41
0.21      39
0.29      36
0.26      34
        ... 
5.40       1
4.40       1
4.73       1
4.64       1
0.68       1
Name: count, Length: 174, dtype: int64

Column: CustServCalls
CustServCalls
1    1181
2     759
0     697
3     429
4     166
5      66
6      22
7       9
9       2
8       2
Name: count, dtype: int64

Column: DayMins
DayMins
154.0    8
159.5    8
174.5    8
183.4    7
175.4    7
        ..
78.6     1
200.9    1
254.3    1
247.0    1
180.8    1
Name: count, Length: 1667, dtype: int64

Column: DayCalls
DayCalls
10

**Observations**
* the dataset has 3,333 rows and 11 columns.
* the columns are all numeric, no string or object.
* the dataset doesn't have any missing value in all the columns
* there are no duplicate rows in the dataset

**Feature Engineering**

In [21]:
df

Unnamed: 0,Churn,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
0,0,128,1,1,2.70,1,265.1,110,89.0,9.87,10.0
1,0,107,1,1,3.70,1,161.6,123,82.0,9.78,13.7
2,0,137,1,0,0.00,0,243.4,114,52.0,6.06,12.2
3,0,84,0,0,0.00,2,299.4,71,57.0,3.10,6.6
4,0,75,0,0,0.00,3,166.7,113,41.0,7.42,10.1
...,...,...,...,...,...,...,...,...,...,...,...
3328,0,192,1,1,2.67,2,156.2,77,71.7,10.78,9.9
3329,0,68,1,0,0.34,3,231.1,57,56.4,7.67,9.6
3330,0,28,1,0,0.00,2,180.8,109,56.0,14.44,14.1
3331,0,184,0,0,0.00,2,213.8,105,50.0,7.98,5.0


In [22]:
# Features and Target
X = df.drop('Churn', axis=1)
y = df['Churn']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 
# random_state=42 is used to control randomness in functions that involve random splitting or shuffling

**Model training**


We are training Random Forest, Gradient Boosting and Decision Tree

In [25]:

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
dt_model = DecisionTreeClassifier(max_depth=5, random_state=42)

# Train models
rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
dt_model.fit(X_train, y_train)

**Model Evaluation**

In [27]:
models = {
    'Random Forest': rf_model,
    'Gradient Boosting': gb_model,
    'Decision Tree': dt_model
}

for name, model in models.items():
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n{name} Results")
    print("Accuracy:", acc)
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Random Forest Results
Accuracy: 0.9265367316341829
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.98      0.96       566
           1       0.86      0.61      0.72       101

    accuracy                           0.93       667
   macro avg       0.90      0.80      0.84       667
weighted avg       0.92      0.93      0.92       667

Confusion Matrix:
 [[556  10]
 [ 39  62]]

Gradient Boosting Results
Accuracy: 0.9370314842578711
Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.98      0.96       566
           1       0.88      0.67      0.76       101

    accuracy                           0.94       667
   macro avg       0.91      0.83      0.86       667
weighted avg       0.93      0.94      0.93       667

Confusion Matrix:
 [[557   9]
 [ 33  68]]

Decision Tree Results
Accuracy: 0.9130434782608695
Classification Report:
               precision    rec

**Random Forest** achieved 92.7% accuracy, performing very well on the majority class (0) with high precision (0.93) and recall (0.98). However, it struggles more with the minority class (1), showing lower recall (0.61) despite decent precision (0.86). Overall, it’s strong at identifying non-churners but misses some churners.

**Gradient Boosting** achieved 93.7% accuracy, slightly outperforming Random Forest. It predicts non-churners (0) with excellent precision (0.94) and recall (0.98), and improves detection of churners (1) with better recall (0.67) and precision (0.88). This model is more balanced and reliable, especially for identifying churners.

**Decision Tree** achieved 91.3% accuracy. It performs well on non-churners (0) with high precision (0.92) and recall (0.98), but has difficulty detecting churners (1), with lower recall (0.51) despite reasonable precision (0.85). This model is less effective at identifying churners compared to the other two.

In [29]:
X_test

Unnamed: 0,AccountWeeks,ContractRenewal,DataPlan,DataUsage,CustServCalls,DayMins,DayCalls,MonthlyCharge,OverageFee,RoamMins
438,113,1,0,0.00,1,155.0,93,55.0,16.53,13.5
2674,67,1,0,0.00,0,109.1,117,38.0,10.87,12.8
1345,98,1,0,0.00,4,0.0,0,14.0,7.98,6.8
1957,147,1,0,0.33,1,212.8,79,57.3,10.21,10.2
2148,96,1,0,0.30,1,144.0,102,47.0,11.24,10.0
...,...,...,...,...,...,...,...,...,...,...
2577,157,1,0,0.00,2,185.1,92,50.0,10.65,8.5
2763,116,1,1,2.21,3,155.7,104,65.1,9.27,8.2
3069,148,1,1,2.67,1,158.7,91,67.7,8.03,9.9
1468,75,1,1,1.13,3,117.5,102,49.3,10.34,4.2


In [30]:
y_test.head(10)

438     0
2674    0
1345    1
1957    0
2148    0
3106    0
1786    0
321     0
3082    0
2240    0
Name: Churn, dtype: int64


**Making predictions**

In [32]:
predict = X_test.iloc[:10]
pred = rf_model.predict(predict)
print("Predictions:", pred)

Predictions: [0 0 1 0 0 0 0 0 0 0]


In [33]:
import joblib

In [34]:
joblib.dump(gb_model, 'gb_model.pkl')
print(f"Model saved to '{gb_model}' successfully.")

Model saved to 'GradientBoostingClassifier(random_state=42)' successfully.
