# **Multi-Level Classification ML Project to Predict Churn in Telecom</i>**

# Introduction

In the competitive Telecom industry, customer retention is a critical challenge. A churn rate ranging from 10 to 60 percent significantly impacts a company's growth.

## Understanding Churn in Telecom

- **Churn**: Represents the percentage of customers discontinuing a service within a specific period.
- **Churn Category and Reason**:
  - *Churn Category*: Broad classification of why customers leave.
  - *Churn Reason*: Specific rationale behind a customer's decision to churn.
- **Multiclass Classification**: Used to predict both churn category and reason.

## Goal of Classification

The aim is to utilize multiclass and multilevel classification for:

- **Predicting Churn**: Determining the category and specific reasons for customer churn.
- **Strategy Development**: Assisting telecom companies in creating targeted retention strategies to reduce churn, boost customer retention, and increase satisfaction.
- **Proactive Measures**: Enabling proactive steps based on churn insights to enhance customer loyalty.




### **Business Impact of Churn Classification**



**Increase revenue with customer retention**: Reduced churn means company is not losing out against it competitors and a happy customer would keep on spending money on the platform


**Improve customer acquisition cost**: If the company is able to stop old customers from leaving then it doesn't need to spend extra money on getting new customers or in an extreme case throw offers to get the old customer back. This impacts cost of acquisition per customer


**Improve customer satisfaction**: If the company is able to identify which customer will churn and reason behind that, it can fix the problem and in the end improve customer satisfaction

# Machine learning:

Machine learning systems combines strength of historical data and statistical techniques to explain the right churn reason for individual customers of the company. For example, a ML system can tell with a high confidence if a potentail customer will churn or not and what is the reason for the churn.

### **Assumptions**

* We assume that <b>Churn Category & Churn Reason</b> are our target variables.
* We assume that whenever the churn reason is attributed to wrong churn category, it should be switched to the proper one.
* The primary goal is that of Churn category identificationa multi-class classification.
* Post that we incorporate churn reason and formulate the problem as a multi-label classification problem.
* <i>Churn Category = Not Applicable</i> denotes that the customer didn't churn and it is our negative label.
* Columns with a lot of null values are not meaningful and imputation also won't be helpful.
* Most of the missing values are imputed(some are remaining as they can be imputed according to the problem).

## **Approach**


We are treating this problem as a supervised learning problem doing multi-label classification. So every data point will have multiple target variables(in this case it would be 2, i.e., churn category and churn reason) for the model to learn the dependencies and predict on the unknown.


In real life, this model would tell the business that which category of churn does a user lie in and the reason behind the churn. It would in turn help the company to proactively prevent customers from leaving the platform.


Given our assumptions about the data, we will build a prediction model based on the historical data. Simplifying, here's the logic of what we'll build:


1. We will try to understand the churn data first and do a detailed problem specific EDA;
2. We'll build a model to predict the churn category using multi-class classification;
3. We'll then formulate the problem as a multi-label classification problem and predict both churn category and churn reason

**Supervised Machine Learning:**

In supervised machine learning, the algorithm is trained on an labeled dataset with a predefined target variable. The goal is to identify patterns, relationships, and structures of the data with the target variable, such as logistic regression, decision tree or boosting trees

**Multi-Class Classification:**

Multiclass classification is a type of machine learning task where the goal is to classify instances into one of three or more classes. In other words, given a set of input data, the task is to predict the class label of the instance, where the instance can belong to any one of the several predefined classes.

For example, in a medical diagnosis application, the goal may be to predict whether a patient has a certain disease, where the disease can be one of several possibilities. In this case, the output of the classification model would be a probability distribution over the possible classes, with the highest probability indicating the predicted class.

There are various algorithms that can be used for multiclass classification, including logistic regression, decision trees, random forests, support vector machines, and neural networks. The choice of algorithm depends on various factors, such as the nature of the data, the number of classes, and the available computing resources.

**Multi-Label Classification:**

Multilabel classification is a type of machine learning task where the goal is to assign one or more class labels to an instance. Unlike multiclass classification, where an instance can belong to only one class, in multilabel classification, an instance can belong to more than one class simultaneously.

For example, in a news article categorization task, an article may belong to multiple categories, such as "politics", "world news", and "entertainment". In this case, the task would be to predict a set of labels that best describe the content of the article.

Multilabel classification is used in a variety of applications, such as text classification, image annotation, and recommendation systems, where multiple labels can be assigned to an instance based on its content or characteristics.

# **Package Load**

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import shap

In [2]:
import branca.colormap as cm
import folium
import h3
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

import numpy as np
import pandas as pd
import plotly.express as px

from sklearn import preprocessing

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score,recall_score, precision_score, average_precision_score, classification_report
from sklearn.impute import SimpleImputer
# Suppress all warnings
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import MultiLabelBinarizer
# mlp for multi-label classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import RepeatedKFold
from keras.models import Sequential
from keras.layers import Dense
from keras.metrics import CategoricalAccuracy
from sklearn.metrics import accuracy_score




In [3]:
pd.set_option('display.max_columns', 200)

Its the maximum number of columns displayed when a frame is pretty-printed.
By setting this limit we can see 200 columns at once without truncation. 


In [4]:
csv_file_path = "data/Telecom__Data.csv"
df = pd.read_csv(csv_file_path)

In [5]:
df.head(10)

Unnamed: 0,Customer ID,Month,Month of Joining,zip_code,Gender,Age,Married,Dependents,Number of Dependents,Location ID,Service ID,state,county,timezone,area_codes,country,latitude,longitude,arpu,roam_ic,roam_og,loc_og_t2t,loc_og_t2m,loc_og_t2f,loc_og_t2c,std_og_t2t,std_og_t2m,std_og_t2f,std_og_t2c,isd_og,spl_og,og_others,loc_ic_t2t,loc_ic_t2m,loc_ic_t2f,std_ic_t2t,std_ic_t2m,std_ic_t2f,std_ic_t2o,spl_ic,isd_ic,ic_others,total_rech_amt,total_rech_data,vol_4g,vol_5g,arpu_5g,arpu_4g,night_pck_user,fb_user,aug_vbc_5g,Churn Value,Referred a Friend,Number of Referrals,Phone Service,Multiple Lines,Internet Service,Internet Type,Streaming Data Consumption,Online Security,Online Backup,Device Protection Plan,Premium Tech Support,Streaming TV,Streaming Movies,Streaming Music,Unlimited Data,Payment Method,Status ID,Satisfaction Score,Churn Category,Churn Reason,Customer Status,offer,age_bucket,rank,rank_x
0,hthjctifkiudi0,1,1.0,71638,Female,36.0,No,No,0.0,jeavwsrtakgq0,bfbrnsqreveeuafgps0,AR,Chicot County,America/Chicago,870.0,US,33.52,-91.43,273.07,18.88,78.59,280.32,30.97,5.71,1.79,25.71,175.56,0.47,0.0,5.11,0.65,13.99,121.51,168.4,67.61,115.69,52.22,18.71,0.0,0.26,11.53,46.42,18.0,0.0,38.3,219.25,0.0,0.0,0.0,0.0,214.99,1,Yes,9.0,Yes,Yes,Yes,DSL,27,No,No,Yes,Yes,No,Yes,Yes,Yes,Credit Card,vvhwtmkbxtvsppd52013,3,Competitor,Competitor offered higher download speeds,Churned,A,35-49,284189.5,376025.0
1,uqdtniwvxqzeu1,6,6.0,72566,Male,36.657146,No,No,0.0,qcvetdmalnkw1,tkqnsqflrdatnqapsh1,AR,Izard County,America/Chicago,870.0,US,36.22,-92.08,-329.96,69.46,72.08,255.73,148.8,30.0,7.61,308.29,265.2,10.82,0.0,1.23,905.51,1.69,212.93,155.19,29.04,9.15,38.89,0.84,0.0,0.05,32.51,25.53,1183.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,No,0.0,Yes,Yes,No,,14,No,Yes,No,No,Yes,No,No,No,Bank Withdrawal,jucxaluihiluj82863,4,Not Applicable,Not Applicable,Stayed,F,35-49,480205.5,268482.5
2,uqdtniwvxqzeu1,7,6.0,72566,Male,36.605901,No,No,0.0,qcvetdmalnkw1,tkqnsqflrdatnqapsh1,AR,Izard County,America/Chicago,870.0,US,36.22,-92.08,101.22,1012.6,115.26,52.95,1151.734045,103.28,15.71,244.2,15.19,61.834952,0.0,13.14,455.15,115.63,121.8,699.39,44.49,83.59,914.7,13.25,0.0,0.06,13.05,5.62,295.0,7.0,14.83,967.95,-9.4,106.3,1.0,1.0,85.87,0,Yes,6.0,Yes,No,Yes,Cable,82,No,No,Yes,No,Yes,No,No,Yes,Credit Card,vjskkxphumfai57182,3,Not Applicable,Not Applicable,Stayed,No Offer,35-49,284189.5,641205.5
3,uqdtniwvxqzeu1,8,6.0,72566,Male,36.942957,No,No,0.0,qcvetdmalnkw1,tkqnsqflrdatnqapsh1,AR,Izard County,America/Chicago,870.0,US,36.22,-92.08,215.48,84.18,99.85,140.51,4006.99,280.86,6.33,346.14,103.15,183.53,0.0,33.88,495.6,14.01,658.96,195.02,144.11,50.18,2.35,623.94,0.0,0.07,69.13,10.62,354.0,1.0,264.9,268.11,-5.15,77.53,0.0,1.0,268.38,0,Yes,10.0,Yes,No,Yes,Fiber Optic,57,No,No,Yes,No,Yes,No,No,Yes,Wallet Balance,cdwbcrvylqca53109,4,Not Applicable,Not Applicable,Stayed,J,35-49,480205.5,530842.0
4,uqdtniwvxqzeu1,9,6.0,72566,Male,36.631143,No,No,0.0,qcvetdmalnkw1,tkqnsqflrdatnqapsh1,AR,Izard County,America/Chicago,870.0,US,36.22,-92.08,636.55,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,138.85,201.92,19.89,15.91,23.78,16.01,0.0,0.03,64.35,36.18,0.0,0.0,52.78,370.59,0.0,0.0,0.0,0.0,399.84,0,Yes,1.0,No,No,Yes,Fiber Optic,38,No,No,No,No,No,Yes,No,Yes,Credit Card,whqrmeulitfj98550,1,Not Applicable,Not Applicable,Stayed,No Offer,35-49,52747.5,432385.5
5,uqdtniwvxqzeu1,10,6.0,72566,Male,36.601209,No,No,0.0,qcvetdmalnkw1,tkqnsqflrdatnqapsh1,AR,Izard County,America/Chicago,870.0,US,36.22,-92.08,771.79,775.76,81.54,67.47,78.41,6.75,6.97,193.49,215.74,0.18,0.0,12.88,33.07,11.55,156.69,128.45,73.36,6.84,16.09,25.74,0.0,0.08,61.54,63.21,700.0,0.0,0.0,0.0,789.0,0.0,0.0,1.0,0.0,0,Yes,10.0,Yes,Yes,No,,21,No,No,No,Yes,Yes,Yes,Yes,,Bank Withdrawal,adzabvpghmbju72072,4,Not Applicable,Not Applicable,Stayed,No Offer,35-49,480205.5,340222.5
6,uqdtniwvxqzeu1,11,6.0,72566,Male,36.693916,No,No,0.0,qcvetdmalnkw1,tkqnsqflrdatnqapsh1,AR,Izard County,America/Chicago,870.0,US,36.22,-92.08,565.53,61.23,75.63,281.94,1691.32,48.84,6.5,304.78,186.81,171.1,0.0,38.18,23.13,13.57,50.28,23.2,11.44,14.58,60.26,20.67,0.0,0.04,40.03,65.16,153.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,No,0.0,Yes,,No,,62,No,No,Yes,No,Yes,No,No,No,Bank Withdrawal,xjnuhhfmfgtd73026,4,Not Applicable,Not Applicable,Stayed,No Offer,35-49,480205.5,554682.0
7,uqdtniwvxqzeu1,12,6.0,72566,Male,36.621807,No,No,0.0,qcvetdmalnkw1,tkqnsqflrdatnqapsh1,AR,Izard County,America/Chicago,870.0,US,36.22,-92.08,403.33,73.91,90.44,98.25,209.47,4.1,13.52,456.56,146.11,2.47,0.0,2.18,2.82,1.94,19.77,88.7,20.37,58.37,65.31,20.39,0.0,2.21,45.58,9.45,171.0,8.0,190.62,175.35,224.56,176.23,0.0,1.0,8052.18,0,Yes,0.0,Yes,No,Yes,DSL,23,No,No,Yes,Yes,Yes,No,No,Yes,Bank Withdrawal,igrkenxzyvdw27549,3,Not Applicable,Not Applicable,Stayed,No Offer,35-49,284189.5,353280.5
8,uqdtniwvxqzeu1,13,6.0,72566,Male,36.637909,No,No,0.0,qcvetdmalnkw1,tkqnsqflrdatnqapsh1,AR,Izard County,America/Chicago,870.0,US,36.22,-92.08,540.14,56.2,29.64,187.39,167.55,13.48,14.41,283.25,24.86,3.88,0.0,25.21,0.61,2.26,180.3,135.53,63.44,65.78,51.91,20.82,0.0,0.07,21.05,11.1,36.0,0.0,210.04,651.67,0.0,0.0,0.0,0.0,55.69,0,Yes,6.0,Yes,Yes,Yes,Cable,0,No,No,No,Yes,No,No,No,Yes,Bank Withdrawal,srrfeoupvdnwy37904,3,Not Applicable,Not Applicable,Stayed,No Offer,35-49,284189.5,75599.5
9,uqdtniwvxqzeu1,14,6.0,72566,Male,36.459363,No,No,0.0,qcvetdmalnkw1,tkqnsqflrdatnqapsh1,AR,Izard County,America/Chicago,870.0,US,36.22,-92.08,1330.04,1582.05,157.2,161.81,1827.38,39.79,1.0,1362.59,5267.31,171.81,0.0,390.32,24.94,511.23,2128.61,2896.11,54.41,100.54,585.44,162.7,0.0,0.11,10.46,1247.37,255.0,0.0,0.0,0.0,254687.0,254687.0,0.0,1.0,0.0,0,Yes,9.0,Yes,No,No,,74,No,No,Yes,No,Yes,No,No,No,Credit Card,inebwpymzwpup39698,4,Not Applicable,Not Applicable,Stayed,No Offer,35-49,480205.5,608671.5


In [6]:
df.shape

(653435, 77)

In [7]:
df['arpu'].mean()

781.2588237085557

In [8]:
df[df['Customer Status'] == 'Churned']['arpu'].mean()

893.6721309663372

In [9]:
df[df['Customer Status'] == 'Churned']['arpu'].median()

354.98

In [10]:
churn_val = df[df['Customer Status'] == 'Churned']['arpu'].sum()
churn_val

26680581.47

In [11]:
total_val=df['arpu'].sum()
total_val

510501859.4700001

In [12]:
churn_val/total_val

0.05226343641078921

In [13]:
df['arpu'].median()

348.54

In [14]:
# Checking the names of the columns
df.columns

Index(['Customer ID', 'Month', 'Month of Joining', 'zip_code', 'Gender', 'Age',
       'Married', 'Dependents', 'Number of Dependents', 'Location ID',
       'Service ID', 'state', 'county', 'timezone', 'area_codes', 'country',
       'latitude', 'longitude', 'arpu', 'roam_ic', 'roam_og', 'loc_og_t2t',
       'loc_og_t2m', 'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m',
       'std_og_t2f', 'std_og_t2c', 'isd_og', 'spl_og', 'og_others',
       'loc_ic_t2t', 'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m',
       'std_ic_t2f', 'std_ic_t2o', 'spl_ic', 'isd_ic', 'ic_others',
       'total_rech_amt', 'total_rech_data', 'vol_4g', 'vol_5g', 'arpu_5g',
       'arpu_4g', 'night_pck_user', 'fb_user', 'aug_vbc_5g', 'Churn Value',
       'Referred a Friend', 'Number of Referrals', 'Phone Service',
       'Multiple Lines', 'Internet Service', 'Internet Type',
       'Streaming Data Consumption', 'Online Security', 'Online Backup',
       'Device Protection Plan', 'Premium Tech Suppo

# **Exploratory Data Analysis**

In [15]:
# Check the Information of the Dataframe, number of unique values and frequency
df.describe()

Unnamed: 0,Month,Month of Joining,zip_code,Age,Number of Dependents,latitude,longitude,arpu,roam_ic,roam_og,loc_og_t2t,loc_og_t2m,loc_og_t2f,loc_og_t2c,std_og_t2t,std_og_t2m,std_og_t2f,std_og_t2c,isd_og,spl_og,og_others,loc_ic_t2t,loc_ic_t2m,loc_ic_t2f,std_ic_t2t,std_ic_t2m,std_ic_t2f,std_ic_t2o,spl_ic,isd_ic,ic_others,total_rech_amt,total_rech_data,vol_4g,vol_5g,arpu_5g,arpu_4g,night_pck_user,fb_user,aug_vbc_5g,Churn Value,Number of Referrals,Streaming Data Consumption,Satisfaction Score,rank,rank_x
count,653435.0,653435.0,653435.0,653435.0,648501.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653435.0,653048.0,653435.0,653435.0,653435.0,653435.0
mean,9.508305,5.823839,90386.334673,36.626053,1.161224,37.65268,-114.39237,781.258824,249.655375,267.540301,834.606074,678.58326,32.273571,30.330951,577.642346,441.105848,34.878435,0.0,49.614174,88.976941,98.301653,846.725928,720.762928,330.925795,369.450039,306.393199,125.712416,0.0,0.251296,251.295182,144.004695,1687.021829,3.294941,192.081365,2240.84154,6119.586352,6202.56848,0.093439,0.296233,530.598599,0.045689,4.337998,27.575505,3.131739,326718.0,326718.0
std,3.298722,2.855191,8412.169661,12.168746,2.254352,4.687393,10.537229,1807.379984,424.622504,625.589366,1589.561071,1112.518751,59.022345,67.614079,1306.949186,990.265335,57.921764,0.0,113.259631,168.72137,161.823356,1216.184636,1009.699978,483.248631,631.212783,437.627086,185.824242,0.0,0.438389,442.332364,293.132738,2979.988658,7.220396,592.039866,4582.944286,35738.095005,36034.204591,0.291046,0.456595,1402.929329,0.208811,3.769508,26.349922,1.249102,180979.54535,187449.358554
min,1.0,1.0,71601.0,19.0,0.0,31.79,-124.63,-2258.68,-25.039781,-108.56445,-30.483143,0.0,-6.931819,0.0,0.0,0.0,-2.206085,0.0,0.0,-42.624069,-12.852701,0.0,0.0,0.0,-40.914292,0.0,0.0,0.0,0.0,-49.526113,-58.174176,0.0,0.0,0.0,0.0,-16.7,-10.47,0.0,0.0,-371.673768,0.0,0.0,0.0,1.0,52747.5,75599.5
25%,7.0,3.0,88424.0,28.0,0.0,34.14,-121.65,118.96,12.08,14.7,32.695,26.26,1.46,1.61,33.12,25.55,1.19,0.0,3.25,4.94,3.43,85.565,84.17,36.12,42.46,32.19,12.46,0.0,0.04,26.98,20.33,72.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,3.0,284189.5,168800.0
50%,10.0,6.0,93032.0,34.0,0.0,36.08,-118.39,348.54,50.56,75.1,171.33,135.46,7.8,8.18,174.61,134.8,6.34,0.0,17.19,25.58,17.83,171.5,168.39,72.06,84.47,64.76,24.98,0.0,0.08,53.7,40.54,374.0,0.0,47.01,274.14,0.0,0.0,0.0,0.0,117.36,0.0,4.0,20.0,3.0,284189.5,329966.5
75%,12.0,8.0,95552.0,43.0,1.0,38.6,-108.8,580.655,162.06,135.29,309.09,618.31,14.09,14.700068,316.24,244.51,36.64,0.0,31.14,46.19,106.79,1259.265,1090.1,496.805,126.28,448.831338,186.72,0.0,0.21,80.38,60.73,1089.0,2.0,154.91,895.855,194.63,228.4,0.0,1.0,311.76,0.0,8.0,49.0,4.0,480205.5,489750.5
max,14.0,12.0,99403.0,80.0,9.0,48.99,-89.74,9394.5,1719.43,3161.78,6431.25,4212.01,283.53,336.13,7366.16,5622.54,217.44,0.0,765.05,1020.71,609.8,4363.95,3847.110525,1872.34,2527.07,1619.68,663.93,0.0,2.33,1917.756829,1344.13,11900.0,32.0,4503.93,19876.75,254687.0,254687.0,1.0,1.0,8214.87,1.0,11.0,85.0,5.0,600175.5,652416.5


In [16]:
df['Churn Value'].describe()

count    653435.000000
mean          0.045689
std           0.208811
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: Churn Value, dtype: float64

## **Data Dictionary**



| Column name	 | Description|
| ----- | ----- |
| Customer ID|  Unique identifier for each customer |
| Month|  Calender Month- 1:12 |
| Month of Joining|  "Calender Month -1:14|   Month for which the data is captured" |
| zip_code|  Zip Code |
| Gender|  Gender |
| Age|  Age(Years) |
| Married|  Marital Status |
| Dependents|  Dependents - Binary |
| Number of Dependents|  Number of Dependents |
| Location ID|  Location ID |
| Service ID|  Service ID |
| state|  State |
| county|  County |
| timezone|  Timezone |
| area_codes|  Area Code |
| country|  Country |
| latitude|  Latitude |
| longitude|  Longitude |
| arpu|  Average revenue per user |
| roam_ic|  Roaming incoming calls in minutes |
| roam_og|  Roaming outgoing calls in minutes |
| loc_og_t2t|  Local outgoing calls within same network in minutes |
| loc_og_t2m|  Local outgoing calls outside network in minutes(outside same + partner network) |
| loc_og_t2f|  Local outgoing calls with Partner network in minutes |
| loc_og_t2c|  Local outgoing calls with Call Center in minutes |
| std_og_t2t|  STD outgoing calls within same network in minutes |
| std_og_t2m|  STD outgoing calls outside network in minutes(outside same + partner network) |
| std_og_t2f|  STD outgoing calls with Partner network in minutes |
| std_og_t2c|  STD outgoing calls with Call Center in minutes |
| isd_og|  ISD Outgoing calls |
| spl_og|  Special Outgoing calls |
| og_others|  Other Outgoing Calls |
| loc_ic_t2t|  Local incoming calls within same network in minutes |
| loc_ic_t2m|  Local incoming calls outside network in minutes(outside same + partner network) |
| loc_ic_t2f|  Local incoming calls with Partner network in minutes |
| std_ic_t2t|  STD incoming calls within same network in minutes |
| std_ic_t2m|  STD incoming calls outside network in minutes(outside same + partner network) |
| std_ic_t2f|  STD incoming calls with Partner network in minutes |
| std_ic_t2o|  STD incoming calls operators other networks in minutes |
| spl_ic|  Special Incoming calls in minutes |
| isd_ic|  ISD Incoming calls in minutes |
| ic_others|  Other Incoming Calls |
| total_rech_amt|  Total Recharge Amount in Local Currency |
| total_rech_data|  Total Recharge Amount for Data in Local Currency |
| vol_4g|  4G Internet Used in GB |
| vol_5g|  5G Internet used in GB |
| arpu_5g|  Average revenue per user over 5G network |
| arpu_4g|  Average revenue per user over 4G network |
| night_pck_user|  Is Night Pack User(Specific Scheme) |
| fb_user|  Social Networking scheme |
| aug_vbc_5g|  Volume Based cost for 5G network (outside the scheme paid based on extra usage) |
| offer|  Offer Given to User |
| Referred a Friend|  Referred a Friend : Binary |
| Number of Referrals|  Number of Referrals |
| Phone Service|  Phone Service: Binary |
| Multiple Lines|  Multiple Lines for phone service: Binary |
| Internet Service|  Internet Service: Binary |
| Internet Type|  Internet Type |
| Streaming Data Consumption|  Streaming Data Consumption |
| Online Security|  Online Security |
| Online Backup|  Online Backup |
| Device Protection Plan|  Device Protection Plan |
| Premium Tech Support|  Premium Tech Support |
| Streaming TV|  Streaming TV |
| Streaming Movies|  Streaming Movies |
| Streaming Music|  Streaming Music |
| Unlimited Data|  Unlimited Data |
| Payment Method|  Payment Method |
| Status ID|  Status ID |
| Satisfaction Score|  Satisfaction Score |
| Churn Category|  Churn Category |
| Churn Reason|  Churn Reason |
| Customer Status|  Customer Status |
| Churn Value|  Binary Churn Value |

In [17]:
# Check the Information of the Dataframe, datatypes and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 653435 entries, 0 to 653434
Data columns (total 77 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Customer ID                 653435 non-null  object 
 1   Month                       653435 non-null  int64  
 2   Month of Joining            653435 non-null  float64
 3   zip_code                    653435 non-null  int64  
 4   Gender                      653435 non-null  object 
 5   Age                         653435 non-null  float64
 6   Married                     653435 non-null  object 
 7   Dependents                  653435 non-null  object 
 8   Number of Dependents        648501 non-null  float64
 9   Location ID                 653435 non-null  object 
 10  Service ID                  653435 non-null  object 
 11  state                       653435 non-null  object 
 12  county                      653435 non-null  object 
 13  timezone      

**Observation:** 
* we can see some null values present in this data. We will treat them later
* There are multiple columns relating to churn, we will have to look into them and drop the irrelevant ones otherwise it will lead to feature leakage in the model

In [18]:
df['Customer Status'].value_counts(normalize=True)

Customer Status
Stayed     0.954311
Churned    0.045689
Name: proportion, dtype: float64

In [19]:
df['Churn Reason'].value_counts(normalize = 1)

Churn Reason
Not Applicable                               0.951681
Unknown                                      0.002781
43tgeh                                       0.002747
Service dissatisfaction                      0.002380
Lack of self-service on Website              0.002364
Attitude of support person                   0.002349
Moved                                        0.002266
Competitor offered more data                 0.002262
Competitor had better devices                0.002231
Poor expertise of online support             0.002211
Product dissatisfaction                      0.002167
Attitude of service provider                 0.002156
Long distance charges                        0.002126
Lack of affordable download/upload speed     0.002113
Competitor made better offer                 0.002078
Price too high                               0.002077
Competitor offered higher download speeds    0.002068
Poor expertise of phone support              0.002035
Network reliabi


* Not Applicable refers to cases where customer didn't churn
* There arre garbage values present in churn reason

In [20]:
df['Churn Category'].value_counts()

Churn Category
Not Applicable     622441
Support              7535
Dissatisfaction      6000
Competitor           5970
Price                4380
Other                4355
Unknown              1269
bcvjhdjcb            1189
Attitude              296
Name: count, dtype: int64

In [21]:
df['Customer ID'].value_counts()

Customer ID
dqxnbgqlktkrz121392    14
xisowiodronug28827     14
uaewvtjhhmzmt28415     14
loalggxcfareo127312    14
ugksshqyowamu139051    14
                       ..
smcupkvgdunti119004     1
mpthmaolhfpex66526      1
durfqlkpbxgku66533      1
lioyqidmlbqfq119002     1
hthjctifkiudi0          1
Name: count, Length: 98187, dtype: int64

* There are multiple rows for each customer since data is at a monthly level

In [22]:
df = df[df['Churn Reason'] != 'Moved']
df = df[df['Churn Reason'] != 'Deceased']

* We also have a number of reasons for the churn that we cannot influence in any way, such as relocation or death. 
* We will remove them from the dataset, since there isn't much data in the set and they won't provide .

In [23]:
df[df['Churn Category'].isin(['Other', 'Unknown', 'bcvjhdjcb', 'Attitude'])].groupby(['Churn Category', 'Churn Reason'])['Customer ID'].nunique()

Churn Category  Churn Reason                             
Attitude        43tgeh                                          1
                Limited range of services                     187
                Unknown                                       108
Other           43tgeh                                         11
                Don't know                                   1284
                Limited range of services                     186
                Unknown                                       116
Unknown         43tgeh                                          1
                Attitude of service provider                    1
                Attitude of support person                      2
                Competitor had better devices                   1
                Competitor made better offer                    7
                Competitor offered higher download speeds       4
                Competitor offered more data                    3
                Do

* There are few categories which are irrelevant and won't add value to the model, we will clean those categories

In [24]:
def clean_churn_category(category, reason):
    # Check if reason is NaN (float type in pandas)
    if pd.isna(reason):
        return category

    if reason in ['Lack of affordable download/upload speed', 'Limited range of services', 'Network reliability'] \
        or 'dissatisfaction' in reason.lower():
        category = "Dissatisfaction"

    if "Price" in reason:
        category = "Price"
    if "Competitor" in reason:
        category = "Competitor"
    if "support" in reason.lower() or reason in ['Lack of self-service on Website']:
        category = "Support"

    if category in ["bcvjhdjcb", "Other", "Unknown", "Attitude"] or reason == 'Unknown':
        category = "Other"
    if reason in ['Attitude of service provider']:
        category = "Support"
    if reason in ['Extra data charges', 'Long distance charges']:
        category = "Price"

    return category

# Assuming df is your DataFrame
df['Churn Category'] = df[['Churn Category', 'Churn Reason']].apply(
    lambda x: clean_churn_category(x['Churn Category'], x['Churn Reason']), axis=1
)


In [25]:
df = df[df['Churn Reason'] != '43tgeh']
df.drop(df[(df['Churn Category'] == 'Competitor') & (df['Churn Reason'] == 'Unknown')].index , inplace=True)
df['Churn Reason'] = df[['Churn Reason', 'Churn Category']]\
                .apply(lambda x: 'Unknown' if x['Churn Category']=='Other' else x['Churn Reason'], axis=1)

In [30]:
df.drop("country", inplace=True, axis=1)

# **Data Processing & Feature engineering**

#### **Data Preprocessing and Leakage**

The data we are preprocessing will need to be reviewed to determine areas of leakage before the final product is submitted

### Dropping Irrelevant Features and IDs

In [None]:
data = df.copy()

## Drop ID columns

In [None]:
data = data.drop(["Location ID", "Service ID", "area_codes", "Status ID"], axis=1)

In [None]:
data = data.drop(['Customer ID','zip_code','state','county','latitude','longitude', 
                  'night_pck_user', 'fb_user', 'Customer Status'], axis = 1)

In [None]:
data.info()

We will treat the missing values now although they are present in very few columns

In [None]:
data['Internet Type'].value_counts()

In [None]:
data['total_rech_data'] = pd.to_numeric(data['total_rech_data'], errors='coerce')

# Fill missing values in 'Internet Type' with "Other"
data['Internet Type'].fillna("Other", inplace=True)

# Fill missing values in 'total_rech_data' with the mean of the column
data['total_rech_data'].fillna(data['total_rech_data'].mean(), inplace=True)

### Label Encoding

**Transforming Categorical Variables**


Transform categorical variables is through label encoding. Label encoding involves assigning a unique integer value to each category in the variable. This approach is useful when the categories have a natural order or ranking, such as low, medium, and high.
Transforming categorical features into numerical labels:

**Note:** We are NOT using dummies here to minimize the explosion of columns because of the distance methods we are using.


In [None]:
x = data.drop("Churn Value", axis = 1)
y = data['Churn Value']

In [None]:
df_reason = x['Churn Reason']
y = x['Churn Category']
x = x.drop(['Churn Category', 'Churn Reason'], axis=1)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state =2, test_size = 0.2)
print(x_train.shape, x_test.shape)

In [None]:
def encode_data(dataframe):
    le = LabelEncoder()
    for column in dataframe.columns:
        if dataframe[column].dtype == 'object':
            dataframe[column] = le.fit_transform(dataframe[column])
    return dataframe

x_train = encode_data(x_train)
x_test = encode_data(x_test)


In [None]:
class_distribution = y_train.value_counts()
print(class_distribution)


In [None]:


# Handling Missing Values
target_proportion = 0.1  # example proportion
majority_class_count = class_distribution['Not Applicable']
target_count = int(majority_class_count * target_proportion)

numeric_imputer = SimpleImputer(strategy='mean')
non_numeric_imputer = SimpleImputer(strategy='most_frequent')
sampling_strategy = {class_label: max(count, target_count)
                     for class_label, count in class_distribution.items() 
                     if class_label != 'Not Applicable'}
# Example undersampling strategy
under_sampling_strategy = {'Not Applicable': 8000} # Replace some_target_count with your chosen count



for col in x_train.columns:
    if x_train[col].dtype == 'object':
        x_train[col] = non_numeric_imputer.fit_transform(x_train[[col]])
        x_test[col] = non_numeric_imputer.transform(x_test[[col]])
    else:
        x_train[col] = numeric_imputer.fit_transform(x_train[[col]])
        x_test[col] = numeric_imputer.transform(x_test[[col]])

# Apply Label Encoding
label_encoder = LabelEncoder()
for column in x_train.columns:
    if x_train[column].dtype == 'object':
        x_train[column] = label_encoder.fit_transform(x_train[column])
        x_test[column] = label_encoder.transform(x_test[column])

# Define the Pipeline
# Define the Pipeline
over = SMOTE(sampling_strategy=sampling_strategy)  # Oversampling strategy for minority classes
#under = RandomUnderSampler(sampling_strategy=under_sampling_strategy)  # New undersampling strategy
model = RandomForestClassifier(max_depth=10, min_samples_leaf=2, min_samples_split=5, n_estimators=100)

pipeline = Pipeline([('over', over),
                    ##('under', under), 
                     ('model', model)])





In [None]:
# Fit the Pipeline
pipeline.fit(x_train, y_train)

# Evaluate the Model
predictions = pipeline.predict(x_test)
print("Accuracy:", accuracy_score(y_test, predictions))
print("Classification Report:")
print(classification_report(y_test, predictions))


In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
predictions = pipeline.predict(x_test)

# Create the confusion matrix
cm = confusion_matrix(y_test, predictions)

# Plotting the confusion matrix using Seaborn
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', xticklabels=sorted(y_test.unique()), yticklabels=sorted(y_test.unique()))
plt.xlabel('Predicted')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

### Correlation of remaining variables with Churn Value

fig = px.bar(data.corr()['Churn Value'].sort_values(ascending = False),
             color = 'value')
fig.show()

In [None]:
data['Churn Value'].value_counts(normalize=1)

**Observation**
* We have a highly imbalanced dataset
* We should use over and under sampling to make our dataset more suited for the ML model

## SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) is a technique used in machine learning to address class imbalance in a dataset. Class imbalance occurs when the number of instances in one class is much lower than the number of instances in another class, making it difficult for machine learning algorithms to learn from the data and predict the minority class accurately.

SMOTE works by creating synthetic samples from the minority class by interpolating new instances between existing instances. The new instances are created by selecting pairs of instances that are close to each other in the feature space and generating new instances along the line that connects them. The number of new instances to be generated is determined by a user-defined parameter that specifies the desired ratio of minority to majority class instances.

The synthetic instances generated by SMOTE are used to balance the classes in the dataset, allowing the machine learning algorithm to learn from a more balanced dataset and make better predictions on the minority class.

## Undersampling

Undersampling is a technique used in machine learning to address class imbalance in a dataset. Class imbalance occurs when the number of instances in one class is much lower than the number of instances in another class, making it difficult for machine learning algorithms to learn from the data and predict the minority class accurately.

Undersampling works by randomly selecting a subset of instances from the majority class so that the number of instances in the majority class is reduced to a level comparable to the number of instances in the minority class. This creates a more balanced dataset and allows the machine learning algorithm to learn from a more representative sample of the data.

Undersampling can be effective in reducing the computational cost and training time of machine learning models, as well as reducing the risk of overfitting to the majority class.

# **Model Building and Testing**

**Splitting the dataset into a training and production dataset:**

- Training: Part of data used for training our supervised models
- Test: Part of the dataset used for testing our models performance

## 1. Multi-class classification to predict Churn Category

## **Supervised learning**



Supervised learning uses a training set to teach models to yield the desired output. This training dataset includes inputs and correct outputs, which allow the model to learn over time. The algorithm measures its accuracy through the loss function, adjusting until the error has been sufficiently minimized.

Supervised learning can be separated into two types of problems when data mining—classification and regression:

1. Classification uses an algorithm to accurately assign test data into specific categories. It recognizes specific entities within the dataset and attempts to draw some conclusions on how those entities should be labeled or defined. Common classification algorithms are linear classifiers, support vector machines (SVM), decision trees, k-nearest neighbor, and random forest, which are described in more detail below.


2. Regression is used to understand the relationship between dependent and independent variables. It is commonly used to make projections, such as for sales revenue for a given business. Linear regression, logistical regression, and polynomial regression are popular regression algorithms.



In [None]:
def model(method, x_train, y_train, x_test, y_test):
    # Train the model
    print("Training Model......")
    method.fit(x_train, y_train)
    print("Model Trained")
    
    # Make predictions on test data
    predictions = method.predict(x_test)
    
    # Evaluate model performance and print results
    print("Model accuracy: ", '{:.2%}'.format(accuracy_score(y_test, predictions)))
    


## **Decision Trees**

**Decision Trees in Classification**

Decision trees are a type of supervised learning algorithm that can be used for classification as well as regression problems. They are widely used in machine learning because they are easy to understand and interpret, and can handle both categorical and numerical data. The idea behind decision trees is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.


**Splitting Criteria**

To build a decision tree, we need a measure that determines how to split the data at each node. The splitting criterion is chosen based on the type of data and the nature of the problem. The most common splitting criteria are:

* Gini index: measures the impurity of a set of labels. It calculates the probability of misclassifying a randomly chosen element from the set, and is used to minimize misclassification errors.
* Information gain: measures the reduction in entropy (uncertainty) after a split. It is used to maximize the information gain in each split.
* Chi-square: measures the difference between observed and expected frequencies of the classes. It is used to minimize the deviation between the observed and expected class distribution.

**Ensemble Methods**

Ensemble methods are techniques that combine multiple models to improve performance and reduce overfitting, a typical issue with decision trees. The two most common ensemble methods used with decision trees are:

* Bagging 

We will be using Random Forest as a bagging technique to introduce bootstrap sampling is a statistical technique that involves randomly sampling the data with replacement to create multiple subsets. These subsets are used to train individual decision trees. By using bootstrap samples, the algorithm can generate multiple versions of the same dataset with slightly different distributions. This introduces randomness into the training process, which helps to reduce overfitting over Bagging techniques for algorithms.

* XGBoost

We will be using XGBoost as a the key idea behind XGBoost is that it improves upon the predictions of the weak learners by focusing on the misclassified data points. By fitting a new tree to the residuals, XGBoost can correct the errors of the previous model and improve its overall accuracy. Additionally, XGBoost uses regularization to prevent overfitting and to improve generalization performance.

Deecision trees are powerful tools for classification problems that provide a clear and interpretable representation of the decision rules learned from the data. The choice of splitting criterion, stopping criterion, and ensemble method can have a significant impact on the performance and generalization of the model.

### **Gradient Boosting**

The primary idea behind this technique is to develop models in a sequential manner, with each model attempting to reduce the mistakes of the previous model.The additive model, loss function, and a weak learner are the three fundamental components of Gradient Boosting.

The method provides a direct interpretation of boosting in terms of numerical optimization of the loss function using Gradient Descent. We employ Gradient Boosting Regressor when the target column is continuous, and Gradient Boosting Classifier when the task is a classification problem. The "Loss function" is the only difference between the two. The goal is to use gradient descent to reduce this loss function by adding weak learners. Because it is based on loss functions, for regression problems, Mean squared error (MSE) will be used, and  for classification problems, log-likelihood.

### **XG Boost**




To improve upon the first decision tree, we can use XGBoost. Here's a roadmap for it:

* Initialize the model: We start by initializing the XGBoost model with default hyperparameters. This model will be a simple decision tree with a single split.

* Make predictions: We use this model to make predictions on the training data. We compare these predictions to the true labels and calculate the residuals, which are the differences between the predicted values and the true labels.

* Fit a new tree: We then fit a new decision tree to the residuals. This tree will be a weak learner, as it is only modeling the errors of the previous model.

* Combine the models: We add the new tree to the previous model to create a new ensemble. This new ensemble consists of the previous model plus the new tree.

* Repeat: We repeat steps 2-4 for a specified number of iterations, adding a new tree to the ensemble each time.

* Predictions: To make predictions on new data, we combine the predictions of all the trees in the ensemble.



## Classification Evaluation Metrics

#### The following evaluation metrics have been identified as important for addressing the business problem:

**F1 score:** Use the F1 score when the class distribution is imbalanced, and when both precision and recall are equally important.

**Recall score:** The recall score will be used as the cost of false negatives (missing customers likely to churn) is high. For example, in this project missing out on customers that are likely to churn is more important than misclassifying customers that arne't at risk of churn.

**Confusion matrix:**  The confusion matrix is a versatile tool that can be used to visualize the performance of a model across different classes. It can be useful for identifying specific areas of the model that need improvement. As this project will be used in an iterative manner, it will be important to optimize the model by analyzing previous models' failures.

**ROC AUC score:** the ROC AUC score will be used as it's ability to distinguish between positive and negative classes is important. Ideally we would like to have the clearest picture possible in terms of a customer's likelihood of churn so as to not needlessly waste resources on customers that don't churn and properly identify customers with a good risk of churn.

In [None]:
rf = RandomForestClassifier(max_depth=10, min_samples_leaf=2, min_samples_split=5, n_estimators=100, n_jobs=3)

In [None]:
importances = model.feature_importances_

# Convert the importances into a DataFrame
feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': importances})

# Sort the DataFrame to see the most important features
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)

# Display
print(feature_importance)


In [None]:
from sklearn.impute import SimpleImputer

# Imputation
imputer = SimpleImputer(strategy='mean')  # or 'median' if more appropriate
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
X_test_imputed = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)

# Train the model
rf_model.fit(X_train_imputed, y_train)

# Extracting feature importances
importances = rf_model.feature_importances_

# Convert the importances into a DataFrame
feature_importance = pd.DataFrame({'Feature': X_train_imputed.columns, 'Importance': importances})

# Sort the DataFrame to see the most important features
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)

# Display
print(feature_importance)


In [None]:
pd.set_option('display.max_rows', 100)  # Change 100 to the number of rows you wish to display

# Assuming 'feature_importance' is your DataFrame with importances
print(feature_importance.head(100))

Use schep values

 df filter for y.test != y.pred
    
  go back to interpretable ml kickoff

In [None]:
model(rf, x_train, y_train, x_test, y_test)

from ipywidgets import IntProgress
from IPython.display import display
import time

max_count = 100
progress = IntProgress(min=0, max=max_count) 
display(progress)

for i in range(max_count):
    time.sleep(0.1)  # Replace with your code
    progress.value = i


explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(x_train)

**Observation**
* Accuracy of our multi-class classification model is 95.74%

## 2. Multi-label classification to predict Churn Category and Churn Reason

In [None]:
y = pd.DataFrame(y)
y['Churn Reason'] = df_reason
y.head()

In [None]:
mlb = MultiLabelBinarizer()
y_str = y.astype(str)
y_mlb = mlb.fit_transform(y_str.values)
x_mlb = x.values



In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_mlb, y_mlb, random_state =2, test_size = 0.2)
print(x_train.shape, x_test.shape)

In [None]:
num_classes = y_train.shape[1]
print("Number of classes:", num_classes)


### Deep Neural Network

In [None]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.utils import to_categorical


In [None]:
# Assuming df is your DataFrame
x = df.drop(['Churn Value'], axis=1)
y = df['Churn Value']

# Split into training and testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2)

# Drop identifier and non-relevant columns
columns_to_drop = ['Customer ID', 'Location ID', 'Service ID', 'Status ID', 'hex_id']
x_train = x_train.drop(columns=columns_to_drop)
x_test = x_test.drop(columns=columns_to_drop)

# Identify all categorical columns
categorical_columns = x_train.select_dtypes(include=['object']).columns.tolist()

# Apply one-hot encoding to these columns
x_train = pd.get_dummies(x_train, columns=categorical_columns)
x_test = pd.get_dummies(x_test, columns=categorical_columns)

# Ensure x_train and x_test have the same columns after encoding
x_test = x_test.reindex(columns=x_train.columns, fill_value=0)

# Scale the data
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# One-hot encode the target
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# Define and compile the model
model = Sequential()
n_inputs = x_train.shape[1]
n_outputs = y_train.shape[1]
model.add(Dense(50, input_dim=n_inputs, activation='relu'))
model.add(Dense(25, activation='tanh'))
model.add(Dense(n_outputs, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model
model.fit(x_train, y_train, epochs=20)


## Categorical Accuracy

Right now there are errors in our accuracy score, we will need to revise our Deep Neural Network model or the data set to improve performance.

In [None]:
acc = model.evaluate(x_test, y_test, verbose=1)[1]*100.0
print('Categorical Accuracy: >%.3f' % acc)