# **Telecom Machine Learning Project to Predict Customer Churn**

## **Business Overview**

The telecommunications industry is a rapidly growing sector that is constantly evolving to meet the demands of consumers. As technology advances and user behavior changes, telecom operators face a variety of challenges that can impact their business success. In order to stay competitive and meet customer needs, it is important for telecom companies to regularly analyze their data to identify relevant problems and opportunities for improvement.

**Aim:**

The aim of a churn prediction notebook is to develop a machine learning model that can predict which customers are likely to churn or discontinue their use of a service or product. Churn prediction is a critical business problem for companies that operate on a subscription or recurring revenue model, such as telecommunications companies. 

While the project will involve building a churn prediction model, an additional focus will be on the importance of monitoring and adapting to changes in the data that may affect the accuracy and effectiveness of the model over time. The project will also emphasize the need for a feedback loop that allows for continuous improvement and refinement of the model based on new data and changing business requirements. By highlighting these concepts, the project aims to help businesses understand the importance of staying agile and adaptable in their machine learning approaches, rather than solely focusing on the accuracy of a single model.

## Understanding Churn Prediction

Churn prediction involves identifying customers likely to stop using a product or service. Specifically, in telecommunications, it's about detecting customers inclined to switch providers or end their current contracts.

This prediction is vital for telecom companies due to its direct effect on revenue and profit. The telecom industry is fiercely competitive, and understanding which customers might leave is crucial. It enables providers to proactively work on retaining these customers.

## Key Challenges

- Analyzing vast, varied data sources is a primary challenge in churn prediction. Telecom companies accumulate huge data from customer interactions, network operations, and billing, often stored separately, complicating comprehensive analysis.
  
- Understanding diverse customer behaviors is another hurdle. Customers leave for various reasons, like service quality or better offers elsewhere. Accurately predicting churn requires grasping these varied behaviors and pinpointing key churn indicators.

## Impact on Business

Churn prediction significantly influences telecom business operations. High churn rates can erode revenue and profit, while effective prediction models can help in retaining at-risk customers. Here are essential business impacts:

- **Safeguarding Revenue**: Identifying potential churn helps in taking actions like offering promotions or service upgrades to keep customers, thus protecting revenue and reducing new customer acquisition costs.

- **Boosting Customer Retention**: By pinpointing customer departure reasons, telecom companies can enhance their services, fostering loyalty and increasing customer lifetime value.

- **Cutting Costs**: It's cheaper to retain existing customers than acquire new ones. Churn prediction allows for more focused and efficient marketing strategies.

- **Gaining Competitive Edge**: Effective churn management can set a telecom company apart, helping to grow market share and profitability by improving customer satisfaction and loyalty.



## **Approach**

**Data exploration**

* Load the dataset and examine its structure and contents.
* Explore the distribution of the target variable (churn) and the features.

**Data preprocessing**

* Handle missing values by imputing them with appropriate values.
* Handle outliers by removing or transforming them.
* Encode categorical variables using one-hot encoding.
* Scale numerical variables using Standard scaler.


**Model training**

* Split the data into training and validation sets.
* Train logistic regression, random forest, and XGBoost models on the training set.
* Evaluate the performance of the models on the validation set using metrics such as accuracy, precision, recall, and F1 score.
* Choose the best-performing model based on the evaluation results.

**Data drift monitoring**

* Use deep checks to monitor for data drift in the input features and the target variable.
* Check the model's performance on the validation set regularly to detect any model drift.

**Inference pipeline**

* Build an inference pipeline to predict churn for new data.
* Handle cases where the label (churn) is not present in the input data.
* Handle cases where the drift is detected by retraining the model with misclassified data.

**Project Summary**

* Summarize the results and draw insights from the model's predictions.
* Provide recommendations for business actions based on the model's predictions.

## **Package Requirements**

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import libraries
import pandas as pd
import numpy as np
from projectpro import preserve, save_point, model_snapshot, feedback, show_video 
import math
import sys
import traceback
from deepchecks.tabular import Dataset
from deepchecks.tabular import Suite
from deepchecks.tabular.checks import WholeDatasetDrift, DataDuplicates, NewLabelTrainTest, TrainTestFeatureDrift, TrainTestLabelDrift
from deepchecks.tabular.checks import FeatureLabelCorrelation, FeatureLabelCorrelationChange, ConflictingLabels, OutlierSampleDetection 
from deepchecks.tabular.checks import WeakSegmentsPerformance, RocReport, ConfusionMatrixReport, TrainTestPredictionDrift, CalibrationScore, BoostingOverfit
from sklearn.metrics import f1_score, recall_score, confusion_matrix, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from pickle import dump
from sklearn.impute import SimpleImputer
# import xgboost
import xgboost as xgb

preserve("fcTel2")

## **Data Reading from Different Sources**

In [3]:
np.set_printoptions(threshold=sys.maxsize)

In [4]:
pd.set_option('display.max_columns', 200)

In [5]:
df=pd.read_csv('data/Telecom__Data.csv')

## **Data Exploration**


In [6]:
# Check the shape of the Dataframe
df.shape

(653435, 77)

In [7]:
# Check the Information of the Dataframe, datatypes and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 653435 entries, 0 to 653434
Data columns (total 77 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Customer ID                 653435 non-null  object 
 1   Month                       653435 non-null  int64  
 2   Month of Joining            653435 non-null  float64
 3   zip_code                    653435 non-null  int64  
 4   Gender                      653435 non-null  object 
 5   Age                         653435 non-null  float64
 6   Married                     653435 non-null  object 
 7   Dependents                  653435 non-null  object 
 8   Number of Dependents        648501 non-null  float64
 9   Location ID                 653435 non-null  object 
 10  Service ID                  653435 non-null  object 
 11  state                       653435 non-null  object 
 12  county                      653435 non-null  object 
 13  timezone      

In [8]:
# Checking the names of the columns
df.columns

Index(['Customer ID', 'Month', 'Month of Joining', 'zip_code', 'Gender', 'Age',
       'Married', 'Dependents', 'Number of Dependents', 'Location ID',
       'Service ID', 'state', 'county', 'timezone', 'area_codes', 'country',
       'latitude', 'longitude', 'arpu', 'roam_ic', 'roam_og', 'loc_og_t2t',
       'loc_og_t2m', 'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m',
       'std_og_t2f', 'std_og_t2c', 'isd_og', 'spl_og', 'og_others',
       'loc_ic_t2t', 'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m',
       'std_ic_t2f', 'std_ic_t2o', 'spl_ic', 'isd_ic', 'ic_others',
       'total_rech_amt', 'total_rech_data', 'vol_4g', 'vol_5g', 'arpu_5g',
       'arpu_4g', 'night_pck_user', 'fb_user', 'aug_vbc_5g', 'Churn Value',
       'Referred a Friend', 'Number of Referrals', 'Phone Service',
       'Multiple Lines', 'Internet Service', 'Internet Type',
       'Streaming Data Consumption', 'Online Security', 'Online Backup',
       'Device Protection Plan', 'Premium Tech Suppo

## **Data Dictionary (out of order)** 




| Column name	 | Description|
| ----- | ----- |
| Customer ID	 | Unique identifier for each customer |
| Month | Calendar Month- 1:12 | 
| Month of Joining |	Calender Month -1:14, Month for which the data is captured|
| zip_code |	Zip Code|
|Gender |	Gender|
| Age |	Age(Years)|
| Married |	Marital Status |
|Dependents | Dependents - Binary |
| Number of Dependents |	Number of Dependents|
|Location ID |	Location ID|
|Service ID	 |Service ID|
|state|	State|
|county	|County|
|timezone	|Timezone|
|area_codes|	Area Code|
|country	|Country|
|latitude|	Latitude|
|longitude	|Longitude|
|arpu|	Average revenue per user|
|roam_ic	|Roaming incoming calls in minutes|
|roam_og	|Roaming outgoing calls in minutes|
|loc_og_t2t|	Local outgoing calls within same network in minutes|
|loc_og_t2m	|Local outgoing calls outside network in minutes(outside same + partner network)|
|loc_og_t2f|	Local outgoing calls with Partner network in minutes|
|loc_og_t2c	|Local outgoing calls with Call Center in minutes|
|std_og_t2t|	STD outgoing calls within same network in minutes|
|std_og_t2m|	STD outgoing calls outside network in minutes(outside same + partner network)|
|std_og_t2f|	STD outgoing calls with Partner network in minutes|
|std_og_t2c	|STD outgoing calls with Call Center in minutes|
|isd_og|	ISD Outgoing calls|
|spl_og	|Special Outgoing calls|
|og_others|	Other Outgoing Calls|
|loc_ic_t2t|	Local incoming calls within same network in minutes|
|loc_ic_t2m|	Local incoming calls outside network in minutes(outside same + partner network)|
|loc_ic_t2f	|Local incoming calls with Partner network in minutes|
|std_ic_t2t	|STD incoming calls within same network in minutes|
|std_ic_t2m	|STD incoming calls outside network in minutes(outside same + partner network)|
|std_ic_t2f|	STD incoming calls with Partner network in minutes|
|std_ic_t2o|	STD incoming calls operators other networks in minutes|
|spl_ic|	Special Incoming calls in minutes|
|isd_ic|	ISD Incoming calls in minutes|
|ic_others|	Other Incoming Calls|
|total_rech_amt|	Total Recharge Amount in Local Currency|
|total_rech_data|	Total Recharge Amount for Data in Local Currency
|vol_4g|	4G Internet Used in GB|
|vol_5g|	5G Internet used in GB|
|arpu_5g|	Average revenue per user over 5G network|
|arpu_4g|	Average revenue per user over 4G network|
|night_pck_user|	Is Night Pack User(Specific Scheme)|
|fb_user|	Social Networking scheme|
|aug_vbc_5g|	Volume Based cost for 5G network (outside the scheme paid based on extra usage)|
|offer|	Offer Given to User|
|Referred a Friend|	Referred a Friend : Binary|
|Number of Referrals|	Number of Referrals|
|Phone Service|	Phone Service: Binary|
|Multiple Lines|	Multiple Lines for phone service: Binary|
|Internet Service|	Internet Service: Binary|
|Internet Type|	Internet Type|
|Streaming Data Consumption|	Streaming Data Consumption|
|Online Security|	Online Security|
|Online Backup|	Online Backup|
|Device Protection Plan|	Device Protection Plan|
|Premium Tech Support|	Premium Tech Support|
|Streaming TV|	Streaming TV|
|Streaming Movies|	Streaming Movies|
|Streaming Music|	Streaming Music|
|Unlimited Data|	Unlimited Data|
|Payment Method|	Payment Method|
|Status ID|	Status ID|
|Satisfaction Score|	Satisfaction Score|
|Churn Category|	Churn Category|
|Churn Reason|	Churn Reason|
|Customer Status|	Customer Status|
|Churn Value|	Binary Churn Value



In [9]:
# Null values sum
df.isna().sum()

Customer ID         0
Month               0
Month of Joining    0
zip_code            0
Gender              0
                   ..
Customer Status     0
offer               0
age_bucket          0
rank                0
rank_x              0
Length: 77, dtype: int64

In [10]:
# Null values in total recharge data
df['total_rech_data'].isna().sum()

0

In [11]:
# Null values in Internet Type
df['Internet Type'].isna().sum()

325078

In [12]:
# Missing value percentage
df['total_rech_data'].isna().sum()/df.shape[0]

0.0

**Observation:**

*  These missing values may represent customers who have not recharged their account or have recharged but the information has not been recorded.

* It is possible that customers with missing recharge data are those who received free data service, and therefore did not need to recharge their account. Alternatively, it is possible that the missing values are due to technical issues, such as data recording errors or system failures.

In [13]:
# Checking the value counts of Internet Service where total recharge data was null
df[df['total_rech_data'].isna()]['Internet Service'].value_counts(dropna=False)

Series([], Name: count, dtype: int64)

**Observation**:

* It turns out that all customers with missing recharge data have opted for internet service, the next step could be to check if they have used it or not.

In [14]:
# Let's check unlimited data column
df[(df['total_rech_data'].isna())]['Unlimited Data'].value_counts()

Series([], Name: count, dtype: int64)

In [15]:
# Lets check Average Revenue for 4g and 5g
df[(df['total_rech_data'].isna())][['arpu_4g','arpu_5g']].value_counts()

Series([], Name: count, dtype: int64)

**Observation**:

* We can fill the missing values in the total_rech_data column with 0 when the arpu (Average Revenue Per User) is not applicable. This is because the arpu is a measure of the revenue generated per user, and if it is not applicable, it may indicate that the user is not generating any revenue for the company. In such cases, it is reasonable to assume that the total recharge amount is 0.

In [16]:
# Check the value counts of ARPU 4g and 5g
df[['arpu_4g','arpu_5g']].value_counts()

arpu_4g    arpu_5g      
0.00       0.000000         379093
           63.000000         13018
63.00      0.000000          12966
254687.00  0.000000          10909
0.00       254687.000000     10707
                             ...  
250.33     8529.715094           1
250.34     41.520000             1
250.35     130.170000            1
           182.760000            1
838.75     2111.580000           1
Name: count, Length: 195759, dtype: int64

In [17]:
# Replacing all values of total recharge data= 0 where arpu 4g and 5g are not applicable
df.loc[(df['arpu_4g']=='Not Applicable') | (df['arpu_5g']=='Not Applicable'),'total_rech_data']=0

In [18]:
# Missing value percentage
df['total_rech_data'].isna().sum()/df.shape[0]

0.0

We cannot fill other values with 0 because they have some ARPU to consider.

In [19]:
# Calculate the mean of 'total_rech_data' where either 'arpu_4g' or 'arpu_5g' is not equal to 'Not Applicable'
df.loc[(df['arpu_4g']!='Not Applicable') | (df['arpu_5g']!='Not Applicable'),'total_rech_data'].mean()

3.2949413484126193

With this mean, we will fill the NaN values.

In [20]:
# Fill NaN values in 'total_rech_data' with the mean of 'total_rech_data' where either 'arpu_4g' or 'arpu_5g' is not equal to 'Not Applicable'
df['total_rech_data']=df['total_rech_data'].fillna(df.loc[(df['arpu_4g']!='Not Applicable') | (df['arpu_5g']!='Not Applicable'),'total_rech_data'].mean())

In [21]:
# Check the value counts for Internet Type
df['Internet Type'].value_counts(dropna=False)

Internet Type
NaN            325078
Fiber Optic    134929
Cable          112061
DSL             81367
Name: count, dtype: int64

In [22]:
# Check value counts for Internet Service where Internet Type is null
df[df['Internet Type'].isna()]['Internet Service'].value_counts(dropna=False)

Internet Service
No     236017
Yes     89061
Name: count, dtype: int64

All null values in Internet Type does not have Internet Service. Let's fill these null values with Not Applicable.

In [23]:
# Filling Null values in Internet Type 
df['Internet Type']=df['Internet Type'].fillna('Not Applicable')

In [24]:
# Shape of the dataframe
df.shape

(653435, 77)

In [25]:
# Insert a new column named 'total_recharge' before the last column in the dataframe 
# The values of 'total_recharge' are the sum of 'total_rech_amt' and 'total_rech_data'
df.insert(loc=df.shape[1]-1,column='total_recharge',value=df['total_rech_amt']+df['total_rech_data'])

In [26]:
# Checking percent of missing values in columns
df_missing_columns = (round(((df.isnull().sum()/len(df.index))*100),2).to_frame('null')).sort_values('null', ascending=False)
df_missing_columns

Unnamed: 0,null
Multiple Lines,7.05
Unlimited Data,1.70
Number of Dependents,0.76
Number of Referrals,0.06
Customer ID,0.00
...,...
std_og_t2m,0.00
std_og_t2t,0.00
loc_og_t2c,0.00
loc_og_t2f,0.00


Let's drop some unnecessary columns!

In [27]:
# Dropping columns
df=df.drop(columns=['night_pck_user', 'fb_user','Churn Category','Churn Reason', 'Customer Status'])

In [28]:
# Checking churn %
round(100*(df['Churn Value'].mean()),2)

4.57

In [29]:
# Number of unique latitudes
df['latitude'].nunique()

1096

In [30]:
# Number of unique longitudes
df['longitude'].nunique()

1368

In [31]:
nan_indices = df[df['Number of Dependents'].isna()].index

# Randomly select half of the indices
random_half = np.random.choice(nan_indices, size=len(nan_indices)//2, replace=False)

# Assign 0 to the selected half
df.loc[random_half, 'Number of Dependents'] = 0

# Assign 1 to the remaining NaN values
df['Number of Dependents'].fillna(1, inplace=True)

Replace 'Not Applicable' with 0 in both 'arpu_4g' and 'arpu_5g'.

In [32]:
# Replace 'Not Applicable' with 0 in 'arpu_4g'
df['arpu_4g'] = df['arpu_4g'].replace('Not Applicable', 0)

# Replace 'Not Applicable' with 0 in 'arpu_5g'
df['arpu_5g'] = df['arpu_5g'].replace('Not Applicable', 0)

# Convert 'arpu_4g' to float data type
df['arpu_4g'] = df['arpu_4g'].astype(float)

# Convert 'arpu_5g' to float data type
df['arpu_5g'] = df['arpu_5g'].astype(float)


In [33]:
# Check the data types
df.dtypes

Customer ID          object
Month                 int64
Month of Joining    float64
zip_code              int64
Gender               object
                     ...   
offer                object
age_bucket           object
rank                float64
total_recharge      float64
rank_x              float64
Length: 73, dtype: object

In [34]:
# Note: We are keeping customer location-based attributes aside for now
location_att=['zip_code''state', 'county', 'timezone', 'area_codes', 'country','latitude','longitude']

# List of categorical columns
categorical_cols=['Gender',
       'Married', 'Dependents',
       'offer','Referred a Friend', 'Phone Service',
       'Multiple Lines', 'Internet Service', 'Internet Type',
        'Online Security', 'Online Backup',
       'Device Protection Plan', 'Premium Tech Support', 'Streaming TV',
       'Streaming Movies', 'Streaming Music', 'Unlimited Data',
       'Payment Method']

# List of continuous columns
cts_cols=['Age','Number of Dependents',
       'roam_ic', 'roam_og', 'loc_og_t2t',
       'loc_og_t2m', 'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m',
       'std_og_t2f', 'std_og_t2c', 'isd_og', 'spl_og', 'og_others',
       'loc_ic_t2t', 'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m',
       'std_ic_t2f', 'std_ic_t2o', 'spl_ic', 'isd_ic', 'ic_others',
       'total_rech_amt', 'total_rech_data', 'vol_4g', 'vol_5g', 'arpu_5g',
       'arpu_4g', 'arpu', 'aug_vbc_5g', 'Number of Referrals','Satisfaction Score',
       'Streaming Data Consumption']   



## Outlier detection

By calculating quantiles for each continuous variable in the dataset, we are trying to get an idea about the spread and distribution of the data. Specifically, we are interested in identifying potential outliers in the data.

Quantiles divide a distribution into equal proportions. For instance, the 0.25 quantile is the value below which 25% of the observations fall and the 0.75 quantile is the value below which 75% of the observations fall. By calculating quantiles at various levels, we can get a better understanding of the distribution of the data and identify any observations that are too far away from the rest of the data.

These quantiles can be used as thresholds to identify potential outliers in the data. Observations with values beyond these thresholds can be considered as potential outliers and further investigation can be carried out to determine if they are true outliers or not.

In [35]:
# Create an empty dataframe with columns as cts_cols and index as quantiles
quantile_df=pd.DataFrame(columns=cts_cols,index=[0.1,0.25,0.5,0.75,0.8,0.9,0.95,0.97,0.99])

# for each column in cts_cols, calculate the corresponding quantiles and store them in the quantile_df
for col in cts_cols:
   quantile_df[col]=df[col].quantile([0.1,0.25,0.5,0.75,0.8,0.9,0.95,0.97,0.99])

In [36]:
# Let's check out the quantiles df
quantile_df

Unnamed: 0,Age,Number of Dependents,roam_ic,roam_og,loc_og_t2t,loc_og_t2m,loc_og_t2f,loc_og_t2c,std_og_t2t,std_og_t2m,std_og_t2f,std_og_t2c,isd_og,spl_og,og_others,loc_ic_t2t,loc_ic_t2m,loc_ic_t2f,std_ic_t2t,std_ic_t2m,std_ic_t2f,std_ic_t2o,spl_ic,isd_ic,ic_others,total_rech_amt,total_rech_data,vol_4g,vol_5g,arpu_5g,arpu_4g,arpu,aug_vbc_5g,Number of Referrals,Satisfaction Score,Streaming Data Consumption
0.1,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,34.74,33.79,14.46,16.95,13.06,5.03,0.0,0.02,10.77,8.1,0.0,0.0,0.0,0.0,0.0,0.0,-256.2,0.0,0.0,1.0,0.0
0.25,28.0,0.0,12.08,14.7,32.695,26.26,1.46,1.61,33.12,25.55,1.19,0.0,3.25,4.94,3.43,85.565,84.17,36.12,42.46,32.19,12.46,0.0,0.04,26.98,20.33,72.0,0.0,0.0,0.0,0.0,0.0,118.96,0.0,0.0,3.0,2.0
0.5,34.0,0.0,50.56,75.1,171.33,135.46,7.8,8.18,174.61,134.8,6.34,0.0,17.19,25.58,17.83,171.5,168.39,72.06,84.47,64.76,24.98,0.0,0.08,53.7,40.54,374.0,0.0,47.01,274.14,0.0,0.0,348.54,117.36,4.0,3.0,20.0
0.75,43.0,1.0,162.06,135.29,309.09,618.31,14.09,14.700068,316.24,244.51,36.64,0.0,31.14,46.19,106.79,1259.265,1090.1,496.805,126.28,448.831338,186.72,0.0,0.21,80.38,60.73,1089.0,2.0,154.91,895.855,194.63,228.4,580.655,311.76,8.0,4.0,49.0
0.8,47.0,2.0,496.940126,146.82,856.884,1393.012469,43.886016,15.97,344.97,266.55,71.61,0.0,33.91,50.24,229.242068,1999.588,1471.763855,653.084534,543.225916,634.160224,275.2,0.0,0.33,384.901558,64.8,2197.0,3.0,176.36,1655.272,789.0,783.452,626.24,350.531298,8.0,4.0,56.0
0.9,55.0,4.0,969.1,689.466,3613.888,2644.684,126.618596,109.12,1547.248709,1008.117776,143.14,0.0,113.208663,372.874736,382.716,2974.37,2425.002856,1198.682,1526.041689,1030.656401,466.85,0.0,0.71,1102.839877,532.386,7013.0,14.0,219.26,9658.65,2219.842,2224.166,1901.764,789.0,10.0,5.0,69.0
0.95,61.0,7.0,1283.255,1954.446,5079.62,3479.505,183.500072,207.53,3953.643222,3108.777354,171.81,0.0,319.369422,470.183,489.7,3719.723,3166.9,1462.223,2022.082864,1360.456,569.74,0.0,1.27,1443.946433,914.291722,9369.0,23.0,663.159,14517.64,8530.666241,8678.16631,5895.142,3944.342,11.0,5.0,77.0
0.97,64.0,8.0,1494.03,2550.3656,5806.0894,3756.4498,206.75,277.3498,5344.3178,3848.737683,188.88,0.0,394.239829,518.460093,531.579909,3911.4698,3468.8698,1657.019,2145.5,1476.4198,594.01,0.0,1.75,1554.88,1212.8592,10491.98,26.0,1438.158,16578.6164,8724.403682,8842.707361,7593.6286,5949.9068,11.0,5.0,80.0
0.99,74.0,9.0,1646.89,3041.76,6191.2762,4060.3096,257.6432,311.47,6729.45,4875.056703,208.18,0.0,637.0398,836.1466,579.3866,4200.4596,3679.3932,1792.98,2434.4798,1571.7566,639.0,0.0,2.19,1601.91,1317.5066,11367.0,30.0,4289.78,18614.3528,254687.0,254687.0,8846.9728,7366.8366,11.0,5.0,83.0


Outliers were detected in the variables vol_5g, arpu_4g, and arpu_5g.

In [37]:
# Checking further
df['arpu_4g'].quantile([0.75,0.8,0.9,0.95,0.97,0.99,0.999])

0.750       228.400000
0.800       783.452000
0.900      2224.166000
0.950      8678.166310
0.970      8842.707361
0.990    254687.000000
0.999    254687.000000
Name: arpu_4g, dtype: float64

In [38]:
# Calculate the proportion of rows in the DataFrame where the value in the 'arpu_4g' column is equal to 254687
df[df['arpu_4g']==254687].shape[0]/df.shape[0]

0.01965765531384147

In [39]:
# Let's check it out
df[df['arpu_4g']==254687]

Unnamed: 0,Customer ID,Month,Month of Joining,zip_code,Gender,Age,Married,Dependents,Number of Dependents,Location ID,Service ID,state,county,timezone,area_codes,country,latitude,longitude,arpu,roam_ic,roam_og,loc_og_t2t,loc_og_t2m,loc_og_t2f,loc_og_t2c,std_og_t2t,std_og_t2m,std_og_t2f,std_og_t2c,isd_og,spl_og,og_others,loc_ic_t2t,loc_ic_t2m,loc_ic_t2f,std_ic_t2t,std_ic_t2m,std_ic_t2f,std_ic_t2o,spl_ic,isd_ic,ic_others,total_rech_amt,total_rech_data,vol_4g,vol_5g,arpu_5g,arpu_4g,aug_vbc_5g,Churn Value,Referred a Friend,Number of Referrals,Phone Service,Multiple Lines,Internet Service,Internet Type,Streaming Data Consumption,Online Security,Online Backup,Device Protection Plan,Premium Tech Support,Streaming TV,Streaming Movies,Streaming Music,Unlimited Data,Payment Method,Status ID,Satisfaction Score,offer,age_bucket,rank,total_recharge,rank_x
9,uqdtniwvxqzeu1,14,6.0,72566,Male,36.459363,No,No,0.0,qcvetdmalnkw1,tkqnsqflrdatnqapsh1,AR,Izard County,America/Chicago,870.0,US,36.22,-92.08,1330.04,1582.05,157.20,161.810000,1827.38,39.790000,1.00,1362.59,5267.31,171.81,0.0,390.32,24.940000,511.23,2128.610000,2896.11,54.41,100.540000,585.44,162.70,0.0,0.11,10.460000,1247.37,255.0,0.0,0.0,0.0,254687.0,254687.0,0.0,0,Yes,9.0,Yes,No,No,Not Applicable,74,No,No,Yes,No,Yes,No,No,No,Credit Card,inebwpymzwpup39698,4,No Offer,35-49,480205.5,255.0,608671.5
86,ucpurmfkdlnwi18,13,12.0,71747,Female,20.000000,Yes,No,0.0,rqiqguxisfoc18,dkupusivpzrazcfsdi18,AR,Union County,America/Chicago,870.0,US,33.04,-92.18,160.07,18.63,31.29,2894.610900,834.78,209.170000,9.59,177.64,116.17,120.34,0.0,14.74,439.375628,100.81,156.270000,254.19,29.68,998.952008,24.13,12.62,0.0,0.29,795.286166,5.06,8462.0,0.0,0.0,0.0,0.0,254687.0,0.0,0,Yes,6.0,Yes,Yes,No,Not Applicable,0,No,No,No,Yes,No,No,No,No,Bank Withdrawal,usfobpyxwqrkg27554,5,No Offer,18-24,600175.5,8462.0,75599.5
103,sirifvlkipkel21,13,11.0,92865,Female,40.000000,Yes,No,0.0,jobplwgowgko21,zmuwwsnfbwxxdxzuvz21,CA,Orange County,America/Los_Angeles,714.0,US,33.83,-117.85,478.77,26.04,72.49,111.050000,1.87,6.890000,4.83,11.50,134.28,6.71,0.0,31.44,6.230000,2.70,171.280000,167.16,15.18,54.880000,64.06,31.83,0.0,0.01,41.910000,61.24,417.0,0.0,0.0,0.0,0.0,254687.0,0.0,0,Yes,0.0,Yes,Yes,No,Not Applicable,56,No,Yes,Yes,No,Yes,Yes,Yes,No,Credit Card,cullucfodcpbc24549,3,No Offer,35-49,284189.5,417.0,524369.5
112,dnnrchjlmrylq24,14,9.0,91423,Female,48.000000,Yes,Yes,0.0,vxainqiqplai24,liroqcvpdnrzdyolqw24,CA,Los Angeles County,America/Los_Angeles,2.13e+17,US,34.14,-118.42,143.68,0.00,0.00,0.000000,0.00,0.000000,0.00,0.00,0.00,0.00,0.0,0.00,0.000000,0.00,149.180000,2769.19,207.23,33.720000,331.07,3.33,0.0,0.06,0.090000,2.56,0.0,0.0,0.0,0.0,0.0,254687.0,0.0,0,Yes,6.0,No,Yes,No,Not Applicable,51,No,Yes,Yes,No,No,Yes,Yes,No,Bank Withdrawal,qflywarsexbpg13676,4,G,35-49,480205.5,0.0,497565.0
145,pltaycxycbhvo31,11,7.0,95126,Other,35.000000,No,No,0.0,sjmjgqjvhvth31,xbmtjtsvypinczxnhf31,CA,Santa Clara County,America/Los_Angeles,408.0,US,37.32,-121.91,95.40,0.00,0.00,0.000000,0.00,0.000000,0.00,0.00,0.00,0.00,0.0,0.00,0.000000,0.00,3210.570000,525.28,136.57,19.790000,1.21,202.92,0.0,0.05,61.380000,52.97,0.0,0.0,0.0,0.0,0.0,254687.0,0.0,0,Yes,10.0,No,No,No,Cable,56,No,Yes,No,No,No,Yes,Yes,No,Bank Withdrawal,xayhhjriwxte83055,3,J,25-34,284189.5,0.0,524369.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
652999,tphemcbndfpem162885,5,5.0,91604,Female,23.000000,Yes,Yes,9.0,psxavglkqzny162885,lepgdnuzszymxfxefi162885,CA,Los Angeles County,America/Los_Angeles,213.0,US,34.13,-118.39,576.68,1555.64,148.54,286.060000,2640.98,11.450000,15.25,177.76,80.32,182.42,0.0,36.39,29.010000,12.66,149.150000,254.42,34.55,70.130000,866.24,21.63,0.0,0.02,8.530000,13.05,6036.0,0.0,0.0,0.0,0.0,254687.0,63.0,0,Yes,7.0,Yes,No,No,Not Applicable,8,No,No,No,No,Yes,No,No,No,Wallet Balance,unsgjstmbbczmsf47552,3,No Offer,18-24,284189.5,6036.0,224935.5
653051,umbrcxomoexlc162896,8,5.0,94939,Female,55.000000,Yes,Not Specified,0.0,uuqthlwgdxrn162896,njhcqhdfkoqrazlxxo162896,CA,Marin County,America/Los_Angeles,415.0,US,37.93,-122.53,5007.56,75.85,144.05,271.200000,30.16,10.500000,242.12,349.01,1073.75,1.19,0.0,30.40,24.610000,2.62,106.590000,145.29,86.85,83.010000,29.74,23.25,0.0,0.06,61.500000,54.30,1518.0,0.0,0.0,0.0,0.0,254687.0,0.0,0,Yes,8.0,Yes,Yes,No,Not Applicable,23,No,No,Yes,No,Yes,Yes,Yes,No,Credit Card,mospmxtxyzdy97920,1,No Offer,50-64,52747.5,1518.0,353280.5
653105,dkjfuyorfdngv162907,13,11.0,87553,Male,23.000000,Yes,Yes,2.0,huiasasztqyw162907,dscivazeqkwgxzggqx162907,NM,Taos County,America/Denver,505575.0,US,36.17,-105.69,585.54,27.80,81.79,156.780000,22.08,1.970000,16.43,121.46,424.02,0.87,0.0,19.42,73.170000,10.20,164.160000,55.41,16.81,97.510000,73.50,22.02,0.0,0.10,74.750000,870.14,229.0,0.0,0.0,0.0,0.0,254687.0,0.0,1,Yes,9.0,Yes,No,No,Not Applicable,4,No,No,No,No,Yes,No,No,No,Bank Withdrawal,rgvqptvqmqule47777,2,D,18-24,130189.0,229.0,191954.5
653218,jqvmittclvgqd162934,11,7.0,98907,Not Specified,40.000000,Yes,No,0.0,uviytafwcahi162934,whncxdyflgkmlzguym162934,WA,Yakima County,America/Los_Angeles,509.0,US,46.59,-120.52,373.84,25.65,143.85,318.780000,416.37,12.540000,13.25,317.96,182.93,10.41,0.0,11.00,31.100000,14.43,170.040000,38.57,80.97,59.360000,7.93,12.83,0.0,0.03,67.120000,13.64,244.0,0.0,0.0,0.0,0.0,254687.0,0.0,0,No,0.0,Yes,Yes,No,Not Applicable,14,No,Yes,Yes,No,Yes,No,No,No,Bank Withdrawal,qpzsyhumyefn64654,4,No Offer,35-49,480205.5,244.0,268482.5


Let's see what is the value of 'total_rech_data' for these observations.

In [40]:
# Get the value counts of 'total_rech_data' for observations where the value in the 'arpu_4g' column is equal to 254687
df[df['arpu_4g']==254687]['total_rech_data'].value_counts()

total_rech_data
0.0    12845
Name: count, dtype: int64

Now, since the recharge amount is 0 and there is no ARPU, let's replace it with 0.

In [41]:
# Replace the outlier value 254687 in the 'arpu_4g' column of the dataframe 'df' with 0.
df['arpu_4g']=df['arpu_4g'].replace(254687,0)


In [42]:
# Checking further
df['arpu_4g'].quantile([0.75,0.8,0.9,0.95,0.97,0.99,0.999])

0.750      120.600000
0.800      504.194000
0.900     1893.872000
0.950     2493.923000
0.970     8678.332415
0.990     8842.707361
0.999    87978.000000
Name: arpu_4g, dtype: float64

In [43]:
# Filter by 'arpu_4g' value of 87978 and count unique values in 'total_rech_data' column
df[df['arpu_4g']==87978]['total_rech_data'].value_counts()

total_rech_data
0.0    5006
Name: count, dtype: int64

All rows in the dataframe with an 'arpu_4g' value of 87978 have 0 value in the 'total_rech_data' column, indicating that these are likely outliers. Therefore, we have decided to replace the 'arpu_4g' value for these rows with 0.

In [44]:
# Replace the values with 0
df['arpu_4g']=df['arpu_4g'].replace(87978,0)

In [45]:
# Checking the quantiles again
df['arpu_4g'].quantile([0.75,0.8,0.9,0.95,0.97,0.99,0.999])

0.750     107.780000
0.800     432.330000
0.900    1803.678000
0.950    2424.093000
0.970    2735.677800
0.990    8708.088152
0.999    8842.707361
Name: arpu_4g, dtype: float64

In [46]:
# Check the churn value for this ARPU
df[df['arpu_4g']>8000]['Churn Value'].value_counts()

Churn Value
0    16155
1      980
Name: count, dtype: int64

**Observation:**

 * A higher ARPU suggests that a business is generating more revenue per user, which can be a positive sign for the business's profitability. However, a high ARPU can also imply churn, or the rate at which customers are leaving the business.

* There are a few reasons why a high ARPU may imply churn. First, if a business is charging a high price for its services, it may attract a customer base that is more price-sensitive and likely to switch to a competitor if they find a better deal. This could result in a higher churn rate for the business.

In [47]:
# Check the value counts of total recharge data at outlying values
df[df['arpu_5g']==254687]['total_rech_data'].value_counts()

total_rech_data
0.0    12608
Name: count, dtype: int64

In [48]:
# Check the value counts of total recharge data at outlying values
df[df['arpu_5g']==87978]['total_rech_data'].value_counts()

total_rech_data
0.0    5126
Name: count, dtype: int64

In [49]:
# Replacing the values with 0 where total recharge data is 0
df['arpu_5g']=df['arpu_5g'].replace([87978,254687],0)

In [50]:
# Check the quantiles of ARPU 5G
df['arpu_5g'].quantile([0.75,0.8,0.9,0.95,0.97,0.99,0.999])

0.750      96.525000
0.800     417.224000
0.900    1797.658000
0.950    2543.916000
0.970    2792.089400
0.990    8587.032306
0.999    8724.403682
Name: arpu_5g, dtype: float64

In [51]:
# Check the quantiles of Volume of 5G data
df['vol_5g'].quantile([0.75,0.8,0.9,0.95,0.97,0.98,0.99,0.999])

0.750      895.8550
0.800     1655.2720
0.900     9658.6500
0.950    14517.6400
0.970    16578.6164
0.980    17550.9976
0.990    18614.3528
0.999    19746.1266
Name: vol_5g, dtype: float64

In [52]:
# Lets see the recharge data value
df[df['vol_5g']>=87978]['total_rech_data'].value_counts()

Series([], Name: count, dtype: int64)

In [53]:
# Proportion of these values
df[df['vol_5g']>=87978]['total_rech_data'].value_counts()/df.shape[0]

Series([], Name: count, dtype: float64)

**Observation**:

There is a presence of 2% outliers in vol 5g, where the values are very high, but their total recharge data is 0. We will fill these outliers with 0, and below are some possible reasons why this could be:

* Data recording error: It is possible that there was an error in recording the recharge data for these outliers, leading to an incorrect value of 0. In this case, it would make sense to fill the outliers with 0, as this is likely the correct value.

* Promotions or bonuses: Another possibility is that these customers received promotions or bonuses that allowed them to use the service without recharging, leading to a total recharge data of 0. However, these customers may still be using the service heavily, leading to the high values in vol 5g. In this case, filling the outliers with 0 would make sense as it accurately reflects the lack of recharge data.

In [54]:
# Replace the outlier values
df['vol_5g']=df['vol_5g'].replace([87978,254687],0)

In [55]:
# Unique months
df['Month'].unique()

array([ 1,  6,  7,  8,  9, 10, 11, 12, 13, 14,  2,  3,  4,  5],
      dtype=int64)

In [56]:
# Unique months of joining
df['Month of Joining'].unique()

array([ 1.,  6., 11.,  9.,  8.,  7., 10.,  2., 12.,  3.,  5.,  4.])

We will get 4 quarters in month of joining!

In [57]:
# # Save Processed data
# df.to_csv('../data/processed/processed_churn_data.csv',index=False)

### Quarterly Churn Analysis

Quarterly churn analysis involves assessing customer attrition over a three-month period. This involves calculating the churn rate, which is the percentage of customers discontinuing service in a quarter.

- **Timeliness**: Performing this analysis quarterly aids in timely evaluation of customer retention and churn. Regular assessment helps spot behavioral changes in customers, enabling prompt action.

- **Strategy Evaluation**: Quarterly analysis helps gauge the success of customer retention strategies. An increase in churn rate prompts a review of past strategies, guiding future improvements.

- **Financial Impact**: Churn significantly affects business finances. Quarterly analysis identifies revenue loss areas, aiding in taking preventive measures for financial stability and growth.

- **Customer Insights**: This analysis yields insights into customer behavior and preferences. Understanding reasons behind churn can reveal patterns, aiding in service improvement and future retention.

- **Benchmarking**: Quarterly churn analysis helps in benchmarking against industry standards and competitors, highlighting strengths and areas needing improvement for competitive edge.


In [58]:


# Define a function to map a month to its corresponding quarter
def map_month_to_quarter(month):
    if math.isnan(month): # Handle NaN values if present
        return None
    quarter = math.ceil(month / 3)
    return quarter

# Insert a new column called 'Quarter of Joining' in the DataFrame 'df' and populate it with the quarter corresponding to the 'Month of Joining' column
df.insert(loc=1,column='Quarter of Joining',value=df['Month of Joining'].apply(lambda x: map_month_to_quarter(x)))

# Insert a new column called 'Quarter' in the DataFrame 'df'and populate it with the quarter corresponding to the 'Month' column
df.insert(loc=1,column='Quarter',value= df['Month'].apply(lambda x: map_month_to_quarter(x)))


In [59]:
# Remove duplicate rows in the DataFrame 'df' based on the 'Customer ID', 'Quarter', and 'Quarter of Joining' columns and keep only the last occurrence of each set of duplicates
telco=df.drop_duplicates(subset=['Customer ID','Quarter','Quarter of Joining'],keep='last')

The 'train_data' DataFrame contains the data of customers who joined in the first quarter and were active in the first quarter. This dataset is used for training the churn prediction model.

The 'test_data' DataFrame contains the data of customers who joined in the first quarter and were active in the second quarter. This dataset is used for testing the accuracy of the churn prediction model.

The 'prediction_data' DataFrame contains the data of customers who joined in the second quarter and were active in the second quarter. This dataset is used for predicting the churn of customers who joined in the second quarter.

In [60]:
# Filter 1 and 2 quarter wise data
train_data=telco[(telco['Quarter of Joining']==1)&(telco['Quarter']==1)]
test_data=telco[(telco['Quarter of Joining']==1)&(telco['Quarter']==2)]
prediction_data=telco[(telco['Quarter of Joining']==2)&(telco['Quarter']==2)]
save_point("fcTel2")
#note that we have not used alot of data which we will use for feedback loop

### **Data Preprocessing**



In [61]:
# Unique counts of quarter and month of joining
telco[['Quarter','Quarter of Joining']].value_counts()

Quarter  Quarter of Joining
3        3                     30901
4        3                     29111
5        3                     26086
2        2                     25618
1        1                     24309
3        2                     23514
4        2                     21476
5        2                     19200
4        4                     17359
2        1                     16706
5        4                     16048
3        1                     13724
4        1                     12299
5        1                     10944
Name: count, dtype: int64

In [62]:
# Checking the shape of the data
train_data.shape,test_data.shape

((24309, 75), (16706, 75))

In [63]:
# Normalizing value counts and checking the churn rate in 1st quarter or the training data
train_data['Churn Value'].value_counts(normalize=True)

Churn Value
0    0.687235
1    0.312765
Name: proportion, dtype: float64

In [64]:
# List of columns in Train data
train_data.columns

Index(['Customer ID', 'Quarter', 'Quarter of Joining', 'Month',
       'Month of Joining', 'zip_code', 'Gender', 'Age', 'Married',
       'Dependents', 'Number of Dependents', 'Location ID', 'Service ID',
       'state', 'county', 'timezone', 'area_codes', 'country', 'latitude',
       'longitude', 'arpu', 'roam_ic', 'roam_og', 'loc_og_t2t', 'loc_og_t2m',
       'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m', 'std_og_t2f',
       'std_og_t2c', 'isd_og', 'spl_og', 'og_others', 'loc_ic_t2t',
       'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m', 'std_ic_t2f',
       'std_ic_t2o', 'spl_ic', 'isd_ic', 'ic_others', 'total_rech_amt',
       'total_rech_data', 'vol_4g', 'vol_5g', 'arpu_5g', 'arpu_4g',
       'aug_vbc_5g', 'Churn Value', 'Referred a Friend', 'Number of Referrals',
       'Phone Service', 'Multiple Lines', 'Internet Service', 'Internet Type',
       'Streaming Data Consumption', 'Online Security', 'Online Backup',
       'Device Protection Plan', 'Premium Tech S

In [65]:
# Let's drop unnecessary columns
drop_cols=['Customer ID', 'Quarter', 'Quarter of Joining', 'Month',
       'Month of Joining', 'zip_code','Location ID', 'Service ID',
       'state', 'county', 'timezone', 'area_codes', 'country', 'latitude',
       'longitude','Status ID','age_bucket']

train_data=train_data.drop(columns=drop_cols)
test_data=test_data.drop(columns=drop_cols)

In [66]:
# Columns
train_data.columns

Index(['Gender', 'Age', 'Married', 'Dependents', 'Number of Dependents',
       'arpu', 'roam_ic', 'roam_og', 'loc_og_t2t', 'loc_og_t2m', 'loc_og_t2f',
       'loc_og_t2c', 'std_og_t2t', 'std_og_t2m', 'std_og_t2f', 'std_og_t2c',
       'isd_og', 'spl_og', 'og_others', 'loc_ic_t2t', 'loc_ic_t2m',
       'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m', 'std_ic_t2f', 'std_ic_t2o',
       'spl_ic', 'isd_ic', 'ic_others', 'total_rech_amt', 'total_rech_data',
       'vol_4g', 'vol_5g', 'arpu_5g', 'arpu_4g', 'aug_vbc_5g', 'Churn Value',
       'Referred a Friend', 'Number of Referrals', 'Phone Service',
       'Multiple Lines', 'Internet Service', 'Internet Type',
       'Streaming Data Consumption', 'Online Security', 'Online Backup',
       'Device Protection Plan', 'Premium Tech Support', 'Streaming TV',
       'Streaming Movies', 'Streaming Music', 'Unlimited Data',
       'Payment Method', 'Satisfaction Score', 'offer', 'rank',
       'total_recharge', 'rank_x'],
      dtype='object')

In [67]:
# Splitting the train data into features and label
X_train=train_data.drop("Churn Value", axis = 1)
y_train=train_data['Churn Value']

In [68]:
# Splitting the test data into features and label
X_test=test_data.drop("Churn Value", axis = 1)
y_test=test_data['Churn Value']

In [69]:
# % churn value
y_train.mean(),y_test.mean()

(0.31276481961413466, 0.17849874296659882)

In [71]:
#fit encoder
encoder = OneHotEncoder(sparse=False)
# train
encoder.fit(X_train[categorical_cols])
encoded_features = list(encoder.get_feature_names_out(categorical_cols))

X_train[encoded_features] = encoder.transform(X_train[categorical_cols])
# test
X_test[encoded_features] = encoder.transform(X_test[categorical_cols])

##### **Note**

We fit the encoder on the training set, but only transform the test set. This ensures that only the categories found in the training set are one hot encoded (which prevents **data leakage** - when information outside the training set is used to build the model). 

By encoding the labels all at once before dividing, we would be indirectly indicating that we already know what are the possible classes or numeric ranges we are going to see in the future. Depending on the definition, this could be defined as data leaking, because you can deduce information that isn't in the training set.

We will use fit and transform on the training set and just transform on the test set which essentially means that the one hot encoder object is trained or fitted by seeing the values of just the training set.

In [73]:
# Shape
X_train.shape

(24309, 111)

In [74]:
# drop original features
X_train=X_train.drop(categorical_cols,axis=1)
X_test=X_test.drop(categorical_cols,axis=1)

In [75]:
# Check again
X_train.shape

(24309, 93)

In [76]:
# Instantiate scaler
scaler = StandardScaler()

# Scale Separate Columns
# train
X_train[cts_cols]  = scaler.fit_transform(X_train[cts_cols]) 
# test
X_test[cts_cols]  = scaler.transform(X_test[cts_cols])
preserve("fcTel2") 

In [77]:
# Dump the scaler to use in transforming test data
# dump(scaler, open('../data/output/scaler.pkl', 'wb'))

## **Model Training**

### **Supervised Machine Learning**

Supervised learning is a type of machine learning where the algorithm learns from labeled data. In other words, the data used to train the algorithm includes input variables and corresponding output variables. The algorithm learns to predict the output variable based on the input variables. Supervised learning can be further divided into two categories: regression and classification.

* **Regression** is a type of supervised learning where the algorithm learns to predict a continuous output variable. In other words, the output variable is a numerical value. Examples of regression problems include predicting housing prices, stock prices, or the amount of rainfall in a particular area.

* **Classification**, on the other hand, is a type of supervised learning where the algorithm learns to predict a discrete output variable. In other words, the output variable is a category or label. Examples of classification problems include predicting whether an email is spam or not, whether a tumor is malignant or benign, or whether a customer is likely to churn or not.

### Logistic Regression

Logistic regression is a machine learning algorithm used for binary classification tasks, such as predicting whether a customer will churn. It operates by examining the relationship between input variables (like customer demographics and usage) and a binary output (churn or not).

Key points about Logistic Regression:

- **Functioning**: It estimates the probability of an event (output variable) using a logistic function, which provides a value between 0 and 1.

- **Nature of Algorithm**: Despite the name, logistic regression is used for classification. It gets its name from employing a logistic function to model probability.

- **Regression to Classification**: The logistic function transforms the linear regression equation's output into a probability (between 0 and 1), enabling classification.

- **Logistic Function**: Defined as $$\text{sigmoid}(z) = \frac{1}{1 + e^{-z}}$$, where \( z \) is a linear combination of input variables and weights.

- **Probability Prediction**: The probability of the binary outcome is given by $$P(y=1|x) = \text{sigmoid}(z)$$, with \( y \) as the binary outcome and \( x \) the input variables.

- **Training the Model**: Involves minimizing the cross-entropy loss function, using labeled examples and an optimization algorithm like gradient descent.

- **Cross-Entropy Loss**: Defined as $$L(y,\hat{y}) = -\frac{1}{N} \sum y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)$$, with \( y \) being the binary outcome and \( \hat{y} \) the predicted probability.



## Decision Trees

### Decision Trees in Classification

Decision trees are a supervised learning algorithm suitable for both classification and regression tasks. Known for their interpretability and ability to handle various data types, they model target variables using simple decision rules from data features.

- **Structure**: Begins with a root node representing the entire dataset, splitting into child nodes based on feature values. This recursive process creates a tree-like structure of decisions.
  
- **Nodes and Leaves**: Each node tests a feature, with branches as possible outcomes. Leaves represent final decisions or class labels.

### Splitting Criteria

The criterion for splitting at each node varies based on data and problem nature. Common criteria include:

- **Gini Index**: Calculates the probability of misclassifying an element, aiming to minimize errors.
- **Information Gain**: Assesses entropy reduction post-split, seeking to maximize information gain.
- **Chi-Square**: Compares observed and expected class frequencies, minimizing distribution deviation.

### Overfitting in Decision Trees

Overfitting, where trees fit too closely to training data, is a key challenge. It's addressed by:

- **Pruning**: Reducing tree complexity.
- **Limiting Depth**: Restricting tree growth.
- **Ensemble Methods**: Improving performance and reducing overfitting.

### Ensemble Methods

These methods combine multiple models for enhanced results:

- **Bagging**: Trains multiple trees on data subsets, then aggregates predictions.
- **Boosting**: Sequentially trains trees, each focusing on previous misclassifications, enhancing accuracy.

In summary, decision trees are a robust tool for classification, offering clear decision-making insights. Performance hinges on the choice of splitting, stopping criteria, and whether ensemble methods are employed.


### Random Forest

Random Forest is an ensemble algorithm known for its high performance and ability to prevent overfitting. It operates by building numerous decision trees on varied data subsets and then aggregating their predictions.

**Algorithm of Random Forest**

The Random Forest algorithm involves these steps:

- **Bootstrap Sampling**: Randomly select a subset of the training data with replacement, called a bootstrap sample.
- **Feature Selection**: Randomly choose a subset of features.
- **Tree Building**: Construct a decision tree with the bootstrap sample and selected features, choosing the best feature at each node.
- **Repeat Process**: Build multiple trees following the above steps.
- **Final Prediction**: For classification, take a majority vote from all trees; for regression, average their predictions.

**Mathematics Behind Random Forest**

Key mathematical concepts in Random Forest include:

- **Decision Trees**: Recursive binary partitioning to split data, maximizing information gain at each node.
- **Bootstrap Sampling**: Random sampling with replacement to create diverse data subsets, reducing overfitting.

**Difference between Bagging and Random Forest**

While both are ensemble methods using bootstrap sampling, they differ in:

- **Feature Handling**: Bagging uses the same features for all models, potentially leading to correlated predictions. Random Forest selects random feature subsets, reducing prediction correlation and enhancing performance, particularly with high-dimensional data.









### Boosting

Boosting is an ensemble machine learning algorithm that combines multiple weak models into a strong one. It aims to improve accuracy by focusing on errors from previous models.

**How Boosting Works**

- **Error Correction**: Boosting iteratively adds models to correct previous errors, focusing on misclassified data points.
- **Weight Adjustment**: Increases weights for misclassified points and decreases for correctly classified ones, making the model focus on harder cases.
- **Model Specialization**: Each new model becomes more specialized, enhancing overall ensemble accuracy.

**Types of Boosting Algorithms**

- **AdaBoost (Adaptive Boosting)**
- **Gradient Boosting**
- **XGBoost (Extreme Gradient Boosting)**

Each varies in weight assignment and model building but follows the core concept of combining weak models to form a strong one.

**Difference between Bagging and Boosting**

- **Training Approach**: Bagging trains weak learners simultaneously, while Boosting does so sequentially, focusing on prior misclassifications.
- **Weight Redistribution**: Boosting redistributes weights to improve focus on challenging data points.
- **Use Cases**: Bagging is often used with high variance, low bias models (like decision trees), whereas Boosting is applied to low variance, high bias scenarios.
- **Overfitting Risk**: Boosting can be more prone to overfitting compared to Bagging. This risk can be mitigated through hyperparameter tuning and regularization.


### Gradient Boosting

Gradient Boosting is a sequential modeling technique, aiming to correct the errors of previous models. It revolves around three key components: the additive model, loss function, and a weak learner.

**Core Principles**

- **Sequential Development**: Models are built one after another, each focusing on reducing the errors made by its predecessor.
- **Numerical Optimization**: It interprets boosting as a numerical optimization of the loss function using Gradient Descent.

**Application in Regression and Classification**

- **Gradient Boosting Regressor**: Used when the target variable is continuous.
- **Gradient Boosting Classifier**: Applied in classification tasks.
- **Loss Function**: The key distinction between the regressor and classifier lies in their loss functions - Mean Squared Error (MSE) for regression and log-likelihood for classification.

The objective of Gradient Boosting is to minimize the loss function by iteratively adding weak learners, tailored to the specific problem type (regression or classification).


### XGBoost

XGBoost stands for Extreme Gradient Boosting and is an advanced form of gradient boosting. It iteratively builds an ensemble of models, with each new model focusing on correcting errors of previous ones.

**How XGBoost Works**

- **Error Correction**: Each iteration involves fitting a new model to the residual errors made by the last model.
- **Objective Function**: Combines a loss function with a regularization term to minimize prediction errors and prevent overfitting.
- **Model Ensemble**: The algorithm adds each new model to the ensemble and repeats until reaching the desired count.

**Example Process Using XGBoost**

1. **Initialize the Model**: Begin with a simple decision tree.
2. **Make Predictions**: Use the model to predict training data, then compute residuals (differences between predictions and true labels).
3. **Fit a New Tree**: Train a new decision tree on these residuals.
4. **Combine Models**: Add this new tree to the ensemble, enhancing the overall model.
5. **Iteration**: Repeat this process for a set number of iterations, continually improving the model.
6. **Final Predictions**: Use the combined predictions of all trees for new data.

XGBoost's strength lies in its ability to focus on misclassified points and use regularization, enhancing both accuracy and generalization.


### Classification Evaluation Metrics

Evaluation metrics are crucial for assessing the performance of classification models in machine learning. Key metrics include F1 score, recall, precision, confusion matrix, and ROC AUC score.

**F1 Score**
- Combines precision and recall into a single value.
- Expressed between 0 and 1, where 1 indicates perfect precision and recall.
- Calculated as: $$ F1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}} $$

**Recall**
- Important when false negatives have high costs (e.g., medical diagnosis).
- Formula: $$ Recall = \frac{TP}{TP + FN} $$

**Precision**
- Key when false positives have significant consequences (e.g., fraud detection).
- Formula: $$ precision = \frac{true\ positive}{true\ positive + false\ positive} $$

**Confusion Matrix**
- Tabulates true positives, false positives, true negatives, and false negatives.

|                      | Actual Positive      | Actual Negative      |
|----------------------|----------------------|----------------------|
| Predicted Positive   | True Positive (TP)   | False Positive (FP)  |
| Predicted Negative   | False Negative (FN)  | True Negative (TN)   |

**ROC AUC Score**
- Measures a classifier's ability to differentiate between classes.
- Calculated as the area under the ROC curve, plotting true positive rate vs false positive rate.
- Formula: $$ ROC\ AUC\ Score = \int_0^1 TPR(FPR^{-1}(t)) dt $$

**Choosing the Right Metric**
- **F1 Score**: Use when precision and recall are equally important.
- **Recall**: Choose for high costs of missing positive cases.
- **Precision**: Prefer when false positives have serious implications.
- **Confusion Matrix**: Useful for a detailed performance overview.
- **ROC AUC Score**: Optimal when class distinction is crucial.


### Importance with respect to the business problem:

Here we will be focusing more on Recall score than precision, as is noted in established publications [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10239051/#:~:text=In%20the%20telecommunication%20industry%2C%20research,an%20existing%20customer%20%5B10%5D.) acquiring customers costs atleast 5 times more than retaining existing ones in telecoms so Precision scores will not have a heavy importance unless they exceed the Recall score by atleast 5x.





### Model Evaluation Function: `evaluate_models()`

The `evaluate_models()` function is designed to assess the performance of various classification models on a dataset. 

**Function Features**

- **Inputs**: Accepts a machine learning model, training, and testing data.
- **Outputs**: Returns metrics like F1 Score, Recall Score, Confusion Matrix, and AUC for both training and testing sets.

**Utility**

- **Model Comparison**: Facilitates comparison across different classification models.
- **Best Model Selection**: Aids in selecting the most effective model for a specific problem.
- **Data Storage**: Outputs are stored in a pandas DataFrame for further analysis and comparison.


In [79]:
# function modelling
#Columns needed to compare metrics
comparison_columns = ['Model_Name', 'Train_F1score', 'Train_Recall', 'Test_F1score', 'Test_Recall']

comparison_df = pd.DataFrame()

def evaluate_models(model_name, model_defined_var, X_train, y_train, X_test, y_test):
  ''' This function predicts and evaluates various models for clasification'''
  
  # train predictions
  y_train_pred = model_defined_var.predict(X_train)
  # train performance
  train_f1_score = f1_score(y_train,y_train_pred)
  train_recall = recall_score(y_train, y_train_pred)

  # test predictions
  y_pred = model_defined_var.predict(X_test)
  # test performance
  test_f1_score = f1_score(y_test,y_pred)
  test_recall = recall_score(y_test, y_pred)

  # Printing performance
  print("Train Results")
  print(f'F1 Score: {train_f1_score}')
  print(f'Recall Score: {train_recall}')
  print(f'Confusion Matrix: \n{confusion_matrix(y_train, y_train_pred)}')
  print(f'Area Under Curve: {roc_auc_score(y_train, y_train_pred)}')

  print(" ")

  print("Test Results")
  print(f'F1 Score: {test_f1_score}')
  print(f'Recall Score: {test_recall}')
  print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_pred)}')
  print(f'Area Under Curve: {roc_auc_score(y_test, y_pred)}')

  
  #Saving our results
  global comparison_columns

  metric_scores = [model_name, train_f1_score, train_recall, test_f1_score, test_recall]
  final_dict = dict(zip(comparison_columns,metric_scores))

  return final_dict


#function to create the comparison table
final_list = []
def add_dic_to_final_df(final_dict):
  global final_list
  final_list.append(final_dict)
  global comparison_df
  comparison_df = pd.DataFrame(final_list, columns= comparison_columns)

The above code defines two functions for evaluating machine learning models for classification. The first function is evaluate_models() which takes a model name, a defined machine learning model variable, training and testing data, and evaluates the model's performance using the F1 score, recall score, confusion matrix, and area under the curve (AUC) score. It then prints the results of the model's performance on the training and testing datasets. Finally, it creates a dictionary of the evaluation metrics for the model and returns it.

The second function is add_dic_to_final_df() which takes the dictionary returned from the evaluate_models() function and appends it to a list of all models evaluated. It then creates a pandas DataFrame from the list and returns it. The DataFrame contains the evaluation metrics for all the models evaluated so far, including the model name, training F1 score, training recall score, testing F1 score, and testing recall score.

The comparison_columns variable is a list of the column names for the comparison_df DataFrame. It is used to ensure that the DataFrame columns are always in the correct order.

In [80]:
# Churn in training data
y_train.value_counts(normalize=True)

Churn Value
0    0.687235
1    0.312765
Name: proportion, dtype: float64

In [81]:
# Churn in test data
y_test.value_counts(normalize=True)

Churn Value
0    0.821501
1    0.178499
Name: proportion, dtype: float64

To handle the heavy imbalance we find in our dataset, we have used the churn rate as weights to give more importance to the minority class during the model training process.

To do this, we can calculate the churn rate (e.g., by dividing the number of churned customers by the total number of customers) and use it as a weight in the loss function during model training. 

In [82]:
# Let's calculate the churn rate for data and store it as dict
w=y_train.value_counts(normalize=True).to_dict()

In [83]:
# Weights
w

{0: 0.6872351803858653, 1: 0.31276481961413466}

In [84]:
df['Number of Dependents'].value_counts(dropna=False)

Number of Dependents
0.0    437828
1.0     81616
4.0     39616
2.0     21125
3.0     16027
7.0     15458
6.0     15119
8.0     13623
9.0     13023
Name: count, dtype: int64

In [85]:


# Creating an instance of SimpleImputer with strategy as 'most_frequent'
imputer = SimpleImputer(strategy='most_frequent')

# Fitting the imputer on the X_train data
imputer.fit(X_train)

# Transforming both X_train and X_test data
X_train_imputed = imputer.transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Optionally, if you want to put the imputed data back into a DataFrame:
X_train_imputed = pd.DataFrame(X_train_imputed, columns=X_train.columns)
X_test_imputed = pd.DataFrame(X_test_imputed, columns=X_test.columns)


In [86]:
# Define model
lg2 = LogisticRegression(random_state=13, class_weight=w)
# fit it
lg2.fit(X_train_imputed,y_train)

model_snapshot("fcTel2")

In [87]:
# Evaluate models
logistic_results = evaluate_models("Logistic Regression", lg2, X_train_imputed, y_train, X_test_imputed, y_test)
add_dic_to_final_df(logistic_results)

Train Results
F1 Score: 0.036627241510873716
Recall Score: 0.018939892147836382
Confusion Matrix: 
[[16590   116]
 [ 7459   144]]
Area Under Curve: 0.5059981395373445
 
Test Results
F1 Score: 0.03387663790348354
Recall Score: 0.017773306505700873
Confusion Matrix: 
[[13630    94]
 [ 2929    53]]
Area Under Curve: 0.5054619957186038


In [88]:
# define model
random_f = RandomForestClassifier(n_estimators=20, class_weight=w, random_state=7)
random_f.fit(X_train_imputed, y_train)

randomf_results = evaluate_models("Random Forest", random_f, X_train_imputed, y_train, X_test_imputed, y_test)
add_dic_to_final_df(randomf_results)

Train Results
F1 Score: 0.9939874463164851
Recall Score: 0.9893463106668421
Confusion Matrix: 
[[16696    10]
 [   81  7522]]
Area Under Curve: 0.9943738616664749
 
Test Results
F1 Score: 0.45925361766945927
Recall Score: 0.4044265593561368
Confusion Matrix: 
[[12660  1064]
 [ 1776  1206]]
Area Under Curve: 0.663449070992554


### XG Boost Training

XGBoost is a robust library for gradient boosting in supervised machine learning. Known for its efficiency and flexibility, it's a favored choice in many machine learning competitions.

**D Matrix in XGBoost**

- **Purpose**: Serves as a data structure to store and efficiently access input data during training.
- **Structure**: Acts as a wrapper around input data, which can be in the form of NumPy arrays or Pandas DataFrames.
- **Advantages**:
  - **Efficient Data Access**: Crucial for handling large datasets.
  - **Additional Features**: Handles missing values, splits data into training and validation sets.
  - **Ease of Use**: Simplifies the process of feeding data into the XGBoost model.


In [89]:


# Convert training and test sets to DMatrix
dtrain = xgb.DMatrix(X_train_imputed, label=y_train)
dtest = xgb.DMatrix(X_test_imputed, label=y_test)

# Train initial model
params = {'objective': 'multi:softmax', 'num_class': 2}
num_rounds = 30
xgbmodel = xgb.train(params, dtrain, num_rounds)
model_snapshot("fcTel2")


xgb_results = evaluate_models("XGB", xgbmodel, dtrain, y_train, dtest, y_test)
add_dic_to_final_df(xgb_results)

Train Results
F1 Score: 0.8261832272349847
Recall Score: 0.7852163619623833
Confusion Matrix: 
[[15827   879]
 [ 1633  5970]]
Area Under Curve: 0.8663002676566376
 
Test Results
F1 Score: 0.4899047619047619
Recall Score: 0.4312541918175721
Confusion Matrix: 
[[12742   982]
 [ 1696  1286]]
Area Under Curve: 0.6798503544339973


In [90]:
# Let's see the comparison df
comparison_df

Unnamed: 0,Model_Name,Train_F1score,Train_Recall,Test_F1score,Test_Recall
0,Logistic Regression,0.036627,0.01894,0.033877,0.017773
1,Random Forest,0.993987,0.989346,0.459254,0.404427
2,XGB,0.826183,0.785216,0.489905,0.431254


### **Future work Hyperparameter tuning with XGBoost**

Perform hyperparameter tuning on the XGBoost model using a grid search approach.
hyperparamters to be considered include:

* max_depth: the maximum depth of each tree in the ensemble
* learning_rate: the learning rate for gradient boosting
* n_estimators: the number of trees in the ensemble
* min_child_weight: the minimum weight required in a child node to continue splitting
* subsample: the fraction of samples used for each tree
* colsample_bytree: the fraction of features used for each tree
* You can define a dictionary of hyperparameter values to search over, and then pass it to the param_grid parameter of the GridSearchCV function.

Evaluate the tuned model
Evaluate the performance of the tuned model on the testing data using the sklearn.metrics module.


Different hyperparameter optimization technique, such as random search or Bayesian optimization could also be used.

## Data Drift Monitoring

### Why Is Drift Important?

Drift in machine learning signifies changes in data or its relationship with target labels, potentially degrading model performance. Detecting drift is crucial in production environments to maintain accuracy, as it indicates the need for model adjustment or retraining.

- **Not All Changes Are Drift**: Regular periodic changes (like seasonal variations) are usually not considered drift.
- **Indication of Performance Deterioration**: Drift often signals a decline in model performance, especially when labels are unknown post-prediction.

### Types of Drift

There are two primary types of drift in machine learning:

- **Data Drift**: Changes in the data distribution. For instance, socio-economic initiatives altering education levels can affect income distribution but not the correlation between education and income.

- **Concept Drift**: Shifts in the relationship between data and labels. For example, job market changes making experience more valuable than academic degrees for certain jobs, altering the education-income correlation.

Both types of drift can necessitate model retraining or updates.

### Handling Drift

When facing drift:

1. **Identify the Change**: Determine if the drift is in features, labels, or predictions.
2. **Manual Data Exploration**: Investigate the root cause of changes to understand their impact.
3. **Retrain Your Model**: Address both data and concept drift by retraining with new, relevant data. This might require additional resources or be delayed if new labels are unavailable.

Retraining is essential for concept drift but can also be beneficial for data drift, especially if it affects label distribution.



Reference: [Deepchecks Documentation](https://docs.deepchecks.com/stable/user-guide/general/drift_guide.html#what-is-distribution-drift)

In [93]:
# Define categorical and continuous columns
pred_cat_cols=[
       'Gender_Female', 'Gender_Male', 'Gender_Not Specified', 'Gender_Other',
       'Married_No', 'Married_Not Specified', 'Married_Yes', 'Dependents_No',
       'Dependents_Not Specified', 'Dependents_Yes', 'offer_A', 'offer_B',
       'offer_C', 'offer_D', 'offer_E', 'offer_F', 'offer_G', 'offer_H',
       'offer_I', 'offer_J', 'offer_No Offer', 'Referred a Friend_No',
       'Referred a Friend_Yes', 'Phone Service_No', 'Phone Service_Yes',
       'Multiple Lines_No', 'Multiple Lines_None', 'Multiple Lines_Yes',
       'Internet Service_No', 'Internet Service_Yes', 'Internet Type_Cable',
       'Internet Type_DSL', 'Internet Type_Fiber Optic', 'Internet Type_None',
       'Internet Type_Not Applicable', 'Online Security_No',
       'Online Security_Yes', 'Online Backup_No', 'Online Backup_Yes',
       'Device Protection Plan_No', 'Device Protection Plan_Yes',
       'Premium Tech Support_No', 'Premium Tech Support_Yes',
       'Streaming TV_No', 'Streaming TV_Yes', 'Streaming Movies_No',
       'Streaming Movies_Yes', 'Streaming Music_No', 'Streaming Music_Yes',
       'Unlimited Data_No', 'Unlimited Data_None', 'Unlimited Data_Yes',
       'Payment Method_Bank Withdrawal', 'Payment Method_Credit Card',
       'Payment Method_Wallet Balance']

pred_cts_cols=['Age', 'Number of Dependents', 'roam_ic', 'roam_og', 'loc_og_t2t',
       'loc_og_t2m', 'loc_og_t2f', 'loc_og_t2c', 'std_og_t2t', 'std_og_t2m',
       'std_og_t2f', 'std_og_t2c', 'isd_og', 'spl_og', 'og_others',
       'loc_ic_t2t', 'loc_ic_t2m', 'loc_ic_t2f', 'std_ic_t2t', 'std_ic_t2m',
       'std_ic_t2f', 'std_ic_t2o', 'spl_ic', 'isd_ic', 'ic_others',
       'total_rech_amt', 'total_rech_data', 'vol_4g', 'vol_5g', 'arpu_5g',
       'arpu_4g', 'arpu', 'aug_vbc_5g', 'Number of Referrals',
       'Streaming Data Consumption', 'Satisfaction Score', 'total_recharge']

The below code defines a function called check_data_drift() that checks for data drifts between two datasets, ref_df and cur_df, based on a set of predictors. The function uses the dataduit library to create two datasets, ref_dataset and cur_dataset, based on the reference and current dataframes, respectively. The features and cat_features parameters are set for each dataset based on the ref_features, cur_features, ref_cat_features, and cur_cat_features lists, which are generated based on the intersection of predictors and the columns of the two dataframes.

The function then creates a suite object, which contains two tests for data drift: WholeDatasetDrift() and TrainTestFeatureDrift(). The WholeDatasetDrift() test checks for overall drift in the entire dataset, while the TrainTestFeatureDrift() test checks for drift in specific features between the reference and current datasets. The add_condition_overall_drift_value_less_than() and add_condition_drift_score_less_than() methods set the threshold for acceptable drift to 0.2 and 0.1, respectively.

The suite is then run using the reference and current datasets as train_dataset and test_dataset, respectively, and the results are stored in an r object. If any checks did not run or did not pass, the retrain variable is set to True, indicating that the model may need to be retrained. Finally, the function saves the results of the data drift analysis as an HTML report in the Output directory with a filename based on the job_id parameter.

The function returns a dictionary with two keys: report, which contains the r object with the results of the data drift analysis, and retrain, which is a boolean value indicating whether the model needs to be retrained.

In [94]:
def check_data_drift(ref_df:pd.DataFrame, cur_df:pd.DataFrame, predictors:list, job_id:str):
    """
    Check for data drifts between two datasets and decide whether to retrain the model. 
    A report will be saved in the results directory.
    :param ref_df: Reference dataset
    :param cur_df: Current dataset
    :param predictors: Predictors to check for drifts
    :param target: Target variable to check for drifts
    :param job_id: Job ID
    :return: boolean
    """
    ref_features = [col for col in predictors if col in ref_df.columns]
    cur_features = [col for col in predictors if col in cur_df.columns]
    ref_cat_features = [col for col in pred_cat_cols if col in ref_df.columns]
    cur_cat_features = [col for col in pred_cat_cols if col in cur_df.columns]
    ref_dataset = Dataset(ref_df,  features=ref_features, cat_features=ref_cat_features)
    cur_dataset = Dataset(cur_df, features=cur_features, cat_features=cur_cat_features)
    
    suite = Suite("data drift",
        WholeDatasetDrift().add_condition_overall_drift_value_less_than(0.2), #0.2 
        TrainTestFeatureDrift().add_condition_drift_score_less_than(0.2), #0.1   
        )
    r = suite.run(train_dataset=ref_dataset, test_dataset=cur_dataset)
    retrain = (len(r.get_not_ran_checks())>0) or (len(r.get_not_passed_checks())>0)
    
    # try:
    #     r.save_as_html(f"../reports/{job_id}_data_drift_report.html")
    #     print("[INFO] Data drift report saved as {}".format(f"{job_id}_data_drift_report.html"))
    # except Exception as e:
    #     print(f"[WARNING][DRIFTS.check_DATA_DRIFT] {traceback.format_exc()}")
    return {"report": r, "retrain": retrain}


In [95]:
# Defining the preprocessing steps for test data
def preprocess_steps(data):
    df=data.copy()
    drop_cols=['Customer ID', 'Quarter', 'Quarter of Joining', 'Month',
       'Month of Joining', 'zip_code','Location ID', 'Service ID',
       'state', 'county', 'timezone', 'area_codes', 'country', 'latitude',
       'longitude','Status ID']
    df=df.drop(columns=drop_cols)
    processed_data=df.copy()
    processed_data[encoded_features] = encoder.transform(processed_data[categorical_cols])
    processed_data=processed_data.drop(categorical_cols,axis=1)
    processed_data[cts_cols]  = scaler.transform(processed_data[cts_cols]) 

    return processed_data

In [96]:
# Creating a copy of train for reference
ref_check_data=X_train_imputed.copy()

### Inference Pipeline with and without Label Availability

In machine learning, deploying models for predictions on new data involves different approaches based on label availability.

#### Case 1: Label Not Available
Without labels, model drift can't be assessed, and the model can't be retrained. The inference pipeline here includes:

- **Preprocessing**: Apply the same preprocessing steps to new data as were used for training data.
- **Prediction**: Use the trained model to predict on preprocessed new data.
- **Data Drift Check**: Monitor for any significant data drift compared to training data, which might impact prediction accuracy.

#### Case 2: Label Available
With labels, it's possible to check for both data and model drift and consider retraining. The pipeline involves:

- **Preprocessing**: Same as in Case 1, preprocess the new data.
- **Data Drift Check**: Compare new data against training data to detect significant drift.
- **Model Retraining**: If drift is detected, consider retraining the model.
- **Prediction**: Use either the existing or the retrained model for predictions.

### Importance of Drift Checks
- **Assumption Validation**: Machine learning models operate under the assumption that training and new data distributions are similar.
- **Performance Maintenance**: Identifying and addressing data or model drift ensures sustained model accuracy on new data.

The inference pipeline is key in real-world applications, ensuring that models remain accurate and effective over time.


In [97]:
def inference_pipeline(inference_data,reference_data,job_id,predictors_cols):
    


    #data preprocessing
    clean_inf_data=preprocess_steps(inference_data)

    #data drift
    data_drift=check_data_drift(ref_df=reference_data, cur_df=clean_inf_data, predictors=predictors_cols,  job_id=job_id)
    print(f"Data Drift Retrain: {data_drift['retrain']}")

    return data_drift
    

In [98]:
model_snapshot("fcTel2")
d1_drift=inference_pipeline(inference_data=prediction_data[prediction_data.columns[:-1]],reference_data=ref_check_data,job_id='1cbhja2',predictors_cols=pred_cat_cols+pred_cts_cols)

Data Drift Retrain: False


No data drift!

In [99]:
pred_processed_data = preprocess_steps(prediction_data.drop(columns=['Churn Value', 'age_bucket']))

# Check data types of columns in pred_processed_data
print(pred_processed_data.dtypes)


Age                               float64
Number of Dependents              float64
arpu                              float64
roam_ic                           float64
roam_og                           float64
                                   ...   
Unlimited Data_Yes                float64
Unlimited Data_nan                float64
Payment Method_Bank Withdrawal    float64
Payment Method_Credit Card        float64
Payment Method_Wallet Balance     float64
Length: 93, dtype: object


In [100]:
# Converting the preprocessed data to DMatrix format for XGBoost
d_pred_processed_data = xgb.DMatrix(pred_processed_data)

# Making predictions using the XGBoost model
predictions = xgbmodel.predict(d_pred_processed_data)


Lets compare predictions and actual values

In [101]:
# Saving the actual labels
pred_label=prediction_data['Churn Value']

In [102]:
print(f'Confusion Matrix: \n{confusion_matrix(pred_label, predictions)}')
print(f'Area Under Curve: {roc_auc_score(pred_label, predictions)}')

Confusion Matrix: 
[[21402  2112]
 [  615  1489]]
Area Under Curve: 0.8089403942186696


**Observation:**

* The confusion matrix shows that the model correctly predicted 21402 instances of non-churn and 1489 instances of churn. However, it incorrectly predicted 2112 instances of churn and 615 instances of non-churn.

* The area under the curve (AUC) is 0.8089, which indicates that the model has a moderate level of accuracy in distinguishing between churn and non-churn customers.

* False negatives occur when the model predicts that a customer will not churn, but in reality, the customer does churn. In this case, the model has 615 false negatives, which means that it predicted 615 customers to be non-churners, but they actually churned. This is a concern because it means that the model is not able to accurately identify all of the customers who are at risk of churning, and this could result in missed opportunities to retain these customers.

* False positives occur when the model predicts that a customer will churn, but in reality, the customer does not churn. In this case, the model has 2112 false positives, which means that it predicted 2112 customers to be churners, but they actually did not churn. This is also a concern because it could result in unnecessary retention efforts being directed towards customers who are not at risk of churning, which could be a waste of resources.  Yet these results are not as concerning as preventing churn is ultimately the most cost effectively solution than unecessarily allowing customers to churn out of cost concerns.


## Future potential work 

We can improve the results by retraining the model 