# Telecom Churn Prediction - Starter Notebook

**Author:** Krithigha Ganesh

# 0. Problem statement

In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, retaining high profitable customers is the number one business
goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, you will analyze customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn, and identify the main indicators of churn.

In this competition, your goal is *to build a machine learning model that is able to predict churning customers based on the features provided for their usage.*

**Customer behaviour during churn:**

Customers usually do not decide to switch to another competitor instantly, but rather over a
period of time (this is especially applicable to high-value customers). In churn prediction, we
assume that there are three phases of customer lifecycle :

1. <u>The ‘good’ phase:</u> In this phase, the customer is happy with the service and behaves as usual.

2. <u>The ‘action’ phase:</u> The customer experience starts to sore in this phase, for e.g. he/she gets a compelling offer from a competitor, faces unjust charges, becomes unhappy with service quality etc. In this phase, the customer usually shows different behaviour than the ‘good’ months. It is crucial to identify high-churn-risk customers in this phase, since some corrective actions can be taken at this point (such as matching the competitor’s offer/improving the service quality etc.)

3. <u>The ‘churn’ phase:</u> In this phase, the customer is said to have churned. In this case, since you are working over a four-month window, the first two months are the ‘good’ phase, the third month is the ‘action’ phase, while the fourth month (September) is the ‘churn’ phase.

In [312]:
#Data Structures
import pandas as pd
import numpy as np
import re
import os

### For installing missingno library, type this command in terminal
#pip install missingno

import missingno as msno

#Sklearn
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, precision_score, recall_score
import xgboost as xgb  # Load this xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score, learning_curve, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics
from sklearn.ensemble import AdaBoostClassifier

#Plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

#Others
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [313]:
#COMMENT THIS SECTION INCASE RUNNING THIS NOTEBOOK LOCALLY

#Checking the kaggle paths for the uploaded datasets
#import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
    #for filename in filenames:
        #print(os.path.join(dirname, filename))

In [314]:
#INCASE RUNNING THIS LOCALLY, PASS THE RELATIVE PATH OF THE CSV FILES BELOW
#(e.g. if files are in same folder as notebook, simple write "train.csv" as path)

df = pd.read_csv("train.csv")
data=df.copy()
unseen = pd.read_csv("test.csv")
unseen_df=unseen.copy()
sample = pd.read_csv("sample.csv")
data_dict = pd.read_csv("data_dictionary.csv")

print(data.shape)
print(unseen.shape)
print(sample.shape)
print(data_dict.shape)
data

(69999, 172)
(30000, 171)
(30000, 2)
(36, 2)


Unnamed: 0,id,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,arpu_6,arpu_7,...,sachet_3g_7,sachet_3g_8,fb_user_6,fb_user_7,fb_user_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,churn_probability
0,0,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,31.277,87.009,...,0,0,,,,1958,0.00,0.00,0.00,0
1,1,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,0.000,122.787,...,0,0,,1.0,,710,0.00,0.00,0.00,0
2,2,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,60.806,103.176,...,0,0,,,,882,0.00,0.00,0.00,0
3,3,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,156.362,205.260,...,0,0,,,,982,0.00,0.00,0.00,0
4,4,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,240.708,128.191,...,1,0,1.0,1.0,1.0,647,0.00,0.00,0.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69994,69994,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,15.760,410.924,...,1,0,,1.0,1.0,221,0.00,0.00,0.00,0
69995,69995,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,160.083,289.129,...,0,0,,,,712,0.00,0.00,0.00,0
69996,69996,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,372.088,258.374,...,0,0,,,,879,0.00,0.00,0.00,0
69997,69997,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,238.575,245.414,...,0,0,1.0,1.0,1.0,277,664.25,1402.96,990.97,0


1. Lets analyze the data dictionary versus the churn dataset.
2. The data dictonary contains a list of abbrevations which provide you all the information you need to understand what a specific feature/variable in the churn dataset represents
3. Example: 

> "arpu_7" -> Average revenue per user + KPI for the month of July
>
> "onnet_mou_6" ->  All kind of calls within the same operator network + Minutes of usage voice calls + KPI for the month of June
>
>"night_pck_user_8" -> Scheme to use during specific night hours only + Prepaid service schemes called PACKS + KPI for the month of August
>
>"max_rech_data_7" -> Maximum + Recharge + Mobile internet + KPI for the month of July

Its important to understand the definitions of each feature that you are working with, take notes on which feature you think might impact the churn rate of a user, and what sort of analysis could you do to understand the distribution of the feature better.

In [315]:
data_dict

Unnamed: 0,Acronyms,Description
0,CIRCLE_ID,Telecom circle area to which the customer belo...
1,LOC,Local calls within same telecom circle
2,STD,STD calls outside the calling circle
3,IC,Incoming calls
4,OG,Outgoing calls
5,T2T,Operator T to T ie within same operator mobile...
6,T2M,Operator T to other operator mobile
7,T2O,Operator T to other operator fixed line
8,T2F,Operator T to fixed lines of T
9,T2C,Operator T to its own call center


# 2. Cleaning Data & EDA

In [316]:
model_1_data = df.copy()
model_1_data.shape

(69999, 172)

#### At start of cleaning we have around 172  columns in the Train data ####
#### Null values in the data are validated and All columns with null values  > 50% are  dropped ####

In [317]:
null_values = model_1_data.isnull().sum()

null_values_percent=(100 * null_values )/len(data)
null_values_percent
null_values_table = pd.concat([null_values,null_values_percent] ,axis=1)
null_values_table = null_values_table.rename(columns = {0:"Missing Values", 1:"Missing Percent"})
null_values_table 
all_null_values_table = null_values_table[null_values_table["Missing Percent"] >50]
less_null_values_table = null_values_table[null_values_table["Missing Percent"] >0]
#all_null_values_table = null_values_table[(null_values_table["Missing Percent"] > 1) & (null_values_table["Missing Percent"] < 10)]

#### Number of column information in the Test data is  also identified ####

In [318]:
unseen_df.shape

(30000, 171)

In [319]:
model_1_data= model_1_data.drop(columns= all_null_values_table.index)
model_1_data.shape

(69999, 142)

In [320]:
object_columns = model_1_data.columns[model_1_data.dtypes == object]
object_columns

Index(['last_date_of_month_6', 'last_date_of_month_7', 'last_date_of_month_8',
       'date_of_last_rech_6', 'date_of_last_rech_7', 'date_of_last_rech_8'],
      dtype='object')

#### Object columns are dropped from the Train data ####

In [321]:
model_1_data.drop(columns=object_columns, inplace=True)
model_1_data.shape

(69999, 136)

In [322]:
less_null_values_table

Unnamed: 0,Missing Values,Missing Percent
loc_og_t2o_mou,702,1.002871
std_og_t2o_mou,702,1.002871
loc_ic_t2o_mou,702,1.002871
last_date_of_month_7,399,0.570008
last_date_of_month_8,733,1.047158
...,...,...
night_pck_user_7,52134,74.478207
night_pck_user_8,51582,73.689624
fb_user_6,52431,74.902499
fb_user_7,52134,74.478207


#### Rows with Nullvalues are dropped from the columns of Train data - ic_others_8 ,ic_others_6 ,ic_others_7 ####

In [323]:
model_1_data = model_1_data.dropna(subset=['ic_others_8'])
model_1_data = model_1_data.dropna(subset=['ic_others_6'])
model_1_data = model_1_data.dropna(subset=['ic_others_7'])
model_1_data.shape

(63842, 136)

In [324]:
null_values = model_1_data.isnull().sum()

null_values_percent=(100 * null_values )/len(data)
null_values_percent
null_values_table = pd.concat([null_values,null_values_percent] ,axis=1)
null_values_table = null_values_table.rename(columns = {0:"Missing Values", 1:"Missing Percent"})
null_values_table 
all_null_values_table = null_values_table[null_values_table["Missing Percent"] >0]
all_null_values_table

Unnamed: 0,Missing Values,Missing Percent


#### Missing Values Percentage is checked again as there are no null values ,proceeding to next step of unique values ####

In [325]:
columndatatype = [] 
for col in model_1_data.columns:
    unique_values =model_1_data[col].unique()
    data_type =model_1_data[col].dtype
    columndatatype.append ([col, unique_values, data_type])
columndatatype_df =pd.DataFrame(columndatatype, columns = ['col','unique_values','data_type'])
columndatatype_df

Unnamed: 0,col,unique_values,data_type
0,id,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",int64
1,circle_id,[109],int64
2,loc_og_t2o_mou,[0.0],float64
3,std_og_t2o_mou,[0.0],float64
4,loc_ic_t2o_mou,[0.0],float64
...,...,...,...
131,aon,"[1958, 710, 882, 982, 647, 698, 1083, 584, 245...",int64
132,aug_vbc_3g,"[0.0, 82.26, 1.05, 700.4, 531.77, 132.74, 383....",float64
133,jul_vbc_3g,"[0.0, 73.56, 0.86, 185.71, 40.41, 287.81, 217....",float64
134,jun_vbc_3g,"[0.0, 177.14, 18.95, 173.72, 0.26, 265.4, 254....",float64


In [326]:
model_1_data.drop(columns=['std_og_t2c_mou_6','std_og_t2c_mou_7','std_og_t2c_mou_8','std_ic_t2o_mou_6','std_ic_t2o_mou_7','std_ic_t2o_mou_8','loc_og_t2o_mou','circle_id','std_og_t2o_mou','loc_ic_t2o_mou'], inplace=True)
one_unique_value_columns = columndatatype_df[columndatatype_df['unique_values'].apply(lambda x: len(x) == 1)]

#### Columns with only one unique value are removed ####

In [327]:
columns_to_keep = model_1_data.columns.intersection(unseen_df.columns)
columns_to_keep
unseen_df = unseen_df[columns_to_keep] 
unseen_df = unseen_df.set_index('id')
unseen_df.shape
null_values = unseen_df.isnull().sum()

null_values_percent=(100 * null_values )/len(data)
null_values_percent
null_values_table = pd.concat([null_values,null_values_percent] ,axis=1)
null_values_table = null_values_table.rename(columns = {0:"Missing Values", 1:"Missing Percent"})
null_values_table 
all_null_values_table = null_values_table[null_values_table["Missing Percent"] >0]
all_null_values_table

Unnamed: 0,Missing Values,Missing Percent
onnet_mou_6,1169,1.670024
onnet_mou_7,1172,1.674310
onnet_mou_8,1675,2.392891
offnet_mou_6,1169,1.670024
offnet_mou_7,1172,1.674310
...,...,...
isd_ic_mou_7,1172,1.674310
isd_ic_mou_8,1675,2.392891
ic_others_6,1169,1.670024
ic_others_7,1172,1.674310


#### Missing data values in Test data are identified and imputed ####

In [328]:
missing_data_percent = unseen_df.isnull().any()
impute_cols = missing_data_percent[missing_data_percent.gt(0)].index
impute_cols
imp = SimpleImputer(strategy='constant', fill_value=0)
unseen_df[impute_cols] = imp.fit_transform(unseen_df[impute_cols])

#### ID column is set as index and Validation to check if columns with only one unique value are dropped ####

In [329]:
model_1_data = model_1_data.set_index('id')

In [330]:
columndatatype = [] 
for col in model_1_data.columns:
    unique_values =model_1_data[col].unique()
    data_type =model_1_data[col].dtype
    columndatatype.append ([col, unique_values, data_type])
columndatatype_df =pd.DataFrame(columndatatype, columns = ['col','unique_values','data_type'])
columndatatype_df
one_unique_value_columns = columndatatype_df[columndatatype_df['unique_values'].apply(lambda x: len(x) == 1)]
one_unique_value_columns

Unnamed: 0,col,unique_values,data_type


#### Train and Test data are split ####

In [331]:
y = model_1_data['churn_probability']
X = model_1_data.drop(columns=['churn_probability'])

In [332]:
unseen_df.shape

(30000, 124)

In [333]:
X.shape

(63842, 124)

In [334]:
train, test, target_train, target_test = train_test_split(X, y, train_size= 0.75,random_state=42);

## Model 1 with XGBoost ##

In [335]:
xgb_cl = xgb.XGBClassifier(n_jobs = -1,objective = 'binary:logistic')

In [336]:
# Fit the model to our train and target
xgb_cl.fit(train, target_train)  # default 
# Get our predictions
xgb_predictions = xgb_cl.predict(test)

In [337]:
print(accuracy_score(target_test, xgb_predictions))
print("Confusion Matrix:\n", confusion_matrix(target_test, xgb_predictions))

0.9514441451036902
Confusion Matrix:
 [[14807   239]
 [  536   379]]


#### Intial Model 1 - Accuracy and Confusion Matrix score ####
#### Based on the above the tuning of Hyper Parameters was performed ####

In [338]:
'''param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.5, 0.75, 1.0],
    'colsample_bytree': [0.5, 0.75, 1.0],
    'min_child_weight': [1, 5, 10]
}

# Setup the grid search
grid_search = GridSearchCV(estimator=xgb_cl, param_grid=param_grid,
                           scoring='roc_auc', cv=3, n_jobs=-1,verbose=3)

# Fit the model
grid_search.fit(train, target_train)

# Best parameters and score
print("Best parameters found: ", grid_search.best_params_)
print("Best AUC score: ", grid_search.best_score_)'''

'param_grid = {\n    \'n_estimators\': [100, 200],\n    \'max_depth\': [3, 5, 7],\n    \'learning_rate\': [0.01, 0.1, 0.2],\n    \'subsample\': [0.5, 0.75, 1.0],\n    \'colsample_bytree\': [0.5, 0.75, 1.0],\n    \'min_child_weight\': [1, 5, 10]\n}\n\n# Setup the grid search\ngrid_search = GridSearchCV(estimator=xgb_cl, param_grid=param_grid,\n                           scoring=\'roc_auc\', cv=3, n_jobs=-1,verbose=3)\n\n# Fit the model\ngrid_search.fit(train, target_train)\n\n# Best parameters and score\nprint("Best parameters found: ", grid_search.best_params_)\nprint("Best AUC score: ", grid_search.best_score_)'

In [339]:
best_params = {
    'colsample_bytree': 0.5,
    'learning_rate': 0.1,
    'max_depth': 5,
    'min_child_weight': 10,
    'n_estimators': 100,
    'subsample': 0.75
}
xgb_cl_model = xgb.XGBClassifier(**best_params)
xgb_cl_model.fit(train, target_train)
xgb_predictions_cl = xgb_cl_model.predict(test)
print(accuracy_score(target_test, xgb_predictions_cl))
print("Confusion Matrix:\n", confusion_matrix(target_test, xgb_predictions_cl))

0.9535116847315331
Confusion Matrix:
 [[14823   223]
 [  519   396]]


####  After Hyper  parameter tuning the above is the Accuracy and Confusion matrix score of  Model 1 ####

# Changes in Model 
###  Outliers were removed and the data is scaled using Standard scaler ###

In [340]:
def cap_outliers(array, k=3):
    upper_limit = array.mean() + k*array.std()
    lower_limit = array.mean() - k*array.std()
    array[array<lower_limit] = lower_limit
    array[array>upper_limit] = upper_limit
    return array
train_filtered1 = train.apply(cap_outliers, axis=0)
test_filtered1 = test.apply(cap_outliers, axis=0)
#plt.figure(figsize=(15,8))
#plt.xticks(rotation=45)
#sns.boxplot(data = train_filtered1)

In [341]:
train_scaled = train_filtered1.copy()
test_scaled = test_filtered1.copy()
scaler = StandardScaler()
#X_train_scaled = scaler.fit_transform(train)
#X_test_scaled = scaler.transform(test)
#X_train_scaled
binary_columns = train_filtered1.columns[(train_filtered1.nunique() == 2) & (train_filtered1.isin([0, 1]).all())]
columns_to_scale = train_filtered1.columns[~train_filtered1.columns.isin(binary_columns)]
train_scaled[columns_to_scale] = scaler.fit_transform(train_filtered1[columns_to_scale])
test_scaled[columns_to_scale] = scaler.transform(test_filtered1[columns_to_scale])
train_scaled

Unnamed: 0_level_0,arpu_6,arpu_7,arpu_8,onnet_mou_6,onnet_mou_7,onnet_mou_8,offnet_mou_6,offnet_mou_7,offnet_mou_8,roam_ic_mou_6,...,monthly_3g_6,monthly_3g_7,monthly_3g_8,sachet_3g_6,sachet_3g_7,sachet_3g_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
46794,2.746128,0.054397,1.709303,-0.561227,-0.552175,-0.536823,0.390974,-0.704115,-0.151160,-0.264343,...,4.028435,3.927645,3.791285,-0.202127,-0.197654,-0.199821,2.489816,4.631135,-0.315846,0.05562
3971,-1.071516,-1.054859,0.023187,-0.561227,-0.552175,0.808078,-0.769418,-0.757548,-0.200171,-0.020844,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,-0.585034,-0.327793,-0.315846,-0.30627
17983,1.479749,0.861721,2.986783,4.305875,-0.348534,0.351594,1.143669,3.983600,4.015958,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,0.379564,-0.327793,-0.315846,-0.30627
51532,-0.767968,0.421117,-0.140070,-0.241579,0.591331,0.116224,-0.675388,-0.178200,-0.257636,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,0.430551,-0.327793,-0.315846,-0.30627
22824,-0.192268,-0.765381,-0.767159,-0.473223,-0.534172,-0.544777,-0.674107,-0.708805,-0.750727,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,-1.059528,-0.327793,-0.315846,-0.30627
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68593,-0.786347,-0.663740,-1.042692,-0.419037,-0.338522,-0.543628,-0.569372,-0.416104,-0.751667,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,0.894640,-0.327793,-0.315846,-0.30627
41836,0.219938,0.355244,-1.030718,0.024257,1.166653,-0.548134,0.927999,0.197196,-0.746643,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,-0.333219,-0.327793,-0.315846,-0.30627
944,-0.634616,-0.432698,-1.008313,-0.075846,0.863616,-0.529696,-0.666998,-0.496475,-0.677619,-0.217715,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,-0.506992,-0.327793,-0.315846,-0.30627
17364,-0.664731,-0.671214,-0.475013,-0.544925,-0.477381,-0.455899,-0.576687,-0.577526,-0.451230,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,-1.003338,-0.327793,-0.315846,-0.30627


####  Scaling is performed on Test  data similar to train data ####

In [342]:
unseen_df_scaled  = unseen_df.copy()
unseen_df_scaled[columns_to_scale] = scaler.transform(unseen_df[columns_to_scale])
unseen_df_scaled

Unnamed: 0_level_0,arpu_6,arpu_7,arpu_8,onnet_mou_6,onnet_mou_7,onnet_mou_8,offnet_mou_6,offnet_mou_7,offnet_mou_8,roam_ic_mou_6,...,monthly_3g_6,monthly_3g_7,monthly_3g_8,sachet_3g_6,sachet_3g_7,sachet_3g_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
69999,-0.725377,-0.813495,-0.810725,-0.412778,-0.461354,-0.441875,-0.521303,-0.628635,-0.610638,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,0.451362,-0.327793,-0.315846,-0.306270
70000,0.489668,0.849928,0.278034,-0.208507,-0.367165,-0.456910,1.190581,1.735096,0.860140,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,1.326472,-0.327793,-0.315846,-0.306270
70001,0.171760,0.551838,1.693498,-0.526007,-0.516888,-0.509511,-0.699774,-0.573919,-0.568938,0.579763,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,-1.021028,2.630026,4.027425,1.229160
70002,-0.907593,-0.421652,-0.957809,-0.536424,-0.542478,-0.548134,-0.604463,0.064234,-0.654829,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,-0.014808,-0.327793,-0.315846,-0.306270
70003,0.085086,0.446192,0.470723,1.545143,2.182030,2.673560,-0.518079,-0.661985,-0.450004,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,-0.828524,-0.327793,-0.315846,-0.306270
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99994,1.638421,0.409136,0.444354,0.954381,0.570231,0.764168,0.016838,-0.091576,0.153647,2.983301,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,-0.003362,-0.306916,0.289134,0.084064
99995,-0.248669,0.142431,0.329871,0.670981,1.545472,2.441731,-0.321381,-0.260053,0.022135,-0.138523,...,-0.240345,3.927645,-0.248868,-0.202127,-0.197654,-0.199821,1.040318,-0.327793,-0.315846,-0.306270
99996,-0.545955,-0.913617,-0.385580,-0.509471,-0.485327,-0.456451,-0.616035,-0.690446,-0.567957,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,-0.851417,-0.327793,-0.315846,-0.306270
99997,3.161693,1.831025,-0.101860,0.012112,-0.326670,-0.426472,6.893346,3.607024,0.497541,-0.264343,...,-0.240345,-0.239840,-0.248868,-0.202127,-0.197654,-0.199821,-0.466410,-0.327793,-0.315846,-0.306270


### XGBoost Model is rerun with Hyper parameters identified and with scale pos wieght tuned ###

In [343]:
neg_count = sum(target_train == 0)
pos_count = sum(target_train == 1)
scale_pos_weight = (neg_count/4) / pos_count
best_params = {
    'colsample_bytree': 0.5,
    'learning_rate': 0.1,
    'max_depth': 5,
    'min_child_weight': 10,
    'n_estimators': 100,
    'subsample': 0.75,
    'scale_pos_weight':scale_pos_weight 
}

xgb_cl_model = xgb.XGBClassifier(**best_params)
xgb_cl_model.fit(train_scaled, target_train)
xgb_predictions_cl = xgb_cl_model.predict(test_scaled)
print(accuracy_score(target_test, xgb_predictions_cl))
print("Confusion Matrix:\n", confusion_matrix(target_test, xgb_predictions_cl))
class_report = classification_report(target_test, xgb_predictions_cl)
print("Classification Report:\n", class_report)

0.9427354175803521
Confusion Matrix:
 [[14478   568]
 [  346   569]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.96      0.97     15046
           1       0.50      0.62      0.55       915

    accuracy                           0.94     15961
   macro avg       0.74      0.79      0.76     15961
weighted avg       0.95      0.94      0.95     15961



## Model 1 outcome - Accuracy: 94.2 % and Other Confusion Matrix parameters are displayed above ##

# Model 2 with reduced columns identifying top 40 features - Random Forest 

In [344]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(train_scaled, target_train)

# Step 6: Get Feature Importances
importances = rf_classifier.feature_importances_

# Step 7: Create a DataFrame for Importances
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)

# Step 8: Select Top 40 Features
top_features = feature_importances.head(40)['Feature'].tolist()
print("Top 40 features:\n", top_features)

Top 40 features:
 ['loc_ic_mou_8', 'total_ic_mou_8', 'loc_ic_t2m_mou_8', 'roam_ic_mou_8', 'total_og_mou_8', 'roam_og_mou_8', 'loc_og_mou_8', 'loc_og_t2m_mou_8', 'loc_ic_t2t_mou_8', 'arpu_8', 'loc_ic_mou_7', 'total_ic_mou_7', 'aon', 'total_rech_amt_8', 'last_day_rch_amt_8', 'arpu_7', 'offnet_mou_8', 'loc_ic_t2m_mou_7', 'roam_og_mou_7', 'loc_ic_mou_6', 'loc_og_t2t_mou_8', 'total_ic_mou_6', 'offnet_mou_7', 'arpu_6', 'loc_ic_t2m_mou_6', 'max_rech_amt_8', 'roam_ic_mou_7', 'loc_ic_t2t_mou_7', 'total_rech_num_7', 'total_og_mou_7', 'offnet_mou_6', 'loc_og_t2m_mou_7', 'loc_og_mou_7', 'onnet_mou_7', 'std_ic_mou_8', 'loc_ic_t2t_mou_6', 'onnet_mou_8', 'loc_og_t2m_mou_6', 'onnet_mou_6', 'total_rech_num_6']


## Random Forest model is run , the Accuracy and Confusion Matrix score as below ##

In [345]:
# Step 9: Create a Reduced Dataset
train_reduced = train_scaled[top_features]
test_reduced = test_scaled[top_features]


# Step 10: Train the Final Random Forest Model with Reduced Features
final_rf_classifier = RandomForestClassifier(n_estimators=300,min_samples_split= 10, min_samples_leaf= 4, max_depth= 30, class_weight='balanced', random_state=42)
final_rf_classifier.fit(train_reduced, target_train)

# Step 11: Make Predictions
y_pred = final_rf_classifier.predict(test_reduced)

# Step 12: Evaluate the Model
accuracy = accuracy_score(target_test, y_pred)
conf_matrix = confusion_matrix(target_test, y_pred)
class_report = classification_report(target_test, y_pred)

# Step 13: Print the Results
print(f'Accuracy: {accuracy * 100:.2f}%')
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

Accuracy: 94.99%
Confusion Matrix:
 [[14690   356]
 [  443   472]]
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.98      0.97     15046
           1       0.57      0.52      0.54       915

    accuracy                           0.95     15961
   macro avg       0.77      0.75      0.76     15961
weighted avg       0.95      0.95      0.95     15961



## Test data is now reduced with Top 40 features ##

In [346]:
unseen_df_reduced = unseen_df_scaled[top_features]
unseen_df_reduced

Unnamed: 0_level_0,loc_ic_mou_8,total_ic_mou_8,loc_ic_t2m_mou_8,roam_ic_mou_8,total_og_mou_8,roam_og_mou_8,loc_og_mou_8,loc_og_t2m_mou_8,loc_ic_t2t_mou_8,arpu_8,...,offnet_mou_6,loc_og_t2m_mou_7,loc_og_mou_7,onnet_mou_7,std_ic_mou_8,loc_ic_t2t_mou_6,onnet_mou_8,loc_og_t2m_mou_6,onnet_mou_6,total_rech_num_6
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
69999,-0.686579,-0.734991,-0.748645,-0.227717,-0.630727,-0.247904,-0.573309,-0.663821,-0.279932,-0.810725,...,-0.521303,-0.644811,-0.495955,-0.461354,-0.404917,-0.168876,-0.441875,-0.578143,-0.412778,-0.429268
70000,0.139113,0.353537,0.637455,-0.227717,0.283236,-0.247904,1.444196,2.452408,-0.547519,0.278034,...,1.190581,4.276390,2.733655,-0.367165,-0.477541,-0.307593,-0.456910,3.171581,-0.208507,-0.429268
70001,-0.853444,-0.895831,-0.824007,1.006224,-0.778790,1.665467,-0.734795,-0.734441,-0.583574,1.693498,...,-0.699774,-0.736079,-0.738809,-0.516888,-0.477541,-0.575772,-0.509511,-0.727398,-0.526007,-0.262284
70002,1.961917,1.539961,3.184831,-0.227717,-0.717434,-0.247904,-0.609359,-0.561186,-0.265288,-0.957809,...,-0.604463,0.090365,-0.203184,-0.542478,-0.329783,0.203810,-0.548134,-0.703194,-0.536424,-0.763236
70003,-0.760188,-0.189396,-0.763195,-0.227717,1.213911,-0.247904,-0.594818,-0.642027,-0.446673,0.470723,...,-0.518079,-0.713786,-0.720123,2.182030,1.889937,-0.575772,2.673560,-0.707722,1.545143,0.572636
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99994,2.516050,2.048961,1.898609,-0.227717,0.517802,-0.247904,1.884538,0.900798,3.301615,0.444354,...,0.016838,0.589701,1.404995,0.570231,-0.015859,1.337028,0.764168,0.604286,0.954381,0.071684
99995,-0.542167,-0.532411,-0.622434,-0.227717,1.367878,-0.247904,-0.591087,-0.624372,-0.390310,0.329871,...,-0.321381,-0.454228,-0.458367,1.545472,-0.090324,-0.552563,2.441731,-0.542661,0.670981,-0.763236
99996,-0.768290,-0.805519,-0.779486,0.935726,-0.651937,0.299677,-0.555695,-0.522750,-0.439351,-0.385580,...,-0.616035,-0.667020,-0.657821,-0.485327,-0.405921,-0.542098,-0.456451,-0.538627,-0.509471,-0.429268
99997,-0.804406,-0.854756,-0.750780,-0.227717,0.071234,-0.247904,-0.587959,-0.508390,-0.583574,-0.101860,...,6.893346,-0.059905,-0.274448,-0.326670,-0.477541,-0.574430,-0.426472,-0.248597,0.012112,7.085011


In [347]:
'''param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

random_search = RandomizedSearchCV(estimator=final_rf_classifier,
                                   param_distributions=param_grid,
                                   n_iter=50,  # Number of parameter settings to try
                                   scoring='roc_auc',
                                   cv=3,
                                   n_jobs=-1,
                                   verbose=3,
                                   random_state=42)  # Ensures reproducibility

# Fit the model
random_search.fit(train_reduced, target_train)

# Best parameters and score
print("Best parameters found: ", random_search.best_params_)
print("Best AUC score: ", random_search.best_score_)'''

'param_grid = {\n    \'n_estimators\': [100, 200, 300],\n    \'max_depth\': [None, 10, 20, 30],\n    \'min_samples_split\': [2, 5, 10],\n    \'min_samples_leaf\': [1, 2, 4]\n}\n\nrandom_search = RandomizedSearchCV(estimator=final_rf_classifier,\n                                   param_distributions=param_grid,\n                                   n_iter=50,  # Number of parameter settings to try\n                                   scoring=\'roc_auc\',\n                                   cv=3,\n                                   n_jobs=-1,\n                                   verbose=3,\n                                   random_state=42)  # Ensures reproducibility\n\n# Fit the model\nrandom_search.fit(train_reduced, target_train)\n\n# Best parameters and score\nprint("Best parameters found: ", random_search.best_params_)\nprint("Best AUC score: ", random_search.best_score_)'

#### Hyper Parameters are identified for Random forest Model ####

# Model 3 with PCA #

#### Train and Test Data is now taken excluding the Top 40 features identified ####

In [348]:
train_ada_reduced = train_scaled.loc[:, ~train_scaled.columns.isin(top_features)]
test_ada_reduced = test_scaled.loc[:, ~test_scaled.columns.isin(top_features)]
from sklearn.tree import DecisionTreeClassifier

In [349]:
unseen_df_ada_reduced = unseen_df_scaled.loc[:, ~unseen_df_scaled.columns.isin(top_features)]
unseen_df_ada_reduced.shape

(30000, 84)

In [350]:
test_ada_reduced.shape

(15961, 84)

In [351]:
base_estimator = DecisionTreeClassifier(max_depth=3,class_weight={0: 1, 1: 3})

In [352]:
adaboost =  AdaBoostClassifier(base_estimator=base_estimator,n_estimators=200,learning_rate= 0.5,random_state=1)

In [353]:
adaboost.fit(train_ada_reduced, target_train)

In [354]:
'''param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.5, 1.0],
}

grid_search = GridSearchCV(estimator=adaboost, param_grid=param_grid,
                           scoring='roc_auc', cv=3, n_jobs=-1, verbose=3)

# Fit the model
grid_search.fit(train_ada_reduced, target_train)

# Best parameters and score
print("Best parameters found: ", grid_search.best_params_)
print("Best AUC score: ", grid_search.best_score_)'''

'param_grid = {\n    \'n_estimators\': [50, 100, 200, 300],\n    \'learning_rate\': [0.01, 0.1, 0.5, 1.0],\n}\n\ngrid_search = GridSearchCV(estimator=adaboost, param_grid=param_grid,\n                           scoring=\'roc_auc\', cv=3, n_jobs=-1, verbose=3)\n\n# Fit the model\ngrid_search.fit(train_ada_reduced, target_train)\n\n# Best parameters and score\nprint("Best parameters found: ", grid_search.best_params_)\nprint("Best AUC score: ", grid_search.best_score_)'

In [355]:
y_pred = adaboost.predict(test_ada_reduced)

In [356]:
print('Accuracy of the model is:  ',accuracy_score(target_test, y_pred))

Accuracy of the model is:   0.9368460622767997


In [357]:
cm = confusion_matrix(target_test, y_pred)
print('The confusion Matrix : \n',cm)

The confusion Matrix : 
 [[14541   505]
 [  503   412]]


### Accuracy and the Confusion Matrix score is identified on Train data as above with PCA ###

# Meta Model

### Meta Model  data frame creation with Train data ###

In [358]:
predictions_train = pd.DataFrame()
predictions_train['model_3'] = adaboost.predict(train_ada_reduced)
predictions_train['model_2'] = final_rf_classifier.predict(train_reduced)
predictions_train['model_1'] = xgb_cl_model.predict(train_scaled)
predictions_train['target'] = target_train.values
predictions_train

Unnamed: 0,model_3,model_2,model_1,target
0,0,0,0,0
1,0,0,0,0
2,0,0,0,0
3,1,1,1,0
4,0,0,0,0
...,...,...,...,...
47876,0,0,0,0
47877,1,1,1,1
47878,0,1,1,1
47879,0,0,0,0


In [359]:
meta_model = xgb.XGBClassifier(n_jobs = -1,objective = 'binary:logistic')
meta_model.fit(predictions_train.drop('target', axis=1), predictions_train['target'])

#### Predicting the output with Meta model based on the outcome of three models on the test data set identified within Traindata ####

In [360]:
predictions_test = pd.DataFrame()
predictions_test['model_3'] = adaboost.predict(test_ada_reduced)
predictions_test['model_2'] = final_rf_classifier.predict(test_reduced)
predictions_test['model_1'] = xgb_cl_model.predict(test_scaled)
predictions_test

Unnamed: 0,model_3,model_2,model_1
0,0,0,0
1,0,0,0
2,0,0,0
3,0,0,0
4,0,0,0
...,...,...,...
15956,0,0,0
15957,0,0,0
15958,0,0,0
15959,0,0,0


In [361]:
meta_predictions = meta_model.predict(predictions_test)
print(accuracy_score(target_test, meta_predictions))
print("Confusion Matrix:\n", confusion_matrix(target_test, meta_predictions))

0.9499404799198046
Confusion Matrix:
 [[14690   356]
 [  443   472]]


#### The Accuracy and Confusion Matrix of Model with Train data is above ####

## Final Model Run on the Test data to identify Churn Probability ##

In [362]:
predictions_unseen_test = pd.DataFrame()
predictions_unseen_test['model_3'] = adaboost.predict(unseen_df_ada_reduced)
predictions_unseen_test['model_2'] = final_rf_classifier.predict(unseen_df_reduced)
predictions_unseen_test['model_1'] = xgb_cl_model.predict(unseen_df_scaled)
predictions_unseen_test

Unnamed: 0,model_3,model_2,model_1
0,0,0,0
1,0,0,0
2,1,1,1
3,0,0,0
4,0,0,0
...,...,...,...
29995,0,0,0
29996,0,0,0
29997,0,0,0
29998,0,0,0


In [363]:
meta_predictions_unseen = meta_model.predict(predictions_unseen_test)
meta_predictions_unseen

array([0, 0, 1, ..., 0, 0, 0])

## Submission File Creation ##

In [364]:
sample.head()

Unnamed: 0,id,churn_probability
0,69999,0
1,70000,0
2,70001,0
3,70002,0
4,70003,0


In [365]:
unseen.head()

Unnamed: 0,id,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,arpu_6,arpu_7,...,sachet_3g_6,sachet_3g_7,sachet_3g_8,fb_user_6,fb_user_7,fb_user_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g
0,69999,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,91.882,65.33,...,0,0,0,,,,1692,0.0,0.0,0.0
1,70000,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,414.168,515.568,...,0,0,0,,,,2533,0.0,0.0,0.0
2,70001,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,329.844,434.884,...,0,0,0,,,,277,525.61,758.41,241.84
3,70002,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,43.55,171.39,...,0,0,0,,,,1244,0.0,0.0,0.0
4,70003,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,306.854,406.289,...,0,0,0,,,,462,0.0,0.0,0.0


In [366]:
submission_data = unseen.set_index('id')
submission_data.shape

(30000, 170)

In [367]:
unseen['churn_probability'] = meta_model.predict(predictions_unseen_test)
output = unseen[['id','churn_probability']]
output.head()

Unnamed: 0,id,churn_probability
0,69999,0
1,70000,0
2,70001,1
3,70002,0
4,70003,0


In [369]:
output.to_csv('submission_assignment.csv',index=False)