# Telecom Churn Case Study

## Problem Statement

#### - In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn_df rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.
#### - For many incumbent operators, retaining high profitable customers is the number one business goal. To reduce customer churn_df, telecom companies need to predict which customers are at high risk of churn_df.
#### - In this project, we will analyse customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn_df and identify the main indicators of churn_df.

#### There are multiple ways to explain churn_df, such as:
#### -- Revenue-Based churn_df: 
 * Focus: Customers not using revenue-generating services like calls or mobile internet.
 * Challenge: Ignores users who receive calls but don't actively spend, like rural users.

#### -- Usage-Based churn_df:
 * Focus: Customers with zero activity – no calls or internet usage.
 * Challenge: May be late in predicting churn_df if defined based on a prolonged period of zero usage.
 
 In this telecom churn_df case study, our major focus will be on usage-based definition to determine churn_df.

#### Here, our main goal is: 
* To predict customers on the verge of churn_dfing from a telecom operator. 
* Specifically, we are interested in identifying High-Value Customers. 
* churn_df prediction will be based on the usage behavior during the action period, with whurn period data excluded after labeling.

#### Requirements:
* churn_df Prediction Model: Develop a model capable of predicting which customers are likely to churn_df.
* Best Predictor Variables: Identify and utilize the most influential variables for accurate predictions.

#### We will be build our model by following these six steps:

#### Step-I : Data Preprocessing
* Read and comprehend the data
* Clean the data by handling  missing values
* Impute missing values as needed
#### Step-II : Customer Segmentation
* Identify and filter high-value customers
#### Step-III : Target Variable Definition
* Derive the churn_df target variable
#### Step-IV : Feature Engineering and Data Exploration
* Create derived variables
* Conduct exploratory data analysis (EDA)
* Split the data into training and test sets
* Apply feature scaling
#### Step-V : Model Building
* Handle class imbalance in the target variable
* Utilize dimensionality reduction techniques such as PCA
* Apply various classification models for churn_df prediction
#### Step- VI : Model Evaluation
* Evaluate the performance of the models
* Prepare models for predictor variable selection, considering multiple models and selecting the best one
* Finally, provide a comprehensive summary and recommendations to the company based on the analysis.

### Step-I : Data Preprocessing

In [1]:
# Importing required libraries

import warnings                                        # To suppress the warnings which will be raised

warnings.filterwarnings('ignore')

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.stats.outliers_influence import variance_inflation_factor     # Importing 'variance_inflation_factor' or VIF

from sklearn.feature_selection import RFE              # Import RFE for RFE selection

import statsmodels.api as sm                           # Loading statsmodels

from sklearn.metrics import precision_recall_curve     # Loading the precision recall curve

from sklearn import metrics                            # # Importing evaluation metrics from scikitlearn 

from imblearn.over_sampling import SMOTE

from sklearn.decomposition import IncrementalPCA

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from imblearn.metrics import sensitivity_specificity_support
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

In [4]:
pip install imbalanced-learn               # Imbalanced-learn helps balance datasets that are biased towards certain classes

In [5]:
churn_df = pd.read_csv(r"C:\Users\LENOVO\Desktop\amreeta\github\train.csv")      # Loading our dataset

In [6]:
churn_df.head()

In [7]:
# create backup of data
original = churn_df.copy()

In [8]:
#look at the last 5 rows
churn_df.tail() 

In [9]:
#check the columns of data
churn_df.columns

In [10]:
#Checking the numerical columns data distribution statistics
churn_df.describe()

In [11]:
#check dataframe for null and datatype 
churn_df.info()

In [12]:
# feature type summary
churn_df.info(verbose=1)

In [13]:
# Checking for null values
churn_df.isnull().sum()

In [14]:
# Checking the null value percentage
churn_df.isna().sum()/churn_df.isna().count()*100

In [15]:
# Checking for shape of a data set
churn_df.shape

In [16]:
# Checking for the duplicates
churn_df.drop_duplicates(subset=None, inplace=True)
churn_df.shape

In [17]:
#check the size of data
churn_df.size

In [18]:
#check the axes of data
churn_df.axes

In [19]:
#check the dimensions of data
churn_df.ndim

In [20]:
#check the values of data
churn_df.values

In [21]:
#list of columns
pd.DataFrame(churn_df.columns)

In [22]:
# look at missing value ratio in each column
churn_df.isnull().sum()*100/churn_df.shape[0]

In [23]:
# some recharge columns have minimum value of 1 while some don't have
recharge_cols = ['total_rech_data_6', 'total_rech_data_7', 'total_rech_data_8', 
                 'count_rech_2g_6', 'count_rech_2g_7', 'count_rech_2g_8', 
                 'count_rech_3g_6', 'count_rech_3g_7', 'count_rech_3g_8', 
                 'max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8', 
                 'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8', 
                 ]

churn_df[recharge_cols].describe(include='all')

* We can create new feature as total_rech_amt_data using total_rech_data and av_rech_amt_data to capture amount utilized by customer for data.
* The minimum value is 1 we can impute the NA values by 0, Considering there were no recharges done by the customer.

In [24]:
# It is also observed that the recharge date and the recharge value are missing together which means the customer didn't recharge
churn_df.loc[churn_df.total_rech_data_6.isnull() & churn_df.date_of_last_rech_data_6.isnull(), ["total_rech_data_6", "date_of_last_rech_data_6"]].head(20)

In the recharge variables where minumum value is 1, we can impute missing values with zeroes since it means customer didn't recharge their numbers that month.

In [25]:
# create a list of recharge columns where we will impute missing values with zeroes
zero_impute = ['total_rech_data_6', 'total_rech_data_7', 'total_rech_data_8', 
        'av_rech_amt_data_6', 'av_rech_amt_data_7', 'av_rech_amt_data_8', 
        'max_rech_data_6', 'max_rech_data_7', 'max_rech_data_8'
       ]

In [26]:
# impute missing values with 0
churn_df[zero_impute] = churn_df[zero_impute].apply(lambda x: x.fillna(0))

In [27]:
# now we have to  make sure the values are imputed correctly for that we can check "Missing value ratio"
churn_df[zero_impute].isnull().sum()*100/churn_df.shape[1]

In [28]:
# now we can check the "statistics Summary"
churn_df[zero_impute].describe(include='all')

In [29]:
# now we can create some column name list by there types using description of columns
id_cols = ['id', 'circle_id']

date_cols = ['last_date_of_month_6',
             'last_date_of_month_7',
             'last_date_of_month_8',             
             'date_of_last_rech_6',
             'date_of_last_rech_7',
             'date_of_last_rech_8',             
             'date_of_last_rech_data_6',
             'date_of_last_rech_data_7',
             'date_of_last_rech_data_8'             
            ]

cat_cols =  ['night_pck_user_6',
             'night_pck_user_7',
             'night_pck_user_8',             
             'fb_user_6',
             'fb_user_7',
             'fb_user_8'             
            ]

num_cols = [column for column in churn_df.columns if column not in id_cols + date_cols + cat_cols]

# print the number of columns in each list
print("#ID cols: %d\n#Date cols:%d\n#Numeric cols:%d\n#Category cols:%d" % (len(id_cols), len(date_cols), len(num_cols), len(cat_cols)))

# check if we have missed any column or not
print(len(id_cols) + len(date_cols) + len(num_cols) + len(cat_cols) == churn_df.shape[1])

In [30]:
# drop id and date columns
churn_df = churn_df.drop(id_cols + date_cols, axis=1)
#check the shape again
churn_df.shape

In [31]:
# replace missing values with '-1' in categorical columns
churn_df[cat_cols] = churn_df[cat_cols].apply(lambda x: x.fillna(-1))

In [32]:
# missing value ratio
churn_df[cat_cols].isnull().sum()*100/churn_df.shape[0]

Droping variables with more than 70% of missing values (we can call it as threshold )

In [33]:
initial_cols = churn_df.shape[1]

MISSING_THRESHOLD = 0.7

include_cols = list(churn_df.apply(lambda column: True if column.isnull().sum()/churn_df.shape[0] < MISSING_THRESHOLD else False))

drop_missing = pd.DataFrame({'features':churn_df.columns , 'include': include_cols})
drop_missing.loc[drop_missing.include == True,:]

In [34]:
# now we can drop  some more columns
churn_df = churn_df.loc[:, include_cols]

dropped_cols = churn_df.shape[1] - initial_cols
dropped_cols

In [35]:
#rechecking the shape of a dataframe
churn_df.shape

In [36]:
# rechecking the missing values for how many missing values has left
churn_df.isnull().sum()*100/churn_df.shape[0]

In [37]:
num_cols = [column for column in churn_df.columns if column not in id_cols + date_cols + cat_cols]
num_cols

In [38]:
#imputing with meadian for num_cols
churn_df[num_cols] = churn_df[num_cols].apply(lambda x: x.fillna(x.median()))

In [39]:
#again checking for the missing values
churn_df.isnull().sum()*100/churn_df.shape[0]

In churn_df prediction, we assume that there are three phases of customer lifecycle :

- The ‘good & action’ phase [Month 6 & 7]
- The ‘churn_df’ phase [Month 8]
In this case, since we are working over a three-month window, the first two months are the ‘good & action’ phase, the third month is the ‘churn_df’ phase.

### Step-II: Customer Segmentation
* Filter high-value customers

Here we can take good phase ( it means month 6 and 7) data to get high value customers

In [40]:
# calculate the total data recharge amount for June and July --> number of recharges * average recharge amount
churn_df['total_data_rech_6'] = churn_df.total_rech_data_6 * churn_df.av_rech_amt_data_6
churn_df['total_data_rech_7'] = churn_df.total_rech_data_7 * churn_df.av_rech_amt_data_7

add total data recharge and total recharge to get total combined recharge amount for a month

In [41]:
# calculate total recharge amount for June and July --> call recharge amount + data recharge amount
churn_df['amt_data_6'] = churn_df.total_rech_amt_6 + churn_df.total_data_rech_6
churn_df['amt_data_7'] = churn_df.total_rech_amt_7 + churn_df.total_data_rech_7

In [42]:
# calculate average recharge done by customer in June and July
churn_df['av_amt_data_6_7'] = (churn_df.amt_data_6 + churn_df.amt_data_7)/2

In [43]:
# look at the 70th percentile recharge amount
print("Recharge amount at 70th percentile: {0}".format(churn_df.av_amt_data_6_7.quantile(0.7)))


In [44]:
churn_df.head()

In [45]:
# retain only those customers who have recharged their mobiles with more than or equal to 70th percentile amount
churn_df_filtered = churn_df.loc[churn_df.av_amt_data_6_7 >= churn_df.av_amt_data_6_7.quantile(0.7), :]
churn_df_filtered = churn_df_filtered.reset_index(drop=True)


In [46]:
churn_df_filtered.shape

In [47]:
# delete variables created to filter high-value customers
churn_df_filtered = churn_df_filtered.drop(['total_data_rech_6', 'total_data_rech_7','amt_data_6', 'amt_data_7', 'av_amt_data_6_7'], axis=1)

In [48]:
churn_df_filtered.shape

Now, we have 21,013 rows  and 149 columns after selecting customers who have provided recharge value of more than or equal to the recharge value of the 70th percentile customer.

### Step-III : Target Variable Definition

* Derive churn_df

Here, derive churn_df means that we are using 8 months (The ‘churn_df’ phase) data , to get the target variable (in this case study as we have not been provided any target variable we have to derive it from churn_df phase data)
For this, we need to find the derive churn_df variable using total_ic_mou_8,total_og_mou_8,vol_2g_mb_8 and vol_3g_mb_8 attributes

In [49]:
# Selecting the columns to define churn_df variable (i.e. TARGET Variable)
churn_df_col=['total_ic_mou_8','total_og_mou_8','vol_2g_mb_8','vol_3g_mb_8']
churn_df_filtered[churn_df_col].info()

In [50]:
# lets find out churn_df/non churn_df percentage
print((churn_df_filtered['churn_probability'].value_counts()/len(churn_df))*100)
((churn_df_filtered['churn_probability'].value_counts()/len(churn_df))*100).plot(kind="pie")
plt.show()

#### ***As we can see that 90% of the customers do not churn_df, there is a possibility of class imbalance*** 
Since this variable churn_df is the target variable, all the columns relating to this variable(i.e. all columns with suffix _8) can be dropped forn the dataset.


We can still clean the data by few possible columns relating to the good phase.

As we derived few columns in the good phase earlier, we can drop those related columns during creation.

In [51]:
#churn_df['total_rech_amt_data_6']=churn_df['av_rech_amt_data_6'] * churn_df['total_rech_data_6']
# churn_df['total_rech_amt_data_7']=churn_df['av_rech_amt_data_7'] * churn_df['total_rech_data_7']

# # Calculating the overall recharge amount for the months 6,7 and 8

# churn_df['overall_rech_amt_6'] = churn_df['total_rech_amt_data_6'] + churn_df['total_rech_amt_6']
# churn_df['overall_rech_amt_7'] = churn_df['total_rech_amt_data_7'] + churn_df['total_rech_amt_7']

churn_df_filtered.drop(['av_rech_amt_data_6',
                   'total_rech_data_6','total_rech_amt_6',
                  'av_rech_amt_data_7',
                   'total_rech_data_7','total_rech_amt_7'], axis=1, inplace=True)

We can also create new columns for the defining the good phase variables and drop the seperate 6th and 7 month variables.

Before proceding to check the remaining missing value handling, let us check the collineartity of the indepedent variables and try to understand their dependencies.

In [52]:
# creating a list of column names for each month
mon_6_cols = [col for col in churn_df_filtered.columns if '_6' in col]
mon_7_cols = [col for col in churn_df_filtered.columns if '_7' in col]
mon_8_cols = [col for col in churn_df_filtered.columns if '_8' in col]

In [53]:
mon_7_cols

In [54]:
# lets check the correlation amongst the independent variables, drop the highly correlated ones
churn_df_corr = churn_df_filtered.corr()
churn_df_corr.loc[:,:] = np.tril(churn_df_corr, k=-1)
churn_df_corr = churn_df_corr.stack()
churn_df_corr
churn_df_corr[(churn_df_corr > 0.80) | (churn_df_corr < -0.80)].sort_values(ascending=False)

In [55]:
col_to_drop=['fb_user_6','fb_user_7','total_ic_mou_6','total_ic_mou_7',               
               'std_og_t2t_mou_7','std_og_t2t_mou_6' ,'std_og_t2m_mou_7','std_ic_mou_7',]

# These columns can be dropped as they are highly collinered with other predictor variables.
# criteria set is for collinearity of 85%

#  dropping these column
churn_df_filtered.drop(col_to_drop, axis=1, inplace=True)

In [56]:
# The curent dimension of the dataset after dropping few unwanted columns
churn_df_filtered.shape

### Step-IV : Feature Engineering and Data Exploration
* Create derived variables
* Conduct exploratory data analysis (EDA)
* Split the data into training and test sets
* Apply feature scaling

In [57]:
# We have a column called 'aon'

# Can derive new variables from this to explain the data w.r.t churn_df.

churn_df_filtered['tenure'] = (churn_df_filtered['aon']/30).round(0)   # creating a new variable 'tenure'

churn_df_filtered.drop('aon',axis=1, inplace=True)                    # Since we derived a new column from 'aon', we can drop it

In [58]:
# Checking the distribution of he tenure variable

sns.distplot(churn_df_filtered['tenure'],bins=30)
plt.show()

In [59]:
tn_range = [0, 6, 12, 24, 60, 61]
tn_label = [ '0-6 Months', '6-12 Months', '1-2 Yrs', '2-5 Yrs', '5 Yrs and above']
churn_df_filtered['tenure_range'] = pd.cut(churn_df_filtered['tenure'], tn_range, labels=tn_label)
churn_df_filtered['tenure_range'].head()

In [60]:
# Plotting a bar plot for tenure range
plt.figure(figsize=[12,7])
sns.barplot(x='tenure_range',y='churn_probability', data=churn_df_filtered)
plt.show()

In [61]:
churn_df_filtered["avg_arpu_6_7"]= (churn_df_filtered['arpu_6']+churn_df_filtered['arpu_7'])/2
churn_df_filtered['avg_arpu_6_7'].head()

In [62]:
# Lets drop the original columns as they are derived to a new column for better understanding of the data

churn_df_filtered.drop(['arpu_6','arpu_7'], axis=1, inplace=True)


# The curent dimension of the dataset after dropping few unwanted columns
churn_df_filtered.shape

In [63]:
# Visualizing the column created
sns.distplot(churn_df_filtered['avg_arpu_6_7'])
plt.show()

In [64]:
# Checking Correlation between target variable(SalePrice) with the other variable in the dataset
plt.figure(figsize=(10,50))
heatmap_churn_df = sns.heatmap(churn_df_filtered.corr()[['churn_probability']].sort_values(ascending=False, by='churn_probability'),annot=True, 
                                cmap='summer')
heatmap_churn_df.set_title("Features Correlating with churn_df variable", fontsize=15)

In [65]:
churn_df_filtered.columns

- Avg Outgoing Calls & calls on roaming for 6th & 7th months are positively correlated with churn_df.
- Avg Revenue, No. of Recharge for 8th month has negative correlation with churn_df.

In [66]:
# lets now draw a scatter plot between total recharge and avg revenue for the 8th month
churn_df_filtered[['total_rech_num_8', 'arpu_8']].plot.scatter(x = 'total_rech_num_8',
                                                              y='arpu_8')
plt.show()

In [67]:
# Creating categories for month 8 column totalrecharge and their count
churn_df_filtered['total_rech_data_group_8']=pd.cut(churn_df_filtered['total_rech_data_8'],[-1,0,10,25,100],labels=["No_Recharge","<=10_Recharges","10-25_Recharges",">25_Recharges"])
churn_df_filtered['total_rech_num_group_8']=pd.cut(churn_df_filtered['total_rech_num_8'],[-1,0,10,25,1000],labels=["No_Recharge","<=10_Recharges","10-25_Recharges",">25_Recharges"])

In [68]:
# Plotting the results

plt.figure(figsize=[12,4])
sns.countplot(data=churn_df_filtered,x="total_rech_data_group_8",hue="churn_probability")
print("\t\t\t\t\tDistribution of total_rech_data_8 variable\n",churn_df_filtered['total_rech_data_group_8'].value_counts())
plt.show()
plt.figure(figsize=[12,4])
sns.countplot(data=churn_df_filtered,x="total_rech_num_group_8",hue="churn_probability")
print("\t\t\t\t\tDistribution of total_rech_num_8 variable\n",churn_df_filtered['total_rech_num_group_8'].value_counts())
plt.show()

As the number of recharge rate increases, the churn_df rate decreases clearly.

In [69]:
churn_df_filtered.drop(['av_rech_amt_data_8','total_rech_data_8','sachet_2g_6','sachet_2g_7','sachet_3g_6',
              'sachet_3g_7','sachet_3g_8','last_day_rch_amt_6','last_day_rch_amt_7',
              'last_day_rch_amt_8',], axis=1, inplace=True)

In [70]:
churn_df_filtered.drop(['loc_og_t2o_mou', 'std_og_t2o_mou', 'loc_ic_t2o_mou','roam_ic_mou_6', 'roam_ic_mou_7', 'roam_ic_mou_8', 
         'roam_og_mou_6', 'roam_og_mou_7', 'roam_og_mou_8', 'loc_og_t2t_mou_6', 'loc_og_t2t_mou_7', 'loc_og_t2t_mou_8',
         'loc_og_t2m_mou_6', 'loc_og_t2m_mou_7', 'loc_og_t2m_mou_8', 'loc_og_t2f_mou_6', 'loc_og_t2f_mou_7', 'loc_og_t2f_mou_8',
         'loc_og_t2c_mou_6', 'loc_og_t2c_mou_7', 'loc_og_t2c_mou_8', 'loc_og_mou_6', 'loc_og_mou_7', 'loc_og_mou_8', 
         'std_og_t2m_mou_6', 'std_og_t2f_mou_6', 'std_og_t2f_mou_7', 'std_og_t2f_mou_8', 'std_og_t2c_mou_6', 'std_og_t2c_mou_7',
         'std_og_t2c_mou_8', 'std_og_mou_6', 'std_og_mou_7', 'std_og_mou_8', 'isd_og_mou_6', 'isd_og_mou_7', 'spl_og_mou_6',
         'spl_og_mou_7', 'spl_og_mou_8','total_og_mou_6', 'loc_ic_t2t_mou_6', 'loc_ic_t2t_mou_7', 'loc_ic_t2t_mou_8', 
         'loc_ic_t2m_mou_6', 'loc_ic_t2m_mou_7', 'loc_ic_t2m_mou_8', 'loc_ic_t2f_mou_6', 'loc_ic_t2f_mou_7', 'loc_ic_t2f_mou_8',
         'loc_ic_mou_6', 'loc_ic_mou_7', 'loc_ic_mou_8', 'std_ic_t2t_mou_6', 'std_ic_t2t_mou_7', 'std_ic_t2t_mou_8', 
         'std_ic_t2m_mou_6', 'std_ic_t2m_mou_7', 'std_ic_t2m_mou_8', 'std_ic_t2f_mou_6', 'std_ic_t2f_mou_7', 'std_ic_t2f_mou_8',
         'std_ic_t2o_mou_6', 'std_ic_t2o_mou_7', 'std_ic_t2o_mou_8', 'std_ic_mou_6', 'spl_ic_mou_6', 'spl_ic_mou_7',
         'spl_ic_mou_8', 'isd_ic_mou_6', 'isd_ic_mou_7', 'isd_ic_mou_8',], axis=1, inplace=True)

In [71]:
churn_df_filtered.shape

In [72]:
plt.figure(figsize = (50, 50))
sns.heatmap(churn_df_filtered.corr())
plt.show()

In [73]:
churn_df_filtered.info()

In [74]:
churn_df_filtered.drop(['total_rech_data_group_8','total_rech_num_group_8',] , axis=1, inplace=True)

In [75]:
churn_df_filtered.shape

In [76]:
churn_df_filtered.info()

In [77]:
churn_df_filtered.drop(['tenure_range'] , axis=1, inplace=True)

In [78]:
churn_df_filtered.info()

In [79]:
churn_df_rate = (sum(churn_df_filtered["churn_probability"])/len(churn_df_filtered["churn_probability"].index))*100
churn_df_rate

### Step-V : Model Building
* Handle class imbalance in the target variable
* Utilize dimensionality reduction techniques such as PCA
* Apply various classification models for churn_df prediction

#### Split Data Into Train and Test Data

In [80]:
churn_df_filtered.shape

In [81]:
# divide data into train and test
X = churn_df_filtered.drop("churn_probability", axis = 1)
y = churn_df_filtered.churn_probability
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 4, stratify = y)

In [82]:
# print shapes of train and test sets
X_train.shape

In [83]:
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

Perform Scaling

In [84]:
X_train.head()

In [85]:
X_train.info()

In [86]:
num_col = X_train.select_dtypes(include = ['int64','float64']).columns.tolist()

In [87]:
# apply scaling on the dataset
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train[num_col] = scaler.fit_transform(X_train[num_col])

In [88]:
X_train.head()

As there are many variables we will start the process of dropping variables after doing the RFE

### Data modeling, model evaluation & prepare model for predictor variables selection

#### Data Imbalance Handling
Using SMOTE method, we can balance the data w.r.t. churn_df variable and proceed further

In [89]:
smote = SMOTE(random_state=42)
X_train_sm,y_train_sm = smote.fit_resample(X_train,y_train)

In [90]:
print("Dimension of X_train_sm Shape:", X_train_sm.shape)
print("Dimension of y_train_sm Shape:", y_train_sm.shape)

#### Logistic Regression

In [91]:
# Logistic regression model
logm1 = sm.GLM(y_train_sm,(sm.add_constant(X_train_sm)), family = sm.families.Binomial())
logm1.fit().summary()

#### Logistic Regression using Feature Selection (RFE method)

In [92]:
logreg = LogisticRegression()

from sklearn.feature_selection import RFE

# running RFE with 20 variables as output
rfe = RFE(logreg,  n_features_to_select= 20)             
rfe = rfe.fit(X_train_sm, y_train_sm)

In [93]:
rfe.support_

In [94]:
rfe_columns=X_train_sm.columns[rfe.support_]
print("The selected columns by RFE for modelling are: \n\n",rfe_columns)

In [95]:
list(zip(X_train_sm.columns, rfe.support_, rfe.ranking_))

#### Assessing the model with StatsModels

In [96]:
X_train_SM = sm.add_constant(X_train_sm[rfe_columns])
logm2 = sm.GLM(y_train_sm,X_train_SM, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

In [97]:
# Getting the predicted values on the train set
y_train_sm_pred = res.predict(X_train_SM)
y_train_sm_pred = y_train_sm_pred.values.reshape(-1)
y_train_sm_pred[:10]

In [98]:
# Creating a dataframe with the actual churn_df flag and the predicted probabilities
y_train_sm_pred_final = pd.DataFrame({'Converted':y_train_sm.values, 'Converted_prob':y_train_sm_pred})
y_train_sm_pred_final.head()

#### Creating new column 'churn_df_pred' with 1 if churn_df_Prob > 0.8 else 0

In [99]:
y_train_sm_pred_final['churn_df_pred'] = y_train_sm_pred_final.Converted_prob.map(lambda x: 1 if x > 0.5 else 0)

# Viewing the prediction results
y_train_sm_pred_final.head()

In [100]:
# Confusion matrix 
confusion = metrics.confusion_matrix(y_train_sm_pred_final.Converted, y_train_sm_pred_final.churn_df_pred )
print(confusion)

* Confusion matrix
* Predicted     not_churn_df    churn_df
* Actual
* 3 not_churn_df     11630           2825
* churn_df             2238            12217  

In [101]:
# Checking the overall accuracy.
print("The overall accuracy of the model is:",metrics.accuracy_score(y_train_sm_pred_final.Converted, y_train_sm_pred_final.churn_df_pred))

#### Finding out the VIF values of the feature variables

In [102]:
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_sm[rfe_columns].columns
vif['VIF'] = [variance_inflation_factor(X_train_sm[rfe_columns].values, i) for i in range(X_train_sm[rfe_columns].shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

In [103]:
#### Metrics beyond accuracy
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [104]:
# Let's see the sensitivity of our logistic regression model
print("Sensitivity = ",TP / float(TP+FN))

# Let us calculate specificity
print("Specificity = ",TN / float(TN+FP))

# Calculate false postive rate - predicting churn_df when customer does not have churn_dfed
print("False Positive Rate = ",FP/ float(TN+FP))

# positive predictive value 
print ("Precision = ",TP / float(TP+FP))

# Negative predictive value
print ("True Negative Prediction Rate = ",TN / float(TN+ FN))

#### Plotting the ROC Curve

In [105]:
# Defining a function to plot the roc curve
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Prediction Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None

In [106]:
# Defining the variables to plot the curve
fpr, tpr, thresholds = metrics.roc_curve( y_train_sm_pred_final.Converted, y_train_sm_pred_final.Converted_prob, drop_intermediate = False )

In [107]:
# Plotting the curve for the obtained metrics
draw_roc(y_train_sm_pred_final.Converted, y_train_sm_pred_final.Converted_prob)

#### Finding Optimal Cutoff Point


In [108]:
# Let's create columns with different probability cutoffs 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_train_sm_pred_final[i]= y_train_sm_pred_final.Converted_prob.map(lambda x: 1 if x > i else 0)
y_train_sm_pred_final.head()

In [109]:
# let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['probability','accuracy','sensitivity','specificity'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive 
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_sm_pred_final.Converted, y_train_sm_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensitivity,specificity]
print(cutoff_df)

In [110]:
# Now, plotting accuracy sensitivity and specificity for various probabilities calculated above
cutoff_df.plot.line(x='probability', y=['accuracy','sensitivity','specificity'])
plt.show()

In [111]:
numbers = [0.50,0.51,0.52,0.53,0.54,0.55,0.56,0.57,0.58,0.59]       # Creating columns with refined probability cutoffs 
for i in numbers:
    y_train_sm_pred_final[i]= y_train_sm_pred_final.Converted_prob.map(lambda x: 1 if x > i else 0)
y_train_sm_pred_final.head()

In [112]:
# Calculating the accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['probability','accuracy','sensitivity','specificity'])
from sklearn.metrics import confusion_matrix

num = [0.50,0.51,0.52,0.53,0.54,0.55,0.56,0.57,0.58,0.59]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_sm_pred_final.Converted, y_train_sm_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    
    specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensitivity,specificity]
print(cutoff_df)

In [113]:
# plotting accuracy sensitivity and specificity for various probabilities calculated above.
cutoff_df.plot.line(x='probability', y=['accuracy','sensitivity','specificity'])
plt.show()

**From the above graph we can conclude, the optimal cutoff point in the probability to define the predicted churn_df variabe converges at `0.54`**

In [114]:
#### From the curve above,we can take 0.54 is the optimum point to take it as a cutoff probability.

y_train_sm_pred_final['final_churn_df_pred'] = y_train_sm_pred_final.Converted_prob.map( lambda x: 1 if x > 0.53 else 0)

y_train_sm_pred_final.head()

In [115]:
# Calculating the ovearall accuracy again
print("The overall accuracy of the model now is:",metrics.accuracy_score(y_train_sm_pred_final.Converted, y_train_sm_pred_final.final_churn_df_pred))

In [116]:
confusion2 = metrics.confusion_matrix(y_train_sm_pred_final.Converted, y_train_sm_pred_final.final_churn_df_pred )
print(confusion2)

In [117]:
TP2 = confusion2[1,1] # true positive 
TN2 = confusion2[0,0] # true negatives
FP2 = confusion2[0,1] # false positives
FN2 = confusion2[1,0] # false negatives

# Let's see the sensitivity of our logistic regression model
print("Sensitivity = ",TP2 / float(TP2+FN2))

# Let us calculate specificity
print("Specificity = ",TN2 / float(TN2+FP2))

# Calculate false postive rate - predicting churn_df when customer does not have churn_dfed
print("False Positive Rate = ",FP2/ float(TN2+FP2))

# positive predictive value 
print ("Precision = ",TP2 / float(TP2+FP2))

# Negative predictive value
print ("True Negative Prediction Rate = ",TN2 / float(TN2 + FN2))

#### Precision and Recall

In [118]:
p, r, thresholds = precision_recall_curve(y_train_sm_pred_final.Converted, y_train_sm_pred_final.Converted_prob)

plt.plot(thresholds, p[:-1], "g-")               # Plotting the curve
plt.plot(thresholds, r[:-1], "r-")
plt.show()

##### Predicting test_set
**  Transforming and feature selection for test data  **

In [119]:
X_test[num_col] = scaler.transform(X_test[num_col])    # Scaling test data
X_test.head()

In [120]:
X_test=X_test[rfe_columns]           # Feature selection
X_test.head()

In [121]:
# Adding constant to the test model.
X_test_SM = sm.add_constant(X_test)

#### Predicting the target variable

In [122]:
y_test_pred = res.predict(X_test_SM)
print("\n The first ten probability value of the prediction are:\n",y_test_pred[:10])

In [123]:
y_pred = pd.DataFrame(y_test_pred)
y_pred.head()

In [124]:
y_pred=y_pred.rename(columns = {0:"Conv_prob"})

In [125]:
y_test_df = pd.DataFrame(y_test)
y_test_df.head()

In [126]:
y_pred_final = pd.concat([y_test_df,y_pred],axis=1)
y_pred_final.head()

In [127]:
y_pred_final['test_churn_df_pred'] = y_pred_final.Conv_prob.map(lambda x: 1 if x>0.54 else 0)
y_pred_final.head()

In [128]:
# Checking the overall accuracy of the predicted set.
metrics.accuracy_score(y_pred_final.churn_probability, y_pred_final.test_churn_df_pred)

### Step- VI : Model Evaluation
* Evaluate the performance of the models
* Prepare models for predictor variable selection, considering multiple models and selecting the best one

**Metrics Evaluation**

In [129]:
# Confusion Matrix
confusion2_test = metrics.confusion_matrix(y_pred_final.churn_probability, y_pred_final.test_churn_df_pred)
print("Confusion Matrix\n",confusion2_test)

In [130]:
TP3 = confusion2_test[1,1] # true positive      # Calculating model validation parameters
TN3 = confusion2_test[0,0] # true negatives
FP3 = confusion2_test[0,1] # false positives
FN3 = confusion2_test[1,0] # false negatives

In [131]:
print("Sensitivity = ",TP3 / float(TP3+FN3))    # Displays sensitivity of our logistic regression model

print("Specificity = ",TN3 / float(TN3+FP3))    # Calculating specificity

print("False Positive Rate = ",FP3/ float(TN3+FP3))  # Calculating false postive rate, predicting churn_df when customer does not have churn_dfed

print ("Precision = ",TP3 / float(TP3+FP3))     # positive predictive value 

print ("True Negative Prediction Rate = ",TN3 / float(TN3+FN3))  # Negative predictive value

Breaking down and explaining the results

In [132]:
print("The accuracy of the predicted model is: ",round(metrics.accuracy_score(y_pred_final.churn_probability, y_pred_final.test_churn_df_pred),2)*100,"%")
print("The sensitivity of the predicted model is: ",round(TP3 / float(TP3+FN3),2)*100,"%")

print("\nAs the model created is based on a sentivity model, i.e. True positive rate is given more importance as the actual and prediction of churn_df by a customer\n") 

In [133]:
# ROC curve for the test dataset

# Here, we'll define the variables to plot the curve
fpr, tpr, thresholds = metrics.roc_curve(y_pred_final.churn_probability,y_pred_final.Conv_prob, drop_intermediate = False )
# Plotting the curve for the obtained metrics
draw_roc(y_pred_final.churn_probability,y_pred_final.Conv_prob)

* The AUC score for train dataset is 0.90 and the test dataset is 0.88. Further, this can be considered as a good model.

#### Principal component analysis (PCA)

In [134]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=10)

In [135]:
X_train.shape

In [136]:
pca = PCA(random_state=42)

In [137]:
pca.fit(X_train)

In [138]:
pca.components_

#### Analysing the explained variance ratio

In [139]:
pca.explained_variance_ratio_

In [140]:
var_cumu = np.cumsum(pca.explained_variance_ratio_)

In [141]:
fig = plt.figure(figsize=[12,8])
plt.vlines(x=15, ymax=1, ymin=0, colors="r", linestyles="--")
plt.hlines(y=0.95, xmax=30, xmin=0, colors="g", linestyles="--")
plt.plot(var_cumu)
plt.ylabel("Cumulative variance explained")
plt.show()

Note: Incremental PCA can be used for the best result

In [142]:
pca_final = IncrementalPCA(n_components=16)

In [143]:
df_train_pca = pca_final.fit_transform(X_train)

In [144]:
df_train_pca.shape

In [145]:
corrmat = np.corrcoef(df_train_pca.transpose())

In [146]:
corrmat.shape

In [147]:
df_test_pca = pca_final.transform(X_test)
df_test_pca.shape

#### Implementing logistic regression on the principal components

In [148]:
learner_pca = LogisticRegression()

In [149]:
model_pca = learner_pca.fit(df_train_pca, y_train)

#### Making predictions on the test_set

In [150]:
pred_probs_test = model_pca.predict_proba(df_test_pca)

In [151]:
"{:2.2}".format(metrics.roc_auc_score(y_test, pred_probs_test[:,1]))

#### Confusion matrix, Sensitivity and Specificity

In [152]:
pred_probs_test1 = model_pca.predict(df_test_pca)

In [153]:
# Confusion matrix
confusion = metrics.confusion_matrix(y_test, pred_probs_test1)
print(confusion)

In [154]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [155]:
print("Accuracy:-",metrics.accuracy_score(y_test, pred_probs_test1))    # Accuracy

print("Sensitivity:-",TP / float(TP+FN))                                # Sensitivity

print("Specificity:-", TN / float(TN+FP))                               # Specificity

#### Making predictions on the train_set

In [156]:
pred_probs_train = model_pca.predict_proba(df_train_pca)

In [157]:
"{:2.2}".format(metrics.roc_auc_score(y_train, pred_probs_train[:,1]))

#### Confusion matrix, Sensitivity and Specificity

In [158]:
pred_probs_train1 = model_pca.predict(df_train_pca)

In [159]:
# Confusion matrix
confusion = metrics.confusion_matrix(y_train, pred_probs_train1)
print(confusion)

In [160]:
TP = confusion[1,1] # true positive 
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [161]:
# Accuracy
print("Accuracy:-",metrics.accuracy_score(y_train, pred_probs_train1))

# Sensitivity
print("Sensitivity:-",TP / float(TP+FN))

# Specificity
print("Specificity:-", TN / float(TN+FP))

#### Decision Tree with PCA

In [162]:
from sklearn.tree import DecisionTreeClassifier

In [163]:
dt = DecisionTreeClassifier(random_state=42)

In [164]:
from sklearn.model_selection import GridSearchCV

In [165]:
params = {
    'max_depth': [2, 3, 5, 10, 20],
    'min_samples_leaf': [5, 10, 20, 50, 100],
    'min_samples_split': [50, 150, 50]
}

In [166]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=dt, 
                           param_grid=params, 
                           cv=4, n_jobs=-1, verbose=1, scoring = "accuracy")

In [167]:
grid_search.fit(df_train_pca, y_train)

In [168]:
score_df = pd.DataFrame(grid_search.cv_results_)
score_df.head()

In [169]:
score_df.nlargest(5,"mean_test_score")

In [170]:
grid_search.best_estimator_

In [171]:
dt_best = DecisionTreeClassifier( random_state = 42,
                                  max_depth=10, 
                                  min_samples_leaf=20,
                                  min_samples_split=50)

In [172]:
dt_best.fit(df_train_pca, y_train)

In [173]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [174]:
def evaluate_model(dt_classifier):
    print("Train Accuracy :", accuracy_score(y_train, dt_classifier.predict(df_train_pca)))
    print("Train Confusion Matrix:")
    print(confusion_matrix(y_train, dt_classifier.predict(df_train_pca)))
    print("-"*50)
    print("Test Accuracy :", accuracy_score(y_test, dt_classifier.predict(df_test_pca)))
    print("Test Confusion Matrix:")
    print(confusion_matrix(y_test, dt_classifier.predict(df_test_pca)))

In [175]:
evaluate_model(dt_best)

####  Random Forest with PCA

In [176]:
from sklearn.ensemble import RandomForestClassifier

In [177]:
max_features = int(round(np.sqrt(X_train.shape[1])))    # number of variables to consider to split each node
print(max_features)

In [178]:
rf = RandomForestClassifier(n_estimators=100, max_depth=4, max_features=7, random_state=100, oob_score=True, verbose=1)

In [179]:
rf.fit(df_train_pca, y_train)

In [180]:
rf.oob_score_

In [181]:
#from sklearn.metrics import plot_roc_curve
from sklearn.metrics import RocCurveDisplay

In [182]:
#plot_roc_curve(rf, df_train_pca, y_train)
RocCurveDisplay.from_estimator(rf, df_train_pca, y_train)
plt.show()

#### Hyper-parameter tuning for the Random Forest

In [183]:
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

In [184]:
params = {
    'max_depth': [2,3,5],
    'min_samples_leaf': [50,100],
    'min_samples_split': [ 100, 150, ],
    'n_estimators': [100, 200 ]
}

In [185]:
grid_search = GridSearchCV(estimator=rf, param_grid=params,cv = 4,n_jobs=-1, verbose=1, scoring="accuracy")

In [None]:
grid_search.fit(df_train_pca, y_train)

In [None]:
grid_search.best_score_ 

In [None]:
grid_search.best_params_

In [None]:
rfc_model = RandomForestClassifier(bootstrap=True,
                             max_depth=5,
                             min_samples_leaf=50, 
                             min_samples_split=100,
                             n_estimators=200)

In [None]:
rfc_model.fit(df_train_pca, y_train)

In [None]:
evaluate_model(rfc_model)

In [None]:
rfc_model.feature_importances_

#### Conclusion :
* The best model to predict the churn_df is observed to be Random Forest based on the accuracy as performance measure.
* The incoming calls (with local same operator mobile/other operator mobile/fixed lines, STD or Special) plays a vital role in understanding the possibility of churn_df. Hence, the operator should focus on incoming calls data and has to provide some kind of special offers to the customers whose incoming calls turning lower.

#### Logistic Regression  :
* Logistic Regression with RFE, Logistic regression with PCA, Random Forest For each of these models, the summary of performance measures are as follows:

*** Logistic Regression ***
* Train Accuracy : ~90%
* Test Accuracy : ~88%

*** Logistic regression with PCA ***
* Train Accuracy : ~92%
* Test Accuracy : ~92%

*** Decision Tree with PCA ***
* Train Accuracy : ~94%
* Test Accuracy : ~93%

*** Random Forest with PCA ***
* Train Accuracy :~ 92%
* Test Accuracy :~ 92%

In [None]:
churn_df_test = pd.read_csv(r"C:\Users\LENOVO\Desktop\amreeta\github/test.csv")

In [None]:
churn_df_test.head()

In [None]:
churn_df_test.shape

In [None]:
churn_df_test.isnull().sum()

In [None]:
churn_df_id = churn_df_test['id']

In [None]:
churn_df_test['tenure'] = (churn_df_test['aon']/30).round(0)
churn_df_test["avg_arpu_6_7"]= (churn_df_test['arpu_6']+churn_df_test['arpu_7'])/2

churn_df_test = churn_df_test[X.columns]

In [None]:
churn_df_test.shape

In [None]:
churn_df_test_null = churn_df_test.isnull().sum().sum() / np.product(churn_df_test.shape) * 100
churn_df_test_null

In [None]:
for col in churn_df_test.columns:
    null_col = churn_df_test[col].isnull().sum() / churn_df_test.shape[0] * 100
    print("{} : {:.2f}".format(col,null_col))

In [None]:
for col in churn_df_test.columns:
    null_col = churn_df_test[col].isnull().sum() / churn_df_test.shape[0] * 100
    if null_col > 0:
        churn_df_test[col] = churn_df_test[col].fillna(churn_df_test[col].mode()[0])

In [None]:
churn_df_test.isnull().sum().sum()


In [None]:
churn_df_test_final = pca_final.transform(churn_df_test)

In [None]:
churn_df_test_final.shape

In [None]:
predict_probalbilty = rfc_model.predict(churn_df_test_final)

In [None]:
predict_probalbilty.shape

In [None]:
len(churn_df_id)

In [None]:
final_prediction = pd.DataFrame({'id':churn_df_id,'churn_probability':predict_probalbilty})

In [None]:
final_prediction.to_csv('final_submission.csv',index=False)
final_prediction.head