# CHURN##
Churn rate (sometimes called attrition rate), in its broadest sense, is a measure of the number of individuals or items moving out of a collective group over a specific period. It is one of two primary factors that determine the steady-state level of customers a business will support.

Derived from the butter churn, the term is used in many contexts but most widely applied in business with respect to a contractual customer base. Examples include a subscriber-based service model as used by mobile telephone networks and pay TV operators. The term is often synonymous with turnover, for example participant turnover in peer-to-peer networks. Churn rate is an input into customer lifetime value modeling, and can be part of a simulator used to measure return on marketing investment using marketing mix modeling.

THE DATA SET

In [None]:
import pandas as pd
import numpy as np

path ='https://raw.githubusercontent.com/dsrscientist/DSData/master/Telecom_customer_churn.csv'
df= pd.read_csv(path)

In [None]:
df

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.columns


In [None]:
#Data Exploration
cat_cols=df.select_dtypes([object])

for col in cat_cols.columns:
    print(col)
    print(df[col].value_counts())
    print('******************************************')

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df['MonthlyCharges'].value_counts()

In [None]:
df['MonthlyCharges'] = pd.cut(df['MonthlyCharges'], bins = [0, 25, 50, 100,150], labels = ['Low', 'Average', 'High','Very High'])
df['MonthlyCharges'].value_counts()

In [None]:
totalCharges = df.columns.get_loc("TotalCharges")
new_col = pd.to_numeric(df.iloc[:, TotalCharges], errors='coerce')
df.iloc[:, TotalCharges] = pd.Series(new_col)
df['TotalCharges'].value_counts()

In [None]:
df['TotalCharges'] = pd.cut(df['TotalCharges'], bins = [0, 100, 500, 1000,10000], labels = ['Low', 'Average', 'High','Very High'])
df['TotalCharges'].value_counts()

In [None]:
df['tenure'].value_counts()

In [None]:
df['tenure'] = pd.cut(df['tenure'], bins = [0, 25, 50, 100], labels = ['Low', 'Average', 'High'])
df['tenure'].value_counts()

In [None]:
df.describe()

In [None]:
# Dropping the irrelevant columns..

df.drop(columns=["customerID"], axis=1, inplace=True)

#Checking for the Columns containing Null , Blank Or Empty Values

In [None]:
df.isnull().sum()

In [None]:
df.dtypes()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull())
plt.title("Null Values")
plt.show()

In [None]:
df["TotalCharges"] = df["TotalCharges"].fillna(df["TotalCharges"].mean())


In [None]:
df.isnull().sum()

In [None]:
#Transforming the Data types of the Columns To Same DataTypes 
df.info()

In [None]:
df.describe()

In [None]:
from sklearn.preprocessing import LabelEncoder
le =LabelEncoder()

list1=['gender','Partner','Dependents','PhoneService','MultipleLines','InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','PaperlessBilling','PaymentMethod','Churn','MonthlyCharges','TotalCharges','tenure']
for val in list1:
  df[val]=le.fit_transform(df[val].astype(str))

In [None]:
df.head()

# EDA

In [None]:
fig, ax = plt.subplots(figsize = (12,5))
sns.countplot(df.PaymentMethod, hue = df.Contract, ax = ax)

In [None]:
group = "PaymentMethod"
target = "Churn"
fig, ax = plt.subplots(figsize = (12,5))
temp_df = (df.groupby([group, target]).size()/df.groupby(group)[target].count()).reset_index().pivot(columns=target, index=group, values=0)
temp_df.plot(kind='bar', stacked=True, ax = ax, color = ["green", "darkred"])
ax.xaxis.set_tick_params(rotation=0)
ax.set_xlabel(group)
ax.set_ylabel('Churn Percentage');

In [None]:
plt.figure(figsize=(16,8))
sns.countplot(x="tenure", hue="Churn", data=df)
plt.show()

People having month-to-month contract prefer paying by Electronic Check mostly or mailed check. The reason might be short subscription cancellation process compared to automatic payment.




As we can see the higher the tenure, the lesser the churn rate. This tells us that the customer becomes loyal with the tenure.

In [None]:
stacked_plot(df, "PhoneService", "Churn")
stacked_plot(df, "MultipleLines", "Churn")
stacked_plot(df, "OnlineSecurity", "Churn")
stacked_plot(df, "OnlineBackup", "Churn")
stacked_plot(df, "DeviceProtection", "Churn")
stacked_plot(df, "TechSupport", "Churn")
stacked_plot(df, "StreamingTV", "Churn")
stacked_plot(df, "StreamingMovies", "Churn")
stacked_plot(df, "gender", "Churn")
stacked_plot(df, "SeniorCitizen", "Churn")
stacked_plot(df, "Partner", "Churn")
stacked_plot(df, "Dependents", "Churn")



Observations

As we can see multiplelines and phoneservice do not add value in the model having similar churn rate


If a person does not opt for internet service, the customer churning is less. The reason might be the less cost of the service. Also, if they have internet service and does not opt for specific service their probability of churning is high.


Gender alone does not help us predict the customer churn.

If a person is young and has a family, he or she is less likely to stop the service as we can see below. The reason might be the busy life, more money or another factors.

Mostly people without dependents go for fiber optic option as Internnet Service and their churning percentage is high.

In [None]:
sns.distplot(df.tenure[df.OnlineSecurity == "No"], hist_kws=dict(alpha=0.3), label="No")
sns.distplot(df.tenure[df.OnlineSecurity == "Yes"], hist_kws=dict(alpha=0.3), label="Yes")
sns.distplot(df.tenure[df.OnlineSecurity == "No internet service"], hist_kws=dict(alpha=0.3), label="No Internet Service")
plt.title("Tenure Distribution by Online Security Service Subscription")
plt.legend()
plt.show()

#as we can see here there is not a normal distribution 

In [None]:
df.hist(figsize=(15,30),edgecolor='red',layout=(9,3),bins=15,legend=True)
plt.show()

In [None]:
sns.pairplot(df)

# Corealtion between features

In [None]:
df.corr()

In [None]:
df.corr()['Churn'].sort_values()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(), annot=True, linewidths=0.5,linecolor="black", fmt='.2f')


In [None]:
##Descriptive Statistics
df.describe()

In [None]:
plt.figure(figsize=(15,7))
sns.heatmap(round(df.describe()[1:].transpose(),2), annot=True, linewidths=0.5,linecolor="black", fmt='f')


In [None]:
df.info()

In [None]:
##Checking Skewness
df.iloc[:,:-1].skew()

In [None]:
df.dtypes

In [None]:
rom sklearn.preprocessing import power_transform
x_new=power_transform(df.iloc[:,:-1],method='yeo-johnson')

df.iloc[:,:-1]=pd.DataFrame(x_new,columns=df.iloc[:,:-1].columns)
df.iloc[:,:-1].skew()

In [None]:
##Outliers Checking
import warnings
warnings.filterwarnings('ignore')
df.plot(kind='box',subplots=True, layout=(3,9), figsize=[20,8])

# IQR Proximity Rule
Z - Score Technique

In [None]:
from scipy.stats import zscore
import numpy as np
z=np.abs(zscore(df))
z.shape


In [None]:
threshold=3
print(np.where(z>3))

In [None]:
len(np.where(z>3)[0])

In [None]:
df.drop([0,    3,    7,   20,   27,   62,   81,   89,  103,  105,  107,
        114,  116,  129,  131,  133,  168,  180,  185,  187,  206,  211,
        215,  216,  217,  225,  236,  252,  255,  259,  260,  263,  272,
        278,  303,  321,  324,  328,  348,  354,  358,  372,  376,  382,
        387,  398,  424,  431,  435,  452,  465,  481,  488,  495,  498,
        544,  569,  596,  610,  616,  620,  634,  660,  667,  669,  674,
        677,  688,  716,  718,  735,  765,  776,  784,  790,  794,  813,
        829,  843,  847,  859,  866,  873,  875,  877,  884,  893,  917,
        934,  941,  943,  960,  973, 1011, 1018, 1037, 1050, 1051, 1053,
       1072, 1110, 1119, 1122, 1144, 1146, 1150, 1161, 1169, 1182, 1204,
       1221, 1225, 1242, 1255, 1257, 1271, 1278, 1298, 1311, 1326, 1331,
       1333, 1334, 1340, 1349, 1352, 1365, 1379, 1402, 1407, 1416, 1452,
       1479, 1480, 1481, 1500, 1506, 1513, 1519, 1560, 1562, 1581, 1584,
       1614, 1620, 1634, 1637, 1652, 1689, 1692, 1694, 1703, 1722, 1734,
       1789, 1802, 1803, 1819, 1827, 1832, 1845, 1851, 1854, 1862, 1881,
       1889, 1892, 1894, 1906, 1910, 1944, 1959, 1969, 1985, 1989, 1998,
       2002, 2031, 2046, 2050, 2087, 2089, 2090, 2117, 2124, 2127, 2131,
       2188, 2215, 2225, 2226, 2237, 2239, 2290, 2295, 2310, 2340, 2344,
       2348, 2362, 2382, 2383, 2385, 2398, 2399, 2409, 2412, 2413, 2417,
       2420, 2421, 2426, 2427, 2431, 2433, 2465, 2468, 2492, 2533, 2538,
       2541, 2547, 2562, 2608, 2610, 2626, 2637, 2644, 2661, 2662, 2681,
       2696, 2700, 2709, 2712, 2718, 2725, 2728, 2748, 2751, 2752, 2754,
       2761, 2773, 2781, 2804, 2809, 2814, 2841, 2842, 2889, 2898, 2899,
       2903, 2913, 2915, 2916, 2918, 2919, 2929, 2940, 2944, 2962, 2966,
       2972, 2990, 2992, 2994, 2995, 3020, 3028, 3036, 3039, 3042, 3043,
       3060, 3062, 3070, 3073, 3080, 3092, 3096, 3126, 3127, 3133, 3139,
       3150, 3160, 3174, 3177, 3183, 3185, 3190, 3191, 3194, 3213, 3221,
       3223, 3233, 3235, 3243, 3258, 3290, 3292, 3311, 3316, 3318, 3342,
       3354, 3363, 3370, 3414, 3422, 3444, 3454, 3492, 3502, 3505, 3541,
       3542, 3557, 3575, 3583, 3586, 3594, 3613, 3614, 3617, 3620, 3621,
       3652, 3653, 3660, 3677, 3680, 3685, 3690, 3722, 3733, 3738, 3753,
       3756, 3773, 3819, 3860, 3870, 3873, 3877, 3902, 3905, 3926, 3934,
       3940, 3945, 3946, 3955, 3961, 3973, 3976, 3983, 3989, 4008, 4020,
       4024, 4027, 4029, 4040, 4041, 4043, 4048, 4052, 4054, 4055, 4056,
       4071, 4075, 4085, 4099, 4109, 4128, 4130, 4132, 4141, 4149, 4151,
       4162, 4168, 4174, 4178, 4180, 4183, 4200, 4207, 4208, 4233, 4239,
       4251, 4281, 4290, 4309, 4310, 4311, 4338, 4369, 4396, 4400, 4402,
       4409, 4411, 4424, 4432, 4465, 4474, 4481, 4521, 4537, 4557, 4565,
       4603, 4612, 4641, 4653, 4657, 4665, 4670, 4702, 4710, 4726, 4728,
       4729, 4740, 4750, 4765, 4773, 4821, 4828, 4831, 4840, 4845, 4849,
       4854, 4857, 4860, 4882, 4883, 4897, 4898, 4915, 4919, 4924, 4933,
       4949, 4965, 4968, 4970, 4974, 4976, 4981, 4983, 4989, 4992, 4993,
       5002, 5013, 5014, 5017, 5034, 5060, 5062, 5064, 5066, 5073, 5085,
       5091, 5117, 5130, 5144, 5147, 5163, 5176, 5180, 5186, 5204, 5207,
       5210, 5212, 5216, 5249, 5263, 5264, 5284, 5290, 5292, 5296, 5303,
       5314, 5329, 5331, 5338, 5343, 5348, 5356, 5359, 5382, 5387, 5391,
       5392, 5411, 5456, 5489, 5497, 5501, 5505, 5531, 5536, 5546, 5559,
       5565, 5601, 5607, 5631, 5636, 5648, 5665, 5666, 5674, 5682, 5683,
       5690, 5717, 5740, 5761, 5788, 5790, 5796, 5799, 5829, 5833, 5837,
       5841, 5880, 5884, 5889, 5891, 5900, 5911, 5939, 5941, 5942, 5949,
       5950, 5954, 5961, 5967, 5976, 5983, 6001, 6006, 6007, 6020, 6030,
       6031, 6039, 6043, 6059, 6064, 6067, 6074, 6080, 6087, 6093, 6108,
       6129, 6132, 6133, 6145, 6149, 6162, 6174, 6183, 6204, 6209, 6212,
       6218, 6219, 6220, 6235, 6248, 6252, 6253, 6256, 6260, 6263, 6269,
       6285, 6296, 6310, 6319, 6326, 6331, 6367, 6377, 6383, 6392, 6406,
       6415, 6416, 6424, 6425, 6435, 6455, 6457, 6459, 6493, 6494, 6500,
       6503, 6509, 6514, 6515, 6522, 6523, 6530, 6536, 6547, 6553, 6570,
       6573, 6593, 6600, 6607, 6624, 6640, 6653, 6661, 6662, 6665, 6677,
       6679, 6683, 6684, 6691, 6693, 6703, 6727, 6747, 6750, 6752, 6757,
       6777, 6779, 6783, 6791, 6810, 6811, 6813, 6834, 6864, 6881, 6884,
       6891, 6895, 6904, 6905, 6937, 6940, 6941, 6943, 6946, 6949, 6964,
       6966, 6979, 6980, 6984, 6985, 6999, 7003, 7007, 7029, 7040],axis=0)

In [None]:
df=df[(z<3).all(axis=1)]

# Feature Engineering ( VIF )

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
df.corr()


In [None]:
plt.figure(figsize=(25,22))
sns.heatmap(df.corr(),linewidths=.1,vmin=-1, vmax=1, fmt='.2g', annot = True, linecolor="black",annot_kws={'size':15},cmap="YlGnBu")
plt.yticks(rotation=0)

In [None]:
df.isnull().sum()

In [None]:
x=df.drop('Churn',axis=1)
y=df['Churn']

In [None]:
x

In [None]:
y

In [None]:
def vif_calc():
  vif=pd.DataFrame()
  vif["VIF Factor"]=[variance_inflation_factor(x.values,i) for i in range(x.shape[1])]
  vif["features"]=x.columns
  print(vif)
    vif_calc()

In [None]:
# Dropping the irrelevant columns..

x.drop(columns=["TotalCharges"], axis=1, inplace=True)
vif_calc()

In [None]:
##Scaling the Data
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x=pd.DataFrame(sc.fit_transform(x), columns=x.columns)
x

# MODELLING

Building CLASSIFICATION Model As Target Column's Has only Two Outputs

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

print(df['Churn'].value_counts())  
plt.figure(figsize=(5,5))
sns.countplot(df['Churn'])
plt.show()

In [None]:
##OverSampling
from imblearn.over_sampling import SMOTE
sm = SMOTE()
x, y = sm.fit_resample(x,y)
y.value_counts()

In [None]:
Getting the best random state
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import GradientBoostingClassifier, BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, accuracy_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

maxAccu=0
maxRS=0

for i in range(1,200):
    x_train,x_test, y_train, y_test=train_test_split(x,y,test_size=.30, random_state=i)
    rfc=RandomForestClassifier()
    rfc.fit(x_train,y_train)
    pred=rfc.predict(x_test)
    acc=accuracy_score(y_test,pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRS=i
print("Best accuracy is ",maxAccu*100," on Random_state ",maxRS)

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.30,random_state=maxRS)

In [None]:
##Logistic Regression
# Checking Accuracy for Logistic Regression
log = LogisticRegression()
log.fit(x_train,y_train)

#Prediction
predlog = log.predict(x_test)

print(accuracy_score(y_test, predlog)*100)
print(confusion_matrix(y_test, predlog))
print(classification_report(y_test,predlog))

In [None]:
Random Forest Classifier
# Checking accuracy for Random Forest Classifier
rf = RandomForestClassifier()
rf.fit(x_train,y_train)

# Prediction
predrf = rf.predict(x_test)

print(accuracy_score(y_test, predrf)*100)
print(confusion_matrix(y_test, predrf))
print(classification_report(y_test,predrf))

In [None]:
Decission Tree Classifier
# Checking Accuracy for Decision Tree Classifier
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)

#Prediction
preddtc = dtc.predict(x_test)

print(accuracy_score(y_test, preddtc)*100)
print(confusion_matrix(y_test, preddtc))
print(classification_report(y_test,preddtc))

In [None]:
##support Vector Machine Classifier
# Checking accuracy for Support Vector Machine Classifier
svc = SVC()
svc.fit(x_train,y_train)

# Prediction
predsvc = svc.predict(x_test)

print(accuracy_score(y_test, predsvc)*100)
print(confusion_matrix(y_test, predsvc))
print(classification_report(y_test,predsvc))

In [None]:
##Gradient Boosting Classifier
# Checking accuracy for Gradient Boosting Classifier
GB = GradientBoostingClassifier()
GB.fit(x_train,y_train)

# Prediction
predGB = GB.predict(x_test)

print(accuracy_score(y_test, predGB)*100)
print(confusion_matrix(y_test, predGB))
print(classification_report(y_test,predGB))

In [None]:
##Cross Validation Score
#cv score for Logistic Regression
print(cross_val_score(log,x,y,cv=5).mean()*100)

# cv score for Decision Tree Classifier
print(cross_val_score(dtc,x,y,cv=5).mean()*100)

# cv score for Random Forest Classifier
print(cross_val_score(rf,x,y,cv=5).mean()*100)

# cv score for Support Vector  Classifier
print(cross_val_score(svc,x,y,cv=5).mean()*100)

# cv score for Gradient Boosting Classifier
print(cross_val_score(GB,x,y,cv=5).mean()*100)

# Random Forest Classifier is working the best with respect to Cross validation score as well which is minimum in the case..

In [None]:
##HyperParameter Tuning for the model with best score
#Random Forest Classifier

parameters = {'criterion':['gini'],
             'max_features':['auto'],
             'n_estimators':[0,200],
             'max_depth':[2,3,4,5,6,8]}
GCV=GridSearchCV(RandomForestClassifier(),parameters,cv=5)
GCV.fit(x_train,y_train)

In [None]:
GCV.best_params_

In [None]:
Churne =RandomForestClassifier (criterion='gini', max_depth=8, max_features='auto', n_estimators=200)
Churne.fit(x_train, y_train)
pred = Churne.predict(x_test)
acc=accuracy_score(y_test,pred)
print(acc*100)

In [None]:
##Plotting ROC and compare AUC for the final model
from sklearn.metrics import plot_roc_curve
plot_roc_curve(Churne,x_test,y_test)
plt.title("ROC AUC Plot")
plt.show()

# Conclusion:
The accuracy score for Income is 91 %

In [None]:
#Saving the model
import joblib
joblib.dump(Churne,"Census_Income.pkl")
['Census_Income.pkl']