# Introduction
This is a comprehensive notebook that might be useful to data scientists in Telecom industry. It studies the case of Customers Churning which is very common in Telecommunication companies. In this project, I did EDA, perdictive modelling and customers clustering. Churn analysis is the evaluation of a company’s customer loss rate in order to reduce it. Also referred to as customer attrition rate. It's importatnt because keeping an existing customer saves more money to the company than attracting a new one. Churn rate has strong impact on the life time value of the customer because it affects the length of service and the future revenue of the company.

# Load libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn import model_selection
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, chi2
from mlxtend.preprocessing import minmax_scaling
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import pylab as pl
from kmodes.kmodes import KModes

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Read the data

The author has divided the original data into two datasets, train and test, where he sampled 80% of the data into train and validation data and 20% into test data.

In [None]:
train = pd.read_csv('../input/telecom-churn-datasets/churn-bigml-80.csv')
test = pd.read_csv('../input/telecom-churn-datasets/churn-bigml-20.csv')

# Explore train and test data

In [None]:
train.sample(10)

In [None]:
train.describe()

In [None]:
test.sample(10)

In [None]:
test.describe()

In [None]:
train.info()

In [None]:
test.info()

# EDA

In [None]:
train['State'].nunique()

In [None]:
train['State'].value_counts()

In [None]:
print('The percentage of customers churning from the company is: %{}'.format((train['Churn'].sum()) *100/train.shape[0]) ) # as the Churn column data type is boolean, every True value will be summed as '1'...I'll convert them later into binary 0's and 1's when I do the data cleaning part

In [None]:
plt.figure(figsize=(20,6))
sns.set_style('whitegrid')
sns.barplot(x='State',y='Churn', data=train)

In [None]:
sns.barplot(x='Churn', y='Customer service calls',data=train)

In [None]:
sns.barplot(x='Churn', y='Account length',data=train)

In [None]:
plt.hist(train['Account length'], bins=400)
plt.show()

In [None]:
churn_intl = train.groupby(['Churn','International plan']).size()
churn_intl.plot()
plt.show()


In [None]:
churn_voicem = train.groupby(['Churn','Voice mail plan']).size()
churn_voicem.plot()
plt.show()

In [None]:
train.head()

In [None]:
train['Total charge'] = train['Total day charge'] + train['Total eve charge'] + train['Total night charge'] + train['Total intl charge']
test['Total charge'] = test['Total day charge'] + test['Total eve charge'] + test['Total night charge'] + test['Total intl charge']

In [None]:
sns.boxplot(x='Churn',y='Total charge', data = train)

From the previous analysis we knew the following insights:
* 14% of customers have churned.
* Texas has the highest number of customer churns.
* Churned customers have called customer service more than remaining customers. Maybe that means that customer service in this company needs more training in retaining customers.
* Churned customers had higher charges to pay than remaining cutomers. Maybe that means that the company needs to work in more effective plans to facilitate late payments.
* Account length (Account duration) is normally distributed.


# Data cleaning

Now,we'll clean the data and prepare it for prediction.


As you noticed earlier, when we used .info() with both train and test datasets, we haven't found any null values ( luckily!), but if we had found them, we would either drop columns with the missing values or impute the missing values to the mean, median or mode of the values in the same column.

We still have columns with categorical values though (dtype = object), so we should deal with them because predictive models deal only with numerical values.

For 'Churn', 'International plan' and 'Voice mail plan' columns, I will use multiple techniques to deal with categorical values for illustration purpose, but you can use only one of them if you want, since each of the 3 column has only 2 unique values.

But first, let's work on a copy of the original dataset. ( It's always a good idea to work on copies, not on the original data)

In [None]:
train2 = train.copy()
test2 = test.copy()
train2

In [None]:
train2['Churn'] = train2['Churn'].map({True:1,False:0}) # no need to do it for test dataset because Churn column will be dropped later.

train2['International plan'].replace(['No','Yes'],[0,1],inplace=True)
test2['International plan'].replace(['No','Yes'],[0,1],inplace=True)

# Now, I'll use the label encoder preprocessing technique:

encoder = LabelEncoder()
coded_voicem_train = encoder.fit_transform(train2['Voice mail plan'])
train2['Voice mail plan'] = coded_voicem_train
coded_voicem_test = encoder.transform(test2['Voice mail plan'])
test2['Voice mail plan'] = coded_voicem_test

In [None]:
train2.head()

In [None]:
test2.head()

Now, some data might need scaling. I usually delay that until I choose the features that has higher correlation with the target ( aka feature selection or dimensionality reduction), then scale whatever data needs scaling in the features that I chose. That will lead us to the next step which is:

# Feature Selection

Feature selection means choosing the best features that highly affects the target and not redundant with each other.

A good method we can use to study features correlation is .corr()

In [None]:
train2.corr()

We can also plot the correlation in heatmap to make it easier for us:

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(train2.corr() , annot =True)

We can notice that:
Some features are correlated to each other (have a high coefficient with each other).

'Total day minutes' and 'Total day charge' for example are directly related because their coeff is 1, so we'll delete one of them. I choose to delete all the columns with the minutes count because they are redundant.

In [None]:
train3 = train2.drop(['Total day minutes','Total eve minutes','Total night minutes', 'Total intl minutes'], axis=1)


Now, we'll select the best features that have the highest correlation with the target 'Churn'.

In [None]:
features = ['International plan','Total charge','Customer service calls']
X_init = train3[features]
y = train3['Churn']
Xtest_init = test2[features]
ytest = test2['Churn']

In [None]:
X_init.head()

Here comes the scaling part...
The data range in 'Total charge' is higher that other features, so we'll scale it.


In [None]:
# mix-max scale the data between 0 and 1
X = minmax_scaling(X_init, columns = features)
Xtest = minmax_scaling(Xtest_init, columns = features)
Xtest

# Model Selection

Here comes the juicy part !
Now that our data is ready, lets build our model, but first we'll split X into training data(80%) and validation data(20%)

In [None]:
Xtrain,Xval,ytrain,yval = train_test_split(X,y,train_size=0.8)

In [None]:
Xtrain.shape

In [None]:
Xval.shape

In [None]:
ytrain.shape

In [None]:
yval.shape

We'll make a list of tuples. Each tuple contains the model name and the model creation instance. Then, we'll use each model with cross validation technique k-folds to avoid over-fitting. The choice of the best model will depend on its score.

In [None]:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF', RandomForestClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
results = []
names = []
for name,model in models:
    kfold = model_selection.KFold(n_splits=10)
    cv_result = model_selection.cross_val_score(model,Xtrain,ytrain, cv = kfold, scoring = "accuracy")
    names.append(name)
    results.append(cv_result)
for i in range(len(names)):
    print(names[i],results[i].mean())

KNeighboursClassifier model has the highest score, hence it will be chosen. We'll choose the best n_neighbours parameter using Grid Search which is a class used to fine-tune your model to get the best results.

In [None]:
chosen_model = KNeighborsClassifier()
param = {'n_neighbors': [1,2,3,4,5,6,7]}
grid = GridSearchCV(estimator= chosen_model, param_grid=param, cv=5)
grid.fit(Xtrain,ytrain)
print(grid.best_params_)


In [None]:
best_model = KNeighborsClassifier(n_neighbors=5)
best_model.fit(Xtrain,ytrain)
pred_val = best_model.predict(Xval)
pred = best_model.predict(Xtest)

Let's evaluate our model:

In [None]:
print("Accuracy Score is:")
print(accuracy_score(ytest, pred))
print(accuracy_score(yval, pred_val))
print()

In [None]:
print("Classification Report:")
print(classification_report(ytest, pred))

In [None]:
conf = confusion_matrix(ytest,pred)
label = ["0","1"]
sns.heatmap(conf, annot=True, xticklabels=label, yticklabels=label)
plt.show()

# Customers Clustering

Telecom companies use recommendation engines to suggest the best packages for the clients based on their history. Customer clustering helps alot in segmentation of customers into groups of similarities. I'm not going to build a recommendation engine here, but rather I'll do the clustering.

I'll start with k-means clustering which requires the data to be numerical, so I'll deal with the processed dataset.
Also, we'll make an elbow curve to determine the optimal number of clusters.

In [None]:
# train3 is fine for this.
clust_data = train3.drop(['Churn','State'], axis=1)
inertia = []
for i in range(1,11):
    clust_model = KMeans(n_clusters= i , init='k-means++', n_init=10)
    clust_model.fit(clust_data)
    inertia.append(clust_model.inertia_)

plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
plt.show()

The optimal number of clusters is 4.

In [None]:
clust_model = KMeans(n_clusters= 4 , init='k-means++', n_init=10)
clusters = clust_model.fit_predict(clust_data)
print(silhouette_score(clust_data, clusters))


Let's add the clusters to the original train data

In [None]:
train['clusters'] = pd.Series(clusters,index=train.index)
train

Let's further inspect those clusters:

In [None]:
clust_churn = train.groupby('clusters').Churn.sum()
clust_churn

In [None]:
train['clusters'].value_counts()

In [None]:
train.head()

In [None]:
train['charge'] = train['Total charge']
charge_clust = train.groupby('clusters').charge.mean()
charge_clust