# Objective
* The goal of this project is to predict customer churn using various machine learning models and choosing the best model that gives the most accurate predictions.
* Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers.
* Each row represents a customer, each column contains a customer’s attribute.

# Data collected
* Download the data files.
  * The data is available here: https://www.kaggle.com/blastchar/telco-customer-churn/download

In [3]:
%sh
wget -O customer_churn https://www.dropbox.com/s/xct68iza4c8v7z2/WA_Fn-UseC_-Telco-Customer-Churn1.csv?dl=0

In [4]:
customer_churn = spark.read.csv('file:/databricks/driver/customer_churn', inferSchema=True, header=True, mode='DROPMALFORMED')

In [5]:
#displaying fisrt five rows of the dataframe
customer_churn.show(5)

In [6]:
customer_churn.printSchema()

# Data Explained
* KEY FIELDS:
  * customerID: Customer ID
  * gender: Whether the customer is a male or a female
  * SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
  * Partner: Whether the customer has a partner or not (Yes, No)
  * Dependents: Whether the customer has dependents or not (Yes, No)
  * tenure: Number of months the customer has stayed with the company
  * PhoneService: Whether the customer has a phone service or not (Yes, No)
  * MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone service)
  * InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
  * OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet service)
  * OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet service)
  * DeviceProtection: Whether the customer has device protection or not (Yes, No, No internet service)
  * TechSupport: Whether the customer has tech support or not (Yes, No, No internet service)
  * StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet service)
  * StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No internet service)
  * Contract: The contract term of the customer (Month-to-month, One year, Two year)
  * PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
  * PaymentMethod: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
  * MonthlyCharges: The amount charged to the customer monthly
  * TotalCharges: The total amount charged to the customer
* PREDICTOR FIELD:
  * Churn: Whether the customer churned or not (Yes or No)

In [8]:
# Displaying the stats 
display(customer_churn.summary())

summary,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
count,7043,7043,7043.0,7043,7043,7043.0,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043.0,7043.0,7043
mean,,,0.1621468124378816,,,32.37114865824223,,,,,,,,,,,,,64.76169246059922,,
stddev,,,0.3686116056100135,,,24.55948102309444,,,,,,,,,,,,,30.09004709767848,,
min,0002-ORFBO,Female,0.0,No,No,0.0,No,No,DSL,No,No,No,No,No,No,Month-to-month,No,Bank transfer (automatic),18.25,18.8,No
25%,,,0.0,,,9.0,,,,,,,,,,,,,35.5,401.95,
50%,,,0.0,,,29.0,,,,,,,,,,,,,70.35,1400.55,
75%,,,1.0,,,72.0,,,,,,,,,,,,,118.75,,
max,9995-HOTOH,Male,1.0,Yes,Yes,72.0,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Two year,Yes,Mailed check,118.75,,Yes


# Data Cleaning
* Rows containing null values have been dropped to make our data suitable for applying various models.

In [10]:
#Dropping rows with NaN values
print("rows: {}".format(customer_churn.count()))
customer_churn = customer_churn.dropna()
print("rows after dropna",format(customer_churn.count()))

# Data Visualization

In [12]:
# Plotting the distribution of customer Churn
import seaborn as sns
import matplotlib.pyplot as plt

from matplotlib.pyplot import figure


p_df = customer_churn.toPandas()

sns.countplot(x='Churn',data=p_df).set_title('Distribution of Churn')

fig = plt.gcf()
display(fig)

In [13]:
# The plot shows that variation of Churn based on MonthlyCharges and TotalCharges
sns.scatterplot(x='MonthlyCharges',y='TotalCharges',hue='Churn',data=p_df).set_title('MonthlyCharges vs TotalCharges with respect to Churn')
fig = plt.gcf()
display(fig)

In [14]:
#Plot shows that if the contract is for a longer period of time, the customer churn is less.
sns.scatterplot(x='Contract',y='TotalCharges',hue='Churn',data=p_df).set_title('Variation of Churn and Contract')
fig = plt.gcf()
display(fig)

In [15]:
#Plot shows the variation of Churn with respect to PaymentMethod
sns.swarmplot(x='PaymentMethod',y='MonthlyCharges',hue='Churn',data=p_df).set_title('variation of Churn with respect to PaymentMethod')
fig = plt.gcf()
fig.set_size_inches(9, 7)
display(fig)

# Data Modeling

In [17]:
from pyspark.ml.feature import StringIndexer,VectorAssembler
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier, NaiveBayes
from pyspark.ml import Pipeline,Model
from pyspark.ml.tuning import ParamGridBuilder

stringIndexer_1= StringIndexer(inputCol="gender", outputCol="gender_IX")
stringIndexer_2 = StringIndexer(inputCol="Partner", outputCol="Partner_IX")
stringIndexer_3 = StringIndexer(inputCol="Dependents", outputCol="Dependents_IX")
stringIndexer_4 = StringIndexer(inputCol="PhoneService", outputCol="PhoneService_IX")
stringIndexer_5 = StringIndexer(inputCol="MultipleLines", outputCol="MultipleLines_IX")
stringIndexer_6 = StringIndexer(inputCol="InternetService", outputCol="InternetService_IX")
stringIndexer_7 = StringIndexer(inputCol="OnlineSecurity", outputCol="OnlineSecurity_IX")
stringIndexer_8 = StringIndexer(inputCol="OnlineBackup", outputCol="OnlineBackup_IX")
stringIndexer_9 = StringIndexer(inputCol="DeviceProtection", outputCol="DeviceProtection_IX")
stringIndexer_10 = StringIndexer(inputCol="TechSupport", outputCol="TechSupport_IX")
stringIndexer_11= StringIndexer(inputCol="StreamingTV", outputCol="StreamingTV_IX")
stringIndexer_12= StringIndexer(inputCol="StreamingMovies", outputCol="StreamingMovies_IX")
stringIndexer_13= StringIndexer(inputCol="Contract", outputCol="Contract_IX")
stringIndexer_14= StringIndexer(inputCol="PaperlessBilling", outputCol="PaperlessBilling_IX")
stringIndexer_15= StringIndexer(inputCol="PaymentMethod", outputCol="PaymentMethod_IX")

vectorAssembler_features = VectorAssembler(inputCols=["gender_IX", "Partner_IX","Dependents_IX","PhoneService_IX","MultipleLines_IX","InternetService_IX","OnlineSecurity_IX","OnlineBackup_IX","DeviceProtection_IX","TechSupport_IX","StreamingTV_IX","StreamingMovies_IX","Contract_IX","PaperlessBilling_IX","PaymentMethod_IX","MonthlyCharges","TotalCharges","tenure"],outputCol="features")
stringIndexer_label = StringIndexer(inputCol="Churn", outputCol="label").fit(customer_churn)

# bestmodel_features = VectorAssembler(inputCols=[ "OnlineSecurity_IX","TechSupport_IX","Contract_IX","PaperlessBilling_IX","tenure"], outputCol="features")

# lr = LogisticRegression(labelCol= "label", featuresCol = "features", maxIter = 10, regParam = 0.0001, elasticNetParam = 1.0)
# rf = RandomForestClassifier(numTrees=50)

pipeline = Pipeline(stages=[])

basePipeline = [stringIndexer_1, stringIndexer_2,stringIndexer_3, stringIndexer_4,stringIndexer_5, stringIndexer_6,stringIndexer_7, stringIndexer_8,stringIndexer_9, stringIndexer_10,stringIndexer_11, stringIndexer_12,stringIndexer_13, stringIndexer_14,stringIndexer_15,stringIndexer_label, vectorAssembler_features]

# basePipeline_bestmodel = [stringIndexer_1, stringIndexer_2,stringIndexer_3, stringIndexer_4,stringIndexer_5, stringIndexer_6,stringIndexer_7, stringIndexer_8,stringIndexer_9, stringIndexer_10,stringIndexer_11, stringIndexer_12,stringIndexer_13, stringIndexer_14,stringIndexer_15,stringIndexer_label, bestmodel_features]

lr = LogisticRegression(maxIter=10)
pl_lr = basePipeline + [lr]
pg_lr = ParamGridBuilder()\
          .baseOn({pipeline.stages: pl_lr})\
          .addGrid(lr.regParam,[0.01, .04])\
          .addGrid(lr.elasticNetParam,[0.1, 0.4])\
          .build()

rf = RandomForestClassifier(numTrees=50)
pl_rf = basePipeline + [rf]
pg_rf = ParamGridBuilder()\
      .baseOn({pipeline.stages: pl_rf})\
      .build()

gb = GBTClassifier(labelCol="label", featuresCol="features",  maxBins=32,  maxDepth=10, maxIter=15)
pl_gb = basePipeline + [gb]
pg_gb = ParamGridBuilder()\
      .baseOn({pipeline.stages: pl_gb})\
      .build()

nb = NaiveBayes()
pl_nb = basePipeline + [nb]
pg_nb = ParamGridBuilder()\
      .baseOn({pipeline.stages: pl_nb})\
      .addGrid(nb.smoothing,[0.4,1.0])\
      .build()


paramGrid = pg_lr + pg_rf + pg_gb + pg_nb


In [18]:
splitted_data = customer_churn.randomSplit([0.6, 0.4], 24)   
train_data = splitted_data[0]
test_data = splitted_data[1]

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

In [19]:
%sh
pip install mlflow

In [20]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator
cv = CrossValidator()\
      .setEstimator(pipeline)\
      .setEvaluator(BinaryClassificationEvaluator())\
      .setEstimatorParamMaps(paramGrid)\
      .setNumFolds(2)

cvmodel = cv.fit(train_data)

In [21]:
cvmodel.bestModel.stages

In [22]:
import pandas as pd
model = pd.DataFrame(cvmodel.bestModel.stages[-1].featureImportances.toArray(), columns=["values"])
features = ["gender_IX", "Partner_IX","Dependents_IX","PhoneService_IX","MultipleLines_IX","InternetService_IX","OnlineSecurity_IX","OnlineBackup_IX","DeviceProtection_IX","TechSupport_IX","StreamingTV_IX","StreamingMovies_IX","Contract_IX","PaperlessBilling_IX","PaymentMethod_IX","MonthlyCharges","TotalCharges","tenure"]
features_col = pd.Series(features)
model["features"] = features_col
model

Unnamed: 0,values,features
0,0.000752,gender_IX
1,0.002164,Partner_IX
2,0.002537,Dependents_IX
3,0.001144,PhoneService_IX
4,0.003478,MultipleLines_IX
5,0.072352,InternetService_IX
6,0.096543,OnlineSecurity_IX
7,0.020776,OnlineBackup_IX
8,0.019097,DeviceProtection_IX
9,0.144378,TechSupport_IX


In [23]:
import numpy as np
cvmodel.getEstimatorParamMaps()[ np.argmax(cvmodel.avgMetrics) ]

In [24]:
import re
def paramGrid_model_name(model):
  params = [v for v in model.values() if type(v) is not list]
  name = [v[-1] for v in model.values() if type(v) is list][0]
  name = re.match(r'([a-zA-Z]*)', str(name)).groups()[0]
  return "{}{}".format(name,params)

# Resulting metric and model description
# get the measure from the CrossValidator, cvModel.avgMetrics
# get the model name & params from the paramGrid
# put them together here:
kmeans_measures = zip(cvmodel.avgMetrics, [paramGrid_model_name(m) for m in paramGrid])
metrics,model_names = zip(*kmeans_measures)

In [25]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.clf() # clear figure
fig = plt.figure( figsize=(5, 5))
plt.style.use('fivethirtyeight')
axis = fig.add_axes([0.1, 0.3, 0.8, 0.6])
# plot the metrics as Y
#plt.plot(range(len(model_names)),metrics)
plt.bar(range(len(model_names)),metrics)
# plot the model name & param as X labels
plt.xticks(range(len(model_names)), model_names, rotation=70, fontsize=6)
plt.yticks(fontsize=6)
#plt.xlabel('model',fontsize=8)
plt.ylabel('ROC AUC (better is greater)',fontsize=8)
plt.title('Model evaluations')
display(plt.show())

In [26]:
stringIndexer_7= StringIndexer(inputCol="OnlineSecurity", outputCol="OnlineSecurity_IX")
stringIndexer_10 = StringIndexer(inputCol="TechSupport", outputCol="TechSupport_IX")
stringIndexer_13 = StringIndexer(inputCol="Contract", outputCol="Contract_IX")
stringIndexer_15 = StringIndexer(inputCol="PaymentMethod", outputCol="PaymentMethod_IX")
stringIndexer_6 = StringIndexer(inputCol="InternetService", outputCol="InternetService_IX")
stringIndexer_8 = StringIndexer(inputCol="OnlineBackup", outputCol="OnlineBackup_IX")
stringIndexer_9 = StringIndexer(inputCol="DeviceProtection", outputCol="DeviceProtection_IX")
stringIndexer_11 = StringIndexer(inputCol="StreamingTV", outputCol="StreamingTV_IX")
stringIndexer_12 = StringIndexer(inputCol="StreamingMovies", outputCol="StreamingMovies_IX")
stringIndexer_label = StringIndexer(inputCol="Churn", outputCol="label").fit(customer_churn)
# vectorAssembler_features = VectorAssembler(inputCols=["OnlineSecurity_IX","TechSupport_IX","Contract_IX","PaperlessBilling_IX","tenure"],outputCol="features")
bestmodel_features = VectorAssembler(inputCols=[ "InternetService_IX","OnlineBackup_IX","DeviceProtection_IX","StreamingTV_IX","StreamingMovies_IX","PaymentMethod_IX","MonthlyCharges","TotalCharges","OnlineSecurity_IX","TechSupport_IX","Contract_IX","tenure"], outputCol="features")
basePipeline_bm = [stringIndexer_7, stringIndexer_10,stringIndexer_13, stringIndexer_15, stringIndexer_label,stringIndexer_6,stringIndexer_8,stringIndexer_9,stringIndexer_11,stringIndexer_12, bestmodel_features]
rf = RandomForestClassifier(numTrees=50)
# pl_rf = basePipeline_bm + [rf]
basepipe=Pipeline(stages=basePipeline_bm + [rf])
rfModel = basepipe.fit(train_data)
predictions = rfModel.transform(test_data)

# Prediction

###### 
* Predicting churn label based on various features available in the data set using Logistic Regression, RandomForestClassifier, Naive Bayes and gradient Boosted Trees Classifier. The model that gave us the most accurate predictions is the Random Forest Classifier.
* We further improved the Random Forest Classifier by considering feature importance and eliminated few features that did not help in the prediction.
* On the test data the model predicted 2260 labels correctly and 588 incorrectly

In [29]:

# Model quality
predictions = rfModel.transform(test_data)


In [30]:
predictions.select('prediction','label').show()

In [31]:
correct = predictions.where("(label = prediction)").count()
incorrect = predictions.where("(label != prediction)").count()

resultDF = sqlContext.createDataFrame([['correct', correct], ['incorrect', incorrect]], ['metric', 'value'])
display(resultDF)

metric,value
correct,2270
incorrect,559


# Model Evaluation

##### 
* For evaluating the model we calculated the number of predicted values for positive and negative class and compared them with actual labels.

In [34]:
counts = [predictions.where('label=1').count(), predictions.where('prediction=1').count(),
          predictions.where('label=0').count(), predictions.where('prediction=0').count()]
names = ['actual 1', 'predicted 1', 'actual 0', 'predicted 0']
display(sqlContext.createDataFrame(zip(names,counts),['Measure','Value']))

Measure,Value
actual 1,745
predicted 1,580
actual 0,2084
predicted 0,2249


In [35]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html
# Compute raw scores on the test set
predictionAndLabels = predictions.select('prediction','label').rdd

# Instantiate metrics object
metrics = BinaryClassificationMetrics(predictionAndLabels)

# Area under precision-recall curve
print("Area under PR = %s" % metrics.areaUnderPR)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

#Result Visualization

In [37]:
correct = predictions.where("(label = prediction)").count()
incorrect = predictions.where("(label != prediction)").count()

resultDF = sqlContext.createDataFrame([['correct', correct], ['incorrect', incorrect]], ['metric', 'value'])
display(resultDF)

metric,value
correct,2270
incorrect,559


In [38]:
counts = [predictions.where('label=1').count(), predictions.where('prediction=1').count(),
          predictions.where('label=0').count(), predictions.where('prediction=0').count()]
names = ['actual 1', 'predicted 1', 'actual 0', 'predicted 0']
display(sqlContext.createDataFrame(zip(names,counts),['Measure','Value']))

Measure,Value
actual 1,745
predicted 1,580
actual 0,2084
predicted 0,2249


### Conclusion
* Predicting churn label based on various features available in the data set using Logistic Regression, RandomForestClassifier, Naive Bayes and gradient Boosted Trees Classifier. The model that gave us the most accurate predictions is the Random Forest Classifier.
* We further improved the Random Forest Classifier by considering feature importance and eliminated few features that did not help in the prediction.
* On the test data the model predicted 2270 labels correctly and 559 incorrectly
* For classification of customer churn, we selected the best model(RFC) which had the ROC of 0.71.