# Telco Customer Churn for ICP4D

We'll use this notebook to create a machine learning model to predict customer churn.

# 1.0 Install required packages

Run `!pip freeze` to see what packages come installed on the platform with the current Juptyer kernel.

In [None]:
!pip freeze

In [None]:
!pip install --user watson-machine-learning-client --upgrade | tail -n 1

# 2.0 Load and Clean data
We'll load our data as a pandas data frame.

* Highlight the cell below by clicking it.
* Click the `10/01` "Find data" icon in the upper right of the notebook.
* To load the virtualized data created in Exercise-1, choose the `Remote` tab.
* Choose your virtualized data (i.e. User<xyz>.billingProductCustomers), click `Insert to code` and choose `Insert Pandas DataFrame`
* The code to bring the data into the notebook environment and create a Pandas DataFrame will be added to the cell below.
* Run the cell


In [None]:
# Place cursor below and insert the Pandas DataFrame for the Telco churn data


We'll use the Pandas naming convention `df` for our DataFrame.  Make sure that the cell below uses the name for the dataframe used above, i.e df1, df2,... dfX.

In [None]:
df = df1

### 2.1 Drop CustomerID feature (column)

In [None]:
df = df.drop('customerID', axis=1)
df.head(5)

### 2.2 Examine the data types of the features

In [None]:
df.info()

### 2.3 Any NaN values should be removed to create a more accurate model. Prior examination shows NaN values for `TotalCharges`

In [None]:
# Check if we have any NaN values
df.isnull().values.any()

Set `nan_column` to the column number for TotalCharges (starting at 0).

In [None]:
nan_column = df.columns.get_loc("TotalCharges")
print(nan_column)

In [None]:
# Handle missing values for nan_column (TotalCharges)

from sklearn.preprocessing import Imputer

imp = Imputer(missing_values="NaN", strategy="mean")

df.iloc[:, nan_column] = imp.fit_transform(df.iloc[:, nan_column].values.reshape(-1, 1))
df.iloc[:, nan_column] = pd.Series(df.iloc[:, nan_column])

In [None]:
# Check if we have any NaN values
df.isnull().values.any()

In [None]:
customer_data = df

# Visualize data

In [None]:
import json
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from itertools import combinations
from sklearn.preprocessing import PolynomialFeatures, LabelEncoder, StandardScaler
import sklearn.feature_selection
from sklearn.model_selection import train_test_split
from collections import defaultdict
from sklearn import metrics
import pixiedust

In [None]:
# Plot Tenure Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="tenure", hue="Churn", data=customer_data)

In [None]:
# Plot Tenure Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="Contract", hue="Churn", data=customer_data)

In [None]:
# Plot Tenure Frequency count
sns.set(style="darkgrid")
sns.set_palette("hls", 3)
fig, ax = plt.subplots(figsize=(20,10))
ax = sns.countplot(x="TechSupport", hue="Churn", data=customer_data)


In [None]:
# Create Grid for pairwise relationships
gr = sns.PairGrid(customer_data, size=5, hue="Churn")
gr = gr.map_diag(plt.hist)
gr = gr.map_offdiag(plt.scatter)
gr = gr.add_legend()

In [None]:
totalCharge  = df.columns.get_loc("TotalCharges")
print(nan_column)

In [None]:
# Set up plot size
fig, ax = plt.subplots(figsize=(6,6))

# Attributes destribution
a = sns.boxplot(orient="v", palette="hls", data=customer_data.iloc[:, totalCharge], fliersize=14)


In [None]:
# Total Charges data distribution
histogram = sns.distplot(customer_data.iloc[:, totalCharge], hist=True)
plt.show()

In [None]:
tenure  = df.columns.get_loc("tenure")
print(tenure)

In [None]:
# Tenure data distribution
histogram = sns.distplot(customer_data.iloc[:, tenure], hist=True)
plt.show()

In [None]:

monthly = df.columns.get_loc("MonthlyCharges")
print(monthly)

In [None]:
# Monthly Charges data distribution
histogram = sns.distplot(customer_data.iloc[:, monthly], hist=True)
plt.show()


Understand Data Distribution¶


# 3.0 Create a model

In [None]:
from pyspark.sql import SparkSession
import pandas as pd
import json

spark = SparkSession.builder.getOrCreate()
df_data = spark.createDataFrame(df)
df_data.head()

### 3.1 Split the data into training and test sets

In [None]:
spark_df = df_data
(train_data, test_data) = spark_df.randomSplit([0.8, 0.2], 24)

print("Number of records for training: " + str(train_data.count()))
print("Number of records for evaluation: " + str(test_data.count()))

### 3.2 Examine the Spark DataFrame Schema
Look at the data types to determine requirements for feature engineering

In [None]:
spark_df.printSchema()

### 3.3 Use StringIndexer to encodes a string column of labels to a column of label indices

In [None]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline, Model


si_gender = StringIndexer(inputCol = 'gender', outputCol = 'gender_IX')
si_Partner = StringIndexer(inputCol = 'Partner', outputCol = 'Partner_IX')
si_Dependents = StringIndexer(inputCol = 'Dependents', outputCol = 'Dependents_IX')
si_PhoneService = StringIndexer(inputCol = 'PhoneService', outputCol = 'PhoneService_IX')
si_MultipleLines = StringIndexer(inputCol = 'MultipleLines', outputCol = 'MultipleLines_IX')
si_InternetService = StringIndexer(inputCol = 'InternetService', outputCol = 'InternetService_IX')
si_OnlineSecurity = StringIndexer(inputCol = 'OnlineSecurity', outputCol = 'OnlineSecurity_IX')
si_OnlineBackup = StringIndexer(inputCol = 'OnlineBackup', outputCol = 'OnlineBackup_IX')
si_DeviceProtection = StringIndexer(inputCol = 'DeviceProtection', outputCol = 'DeviceProtection_IX')
si_TechSupport = StringIndexer(inputCol = 'TechSupport', outputCol = 'TechSupport_IX')
si_StreamingTV = StringIndexer(inputCol = 'StreamingTV', outputCol = 'StreamingTV_IX')
si_StreamingMovies = StringIndexer(inputCol = 'StreamingMovies', outputCol = 'StreamingMovies_IX')
si_Contract = StringIndexer(inputCol = 'Contract', outputCol = 'Contract_IX')
si_PaperlessBilling = StringIndexer(inputCol = 'PaperlessBilling', outputCol = 'PaperlessBilling_IX')
si_PaymentMethod = StringIndexer(inputCol = 'PaymentMethod', outputCol = 'PaymentMethod_IX')


In [None]:
si_Label = StringIndexer(inputCol="Churn", outputCol="label").fit(spark_df)
label_converter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=si_Label.labels)

### 3.4 Create a single vector

In [None]:
va_features = VectorAssembler(inputCols=['gender_IX',  'SeniorCitizen', 'Partner_IX', 'Dependents_IX', 'PhoneService_IX', 'MultipleLines_IX', 'InternetService_IX', \
                                         'OnlineSecurity_IX', 'OnlineBackup_IX', 'DeviceProtection_IX', 'TechSupport_IX', 'StreamingTV_IX', 'StreamingMovies_IX', \
                                         'Contract_IX', 'PaperlessBilling_IX', 'PaymentMethod_IX', 'TotalCharges', 'MonthlyCharges'], outputCol="features")

### 3.5 Create a pipeline, and fit a model using RandomForestClassifier 
Assemble all the stages into a pipeline. We don't expect a clean linear regression, so we'll use RandomForestClassifier to find the best decision tree for the data.

In [None]:
classifier = RandomForestClassifier(featuresCol="features")

pipeline = Pipeline(stages=[si_gender, si_Partner, si_Dependents, si_PhoneService, si_MultipleLines, si_InternetService, si_OnlineSecurity, si_OnlineBackup, si_DeviceProtection, \
                            si_TechSupport, si_StreamingTV, si_StreamingMovies, si_Contract, si_PaperlessBilling, si_PaymentMethod, si_Label, va_features, \
                            classifier, label_converter])

model = pipeline.fit(train_data)

In [None]:
predictions = model.transform(test_data)
evaluatorDT = BinaryClassificationEvaluator(rawPredictionCol="prediction")
area_under_curve = evaluatorDT.evaluate(predictions)

#default evaluation is areaUnderROC
print("areaUnderROC = %g" % area_under_curve)

# 4.0 Save the model and test data

Add a unique name for MODEL_NAME.

In [None]:
MODEL_NAME = "myname model"

### 4.1 Save the model to ICP4D local Watson Machine Learning

In [None]:
from dsx_ml.ml import save

save(name=MODEL_NAME, model=model, test_data=test_data, algorithm_type='Classification',
     description='This is a SparkML Model to Classify Telco Customer Churn Risk')

### 4.2 Write the test data without label to a .csv so that we can later use it for batch scoring

In [None]:
write_score_CSV=test_data.toPandas().drop(['Churn'], axis=1)
write_score_CSV.to_csv('../datasets/TelcoCustomerSparkMLBatchScore.csv', sep=',', index=False)

### 4.3 Write the test data to a .csv so that we can later use it for evaluation

In [None]:
write_eval_CSV=test_data.toPandas()
write_eval_CSV.to_csv('../datasets/TelcoCustomerSparkMLEval.csv', sep=',', index=False)

## Congratulations, you have created a model based on customer churn data, and deployed it to Watson Machine Learning!