In [1]:
import pandas as pd
import numpy as np
from plotly import express as px


# Introduction

For this lecture we will be using the customer churn dataset. The dataset contains information about customers of a telecom company. The goal is to predict whether a customer will churn (i.e., stop using the company's services) based on the customer's demographic information, account information, and usage of various services.

1. Basic data exploration,
2. Data cleaning.
3. Use Weight of Evidence (WoE) to transform categorical features into numeric features that can be used for prediction.
4. Use Principal Component Analysis (PCA) to reduce the dimensionality of the dataset.
5. Use Logistic Regression and to predict whether a customer will churn.
6. Build and deploy a basic neural network using TensorFlow and Keras to predict whether a customer will churn.

In [2]:
churn_df = pd.read_csv('data/telecom_customer_churn.csv')

# 1. Data Exploration

In [3]:
churn_df.head(10)

Unnamed: 0,Customer ID,Gender,Age,Married,Number of Dependents,City,Zip Code,Latitude,Longitude,Number of Referrals,...,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Customer Status,Churn Category,Churn Reason
0,0002-ORFBO,Female,37,Yes,0,Frazier Park,93225,34.827662,-118.999073,2,...,Credit Card,65.6,593.3,0.0,0,381.51,974.81,Stayed,,
1,0003-MKNFE,Male,46,No,0,Glendale,91206,34.162515,-118.203869,0,...,Credit Card,-4.0,542.4,38.33,10,96.21,610.28,Stayed,,
2,0004-TLHLJ,Male,50,No,0,Costa Mesa,92627,33.645672,-117.922613,0,...,Bank Withdrawal,73.9,280.85,0.0,0,134.6,415.45,Churned,Competitor,Competitor had better devices
3,0011-IGKFF,Male,78,Yes,0,Martinez,94553,38.014457,-122.115432,1,...,Bank Withdrawal,98.0,1237.85,0.0,0,361.66,1599.51,Churned,Dissatisfaction,Product dissatisfaction
4,0013-EXCHZ,Female,75,Yes,0,Camarillo,93010,34.227846,-119.079903,3,...,Credit Card,83.9,267.4,0.0,0,22.14,289.54,Churned,Dissatisfaction,Network reliability
5,0013-MHZWF,Female,23,No,3,Midpines,95345,37.581496,-119.972762,0,...,Credit Card,69.4,571.45,0.0,0,150.93,722.38,Stayed,,
6,0013-SMEOE,Female,67,Yes,0,Lompoc,93437,34.757477,-120.550507,1,...,Bank Withdrawal,109.7,7904.25,0.0,0,707.16,8611.41,Stayed,,
7,0014-BMAQU,Male,52,Yes,0,Napa,94558,38.489789,-122.27011,8,...,Credit Card,84.65,5377.8,0.0,20,816.48,6214.28,Stayed,,
8,0015-UOCOJ,Female,68,No,0,Simi Valley,93063,34.296813,-118.685703,0,...,Bank Withdrawal,48.2,340.35,0.0,0,73.71,414.06,Stayed,,
9,0016-QLJIS,Female,43,Yes,1,Sheridan,95681,38.984756,-121.345074,3,...,Credit Card,90.45,5957.9,0.0,0,1849.9,7807.8,Stayed,,


In [4]:
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 38 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Customer ID                        7043 non-null   object 
 1   Gender                             7043 non-null   object 
 2   Age                                7043 non-null   int64  
 3   Married                            7043 non-null   object 
 4   Number of Dependents               7043 non-null   int64  
 5   City                               7043 non-null   object 
 6   Zip Code                           7043 non-null   int64  
 7   Latitude                           7043 non-null   float64
 8   Longitude                          7043 non-null   float64
 9   Number of Referrals                7043 non-null   int64  
 10  Tenure in Months                   7043 non-null   int64  
 11  Offer                              7043 non-null   objec

In [5]:
churn_df.isnull().sum()

Customer ID                             0
Gender                                  0
Age                                     0
Married                                 0
Number of Dependents                    0
City                                    0
Zip Code                                0
Latitude                                0
Longitude                               0
Number of Referrals                     0
Tenure in Months                        0
Offer                                   0
Phone Service                           0
Avg Monthly Long Distance Charges     682
Multiple Lines                        682
Internet Service                        0
Internet Type                        1526
Avg Monthly GB Download              1526
Online Security                      1526
Online Backup                        1526
Device Protection Plan               1526
Premium Tech Support                 1526
Streaming TV                         1526
Streaming Movies                  

In [12]:
# if a feature has type object count the number of unique values
for col in churn_df.columns:
    if churn_df[col].dtype == 'object':
        print(f'{col}: {churn_df[col].nunique()} unique Categories')

Customer ID: 7043 unique Categories
Gender: 2 unique Categories
Married: 2 unique Categories
City: 1106 unique Categories
Offer: 6 unique Categories
Phone Service: 2 unique Categories
Multiple Lines: 2 unique Categories
Internet Service: 2 unique Categories
Internet Type: 3 unique Categories
Online Security: 2 unique Categories
Online Backup: 2 unique Categories
Device Protection Plan: 2 unique Categories
Premium Tech Support: 2 unique Categories
Streaming TV: 2 unique Categories
Streaming Movies: 2 unique Categories
Streaming Music: 2 unique Categories
Unlimited Data: 2 unique Categories
Contract: 3 unique Categories
Paperless Billing: 2 unique Categories
Payment Method: 3 unique Categories
Customer Status: 3 unique Categories
Churn Category: 5 unique Categories
Churn Reason: 20 unique Categories


In [13]:
print('Number of unique cities: ',churn_df['City'].nunique())

Number of unique cities:  1106


In [16]:
# print all the categories in a feature
feat = 'Streaming TV'
print(f'Categories in the feature "{feat}": ')
for reason in churn_df[feat].unique():
    print('\t',reason)

Categories in the feature "Streaming TV": 
	 Yes
	 No
	 nan


In [None]:
# print the 10 cities with the most customers
churn_df['City'].value_counts().head(10)

In [None]:
churn_df['Customer Status'].value_counts()

In [None]:
# we want to find the customer status for each city, so we group by city and get the value counts of Customer Status
churn_df.groupby('City')['Customer Status'].value_counts().head(25)

In [None]:
# use px.density_mapbox to show the number of customers in each city, center the map on California
fig = px.density_mapbox(churn_df, lat='Latitude', lon='Longitude', radius=10, zoom=5.5, mapbox_style='stamen-terrain', height=1200, width=900)
fig.show()

# 2. Data Cleaning

Let's drop some features that are not useful for prediction. We'll also drop features that have too many missing values or too many unique values.

In [None]:
# drop irrelevant features
garbage_cols = ['Gender','Customer ID','Latitude', 'Longitude', 'Zip Code', 'City',  'Churn Category', 'Churn Reason']
churn_df_clean = churn_df.drop(garbage_cols, axis=1).copy()


In [None]:
# we don't care about customers that just Joined, so let's drop them
churn_df_clean = churn_df_clean[churn_df_clean['Customer Status'] != 'Joined'].copy()

In [None]:
# map the values of the feature 'Customer Status' to numerical values and call it Churned
churn_df_clean['Churned'] = churn_df_clean['Customer Status'].map({'Stayed': 0, 'Churned':1})


# 3. Weight of Evidence (WoE) and Information Value (IV)

Weight of Evidence (WoE) is a statistical technique commonly used in credit scoring and other applications to assess the predictive power of independent variables in a logistic regression model. It involves transforming categorical variables into numeric values that can be used as inputs for predictive models. Here's how you can use WoE:

Understand the purpose: WoE is used to measure the relationship between a categorical variable and the likelihood of an event occurring (e.g., defaulting on a loan or in our case a customer churn). It calculates the relative "evidence" provided by each category in predicting the event.

1. *Calculate the WoE*: To compute the WoE for each category of a categorical variable, follow these steps:

    a. For each category, calculate the proportion of events (e.g., customer churn) and non-events (e.g., stayed).

    b. Calculate the ratio of event proportion to non-event proportion for each category.

    c. Take the natural logarithm of the ratio obtained in step b.

    d. Multiply the result from step c by 100 to scale the WoE values.

    The formula for WoE is: WoE = ln(Event Proportion / Non-event Proportion) * 100

2. *Replace categorical values with WoE*: Once you have calculated the WoE for each category, you can replace the original categorical values in your dataset with their corresponding WoE values. This transformation ensures that the categorical variable retains its predictive power while being expressed numerically.

3. *Handle missing values*: If you have missing values in your categorical variable, you can assign a separate category or use a special WoE value to represent those missing values.

4. *Interpretation*: After converting categorical values to WoE, you can interpret the magnitude and direction of the WoE values. Higher positive values indicate a higher likelihood of the event occurring, while lower negative values indicate a lower likelihood. A value of zero means that the event and non-event proportions are equal.

Information Value (IV) is a measure of the predictive power of an independent variable in a logistic regression model. It is calculated by summing the differences between the proportions of events and non-events for each category of the variable, multiplied by the WoE for that category. The higher the IV, the more predictive power the variable has. Here's how you can use IV:

1. *Calculate the IV*: To compute the IV for each category of a categorical variable, follow these steps:

    a. For each category, calculate the proportion of events (e.g., customer churn) and non-events (e.g., stayed).

    b. Calculate the difference between the event proportion and non-event proportion for each category.

    c. Multiply the result from step b by the WoE for that category.

    d. Sum the results from step c to obtain the IV for the variable.

    The formula for IV is: IV = sum((Event Proportion - Non-event Proportion) * WoE)
2. Interpretation*: After calculating the IV for each variable, you can interpret the predictive power of each variable. According to the literature, the following guidelines can be used to interpret the IV:

    * < 0.02: Useless for prediction
    * 0.02 to 0.1: Weak predictor
    * 0.1 to 0.3: Medium predictor
    * 0.3 to 0.5: Strong predictor
    * > 0.5: Suspicious predictor




In [None]:
# create a function that calculates the weight of evidence of each category in a feature
#  add information value of the feature
def calc_weight_of_evidence(df, target):
    # create good and bad columns, by mapping the target feature to 1 and 0
    df[f'Good'] = np.where(df[target] == 0, 1, 0)
    df[f'Bad'] = np.where(df[target] == 1, 1, 0)
    total_good = df['Good'].sum()
    total_bad = df['Bad'].sum()
    iv = {}
    # after cleaning get the list of categorical features
    categorical_features = [col for col in df.columns if df[col].dtype == 'object']
    for feature in categorical_features:
        # ignore the target feature and Good and Bad columns
        if feature == target or feature == 'Good' or feature == 'Bad':
            continue
        # group by each category in the feature and calculate the WoE, binning the feature values
        grouped = df.groupby(feature).agg({'Good': 'sum', 'Bad': 'sum'})
        # create a DistributionGood and DistributionBad column to calculate the proportion of each category, add 0.5 * len(grouped) to avoid division by zero
        grouped['DistributionGood'] = (grouped['Good'] + 0.5) / (total_good + 0.5 * len(grouped))
        grouped['DistributionBad'] = (grouped['Bad'] + 0.5) / (total_bad + 0.5 * len(grouped))
        # calculate the WoE
        grouped['WoE'] = np.log(grouped['DistributionGood'] / grouped['DistributionBad'])
        # make a woe dictionary to map each category to its corresponding WoE value
        woe_dict = grouped['WoE'].to_dict()
        df[feature] = df[feature].map(woe_dict)
        # calculate the information value of the feature and add it to the iv dictionary
        information_value = ((grouped['DistributionGood'] - grouped['DistributionBad']) * grouped['WoE']).sum()
        iv[feature] = information_value
    df = df.drop(['Good', 'Bad'], axis=1)
    return df, iv

In [None]:
# calculate the weight of evidence and IV for the churn df
churn_df_woe, churn_iv = calc_weight_of_evidence(churn_df_clean, 'Churned')

In [None]:
churn_df_woe.head(10)

In [None]:
# print the information value of each feature
for feature in churn_iv:
    print(feature, churn_iv[feature])

In [None]:
# drop features with IV < 0.02 and > 0.5
bad_features = []
for feature in churn_iv:
    if churn_iv[feature] <= 0.02 or churn_iv[feature] >= 0.5:
        bad_features.append(feature)

churn_df_woe = churn_df_woe.drop(bad_features, axis=1)

In [None]:
churn_df_woe.head(10)

# 4. Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique that can be used to reduce the number of features in a dataset while retaining most of the information.
It does this by creating new features that are combinations of the original features, and then dropping the original features.
The new features are known as principal components, and they are orthogonal (i.e., at right angles) to each other.
The first principal component captures the largest amount of variation in the data, and each subsequent component captures the largest amount of remaining variation that is orthogonal to the previous components.

In [None]:
# use PCA  for churn df
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# create a copy of the churn woe df
churn_df_pca = churn_df_woe.copy()

# drop the target feature
churn_df_pca.drop('Churned', axis=1, inplace=True)

# standardize the data
scaler = StandardScaler()
churn_df_pca = scaler.fit_transform(churn_df_pca)

# create a PCA object
pca = PCA(n_components=2)

# fit the PCA object
pca.fit(churn_df_pca)

# transform the data
churn_df_pca = pca.transform(churn_df_pca)

# create a dataframe with the PCA data
churn_df_pca = pd.DataFrame(churn_df_pca, columns=['PC1', 'PC2'])

# add the target feature to the dataframe
churn_df_pca['Churned'] = churn_df_clean['Churned']



# 5. Logistic Regression

Logistic regression is a statistical model that uses a logistic function to model a binary dependent variable.
In our case, the dependent variable is whether a customer has churned or not.
Logistic regression is a popular technique for modeling customer churn because it is easy to interpret and implement, and it performs well on simple datasets.

In [None]:
# use logistic regression for churn_df_pca

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# create a copy of the churn df
churn_df_lr = churn_df_pca.copy()

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(churn_df_lr.drop('Customer Status', axis=1), churn_df_lr['Customer Status'], test_size=0.2, random_state=42)

# create a logistic regression object
lr = LogisticRegression()

# fit the model
lr.fit(X_train, y_train)

# make predictions
y_pred = lr.predict(X_test)





## 5.2 Model Evaluation

**Accuracy** is the proportion of correct predictions out of all predictions made. It is a good measure when the classes are balanced, but it can be misleading when there is a large class imbalance.

**Precision** is the proportion of correct positive predictions out of all positive predictions made. It is a good measure when the cost of false positives is high.

**Recall** is the proportion of correct positive predictions out of all actual positive instances. It is a good measure when the cost of false negatives is high.

**F1 score** is the [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean) of precision and recall. It is a good measure when you want to balance precision and recall, and when there is an uneven class distribution.

**Confusion matrix** is a table that shows the number of correct and incorrect predictions made by a model. It is a good way to evaluate the performance of a model.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

def eval_model(y_test, y_pred):
    print(f'Accuracy: {accuracy_score(y_test, y_pred)}')
    print(f'Precision: {precision_score(y_test, y_pred)}')
    print(f'Recall: {recall_score(y_test, y_pred)}')
    print(f'F1 Score: {f1_score(y_test, y_pred)}')

In [None]:
eval_model(y_test, y_pred)

In [None]:
confusion_matrix(y_test, y_pred)

# 6. Neural Networks

"Neural Networks (NNs) are a class of machine learning algorithms that draw inspiration from the structure and function of the human brain. They are widely used for solving complex problems in classification and regression tasks.

NNs are composed of interconnected layers of artificial neurons. Each neuron receives inputs from the neurons in the previous layer and applies a set of weights to those inputs, along with a bias term. These weighted inputs are then transformed using an activation function to produce an output.

One of the key strengths of NNs lies in their ability to learn complex relationships within the data. This is achieved through a process called training, where the NN adjusts its weights and biases based on a training dataset. The objective is to minimize a predefined loss function that measures the disparity between the predicted outputs and the true labels in the training data.

During the training process, NNs use optimization algorithms like gradient descent to update the weights and biases iteratively. The backpropagation algorithm plays a crucial role in this process, as it efficiently calculates the gradients of the loss function with respect to the weights and biases, enabling the network to make adjustments that reduce the error.

Choosing an appropriate loss function is critical and depends on the problem at hand. For regression tasks, common loss functions include mean squared error (MSE) and mean absolute error (MAE), while for classification tasks, cross-entropy loss is often used.

For this lecture we will be using Tensorflow and Keras to build a neural network model. Tensorflow is an open-source machine learning library developed by Google, and Keras is a high-level API that runs on top of Tensorflow. Keras provides a simple and intuitive interface for building neural networks, while Tensorflow provides the backend for executing the computations required by the network.

In [None]:
# in case you haven't installed tensorflow run this !
!pip install tensorflow

In [None]:
# Using tensorflow and keras to build a neural network


In [None]:
# Build a tensorflow model

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam

# create a copy of the churn df
churn_df_nn = churn_df_pca.copy()

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(churn_df_nn.drop('Customer Status', axis=1), churn_df_nn['Customer Status'], test_size=0.2, random_state=42)

# create a tensorflow model
model = Sequential([
    Dense(32, activation='relu', input_shape=(2,)),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

# compile the model
model.compile(optimizer=Adam(learning_rate=0.0001), loss='binary_crossentropy', metrics=['accuracy'])

# fit the model
model.fit(X_train, y_train, epochs=100, batch_size=32, verbose=0)

# evaluate the model
model.evaluate(X_test, y_test, verbose=0)



In [None]:
# use the custom eval_model function to evaluate the model
eval_model(y_test, model.predict_classes(X_test).reshape(-1))