# Module 02 Practicum: Let's Explore Some Telco Churn Data

## Learning Objectives:
-  Practicum: Examine the following notebook identifying opportunities to apply patching_sklearn() to aalgorithms applied totabular data.
- Apply code changes to try different classifiers optimized with Intel Extensions for scikit-learn*
- Target Decision Tree for replacement by a classifier(x2) of your choice
- Apply patching to PCA
- Apply patching to Kmeans
- **Synthesize** your learning by applying patching to the DBSCAN algorithm

Data can be found 

Kaggle: [Telco Customer Churn](https://www.kaggle.com/datasets/pangkw/telco-churn)

License
CC0: Public Domain

Download the data from Kaggle - l0ok for the following download icon:

![Download from Kaggle](Assets/ChurnDownload.png)

This will download the file 'Telco Churn dataset.xlsx' to your local machine

Upload this file to DevCloud - simply drag the downloaded file to the data folder for this project on DevCloud

![Upload to DevCloud data folder](Assets/DragHere.jpg)

### Dependencies:
- pip install seaborn
- pip install openpyxl

# Practicum:

Work through each cell looking for places to patch or unpatch as needed to maxiumize the performace of each cell

# What’s in the Telco Customer Churn data set?
This data set provides info to help you predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.

A telecommunications company is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving. Imagine that you’re an analyst at this company and you have to find out who is leaving and why.

The data set includes information about:
* Customers who left within the last month – the column is called Churn
* Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
* Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
* Demographic info about customers – gender, age range, and if they have partners and dependents


# Choose interactive or inline plotting

In [1]:
#%matplotlib notebook
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np

churn = pd.read_excel('Telco Churn dataset.xlsx')

churn.info()

FileNotFoundError: [Errno 2] No such file or directory: 'Telco Churn dataset.xlsx'

In [None]:
churn.shape

# Handy Utilities

In [None]:
from itertools import compress
class MaskableList(list):
    def __getitem__(self, index):
        try: return super(MaskableList, self).__getitem__(index)
        except TypeError: return MaskableList(compress(self, index))
        
def DescribeItemRange(df):  # Show the number of unique values in each column
    cols = df.columns.tolist()
    vals = pd.DataFrame ( [ len(set(df[s])) for s in df.columns.tolist()] ).T
    vals.columns = cols
    return vals

def ReplaceNans (df):  #replace NaNs in all df locations with zeros
    df.fillna(0)
    return df

def SelectZeroColumns (df): #return a list with column names of zero filled columns
    dropcols = (df == 0).all().astype(int).tolist()
    return [s for s in mylist[dropcols]  ]

def SelectNonZeroColumns(df):  #return a list with columns names of nonzero columns
    return list(set(StringifyColumnNames(df))- set(SelectZeroColumns(df) ))

def StringifyColumnNames (df):   # handle unicode strings and covert to normal strings
    cols = df.columns.tolist()
    cols = [str(r) for r in cols]
    return cols

def Factorize(df):
    return  df.apply(lambda x: pd.factorize(x)[0])

def MyOneHotEncode(pdold, StaticCols, EncodeCols):
    """MyOneHotEncode performs OneHotEncoding on specified columns in our dataframe"""
    """pdold is the old dataframe we are paasing in"""
    """StaticCols is a list of columns we do not wish to encode"""
    """EncodeCols is a list of columns we wish to encode"""
    # make sure that we our static list is not in our encode list
    cols_OneHotEncode = sorted(list(set(EncodeCols) - set(StaticCols)))
    # create a new dataframe which represents the static columns
    pdnew = pdold[StaticCols]
    # loop thru the endcode list and OneHotEncode (using get_dummies) each column in the encode list
    # note - this will typically map a single column to several columns - one for each value contained in the original column
    for col in cols_OneHotEncode:
        new_names = []
        enc = pd.get_dummies(pdold[col])
        for subcol in  enc:
            name = col + str(subcol)
            new_names.append(name)
        enc.columns = new_names
        # concat the new OneHotEnoded dataframe to the original
        pdnew = pd.concat([pdnew, enc], axis=1)
    return pdnew


def my_dbscan(df, eps_):
# Compute DBSCAN
    labels = []
    db = DBSCAN(eps=eps_, min_samples=100).fit(df)
    labels = pd.DataFrame(db.labels_ )
    #n_clusters_ = labels[0].madf()
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    # Number of clusters in labels, ignoring noise if present.
    #tdft = 'dbsPCA_' + str(n_clusters_)
    tdft = 'dbsPCA'
    df[tdft]=labels
    print('eps:{} Estimated number of clusters: {}'.format(eps_, n_clusters_) ) 
    return df, labels, n_clusters_


# Ensure there are no NaNs or blank data

In [None]:
churn.isna().sum()

# Describe the Range of Values in each Column

In [None]:
DescribeItemRange(churn)
churn_cat_col

## Explore Churn Data
* the number of rows, 
* the number of columns,
* any descriptive statistics

## Separate Categorical from Numeric Data for Further Processing
* numeric data for now will be kept in a separate dataframe
* categorical data will be in its own dataframe for be transformed into values in kind of pipeline

In [None]:
# here are the columns which are categorical
churn_cat_col = [key for key in dict(churn.dtypes) if dict(churn.dtypes)[key] in ['object']]
# here are the columns which are numeric
churn_num_col = [key for key in dict(churn.dtypes) if dict(churn.dtypes)[key] in ['float64', 'int64']]
#create a dataframe to hold categorical data
churn_cat = churn[churn_cat_col]
print("Categorical columns")
display(churn_cat.head())
#create a dataframe to hold numeric data
churn_num = churn[churn_num_col]
# display first few rows of numerical data to get a feel for them
print("Numerical columns")
display(churn_num.head())


# Convert Categories to Numbers

we apply the factorize fucntion to all the categories as a first pass. We created a new dataframe here called **churn_cat_encode**

The commented out code woudl allow us to map back to the original form if needed

PaymentMethod for example will now have numeric values like 0, 1, 2, 3,...:


In [None]:
churn_cat.columns

In [None]:
#  Factorize the categorical churn columns
# churn_cat_encode = Factorize(churn_cat)
# churn_cat_encode.head()

# Encode the data (OneHotEncode)
The problem with the data above is that while the data are numeric, it give arbitrary low weights to categories encoded with 0 and higher weights to others

This will probably skew our analysis. For example, PaymentMethod has values 0, 1, 2.  Do we really mean to imply that PaymentMethod=0 has no weight, but PaymentMethod=2 has twice as much weight as PaymentMethod=1?

We use OneHotEncode here to make sure each value that was used is encoded with the value 1, while the value will be 0 if that feature was not used.

**Note:** We dont want to encode categories with many unique values in it such as customerID. 

We define a funvtion to handle doing the encoding for us

Now PaymentMethod will be split into brand new columns (one for each kind of numeric value) and each cell will have either 0 or 1, kind of like a check box, indicating the feature was present or not.

We also concatenate the previous numerica columns back in to have them available for analysis after standardization.

In [None]:
# list of column name I wish to OneHotEncode
cols_OneHotEncode = [ 'gender', 'MaritalStatus', 'Dependents', 'PhoneService',
       'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
       'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
       'Contract', 'PaperlessBilling', 'PaymentMethod', 'InternationalPlan',
       'VoiceMailPlan']

#['DeviceProtection','TechSupport','OnlineBackup','MultipleLines','gender','StreamingTV','Contract','StreamingMovies','PhoneService','PaperlessBilling','OnlineSecurity','Partner','Dependents','InternetService','PaymentMethod']
# label is the column I ultimately want to predict aka Churn
label = churn_cat['Churn']
churn_encoded_numeric = MyOneHotEncode (churn_cat, [], cols_OneHotEncode)
churn_numeric = pd.concat([churn_encoded_numeric, churn_num], axis=1)  # add numeric to main df
churn_numeric.head()

# Standardize the Data

In [None]:
from sklearn import preprocessing
#scale our data to zero center it
scaler = preprocessing.StandardScaler().fit(churn_numeric)
X_scaled = scaler.fit_transform(churn_numeric )   # same as (df-df.mean())/df.std(ddof=0)

# create new data frame with all values scaled
churn_scaled = pd.DataFrame(X_scaled)
churn_scaled.columns = churn_numeric.columns
churn_scaled.head()

# Correlation Plot
Plot shows which factors are correlated with which ohers and to what degree.

Here, red means highly correlated (as in correlated with Churn being bad therefor red), blue means negatively correlated, and white means no correlation.
**Notice** A customer with more contracts, more dependents, more tenure are less likley to churn in this example.

In [None]:
churn_numeric.shape

Perfectly anti-correlated columsn such as gender - contain redundant information = there are other yes/no columns that were encoded into two redundant columns by one hot encoding

<h1><center>Principal Component Analysis To Analyze Bahavior</center></h1>

![image.png](attachment:image.png)

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables.




In [None]:

from sklearn.decomposition import PCA
scaler = preprocessing.StandardScaler().fit(churn_numeric)
X_scaled = preprocessing.scale(churn_encoded_numeric, with_mean=True, with_std=True )   # same as (df-df.mean())/df.std(ddof=0)
# start w n_components principal components and see how much variance that gives me
n_components = 20
# create instance of PCA (constructor)
pca = PCA(n_components=n_components)
#pca = PCA(copy=True, iterated_power='auto', n_components=n_components, random_state=None, svd_solver='auto', tol=0.0, whiten=False) 
pca.fit(X_scaled)
X_scaled = pca.transform(X_scaled)  #get the actual principal components as vectors
print(pca.explained_variance_ratio_) 

# Create copy the PCA data to a Dataframe and print the top few rows

In [None]:
Col_headings = ['P'+str(x) for x in range(n_components)]
pcomps = pd.DataFrame(X_scaled) 
pcomps.columns = Col_headings
# concatenate the Churn column on the end so its easy to plot by Churn
if 'Churn' not in pcomps.columns:
    pcomps = pd.concat([pcomps, label], axis=1)
pcomps.head()


# PCA Scree Plot

### Overview

A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each PC. The PCs are ordered, and by definition are therefore assigned a number label, by decreasing order of contribution to total variance. The PC with the largest fraction contribution is labeled with the label name from the preferences file. Such a plot when read left-to-right across the abscissa can often show a clear separation in fraction of total variance where the 'most important' components cease and the 'least important' components begin. The point of separation is often called the 'elbow'.  (In the PCA literature, the plot is called a 'Scree' Plot because it often looks like a 'scree' slope, where rocks have fallen down and accumulated on the side of a mountain.)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# bin is my x axis variable
bin = []
for i in range (n_components):
    bin.append(i+1)
# plot the cummulative variance against the index of PCA
cum_var = np.cumsum(pca.explained_variance_ratio_)
plt.plot(bin, cum_var)
# plot the 95% threshold, so we can read off count of principal components that matter
plt.plot(bin, [.95]*n_components, '--')
plt.plot(bin, [.75]*n_components, '--')
plt.plot(bin, [.67]*n_components, '--')
#turn on grid to make graph reading easier
plt.grid(True)
plt.rcParams.update({'font.size': 22})
plt.suptitle('PC Variance Exaplined (Scree Plot)')
# control number of tick marks
plt.xticks([i for i in range(0,n_components)])
plt.show()

# Plot Principal Components: Scatter Matrix

Control the **colors** by the ** churn variable** or by the cluster they are in

Get a feel for clsutering beahvior

* **black = no churn**

* **red = churn**

In [None]:
pcomps

In [None]:
# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
%matplotlib inline

# Figure 1
# color the plot by the Churn variable
label_num = MyOneHotEncode (churn_cat, [], ['Churn'])
pcomps['Churn'] = label_num['ChurnYes']
# attributes are the index of Principal components to plot against each other
attributes = ['P{}'.format(i) for i in range(4)]
#attributes = [item + 10 for item in attributes]   # mthod to add if you want to look at the least significant components
churn_colors=['black', 'red']
# make a scatter plot
scatter_matrix(pcomps[attributes], alpha =.3, marker='o',c=pcomps.Churn.apply(lambda x:churn_colors[x]), figsize=[16,16])
#scatter_matrix(pcomps[attributes], alpha =.3, marker='x',c=churn_cat_encode_new2.InternetService0.astype('int').apply(lambda x:churn_colors[x]), figsize=[16,16])
plt.rcParams.update({'font.size': 22})
plt.suptitle("Figure 1: Scatter Matrix of PC's")
plt.show()


Are the dots completely random?

Is there some patterning visible?

Are the Churn people (red) concentrated in certain places or randomly?

## Method to Drill Down to Specific Portion of the Data
We are keying off the graph above and noticing that P0 < 0 covers a major portionof the clusters

So what makes this part of the tick?

In [None]:
churn_encoded_numeric[pcomps['P0']>0].describe()


# Find KMeans Clusters in the PCA Data

Lets use an algorithm to groups clusters it finds together. We will label our data with the clusters we find so we can sort and filter as needed

Here we use kmeans as a quick and dirty clustering method to get a better feel for customer behavior

Ideally I would like to identify a cluster for each natural cluster of customer bevaior

Armed with this I may decide to sgement out interesting populations for prediction, or I may decide that one big prediction against all the data is best

We will augment the rows of the dataset with the cluster information so that it can be used in our models

In [None]:
from sklearn.cluster import KMeans
import numpy as np

def my_kmeans(pcomps, n_clusters):
    cols = ['P{}'.format(i) for i in range(4)]
    x = pcomps[cols].to_numpy()
    kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(x)
    kmeans.labels_
    X_scaled.shape
    #kmeans.cluster_centers_
    txt = 'kmeans' + str(n_clusters)
    pcomps[txt]=kmeans.labels_
    return pcomps

pcomps = my_kmeans(pcomps,2)
pcomps = my_kmeans(pcomps,3)
pcomps = my_kmeans(pcomps,4)
pcomps = my_kmeans(pcomps,5)

plt.hist( pcomps['kmeans4'])
plt.title("kmeans4")
plt.xlabel("Cluster")
plt.ylabel("Frequency")

plt.show()
#churn_cat_encode_new2[df['kmeans']==1].describe()

In [None]:
churn_scaled = pd.concat([churn_scaled, pcomps['kmeans2']], axis=1)
churn_scaled = pd.concat([churn_scaled, pcomps['kmeans3']], axis=1)
churn_scaled = pd.concat([churn_scaled, pcomps['kmeans4']], axis=1)
churn_scaled = pd.concat([churn_scaled, pcomps['kmeans5']], axis=1)
churn_encoded_numeric = pd.concat([churn_encoded_numeric, pcomps['kmeans2']], axis=1)
churn_encoded_numeric = pd.concat([churn_encoded_numeric, pcomps['kmeans3']], axis=1)
churn_encoded_numeric = pd.concat([churn_encoded_numeric, pcomps['kmeans4']], axis=1)
churn_encoded_numeric = pd.concat([churn_encoded_numeric, pcomps['kmeans5']], axis=1)


# Examine Specific Cluster Like This

In [None]:
churn_encoded_numeric[pcomps['kmeans5']==1].head()

# Experiment with 3D plot: Red = Churn
This may give a better feel for the clustering behavior

Control the colors as you wish, here we color Churners red
**red = Churn**

The plot should be interactive so you can rotate it a see where churners live

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline

fig = plt.figure(figsize=[10,10])
ax = fig.add_subplot(111, projection='3d')
colors=['black','red']
#df = pcomps[pcomps['kmeans3']==2]
df = pcomps
churn_colors=['red' , 'black', 'black', 'black', 'black', 'black', 'black', 'black', 'black', 'black', 'black']
ax.scatter(df['P5'], df['P0'], df['P1'],c=df['kmeans5'].apply(lambda x:churn_colors[x]), alpha = .3 )

ax.set_xlabel('Princ 1')
ax.set_ylabel('Princ 0')
ax.set_zlabel('Princ 5')

plt.show()




# One Approach to Outliers (DBSCAN Visualization)

Control the **colors** by the **cluster** they are in

Get a feel for clsutering behavior

We added the cluster number to the dataframe so we can drill down as needed to examine a cluster in more detail.

When we create the model we might decide to take advantage of our feature engineering here or we might decide to let the ML/DL algorithm use all the data to make predictions.

**NOTE:** Be careful with relying on the cluster number. These were established by random selection. So each time we run we may get a different cluster number for what humans percieve as a clsuter or the k-means may find different clusters entirely. Specifying a random seed or using a different cluster methods, such SOM may make this part more rubust.

It would be great if we coudl isolate the different communities from each other and see if this effects the chrun prediction

In [None]:
### import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
# X = pcomps.iloc[[1, 3, 5], [1, 3]]
X = pcomps
np.random.seed(42)

# try eps of 6.5, then observe a small number of points are classified as -1 (Outliers)
pcomps2, labels, n_clusters = my_dbscan(X, 6.3 )  # call my_dbscan with df=X, eps = 4
#dbscan finds 2 clusters account for most points with min cluster distance 6.3

dbsPCA_hist = pcomps2.dbsPCA
print('Outliers: {}:   Estimated number of clusters: {}'.format(  dbsPCA_hist.tolist().count(-1),  n_clusters) )
plt.hist( dbsPCA_hist)
plt.title("dbsPCA_hist")
plt.xlabel("Cluster")
plt.ylabel("Frequency")
#plt.axis([-1,n_clusters,0,7200])
plt.show()

In [None]:
churn_encoded_numeric = pd.concat([churn_encoded_numeric, pcomps['dbsPCA']], axis=1)


In [None]:
churn_encoded_numeric.head()

In [None]:
# Figure 2
# plot each cluster a different color so we can see what cluster number is associated with each blob
df2 = pcomps
#df2 = pcomps[pcomps['Churn'] == 1]
# make a scatter plot
#attributes = [0, 1, 2]
#colors=['blue','green','yellow']

attributes = ['P{}'.format(i) for i in range(4)]

colors=['black', 'violet','green','blue', 'yellow', 'red', 'gray', 'white', 'orange']
scatter_matrix(df2[attributes],figsize=[16,16],marker='o', alpha=.3, c=df2['dbsPCA'].apply(lambda x:colors[x+1]))
plt.rcParams.update({'font.size': 22})
plt.suptitle("Outliers in black")
plt.show()

In [None]:
print('Outliers: {} '.format(  dbsPCA_hist.tolist().count(-1)) )
churn_encoded_numeric[churn_encoded_numeric['dbsPCA']<0].describe()

# Examine Outliers Statistics Like This

Compare leading differnces bewteen Outliers and general population

Notice that the ratio of the histogram values for Outliers versus general population are different - pretty dramtically

In [None]:
#churn_encoded_numeric[pcomps['dbsPCA']<0].head()
if 'Churn' not in churn_encoded_numeric.columns:
    churn_numeric = pd.concat([churn_encoded_numeric, label], axis=1)  # add numeric to main df
churn_numeric[pcomps['dbsPCA']<0]

TotalC = churn_numeric[pcomps['dbsPCA']<0]
TotalC.hist(column = 'ContractOne year')

plt.title('ContractOne year: Histogram for Outliers')
churn_numeric.hist(column = 'ContractOne year')


In [None]:
churn_encoded_numeric[pcomps['dbsPCA']<0].describe()

In [None]:
churn_encoded_numeric.describe()

# Modeling

## Split Data - Training/Test

Splitting the data can be a very easy thing but its also easy to get into trouble.

We want to ensure that in subsequent runs we never contaminate the test set from run to run. Using a random seed help something like randseed = 42

But if we ever drop or add rows, and then use the same random selection we will likely wind up putting a new value into our test set inadvertantly.

It would be best to split the data based on some invariant such as a hask of column names or some other unique row identifier. But for now we take the easy random split approach.

In [None]:
# to make this notebook's output identical at every run
np.random.seed(42)
# For illustration only. Sklearn has train_test_split()
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

churn_numeric
churn_scaled = pd.concat([churn_cat['customerID'], churn_numeric], axis=1)
churn_scaled['Churn'] = label_num['ChurnYes']
churn_scaled.head()
if ( len(churn_scaled.customerID.unique()) ==  len(churn_numeric)) : 
    print("Unique Identifier")

train_set, test_set = split_train_test(churn_scaled, 0.2)


# First Model - Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

X = churn_scaled.drop(u'customerID', axis=1)

clf = DecisionTreeClassifier(random_state=0)
AllCols = X.columns
StaticCols = ['Churn']

dataCols = sorted(list(set(AllCols) - set(StaticCols)))
clf.fit(train_set[dataCols], train_set['Churn'])
#accuracy = cross_val_score(clf, churn_cat_encode_new2[dataCols], churn_cat_encode_new2['Churn'], cv=10)
predicted = clf.predict(test_set[dataCols])
print ("accuracy ", accuracy_score(test_set['Churn'], predicted ) )

# EXERCISE

- Replace custom train_test_split with patched version of sklearns train_test_split

- Try at least two optimized sklearnex algorithms

- Patch the DBSCAN portion

- Patch the kmeans portion