# Cluster

## Objectives

* Fit and evaluate a cluster model to group similar customer behaviour
* Understand profile for each cluster


## Inputs

* outputs/datasets/collection/TelcoCustomerChurn.csv
* instructions on which variables to use for data cleaning and feature engineering. They are found on its respectives notebooks.

## Outputs

* Cluster Pipeline
* Train Set
* Feature importance plot
* Clusters Description
* Cluster Silhouette

## Additional Comments | Insights | Conclusions



---

# Install and Import packages

* You eventually will need to restart runtime when installing packages, please note cell output when installing a package

In [None]:
! pip install feature-engine==1.0.2
! pip install yellowbrick==1.3
! pip install scikit-learn==0.24.2

# Code for restarting the runtime, that will restart colab session
# It is a good practice after you install a package in a colab session
import os
os.kill(os.getpid(), 9)

---

# Setup GPU

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [None]:
from getpass import getpass
import os
from IPython.display import clear_output 

print("=== Insert your credentials === \nType in and hit Enter")
os.environ['UserName'] = getpass('GitHub User Name: ')
os.environ['UserEmail'] = getpass('GitHub User E-mail: ')
os.environ['RepoName'] = getpass('GitHub Repository Name: ')
os.environ['UserPwd'] = getpass('GitHub Account Token: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

* **Credentials format disclaimer**: when opening Jupyter notebooks in Colab that are hosted at GitHub, we ask you to not consider special characters in your **password**, like @ ! " # $ % & ' ( ) * + , - . / :;< = > ? @ [\ ]^_ ` { } | ~
  * Otherwise it will not work properly the git push command, since the credentials are concatenated in the command: username:password@github.com/username/repo , the git push command will not work properly when these terms have special characters 

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [None]:
! git clone https://github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need it for this project

import os
if os.path.isdir(os.environ['RepoName']):
  print("\n")
  %cd /content/{os.environ['RepoName']}
  print(f"\n\n* Current session directory is:{os.getcwd()}")
  print(f"* You may refresh the session folder to access {os.environ['RepoName']} folder.")
else:
  print(f"\n* The Repo {os.environ['UserName']}/{os.environ['RepoName']} was not cloned."
        f" Please check your Credentials: UserName and RepoName")

---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [None]:
! git config --global user.email {os.environ['UserEmail']}
! git config --global user.name {os.environ['UserName']}
! git remote rm origin
! git remote add origin https://{os.environ['UserName']}:{os.environ['UserPwd']}@github.com/{os.environ['UserName']}/{os.environ['RepoName']}.git

# the logic is: create a temporary file in the sessions, update the repo. Delete this file, update the repo
# If it works, it is a signed that the session is connected to the repo.
import uuid
file_name = "session_connection_test_" + str(uuid.uuid4()) # generates a unique file name
with open(f"{file_name}.txt", "w") as file: file.write("text")
print("=== Testing Session Connectivity to the Repo === \n")
! git add . ; ! git commit -m {file_name + "_added_file"} ; ! git push origin main 
print("\n\n")
os.remove(f"{file_name}.txt")
! git add . ; ! git commit -m {file_name + "_removed_file"}; ! git push origin main

# delete your Credentials (username and password)
os.environ['UserName'] = os.environ['UserPwd'] = os.environ['UserEmail'] = ""

* If output above indicates there was a **failure in the authentication**, please insert again your credentials.

---

# Load Data for Modelling

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/TelcoCustomerChurn.csv")
      .drop(['customerID', 'TotalCharges', 'Churn', 'tenure' ],axis=1) 
)
df.info()

# Cluster Pipeline considering all data: KMeans

## ML pipeline for Data Cleaning and Feature Engineering

In [None]:
from sklearn.pipeline import Pipeline

### Feature Engineering
from feature_engine.encoding import OrdinalEncoder

### PCA
from sklearn.decomposition import PCA

### Feat Selection
from sklearn.feature_selection import SelectFromModel

### Feat Scaling
from sklearn.preprocessing import StandardScaler

### ML algorithms 
from sklearn.cluster import KMeans


def PipelineDataCleaningAndFeatureEngineering():
  pipeline_base = Pipeline(
      [
       
      ("OrdinalCategoricalEncoder",OrdinalEncoder(encoding_method='arbitrary',
                                                  variables = [ 'gender', 'Partner', 'Dependents', 'PhoneService',
                                                               'MultipleLines', 'InternetService', 'OnlineSecurity',
                                                               'OnlineBackup','DeviceProtection', 'TechSupport', 
                                                               'StreamingTV', 'StreamingMovies','Contract', 
                                                               'PaperlessBilling', 'PaymentMethod'])
      ),
    ]
  )

  return pipeline_base

## ML Pipeline for Cluster

where `n_components` of PCA and `n_clusters` of KMeans will be updated

In [None]:
def PipelineCluster():
  pipe = PipelineDataCleaningAndFeatureEngineering()
  pipe.steps.append(["PCA",PCA(n_components=3, random_state=0)])
  pipe.steps.append(["scaler",StandardScaler()])
  pipe.steps.append(["model",KMeans(n_clusters=4, random_state=0)])
  return pipe

PipelineCluster()

## ML Pipeline for a classifier to explain the clusters

We are considering a model that typically offers good results and features importance can be assessed with `.features_importance_`

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
def PipelineClf2ExplainClusters():
   pipe = PipelineDataCleaningAndFeatureEngineering()
   pipe.steps.append(["feat_selection",SelectFromModel(GradientBoostingClassifier(random_state=0))])
   pipe.steps.append(["scaler",StandardScaler()])
   pipe.steps.append(["model",GradientBoostingClassifier(random_state=0)])
   return pipe
  
PipelineClf2ExplainClusters()

## Principal Component Analysis (PCA)

Apply PCA separately to find the most suitable `n_components`, update the value on ML Pipeline for Cluster

It needs the dataset after data cleaning and feature engineering
  * That means you have to remove 3 steps

In [None]:
pipeline_pca = PipelineDataCleaningAndFeatureEngineering()
df_pca = pipeline_pca.fit_transform(df)
print(df_pca.shape)
df_pca.head(3)

Apply PCA component

In [None]:
import numpy as np
from sklearn.decomposition import PCA

n_components = 3

pca = PCA(n_components=n_components).fit(df_pca)
x_PCA = pca.transform(df_pca) # array with transformed PCA

ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,2),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

Heatmap: PCA components and variables

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

df_comp = pd.DataFrame(pca.components_, columns=df_pca.columns)
plt.figure(figsize=(20,5))
sns.heatmap(df_comp,center=0,linewidths=.5)
plt.show()

Major Variables

In [None]:
major_pca_variables = df_comp[abs(df_comp) > 0.1].dropna(axis=1, how='all').columns.to_list()
print(f"* There are {len(major_pca_variables)} major variables in the PCA: \n{major_pca_variables}")

## Elbow Analysis

Find the most suitable `n_clusters`, update the value on ML Pipeline for Cluster


Prepare data for analysis
  * You need to clean and feature engineer your data using the pipeline without the model

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_elbow = Pipeline(pipeline_cluster.steps[:-1])
df_elbow = pipeline_elbow.fit_transform(df)

print(df_elbow.shape,'\n')
df_elbow

Elbow Analysis

In [None]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans

visualizer = KElbowVisualizer(KMeans(random_state=0), k=(1,16))
visualizer.fit(df_elbow) 
visualizer.show() 
plt.show()

## Fit Cluster Pipeline

Quick recap in our raw dataset

In [None]:
print(df.shape)
df.head(3)

Fit Cluster pipeline

In [None]:
X = df.copy()
pipeline_cluster = PipelineCluster()
pipeline_cluster.fit(X)

Cluster model output is an array with clusters labels

In [None]:
pipeline_cluster['model'].labels_

Let's check its shape

In [None]:
pipeline_cluster['model'].labels_.shape

## Add cluster labels to dataset

The goal is to merge cluster labels to `X` DataFrame. There is one attention point: **does the pipeline have a step for dropping rows?** 
  * If yes, before merging, we need to drop these rows from X. the code below can do that

```
drop_imputer = DropMissingData(variables =['place here the variables where you drop rows'])
X = drop_imputer.fit_transform(X)
```
* If no, ignore this step.

Our project doesn't need this step. You can confirm that comparing the length of `X` and the length of `cluster label predictions`


In [None]:
print(X.shape)
print(pipeline_cluster['model'].labels_.shape)

We add a column "`Clusters`" to the data and check clusters distribution
* Clusters don't look to be imbalanced

In [None]:
X['Clusters'] = pipeline_cluster['model'].labels_

print(f"* Clusters frequencies \n{ X['Clusters'].value_counts(normalize=True).to_frame().round(2)} \n\n")
X['Clusters'].value_counts().sort_values().plot(kind='bar');

This is how our data look like from now
  * Check the last column: `Clusters`
  * Quick reminder: The data is unprocessed (**no data cleaning or feature engineering applied yet**)

In [None]:
print(X.shape)
X.head(3)

Here we are saving the cluster predictions for this pipeline to use in a fututre moment. We will get back to that soon

In [None]:
cluster_anwers_with_all_variables = X['Clusters']
cluster_anwers_with_all_variables

## Evaluate Clusters silhouette

To evaluate clusters silhouete we need:
  * data transformed (transform data in the pipeline wihout model step)
  * clusters arrays

In [None]:
pipeline_silhouette = Pipeline(pipeline_cluster.steps[:-1])
df_transformed = pipeline_silhouette.transform(df)
df_transformed

Silhouette plot function

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.cm as cm
import numpy as np
from sklearn.metrics import silhouette_samples, silhouette_score
sns.set_style("whitegrid")

def EvaluateClusterSilhouette(X,n_clusters,cluster_labels):

  print(f"  * The silhouette score range is -1 to +1, where: \n"
        f"  - 1 means the clusters are dense and properly separated.\n"
        f"  - 0 means the clusters are overlapping. \n"
        f"  - A negative score means that these data points from that cluster may be wrong, "
        f"they should belong to other cluster.")  
  
  print(f"\n* You should evaluate:\n"
      "  * If there are clusters with below average silhouette scores. \n"
      "  * If there is broad variation in the silhouette plots's size across clusters. \n"
      "  * If the thickness of the silhouettes are uniform/similar in general \n")
  
  fig = plot_clusters_silhouette(X,n_clusters,cluster_labels)
  plt.show()

def plot_clusters_silhouette(X,n_clusters,cluster_labels):

  silhouette_avg = silhouette_score(X, cluster_labels,random_state=0)

  fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(15,7))
  axes.set_xlim([-0.1, 1])
  axes.set_ylim([0, len(X) + (n_clusters + 1) * 10])

  sample_silhouette_values = silhouette_samples(X, cluster_labels)
  y_lower = 10
  for i in range(n_clusters):
    ith_cluster_silhouette_values = \
      sample_silhouette_values[cluster_labels == i]
    ith_cluster_silhouette_values.sort()
    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i
    color = cm.nipy_spectral(float(i) / n_clusters)
    axes.fill_betweenx(np.arange(y_lower, y_upper),
              0, ith_cluster_silhouette_values,
              facecolor=color, edgecolor=color, alpha=0.7)
    axes.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
    y_lower = y_upper + 10

  axes.set_title("The silhouette plot for each cluster")
  axes.set_xlabel("The silhouette coefficient values")
  axes.set_ylabel("Cluster label")
  axes.text(x=silhouette_avg*1.01, y=len(X)*0.95, s=f"Silhouette Average: {round(silhouette_avg,2)}", fontsize=12, c='r')
  axes.axvline(x=silhouette_avg, color="red", linestyle="--")
  axes.set_yticks([])
  axes.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
  return fig


Evaluate Cluster Silhouette

In [None]:
EvaluateClusterSilhouette(
    X=df_transformed,
    n_clusters=X['Clusters'].nunique(),
    cluster_labels=X['Clusters'].values)

## Fit a classifier, where target is cluster labels and features remaining variables

We are in a moment where we have predictions from the cluster pipeline, but we don't have a meaning for them yet. 
* We seek to understand clusters' profile **bold text**

We need to find the most relevant variables, to define each cluster in terms of each relevant variable
* Our new dataset has `Clusters`, which will be the **target for a classifier**. The most relevant features for this classifier, will be the most relevant variables when we run a classifier where the target is the cluster labels!

In [None]:
X.head()

We copy `X` to a DataFrame `df_clf`, just to separate the cases

In [None]:
df_clf = X.copy()
print(df_clf.shape)
df_clf.head(3)

Split Train and Test sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['Clusters'],axis=1),
                                    df_clf['Clusters'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print(X_train.shape, X_test.shape)

Create pipeline

In [None]:
pipeline_clf_cluster = PipelineClf2ExplainClusters()
pipeline_clf_cluster

Fit pipeline

In [None]:
pipeline_clf_cluster.fit(X_train, y_train)

## Evaluate classifier performance on Train and Test Sets

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train, pipeline_clf_cluster.predict(X_train)))

In [None]:
print(classification_report(y_test, pipeline_clf_cluster.predict(X_test)))

## Assess Most Important Features that define a cluster

In [None]:
# after data cleaning and feat engine, the feature space changes
columns_after_data_cleaning_feat_eng = (PipelineDataCleaningAndFeatureEngineering()
                                        .fit_transform(X_train)
                                        .columns)

best_features = columns_after_data_cleaning_feat_eng[pipeline_clf_cluster['feat_selection'].get_support()].to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
          'Attribute': columns_after_data_cleaning_feat_eng[pipeline_clf_cluster['feat_selection'].get_support()],
          'Importance': pipeline_clf_cluster['model'].feature_importances_})
  .sort_values(by='Importance', ascending=False)
  )

best_features = df_feature_importance['Attribute'].to_list() # reassign best features in importance order

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{best_features} \n")
df_feature_importance.plot(kind='bar',x='Attribute',y='Importance')
plt.show()

We will store the best_features for future usage. We will get back to that soon

In [None]:
best_features_pipeline_all_variables = best_features
best_features_pipeline_all_variables

## Cluster Analysis

Custom Functions for Cluster Analysis

* Table with description for all Clusters

In [None]:
def Clusters_IndividualDescription(EDA_Cluster,cluster):

  ClustersDescription = pd.DataFrame(columns=EDA_Cluster.columns)
  for col in EDA_Cluster.columns:
    
    try:  # eventually a given cluster will have only mssing data for a given variable
      
      if EDA_Cluster[col].dtypes == 'object':
        
        top_frequencies = EDA_Cluster.dropna(subset=[col])[[col]].value_counts(normalize=True).nlargest(n=3)
        Description = ''
        
        for x in range(len(top_frequencies)):
          freq = top_frequencies.iloc[x]
          category = top_frequencies.index[x][0]
          CategoryPercentage = int(round(freq*100,0))
          statement =  f"'{category}': {CategoryPercentage}% , "  
          Description = Description + statement
        
        ClustersDescription.at[0,col] = Description[:-2]


      
      elif EDA_Cluster[col].dtypes in ['float', 'int']:
        DescStats = EDA_Cluster.dropna(subset=[col])[[col]].describe()
        Q1 = int(round(DescStats.iloc[4,0],0))
        Q3 = int(round(DescStats.iloc[6,0],0))
        Description = f"{Q1} -- {Q3}"
        ClustersDescription.at[0,col] = Description
    
    
    except Exception as e:
      ClustersDescription.at[0,col] = 'Not available'
      print(f"** Error Exception: {e} - cluster {cluster}, variable {col}")
  
  ClustersDescription['Cluster'] = str(cluster)
  
  return ClustersDescription


def DescriptionAllClusters(df_cluster_profile):

  DescriptionAllClusters = pd.DataFrame(columns=df_cluster_profile.drop(['Clusters'],axis=1).columns)
  for cluster in df_cluster_profile.sort_values(by='Clusters')['Clusters'].unique():
    
      EDA_ClusterSubset = df_cluster_profile.query(f"Clusters == {cluster}").drop(['Clusters'],axis=1)
      ClusterDescription = Clusters_IndividualDescription(EDA_ClusterSubset,cluster)
      DescriptionAllClusters = DescriptionAllClusters.append(ClusterDescription)

  
  DescriptionAllClusters.set_index(['Cluster'],inplace=True)
  return DescriptionAllClusters


* Cluster distribution per Variable (absolute and relative)

In [None]:
import plotly.express as px
def cluster_distribution_per_variable(df,target):


  df_bar_plot = df.value_counts(["Clusters", target]).reset_index() 
  df_bar_plot.columns = ['Clusters',target,'Count']
  df_bar_plot[target] = df_bar_plot[target].astype('object')

  print(f"Clusters distribution across {target} levels")
  fig = px.bar(df_bar_plot, x='Clusters',y='Count',color=target,width=800, height=500)
  fig.update_layout(xaxis=dict(tickmode= 'array',tickvals= df['Clusters'].unique()))
  fig.show()


  df_relative = (df
                 .groupby(["Clusters", target])
                 .size()
                 .groupby(level=0)
                 .apply(lambda x:  100*x / x.sum())
                 .reset_index()
                 .sort_values(by=['Clusters'])
                 )
  df_relative.columns = ['Clusters',target,'Relative Percentage (%)']
 

  print(f"Relative Percentage (%) of {target} in each cluster")
  fig = px.line(df_relative, x='Clusters',y='Relative Percentage (%)',color=target,width=800, height=500)
  fig.update_layout(xaxis=dict(tickmode= 'array',tickvals= df['Clusters'].unique()))
  fig.update_traces(mode='markers+lines')
  fig.show()
 


---

We will study the profile for the main variables that define a cluster


In [None]:
df_cluster_profile = df_clf.copy()
df_cluster_profile = df_cluster_profile.filter(items=best_features + ['Clusters'], axis=1)
df_cluster_profile.head(3)

Load Churn levels

In [None]:
df_churn = pd.read_csv("outputs/datasets/collection/TelcoCustomerChurn.csv").filter(['Churn'])
df_churn['Churn'] = df_churn['Churn'].astype('object')
df_churn.head(3)

### Cluster profile on most important features

Considering `df_cluster_profile` and `df_churn`

In [None]:
pd.set_option('display.max_colwidth', None)
clusters_profile = DescriptionAllClusters(pd.concat([df_cluster_profile,df_churn], axis=1))
clusters_profile

### Clusters distribution across Churn levels & Relative Percentage of Churn in each cluster

In [None]:
df_cluster_vs_churn=  df_churn.copy()
df_cluster_vs_churn['Clusters'] = X['Clusters']
cluster_distribution_per_variable(df=df_cluster_vs_churn, target='Churn')

# Fit New Cluster Pipeline only on most important features

In order to reduce feature space, we will study the trade-off between considering the previous Cluster Pipeline (fitted with all variables) and creating a new Cluster Pipeline with the variables that are most important to define the clusters from previous pipeline

In [None]:
best_features_pipeline_all_variables

## Define trade-off and metrics to compare new and previous Cluster Pipeline

To evaluate this tradeoff we will
1. Conduct a PCA analysis, with the same amount of components from 
previous study, in a dataset only with `best_features_pipeline_all_variables` and see if all variables are relevant
2. Conduct a elbow study and check if the same number of clusters is suggested
3. Fit new cluster pipeline and compare if the both clusters predictions are "equivalent"
4. Compare silhoutte score
5. Fit a classifier to explain cluster, and check if performance on Train and Test sets is similar
6. Check if the most important features for the classifier are the same.
7. Compare if the cluster profile from both cases are "equivalent"

If we are happy to say **yes** for them, you can use a cluster pipeline with reduced feature space!
* The **gain** is that in real time (which is the major purpose of Machine Learning) you will need less variables for running predictions and decision making.

## Consider the data with the most relevant variables

In [None]:
df_reduced = df.filter(best_features_pipeline_all_variables)
df_reduced.head(3)

## Rewrite ML pipeline for Data Cleaning and Feature Engineering

In [None]:
def PipelineDataCleaningAndFeatureEngineering():
  pipeline_base = Pipeline(
      [
        # we updated the pipeline, considering only the new variables      
       ("OrdinalCategoricalEncoder",OrdinalEncoder(encoding_method='arbitrary',
                                                  variables = ['PaymentMethod', 'InternetService',
                                                               'DeviceProtection','OnlineSecurity'])
      ),

    ]
  )

  return pipeline_base

## Apply PCA and compare to previous PCA

It needs the dataset after data cleaning and feature engineering

In [None]:
pipeline_pca = PipelineDataCleaningAndFeatureEngineering()
df_pca = pipeline_pca.fit_transform(df_reduced)
print(df_pca.shape)
df_pca.head(3)

Apply PCA component

In [None]:
print(f"* n_components here should be the same amount from previous study: {n_components}")

In [None]:
n_components = 3

pca = PCA(n_components=n_components).fit(df_pca)
x_PCA = pca.transform(df_pca) # array with transformed PCA

ComponentsList = ["Component " + str(number) for number in range(n_components)]
dfExplVarRatio = pd.DataFrame(
    data= np.round(100 * pca.explained_variance_ratio_ ,2),
    index=ComponentsList,
    columns=['Explained Variance Ratio (%)'])

PercentageOfDataExplained = dfExplVarRatio['Explained Variance Ratio (%)'].sum()

print(f"* The {n_components} components explain {round(PercentageOfDataExplained,2)}% of the data \n")
print(dfExplVarRatio)

Heatmap: PCA components and variables

In [None]:
df_comp = pd.DataFrame(pca.components_, columns=df_pca.columns)
plt.figure(figsize=(20,5))
sns.heatmap(df_comp, center=0, linewidths=.5)
plt.show()

Major PCA variables

In [None]:
major_pca_variables = df_comp[abs(df_comp) > 0.1].dropna(axis=1, how='all').columns.to_list()
print(f"* There are {len(major_pca_variables)} major variables in the PCA: \n{major_pca_variables}")

Note that all variables from `best_features_pipeline_all_variables` are indicated as relevant after applying PCA

## Apply Elbow analysis and compare to previous Elbow analysis

Prepare data for analysis
  * You need to clean and feature engineer your data using the pipeline without the model

In [None]:
pipeline_cluster = PipelineCluster()
pipeline_elbow = Pipeline(pipeline_cluster.steps[:-1])
df_elbow = pipeline_elbow.fit_transform(df_reduced)

print(df_elbow.shape,'\n')
df_elbow

Elbow Analysis

In [None]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans

visualizer = KElbowVisualizer(KMeans(random_state=0), k=(1,16))
visualizer.fit(df_elbow) 
visualizer.show() 
plt.show()

The same number of clusters is suggested! :)

## Fit New Cluster Pipeline

Quick recap in our raw dataset

In [None]:
print(df_reduced.shape)
df_reduced.head(3)

Fit Cluster pipeline

In [None]:
X = df_reduced.copy()
pipeline_cluster = PipelineCluster()
pipeline_cluster.fit(X)

Cluster model output is an array with clusters labels

In [None]:
pipeline_cluster['model'].labels_

In [None]:
pipeline_cluster['model'].labels_.shape

## Add cluster labels to dataset

We add a column "Cluster" to the data and check clusters distribution
* Clusters don't look to be imbalanced

In [None]:
X['Clusters'] = pipeline_cluster['model'].labels_

print(f"* Clusters frequencies \n{ X['Clusters'].value_counts(normalize=True).to_frame().round(2)} \n\n")
X['Clusters'].value_counts().sort_values().plot(kind='bar');

## Compare current cluster labels to previous cluster labels

We just fitted a new cluster pipeline and want to compare if its predictions are "equivalent" from the previous cluster

These are the predictions from **previous** cluster pipeline 

In [None]:
cluster_anwers_with_all_variables

And these are the predictions from **current** cluster pipeline (trained with `df_reduced`)

In [None]:
cluster_anwers_with_major_variables = X['Clusters'] 
cluster_anwers_with_major_variables

---

We use a confusion matrix to evaluate if the predictions of both pipelines are **"equivalent"**
* We say equivalent in quotes, because we can't expect that a cluster label 0 in the previous cluster will have the same label in the current cluster. Eventually label 0 in previous cluster pipeline will be a different label in current cluster pipeline
* When we reach this **equivalence**, it means both clusters "clustered" in a similar way but with different labels. And this is fine, since the label itself doesn't have meaning. We will look for the meaning after. 

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(cluster_anwers_with_all_variables,
                       cluster_anwers_with_major_variables))

We see that one pipeline labeled 1526 observations as label 2 and other as label 0. 
  * So what is label 2 in the first, is label 0 in the second

We also see one pipeline labeled 1674 observations as label 0 and other as label 2.
  * So what is label 0 in one, is label 2 in the other
  * We notice 146 data points were labelled as 0 in one, and a 1 in the other. This is fine, since is a minority compared to the 1674

When you keep comparing you will notice:
  * What is label 1 in one, is label 3 in the other
  * What is label 3 is one, is label 1 in the other


Conclusion
* We see that both Clusters Pipelines are not predicting the data 100% equivalent, since few data points have different meaning for each pipeline. 
* However this is fine, and we say yes for this criteria. This is part of the trade-off we are up to.

## Evaluate current Clusters silhouette

* To evaluate clusters silhouete we need:
  * data transformed (transform data in the pipeline wihout model step)
  * clusters arrays

In [None]:
pipeline_silhouette = Pipeline(pipeline_cluster.steps[:-1])
df_transformed = pipeline_silhouette.transform(df_reduced)
df_transformed

Evaluate Cluster Silhouette

In [None]:
EvaluateClusterSilhouette(
    X=df_transformed,
    n_clusters=X['Clusters'].nunique(),
    cluster_labels=X['Clusters'].values)

The silhoutte score from both pipelines are similar! Now it has even increased a bit :)

## Rewrite ML Pipeline for a classifier to explain the clusters

We want again to explain the major variables for our clusters

We copy `X` to a DataFrame `df_clf`, just to separate the concerns

In [None]:
df_clf = X.copy()
print(df_clf.shape)
df_clf.head(3)

Split Train and Test sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train, y_test = train_test_split(
                                    df_clf.drop(['Clusters'],axis=1),
                                    df_clf['Clusters'],
                                    test_size=0.2,
                                    random_state=0
                                    )

print(X_train.shape, X_test.shape)

Rewrite pipeline to explain clusters

In [None]:
def PipelineClf2ExplainClusters():
   pipe = PipelineDataCleaningAndFeatureEngineering()
   # no feature selection step
   pipe.steps.append(["scaler",StandardScaler()])
   pipe.steps.append(["model",GradientBoostingClassifier(random_state=0)])
   return pipe

Create pipeline

In [None]:
pipeline_clf_cluster = PipelineClf2ExplainClusters()
pipeline_clf_cluster

## Fit a classifier, where target is cluster labels and features remaining variables

In [None]:
pipeline_clf_cluster.fit(X_train,y_train)

## Evaluate classifier performance on Train and Test Sets

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train, pipeline_clf_cluster.predict(X_train)))

In [None]:
print(classification_report(y_test, pipeline_clf_cluster.predict(X_test)))

The performance on Train and Test sets are similar, comparing to the previous pipeline! :)

## Assess Most Important Features

They help the most to define a cluster, compare with previous pipeline


In [None]:
best_features = X_train.columns.to_list()

# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Attribute': X_train.columns,
    'Importance': pipeline_clf_cluster['model'].feature_importances_})
.sort_values(by='Importance', ascending=False)
)

best_features = df_feature_importance['Attribute'].to_list()

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Attribute'].to_list()}")

df_feature_importance.plot(kind='bar',x='Attribute',y='Importance')
plt.show()

The most relevant variables, in descending order, to explain the cluster from the previous pipeline are

In [None]:
best_features

We noticed the features importance order are the same from the previous cluster pipeline :)

## Cluster Analysis

We will study the profile for the main variables that define a cluster


In [None]:
df_cluster_profile = df_clf.copy()
df_cluster_profile = df_cluster_profile.filter(items=best_features + ['Clusters'], axis=1)
df_cluster_profile.head(3)

Load Churn levels

In [None]:
df_churn = pd.read_csv("outputs/datasets/collection/TelcoCustomerChurn.csv").filter(['Churn'])
df_churn['Churn'] = df_churn['Churn'].astype('object')
df_churn.head(3)

### Cluster profile on most important features

Considering `df_cluster_profile` and `df_churn`

In [None]:
pd.set_option('display.max_colwidth', None)
clusters_profile = DescriptionAllClusters(pd.concat([df_cluster_profile,df_churn], axis=1))
clusters_profile

### Clusters distribution across Churn levels & Relative Percentage of Churn in each cluster

In [None]:
df_cluster_vs_churn=  df_churn.copy()
df_cluster_vs_churn['Clusters'] = X['Clusters']
cluster_distribution_per_variable(df=df_cluster_vs_churn, target='Churn')

## Which pipeline should I keep?

Let's recap the criteria we consider to evaluate the **trade-off**
1. Conduct a PCA analysis, with the same amount of components from previous study, in a dataset only with `best_features_pipeline_all_variables` and see if all variables are relevant
2. Conduct a elbow study and check if the same number of clusters is suggested
3. Fit new cluster pipeline and compare if the both clusters predictions are "equivalent"
4. Compare silhoutte score
5. Fit a classifier to explain cluster, and check if performance on Train and Test sets is similar
6. Check if the most important features for the classifier are the same.
7. Compare if the cluster profile from both cases are "equivalent"

We are happy with all criteria above for the new Cluster Pipeline


Now we face a moment of trade-off, where there is no 100% right or wrong decision, it is more a contextual decision
* All 7 criteria support the second pipeline. In addition, there is a great gain to have less variables for predicting live data
* Therefore we have positive evidence to take the pipeline with less variables

In [None]:
pipeline_cluster

# Push files to Repo


We will generate the following files

* Cluster Pipeline
* Train Set
* Feature importance plot
* Clusters Description
* Cluster Silhouette


In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/ml_pipeline/cluster_analysis/{version}'

try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)

## Cluster pipeline

In [None]:
pipeline_cluster

In [None]:
joblib.dump(value=pipeline_cluster ,
            filename=f"{file_path}/cluster_pipeline.pkl")

## Train Set

In [None]:
print(df_reduced.shape)
df_reduced.head(3)

In [None]:
df_reduced.to_csv(f"{file_path}/TrainSet.csv", index=False)

## Most important features plot

These are the features that define a cluster

In [None]:
df_feature_importance.plot(kind='bar',x='Attribute',y='Importance')
plt.show()

In [None]:
df_feature_importance.plot(kind='bar',x='Attribute',y='Importance')
plt.savefig(f"{file_path}/features_define_cluster.png", bbox_inches='tight')

## Cluster Profile

In [None]:
clusters_profile

In [None]:
clusters_profile.to_csv(f"{file_path}/clusters_profile.csv")

## Cluster silhouette plot

In [None]:
plot_clusters_silhouette(
    X=df_transformed,
    n_clusters=X['Clusters'].nunique(),
    cluster_labels=X['Clusters'].values)
plt.show()

In [None]:
fig = plot_clusters_silhouette(
    X=df_transformed,
    n_clusters=X['Clusters'].nunique(),
    cluster_labels=X['Clusters'].values)

plt.savefig(f"{file_path}/clusters_silhouette.png", bbox_inches='tight')

---

## **Push** generated/new files from this Session to GitHub repo

You can push the files now to the Repo!!!!

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "added-files-cluster-analysis"
! git add .
! git commit -m {CommitMsg}

* Git Push

In [None]:
! git push origin main

Good job, now save the notebook in your repo and terminate the session (Runtime - Manage Session - Terminate)