# PyCaret Demo - Clustering

using the jewellery dataset, this demo covers some steps used to perform clustering analysis using pycaret. prior to running the notebook, ensure you have the followng packes installed.

the relevant packages are:
- Pandas
- Numpy
- Matplotlib
- Seaborn
- PyCaret
- MLFlow
- Sklearn
- Scipy
- PyCaret[Analysis]

In [None]:
# import packages

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot at plt
%matplotlib inline

# for pycaret
import pycaret
from pycaret.utils import version
from pycaret.clustering import *

In [None]:
# import dataset to be used
from pycaret.datasets import get_data
dataset = get_data('jewellery')

In [None]:
### check out the dataset

In [None]:
# review dataset info
dataset.info()

In [None]:
# check for null values
dataset.isnull().values.sum()

In [None]:
### exploratory data analysis

In [None]:
#check the summary stats
dataset.describe(include='all').T

In [None]:
# performing further exploratory data analysis on our numeric columns
num_cols = dataset.select_dtypes(include=np.number).columns.tolist()
print("Numerical Variables:")
print(num_cols)

In [None]:
#create histograms and boxplots for numerical values

for col in num_cols:
    print(col)
    plt.figure(figsize = (15,4))
    plt.subplot(1,2,1)
    dataset[col].hist(grid=False)
    plt.ylabel('count')
    plt.subplot(1,2,2)
    sns.boxplot(x=dataset[col])
    plt.show()

In [None]:
# correlation heatmap

plt.figure(figsize=(5,5))
sns.heatmap(dataset.corr().round(decimals=2), annot=True)
plt.show()

In [None]:
### Creating Unseen Data for predictions

In [None]:
# splitting off some data fr predictions, pretending its unseen data
data = dataset.sample(frac = 0.95, random_state = 786).reset_index(drop=True)
dataset_unseen = dataset.drop(data.index).reset_index(drop=True)

#Split datasets below, we will use the unseen data later for predicting on "new" data
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data for Predictions: '+ str(data_unseen.shape))

In [None]:
### Setting up the cluster model

In [None]:
# setting up the experiment for clustering
exp = setup(data = data, session_id = 123, preprocess = True, normalize = True)

In [None]:
# use this to check all of the available models
models()
# will use kmeans and agglomerative clustering

In [None]:
### K-Means Model

In [None]:
# Creating a k-means model
kmeans = create_model ('kmeans')
print(kmeans)

lookingat the info above, we can see the number of clusters used is 4. this is the default but you can change this specifying the number of clusters you want. the code below shows this.

In [None]:
# Creating a k-means model with specific clusters
kmeans_model2=create_model('kmeans', num_clusters = 3)
print(kmeans_model2)

as we can see from the above, there were changes in the Silhouette, Calinski-Harabasz, and Davies-Bouldin scores. these score denote how well our model is fitting the data in terms of how tight the clusters are. Having a higher Silhouette (which is vounded between -1 and +1) and Calinksi scores, and lower Davies score is better. Based on the above scores, this performs worse than 4 clusters.

In [None]:
### agglomerative clustering model

In [None]:
# the jewellery dataset is not hierarchical in nature. To demo this, using the iris dataset + sklearn + scipy

# load packages
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

# Load iris dataset
iris = load_iris()
X = iris.data

# create instance of agglomerative clustering
agg_clust = AgglomerativeClustering(n_clusters=4)

#Fit model to data
agg_clust.fit(X)

#Obtain predicted cluster labels
cluster_labels = agg_clust.labels_

#print labels
print(cluster_labels)

In [None]:
# get evaluation metrics
from sklearn.metrics import silhouette_score, calinkski_harabasz_score

silo_score = silhouette_score(X, cluster_labels)
cali_score = calinski_harabasz_score(X, cluster_labels)

print("Silhouette Score:" silo_score)
print("Calinski-Harabasz Score:" cali_score)

In [None]:
# building the graph
from scipy.cluster.hierarchy import dendrogram, linkage

#building linkage matrix, which stores info on the hierarchical structure of the data
linkage_matrix - linkage(X, method = 'ward')

# building the plot/graph
plt.figure(figsize = (20,6))
dendrogram(linkage_matrix)
plt.title('Denodrogram')
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.grid(False)
Plt.show()

Each data point is represented as leadnode, and the branches of the tree represent clusters form at different levels of the hierarchy. The height or length of the lines represent similarities/differences between clusters. Longer lines mean more dissimilarity, while shorter lines mean more similarity. Note that this graph does not tell you how many clusters you should have.

In [None]:
# from here obwards, will use kmeans for plotting/analysis

In [None]:
# elbow plot
plot_model(kmeans, 'elbow')

Interpretation of the below graph: this is a plot that helps to determine the best number of clusters (k) in a k-means clustering model. It shows the relationship between 'k' and the 'within-cluster sum of squares', which measures how compact the clusters are. the bend/curve is called the elbow point, which represents a trade-off between the low-within-cluster sum of squares and avoiding excessive complexity/overfitting. In other words its basically the optimal 'k' value where you balance cluster quality and model simplicity.

Based on this, we should try 5 clusters.

In [None]:
### Best kmeans based on elbow

In [None]:
# Creating a kmeans model with 5 clusters
kmeans_best = create_model('kmeans', num_clusters = 5)
print(kmeans_best)

In [None]:
### Plotting the model

In [None]:
# plot the new kmeans model
plot_model(kmeans_best, 'cluster')

In [None]:
# Silhouette plot
plot_model(kmeans_best, plot = 'silhouette')

In [None]:
# Distribution plot (how big are the clusters)
plot_model(kmeans_best, plot = 'distribution')

In [None]:
# using evaluating moethod, includes all graphs above with interactive component
evaluate_model(kmeans_best)

In [None]:
### Assigning Clusters

In [None]:
# Assigning clusters to the data
kmeans_cluster = assign_model(kmeans_best)
kmeans_cluster

In [None]:
### Predictions

In [None]:
# Predicting on the unseen data
kmeans_pred = predict_model(kmeans_best, data=data_unseen)
kmeans_pred

#### simple customer segementation example

One use of the clustering algorithms is for building customer profiles and performing customer segmentation analysis. This is a marketing technique that is used to separate customers and/or product into groups based on similar features/traits. The purpose of this is to get a better understanding of the values, needs and behaviours of each group, so that we can optimize out marketing techniques for them. The below is a simple example of assigning traits to each of our clusters, to get better insights on who they are.

In [None]:
# looking at average of each column for each customer to make profiles
avg_data=kmeans_cluster.groupby(['Cluster'],as_index=True).mean()
print(avg_data)

We can apply trains or re-name the clusters to help us with curating marketing plans to each group. For example, we can call each clusters the below:

- Cluster 0: "Savvy Savers" - Young with moderate income ans savings. Don't spend a lot, maybe saving for a home
- Cluster 1: "Active Spenders" - Older with moderate income. more willing to spend and save less. might enjoy traveling
- Cluster 2: "Wise Planners" - Elderly with low income. Thrify and big savers, likely retired and only spend as needed
- Cluster 3: "Upcoming IG Influencers" - Young with high income. Prioritize spending vs saving (gotta spend $$$ for followers)
- Cluster 4: "Penny Pinchers" - Elderly with high income and low spending. Maybe business owners/are rich, hesitate to spend money unless needed.

In [None]:
### Saving/loading the model

In [None]:
# Saving the model
save_model(kmeans_best, 'kmeans_pipeline')

In [None]:
# Loading the model
loaded_model= load_model('kmeans_pipeline')
print(loaded_model)