# Clustering

#### Lecture example

The below is just provided in case you need/want a more detailed example of measuring the distance. Most of you can safely ignore this.

In [None]:
import pandas as pd
import numpy as np
import math as m

a = np.array([3, 4, 1])
b = np.array([1, 2, 2])
print( a-b)
edist = m.sqrt(sum((a-b)**2))
print("Distance from a to b: ", edist) 

## Textbook Example

In [None]:
import pandas as pd
import numpy as np
import math as m

#if you are following along in the text book, you need to add the below code.
from sklearn.datasets import make_blobs

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()  

### Blobs?

What are we doing here? This is just a cheesy but quick way to create a large number of data samples, so we can demonstrate the algorithm. That is all. 

In [None]:
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=.6, random_state=0)

plt.scatter(X[:, 0], X[:, 1], s=50);

In [None]:
X

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

The choice of how many clusters can get a little confusing... lets just try this with a different k, and see what happens!

In [None]:
kmeans = KMeans(n_clusters=7)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

## Example using Simulated Retail Sales Data

This examples uses data derived from the UCI Machine Learning dataset Online Retail. This is a well used data set, and is intended to be used to show clustering and classification tasks. The upside of that: lots of versions and ideas on how to cluster on this dataset can be found online. The below code comes from various sources and texts including the scikit learn help pages. This is a fairly standard (if uninspiring) way to do kmeans, and a standard (if uninspiring) well documented example. 

https://archive.ics.uci.edu/ml/datasets/Online+Retail

https://scikit-learn.org/stable/search.html?q=kmeans


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans

path = "../data/rfmfile.csv"
df = pd.read_csv(path)
df.head()

In [None]:
sns.jointplot(x= 'recency', y='frequency', data=df, kind = 'scatter')


In [None]:
df.columns

In [None]:
modelvar = df.loc[:,
    ['recency', 'frequency', 'monetary_value']]

sns.pairplot(modelvar)

In [None]:
#adjust plot size
sns.heatmap(modelvar.corr(), cmap = 'Wistia', annot = True)
plt.title('Correl. for model data', fontsize = 20)
plt.show()

## Getting ready for K means in a nutshell

(or a Python Shell anyway)

K means has some assumptions, we won't go into a lot of detail in this overview, but the data we have probably needs some cleaning...

In [None]:
df.set_index('customer_id', inplace=True)
df.head()

In [None]:
df.describe()

In [None]:
df['recency'].plot(kind='kde', figsize=(15, 3))
plt.show

In [None]:
df['frequency'].plot(kind='kde', figsize=(15, 3))
plt.show

In [None]:
df['monetary_value'].plot(kind='kde', figsize=(15, 3))
plt.show

In [None]:
#... it is also easy to do this all in one line

import matplotlib.pyplot as plt
df.plot(kind='density', subplots=True, sharex=False, figsize=(16, 10))
plt.show()

In [None]:
#standarizing 

from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
rfm_std = scale.fit_transform(df)


In [None]:
rfm_std

In [None]:
df_std = pd.DataFrame(data = rfm_std, 
                            index = df.index, 
                            columns = df.columns)
df_std.describe()

### Let's do some K means



In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

#### Deciding on K

A popular way of determining K is through silhoutte coefficients. If you'd like to read more on these, links are below. They are somewhat problematic, for reasons we won't delve into, but are easier to understand than some of the other methods. A silhouette coefficient of 1 means that the value(s) are far away from other clusters. This is a good thing, because we want our clusters to be seperated. -1 means the values are very close to other clusters, so they be mis-assigned. This is bad.

Shortcut to the above:

* Silhouette coefficients are okay to use if you have nothing better (like domain knowledge or even a graph)
* +1 = good
* -1 = bad

Ok... now to do the work. Don't try and memorize this code. Just copy paste it. 

##### Option 1
The first "hack" will get us a quick a dirty graphical representation of the scores. Where the plot stops the steep decline a starts leveling out, thats your K. Yup, that sounds about as unscientific as it is! This happens to be my go to approach, because it is just so easy. 

##### Option 2
This one comes from the scikit learn documentation. I think this is a better approach-- but better in clustering is pretty subjective. It does just give you a nice "this is the score" output. 

##### Here is Option 1

We need to decide on the number of clusters. We google and find lot's of solutions using "elbow plots". Here is one of them:

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

This is a fairly common way to develop this elbow plot. 

In [None]:
elbow_sse = {}
for k in range(1, 11):
    km = KMeans(n_clusters=k,
                random_state=1957) 
    km.fit(df_std)
    elbow_sse[k] = km.inertia_

In [None]:
sns.pointplot(x=list(elbow_sse.keys()), y=list(elbow_sse.values()))
plt.show()

##### Option 2

For this option we need to use our array data, not the pandas dataframe. 

In [None]:
#rfm_std

X = rfm_std
range_n_clusters = [2, 3, 4, 5, 6]


In [None]:
for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

In [None]:
#lets start with 2
k = 2
kmeans = KMeans(n_clusters=k, random_state=1957)
kmeans.fit(df_std)

df_cluster2 = df_std.assign(Cluster=kmeans.labels_)

df_cluster2.groupby('Cluster').agg({
    'recency': 'mean',
    'frequency': 'mean',
    'monetary_value': ['mean', 'count'],
}).round(0)


In [None]:
y_kmeans = kmeans.predict(X)
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);

darn it!

In [None]:
k = 3
kmeans = KMeans(n_clusters=k, random_state=1957)
kmeans.fit(df_std)
df_cluster3 = df_std.assign(Cluster=kmeans.labels_)
df_cluster3.groupby('Cluster').agg({
    'recency': 'mean',
    'frequency': 'mean',
    'monetary_value': ['mean', 'count'],
}).round(0)

In [None]:
y_kmeans = kmeans.predict(X)
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);

super darn it... 

In [None]:
k = 4
kmeans = KMeans(n_clusters=k, random_state=1957)
kmeans.fit(df_std)
df_cluster4 = df_std.assign(Cluster=kmeans.labels_)
df_cluster4.groupby('Cluster').agg({
    'recency': 'mean',
    'frequency': 'mean',
    'monetary_value': ['mean', 'count'],
}).round(0)

In [None]:
y_kmeans = kmeans.predict(X)
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);

... this is getting stupid.... are any of these actually useful?

In [None]:
df_m = pd.melt(df_std.assign(Cluster=kmeans.labels_).reset_index(),
                        id_vars=['customer_id', 'Cluster'],
                        value_vars=['recency', 'frequency', 'monetary_value'],
                        var_name='rfm category', 
                        value_name='Value'
                       )

In [None]:
plt.title('Plot of variables and clusters')
sns.lineplot(data=df_m, x='rfm category', y='Value', hue='Cluster')
plt.show()

# Agglolmerative/Hierarchical Clustering

## Example using Simulated Credit Card Data

This dataset comes to us from Kaggle. Check it out at the link below. One reason I really like using these types of examples is the many derivative works (code in Python and other languages) that you can find using this data. It makes it easier to learn when you can follow along with multiple examples. 

https://www.kaggle.com/datasets/arjunbhasin2013/ccdata?resource=download

Here is just one example (I am not arguing it is good or bad...but it is a nicely written example!)

https://www.kaggle.com/code/ankits29/credit-card-customer-clustering-with-explanation



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans

path = "../data/cc_data.csv"
df = pd.read_csv(path)
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isna().sum()

In [None]:
df.CUST_ID.is_unique

In [None]:
df.set_index('CUST_ID', inplace=True)
df.head()

In [None]:
df.columns= df.columns.str.lower()
df.columns

Before we do anything with the missing data. We should check to see if it is skewed. Dealing with missing data is always a little tricky... and sometimes takes some time with a dataset to get right. 

One way to check for skew is a graph (I have shown those before!). Another is using the .skew method in Pandas. Skew is a measure of the asymmetry of the distribution. O would mean it is not skewed, - means it has a tail on the left... and + it has a tail on the right. 

Would we EXPECT to see a strong negative in something like minimum_payment? Not really. The column is bound by 0 at the low end. So it should be positive (to the right)

In [None]:
df.skew()

In [None]:
sns.histplot(df.minimum_payments)

In [None]:
#fix the missing values by filling them in

#frontfill
#df.fillna(method='ffill', inplace = True)

#backfill
#df.fillna(method='bfill', inplace= True)

#median..this code also be mean or mode
df.fillna(df.median(), inplace= True)

In [None]:
df.isna().sum()

In [None]:
#adjust plot size
sns.heatmap(df.corr(), cmap = 'Wistia')
plt.title('Correl. for model data', fontsize = 0
         )
plt.show()

## Let's scale these values


In [None]:
#standarizing 

from sklearn.preprocessing import StandardScaler

# unskewing
scale = StandardScaler()
df_std = scale.fit_transform(df)


Here we are going to apply an approach called principle component analysis (PCA) to our data. PCA (in a nutshell) reduces the dimensions of our data while trying to preserve as much of the items information as possible. 

If you are interested in reading more on PCA, I have linked a few resources. We won't cover it more in this video. 

In [None]:
from sklearn.decomposition import PCA
import scipy.cluster.hierarchy as shc

pca = PCA(n_components = 2)
df_pca = pca.fit_transform(df_std)
df_dr = pd.DataFrame(df_pca)
df_dr.columns = ['P1', 'P2']

In [None]:
df_dr

The method we are using below (Ward linkage) is a little different than the simplified lecture video. The documentation has more detail if you are interested in learning more...  but that is optional. 

In [None]:
plt.figure(figsize =(8, 8))
plt.title('Our Nice Clusters')
Dendrogram = shc.dendrogram((shc.linkage(df_dr, 
                                         method ='ward')))

What is the right number of clusters? How many clusters should these transactions be binned into?

In [None]:
#this simple code should tell us!
return(truth)


In [None]:
from sklearn.cluster import AgglomerativeClustering

clustering_model = AgglomerativeClustering(n_clusters=6, affinity='euclidean', linkage='ward')
clustering_model.fit(df_dr)
segment_labels = clustering_model.labels_

In [None]:
len(segment_labels)

In [None]:
#lets add these to our data...
df['newSegments'] = segment_labels.tolist()

In [None]:
df

In [None]:
pd.pivot_table(df, 
               index=df.newSegments,
               aggfunc='mean')

#mean is the default agg function for pivot tables... just sharing the full(er) code here

In [None]:
df.describe()