# **Customer Segmentation Notebook Showing Modeling Using Agglomerative Clustering**

### In this notebook, similar to the first modeling notebook we conducted with the KMeans modeling, we will utilize our previously scaled version of our dataframe so we can look at the customer clusters extracted based on the custom features that we have created.  In this notebook we will be utilizing and focusing on Agglomerative Clustering, and will see in the summary how it differs in the results.

In [11]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from yellowbrick.cluster import SilhouetteVisualizer
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA

#### Let's read in our saved dataframe so we can start to work with it here in this notebook again.

In [13]:
df_user_scaled = pd.read_csv(f'/Users/ryanm/Desktop/df-scaled.csv')


print(df_user_scaled.shape)
print(df_user_scaled.columns)
print(df_user_scaled.head(25))
df_user_scaled.isna().sum()

agg_scaled = df_user_scaled

(206209, 11)
Index(['order_dow_mean', 'order_hour_of_day_mean', 'time_between_purchases',
       'purchase_frequency', 'kmeans_opt_cluster', 'agg_ward_cluster',
       'agg_complete_cluster', 'agg_average_cluster', 'agg_single_cluster',
       'dbscan_cluster', 'user_id'],
      dtype='object')
    order_dow_mean  order_hour_of_day_mean  time_between_purchases  \
0        -0.269018               -1.557111                0.764202   
1        -0.648385               -1.428500                0.110680   
2        -1.773842                1.341157               -0.396789   
3         2.174108               -0.467299               -0.188850   
4        -1.065689                1.143728               -0.266827   
5         0.970249                1.775503               -0.801114   
6        -0.959466                0.006532               -0.622056   
7         1.324325               -5.174023                1.624838   
8        -0.800132                0.354009                0.238579   
9   

#### The data imported has already been scaled using StandardScaler, which is appropriate for Agglomerative Clustering.  PCA has already been applied with a n_components = 2 value in a previous notebook, though we will do so again so we can show the process and reacquire our key pca_features variable.

In [8]:
n_components = 2
pca_fit = PCA(n_components=n_components)
pca_features = pca_fit.fit_transform(agg_scaled)

ValueError: Input X contains NaN.
PCA does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

#### Since Agglomerative Clustering doesn't benefit from elbow plots (as Agglomerative Clustering doesn't use Inertia-based criterion) we will now go into the silhouette plotting to see how many clusters there are in our dataframe based on the Agglomerative Clustering method.  One of the key differences in this clustering method is that we will be looking at 4 different linkage methods within Agglomerative Clustering (ward, complete, average, and single), so there will be repeated processes to obtain different insights.  For the below we will once again be using the yellowbrick library for visualization.

In [None]:
linkage_methods = ['ward', 'complete', 'average', 'single']
silhouette_scores = {method: [] for method in linkage_methods}

for method in linkage_methods:
    for n_clusters in range(2,11):
        agg = AgglomerativeClustering(n_clusters = n_clusters, linkage = method)
        visualizer = SilhouetteVisualizer(agg, colors = 'yellowbrick')
        visualizer.fit(pca_features)
        visualizer.show()
        silhouette_scores[method].append(visualizer.silhouette_score_)

silhouette_agg_df = pd.DataFrame(silhouette_scores, index = range(2,11))

In [None]:
plt.figure(figsize = (10,8))
for method in linkage_methods:
    plt.plot(silhouette_agg_df.index, silhouette_agg_df[method], marker = 'o', label= 'method')

plt.title('Silhouette Scores for Different Linkage Methods')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.legend()
plt.grid(True)
plt.show()

#### Saved for OBS

#### Now we will establish our key optimal_k variable like we did in the previous notebook as well as establish variables for each Agglomerative Clustering linkage method to use in our forthcoming plots, then look at some pairplots to see any representative relationships.  Note that we are only using a small sampling of the data here, as Agglomerative Clustering is quite memory-intensive.

In [None]:
optimal_k = 

def agg_clustering_mod_sample(linkage, data, df, cluster_col_name, sample_size = 15000):
    sample_indices = np.random.choice(len(data), size = sample_size, replace = False)
    sample_data = data[sample_indices]
    sample_df = df.iloc[sample_indices].reset_index(drop = True)
    
    agg = AgglomerativeClustering(n_clusters = optimal_k, linkage = linkage)
    sample_df[cluster_col_name] = agg.fit_predict(sample_data)
    df[cluster_col_name] = -1
    
    df.loc[sample_indices, cluster_col_name] = sample_df[cluster_col_name]
    
    return df, sample_data, sample_df

sample_size = 15000

agg_scaled, pca_features_sampled_ward, agg_scaled_sampled_ward = agg_clustering_mod_sample('ward', pca_features, agg_scaled, 'agg_ward_cluster', sample_size = sample_size)
agg_scaled, pca_features_sampled_complete, agg_scaled_sampled_complete = agg_clustering_mod_sample('complete', pca_features, agg_scaled, 'agg_complete_cluster', sample_size = sample_size)
agg_scaled, pca_features_sampled_average, agg_scaled_sampled_average = agg_clustering_mod_sample('average', pca_features, agg_scaled, 'agg_average_cluster', sample_size = sample_size)
agg_scaled, pca_features_sampled_single, agg_scaled_sampled_single = agg_clustering_mod_sample('single', pca_features, agg_scaled, 'agg_single_cluster', sample_size = sample_size)


#### With our key variables created let's look at the pairplots.

In [None]:
sns.pairplot(agg_scaled_sampled_ward, hue = 'agg_ward_cluster', palette = 'viridis', diag_kind= 'kde')
plt.suptitle('Agglomerative Clustering (Euclidean, Ward)', y = 1.02)
plt.show()

sns.pairplot(agg_scaled_sampled_complete, hue = 'agg_complete_cluster', palette = 'viridis', diag_kind = 'kde')
plt.suptitle('Agglomerative Clustering (Euclidean, Complete)', y = 1.02)
plt.show()

sns.pairplot(agg_scaled_sampled_average, hue = 'agg_average_cluster', palette = 'viridis', diag_kind= 'kde')
plt.suptitle('Agglomerative Clustering (Euclidean, Average)', y = 1.02)
plt.show()

sns.pairplot(agg_scaled_sampled_single, hue = 'agg_single_cluster', palette = 'viridis', diag_kind = 'kde')
plt.suptitle('Agglomerative Clustering (Euclidean, Single)', y = 1.02)
plt.show()

#### MKDN for OBS

#### Now let's look at some quick histplots for population size for each linkage method to see the normalization, we are looking for (as close as possible) the same population size in each cluster per linkage method.

In [None]:
def agg_histplot_cluster(cluster_labels, title):
    plt.figure(figsize = (10,8))
    sns.histplot(cluster_labels, bins = len(np.unique(cluster_labels)), kde = False)
    plt.title(f'Cluster Size Distribution - {title}')
    plt.xlabel('Cluster')
    plt.ylabel('Number of Points')
    plt.show()

agg_histplot_cluster(agg_scaled_sampled_ward['agg_ward_cluster'], 'Agglomerative Clustering (Ward)')
agg_histplot_cluster(agg_scaled_sampled_complete['agg_complete_cluster'], 'Agglomerative Clustering (Complete)')
agg_histplot_cluster(agg_scaled_sampled_average['agg_average_cluster'], 'Agglomerative Clustering (Average)')
agg_histplot_cluster(agg_scaled_sampled_single['agg_single_cluster'], 'Agglomerative Clustering (Single)')

#### MKDN for OBS (choose linkage method here)

#### Now let's look at a 3D render of our chosen optimal_k value and linkage method, while also taking a look at outliers as well in this section.