# <center>Cluster Analysis

To perform our cluster analysis, we’ll use a comprehensive dataset from *Kaggle*, the "Credit Card Dataset for Clustering." This dataset contains various features related to credit card usage and customer behavior. Our task will be to apply clustering techniques to group customers into distinct segments based on their spending patterns, allowing us to identify different customer profiles and behaviors for targeted financial strategies.

In [2]:
# %pip install kaggle --q
# %pip install pingouin --q
# %pip install seaborn --q
# %pip install nbformat --q

In [3]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import scipy.stats as stats
from scipy.stats import zscore
from scipy.stats import chi2_contingency
from scipy.spatial.distance import pdist
import scipy.spatial.distance as dist
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pingouin as pg
import warnings
warnings.filterwarnings('ignore')
pio.renderers.default = 'vscode'

In [4]:
# Name of the dataset on Kaggle
dataset = 'arjunbhasin2013/ccdata'

# Directory where the dataset will be saved
download_dir = '/content/ccdata'

# Download the dataset and unzip it
!kaggle datasets download -d {dataset} -p {download_dir} --unzip

# List the downloaded files
!ls {download_dir}

# Path to the CSV file inside the unzipped directory
csv_file_path = f"{download_dir}/CC GENERAL.csv"  

# Read the CSV file into a DataFrame
df = pd.read_csv(csv_file_path)

df = df.iloc[:, 1:]
df.head()

Dataset URL: https://www.kaggle.com/datasets/arjunbhasin2013/ccdata
License(s): CC0-1.0
Downloading ccdata.zip to /content/ccdata




  0%|          | 0.00/340k [00:00<?, ?B/s]
100%|██████████| 340k/340k [00:00<00:00, 446kB/s]
100%|██████████| 340k/340k [00:00<00:00, 444kB/s]
'ls' n�o � reconhecido como um comando interno
ou externo, um programa oper�vel ou um arquivo em lotes.


Unnamed: 0,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,40.900749,0.818182,95.4,0.0,95.4,0.0,0.166667,0.0,0.083333,0.0,0,2,1000.0,201.802084,139.509787,0.0,12
1,3202.467416,0.909091,0.0,0.0,0.0,6442.945483,0.0,0.0,0.0,0.25,4,0,7000.0,4103.032597,1072.340217,0.222222,12
2,2495.148862,1.0,773.17,773.17,0.0,0.0,1.0,1.0,0.0,0.0,0,12,7500.0,622.066742,627.284787,0.0,12
3,1666.670542,0.636364,1499.0,1499.0,0.0,205.788017,0.083333,0.083333,0.0,0.083333,1,1,7500.0,0.0,,0.0,12
4,817.714335,1.0,16.0,16.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0,1,1200.0,678.334763,244.791237,0.0,12


* What do we want?
  * We aim to cluster these IDs into distinct groups, ensuring that IDs within the same group are similar, while IDs across different groups are distinct.
  * Based on all the metric variables we have:

In [5]:
df.info() # Metric values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 17 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   BALANCE                           8950 non-null   float64
 1   BALANCE_FREQUENCY                 8950 non-null   float64
 2   PURCHASES                         8950 non-null   float64
 3   ONEOFF_PURCHASES                  8950 non-null   float64
 4   INSTALLMENTS_PURCHASES            8950 non-null   float64
 5   CASH_ADVANCE                      8950 non-null   float64
 6   PURCHASES_FREQUENCY               8950 non-null   float64
 7   ONEOFF_PURCHASES_FREQUENCY        8950 non-null   float64
 8   PURCHASES_INSTALLMENTS_FREQUENCY  8950 non-null   float64
 9   CASH_ADVANCE_FREQUENCY            8950 non-null   float64
 10  CASH_ADVANCE_TRX                  8950 non-null   int64  
 11  PURCHASES_TRX                     8950 non-null   int64  
 12  CREDIT

* Devo padronizar os dados?
  * Observar os valores de mínimo e máximo (amplitude das variáveis).

In [6]:
df.describe().T # It is observed that these variables are on different scales/different measurement units.
                # Therefore, it is necessary to carry out standardization. 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
BALANCE,8950.0,1564.474828,2081.531879,0.0,128.281915,873.385231,2054.140036,19043.13856
BALANCE_FREQUENCY,8950.0,0.877271,0.236904,0.0,0.888889,1.0,1.0,1.0
PURCHASES,8950.0,1003.204834,2136.634782,0.0,39.635,361.28,1110.13,49039.57
ONEOFF_PURCHASES,8950.0,592.437371,1659.887917,0.0,0.0,38.0,577.405,40761.25
INSTALLMENTS_PURCHASES,8950.0,411.067645,904.338115,0.0,0.0,89.0,468.6375,22500.0
CASH_ADVANCE,8950.0,978.871112,2097.163877,0.0,0.0,0.0,1113.821139,47137.21176
PURCHASES_FREQUENCY,8950.0,0.490351,0.401371,0.0,0.083333,0.5,0.916667,1.0
ONEOFF_PURCHASES_FREQUENCY,8950.0,0.202458,0.298336,0.0,0.0,0.083333,0.3,1.0
PURCHASES_INSTALLMENTS_FREQUENCY,8950.0,0.364437,0.397448,0.0,0.0,0.166667,0.75,1.0
CASH_ADVANCE_FREQUENCY,8950.0,0.135144,0.200121,0.0,0.0,0.0,0.222222,1.5


In [7]:
# Remove missing values

df = df.dropna()
df.shape

(8636, 17)

In [8]:
# Standardizing the data using z-score

df_pad = df.apply(
    zscore,
    ddof=1
)
df_pad.head()

# Variables will now have a mean of 0 and a standard deviation of 1.

Unnamed: 0,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,-0.744582,-0.370025,-0.429159,-0.359139,-0.354805,-0.468628,-0.820721,-0.68624,-0.717137,-0.681913,-0.479409,-0.517593,-0.96252,-0.54391,-0.30549,-0.537696,0.35516
1,0.764108,0.067675,-0.473181,-0.359139,-0.458812,2.568408,-1.236067,-0.68624,-0.926468,0.55699,0.099252,-0.59702,0.677165,0.796806,0.087684,0.212368,0.35516
2,0.426578,0.505375,-0.116406,0.099904,-0.458812,-0.468628,1.256004,2.646498,-0.926468,-0.681913,-0.479409,-0.12046,0.813805,-0.39948,-0.0999,-0.537696,0.35516
4,-0.373889,0.505375,-0.465798,-0.34964,-0.458812,-0.468628,-1.028396,-0.408513,-0.926468,-0.681913,-0.479409,-0.557306,-0.907864,-0.380143,-0.261115,-0.537696,0.35516
5,0.099545,0.505375,0.142054,-0.359139,0.994757,-0.468628,0.425314,-0.68624,0.538851,-0.681913,-0.479409,-0.279313,-0.743895,-0.132112,0.650326,-0.537696,0.35516


<center><h2>Plot</h2>

In [9]:
df_melt = df_pad.reset_index()
display(df_melt.head())
print()

vis = pd.melt(
    frame=df_melt,
    id_vars=['index'],
    value_vars=df_melt.columns[1:],
    var_name='Columns',
    value_name = 'Standardized Values'
)

px.box(
    vis, y="Columns",
    x="Standardized Values", 
    template='plotly_dark',
    title='Box Plot',
    color_discrete_sequence=px.colors.qualitative.Dark2
    )

Unnamed: 0,index,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,0,-0.744582,-0.370025,-0.429159,-0.359139,-0.354805,-0.468628,-0.820721,-0.68624,-0.717137,-0.681913,-0.479409,-0.517593,-0.96252,-0.54391,-0.30549,-0.537696,0.35516
1,1,0.764108,0.067675,-0.473181,-0.359139,-0.458812,2.568408,-1.236067,-0.68624,-0.926468,0.55699,0.099252,-0.59702,0.677165,0.796806,0.087684,0.212368,0.35516
2,2,0.426578,0.505375,-0.116406,0.099904,-0.458812,-0.468628,1.256004,2.646498,-0.926468,-0.681913,-0.479409,-0.12046,0.813805,-0.39948,-0.0999,-0.537696,0.35516
3,4,-0.373889,0.505375,-0.465798,-0.34964,-0.458812,-0.468628,-1.028396,-0.408513,-0.926468,-0.681913,-0.479409,-0.557306,-0.907864,-0.380143,-0.261115,-0.537696,0.35516
4,5,0.099545,0.505375,0.142054,-0.359139,0.994757,-0.468628,0.425314,-0.68624,0.538851,-0.681913,-0.479409,-0.279313,-0.743895,-0.132112,0.650326,-0.537696,0.35516





In [10]:
fig = go.Figure()
fig.add_trace(
    go.Heatmap(
        x=df_pad.corr().columns,
        y=df_pad.corr().index,
        z=np.array(df_pad.corr()),
        text=df_pad.corr().values,
        texttemplate='%{text:.2f}',
        colorscale='Greens',
        zmin=-1,
        zmax=1
    )
)

fig.update_layout(
    template='plotly_dark',
    title='Matriz de Correlação',
    xaxis_title='Variáveis',
    yaxis_title='Variáveis',
    height=1000,
)
fig.show()

Assessing relationships between variables. While analyzing correlations is not my primary focus at this stage, this analysis will assist in interpreting the results later.

<center><h2>Elbow method</h2>

<center><b>What is the Elbow method?</b></center>

The idea is that the closer the observations are to their centroid, the better the clustering. If the observations are tightly grouped within clusters, the sum of the squared distances (within-cluster sum of squares, or WCSS) is smaller. For various values of K (the number of clusters), the method calculates this total sum of squares within the clusters and plots it on a graph against the number of clusters generated.

What’s the general trend? As you increase the number of clusters, the total sum of squares tends to decrease because the observations are closer to their cluster centroids. However, if you keep increasing the number of clusters, eventually you'll have too many clusters, which complicates interpretation. Essentially, the more clusters you have, the lower the WCSS, but having too many clusters reduces the clarity of the model's results.

There’s a trade-off here. The Elbow method helps identify an optimal number of clusters by looking for a "bend" or "elbow" in the graph. This point represents where a significant drop in WCSS occurs, but without generating too many clusters, which could complicate the interpretation.

In [11]:
elbow = []
K = range(1, 6) 
for k in K:
  kmeanElbow = KMeans(n_clusters=k,
                      init='random',
                      random_state=100).fit(
                          df_pad
                      )
  elbow.append(kmeanElbow.inertia_) # WCSS

fig = px.line(
  x=K,
  y=elbow,
  markers=True, 
  color_discrete_sequence=px.colors.qualitative.Dark2,
  template='plotly_dark',
  title='Elbow method'
  ).update_traces(patch={"line": {"dash": "dot"}}).update_layout(
    xaxis=dict(
        title='Nº Clusters',
        tickvals=[1, 2, 3, 4, 5],
        ticktext=['1', '2', '3', '4', '5']
        ),
    yaxis=dict(title='WCSS'),
    title_x=0.5
    )
fig.show()

* On the X-axis, I have the number of clusters, and on the Y-axis, the WCSS (Within-Cluster Sum of Squares).
* If I choose a single cluster (all observations together), the WCSS will be at its highest, as the observations are spread out from the center. However, as I increase the number of clusters, the WCSS decreases because the observations are more evenly divided into different groups, making the groups more homogeneous.
* This method is used to help determine the optimal number of clusters.
* Creating too many clusters complicates interpretation.

<center><h2>Silhouette Method</h2>

This is another technique often used alongside K-Means, where you compare the average distance <b>b</b> (in the formula) – the average distance of each observation to the nearest cluster it is not assigned to – with the distance <b>a</b>, which is the average distance to observations within its own cluster. Essentially, <b>b</b> represents the distance to the nearest neighboring cluster, while <b>a</b> is the average distance within the assigned cluster.

Ideally, the observation should be far from other clusters but close to points within its own cluster. You want the number to be large (with <b>b</b> being much larger than <b>a</b>). This difference between <b>b</b> and <b>a</b> is then divided by the maximum of the two values.

$$ \text{Silhouette} = \frac{(b - a)}{\max(a, b)} $$

This calculation is performed for each observation, and a mean coefficient is derived. A large numerator indicates good clustering, meaning the observation is much closer to its assigned cluster than any other. The silhouette score ranges between -1 and 1. Values near 1 indicate well-clustered data, where observations are far from neighboring clusters. A score near 0 means the clusters are overlapping, while a negative score (close to -1) suggests the observation is likely assigned to the wrong cluster. If the score is close to 0, it may indicate that too many clusters were generated, leading to overlapping groups.

In [12]:
silhueta = []

I = range(2, 7) 
for i in I:
  kmeansSil = KMeans(
      n_clusters=i,
      init='random',
      random_state=100
      ).fit(df_pad)
  silhueta.append(
      silhouette_score( 
          df_pad,
          kmeansSil.labels_)
  )

fig = px.line(
  x=I,
  y=silhueta,
  markers=True, 
  color_discrete_sequence=px.colors.qualitative.Dark2,
  template='plotly_dark',
  title='Silhouette'
  ).update_traces(patch={"line": {"dash": "dot"}}).update_layout(
    xaxis=dict(
        title='Nº Clusters',
        tickvals=[2, 3, 4, 5, 6],
        ticktext=['2', '3', '4', '5', '6']
        ),
    yaxis=dict(title='Silhouette'),
    title_x=0.5
    )
fig.add_vline(x=silhueta.index(max(silhueta))+2)
fig.show()

* We aim for the average silhouette score that is closest to 1.0.
* This serves as an indication of an optimal clustering solution.

<center><h2>K-means Method</h2>

For the non-hierarchical K-means method, you must first decide how many clusters you want before beginning the analysis. The selection of the number of clusters is crucial, as it serves as the foundation for identifying the cluster centers. Initially, the observations are arbitrarily assigned to K clusters to calculate the initial centroids.

In the subsequent steps, the observations are compared based on their proximity to the centroids of other clusters. If an observation is closer to another cluster's centroid, it is reassigned, and the centroids of both clusters are recalculated.

K-means is an iterative process that runs simulations until an optimal solution is found for the chosen number of clusters. So, what is considered an optimal solution? The algorithm starts with a random separation of observations into the K clusters you've selected. Iteratively, it keeps reallocating observations based on their proximity. The process continues until no further reallocations improve the clustering, meaning that it stops when all observations are assigned to their most appropriate cluster, where the distance to the centroid is minimized.

K-means requires an input parameter to determine the number of clusters you want, which is provided before running the function. Now, imagine a scenario where you haven’t yet run the hierarchical method, and you're about to start K-means. How many clusters should you request? What if you have no idea how many clusters are appropriate? In these cases, two common techniques are used: the Elbow method and the Silhouette method, as demonstrated/obtained earlier.

In [13]:
# Non-hierarchical K-means clustering

# Let's consider 3 clusters, based on previous evidence!

kmeans = KMeans(
    n_clusters=3,
    init='random',
    random_state=100
).fit(df_pad)

In [14]:
# Generating a variable to identify the clusters generated

kmeans_clusters = kmeans.labels_
df['cluster_kmeans'] = kmeans_clusters
df['cluster_kmeans'] = df['cluster_kmeans'].astype('category')

df_pad['cluster_kmeans'] = kmeans_clusters
df_pad['cluster_kmeans'] = df_pad['cluster_kmeans'].astype('category')

Another important detail regarding clustering is that, after the clusters are formed, it's useful to analyze which variables contributed to the formation of those clusters (groups). To do this, you apply an F-test to analyze the variances. This F-test is applied to each variable individually. The goal is to determine whether a variable played a significant role in forming the clusters. If a variable helped differentiate clusters, its F-statistic should reflect this.

The F-statistic aligns well with the objective of clustering. It simply calculates the variability between groups divided by the variability within groups. For a variable to be important, it should have a large F-statistic, meaning it shows high variability between groups (the groups are distinct from one another) but low variability within the groups (observations within the same group are similar). The larger the F-statistic, the more the variable contributes to the formation of at least one of the clusters. The formula is:

$$
F = \frac{\text{Between-group variability}}{\text{Within-group variability}}
$$

The larger the F value, the better.

In [15]:
# One-way analysis of variance (ANOVA)

# Interpretation of the output:

# cluster_kmeans MS: indicates the variability between groups.
# Within MS: indicates the variability within groups.
# F: test statistic (cluster_kmeans MS / Within MS)
# p-unc: p-value of the F-statistic.
# if p-value < 0.05: at least one cluster shows a statistically different mean from the others.

value = 0
variable = ''
values = []

for column in df_pad.columns[:-1]:
    
    values.append(pg.anova(data=df_pad, dv=column, between='cluster_kmeans', detailed=True).T.loc['F', 0])

    if pg.anova(
        data=df_pad,
        dv=column,  # Column variable
        between='cluster_kmeans',  # Analysis between groups
        detailed=True
    ).T.loc['p-unc', 0] < 0.05:  # p-unc is the p-value.
        
        # If the p-value is lower than the significance level (5%),
        # I reject the null hypothesis, which indicates that this variable
        # contributes to the formation of at least one cluster.
        # If a variable is not significant, you can exclude it and rerun the algorithm.
        
        print(f'Variable {column} contributes to the formation of clusters.')

    if pg.anova(
        data=df_pad,
        dv=column,
        between='cluster_kmeans',
        detailed=True
    ).T.loc['F', 0] > value:
        
        # The variable with the highest F-statistic value is the one that most contributed to the creation of clusters.
        value = pg.anova(data=df_pad, dv=column, between='cluster_kmeans', detailed=True).T.loc['F', 0]
        variable = column
    else:
        value = value
        variable = variable

print(f'{variable} and {value}')

Variable BALANCE contributes to the formation of clusters.
Variable BALANCE_FREQUENCY contributes to the formation of clusters.
Variable PURCHASES contributes to the formation of clusters.
Variable ONEOFF_PURCHASES contributes to the formation of clusters.
Variable INSTALLMENTS_PURCHASES contributes to the formation of clusters.
Variable CASH_ADVANCE contributes to the formation of clusters.
Variable PURCHASES_FREQUENCY contributes to the formation of clusters.
Variable ONEOFF_PURCHASES_FREQUENCY contributes to the formation of clusters.
Variable PURCHASES_INSTALLMENTS_FREQUENCY contributes to the formation of clusters.
Variable CASH_ADVANCE_FREQUENCY contributes to the formation of clusters.
Variable CASH_ADVANCE_TRX contributes to the formation of clusters.
Variable PURCHASES_TRX contributes to the formation of clusters.
Variable CREDIT_LIMIT contributes to the formation of clusters.
Variable PAYMENTS contributes to the formation of clusters.
Variable MINIMUM_PAYMENTS contributes to 

In [16]:
test = pd.DataFrame(np.round(values, 3), columns=['F'])
test['column'] = df_pad.columns[:-1]
test.sort_values(by='F', ascending=False)

Unnamed: 0,F,column
9,4823.489,CASH_ADVANCE_FREQUENCY
11,3631.36,PURCHASES_TRX
5,3033.266,CASH_ADVANCE
10,2769.53,CASH_ADVANCE_TRX
7,2719.236,ONEOFF_PURCHASES_FREQUENCY
2,2569.268,PURCHASES
0,2248.414,BALANCE
6,1525.13,PURCHASES_FREQUENCY
3,1517.182,ONEOFF_PURCHASES
4,1503.933,INSTALLMENTS_PURCHASES


In [17]:
# Gráfico 3D
# Escolher boas opções
fig = px.scatter_3d(
    x=df['CASH_ADVANCE_FREQUENCY'],
    y=df['PURCHASES_TRX'],
    z=df['CASH_ADVANCE'],
    color=df['cluster_kmeans']
)
fig.update_layout(
    template='plotly_dark',
    title='Clustering Analysis: Cash Advance Frequency vs. Purchases vs. Total Cash Advances',
    title_x=0.5,
    legend_title_text='Cluster',
    scene=dict(
        xaxis_title='CASH_ADVANCE_FREQUENCY',
        yaxis_title='PURCHASES_TRX',
        zaxis_title='CASH_ADVANCE'
    )
)
fig.show()

In [18]:
group = df.loc[:, ['CASH_ADVANCE_FREQUENCY', 'PURCHASES_TRX', 'CASH_ADVANCE', 'cluster_kmeans']].groupby(by=['cluster_kmeans'])
group 
# Estatísticas descritivas por grupo

tab_group = group.describe().T
tab_group

Unnamed: 0,cluster_kmeans,0,1,2
CASH_ADVANCE_FREQUENCY,count,1559.0,5864.0,1213.0
CASH_ADVANCE_FREQUENCY,mean,0.449891,0.069716,0.064431
CASH_ADVANCE_FREQUENCY,std,0.220485,0.107684,0.135734
CASH_ADVANCE_FREQUENCY,min,0.0,0.0,0.0
CASH_ADVANCE_FREQUENCY,25%,0.272727,0.0,0.0
CASH_ADVANCE_FREQUENCY,50%,0.416667,0.0,0.0
CASH_ADVANCE_FREQUENCY,75%,0.583333,0.090909,0.083333
CASH_ADVANCE_FREQUENCY,max,1.5,0.714286,1.0
PURCHASES_TRX,count,1559.0,5864.0,1213.0
PURCHASES_TRX,mean,5.559974,8.866132,57.022259


<h4>1. CASH_ADVANCE_FREQUENCY (Frequency of Cash Advances):</h4>

* <b>Cluster 0</b> shows an average of 0.449, indicating that these customers use cash advances more frequently compared to Clusters 1 and 2.
  
* <b>Cluster 1</b> and <b>Cluster 2</b> have significantly lower averages (0.069 and 0.064, respectively), suggesting that these customers rarely use cash advances.

* <b>Insight:</b> Cluster 0 represents customers who actively use cash advances, while Clusters 1 and 2 represent customers who rarely or never use this service.

---

<h4>2. PURCHASES_TRX (Number of Purchase Transactions):</h4>

* <b>Cluster 2</b> stands out with an average of 57.02 transactions, far exceeding the other clusters. This group seems to include customers with a very high volume of purchase transactions.
  
* <b>Cluster 1</b> has a moderate average of 8.86, suggesting that these customers regularly make purchases, but not as frequently as those in Cluster 2.

* <b>Cluster 0</b> has the lowest average, with 5.55 transactions, indicating that these customers make relatively few purchases.

* <b>Insight:</b> Cluster 2 represents high-spending customers with frequent transactions, while Cluster 0 shows a much lower level of purchasing activity. Cluster 1 is in the middle, with moderate purchasing behavior.

---

<h4>3. CASH_ADVANCE (Total Value of Cash Advances):</h4>

* <b>Cluster 0</b> has a relatively high average of $3896.02, suggesting that these customers are making significant cash advances.
  
* <b>Cluster 1</b> and <b>Cluster 2</b> have much lower averages ($331.48 and $468.25, respectively), indicating that these groups rarely use large cash advances.

* <b>Insight:</b> Cluster 0 groups customers who not only use cash advances frequently but also in substantial amounts. On the other hand, Clusters 1 and 2 seem to represent customers who rarely use cash advances, and when they do, the amounts are small.

---


In [19]:
group_= df.groupby(by=['cluster_kmeans'])

# group_.max().T
# group_.min().T

<center><h4>General Insights</h4></center>

* <b>Cluster 0:</b> This group represents customers who frequently and significantly use cash advances but make fewer purchase transactions. These customers might rely on cash credit for other purposes instead of spending directly through purchases.
  
* <b>Cluster 1:</b> This group has a moderate use of purchase transactions but infrequent and low-value cash advances. They may be more focused on making purchases than using cash advances.
  
* <b>Cluster 2:</b> This is the most distinct group, with a high frequency of purchase transactions but minimal use of cash advances. These customers seem to be the most active in terms of purchases, while avoiding cash advances.