# ***AUTHOR: Agustín Rojas***
 agustinsilviorojas@outlook.com.ar


**Social media:**

image.png[LinkedIn](https://www.linkedin.com/in/agust%C3%ADnsilviorojas/)

# Index

- [Import libraries and file](#import-libraries-and-file)
- [EDA](#eda)
  - [Analysis of averages](#analysis-of-averages)
  - [Out outliers and duplicated values detection and treatment.](#out-outliers-and-duplicated-values-detection-and-treatment)
- [Analysis of the Correlation Matrix.](#analysis-of-the-correlation-matrix)
- [STANDARIZATION OF DATA](#standarization-of-data)
- [APPLICATION OF THE K-MEANS ALGORITHM](#application-of-the-k-means-algorithm)
  - [VISUALIZATION OF THE COORDINATES OF THE CENTROIDS OF THE CLUSTERS IN A DATAFRAME](#visualization-of-the-coordinates-of-the-centroids-of-the-clusters-in-a-dataframe)
  - [Segments insights](#segments-insights)
  - [Predict](#predict)
  - [PCA (Dimensionality Reduction)](#pca-dimensionality-reduction)
  - [Scatter Plot of Clusters in Principal Component Space](#scatter-plot-of-clusters-in-principal-component-space)
- [Conclusion](#conclusion)

# Import libraries and file

In [1]:

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
path = r"C:\Users\agust\OneDrive\Documentos\VScode\Bank\Bank_data.csv"
file_csv = pd.read_csv(path, sep=",")
df = pd.DataFrame(file_csv)
df.head()

Unnamed: 0,CUST_ID,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
0,C10001,40.900749,0.818182,95.4,0.0,95.4,0.0,0.166667,0.0,0.083333,0.0,0,2,1000.0,201.802084,139.509787,0.0,12
1,C10002,3202.467416,0.909091,0.0,0.0,0.0,6442.945483,0.0,0.0,0.0,0.25,4,0,7000.0,4103.032597,1072.340217,0.222222,12
2,C10003,2495.148862,1.0,773.17,773.17,0.0,0.0,1.0,1.0,0.0,0.0,0,12,7500.0,622.066742,627.284787,0.0,12
3,C10004,1666.670542,0.636364,1499.0,1499.0,0.0,205.788017,0.083333,0.083333,0.0,0.083333,1,1,7500.0,0.0,,0.0,12
4,C10005,817.714335,1.0,16.0,16.0,0.0,0.0,0.083333,0.083333,0.0,0.0,0,1,1200.0,678.334763,244.791237,0.0,12


# EDA

Exploratory data analyisis

In [None]:

# Analyzed the general information of the DataFrame.
df.info()

In [None]:

# Checked how many missing (null) values there are in each variable.
df.isnull().sum()

In [None]:
# Created a copy of the DataFrame and replaced the missing values with the mean value of each corresponding variable.
df2 = df.copy()
df2['CREDIT_LIMIT'].fillna(df2['CREDIT_LIMIT'].mean(), inplace = True)
df2['MINIMUM_PAYMENTS'].fillna(df2['MINIMUM_PAYMENTS'].mean(), inplace = True)

In [None]:

# Verified that there are no longer any missing values
df2.isnull().sum()

In [None]:

# Generated the descriptive statistics of the dataset.
df2.describe().T

## Analysis of averages 

As conclusions, we can observe the following:

The average account balance is $1,564, suggesting that customers maintain a significant balance in their accounts on average.

The average balance‐holding frequency is 0.88, indicating that most customers regularly carry a balance.

The average total purchase amount is $1,003, showing that customers routinely make purchases with their credit cards.

The average single‐purchase amount is $592, suggesting that customers tend to make smaller individual purchases rather than large, one‐off transactions.

The average number of purchase transactions is approximately 14, highlighting that customers engage in multiple purchases.

The average credit limit is $4,494, representing the maximum total amount customers can charge to their cards.

The average proportion of full credit‐bill payments is about 0.15, indicating that most customers do not pay off their entire balance each month.

The average relationship duration with the bank is roughly 11.5 years, reflecting a long‐term customer–bank relationship.

Additionally, we do not observe any extreme outliers in these summary statistics. However, it would be prudent to perform a deeper outlier analysis to identify any values that might significantly impact the customer segmentation.

## Out outliers and duplicated values detection and treatment.

In [None]:
# Search for duplicate values.
print("There are {} duplicated values".format(df2.duplicated().sum()))

In [None]:
# Create a new copy of the DataFrame and dropped columns unnecessary for the analysis.
df3 = df2.copy()
df3 = df3.drop('CUST_ID', axis=1)
df3.head()

In [None]:
# Generate subplots for each variable.
fig, axs = plt.subplots(nrows=6, ncols=3, figsize=(15, 20))
axs = axs.flatten()

# Generate a boxplot for each variable.
for i, var in enumerate(df3.columns):  # We iterated directly over the DataFrame’s columns.
    axs[i].boxplot(df3[var].dropna(), vert=False)
    axs[i].set_title(var)

plt.tight_layout()
plt.show()

In [None]:
# Detection of outlier values using the interquartile range.

stats = df3.describe()

for column in df3.select_dtypes(include = "number"):
    Q1 = stats[column]['25%']
    Q3 = stats[column]['75%']
    IQR = Q3 - Q1
    lower_limit = Q1 - 1.5 * IQR
    upper_limit = Q3 + 1.5 * IQR
    
    outliers = df3.loc[(df3[column] < lower_limit) | (df3[column] > upper_limit)]
    
    print(f"Column: {column}")
    print(outliers)
    print("------------------------")

# Analysis of the Correlation Matrix.

In [None]:

correlation_matrix. = df3.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix.')
plt.show()

From the correlation matrix above, we can observe the following:

The variable 'PURCHASES' is highly correlated with the variables 'PAYMENTS', 'PURCHASES_TRX', 'ONEOFF_PURCHASES_FREQUENCY', 'INSTALLMENTS_PURCHASES', and 'ONEOFF_PURCHASES'.

The variable 'CASH_ADVANCE' is correlated with the variables 'PAYMENTS', 'CASH_ADVANCE_TRX', 'CASH_ADVANCE_FREQUENCY', and 'BALANCE'.

# STANDARIZATION OF DATA

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df4 = scaler.fit_transform(df3)

df4 = pd.DataFrame(df4, columns=df3.columns)

df4.head()

# APPLICATION OF THE K-MEANS ALGORITHM

EVALUATION OF THE OPTIMAL NUMBER OF CLUSTERS USING THE ELBOW METHOD IN K-MEANS

In [None]:
from sklearn.cluster import KMeans

wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state=0)
    kmeans.fit(df4)
    wcss.append(kmeans.inertia_)
    
plt.plot(range(1,11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()


At this point, we can notice an elbow in the inertia graph when the number of clusters is 4.Therefore, the optimal number of clusters appears to be 4.

In [None]:
kmeans = KMeans(n_clusters=4, n_init=10)
kmeans.fit(df4)
labels = kmeans.labels_
centers = kmeans.cluster_centers_

## VISUALIZATION OF THE COORDINATES OF THE CENTROIDS OF THE CLUSTERS IN A DATAFRAME

We create a DataFrame 'cluster_centers' to store the coordinates of each centroid. 

Each column of 'cluster_centers' represents a feature, while each feature corresponds to a dimension and to a coordinate point in multi-dimensional space.

In [None]:
centroids = pd.DataFrame(data=kmeans.cluster_centers_, columns=[df3.columns])
centroids

For better understanding, we will perform a reverse conversion. This will allow us to visualize the location of the centroids of each cluster and how k-means has grouped the data.

In [None]:
# Convert the normalized centroids back to their original scale.
centroids_df = scaler.inverse_transform(centroids)
# Create a DataFrame 'cluster_centers_df' to store the coordinates of the centroids in their original scale, where each column represents a feature.
centroids_df = pd.DataFrame(data=centroids_df, columns=[df3.columns])
centroids_df

## Segments insights


Based on the average values of each variable for each customer group, we can derive insights about each segment:

**Group 0:**

💳 Moderate Balance and Credit Limit Customers in this group maintain an average balance and a relatively moderate credit limit.

🛍️ Frequent and Diverse Purchasing Behavior They show moderate average values for both one-off purchases and installment purchases. This indicates regular purchasing activity across different types—both single transactions and installment-based spending.

💸 Low Cash Advance Usage The average cash advance value is low, suggesting these customers rarely rely on cash advances.

📈 Reasonable Payment Habits They exhibit a relatively high average percentage of full payments, pointing to strong financial management and responsible credit usage.

**Group 1:**

💳 High Balance and Credit Limit This group maintains a relatively high average balance and credit limit, indicating greater financial capacity.

🛍️ Very High Spending They show very high average values for both one-off and installment purchases, reflecting a high level of consumer spending.

💸 Low Cash Advance Usage The average cash advance amount is low, suggesting minimal reliance on cash advances.

📈 Reasonable Payment Habits They maintain a relatively high average percentage of full payments, indicating responsible payment behavior.

**Group 2:**

💳 Low Balance and Credit Limit Customers in this group have relatively low average balances and credit limits, indicating limited financial resources.

🛍️ Low Purchasing Activity Both one-off and installment purchases are low on average, suggesting minimal spending behavior.

💸 Minimal Cash Advance Usage The average cash advance amount is low, showing little reliance on this form of credit.

📉 Weak Payment Habits They have a relatively low average percentage of full payments, which may reflect financial challenges or less disciplined payment behavior.

**Group 3:**

💳 High Balance and Credit Limit This group holds relatively high average balances and credit limits, indicating strong financial capacity.

🛍️ Moderate Purchasing Activity Average values for both one-off and installment purchases are moderate, suggesting balanced spending habits.

💸 High Cash Advance Usage They show a high average value for cash advances, implying greater dependence on short-term credit or liquidity needs.

📉 Weak Payment Habits The average percentage of full payments is relatively low, which may point to financial strain or less responsible credit management.

## Predict

In [None]:
'''
This code uses the previously trained KMeans model to predict which group each customer belongs to within the dataset.
Using the predict method, the model assigns each customer a label indicating the segment or cluster they belong to based on their features.

The group labels assigned to each customer are stored in the variable y_kmeans, allowing us to identify the category each customer falls into according to the KMeans model.
'''
y_kmeans = kmeans.predict(df4)

In [None]:

y_kmeans

In [None]:
'''
In this step, we concatenate the cluster labels (y_kmeans) predicted by the KMeans model to the original DataFrame df3.
A new DataFrame called df5 is created, containing all the original customer features along with an additional column, cluster, which indicates the segment or cluster each customer belongs to based on the KMeans prediction.

This allows us to gain a comprehensive view of how customers are grouped into different segments according to their characteristics.
'''
df5 = pd.concat([df3, pd.DataFrame({'cluster':y_kmeans})], axis=1)

In [None]:

df5.head()

In [None]:
'''
In this step, we examine the distribution of cluster labels (cluster) generated by the KMeans model in the DataFrame df5.
Using the value_counts() method, we count how many customers belong to each cluster and display the number of customers in each one.

This provides us with insights into how customers are distributed across the various segments or clusters identified by the KMeans model.
'''
df5.cluster.value_counts()

In [None]:

# Calculate the number of columns needed for the subplots.
num_clusters = 4
num_features = len(df3.columns)
num_cols = min(num_clusters, num_features)

# A histogram is created for each cluster to enhance visualization.
for feature in df3.columns:
    plt.figure(figsize=(num_cols * 8, 5))
    for cluster_label in range(num_clusters):
        plt.subplot(1, num_cols, cluster_label + 1)
        cluster_data = df5[df5['cluster'] == cluster_label]
        cluster_data[feature].hist(bins=20, alpha=0.7)
        plt.title('Cluster {}: {}'.format(cluster_label, feature))
        plt.xlabel(feature)
        plt.ylabel('Frequency')
    plt.tight_layout()
    plt.show()

# PCA (Dimensionality Reduction)

In [None]:
from sklearn.decomposition import PCA

# Calculate the explained variance for each principal component.
pca = PCA()
pca.fit(df4)
explained_variance_ratio = pca.explained_variance_ratio_

# Plot the cumulative explained variance.
plt.plot(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio.cumsum(), marker='o', linestyle='-')
plt.xlabel('Number of principal components.')
plt.ylabel('Cumulative explained variance.')
plt.title('Cumulative Explained Variance Plot.')
plt.show()

In [None]:

# Print the output to observe how much of the total information is contributed by each principal component.

print(pca.explained_variance_ratio_.round(2)[:10])

In [None]:

# PCA is applied with 2 principal components to reduce the dimensionality of the dataset while preserving as much variance as possible.
pca = PCA(n_components=2)

pca_df = pca.fit_transform(df4)

In [None]:

df_components = pd.DataFrame(data=pca_df, columns=['pc_1','pc_2'])
df_components.head()

In [None]:
df_final = pd.concat([df_components, pd.DataFrame({'cluster':labels})], axis=1)
df_final.head()


## Scatter Plot of Clusters in Principal Component Space

This visualization displays how the identified clusters are distributed across the space defined by the principal components.

It helps reveal separation, overlap, and structure among the clusters in reduced dimensions.

In [None]:

plt.figure(figsize=(10,10))
ax = sns.scatterplot(x='pc_1',y='pc_2', hue='cluster', data=df6, palette='Set1')

# Conclusion


By applying machine learning techniques, it is possible to achieve effective categorization of banking customers, enabling a deeper understanding of their behaviors and needs. This segmentation provides a clearer view of the customer base, making it easier to personalize marketing strategies, identify opportunities for product and service improvement, and support more informed business decision-making. In summary, customer segmentation is a powerful tool for any organization in the modern era, allowing for more agile adaptation to changing market demands and delivering a more satisfying experience for customers.