<a href="https://colab.research.google.com/github/Mychoyce/Gomycode-Checkpoints/blob/main/Unsupervised_Learning_Clustering_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this checkpoint, we are going to work on the 'Credit Card Dataset for Clustering' dataset provided by Kaggle.

Dataset description : This dataset was derived and simplified for learning purposes. It includes usage behaviour of about 9000 active credit card holders during 6 months period. This case requires to develop a customer segmentation to define marketing strategy.

➡️ Dataset link

https://i.imgur.com/gAT5gVg.jpg

Columns explanation :

CUST_ID: Identification of Credit Card holder (Categorical)
BALANCE_FREQUENCY: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES: Amount of purchases made from account
CASH_ADVANCE: Cash in advance given by the user
CREDIT_LIMIT: Limit of Credit Card for user
PAYMENTS: Amount of Payment done by user

Follow These Steps
Import you data and perform basic data exploration phase
Perform the necessary data preparation steps ( Corrupted and missing values handling, data encoding, outliers handling ... )
Perform hierarchical clustering to identify the inherent groupings within your data. Then, plot the clusters. (use only 2 features. For example, try to cluster the customer base with respect to 'PURCHASES' and 'credit limit')
Perform partitional clustering using the K-means algorithm. Then, plot the clusters
Find the best k value and plot the clusters again.
Interpret the results

In [None]:
import pandas as pd

# Load the dataset
url ="/content/Credit_card_dataset.csv"
df = pd.read_csv("/content/Credit_card_dataset.csv")
df

In [None]:
# Display basic information about the dataset
print(df.info())

In [None]:
# Display summary statistics
print(df.describe())

In [None]:
# Check for missing values
print(df.isnull().sum())

### Step 2: Data Preparation

In [17]:
# Check for duplicate rows and remove if any
df = df.drop_duplicates()

In [None]:
df.duplicated().sum()

In [None]:
# Handle outliers
# winsorize the outliers in 'PURCHASES' and 'CREDIT_LIMIT'
from scipy.stats.mstats import winsorize
df['PURCHASES'] = winsorize(df['PURCHASES'], limits=[0.05, 0.05])
df['CREDIT_LIMIT'] = winsorize(df['CREDIT_LIMIT'], limits=[0.05, 0.05])
df


In [21]:
!pip install chardet



In [23]:
import chardet
# Specify the file path
file_path = '/content/Credit_card_dataset.csv'


In [24]:
# Open the file in binary mode and read a portion of it for analysis
with open(file_path, 'rb') as f:
    result = chardet.detect(f.read())

In [None]:
# The detected encoding will be in result['encoding']
print(f"The detected encoding is: {result['encoding']}")

 Step 3: Hierarchical Clustering

In [27]:
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt


In [28]:
# Choose features for clustering
features = ['PURCHASES', 'CREDIT_LIMIT']
X = df[features]

In [29]:
# Perform hierarchical clustering
linkage_matrix = linkage(X, method='ward')

In [None]:
# Plot dendrogram
plt.figure(figsize=(12, 6))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.show()

 Step 4: Partitional Clustering (K-means)

In [31]:
from sklearn.cluster import KMeans
import seaborn as sns

In [32]:
# Choose the number of clusters (k)
k = 3

In [None]:
# Perform K-means clustering
kmeans = KMeans(n_clusters=k, random_state=42)
df['Cluster'] = kmeans.fit_predict(X)

In [None]:
# Plot the clusters
sns.scatterplot(x='PURCHASES', y='CREDIT_LIMIT', hue='Cluster', data=df, palette='viridis')
plt.title('K-means Clustering')
plt.show()

 Step 5: Find the Best K-value and Plot Clusters

In [35]:
from sklearn.metrics import silhouette_score
import numpy as np

In [None]:
# Find the best k using silhouette score
silhouette_scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    cluster_labels = kmeans.fit_predict(X)
    silhouette_scores.append(silhouette_score(X, cluster_labels))


In [None]:
# Plot the silhouette scores
plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Different Values of k')
plt.show()

In [None]:
# Choose the best k and re-run K-means clustering
best_k = np.argmax(silhouette_scores) + 2  # +2 because range starts from 2
kmeans_best = KMeans(n_clusters=best_k, random_state=42)
df['Cluster'] = kmeans_best.fit_predict(X)

In [None]:
# Plot the clusters with the best k
sns.scatterplot(x='PURCHASES', y='CREDIT_LIMIT', hue='Cluster', data=df, palette='viridis')
plt.title(f'K-means Clustering (k={best_k})')
plt.show()

 Step 6: Interpret the Results

In [42]:
high-spending customers with a high credit limit in one cluster