### What You're Aiming For

- In this checkpoint, we are going to work on the 'Credit Card Dataset for Clustering' dataset provided by Kaggle.

- Dataset description : This dataset was derived and simplified for learning purposes. It includes usage behaviour of about 9000 active credit card holders during 6 months period. This case requires to develop a customer segmentation to define marketing strategy.

### Columns explanation : 

- CUST_ID: Identification of Credit Card holder (Categorical)
- BALANCE_FREQUENCY: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
- PURCHASES: Amount of purchases made from account 
- CASH_ADVANCE: Cash in advance given by the user
- CREDIT_LIMIT: Limit of Credit Card for user 
- PAYMENTS: Amount of Payment done by user 

#### Instructions

- Import you data and perform basic data exploration phase
- Perform the necessary data preparation steps ( Corrupted and missing values handling, data encoding, outliers handling ... )
- Perform hierarchical clustering to identify the inherent groupings within your data. Then, plot the clusters. (use only 2 features. For example, try to cluster the customer base with respect to 'PURCHASES' and 'credit limit')
- Perform partitional clustering using the K-means algorithm. Then, plot the clusters
- Find the best k value and plot the clusters again.
- Interpret the results

In [None]:
import warnings
warnings.simplefilter("ignore")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans


In [None]:
#load the Data 
data = pd.read_csv('Credit_card_dataset.csv')

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
data.isna().mean()*100

In [None]:
data.columns

In [None]:
data.dropna(subset=['CREDIT_LIMIT'], inplace=True)

In [None]:
data['CASH_ADVANCE'].fillna(data['CASH_ADVANCE'].median(), inplace=True)

In [None]:
# Drop columns that are not to be used
data.drop(['CUST_ID'], axis=1, inplace=True)
data.drop('BALANCE_FREQUENCY', axis=1, inplace=True)
data.drop('PAYMENTS', axis=1, inplace=True)
data.drop('CASH_ADVANCE', axis=1, inplace=True)

In [None]:
#remove outliers
from scipy import stats
data = data[(np.abs(stats.zscore(data['PURCHASES'])) < 3)]
data = data[(np.abs(stats.zscore(data['CREDIT_LIMIT'])) < 3)]

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler

# Select the features for clustering
X = data[['PURCHASES', 'CREDIT_LIMIT']]

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform hierarchical clustering
linked = linkage(X_scaled, method='ward')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.title('Hierarchical Clustering Dendrogram')
plt.show()


## Model Training

In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Assuming X is your feature matrix
pca = PCA(n_components=2)  # You can change the number of components based on your use case
X_red = pca.fit_transform(X_scaled)

# Now apply KMeans clustering on the reduced dataset
kmeans_models = [KMeans(n_clusters=k, random_state=23).fit(X_red) for k in range(1, 11)]
inertia = [model.inertia_ for model in kmeans_models]


from sklearn.cluster import KMeans

kmeans_models = [KMeans(n_clusters=k, random_state=23).fit(X_red) for k in range (1, 11)]
innertia = [model.inertia_ for model in kmeans_models]

# Plotting the elbow method graph
plt.plot(range(1, 11), innertia, marker='o')
plt.title('Elbow method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS (Within-Cluster Sum of Squares)')
plt.xticks(range(1, 11))  # Optional: sets x-ticks to integers
plt.grid()
plt.show()

In [None]:
innertia

In [None]:
model_3 = KMeans(n_clusters=3, random_state=23).fit(X_red)
model_3.inertia_

In [None]:
model_4 = KMeans(n_clusters=4, random_state=23).fit(X_red)
model_4.inertia_

In [None]:
model_5 = KMeans(n_clusters=5, random_state=23).fit(X_red)
model_5.inertia_

In [None]:
model_6 = KMeans(n_clusters=6, random_state=23).fit(X_red)
model_6.inertia_

In [None]:
from sklearn.metrics import silhouette_score

silhoutte_scores = [silhouette_score(X_red, model.labels_) for model in kmeans_models[1:4]]
plt.plot(range(2,5), silhoutte_scores, "bo-")
plt.xticks([2, 3, 4])
plt.title('Silhoutte scores vs Number of clusters')
plt.xlabel('Number of clusters')
plt.ylabel('Silhoutte score')
plt.show()

#select 2 as our number of clusters.

In [None]:
from sklearn.metrics import silhouette_score

kmeans = KMeans(n_clusters=6, random_state=23)
kmeans.fit(X_red)

print('Silhoutte score of our model is ' + str(silhouette_score(X_red, kmeans.labels_)))

In [None]:
#Assigning labels as cluster index to our dataset.

data['cluster_id'] = kmeans.labels_

plt.figure(figsize=(10,6))
sns.scatterplot(x=data['CREDIT_LIMIT'], y=data['PURCHASES'], hue=data['cluster_id'], palette= 'Set1')
plt.title('Customer Segmentation To Define Marketing Strategy With Purchases And Credit Limits')
plt.show()