![credit cards](https://unsplash.com/photos/na8l3EPqpvY/download?force=true&w=640)
# Let's Cluster some Credit Cards!
In this notebook I'll try to cluster some credit cards. we go through preprocessing, choose a model and train it on our data, and then evaluate and interpret our outcomes. 

# Libraries
Let's begin by importing libraries. I have added comments that shows the use case of each library.

In [None]:
# Essentials:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

# t-SNE visualization
from sklearn.manifold import TSNE

# imputation
from sklearn.impute import KNNImputer

# Scaling
from sklearn.preprocessing import StandardScaler

# PCA
from sklearn.decomposition import PCA

# K-means for Clustering
from sklearn.cluster import KMeans

# elbow method
from yellowbrick.cluster import KElbowVisualizer

# cluster metrics
from sklearn.metrics import davies_bouldin_score
from sklearn.metrics import silhouette_score

# Silhouette Visualizer
from yellowbrick.cluster import SilhouetteVisualizer

# Take a look at Dataset
Let's take a look at our dataset. we're going to:
- load the dataset
- understand the features
- check for missing values ...
- ... and outliers
- determine whether it's possible to cluster these datapoints (using t-SNE)

## Load the dataset:

In [None]:
cc_general = pd.read_csv('../input/ccdata/CC GENERAL.csv')
cc_general.head()

## Understand the features:

In [None]:
cc_general.describe()

CUSTID : Identification of Credit Card holder (Categorical)

BALANCE : Balance amount left in their account to make purchases 

BALANCEFREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)

PURCHASES : Amount of purchases made from account

ONEOFFPURCHASES : Maximum purchase amount done in one-go

INSTALLMENTSPURCHASES : Amount of purchase done in installment

CASHADVANCE : Cash in advance given by the user

PURCHASESFREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)

ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)

PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)

CASHADVANCEFREQUENCY : How frequently the cash in advance being paid

CASHADVANCETRX : Number of Transactions made with "Cash in Advanced"

PURCHASESTRX : Numbe of purchase transactions made

CREDITLIMIT : Limit of Credit Card for user

PAYMENTS : Amount of Payment done by user

MINIMUM_PAYMENTS : Minimum amount of payments made by user

PRCFULLPAYMENT : Percent of full payment paid by user

TENURE : Tenure of credit card service for user

## Check for missing values:

In [None]:
cc_general.isna().sum()

## Check for outliers:

Using IQR, we can follow the below approach to find outliers:

- Calculate the first and third quartile (Q1 and Q3).
- Further, evaluate the interquartile range, IQR = Q3-Q1.
- Estimate the lower bound, the lower bound = Q1*1.5
- Estimate the upper bound, upper bound = Q3*1.5
- The data points that lie outside of the lower and the upper bound are outliers.

In [None]:
def outlier_percent(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    minimum = Q1 - (1.5 * IQR)
    maximum = Q3 + (1.5 * IQR)
    num_outliers =  np.sum((data < minimum) |(data > maximum))
    num_total = data.count()
    return (num_outliers/num_total)*100

In [None]:
non_categorical_data = cc_general.drop(['CUST_ID'], axis=1)
for column in non_categorical_data.columns:
    data = non_categorical_data[column]
    percent = str(round(outlier_percent(data), 2))
    print(f'Outliers in "{column}": {percent}%')

# Preprocessing
in this part, I'm going to:
1. remove the outliers
2. impute missing data
3. scale the data
4. Reduce dimentions using PCA

## Removing the outliers
first, let's get rid of the noise. we're going to first set all outliers as `NaN`, so it will be taken care of in the next stage, where we impute the missing values. 

In [None]:
for column in non_categorical_data.columns:
    data = non_categorical_data[column]
    
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    minimum = Q1 - (1.5 * IQR)
    maximum = Q3 + (1.5 * IQR)
 
    outliers = ((data < minimum) |(data > maximum))
    non_categorical_data[column].loc[outliers] = np.nan
    
non_categorical_data.isna().sum()

## Imputing the missing data
I use `KNN imputer`: Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set.

In [None]:
imputer = KNNImputer()
imp_data = pd.DataFrame(imputer.fit_transform(non_categorical_data), columns=non_categorical_data.columns)
imp_data.isna().sum()

## Scale the Data
I use `StandardScaler`

In [None]:
std_imp_data = pd.DataFrame(StandardScaler().fit_transform(imp_data), columns=imp_data.columns)
std_imp_data.describe()

## Dimention Reduction using PCA
K-means, DBSCAN and agglomerative clustering, all use the Euclidean distance, which starts to lose its meaning when the number of dimensions starts increasing. so, before using these methods, we have to reduce the number of dimensions. I'm going to use PCA, which is by far the most popular dimensionality reduction algorithm. 

<div class="alert alert-warning" role="alert">
  ⚠ If you are not familiar with PCA or need to learn more about it, I highly recommend you read <a href="https://github.com/HalflingWizard/MachineLearning/blob/main/4-%20Dimensionality%20Reduction/PCA.md">my Notes</a> on this dimentionality reduction method, In which I cover almost anything you need to know about this algorithm.
</div>

here I set parameter `n_components` equals to 0.9, which means that the PCA will automatically produce enough PCs that will preserve 90% of variance in the dataset.

In [None]:
pca = PCA(n_components=0.9, random_state=42)
pca.fit(std_imp_data)
PC_names = ['PC'+str(x) for x in range(1,len(pca.components_)+1)]
pca_data = pd.DataFrame(pca.transform(std_imp_data), columns=PC_names)

In [None]:
fig, ax = plt.subplots(figsize=(24, 16))
plt.imshow(pca.components_.T,
           cmap="Spectral",
           vmin=-1,
           vmax=1,
          )
plt.yticks(range(len(std_imp_data.columns)), std_imp_data.columns)
plt.xticks(range(len(pca_data.columns)), pca_data.columns)
plt.xlabel("Principal Component")
plt.ylabel("Contribution")
plt.title("Contribution of Features to Components")
plt.colorbar()

# Train the Model
now that we have done the preprocessing step, we can perform K-means clustering on our data. 

<div class="alert alert-warning" role="alert">
  ⚠ If you are not familiar with K-Means or need to learn more about it, I highly recommend you read <a href="https://github.com/HalflingWizard/MachineLearning/blob/main/3-%20Clustering/K-Means.md">my Notes</a> on this clustering method, In which I cover almost anything you need to know about this algorithm.
</div>

first, we have to find good parameters for our model.

## Find the `n_clusters` parameter using the elbow method

In [None]:
model = KMeans(random_state=42)
distortion_visualizer = KElbowVisualizer(model, k=(2,10))

distortion_visualizer.fit(pca_data)       
distortion_visualizer.show()       

so, as you can see, it seems the best value for `k` is 4.

In [None]:
km_model = KMeans(distortion_visualizer.elbow_value_, random_state=42)
labels = km_model.fit_predict(pca_data)

Now I add these labels to 3 dataframes:
- `cc_general`: original dataframe
- `std_imp_data`: imputed, standard dataframe
- `pca_data`: Transformed data after PCA

In [None]:
cc_general['LABELS'] = labels
std_imp_data['LABELS'] = labels
pca_data['LABELS'] = labels

Let's see how our data is distributed among these 4 clusters:

In [None]:
pca_data.LABELS.value_counts().plot.pie(autopct='%1.0f%%', pctdistance=0.7, labeldistance=1.1)

# Evaluate the Model
Let's see how good/bad is our model.

we start by caculating two metrics: 
- **The Davis-Bouldin Index** is the average similarity between each cluster and the closest cluster. Scores range from 0 and up. 0 indicates better clustering.
- **The Silhouette Coefficient** is a value between -1 and 1. The higher the score, the better. 1 indicates tight clusters, and 0 means overlapping clusters. 

In [None]:
print(f'Davies-Bouldin index = {davies_bouldin_score(pca_data, labels)}')
print(f'Silhouette Score = {silhouette_score(pca_data, labels)}')

now, let's get the Silhouette Plot:

In [None]:
visualizer = SilhouetteVisualizer(km_model, colors='yellowbrick')
visualizer.fit(pca_data.drop(['LABELS'],axis=1))
visualizer.show()

Here, the vertical dotted red line in this plot is the average score. It looks like our clustering method is good, since each cluster bumps out above the average, and the cluster scores look decent.

# Interpret the results
alright, we have nice clusters, but what do they mean? let's figure out.

## PCs vs Labels!
Let's see which PCs have higher values in each label.

In [None]:
def spider_plot(data, title):
    means = data.groupby("LABELS").mean().to_numpy()
    names = data.columns[0:-1]
    label_loc = np.linspace(start=0, stop=2 * np.pi, num=len(names))
    categories = np.arange(0, len(means))
    plt.figure(figsize=(10,10))
    plt.subplot(polar=True)
    for i in range(len(means)):
        plt.plot(label_loc, means[i], label=f'class {categories[i]}')
    plt.title(f'Feature comparison ({title})', size=20)
    lines, labels = plt.thetagrids(np.degrees(label_loc), labels=names)
    plt.legend()
    plt.show()

In [None]:
spider_plot(pca_data, 'PCA Data')

hmmm... It looks like PC1, PC2 and PC3 were the most important Principal Components. Let's get the same plot, this time for original features:

## Standard, Imputed Data vs Labels!
the following plot should give us a better understanding on our clusters:

In [None]:
spider_plot(std_imp_data, 'Std, Impt Data')

Wow! now we are talking. I know it's hard to read, but I just want you to notice these points:
- Class 0 contains customers who don't make a lot of money (look at their `BALANCE`, it is the lowest of all) but this doen't keep them away from purchasing stuff! in terms of `PURCHASES`, they are the second class. how do they do this? take a closer look: they don't buy stuff in one go (they have lowest ammount of `ONEOFF_PURCHASES` and `ONEOFF PURCHASES FREQUENCY`) their key to success is _instalments!_ It's easy: if you don't make enough money to buy stuff in one go, just pay over a period of time. (They have highest values of `INSTALLMENTSPURCHASES` and `PURCHASESINSTALLMENTSFREQUENCY`  I refer to these people as **Dreamers** because although they don't make much money, lack of money doesn't prevent them from reaching for their dreams!
- Class 1 shows customers who are not very rich, and don't take risks. their `BALANCE` (amount left in their account to make purchases) is below average, and they don't purchase much. (their `PURCHASES` is below average as well, and their `PURCHASE FREQUENCY` is very low.) I call these people **Economicals**. to them, every penny is important.
- Class 2 contains customers who have a good income (second highest `BALANCE`) and are enjoying it! they purchase a lot (highest `PURCHASESFREQUENCY`), both in installments and in one-go. let's call them **Bourgeoisie**!
- Class 3 is mysterious. look at them! they have highest `BALANCE`, but lowest `PURCHASE` of all! it seems they only use their fortunes when they want pay in advance (highest `CASHADVANCE`, `CASHADVANCEFREQUENCY` and `CASHADVANCETRX`). A cash advance is a service provided by most credit card and charge card issuers. The service allows cardholders to withdraw cash, either through an ATM or over the counter at a bank or other financial agency, up to a certain limit. For a credit card, this will be the credit limit (or some percentage of it). so, these guys don't use their credit cards to buy stuff, instead, they get chash from ATMs to do so. why? it is beause they want to buy something illegal? Let's just call them **The Mafia** for now.

## Evaluating our hypothesis
now, I want to plot our data using only `BALANCE` and `PURCHASES`. this is my hypothesis:
- If `BALANCE` is low and `PURCHASES` is high ➡ Class 0 (Dreamers ✨)
- If `BALANCE` is low and `PURCHASES` is low ➡ Class 1 (Economicals 💲)
- If `BALANCE` is high and `PURCHASES` is high ➡ Class 2 (Bourgeoisie 🛍️)
- If `BALANCE` is high and `PURCHASES` is low ➡ Class 3 (The Mafia 🕶️)

In [None]:
def colorful_scatter(data):   
    LABEL_COLOR_MAP = {0 : 'y',
                       1 : 'g',
                       2 : 'm',
                       3 : 'k'
                       }
    sns.jointplot(data=data, x="BALANCE", y="PURCHASES", hue="LABELS", palette=LABEL_COLOR_MAP, alpha=0.6, height=10)

In [None]:
colorful_scatter(cc_general)

It's hard to read... let's use our normalized dataframe:

In [None]:
colorful_scatter(std_imp_data)

The following KDE plot also helps proving my point:

In [None]:
sns.kdeplot(data=std_imp_data, x="BALANCE", y="PURCHASES", hue="LABELS", palette={0 : 'y', 1 : 'g', 2 : 'm', 3 : 'k'}, alpha=.7, height=20)

It looks like my hypothesis was quite right. 
in this plot, it is clear that:
- people of Class 0 spend a lot while they have low balance.
- people of Class 1 have a low balance and spend less than others
- people of Class 2 have high balance and purchase a lot
- people of Class 3 don't purchase much, although they have lots of money

we can further investigate this hypothesis using kde plots:

In [None]:
def kde_plot(data,x):
    LABEL_COLOR_MAP = {0 : 'y',
                   1 : 'g',
                   2 : 'm',
                   3 : 'k'
                   }
    sns.kdeplot(data=data, x=x, hue="LABELS", palette=LABEL_COLOR_MAP)

In [None]:
kde_plot(cc_general, 'PURCHASES_FREQUENCY')

looking at this plot, it is obvious that ***the mafia*** and ***economicals*** are purchasing less often than ***dreamers*** and ***Bourgeoisie***

In [None]:
kde_plot(cc_general, 'PURCHASES_INSTALLMENTS_FREQUENCY')

This plot shows how ***the dreamers*** are trying to buy whatever they persue, by buying first and paying later.

In [None]:
kde_plot(cc_general, 'CASH_ADVANCE_FREQUENCY')

and this plot shows that the infamous ***mafia*** are getting cash from ATMs more often than other groups. should we call the cops? 😈

# Conclusion
Congrats! we found data hidden in this dataset by using cool ML tools. 🥳

Let's keep learning!

<div class="alert alert-danger" role="alert" style="text-align:center;">
    I hope you enjoyed this tutorial. If you did, please consider subscribing to <b><a href="https://www.youtube.com/channel/UC34Gj0-vHuBiTNEYlP7wczg">my YouTube Channel ▶</a></b>
</div>

<center><h2><span style="font-family:cursive;"> Also, please Upvode! 😜 </span></h2></center>