# Business Data Analytics - Exercise Unsupervised Learning

This notebook demonstrates how **clustering** and **dimensionality reduction** can be applied to segment customers based on their credit card usage behaviour. The code used throughout this tutorial is inspired by [Saba Naseem Butt's notebook on Kaggle](https://www.kaggle.com/code/sabanasimbutt/clustering-visualization-of-clusters-using-pca).

The notebook follows these steps:
1. Preprocessing 
2. Clustering
3. Interpreting the clusters
4. Visualising the results with PCA

### Import libraries

In [None]:
!pip install -r requirements.txt

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from typing import List, Union

### Loading the data

In [None]:
# load data
data = pd.read_csv("data.csv")

The following variables are contained in the dataset:

- **CUST_ID**: Identification of Credit Card holder (Categorical)


- **BALANCE**: Balance amount left in their account to make purchases


- **BALANCE_FREQUENCY**: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)


- **PURCHASES**: Amount of purchases made from account


- **ONEOFF_PURCHASES**: Maximum purchase amount done in one-go


- **INSTALLMENTS_PURCHASES**: Amount of purchase done in installment


- **CASH_ADVANCE**: Cash in advance given by the user


- **PURCHASES_FREQUENCY**: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)


- **ONEOFF_PURCHASES_FREQUENCY**: How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)


- **PURCHASES_INSTALLMENTS_FREQUENCY**: How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)


- **CASH_ADVANCE_FREQUENCY**: How frequently the cash in advance being paid


- **CASH_ADVANCE_TRX**: Number of Transactions made with "Cash in Advanced"


- **PURCHASES_TRX**: Numbe of purchase transactions made


- **CREDIT_LIMIT**: Limit of Credit Card for user


- **PAYMENTS**: Amount of Payment done by user


- **MINIMUM_PAYMENTS**: Minimum amount of payments made by user


- **PRC_FULL_PAYMENT**: Percent of full payment paid by user


- **TENURE**: Tenure of credit card service for user

### Brief Exploratory Data Analysis

In [None]:
data.shape

In [None]:
data.head()

In [None]:
# Checking the distribution of our data yields that we have lots of outliers
data.describe()

In [None]:
# we have some missing values
data.isnull().sum().sort_values(ascending=False).head()

### Preprocessing

Before we can feed our data into a clustering algorithm, we need to preprocess them.

#### Task: Imputing missing values

For the sake of simplicity, we can impute the missing values as the mean value. You can either use [sklearn's SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) or do it manually.

Use the cell below to answer this task

#### Task: Bin numeric values

To cap outlier in our distributions and to provide interpretable results, it is sometimes useful to transform our numeric features into meaningful bins.

For instance, the balance column can be divided into 7 bins, so that 
- the 1st bin represents values between -inf < x <= 0
- the 2nd bin represents values between 0 < x <= 500
- the 3rd bin represents values between 500 < x <= 1000
- the 4th bin represents values between 1000 < x <= 3000
- the 5th bin represents values between 3000 < x <= 5000
- the 6th bin represents values between 5000 < x <= 10000
- the 7th bin represents values between 10000 < x <= inf

To do this for all our features, we will define a function that returns a pandas Series with the corresponding bin numbers for all values in a column of our dataframe. To illustrate the basic functionality, we can have a look at the example above. Assuming a balance of 256, the corresponding bin number will be 1 if we denote the first bin as 0. To make our function as flexible as possible it accepts a pandas Series that contains all the values to be binned and the corresponding binning thresholds as an input.

To get started, you can have a look at the [pandas.cut()](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) function. It provides some examples of how we can leverage this function for binning.

Use the cell below to define the function.

In [None]:
def apply_binning(series: pd.Series, thresholds: List[Union[int, float]]) -> pd.Series:
    
    # fill this function with your code
    
    pass

In [None]:
## first batch of column transformations
thresholds = [-np.inf, 0, 500, 1000, 3000, 5000, 10000, np.inf]

for column in ['BALANCE', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS']:    
    data[column] = apply_binning(series=data[column], thresholds=thresholds)

## second batch of column transformations
thresholds = [-np.inf, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, np.inf]

for column in ['BALANCE_FREQUENCY', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY', 'PRC_FULL_PAYMENT']:
    data[column] = apply_binning(series=data[column], thresholds=thresholds)
    
## third batch of column transformations
thresholds = [-np.inf, 0, 5, 10, 15, 20, 30, 50, 100, np.inf]

for column in ['PURCHASES_TRX', 'CASH_ADVANCE_TRX']:
    data[column] = apply_binning(series=data[column], thresholds=thresholds)

In [None]:
# drop customer id since it is no longer needed 
data.drop(['CUST_ID'], axis=1, inplace=True)

#### Task: Feature Scaling

Since we will be using KMeans later on, we need to scale our input data so that each feature contributes equally to the distance measure.

Use the cell below to answer this task.

### Clustering

In this notebook section, we apply the [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) clustering in two steps:

1. First, we will determine the "optimal" number of clusters for KMeans. 
2. Once we have found this number, we will reapply the KMeans algorithm on the dataset. 

#### Task: Determining the number of clusters

Use the elbow criterium to determine the "optimal" number of clusters. To do so, define a range of possible clusters and save the resulting sum of the squared distances in a list. Afterwards, you can plot the number of clusters versus the sum of squared distances, which will help you to determine the "optimal" number of clusters.

Use the cells below to answer this task.

#### Task: Clustering on "optimal" number of clusters

Use the plot from above, to infer the "optimal" number of clusters. Use the cell below to answer this task.

In [None]:
# use your own labels for each data point obtained from the previous step and add it to the dataframe
# the variable 'labels' should be supplied by you in the previous step
clusters=pd.concat([data, pd.DataFrame({'cluster':labels})], axis=1)
clusters.head()

### Interpretation of clusters

Now we can start interpreteting the clusters. For this purpose, we visualize all features with respect to the found clusters.

In [None]:
for c in clusters:
    grid= sns.FacetGrid(clusters, col='cluster')
    grid.map(plt.hist, c)

#### Task: Describe the clusters you have found in your own words. 
Use this cell for your answer.


### Visualization of clusters

Now we have derived meaningful clusters from the credit card user behaviour. To further inspect our results, we will visualize the clusters. 

#### Task: Visualize clusters
Currently, our feature space has more than 2 dimensions. This makes it difficult to plot our results. Hence, we need to transform our feature space into a 2D projection. In case of PCA, we simply set the number of components to 2 and transform our dataset. Afterwards, we can visualize the results. To get started with PCA, you can have a look at [sklearn's documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html). To visualize the results, you can use [matplotlib](https://matplotlib.org/).