
# Credit Card Dataset  
### By: Fabio Pinto

This notebook is an example of using clustering for customer segmentation to define a marketing strategy. This sample dataset that summarizes the usage behavior of about 9000 active credit card holders during the last 6 months.

It includes the following variables:

CUST_ID: Identification of Credit Card holder
BALANCE: Balance amount left in their account to make purchases
BALANCE_FREQUENCY: How frequently the Balance is updated, score between 0 and 1
PURCHASES: Amount of purchases made from account
ONEOFF_PURCHASES: Maximum purchase amount done in one-go
INSTALLMENTS_PURCHASES: Amount of purchase done in installment
CASH_ADVANCE: Cash in advance given by the user
PURCHASES_FREQUENCY: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFF_PURCHASES_FREQUENCY: How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASES INSTALLMENTS FREQUENCY: How frequently purchases in installments are being done (1 =
frequently done, 0 = not frequently done)
CASHADVANCE_FREQUENCY: How frequently the cash in advance being paid
CASHADVANCE_TRX: Number of Transactions made with "Cash in Advanced"
PURCHASES_TRX: Number of purchase transactions made
CREDIT_LIMIT: Limit of Credit Card for user
PAYMENTS: Amount of Payment done by user
MINIMUM_PAYMENTS: Minimum amount of payments made by user
PRC_FULL_PAYMENT: Percent of full payment paid by user
TENURE: Tenure of credit card service for user

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
# las de clustering 
from scipy.spatial import distance_matrix
from scipy.cluster.hierarchy import dendrogram, linkage

In [None]:
datacredit = pd.read_csv("../input/ccdata/CC GENERAL.csv")
datacredit.head()

In [None]:
## let's see the columns with nas
datacredit.isna().sum()

In [None]:
## We fill the na's on the dataset with the mean of each columns 
mp_mean = datacredit['MINIMUM_PAYMENTS'].mean()
datacredit['MINIMUM_PAYMENTS'].fillna(value = mp_mean, inplace = True)
datacredit['CREDIT_LIMIT'].fillna(value = mp_mean, inplace = True)

In [None]:
datacredit.hist(figsize = (15,20));

In [None]:
# after the imputation 
datacredit.head()

In [None]:
datacredit.drop("CUST_ID", axis = 1, inplace = True)
datacredit.head()

In [None]:
datacredit.describe()

## Data Processing

In [None]:
## escalamiento 
from sklearn.preprocessing import StandardScaler

escala = StandardScaler()

copiadata = escala.fit_transform(datacredit)
datacopia = pd.DataFrame(copiadata, columns= datacredit.columns)
datacopia.head()

## Model construction

In [None]:
enlaces = linkage(datacopia, method = "ward")
enlaces

In [None]:
dendrogram(enlaces);

# se identifican 4 clusters 

In [None]:
## let see minimized dendrogram 
plt.figure(figsize=(25,10))
dend = dendrogram(enlaces, truncate_mode="lastp", p = 10, show_leaf_counts= True, show_contracted= True) 

## We will see the appropiate number of clusters

In [None]:
## automatic cut of the dendrogram 

from scipy.cluster.hierarchy import inconsistent

In [None]:
## inconsistent method 

depth = 5
incons = inconsistent(enlaces, depth)
incons[-10:]

In [None]:
## método del codo 
last = enlaces[-10:,2]
last_rev = last[::-1]
print(last_rev)

idx = np.arange(1, len(last)+1)
plt.plot(idx, last_rev)

acc = np.diff(last,2)
acc_rev = acc[::-1]
print(acc_rev)
plt.plot(idx[:-2]+1, acc_rev)

In [None]:
## visualization of cluster 

from scipy.cluster.hierarchy import fcluster

## Put the tags in the elements 
clusteres = fcluster(enlaces, 4, criterion="maxclust")
datacredit['Cluster'] = clusteres
datacopia['Cluster'] = clusteres
datacredit.head()

## Evaluation and conclussions

In [None]:

datacopia.boxplot(figsize = (25,25), fontsize = 8, by='Cluster', rot =45, autorange = True );
plt.ylim(-5, 30)

In [None]:
datacopia.boxplot(column='PURCHASES', by='Cluster' )
plt.ylim(-1,30)

In [None]:
datacopia.boxplot(column='ONEOFF_PURCHASES', by='Cluster' )
plt.ylim(-1,30)

In [None]:
datacopia.boxplot(column='CREDIT_LIMIT', by='Cluster' )
plt.ylim(-1,30)

In [None]:
datacopia.boxplot(column='BALANCE', by='Cluster' )
plt.ylim(-1,20)

# CONCLUSSION
## IDENTIFIED CLUSTERS 

+ CLUSTER 1 - PEOPLE WITH LOW LEVEL OF INCOME. Not Frequent purchases.
+ CLUSTER 2 - PEOPLE WITH MEDIAN LEVEL O INCOME. High Frequent purchases. 
+ CLUSTER 3 - PEOPLE WITH HIGH LEVEL OF INCOME. Not Frequent purchases and the most high advance level. 
+ CLUSTER 4 - PEOPLE WITH LOW LEVEL OF INCOME. Frequent purchases. 