## Credit Card Clustering


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing as pp
from sklearn.cluster import KMeans

sns.set()
%matplotlib inline

# Display Options
pd.options.mode.chained_assignment = None
pd.options.display.max_columns = None

### Preparing data
- View types;
- Summarize / Overview dataset;
- Fill NAs

In [None]:
# Import data
data = pd.read_csv("../input/CC GENERAL.csv")
data.head()

In [None]:
# Overview
data.describe()

Observing the table above, we can say that variables  `BALANCE`, `PURCHASES`, `ONEOFF_PURCHASES`, `INSTALLMENTS_PURCHASES`, `CASH_ADVANCE`, `CASH_ADVANCE_TRX`, `PURCHASE_TRX`, `CREDIT_LIMIT`, `PAYMENTS` and `MINIMUM_PAYMENTS` have outliers. Let's treat using log-transformation before standardizing.

In [None]:
# View missing values (count)
data.isna().sum()

In [None]:
# Fill NAs by mean
data = data.fillna(data.mean())

data.isna().sum()

In [None]:
# Remove CUST_ID (not usefull)
data.drop("CUST_ID", axis=1, inplace=True)

### Data exploration
- View types;
- Data visualization

In [None]:
data.dtypes

In [None]:
# Unique values for int64 types
data[['CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'TENURE']].nunique()

In [None]:
# Correlation plot
sns.heatmap(data.corr(),
            xticklabels=data.columns,
            yticklabels=data.columns
           )

In [None]:
# Pairplot - dispersion between variables
sns.pairplot(data)

In [None]:
# Distribution of int64 variables
fig, axes = plt.subplots(nrows=3, ncols=1)
ax0, ax1, ax2 = axes.flatten()

ax0.hist(data['CASH_ADVANCE_TRX'], 65, histtype='bar', stacked=True)
ax0.set_title('CASH_ADVANCE_TRX')

ax1.hist(data['PURCHASES_TRX'], 173, histtype='bar', stacked=True)
ax1.set_title('PURCHASES_TRX')

ax2.hist(data['TENURE'], 7, histtype='bar', stacked=True)
ax2.set_title('TENURE')

fig.tight_layout()
plt.show()

 ## Feature generation
 ### Used technics:
 - Log transformation;
 - Standardization;
 - Statistics for some variables (like mean, median, first and third quartile and mode)

In [None]:
# Create a copy of data
features = data.copy()
list(features)

In [None]:
# Log-transformation

cols =  ['BALANCE',
         'PURCHASES',
         'ONEOFF_PURCHASES',
         'INSTALLMENTS_PURCHASES',
         'CASH_ADVANCE',
         'CASH_ADVANCE_TRX',
         'PURCHASES_TRX',
         'CREDIT_LIMIT',
         'PAYMENTS',
         'MINIMUM_PAYMENTS',
        ]

# Note: Adding 1 for each value to avoid inf values
features[cols] = np.log(1 + features[cols])

features.head()

In [None]:
features.describe()

#### Outliers


As this is a clustering, I decided to test first without _outlier\`s_ replacement.  But is important know that information for comparision of clusterized values,  if we\`ll see outliers inside the clusters.

Using _IRQ Score_ for identify _outliers_ values in  dataset. 
*_IRQ method_* is used in boxplot to identify possible outliers values. By Wikipedia definition:

> The interquartile range (IQR), also called the midspread or middle 50%, or technically H-spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, IQR = Q3 − Q1.
In other words, the IQR is the first quartile subtracted from the third quartile; these quartiles can be clearly seen on a box plot on the data.
It is a measure of the dispersion similar to standard deviation or variance, but is much more robust against outliers.



**For now, we\`ll do nothing with outliers because this may harm the clustering.**


In [None]:
# Using boxplot for indentify possible outliers values after log-transform

features.boxplot(rot=90, figsize=(30,10))

Applying IRQ methodology in our dataset:

In [None]:
cols = list(features)
irq_score = {}

for c in cols:
    q1 = features[c].quantile(0.25)
    q3 = features[c].quantile(0.75)
    score = q3 - q1
    outliers = features[(features[c] < q1 - 1.5 * score) | (features[c] > q3 + 1.5 * score)][c]
    values = features[(features[c] >= q1 - 1.5 * score) | (features[c] <= q3 + 1.5 * score)][c]
    
    irq_score[c] = {
        "Q1": q1,
        "Q3": q3,
        "IRQ": score,
        "n_outliers": outliers.count(),
        "outliers_avg": outliers.mean(),
        "outliers_stdev": outliers.std(),
        "outliers_median": outliers.median(),
        "values_avg:": values.mean(),
        "values_stdev": values.std(),
        "values_median": values.median(),
    }
    
irq_score = pd.DataFrame.from_dict(irq_score, orient='index')

irq_score

#### Feature Scaling
Here we can use `scale` function of `sklearn.preprocessing`.  This function will put all variables at the same scale, with _mean zero_ and _standard deviation equals to one_.

In [None]:
# Scale All features

for col in cols:
    features[col] = pp.scale(np.array(features[col]))

features.head()

## Clustering using K-Means


Now we\`re ready to apply the clustering algorithm, using `KMeans` from `sklearn.cluster`.


Firstly, using Elbow\`s method, we can find an adequate number of clusters

In [None]:
X = np.array(features)
Sum_of_squared_distances = []
K = range(1, 30)

for k in K:
    km = KMeans(n_clusters=k, random_state=0)
    km = km.fit(X)
    Sum_of_squared_distances.append(km.inertia_)

plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()


I choose `k = 10` for number of clusters, based in plot above. 

In [None]:
# Custumers per cluster

n_clusters = 10

clustering = KMeans(n_clusters=n_clusters,
                    random_state=0
                   )

cluster_labels = clustering.fit_predict(X)

# plot cluster sizes

plt.hist(cluster_labels, bins=range(n_clusters+1))
plt.title('# Customers per Cluster')
plt.xlabel('Cluster')
plt.ylabel('# Customers')
plt.show()

# Assing cluster number to features and original dataframe
features['cluster_index'] = cluster_labels
data['cluster_index'] = cluster_labels

In [None]:
# Dispersion between clusterized data
# Pairplot - dispersion between variables
sns.pairplot(features, hue='cluster_index')

In [None]:
# View Features
features

In [None]:
# View results
data

## To-Do:
- [ ] Outlier analysis for any cluster;
- [ ] Interpretation of clusters