**<font size = "5" > Credit Card Dataset for Clustering Using Gaussian Mixture Model </font>**<br><br>
**<font size = "4">Problem Statement:</font>**<br><br>
<font size = "3"> Dataset contains customer-wise credit card usage data.For each customer there is one observation. That is a cust_id
occurs just once. Objective of this dataset is:

-  Customer-segmentation.
-  Discover unusual or anomalous customers.
-  Perform more extensive credit-card related analysis with such data.</font><br>

**<font size = "4">Contents:</font>**<br>
<font size = "3"> 
- Data Preprocessing
- Plotting and Graphing of data
- Clustering of data using GMM
- Visualization of Clusters using TSNE
- Differences of anomalous and un-anomalous clients
</font><br>

**<font size ="4">Import Libraries:</font>**

In [None]:
%reset -f  

import warnings
warnings.filterwarnings("ignore")

# 1.1 Data manipulation library
import pandas as pd
import numpy as np
%matplotlib inline

# 1.2 OS related package

import os

# 1.3 Modeling librray
# 1.3.1 Scale data

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
# 1.4 Plotting library

import seaborn as sns
import matplotlib.pyplot as plt

# 1.5 Import GaussianMixture class

from sklearn.mixture import GaussianMixture

# 1.6 TSNE
from sklearn.manifold import TSNE


**<font size = "3">Load the Dataset:</font>**<br>

In [None]:
# DateFrame object is created while reading file available at particular location given below

cc=pd.read_csv("../input/ccdata/CC GENERAL.csv")

**<font size = "3">Displaying the first 5 rows of DataFrame:</font>**<br>

In [None]:
cc.head()

**<font size = "3">No of rows and columns in DataFrame:</font>**<br>

In [None]:
cc.shape

**<font size = "3">Summary of DataFrame:</font>**<br>

In [None]:
cc.info()

**<font size = "3">Change the column names of DataFrame to lower case:</font>**<br>

In [None]:
cc.columns = [i.lower() for i in cc.columns]
cc.columns

**<font size = "3">Drop the cust_id column of DataFrame:</font>**<br>

In [None]:
cc.drop(columns="cust_id",inplace=True)
cc.columns

**<font size ="3">Checking no of columns having missing values:</font>**

In [None]:
cc.isnull().sum()

**<font size ="3">Distribution plot of columns having missing value:</font>**

In [None]:
sns.distplot(cc.credit_limit,color="b")
sns.distplot(cc.minimum_payments,color="g")

**<font size ="3">Filling NaN values with column median:</font>**

In [None]:
values = {'minimum_payments' :   cc['minimum_payments'].median(),
          'credit_limit'     :   cc['credit_limit'].median()
         }

cc.fillna(value=values,inplace=True)

**<font size ="3">Again Checking no of columns having missing value:</font>**

In [None]:
cc.isnull().sum()

<font size ="3">Now there is no such column😊</font>

**<font size ="3">Feature standardization & observation-normalization:</font>**

In [None]:
ss =  StandardScaler()
out = ss.fit_transform(cc)
out = normalize(out)

**<font size ="3">Transform numpy array out to Pandas DataFrame df_out</font>**

In [None]:
col_names=cc.columns
df_out=pd.DataFrame(out,columns=col_names)

**<font size ="3">Displaying first 5 rows of new dataset df_out:</font>**

In [None]:
df_out.head()

**<font size ="4">Graphing:</font>**<br>

**<font size ="3">Distribution plot for all features of DataFrame:</font>**

In [None]:
fig = plt.figure(figsize=(20,20))

for i in range(17):
    plt.subplot(6,3,i+1)
    sns.distplot(df_out[df_out.columns[i]])
    

**<font size ="4">Box plot for all features of DataFrame:</font>**<br>

In [None]:
fig = plt.figure(figsize=(15, 10))

sns.boxplot(data=df_out)

plt.xticks(rotation=90)

**<font size ="3">Interpretation:</font>**<br><br>
    <font size ="3">All features have outliers except purchase frequency and purchase installment frequency. It means
    that these features have either unusally small or large observation. It can have disproportionate effect on statistical 
    results such as mean, which can result in misleading interpretations.</font><br><br>

**<font size ="3">Joint plot between balance and credit limit feature:</font>**

In [None]:
sns.jointplot(x="balance", y="credit_limit", data=df_out,color="g")

**<font size ="3">Interpretation:</font>**<br><br>
    <font size ="3">There is strong correlation between balance and credit limit as valuse of balance increases, 
    credit limit also increases</font><br><br>

**<font size ="3">Joint plot between balance and credit limit feature:</font>**

In [None]:
sns.jointplot(x="balance", y="credit_limit", data=df_out,kind="kde",color="g")

**<font size ="3">Interpretation:</font>**<br><br>
    <font size ="3">More dense part shows where value of balance and credit limit matches more.</font><br><br>

**<font size ="3">Heatmap:</font>**

In [None]:
fig = plt.figure(figsize=(20, 10))

heatmap = sns.heatmap(df_out.corr(),annot = True)

heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':15})

**<font size ="3">Interpretation:</font>**<br><br>
    <font size = "3">Each square shows the correlation between the variable on each axis. Value in each square closer
    to 0 shows there is no relationship. If value of square is 1, it means there is strong correlation between the 
    variables. If value ofone variable increases, value of other variable also increase. If value of square is -1 
    it means also strong correlation but in opposit direction. If value of one variable increases, value of other
    variable will decrease.</font><br><br>

**<font size ="3">AIC and BIC measures to discover ideal no of clusters:</font>**

In [None]:
bic = []
aic = []
for i in range(3):
    gm = GaussianMixture(
                     n_components = i+1,
                     n_init = 10,
                     max_iter = 100)
    gm.fit(df_out)
    bic.append(gm.bic(df_out))
    aic.append(gm.aic(df_out))

**<font size ="3">Plotting AIC and BIC curves:</font>**

In [None]:
fig = plt.figure()
plt.plot([1,2,3], aic,marker="o",label="aic",color="b")
plt.plot([1,2,3], bic,marker="o",label="bic",color="r")
plt.legend()
plt.show()

**<font size ="3">Darw a 2-D t-sne plot and colour points by gmm-cluster labels:</font>**

In [None]:
tsne = TSNE(n_components = 2)
tsne_out = tsne.fit_transform(df_out)
plt.scatter(tsne_out[:, 0], tsne_out[:, 1],
            marker='x',
            s=20,                   # marker size
            linewidths=5,           # linewidth of marker edges
            c=gm.predict(df_out)    # Colour as per gmm
            )

**<font size ="3">Anomaly detection:</font>**

In [None]:
# Anomalous points are those that are in low-density region Or where density is in low-percentile of 4%


densities = gm.score_samples(df_out)              #score_samples() method gives score or density of a point at any location.
densities

density_threshold = np.percentile(densities,4)
density_threshold

anomalies = df_out[densities < density_threshold]
anomalies
anomalies.shape               

**<font size ="3">Unanomalous Data:</font>**

In [None]:
unanomalies = df_out[densities >= density_threshold]
unanomalies
unanomalies.shape    

**<font size ="3">Transform anomalous and unanomalous data to DataFrame:</font>**

In [None]:
df_anomaly = pd.DataFrame(anomalies, columns = df_out.columns)

df_unanomaly = pd.DataFrame(unanomalies, columns =df_out.columns)

**<font size ="3">Create density plot function:</font>**

In [None]:
def densityplots(df1,df2, label1 = "Anomalous",label2 = "Normal"):
    fig, axes = plt.subplots(nrows=4, ncols=5, figsize=(15,15))
    ax = axes.flatten()
    fig.tight_layout()
    # Do not display 18th, 19th and 20th axes
    axes[3,3].set_axis_off()
    axes[3,2].set_axis_off()
    axes[3,4].set_axis_off()
    # Below 'j' is not used.
    for i,j in enumerate(df1.columns):
        sns.distplot(df1.iloc[:,i],
                     ax = ax[i],
                     kde_kws={"color": "k", "lw": 3, "label": label1},   # Density plot features
                     hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "g"}) # Histogram features
        sns.distplot(df2.iloc[:,i],
                     ax = ax[i],
                     kde_kws={"color": "red", "lw": 3, "label": label2},
                     hist_kws={"histtype": "step", "linewidth": 3,"alpha": 1, "color": "b"})

**<font size ="3">Density plot for both anomalous and unanomalous data:</font>**

In [None]:
densityplots(df_anomaly, df_unanomaly, label2 = "Unanomalous")

**<font size ="3">Thank You...😊</font>**