# Clustering

In this notebook we will study about **K-Means** algorithm but first we will start with **Loading Data**. Before exploring data, let us have a look at the data dictionary

Following is the Data Dictionary for Credit Card dataset :-

**CUST_ID** : Identification of Credit Card holder (Categorical) <br/>
**BALANCE** : Balance amount left in their account to make purchases <br/>
**PURCHASES** : Amount of purchases made from account <br/>
**INSTALLMENTS_PURCHASES** : Amount of purchase done in installment <br/>
**CASH_ADVANCE** : Cash in advance given by the user <br/>
**CREDIT_LIMIT** : Limit of Credit Card for user <br/>
**PAYMENTS** : Amount of Payment done by user <br/>
**MINIMUM_PAYMENTS** : Minimum amount of payments made by user <br/>
**TENURE** : Tenure of credit card service for user

## Loading Data

In [1]:
import pandas as pd
pd.set_option('display.float_format', '{:.2f}'.format)
import numpy as np

from sklearn.cluster import KMeans
from sklearn import metrics 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set() 
%matplotlib inline 

**Task 1:** Read CSV file "credit_card.csv" from system and It is imporatant to make a copy of data first.

In [5]:
#write code here
data = pd.read_csv('credit_card.csv')
df= data.copy()
df.head()

Unnamed: 0,CUST_ID,BALANCE,PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,CREDIT_LIMIT,PAYMENTS,MINIMUM_PAYMENTS,TENURE
0,C10001,40.9,95.4,95.4,0.0,1000.0,201.8,139.51,12
1,C10002,3202.47,0.0,0.0,6442.95,7000.0,4103.03,1072.34,12
2,C10003,2495.15,773.17,0.0,0.0,7500.0,622.07,627.28,12
3,C10004,1666.67,1499.0,0.0,205.79,7500.0,0.0,312.34,12
4,C10005,817.71,16.0,0.0,0.0,1200.0,678.33,244.79,12


**Task 2:** Get the shape of data

In [None]:
#write code here


**Task 3:** Display first five rows

In [None]:
#write code here


**Task 4:** Display data types of Data

In [None]:
#write code here


**Task 5:** Check missing values

In [None]:
#write code here


**Task 6:** Check the statistics

In [None]:
#write code here


**Task 7:** Remove **CUST_ID**

In [None]:
#Write code here
X= None 

# KMeans

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:
<br><br>
<li>The centroids of the K clusters, which can be used to label new data</li>
<li>Labels for the training data (each data point is assigned to a single cluster)</li><br>
Rather than defining groups before looking at the data, clustering allows you to find and analyze the groups that have formed organically. The "Choosing K" section below describes how the number of groups can be determined.  

Each centroid of a cluster is a collection of feature values which define the resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind of group each cluster represents. 

In [None]:
kmeans = KMeans(n_clusters=5, random_state=0)
kmeans.fit(X)

Kmean.fit command runs the Kmean algorithm on the provided dataset.

Now lets make a copy of df in a new variable ***pred***.
To get to know that which observation belongs to which cluster, there is an attribute ***labels_***. This will return the list of labels and assign it to the new column ***kmean1***

In [None]:
pred = X.copy()
pred['kmean1'] = kmeans.labels_
pred.head()

The **kmean1** column shows the lables of the Kmean algorithm. For example row index 0 belongs to cluster 0 and row 1 belongs to cluster 1 and row 2 belongs to cluster 4 and so on

In [None]:
pred['kmean1'].value_counts()

The above output shows the number of obervations in each cluster

# Scaling

#### Why need scaling?
<br>Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization.

### Scaling using min max

Also known as min-max scaling or min-max normalization, is the simplest method and consists in rescaling the range of features to scale the range in [0, 1]. Selecting the target range depends on the nature of the data. The general formula is given as:
<br>

*Formula*
<br>zi=(xi−min(x))/(max(x)−min(x))

### Scaling using MinMaxScaler function

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler() 

In [None]:
new=scaler.fit_transform(X)

In [None]:
type(new)

In [None]:
new

In the above step the scaling is done by the built in min max scaler function

In [None]:
col_names=["BALANCE", "PURCHASES","INSTALLMENTS_PURCHASES","CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS","TENURE"]

In [None]:
scaled=pd.DataFrame(columns=col_names,data=new)

In [None]:
scaled.head()

Now we will use the scaled variables and see how our clusters differ

**Task 8:** Apply ***fit*** on **scaled** dataset and put the labels in the predicted data.

Also display value count

In [None]:
#Write code here
kmean2 = None
#Write code to fit


#Write code to put labels into predicted data
pred['kmean2'] = None

#View the final data set i.e top 5 rows



In [None]:
#Write code here to view value counts


From the above output you can see that now the distribution of the clusters has changed

## Choosing K

### Elbow Analysis

The Elbow method is a method of interpretation and validation of consistency within cluster analysis designed to help finding the appropriate number of clusters in a dataset.

### Working

One method to validate the number of clusters is the elbow method. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE).Then, plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then the "elbow" on the arm is the value of k that is the best. The idea is that we want a small SSE, but that the SSE tends to decrease toward 0 as we increase k (the SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster). So our goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k

In [None]:
cost = []
for k in range(1, 15):
    kmeanModel = KMeans(n_clusters=k, random_state=0).fit(scaled)
    cost.append([k,kmeanModel.inertia_])

In [None]:
cost

In [None]:
plt.figure(figsize=(15,6))
sns.set_context('poster')
plt.plot(pd.DataFrame(cost)[0], pd.DataFrame(cost)[1])
plt.xlabel('k')
plt.ylabel('Cost')
plt.title('The Elbow Method showing the optimal k') 
plt.show()

From the above graph we can see that the elbow is formed when the input was 3 clusters.
<br>But before proceding, let us check the **Silhouette Score**

### Silhouette Score

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.

In [None]:
from sklearn.metrics import silhouette_score

In [None]:
#add plot
s_score = []
for k in range(2, 15):
    kmeans = KMeans(n_clusters=k, random_state=0).fit(scaled)
    s_score.append([k, silhouette_score(scaled, kmeans.labels_)])

In [None]:
s_score

In [None]:
plt.figure(figsize=(15,6))
sns.set_context('poster')
plt.plot( pd.DataFrame(s_score)[0], pd.DataFrame(s_score)[1])
plt.xlabel('clusters')
plt.ylabel('score')
plt.title('The silhouette score') 
plt.show()

## Final clusters using K-Means

After checking the **Elbow Score** and **Silhoute Score**, we can conclude that number of clusters/k should be 3.

**Task 9:** Apply kmeans algorithm with number of clusters = 3. Also assign values to the predicted data and check value count.

In [None]:
#Write code here
kmean3 = None

#write code to fit


#Write code to assign labels to predicted data
pred['kmean3'] = None

#Write code to display value counts



## Profiling

**Profiling and its usage**<br>
Having decided (for now) how many clusters to use, we would like to get a better understanding of what values are in those clusters are and interpret them.

Data analytics is used to eventually make decisions, and that is feasible only when we are comfortable (enough) with our understanding of the analytics results, including our ability to clearly interpret them.

To this purpose, one needs to spend time visualizing and understanding the data within each of the selected clusters. For example, one can see how the summary statistics (e.g. averages, standard deviations, etc) of the profiling attributes differ across the segments.

In our case, assuming we decided we use the 3 clusters found using kmean algorithm as outlined above, we can see how the responses changes across clusters. The average values of our data within each cluster are:

In [None]:
p_ = pred[["BALANCE", "PURCHASES","INSTALLMENTS_PURCHASES","CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS","TENURE",'kmean3']]
pivoted = p_.groupby('kmean3')["BALANCE", "PURCHASES","INSTALLMENTS_PURCHASES","CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS","TENURE"].median().reset_index()
pivoted


# Radar Plot

The radar chart is a chart and/or plot that consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. A line is drawn connecting the data values for each spoke. This gives the plot a star-like appearance and the origin of one of the popular names for this plot.

<img src="https://upload.wikimedia.org/wikipedia/commons/0/00/Spider_Chart.svg" />

In [None]:
#!pip install chart_studio

[Sign UP](https://plot.ly/Auth/login/?action=signup#/) on Plotly, verify your email address and regenerate your API key

In [None]:
import chart_studio
#chart_studio.tools.set_credentials_file(username='Your username', api_key='Your password')

In [None]:
import chart_studio.plotly as py
import plotly.graph_objs as go

In [None]:
radar_data = [
    go.Scatterpolar(
      r = list(pivoted.loc[0,["BALANCE", "PURCHASES","INSTALLMENTS_PURCHASES","CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS","TENURE", 'BALANCE']]),
      theta = ["BALANCE", "PURCHASES","INSTALLMENTS_PURCHASES","CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS","TENURE", 'BALANCE'],
      fill = None,
      fillcolor=None,
      name = 'Cluster 0'
    ),
    go.Scatterpolar(
      r = list(pivoted.loc[1,["BALANCE", "PURCHASES","INSTALLMENTS_PURCHASES","CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS","TENURE", 'BALANCE']]),
      theta = ["BALANCE", "PURCHASES","INSTALLMENTS_PURCHASES","CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS","TENURE", 'BALANCE'],
      fill = None,
      fillcolor=None,
      name = 'Cluster 1'
    ),
    go.Scatterpolar(
      r = list(pivoted.loc[2,["BALANCE", "PURCHASES","INSTALLMENTS_PURCHASES","CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS","TENURE", 'BALANCE']]),
      theta = ["BALANCE", "PURCHASES","INSTALLMENTS_PURCHASES","CASH_ADVANCE","CREDIT_LIMIT", "PAYMENTS", "MINIMUM_PAYMENTS","TENURE", 'BALANCE'],
      fill = None,
      fillcolor=None,
      name = 'Cluster 2'
    )
]

In [None]:
radar_layout = go.Layout(polar = dict(radialaxis = dict(visible = True,range = [0, 9000])), showlegend = True)

In [None]:
fig = go.Figure(data=radar_data, layout=radar_layout)
py.iplot(fig, filename = "radar")