## Activity: Anomaly detection with k-means

You are working for a big international bank. The Credit department is reviewing their offerings and wants to get a better understanding of their current customers. You have been tasked to perform a customer segmentation analysis. You will perform a cluster analysis with k-means to identify groups of similar customers.

Students are expected to:
* Download and load the dataset into Python
* Perform data standardisation if required
* Analyse and define the optimal number of clusters
* Fit k-means with default hyperparameters
* Plot the clusters and their centroids
* Tune hyperparameters and re-train k-means
* Analyse and interpret clusters found

The dataset used has been shared by Dheeru Dua and Casey Graff from the University of California.

It is available here: [https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data-numeric](https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data-numeric)


Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Notes:
* This dataset is a .dat file format. You can still load the file using **read_csv()** but you will need the specify the right separator characters with the parameter **sep**.
* Even though all the columns in this dataset are integers, most of them are actually categorical variables. The data in these columns are not continuous. Only 2 variables are really numeric. Find and use them for your clustering.


1. Open a new Colab notebook. Import the libraries of pandas, Kmeans, altair and sklearn (KMeans and StandardScaler)

In [0]:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt
from sklearn.preprocessing import StandardScaler

2. Assign the link to the dataset to a variable called file_url:

In [0]:
file_url = 'https://raw.githubusercontent.com/TrainingByPackt/The-Data-Science-Workshop/master/Chapter05-Perform_Your_First_Cluster_Analysis/data/german.data-numeric'

3. Load the dataset using read_csv() method from the package pandas and the following parameters:
header=None, sep=’\s\s+’ and prefix=’X’


In [5]:
df = pd.read_csv(file_url, header=None, sep='\s\s+', prefix='X')


  """Entry point for launching an IPython kernel.


4. Display the first 5 rows of the dataframe:

In [6]:
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24
0,1,6,4,12,5,5,3,4,1,67,3,2,1,2,1,0,0,1,0,0,1,0,0,1,1.0
1,2,48,2,60,1,3,2,2,1,22,3,1,1,1,1,0,0,1,0,0,1,0,0,1,2.0
2,4,12,4,21,1,4,3,3,1,49,3,1,2,1,1,0,0,1,0,0,1,0,1,0,1.0
3,1,42,2,79,1,4,3,4,2,45,3,1,2,1,1,0,0,0,0,0,0,0,0,1,1.0
4,1,24,3,49,1,3,3,4,4,53,3,2,2,1,1,1,0,1,0,0,0,0,0,1,2.0


5. Extract the columns X3 and X9 and assign them to a new variable called X:

In [0]:
X = df[['X3', 'X9']]

6. Instantiate a StandardScaler object and standardise the data and store the result in a variable called ‘X_scaled’

In [0]:
standard_scaler = StandardScaler()
X_scaled = standard_scaler.fit_transform(X)

7. Create an empty pandas dataframe called ‘clusters’ and an empty list called ‘inertia’


In [0]:
clusters = pd.DataFrame()
inertia = []

8. Create a new column called ‘cluster_range’ from the ‘clusters’ dataframe and assign a range from 1 to 15:

In [0]:
clusters['cluster_range'] = range(1, 15)

9. Using a for loop, fit a k-means with the number of clusters defined in the column ‘cluster_range’ and extract the relevant inertia value and append it to the ‘inertia’ list: 

In [0]:
for k in clusters['cluster_range']:
    kmeans = KMeans(n_clusters=k, random_state=8).fit(X_scaled)
    inertia.append(kmeans.inertia_)

10. Create a new column called ‘cluster_range’ from the ‘clusters’ dataframe and assign it the ‘inertia’ list:

In [0]:
clusters['inertia'] = inertia

11. Print ‘clusters’ dataframe

In [13]:
clusters

Unnamed: 0,cluster_range,inertia
0,1,2000.0
1,2,1280.617612
2,3,767.694985
3,4,576.086134
4,5,443.899592
5,6,360.418261
6,7,291.398267
7,8,252.704796
8,9,219.531292
9,10,193.202261


12. Using altair package and mark_line and encode methods, display the elbow plot:


In [14]:
alt.Chart(clusters).mark_line().encode(alt.X('cluster_range'), alt.Y('inertia'))

13. Looking at the elbow plot, find the optimal number of cluster and save this value into a new variable called clusters_number: 

In [0]:
clusters_number = 5

14. Fit a k-means++ with this number of clusters, n_init=50 and max_iter=1000


In [16]:
kmeans = KMeans(random_state=1, n_clusters=clusters_number, init='k-means++', n_init=50, max_iter=1000)
kmeans.fit(X_scaled)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=1000,
       n_clusters=5, n_init=50, n_jobs=None, precompute_distances='auto',
       random_state=1, tol=0.0001, verbose=0)

15. Use the method predict() from sklearn to get the assigned clusters for all data points saved in X_scaled

In [0]:
df['cluster'] = kmeans.predict(X_scaled)

16. Plot the scatter plot with the package altair

In [18]:
scatter_plot = alt.Chart(df).mark_circle()
scatter_plot.encode(x='X3', y='X9',color='cluster:N')

Looking at the final plot, we can see k-means++ has grouped the data into 4 different clusters. Cluster 0 corresponds to observations with value of X3 under 45 and X9 between 20 and 40. All the data points with X3 value under 45 and X9 over 40 have been assigned to cluster 2. Observations that have X3 over 45 belong to cluster 1. Finally cluster 4 contains data points very low value of X3 and X9. 
Observations for cluster 0 to 2 are quite spread on the X9 axis and it seems there is a natural lower boundary for X3 of 20 (all data points are over 20). But we got few data points that are lower that this boundary (they all belong to cluster 3). These observations seem to be very different from all the other data points from clusters 0 to 2. We will classify them as anomaly and if this was a real project we would have reported these cases to the Risk department for further investigation.
