# K-Means Clustering

In K-Means Clustering we have to choose the value of 'K' very accurately by looking at the Within Sum of Squares (WSS).

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans 
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import os
os.getcwd()

In [None]:
df = pd.read_csv('Europe_Countries.csv')
df.head()

The data set imported in the last cell is an economic dataset. This data set contains various economic factors for European Countries. All the variables are kind of self-explanatory.

Let us go ahead and create a new dataframe with only the relevant variables for Clustering.

In [None]:
data = df.iloc[:,1:8]
data.head()

Since we have seen how to scale the data using the StandardScaler function from sklearn in the codebook of Hierarchical clustering, we will go ahead and do the same for this particular data set as well.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# Let us check the scaled data.

data_scaled = pd.DataFrame(StandardScaler().fit_transform(data),columns=data.columns)
data_scaled.head()

Now that we have scaled the data, let us go ahead and perform the K-Means Clustering.

Since we do not know the value of 'K' i.e. the opitmum number of clusters we will start with 2 clusters and check the Within Sum of Squares (WSS).

In [None]:
k_means = KMeans(n_clusters = 2)

In [None]:
clust_2 = k_means.fit_predict(data_scaled)
clust_2

In [None]:
k_means.inertia_

The 'inertia' gives us the Within Sum of Squares (WSS) for the number of clusters defined in the KMeans function inside the 'sklearn' library.

Let us now check the WSS for 3 clusters.

In [None]:
k_means = KMeans(n_clusters = 3)
k_means.fit(data_scaled)
k_means.inertia_

Now, we see that the WSS is decreasing. But it is very cumbersome to manually compute for each value. So, we are going to pass the KMeans function through a loop to automate this process of manually calculating the 'inertia'.

Let us define an empty list to being the process of automating the calculation of 'inertia'.

In [None]:
wss =[] 

Now, let us create the loop.

In [None]:
for i in range(2,9): #we are mentioning the range which the value of 'i' should take
    KM = KMeans(n_clusters=i) #we are defining the number of clusters which is the same as 'i'
    KM.fit(data_scaled) #we are applying the 'fit' function to form the required number of clusters in the dataset
    wss.append(KM.inertia_) # we are adding all the values of 'inertia' into the empty list called wss
    
#In short, we are calculating the value of 'inertia' in every step and storing the same in 'wss'

Now, let us print 'wss' and check the values.

In [None]:
# The below code snippet is used to print the WSS values. We have passed it through a loop to make understand the values
# of WSS better.

for i in range(2,9):
    print('The WSS value for',i,'clusters is',wss[i-2])

In [None]:
plt.plot(range(2,9), wss)
plt.grid()
plt.title('WSS Plot')
plt.xlabel('Number of Clusters')
plt.ylabel('WSS Value')
plt.show()

The ideal 'WSS' plot has to have a sharp elbow like structure. The number of clusters corresponding to that elbow-like graph is considered to be the most optimum.

Having said that, here we will go for 4 clusters.

Let us now store the values of the clusters into a variable and we will attach the particular variable to the data set.

In [None]:
k_means = KMeans(n_clusters = 4)
k_means.fit(data_scaled)
labels = k_means.labels_
labels

In [None]:
df["Clus_kmeans"] = labels
df.head()

Now that we have created a new column of clusters, let us export this into a .csv file.

In [None]:
df.to_csv('KMeans_Output.csv')

Now, let us compare the different clusters with the average values and try to interpret the problem.

In [None]:
df1 = df.drop(['Country'],axis=1)
df_clust = df1.groupby('Clus_kmeans').mean()
df_clust = df_clust.reset_index()
round(df_clust,0)

Let us check the frequency of the occurence of the clusters for each individual cluster.

In [None]:
cluster_freq = df['Clus_kmeans'].value_counts().sort_index()
cluster_freq

In [None]:
df_clust['Frequency'] = cluster_freq.values
round(df_clust,0)