#  **KMeans Clustering - Grouping Indian States based on Literacy**

**Done by -** 

**Aakash R - 20BCE1003**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
import plotly.express as px
import seaborn as sns
from sklearn.metrics import silhouette_score
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


**Reading Data**

In [None]:
df = pd.read_excel('/kaggle/input/literacy-data-india/India_literacy_data_2021_Statewise.xlsx')
df.head()

**Renaming Columns**

Renaming columns to a more understandable and readable form

In [None]:
df.columns = ['State/UT','Area','Children_Schooling_Percentage','Women_literacy','Men_literacy','Women_Schooling','Men_Schooling']

**Dataset Size**

In [None]:
df.shape

**Feature data types**

In [None]:
df.dtypes

### Data Preprocessing

**Retaining only the 'Total' literacy rate of every State/Union Territory**

In [None]:
df = df[df['Area'] == 'Total']

**Dropping the 'Area' Feature**

In [None]:
df = df.drop('Area', axis=1)

In [None]:
df.head()

**Preprocessing**

Changing the data type of features to float

In [None]:
df['Children_Schooling_Percentage'] = df['Children_Schooling_Percentage'].astype(float)
df['Men_literacy'] = df['Men_literacy'].astype(float)
df['Men_Schooling'] = df['Men_Schooling'].astype(float)

In [None]:
df.dtypes

**Dropping the State/UT feature**

The State/UT feature is not useful for clustering 

In [None]:
df_filtered = df.drop('State/UT', axis=1)

In [None]:
df_filtered.shape

### Elbow Method
The elbow method is a graphical representation of finding the optimal 'K' in a K-means clustering. It works by finding WCSS (Within-Cluster Sum of Square) i.e. the sum of the square distance between points in a cluster and the cluster centroid. The optimal value of K is where the curve takes a sharp bend

In [None]:
inertias = []

for i in range(2,21):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(df_filtered)
    inertias.append(kmeans.inertia_)

plt.plot(range(2,21), inertias, marker='o')
plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Within Cluster Sum of Squares')
plt.xticks(range(2,21))
plt.show()

**INFERENCE -** The optimal value of K (number of clusters) is thus 5, where the curve takes a bend

### Silhouette Method
The silhouette coefficient or silhouette score kmeans is a measure of how similar a data point is within-cluster (cohesion) compared to other clusters (separation). The optimal value of K is where the graph is maximum

In [None]:
silhouette_scores = []
for i in range(2,21):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(df_filtered)
    label=kmeans.predict(df_filtered)
    silhouette_scores.append(silhouette_score(df_filtered, label))

plt.plot(range(2,21), silhouette_scores, marker='o')
plt.title('Silhouette Method')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.xticks(range(2,21))
plt.show()

**INFERENCE -** The optimal value of K (number of clusters) is thus 5, where the graph is maximum

**Both Methods suggest the K value to be 5**

### KMeans Clustering
Number of clusters = 5

In [None]:
kmeans = KMeans(n_clusters=5)
kmeans.fit(df_filtered)
label=kmeans.predict(df_filtered)
df_filtered['Cluster'] = label
df['Cluster'] = label

**Cluster Centers**

In [None]:
cluster_centers = kmeans.cluster_centers_
cluster_centers 

In [None]:
centroids = pd.DataFrame({'Cluster':['0','1','2','3','4'],'Children_Schooling_Percentage':cluster_centers[:,0],'Women_literacy':cluster_centers[:,1],'Men_literacy':cluster_centers[:,2],'Women_Schooling':cluster_centers[:,3],'Men_Schooling':cluster_centers[:,4]})
centroids

### Visualization

**1. Visualizing clusters on a 2D plane selecting only two features**

In [None]:
u_labels = np.unique(label)
 
for i in u_labels:
    plt.scatter(df_filtered.iloc[label == i , 1] , df_filtered.iloc[label == i , 2] , label = i)
plt.ylabel('Men_literacy')
plt.xlabel('Women_literacy')
plt.legend()
plt.show()

**2. Visualizing Clusters on a 3D Space**

In [None]:
fig = px.scatter_3d(df_filtered, x='Children_Schooling_Percentage', z='Men_literacy', y='Women_literacy',color=label,
                    height=700, width=800,color_discrete_sequence=sns.color_palette('colorblind',n_colors=5,desat=1).as_hex(),
                   title='Kmeans Clustering')
fig.show()

In [None]:
fig = px.scatter_3d(df_filtered, x='Children_Schooling_Percentage', z='Men_Schooling', y='Women_Schooling',color=label,
                    height=800, width=800, color_discrete_sequence=sns.color_palette('colorblind',n_colors=5,desat=1).as_hex(),
                   title='KMeans - Clustering')
fig.show()

### Grouping Indian States 

**States in Cluster-0**

In [None]:
df[df['Cluster']==0]

**States in Cluster-1**

In [None]:
df[df['Cluster']==1]

**States in Cluster-2**

In [None]:
df[df['Cluster']==2]

**States in Cluster-3**

In [None]:
df[df['Cluster']==3]

**States in Cluster-4**

In [None]:
df[df['Cluster']==4]

### Conclusion - 
The Indian States and Union Territories are grouped into 5 clusters based on literacy rates that would help the state and central governments and other NGOs take-up various measures to improve education targetting the specific needs of the clusters. Different plans and policies can be formulated for the clusters.