# College Student - K Means Clustering

**What is K-Means used for?**

K-means clustering is a very famous and powerful unsupervised machine learning algorithm. It is used to solve many complex unsupervised machine learning problems

**What is K mean in machine learning?**

It is capable of classifying unlabeled data into a predetermined number of clusters based on similarities (k).

**How K-Means algorithm works?**

K-means clustering uses “centroids”, K different randomly-initiated points in the data, and assigns every data point to the nearest centroid. After every point has been assigned, the centroid is moved to the average of all of the points assigned to it.

**Is K-Means better than KNN?**

K-NN is a Supervised machine learning while K-means is an unsupervised machine learning. K-NN is a classification or regression machine learning algorithm while K-means is a clustering machine learning algorithm. K-NN is a lazy learner while K-Means is an eager learner.

**Can K-Means be used for classification?**

KMeans is a clustering algorithm which divides observations into k clusters. Since we can dictate the amount of clusters, it can be easily used in classification where we divide data into clusters which can be equal to or more than the number of classes.

**How do I cluster K-Means?**

    Step 1: Choose the number of clusters k.
    Step 2: Select k random points from the data as centroids. 
    Step 3: Assign all the points to the closest cluster centroid. 
    Step 4: Recompute the centroids of newly formed clusters. 
    Step 5: Repeat steps 3 and 4.

**What are centroids in K-Means?**

A centroid is the imaginary or real location representing the center of the cluster. Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.

# Importing Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Read the data

In [2]:
df=pd.read_csv("../input/cpga-iq-placement/student_clustering.csv")
df.head()

In [3]:
df.shape

In [4]:
df.info()

In [5]:
df.describe()

In [8]:
df.isnull().sum()

In [9]:
df.iq.value_counts()

# Visualization

In [10]:
plt.scatter(df['cgpa'],df['iq'])

'''
plt.figure(figsize=(15,6))
sns.scatterplot(x='cgpa', y='iq', data=df)
plt.xticks(rotation=0)
plt.show()
'''

In [11]:
plt.figure(figsize=(20,6))
sns.barplot(x='iq',y='cgpa',data=df)
plt.xticks(rotation=0)
plt.show()

In [12]:
plt.figure(figsize=(20,6))
sns.countplot(x='iq',data=df)
plt.xticks(rotation=0)
plt.show()

# Handling Outliers

In [13]:
for column in df.columns[0:-1]:
    plt.figure(figsize=(10,5))
    sns.boxplot(x=(column),data=df)

In [14]:
for i in df.columns:
    sns.boxplot(x=df[i])
    plt.show()

In [15]:
plt.figure(figsize=(15,6))
sns.scatterplot(x='cgpa', y='iq', data=df)
plt.xticks(rotation=0)
plt.show()

In [16]:
from sklearn.cluster import KMeans

data=[]
for i in range(1,11):
    km = KMeans(n_clusters=i)
    km.fit_predict(df)
    data.append(km.inertia_)

In [17]:
data

In [18]:
plt.figure(figsize=(15,6))
plt.plot(range(1,11),data)
plt.grid()
plt.xticks(rotation=0)
plt.show()

In [19]:
x = df.iloc[:,:].values
km = KMeans(n_clusters=4)
y_means = km.fit_predict(x)

In [20]:
y_means

In [21]:
x[y_means==0]

In [22]:
x[y_means==0,0],x[y_means==0,1]

In [23]:
plt.scatter(x[y_means==0,0],x[y_means==0,1],color='red')
plt.scatter(x[y_means==1,0],x[y_means==1,1],color='green')
plt.scatter(x[y_means==2,0],x[y_means==2,1],color='blue')
plt.scatter(x[y_means==3,0],x[y_means==3,1],color='orange')
plt.xlabel('CGPA')
plt.ylabel('IQ')


# K-Means 0n 3D data

In [24]:
from sklearn.datasets import make_blobs
centroids = [(-5,-5,5),(5,5,-5),(3.5,-2.5,4),(-2.5,2.5,-4)]
cluster_std =[1,1,1,1]
X,y = make_blobs(n_samples=200,cluster_std=cluster_std,centers=centroids,n_features=3,random_state=1)

In [25]:
import plotly.express as px
fig = px.scatter_3d(x=X[:,0],y=X[:,1],z=X[:,2])
fig.show()

In [26]:
data= []
for i in range(1,21):
    km = KMeans(n_clusters=i)
    km.fit_predict(X)
    data.append(km.inertia_)

In [27]:
data

In [29]:
plt.figure(figsize=(15,6))
plt.plot(range(1,21),data)
plt.grid()
plt.xticks(rotation=0)
plt.show()

In [30]:
km = KMeans(n_clusters=4)
y_pred = km.fit_predict(X)

In [31]:
df = pd.DataFrame()
df['col1'] = X[:,0]
df['col2'] = X[:,1]
df['col3'] = X[:,2]
df['label'] = y_pred

In [32]:
df

In [33]:
fig = px.scatter_3d(df,x='col1',y='col2',z='col3',color='label')
fig.show()