<a href="https://colab.research.google.com/github/King-Steve-Kuria/Machine-Learning/blob/main/Cluster_Analysis_using_KMeans_Clustering_Algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INTRODUCTION

K-means clustering algorithm represents each cluster by its corresponding cluster centroid. The algorithm will partition the input data into k-disjoint clusters by using the following steps:
1. From K clusters by assigning each instance to its nearest centroid
2. Recompute the centroid of each cluster.

In [None]:
import pandas as pd


In [None]:
ratings=[['John',5,5,2,1], ['Mary',4,5,3,2],['Bob',4,4,4,3],['Lisa',2,2,4,5], ['Lee',1,2,3,4],['Harry',2,1,5,5]]

In [None]:
ratings

[['John', 5, 5, 2, 1],
 ['Mary', 4, 5, 3, 2],
 ['Bob', 4, 4, 4, 3],
 ['Lisa', 2, 2, 4, 5],
 ['Lee', 1, 2, 3, 4],
 ['Harry', 2, 1, 5, 5]]

In [None]:
titles = ['User', 'Jaws', 'Star Wars', 'Exorcist', 'Omen']

In [None]:
titles

['User', 'Jaws', 'Star Wars', 'Exorcist', 'Omen']

In [None]:
movies = pd.DataFrame(ratings, columns=titles)

In [None]:
movies

Unnamed: 0,User,Jaws,Star Wars,Exorcist,Omen
0,John,5,5,2,1
1,Mary,4,5,3,2
2,Bob,4,4,4,3
3,Lisa,2,2,4,5
4,Lee,1,2,3,4
5,Harry,2,1,5,5


# Data Discussion


    

From the dataset the first 3 users like action movies, and the last 3 user enjoy horror movies. Our goal is to apply k-means clustering on the users to identify groups of users with similar movie preferences. k=2 from the dataset. 


In [None]:
from sklearn import cluster

In [None]:
myData=movies.drop('User', axis=1)

In [None]:
myData

Unnamed: 0,Jaws,Star Wars,Exorcist,Omen
0,5,5,2,1
1,4,5,3,2
2,4,4,4,3
3,2,2,4,5
4,1,2,3,4
5,2,1,5,5


In [None]:
k_means = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)

In [None]:
k_means

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=50,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=1, tol=0.0001, verbose=0)

In [None]:
k_means.fit(myData)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=50,
       n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=1, tol=0.0001, verbose=0)

In [None]:
labels = k_means.labels_

In [None]:
labels

array([0, 0, 0, 1, 1, 1], dtype=int32)

In [None]:
pd.DataFrame(labels, index=movies.User,columns=['Cluster ID'])

Unnamed: 0_level_0,Cluster ID
User,Unnamed: 1_level_1
John,0
Mary,0
Bob,0
Lisa,1
Lee,1
Harry,1


K-means clustering has assigned the first 3 users to one cluster and the last 3 users to the second cluster. This results are consistent with our expectations. We can also display the centroid for each of the 2 clusters.

In [None]:
centroids=k_means.cluster_centers_

In [None]:
centroids

array([[4.33333333, 4.66666667, 3.        , 2.        ],
       [1.66666667, 1.66666667, 4.        , 4.66666667]])

In [None]:
pd.DataFrame(centroids, columns=myData.columns)

Unnamed: 0,Jaws,Star Wars,Exorcist,Omen
0,4.333333,4.666667,3.0,2.0
1,1.666667,1.666667,4.0,4.666667


The cluster centroids can be used to determine other users cluster assignments.

In [None]:
import numpy as np


In [None]:
testData= np.array([[4,5,1,2],[3,2,4,4],[2,3,4,1],[3,2,3,3],[5,4,1,4]])

In [None]:
testData

array([[4, 5, 1, 2],
       [3, 2, 4, 4],
       [2, 3, 4, 1],
       [3, 2, 3, 3],
       [5, 4, 1, 4]])

In [None]:
labels=k_means.predict(testData)

In [None]:
labels

array([0, 1, 0, 1, 0], dtype=int32)

In [None]:
labels=labels.reshape(-1,1)

In [None]:
labels

array([[0],
       [1],
       [0],
       [1],
       [0]], dtype=int32)

In [None]:
usernames=np.array(['Paul','Kim','Liz','Tom','Bill']).reshape(-1,1)

In [None]:
usernames

array([['Paul'],
       ['Kim'],
       ['Liz'],
       ['Tom'],
       ['Bill']], dtype='<U4')

In [None]:
cols=movies.columns.tolist()

In [None]:
cols

['User', 'Jaws', 'Star Wars', 'Exorcist', 'Omen']

In [None]:
cols.append('Cluster ID')

In [None]:
cols

['User', 'Jaws', 'Star Wars', 'Exorcist', 'Omen', 'Cluster ID']

In [None]:
newusers=pd.DataFrame(np.concatenate((usernames,testData,labels), axis=1),columns=cols)

In [None]:
newusers

Unnamed: 0,User,Jaws,Star Wars,Exorcist,Omen,Cluster ID
0,Paul,4,5,1,2,0
1,Kim,3,2,4,4,1
2,Liz,2,3,4,1,0
3,Tom,3,2,3,3,1
4,Bill,5,4,1,4,0


In [None]:
myData

Unnamed: 0,Jaws,Star Wars,Exorcist,Omen
0,5,5,2,1
1,4,5,3,2
2,4,4,4,3
3,2,2,4,5
4,1,2,3,4
5,2,1,5,5


In [None]:
k_means.cluster_centers_

array([[4.33333333, 4.66666667, 3.        , 2.        ],
       [1.66666667, 1.66666667, 4.        , 4.66666667]])

# Future Work

How to determine the number of clusters. We looked at the data and concluded that 2 clusters are okay. How about if you have a lot of data, or high dimensional data?