The goal of this Lab is to implement the clustering technique Kmeans as a Python class.

To do it, follow the steps below.

Meanwhile, we will test Kmeans implementation using a training population of employees.

<b>Todo :</b> Replace <font color=red>#?</font> by an appropriate Python code

Step 1. Define a training dataset of employees

In [1]:
# Import numpy library
import numpy as np

In [2]:
# Declare a numpy matrix denoted X
# It represents a population of employees
# The matrix rows are employees
# The matrix columns are their properties : age and salary
# The data of employees : (30,1300.5) , (48, 2500.7) , (25, 1100.5) , (45, 1900.75)
X=np.array([[30,1300.5],[48, 2500.7] , [25, 1100.5] , [45, 1900.75]])
X

array([[  30.  , 1300.5 ],
       [  48.  , 2500.7 ],
       [  25.  , 1100.5 ],
       [  45.  , 1900.75]])

In [3]:
# Shape of the matrix X : number of rows x number of columns
X.shape

(4, 2)

Step 2. Define Kmeans hyperparameters

In [4]:
# Kmeans hyperparameters are defined as attributes
# The most important hyperparameter of Kmeans is the number of clusters.
# It is manually assigned.
# It denoted k 
# It has 2 as default value
# Complete the code
class Kmeans :
  def __init__(self, k=2):
    self.k=k

In [5]:
# Run test
km=Kmeans(k=3)
km.k

3

Step 3. Define Kmeans model parameters

In [6]:
# Kmeans parameters are defined as attributes
# Kmeans has one parameter : It is the cluster centers
# It is denoted centers
# It is assigned None value before learning process
# Complete the code :
class Kmeans :
  def __init__(self, k=2):
    self.k=k
    self.centers=None

In [7]:
# Run test
km=Kmeans(k=2)
km.centers

Step 4. Define the learning process

Some functions are useful and required to implement the learning process :
- <font color='red'>random.choice</font> : allows to draw one random elements from a given list
- <font color='red'>random.sample</font> : allows to randomly draw a sample of elements from a given list
- <font color='red'>range</font> : allows to generate a sequence of integers

Some matrix properties are useful :
- <font color='red'>matrix.shape</font> : represents the shape of matrix (num_rows x num_columns) 
- <font color='red'>matrix.shape[0]</font> : number of rows of matrix
- <font color='red'>matrix.shape[1]</font> : number of columns of matrix

<u>Examples :</u>

In [8]:
import random
# choice function allows to draw one sample from a list
random.choice([0,1,2,3])

2

In [9]:
k=2
# sample function allows to draw k samples from a list
random.sample([0,1,2,3],k)

[1, 0]

In [10]:
# range allows to generate a sequence of integers
list(range(0,3,1))

[0, 1, 2]

In [11]:
indices_all_rows=list(range(3))
indices_all_rows

[0, 1, 2]

In [12]:
# draw k integers from the list of indices
random.sample(indices_all_rows, k)

[1, 0]

In [13]:
# extract k rows from X matrix based on their indices [0, 3]
X[[0, 3],:]

array([[  30.  , 1300.5 ],
       [  45.  , 1900.75]])

In [14]:
X

array([[  30.  , 1300.5 ],
       [  48.  , 2500.7 ],
       [  25.  , 1100.5 ],
       [  45.  , 1900.75]])

In [15]:
# number of rows in X
X.shape[0]

4

<u>Learning process :</u>

In [20]:
# The learning process of Kmeans is implemented in fit() function
# Define fit() function that :
# - takes as input training data matrix X
# - iteratively estimates the cluster centers
# - returns the class object (self)
class Kmeans :
  def __init__(self, k=2, max_iter=100):
    self.k=k
    self.max_iter=max_iter
    self.centers=None
  def fit(self, X):
    # Randomly initialize K cluster centers as data points
    indices_all_rows=list(range(X.shape[0]))
    indices_drawn=random.sample(indices_all_rows, self.k)
    self.centers=X[indices_drawn,:]
    for _ in range(self.max_iter):
            # Step 1: Determine the clusters - find the nearest cluster center for each data point
            clusters = []
            for x in X:
                cluster_id = np.argmin([np.linalg.norm(x - center) for center in self.centers])
                clusters.append(cluster_id)

            # Step 2: Update cluster centers - a cluster center is the mean of data points that belongs to the cluster
            new_centers = np.array([np.mean(X[np.array(clusters) == i], axis=0) for i in range(self.k)])

            # Check for convergence
            if np.all(self.centers == new_centers):
                break

            self.centers = new_centers

    return self

In [17]:
# Run test
km=Kmeans(k=2)
km.fit(X)
km.centers

array([[  25. , 1100.5],
       [  30. , 1300.5]])

Step 5. Define the prediction process

In [21]:
# The prediction process of Kmeans is implemented in predict() function
# Define predict() function that :
# - takes as input a employee vector x
# - predicts the label of cluster (an integer) to which x belongs, ie it can be {0,1,..,k-1}
# - returns the predicted label
class Kmeans :
  def __init__(self, k=2, max_iter=100):
    self.k = k
    self.max_iter = max_iter
    self.centers = None

  def fit(self, X):
    # Randomly initialize K cluster centers as data points
    indices_all_rows = list(range(X.shape[0]))
    indices_drawn = random.sample(indices_all_rows, self.k)
    self.centers = X[indices_drawn, :]
    for _ in range(self.max_iter):
          # Step 1: Determine the clusters - find the nearest cluster center for each data point
          clusters = []
          for x in X:
                cluster_id = np.argmin([np.linalg.norm(x - center)
                                       for center in self.centers])
                clusters.append(cluster_id)

            # Step 2: Update cluster centers - a cluster center is the mean of data points that belongs to the cluster
          new_centers = np.array(
          [np.mean(X[np.array(clusters) == i], axis=0) for i in range(self.k)])

            # Check for convergence
          if np.all(self.centers == new_centers):
                break

          self.centers = new_centers

    return self

  def predict(self, x):
    # Assign the input vector x to the cluster with the nearest center
    cluster_id = np.argmin([np.linalg.norm(x - center)
                           for center in self.centers])
    return cluster_id


In [38]:
# Run test
# predict which cluster a new employe x with age=29 and salary=1400 belongs to
x=np.array([29,1400])
km=Kmeans(k=2)
km.fit(X)
label=km.predict(x)
label

1