# Lab 4: Clustering (K-Means)

**Instructions:**

- Make sure your codes can be tested by our TAs or instructors through a simple "click" before submission. You can submit a readme file if needed.

- Your submission file should be titled "IS424 Lab 4 - Clustering " followed by your name, and SMU email id. Submit it on the elearn assignment folder before the stated deadline.

- **VERY IMPORTANT: This assignment must be your independent effort. You SHOULD NOT collaborate with others. Supporting or indulging in plagiarism is a violation of academic policy and any such attempts will lead to strict disciplinary action.**

**Learning Objectives:**

This lab assignment aims to familiarize you with some of the clustering techniques that you have learnt in the course. 
By the end of this lab, students will be familiar with the following concepts:
1.	K-Means Clustering using Python (sklearn).
2.	Effect of Standardization on clustering
3.	Visualizing different number of clusters for the same dataset

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Import Iris Dataset

The Fisher Iris Dataset (https://archive.ics.uci.edu/ml/datasets/iris) will be used for this assignment.

In [None]:
iris_df = pd.read_csv('iris.data', header=None)
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris_df.head()

## 2. Plot Scatterplots [6 marks]

There are 4 variables in the dataset. Plot the scatter plot of the data using the following pairs of variables (grouping them by `species`). You are free to use any visualization package of your choice. Seaborn is a good option: https://seaborn.pydata.org/generated/seaborn.scatterplot.html


**a. Sepal Width `sepal_width` and Sepal Length `sepal_length`**

In [None]:
# a. sepal_width and sepal_length


**b. Petal Width `petal_width` and Sepal Length `sepal_length`**

In [None]:
# b. petal_width and sepal_length


**c. Petal Width `petal_width` and Sepal Width `sepal_width`**

In [None]:
# c. petal_width and sepal_width


## 3. Perform K-Means Clustering [6 marks]

Setting the number of clusters `n_clusters` to `3`, the seed `random_state` to `424`, perform clustering on the following variables **and plot the corresponding scatter plot grouped by cluster number `labels_`**. 

After performing clustering (`fit`ting into the model), utilise the `add_cluster_number_to_dataframe` to add the cluster numbers into a new column.
- `add_cluster_number_to_dataframe` takes in two arguments: `model` (of type `sklearn.cluster.KMeans`) and `df` (of type `pandas.DataFrame`)
- `add_cluster_number_to_dataframe` returns a new `DataFrame`

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
from sklearn.cluster import KMeans

In [None]:
def add_cluster_number_to_dataframe(model, df):
    df = df.copy() # Prevent adding column in-place
    df['cluster'] = model.labels_ + 1
    df['cluster'] = 'cluster ' + df['cluster'].astype(str)
    df = df.sort_values(['cluster'])
    return df

**a. Sepal Width `sepal_width` and Sepal Length `sepal_length`**

In [None]:
# Kmeans Clustering (sepal_width, sepal_length)
X = iris_df[[_____, _____]]

kmeans = KMeans(_____, _____, _____)
kmeans.fit(X)
iris_df = add_cluster_number_to_dataframe(_____, _____)

iris_df.head()

**Plot the scatterplot**

In [None]:
# plot scatterplot


**b. Petal Width `petal_width` and Sepal Length `sepal_length`**

In [None]:
# Kmeans Clustering (petal_width, sepal_length)


**Plot the scatterplot**

In [None]:
# plot scatterplot


**c. Petal Width `petal_width` and Sepal Width `sepal_width`**

In [None]:
# Kmeans Clustering (petal_width, sepal_width)


**Plot the scatterplot**

In [None]:
# plot scatterplot


## Standardise Variables
Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler(with_mean=False)
transformed = scaler.fit_transform(iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])

for idx, col in enumerate(iris_df.columns[:-2]):
    iris_df[col] = transformed[:,idx]
    
iris_df.describe()

## 4. Perform K-Means Clustering <ins>with</ins> Standardised Variables [8 marks]

**a. Sepal Width `sepal_width` and Sepal Length `sepal_length`**

In [None]:
# Kmeans Clustering (sepal_width, sepal_length)


In [None]:
# plot scatterplot


**b. Petal Width `petal_width` and Sepal Length `sepal_length`**

In [None]:
# Kmeans Clustering (petal_width, sepal_length)


In [None]:
# plot scatterplot


**c. Petal Width `petal_width` and Sepal Width `sepal_width`**

In [None]:
# Kmeans Clustering (petal_width, sepal_width)


In [None]:
# plot scatterplot


Comment on why results of this clustering analysis (with standardisation) are different from those obtained in the previous clustering analysis (without standardisation).

`type your answer here`

## 5. Perform K-Means Clustering with Different Number of Clusters [6 marks]

For this dataset, we happened to know a priori the right number of clusters. In a real world scenario, we will not know the number of clusters. Additionally, the clusters may not correspond the “ideal” cluster that we want. Using Petal Width `petal_width` and Sepal Length `sepal_length` as the variables, perform cluster analysis with number of clusters `n_clusters` = `2`, `4`, and `6`. 

**Report the number of items in each segment for all 3 cluster analysis, and provide the appropriate scatter plots.**

`n_clusters` = `2`

In [None]:
# Kmeans Clustering (n_clusters=2) 


In [1]:
# plot scatterplot


`n_clusters` = `4`

In [None]:
# Kmeans Clustering (n_clusters=4) 


In [2]:
# plot scatterplot


`n_clusters` = `6`

In [None]:
# Kmeans Clustering (n_clusters=6) 


In [None]:
# plot scatterplot
