<a href="https://colab.research.google.com/github/GabrielleRab/SRMPmachine/blob/main/K_means_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **K-means clustering with a dataset of your choice**

Recall that we use k-means clustering when we have unlabeled data and we want to find patterns based on specific characteristics.

### **Step 1:** Identify your question

This has already been done for you! Look at the research question for your dataset and consider whether or not it's a good fit for the k-means clustering approach.

### **Step 2:** Select your data

Let's import our data. First we need to load in the necessary Python libraries. Run the code below:

In [None]:
# import the necessary Python library
import pandas as pd

Next, we create a dataframe called with a pre-cleaned version of your data. 

**Important:** *Only run the cell for your chosen dataset. Ignore the other two cells.*

In [None]:
# Run this cell ONLY if you are using the stellar rotation dataset
df = pd.DataFrame(pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/Stellar_rotation_clean.csv"))

In [None]:
# Run this cell ONLY if you are using the dragonfly wing dataset
df = pd.DataFrame(pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/wing_measurements_clean.csv"))

In [None]:
# Run this cell ONLY if you are using the North Carolina crime dataset
df = pd.DataFrame(pd.read_csv("https://raw.githubusercontent.com/GabrielleRab/SRMPmachine/main/datasets/crime_data_clean.csv"))

Let's take a look at the first 5 rows of the dataset. Make sure this is the dataset you meant to import! If it's wrong, just go back and run the correct cell above. That will over-write the dataframe.

In [None]:
df.head()

Run the code below to find out how many rows are in our dataset:

In [None]:
# return the number of rows in the dataset
len(df)

### **Step 3:** Choose your method

Review your dataset and your research question one more time to make sure that you're ready to use the k-means clustering method. This is an unsupervised method that works best when your data is not already classified. K-means clustering is an algorithm used to group unlabelled datasets based on their feature similarities. 

Run the code below to import the necessary Python libraries for k-means clustering.

In [None]:
# import necessary Python libraries
import pylab as plt
import numpy as np
from scipy import stats

import matplotlib.pyplot as mplt 
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

### **Step 4:** Prepare your data

This step has been taken care of for you! There are no rows with missing data.

Now it's time to choose which features you will use for your clustering analysis. 

**Replace aaaa and bbbb below with the names of the columns that contain the features you want to compare (make sure there are no typos!):**

If you want to try clustering with different features, simply replace the column names with new ones and then rerun this cell and all the cells below it.

In [None]:
# Replace aaaa and bbbb below with the names of the columns 
# that contain the features you want to compare:

col1 = "aaaa"
col2 = "bbbb"

x = df.iloc[:, [df.columns.get_loc(col1),df.columns.get_loc(col2)]].values

Let's examine the distribution of the two features you have chosen to examine the data distrubiton. 

In [None]:
# plt.boxplot(x[:, 0])
fig, axs = plt.subplots(2, figsize = (10,4))
plt1 = sns.boxplot(x[:, 0], ax = axs[0])
plt2 = sns.boxplot(x[:, 1], ax = axs[1])
plt.tight_layout()

Let's look at the overall distribution of our data. We will create a "distplot" which  represents the univariate distribution of data i.e. data distribution of a variable against the density distribution. So the y-axis in a density plot is the probability density function for the kernel density estimation, which is why it says density. :

In [None]:
# Plot distribution
sns.distplot(x)

### **Step 5:** Create and Use the model

Now it's time to make our k-means classifier. We will also need to set the hyperparameters (values that control how the model learns and makes decisions).

**Note:** Because k-means clustering is an unsupervised method, we do not need to split our data into a training and a testing set. No data is labeled, so there is no way to train the model.

Run the code below to create the model:

To calculate the optimum value of k (number of clusters) in the k-means model, we draw an elbow curve to see where the steep decline happens. 

In [None]:
distorsions = []
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(x)
    distorsions.append(kmeans.inertia_)

fig = plt.figure(figsize=(15, 5))
plt.plot(range(1, 10), distorsions)
plt.grid(True)
plt.title('Elbow curve')

See a sample elbow point below:

<img src="https://miro.medium.com/max/1400/1*V25YoNWK7v1fSDP5ixPPfQ.png">

**Enter the correct number for k below:**

Remember to change this if you rerun your code with new features.

In [None]:
# enter your number for k (replace the x) below:
k = x

# create the k-means classifier
kmeans = KMeans(n_clusters = k, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)


Next we will apply the model to the data:

In [None]:
# fit the model to the data
y_kmeans = kmeans.fit_predict(x)

Let's compute the size of each cluster that our model formed. 

In [None]:
#Sizes of each cluster 
for i in range(k):
  print(x[y_kmeans == i].size)

**Bias Alert!** If your clusters are not similar in size, this might introduce some inaccuracy in our clustering. K-means clustering will always form clusters of similar areas, but different clusters may have different densities.

Let's visualize the clusters that our model has created. We will add the planets from our solar system for reference:

In [None]:
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 22}

colors = ["purple", "blue", "orange", "green", "yellow", "lightblue", "magenta", "red", "pink", "turquoise", "grey"]

#Visualising the clusters
for i in range(k):
  plt.scatter(x[y_kmeans == i, 0], x[y_kmeans == i, 1], s = 20, c = colors[i])

#Plotting the centroids of the clusters
# print(kmeans.cluster_centers_)
plt.scatter(
    kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
    s=100, marker='*',
    c='black', 
    label='centroids'
)

plt.xlabel(col1)
plt.ylabel(col2)
plt.legend()


What do you notice? 

Do the clusters of data points in your dataset make any logical sense to you?

**Bias alert!** How much space is there between your clusters? Recall that the borders between closely packed clusters may be somewhat arbitrary. Points along the borders should be considered with caution.

Remember that the k-means clustering algorithm will *always create clusters* even if you don't think they look very meaningful. If these clusters don't look real to you, go back and try different features.

**Note: if your data looks like it has a linear correlation, you might want to try Linear Regression instead.**

Evaluate the model

Did this model help you answer your research question?

What are some forms of bias that you need to be aware of in this analysis?

What questions do you still have?