# Exercise 4: Find the closest centroids

In this exercise, we will code from scratch the first k-mean iteration to assign a data point to its closest cluster.

1. Open on a new Colab notebook. Import the required packages (pandas, sklearn, altair)

In [0]:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt 

2. Load the dataset and select the same columns as in exercise 1.

Hints: 
- Use **read_csv** method from the package **pandas**
- Provide the list of the 3 variables name to the parameter **usecols**

In [0]:
file_url = 'https://data.gov.au/data/dataset/5c99cfed-254d-40a6-af1c-47412b7de6fe/resource/4e21e064-9ff9-4033-a79a-37681f9e54b1/download/taxstats2015individual28countaveragemedianbypostcode.csv'
df = pd.read_csv(file_url, usecols=['Postcode', 'Average total business income', 'Average total business expenses'])

3. Calculate the minimum and maximum values of the variables '*Average total business income*' and '*Average total business income'* and print their values.

Hints: 
- Use **min()** and **max()** from the package **pandas** as the limit range





In [0]:
business_income_min = df['Average total business income'].min()
business_income_max = df['Average total business income'].max()

business_expenses_min = df['Average total business expenses'].min()
business_expenses_max = df['Average total business expenses'].max()

print(business_income_min)
print(business_income_max)
print(business_expenses_min)
print(business_expenses_max)

0
876324
0
884659


4. Generate random coordinates for the 4 centroids with possible values between the minimum and maximum of the 2 variables and store them into a dataframe. Print the dataframe

Hints: 
- Look at the **sample** method from the package **random**
- Use the function **range** to define the limits
- Add the dataframe index into a column called '*cluster*' (this will be used for displaying the cluster number in the scatter plot tooltip)
- Set the seed to 42 so that you get the same results


In [0]:
import random
random.seed(42)

centroids = pd.DataFrame()

centroids['Average total business income'] = random.sample(range(business_income_min, business_income_max), 4)
centroids['Average total business expenses'] = random.sample(range(business_expenses_min, business_expenses_max), 4)
centroids['cluster'] = centroids.index
centroids

Unnamed: 0,Average total business income,Average total business expenses,cluster
0,670487,288389,0
1,116739,256787,1
2,26225,234053,2
3,777572,146316,3


5. Plot the centroids and the first 5 rows of the dataset.

Hints:
- Create 2 different scatter plots on **altair** and combine them

In [0]:
chart1 = alt.Chart(df.head()).mark_circle().encode(x='Average total business income', y='Average total business expenses', 
    color=alt.value('orange'),
    tooltip=['Postcode', 'Average total business income', 'Average total business expenses']
).interactive()

chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average total business income', y='Average total business expenses', 
    color=alt.value('black'),
    tooltip=['cluster', 'Average total business income', 'Average total business expenses']
).interactive()

chart1 + chart2

6. For the first row calculate its squared Euclidean distance against each of the 4 centroids

Hints:
- Look at the method **at** from the package **pandas** to extract value from a given row
- Define a function that will calculate squared Euclidean distance then use a list comprehension (or a for loop) to iterate through each centroid

In [0]:
def squared_euclidean(data_x, data_y, centroid_x, centroid_y, ):
  return (data_x - centroid_x)**2 + (data_y - centroid_y)**2

data_x = df.at[0, 'Average total business income']
data_y = df.at[0, 'Average total business expenses']
distances = [squared_euclidean(data_x, data_y, centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
distances

[215601466600, 10063365460, 34245932020, 326873037866]

7. Let's find the index of the smallest element in the distances list and saved it into the dataframe in a new column called '*cluster*'

Hints:
- Use the **index** method from the list


In [0]:
cluster_index = distances.index(min(distances))
df.at[0, 'cluster'] = cluster_index
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,
2,2007,575099,639499,
3,2008,53329,32173,
4,2009,237539,222993,


8. Repeat steps 5 and 6 for the next 4 rows.


In [0]:
distances = [squared_euclidean(df.at[1, 'Average total business income'], df.at[1, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[1, 'cluster'] = distances.index(min(distances))

distances = [squared_euclidean(df.at[2, 'Average total business income'], df.at[2, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[2, 'cluster'] = distances.index(min(distances))

distances = [squared_euclidean(df.at[3, 'Average total business income'], df.at[3, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[3, 'cluster'] = distances.index(min(distances))

distances = [squared_euclidean(df.at[4, 'Average total business income'], df.at[4, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[4, 'cluster'] = distances.index(min(distances))

df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,2.0
2,2007,575099,639499,0.0
3,2008,53329,32173,2.0
4,2009,237539,222993,1.0


9. Plot the centroids and the first 5 rows of the dataset.

In [0]:
chart1 = alt.Chart(df.head()).mark_circle().encode(x='Average total business income', y='Average total business expenses', 
    color='cluster:N',
    tooltip=['Postcode', 'cluster', 'Average total business income', 'Average total business expenses']
).interactive()

chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average total business income', y='Average total business expenses', 
    color=alt.value('black'),
    tooltip=['cluster', 'Average total business income', 'Average total business expenses']
).interactive()

chart1 + chart2

Awesome!  You just re-implemented part of the k-means from scratch: random initialisation of centroids (cluster center), calculate the squared Euclidean distance for some data points, find their closest centroid and assign them to the corresponding cluster. This wasn't easy but you made it.