## Exercise 4: Find the closest centroids

In this exercise, we will code from scratch the first iteration of k-mean in order to assign data points to their closest cluster centroid.



---

1. Open on a new Colab notebook. Import the required packages (**pandas**, **sklearn**, **altair**)



---




In [0]:
import pandas as pd
from sklearn.cluster import KMeans
import altair as alt 



---

2. Load the dataset and select the same columns as in exercise 1 using read_csv() method from the package pandas.


---


In [0]:
file_url = 'https://raw.githubusercontent.com/TrainingByPackt/The-Data-Science-Workshop/master/Chapter05-Perform_Your_First_Cluster_Analysis/data/taxstats2015.csv'
df = pd.read_csv(file_url, usecols=['Postcode', 'Average total business income', 'Average total business expenses'])



---

3. Assign the columns ‘Average total business income’ and ‘Average total business expenses’ to a new variable called X:


---



In [0]:
X = df[['Average total business income', 'Average total business expenses']]



---

4. Now, calculate the minimum and maximum using min() and max() values of the variables 'Average total business income' and 'Average total business income'


---



In [0]:
business_income_min = df['Average total business income'].min()
business_income_max = df['Average total business income'].max()
 
business_expenses_min = df['Average total business expenses'].min()
business_expenses_max = df['Average total business expenses'].max()




---

5. Print the values of these 4 variables:


---



Expected output:

![Figure 39 - Minimum and maximum values of the 2 variables](https://docs.google.com/uc?export=download&id=1ztKDJqqbp7m_pcKWXXfroTngQqN_n3Cq)


In [5]:
print(business_income_min)
print(business_income_max)
print(business_expenses_min)
print(business_expenses_max)

0
876324
0
884659




---


6. Import the random package and use the method seed() to set a seed of 42

---

In [0]:
import random
random.seed(42)



---

7. Create an empty pandas dataframe and assign it to a variable called ‘centroids’



---



In [0]:
centroids = pd.DataFrame()



---

8. Generate 4 random values using the sample() method from the random package with possible values between the minimum and maximum of the column ‘Average total business expenses’ using range() and store the results into a new column called ‘Average total business income’ from the dataframe ‘centroids’.


---



In [0]:
centroids['Average total business income'] = random.sample(range(business_income_min, business_income_max), 4)



---

9. Repeat the same process for generating 4 random values for  ‘Average total business expense’


---



In [0]:
centroids['Average total business expenses'] = random.sample(range(business_expenses_min, business_expenses_max), 4)



---

10. Create a new column called ‘cluster’ from the dataframe ‘centroids’ using the attributes index from pandas package and print this dataframe:


---



Expected output:

![Figure 40 - Coordinates of the 4 random centroids](https://docs.google.com/uc?export=download&id=1TXQuyizVTrcB4XCo6PUPJJiMbi97Lz3o)


In [10]:
centroids['cluster'] = centroids.index
centroids

Unnamed: 0,Average total business income,Average total business expenses,cluster
0,670487,288389,0
1,116739,256787,1
2,26225,234053,2
3,777572,146316,3




---

11. Create a scatter plot with the package Altair to display the data contained in the dataframe df and save it into a variable called ‘chart1’:


---



In [0]:
chart1 = alt.Chart(df.head()).mark_circle().encode(x='Average total business income', y='Average total business expenses', 
    color=alt.value('orange'),
    tooltip=['Postcode', 'Average total business income', 'Average total business expenses']
).interactive()



---

12. Create a second scatter plot using altair package to display the ‘centroids’ and save it into a variable called ‘chart2’:



---


In [0]:
chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average total business income', y='Average total business expenses', 
    color=alt.value('black'),
    tooltip=['cluster', 'Average total business income', 'Average total business expenses']
).interactive()



---


13. Display the 2 charts together using the Altair syntax: < chart A > + < chart B >

---




In [13]:
chart1 + chart2

Expected output:

![Figure 41 - Scatter plot of the random centroids and the first 5 observations](https://docs.google.com/uc?export=download&id=1WqaoIn5IOhFhCFMDqc7QTnfcQA0vpLcp)




---

14. Define a function that will calculate the squared euclidean distance and return its value. This function will take the x and y coordinates of a data point and a centroid:


---


In [0]:
def squared_euclidean(data_x, data_y, centroid_x, centroid_y, ):
  return (data_x - centroid_x)**2 + (data_y - centroid_y)**2



---

15. Using the method at from the pandas package, extract the first row x and y coordinates and saved them into 2 variables called ‘data_x’ and ‘data_y’


---



In [0]:
data_x = df.at[0, 'Average total business income']
data_y = df.at[0, 'Average total business expenses']



---

16. Using a for loop or a list comprehension, calculate the squared euclidean distance of the first observation (using its coordinates data_x and data_y) against the 4 different centroids contained in ‘centroids’ and save the result into a variable called ‘distance’ and display it:


---



In [16]:
distances = [squared_euclidean(data_x, data_y, centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
distances

[215601466600, 10063365460, 34245932020, 326873037866]

Expected output:

![Figure 42 - Squared Euclidean distances from the random centroids for the first observation](https://docs.google.com/uc?export=download&id=1CLQR4pNQe6vV6KeFZHGuGeeRkS6QPeQR)




---

17. Use the index method from the list containing the squared Euclidean distances to find the cluster with the smallest distance


---



In [0]:
cluster_index = distances.index(min(distances))

Expected output:

![Figure 43 - First 5 rows of the ATO dataframe with the assigned cluster number for the first row](https://docs.google.com/uc?export=download&id=1K-Nr9XZJsj4YoR_2zTLWWiV3X8_CkU05)




---

18. Save the cluster index into a column called ‘cluster’ from the ‘df’ dataframe for the first observation using the method at from the pandas package:


---



In [0]:
df.at[0, 'cluster'] = cluster_index



---

19. Display the first 5 rows of ‘df’ using the method head() from pandas package:


---



In [19]:
df.head()

Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,
2,2007,575099,639499,
3,2008,53329,32173,
4,2009,237539,222993,




---

20. Repeat steps 14 to 17 for the next 4 rows to calculate their distances from the centroids and find the cluster with the smallest distance: 


---



Expected output:

![Figure 44 - First 5 rows of the ATO dataframe and their assigned cluster](https://docs.google.com/uc?export=download&id=1bv-Ezo7MsljNqAcA7SYuBak2lF_RBSaf)



In [20]:
distances = [squared_euclidean(df.at[1, 'Average total business income'], df.at[1, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[1, 'cluster'] = distances.index(min(distances))
 
distances = [squared_euclidean(df.at[2, 'Average total business income'], df.at[2, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[2, 'cluster'] = distances.index(min(distances))
 
distances = [squared_euclidean(df.at[3, 'Average total business income'], df.at[3, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[3, 'cluster'] = distances.index(min(distances))
 
distances = [squared_euclidean(df.at[4, 'Average total business income'], df.at[4, 'Average total business expenses'], centroids.at[i, 'Average total business income'], centroids.at[i, 'Average total business expenses']) for i in range(4)]
df.at[4, 'cluster'] = distances.index(min(distances))
 
df.head()


Unnamed: 0,Postcode,Average total business income,Average total business expenses,cluster
0,2000,210901,222191,1.0
1,2006,69983,48971,2.0
2,2007,575099,639499,0.0
3,2008,53329,32173,2.0
4,2009,237539,222993,1.0



---

21. Plot the centroids and the first 5 rows of the dataset using altair package as in steps 10 to 12:


---



Expected output:

![Figure 45 - Scatter plot of the random centroids and the first 5 observations](https://docs.google.com/uc?export=download&id=1tv13Nr_osgkr_VlvAe6MR975ivgv7MkV)


In [21]:
chart1 = alt.Chart(df.head()).mark_circle().encode(x='Average total business income', y='Average total business expenses', 
    color='cluster:N',
    tooltip=['Postcode', 'cluster', 'Average total business income', 'Average total business expenses']
).interactive()
 
chart2 = alt.Chart(centroids).mark_circle(size=100).encode(x='Average total business income', y='Average total business expenses', 
    color=alt.value('black'),
    tooltip=['cluster', 'Average total business income', 'Average total business expenses']
).interactive()

chart1 + chart2

In this final result, we can see where the 4 clusters have been placed in the graph and which cluster the 5 data points have been assigned to:
*   The 2 bottom left corner data points have been assigned to cluster 2 which corresponds to the one with a centroid of coordinates 26,000 (average total business income) and 234,000 (average total business expense). It is the closest centroid for these 2 points
*   The 2 observations in the middle are very close to the centroid with coordinates 116,000 (average total business income) and 256,000 (average total business expense) which corresponds to cluster 1
*   The observation in the top have been assigned to cluster 0 which centroid have coordinates of 670,000 (average total business income) and 288,000 (average total business expense)

Awesome!  You just re-implemented a big part of the k-means algorithm from scratch. You went through how to randomly initialise the centroids (cluster center), calculate the squared Euclidean distance for some data points, find their closest centroid and assign them to the corresponding cluster. This wasn't easy but you made it.