# Clustering Workshop – Student Notebook

Use this notebook to take notes and complete the challenges in your clustering workshop.

In this workshop, we'll be looking at demographic data to identify different clusters of regions within major world cities. 

The goal is to identify naturally occurring groups within this demographic data. As a result, we should be able to better understand which parts of a city are similar and which are not.

The figure below shows the result a result of similar analysis on demographic data for London. Parts of the city that are most similar to each other share the same colour.

<img src="https://assets.decoded.com/managed/c6b02ff5-3899-40e5-8888-ad845beee77e_fxxqkyv_sev6lfd4.jpg" alt="London map" width="500"/>


## Import libraries

In [2]:
# Use this code block to import libraries as you complete this notebook.
# Be sure to leave a comment above each that explains what the library does and why it is required.

## About the data

For this workshop, we'll be using the dataset available at decd.co/da-resources in the clustering section. 

In this spreadsheet, you will find demographic data for four major world cities: London, Hong Kong, Amsterdam and New York. 

Each city is broken into several administrative districts. (London, for example, is made up of 33 boroughs, while Hong Kong is divided into 18 districts.) 

For each area in a city, you will find several different demographic features. These include things like population density, average salary, and average income.

## Sourcing the data

Before continuing, take a moment to choose which of the four cities out you'd like to study for the rest of this workshop.

In [3]:
# Challenge 1: Load the data for your chosen city into this notebook using pandas
# Challenge 2: Check your data has imported correclty using .head()
# Challenge 3: Add a text block above this code block explaining this section's purpose and what this code does

## Exploring and transforming the data

It's important to spend time understanding your data before doing your analysis. Take a few moments to look through the data you've imported using some of the functions built into `pandas`.

In [4]:
# Challenge 4: Explore your data using the following functions: .head(), .tail(), .info(), and .dtypes
# Challenge 5: Find another function you can use to explore your data online
# Challenge 6: Add a text block beneath this code block describing what you've found in this data

### Visualising the data

In [7]:
# Challenge 7: Import the seaborn library into this notebook
# Challenge 8: Create a scatter plot of two variables in your dataset using seaborn (support resource below)
# Challenge 9: Add a text block beneath this code block describing what your visualisation shows
# Challenge 10: (Optional) Try creating another visualisation using seaborn and describe what it shows

(See [Seaborn's official scatterplot documentation](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) for support on Challenge 7.)

### Scaling the data

Before running any clustering analysis, it is important to [_scale_ your data](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e).

Scaling is trivial to do using the [Scale method](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html) from [scikit-learn](https://scikit-learn.org/stable/index.html).

> scikit-learn (or sklearn) is the library most commonly used for machine learning in Python. It also has great documentation. Check out its [resources for clustering](https://scikit-learn.org/stable/modules/clustering.html#k-means).

from sklearn.preprocessing import scale

In [None]:
# Challenge 11: Import the scale method from sklearn.preprocessing in your libraries section
# Challenge 12: Uncomment and alter the code below to scale two features from your dataset

In [None]:
# data_scaled = scale(df[['feature_1','feature_2']])

## Modelling with k-means

To run the k-means algorithm, you can use the KMeans method from scikit-learn.

In [19]:
# Challenge 13: Familiarise yourself with the KMeans scikit-learn documentation at: 
# https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
# Challenge 14: Create the KMeans model object using the commented code below (choose a reasonable value for 
# n_clusters based on your visualisation above)

In [None]:
# model = KMeans(n_clusters = X, random_state= 123)

In [None]:
# Challenge 15: Fit the model using the code below
model.fit(data_scaled)

In [None]:
# Challenge 16: Explore the labels in your model using the code below
model.labels_

In [None]:
# Challenge 17: Add the labels to your original dataframe using a version of the code below
# df["cluster"] = model.labels_.astype(int)

In [None]:
# Challenge 18: View your original dataframe and notice how it has changed

In [None]:
# Challenge 19: Copy your scatterplot from above and add hue = df['cluster'] to your list of arugments to visualise
# your clusters

## Evaluating the results

There are two ways we can evaluate the results of our analysis:
1. Statistical methods
2. Domain knowledge

Statistical methods can give us a numeric understanding about the quality of our clusters. Our domain knowledge can help to interpret whether there are other factors that may not necessarily be explained by the numbers.

### Within-cluster sum of squares
A common statistical method to interpret our results is the within sum of squares method. This is simply the sum of the squared deviations from each observation and the cluster centroid.

The idea is that, in general, a cluster that has a small sum of squares is more compact than a cluster that has a large sum of squares. Clusters that have higher values exhibit greater variability of the observations within the cluster.

In [None]:
# Challenge 20: Use the .inertia_ attribute on your model to find the WSS value for your analysis
# Challenge 21: Try using different values for k (n_clusters) in your algorithm above to see if you can reduce 
# this number

### Elbow method
You can use the within-cluster sum of squares to establish what the ideal number of clusters is.

It's helpful to start your analysis by plotting this value for different values of _k_ on a graph. Often, you will see a kink (or elbow) in the chart. This is often the optimum value for k.

In [20]:
# Challenge 22: Run the code below to identify the optimum number of clusters for this dataset and your chosen features.

In [None]:
# Idea: peform k-means clustering for various k, and compute the WSS each time

# create a list of the different values of k to test. Could also use: list(range(1,10))
num_clusters = [1,2,3,4,5,6,7,8,9,10]

# create a kmeans model for each value of k. Could use a regular for loop, but let's use a "list comprehension"!
kmeans_list = [KMeans(n_clusters = i) for i in num_clusters]

# For each value of k, fit the model with our data and use the "inertia" method of KMeans to compute the WSS
scores = [kmeans_list[i-1].fit(data_scaled).inertia_ for i in num_clusters]

# Optional 
# We can choose to normalise the scores with respect to the score for k=1 (the highest score)
scores_normalised = scores/scores[0]

In [None]:
# We can choose to set a grid
sns.set_style('darkgrid')

# Use the lineplot function from seaborn
sns.lineplot(num_clusters, scores_normalised)

# Add a title and axis labels
plt.xlabel("Number of clusters k")
plt.ylabel("Score (normalised)")
plt.title("Elbow test")

## Conclusion

Now that you have your results, it's important to summarise them in a way that people will easily understand.

In [None]:
# Challenge 23: Spend some time writing up a summary of your process and results. Write it as you would if you were
# going to send it to a senior leadership team.

Another way that is helpful to share your results is with a compelling visualisation. The code below shows how you could create a visualisation for London. It is optional to complete.

In [None]:
# Optional: Run the following code to plot your clusters on a map.
# You will need to download the relevant shapefiles from decd.co/da-resources 
# and run adjust the code as needed. The example below is for London.

# Import the relevant libraries
import matplotlib.pyplot as plt
import geopandas as gpd
%matplotlib inline

# set the filepath and load in a shapefile
fp = 'statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.shp'
map_df = gpd.read_file(fp)

# check data type so we can see that this is not a normal dataframe, but a GEOdataframe
map_df.head()

In [None]:
# Make sure that your plot works
map_df.plot()

In [None]:
# Merge your existing dataset with the one you've just created
merged = map_df.set_index('NAME').join(london_df.set_index('Borough'))
merged.head()

In [None]:
# Set a variable that will call whatever column we want to visualise on the map
variable = 'cluster'
# Set the range for the choropleth
vmin, vmax = 120, 220
# Create a figure and axes for Matplotlib
fig, ax = plt.subplots(1, figsize=(10, 6))
# Plot the figure
merged.plot(variable, cmap='Blues', linewidth=0.8, ax=ax, edgecolor='0.8')
# Turn off the axes
ax.axis('off')

In [None]:
# add a title
ax.set_title('Similar boroughs in London', fontdict={'fontsize': '25', 'fontweight' : '3'})
# create an annotation for the data source
ax.annotate('Source: London Datastore',xy=(0.1, .08),  xycoords='figure fraction', horizontalalignment='left', verticalalignment='top', fontsize=12, color='#555555')

In [None]:
# Save the figure as a .png file
fig.savefig('london_clusters.png', dpi=300)