# Clustering template for Seaborn example datasets

## Task

We are going to build a k-means clustering model and evaluate its performance. For clustering we need several independent variables. These are the variables we are going to use to identify the clusters. For clustering all of these need to be numerical and at least some need to be continuous.

For a clustering model to provide useful groups, we need our independent variables to:

- Be independent of each other. If the independent variables are highly correlated then we only need to use one of them to get the same information.
- Have a similar scale. If some have much smaller values or a smaller range of values than the others they will not be taken into account when building the clustering model

<a id='Contents'></a>
## Contents
In this notebook, we will:<b>

- [Import](#import) packages and load in some data 
- [Prepare](#prepare) the data so we can explore it
- [Explore](#explore) the data and make our testable hypothesis
- [Build](#build) the model 
- [Interpret](#interpret) the model results

<div class="alert alert-block alert-warning">
<b>Reminder:</b> <br>
You don't need to understand all the code here for now, just look for:
<ul>
<li> What we are trying to do with each code cell. How does it fit in our objective?
<li> What the outputs of each code cell tell us? Are we reading too much into the results of each code cell?
<li> Some of the code cells will have parts that you will need to change to fit your data. You will be told what to change in the comment before the code cell.
</ul>
</div>

<a id="import"></a>
## Import packages and read data
[Back to Contents](#Contents)

Let's start by importing the Python packages we will need:
- [**pandas**](https://pandas.pydata.org/): a tabular data manipulation package
- [**seaborn**](https://seaborn.pydata.org/): a data visualisation package
- [**scikit-learn**](https://scikit-learn.org/): a model building package

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

sns.set_style("whitegrid")

We will be using one of these Seaborn [example datasets](https://github.com/mwaskom/seaborn-data):

- `taxis`: New York taxi journeys 
- `titanic`: Records of details of passengers on the Titanic
- `penguins`: Physical details of various penguins
- `iris`: Measurements of different Iris flowers
- `tips`: Restaurant bills and tips

These can be loaded using the Seaborn function [`load_dataset`](https://seaborn.pydata.org/generated/seaborn.load_dataset.html). We can have a look at the first few rows using the method [`head`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)

In [None]:
data = sns.load_dataset('iris')
data.head()

<a id="prepare"></a>
### Prepare the data
[Back to Contents](#Contents)

For this notebook we are assuming the data has been prepared before being loaded in. However, it is always important to check that you have what you expect.

We can look at what kind of data our table contains using the [`info`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) method. 

We should be asking ourselves:

- Does the data contain the columns I expect?
- Do the columns have the data type I would expect?
- Do any of the columns have missing values? Are these in any columns we intend to use? There cannot be any null values in the rows we plan to use in our model.

In [None]:
data.info()

If we have any columns with null values that we want to use in our model, we will need to drop those pieces of data. Pandas gives us the [`dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) method to do this. If you only need to drop null values from one column you can use the `subset` parameter to pass a list of the columns you wish to be checked for null values.

This piece of code has been commented out, so it will not run. To use this, remove the `#` from the start of the line and replace `'column name'` with the column that you wish to be checked for null values. You can examine more than one column at a time by listing all the columns in the square brackets.

In [None]:
# data = data.dropna(subset=['column name', ])

If one of the columns is only useful as a unique row identifier, we can use it as an index. We can set it using the [`set_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) method. Replace `'index column name'` with the column that you wish to set as the index. If you need this, remove the `#` to uncomment the code.

In [None]:
#data = data.set_index('ID column name')

---
<a id="explore"></a>
### Exploring the data
[Back to Contents](#Contents)

Now that we have checked our data is clean, we can explore it. A good starting point is to look at the descriptive statistics and check that they seem reasonable. We can do so using the [`describe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) method.

We should be asking ourselves:

- Does the data have unexpected outliers (looking at the max and min values) that might suggest a data quality issue?
- Is the spread of the data what you would expect?

In [None]:
data.describe()

It is helpful to do these checks visually. Histograms will help us see the distribution of a variable and scatter plots will show us the relationship between variables. Seaborn allows us to plot these using the functions [`histplot`](https://seaborn.pydata.org/generated/seaborn.histplot.html) for histograms and [`scatterplot`](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) for scatter plots.

In [None]:
sns.histplot(data=data, x='variable name 1');

In [None]:
sns.scatterplot(data=data, x='variable name 1', y='variable name 2');

It can be helpful to plot all the histograms and scatter plots in one go. Seaborn allows us to do this using a [`pairplot`](https://seaborn.pydata.org/generated/seaborn.pairplot.html).

In [None]:
sns.pairplot(data);

The scatter plots help us to understand what kind of relationships there are between our variables and how strong they are. We can also quantify this using the Pandas method [`corr`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)

In [None]:
data.corr(numeric_only=False)

### Confirm your hypothesis

Now that you have explored your data and seen the relationships in it, you can pick the independent variables you want to use build your clusters. Remember, for your independent variables to be useful in predicting the clusters, they need to:

- Be independent of each other. 
- Have a similar scale. As this is quite rare, we will need to normalise the data before building the model

#### Normalising

We will need to keep only the columns we want to use for clustering. We can start by listing all the columns available using the [`columns`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html) parameter.

In [None]:
data.columns

Our predictor variables, the independent variables, are likely to be a subset of our data, so we will make a list of the ones we wish to use. Replace and remove the variable names as needed.

In [None]:
independent_variable_names = ['variable name 1', 
                              'variable name 2', 
                              'variable name 3', 
                              'variable name 4', 
                              'variable name 5']

With this list we can make a subset of the data

In [None]:
data_subset = data.loc[:, independent_variable_names]

Now we are ready to normalise the data. We will use the scikit-learn [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) function. This will transform the minimum of all the variables to 0 and the maximum to 1. 

In [None]:
normalised_data_table = MinMaxScaler().fit_transform(data_subset[independent_variable_names])

To turn this array into a table we will transform it into a Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and pass the information from our unnormalised DataFrame. 

In [None]:
data_normalised = pd.DataFrame(normalised_data_table)
data_normalised.columns = independent_variable_names
data_normalised.index = data_subset.index

We can check what this had done to our data using a pairplot.

In [None]:
sns.pairplot(data_normalised);

---
<a id="build"></a>
## Build the model
[Back to Contents](#Contents)

Now we have our data in a form we can build our k-means clustering model. To build it, we will use the scikit-learn [`KMeans`](https://scikit-learn.org/dev/modules/clustering.html#k-means) function.

We will try some possible numbers of clusters and see how the Sum Squared Errors (SSE) changes. We know that as we increase the number of clusters, and if the center's of the clusters are evenly spread throughout the data, then the
To do so, we will generate a list of possible numbers of clusters from `2` to `15`.

In [None]:
potential_cluster_number = range(2, 15)

Now we can loop through this list, fitting the model and recording the SSE for each, which is called `inertia_` here.

In [None]:
Sum_of_squared_distances = []

for k in potential_cluster_number:
    km = KMeans(n_clusters=k, n_init=10).fit(data_normalised)
    Sum_of_squared_distances.append(km.inertia_)

We can plot this to see how it changes.

In [None]:
plt.plot(potential_cluster_number, Sum_of_squared_distances, 'x-')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of Squared Errors');

From this, we are looking for one or more 'elbow' in the change of SSE. Where does it begin to change more slowly? Once you have decided a number of clusters to use, enter it as the `number_of_clusters` below. The default has been set to `4`.

In [None]:
number_of_clusters = 4
cluster_model = KMeans(n_clusters=number_of_clusters, random_state=13, n_init='auto').fit(data_normalised)

We can store the cluster identified for each datapoint alongside the unnormalised data. The model parameter `labels_` will provide them. 

In [None]:
data_subset['clusters'] = cluster_model.labels_

---
<a id="interpret"></a>
## Interpret the model results
[Back to Contents](#Contents)

Now we have built the model we can look at the model quality and what it tells us. We need to examine:

- Do any of the clusters represent a meaningful group? It may only be one or two that do.
- Are any of the clusters too small to be useful? This might help us identify where more data is needed.

The first thing to check is how large the clusters are. 

- Are there some clusters that only contain a small number of outliers? 
- Are most of the data points in one cluster?

We can count the number of data points in each cluster by grouping the data in our DataFrame using the [`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) and [`count`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.count.html) methods.

In [None]:
data_subset.groupby(['clusters']).count()

We can colour a pairplot with these cluster labels

In [None]:
sns.pairplot(data_subset, hue='clusters');

We can also examine the location of the centres of all the clusters by reading them from the model using `cluster_centers_` and putting them into a Pandas DataFrame

In [None]:
pd.DataFrame(cluster_model.cluster_centers_, columns=independent_variable_names)

Lastly, we can use Seaborn to plot [`violinplot`](https://seaborn.pydata.org/generated/seaborn.violinplot.html) to compare the clusters

In [None]:
for col in independent_variable_names:
    fig, ax = plt.subplots()
    sns.violinplot(data=data_subset, 
                   y=col, 
                   x='clusters',
                   inner='quartile',
                   ax=ax);