# Introduction to Unsupervised machine learning with Scikitlearn

<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons Licence" style="width=50" src="https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png" title='This work is licensed under a Creative Commons Attribution 4.0 International License.' align="right"/></a>

Authors: 
Dr Antonia Mey -- antonia.mey@ed.ac.uk  
Dr Matteo Degiacomi -- matteo.t.degiacomi@durham.ac.uk

Content is partially adapted from the [Software Carpentries Machine learning lesson](https://carpentries-incubator.github.io/machine-learning-novice-sklearn/index.html)
 


## Learning outcomes:
### Questions
How can we use clustering to find data points with similar attributes?
### Objectives
- Identify clusters in data using **k-means** clustering, **DB-scan** and **spectral clustering**. 
- See the limitations of k-means when clusters overlap.
- Use spectral clustering to overcome the limitations of k-means.

**Jupyter cheat sheet**:
- to run the currently highlighted cell, hold <kbd>&#x21E7; Shift</kbd> and press <kbd>&#x23ce; Enter</kbd>;
- to get help for a specific function, place the cursor within the function's brackets, hold <kbd>&#x21E7; Shift</kbd>, and press <kbd>&#x21E5; Tab</kbd>;

## Clustering

Clustering is the grouping of data points which are similar to each other. It can be a powerful technique for identifying patterns in data. Clustering analysis does not usually require any training and is known as an unsupervised learning technique. The lack of a need for training means it can be applied quickly.

## Applications of Clustering

- Looking for trends in data
- Data compression, all data clustering around a point can be reduced to just that point. For example, reducing colour depth of an image.
- Pattern recognition

## K-means Clustering

The K-means clustering algorithm is a simple clustering algorithm that tries to identify the centre of each cluster. It does this by searching for a point which minimises the distance between the centre and all the points in the cluster. The algorithm needs to be told how many clusters to look for, but a common technique is to try different numbers of clusters and combine it with other tests to decide on the best combination.

## K-means with Scikit Learn

To perform a k-means clustering with Scikit learn we first need to import the sklearn.cluster module.

In [None]:
import sklearn.cluster as skl_cluster

Now let’s create some random blobs using the make_blobs function. The n_samples argument sets how many points we want to use in all of our blobs. cluster_std sets the standard deviation of the points, the smaller this value the closer together they will be. centers sets how many clusters we’d like. random_state is the initial state of the random number generator, by specifying this we’ll get the same results every time we run the program. If we don’t specify a random state then we’ll get different points every time we run. This function returns two things, an array of data points and a list of which cluster each point belongs to.

In [None]:
data, cluster_id = skl_datasets.make_blobs(n_samples=400, cluster_std=0.75, centers=4, random_state=1)

Now that we have some data we can go ahead and try to identify the clusters using K-means. First, we need to initialise the KMeans module and tell it how many clusters to look for. Next, we supply it some data via the fit function, in much the same we did with the regression functions earlier on. Finally, we run the predict function to find the clusters.

In [None]:
Kmean = skl_cluster.KMeans(n_clusters=4)
Kmean.fit(data)
clusters = Kmean.predict(data)

The data can now be plotted to show all the points we randomly generated. To make it clearer which cluster points have been classified to we can set the colours (the c parameter) to use the clusters list that was returned by the predict function. The Kmeans algorithm also lets us know where it identified the centre of each cluster as. These are stored as a list called `cluster_centers_` inside the `Kmean` object. Let’s go ahead and plot the points from the clusters, colouring them by the output from the K-means algorithm, and also plot the centres of each cluster as a red X.

In [None]:
import matplotlib.pyplot as plt
plt.scatter(data[:, 0], data[:, 1], s=5, linewidth=0, c=clusters)
for cluster_x, cluster_y in Kmean.cluster_centers_:
    plt.scatter(cluster_x, cluster_y, s=100, c='r', marker='x')
plt.show()

In [None]:
import sklearn.cluster as skl_cluster
import sklearn.datasets as skl_datasets
import matplotlib.pyplot as plt

data, cluster_id = skl_datasets.make_blobs(n_samples=400, cluster_std=0.75, centers=4, random_state=1)

Kmean = skl_cluster.KMeans(n_clusters=4)
Kmean.fit(data)
clusters = Kmean.predict(data)

plt.scatter(data[:, 0], data[:, 1], s=5, linewidth=0, c=clusters)
for cluster_x, cluster_y in Kmean.cluster_centers_:
    plt.scatter(cluster_x, cluster_y, s=100, c='r', marker='x')
plt.show()

<div class="alert alert-info">
<b>Working in multiple dimensions:</b>
Although this example shows two dimensions the kmeans algorithm can work in more than two, it just becomes very difficult to show this visually once we get beyond 3 dimensions. Its very common in machine learning to be working with multiple variables and so our classifiers are working in multi-dimensional spaces.
</div>


In [None]:
# ================================
# First task secion
# ================================

<div class="alert alert-success">
<b>Task 1: Discuss: </b> </div>
    What are the limitations and advantages of K-Means?


<details>
<summary> <mark> Solution: Suggested limitations and advantages</mark> </summary>

Limitations:
- Requires number of clusters to be known in advance
- Struggles when clusters have irregular shapes
- Will always produce an answer finding the required number of clusters even if the data isn’t clustered (or clustered in that many clusters).
- Requires linear cluster boundaries

Advantages:
- Simple algorithm, fast to compute. A good choice as the first thing to try when attempting to cluster data.
- Suitable for large datasets due to its low memory and computing requirements.

</details>

<div class="alert alert-success">
<b>Task 2: K-means with overlapping clusters </b> </div>
    Adjust the program above to increase the standard deviation of the blobs (the cluster_std parameter to make_blobs) and increase the number of samples (n_samples) to 4000. You should start to see the clusters overlapping. Do the clusters that are identified make sense? Is there any strange behaviour from this?


<details>
<summary> <mark> Solution: Try it yourself</mark> </summary>

```Python
   a = b 
```

</details>

<div class="alert alert-success">
<b>Task 3: How many clusters should we look for? </b> </div>
Adjust the program above to increase the standard deviation of the blobs (the cluster_std parameter to make_blobs) and increase the number of samples (n_samples) to 4000. You should start to see the clusters overlapping. Do the clusters that are identified make sense? Is there any strange behaviour from this?



<details>
<summary> <mark> Solution: Try it yourself</mark> </summary>

```Python
   a = b 
```

</details>

In [None]:
## intro to DB scan

In [None]:
## Use DB scan for ring dataset?

In [None]:
## Spectral clustering?

In [None]:
## Introduce Alanine dipeptide dataset

In [None]:
## What happens when we cluster ADP with k-means, DB scan or spectral clustering?

<div class="alert alert-info">
<b>Key points:</b></div>   

- Clustering is a form of unsupervised learning   
- Unsupervised learning algorithms don’t need training   
- Kmeans is a popular clustering algorithm.   
- Kmeans struggles where one cluster exists within another, such as concentric circles.   
- Spectral clustering is another technique which can overcome some of the limitations of Kmeans.    
- Spectral clustering is much slower than Kmeans.    
- As well as providing machine learning algorithms scikit learn also has functions to make example data   


## Next Notebook

[Getting started with Python](Session_1.2.ipynb)