#  Clustering for dataset exploration

Learn how to discover the underlying groups (or "clusters") in a dataset. By the end of this chapter, you'll be clustering companies using their stock market prices, and distinguishing different species by clustering their measurements. 

# (1) Unsupervised Learning

## Unsupervised Learning
- Unsupervised learning finds patterns in data
- E.g., clustering customers by their purchaces
- Compressigng the data using purchaces patterns (dimension reduction)

## Supervised vs unsupervised learning
- Supervised learning finds patterns for a prediction task
- E.g., classify tumors as benign or cancerous (labels)
- Unsupervised learning finds patterns in data
- ... but without a specific prediction task in mind

## Iris dataset
- Measurements of many iris plants
- Three species of iris:
    - setosa
    - versicolor
    - virginica
- Petal length, petal width, sepal length (the features of the dataset)

<p align='center'>
    <img src='image/Screenshot 2021-02-18 000035.png'>
</p>

## Arrays, features & samples
- 2D NumPy array
- Columns are measurements (the features)
- Rows represent iris plants (the samples)

## Iris data is 4-dimensional
- Iris samples are points in 4 dimensional space
- Dimension = number of features
- Dimension too high to visualize!
- ... but un supervised learning gives insight

## k-means clustering
- Find clusters of samples
- Number of clusters must be spescified
- Implemented in `sklearn` ("scikit-learn")

In [None]:
print(samples)

In [None]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)

In [None]:
labels = model.predict(samples)

## Cluster labels for new samples
- New samples can be assigned to existing clusters
- k-means remembers the mean of each cluster (the "centroids")
- Finds the nearest centroid to each new sample

In [None]:
print(new_samples)

In [None]:
new_labels = model.predict(new_samples)
print(new_labels)  

## Scatter plots
- Scatter plot of sepal length vs. petal length
- Each point represents an iris sample
- Color points by cluster labels
- PyPlot (`matplotlib.pyplot`)

<p align='center'>
    <img src='image/Screenshot 2021-02-18 010119.png'>
</p>

In [None]:
import matploylib.pyplot as plt
xs = samples[:, 0]
ys = samples[:, 2]
plt.scatter(xs, ys, c=labels)
plt.show()

# Exercise I: How many clusters?

You are given an array `points` of size 300x2, where each row gives the (x, y) co-ordinates of a point on a map. Make a scatter plot of these points, and use the scatter plot to guess how many clusters there are.

`matplotlib.pyplot` has already been imported as `plt`. In the IPython Shell:

    Create an array called `xs` that contains the values of `points[:,0]` - that is, column `0` of `points`.
    Create an array called `ys` that contains the values of `points[:,1]` - that is, column `1` of `points`.
    Make a scatter plot by passing `xs` and `ys` to the `plt.scatter()` function.
    Call the `plt.show()` function to show your plot.

How many clusters do you see?

### Instructions
#### Possible Answers
- 2
- 3 (T)
- 300

# Exercise II: Clustering 2D points

From the scatter plot of the previous exercise, you saw that the points seem to separate into 3 clusters. You'll now create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the `.predict()` method.

You are given the array `points` from the previous exercise, and also an array `new_points`.

### Instructions

- Import `KMeans` from `sklearn.cluster`.
- Using `KMeans()`, create a `KMeans` instance called `model` to find `3` clusters. To specify the number of clusters, use the `n_clusters` keyword argument.
- Use the `.fit()` method of `model` to fit the model to the array of points `points`.
- Use the `.predict()` method of `model` to predict the cluster labels of `new_points`, assigning the result to `labels`.
- Hit 'Submit Answer' to see the cluster labels of `new_points`.


In [None]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)

# Fit model to points
model.fit(points)

# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)


# Exercise III: Inspect your clustering

Let's now inspect the clustering you performed in the previous exercise!

A solution to the previous exercise has already run, so `new_points` is an array of points and `labels` is the array of their cluster labels.

### Instructions

- Import `matplotlib.pyplot` as `plt`.
- Assign column `0` of `new_points` to `xs`, and column `1` of `new_points` to `ys`.
- Make a scatter plot of `xs` and `ys`, specifying the `c=labels` keyword arguments to color the points by their cluster label. Also specify `alpha=0.5`.
- Compute the coordinates of the centroids using the `.cluster_centers_` attribute of `model`.
- Assign column `0` of `centroids` to `centroids_x`, and column `1` of `centroids` to `centroids_y`.
- Make a scatter plot of `centroids_x` and `centroids_y`, using `'D'` (a diamond) as a `marker` by specifying the marker parameter. Set the size of the markers to be `50` using `s=50`.


In [None]:
# Import pyplot
import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c=labels, alpha=0.5)

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x, centroids_y, marker='D', s=50)
plt.show()


## Plot

<p align='center'>
    <img src='image/[2021-02-18] 012831.svg' width=30%>
</p>

# (2) Evaluation a clustering

## Evaluation a clustering
- Can check correspondence with e.g. iris species
- ... but what if there are no species to check against?
- Measure quality of a clustering
- Informs choice of how many clusters to look for

## Iris: Clusters vs species
- k-means found 3 clusters amongst the iris samples
- Do the clusters correspond to the species?

| species (label) | setosa | versicolor | virginica |
| :-: | :-: | :-: | :-:|
| 0 | 0 | 2 | 36 |
| 1 | 50 | 0 | 0 |
| 2 | 0 | 48 | 14 |

## Cross tabulation with pandas
- Clusters vs species is a "cross-tabulation"
- Use the `pandas` library
- Given the species of each sample as a list `species`

In [None]:
 print(species)

## Alining lables and species

In [None]:
import pandas as pd
df = pd.DataFrame({'labels': labels, 'species': species})
print(df)

| | labels | species |
| :-: | :-: | :-: |
| 0 | 1 | setosa |
| 1 | 1 | setosa |
| 2 | 2 | versicolor |
| 3 | 3 | virginica |
| 4 | 1 | setosa |

## Crosstab of labels and species

In [None]:
ct = pd.crosstab(df['labels'], df['species'])
print(ct)

| species | setosa | versicolor | virginica |
| :-: | :-: | :-: | :-: |
| 0 | 0 | 2 | 56 |
| 1 | 50 | 0 | 0 |
| 2 | 0 | 48 | 14 |

How to evaluate a clustering, if there were no species information?

## Measuring clustering quality
- Using only samples and their cluster labels
- A good clustering has tight clusters
- Samples in each cluster bunched together

## Inertia measures clustering quality
- Measures how spread out the clusters are (lower is better)
- Distance from each sample to centroid of its cluster
- After `fit()`, available as attribute `inertia_`
- k-means attempts to minimize the inertia when choosing clusters

In [None]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)
model.fit(samples)
print(model.inertia_)

## The number of clusters
- Clusterings of the iris dataset with different numbers of clusters
- More clusters means lowers inertia

<p align='center'>
    <img src='image/Screenshot 2021-02-18 024118.png'>
</p>

## How many clusters to choose?
- A good clustering has tight clusters (so low inertia)
- ... but not too many clusters!
- Choose an "elbow" in the inertia plot
- Where inertia begins to decrease more slowly
- E.g., for iris dataset, 3 is a good choice

# Exercise IV: How many clusters of grain?

In the video, you learned how to choose a good number of clusters for a dataset using the k-means inertia graph. You are given an array `samples` containing the measurements (such as area, perimeter, length, and several others) of samples of grain. What's a good number of clusters in this case?

`KMeans` and PyPlot (`plt`) have already been imported for you.

This dataset was sourced from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/seeds).

### Instructions

- For each of the given values of `k`, perform the following steps:
- Create a `KMeans` instance called `model` with `k` clusters.
- Fit the model to the grain data `samples`.
- Append the value of the `inertia_` attribute of `model` to the list `inertias`.
- The code to plot `ks` vs `inertias` has been written for you, so hit 'Submit Answer' to see the plot!


In [None]:
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(samples)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()


## Plot

<p align='center'>
    <img src='image/[2021-02-18] 025003.svg'>
</p>

In [None]:
# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)


# (3) Trnasforming features for better clusterings

## Piedmont wines dataset
- 178 samples from 3 distinct varieties of red wine: Barolo, Grignolino and Barbera
- Features measure chemical composition e.g. alcohol content
- Visual properties like "color intersity"

## Cluster vs. varieties

In [None]:
df = pd.DataFrame({'labels': labels, 'varieties': varieties})
ct = pd.crosstab(df['labels'], df['varieties'])
print(ct)

| varieties | Barbera | Barolo | Grignolino |
| :-: | :-: | :-: | :-: |
| 0 | 29 | 13 | 20 |
| 1 | 0 | 46 | 1 |
| 2 | 19 | 0 | 50 |

## Feature variances
- The wine features have very different variances!
- Variance of a feature measures spread of its values

<p align='center'>
    <img src='image/Screenshot 2021-02-18 203600.png'>
    <img src='image/Screenshot 2021-02-18 203859.png'>
</p>

## Standard Scaler
- In kmeans: feature varience = feature influence
- `StandardScaler` transforms each feature to have mean 0 and varience 1
- Features are said to be "standardized"

<p align='center'>
    <img src='image/Screenshot 2021-02-18 205354.png'>
</p>

## sklearn StandardScaler

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(samples)
StandardScaler(copy=True, with_mean=True, with_std=True)
samples_scaled = scaler.transform(smaples)

## Similar methods
- `StandardScaler` and `KMeans` have similar methods
- Use `fit()` / `transform()` with `StandardScaler`
- Use `fit()` / `predict()` with `KMeans`

## StandardScaler, then KMeans
- Need to perform two steps: `StandardScaler`, then `KMeans`
- Use `sklearn` pipeline to combine multiple steps
- Data flows from one step into the next

## Pipelines combine multiple steps

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(scaler, kmeans)
pipeline.fit(samples)

In [None]:
labels = pipeline.predict(samples)

## Feature standardization improves clustering
with feature standardization

| varieties | Barbera | Barolo | Grignolino |
| :-: | :-: | :-: | :-: |
| 0 | 0 | 59 | 3 |
| 1 | 48 | 0 | 3 |
| 2 | 0 | 0 | 65 |

Without feature standardization was very bad

| varieties | Barbera | Barolo | Grignolino |
| :-: | :-: | :-: | :-: |
| 0 | 29 | 13 | 20 |
| 1 | 0 | 46 | 1 |
| 2 | 19 | 0 | 50 |

## sklearn preprocessing steps
- `StandardScaler` is a "preprocessing" step
- `MaxAbsScaler` and `Normalizer` are other examples

# Exercise V: Scaling fish data for clustering

You are given an array `samples` giving measurements of fish. Each row represents an individual fish. The measurements, such as weight in grams, length in centimeters, and the percentage ratio of height to length, have very different scales. In order to cluster this data effectively, you'll need to standardize these features first. In this exercise, you'll build a pipeline to standardize and cluster the data.

These fish measurement data were sourced from the [Journal of Statistics Education](ww2.amstat.org/publications/jse/jse_data_archive.htm).

### Instructions

- Import:
    - `make_pipeline`from `sklearn.pipeline`.
    - `StandardScaler` from `sklearn.preprocessing`.
    - `KMeans` from `sklearn.cluster`.
- Create an instance of `StandardScaler` called `scaler`.
- Create an instance of `KMeans` with `4` clusters called `kmeans`.
- Create a pipeline called `pipeline` that chains `scaler` and `kmeans`. To do this, you just need to pass them in as arguments to `make_pipeline()`.


In [None]:
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters=4)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, kmeans)


# Exercise VI: Clustering the fish data

You'll now use your standardization and clustering pipeline from the previous exercise to cluster the fish by their measurements, and then create a cross-tabulation to compare the cluster labels with the fish species.

As before, `samples` is the 2D array of fish measurements. Your `pipeline` is available as pipeline, and the species of every fish sample is given by the list `species`.

### Instructions

- Import `pandas` as `pd`.
- Fit the pipeline to the fish measurements `samples`.
- Obtain the cluster labels for `samples` by using the `.predict()` method of `pipeline`.
- Using `pd.DataFrame()`, create a DataFrame `df` with two columns named `'labels'` and `'species'`, using `labels` and `species`, respectively, for the column values.
- Using `pd.crosstab()`, create a cross-tabulation `ct` of `df['labels']` and `df['species']`.


In [None]:
# Import pandas
import pandas as pd

# Fit the pipeline to samples
pipeline.fit(samples)

# Calculate the cluster labels: labels
labels = pipeline.predict(samples)

# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels': labels, 'species': species})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['species'])

# Display ct
print(ct)


# Exercise VII: Clustering stocks using KMeans

In this exercise, you'll cluster companies using their daily stock price movements (i.e. the dollar difference between the closing and opening prices for each trading day). You are given a NumPy array movements of daily price `movements` from 2010 to 2015 (obtained from Yahoo! Finance), where each row corresponds to a company, and each column corresponds to a trading day.

Some stocks are more expensive than others. To account for this, include a `Normalizer` at the beginning of your pipeline. The Normalizer will separately transform each company's stock price to a relative scale before the clustering begins.

Note that `Normalizer()` is different to `StandardScaler()`, which you used in the previous exercise. While `StandardScaler()` standardizes **features** (such as the features of the fish data from the previous exercise) by removing the mean and scaling to unit variance, `Normalizer()` rescales **each sample** - here, each company's stock price - independently of the other.

`KMeans` and `make_pipeline` have already been imported for you.

### Instructions

- Import `Normalizer` from `sklearn.preprocessing`.
- Create an instance of `Normalizer` called `normalizer`.
- Create an instance of `KMeans` called `kmeans` with `10` clusters.
- Using `make_pipeline()`, create a pipeline called `pipeline` that chains `normalizer` and `kmeans`.
- Fit the pipeline to the `movements` array.


In [None]:
# Import Normalizer
from sklearn.preprocessing import Normalizer

# Create a normalizer: normalizer
normalizer = Normalizer()

# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters=10)

# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(movements)


# Exercise VIII: Which stocks move together?

In the previous exercise, you clustered companies by their daily stock price movements. So which company have stock prices that tend to change in the same way? You'll now inspect the cluster labels from your clustering to find out.

Your solution to the previous exercise has already been run. Recall that you constructed a Pipeline `pipeline` containing a KMeans model and fit it to the NumPy array `movements` of daily stock movements. In addition, a list `companies` of the company names is available.

### Instructions

- Import `pandas` as `pd`.
- Use the `.predict()` method of the pipeline to predict the labels for `movements`.
- Align the cluster labels with the list of company names `companies` by creating a DataFrame `df` with `labels` and `companies` as columns. This has been done for you.
- Use the `.sort_values()` method of `df` to sort the DataFrame by the `'labels'` column, and print the result.
- Hit 'Submit Answer' and take a moment to see which companies are together in each cluster!


In [None]:
# Import pandas
import pandas as pd

# Predict the cluster labels: labels
labels = pipeline.predict(movements)

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 'companies': companies})

# Display df sorted by cluster label
print(df.sort_values('labels'))
