# `DSML_WS_12` - Introduction to Unsupervised Learning

This workshop deals with `unsupervised learning`. In unsupervised learning, we are not trying to predict a continuous value (regression) or a discrete value/class (classification) based on a set of features, but instead we are trying to find (hidden) structures in the data. Therefore, there is no exact measure to tell if your algorithm works well or does not (because we have no fixed reference (such as a label) to compare our results to). During this workshop, we will use clustering algorithms on datasets for which we __do__ have labels, just to give you a better intuition. In real world examples, however, assessing the quality of your clustering result is __more difficult__!

We will focus on hard clustering and review the following algorithms:
1. **Task: Using tree-based methods to predict electricity demand**
1. **Preparation for clustering**
1. **K-means clustering**
1. **Hierarchichal clustering**

---

## 1. Task: Using tree-based methods to predict electricity demand

In last week's workshop, we talked about decision trees and ensemble methods and applied them to classify breast cancer cells. Let us see how this works in a regression setting, using our electricity demand dataset.

- Load data, define your X feature vector to include `Avg_temp` and `is_weekday` (which you first have to generate) and your Y vector to be `AVG`.
- Perform a train/test split.
- Train four different trees using `DecisionTreeRegressor`, with tree_depths ranging from 2 to 5.
- Train a gradient boosting model using `XGBRegressor` (you might have to read up on hyperparameters in the documentation).
- Train a random forest model using `RandomForestRegressor` (again, you might need to consult the documentation).
- Compare model performance using appropriate test metrics. Which model performs best?

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# load data
df = pd.read_csv("Pittsburgh_load_data.csv")
df["Date"] = pd.to_datetime(df["Date"], format="%d.%m.%Y")
df["is_weekday"] = df["Date"].apply(lambda x: 1 if x.weekday() in [0,1,2,3,4] else 0)

# define vectors
X = df[["Avg_temp", "is_weekday"]].values
Y = df["AVG"].values

# perform train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# train decision trees
m1 = DecisionTreeRegressor(max_depth=2).fit(X_train, Y_train)
m2 = DecisionTreeRegressor(max_depth=3).fit(X_train, Y_train)
m3 = DecisionTreeRegressor(max_depth=4).fit(X_train, Y_train)
m4 = DecisionTreeRegressor(max_depth=5).fit(X_train, Y_train)

# train xgboost
m5 = XGBRegressor(booster="gbtree").fit(X_train, Y_train)

# train random forest
m6 = RandomForestRegressor(n_estimators=100, 
                            bootstrap=True,
                            random_state=42).fit(X_train, Y_train)

# evaluate performance
for i, model in enumerate([m1, m2, m3, m4, m5, m6]):
    mse = mean_squared_error(Y_test, model.predict(X_test))
    r2 = r2_score(Y_test, model.predict(X_test))
    print("Performance of model " + str(i+1) + ": MSE = " + str(mse) + ", R2 Score = " + str(r2))

---

## 2. Preparation for clustering

### Dataset used throughout the workshop: Iris flowers
For a first intuition, we will be working on the iris dataset again, which you all are familiar with by now. As you remember, in reality there are __three__ different species of irises in the dataset. So, let's see whether we are able to confirm this number of __clusters__ in our dataset using unsupervised learning.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

In [None]:
iris = pd.read_csv('iris.csv', index_col="number").dropna(axis=0)
iris.head()

To make this an unsupervised learning task we need to drop the label (i.e. `Species`). For later checks we save the response as variable `y`.

In [None]:
X = iris.drop("Species", axis=1)
y = iris["Species"]
X.head()

### Data Prep and Scaling
First, let's start out by scaling the data. The K-Means algorithm (and any other clustering algorithm for that matter) minimizes some intra-cluster distance (while maximizing inter-cluster distances). For the case of k-means this is typically defined as the **euclidean distance** from the midpoint of each cluster to all points in this cluster. Other distance metrics are also sometimes used (e.g., **Manhattan distance**)

If one feature has a bigger spread than others it will be more important than other factors for the outcome of the clustering. We will revisit this later! For now, let's standardize!

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

# create a df out of the array
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)
X_scaled_df.head()

In [None]:
iris_scaled = X_scaled_df
iris_scaled["Species"] = iris["Species"]

The typical patterns remain in the scaled data, just at a different scale.

In [None]:
sns.pairplot(data=iris_scaled, hue="Species")

### Supervised learning case (classification)

Before we start with the actual clustering, let's revisit the classification methods we talked about during past workshops and see how well they are doing in terms of classification error. For that, we are importing a `kNN classifier` from the sklearn library and have a look at the confusion matrix and the classification report.

kNN has some intuitive parallels to k-means clustering (see lecture material).

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Note that we are not doing a train-test-split or other advanced methods here; this is not a valid approach for a real supervised learning projects!
model_neigh = KNeighborsClassifier(n_neighbors=5)
model_neigh.fit(X_scaled,y)

In [None]:
y_pred = model_neigh.predict(X_scaled)

To review the performance of our classifier we check relevant test metrics, which you are familiar with by now. We use the `classification_report` class to return a neat summary of the most important classification metrics. We also review the previously discussed `confusion_matrix`.

In [None]:
iris["SpeciesPred"] = y_pred
print(classification_report(y,y_pred))

In [None]:
classification_conf_matrix = confusion_matrix(y,y_pred)
classification_conf_matrix

## 3. K-means clustering
Now, let's drop the labels and just have a look at the given features. How would we group the datapoints for the different observed flowers?
In a typical unsupervised learning case we normally just have a rough idea of how many clusters to expect.

Three approaches are commonly applied:
1. Use expert knowledge
1. Plot residual loss for different numbers of clusters, find 'elbow' and select corresponding number of clusters
1. Use hierarchical clustering to detect suitable branching and corresponding number of clusters

We will focus on the residual loss (elbow method) as a selection criterion for now. In practice, a combination of all three methods is often used.

In [None]:
from sklearn.cluster import KMeans

In [None]:
k_max = 20  # we have 147 datapoints, more than 20 clusters are definitely not reasonable!

clusters = []
losses = []

for k in range(k_max):
    model = KMeans(n_clusters=k+1) # initialize
    model.fit(X_scaled) # fit
    clusters.append(k+1)
    losses.append(model.inertia_)

In [None]:
plt.plot(clusters, losses)
plt.xticks(range(k_max+1))
plt.show()

Let's zoom in in the good region by limiting the x-axis

In [None]:
plt.plot(clusters, losses)
plt.xlim([0,10])
plt.show()

From this plot we would expect a good amount of clusters to lie in the region of two to five. Of course, we know that the correct answer is 3 (i.e., one per each species). However, in a true unsupervised learning setting there is no frame of reference and we'd need to draw on several indicators (quantitative and qualitative) to settle on a final number of clusters. For illustrative purposes let's select k=2, the minimum sensible choice.

In [None]:
# re-fit algorithm
two_means = KMeans(n_clusters=2)
two_means.fit(X_scaled)

# match records to clusters by calling predict
two_means.predict(X_scaled)

Let's plot the results.

In [None]:
numbers = ["zero", "one", "two", "three","four","five"]
iris_scaled["clusters"] = two_means.predict(X_scaled)
iris_scaled["clusters"] = iris_scaled["clusters"].apply(lambda x: numbers[x])
sns.pairplot(data=iris_scaled, hue="clusters")

Now let's do this for the "correct" number of cluster, i.e., three.

In [None]:
three_means = KMeans(n_clusters=3)
three_means.fit(X_scaled)
iris_scaled["clusters"] = three_means.predict(X_scaled)
iris_scaled["clusters"] = iris_scaled["clusters"].apply(lambda x: numbers[x])
sns.pairplot(data=iris_scaled, hue="clusters")

How would k=5 fare?

In [None]:
five_means = KMeans(n_clusters=5)
five_means.fit(X_scaled)
iris_scaled["clusters"] = five_means.predict(X_scaled)
iris_scaled["clusters"] = iris_scaled["clusters"].apply(lambda x: numbers[x])
sns.pairplot(data=iris_scaled, hue="clusters")

From inspection, we can tell that k=5 achieves relatively poorer inter-cluster separation (e.g., 1 and 3 are very similar). From these analyses, a data scientist would most likely select 3 clusters.

Let us implement a second very neat way of evaluating clusters - a silhouette score analysis. The silhouette score is a measure of inter- and intra-cluster distance and can range from -1 (indicating incorrect cluster assignment) to 1 (indicating perfect cluster assignment). 

In [None]:
from matplotlib import cm
from sklearn.metrics import silhouette_samples, silhouette_score

range_n_clusters = [2, 3, 4, 5]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # 1st subplot is the silhouette plot
    ax1.set_xlim([-0.1, 1])
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # initialize k-means with n_clusters value
    clusterer = KMeans(n_clusters=n_clusters)
    cluster_labels = clusterer.fit_predict(X_scaled)

    # silhouette_score gives the average value for all the samples.
    silhouette_avg = silhouette_score(X_scaled, cluster_labels)
    print(
        "For n_clusters =",
        n_clusters,
        "The average silhouette_score is :",
        silhouette_avg,
    )

    # compute silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X_scaled, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # aggregate silhouette scores for samples belonging to cluster i and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(
            np.arange(y_lower, y_upper),
            0,
            ith_cluster_silhouette_values,
            facecolor=color,
            edgecolor=color,
            alpha=0.7,
        )

        # label silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # compute new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("Silhouette plot for various clusters")
    ax1.set_xlabel("Silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    # 2nd plot showing actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(
        X_scaled[:, 0], X_scaled[:, 1], marker=".", s=200, lw=0, alpha=0.7, c=colors, edgecolor="k"
    )

    # labeling clusters
    centers = clusterer.cluster_centers_
    # draw white circles at cluster centers
    ax2.scatter(
        centers[:, 0],
        centers[:, 1],
        marker="o",
        c="white",
        alpha=1,
        s=200,
        edgecolor="k",
    )

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker="$%d$" % i, alpha=1, s=50, edgecolor="k")

    ax2.set_title("Visualization of clustered data")
    ax2.set_xlabel("Sepal length")
    ax2.set_ylabel("Sepal width")

    plt.suptitle(
        "Silhouette analysis for KMeans clustering on iris dataset with n_clusters = %d"
        % n_clusters,
        fontsize=14,
        fontweight="bold",
    )

plt.show()

### Small excursion: Principal Component Analysis (PCA) revisited
In the case of the iris dataset, we have just four features - an amount we can still easily manage. But imagine we have a much larger feature space. Then, it might be necessary to reduce the dimensionality of our dataset. Also, we gain another advantage: We can visualize the data more easily! As you have seen above, already with just 4 dimensions it is hard to visualize our data in a meaningful way. The size of a pairplot increases quadratically with the number of features. For four features we have sixteen plots (more than you should expect any stakeholder to have a look at/understand...), but for 10 features it is already a hundred - way more than you would like to have to look at!

So, how do we reduce the dimensionality of our dataset without losing too much information? (unfortunately, we are always losing some)


Principal component analysis (PCA) is the way to go! Using PCA, we are looking at our data and try to find a projection which keeps most of the variance within the data. As this is a little theoretical, let's have a quick look at a graphical example:

![pca](pca.png)

Here, our data originally had two dimensions and is reduced onto just one dimension. As you can see, most of the variance of our data is conserved. If the datapoints are very crooked in space, it is a little harder to find the "principal components". For this, there a advanced options like, for example, Kernel-PCA (which we are not talking about in this course). 

Furthermore, there is one more important application for PCA: When there are more features than records in your dataset, you can reduce the dimensionality of your dataset to the number of records without losing any information! So if you have 1 million features but only 100 records, then it is sufficient to look at only the first 100 principal components and you will not have lost any variance. Still, it can be useful to reduce the number of dimensions further in case you do not lose too much variance.

As a rule of thumb, there should be at least 95% of the original variance left in the dataset. Let's now run a PCA on our iris dataset and have a look at the principal components.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
X_pca = pca.fit_transform(X_scaled)

In [None]:
print(pca.explained_variance_ratio_)

In [None]:
pca2 = PCA(n_components=2)
X_pca2 = pca2.fit_transform(X_scaled)

In [None]:
print(pca2.explained_variance_ratio_)

#sum explained variance over all PCA components
print(sum(pca2.explained_variance_ratio_))

Using only two principal components, we still have approx. 96% of our variance, so we can assume that the transformed dataset is a good approximation of our original dataset. Let's now have a closer look at how each of these two principal components is composed.

In [None]:
print(pca2.components_)
print(list(X_scaled_df.columns)[:4])

In [None]:
X_scaled

In [None]:
iris_pca = pd.DataFrame(X_pca2, columns=["First PC", "Second PC"], index=iris.index)
iris_pca["Species"] = iris["Species"]
iris_pca.head()

Before running a clustering algorithm on the reduced data, let's first have a look at what we would have seen using only one principal component!

In [None]:
sns.boxplot(x="Species", y="First PC", data=iris_pca)

We are (as expected) already able to differentiate between *setosa* and the other two types easily using just the first principal component. There is a clear difference between the core elements of the other two clusters, but it is not possible to completely differentiate between them.

In [None]:
sns.lmplot(x="First PC", y="Second PC", data=iris_pca, fit_reg=False, hue="Species")

Let's now use K-means again to find clusters in this reduced 2-dimensional space! 

In [None]:
pca_clusters = []
pca_losses = []

for i in range(k_max):
    model = KMeans(n_clusters=i+1)
    model.fit(X_pca2)
    pca_clusters.append(i+1)
    pca_losses.append(model.inertia_)
    
plt.plot(pca_clusters, pca_losses)
plt.xticks(range(k_max+1))
plt.show()

In [None]:
# zoom into the good region
plt.plot(pca_clusters, pca_losses)
plt.xlim([0,10])
plt.show()

In [None]:
pca_three_means = KMeans(n_clusters=3)
pca_three_means.fit(X_pca2)

In [None]:
iris_pca["Cluster"] = pca_three_means.predict(X_pca2)

# clustering result
sns.lmplot(x="First PC", y="Second PC", hue="Cluster", data=iris_pca, fit_reg=False)
# actual classes
sns.lmplot(x="First PC", y="Second PC", hue="Species", data=iris_pca, fit_reg=False)

## 4. Hierarchical Clustering

Unlike K-Means, hierarchical clustering defines step-wise decision rules for how to seperate the data into clusters. This has the advantage that it is more resembling the human decision making process and is, therefore, very intuitive for stakeholders to understand. 


Let's have a look at how a supervised decision tree would look like for our iris dataset and then back this up with an unsupervised hierarchical clustering.

### Supervised Learning Case - Decision Trees for Classification

The supervised equivalent of hierarchichal clustering is decision tree classification, which we will briefly re-visit in this section.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

model_tree = DecisionTreeClassifier(max_depth=2)

# we engineer a new feature here to enhance performance
X["Petal.Area"] = X["Petal.Length"]*X["Petal.Width"]
model_tree.fit(X, y)
y_pred_tree = model_tree.predict(X)

In [None]:
print(classification_report(y_true=y, y_pred=y_pred_tree))

In [None]:
tree.plot_tree(model_tree, class_names=y.unique().tolist(), feature_names=X.columns.tolist(), filled=True)

### Unsupervised Learning Case - Agglomerative/Hierarchichal Clustering

In agglomerative hierarchical clustering, we begin with every observation in an own cluster and keep joining records into clusters until only one single cluster is left. Obviously, neither extreme is useful, so we need to find the sweet spot in between. Compared to k-means, we do not need to specify the number of clusters in advance. Alternatively, we generate a plot called dendrogram to help us decide on an appropriate number of clusters. Let's see how this works in practice.

In [None]:
X_scaled = StandardScaler().fit_transform(X)

In [None]:
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering

def plot_dendrogram(model, **kwargs):
    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

In [None]:
# plot
agglo = AgglomerativeClustering(compute_distances=True, n_clusters=3)
y_pred_agglo = agglo.fit_predict(X_scaled)

plt.figure(figsize=(15,10))
plt.title('Hierarchical Clustering Dendrogram')
plot_dendrogram(agglo, labels=agglo.labels_)

In [None]:
iris_hierarchical = pd.DataFrame(X_scaled, columns=X.columns)
iris_hierarchical["Agglo_clusters"] = y_pred_agglo
iris_hierarchical["Species"] = iris["Species"].values

iris_hierarchical.head()

In [None]:
sns.lmplot(x="Sepal.Length", y="Petal.Area", data=iris_hierarchical, fit_reg=False, hue="Agglo_clusters")
sns.lmplot(x="Sepal.Length", y="Petal.Area", hue="Species", data=iris_hierarchical, fit_reg=False)

---