<a href="https://colab.research.google.com/github/MScEcologyAndDataScienceUCL/BIOS0032_AI4Environment/blob/main/2_Intro_to_ML/1_Intro_to_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 2 Introduction to Machine Learning

## What we will learn

In this weeks colab we will introduce the Machine Learning library `scikit-learn` and practice some basic concepts of Machine Learning (ML), including:

1. How to breakdown ML projects
2. How to do classification with `scikit-learn`
3. How to evaluate model performance
4. How to do regression with `scikit-learn`
5. How to do clustering with `scikit-learn`
6. How to do dimensionality reduction with `scikit-learn`

In the process you will also learn about:
* Some classification algorithms, such as Nearest Neighbors, SVM, Decision Trees and Random Forest
* Some regression algorithms, such as Linear Regression, and Nearest Neighbor Regression

## Recap

**What is Machine Learning?**

Teaching computers how to perform a task without having to explicitly program them to do it.



> A computer program is said to learn from experience E with respect to some classes of task T and performance measure P if its performance can improve with E on T measured by P.
>
> M. T. Mitchell. 1997. Machine Learning

**Examples**

Butterfly recognition:

* Task T: Classify images of British butterflies into different species
* Performance measure P: percent of images that have been correctly classified
* Training experience E: A database of butterfly images from museum collections 

**How does a computer program learn?**

Using **data** to **parametrize** models.

## 1. ML workflow

Using ML for ecological inference is a multistep process. In practice it can generally be broken down into the following steps:

* Data collection
* Data preparation
* Model training
* Model evaluation
* Making predictions

![ml workflow](images/ml_workflow.png)

> Simplified workflow from: S. Amershi et al., ["Software Engineering for Machine Learning: A Case Study,"](https://ieeexplore.ieee.org/abstract/document/8804457) 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Montreal, QC, Canada, 2019, pp. 291-300, doi: 10.1109/ICSE-SEIP.2019.00042.

The previous figure serves as a guideline indicating the usual flow, but it is not uncommon to work simultaneously on multiple steps.

Each step can be subdivided further:

### Data Collection

* Design your ML goals
* Determine what data (and ideally how much) you will need.
* Collect data from multiple sources
    * Field studies
    * Open-source datasets
    * Web scraping
    * Citizen science
    * Colaboration

### Data Preparation

* Select feature set
* Clean dataset, fix errors, decide what to do with missing values
* Normalize variables
* Annotate or label data
* Split data into training and test datasets

### Model Training

* Select an adequate ML model
* Train model and validate
* Finetune hyperparameters

### Model Evaluation

* Compute metrics
* Visualize predictions
* Study failure cases
* Identify weak spots
* Compare with baselines

### Make predictions

* Use model to process novel data
* Use predictions to make ecological inference

Now you will step through some practical examples of some of the steps listed above. 

## 2. Scikit-Learn

`scikit-learn` is a Python library for Machine Learning. It offers a wide array of algorithms for several ML tasks and tools to setup ML pipelines.

![scikit-learn](https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/260px-Scikit_learn_logo_small.svg.png?20180808062052)

You can learn about `scikit-learn` in its official [documentation](https://scikit-learn.org/stable/index.html).

The development team published a paper on the design of the package wich makes an interesting read

> Pedregosa, Fabian, et al. ["Scikit-learn: Machine learning in Python."](https://arxiv.org/abs/1201.0490) the Journal of machine Learning research 12 (2011): 2825-2830.

In [None]:
import sklearn  # Notice scikit-learn is abbreviated as sklearn.

# Scikit-learn will be preinstalled in colab environments

# Print the currently installed scikit-learn version
print(sklearn.__version__)

## 3. Classification

Suppose you need to automate the following task

**Task**: Identify Iris flower species

How would you describe this Iris flower?

![iris](https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Iris_germanica_%28Purple_bearded_Iris%29%2C_Wakehurst_Place%2C_UK_-_Diliff.jpg/470px-Iris_germanica_%28Purple_bearded_Iris%29%2C_Wakehurst_Place%2C_UK_-_Diliff.jpg?20140528110728)

* Colour?
* Number of stripes?
* Size?
* Weight?
* Environment?

**What are features?**

They are numerical or categorical descriptors, attributes or traits of the object of study.

> A **feature** is an individual measurable property or characteristic of a phenomenon
>
> *Bishop, Christopher (2006). Pattern recognition and machine learning*

In the case of Iris flowers, lets use sepal and petal length and width

![iris sepal/petal length/width](https://ars.els-cdn.com/content/image/3-s2.0-B9780128147610000034-f03-01-9780128147610.jpg)

Feature vector - Feature (descriptor):

    x = (sepal_length, sepal_width, petal_length, petal_width)

**Where to get Iris flower data?**

The problem of Iris species identification was famously studied in the paper

> Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).

The dataset is publicly available. Find a description [here](https://archive.ics.uci.edu/ml/datasets/iris).

Lets load a dataset of Iris flower measurements using `scikit-learn`. 

In [None]:
# Scikit-learn offers toy datasets including the iris dataset.
# Load the dataset functions from scikit-learn.
from sklearn import datasets

In [None]:
# Load the iris dataset. It is a pandas DataFrame
iris = datasets.load_iris(as_frame=True).data

In [None]:
# Print the first rows
iris.head()

**Recall**: Supervised training is one of the main approaches of Machine Learning.

It consists of trying to predict a **target** variable using **features** as predictors.

**What is a classification task?**

It is a supervised learning task where the target variable is a categorical variable, that is when trying to predict a class based on features.


For the Iris dataset, given our feature vector **x**, can we predict the correct species (i.e. class label) **y**?

In [None]:
# We can also load the species labels using scikit-learn
y = datasets.load_iris(as_frame=True).target

In [None]:
# Print first values of y
y.head()

In [None]:
# Count the number of rows per target class
y.value_counts()
# Notice the target class is codified as a integer value

In [None]:
# Get the class names
species_names = datasets.load_iris(as_frame=True).target_names

# Print the correspondence between integer values and species names
for index, name in enumerate(species_names):
    print(f"class {index} = {name}")

# Map the target integer values to species names
y_species = y.apply(lambda index: species_names[index])

y_species.head()

### **Exercise**: Explore the dataset (10 min)

### Nearest Neighbor Classification

Select two features: petal length and petal width

In [None]:
feature_1 = "petal length (cm)"
feature_2 = "petal width (cm)"
x = iris[[feature_1, feature_2]]

Each datapoint has some <span style="color: deepskyblue;">features</span> and a <span style="color: coral;">class label</span>

In [None]:
# import seaborn and matplotlib to make some plots
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Visualize the dataset
plt.figure(figsize=(10, 6))

# plot each data point (x = feature1, y = feature2) and the species in color
ax = sns.scatterplot(
    data=iris,
    x=feature_1,
    y=feature_2,
    hue=y_species,
    style=y_species,
    s=50,
)

# select a single data point
sample = iris.iloc[60]

# Add text to point to single data point
ax.annotate(
    text=f"({feature_1}, {feature_2}), Species",
    xy=(sample[feature_1] + 0.1, sample[feature_2] - 0.05),
    xytext=(sample[feature_2] + 0.5, sample[feature_2] - 0.3),
    fontsize=12,
    arrowprops={
        "width": 1,
        "headwidth": 6,
        "headlength": 6,
        "edgecolor": "black",
        "facecolor": "black",
    },
);

Given a <span style="color: deepskyblue;">new</span> datapoint, how can we determine its <span style="color: coral">class</span>?

In [None]:
# create new test point
test_point = [3.8, 1.6]  # Petal length cm, Petal width cm

In [None]:
plt.figure(figsize=(10, 6))

# plot the species type in color
ax = sns.scatterplot(
    data=iris,
    x=feature_1,
    y=feature_2,
    hue=y_species,
    style=y_species,
    s=50,
)

# plot the new test point
ax.scatter(x=[test_point[0]], y=[test_point[1]], color="deepskyblue")

# add "new" text and arrow pointing at new test point
ax.annotate(
    text="New",
    xy=(test_point[0] - 0.05, test_point[1] + 0.02),
    xytext=(test_point[0] - 1, test_point[1] + 0.3),
    fontsize=12,
    color="deepskyblue",
    arrowprops={
        "width": 1,
        "headwidth": 6,
        "headlength": 6,
        "edgecolor": "deepskyblue",
        "facecolor": "deepskyblue",
    },
);

**Simple idea**: Assign the class of the nearest point in the dataset.

How to find the nearest point in our dataset to a given test point?

In [None]:
plt.figure(figsize=(10, 6))

# plot the species type in color
ax = sns.scatterplot(
    data=iris,
    x=feature_1,
    y=feature_2,
    hue=y_species,
    style=y_species,
    s=50,
    zorder=2,
)

# plot a line from the test point to each point in the dataset
for _, flower in iris.iterrows():
    ax.plot(
        [test_point[0], flower[feature_1]],
        [test_point[1], flower[feature_2]],
        color="gray",
        linewidth=1,
        alpha=0.5,
        zorder=1,
    )

# plot the test point
ax.scatter(x=[test_point[0]], y=[test_point[1]], color="deepskyblue", s=100, zorder=2);

Compute **similarity** between two feature points.

Use *Euclidean* distance (based on the pythagorean theorem)

![euclidean distance](https://upload.wikimedia.org/wikipedia/commons/5/55/Euclidean_distance_2d.svg)

In [None]:
# Import numpy for math functions
import numpy as np


# Implementation of distance between two points
def compute_distance(point1, point2):
    return np.sqrt((point1[0] - point2[0]) ** 2 + (point1[1] - point2[1]) ** 2)

And find the minimum

In [None]:
# compute the distance from the test point to every example in the dataset
distance = iris.apply(
    lambda row: compute_distance(test_point, [row[feature_1], row[feature_2]]),
    axis=1,
)

# print the first results
distance.head()

In [None]:
# find the point in the dataset that is closest to the test point
# and record its distance
closest_point = iris.iloc[distance.argmin()]
distance_to_closest = distance.min()
print("Distance to closest point: ", distance_to_closest)

In [None]:
# plot training dataset
sns.scatterplot(
    data=iris,
    x=feature_1,
    y=feature_2,
    hue=y_species,
)

# plot the test point as an 'x'
sns.scatterplot(
    x=[test_point[0]],
    y=[test_point[1]],
    marker="x",
    label="test point",
    color="black",
)

# plot a ring around the nearest datapoint
sns.scatterplot(
    x=[closest_point[feature_1]],
    y=[closest_point[feature_2]],
    marker="o",
    label="nearest training point",
    edgecolor="black",
    facecolor="none",
);

In [None]:
# assume the test point is the same class as the datapoint it is closest to
predicted_species = y_species[distance.argmin()]
print(f"Predicted species: {predicted_species}")

**Summary: The nearest neighbor algorithm**

1. Given a test point x
2. Compute the distance between x and every other datapoint
3. The class of x is set as the same as the closest datapoint

Here is a quick implementation of the nearest neighbor algorithm

In [None]:
def nearest_neighbour(test_point):
    # compute the distance from the test point to every example in the dataset
    distance = iris.apply(
        lambda row: compute_distance(test_point, [row[feature_1], row[feature_2]]),
        axis=1,
    )

    # Get index where distance is minimum
    index_of_min_distance = distance.argmin()

    # find the point in the dataset that is closest to the test point
    closest_point = iris.iloc[index_of_min_distance]

    # assume the test point is the same class as the datapoint it is closest to
    predicted_species = y_species[index_of_min_distance]

    return predicted_species, closest_point

Scikit-learn also provides an easy way of building Nearest Neighbor Classification Models.

In [None]:
# Import the KNeighborsClassifier from scikit learn
from sklearn.neighbors import KNeighborsClassifier

# create a model instance
nn_model = KNeighborsClassifier(n_neighbors=1)

# fit it to the iris dataset
nn_model.fit(x, y_species)

The model can be used to make inference on new points. Lets try it out on a new test point, make sure we get the same results as before, and plot the predictions.

In [None]:
# use it to predict the species of a test point
test_point_2 = np.array([2.1, 0.7])

predicted_species = nn_model.predict(test_point_2.reshape(1, -1))[0]

print(predicted_species)

In [None]:
# Make sure it is the same result as our algorithm
predicted_species_ours, closest_point_2 = nearest_neighbour(test_point_2)
print(f"Predictions are equal = {predicted_species == predicted_species_ours}")

In [None]:
# plot training dataset
ax = sns.scatterplot(
    data=iris,
    x=feature_1,
    y=feature_2,
    hue=y_species,
)

# plot the test point as an 'x'
sns.scatterplot(
    x=[test_point_2[0]],
    y=[test_point_2[1]],
    marker="x",
    label="test point",
    color="black",
)

# plot a ring around the nearest datapoint
sns.scatterplot(
    x=[closest_point_2[feature_1]],
    y=[closest_point_2[feature_2]],
    marker="o",
    label="nearest training point",
    edgecolor="black",
    facecolor="none",
);

The model predicts some species to every point in feature space. 

**Decision regions** are formed by points that are assigned to the same species. 

The regions are separated by **decision boundaries** where the model is unsure what class to assign.

The next cell contains some function definitions to plot decision regions and boundaries. You can safely ignore it, but make sure to run the code block.

In [None]:
# @title plot decision boundary definition
# IGNORE: here we define functions to plot the decision boundary.
# Their implementation is not relevant.

from functools import reduce

from sklearn.base import is_regressor
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import _safe_indexing
from sklearn.utils.validation import _is_arraylike, _num_features


def _is_arraylike_not_scalar(array):
    return _is_arraylike(array) and not np.isscalar(array)


def _check_boundary_response_method(estimator, response_method):
    has_classes = hasattr(estimator, "classes_")
    if has_classes and _is_arraylike_not_scalar(estimator.classes_[0]):
        msg = "Multi-label and multi-output multi-class classifiers are not supported"
        raise ValueError(msg)

    if has_classes and len(estimator.classes_) > 2:
        if response_method not in {"auto", "predict"}:
            msg = (
                "Multiclass classifiers are only supported when response_method is"
                " 'predict' or 'auto'"
            )
            raise ValueError(msg)
        methods_list = ["predict"]
    elif response_method == "auto":
        methods_list = ["decision_function", "predict_proba", "predict"]
    else:
        methods_list = [response_method]

    prediction_method = [getattr(estimator, method, None) for method in methods_list]
    prediction_method = reduce(lambda x, y: x or y, prediction_method)
    if prediction_method is None:
        raise ValueError(
            f"{estimator.__class__.__name__} has none of the following attributes: "
            f"{', '.join(methods_list)}."
        )

    return prediction_method


def plot_decision_boundary(
    estimator,
    X,
    *,
    grid_resolution=100,
    eps=1.0,
    plot_method="contourf",
    response_method="auto",
    xlabel=None,
    ylabel=None,
    ax=None,
    **kwargs,
):
    if not grid_resolution > 1:
        raise ValueError(
            "grid_resolution must be greater than 1. Got" f" {grid_resolution} instead."
        )

    if not eps >= 0:
        raise ValueError(f"eps must be greater than or equal to 0. Got {eps} instead.")

    possible_plot_methods = ("contourf", "contour", "pcolormesh")
    if plot_method not in possible_plot_methods:
        available_methods = ", ".join(possible_plot_methods)
        raise ValueError(
            f"plot_method must be one of {available_methods}. "
            f"Got {plot_method} instead."
        )

    num_features = _num_features(X)
    if num_features != 2:
        raise ValueError(f"n_features must be equal to 2. Got {num_features} instead.")

    x0, x1 = _safe_indexing(X, 0, axis=1), _safe_indexing(X, 1, axis=1)

    x0_min, x0_max = x0.min() - eps, x0.max() + eps
    x1_min, x1_max = x1.min() - eps, x1.max() + eps

    xx0, xx1 = np.meshgrid(
        np.linspace(x0_min, x0_max, grid_resolution),
        np.linspace(x1_min, x1_max, grid_resolution),
    )

    if hasattr(X, "iloc"):
        # we need to preserve the feature names and therefore get an empty dataframe
        X_grid = X.iloc[[], :].copy()
        X_grid.iloc[:, 0] = xx0.ravel()
        X_grid.iloc[:, 1] = xx1.ravel()
    else:
        X_grid = np.c_[xx0.ravel(), xx1.ravel()]

    pred_func = _check_boundary_response_method(estimator, response_method)
    response = pred_func(X_grid)

    # convert classes predictions into integers
    if pred_func.__name__ == "predict" and hasattr(estimator, "classes_"):
        encoder = LabelEncoder()
        encoder.classes_ = estimator.classes_
        response = encoder.transform(response)

    if response.ndim != 1:
        if is_regressor(estimator):
            raise ValueError("Multi-output regressors are not supported")

        # TODO: Support pos_label
        response = response[:, 1]

    if xlabel is None:
        xlabel = X.columns[0] if hasattr(X, "columns") else ""

    if ylabel is None:
        ylabel = X.columns[1] if hasattr(X, "columns") else ""

    if plot_method not in ("contourf", "contour", "pcolormesh"):
        raise ValueError("plot_method must be 'contourf', 'contour', or 'pcolormesh'")

    if ax is None:
        _, ax = plt.subplots()

    plot_func = getattr(ax, plot_method)

    surface_ = plot_func(xx0, xx1, response.reshape(xx0.shape), **kwargs)

    if xlabel is not None or not ax.get_xlabel():
        xlabel = xlabel if xlabel is None else xlabel
        ax.set_xlabel(xlabel)

    if ylabel is not None or not ax.get_ylabel():
        ylabel = ylabel if ylabel is None else ylabel
        ax.set_ylabel(ylabel)

    return ax

In [None]:
from matplotlib.colors import ListedColormap

cmap_light = ListedColormap(["lightblue", "peachpuff", "palegreen"])

# Plot the decision regions
ax = plot_decision_boundary(nn_model, x, cmap=cmap_light)

# Plot the decision boundary
ax = plot_decision_boundary(
    nn_model,
    x,
    plot_method="contour",
    ax=ax,
    levels=[0, 1],
    colors="black",
)

# Overlay data points from iris dataset
ax = sns.scatterplot(
    data=iris,
    x=feature_1,
    y=feature_2,
    hue=y_species,
    style=y_species,
)

Now you will see some problems with the nearest neighbor algorithm. These are:

1. Overconfidence
2. Memory and Speed
3. Sensitive to Noise
4. Sensitive to changes in Scale

**Overconfidence**

When making inference on a point far from the training points, the model can be very confident about its prediction.

Example: the following test point is much closer to any *virginica* point than any other species.

In [None]:
# New test point far from all other data points
test_point_3 = [10, 8]

_, closest_point_3 = nearest_neighbour(test_point_3)

# plot training dataset
sns.scatterplot(
    data=iris,
    x=feature_1,
    y=feature_2,
    hue=y_species,
)

# plot the test point as an 'x'
sns.scatterplot(
    x=[test_point_3[0]],
    y=[test_point_3[1]],
    marker="x",
    label="test point",
    color="black",
)

# plot a ring around the nearest datapoint
sns.scatterplot(
    x=[closest_point_3[feature_1]],
    y=[closest_point_3[feature_2]],
    marker="o",
    label="nearest training point",
    edgecolor="black",
    facecolor="none",
);

**Question**: Is this a reasonable behaviour?

**Memory and speed**

The Nearest Neighbor models needs to store all points in the training dataset. 

Additionally, at inference it computes every distance from the test point to the training set points.

For large datasets it becomes infeasible.

In [None]:
# Import a performance timer from the standard library
from time import perf_counter

test_point_4 = np.array([4, 4])


def measure_nn_speed(n_samples, test_point=test_point_4):
    # Generate toy dataset with scikit-learn dataset functions.
    X, y = datasets.make_classification(
        n_samples=n_samples,
        n_features=2,  # Two features
        n_informative=2,
        n_redundant=0,
        n_repeated=0,
        n_classes=2,  # Two classes
    )

    # Create a NN model
    model = KNeighborsClassifier(n_neighbors=1, algorithm="brute", n_jobs=1)

    # Measure fit time
    start_fit = perf_counter()
    model.fit(X, y)
    fit_time = perf_counter() - start_fit

    # Measure prediction time
    start_predict = perf_counter()
    model.predict(test_point.reshape(1, -1))
    predict_time = perf_counter() - start_predict

    return fit_time, predict_time

In [None]:
# Select different sizes of samples. Logspace will return exponentially separated
# points. 10, 100, 1000, ...
n_samples_options = np.logspace(start=1, stop=8, num=8, dtype=np.int32)

# Measure time to fit and predict for each dataset size
fit_time, predict_time = zip(
    *[measure_nn_speed(n_samples) for n_samples in n_samples_options]
)

# Plot times
plt.plot(n_samples_options, fit_time, label="fit")
plt.plot(n_samples_options, predict_time, label="predict")
plt.xscale("log")
plt.yscale("log")
plt.ylabel("duration (s)")
plt.xlabel("dataset size (# samples)")
plt.legend();

**Noise sensitivity**

If a single mistake is introduced in the dataset the **decision boundaries** can change drastically.

In [None]:
# The species of the 134th element is virginica
index = 134
print(y_species[index])

# Make a copy of the targets
y_species_corrupted = y_species.copy()

# And change the label of a single entry
y_species_corrupted[index] = "versicolor"

In [None]:
# Fit new NN model with corrupted labels
nn_model_with_corrupted_data = KNeighborsClassifier(n_neighbors=1).fit(
    x, y_species_corrupted
)

In [None]:
plt.figure(figsize=(15, 5))

# Create a subplot on the left
ax1 = plt.subplot(1, 2, 1)
ax1.set_title("Decision regions and boundary of original model")

# Plot the decision regions and boundary of original model
plot_decision_boundary(nn_model, x, cmap=cmap_light, ax=ax1)

plot_decision_boundary(
    nn_model,
    x,
    plot_method="contour",
    ax=ax1,
    levels=[0, 1],
    colors="black",
)

# Overlay data points from iris dataset
sns.scatterplot(
    data=iris,
    x=feature_1,
    y=feature_2,
    hue=y_species,
    style=y_species,
    ax=ax1,
)

# Create a subplot on the right
ax2 = plt.subplot(1, 2, 2)

ax2.set_title("Decision regions and boundary of model trained with corrupted labels")

# Plot the decision regions and boundary of model with corrupted data
plot_decision_boundary(nn_model_with_corrupted_data, x, cmap=cmap_light, ax=ax2)

plot_decision_boundary(
    nn_model_with_corrupted_data,
    x,
    plot_method="contour",
    ax=ax2,
    levels=[0, 1],
    colors="black",
)

# Overlay data points from iris dataset
sns.scatterplot(
    data=iris,
    x=feature_1,
    y=feature_2,
    hue=y_species_corrupted,
    style=y_species_corrupted,
    ax=ax2,
);

**Sensitivity to scale**

So far we have been using **cm** as units for length.

What happens if we change cm to meters for a single feature? 

How does this choice affect predictions?

In [None]:
# Modify the feature array so that the first feature is in meters
x2 = x.copy()
x2["petal length (m)"] = x2["petal length (cm)"] / 100
x2 = x2.reindex(columns=["petal length (m)", "petal width (cm)"])

In [None]:
# Fit model with new feature array
nn_model_2 = KNeighborsClassifier(n_neighbors=1).fit(x2, y_species)

In [None]:
# We wish to classify a new flower. These are the measurements of the flower with
# two different units of measurements
test_point_5_cm = np.array([2.7, 0.7])  # (petal length cm, petal width cm)
test_point_5_m = np.array([0.027, 0.7])  # (petal length m, petal width cm)

In [None]:
# Predict with cm
predicted_species_cm = nn_model.predict(test_point_5_cm.reshape(1, -1))[0]

# Predict with meters
predicted_species_m = nn_model_2.predict(test_point_5_m.reshape(1, -1))[0]

print(
    f"Species predictions: with cm as units = {predicted_species_cm}, with m as units = {predicted_species_m}"
)

**Summary**

The nearest neighbor algorithm is intuitive and simple to implement. But 

* is very sensitive to errors and scaling (recall underfitting/overfitting discussion?)
* requires lots of memory and computation

What are some alternatives? 

Is there a way that involves less computation?

### Support Vector Machines

In [None]:
# Select only data of the setosa and virginica species
separable_dataset = iris[y_species.isin(["setosa", "versicolor"])]
separable_labels = y_species[y_species.isin(["setosa", "versicolor"])]

In [None]:
# Plot data points
sns.scatterplot(
    data=separable_dataset,
    x=feature_1,
    y=feature_2,
    hue=separable_labels,
    style=separable_labels,
);

Data points from the different species are clearly separated. In fact, they can be separated by a line, and new points can be classified depending
on which side of the line they fall.

This is what a linear classifier does, in essence. Here we will use the linear support vector classifier (SVC). See [here](https://scikit-learn.org/stable/modules/svm.html#svc) for more details.

In [None]:
# Import the Linear SVC model from the SVM module in scikit-learn
from sklearn.svm import LinearSVC

In [None]:
# Fit a linear support vector classifier (SVC) on the separable dataset
clf = LinearSVC().fit(
    separable_dataset[[feature_1, feature_2]],
    separable_labels,
)

# Plot the decision regions of the linear classifier
ax = plot_decision_boundary(
    clf,
    separable_dataset[[feature_1, feature_2]],
    cmap=ListedColormap(["lightblue", "peachpuff"]),
)

# Plot the decision boundary of the linear classifier
plot_decision_boundary(
    clf,
    separable_dataset[[feature_1, feature_2]],
    plot_method="contour",
    levels=[0],
    colors="black",
    ax=ax,
)

# Overlay the dataset points
sns.scatterplot(
    data=separable_dataset,
    x=separable_dataset[feature_1],
    y=separable_dataset[feature_2],
    hue=separable_labels,
    style=separable_labels,
    ax=ax,
);

This linear model is small in memory and very fast.

But in most cases will not generate perfect predictions.

In [None]:
# Select only data of the versicolor and virginica species
non_separable_species = ["virginica", "versicolor"]
non_separable_dataset = iris[y_species.isin(non_separable_species)]
non_separable_labels = y_species[y_species.isin(non_separable_species)]

In [None]:
# Plot data points
sns.scatterplot(
    data=non_separable_dataset,
    x=feature_1,
    y=feature_2,
    hue=non_separable_labels,
    style=non_separable_labels,
);

No line will cut the two sets of points cleanly.

When we fit a linear model prediction errors are unavoidable.

In [None]:
# Fit a linear support vector classifier (SVC) on the separable dataset
clf = LinearSVC().fit(
    non_separable_dataset[[feature_1, feature_2]],
    non_separable_labels,
)


# Plot the decision regions of the linear classifier
ax = plot_decision_boundary(
    clf,
    non_separable_dataset[[feature_1, feature_2]],
    cmap=ListedColormap(["lightblue", "peachpuff"]),
)

# Plot the decision boundary of the linear classifier
plot_decision_boundary(
    clf,
    non_separable_dataset[[feature_1, feature_2]],
    plot_method="contour",
    levels=[0],
    ax=ax,
    colors="black",
    zorder=1,
)

# Overlay the dataset points
sns.scatterplot(
    data=non_separable_dataset,
    x=feature_1,
    y=feature_2,
    hue=non_separable_labels,
    style=non_separable_labels,
    ax=ax,
    zorder=2,
);

Support vector machines can be non-linear by applying the so called kernel-trick. If interested in details check the [An Introduction to Statistical Learning](https://hastie.su.domains/ISLR2/ISLRv2_website.pdf) book.

In [None]:
# Import the support vector classifier model from scikit-learn
from sklearn.svm import SVC

In [None]:
# Fit a non-linear support vector machine on the non-linearly-separable dataset
clf = SVC(C=100, gamma=10).fit(
    non_separable_dataset[[feature_1, feature_2]],
    non_separable_labels,
)

# Plot the decision regions of the linear classifier
ax = plot_decision_boundary(
    clf,
    non_separable_dataset[[feature_1, feature_2]],
    cmap=ListedColormap(["lightblue", "peachpuff"]),
)

# Plot the decision boundary of the linear classifier
plot_decision_boundary(
    clf,
    non_separable_dataset[[feature_1, feature_2]],
    plot_method="contour",
    levels=[0],
    ax=ax,
    colors="black",
    zorder=1,
)

# Overlay the dataset points
sns.scatterplot(
    data=non_separable_dataset,
    x=feature_1,
    y=feature_2,
    hue=non_separable_labels,
    style=non_separable_labels,
    ax=ax,
    zorder=2,
);

### Decision Trees

**Another simple idea**: Use simple binary decisions to discriminate between points. Use a sequence or tree of decisions to bin a test point into its correct species.

Example binary decision: whether the petal length ≤ 5.1 cm, or the petal width is ≤ 1.75 cm.

Can be nested: If petal length > 5.2 cm, and petal width < 1.3 then predict setosa.

Each decision splits feature space in two.

In [None]:
# Plot the dataset points
ax = sns.scatterplot(
    data=non_separable_dataset,
    x=feature_1,
    y=feature_2,
    hue=non_separable_labels,
    style=non_separable_labels,
)

# Draw horizontal line at y = 1.75
ax.axhline(1.75, color="gray", linewidth=3, alpha=0.2)

# Draw vertical line at x = 4.95
ax.axvline(4.95, color="blue", linewidth=3, alpha=0.2, ymax=0.5)

# add text to label regions
ax.text(4.2, 2.15, f"width >= 1.75")
ax.text(2.9, 1.5, f"width < 1.75\n& length < 4.95")
ax.text(5.5, 1.1, f"width < 1.75\n& length >= 4.95");

Decision trees for classification can be created algorithmically. 

Multiple algorithms are available, here we will use the default algorithm from `scikit-learn` (ID3).

In [None]:
# Import the Decision Tree Classifier model from scikit-learn
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Create a Decision Tree classifier model
# Check the scikit-learn documentation to see possible configurations
tree_model = DecisionTreeClassifier(max_depth=2, min_impurity_decrease=0.01)

# Fit to dataset
tree_model.fit(
    non_separable_dataset[[feature_1, feature_2]],
    non_separable_labels,
);

In [None]:
# Import plot_tree function from tree tools in scikit-learn
from sklearn.tree import plot_tree

# Create a new figure
_, ax = plt.subplots(figsize=(10, 6))

# Visualize the trained decision trees
plot_tree(
    tree_model,
    feature_names=[feature_1, feature_2],
    class_names=tree_model.classes_,
    impurity=False,
    label="root",
    rounded=True,
    ax=ax,
)

# Add labels to decisions
ax.text(0.43, 0.66, "yes")
ax.text(0.73, 0.66, "no");

In [None]:
# Plot the decision boundary of the decision tree classifier
ax = plot_decision_boundary(
    tree_model,
    non_separable_dataset[[feature_1, feature_2]],
    cmap=ListedColormap(["lightblue", "peachpuff"]),
)

# Plot the decision boundary of the decision tree classifier
plot_decision_boundary(
    tree_model,
    non_separable_dataset[[feature_1, feature_2]],
    plot_method="contour",
    levels=[0],
    ax=ax,
    colors="black",
    zorder=1,
)

# Overlay the dataset points
sns.scatterplot(
    data=non_separable_dataset,
    x=feature_1,
    y=feature_2,
    hue=non_separable_labels,
    style=non_separable_labels,
    ax=ax,
    zorder=2,
);

In [None]:
# Create another tree with max_depth = 4
tree_model = DecisionTreeClassifier(max_depth=4, min_impurity_decrease=0.01)

# Fit to data
tree_model.fit(
    non_separable_dataset[[feature_1, feature_2]],
    non_separable_labels,
)

# Fit to dataset
tree_model.fit(
    non_separable_dataset[[feature_1, feature_2]],
    non_separable_labels,
)

# Plot the decision boundary of the linear classifier
ax = plot_decision_boundary(
    tree_model,
    non_separable_dataset[[feature_1, feature_2]],
    cmap=ListedColormap(["lightblue", "peachpuff"]),
)

# Plot the decision boundary of the linear classifier
plot_decision_boundary(
    tree_model,
    non_separable_dataset[[feature_1, feature_2]],
    plot_method="contour",
    levels=[0],
    ax=ax,
    colors="black",
    zorder=1,
)

# Overlay the dataset points
sns.scatterplot(
    data=non_separable_dataset,
    x=feature_1,
    y=feature_2,
    hue=non_separable_labels,
    style=non_separable_labels,
    ax=ax,
    zorder=2,
);

### Random Forest

### **Exercise**: (10 min)

Research what is a random forest.

Build random forest classifier with scikit learn. 

Plot decision boundary and regions. 

Go to this [website](http://cs.stanford.edu/people/karpathy/svmjs/demo/demoforest.html) and play with RF parameters.

## 4. How to evaluate your model?

You have seen multiple models for Iris flower classification.

Which model is the best fit?

How can we be confident about the predictions of a model, or evaluate its performance?

### Training and test split

We could use the training data to count the number of correct and erroneous predictions.

However this is a bad choice, as the Nearest Neighbor will always have 0 errors (can you see why?). 

In general, some models are very flexible and can fit any dataset.

Others are rigid - like the linear SVM - and will not perfectly fit all datasets.

Using the training data will not provide a clear picture of prediction accuracy for new points.

**Solution**: Split the dataset into two parts: one for **training** another for evaluation or **testing**.

In [None]:
# Import the train_test_split function from scikit-learn module for model selection
from sklearn.model_selection import train_test_split

In [None]:
# Split dataset and labels into test and train. Test dataset is 30% of all data
train_x, test_x, train_y, test_y = train_test_split(x, y_species, test_size=0.3)

In [None]:
# draw the dataset
sns.scatterplot(data=iris, x=feature_1, y=feature_2, hue=y_species)

# with circles around the training set
sns.scatterplot(
    data=train_x,
    x=feature_1,
    y=feature_2,
    marker="o",
    edgecolor="black",
    facecolor="none",
    label="train set",
);

Splits are usually done randomly to avoid selection bias.

Sometimes random sampling can introduce imbalances to both training and test dataset.

In this case stratified sampling is a better approach (such as `scikit-learn` [Stratified Shuffle Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn-model-selection-stratifiedshufflesplit)).

### Performance metrics

There are many measures of performance.

Accuracy, which is percentage of correct predictions, is commonly used for classification.

Other metrics will provide different information on the model's performace.

See the list of [classification metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) available in `scikit-learn`.

In [None]:
# create new Nearest Neighbor model
model = KNeighborsClassifier(n_neighbors=1)

# fit with train split
model.fit(train_x, train_y)

# predict on the test data
test_predictions = model.predict(test_x)

# compare to ground truth
is_correct = test_predictions == test_y

# print first results
is_correct.head()

In [None]:
# compute accuracy
n_correct_predictions = is_correct.sum()
accuracy = n_correct_predictions / len(test_x)

print(f"Nearest Neighbor model accuracy = {accuracy:.1%}")

In [None]:
# scikit-learn provides an easy way of evaluating models
score = model.score(test_x, test_y)

print(f"Model score = {score:.1%}")

**Summary**

To evaluate a model:
1. Split dataset into train and test
2. Fit model with training data
3. Select relevant performance metrics
4. Evaluate with test data

This is done succinctly with `scikit-learn`

In [None]:
# split dataset
train_x, test_x, train_y, test_y = train_test_split(x, y_species, test_size=0.3)

# create model
model = KNeighborsClassifier(n_neighbors=1)

# fit model with training data
model.fit(train_x, train_y)

# evaluate model with test data
# scikit learn has preselected accuracy as the relevant metric
score = model.score(test_x, test_y)

print(f"Model score = {score:.1%}")

### **Exercise**: 

Compute the accuracy score of all previous models.

Which one is better?

## 5. What is a regression task?

When the target of supervised learning is a numerical variable.

**Examples**

* Trying to predict future $CO_2$ levels for the next decade. Here the feature vector **x** = year and target variable **y** = $CO_2$ levels.

![historic atmospheric co2 data](https://research.noaa.gov/Portals/0/EasyGalleryImages/1/864/co2_data_mlo.png)

(taken from [NOAA research news](https://research.noaa.gov/article/ArtMID/587/ArticleID/2764/Coronavirus-response-barely-slows-rising-carbon-dioxide), Monday, June 7, 2021)

### Linear regression

Let use scikit-learn to generate synthetic data for a regression task

In [None]:
# Import make_regression function from scikit-learn datasets module
from sklearn.datasets import make_regression

# Generate a random dataset for regression with some noise and 200 points
features, target = make_regression(n_features=1, noise=10, n_samples=200)

# Use seaborn to generate a scatterplot
sns.scatterplot(x=features.flatten(), y=target);

A linear regression model assumes that there is a **linear** relation between the features and the target variable

$$ {\bf y} = m {\bf x} + b $$

The parameters $m$ (slope) and $b$ (bias) that best "fit" the data points can be found algorithmically.

How good a model fits the data is determined by minimizing some **loss** or error.

In the case of the linear model, the loss is measured by the Mean Squared Error (MSE), but we won't delve into details here.

In [None]:
# Import the Linear Regression model from scikit-learn
from sklearn.linear_model import LinearRegression

# Fit a linear model to the example data
linear_model = LinearRegression().fit(features, target)

# Plot the datapoints. x = features, y = target value
ax = sns.scatterplot(x=features.flatten(), y=target)

x_min = features.min()
x_max = features.max()

# Generate a prediction using the linear model on the example data
pred = linear_model.predict([[x_min], [x_max]])

# Plot the predicted line
ax.plot([x_min, x_max], pred, color="black", linestyle="--", linewidth=2)

# Add labels to axis
ax.set_xlabel("x = Features")
ax.set_ylabel("y = Target");

Once fitted, a prediction for points outside the dataset is computed with the same formula.

$$ {\bf y_{pred}} = m {\bf x_{test}} + b $$

In [None]:
# Define a new test point at x = 2
test_point = [2]

# Use the linear model to predict its target value
predicted_value = linear_model.predict([test_point])[0]

# Plot the datapoints. x = features, y = target value
ax = sns.scatterplot(x=features.flatten(), y=target)

# Plot the predicted line
ax.plot([x_min, x_max], pred, color="black", linestyle="--", linewidth=2)

# Draw a point at the test point with its predicted value
plt.scatter(test_point, [predicted_value], color="red")

y_min = pred.min()

# Draw a vertical arrow from the x-axis at x = test_point to its predicted value
ax.arrow(
    2,
    y_min,
    0,
    predicted_value - y_min,
    color="red",
    head_width=0.1,
    head_length=10,
    length_includes_head=True,
)

# Draw a horizontal arrow from the point (x, y) = (test_point, predicted_value) to
# the y-axis at y = predicted_value
ax.arrow(
    test_point[0],
    predicted_value,
    -test_point[0] + x_min,
    0,
    color="red",
    head_width=10,
    head_length=0.1,
    length_includes_head=True,
)

# Add the linear formula to the plot
ax.text(-1, 50, "y = mx + b")

# Add labels to axis
ax.set_xlabel("x = Features")
ax.set_ylabel("y = Target");

### Nearest Neighbor Regression

In [None]:
# create another dataset
# here target_ideal is a nonlinear function of x
features = np.arange(0, 100, 2.0)
target_ideal = np.sin(x / 10) + (x / 50) ** 2

# add some noise to our target variable target_ideal
target = target_ideal + np.random.normal(size=len(y_ideal)) * 0.3

In [None]:
# plot scatter points (x = features, y = target)
ax = sns.scatterplot(x=features, y=target)

# add title
ax.set_title("Non linear dataset")

# add labels to axis
ax.set_xlabel("x = Features")
ax.set_ylabel("y = Target");

As with classification, linear models are very rigid and will produce bad predictions here.

In [None]:
# fit linear model
lin = LinearRegression()
lin.fit(features.reshape(-1, 1), target)

# Predict on a range of test points
test_points = np.linspace(0, 100, 1000)
lin_reg_fit = lin.predict(test_points.reshape(-1, 1))

# plot the fitted linear model and random forest
sns.scatterplot(x=features, y=target)

sns.lineplot(x=test_points, y=lin_reg_fit, color="red", label="linear regression");

How else can we predict target value of a test point using features?

**Simple idea revisited**: Use the nearest neighbor's target value as a prediction.

In [None]:
# Import Nearest Neighbor Regression model from scikit-learn
from sklearn.neighbors import KNeighborsRegressor

# fit linear model
nn_reg = KNeighborsRegressor(n_neighbors=1)
nn_reg.fit(features.reshape(-1, 1), target)

# Predict on a range of test points
test_points = np.linspace(0, 100, 1000)
nn_reg_fit = nn_reg.predict(test_points.reshape(-1, 1))

# plot the fitted linear model and random forest
sns.scatterplot(x=features, y=target)

sns.lineplot(x=test_points, y=nn_reg_fit, color="red", label="nearest neighbor");

Nearest neighbor regression suffers from the same problems as nearest neighbor classification:
    
* Sensitive to noise
* Heavy on computation and memory

### Random Forest Regression

Similarly the Random Forest model can be adapted for regression tasks

In [None]:
# Import Nearest Neighbor Regression model from scikit-learn
from sklearn.ensemble import RandomForestRegressor

# fit linear model
rf_reg = RandomForestRegressor()
rf_reg.fit(features.reshape(-1, 1), target)

# predict on a range of test points
test_points = np.linspace(0, 100, 1000)
rf_reg_fit = rf_reg.predict(test_points.reshape(-1, 1))

# plot the fitted linear model and random forest
sns.scatterplot(x=features, y=target)

sns.lineplot(x=test_points, y=rf_reg_fit, color="red", label="random forest");

In [None]:
# plot training dataset
sns.scatterplot(x=features, y=target, color="black", label="train dataset", zorder=4)

# plot the all fitted models
sns.lineplot(x=test_points, y=rf_reg_fit, label="random forest")
sns.lineplot(x=test_points, y=nn_reg_fit, label="nearest neighbors")
sns.lineplot(x=test_points, y=lin_reg_fit, label="linear");

**Which is the best predictive model?**

**How to evaluate regression models?**

Similar procedure as classification but different metric.

How to measure good fit?

One option is to use **Mean Squared Error (MSE)**:

$$ MSE = \frac{1}{n} \sum_{i = 1}^{n} (y_{true} - y_{pred})^2 $$

In [None]:
# plot training dataset
ax = sns.scatterplot(x=features, y=target)

# plot fitted line
sns.lineplot(x=features, y=lin_reg_fit, color="red", label="linear regression")

# plot errors
for x, y_true, y_pred in zip(features, target, lin_reg_fit):
    ax.plot([x, x], [y_true, y_pred], alpha=0.5, color="black", linewidth=0.5);

In [None]:
# compute the array of differences in prediction and true value
error = target - lin_reg_fit

# compute the square of each error
squared_error = error ** 2

# compute the mean
MSE = squared_error.mean()

print(f"MSE of linear model on training dataset: {MSE}")

Scikit-learn implements MSE and offers multiple regression metrics.

Each metric has its benefits and pitfalls. Choice depends on use case.

Visit this [site](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics) to see available regression metrics.

In [None]:
# Scikit-learn provides tools for easy computation of MSE
from sklearn.metrics import mean_squared_error

# Use scikit-learn function to compute MSE
MSE = mean_squared_error(lin_reg_fit, target)

print(f"Score of linear model: {MSE}")

Split the dataset into train and test to make a fair comparison between different models

In [None]:
# split features and target into train and test. Test is 30% of all data.
train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.3)

# iterate over model types
for model in [
    LinearRegression(),
    KNeighborsRegressor(n_neighbors=1),
    RandomForestRegressor(),
]:
    # fit the model to training data
    model.fit(train_x.reshape(-1, 1), train_y)

    # use fitted model to predict in test data
    y_predict = model.predict(test_x.reshape(-1, 1))

    # compute MSE using the predictions and ground truth
    mse = mean_squared_error(y_predict, test_y)

    print(f"{str(model):>34} mse = {mse}")

In [None]:
# Plot train and test points
sns.scatterplot(x=train_x, y=train_y, color="black", marker="o", label="train")
sns.scatterplot(x=test_x, y=test_y, color="black", marker="x", label="test")

# iterate over model types
for model in [
    LinearRegression(),
    KNeighborsRegressor(n_neighbors=1),
    RandomForestRegressor(),
]:
    # fit the model to training data
    model.fit(train_x.reshape(-1, 1), train_y)

    # generate predictions in range of data points
    test_points = np.linspace(0, 100, 1000)
    y_pred = model.predict(test_points.reshape(-1, 1))

    # plot predicted line
    sns.lineplot(x=test_points, y=y_pred, label=str(model))

**What if number of features > 1?**

In most cases multiple features are used for prediction.

That is the same as saying feature vectors are multidimensional.

In [None]:
# generate a synthetic dataset for regression with 2 features
features_2d, target_2d = make_regression(
    n_samples=200,
    n_features=2,
    n_targets=1,
)

# visualise with seaborn
# plot points at (x = feature 1, y = feature 2)
# use the target variable to determine point size and colour
grid = sns.relplot(
    x=features_2d[:, 0],
    y=features_2d[:, 1],
    size=target_2d,
    sizes=(40, 400),
    alpha=0.5,
    hue=target_2d,
)

# add axis labels
grid.ax.set_xlabel("feature 1")
grid.ax.set_ylabel("feature 2");

In [None]:
# Split dataset into train and test
train_x, test_x, train_y, test_y = train_test_split(
    features_2d,
    target_2d,
    test_size=0.3,
)

# fit a Nearest Neighbor Regression model to training data
model = KNeighborsRegressor(n_neighbors=1).fit(train_x, train_y)

# create a mesh of points
XX, YY = np.meshgrid(np.linspace(-4, 4, 100), np.linspace(-4, 4, 100))

# predict on each point in mesh
predictions = model.predict(np.c_[XX.flatten(), YY.flatten()])

# compute the min and max value of predictions
vmin, vmax = predictions.min(), predictions.max()

# select colormap
# see available colormaps at https://matplotlib.org/stable/tutorials/colors/colormaps.html
cmap = "plasma"

# plot
ax = plt.pcolormesh(
    XX, YY, predictions.reshape(XX.shape), vmin=vmin, vmax=vmax, cmap=cmap
)

# create a color bar to indicate mapping between columns and target values
cbar = plt.colorbar()

# add label to color bar
cbar.set_label("target")

# plot training data as small round points
plt.scatter(
    train_x[:, 0],
    train_x[:, 1],
    c=train_y,
    s=20,
    edgecolor="black",
    label="train",
    vmin=vmin,
    vmax=vmax,
    cmap=cmap,
)

# compute predictions at test points
y_pred = model.predict(test_x)

# compute prediction absolute error
error = np.abs(test_y - y_pred)

# plot test data as large square markers
# color squares using true value of target variable
# use absolute error to determine square size
plt.scatter(
    test_x[:, 0],
    test_x[:, 1],
    s=error,
    c=test_y,
    marker="s",
    edgecolor="black",
    label="test",
    vmin=vmin,
    vmax=vmax,
    cmap=cmap,
    sizes=(20, 100),
)

# add legend to figure
plt.legend();

### **Exercise**:

Checkout the [diabetes dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset).

Load the features and target variables from `scikit-learn`.

Split the data into training and test sets.

Select any regression model(s) of your choice and fit to train data.

Evaluate model fit with test data and MSE (and other metrics of your choice).

## 6. Clustering

**What if we don't have any labels?**

Often our data contains some **structure**.

* Features from different classes might be separated (**separability**)

* Similar objects might have similar features (**smoothness**)

Often we wish find groupings or patterns in our data. This is called **clustering**.

Datapoints in the same **cluster** are deemed to be similar under some measure.

### K-Means Clustering

There are many algorithms for clustering. Here you will use k-means clustering.

Scikit-learn has a [collection of clustering algorithms](https://scikit-learn.org/stable/modules/clustering.html#clustering), including k-means.

If interested, checkout an [explanation](https://www.youtube.com/watch?v=4b5d3muPQmA) of the k-means clustering algorithm or an [interactive simulation](https://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/).

In [None]:
# Import make_blobs function in scikit-learn datasets module
from sklearn.datasets import make_blobs

# generate synthetic dataset made up of 5 blobs
X, y_true = make_blobs(n_features=2, n_samples=4000, centers=5)

# plot synthetic dataset
ax = sns.scatterplot(x=X[:, 0], y=X[:, 1])

ax.set_xlabel("feature 1")
ax.set_ylabel("feature 2");

In [None]:
# import k-means clustering model from scikit-learn
from sklearn.cluster import KMeans

# create a new K-means clustering model.
# specify 5 wanted clusters
kmeans_model = KMeans(n_clusters=5)

# fit to dataset
kmeans_model.fit(X)

# get predicted clusters for the dataset
y_pred = kmeans_model.predict(X)

# plot predictions
ax = sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_pred)

ax.set_xlabel("feature 1")
ax.set_ylabel("feature 2");

Clustering performance will depend on clustering parameters, choice of algorithm and data structure

In [None]:
# repeat with 3 clusters
kmeans_model = KMeans(n_clusters=3)

# fit to dataset
kmeans_model.fit(X)

# get predicted clusters for the dataset
y_pred = kmeans_model.predict(X)

# plot predictions
ax = sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_pred)
ax.set_xlabel("feature 1")
ax.set_ylabel("feature 2");

In [None]:
from sklearn.datasets import make_circles

# repeat with other dataset
X, y_target = make_circles(factor=0.2, n_samples=4000, noise=0.1)

# create K means with 2 clusters
kmeans_model = KMeans(n_clusters=2)

# fit to dataset
kmeans_model.fit(X)

# get predicted clusters for the dataset
y_pred = kmeans_model.predict(X)

# plot predictions
ax = sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_pred)
ax.set_xlabel("feature 1")
ax.set_ylabel("feature 2");

In [None]:
# Repeat with DBscan algorithm
from sklearn.cluster import DBSCAN

# repeat with other dataset
X, y_target = make_circles(factor=0.2, n_samples=4000, noise=0.1)

# create DBSCAN model
dbscan_model = DBSCAN(eps=0.15)

# fit to dataset
y_pred = dbscan_model.fit_predict(X)

# plot predictions
ax = sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_pred)
ax.set_xlabel("feature 1")
ax.set_ylabel("feature 2");

### **Exercise**: (15 min)

Use K-means clustering on the iris dataset

Can you recover the species separation?

Research Affinity Propagation clustering and compare to K-Means clustering

## 7. Dimensionality Reduction

Often the data we collect can be very **high dimensional**. e.g. D > 1000

This poses a problem as it is difficult to visualize anything greater than 3 dimensions.

We can **project** this data down to a lower dimension. P << D, where P is typically 2 or 3

### PCA

One of the simplest approach is to do a linear projection.

**PCA** is a linear projection that aligns with the directions of highest variance.

In [None]:
# the full iris dataset has 4 features
iris.shape

In [None]:
# import PCA from scikit-learn
from sklearn.decomposition import PCA

# create a 2-dimensional PCA projection
pca_model = PCA(n_components=2)

# project the 4-dimensional iris dataset into 2-d points
projected_iris = pca_model.fit_transform(iris)

# plot projected points
# use species to color points
ax = sns.scatterplot(
    x=projected_iris[:, 0], y=projected_iris[:, 1], hue=y_species, style=y_species
)

# add labels to axis
ax.set_xlabel("PCA component 1")
ax.set_ylabel("PCA component 2");

### Visualization

As a final example let explore a dataset of digits.

Each data point is an grayscale image of a handwritten digit.

The images are 8x8 pixels, so in total each point has 64 features.

In this case a feature is the grayscale value of a single pixel.

In [None]:
# import load_digits function from scikit-learn datasets module
from sklearn.datasets import load_digits

# load digits dataset
digits = load_digits()

# extract data and target values
X = digits.data
y = digits.target

# select a single data point
# reshape to original 8x8 array
digit = X[0].reshape(8, 8)

# use matplotlib to show image
plt.imshow(digit, cmap="gray");

**How to visualize the whole dataset?**

Use dimensionality reduction

Lets try PCA

In [None]:
# use PCA to project to 2 dimensions
pca_digits = PCA(n_components=2).fit_transform(X)

# do a scatterplot, color points by digit
sns.scatterplot(x=pca_digits[:, 0], y=pca_digits[:, 1], hue=y, palette="tab20");

Some digits seem to cluster.

Still, there is a lot of overlap.

Lets try a different projection method.

Now we will use a non-linear projection called **t-SNE**.

Checkout the [paper](https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) where t-SNE was introduced, or this amazing [blog](https://distill.pub/2016/misread-tsne/) for further information.

In [None]:
from sklearn.manifold import TSNE

# use TSNE to project to 2 dimensions
tsne_digits = TSNE(
    n_components=2,
    init="pca",
    learning_rate="auto",
).fit_transform(X)

# do a scatterplot, color points by digit
sns.scatterplot(x=tsne_digits[:, 0], y=tsne_digits[:, 1], hue=y, palette="tab20");

## 8. Summary

**Which algorithm to choose?**

Short answer: It depends!

No silver bullet, but often for classification it is sensible to first try a Support Vector Machine or Random Forest.

This will give you an idea of how separable your data is. The next step is to try different features, and perhaps even collect more training data.

**How much data do I need?**

Short answer: It depends!

It depends on how easy it is for your classifier to separate your data. Some problems are relatively easy and don’t require lots of data, others such as species identification in images can require 10,000s.