Import a few base librairies.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import sys

Load the *happiness* dataset.

In [None]:
df = pd.read_csv('../data/happiness_long.csv')

# PCA

## Q

Build a DF with the principal components and quantitative variables (exclude the *year* column) as columns.

Draw plots for each component in ordinates and each quantitative variable in abscissae.

Draw the projected data for the first two principal components (PCA1 in abscissae, PCA2 in ordinates) with the points colored by region.

## A

## Q

Draw a scree plot to select a number of components.

Perform a PCA with the maximum likelihood estimator technique to automatically select the number of components.

## A

## Q

Each principal component can be thought of as a linear model of *happiness* as the independent variables. Let us consider the first principal component only.

Plot the corresponding residuals *vs* the “predicted values“. The points can be colored by region.

# UMAP

*umap-learn* is now listed in the *requirements.txt* file. Just in case:

In [None]:
!"{sys.executable}" -m pip install --upgrade umap-learn

## Q

Carry out a 2D UMAP transform with whatever parameter values (*e.g.* 10 neighbors and unit minimum distance).

Plot the UMAP-projected points colored by region (in a first plot) and by score (in a second plot).

## Q

For illustration purpose, we may seek for the set of UMAP parameters that maximizes the separation between two groups.

This can be done by maximizing the Silhouette score:

In [None]:
from sklearn.metrics import silhouette_score

Let us try to highlight the European countries *vs* rest of the world (or Central/Eastern Europe *vs* Western Europe):

In [None]:
# indicator variables

# 1. Europe vs rest of the world
group1 = (df['region'] == 'Central and Eastern Europe') | (df['region'] == 'Western Europe')
group2 = ~group1

# 2. Central/Eastern Europe vs Western Europe
#group1 = df['region'] == 'Central and Eastern Europe'
#group2 = df['region'] == 'Western Europe'

# note: in practice we will pass only the points in group1 or group2;
#       the points that are not in either group will be excluded

indicator = np.zeros(df.shape[0]) # 0 will be the label for group1
indicator[group2] = 1 # 1 the label for group2
indicator = indicator[group1 | group2] # we discard the zeros that are not part of group1

def objective_function(projected_data):
    x = projected_data[group1 | group2, :]
    score = silhouette_score(x, indicator)
    return score

Find the combination of parameter values that maximizes the above score, using a grid search with neighbor counts 2, 3, 5, 10, 20, 30, 50, 100 and minimum distances 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, and plot the resulting UMAP (the projected data).

In [None]:
param_grid = {'n_neighbors': [2, 3, 5, 10, 20, 30, 50, 100], 'min_dist': np.arange(.1, 1.1, .1)}
param_grid

Hint: you can use simple *for* loops and lists to record the scores and associated parameters.

## A

Run the exhaustive grid search (takes time):

# Agglomerative clustering

Let us copy-paste some plotting code from the scikit-learn official documentation:

In [None]:
from scipy.cluster.hierarchy import dendrogram

# copy-pasted from https://scikit-learn.org/1.7/auto_examples/cluster/plot_agglomerative_dendrogram.html
def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram
    # Example usage:
    # plot_dendogram(agglomerated_clustering, truncate_model="level", p=6)
    # See also scipy.cluster.hierarchy.dendrogram for more details

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

    ax = plt.gca()
    ax.set_xlabel("Number of points in node (or index of point if no parenthesis).")
    ax.set_yticks([])

    return ax

## Q

Perform agglomerative clustering:
- first without a defined number of clusters in order to plot a complex dendrogram (for example using the above function),
- second with 5 clusters to plot the data points colored by cluster (use 2D projected data, either the first 2 principal components or the UMAP components).

## A

# *k*-means

## Q

Same as above, apply *k*-means and plot the clusters.

## A