# Metric learning

*Selected Topics in Mathematical Optimization*

**Bac Nguyen Cong** ([email](bac.nguyencong@ugent.be))

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from collections import Counter
from scipy.spatial.distance import mahalanobis


from solution import NCA

## 1. Exercise
Implement a function that returns the objective function value and the corresponding gradient for NCA

In [None]:
class MyModel(NCA):
    r"""Neighborhood Components Analysis(NCA).

    NCA is a distance metric learning algorithm which aims to improve the
    accuracy of nearest neighbors classification compared to the standard
    Euclidean distance. The algorithm directly maximizes a stochastic variant
    of the leave-one-out k-nearest neighbors(KNN) score on the training set.
    It can also learn a low-dimensional linear transformation of data that can
    be used for data visualization and fast classification.

    """    
    def compute_gradient(self, A, X, y):
        """Compute the objective function value and gradients.

        Args:
            A (array-like, shape=[n_features * n_projected]):
                The linear transformation matrix.
            X (array-like, shape=[n_examples, n_features]): Training data.
            y (array-like, shape=[n_examples]): Class labels
                for each data sample.

        Returns:
            value (float): The objective function value.
            grads (array-like, shape=[n_features * n_projected]): The
                gradients.

        """
        pass

### Helper functions

In [None]:
def load_data():
    data = load_wine()
    X, y = data.data, data.target
    return X, y


def get_pipeline(ml_model):
    """Build a pipeline for testing the ml_model."""
    pipeline = Pipeline([
        ('Feature', ml_model),
        ('Classifer', KNeighborsClassifier(n_neighbors=3))
    ])
    return pipeline


def score_solution(ml_model):
    # load data
    X, y = load_data()
    
    # build the pipeline
    model = get_pipeline(ml_model)
    
    # compute cross validation
    cv = StratifiedKFold(n_splits=10, random_state=123456789)
    scores = cross_val_score(model, X, y, cv=cv)
    print('Accuracy %2.2f%c' % (scores.mean()*100, '%'))
    
    return scores


def visualize(ml_model):
    X, y = load_data()
    
    # visualize the data in 2D
    X_embedded = ml_model.fit_transform(X, y)
    data_map = pd.DataFrame({'x': X_embedded[:, 0], 'y': X_embedded[:, 1], 'Class': y})
    sns.scatterplot(x="x", y="y", hue="Class", data=data_map)

    plt.show()

## 2. Test your solution

Print the accuracy

In [None]:
nca_model = NCA(n_components=2, verbose=False)
score = score_solution(nca_model)

Compare the result using PCA

In [None]:
pca_model = PCA(n_components=2)
score = score_solution(pca_model)

Visualize the outputs

In [None]:
nca_model = NCA(n_components=2)
pca_model = PCA(n_components=2)

visualize(nca_model)
visualize(pca_model)

## 3. Recipe dataset

We will illustrate metric learning on the [recipes dataset](https://www.nature.com/articles/srep00196). This is a collection of recipes (set of ingredients used) annotated with the country of origin. We will build a distance function to measure if two collections of recipes likely have a similar country of origin or not.

For an illustration of a machine learning project using this data, see our [paper](https://www.sciencedirect.com/science/article/abs/pii/S0924224415002873).

In [None]:
recipes = pd.read_csv("recipe_data.csv", sep=';')
recipes.head()

Let us separate the ingredients.

In [None]:
ingredients = recipes.columns[1:]

X = recipes.values[:,1:]
X = np.array(X, dtype=int)

for ingredient in ingredients:
    print(ingredient)

We define two useful functions:

- `ingr2vec` maps a set of ingredients to a binary vector
- `vec2ingr` maps a binary vector to a set of ingedients

In [None]:
ingr2vec = lambda ingr_set : np.array([1 if ingr in ingr_set else 0 for ingr in ingredients])
vec2ingr = lambda vec : set(ingredients[vec>0])

In [None]:
v = ingr2vec(("coriander", "fish", "garlic"))
v

In [None]:
vec2ingr(v)

We also separate the countries!

In [None]:
countries = recipes["Country"]
Counter(countries)

Now fit the model!

In [None]:
nca = NCA(n_components=...)
nca.fit(...)

In [None]:
M = pd.DataFrame(nca.return_M(), index=ingredients, columns=ingredients)
M

In [None]:
L = pd.DataFrame(nca.return_L(), index=ingredients)
L

In [None]:
recipedist = lambda recipe1, recipe2 : mahalanobis(ingr2vec(recipe1), ingr2vec(recipe2), M)

In [None]:
recipedist(("fish", "musterd", "shallot"), ("chicken", "musterd", "vinegar"))

In [None]:
recipedist(("fish", "musterd", "shallot"), ("chicken", "coconut", "coriander"))

In [None]:
recipedist(("fish", "musterd", "shallot"), ("vanilla", "coconut", "cream"))

**Assignments**

1. Fit an NCA model to the recipes data.
2. Perform an interpretation of $L$ and $M$, which ingredients are (dis)similar?
3. Make a biplot of the ingredients and make a scatter plot of the recipes based on the learned space.
4. Find two recipes which differ with only one ingredient, but have an *as large as possible* Mahalanobis distance between them.
5. Find two recipes which differ with *two* ingredient, but have an *as small as possible* Mahalanobis distance between them.