# Unsupervised Learning
## DSSP Team
## Spring 2020

<script>
code_show=true; 
function code_toggle() {
  if (code_show) {
    $('div.input').each(function(id) {
      el = $(this).find('.c1');
      if (el.text() == '#solution') {
        $(this).hide();
      }
    });
    $('div.output').hide();
  } else {
    $('div.input').each(function(id) {
      $(this).show();
    });
    $('div.output').show();
  }
  code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>

<form action="javascript:code_toggle()">
The solutions are hidden by default but you can click
<input type="submit" value="here">
to toggle them on/off.
</form> 

## MultiDimensional Scaling

The file __temperature.csv__ (in the __data__ directory) contains the monthly temperature average for 35 cities in Europe. We are going to define distances between those curves. They will be fed into the MDS algorithm to embedd the cities in 2 dimensional space.

We start by reading the data:

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set_style('whitegrid')

from sklearn import decomposition #PCA
from sklearn import metrics #Pairwise distance
from sklearn import manifold #MDS

from adjustText import adjust_text #Text labels placement

In [None]:
temperature = pd.read_csv('../data/temperature.csv',
    index_col=0)
temperature

We will use the raw temperatures without renormalization as they all correspond to the same unit:

In [None]:
temperature_raw = temperature.iloc[:, :13]

__1)__ Compute the pairwise distances between all cities

__Hint:__ Use `metrics.pairwise_distance`

In [None]:
#solution
D = metrics.pairwise_distances(temperature_raw)
D

__2)__ Compute and visualize the 2d MDS representation.

__Hint:__ Use `manifold.MDS` with the `'precomputed'` dissimilarity.

In [None]:
#solution
np.random.seed(42)
temperature_mds = manifold.MDS(dissimilarity='precomputed').fit_transform(D)
temperature_mds_df = pd.DataFrame({'City': temperature_raw.index, 'X1': temperature_mds[:, 0], 'X2': temperature_mds[:, 1]})
temperature_mds_df.plot(kind='scatter', x='X1', y='X2');
texts = [plt.text(x, y, city)
    for idx, city, x, y in temperature_mds_df.itertuples()]
adjust_text(texts, arrowprops=dict(arrowstyle='->', color='k'));

__3)__ Can you play with the axis to find a representation closer to the geography?

__Hint:__ You may combine a rotation and sign inversion...

In [None]:
#solution
temperature_mds_mod_df = pd.DataFrame({'City': temperature_raw.index, 'X1': -temperature_mds[:, 1] - temperature_mds[:, 0], 'X2': -temperature_mds[:, 0] + temperature_mds[:, 1]})
temperature_mds_mod_df.plot(kind='scatter', x='X1', y='X2');
texts = [plt.text(x, y, city)
    for idx, city, x, y in temperature_mds_mod_df.itertuples()]
adjust_text(texts, arrowprops=dict(arrowstyle='->', color='k'));

__4)__ What is the result of a PCA on the same data?

In [None]:
#solution
temperature_pca = decomposition.PCA(n_components=2).fit_transform(temperature_raw)
temperature_pca_df = pd.DataFrame({'City': temperature_raw.index, 'X1': temperature_pca[:, 0], 'X2': temperature_pca[:, 1]})
temperature_pca_df.plot(kind='scatter', x='X1', y='X2')
texts = [plt.text(x, y, city)
    for idx, city, x, y in temperature_pca_df.itertuples()]
adjust_text(texts, arrowprops=dict(arrowstyle='->', color='k'));

In [None]:
#solution
temperature_pca_mod_df = pd.DataFrame({'City': temperature_raw.index, 'X1': -temperature_pca[:, 1], 'X2': -temperature_pca[:, 0]})
temperature_pca_mod_df.plot(kind='scatter', x='X1', y='X2')
texts = [plt.text(x, y, city)
    for idx, city, x, y in temperature_pca_mod_df.itertuples()]
adjust_text(texts, arrowprops=dict(arrowstyle='->', color='k'));

In [None]:
#solution
#Note that we do not obtain the same result because the MDS algorithm use a gradient descent approach with a random initialization that may be trapped in a local optimum. In R, we would have obtained the same result than with PCA.

__5)__ What if one uses the `canberra` distance instead of the euclidean one?

In [None]:
#solution
D = metrics.pairwise_distances(temperature_raw, metric='canberra')
np.random.seed(42)
temperature_mds = manifold.MDS(dissimilarity='precomputed').fit_transform(D)
temperature_mds_df = pd.DataFrame({'City': temperature_raw.index, 'X1': temperature_mds[:, 0], 'X2': temperature_mds[:, 1]})
temperature_mds_df.plot(kind='scatter', x='X1', y='X2');
texts = [plt.text(x, y, city)
    for idx, city, x, y in temperature_mds_df.itertuples()]
adjust_text(texts, arrowprops=dict(arrowstyle='->', color='k'));