# Hands on session: UMAP
In this example, we explore the use of UMAP to visualize high-dimensional data. UMAP is a dimensionality reduction technique that is used for visualizing high-dimensional data. It is similar to t-SNE in that it tries to preserve the local structure of the data, but it also tries to preserve the global structure of the data. This makes it a good choice for visualizing high-dimensional data that has both local and global structure.

This exercise refers to Chapter 6 "UMAP" of the "Dimensionality reduction in neuroscience" course (tutor: Fabrizio Musacchio, Oct 17, 2024)

## Acknowledgements:
The dataset used here is extracted from the the datasets available in the [openTSEN package](https://opentsne.readthedocs.io/en/stable/examples/01_simple_usage/01_simple_usage.html). Specifically, it is the Macosko 2015 mouse retina data set. 

## Environment setup
For reproducibility:

```bash
conda create -n dimredcution python=3.11 mamba -y
conda activate dimredcution
mamba install -y ipykernel matplotlib numpy scipy scikit-learn umap-learn
```

Let's start by importing the necessary libraries:

In [None]:
# %% IMPORTS
import os
import gzip
import pickle
import matplotlib.pyplot as plt
import numpy as np

import umap

from sklearn.manifold import TSNE
from sklearn.preprocessing import LabelEncoder

from sklearn.decomposition import PCA


# set global properties for all plots:
plt.rcParams.update({'font.size': 14})
plt.rcParams["axes.spines.top"]    = False
plt.rcParams["axes.spines.bottom"] = False
plt.rcParams["axes.spines.left"]   = False
plt.rcParams["axes.spines.right"]  = False

Define the path to the data file:

In [2]:
# %% DEFINE PATHS
DATA_PATH = '../data/'
DATA_FILENAME = 'macosko_2015.pkl.gz'
DATA_FILE = os.path.join(DATA_PATH, DATA_FILENAME)

RESULTSPATH = '../results/'
# check whether the results path exists, if not, create it:
if not os.path.exists(RESULTSPATH):
    os.makedirs(RESULTSPATH)

Load the data:

In [None]:
# %% LOAD DATA
with gzip.open(DATA_FILE, "rb") as f:
    data = pickle.load(f)

x = data["pca_50"]

# convert y to a sequence of numbers:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data["CellType2"])
y = data["CellType1"].astype(str)

print(f"The RNA data set (x) contains {x.shape[0]} samples with {x.shape[1]} features")
print(f"y has shape {y.shape} with unique values: {np.unique(y)}")

For our later plots, we again define an according color dictionary:

In [4]:
# decipher the cell types and create an appropriate color-label array:
MACOSKO_COLORS = {
    "Amacrine cells": "#A5C93D",
    "Astrocytes": "#8B006B",
    "Bipolar cells": "#2000D7",
    "Cones": "#538CBA",
    "Fibroblasts": "#8B006B",
    "Horizontal cells": "#B33B19",
    "Microglia": "#8B006B",
    "Muller glia": "#8B006B",
    "Pericytes": "#8B006B",
    "Retinal ganglion cells": "#C38A1F",
    "Rods": "#538CBA",
    "Vascular endothelium": "#8B006B",
}

# Map the cell types in y to their corresponding colors
colors_array = [MACOSKO_COLORS[cell_type] for cell_type in y]

## 📝 Perform UMAP
using the following parameters:
- n_components=2
- n_neighbors=15
- min_dist=0.1
- metric='euclidean'
- random_state=42

In [None]:
# Your code goes here:

# define the UMAP model:
# umap_model = ...

# fit the model:
# ...



Compare the processing time of UMAP with t-SNE: What do you notice?

## 📝 Plot the results
Plot the UMAP results using the color dictionary defined above. 

To assign the cell types to the clusters, you can use the following code snippet:

```python
handles = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=8, label=label)
           for label, color in MACOSKO_COLORS.items()]
plt.legend(handles=handles, title="Cell Types", bbox_to_anchor=(1.05, 1), loc='upper left')
```

Compare the UMAP results with the t-SNE results. What do you notice?

In [None]:
# Your code goes here:



As for t-SNE, different metrics can be used for UMAP. Available metrics are:

- euclidean
- manhattan
- chebyshev
- minkowski
- canberra
- braycurtis
- mahalanobis
- wminkowski
- seuclidean
- cosine
- correlation
- haversine
- hamming
- jaccard
- dice
- russelrao
- kulsinski
- ll_dirichlet
- hellinger
- rogerstanimoto
- sokalmichener
- sokalsneath
- yule

## 📝 Define the UMAP model with cosine metric
1. Fit the model with the data.
2. Plot the results.

In [None]:
# Your code goes here:



## 📝 Define the UMAP model with manhattan metric
1. Fit the model with the data.
2. Plot the results.

In [None]:
# Your code goes here:


## 📝 Play further with the UMAP parameters and metrics
Freely explore the UMAP parameters and metrics. 
1. Play with
   - different metrics
   - change the number of neighbors
   - change the minimum distance
2. Plot the results.

## 📝 Compare the results of the different UMAP models. What do you notice?

# Your answer goes here: