# Dimensionality reduction 1

__Week 2 - 4 May 2022__

Practice applying and using dimensionality rediuction for analysing datasets with Principal Component Analysis (`PCA` & `Kernel PCA`).

---
### Data

The dataset is a photometric catalogue of galaxies. These galaxies were found in the 2-square degree field on the sky called COSMOS by space- and ground-based telescopes.

The radiation flux (energy per second) of each galaxy is measured in 8 bands (i.e. wavelengths of light) that span the spectrum from <span style="color:blue;">blue</span> to <span style="color:rgb(192,4,1,1);">infrared</span>: `u, r, z++, yHSC, H, Ks, SPLASH1, SPLASH2`. The fluxes are not corrected for any effects, such as distance to a galaxy, therefore there is a systematic effect in their measurements (called redshift).

So, in addition to its photometry each galaxy has its observed bias and physical properties:
* `redshift`$^1$ - systematic bias in flux measurements.
* `log_mass` - stellar mass in units of $log_{10}$ (inferred from a combination of fluxes and redshifts).
* `log_sfr` - rate of star formation in units of $log_{10}$ (inferred from a combination of fluxes and redshifts).
* `is_star_forming` - classification, based on galaxy colours (inferred from a combinations of fluxes and redshifts).

<span style="font-size:0.9em;"> $^1$ - redshift is the reddening of light that is proportianal to the velocity of an object receding away. On the sky, object velocities are proportional to their distances from us ([find out more](https://www.anisotropela.dk/encyclo/redshift.html)). </span>

---
### Exercise

Analyze the galaxy catalogue applying dimensionality reduction to galaxy fluxes.

* Apply `PCA` to fluxes. Can you find a base of principal compoenents that separates galaxies into star forming and dead? Does PCA give you a way to differentiate between various properties of galaxies?
* Think about preprocessing the data, if you haven't yet, and see if you can find a more representative set of principal components.
* Apply `Kernel PCA` afterwards. Does this give you a more meaningful vector space? If so, why?
* Apply `t-SNE`. Does it give you a cleaner separation between objects with different properties?
* Apply `UMAP`, for comparison.

---
* Authors:  Vadim Rusakov, Charles Steinhardt
* Email:  vadim.rusakov@nbi.ku.dk
* Date:   27th of April 2022

In [None]:
!conda install -c conda-forge umap-learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import umap
from matplotlib.colors import LogNorm
from sklearn.decomposition import PCA, KernelPCA
from sklearn.manifold import TSNE

Load in the data:

In [None]:
file = "datasets/cosmos2015.csv"
df = pd.read_csv(file, index_col=False)
df

Select a random sub-sample of the dataset. `PCA` does computations linearly, therefore it's quick and you can choose the whole dataset if you wish.

In [None]:
# select a random sub-sample of the dataset
n = 10000
idxs = np.arange(df.shape[0])
idxs_rand = np.random.choice(idxs, size=n)
df_cut = df.iloc[idxs_rand] # dataframe
X = df.iloc[idxs_rand].values # array

flux_cols = list(df.columns[4:]) # flux column names
flux_idxs = np.argwhere(np.isin(df.columns, flux_cols)).flatten() # flux column indices

## Principal Component Analysis (PCA)

Now take the galaxy data (fluxes) and find out whether you can reduce it to a couple of meaningful principal components using `PCA`. By meaningful, we are interested in the method that is capable of separating galaxies into `star forming` or `dead`.

Use the following parameters: `n_components=2`. The user interface of the PCA in sklearn is the same as for all other similar classes (see PCA [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)).

You can access training data (only fluxes columns) as `X[:, flux_idxs]`.

In [None]:
pca = PCA() # get a pca object of class PCA()
y_pcs = pca.fit_transform() # train pca object on fluxes (raw observed data)

In [None]:
# create a figure
fig, ax = plt.subplots(1, figsize=(5, 5), dpi=100)
ax.set_xlim(np.percentile(y_pcs[:,0], 99), np.percentile(y_pcs[:,0], 1))
ax.set_ylim(np.percentile(y_pcs[:,1], 99), np.percentile(y_pcs[:,1], 1))
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')
is_sf = np.isin(df_cut.loc[:, 'is_star_forming'], 1) # is a galaxy forming stars, i.e. alive?

# scatter plot using two principal components stored in y_pcs
ax.scatter(y_pcs[is_sf, 0], y_pcs[is_sf, 1], s=0.02, c='b', norm=LogNorm())
ax.scatter(y_pcs[~is_sf, 0], y_pcs[~is_sf, 1], s=0.02, c='r', norm=LogNorm())
ax.annotate("star forming", xy=(0.05, 0.9), xycoords="axes fraction", 
            color='b', fontsize=12)
ax.annotate("dead", xy=(0.05, 0.86), xycoords="axes fraction", 
            color='r', fontsize=12)
plt.show()

* Make scatter plots coloured by different galaxy properties: `log_mass`, `log_sfr`, `redshift`. Is the low-dimensional representation meaningful in any one of the properties? Can you argue why?

Below is an example code for colouring the scatter by some property, eg., `log_mass`:

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 5), dpi=100)
ax.set_xlim(np.percentile(y_pcs[:,0], 99), np.percentile(y_pcs[:,0], 1))
ax.set_ylim(np.percentile(y_pcs[:,1], 99), np.percentile(y_pcs[:,1], 1))
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')

sc = ax.scatter(y_pcs[:,0], y_pcs[:,1], s=0.2, 
                c=df_cut.loc[:, 'log_mass'], cmap='jet', norm=LogNorm())
cbar = plt.colorbar(sc)
cbar.ax.set_ylabel('Mass', rotation=270, labelpad=10)
plt.show()

## Kernel PCA

For now, let us continue throwing these data at other algorithms to get some practice with them. `KernelPCA` is a variant of the PCA, which can use a range of kernels for non-linear operations. I.e., this extension gives flexibility in separating the data that are not linearly-separable.

Use the following parameters: `n_components=2`, `kernel='cosine'`. Make sure to try different kernels for reducing the dimensionality. See documentation for `KernelPCA` in **sklearn**.

For Kernel PCA see the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html#sklearn.decomposition.KernelPCA).

In [None]:
kpca = KernelPCA()
y_pcs = kpca.fit_transform()

In [None]:
fig, ax = plt.subplots(1, figsize=(5, 5), dpi=100)
ax.set_xlim(np.percentile(y_pcs[:,0], 99), np.percentile(y_pcs[:,0], 1))
ax.set_ylim(np.percentile(y_pcs[:,1], 99), np.percentile(y_pcs[:,1], 1))
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')
is_sf = np.isin(df_cut.loc[:, 'is_star_forming'], 1) # is a galaxy forming stars, i.e. alive?

ax.scatter(y_pcs[is_sf, 0], y_pcs[is_sf, 1], s=0.02, c='b', norm=LogNorm())
ax.scatter(y_pcs[~is_sf, 0], y_pcs[~is_sf, 1], s=0.02, c='r', norm=LogNorm())
ax.annotate("star forming", xy=(0.05, 0.9), xycoords="axes fraction", 
            color='b', fontsize=12)
ax.annotate("dead", xy=(0.05, 0.86), xycoords="axes fraction", 
            color='r', fontsize=12)
plt.show()

* Again, make scatter plots coloured by different galaxy properties: `log_mass`, `log_sfr`, `redshift`. Is the low-dimensional representation more meaningful with this algorithm? Can you argue why?

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 5), dpi=100)
ax.set_xlim(np.percentile(y_pcs[:,0], 99), np.percentile(y_pcs[:,0], 1))
ax.set_ylim(np.percentile(y_pcs[:,1], 99), np.percentile(y_pcs[:,1], 1))
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')

sc = ax.scatter(y_pcs[:,0], y_pcs[:,1], s=0.2, norm=LogNorm(),
                c=df_cut.loc[:, 'redshift'], cmap='jet')
cbar = plt.colorbar(sc)
cbar.ax.set_ylabel('Redshift', rotation=270, labelpad=10)
plt.show()

### t-SNE

Now, try to run `t-SNE` on the dataset (for examples or set-up see documentation for `t-SNE` on sklearn [website](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)). Use `perplexity=50, method='barnes_hut', n_iter=1000, random_state=42, verbose=2` for now. In the next class we will put more emphasis on the importance of the optimal values for theses parameters.

* How well does `t-SNE` help to differentiate between two classes here?

* Does you get clusters of galaxies or a continuum?

* Which physical property is the most distinctly separated in the reduced space (again, use colouring of scatter to analyze this)?

In [None]:
# running t-SNE
tsne = TSNE()
y = tsne.fit_transform()

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 5), dpi=100)
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')
is_sf = np.isin(df_cut.loc[:, 'is_star_forming'], 1)

ax.scatter(y[is_sf, 0], y[is_sf, 1], s=0.05, c='b', norm=LogNorm())
ax.scatter(y[~is_sf, 0], y[~is_sf, 1], s=0.05, c='r', norm=LogNorm())
ax.annotate("star forming", xy=(0.05, 0.9), xycoords="axes fraction", 
            color='b', fontsize=12)
ax.annotate("dead", xy=(0.05, 0.86), xycoords="axes fraction", 
            color='r', fontsize=12)
plt.show()

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 5), dpi=100)
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')

sc = ax.scatter(y[:,0], y[:,1], s=0.2, norm=LogNorm(),
                c=df_cut.loc[:, 'redshift'], cmap='jet')
cbar = plt.colorbar(sc)
cbar.ax.set_ylabel('Redshift', rotation=270, labelpad=10)
plt.show()

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 5), dpi=100)
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')

sc = ax.scatter(y[:,0], y[:,1], s=0.2, norm=LogNorm(),
                c=df_cut.loc[:, 'log_mass'], cmap='jet')
cbar = plt.colorbar(sc)
cbar.ax.set_ylabel('Mass', rotation=270, labelpad=10)
plt.show()

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 5), dpi=100)
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')

sc = ax.scatter(y[:,0], y[:,1], s=0.2, norm=LogNorm(),
                c=df_cut.loc[:, 'log_sfr'], cmap='jet')
cbar = plt.colorbar(sc)
cbar.ax.set_ylabel('SFR', rotation=270, labelpad=10)
plt.show()

### UMAP

Now try using `UMAP`. For documentation see the UMAP [webpage](https://umap-learn.readthedocs.io/en/latest/api.html). This has the same interface as the other embedding classes above. Use with `n_components=2, n_neighbors=50, random_state=42`. 

* Do you get something similar to `t-SNE`?

* How well can you map different properties in the reduced space?

* Do you get clusters or continuous distributions? Which physical property is the most strongly separable with `UMAP`?

In [None]:
map = umap.UMAP()
y = map.fit_transform()

In [None]:
fig, ax = plt.subplots(1, figsize=(5, 5), dpi=100)
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')
is_sf = np.isin(df_cut.loc[:, 'is_star_forming'], 1)

ax.scatter(y[is_sf, 0], y[is_sf, 1], s=0.02, c='b', norm=LogNorm())
ax.scatter(y[~is_sf, 0], y[~is_sf, 1], s=0.02, c='r', norm=LogNorm())
ax.annotate("star forming", xy=(0.05, 0.9), xycoords="axes fraction", 
            color='b', fontsize=12)
ax.annotate("dead", xy=(0.05, 0.86), xycoords="axes fraction", 
            color='r', fontsize=12)
plt.show()

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 5), dpi=100)
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')

sc = ax.scatter(y[:,0], y[:,1], s=0.2, norm=LogNorm(),
                c=df_cut.loc[:, 'redshift'], cmap='jet')
cbar = plt.colorbar(sc)
cbar.ax.set_ylabel('Redshift', rotation=270, labelpad=10)
plt.show()

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 5), dpi=100)
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')

sc = ax.scatter(y[:,0], y[:,1], s=0.2, norm=LogNorm(),
                c=df_cut.loc[:, 'log_mass'], cmap='jet')
cbar = plt.colorbar(sc)
cbar.ax.set_ylabel('Mass', rotation=270, labelpad=10)
plt.show()

In [None]:
fig, ax = plt.subplots(1, figsize=(6, 5), dpi=100)
ax.set_xlabel('Component 1')
ax.set_ylabel('Component 2')

sc = ax.scatter(y[:,0], y[:,1], s=0.2, norm=LogNorm(),
                c=df_cut.loc[:, 'log_sfr'], cmap='jet')
cbar = plt.colorbar(sc)
cbar.ax.set_ylabel('SFR', rotation=270, labelpad=10)
plt.show()