# Using Uniform Manifold Approximation and Projection (UMAP) to Differentiate Subsets of Gene Populations
#### In this Jupyter Notebook, we will be following the standard setup for a UMAP environment as well as perform a UMAP analysis on an example data set.
***

### Import Tools for UMAP: 

In [None]:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline

The main tools for UMAP are those that we have already seen in class - numpy, seaborn, and pandas. It's cool to see the content we're taught in class used for these sorts of high-level analyses! Without all of these packages installed, the UMAP package won't be able to run! 

The sklearn packages are also required to run UMAP, they are used primarily in the preprocessing of the data.
***

### Import UMAP:

In [None]:
import umap

We also need to make sure that the conda environment has the UMAP package installed, to do this - go to your terminal and do:
```
source activate <conda environment/kernel>
conda install -c conda-forge umap-learn
```
***

### Read the Data File:

In [None]:
genes = pd.read_excel("/Users/frankiegarcia/Library/CloudStorage/OneDrive-Personal/Documents/Columbia/Classes/fall2022/python/IntroPython/coding_project/cluster_data.xlsx")
genes.head()

To read an excel file of this size, be sure to have the package openpyxl installed into your conda environment. This data was obtained from https://urldefense.proofpoint.com/v2/url?u=https-3A__www.pnas.org_doi_full_10.1073_pnas.1525528113-23supplementary-2Dmaterials&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=Rcq6m0mNl15PREInDil6wDm9Uog9ancwqCXkec89isU&m=W0cciBNqu_eiooohvkiHHWtOqoY_fy_ZQ2o6oTlUpXfHovnGE5RhwXlV9g1HKR4x&s=UFwn8wloh3KuQSBpj49Uuec8yRejNnKiFcedaWwsMU4&e=

***

### Processing the Data File:

In [None]:
genes = genes.dropna()
cluster_count = genes.clusterID.value_counts() #keep in mind the 'clusterID' portion - this is somewhat of an index for the UMAP dataset
cluster_count

if type(clusterID) == float:
    raise TypeError("floats not allowed")

Here, we start by doing a blanket line of code to get rid of any NaN values that could potentially be in the data set. I then check the amount of genes associated with each cluster.

In [None]:
reducer = umap.UMAP(random_state=42) #this reduces the dimensionality of the dataset to something comprehensive in 2D
gene_data = genes[
    [
        "E17MG",
        "P7nMG",
        "P7pMG",
        "P14MG",
        "P21MG",
        "P60MG",
        "LPSMG",
        "E17MY.",
        "P21MY",
        "P60MY",
        "SZMGMP",
        "E17WB.",
        "P7WB.",
        "P21WB.",
        "P60WB",
        "E17LivMY."
    ]
].values #selecting the groups that will be analyzed for clustering 
reducer.fit(gene_data) #fitting the data to the UMAP package

In this set of Processing, we are selecting the columns that we want to analyze using the UMAP software. The way UMAP reduces dimensionality is included in the package notes. From what I understand, we are training the UMAP software by giving a z-score to the different parameters/columns and clustering them that way. We are using the Standard Scaler that we imported from the SciKit. By reducing the dimensionality, and processing the data - we will be able to grasp multiple planes of information from a 2D representation!

In [None]:
embedding = reducer.transform(gene_data)
assert(np.all(embedding == reducer.embedding_)) #embedding helps us graph the UMAP clustering
embedding.shape

Embedding is the process of taking that reduced dimension and trained matrix and storing it so that it can be plotted using UMAP.
***

### Creating and Exporting your UMAP Representation:

In [None]:
sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)})

plt.scatter(embedding[:, 0], embedding[:, 1], c=genes.clusterID, cmap='Spectral', s=5) #customizing the graph aesthetics
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP Projection of the Gene Dataset', fontsize=24);
plt.savefig('UMAP.png', bbox_inches = 'tight')

This is the graph generated by using the UMAP package. It is supposed to differentiate between population subsets... if successful!
***