# Big data analysis using machine learning and probabilistic models 
# part C - Multivariate

### General Instructions 
Follow the instructions below to analyze the data matrix “mm_gastru_small.h5ad”. You are free to use any package or software (in Python or R). As we are practicing here data analysis, which is always done under uncertainty, you will find that in many cases you need to make hard decisions on parameters/algorithms/visualization. This is OK. There is no single “correct” solution – a perfect project will be one that conveys understanding of the underlying data using the methodologies we studied in class.<br />
<br />
Describe your work in:<br />
A Jupyter notebook with explanations in markdown and comments.<br />
Alternatively, provide A written report (pdf) with concise description of what you find, including figures (strict limit on length is 6 pages – keep figures small..). As well as your “source” code (no need to have it documented, but put sections in it per figure you are generating.<br />
Please work alone. You are however free to discuss with fellow students regarding tools and analysis strategies. 


In [None]:
import anndata as ad
import scanpy as sc
# import metacells as mc
import pandas as pd
import numpy as np
from sklearn as skl
import scipy.stats
import matplotlib.pylab as plt

# Multivariate (groups of cells / genes)
## Selection of feature genes – choose a small (~1000) subset of relevant genes

“Features” in this sense would be genes that are useful for explaining the cells behavior, non-empty, not noisy, differentially expressed (and not lateral, but this last definition is currently beyond our current scope)

1.  Plot gene variance VS gene mean expression
    Explain the trend, how might this effect feature selection?

2.	Plot the relative variance (var/mean) VS mean expression
    Explain the behavior at low mean values

3.	Suggest a feature selection criterion and explain its reasoning

4.	Create a data set with only the selected feature gene for use in the following sections

5.	Optional: compare your selected features with an external method (metacells’ for example). Is there a substantial overlap?

Hint: https://tanaylab.github.io/metacell/articles/a-basic_pbmc8k.html

Extra: Perform and discuss cross validation for these models. 

## Perform clustering on the rows (cells).

<span style="color: red;">Note: If any computation you attempt lasts longer than 5 seconds, kill it. </span>

You may sample a small number of cells (~1000) for the following tasks.

6.  K-means – how did you choose k?

7.  Hierarchical clustering – what is your method of choice? Why?

8.  Overlay the resulting k-means clusters on the Hierarchical clustering tree – Is it a good match?

9.  Visualize a sample of your correlations from section 2, ordered according to your clustering. Also visualize the clustered matrix in a heat map (you can order the genes as well to get a better picture). Is this visualization appropriate? Does it indicate that clustering is “bad”?

10. Finally find for each cluster the 10 cols (genes) that separate it best from the other clusters. Assign the selected genes statistical significance.

## Embed the cells (rows) representation over the feature genes in 2D

11. Attempt UMAP and PCA

12. Compute a Knn graph for the cells (rows) and overlay its edges on the 2D representation. What is the meaning of long edges?

*There are cell type and color annotations the cells metadata, consider adding those to your plots