# Big data analysis using machine learning and probabilistic models 
# part B - 2D distributions (pairs of cells / genes)

### General Instructions 
Follow the instructions below to analyze the data matrix “mm_gastru_small.h5ad”. You are free to use any package or software (in Python or R). As we are practicing here data analysis, which is always done under uncertainty, you will find that in many cases you need to make hard decisions on parameters/algorithms/visualization. This is OK. There is no single “correct” solution – a perfect project will be one that conveys understanding of the underlying data using the methodologies we studied.<br />
<br />
Describe your work in:<br />
A Jupyter notebook with explanations in markdown and comments.<br />
Alternatively, provide A written report (pdf) with concise description of what you find, including figures (strict limit on length is 6 pages – keep figures small..). As well as your “source” code (no need to have it documented, but put sections in it per figure you are generating.<br />
Please work alone. You are however free to discuss with fellow students regarding tools and analysis strategies. 


### ExerciseB: In this stage you'll explorer the notion of correlation between features and how to identify groups of correlated features

In [None]:
import anndata as ad
import scanpy as sc
# import metacells as mc
import pandas as pd
import numpy as np
import sklearn as skl
import scipy.stats
import matplotlib.pylab as plt

# 2D distributions (pairs of cells / genes)

[We continue with the cleaned and downsampled matrices from the previous section]

## Tasks

1.	Plot scatters of all the genes for a single pair of cells
    Run through several such scatters, for example the cells "190307_P10.190307_P10350" and "190313_P12.190313_P12102”


2.  Plot all the cells for a single pair of genes. 
    Run through several such scatters, for example the genes “Pou5f1” and “Dnm53b” 

3.  Sample and calculate correlations between cells / genes 
    for example “Pou5f1” and “Dnmt3b” 

4.  Plot the distributions of these correlations. 
    How do the correlations of the raw data and down-sampled data compare?

5.  Compute a cell-to-cell correlation matrix 
    (for efficiency, use “numpy.corrcoef”) 

6. Find the top correlated gene for each gene : using pearson on <b>un</b>-normalized mat, pearson, pearson(log), spearman 

    show scatter plots
    compare to feature mean
         show cor needs depth
    repeat on the normalized mat - Is there a difference?
    Repeat on the downsampled mat - Is there a difference?

7. Plot Heat map of correlations on high variance genes (try to use seaborn clusermap)

8.  Decide on a reasonable statistical test for comparing cells and visualize the distribution of derived statistics
    e.g. if you are doing linear correlation, show the distribution of the R coefficients (Pearson coefficient). 
    Discuss the statistical significance of the test for correlation between two cells, including the multiple-testing effect.

9.  Consider the pair of cells "190307_P10.190307_P10350" and "190313_P12.190313_P12102", are they more similar than average?

9.	What are the most correlated genes? Cells?

10.  Compute FDR (False Discovery Rate) to estimate how much of the signal is significant. Repeat the same computations with the matrix after replacing each value x by log(1+x) and normalizing columns such that their sums are 0. Discuss the reasons for the differences between the outcomes. 

Extra: find a reasonable way to shuffle the matrix and compare the correlations on the real matrix to the shuffled one. What structures in the matrix should your reshuffling strategy conserve? Which should be eliminated?
Hint reading? : https://mathworld.wolfram.com/CorrelationCoefficient.html 
