<a href="https://colab.research.google.com/github/CPukszta/BI-BE-CS-183-2023/blob/main/HW6/Problem3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Bi/Be/Cs 183 2022-2023: Intro to Computational Biology
TAs: Meichen Fang, Tara Chari, Zitong (Jerry) Wang

**Submit your notebooks by sharing a clickable link with Viewer access. Link must be accessible from submitted assignment document.**

Make sure Runtime $\rightarrow$ Restart and run all works without error

**HW 6 Problem 3**

In this problem you will test different methods for variance stabilization on real single-cell datasets, and analyze the results of these procedures. This follows a recent [paper](https://www.biorxiv.org/content/biorxiv/early/2021/08/25/2021.06.24.449781.full.pdf) and [blog post](https://www.nxn.se/valent/2017/10/15/variance-stabilizing-scrna-seq-counts) about their effects in single-cell.


##**Import data and install packages**

In [1]:
import numpy as np
import scipy.io as sio
import pandas as pd
import matplotlib.pyplot as plt #Can use other plotting packages like seaborn

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()

In [2]:
#Download the gene count matrix for Drop-seq Drospohila embryo data
!wget --content-disposition https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2494nnn/GSM2494783/suppl/GSM2494783_dge_mel_vir_rep1.txt.gz

--2023-02-20 23:57:14--  https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2494nnn/GSM2494783/suppl/GSM2494783_dge_mel_vir_rep1.txt.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.7, 130.14.250.11, 2607:f220:41e:250::7, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6388719 (6.1M) [application/x-gzip]
Saving to: ‘GSM2494783_dge_mel_vir_rep1.txt.gz’


2023-02-20 23:57:15 (22.8 MB/s) - ‘GSM2494783_dge_mel_vir_rep1.txt.gz’ saved [6388719/6388719]



In [3]:
!gunzip GSM2494783_dge_mel_vir_rep1.txt.gz

## **Read in data for analysis**

**The dataset**

This dataset is from a Drop-seq experiment whose purpose was to conduct a single-cell study of the early *Drosophila* (fruit fly) embryo at particular stages of development ([Karaiskos et al., 2017](http://dx.doi.org/10.1126/science.aan3235)), from both *Drosophila melanogaster* and *Drosophila virilis* species. Over 5000 embryos were sequenced to  generate a predictive 3D map of gene expression during development across the embryo (using previous *in situ* hybridization data).

<center><img src="https://drive.google.com/uc?export=view&id=1p4qrvbhjGahIQL1s3M-UzAFbqhuNyTt7" alt="EMFigure" width="600" height="300"><center>



**The count matrix**

The gene count matrix is 3,247 cells by 23,712 genes. These counts have not been processed/normalized, so they directly represent the UMI counts from each cell.


In [4]:
#Get gene count matrix
data = pd.read_csv('GSM2494783_dge_mel_vir_rep1.txt', sep='\t',index_col=0)
data.head()

Unnamed: 0_level_0,CATCTTGGTTCN,GTACTAATTACN,GGAAACACGTTC,ACGCACAACTCN,AGAGCTCGTGTA,AATCACCTCCAA,CATAATTTAGCT,GTGTATTTGTCN,TTCTTCACTTTC,CCAGTGTCTTGC,...,TTCCCTAGGTAA,CCTGTAGCGATA,TAAGGGCGCCTC,ATCTGACCAGAA,ATTCCCACTCGT,ATTCCTTATTAG,CGGTAAGCAGGC,AGCAATGAGTCT,CTTCACCTAAGA,TCGCTAATGCCN
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
128up,6,4,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
14-3-3epsilon,665,370,4,5,1,1,4,1,0,8,...,4,2,0,0,1,0,0,1,3,1
14-3-3zeta,120,49,0,1,0,0,1,0,0,4,...,0,0,0,0,1,0,0,0,0,0
140up,6,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18SrRNA-Psi:CR41602,4,0,3,0,1,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [5]:
#Extract just the counts and transpose (to get cells x genes)
count_mat = data.to_numpy().T
count_mat.shape

(3247, 23712)

## **Problem 3 (40 points)**

For our purposes, we will use the $\mu,\phi$ parametrization of the negative binomial (NB) for this problem. Here $\phi$ is the dispersion and $\mu$ is the mean.

In this configuration, $\operatorname {var}(X) = \mu + \phi\mu^2$ (unlike the Poisson where $\operatorname{var}(X) = \mu$). $x_i$ represents expression of gene $i$.


As described in the assignment, we can find a variance-stabilizing transform, where given
\begin{align}
\operatorname {var} (X)=h(\mu ),\,
\end{align}
a suitable transform would be
\begin{align}
 y\propto \int ^{x}{\frac {1}{\sqrt {h(\mu )}}}\,d\mu 
\end{align}
to result in a constant (mean-independent) variance.


### **a) Find the expression for the transformation $y$ given the var$(X)$ expression for a NB (given in the Problem statement). (5 points)**

If working by hand attach an image of your work, or directly type your answer into a text cell. Feel free to use https://www.wolframalpha.com/ for the integral calculation.

Used mathematica to solve the integral. Note that the negative binomial only spans the domain $[0,∞]$

$\int^x_0 \frac{1}{\sqrt{\mu ^2 \phi +\mu }} d\mu = \frac{2 \sqrt{x} \sqrt{x
   \phi +1} \sinh
   ^{-1}\left(\sqrt{x} \sqrt{\phi
   }\right)}{\sqrt{\phi } \sqrt{x
   (x \phi +1)}} = \frac{2  \sinh
   ^{-1}\left(\sqrt{x} \sqrt{\phi
   }\right)}{\sqrt{\phi } } $

Therefore, the propostionality constant is two. This is the first relation given in part e!

### **b) Run PCA on the data matrix (with genes as features), extract the top two principal components and transform the data matrix, then plot the cells in their 2D, transformed coordinates. (5 points)**

You can use the [sklearn PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) function, similar to HW3.

In [26]:
#first run PCA on the data matrix
import sklearn
from sklearn.decomposition import PCA

components = np.shape(count_mat)[0]
pca = PCA(n_components = components, svd_solver='full')
pca.fit(count_mat)

PCA(n_components=3247, svd_solver='full')

In [27]:
# find the top two vectors
b_PCA = pd.DataFrame(np.transpose(pca.components_[0:2]))
b_PCA

Unnamed: 0,0,1
0,3.165805e-04,0.000162
1,3.764277e-02,0.019676
2,5.611002e-03,0.003016
3,2.778293e-04,0.000152
4,6.572768e-05,0.000225
...,...,...
23707,2.829995e-07,0.000005
23708,2.874869e-04,0.000092
23709,6.180326e-04,0.000313
23710,2.062485e-04,0.000104


In [28]:
#transform the count matrix
transformed = np.matmul(count_mat,np.transpose(pca.components_[0:2]))

In [29]:
f = bokeh.plotting.figure(
    width = 400, height =400,
    x_axis_label = "PCA1",
    y_axis_label = "PCA2",
    title= "Dim reduced 2D plot",
)

f.circle(transformed[:,0],transformed[:,1],color="orange")

bokeh.io.show(f)

### **c) Plot the variance ($\sigma^2$) versus the mean ($\mu$) expression for all genes in a single plot, and comment on any trends you notice (how variance relates to the mean). (5 points)**

You will need to calculate a $\mu$ and $\sigma^2$ for each gene.

In [10]:
#calculate the variance and mean
mean = np.mean(count_mat,axis=0)
var = np.var(count_mat,axis=0)

#plot
meanvar = bokeh.plotting.figure(
    width = 400, height =400,
    x_axis_label = "Mean gene expression",
    y_axis_label = "gene expression variance",
    title= "mean vs variance for each gene",
)

meanvar.circle(mean,var,color="blue")

bokeh.io.show(meanvar)

This plot is pretty zoomed out, ill make another without that top right point to see the data a bit more clearly. 

In [11]:
mean_trunk = np.delete(mean,22590)
var_trunk = np.delete(var,22590)


meanvar_trunk = bokeh.plotting.figure(
    width = 400, height =400,
    x_axis_label = "Mean gene expression",
    y_axis_label = "gene expression variance",
    title= "mean vs variance for each gene Zoomed in",
)

meanvar_trunk.circle(mean_trunk,var_trunk,color="blue")

bokeh.io.show(meanvar_trunk)

In general, the variance and the mean are positively correlated but there really seems to be two lines here rather than one which is quite odd to see and might indicate that there are two differnet groups of genes here.

### **d) Fit a polynomial to $\sigma^2$ vs $\mu$ (the plot from c) to approximate a single $\phi$ value (across all genes). (5 points)**

$\sigma^2$ is var$(X)$. Given that  $\operatorname {var}(X) = \mu + \phi\mu^2$ you can find the fit for $\phi$ as the coefficent for the squared term.

You can use the package [curve_fit](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) to define this degree 2 polynomial where $x=\mu$, $y=$ var$(X)$. Using the bounds options you can constrain the constant to be 0, the first coefficient to be 1, and learn the second coefficient ($\phi$). 

See constrained fit example [here](https://stackoverflow.com/questions/48469889/how-to-fit-a-polynomial-with-some-of-the-coefficients-constrained).

**Report your value of $\phi$ from the fit (one value across all genes).**

In [12]:
from scipy.optimize import curve_fit

def var_func(mu,phi):
  return mu+phi*mu**2

phi, _ = curve_fit(var_func,mean,var)
print("the value for psi is:", phi)

the value for psi is: [3.25263749]


In [13]:
x=np.linspace(0,50)
meanvar_trunk.line(x,var_func(x,phi),color="orange")
meanvar.line(np.linspace(0,170),var_func(np.linspace(0,170),phi),color="orange")


bokeh.io.show(meanvar)
bokeh.io.show(meanvar_trunk)

### **e) Run the log1p, Pearson residual, and $\mathbf{\text{sinh}^{-1}}$ variance stabilization transforms on the full dataset. (10 points)**

Below you will test out the effect of common variance-stabilization procedures.

[In 1948](https://academic.oup.com/biomet/article-abstract/35/3-4/246/280278?redirectedFrom=fulltext), Frank Anscombe developed several transformations for the Poisson and NB distributions including

\begin{align}
y \propto \dfrac{\text{sinh}^{-1}(\sqrt{\phi x_i})}{\sqrt{\phi}} \tag{1}
\end{align} and
\begin{align}
y \propto \text{log}(x_i+\dfrac{1}{2\phi}) \tag{2}
\end{align} (similar to the log1p we've seen before) which can approximate the $\text{sinh}^{-1}$ solution.



Another common method is to use Pearson residuals, shown below:

\begin{align}
y \propto \dfrac{x_i − \mu_i}{\sqrt{\mu_i + \phi \mu_i^2}}. \tag{3}
\end{align}

Again $x_i$ represents expression of gene $i$.

**After running each transformation (on the full data), print *only* the transformed values for the first gene, for the first 10 cells, under each transform (1-3).**

In [14]:
#transform 1:
#assuming we are to use the phi value from above?
t1 =  np.arcsinh((count_mat * phi)**(1/2))/np.sqrt(phi)
t1_df = pd.DataFrame(t1)

#transform 2:
t2 = np.log(count_mat+ (2*phi)**(-1))
t2_df = pd.DataFrame(t2)

#transform 3:
t3 = (count_mat-mean)/np.sqrt(mean+phi*mean**2)
t3_df = pd.DataFrame(t3)


In [15]:
print("transform 1: transformed values for the first gene and first 10 cells")
t1_df[0:10][0]

transform 1: transformed values for the first gene and first 10 cells


0    1.215039
1    1.106018
2    0.000000
3    0.000000
4    0.000000
5    0.000000
6    0.000000
7    0.000000
8    0.000000
9    0.000000
Name: 0, dtype: float64

In [22]:
print("transform 2: transformed values for the first gene and first 10 cells")
t2_df[0:10][0]

transform 2: transformed values for the first gene and first 10 cells


0    1.817057
1    1.424005
2   -1.872613
3   -1.872613
4   -1.872613
5   -1.872613
6   -1.872613
7   -1.872613
8   -1.872613
9   -1.872613
Name: 0, dtype: float64

In [24]:
print("transform 3: transformed values for the first gene and first 10 cells")
t3_df[0:10][0]

transform 3: transformed values for the first gene and first 10 cells


0    23.945950
1    15.893286
2    -0.212043
3    -0.212043
4    -0.212043
5    -0.212043
6    -0.212043
7    -0.212043
8    -0.212043
9    -0.212043
Name: 0, dtype: float64

### **f) For each of the three transformation methods, make a single plot of the variance ($\sigma^2$) versus the mean ($\mu$) for all genes, and comment on the trends you notice (particularly compared to c). (5 points)**

In [25]:
transformed_meanvar = bokeh.plotting.figure(
    width = 400, height =400,
    x_axis_label = "Mean gene expression",
    y_axis_label = "gene expression variance",
    title= "mean vs variance for each gene in the transformed matrix",
)

transformed_meanvar.circle(np.mean(t1,axis=0),np.var(t1,axis=0),color="#66c2a5",legend_label = "transform 1")
transformed_meanvar.circle(np.mean(t2,axis=0),np.var(t2,axis=0),color="#fc8d62",legend_label = "transform 2")
transformed_meanvar.circle(np.mean(t3,axis=0),np.var(t3,axis=0),color="#8da0cb",legend_label = "transform 3")

transformed_meanvar.legend.click_policy="hide"


bokeh.io.show(transformed_meanvar)

This plot is very differnet from part c. First off, transform 3 being mean centered means that as epected, the mean for all genes is zero. It however does the worst with reducing the variance values it seems.  

Transform 1 and 2 seem to keep the general trend of positive correlation with part c, but definately reduce that trend significantly with transform 1 doing the best job at reducing variability. 

Overall, all of these methods do reduce the variance in the data to various degrees.

### **g) For each transformation, run PCA on the variance-stabilized data matrices (with genes as features), extract the top two principal components and transform the matrix, then plot the cells in their 2D, transformed coordinates. There should be one plot for each transformation method. Comment on how these plots compare to that of b. (5 points)**

In [30]:
components = 2
pca_1 = PCA(n_components = components, svd_solver='full')
pca_1.fit(t1)
transformed_1 = np.matmul(t1,np.transpose(pca_1.components_[0:2]))

pca_2 = PCA(n_components = components, svd_solver='full')
pca_2.fit(t2)
transformed_2 = np.matmul(t2,np.transpose(pca_2.components_[0:2]))


pca_3 = PCA(n_components = components, svd_solver='full')
pca_3.fit(t3)
transformed_3 = np.matmul(t3,np.transpose(pca_3.components_[0:2]))


In [31]:
PCA2 = bokeh.plotting.figure(
    width = 400, height =400,
    x_axis_label = "PCA1",
    y_axis_label = "PCA2",
    title= "Dim reduced 2D plot",
)

PCA2.circle(transformed_1[:,0],transformed_1[:,1],color="#66c2a5",legend_label = "transform 1")
PCA2.circle(transformed_2[:,0],transformed_2[:,1],color="#fc8d62",legend_label = "transform 2")
PCA2.circle(transformed_3[:,0],transformed_3[:,1],color="#8da0cb",legend_label = "transform 3")

PCA2.legend.click_policy="hide"

bokeh.io.show(f)

bokeh.io.show(PCA2)

All three of the plots are significantly less varied along the first 2 principal components than the plot from B. Transform 1 seems to have done the best at reducing variation along the first two PC axes as was the case in the variance reduction plot. Then, transform 2 did the second best followed by transform 3. the most varied is the original data set from part b. 