<a href="https://colab.research.google.com/github/CPukszta/BI-BE-CS-183-2023/blob/main/HW3/Problem4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Bi/Be/Cs 183 2022-2023: Intro to Computational Biology
TAs: Meichen Fang, Tara Chari, Zitong (Jerry) Wang

**Submit your notebooks by sharing a clickable link with Viewer access. Link must be accessible from submitted assignment document.**

Make sure Runtime $\rightarrow$ Restart and run all works without error

**HW 3 Problem 4**

In this problem you will compare PCA and SVD, common procedures for dimensionality reduction, on a single-cell dataset. Using the eigenvectors (components) of these factorization procedures we will see how relevant "directions" in biological data can be extracted, such as components which distinguish between the various cell types in the data.


##**Import data and install packages**

In [1]:
import numpy as np
import scipy.io as sio
import pandas as pd
import matplotlib.pyplot as plt #Can use other plotting packages like seaborn

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()

In [2]:
# ! allows you to run commands in the command line, as you would in your normal terminal/command line interface

In [3]:
#Download count matrix of cell by gene counts and metadata files, DOI: 10.22002/D1.2315
#tar.gz file which has:
#(1) count matrix 
#(2) metadata for cells (cell type, date of experimental run) 
#(3) metadata for genes (gene names)

import requests
from tqdm import tnrange, tqdm_notebook
def download_file(doi,ext):
    url = 'https://api.datacite.org/dois/'+doi+'/media'
    r = requests.get(url).json()
    netcdf_url = r['data'][0]['attributes']['url']
    r = requests.get(netcdf_url,stream=True)
    #Set file name
    fname = doi.split('/')[-1]+ext
    #Download file with progress bar
    if r.status_code == 403:
        print("File Unavailable")
    if 'content-length' not in r.headers:
        print("Did not get file")
    else:
        with open(fname, 'wb') as f:
            total_length = int(r.headers.get('content-length'))
            pbar = tnrange(int(total_length/1024), unit="B")
            for chunk in r.iter_content(chunk_size=1024):
                if chunk:
                    pbar.update()
                    f.write(chunk)
        return fname

download_file('10.22002/D1.2315','.gz')


  pbar = tnrange(int(total_length/1024), unit="B")


  0%|          | 0/94578 [00:00<?, ?B/s]

'D1.2315.gz'

In [4]:
!mv D1.2315.gz biccn.tar.gz

In [5]:
!tar -xvf biccn.tar.gz

biccnGeneMeta.csv
biccnMeta.csv
biccn.mtx


## **Read in data for analysis**

**The dataset**

This dataset maps the cells in the mouse primary motor cortex (MOp), including neuronal and non-neuronal cell types ([Yao et al., 2021](https://www.nature.com/articles/s41586-021-03500-8)). We will be dealing with the 10x sequenced data only.

We will be using PCA and SVD factorization of the gene count matrix to demonstrate how the eigenvectors can represent axes of variation which correspond to cell type designations. Thus these component vectors can be used to represent variation between cells due to their different transcriptomic signatures.


**The count matrix**

This matrix is 18,744 cells by 5,000 genes, with 10 cell types. The full dataset contains 71,365 cells, however we will work with a randomly subsetted version to facilitate calculations within the Colab environment.

1.   For each cell, gene counts were normalized to have the same number of total counts (usually 1e5 or 1e6), with cell-gene counts thus scaled accordingly.

2.   Counts were then log-normalized, using the log(1+x), where x is each cell's gene count. The 1 accounts for 0 count genes. (log = ln here).

3. The ~5000 genes were selected for those that displayed large variance in expression amongst the cells ('highly variable genes').



In [6]:
#Get gene count matrix
count_mat = sio.mmread('biccn.mtx')

count_mat = count_mat.todense() #Make dense since most functions we'll use don't work with sparse matrices
count_mat.shape

(18744, 5000)

In [7]:
#Get metadata dataframe for the 18,744 cells (rows of the matrix)
meta = pd.read_csv('biccnMeta.csv')
meta["cell_type"]

0             L2/3 IT
1             L2/3 IT
2        L6 CT Cpa6_1
3             L2/3 IT
4             L2/3 IT
             ...     
18739         L2/3 IT
18740         L2/3 IT
18741         L2/3 IT
18742    L6 CT Cpa6_1
18743         L2/3 IT
Name: cell_type, Length: 18744, dtype: object

In [8]:
#Get metadata dataframe for the 5,000 genes (columns of the matrix)

meta_gene = pd.read_csv('biccnGeneMeta.csv',index_col = 0)
meta_gene.head()


Unnamed: 0,gene_name
Rp1_ENSMUSG00000025900,Rp1_ENSMUSG00000025900
Sox17_ENSMUSG00000025902,Sox17_ENSMUSG00000025902
Oprk1_ENSMUSG00000025905,Oprk1_ENSMUSG00000025905
St18_ENSMUSG00000033740,St18_ENSMUSG00000033740
Sntg1_ENSMUSG00000025909,Sntg1_ENSMUSG00000025909


## **Problem 4** (50 points)

### **a) Find the eigenvectors and values for the covariance matrix of (mean-)centered data (8 points)**
Mean-center the columns (gene vectors) of the matrix, find the $X^TX$ covariance matrix, and use the [numpy.linalg.eig](https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html) function to obtain the eigenvalues and eigenvectors. $X^TX$ is the covariance matrix for $X^T$, thus we are treating the genes as the features/variables measured.

**Report the first 3 eigenvalues and their associated eigenvectors that are returned.**

In [9]:
# lets start by mean centering our count matrix
count_mean = np.mean(count_mat,axis=0)
mean_cent = count_mat - count_mean

#Now we can found the covariance matrix of XTX
cov = np.cov(np.matmul(np.transpose(mean_cent), mean_cent))
eigval, eigvec = np.linalg.eig(cov)

In [10]:
#next lets put these values into a data frame for ease of use
eigval_df = pd.DataFrame(np.real(eigval))
eigvec_df = pd.DataFrame(np.real(eigvec))

print("these are the first 3 eigenvectors")

eigvec_df.iloc[:,0:3]

these are the first 3 eigenvectors


Unnamed: 0,0,1,2
0,0.000131,0.000103,-0.000179
1,0.001487,0.006698,-0.006343
2,0.007410,-0.006378,-0.001343
3,-0.000979,0.000223,0.000161
4,-0.025543,-0.005343,0.047928
...,...,...,...
4995,0.002479,0.000321,-0.001317
4996,-0.001558,0.001923,-0.006390
4997,0.000863,-0.004957,-0.001962
4998,0.000197,-0.012025,0.002700


In [11]:
print("and their eigenvalues are")
eigval_df[0:3]

and their eigenvalues are


Unnamed: 0,0
0,36894270000.0
1,12153650000.0
2,1349121000.0


### **b) Plot a Scree plot of the eigenvectors, after ranking by eigenvalue (descending order) and select the top components (eigenvectors) to use to transform the data. (5 points)**
A Scree plot will have the eigenvalue for an eigenvector on the y-axis, and the number of the eigenvector on the axis (after ordering the eigenvectors by decreasing eigenvalue). Usually we will see a steep drop in this curve, and a plateau after a certain number of components. Using this plot, we can then select a cutoff to determine how many components to actually keep which still capture a large portion of the variance in the data.

**Plot a Scree plot for the top 50 eigenvectors and report how many components you would keep i.e. where you would set a cutoff on the number of components necessary to capture a majority of variance in the data.**

In [12]:
# Start by sorting the eigenvalues and selecting the top 50 values. 
eig_sort = np.sort(eigval)

In [13]:
g = bokeh.plotting.figure(
    width = 400, height =400,
    x_axis_label = "eigenvalue #",
    y_axis_label = "eigenvalue",
    title= "Scree Plot",
    y_axis_type="log"
)

g.circle(np.linspace(49,0),np.real(eig_sort[-50:]),color="blue")

bokeh.io.show(g)

Given the above Scree plot, I would choose to only keep the top 2 eigenvalues/vectors. 

### **c) Transform the original count matrix using the top 2 ranked principal component (eigen) vectors from b) and plot the data in the first two components, colored by 'cell_type'.** (8 points)

The 'cell_type' metadata column represents the cell type designation of each cell from the study.

In [14]:
# start by grabbing the top 2 eigenvectors and their values
vectors = np.real(eigvec[:,0:2])
vals = np.real(eigval[0:2])
# matrix multiply the count matrix and vectors
P = np.matmul(mean_cent,vectors)

#combine the result in a data frame with the cell types
p_df = pd.DataFrame(P)
p_df["cell_type"] = meta["cell_type"]
p_df

Unnamed: 0,0,1,cell_type
0,-24.975069,-3.909889,L2/3 IT
1,-23.570307,-7.747387,L2/3 IT
2,34.546080,-16.651556,L6 CT Cpa6_1
3,-31.665647,-5.551204,L2/3 IT
4,-16.877071,7.056553,L2/3 IT
...,...,...,...
18739,-9.676538,6.658508,L2/3 IT
18740,-18.826675,-1.829612,L2/3 IT
18741,-19.741742,-0.729855,L2/3 IT
18742,33.880809,-11.422378,L6 CT Cpa6_1


In [15]:
#lets plot!
from bokeh.models import ColorBar, ColumnDataSource
from bokeh.palettes import Spectral10
from bokeh.plotting import figure, output_file, show
from bokeh.transform import factor_cmap

x = np.array(P)[:,0]
y = np.array(P)[:,1]
cell_types= p_df["cell_type"].unique()

# define the pallete
mapper = factor_cmap(field_name='cell_type', palette=Spectral10 ,factors=cell_types)

source = ColumnDataSource(dict(x=p_df[0],y=p_df[1],cell_type=p_df["cell_type"]))

p = bokeh.plotting.figure(
    width = 600, height =600,
    x_axis_label = "PC 1",
    y_axis_label = "PC 2",
    title= "Principal Component Graph",
)

p.circle(x="x",y="y",color=mapper,source=source,legend_field="cell_type")
p.legend.location = "top_left"

bokeh.io.show(p)

### **d) Plot the same transformed data and color by the total read counts for each cell (counts across all genes). (5 points)**

Directions of variance highlighted by the the principal components can correspond to other non-biological facets of the data, such as which cells had more sequenced UMIs.

Remember that the counts have been previously log transformed (ln). 

In [16]:
#lets plot!
from bokeh.models import ColorBar, ColumnDataSource
from bokeh.palettes import Turbo256
from bokeh.plotting import figure, output_file, show
from bokeh.transform import linear_cmap

#lets start by un-log transforming the count matrix
count_mat_un = np.exp(count_mat)-1
cell_totals = np.sum(count_mat_un,axis=1)
p_df["total_reads"] = np.array(cell_totals)

# define the pallete
mapper = linear_cmap(field_name='cell_type', palette=Turbo256 ,low=min(p_df["total_reads"]) ,high=max(p_df["total_reads"]))

source = ColumnDataSource(dict(x=p_df[0],y=p_df[1],cell_type=p_df["total_reads"]))

q = bokeh.plotting.figure(
    width = 600, height =600,
    x_axis_label = "PC 1",
    y_axis_label = "PC 2",
    title= "Principal Component Graph, colored by total read counts",
)

q.circle(x="x",y="y",color=mapper,source=source)
color_bar = ColorBar(color_mapper=mapper['transform'], width=8)
q.add_layout(color_bar, 'right')

bokeh.io.show(q)

**Comparing the previous graph with this one, it looks like the variation in the "Endo Slc38a5_1", "OPC Pdgfra", and "Astro Aqp4" cell types may be coming from a significantly larger number of reads in those cell types.**


### **e) Perform SVD on the centered data matrix, construct a Scree plot from $D,V$, and report the number of components chosen for reduction. Plot the points transformed by the top 2 eigenvectors, colored by 'cell_type'. (10 points)**

Use the [numpy.linalg.svd](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html) function to find the SVD factorization of the matrix. The Singular Value Decomposition of $X$ provides a factorization of $X$ where $X = U\Sigma V^T$. Here $\Sigma$ is a diagonal matrix containing the singular values of $X$. $U,V$ represent, respectively, the left and right singular vectors corresponding to those values. We will use $\Sigma$ and $V$ as the eigenvalues and (right) eigenvectors, as compared to part a). 

**Plot a Scree plot for the top 50 singular values/vectors and report the number of components you would select to retain. Transform the data matrix using the top 2 components and plot (as in c) the transformed points colored by 'cell_type'.**

In [17]:
# lets apply SVD to the mean centered data matrix
#u, sigma, vt = np.linalg.svd(mean_cent, full_matrices=True)

In [18]:
f = bokeh.plotting.figure(
    width = 400, height =400,
    x_axis_label = "eigenvalue #",
    y_axis_label = "eigenvalue",
    title= "Scree Plot",
)

f.circle(np.linspace(1,49),sigma[0:50],color="orange")

bokeh.io.show(f)

NameError: ignored

In [None]:
# start by grabbing the top 2 eigenvectors
eig_vectors = np.transpose(vt)[:,0:2]

# matrix multiply the count matrix and vectors
P2 = np.matmul(mean_cent,eig_vectors)

#combine the result in a data frame with the cell types
p2_df = pd.DataFrame(P2)
p2_df["cell_type"] = meta["cell_type"]
p2_df

In [None]:
from pandas.core.algorithms import quantile
#lets plot as in part c
from bokeh.models import ColorBar, ColumnDataSource
from bokeh.palettes import Spectral10
from bokeh.plotting import figure, output_file, show
from bokeh.transform import factor_cmap

cell_types= p2_df["cell_type"].unique()

# define the pallete
mapper = factor_cmap(field_name='cell_type', palette=Spectral10 ,factors=cell_types)

source = ColumnDataSource(dict(x=p2_df[0],y=p2_df[1],cell_type=p2_df["cell_type"]))

b = bokeh.plotting.figure(
    width = 600, height =600,
    x_axis_label = "PC 1",
    y_axis_label = "PC 2",
    title= "Principal Component Graph",
)

b.circle(x="x",y="y",color=mapper,source=source,legend_field="cell_type")
b.legend.location = "top_left"

bokeh.io.show(b)

### **f) Perform PCA with the [sklearn PCA function](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and print the top 2 eigenvectors from the function alongside what you calculated in a) and e) (6 points)**

You can set the number of components for the PCA calculation to be the same size as the dataset, and then select the top 2 components from the result. By default this PCA function will use SVD to approximate the solution.

**Report the top 2 components (vectors) and print the top 2 components from a) and e).**


```
>>> pca = PCA(n_components=2, svd_solver='full')
>>> pca.fit(X)
PCA(n_components=2, svd_solver='full')
>>> print(pca.singular_values_)
[6.30061... 0.54980...]
#Use pca.components_.T to get component vectors
```



In [20]:
import sklearn
from sklearn.decomposition import PCA

components = np.shape(count_mat)[1]
pca = PCA(n_components = components, svd_solver='full')
pca.fit(count_mat)



[[ 1.27394340e-04  1.73097367e-03  6.91642060e-03 ...  5.25742537e-04
  -8.96989532e-04 -7.68932889e-03]
 [ 6.87475259e-05  6.16749087e-03 -7.06638447e-03 ... -5.09674912e-03
  -1.30181099e-02 -1.48158639e-02]]


In [32]:
f_PCA = pd.DataFrame(np.transpose(pca.components_[0:2]))
f_PCA

Unnamed: 0,0,1
0,0.000127,0.000069
1,0.001731,0.006167
2,0.006916,-0.007066
3,-0.001050,0.000047
4,-0.026610,-0.005995
...,...,...
4995,0.002291,-0.000395
4996,-0.001226,0.002593
4997,0.000526,-0.005097
4998,-0.000897,-0.013018


In [21]:
#top components from a)
print("these are the top two components from part a:")
eigvec_df.iloc[:,0:2]


these are the top two components from part a:


Unnamed: 0,0,1
0,0.000131,0.000103
1,0.001487,0.006698
2,0.007410,-0.006378
3,-0.000979,0.000223
4,-0.025543,-0.005343
...,...,...
4995,0.002479,0.000321
4996,-0.001558,0.001923
4997,0.000863,-0.004957
4998,0.000197,-0.012025


In [26]:
#top components from e)

e_components = pd.DataFrame(np.transpose(vt)[:,0:2])
print("these are the top two components from part e:")
e_components

NameError: ignored

### **g) Transform the count matrix using the top 2 component vectors from part f) and report the top 5 genes for the first and second principal component (eigen) vectors (8 points)**

By looking at the weights/values of each gene in each eigenvector, we can determine which genes have the highest weights for the given vector.

**Report the gene names for the top 5 weighted genes in each of the two eigenvectors.**

In [45]:
#start by transforming the count matrix:
transformed = np.matmul(count_mat,np.transpose(pca.components_[0:2]))
n=5

#get the indicies of the top 5 genes for each principal component
inds_PC1 = f_PCA.nlargest(n,0).index.values.tolist()
inds_PC2 = f_PCA.nlargest(n,1).index.values.tolist() 

#print the genes corresponsing to these indicies
print("the top 5 genes for the first principal component are:")
meta_gene.iloc[inds_PC1]

the top 5 genes for the first principal component are:


Unnamed: 0,gene_name
Hs3st4_ENSMUSG00000078591,Hs3st4_ENSMUSG00000078591
Pcp4_ENSMUSG00000090223,Pcp4_ENSMUSG00000090223
Rprm_ENSMUSG00000075334,Rprm_ENSMUSG00000075334
Nxph3_ENSMUSG00000046719,Nxph3_ENSMUSG00000046719
Syt6_ENSMUSG00000027849,Syt6_ENSMUSG00000027849


In [43]:
print("the top 5 genes for the second principal component are:", )
meta_gene.iloc[inds_PC2]

the top 5 genes for the second principal component are:


Unnamed: 0,gene_name
Atp1a2_ENSMUSG00000007097,Atp1a2_ENSMUSG00000007097
Ptprz1_ENSMUSG00000068748,Ptprz1_ENSMUSG00000068748
Atp1b2_ENSMUSG00000041329,Atp1b2_ENSMUSG00000041329
Plpp3_ENSMUSG00000028517,Plpp3_ENSMUSG00000028517
S1pr1_ENSMUSG00000045092,S1pr1_ENSMUSG00000045092
