<a href="https://colab.research.google.com/github/CPukszta/BI-BE-CS-183-2023/blob/main/HW4/Problem3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Bi/Be/Cs 183 2022-2023: Intro to Computational Biology
TAs: Meichen Fang, Tara Chari, Zitong (Jerry) Wang

**Submit your notebooks by sharing a clickable link with Viewer access. Link must be accessible from submitted assignment document.**

Make sure Runtime $\rightarrow$ Restart and run all works without error

**HW 4 Problem 3**

In this problem you will develop code for running the EM algorithm to fit a Gaussian Mixture Model (GMM). You will learn the mixture weights for a set of (multivariate) Gaussian distributions, which describe the input, single-cell data. This is a common approach to determine clusters within a dataset.


##**Import data and install packages**

In [1]:
import numpy as np
import scipy.io as sio
import pandas as pd
import matplotlib.pyplot as plt #Can use other plotting packages like seaborn

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()

In [2]:
#Download count matrix of cell by gene counts and metadata files, DOI: 10.22002/D1.2315
#tar.gz file which has:
#(1) count matrix 
#(2) metadata for cells (cell type, date of experimental run) 
#(3) metadata for genes (gene names)

import requests
from tqdm import tnrange, tqdm_notebook
def download_file(doi,ext):
    url = 'https://api.datacite.org/dois/'+doi+'/media'
    r = requests.get(url).json()
    netcdf_url = r['data'][0]['attributes']['url']
    r = requests.get(netcdf_url,stream=True)
    #Set file name
    fname = doi.split('/')[-1]+ext
    #Download file with progress bar
    if r.status_code == 403:
        print("File Unavailable")
    if 'content-length' not in r.headers:
        print("Did not get file")
    else:
        with open(fname, 'wb') as f:
            total_length = int(r.headers.get('content-length'))
            pbar = tnrange(int(total_length/1024), unit="B")
            for chunk in r.iter_content(chunk_size=1024):
                if chunk:
                    pbar.update()
                    f.write(chunk)
        return fname

download_file('10.22002/D1.2315','.gz')


  pbar = tnrange(int(total_length/1024), unit="B")


  0%|          | 0/94578 [00:00<?, ?B/s]

'D1.2315.gz'

In [3]:
!mv D1.2315.gz biccn.tar.gz
!tar -xvf biccn.tar.gz

biccnGeneMeta.csv
biccnMeta.csv
biccn.mtx


## **Read in data for analysis**

**The dataset**

This dataset maps the cells in the mouse primary cortex (MOp), including neuronal and non-neuronal cell types, for a total of 10 cell types ([Yao et al., 2021](https://www.nature.com/articles/s41586-021-03500-8)). We will be dealing with the 10x sequenced data only.


**The count matrix**

This matrix is 18,744 cells by 5,000 genes. The full dataset contains 71,365 cells, however we will work with a randomly subsetted version to facilitate calculations within the Colab environment.

1.   For each cell, gene counts were normalized to have the same number of total counts (usually 1e5 or 1e6), with cell-gene counts thus scaled accordingly.

2.   Counts were then log-normalized, using the log(1+x), where x is each cell's gene count. The 1 accounts for 0 count genes. 

3. The ~5000 genes were selected for those that displayed large variance in expression amongst the cells ('highly variable genes').



In [4]:
#Get gene count matrix
count_mat = sio.mmread('biccn.mtx')

count_mat = count_mat.todense() #Make dense since most functions we'll use don't work with sparse matrices
count_mat.shape

(18744, 5000)

In [5]:
#Get metadata dataframe for the 18,744 cells (rows of the matrix)
meta = pd.read_csv('biccnMeta.csv',index_col = 0)
meta.head()

Unnamed: 0_level_0,cell_type
barcode,Unnamed: 1_level_1
AAACGAAGTGGATTTC-3L8TX_181211_01_A01,L2/3 IT
AAACGCTCAATGCTCA-3L8TX_181211_01_A01,L2/3 IT
AAAGTCCGTGTATCCA-3L8TX_181211_01_A01,L6 CT Cpa6_1
AAAGTGAGTCGCCACA-3L8TX_181211_01_A01,L2/3 IT
AAAGTGATCGTCTACC-3L8TX_181211_01_A01,L2/3 IT


In [6]:
#Get metadata dataframe for the 5,000 genes (columns of the matrix)

meta_gene = pd.read_csv('biccnGeneMeta.csv',index_col = 0)
meta_gene.head()


Unnamed: 0,gene_name
Rp1_ENSMUSG00000025900,Rp1_ENSMUSG00000025900
Sox17_ENSMUSG00000025902,Sox17_ENSMUSG00000025902
Oprk1_ENSMUSG00000025905,Oprk1_ENSMUSG00000025905
St18_ENSMUSG00000033740,St18_ENSMUSG00000033740
Sntg1_ENSMUSG00000025909,Sntg1_ENSMUSG00000025909


## **Problem 3 (30 points)**

Gaussian mixture model (GMM) is as defined below:

\begin{align}
f_{GMM}(\mathbf{x})=\sum_{j=1}^k \phi_j f(\mathbf {x};{\boldsymbol{\mu }}_{j},\mathbf{\Sigma}_{j})
\end{align}
subject to $\sum_{j=1}^k \phi_j = 1$.

$\boldsymbol{\phi}$ denotes the weights for each Gaussian pdf $f$, and together the GMM is defined as the weighted sum of these Gaussians. $\boldsymbol{\mu} \text{ and } \mathbf{\Sigma}$ represent the mean (vector) and covariance (matrix) for each of the $k$ multivariate Gaussians. This model can then describe data with multiple modes/areas of high density i.e. $k$ cell types which shared distinct gene expression signatures.

Let $\mathbf{x} = (\mathbf{x}_1, \mathbf{x}_2, \ldots \mathbf{x}_n)$ be $n$ independent observations (e.g. cells) which come from a mixture of Gaussians. They are $d$ dimensional, where $d$ is the number of gene measurements. We will define $\mathbf{z} = (z_1, z_2, \ldots z_n)$ as the latent variable which represents the mixture component (of the $k$ components) which a cell comes from. Thus $\phi_j = P(Z = j)$.

The parameters to be fit, $\theta = (\boldsymbol{\mu}_1,\ldots \boldsymbol{\mu}_k, \mathbf{\Sigma_1}, \ldots \mathbf{\Sigma_k}, \mathbf{\phi}_1, \ldots \mathbf{\phi}_k)$ (the parameters of the GMM).

The (complete) likelihood for maximization is defined as 
\begin{align}
p(\mathbf {x},\mathbf {z};\theta) =\prod _{i=1}^{n}\prod _{j=1}^{k}\ [f(\mathbf {x} _{i};{\boldsymbol{\mu }}_{j},\mathbf{\Sigma}_{j})\phi_{j}]^{\mathbb {I} (z_{i}=j)}
\end{align}

where $\mathbb {I}$ is the indicator function.

As described in Problem 1, we will use the $Q$ function to maximize, where 
\begin{align}
Q(\theta|\theta_t) = E_{\mathbf {z}|\mathbf {x},\theta_t} [\log(p(\mathbf {x},\mathbf {z};\theta))].
\end{align}


Dimensions of variables ($n$ cells, $d$ genes):

$\mathbf{X}: n\times d$ matrix

$\boldsymbol{\mu}: k \times d$ matrix

$\boldsymbol{\Sigma}: k\times d \times d$ matrix

$\boldsymbol{\phi}:$ 1d array of length $k$

### **a) Often covariance matrices of single-cell datasets are singular (non-invertible). Subset the input matrix X to create a 'regularized'-non-singular matrix to use. (5 points)**

Remove the bottom half of genes based on variance i.e. calculate the variance of expression for *each* gene and remove any genes below the mean variance (from across all genes). In this sense we are further removing 'redundancy' from the matrix.

Remember that the counts of the matrix are log1p() of the original counts.

**Report how many genes remain after this subsetting, and use this subsetted matrix for all downstream calculations.**

In [7]:
#Un-log transform the counts
non_lg_count = np.exp(count_mat)-1

#compute the variance for each gene
var = np.var(non_lg_count,axis=0)

#compute the mean variance 
mean_var = np.mean(var)

#find the indicies of any genes below the variance and remove them from both the count matrix
#and the gene meta data
inds = np.where(var < mean_var)

count_mat_sub = np.delete(count_mat, inds[1], axis = 1)

print("after subsetting, there are only", np.shape(count_mat_sub)[1], " genes remaining in the count matrix")


after subsetting, there are only 97  genes remaining in the count matrix


### **b) Implement the E step (5 points)**

Dimensions of variables ($n$ cells/rows, $d$ genes/columns):

For the E step we assume a set of (randomly initialized) $\theta_t$.
Then:
\begin{align}
Q(\theta|\theta_t) &= \sum_{i=1}^{n} \sum_{j=1}^{k} P(Z_i = j | X_i = \mathbf{x}_i; \theta_t) \log(p(\mathbf {x}_i,j;\theta_j)) \\
 &= \sum_{i=1}^{n} \sum_{j=1}^{k} T_{i,j}^t[\log \phi _{j}-{\tfrac {1}{2}}\log |\mathbf{\Sigma} _{j}|-{\tfrac {1}{2}}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{j})^{\top }\mathbf{\Sigma} _{j}^{-1}(\mathbf {x} _{i}-{\boldsymbol {\mu }}_{j})-{\tfrac {d}{2}}\log(2\pi )]
\end{align}




$T_{i,j}^t$ represents $P(Z_i = j | X_i = \mathbf{x}_i; \theta_t)$ and $p(\mathbf {x}_i,j;\theta_j) = f(\mathbf {x} _{i};{\boldsymbol{\mu }}_{j},\mathbf{\Sigma}_{j})\phi_{j}$.

$T_{i,j}^t$ needs to be calculated given the current $\theta_t$ for every $\mathbf{x}_i$ in the E step.

Using Bayes theorem we can get that 
\begin{align}
T_{i,j}^t = \frac {\phi _{j}^{t}\ f(\mathbf {x} _{i};{\boldsymbol {\mu }}_{j}^{t},\mathbf{\Sigma} _{j}^{t})}{\sum_{r=1}^k \phi _{r}^{t}\ f(\mathbf {x} _{i};{\boldsymbol {\mu }}_{r}^{t},\mathbf{\Sigma} _{r}^{t})}
\end{align}



You can use [scipy.stats multivariate_normal.pdf](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multivariate_normal.html) to calculate $f$ and [numpy cov](https://numpy.org/doc/stable/reference/generated/numpy.cov.html) to calculate covariances.

**Fill in the e_step() function and calculate the matrix $\mathbf{T}$ given inputs $\mathbf{X}, \boldsymbol{\mu^t}, \mathbf{\Sigma^t}, \textbf{ and } \boldsymbol{\phi^t}$. $\mathbf{X}$  represents our cell x genes data matrix.**

Note: You may need to add a small epsilon ($\approx$ 1e-100) to $f$ values, to avoid divide by zero errors.

In [8]:
import scipy.stats

def e_step(x,mu,sigma,phi):
  epsilon = 10**(-100)

  T_mat = np.zeros((np.shape(x)[0],len(phi)))
  num = np.zeros((np.shape(x)[0],len(phi)))
  f_saved = np.zeros((np.shape(x)[0],len(phi)))
  denom = np.zeros(np.shape(x)[0])

  for j in range(len(phi)):
    f_saved[:,j] = scipy.stats.multivariate_normal.pdf(x, np.ravel(mu[j]), cov=sigma[j]) + epsilon
  num = phi*f_saved
  denom = np.array(np.sum(num,axis=1))
  T_mat = num/denom[:,None]
  Q = np.sum(T_mat*np.log(num))
  return T_mat, Q


In [9]:

    #  denom = np.dot(phi,scipy.stats.multivariate_normal.pdf(x, mean=np.ravel(mu[j,:]), cov=sigma[j]))
    #  T_mat[:,j]= (phi[j]*scipy.stats.multivariate_normal.pdf(x, mean=mu[j], cov=sigma[j]) + epsilon)/(denom + epsilon)

### **c) Implement the M step (5 points)**

In the maximization (M) step, we then (re-)calculate the MLE values for $\theta$ or $(\boldsymbol{\mu}_1,\ldots \boldsymbol{\mu}_k, \mathbf{\Sigma}_1, \ldots \mathbf{\Sigma}_k, \mathbf{\phi}_1, \ldots \mathbf{\phi}_k)$ at this $t+1$ step.

Here $j$ is from 1 to k.

For $\boldsymbol{\phi}$
\begin{align}
\phi _{j} ={\frac {1}{n}}\sum _{i=1}^{n}T_{i,j}^{t}.
\end{align}


For $\boldsymbol{\mu}$
\begin{align}
\boldsymbol{\mu}_j = \frac{\sum_{i=1}^n T_{i,j}^{t} \mathbf{x}_i}{\sum_{i=1}^n T_{i,j}^{t}}.
\end{align}

And for $\boldsymbol{\Sigma}$
\begin{align}
\mathbf{\Sigma}_j = \frac{\sum_{i=1}^n T_{i,j}^{t} (\mathbf{x}_i - \boldsymbol{\mu}_j)^\top (\mathbf{x}_i - \boldsymbol{\mu}_j) }{\sum_{i=1}^n T_{i,j}^{t}}
\end{align}

**Fill in the m_step() function to calculate the updated $\boldsymbol{\mu}, \mathbf{\Sigma}, \textbf{ and } \boldsymbol{\phi}$ values given $\mathbf{T}$.**

Note: To calculate the new covariance matrices you can use the aweights input option for [np.cov](https://numpy.org/doc/stable/reference/generated/numpy.cov.html), where the $T^t_{i,j}$ values are the weights.

In [10]:
def m_step(T, x):
  #calcualte phi
  phi = np.mean(T,axis=0)

  n=np.shape(T)[0]
  k =np.shape(T)[1]
  d=np.shape(x)[1]

  #initalize mu and sigma
  mu = np.zeros((k,d))
  sigma = np.zeros([k,d,d])

  for j in range(k):
    #calculate covariance matrix:
    sigma[j,:,:] = np.cov(np.transpose(x),aweights=np.transpose(T[:,j]))
    #calculate the mu
    mu[j,:] = np.sum(T[:,j]*x,axis=0)/np.sum(T[:,j])
  return mu, sigma, phi

### **d) Run EM steps for 100 iterations, get mixture (cluster) assignments for the cells, at plot the Q function over the iterations. (10 points)**

To initialize the EM process we will (1) let $k = 10$ (also the number of labeled cell types), (2) assume $\boldsymbol{\phi}$ has a uniform probability for each $k$, (3) choose random rows from the data to represent each of the $k$  $\boldsymbol{\mu}$s, and (4) use the covariance matrix for $\mathbf{X}$ (the cell x genes matrix) to intialize the $k$ $\mathbf{\Sigma}$s.

Run the EM algorithm for 100 iterations.

** **

Determine the final cluster assignment for each cell $i$ by finding which $k$ (mixture component) has the max $T_{i,j}$ value.

**Plot Q over the iterations, and report $\boldsymbol{\phi}$ and the cluster (k) assignments for the first 10 cells after 100 iterations.**

**You can use the initialization provided below.**

In [11]:
#Initialize which randon rows to select for mu
n=count_mat.shape[0]
k=10
np.random.seed(2022)
rand_rows=np.random.choice(n,k,replace=False)
mu_inital = np.array(count_mat_sub[rand_rows,:])

phi_intitial = [1/10 for i in range(k)]
cov_inital = np.cov(np.transpose(count_mat_sub))
cov_inital = np.array([cov_inital,cov_inital,cov_inital,cov_inital,cov_inital,cov_inital,cov_inital,cov_inital,cov_inital,cov_inital])

In [12]:
Iterations = 100
Q = np.zeros(Iterations)
for g in range(Iterations):
  T, Q[g] = e_step(count_mat_sub, mu_inital, cov_inital, phi_intitial)

  mu_inital,cov_inital,phi_intitial = m_step(T,count_mat_sub)


In [13]:
p = bokeh.plotting.figure(
    width = 800, height =400,
    title = "Q function outputs",
    x_axis_label = "Iteration", y_axis_label = "Q"
       
)
x = np.linspace(1,Iterations,num=Iterations)
p.circle(x,Q)

bokeh.io.show(p)

In [14]:
# getting cluster assignments
cells = 10
assign = np.zeros(10)
for i in range(cells):
  assign[i] = T[i,:].argmax()

print("The cluster assignments for the first 10 cells are:", assign)

The cluster assignments for the first 10 cells are: [7. 7. 6. 7. 7. 7. 3. 5. 9. 9.]


### **e) Calculate correspondence of generated clusters and given cell type labels. (5 points)**

For each mixture componet/cluster of cells, find the majority cell type label and the percent of cells with that label.

**Report the percent of cells with the majority label for each cluster and what that majority label is.**



In [15]:
cells = 18744
assign = np.zeros(cells)
for i in range(cells):
  assign[i] = T[i,:].argmax()

meta["Assignment"] = np.transpose(assign)

In [17]:
data_frame = pd.DataFrame(meta.groupby(["Assignment"]).apply(lambda x: x["cell_type"].value_counts(normalize=True).index[0]))

percentages = np.zeros(10)
for i in range(10):
  df = meta.loc[meta["Assignment"] == i].value_counts(normalize=True)*100
  percentages[i] = df[0]
  
data_frame["Percentages"] = percentages

In [18]:
data_frame

Unnamed: 0_level_0,0,Percentages
Assignment,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,Vip Chat,51.117318
1.0,L6 CT Cpa6_1,70.611702
2.0,Vip Chat,52.298851
3.0,L5 NP Slc17a8_1,87.388724
4.0,Astro Aqp4,50.153374
5.0,L2/3 IT,100.0
6.0,L6 CT Cpa6_1,100.0
7.0,L2/3 IT,97.757375
8.0,Vip Chat,67.741935
9.0,L2/3 IT,100.0
