<a href="https://colab.research.google.com/github/CPukszta/BI-BE-CS-183-2023/blob/main/HW10/HW10Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Bi/Be/Cs 183 2022-2023: Intro to Computational Biology
TAs: Meichen Fang, Tara Chari, Zitong (Jerry) Wang

**Submit your notebooks by sharing a clickable link with Viewer access. Link must be accessible from submitted assignment document.**

Make sure Runtime $\rightarrow$ Restart and run all works without error

**HW 10 Final Problem**

In this problem you will process a single-cell dataset from the raw fastqs (sequencing reads of the cDNA library) to produce the cell x gene count matrix we usually work with. Given this count matrix you will additionally investigate the impact of various normalization techniques and dimensionality reduction on clustering of the cells (i.e. looking for cell types). With the metadata for this dataset, the cell types and ages of the mice used, you will then compare the efficacy of logistic regression vs neural network based techniques for classifying the age of the mouse a cell came from.


##**Install packages**

Install kb-python

This package is used to do transcript quantification (as shown in HW 4 Problem 2), aligning sequencing reads to a provided transcriptome to estimate transcript abundances. This produces the gene count matrices we have been working with. kb takes in the raw FASTQ files (the cDNA sequences from the generated cDNA library), aligns the transcript sequences to a reference file for the organism (the kallisto index below), and generates a count matrix of the transcripts (or genes) per cell.

In [None]:
# Install kb. This package runs kallisto and bustools. 
# These are programs used to process the single-cell RNA-seq reads to produce count matrices.
!pip3 install kb-python 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kb-python
  Downloading kb_python-0.27.3.tar.gz (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting anndata>=0.6.22.post1
  Downloading anndata-0.8.0-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.1/96.1 KB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting loompy>=3.0.6
  Downloading loompy-3.0.7.tar.gz (4.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ngs-tools>=1.7.3
  Downloading ngs-tools-1.8.3.tar.gz (45.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 MB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h 

In [None]:
!pip3 install --quiet anndata
!pip install --quiet scanpy==1.7.0rc1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.9/69.9 KB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Building wheel for sinfo (setup.py) ... [?25l[?25hdone


In [None]:
import numpy as np
import scipy.io as sio
import pandas as pd
import matplotlib.pyplot as plt #Can use other plotting packages like seaborn

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()

import anndata
import scanpy as sc

In [None]:
import time
t=time.time()

## **Read in data for Part a) analysis**
#### Running the code for data downloading and count matrix generation may take ~20mins total.

In [None]:
# Download the data from the ENA for a 3-month mouse
# This step should take 5-10mins
!wget --continue ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/002/SRR8426372/SRR8426372_1.fastq.gz
!wget --continue ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/002/SRR8426372/SRR8426372_2.fastq.gz

--2023-03-09 20:11:28--  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/002/SRR8426372/SRR8426372_1.fastq.gz
           => ‘SRR8426372_1.fastq.gz’
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/fastq/SRR842/002/SRR8426372 ... done.
==> SIZE SRR8426372_1.fastq.gz ... 2389120304
==> PASV ... done.    ==> RETR SRR8426372_1.fastq.gz ... done.
Length: 2389120304 (2.2G) (unauthoritative)


2023-03-09 20:12:35 (35.2 MB/s) - ‘SRR8426372_1.fastq.gz’ saved [2389120304]

--2023-03-09 20:12:35--  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR842/002/SRR8426372/SRR8426372_2.fastq.gz
           => ‘SRR8426372_2.fastq.gz’
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:21... connected.
Logging in as anonymous 

Download a kallisto index 


In [None]:
!kb ref -d mouse -i index.idx -g t2g.txt -f1 transcriptome.fasta

[2023-03-09 20:15:30,983]    INFO [download] Downloading files for mouse from https://caltech.box.com/shared/static/vcaz6cujop0xuapdmz0pplp3aoqc41si.gz to tmp/vcaz6cujop0xuapdmz0pplp3aoqc41si.gz
100% 1.89G/1.89G [03:27<00:00, 9.78MB/s]
[2023-03-09 20:18:58,452]    INFO [download] Extracting files from tmp/vcaz6cujop0xuapdmz0pplp3aoqc41si.gz


Generate the cell x gene count matrix

In [None]:
# This command processes previously downloaded data
# This step should take ~11mins
%%time
!kb count --h5ad -i index.idx -g t2g.txt -x Dropseq -o output --filter bustools -t 2 \
SRR8426372_1.fastq.gz \
SRR8426372_2.fastq.gz


[2023-03-09 20:19:39,755]    INFO [count] Using index index.idx to generate BUS file to output from
[2023-03-09 20:19:39,755]    INFO [count]         SRR8426372_1.fastq.gz
[2023-03-09 20:19:39,755]    INFO [count]         SRR8426372_2.fastq.gz
[2023-03-09 20:28:05,778]    INFO [count] Sorting BUS file output/output.bus to output/tmp/output.s.bus
[2023-03-09 20:28:22,553]    INFO [count] Whitelist not provided
[2023-03-09 20:28:22,553]    INFO [count] Generating whitelist output/whitelist.txt from BUS file output/tmp/output.s.bus
[2023-03-09 20:28:23,664]    INFO [count] Inspecting BUS file output/tmp/output.s.bus
[2023-03-09 20:28:24,974]    INFO [count] Correcting BUS records in output/tmp/output.s.bus to output/tmp/output.s.c.bus with whitelist output/whitelist.txt
[2023-03-09 20:28:28,098]    INFO [count] Sorting BUS file output/tmp/output.s.c.bus to output/output.unfiltered.bus
[2023-03-09 20:28:34,174]    INFO [count] Generating count matrix output/counts_unfiltered/cells_x_genes 

**The dataset**

This is a Drop-seq based single-cell RNA-seq dataset produced from tissue extracted from the whole mouse lung, published by [Ilias Angelidis, Lukas M. Simon et al. 2019](https://www.nature.com/articles/s41467-019-08831-9). In the study single-cell suspensions were generated from eight 3-month old mice and seven 24-month old mice and looked for cell type specific effects of aging between the mice i.e. to create a single-cell atlas of the aging lung. 

For Part a we will only be working with one sample from a 3-month old mouse (though in Parts b-d you will work with the full dataset across both ages and all mice).

<center><img src="https://drive.google.com/uc?export=view&id=1O_x3hmDDes7foQVVLcoZVasrVED1L0Al" alt="EMFigure" width="900" height="150"><center>

**The count matrix**

This matrix is 3,839 cells by 55,421 genes for one lung sample from a 3-month old mouse.




In [None]:
# load the raw cell x gene count matrix
adata = anndata.read_h5ad("output/counts_unfiltered/adata.h5ad") #This is the output from kb
adata.var["gene_id"] = adata.var.index.values

t2g = pd.read_csv("t2g.txt", header=None, names=["tid", "gene_id", "gene_name"], sep="\t") #Load the transcipt-to-gene name mapping (t2g)
t2g.index = t2g.gene_id
t2g = t2g.loc[~t2g.index.duplicated(keep='first')]

adata.var["gene_name"] = adata.var.gene_id.map(t2g["gene_name"])
adata.var.index = adata.var["gene_name"]

In [None]:
adata

AnnData object with n_obs × n_vars = 3839 × 55421
    var: 'gene_name', 'gene_id'

In [None]:
count_mat = adata.X #Get the count matrix from this anndata object
count_mat.shape

(3839, 55421)

**Use this count_mat for Part a.**

# **a) Pre-processing: Select real/valid cells i.e. cells that pass a UMI count threshold based on the commonly used 'Knee plot'. (10 points)**

Knee plots (described below) are commonly used to filter out cell barcodes that likely correspond to empty droplets that were captured or noisy samples that may be just random transcripts that were picked up in the droplet. We want to only keep cell barcodes that seem to have high enough UMI counts (i.e. molecules detected) which suggest that a real cell was captured in that droplet.

**To construct a knee plot** we (1) rank cells in *descending* order of their total UMI counts (UMI counts summed across genes). The cell rankings are plotted on the x axis (1 to n cells). On the y-axis we (2) plot the total UMI count of each cell. Thus as we move across the x-axis from left to right, the right end of the plot displays cells with very few UMI counts (noisy/empty droplets to possibly remove from analysis). Often the x and y axis are plotted in a log-log plot.

The inflection point of the graph denotes a separation/drop between the lefthand side of the plot where cell barcodes have high UMI counts and the righthand side where cell barcodes have low associated UMI counts (and are thus considered to have had failure in capture and/or to be too noisy for further analysis.)

**(1) Make one knee plot for the 3,839 cells (which are the cell barcodes) and their total UMI counts across the 55,421 genes (in a log-log plot). (2) Describe what UMI threshold you might use to filter out noisy cell barcodes based on the plot.**

# **Read in data for Parts b-d analysis**

For Parts b-d you will be using the full gene count matrix across the 3-month and 24-month mouse cells combined, as provided in the original paper, which is 14,813 cells × 21,969 genes. We will filter for the top 2000 highly variable genes, so that you can use this matrix within the Colab environment. (Downloaded below)

In [None]:
#Read in full cell x gene count matrix for 3 and 24 month old mice lung samples
!wget --content-disposition https://ftp.ncbi.nlm.nih.gov/geo/series/GSE124nnn/GSE124872/suppl/GSE124872_raw_counts_single_cell.mtx.gz

--2023-03-09 20:28:55--  https://ftp.ncbi.nlm.nih.gov/geo/series/GSE124nnn/GSE124872/suppl/GSE124872_raw_counts_single_cell.mtx.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 130.14.250.11, 2607:f220:41f:250::230, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20074662 (19M) [application/x-gzip]
Saving to: ‘GSE124872_raw_counts_single_cell.mtx.gz’


2023-03-09 20:28:56 (29.4 MB/s) - ‘GSE124872_raw_counts_single_cell.mtx.gz’ saved [20074662/20074662]



In [None]:
#Read in cell metadata
!wget --content-disposition https://ftp.ncbi.nlm.nih.gov/geo/series/GSE124nnn/GSE124872/suppl/GSE124872_Angelidis_2018_metadata.csv.gz

--2023-03-09 20:28:56--  https://ftp.ncbi.nlm.nih.gov/geo/series/GSE124nnn/GSE124872/suppl/GSE124872_Angelidis_2018_metadata.csv.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 130.14.250.11, 2607:f220:41f:250::230, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 188271 (184K) [application/x-gzip]
Saving to: ‘GSE124872_Angelidis_2018_metadata.csv.gz’


2023-03-09 20:28:56 (17.4 MB/s) - ‘GSE124872_Angelidis_2018_metadata.csv.gz’ saved [188271/188271]



In [None]:
!gunzip GSE124872_Angelidis_2018_metadata.csv.gz
!gunzip GSE124872_raw_counts_single_cell.mtx.gz

Read in metadata of cells: "grouping" denotes the age (3m - 3 month, 24m - 24 month) and "celltype" denotes the cell type e.g. Plasma_cells

In [None]:
meta = pd.read_csv('GSE124872_Angelidis_2018_metadata.csv')
meta.head()

Unnamed: 0.1,Unnamed: 0,nGene,nUMI,orig.ident,identifier,res.2,identifier.1,name,grouping,batch,cells,cluster,celltype
0,muc3838:muc3838:TTCCGTGCCCCT,4255,10691,merged,muc3838,2,muc3838,,24m,good,800,cluster2,Ciliated_cells
1,muc3838:muc3838:TTGCCCAATTAA,3178,6860,merged,muc3838,2,muc3838,,24m,good,800,cluster2,Ciliated_cells
2,muc3838:muc3838:AAGCCCAGCTAT,1470,6127,merged,muc3838,24,muc3838,,24m,good,800,cluster24,Plasma_cells
3,muc3838:muc3838:GCACTTTAGAAT,2348,4359,merged,muc3838,2,muc3838,,24m,good,800,cluster2,Ciliated_cells
4,muc3838:muc3838:TCCTGCTCCCTT,2439,5293,merged,muc3838,2,muc3838,,24m,good,800,cluster2,Ciliated_cells


**The dataset**

This is a Drop-seq based single-cell RNA-seq dataset produced from tissue extracted from the whole mouse lung, published by [Ilias Angelidis, Lukas M. Simon et al. 2019](https://www.nature.com/articles/s41467-019-08831-9). In the study single-cell suspensions were generated from eight 3-month old mice and seven 24-month old mice and looked for cell type specific effects of aging between the mice i.e. to create a single-cell atlas of the aging lung. 

For Parts b-c we will only be working with the full dataset across both ages and all mice.

<center><img src="https://drive.google.com/uc?export=view&id=1O_x3hmDDes7foQVVLcoZVasrVED1L0Al" alt="EMFigure" width="900" height="150"><center>

**The count matrix**

This matrix is 14,813 cells by 21,969 genes across all mouse lung samples as provided in Ilias Angelidis, Lukas M. Simon et al. 2019.




Read in count matrix

In [None]:
import scipy.io as sio
count_mat = sio.mmread('GSE124872_raw_counts_single_cell.mtx')
count_mat = count_mat.todense().T
count_mat.shape

(14813, 21969)

Select for top 2000 highly variable genes

In [None]:
#Anndata is a common data type for processing and storing single-cell count matrices
adata = anndata.AnnData(X = count_mat)
adata

  adata = anndata.AnnData(X = count_mat)


AnnData object with n_obs × n_vars = 14813 × 21969

In [None]:
#Select the 'highly variable' genes in the matrix so we don't use 21k genes
sc.pp.filter_cells(adata, min_counts=0)
sc.pp.filter_genes(adata, min_counts=0)

sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4) #Cell-size normalization
sc.pp.log1p(adata) #log-variance stabilization

sc.pp.highly_variable_genes(adata,n_top_genes=2000)

genesToKeep = adata.var['highly_variable']

In [None]:
#Subset original count matrix
count_mat = count_mat[:,genesToKeep]
count_mat.shape

(14813, 2000)

**Use this 14813 × 2000 count_mat matrix as your count matrix for parts b-d**

# **b) Normalization (10 points)**

Note that we use cell size normalization and then log1p to transform data before selecting highly variable genes, and then we use the indices to subset the raw count matrix. Therefore, the count_mat matrix here is the raw count matrix. 

In this part, you will do size-normalization and log1p transformations yourself. For size-normalization, normalize the data matrix to 10,000 reads per cell.

**Report both the raw count matrix and the normalized matrix (that is size-normalized and then log1p transformed).**

In [None]:
raw = 

In [None]:
normalize = 

# **c) Clustering (20 points)**

In this part, we will do K-means clustering on 4 different count matrices:
1. Raw count matix
2. Raw count matrix transformed by the first 15 principal components
3. Normalized count matix
4. Normalized count matix transformed by the first 15 principal components

You will use sklearn for PCA and [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit_predict) clustering. **Set the cluster number k to 10 and random_state to 0 when using KMeans.**

Below is an example from sklearn

In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

X = np.array([[1, 2], [1, 4], [1, 0],
               [10, 2], [10, 4], [10, 
                                  0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_



array([1, 1, 1, 0, 0, 0], dtype=int32)

**For each of the four clustering results, compare the k-means cluster assignments to the given cell types in meta data. Report the majority, 'true' cell type in each k-means cluster, and the proportion of cells that are that majority in each k-means cluster.**

**For the two clusterings done in 15 dimensional PCA space, also plot the transformed data in 2D (only the first 2 transformed coordinates) colored by corresponding k-means clusters (which was done in the 15D PCA-space). Comment on the differences between the two plots.**

# **d) Age prediction (60 points)**

In their work, [Ilias Angelidis, Lukas M. Simon et al.](https://www.nature.com/articles/s41467-019-08831-9) found that the gene expression pattern for some cell types were different between the two age groups (3 months and 24 months), as shown in the figure below. In the figure below, the x-axis represent different genes and the y-axis represent different cell types, with each color bar representing the logarithm of the fold change in expression level between old and young cells for a particular genes in a particular cell type. Inspired by this result, we will build classification models (using normalized counts) and test how well they can predict the age of a mouse based on the gene expression profile of a single cell.

<center><img src="https://drive.google.com/uc?export=view&id=17gt0rkWzJmh11GPw6ipehM6miiGTD86e" alt="EMFigure" width="550" height="500"><center>

We will compare the predictive power of two different classes of models: (1) logistic regression and (2) neural network. For each class, we will fit a model using gene expression data from a single cell type to predict the age (3m or 24m) of the mouse from which the cell was taken.

**i) From the given information, what cell type would you choose to build a predictive model of mouse age? Explain your reasoning (10 points)**

**Your Answer Here:** 





To standardize the workflow, we will consider just two cell types: 

(1) 'Alveolar_macrophage' (AM)

(2) 'Type_2_pneumocytes' (T2P) 

For the rest of the question, you will build a total of **4** models: 

1.   logistic regression using alveolar macrophage 
2.   logistic regression using type II pneumocytes 
3.   neural network using alveolar macrophage
4.   neural network using type II pneumocytes

We first subset our **normalized count matrix** to get out count data for alveolar macrophages (AM) and type II pneumocytes (T2P) sampled from 3 months and 24 months mouse, as well as the corresponding age/label for each cell. AM_3m represents alveolar macrophages from a 3 months old mouse, T2P_24m represents type II pneumocytes from a 24 months old mouse. *Note that we are representing the age class '3m' using the integer 0 and the age class '24m' using the integer 1.*

In [None]:
typename = 'Alveolar_macrophage'
youngIndex = meta.grouping.isin(['3m'])*meta.celltype.isin([typename])
oldIndex = meta.grouping.isin(['24m'])*meta.celltype.isin([typename])

# we will assign the number 0 to the age class '3m' and 1 to the age class '24m'
AM_3m = normalize[youngIndex,:]
AM_3m_label = np.zeros(np.size(AM_3m,0))
AM_24m = normalize[oldIndex,:]
AM_24m_label = np.ones(np.size(AM_24m,0))

typename = 'Type_2_pneumocytes'
youngIndex = meta.grouping.isin(['3m'])*meta.celltype.isin([typename])
oldIndex = meta.grouping.isin(['24m'])*meta.celltype.isin([typename])

T2P_3m = normalize[youngIndex,:]
T2P_3m_label = np.zeros(np.size(T2P_3m,0))
T2P_24m = normalize[oldIndex,:]
T2P_24m_label = np.ones(np.size(T2P_24m,0))

Next, we will split our data evenly into a training set and test set. To allow for a fair comparison between the two cell types, we will use the same number of cells for alveolar macrophage as type II pneumocytes. **Use the code block below to split the data for each cell type into a train set and a test set, make sure you split both the gene expression counts and the corresponding labels.** For both cell types, the training set will consist of the first 550 cells of the 3 month old mouse and the first 250 cells of the 24 month old, the test set will consist of the subsequent 550 cells of the 3 month old mouse and the subsequent 250 cells of the 24 month old mouse.

In [None]:
#properly subset the data and labels into these variables
AM_train = # ENTER CODE HERE
AM_trainlabel = # ENTER CODE HERE
AM_test = # ENTER CODE HERE
AM_testlabel = # ENTER CODE HERE

T2P_train = # ENTER CODE HERE
T2P_trainlabel = # ENTER CODE HERE
T2P_test = # ENTER CODE HERE
T2P_testlabel = # ENTER CODE HERE

 **ii) Logistic regression models (10 points)**

 In this section, you will build a logistic regression model to classify mouse age and use it to predict the age of the mouse from the gene expression profile of a cell. Recall that with logistic regression we model a categorical variable (e.g. mouse age - 3m vs 24m) as a continuous value (i.e. the probability of being in the category). In our case, the independent variable will be the collection of count data across all 2000 genes and the dependent variable will be a binary variable, 0 or 1, representing whether the cell was from the 24 month old mouse.

**For each cell type (AM and T2P), complete the code block below where you will need to:**

1) fit a logistic regression model using the training set you created above (hint: you may need to change certain hyperparameter to get convergence)

3) evaluating your logistic regression model on the corresponding test set (make sure to use the same cell type as you used to fit the model)

4) report the test accuracy of your model as the percent of cells where the age of the mouse was correctly predicted by the model

In [None]:
#Set up model for AM 
# ENTER CODE HERE

#Evaluate logistic regression model on the AM test set
# ENTER CODE HERE

#Report test accuracy of AM model
# ENTER CODE HERE
ncorrect = 
ntotal = 
print('Test Accuracy: %2.2f %%' % ((100.0 * ncorrect) / ntotal))

In [None]:
#Set up model for T2P
# ENTER CODE HERE

#Evaluate logistic regression model on the T2P test set
# ENTER CODE HERE

#Report test accuracy of T2P model
# ENTER CODE HERE
ncorrect = 
ntotal =
print('Test Accuracy: %2.2f %%' % ((100.0 * ncorrect) / ntotal))

**iii) Neural network models - use the code/function below (30 points)**

In this section, we will try to predict mouse age by building a Multi-Layer Perceptron (MLP) Model using [Pytorch](https://pytorch.org/docs/stable/index.html). Our MLP will take as input the expression level of the 2000 highly variable genes for a given cell and produce a pair of values corresponding to the probability of the cell being obtain from a 3 months vs. 24 months mouse. Specifically, we will train two separate neural networks using different cell types, one using **alveolar macrophages** and another using **type II pneumocytes**

We will start by first importing packages and initialize the random number generator to a fixed constant to ensure reproducibility.

In [None]:
import torch
import torch.nn.functional as F
from torch import nn
from torch import optim
from torch.utils.data import Dataset, DataLoader

seed = 183
torch.manual_seed(seed)

<torch._C.Generator at 0x7f23cfce6cb0>

#### We will first construct our **neural network model for alveolar macrophages**

When we train and evaluate our model, we will require batches of data to be provided. The Dataloader class can automatically provide batches of data fetched from a `Dataset` objects (which is defined for you as a `CountDataset` class). **Complete the code block below by converting counts and labels into tensors of the appropriate type.** In order for training to run properly, you need to make sure to convert count data to [`tensor` objects](https://pytorch.org/docs/stable/tensors.html#torch.Tensor) of type Float and the labels (age) should also be `tensor` objects of type Integer. **Furthermore, set an appropriate batch size, nbatch.** The batch size represents the number of samples passed through our neural network before an error gradient is computed and the network parameters are updated.

In [None]:
#create a custom dataset object so we can use dataloader for shuffling the data
class CountDataset(Dataset):
    def __init__(self, count, labels):
        self.labels = labels
        self.count = count
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        label = self.labels[idx]
        count = self.count[idx]
        sample = [count, label]
        return sample

# convert counts/label into tensors of the appropriate type
traindata = # ENTER CODE HERE
trainlabel = # ENTER CODE HERE
testdata = # ENTER CODE HERE
testlabel = # ENTER CODE HERE

trainset = CountDataset(count=traindata, labels=trainlabel)
testset = CountDataset(count=testdata, labels=testlabel)

# create data loaders
nbatch = # ENTER CODE HERE
trainloader = DataLoader(trainset, batch_size=nbatch, shuffle=True)
testloader = DataLoader(testset, batch_size=nbatch, shuffle=True)

We are now ready to create our simple neural network model. We will define our model in a class that extends nn.Module. nn.Module subclasses must do a minimum of one thing: implement the forward method which takes a batch of data and performs the forward-pass. PyTorch's autograd system computes the gradients of the forward pass for us. In the code below we'll also make use of the constructor of our model to instantiate the hidden and output layers.The model is a simple neural network with one hidden layer. A rectifier linear unit (ReLU) activation function is used for the neurons in the hidden layer. The nn.Module class defines a instance variable called training that is set to True when the model is being trained and False when it is being evaluated after being trained. Since we would like our output to represent probabilities of the cell being from either of the two age groups (outputs are between 0 and 1, and sum up to 1), we can use the [softmax activation function](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html#torch.nn.functional.softmax) on our output layer to turn the outputs into probability-like values. **Complete the code below by implementing the softmax function on the output when the model is not being trained.** We do this because we will use PyTorch's implementation of Cross Entropy Loss (nn.CrossEntropyLoss) during training which implicitly adds a softmax before a logarithmic loss.

In [None]:
# define MLP model
class MLPmodel(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(MLPmodel, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size) 
        self.fc2 = nn.Linear(hidden_size, num_classes)  
    
    def forward(self, x):
        out = self.fc1(x)
        out = F.relu(out)
        out = self.fc2(out)
        if not self.training:
          # ENTER CODE HERE
        return out

We can now fit and evaluate the model. **In the code below, complete the implementation of the training loop by**

(1) Initialize an instance of our MLPmodel class using appropriate choices of the input_size, hidden_size, and num_classes.

(2) Define the variables named loss_function and optimiser, we will use the [cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) and the [ADAM optimiser](https://pytorch.org/docs/stable/optim.html).

(3) Pick an appropriate number of epoch to train for.

(4) Make a line plot showing the loss across all training epochs.

In [None]:
# build the model 
input_size = # ENTER CODE HERE
hidden_size = # ENTER CODE HERE
num_classes = # ENTER CODE HERE
model = MLPmodel(input_size, hidden_size, num_classes)

# define the loss function and the optimiser
loss_function =# ENTER CODE HERE
optimiser = # ENTER CODE HERE

# define number of epochs to train for
nepoch =# ENTER CODE HERE

# the epoch loop
for epoch in range(nepoch):
    running_loss = 0.0
    for data in trainloader:
        # get the inputs
        inputs, labels = data

        # zero the parameter gradients
        optimiser.zero_grad()

        # forward + loss + backward + optimise (update weights)
        outputs = model(inputs)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimiser.step()

        # keep track of the loss this epoch
        running_loss += loss.item()

# plot epoch vs loss
# ENTER CODE HERE


Let's evaluate the overall accuracy of our trained network on our test set. Use the following code block to finish implementation of the accuracy computation. Report the test accuracy as the percent of cells where the age of the mouse was correctly predicted by the trained MLP model. Note that before the code you need to implement we've made a call to model.eval() - this sets the model into evaluation mode and supresses non-training things (gradients, dropout being applied/computed, etc.).

In [None]:
model.eval()

# Compute the model accuracy on the test set
# ENTER CODE HERE
ncorrect = 
ntotal = 

print('Test Accuracy: %2.2f %%' % ((100.0 * ncorrect) / ntotal))

**Repeat the steps above for type II pneumocytes. Use the exact same set of hyperparameters you set for training on alveolar macrophages.**

**Show all your code and report the following**:

(1) A loss curve showing the loss value for each epoch

(2) Test accuracy on the T2P test set

In [None]:
# train the same MLP model used above for T2P
# ENTER CODE HERE

# plot loss across epoch
# ENTER CODE HERE

# report model accuracy on T2P test set
# ENTER CODE HERE

**iv) Make a bar plot showing the test accuracy across all four models that you built. (10 points)**

You can use https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html 

1.   logistic regression using alveolar macrophage 
2.   logistic regression using type II pneumocytes 
3.   neural network using alveolar macrophage
4.   neural network using type II pneumocytes

In [None]:
# make bar plot
# ENTER CODE HERE
