<a href="https://colab.research.google.com/github/DPariser/DataScience/blob/main/Preprocessing/041823_DNP1_QC_and_Pre_Processing_FASTQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Single-cell RNA-seq data processing
** Information found in publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7857060/

Single-cell sequencing data were aligned and quantified using kallisto/bustools (KB, v0.24.4) (Bray et al., 2016 [link text](https://www.nature.com/articles/nbt.3519)) against the GRCh38 human reference genome downloaded from 10x Genomics official website. Preliminary counts were then used for downstream analysis. Quality control was applied to cells based on three metrics step by step: the total UMI counts, number of detected genes and proportion of mitochondrial gene counts per cell. Specifically, cells with less than 1000 UMI counts and 500 detected genes were filtered, as well as cells with more than 10% mitochondrial gene counts. To remove potential doublets, for PBMC samples, cells with UMI counts above 25,000 and detected genes above 5,000 are filtered out. For other tissues, cells with UMI counts above
70,000 and detected genes above 7,500 are filtered out. Additionally, we applied Scrublet (Wolock et al., 2019 [link text](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6625319/pdf/nihms-1515604.pdf)) to identify potential
doublets. The doublet score for each single cell and the threshold based on the bimodal distribution was calculated using default
parameters. The expected doublet rate was set to be 0.08, and cells predicted to be doublets or with doubletScore larger than  0.25 were filtered. After quality control, a total of 1,598,708 cells were remained. The stepwise quality control metrics used for indi-
vidual samples were listed in Table S1. The resulting distribution of UMI counts, gene counts as well as mitochondrial gene percent-
age were shown in Figures S1C–S1E. We normalized the UMI counts with the deconvolution strategy implemented in the R package scran. Specifically, cell-specific size factors were computed by computeSumFactors function and further used to scale the counts for
each cell. Then the logarithmic normalized counts were used for the downstream analysis.


# Setup Environment

This Notebook is created fresh with nothing else installed explicitly besides what is shown. So we assume that if you follow the instruction exactly, it should run out of the box.

In [None]:
# This is  used to time the running of the notebook
import time
start_time = time.time()

In [None]:
# These packages are pre-installed on Google Colab, but are included here to simplify running this notebook locally
%%capture
!pip install matplotlib
!pip install scikit-learn
!pip install numpy
!pip install scipy

In [None]:
# Install packages for analysis and plotting
from scipy.io import mmread
from sklearn.decomposition import TruncatedSVD
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import os

from scipy.sparse import csr_matrix
matplotlib.rcParams.update({'font.size': 22})
%config InlineBackend.figure_format = 'retina'

## Install kb

Tutorials & Notebooks
* https://www.kallistobus.tools/tutorials/kb_quality_control/python/kb_intro_1_python/
* https://github.com/pachterlab/kallisto-transcriptome-indices
* https://docs.google.com/presentation/d/1QUmi1Mm5dJ1UyQIT_5XAG9806XL4qGfb3OUDrlIvIqs/edit#slide=id.gef29e9d7dc_1_82

In [None]:
%%time
%%capture
# `kb` is a wrapper for the kallisto and bustools program, and the kb-python package contains the kallisto and bustools executables.
!pip install kb-python==0.24.1

CPU times: user 15.3 ms, sys: 12.8 ms, total: 28.1 ms
Wall time: 1.82 s


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## ❗**Connect to the Data**

The data is stored on a shared location in Google Drive. Since many of the files are very large and thus it is not feasable to download them to a location and use them. One good way of dealing with this situation is to create a shortcut to your own Google Drive and point to the shortcut and use them just like they are your own files on Google Drive. Here is the instruction how to set this up.

* Click on the link to the share location of the data.
* Nevigate to the "Data files" folder.
* Click on the "Dropdown" arrow right next to the breaksrumb on the top right.
* Choose "Add shortcut to Drive".

Now it should appear in your Google Drive as the "Data files" folder.
You can now connect to your Google Drive and access the file.
From this point on, we assume that you have the Google Drive setup this way.

Let's mount the Google Drive:



In [None]:
# Google drive root
gd_root = "/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk"
# Data root
data_root = f"{gd_root}/Data_files/HRA001149/HRR339729"
# working directory
work_dir = f"{gd_root}/LungMk"

# create the directory 
!mkdir -p "{work_dir}"

In [None]:
print(work_dir)

/content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/LungMk


In [None]:
 !ls "{work_dir}"

 genes_t2g.txt	 genome.idx   GRCh38genome.idx	'GRCh38genome.idx 2'


In [None]:
if os.path.exists(work_dir):
    print(f"The directory {work_dir} exists.")
else:
    print(f"The directory {work_dir} does not exist.")

The directory /content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/LungMk exists.


# Prepare for quantification for GRCh38

GRCh38 stands for Genome Reference Consortium Human Build 38, which is the most recent version of the human reference genome. It is a widely used reference genome for human genetics and genomics research, and provides a standardized framework for comparing and interpreting genomic data across studies. The GRCh38 assembly was released in December 2013 and contains more than 3 billion base pairs, representing the genetic information of a complete set of human chromosomes. It includes both coding and non-coding regions of the genome and is used as a reference for many different applications, such as genome sequencing, variant calling, and gene expression analysis.

Here we are downloading two files that enable us to process the FastQ files we **have**:

1. The index file: this is a k-mer index built from a reference transcriptome. It is a binary file that contains information on the position of each k-mer within the transcriptome, which is used to map the reads to the transcriptome during the quantification step. The index file is generated by a software tool such as kallisto or salmon using a reference transcriptome in FASTA format. 
2. The t2g file: this is a file that maps each transcript to a corresponding gene. This mapping is necessary because RNA sequencing reads can align to either a specific transcript or a gene, and the counts for each transcript need to be aggregated to obtain gene-level expression. The t2g file can be used to generate gene-level counts from transcript-level counts.

## Download the index and t2g files

In [None]:
%%time
# The quantification of single-cell RNA-seq with kallisto requires an index. 
# Indices are species specific and can be generated or downloaded directly with `kb`. 
# Here we download a pre-made index for human (the genome.idx.gz file) along with an auxillary file (genes_t2g.txt) 
# that describes the relationship between transcripts and genes.

index_url = "https://github.com/pachterlab/kallisto-transcriptome-indices/releases/download/94/Homo_sapiens.GRCh38.cdna.all.release-94_k31.idx.gz"
t2g_url = "https://github.com/BUStools/getting_started/releases/download/species_mixing/transcripts_to_genes_hg19mm10.txt"

!wget {index_url} -O "{work_dir}/genome.idx.gz"
!wget {t2g_url} -O "{work_dir}/genes_t2g.txt"

# Unzip the genome index file
!gunzip -f "{work_dir}/genome.idx.gz"

--2023-04-20 22:45:21--  https://github.com/pachterlab/kallisto-transcriptome-indices/releases/download/94/Homo_sapiens.GRCh38.cdna.all.release-94_k31.idx.gz
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/160138161/82cc8d80-f68f-11e8-8faf-e4656ce5fc8f?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230420%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230420T224521Z&X-Amz-Expires=300&X-Amz-Signature=c351a0eb38f9fc90f497a4d4b7b71942ab193f2c5889140dacd1fd7f1359e76e&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=160138161&response-content-disposition=attachment%3B%20filename%3DHomo_sapiens.GRCh38.cdna.all.release-94_k31.idx.gz&response-content-type=application%2Foctet-stream [following]
--2023-04-20 22:45:22--  https://objects.githubusercontent.com/g

In [None]:
!ls "{work_dir}"

 genes_t2g.txt	 genome.idx   GRCh38genome.idx	'GRCh38genome.idx 2'


In [None]:
# inspect the index files to ensure that they were downloaded properly
genome_idx_size = os.path.getsize(f"{work_dir}/genome.idx") / (1024*1024*1024)
genes_t2g_size = os.path.getsize(f"{work_dir}/genes_t2g.txt") / (1024*1024*1024)

print(f"genome.idx size: {genome_idx_size:.2f} GB")
print(f"genes_t2g.txt size: {genes_t2g_size:.2f} GB")

genome.idx size: 2.27 GB
genes_t2g.txt size: 0.02 GB


In [None]:
!wget -S -O /dev/null "https://github.com/pachterlab/kallisto-transcriptome-indices/releases/download/94/Homo_sapiens.GRCh38.cdna.all.release-94_k31.idx.gz" 2>&1 | grep 'Content-Length' | awk '{print $2 / (1024^3) " GB"}'
!wget -S -O /dev/null "https://github.com/BUStools/getting_started/releases/download/species_mixing/transcripts_to_genes_hg19mm10.txt" 2>&1 | grep 'Content-Length' | awk '{print $2 / (1024^3) " GB"}'

0 GB
1.71542 GB
0 GB
0.0176521 GB


In [None]:
!head "{work_dir}/genome.idx"


          ��    	               %                                                                                                                                                                           %                  [  l  �  ^  [  P  \  d  �  h  U  �  �  p  [  �  �  �  �  �  }  W     m    �  X  �  �  �  u  �  �  �  v  U  �  �  X  �  �  l  �  �  a  �  G    W  �  �  �  R  �  =  �  W  I  �  �  �  }  W  �  �  �  �  �  �  Q  U  �  �  �  �  m  f   �  �  [  K  ]  �  1  �  �  (  �  �  |  �  T  {  �  Q  `  �  �  [    �  6  �  �  l  �  }  (  X  �  �  �  �  [  v  �  �  �  �  l  |  �  }  �  [  �    W  [  �  �     �  �     �  [  �  (  ^  �  �  �  �  m  �  �  �  �  X  [  W  �     X  �  [  u  �  �    �  �  �  �  �
  �  �  �  �  �  (  �  S  =  4  �  

Kallisto index files are binary files. They are generated from the reference transcriptome sequences in FASTA format, and the resulting index file is used to perform fast and memory-efficient quantification of RNA-seq data.

In [None]:
# Inspect the first few lines of the files
!head "{work_dir}/genes_t2g.txt"

hg19_ENST00000456328	hg19_ENSG00000223972	hg19_DDX11L1
hg19_ENST00000515242	hg19_ENSG00000223972	hg19_DDX11L1
hg19_ENST00000518655	hg19_ENSG00000223972	hg19_DDX11L1
hg19_ENST00000450305	hg19_ENSG00000223972	hg19_DDX11L1
hg19_ENST00000438504	hg19_ENSG00000227232	hg19_WASH7P
hg19_ENST00000541675	hg19_ENSG00000227232	hg19_WASH7P
hg19_ENST00000423562	hg19_ENSG00000227232	hg19_WASH7P
hg19_ENST00000488147	hg19_ENSG00000227232	hg19_WASH7P
hg19_ENST00000538476	hg19_ENSG00000227232	hg19_WASH7P
hg19_ENST00000473358	hg19_ENSG00000243485	hg19_MIR1302-10


In [None]:
%%time
# This step runs `kb` to quantify the reads. `kb` can take as input URLs where the reads are located, and will stream the data 
# to Google Colab where it is quantified as it is downloaded. This allows for quantifying very large datasets without first 
# downloading them and saving them to disk. 
!kb count -i {gd_root}/LungMk/genome.idx -g {gd_root}/LungMk/genes_t2g.txt --overwrite -t 2 -x 10xv2 {data_root}/HRR339729_f1.fastq.gz {data_root}/HRR339729_r2.fastq.gz

[2023-04-20 22:46:39,736]    INFO Generating BUS file from
[2023-04-20 22:46:39,736]    INFO         /content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149/HRR339729/HRR339729_f1.fastq.gz
[2023-04-20 22:46:39,736]    INFO         /content/drive/MyDrive/Pate_Lab/DNP/Bioinformatics/H17_LungMk/Data_files/HRA001149/HRR339729/HRR339729_r2.fastq.gz
[2023-04-20 23:29:07,255]    INFO Sorting BUS file ./output.bus to tmp/output.s.bus
[2023-04-20 23:29:53,793]    INFO Whitelist not provided
[2023-04-20 23:29:53,793]    INFO Copying pre-packaged 10XV2 whitelist to .
[2023-04-20 23:29:53,933]    INFO Inspecting BUS file tmp/output.s.bus
[2023-04-20 23:30:15,363]    INFO Correcting BUS records in tmp/output.s.bus to tmp/output.s.c.bus with whitelist ./10xv2_whitelist.txt
[2023-04-20 23:30:35,037]    INFO Sorting BUS file tmp/output.s.c.bus to ./output.unfiltered.bus
[2023-04-20 23:30:37,839]    INFO Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file 

In [None]:
!ls "{gd_root}/LungMk"

 genes_t2g.txt	 genome.idx   GRCh38genome.idx	'GRCh38genome.idx 2'


It doesn't seem like the kb-wrapper is generating bustools files

# Install Kallisto 

In [None]:
!sudo apt-get remove kallisto

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package 'kallisto' is not installed, so not removed
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


In [None]:
!rm /usr/local/bin/kallisto

rm: cannot remove '/usr/local/bin/kallisto': No such file or directory


In [None]:
!wget https://github.com/pachterlab/kallisto/releases/download/v0.46.2/kallisto_linux-v0.46.2.tar.gz
!tar -zxvf kallisto_linux-v0.46.2.tar.gz
!mv kallisto /usr/local/bin/

--2023-04-20 21:10:24--  https://github.com/pachterlab/kallisto/releases/download/v0.46.2/kallisto_linux-v0.46.2.tar.gz
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/26562905/b87f5a00-510d-11ea-9bfc-64cef1470625?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230420%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230420T211024Z&X-Amz-Expires=300&X-Amz-Signature=83ab01a49071c9744e69790c3bc555d1251de59ff3e539d7a86bd67b8dc5ce16&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=26562905&response-content-disposition=attachment%3B%20filename%3Dkallisto_linux-v0.46.2.tar.gz&response-content-type=application%2Foctet-stream [following]
--2023-04-20 21:10:24--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/26562905/b87f5a00-510d-

In [None]:
!cd /usr/local/bin/kallisto

In [None]:
!ls -al /usr/local/bin/kallisto

total 21972
drwxr-xr-x 3  501 staff     4096 Feb 16  2020 .
drwxr-xr-x 1 root root      4096 Apr 20 21:10 ..
-rwxr-xr-x 1  501 staff 22477379 Feb 16  2020 kallisto
-rw-r--r-- 1  501 staff     1357 Mar 20  2017 license.txt
-rw-r--r-- 1  501 staff     2250 Mar 20  2017 README.md
drwxr-xr-x 2  501 staff     4096 Apr 20 21:10 test


In [None]:
!sudo apt-get install kallisto

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libhts3
The following NEW packages will be installed:
  kallisto libhts3
0 upgraded, 2 newly installed, 0 to remove and 24 not upgraded.
Need to get 598 kB of archives.
After this operation, 1,637 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 libhts3 amd64 1.10.2-3ubuntu0.1 [350 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal/universe amd64 kallisto amd64 0.46.1+dfsg-2build1 [248 kB]
Fetched 598 kB in 2s (351 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 2.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debco

In [None]:
!git clone https://github.com/BUStools/bustools.git

Cloning into 'bustools'...
remote: Enumerating objects: 3419, done.[K
remote: Counting objects: 100% (631/631), done.[K
remote: Compressing objects: 100% (224/224), done.[K
remote: Total 3419 (delta 472), reused 473 (delta 405), pack-reused 2788[K
Receiving objects: 100% (3419/3419), 3.36 MiB | 10.04 MiB/s, done.
Resolving deltas: 100% (1351/1351), done.


In [None]:
!cd bustools
# !mkdir build
!cp bustools/CMakeLists.txt bustools/build/.
!cd bustools/build & cmake .. -S bustools & make

cp: cannot create regular file 'bustools/build/.': No such file or directory
/bin/bash: line 0: cd: bustools/build: No such file or directory
make: *** No targets specified and no makefile found.  Stop.
  Ignoring extra path from command line:

   "/"

[0m
  Ignoring extra path from command line:

   ".."

[0m
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
[0mrelease mode[0m
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_cr

In [None]:
!make install

[  4%] [32mBuilding CXX object src/CMakeFiles/bustools_core.dir/BUSData.cpp.o[0m
[  8%] [32mBuilding CXX object src/CMakeFiles/bustools_core.dir/Common.cpp.o[0m
[ 12%] [32mBuilding CXX object src/CMakeFiles/bustools_core.dir/bustools_capture.cpp.o[0m
[ 16%] [32mBuilding CXX object src/CMakeFiles/bustools_core.dir/bustools_clusterhist.cpp.o[0m
[ 20%] [32mBuilding CXX object src/CMakeFiles/bustools_core.dir/bustools_collapse.cpp.o[0m
[ 24%] [32mBuilding CXX object src/CMakeFiles/bustools_core.dir/bustools_compress.cpp.o[0m
[ 28%] [32mBuilding CXX object src/CMakeFiles/bustools_core.dir/bustools_correct.cpp.o[0m
[ 32%] [32mBuilding CXX object src/CMakeFiles/bustools_core.dir/bustools_count.cpp.o[0m
[ 36%] [32mBuilding CXX object src/CMakeFiles/bustools_core.dir/bustools_decompress.cpp.o[0m
[ 40%] [32mBuilding CXX object src/CMakeFiles/bustools_core.dir/bustools_extract.cpp.o[0m
[ 44%] [32mBuilding CXX object src/CMakeFiles/bustools_core.dir/bustools_inspect.cpp.o[0m


In [None]:
%env PATH=/usr/bin:/usr/local/bin:$PATH:/opt/longranger-2.2.2
!echo $PATH

env: PATH=/usr/bin:/usr/local/bin:$PATH:/opt/longranger-2.2.2
/usr/bin:/usr/local/bin:$PATH:/opt/longranger-2.2.2


In [None]:
!bustools

bustools 0.42.0

Usage: bustools <CMD> [arguments] ..

Where <CMD> can be one of: 

sort            Sort a BUS file by barcodes and UMIs
correct         Error correct a BUS file
umicorrect      Error correct the UMIs in a BUS file
count           Generate count matrices from a BUS file
inspect         Produce a report summarizing a BUS file
whitelist       Generate a whitelist from a BUS file
project         Project a BUS file to gene sets
capture         Capture records from a BUS file
merge           Merge bus files from same experiment
text            Convert a binary BUS file to a tab-delimited text file
extract         Extract FASTQ reads correspnding to reads in BUS file
predict         Correct the count matrix using prediction of unseen species
collapse        Turn BUS files into a BUG file
clusterhist     Create UMI histograms per cluster
linker          Remove section of barcodes in BUS files
compress        Compress a BUS file
inflate         Decompress a BUSZ (compressed BUS

In [None]:
!wget -c https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
!chmod +x Anaconda3-2021.11-Linux-x86_64.sh
!bash ./Anaconda3-2021.11-Linux-x86_64.sh -b -f -p /usr/local

--2023-04-20 21:44:24--  https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 608680744 (580M) [application/x-sh]
Saving to: ‘Anaconda3-2021.11-Linux-x86_64.sh’


2023-04-20 21:44:27 (186 MB/s) - ‘Anaconda3-2021.11-Linux-x86_64.sh’ saved [608680744/608680744]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _ipyw_jlab_nb_ext_conf==0.1.0=py39h06a4308_0
    - _libgcc_mutex==0.1=main
    - _openmp_mutex==4.5=1_gnu
    - alabaster==0.7.12=pyhd3eb1b0_0
    - anaconda-client==1.9.0=py39h06a4308_0
    - ana

In [None]:
# Google drive root
gd_root = "/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients"
# Data root
data_root = f"{gd_root}/Data files/HRA001149"
# working directory
work_dir = f"{gd_root}/LungMk"

# create the directory 
!mkdir -p "{work_dir}"

# download the cDNA fasta file
!curl -O ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

# build the kallisto index
!kallisto index -i Homo_sapiens.GRCh38.cdna.all.release-94_k31.idx Homo_sapiens.GRCh38.cdna.all.fa.gz

# download the transcripts to genes mapping file
!curl -O https://github.com/BUStools/getting_started/releases/download/species_mixing/transcripts_to_genes_hg19mm10.txt

# move the mapping file to the working directory
!mv transcripts_to_genes_hg19mm10.txt "{work_dir}/genes_t2g.txt"

# run kb count with the downloaded index and t2g file
!kb count -i Homo_sapiens.GRCh38.cdna.all.release-94_k31.idx -g "{work_dir}/genes_t2g.txt" --overwrite -t 2 -x 10xv2 {data_root}/HRR339729_f1.fastq.gz {data_root}/HRR339729_r2.fastq.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 64.7M  100 64.7M    0     0  5827k      0  0:00:11  0:00:11 --:--:-- 11.3M

[build] loading fasta file Homo_sapiens.GRCh38.cdna.all.fa.gz
[build] k-mer length: 31
        from 1471 target sequences
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 1118780 contigs and contains 108619921 k-mers 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
[2023-04-20 22:17:42,552]    INFO Generating BUS file from
[2023-04-20 22:17:42,552]    INFO         /content/drive/MyDrive/Pate
[2023-04-20 22:17:42,

In [None]:
!curl -O ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 64.7M  100 64.7M    0     0  5908k      0  0:00:11  0:00:11 --:--:-- 11.5M


In [None]:
# Download the index file to the working directory
!wget {index_url} -P "{work_dir}"

# Unzip the index file
!gzip -d "{work_dir}/Homo_sapiens.GRCh38.cdna.all.fa.gz"

# Build the kallisto index
!kallisto index -i "{work_dir}/Homo_sapiens.GRCh38.cdna.all.release-94_k31.idx" "{work_dir}/Homo_sapiens.GRCh38.cdna.all.fa"

In [None]:
!kallisto index -i 	Homo_sapiens.GRCh38.cdna.all.release-94_k31.idx	Homo_sapiens.GRCh38.cdna.all.fa.gz


[build] loading fasta file Homo_sapiens.GRCh38.cdna.all.fa.gz
[build] k-mer length: 31
        from 1471 target sequences
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ... ^C


In [None]:
!ls -lh Homo_sapiens.GRCh38.cdna.all.release-94_k31.idx

-rw-r--r-- 1 root root 2.3G Apr 20 21:17 Homo_sapiens.GRCh38.cdna.all.release-94_k31.idx


In [None]:
t2g_url = "https://github.com/BUStools/getting_started/releases/download/species_mixing/transcripts_to_genes_hg19mm10.txt"
!wget {t2g_url} -O "{work_dir}/genes_t2g.txt"

# Run kb count with the downloaded index and t2g file
!kb count -i Homo_sapiens.GRCh38.cdna.all.release-94_k31.idx -g "{work_dir}/genes_t2g.txt" --overwrite -t 2 -x 10xv2 /content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_f1.fastq.gz  /content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_r2.fastq.gz

--2023-04-20 21:46:13--  https://github.com/BUStools/getting_started/releases/download/species_mixing/transcripts_to_genes_hg19mm10.txt
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/191064839/37b22d80-913c-11e9-9025-bca8558040be?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230420%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230420T214613Z&X-Amz-Expires=300&X-Amz-Signature=1bf8f40bba0344195d88488e1e17834f121558c97c762000139bdd23647fe463&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=191064839&response-content-disposition=attachment%3B%20filename%3Dtranscripts_to_genes_hg19mm10.txt&response-content-type=application%2Foctet-stream [following]
--2023-04-20 21:46:13--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/1

# Quantification

This step runs `kb` to quantify the reads. `kb` can take as input URLs where the reads are located, and will stream the data to Google Colab where it is quantified as it is downloaded. This allows for quantifying very large datasets without first downloading them and saving them to disk.

## Define the function

Let's first define the function to do the quantification.

Aligning and quantifying using kallisto/bustool
Kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, which uses a novel idea of pseudoalignment for fast and accurate quantification of transcript abundances from RNA-Seq data. Kallisto can quantify expression levels of genes, transcripts, and isoforms. Kallisto generates an index from the reference transcriptome that allows fast pseudoalignment of RNA-Seq reads, followed by generation of gene- and transcript-level counts.

Bustools is a set of tools for analyzing BUS files generated by kallisto. BUS files contain information about which barcodes were detected in which transcript and how many UMIs were associated with each barcode-transcript pair. Bustools can be used to correct and sort the barcode and UMI information in the BUS file, filter out low-quality reads and barcodes, count the number of unique molecular identifiers (UMIs) associated with each gene or transcript, and perform other downstream analyses.

We will need to reference the GRCh38 Human Genome here to align it to our sequencing data

The goal is to do the quantification given a patient ID.

**Note:** You might need to create the folder /content/dive/MyDrive/LungMk

**Note:** We are saving the files into the google drive so that they are not deleted.

In [None]:
# Define the function

idx = f'{work_dir}/genome.idx'
t2g = f'{work_dir}/genes_t2g.txt'

def quantify(idx, t2g, r1, r2, out):
  !kb count --verbose  -i "{idx}" -g "{t2g}" --overwrite -t 2 -x 10xv2 -o "{out}" "{r1}" "{r2}"

def quantify_patient(patient_id):
  r1 = f'{data_root}/{patient_id}/{patient_id}_f1.fastq.gz'
  r2 = f'{data_root}/{patient_id}/{patient_id}_r2.fastq.gz'
  out = f"{work_dir}/counts/{patient_id}/"
  quantify(idx, t2g, r1, r2, out)

There is an issue with this code, we are not getting the correct output for the bus files. 

In [None]:
import os

idx = f'{work_dir}/genome.idx'
t2g = f'{work_dir}/genes_t2g.txt'

def quantify(idx, t2g, r1, r2, out):
  !kb count --verbose  -i "{idx}" -g "{t2g}" --overwrite -t 2 -x 10xv2 -o "{out}" "{r1}" "{r2}"
  
def quantify_patient(patient_id):
  r1 = f'{data_root}/{patient_id}/{patient_id}_f1.fastq.gz'
  r2 = f'{data_root}/{patient_id}/{patient_id}_r2.fastq.gz'
  out = f"{work_dir}/counts/{patient_id}/"
  os.makedirs(out, exist_ok=True)
  quantify(idx, t2g, r1, r2, out)
  !kb count -o {out}/output -g {t2g} -e {out}/matrix.ec -t {out}/transcripts.txt --genecounts {out}/output.counts.tsv {out}/output.bus


In [None]:
# Set working directory
work_dir = "/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk"
os.makedirs(f"{work_dir}/counts", exist_ok=True)

# Set file paths
idx = f"{work_dir}/genome.idx"
t2g = f"{work_dir}/genes_t2g.txt"
r1_path = "/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_f1.fastq.gz"
r2_path = "/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_r2.fastq.gz"

def quantify(idx, t2g, r1, r2, out):
    !kb count --verbose -i "{idx}" -g "{t2g}" --overwrite -t 2 -x 10xv2 -o "{out}" "{r1}" "{r2}"

def quantify_patient(patient_id):
    out = f"{work_dir}/counts/{patient_id}"
    os.makedirs(out, exist_ok=True)
    r1 = r1_path
    r2 = r2_path
    quantify(idx, t2g, r1, r2, out)
    !kb count -o {out}/output -g {t2g} -e {out}/matrix.ec -t {out}/transcripts.txt --genecounts {out}/output.counts.tsv {out}/output.bus

# Call the function with patient ID "HRR339729"
quantify_patient("HRR339729")


[2023-04-19 19:47:14,134]   DEBUG Printing verbose output
[2023-04-19 19:47:14,134]   DEBUG Creating tmp directory
[2023-04-19 19:47:14,134]   DEBUG Namespace(list=False, command='count', keep_tmp=False, verbose=True, i='/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/genome.idx', g='/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/genes_t2g.txt', x='10xv2', o='/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/counts/HRR339729', w=None, t=2, m='4G', c1=None, c2=None, overwrite=True, lamanno=False, filter=None, loom=False, h5ad=False, fastqs=['/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_f1.fastq.gz', '/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_r2.fastq.gz'])


Checking to see if there is an issue with the patient data

In [None]:
import gzip

# Path to the input files
r1_path = "/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_f1.fastq.gz"
r2_path = "/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_r2.fastq.gz"

# Check that the input files are in the correct format and can be read by kb
with gzip.open(r1_path, "rt") as f:
    print(f"Contents of {r1_path}:")
    print(f.read(100))

with gzip.open(r2_path, "rt") as f:
    print(f"Contents of {r2_path}:")
    print(f.read(100))


Contents of /content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_f1.fastq.gz:
@v300039845L1C001R0010000104/1
GCTGAATAGTCGGCCTTTTTAATCTGTT
+
EFDED1?AE4EECAGFGFCBDG>@=EF?
@v3000398
Contents of /content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_r2.fastq.gz:
@v300039845L1C001R0010000104/2
NTTTACTTTGTTCTTTTACTAGCATTTTGAACTGGGAATTTTAATTTATTTCCATTCTTCTATTAACAA


In [None]:
import pandas as pd

# Load the count matrix
counts = pd.read_csv("/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/counts/HRR339729/counts_unfiltered/cells_x_genes.mtx", sep=" ", skiprows=3, header=None)
counts.columns = ["barcode", "gene_id", "UMI"]

# Print the first few rows to inspect the data
print(counts.head())

FileNotFoundError: ignored

In [None]:
import struct

bus_path = "/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/counts/HRR339729/output.bus"

# Check the number of lines in the output.bus file
num_lines = 0
with open(bus_path, "rb") as f:
    while True:
        binary = f.read(4)
        if not binary:
            break
        (record_size,) = struct.unpack("<I", binary)
        f.read(record_size)
        num_lines += 1

print(f"Number of lines in the output.bus file: {num_lines}")

# Print the first few lines of the output.bus file
with open(bus_path, "rb") as f:
    for i in range(10):
        binary = f.read(4)
        (record_size,) = struct.unpack("<I", binary)
        record = f.read(record_size)
        print(record)


Number of lines in the output.bus file: 6


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
# Load the gene annotation file
genes = pd.read_csv("/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/genes_t2g.txt", sep="\t", header=None, names=["transcript_id", "gene_id", "gene_name"])

# Print the first few rows of the dataframe
print(genes.head())

# Print the shape of the dataframe
print("Shape of dataframe:", genes.shape)

# Print the unique values in the gene_id column
print("Unique gene IDs:", genes["gene_id"].unique())

          transcript_id               gene_id     gene_name
0  hg19_ENST00000456328  hg19_ENSG00000223972  hg19_DDX11L1
1  hg19_ENST00000515242  hg19_ENSG00000223972  hg19_DDX11L1
2  hg19_ENST00000518655  hg19_ENSG00000223972  hg19_DDX11L1
3  hg19_ENST00000450305  hg19_ENSG00000223972  hg19_DDX11L1
4  hg19_ENST00000438504  hg19_ENSG00000227232   hg19_WASH7P
Shape of dataframe: (333131, 3)
Unique gene IDs: ['hg19_ENSG00000223972' 'hg19_ENSG00000227232' 'hg19_ENSG00000243485' ...
 'mm10_ENSMUSG00000098647' 'mm10_ENSMUSG00000096730'
 'mm10_ENSMUSG00000095742']


## Looping

Here is the code that allows us to loop through all the patients when we are ready for it.

In [None]:
# Here we list all the patient folder. We can use this when we need to loop
# through all the patients.

folders = [f for f in os.listdir(data_root) if os.path.isdir(os.path.join(data_root, f))]
print(folders)

['HRR339728', 'HRR339729', 'HRR339730', 'HRR339731', 'HRR339732', 'HRR339733', 'HRR339734', 'HRR339735', 'HRR339736', 'HRR339737', 'HRR339738', 'HRR339740', 'HRR339739', 'HRR339741', 'HRR339743', 'HRR339748', 'HRR339753', 'HRR339751', 'HRR339752', 'HRR339755', 'HRR339756', 'HRR339754', 'HRR339749', 'HRR339750', 'HRR339742', 'HRR339746', 'HRR339747', 'HRR339744', 'HRR339759', 'HRR339757', 'HRR339761', 'HRR339760', 'HRR339762', 'HRR339765', 'HRR339764', 'HRR339763', 'HRR339770', 'HRR339769', 'HRR339771', 'HRR339772', 'HRR339775', 'HRR339777', 'HRR339774', 'HRR339776', 'HRR339773', 'HRR339778', 'HRR339786', 'HRR339787', 'HRR339795', 'HRR339766', 'HRR339767', 'HRR339768', 'HRR339790', 'HRR339791', 'HRR339792', 'HRR339789', 'HRR339788', 'HRR339758', 'HRR339794', 'HRR339793', 'HRR339805', 'HRR339801', 'HRR339802', 'HRR339799', 'HRR339804', 'HRR339803', 'HRR339798', 'HRR339797', 'HRR339796', 'HRR339800', 'HRR339808', 'HRR339806', 'HRR339815', 'HRR339812', 'HRR339811', 'HRR339814', 'HRR339807'

## Process for one patient

Here we try out for one single patient as a POC.

In [None]:
%%time

# Let's do quatification for one patient HRR339728
patient_id = 'HRR339729'
quantify_patient(patient_id)

[2023-04-18 20:14:02,140]   DEBUG Printing verbose output
[2023-04-18 20:14:02,141]   DEBUG Creating tmp directory
[2023-04-18 20:14:02,141]   DEBUG Namespace(list=False, command='count', keep_tmp=False, verbose=True, i='/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/genome.idx', g='/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/genes_t2g.txt', x='10xv2', o='/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/counts/HRR339729/', w=None, t=2, m='4G', c1=None, c2=None, overwrite=True, lamanno=False, filter=None, loom=False, h5ad=False, fastqs=['/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_S1_L001_R1_001.fastq.gz', '/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_S1

# Quality Control

* https://colab.research.google.com/github/pachterlab/kallistobustools/blob/master/docs/tutorials/kb_getting_started/python/kb_intro_2_python.ipynb

## Filtering cells based on count
Preliminary counts were then used for downstream analysis. Quality control was applied to cells based on three metrics step by step: the total UMI counts, number of detected genes and proportion of mitochondrial gene counts per cell. Specifically, cells with less than 1500 UMI counts and 500 detected genes were filtered, as well as cells with more than 10% mitochondrial gene counts. 

In [None]:
# Load the output from kallisto bus
matrix = pd.read_csv("/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/counts/HRR339729/counts_unfiltered/cells_x_genes.mtx", sep=" ", skiprows=3, header=None)
matrix.columns = ["barcode", "gene_id", "UMI"]

# Set the barcode column as the index
matrix = matrix.pivot(index="barcode", columns="gene_id", values="UMI")

# Calculate the total UMI counts and number of detected genes per cell
umi_counts = matrix.sum(axis=1)
gene_counts = (matrix > 0).sum(axis=1)

# Load the gene annotation file
genes = pd.read_csv("/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/genes_t2g.txt", sep="\t", header=None, names=["transcript_id", "gene_id", "gene_name"])

# Extract mitochondrial gene names
mito_genes = genes[genes.gene_name.str.startswith("MT-")].gene_id.tolist()

# Calculate the proportion of mitochondrial gene counts per cell
mito_counts = matrix[mito_genes].sum(axis=1)
mito_prop = mito_counts / umi_counts

# Filter cells with less than 1500 UMI counts, less than 500 detected genes, or more than 10% mitochondrial gene counts
cells_to_keep = (umi_counts >= 1500) & (gene_counts >= 500) & (mito_prop <= 0.1)
filtered_matrix = matrix.loc[cells_to_keep]

# Calculate total UMI counts and number of detected genes for the filtered matrix
filtered_umi_counts = filtered_matrix.sum(axis=1)
filtered_gene_counts = (filtered_matrix > 0).sum(axis=1)

# Validate XML

In [None]:
xml_gz = '/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_sta.xml'

with open(xml_gz, 'rt') as f:
  HRR339729_sta = f.read()

print(HRR339729_sta)

<Run accession="/p300/HRA-Process/temp/HRA001149/HRR339729/" read_length="fixed" spot_count="227781009" base_count="29155969152" 
base_count_bio="29155969152" spot_count_mates="227781009" base_count_bio_mates="29155969152" spot_count_bad="0" base_count_bio_bad="0" spot_count_filtered="0" 
base_count_bio_filtered="0" cmp_base_count="29155969152">
  <Size value="74256608934" units="bytes"/>
  <Bases cs_native="false" count="29155969152">
    <Base value="A" count="8270024956"/>
    <Base value="C" count="6735888857"/>
    <Base value="G" count="6335923237"/>
    <Base value="T" count="7804688716"/>
    <Base value="N" count="9443386"/>
  </Bases>
  <GC-Content="44.83%"/>
  <AlignInfo>
  </AlignInfo>
  <Statistics nreads="2" nspots="227781009">
    <Read index="0" count="227781009" average="28" stdev="0"/>
    <Read index="1" count="227781009" average="100" stdev="0"/>
  </Statistics>
  <QualityCount>
<Quality value="0" count="9443386"/>
<Quality value="4" count="3469229"/>
<Quality value

In [None]:
validate_xml(xml_gz)

XML syntax error at line 12, column 14: error parsing attribute name, line 12, column 14
Error line:   <GC-Content="44.83%"/>


# Fix maleformed XML and print XML Structure
This XML has a maleformed tag per GC-Content and we have to fix it before it parsing it to understand the XML structure.

* The presence of an invalid token in an XML file could potentially cause issues with downstream RNA-seq analysis. If the invalid token affects the metadata or annotation information contained in the XML file, it may result in incorrect or incomplete information being used in downstream analysis. This could result in incorrect interpretations or conclusions being drawn from the analysis. It is important to ensure that all input files, including XML files, are correctly formatted and free of errors to minimize the risk of such issues occurring.

In [None]:
import os

# Define paths
xml_path = '/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_sta.xml'
output_dir = '/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/'

# Read XML file
with open(xml_path, 'rt') as f:
    xml_string = f.read()

# Replace the attribute value in the XML string
new_xml_string = xml_string.replace('<GC-Content="44.83%"/>', '<GC-Content value="44.83%"/>')

# Write the new XML string to a file
output_path = os.path.join(output_dir, 'HRR339729_sta_new.xml')
with open(output_path, 'w') as f:
    f.write(new_xml_string)
    
print(f"New XML file saved to {output_path}")


New XML file saved to /content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_sta_new.xml


## Remove potential doublets (double balloon effect)

This is what the investigators did in the original paper:


*   To remove potential doublets, for PBMC samples, cells with UMI counts above 25,000 and detected genes above 5,000 are filtered out. For other tissues, cells with UMI counts above 70,000 and detected genes above 7,500 are filtered out. Additionally, we applied Scrublet (Wolock et al., 2019 link text) to identify potential doublets. The doublet score for each single cell and the threshold based on the bimodal distribution was calculated using default parameters. The expected doublet rate was set to be 0.08, and cells predicted to be doublets or with doubletScore larger than 0.25 were filtered. After quality control, a total of 1,598,708 cells were remained.
*   for now we will not be using onliy the PBMC filter methods applied to all tissues
*  *We may revisit this later*

In [None]:
import pandas as pd
import xml.etree.ElementTree as ET

# Load the count matrix
counts = pd.read_csv("/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/LungMk/counts/HRR339729/counts_unfiltered/cells_x_genes.mtx", sep=" ", skiprows=3, header=None)
counts.columns = ["barcode", "gene_id", "UMI"]

# Set the barcode column as the index
counts = counts.pivot(index="barcode", columns="gene_id", values="UMI")

# Load the metadata file from the XML
xml_tree = ET.parse('/content/drive/MyDrive/Pate Lab/DNP Folder/Bioinformatics/H17 Lung Mk from COVID-19 Patients/Data files/HRA001149/HRR339729/HRR339729_sta_new.xml')
root = xml_tree.getroot()
metadata_dict = {}
for child in root.iter():
    metadata_dict[child.tag] = child.text
metadata = pd.DataFrame([metadata_dict])

# Determine the filtering thresholds
umi_threshold = 25000
gene_threshold = 5000

# Compute the UMI counts and number of detected genes for each cell
umi_counts = counts.sum(axis=0)
detected_genes = (counts > 0).sum(axis=0)

# Filter out cells with UMI counts or detected genes above the thresholds
mask = (umi_counts <= umi_threshold) & (detected_genes <= gene_threshold)
filtered_counts = counts.loc[:, mask]

# Print some statistics about the filtering
print("Before filtering:")
print(f"Number of cells: {counts.shape[1]}")
print(f"Max UMI count: {umi_counts.max()}")
print(f"Max detected genes: {detected_genes.max()}")
print("")
print("After filtering:")
print(f"Number of cells: {filtered_counts.shape[1]}")
print(f"Max UMI count: {filtered_counts.sum(axis=0).max()}")
print(f"Max detected genes: {(filtered_counts > 0).sum(axis=0).max()}")


Before filtering:
Number of cells: 1
Max UMI count: 0
Max detected genes: 0

After filtering:
Number of cells: 1
Max UMI count: 0
Max detected genes: 0


In [None]:
import pandas as pd
import xml.etree.ElementTree as ET

# Load the count matrix
counts = pd.read_csv("output/counts/bus_output/output.bus.count.txt", index_col=0)

# Load the metadata file from the XML
xml_tree = ET.parse('/content/drive/MyDrive/Colab_Notebooks/Lung_Mk/HRR339742/HRR339742_sta.xml')
root = xml_tree.getroot()
metadata_dict = {}
for child in root.iter():
    metadata_dict[child.tag] = child.text
metadata = pd.DataFrame([metadata_dict])

# Determine the filtering thresholds
umi_threshold = 25000
gene_threshold = 5000

# Compute the UMI counts and number of detected genes for each cell
umi_counts = counts.sum(axis=0)
detected_genes = (counts > 0).sum(axis=0)

# Filter out cells with UMI counts or detected genes above the thresholds
mask = (umi_counts <= umi_threshold) & (detected_genes <= gene_threshold)
filtered_counts = counts.loc[:, mask]

# Print some statistics about the filtering
print("Before filtering:")
print(f"Number of cells: {counts.shape[1]}")
print(f"Max UMI count: {umi_counts.max()}")
print(f"Max detected genes: {detected_genes.max()}")
print("")
print("After filtering:")
print(f"Number of cells: {filtered_counts.shape[1]}")
print(f"Max UMI count: {filtered_counts.sum(axis=0).max()}")
print(f"Max detected genes: {(filtered_counts > 0).sum(axis=0).max()}")

## Data visualization

The stepwise quality control metrics used for individual samples were listed in Table S1. The resulting distribution of UMI counts, gene counts as well as mitochondrial gene percent- age were shown in Figures S1C–S1E. 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Load the count matrix
counts = pd.read_csv("output/counts/bus_output/output.bus.count.txt", sep="\t", index_col=0)

# Load the gene annotation file from the metadata XML
metadata = pd.read_xml("/content/drive/MyDrive/Colab_Notebooks/Lung_Mk/HRR339742/HRR339742_sta.xml")
genes = pd.DataFrame(metadata["transcript"].apply(lambda x: x.get("gene_name")).unique(), columns=["gene_name"])

# Identify mitochondrial genes
mito_genes = genes[genes["gene_name"].str.startswith("MT-")].index
mito_counts = counts.loc[mito_genes].sum(axis=0)
total_counts = counts.sum(axis=0)
mito_percentage = mito_counts / total_counts * 100

# Compute the UMI counts and number of detected genes for each cell
umi_counts = counts.sum(axis=0)
detected_genes = (counts > 0).sum(axis=0)

# Plot the distribution of UMI counts, detected genes, and mitochondrial gene percentage
fig, axes = plt.subplots(ncols=3, figsize=(15,5))
sns.histplot(umi_counts, ax=axes[0])
sns.histplot(detected_genes, ax=axes[1])
sns.histplot(mito_percentage, ax=axes[2])
axes[0].set_xlabel("UMI counts")
axes[1].set_xlabel("Number of detected genes")
axes[2].set_xlabel("Mitochondrial gene percentage")
plt.show()

# before and after plots

## Normaliazed UMI counts

This is what the paper did:

*  We normalized the UMI counts with the deconvolution strategy implemented in the  R package scran. Specifically, cell-specific size factors were computed by computeSumFactors function and further used to scale the counts for each cell. Then the logarithmic normalized counts were used for the downstream analysis.
*  We can use Scnapy instead

In [None]:
import scanpy as sc

# Load the count matrix
adata = sc.read_text("output/counts/bus_output/output.bus.count.txt", delimiter="\t").T

# Normalize the data using Total Count Normalization (TCN)
sc.pp.normalize_total(adata, target_sum=1e4)

# Scale the data by cell-specific size factors
sc.pp.scale(adata, max_value=10)

# Logarithmically transform the data
sc.pp.log1p(adata)

For normalization of UMI counts, the Scanpy package provides several normalization methods, including the Total Count Normalization (TCN) and Normalization by Logarithm (LogNormalize) methods, which are commonly used in single-cell RNA-seq analysis. Here, we first load the count matrix using Scanpy's read_text function. We then normalize the data using the normalize_total function, which scales the counts for each cell so that they have the same total count (in this case, 10,000). We then scale the data by cell-specific size factors using the scale function, and logarithmically transform the data using the log1p function.

