<a href="https://colab.research.google.com/github/DPariser/DataScience/blob/main/QC_and_Pre_Processing_FASTQ_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Single-cell RNA-seq data processing
** Information found in publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7857060/

Single-cell sequencing data were aligned and quantified using kallisto/bustools (KB, v0.24.4) (Bray et al., 2016 [link text](https://www.nature.com/articles/nbt.3519)) against the GRCh38 human reference genome downloaded from 10x Genomics official website. Preliminary counts were then used for downstream analysis. Quality control was applied to cells based on three metrics step by step: the total UMI counts, number of detected genes and proportion of mitochondrial gene counts per cell. Specifically, cells with less than 1000 UMI counts and 500 detected genes were filtered, as well as cells with more than 10% mitochondrial gene counts. To remove potential doublets, for PBMC samples, cells with UMI counts above 25,000 and detected genes above 5,000 are filtered out. For other tissues, cells with UMI counts above
70,000 and detected genes above 7,500 are filtered out. Additionally, we applied Scrublet (Wolock et al., 2019 [link text](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6625319/pdf/nihms-1515604.pdf)) to identify potential
doublets. The doublet score for each single cell and the threshold based on the bimodal distribution was calculated using default
parameters. The expected doublet rate was set to be 0.08, and cells predicted to be doublets or with doubletScore larger than  0.25 were filtered. After quality control, a total of 1,598,708 cells were remained. The stepwise quality control metrics used for indi-
vidual samples were listed in Table S1. The resulting distribution of UMI counts, gene counts as well as mitochondrial gene percent-
age were shown in Figures S1C–S1E. We normalized the UMI counts with the deconvolution strategy implemented in the R package scran. Specifically, cell-specific size factors were computed by computeSumFactors function and further used to scale the counts for
each cell. Then the logarithmic normalized counts were used for the downstream analysis.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install pandas numpy scikit-learn htseq

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting htseq
  Downloading HTSeq-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting pysam
  Downloading pysam-0.20.0-cp39-cp39-manylinux_2_24_x86_64.whl (15.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.6/15.6 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pysam, htseq
Successfully installed htseq-2.0.2 pysam-0.20.0


In [None]:
# These packages are pre-installed on Google Colab, but are included here to simplify running this notebook locally
%%capture
!pip install matplotlib
!pip install scikit-learn
!pip install numpy
!pip install scipy

In [None]:
# Install packages for analysis and plotting
from scipy.io import mmread
from sklearn.decomposition import TruncatedSVD
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

from scipy.sparse import csr_matrix
matplotlib.rcParams.update({'font.size': 22})
%config InlineBackend.figure_format = 'retina'

In [None]:
%%time
%%capture
# `kb` is a wrapper for the kallisto and bustools program, and the kb-python package contains the kallisto and bustools executables.
!pip install kb-python==0.24.1

CPU times: user 47.8 ms, sys: 18.7 ms, total: 66.5 ms
Wall time: 4.85 s


# Unzip Files
For each patient there are two gz files and a .xml file the gz files include the raw sequencing files R1 refers to read 1 and R2 refers to read two, which during our later processing steps we will need to ensure that they align properly and the .xml file is the metadata information such as sample identifiers, library preparation protocols, sequencing platforms, and other relevant details.

In [None]:
import gzip
import xml.etree.ElementTree as ET

# path to input files
fastq1_gz = '/content/drive/MyDrive/Colab_Notebooks/Lung_Mk/HRR339742/HRR339742_f1.fastq.gz'
fastq2_gz = '/content/drive/MyDrive/Colab_Notebooks/Lung_Mk/HRR339742/HRR339742_r2.fastq.gz'
xml_gz = '/content/drive/MyDrive/Colab_Notebooks/Lung_Mk/HRR339742/HRR339742_sta.xml'

# Load fastq1 data
with gzip.open(fastq1_gz, 'rt') as f:
    fastq1_data = f.read()

# Load fastq2 data
with gzip.open(fastq2_gz, 'rt') as f:
    fastq2_data = f.read()

# Load xml data
with gzip.open(xml_gz, 'rt') as f:
    xml_data = f.read()

# Parse xml data
root = ET.fromstring(xml_data)

# Extract relevant data from xml
for elem in root.iter('Analysis'):
    analysis = elem.attrib
for elem in root.iter('Run'):
    run = elem.attrib
for elem in root.iter('Sample'):
    sample = elem.attrib
for elem in root.iter('Library'):
    library = elem.attrib
for elem in root.iter('Statistics'):
    statistics = elem.attrib


# Quality Control
## Load Long Ranger and GRCh38 Human Genome Data

1.   Single-cell sequencing data were aligned and quantified using kallisto/bustools (KB, v0.24.4) (Bray et al., 2016) against the GRCh38 human reference genome downloaded from 10x Genomics official website.

Install instructions can be found here: https://support.10xgenomics.com/genome-exome/software/pipelines/latest/installation

Code for the GRCh38: https://support.10xgenomics.com/genome-exome/software/downloads/latest?

The GRCh38 reference genome is a widely used reference genome for human sequencing because it represents the most current and accurate version of the human genome. It includes the latest updates and revisions, including new genome sequences and gene annotations, and provides improved coverage of difficult-to-sequence regions such as centromeres and telomeres. It is also used as a reference for many large-scale genomics projects, such as the Human Genome Project, the 1000 Genomes Project, and the Genotype-Tissue Expression (GTEx) project.


In [None]:
!cd /opt

In [None]:
# Download and unpack the Long Ranger file
# Long Ranger - 2.2.2 (March 26, 2018)
!wget -O longranger-2.2.2.tar.gz "https://cf.10xgenomics.com/releases/genome/longranger-2.2.2.tar.gz?Expires=1678942097&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvZ2Vub21lL2xvbmdyYW5nZXItMi4yLjIudGFyLmd6IiwiQ29uZGl0aW9uIjp7IkRhdGVMZXNzVGhhbiI6eyJBV1M6RXBvY2hUaW1lIjoxNjc4OTQyMDk3fX19XX0_&Signature=GHLpJcQ6WIza~wstIxoVHGaEvVAPZfuCP~VbmRb6PuZzqcNMfNeViiKQfx~JpqNpEXKv-eUyDpkyapH5~eWOVDQ09irzJjwNb1JATeo-FWwGBOOVR1ps2A-eVWkDPbbHbkdi2snHKGawL1ZogGm-DRkCqqGfTiGdAh7sXYHGb-3v4eWDtKgiG6icf202HvQSM8oSnZdQftvwp20EkY0Np5M6VH16-dL3RKN0zVqn3scTRFW4gdGJwyQQep1Y8IdNVrgaxEzAM2WkWutJ0zKgTwMW9ODS1dnSQGzOaYY5NF9OWbAwE36gSffBi~Y-CJO058KmFpnrwbNkbP0ztg8HsA__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"

--2023-03-20 15:08:04--  https://cf.10xgenomics.com/releases/genome/longranger-2.2.2.tar.gz?Expires=1678942097&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvZ2Vub21lL2xvbmdyYW5nZXItMi4yLjIudGFyLmd6IiwiQ29uZGl0aW9uIjp7IkRhdGVMZXNzVGhhbiI6eyJBV1M6RXBvY2hUaW1lIjoxNjc4OTQyMDk3fX19XX0_&Signature=GHLpJcQ6WIza~wstIxoVHGaEvVAPZfuCP~VbmRb6PuZzqcNMfNeViiKQfx~JpqNpEXKv-eUyDpkyapH5~eWOVDQ09irzJjwNb1JATeo-FWwGBOOVR1ps2A-eVWkDPbbHbkdi2snHKGawL1ZogGm-DRkCqqGfTiGdAh7sXYHGb-3v4eWDtKgiG6icf202HvQSM8oSnZdQftvwp20EkY0Np5M6VH16-dL3RKN0zVqn3scTRFW4gdGJwyQQep1Y8IdNVrgaxEzAM2WkWutJ0zKgTwMW9ODS1dnSQGzOaYY5NF9OWbAwE36gSffBi~Y-CJO058KmFpnrwbNkbP0ztg8HsA__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 104.18.0.173, 104.18.1.173, 2606:4700::6812:1ad, ...
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|104.18.0.173|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-03-20 15:08:04 ERROR 403: Forbidde

In [None]:
# Download and unpack the Long Ranger file
# Long Ranger - 2.2.2 (March 26, 2018)
!tar -xzvf longranger-2.2.2.tar.gz


gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now


In [None]:
# Download and unpack the reference data file
# GRCh38 Reference - 2.1.0 (Sep 15, 2016)
!wget https://cf.10xgenomics.com/supp/genome/refdata-GRCh38-2.1.0.tar.gz

--2023-03-20 15:08:04--  https://cf.10xgenomics.com/supp/genome/refdata-GRCh38-2.1.0.tar.gz
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 104.18.0.173, 104.18.1.173, 2606:4700::6812:1ad, ...
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|104.18.0.173|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5187997538 (4.8G) [application/x-tar]
Saving to: ‘refdata-GRCh38-2.1.0.tar.gz’


2023-03-20 15:11:30 (24.2 MB/s) - ‘refdata-GRCh38-2.1.0.tar.gz’ saved [5187997538/5187997538]



In [None]:
# Download and unpack the reference data file
# GRCh38 Reference - 2.1.0 (Sep 15, 2016)
!tar -xzvf refdata-GRCh38-2.1.0.tar.gz

refdata-GRCh38-2.1.0/
refdata-GRCh38-2.1.0/version
refdata-GRCh38-2.1.0/README.BEFORE.MODIFYING
refdata-GRCh38-2.1.0/fasta/
refdata-GRCh38-2.1.0/fasta/genome.dict
refdata-GRCh38-2.1.0/fasta/genome.fa
refdata-GRCh38-2.1.0/fasta/genome.fa.amb
refdata-GRCh38-2.1.0/fasta/genome.fa.ann
refdata-GRCh38-2.1.0/fasta/genome.fa.bwt
refdata-GRCh38-2.1.0/fasta/genome.fa.fai
refdata-GRCh38-2.1.0/fasta/genome.fa.flat
refdata-GRCh38-2.1.0/fasta/genome.fa.gdx
refdata-GRCh38-2.1.0/fasta/genome.fa.pac
refdata-GRCh38-2.1.0/fasta/genome.fa.sa
refdata-GRCh38-2.1.0/fasta/primary_contigs.txt
refdata-GRCh38-2.1.0/fasta/sex_chromosomes.tsv
refdata-GRCh38-2.1.0/genes/
refdata-GRCh38-2.1.0/genes/gene_annotations.gtf.gz
refdata-GRCh38-2.1.0/genome
refdata-GRCh38-2.1.0/regions/
refdata-GRCh38-2.1.0/regions/centromeres.tsv
refdata-GRCh38-2.1.0/snps/


In [None]:
# Prepend the Long Ranger directory to your $PATH. This will allow you to invoke the longranger commands.
!export PATH=/opt/longranger-2.2.2:$PATH

In [None]:
# Site check
# Next, please run the bundled site check script and send the output to 10x. We will review the information to ensure that Long Ranger will run smoothly once you have generated your own Chromium data. Assuming you have installed and entered the 10x environment as described above, please run the following commands:
!longranger sitecheck > sitecheck.txt
!longranger upload dpariser@mit.edu sitecheck.txt

/bin/bash: longranger: command not found
/bin/bash: longranger: command not found


In [None]:
# Verify Installation
# To ensure that the longranger pipeline is installed correctly, use longranger testrun. This test can take up to 60 minutes on a sixteen-core workstation. Assuming you have installed Long Ranger into /opt, the command to run the test would look like:
!export PATH=/opt/longranger-2.2.2:$PATH
!longranger testrun --id=tiny

/bin/bash: longranger: command not found


In [None]:
!longranger upload dpariser@mit.edu tiny/tiny.mri.tgz

/bin/bash: longranger: command not found


## Aligning and quantifying using kallisto/bustool

Kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, which uses a novel idea of pseudoalignment for fast and accurate quantification of transcript abundances from RNA-Seq data. Kallisto can quantify expression levels of genes, transcripts, and isoforms. Kallisto generates an index from the reference transcriptome that allows fast pseudoalignment of RNA-Seq reads, followed by generation of gene- and transcript-level counts.

Bustools is a set of tools for analyzing BUS files generated by kallisto. BUS files contain information about which barcodes were detected in which transcript and how many UMIs were associated with each barcode-transcript pair. Bustools can be used to correct and sort the barcode and UMI information in the BUS file, filter out low-quality reads and barcodes, count the number of unique molecular identifiers (UMIs) associated with each gene or transcript, and perform other downstream analyses.

We will need to reference the GRCh38 Human Genome here to align it to our sequencing data

In [None]:
# Check kb-python version
!kb version

usage: kb
       [-h]
       [--list]
       <CMD>
       ...

kb_python
0.24.1

positional arguments:
  <CMD>
    info
    Display
    package and
    citation
    information
    ref
    Build a
    kallisto
    index and
    transcript-
    to-gene
    mapping
    count
    Generate
    count
    matrices
    from a set
    of single-
    cell FASTQ
    files

optional arguments:
  -h, --help
    Show this
    help
    message and
    exit
  --list
    Display
    list of
    supported
    single-cell
    technologie
    s


Kallisto information


*   https://github.com/pachterlab/kallisto
*   http://pachterlab.github.io/kallisto/manual.html
*   https://colab.research.google.com/github/pachterlab/kallistobustools/blob/master/docs/tutorials/kb_quality_control/python/kb_intro_1_python.ipynb#scrollTo=x79Inh3LnnMj



In [None]:
# Create the index file for Kallisto
!kb ref \
  -i index.idx \
  -g transcript-to-gene.tsv \
  -f1 refdata-GRCh38-2.1.0/fasta/genome.fa \
  -f2 refdata-GRCh38-2.1.0/fasta/genome_with_contigs.fa \
  -c1 refdata-GRCh38-2.1.0/cellranger-tiny-bcl-1.2.0/transcriptome-annotation.csv \
  -c2 refdata-GRCh38-2.1.0/cellranger-tiny-bcl-1.2.0/molecule-info.csv \
  refdata-GRCh38-2.1.0/fasta/genome.fa \
  refdata-GRCh38-2.1.0/genes/gene_annotations.gtf.gz

[2023-03-21 00:15:41,273]    INFO Creating transcript-to-gene mapping at transcript-to-gene.tsv
[2023-03-21 00:16:49,048]    INFO Sorting refdata-GRCh38-2.1.0/fasta/genome.fa
[2023-03-21 00:24:41,315]    INFO Sorting refdata-GRCh38-2.1.0/genes/gene_annotations.gtf.gz


In [None]:
# download the GRCh38 Index file
from google.colab import files
files.download('index.idx')

In [None]:
# !wget https://github.com/pachterlab/kallisto/releases/download/v0.46.2/kallisto_linux-v0.46.2.tar.gz
# !tar -zxvf kallisto_linux-v0.46.2.tar.gz
# !mv kallisto /usr/local/bin/

In [None]:
# !git clone https://github.com/BUStools/bustools
# !cd bustools && ./compile.sh
# !mv bustools /usr/local/bin/

In [None]:
#!sudo apt-get install kallisto

In [None]:
#!git clone https://github.com/BUStools/bustools.git
#!cd bustools && ./compile.sh

In [None]:
#!wget -c https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
#!chmod +x Anaconda3-2021.11-Linux-x86_64.sh
#!bash ./Anaconda3-2021.11-Linux-x86_64.sh -b -f -p /usr/local

In [None]:
#!conda install -c bioconda bustools -y

https://github.com/BUStools/bustools/releases

In [None]:
# Align reads to the reference genome using kallisto
!kallisto bus -i /content/refdata-gex-GRCh38-2020-A/kallisto_index.idx -o output/ -x 10xv3 -t 2 /content/drive/MyDrive/Colab_Notebooks/Lung_Mk/HRR339742_f1.fastq /content/drive/MyDrive/Colab_Notebooks/Lung_Mk/HRR339742_f2.fastq

# Correct and sort the barcode and UMIs & generates a matrics with UMI counts/gene for each barcode in the sorted BUS file
!bustools correct -w /content/refdata-gex-GRCh38-2020-A/10xv3_whitelist.txt -p output/output.bus | bust


Found existing installation: kallisto 1.0.9
Uninstalling kallisto-1.0.9:
  Would remove:
    /usr/local/lib/python3.9/dist-packages/kallisto-1.0.9.dist-info/*
    /usr/local/lib/python3.9/dist-packages/kallisto/*
Proceed (Y/n)? Y
  Successfully uninstalled kallisto-1.0.9
Found existing installation: bustools 0.1.0.dev2
Uninstalling bustools-0.1.0.dev2:
  Would remove:
    /usr/local/lib/python3.9/dist-packages/bustools-0.1.0.dev2.dist-info/*
    /usr/local/lib/python3.9/dist-packages/bustools/*
Proceed (Y/n)? Y
  Successfully uninstalled bustools-0.1.0.dev2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kallisto
  Using cached kallisto-1.0.9-py3-none-any.whl (104 kB)
Installing collected packages: kallisto
[31mERROR: Could not install packages due to an OSError: [Errno 21] Is a directory: '/usr/local/bin/kallisto'
[0m[31m
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simpl

## Filtering cells based on count
Preliminary counts were then used for downstream analysis. Quality control was applied to cells based on three metrics step by step: the total UMI counts, number of detected genes and proportion of mitochondrial gene counts per cell. Specifically, cells with less than 1500 UMI counts and 500 detected genes were filtered, as well as cells with more than 10% mitochondrial gene counts. 

In [None]:
import pandas as pd

# Load the output from bustools count
matrix = pd.read_csv("output/counts/bus_output/output.bus.count.txt", sep="\t", index_col=0, header=None, skiprows=1)

# Calculate the total UMI counts and number of detected genes per cell
umi_counts = matrix.sum(axis=1)
gene_counts = (matrix > 0).sum(axis=1)

# Load the gene annotation file
genes = pd.read_csv("refdata-gex-GRCh38-2020-A/genes/genes.gtf", sep="\t", comment="#", header=None) # change this to .xml rather than GRCh38

# Extract mitochondrial gene names
mito_genes = genes[genes[0] == "MT"][8].str.extract(r'gene_name "(.+?)"', expand=False)

# Calculate the proportion of mitochondrial gene counts per cell
mito_counts = matrix[matrix.index.isin(mito_genes)].sum(axis=0)
mito_prop = mito_counts / umi_counts

# Filter cells with less than 1500 UMI counts, less than 500 detected genes, or more than 10% mitochondrial gene counts
cells_to_keep = (umi_counts >= 1500) & (gene_counts >= 500) & (mito_prop <= 0.1)
filtered_matrix = matrix[cells_to_keep]

# Calculate total UMI counts and number of detected genes for the filtered matrix
filtered_umi_counts = filtered_matrix.sum(axis=1)
filtered_gene_counts = (filtered_matrix > 0).sum(axis=1)


## Remove potential doublets (double balloon effect)

This is what the investigators did in the original paper:


*   To remove potential doublets, for PBMC samples, cells with UMI counts above 25,000 and detected genes above 5,000 are filtered out. For other tissues, cells with UMI counts above 70,000 and detected genes above 7,500 are filtered out. Additionally, we applied Scrublet (Wolock et al., 2019 link text) to identify potential doublets. The doublet score for each single cell and the threshold based on the bimodal distribution was calculated using default parameters. The expected doublet rate was set to be 0.08, and cells predicted to be doublets or with doubletScore larger than 0.25 were filtered. After quality control, a total of 1,598,708 cells were remained.
*   for now we will not be using onliy the PBMC filter methods applied to all tissues
*  *We may revisit this later*



In [None]:
import pandas as pd
import xml.etree.ElementTree as ET

# Load the count matrix
counts = pd.read_csv("output/counts/bus_output/output.bus.count.txt", index_col=0)

# Load the metadata file from the XML
xml_tree = ET.parse('/content/drive/MyDrive/Colab_Notebooks/Lung_Mk/HRR339742/HRR339742_sta.xml')
root = xml_tree.getroot()
metadata_dict = {}
for child in root.iter():
    metadata_dict[child.tag] = child.text
metadata = pd.DataFrame([metadata_dict])

# Determine the filtering thresholds
umi_threshold = 25000
gene_threshold = 5000

# Compute the UMI counts and number of detected genes for each cell
umi_counts = counts.sum(axis=0)
detected_genes = (counts > 0).sum(axis=0)

# Filter out cells with UMI counts or detected genes above the thresholds
mask = (umi_counts <= umi_threshold) & (detected_genes <= gene_threshold)
filtered_counts = counts.loc[:, mask]

# Print some statistics about the filtering
print("Before filtering:")
print(f"Number of cells: {counts.shape[1]}")
print(f"Max UMI count: {umi_counts.max()}")
print(f"Max detected genes: {detected_genes.max()}")
print("")
print("After filtering:")
print(f"Number of cells: {filtered_counts.shape[1]}")
print(f"Max UMI count: {filtered_counts.sum(axis=0).max()}")
print(f"Max detected genes: {(filtered_counts > 0).sum(axis=0).max()}")

## Data visualization

The stepwise quality control metrics used for individual samples were listed in Table S1. The resulting distribution of UMI counts, gene counts as well as mitochondrial gene percent- age were shown in Figures S1C–S1E. 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Load the count matrix
counts = pd.read_csv("output/counts/bus_output/output.bus.count.txt", sep="\t", index_col=0)

# Load the gene annotation file from the metadata XML
metadata = pd.read_xml("/content/drive/MyDrive/Colab_Notebooks/Lung_Mk/HRR339742/HRR339742_sta.xml")
genes = pd.DataFrame(metadata["transcript"].apply(lambda x: x.get("gene_name")).unique(), columns=["gene_name"])

# Identify mitochondrial genes
mito_genes = genes[genes["gene_name"].str.startswith("MT-")].index
mito_counts = counts.loc[mito_genes].sum(axis=0)
total_counts = counts.sum(axis=0)
mito_percentage = mito_counts / total_counts * 100

# Compute the UMI counts and number of detected genes for each cell
umi_counts = counts.sum(axis=0)
detected_genes = (counts > 0).sum(axis=0)

# Plot the distribution of UMI counts, detected genes, and mitochondrial gene percentage
fig, axes = plt.subplots(ncols=3, figsize=(15,5))
sns.histplot(umi_counts, ax=axes[0])
sns.histplot(detected_genes, ax=axes[1])
sns.histplot(mito_percentage, ax=axes[2])
axes[0].set_xlabel("UMI counts")
axes[1].set_xlabel("Number of detected genes")
axes[2].set_xlabel("Mitochondrial gene percentage")
plt.show()

# before and after plots

## Normaliazed UMI counts

This is what the paper did:

*  We normalized the UMI counts with the deconvolution strategy implemented in the  R package scran. Specifically, cell-specific size factors were computed by computeSumFactors function and further used to scale the counts for each cell. Then the logarithmic normalized counts were used for the downstream analysis.
*  We can use Scnapy instead

In [None]:
import scanpy as sc

# Load the count matrix
adata = sc.read_text("output/counts/bus_output/output.bus.count.txt", delimiter="\t").T

# Normalize the data using Total Count Normalization (TCN)
sc.pp.normalize_total(adata, target_sum=1e4)

# Scale the data by cell-specific size factors
sc.pp.scale(adata, max_value=10)

# Logarithmically transform the data
sc.pp.log1p(adata)

For normalization of UMI counts, the Scanpy package provides several normalization methods, including the Total Count Normalization (TCN) and Normalization by Logarithm (LogNormalize) methods, which are commonly used in single-cell RNA-seq analysis. Here, we first load the count matrix using Scanpy's read_text function. We then normalize the data using the normalize_total function, which scales the counts for each cell so that they have the same total count (in this case, 10,000). We then scale the data by cell-specific size factors using the scale function, and logarithmically transform the data using the log1p function.

