# Run-through of Kallisto Single-cell RNA-Seq

First, let's look at our sample data - a mixture of fresh frozen human (HEK293T) and mouse (NIH3T3) cells [sequenced using the Chromium v2 Chemistry](https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/hgmm_1k). 

Here are some relevant facts about the dataset:

- 1:1 mixture of fresh frozen human (HEK293T) and mouse (NIH3T3) cells. This is a classic human-mouse mixture   experiment to demonstrate single cell behavior.
- 1,017 cells detected
- Sequenced on Illumina Hiseq4000 with approximately 61,000 reads per cell
- 26bp read1 (16bp Chromium barcode and 10bp UMI), 98bp read2 (transcript), and 8bp I7 sample barcode

The data are located here on our tutorial image

fastQ files:

In [None]:
ls $HOME/tutorial-data/kallisto-single-cell/data/fastq_files/hgmm_1k_v2_fastqs 

These are paired-end sequences (R1, R2) of two lanes (L001, L002) of the experiment. The L1_001 and L2_002 are sequence indexes used on lane 1 and 2 respectively. 

For this experiment, we also have two transcriptomes, one mouse, one human:

In [None]:
ls $HOME/tutorial-data/kallisto-single-cell/data/transcriptomes

## Quality control of FastQ data

As done previously, we will use fastp to take a look at the files containing our cell sequence data (we won't use the  index files). To be safe, we will disable any adapter trimming using the -A option (remember, fastp has several quality control options enabled by default ([See fastp manual](https://github.com/OpenGene/fastp#simple-usage)). **Note** Importantly, since we are dealing with barcodes where we expect a minimum length, we use the `--length-required` option to make sure we don't trim to below this length

In [None]:
cd $HOME/tutorial-data/kallisto-single-cell/data/fastq_files/hgmm_1k_v2_fastqs 
for r1infile in *R1_*fastq.gz;
 do
 r2infile=$(echo $r1infile| sed -e "s/R1/R2/g")
 r1outfile=fastp_${r1infile};
 r2outfile=fastp_${r2infile};
 reportname=$(echo $r1infile|cut -f5 -d _).fastp-report.html
 echo "Processing pair $r1infile,$r2infile"
 fastp -A --thread=4 --length_required=28 -h $reportname -i $r1infile -o $r1outfile -I $r2infile -O $r2outfile;
 done;

We will move the outputs of the fastp analysis to appropriate locations on our computer...

In [None]:
mv fastp_hgmm* $HOME/tutorial-data/kallisto-single-cell/analyses/fastp/
mv *.json $HOME/tutorial-data/kallisto-single-cell/analyses/fastp/
mv *.html $HOME/tutorial-data/kallisto-single-cell/analyses/fastp/

In [None]:
ls $HOME/tutorial-data/kallisto-single-cell/processed/fastp

You can view the html reports in your jupyter notebook file browser. 

## Clean data using umi-tools

[umi-tools](https://github.com/CGATOxford/UMI-tools) is one way to complete an additional requirement for single-cell RNA-Seq where we have cell barcodes and UMIs that may need to be filtered and or demultiplexed for proper analysis. In these steps we will try to remove reads which are likely empty (e.g. barcode-only amplification) and deal with sequencing errors. We will again work with pre-processed fastq files already prepared:

In [None]:
cd $HOME/tutorial-data/kallisto-single-cell/analyses/fastp/pre-processed
ls

Since we have two lanes of data, we will combine them into a single R1 and R2 file

In [None]:
cat fastp_hgmm_1k_v2_S1_L001_R1_001.fastq.gz fastp_hgmm_1k_v2_S1_L002_R1_001.fastq.gz > fastp_hgmm_1k_v2_S1_R1_001.fastq.gz
cat fastp_hgmm_1k_v2_S1_L001_R2_001.fastq.gz fastp_hgmm_1k_v2_S1_L002_R2_001.fastq.gz > fastp_hgmm_1k_v2_S1_R2_001.fastq.gz 

In [None]:
ls 

We can preview the first lines (read) in our data:

In [None]:
zcat fastp_hgmm_1k_v2_S1_R1_001.fastq.gz | head -8

According to the [Chromium site](https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/hgmm_1k) we expect:

- 28bp read1 (16bp Chromium barcode and 12bp UMI), 91bp read2 (transcript), and 8bp I7 sample barcode

Which is (close) to what we get:

In [None]:
NCACCTACATGGTCAT|GCATACGCCTTT|
|--Cell barcode-|----UMI-----|

Next we will use the `whitelist` method to take the top X most abundant barcodes. X can be estimated automatically from the data. 

In [None]:
umi_tools whitelist --stdin fastp_hgmm_1k_v2_S1_R1_001.fastq.gz\
 --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNNNN \
 --set-cell-number=1000\
 --log2stderr > umi-tools_whitelist.txt

The output of the whitelist command is a table containing the accepted CBs. It has four columns:
1. The accepted CB
2. Comma separated list of other CBs within an edit distance of the CB in columns 1 and >1 edit away from any other accepted CB.
3. The abundance (read or UMI count) of the accepted.
4. Comma separated list of abundances for the CBs in column 2


In [None]:
head -n1 ./umi-tools_whitelist.txt

In the next step, umi-tools will extract the cell barcodes and UMI from Read 1 and add it to the Read 2 read name and filter out reads that do not match one of the accepted cell barcodes (contained in the whitelist).

In [None]:
umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNNNN \
 --stdin fastp_hgmm_1k_v2_S1_R1_001.fastq.gz\
 --stdout fastp_hgmm_1k_v2_S1_R1_001.extracted.fastq.gz\
 --read2-in fastp_hgmm_1k_v2_S1_R2_001.fastq.gz\
 --read2-out=fastp_hgmm_1k_v2_S1_R2_001.extracted.fastq.gz\
 --filter-cell-barcode \
 --whitelist=umi-tools_whitelist.txt

In [None]:
ls

In [None]:
zcat fastp_hgmm_1k_v2_S1_R2_001.extracted.fastq.gz |head -n16|sed -n '1~4p'

The number of reads in our newly extracted read file should be this number divided by 4

In [None]:
zcat fastp_hgmm_1k_v2_S1_R2_001.extracted.fastq.gz| wc -l

In [None]:
expr 246092468 / 4

We want to extract the UMIs for use by kallisto; we can do so using this shell command:

In [None]:
zcat fastp_hgmm_1k_v2_S1_R2_001.extracted.fastq.gz |sed -n '1~4p'|cut -f7 -d :|cut -f3 -d_|cut -f1 -d " " >umi.txt

The number of lines should also be equal to the number of reads. 

In [None]:
wc -l umi.txt

We can also preview the umi file

In [None]:
head umi.txt

Let's move the UMI associated files to their own space in our analyses folder

In [None]:
mkdir -p $HOME/tutorial-data/kallisto-single-cell/analyses/umi
mv fastp_hgmm_1k_v2_S1_R1_001.extracted.fastq.gz $HOME/tutorial-data/kallisto-single-cell/analyses/umi
mv fastp_hgmm_1k_v2_S1_R2_001.extracted.fastq.gz $HOME/tutorial-data/kallisto-single-cell/analyses/umi
mv fastp_hgmm_1k_v2_S1_R1_001.fastq.gz $HOME/tutorial-data/kallisto-single-cell/analyses/umi
mv fastp_hgmm_1k_v2_S1_R2_001.fastq.gz $HOME/tutorial-data/kallisto-single-cell/analyses/umi
mv umi-tools_whitelist.txt $HOME/tutorial-data/kallisto-single-cell/analyses/umi
mv umi.txt $HOME/tutorial-data/kallisto-single-cell/analyses/umi

## Prepare transcriptome for Kallisto

Since we are using both human and mouse cells, we will need a single transcriptome containing both human and mouse transcripts. We can combine the existing transcriptomes using the `cat` command:


In [None]:
cd $HOME/tutorial-data/kallisto-single-cell/data/transcriptomes
cat Homo_sapiens.GRCh38.cdna.all.fa.gz Mus_musculus.GRCm38.cdna.all.fa.gz > human-mouse-transcriptome.fa.gz

In [None]:
pwd

We can now build the kallisto index and place this in an appropriate location


In [None]:
kallisto index --index="human_mouse_transcriptome_index" $HOME/tutorial-data/kallisto-single-cell/data/transcriptomes/human-mouse-transcriptome.fa.gz

In [None]:
mkdir -p $HOME/tutorial-data/kallisto-single-cell/indicies
mv human_mouse_transcriptome_index $HOME/tutorial-data/kallisto-single-cell/indicies

## Quantification using Kallisto

This time, we will use the `kallisto pseudo` command

In [None]:
kallisto pseudo

In [None]:
cd $HOME/tutorial-data/kallisto-single-cell/analyses/umi/pre-processed
ls

In [None]:
kallisto quant\
 --single\
 --fragment-length=200\
 --sd=30\
 --threads=4\
 --index=$HOME/tutorial-data/kallisto-single-cell/indicies/pre-processed/human_mouse_transcriptome_index\
 --output-dir=human-mouse_umi_quant\
 $HOME/tutorial-data/kallisto-single-cell/analyses/umi/pre-processed/fastp_hgmm_1k_v2_S1_R2_001.extracted.fastq.gz
 

Our results will be located in the `human-mouse_quant` directory

In [104]:
cd ./human-mouse_umi_quant
ls
pwd

bash: cd: ./human-mouse_umi_quant: No such file or directory
abundance.h5  abundance.tsv  run_info.json
/home/tutorial-user/tutorial-data/kallisto-single-cell/analyses/umi/pre-processed/human-mouse_umi_quant/human-mouse_umi_quant


We can see stats of the pseudoalignment in the `run_info.json` file

In [None]:
cat run_info.json

We will use the `abundance.tsv` to continue our analysis in R