# Run-through of Kallisto Single-cell RNA-Seq

First, let's look at our sample data - a mixture of fresh frozen fuman (HEK293T) and mouse (NIH3T3) cells [sequenced using the Chromium v2 Chemistry](https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/hgmm_1k). 

Here are some relavant facts about the dataset:

- 1:1 mixture of fresh frozen human (HEK293T) and mouse (NIH3T3) cells. This is a classic human-mouse mixture   experiment to demonstrate single cell behavior.
- 1,017 cells detected
- Sequenced on Illumina Hiseq4000 with approximately 61,000 reads per cell
- 26bp read1 (16bp Chromium barcode and 10bp UMI), 98bp read2 (transcript), and 8bp I7 sample barcode

The data are located here on our tutorial image

fastq files:

In [1]:
ls $HOME/tutorial-data/kallisto-single-cell/data/fastq_files/hgmm_1k_v2_fastqs 

hgmm_1k_v2_S1_L001_I1_001.fastq.gz  hgmm_1k_v2_S1_L002_I1_001.fastq.gz
hgmm_1k_v2_S1_L001_R1_001.fastq.gz  hgmm_1k_v2_S1_L002_R1_001.fastq.gz
hgmm_1k_v2_S1_L001_R2_001.fastq.gz  hgmm_1k_v2_S1_L002_R2_001.fastq.gz


These are paired-end sequences (R1, R2) of two lanes (L001, L002) of the experiment. The L1_001 and L2_002 are sequence indexes used on lane 1 and 2 respectively. 

For this experiment, we also have two transcriptomes, one mouse, one human:

In [2]:
ls $HOME/tutorial-data/kallisto-single-cell/data/transcriptomes

Homo_sapiens.GRCh38.cdna.all.fa.gz  Mus_musculus.GRCm38.cdna.all.fa.gz
human-mouse-transcriptome.fa.gz


## Processing with UMI Tools

Install UMI tools with python pip


In [1]:
umi_tools


umi_tools.py - Tools for UMI analyses

:Author: Tom Smith & Ian Sudbury, CGAT
:Release: $Id$
:Date: |today|
:Tags: Genomics UMI

There are 6 tools:

  - whitelist
  - extract
  - group
  - dedup
  - count
  - count_tab

To get help on a specific tool, type:

    umi_tools <tool> --help

To use a specific tool, type::

    umi_tools <tool> [tool options] [tool arguments]



umi_tools: command not found


: 127

## Quality control of FastQ data

As done previously, we will use fastp to take a look at the files containing our cell sequence data (we won't use the  index files). To be safe, we will disable any adaptor trimimg using the -A option (remember, fastp has several quality control ouptions enabled by default ([See fastp manual](https://github.com/OpenGene/fastp#simple-usage))

In [18]:
cd $HOME/tutorial-data/kallisto-single-cell/data/fastq_files/hgmm_1k_v2_fastqs 
for r1infile in *R1_*fastq.gz;
 do
 r2infile=$(echo $r1infile| sed -e "s/R1/R2/g")
 r1outfile=fastp_${r1infile};
 r2outfile=fastp_${r2infile};
 reportname=$(echo $r1infile|cut -f5 -d _).fastp-report.html
 echo "Processing pair $r1infile,$r2infile"
 fastp -A -h $reportname -i $r1infile -o $r1outfile -I $r2infile -O $r2outfile;
 done;

Processing pair hgmm_1k_v2_S1_L001_R1_001.fastq.gz,hgmm_1k_v2_S1_L001_R2_001.fastq.gz
Read1 before filtering:
total reads: 37636688
total bases: 1053827264
Q20 bases: 1043290586(99.0002%)
Q30 bases: 1019063898(96.7012%)

Read1 after filtering:
total reads: 36607618
total bases: 1025012884
Q20 bases: 1015109251(99.0338%)
Q30 bases: 991825546(96.7623%)

Read2 before filtering:
total reads: 37636688
total bases: 3424938608
Q20 bases: 3319762598(96.9291%)
Q30 bases: 3201987689(93.4904%)

Read2 aftering filtering:
total reads: 36607618
total bases: 3330921245
Q20 bases: 3276509776(98.3665%)
Q30 bases: 3172814998(95.2534%)

Filtering result:
reads passed filter: 73215236
reads failed due to low quality: 1897134
reads failed due to too many N: 5340
reads failed due to too short: 155666

Duplication rate: 0%

Insert size peak (evaluated by paired-end reads): 28

JSON report: fastp.json
HTML report: L001.fastp-report.html

fastp -A -h L001.fastp-report.html -i hgmm_1k_v2_S1_L001_R1_001.fastq.gz

We will move the outputs of the fastp analysis to appropriate locations on our computer...

In [24]:
mv fastp_hgmm* $HOME/tutorial-data/kallisto-single-cell/processed/fastp
mv *.json $HOME/tutorial-data/kallisto-single-cell/processed/fastp
mv *.html $HOME/tutorial-data/kallisto-single-cell/processed/fastp

mv: cannot stat 'fastp_hgmm*': No such file or directory


In [25]:
ls $HOME/tutorial-data/kallisto-single-cell/processed/fastp

fastp_hgmm_1k_v2_S1_L001_R1_001.fastq.gz  fastp.json
fastp_hgmm_1k_v2_S1_L001_R2_001.fastq.gz  L001.fastp-report.html
fastp_hgmm_1k_v2_S1_L002_R1_001.fastq.gz  L002.fastp-report.html
fastp_hgmm_1k_v2_S1_L002_R2_001.fastq.gz


You can view the html reports in your jupyter notebook file browser. 

## Prepare transcriptome for Kallisto

Since we are using both human and mouse cells, we will need a single transcriptome containing both human and mouse transcripts. We can combine the existing transcriptomes using the `cat` command:


In [26]:
cd $HOME/tutorial-data/kallisto-single-cell/data/transcriptomes
cat Homo_sapiens.GRCh38.cdna.all.fa.gz Mus_musculus.GRCm38.cdna.all.fa.gz > human-mouse-transcriptome.fa.gz

In [28]:
pwd

/home/tutorial-user/tutorial-data/kallisto-single-cell/data/transcriptomes


We can now build the kallisto index and place this in an appropriate location


In [31]:
kallisto index --index="human_mouse_transcriptome_index" $HOME/tutorial-data/kallisto-single-cell/data/transcriptomes/human-mouse-transcriptome.fa.gz


[build] loading fasta file /home/tutorial-user/tutorial-data/kallisto-single-cell/data/transcriptomes/human-mouse-transcriptome.fa.gz
[build] k-mer length: 31
        from 2071 target sequences
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 2138563 contigs and contains 206125466 k-mers 



In [33]:
mkdir $HOME/tutorial-data/kallisto-single-cell/indicies
mv human_mouse_transcriptome_index $HOME/tutorial-data/kallisto-single-cell/indicies

## Quantification using Kallisto

In [34]:
kallisto bus --list 

List of supported single cell technologies

short name       description
----------       -----------
10Xv1            10X chemistry version 1
10Xv2            10X chemistry verison 2
DropSeq          DropSeq
inDrop           inDrop
CELSeq           CEL-Seq
CELSeq2          CEL-Seq version 2
SCRBSeq          SCRB-Seq



: 1

In [35]:
cd $HOME/tutorial-data/kallisto-single-cell/processed/fastp
ls

fastp_hgmm_1k_v2_S1_L001_R1_001.fastq.gz  fastp.json
fastp_hgmm_1k_v2_S1_L001_R2_001.fastq.gz  L001.fastp-report.html
fastp_hgmm_1k_v2_S1_L002_R1_001.fastq.gz  L002.fastp-report.html
fastp_hgmm_1k_v2_S1_L002_R2_001.fastq.gz


In [37]:
kallisto bus\
 --threads=4\
 --index=$HOME/tutorial-data/kallisto-single-cell/indicies/human_mouse_transcriptome_index\
 --technology=10Xv2\
 --output-dir=human-mouse_quant\
 fastp_hgmm_1k_v2_S1_L001_R1_001.fastq.gz fastp_hgmm_1k_v2_S1_L001_R2_001.fastq.gz\
 fastp_hgmm_1k_v2_S1_L002_R1_001.fastq.gz fastp_hgmm_1k_v2_S1_L002_R2_001.fastq.gz
 


[index] k-mer length: 31
[index] number of targets: 302,896
[index] number of k-mers: 206,125,466
[index] number of equivalence classes: 1,252,306
[quant] will process sample 1: fastp_hgmm_1k_v2_S1_L001_R1_001.fastq.gz
                               fastp_hgmm_1k_v2_S1_L001_R2_001.fastq.gz
[quant] will process sample 2: fastp_hgmm_1k_v2_S1_L002_R1_001.fastq.gz
                               fastp_hgmm_1k_v2_S1_L002_R2_001.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 73,166,000 reads, 58,791,882 reads pseudoaligned


Our results will be located in the `human-mouse_quant` directory

In [41]:
ls ./human-mouse_quant

matrix.ec  output.bus  run_info.json  transcripts.txt


Finally, we will use `bustools` to sort our output.bus to speed up processing we will do in the R notebook to follow

In [43]:
cd $HOME/tutorial-data/kallisto-single-cell/processed/fastp/human-mouse_quant
bustools sort\
 --threads 4\
 --output output_sorted.bus\
 output.bus
 

Read in 58791882 number of busrecords
All sorted


In [2]:
ls

matrix.ec  output.bus  output_sorted.bus  run_info.json  transcripts.txt


We will convert the sorted output to text for downstream processing

In [3]:
bustools text -o output_sorted.txt output_sorted.bus
ls

Read in 40589878 number of busrecords
matrix.ec   output_sorted.bus  run_info.json
output.bus  output_sorted.txt  transcripts.txt


In [8]:
head output_sorted.txt

AAAAAAAAAAAAAAAA	AAAAAAAAAA	246173	1
AAAAAAAAAAAAAAAA	AAAAAAAAAA	1076301	1
AAAAAAAAAAAAAAAA	AAAAGAACCA	553619	1
AAAAAAAAAAAAAAAA	ATAAAAAAAA	321346	1
AAAAAAAAAAAAAAAA	ATAACCAAAA	1207183	1
AAAAAAAAAAAAAAAA	CCCTGGTATC	456094	1
AAAAAAAAAAAAAAAC	TAAACAAAAC	313279	1
AAAAAAAAAAAAAACA	AAAACACATC	451008	1
AAAAAAAAAAAAAATA	ATCAAACAAA	523919	1
AAAAAAAAAAAAAGTC	AAGAGGAAAT	513903	1
