# Pre-processing mouse single-nuclei RNA-seq data with kallisto and bustools

In this tutorial we will process the 10x dataset `1k Brain Nuclei from an E18 Mouse` using kallisto bus and a custom built DNA and intron index for mouse. We will generate two matrices: one for spliced transcripts and one for unspliced transcripts, and sum them to obtain total nuclear transcripts.


The 10x dataset `1k Brain Nuclei from an E18 Mouse` (5GB) is available here:

https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/nuclei_900

To learn how to build a cDNA and intron index see this tutorial:

https://www.kallistobus.tools/velocity_index_tutorial.html

Important: The mouse cDNA and intron index is about 26GB. Because of this, building it and processing data with it requires significantly more RAM than typical kallisto workflows, and we recoomend using a machine with at least 64GB RAM for this workflow.

To save you time, we have made the mouse cDNA and intron index built with the mouse ensembl 86 release. 
You can download the index and other files used in this tutorial at Caltech Data (19GB zip file) here: 

==== insert caltech data link to mouse intron index ====
```
 26GB cDNA_introns.idx
 46MB cDNA_introns.t2g.txt
3.7MB cDNA_transcripts_to_capture.txt
 20MB introns_transcripts_to_capture.txt
```

### Download fastq files and whilelist

In [2]:
# Download files from 10x genomics and untar
!wget http://cf.10xgenomics.com/samples/cell-exp/2.1.0/nuclei_900/nuclei_900_fastqs.tar
!tar -xf nuclei_900_fastqs.tar

--2019-08-02 00:07:03--  http://cf.10xgenomics.com/samples/cell-exp/2.1.0/nuclei_900/nuclei_900_fastqs.tar
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 99.84.41.7, 99.84.41.96, 99.84.41.41, ...
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|99.84.41.7|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5290926080 (4.9G) [application/x-tar]
Saving to: ‘nuclei_900_fastqs.tar’


2019-08-02 00:09:48 (31.5 MB/s) - ‘nuclei_900_fastqs.tar’ saved [5290926080/5290926080]



In [5]:
# Download 10x Chromium v2 chemistry barcode whitelist 10xv2_whitelist.txt
!wget https://github.com/BUStools/getting_started/releases/download/velocity_tutorial/10xv2_whitelist.txt

--2019-08-02 00:10:11--  https://github.com/BUStools/getting_started/releases/download/velocity_tutorial/10xv2_whitelist.txt
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/191064839/2f757f00-8d45-11e9-8067-d123e7762f59?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20190802%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20190802T071016Z&X-Amz-Expires=300&X-Amz-Signature=b068be06f67ae3ef3da3a48e9f3e55272968c0e203300371e9c1c85acf12c963&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3D10xv2_whitelist.txt&response-content-type=application%2Foctet-stream [following]
--2019-08-02 00:10:16--  https://github-production-release-asset-2e65be.s3.amazonaws.com/191064839/2f757f00-8d45-11e9-8067-d123e7762f59?X-Amz-Algorithm=AWS4-HMAC-SHA2

In [8]:
# make sure all files were downloaded
!ls -lh

total 31G
-rw-rw-r-- 1 munfred munfred  12M Jun 12 19:14 10xv2_whitelist.txt
-rw-rw-r-- 1 munfred munfred  26G Aug  1 16:17 cDNA_introns.idx
-rw-rw-r-- 1 munfred munfred  46M Aug  1 16:17 cDNA_introns.t2g.txt
-rw-rw-r-- 1 munfred munfred 3.7M Aug  1 16:17 cDNA_transcripts_to_capture.txt
-rw-rw-r-- 1 munfred munfred  20M Aug  1 16:17 introns_transcripts_to_capture.txt
-rw-rw-r-- 1 munfred munfred 3.6K Aug  2 00:09 kallisto_bus_mouse_nuclei_tutorial.ipynb
drwxr-xr-x 2 munfred munfred 4.0K Aug 23  2017 nuclei_900_fastqs
-rw-rw-r-- 1 munfred munfred 5.0G Nov  8  2017 nuclei_900_fastqs.tar


In [9]:
!ls -lh ./nuclei_900_fastqs/

total 5.0G
-rw-r--r-- 1 munfred munfred 225M Aug 23  2017 nuclei_900_S1_L001_I1_001.fastq.gz
-rw-r--r-- 1 munfred munfred 599M Aug 23  2017 nuclei_900_S1_L001_R1_001.fastq.gz
-rw-r--r-- 1 munfred munfred 1.7G Aug 23  2017 nuclei_900_S1_L001_R2_001.fastq.gz
-rw-r--r-- 1 munfred munfred 226M Aug 23  2017 nuclei_900_S1_L002_I1_001.fastq.gz
-rw-r--r-- 1 munfred munfred 602M Aug 23  2017 nuclei_900_S1_L002_R1_001.fastq.gz
-rw-r--r-- 1 munfred munfred 1.7G Aug 23  2017 nuclei_900_S1_L002_R2_001.fastq.gz


In [10]:
!tree

[01;34m.[00m
├── 10xv2_whitelist.txt
├── cDNA_introns.idx
├── cDNA_introns.t2g.txt
├── cDNA_transcripts_to_capture.txt
├── introns_transcripts_to_capture.txt
├── kallisto_bus_mouse_nuclei_tutorial.ipynb
├── [01;34mnuclei_900_fastqs[00m
│   ├── [01;31mnuclei_900_S1_L001_I1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L001_R1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L001_R2_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L002_I1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L002_R1_001.fastq.gz[00m
│   └── [01;31mnuclei_900_S1_L002_R2_001.fastq.gz[00m
└── [01;31mnuclei_900_fastqs.tar[00m

1 directory, 13 files


## Run kallisto  

In [15]:
!kallisto bus

kallisto 0.46.0
Generates BUS files for single-cell sequencing

Usage: kallisto bus [arguments] FASTQ-files

Required arguments:
-i, --index=STRING            Filename for the kallisto index to be used for
                              pseudoalignment
-o, --output-dir=STRING       Directory to write output to
-x, --technology=STRING       Single-cell technology used 

Optional arguments:
-l, --list                    List all single-cell technologies supported
-t, --threads=INT             Number of threads to use (default: 1)


In [24]:
!kallisto bus -i cDNA_introns.idx -o bus_output -x 10xv2 -t 4 nuclei_900_fastqs/nuclei_900_S1_L001_R1_001.fastq.gz nuclei_900_fastqs/nuclei_900_S1_L001_R2_001.fastq.gz  nuclei_900_fastqs/nuclei_900_S1_L002_R1_001.fastq.gz nuclei_900_fastqs/nuclei_900_S1_L002_R2_001.fastq.gz 


[index] k-mer length: 31
[index] number of targets: 818,724
[index] number of k-mers: 1,105,269,838
[index] number of equivalence classes: 5,740,477
[quant] will process sample 1: nuclei_900_fastqs/nuclei_900_S1_L001_R1_001.fastq.gz
                               nuclei_900_fastqs/nuclei_900_S1_L001_R2_001.fastq.gz
[quant] will process sample 2: nuclei_900_fastqs/nuclei_900_S1_L002_R1_001.fastq.gz
                               nuclei_900_fastqs/nuclei_900_S1_L002_R2_001.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 48,909,315 reads, 44,705,830 reads pseudoaligned


In [25]:
# check output files
!tree

[01;34m.[00m
├── 10xv2_whitelist.txt
├── [01;34mbus_output[00m
│   ├── matrix.ec
│   ├── output.bus
│   ├── run_info.json
│   └── transcripts.txt
├── cDNA_introns.idx
├── cDNA_introns.t2g.txt
├── cDNA_transcripts_to_capture.txt
├── introns_transcripts_to_capture.txt
├── kallisto_bus_mouse_nuclei_tutorial.ipynb
├── [01;34mnuclei_900_fastqs[00m
│   ├── [01;31mnuclei_900_S1_L001_I1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L001_R1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L001_R2_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L002_I1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L002_R1_001.fastq.gz[00m
│   └── [01;31mnuclei_900_S1_L002_R2_001.fastq.gz[00m
└── [01;31mnuclei_900_fastqs.tar[00m

2 directories, 17 files


### Run bustools
Correct, sort, capture, and count the spliced and unspliced matrices

In [37]:
!bustools

bustools 0.39.3

Usage: bustools <CMD> [arguments] ..

Where <CMD> can be one of: 

capture         Capture records from a BUS file
correct         Error correct a BUS file
count           Generate count matrices from a BUS file
inspect         Produce a report summarizing a BUS file
linker          Remove section of barcodes in BUS files
project         Project a BUS file to gene sets
sort            Sort a BUS file by barcodes and UMIs
text            Convert a binary BUS file to a tab-delimited text file
whitelist       Generate a whitelist from a BUS file

Running bustools <CMD> without arguments prints usage information for <CMD>



In [107]:
!mkdir -p bus_output/cDNA_capture/ bus_output/intron_capture/ bus_output/spliced/ bus_output/unspliced/ bus_output/tmp/

In [51]:
!bustools393 correct -w 10xv2_whitelist.txt -o bus_output/output.correct.bus bus_output/output.bus 

Found 737280 barcodes in the whitelist
Number of hamming dist 1 barcodes = 20550336
Processed 44705830 bus records
In whitelist = 43426576
Corrected = 305038
Uncorrected = 974216


In [53]:
!bustools393 sort -o bus_output/output.correct.sort.bus -t 4 bus_output/output.correct.bus

Read in 43731614 BUS records


In [64]:
!ls -lah bus_output/

total 5.1G
drwxrwxr-x 7 munfred munfred 4.0K Aug  2 00:52 .
drwxrwxr-x 5 munfred munfred 4.0K Aug  2 01:13 ..
drwxrwxr-x 2 munfred munfred 4.0K Aug  2 00:31 cDNA_capture
drwxrwxr-x 2 munfred munfred 4.0K Aug  2 00:31 introns_capture
-rw-rw-r-- 1 munfred munfred 1.7G Aug  2 00:30 matrix.ec
-rw-rw-r-- 1 munfred munfred 1.4G Aug  2 00:30 output.bus
-rw-rw-r-- 1 munfred munfred 1.4G Aug  2 00:51 output.correct.bus
-rw-rw-r-- 1 munfred munfred 817M Aug  2 00:52 output.correct.sort.bus
-rw-rw-r-- 1 munfred munfred  546 Aug  2 00:30 run_info.json
drwxrwxr-x 2 munfred munfred 4.0K Aug  2 00:31 spliced
drwxrwxr-x 2 munfred munfred 4.0K Aug  2 00:31 tmp
-rw-rw-r-- 1 munfred munfred  24M Aug  2 00:30 transcripts.txt
drwxrwxr-x 2 munfred munfred 4.0K Aug  2 00:31 unspliced


In [65]:
!tree

[01;34m.[00m
├── 10xv2_whitelist.txt
├── [01;34mbus_output[00m
│   ├── [01;34mcDNA_capture[00m
│   ├── [01;34mintrons_capture[00m
│   ├── matrix.ec
│   ├── output.bus
│   ├── output.correct.bus
│   ├── output.correct.sort.bus
│   ├── run_info.json
│   ├── [01;34mspliced[00m
│   ├── [01;34mtmp[00m
│   ├── transcripts.txt
│   └── [01;34munspliced[00m
├── cDNA_introns.idx
├── cDNA_introns_t2g.txt
├── cDNA_transcripts_to_capture.txt
├── introns_transcripts_to_capture.txt
├── kallisto_bus_mouse_nuclei_tutorial.ipynb
├── [01;34mnuclei_900_fastqs[00m
│   ├── [01;31mnuclei_900_S1_L001_I1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L001_R1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L001_R2_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L002_I1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L002_R1_001.fastq.gz[00m
│   └── [01;31mnuclei_900_S1_L002_R2_001.fastq.gz[00m
└── [01;31mnuclei_900_fastqs.tar[00m

7 directories, 19 files


In [74]:
!bustools393 capture -s -o bus_output/cDNA_capture/cDNA_capture.bus -c cDNA_transcripts_to_capture.txt -e bus_output/matrix.ec -t bus_output/transcripts.txt bus_output/output.correct.sort.bus

Parsing transcripts .. done
Parsing ECs .. done
Parsing capture list .. done
Read in 26766408 BUS records, wrote 22230247 BUS records


In [89]:
!bustools393 capture -s -o bus_output/intron_capture/intron_capture.bus -c introns_transcripts_to_capture.txt -e bus_output/matrix.ec -t bus_output/transcripts.txt bus_output/output.correct.sort.bus

Parsing transcripts .. done
Parsing ECs .. done
Parsing capture list .. done
Read in 26766408 BUS records, wrote 11835626 BUS records


In [90]:
!ls -lh ./bus_output/cDNA_capture
!ls -lh ./bus_output/intron_capture

total 679M
-rw-rw-r-- 1 munfred munfred 679M Aug  2 11:02 cDNA_capture.bus
total 362M
-rw-rw-r-- 1 munfred munfred 362M Aug  2 11:13 intron_capture.bus


In [92]:
!tree

[01;34m.[00m
├── 10xv2_whitelist.txt
├── [01;34mbus_output[00m
│   ├── [01;34mcDNA_capture[00m
│   │   └── cDNA_capture.bus
│   ├── [01;34mintron_capture[00m
│   │   └── intron_capture.bus
│   ├── matrix.ec
│   ├── output.bus
│   ├── output.correct.bus
│   ├── output.correct.sort.bus
│   ├── run_info.json
│   ├── [01;34mspliced[00m
│   ├── [01;34mtmp[00m
│   ├── transcripts.txt
│   └── [01;34munspliced[00m
│       └── [01;34mu[00m
├── cDNA_introns.idx
├── cDNA_introns_t2g.txt
├── cDNA_transcripts_to_capture.txt
├── introns_transcripts_to_capture.txt
├── kallisto_bus_mouse_nuclei_tutorial.ipynb
├── [01;34mnuclei_900_fastqs[00m
│   ├── [01;31mnuclei_900_S1_L001_I1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L001_R1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L001_R2_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L002_I1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L002_R1_001.fastq.gz[00m
│   └── [01;31mnuclei_900_S1_L002_R2_001.fastq.gz[00m
└── [01;31m

In [108]:
!bustools393 count -o bus_output/unspliced/unspliced -g cDNA_introns_t2g.txt -e bus_output/matrix.ec -t bus_output/transcripts.txt --genecounts bus_output/cDNA_capture/cDNA_capture.bus 

In [109]:
!bustools393 count -o bus_output/spliced/spliced -g cDNA_introns_t2g.txt -e bus_output/matrix.ec -t bus_output/transcripts.txt --genecounts bus_output/intron_capture/intron_capture.bus

In [111]:
!tree

[01;34m.[00m
├── 10xv2_whitelist.txt
├── [01;34mbus_output[00m
│   ├── [01;34mcDNA_capture[00m
│   │   └── cDNA_capture.bus
│   ├── [01;34mintron_capture[00m
│   │   └── intron_capture.bus
│   ├── matrix.ec
│   ├── output.bus
│   ├── output.correct.bus
│   ├── output.correct.sort.bus
│   ├── run_info.json
│   ├── [01;34mspliced[00m
│   │   ├── [01;34mspliced[00m
│   │   ├── spliced.barcodes.txt
│   │   ├── spliced.genes.txt
│   │   └── spliced.mtx
│   ├── [01;34mtmp[00m
│   ├── transcripts.txt
│   └── [01;34munspliced[00m
│       ├── [01;34munspliced[00m
│       ├── unspliced.barcodes.txt
│       ├── unspliced.genes.txt
│       └── unspliced.mtx
├── cDNA_introns.idx
├── cDNA_introns_t2g.txt
├── cDNA_transcripts_to_capture.txt
├── introns_transcripts_to_capture.txt
├── kallisto_bus_mouse_nuclei_tutorial.ipynb
├── [01;34mnuclei_900_fastqs[00m
│   ├── [01;31mnuclei_900_S1_L001_I1_001.fastq.gz[00m
│   ├── [01;31mnuclei_900_S1_L001_R1_001.fastq.gz[00m
│   ├── [01;31

# Load spliced and unspliced matrices in Python and merge them

In [1]:
from anndata import AnnData
import anndata
from scipy import sparse
import scipy
import anndata
import scipy.io
import os
import pandas as pd

In [2]:
## load unspliced data on anndata as sparse crs matrix
unspliced = anndata.AnnData(scipy.io.mmread('./bus_output/unspliced/unspliced.mtx').tocsr())
unspliced.obs= pd.read_csv('./bus_output/unspliced/unspliced.barcodes.txt', index_col = 0, header = None, names = ['barcode'])
unspliced.var = pd.read_csv('./bus_output/unspliced/unspliced.genes.txt', header = None, index_col = 0, names =['ensembl_id'], sep = '\t')
print('Loaded unspliced count matrix.')
print(unspliced)

Loaded unspliced count matrix.
AnnData object with n_obs × n_vars = 223419 × 54838 


In [3]:
## load unspliced data on anndata as sparse crs matrix
spliced = anndata.AnnData(scipy.io.mmread('./bus_output/spliced/spliced.mtx').tocsr())
spliced.obs= pd.read_csv('./bus_output/spliced/spliced.barcodes.txt', index_col = 0, header = None, names = ['barcode'])
spliced.var = pd.read_csv('./bus_output/spliced/spliced.genes.txt', header = None, index_col = 0, names =['ensembl_id'], sep = '\t')
print('Loaded spliced count matrix')
print(spliced)

Loaded spliced count matrix
AnnData object with n_obs × n_vars = 170857 × 54838 


In [4]:
print(unspliced)

AnnData object with n_obs × n_vars = 223419 × 54838 


In [5]:
spliced.obs.head()

AAACCTGAGAAACGAG
AAACCTGAGAAAGTGG
AAACCTGAGAAGCCCA
AAACCTGAGAAGGTTT
AAACCTGAGAATTCCC


In [6]:
spliced.var.head()

ENSMUSG00000102693.1
ENSMUSG00000064842.1
ENSMUSG00000051951.5
ENSMUSG00000102851.1
ENSMUSG00000103377.1


In [7]:
unspliced.obs.head()

AAACCTGAGAAACGCC
AAACCTGAGAAAGTGG
AAACCTGAGAAGCCCA
AAACCTGAGAATTCCC
AAACCTGAGACACGAC


In [8]:
unspliced.var.head()

ENSMUSG00000102693.1
ENSMUSG00000064842.1
ENSMUSG00000051951.5
ENSMUSG00000102851.1
ENSMUSG00000103377.1


# Sum spliced + unspliced counts
Now that we have spliced and unspliced matrices we can sum the counts of genes for barcodes common to both matrices
We take the intersection of both matrices because presumably cells without a single count on either have very low counts anyway

In [9]:
idx = spliced.obs.index.intersection(unspliced.obs.index)
spliced_intersection = spliced[idx]
spliced_intersection = unspliced[idx]

In [13]:
spliced_intersection.X + unspliced_intersection.X

<139214x54838 sparse matrix of type '<class 'numpy.float32'>'
	with 6464835 stored elements in Compressed Sparse Row format>

In [18]:
spliced_plus_unspliced = spliced_intersection.copy()
spliced_plus_unspliced.X = spliced_intersection.X + unspliced_intersection.X
spliced_plus_unspliced

AnnData object with n_obs × n_vars = 139214 × 54838 