# Aggregating multiple  count matrices

This notebook describes how to aggregate multiple count matrices by concatenating them into a single AnnData object with batch labels for different samples.

This is similar to the Cell Ranger `aggr` function, however no normalization is performed. `cellranger aggr` is described at https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/aggregate

For this tutorial we use dataset E-MTAB-6108. We provide the count matrices as an 80mb zip file at https://github.com/BUStools/getting_started/releases/download/aggr/E-MTAB-6108_sample1_sample2_genecounts.zip

If you download the zip file is has the following structure: 
```
E-MTAB-6108_sample1_sample2_genecounts.zip
├── sample1
│   ├── genecounts
│   │   ├── genes.barcodes.txt
│   │   ├── genes.genes.txt
│   │   └── genes.mtx
│   ├── matrix.ec
│   ├── run_info.json
│   └── transcripts.txt
└── sample2
    ├── genecounts
    │   ├── genes.barcodes.txt
    │   ├── genes.genes.txt
    │   └── genes.mtx
    ├── matrix.ec
    ├── run_info.json
    └── transcripts.txt
```
The raw data for E-MTAB-6108 is available at:
https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6108/


In [1]:
from anndata import AnnData
import anndata
from scipy import sparse
import scipy
import anndata
import scipy.io
import os

# Check that FASTQ data is in place 

In [2]:
!ls -lh /data/E-MTAB-6108/

total 22G
-rw-rw-r--. 1 munfred munfred 2.7G Jul 17 22:23 iPSC_RGCscRNAseq_Sample1_L005_R1.fastq.gz
-rw-rw-r--. 1 munfred munfred 9.9G Jul 17 22:41 iPSC_RGCscRNAseq_Sample1_L005_R2.fastq.gz
-rw-rw-r--. 1 munfred munfred 1.9G Jul 17 22:20 iPSC_RGCscRNAseq_Sample2_L005_R1.fastq.gz
-rw-rw-r--. 1 munfred munfred 7.1G Jul 17 22:34 iPSC_RGCscRNAseq_Sample2_L005_R2.fastq.gz
-rw-rw-r--. 1 munfred munfred 4.7K Jul 17 22:46 README.txt


## Process sample 1 with kallisto and bustools

In [3]:
!kallisto bus -i /references/homo_sapiens-ensembl-96/transcriptome.idx -o ./sample1 -x 10xv2 -t 8 \
/data/E-MTAB-6108/iPSC_RGCscRNAseq_Sample1_L005_R1.fastq.gz \
/data/E-MTAB-6108/iPSC_RGCscRNAseq_Sample1_L005_R2.fastq.gz



[index] k-mer length: 31
[index] number of targets: 188,753
[index] number of k-mers: 109,544,288
[index] number of equivalence classes: 760,757
[quant] will process sample 1: /data/E-MTAB-6108/iPSC_RGCscRNAseq_Sample1_L005_R1.fastq.gz
                               /data/E-MTAB-6108/iPSC_RGCscRNAseq_Sample1_L005_R2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 135,299,096 reads, 90,746,402 reads pseudoaligned


In [4]:
!ls ./sample1/

genecounts  matrix.ec  output.bus  run_info.json  transcripts.txt


In [5]:
!bustools correct --output ./sample1/output.corrected.bus --whitelist /references/10xv2_whitelist.txt ./sample1/output.bus

Found 737280 barcodes in the whitelist
Number of hamming dist 1 barcodes = 737280
Processed 90746402 bus records, rescued 88289244 records


In [6]:
!ls -lh ./sample1/

total 5.4G
drwxrwxr-x. 2 munfred munfred 4.0K Jul 17 23:51 genecounts
-rw-rw-r--. 1 munfred munfred  48M Jul 19 14:59 matrix.ec
-rw-rw-r--. 1 munfred munfred 2.8G Jul 19 14:59 output.bus
-rw-rw-r--. 1 munfred munfred 2.7G Jul 19 15:00 output.corrected.bus
-rw-rw-r--. 1 munfred munfred  489 Jul 19 14:59 run_info.json
-rw-rw-r--. 1 munfred munfred 3.3M Jul 19 14:59 transcripts.txt


In [7]:
!bustools sort --output ./sample1/output.corrected.sorted.bus ./sample1/output.corrected.bus

Read in 88289244 number of busrecords
All sorted


In [8]:
!ls -lh ./sample1/

total 6.6G
drwxrwxr-x. 2 munfred munfred 4.0K Jul 17 23:51 genecounts
-rw-rw-r--. 1 munfred munfred  48M Jul 19 14:59 matrix.ec
-rw-rw-r--. 1 munfred munfred 2.8G Jul 19 14:59 output.bus
-rw-rw-r--. 1 munfred munfred 2.7G Jul 19 15:00 output.corrected.bus
-rw-rw-r--. 1 munfred munfred 1.2G Jul 19 15:00 output.corrected.sorted.bus
-rw-rw-r--. 1 munfred munfred  489 Jul 19 14:59 run_info.json
-rw-rw-r--. 1 munfred munfred 3.3M Jul 19 14:59 transcripts.txt


In [9]:
!mkdir -p ./sample1/genecounts
!bustools count \
--output  ./sample1/genecounts/genes \
--genecounts \
--genemap /references/homo_sapiens-ensembl-96/transcripts_to_genes.txt \
--ecmap ./sample1/matrix.ec \
--txnames  ./sample1/transcripts.txt \
./sample1/output.corrected.sorted.bus 

bad counts = 0, rescued  =0, compacted = 0


In [10]:
!ls -lh ./sample1/genecounts/

total 135M
-rw-rw-r--. 1 munfred munfred 3.7M Jul 19 15:00 genes.barcodes.txt
-rw-rw-r--. 1 munfred munfred 640K Jul 19 15:00 genes.genes.txt
-rw-rw-r--. 1 munfred munfred 130M Jul 19 15:00 genes.mtx


## Process sample 2 with kallisto and bustools

In [11]:
!kallisto bus -i /references/homo_sapiens-ensembl-96/transcriptome.idx -o ./sample2 -x 10xv2 -t 8 \
/data/E-MTAB-6108/iPSC_RGCscRNAseq_Sample2_L005_R1.fastq.gz \
/data/E-MTAB-6108/iPSC_RGCscRNAseq_Sample2_L005_R2.fastq.gz


[index] k-mer length: 31
[index] number of targets: 188,753
[index] number of k-mers: 109,544,288
[index] number of equivalence classes: 760,757
[quant] will process sample 1: /data/E-MTAB-6108/iPSC_RGCscRNAseq_Sample2_L005_R1.fastq.gz
                               /data/E-MTAB-6108/iPSC_RGCscRNAseq_Sample2_L005_R2.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 95,769,921 reads, 65,902,895 reads pseudoaligned


In [12]:
!ls ./sample2/

genecounts  matrix.ec  output.bus  run_info.json  transcripts.txt


In [13]:
!bustools correct --output ./sample2/output.corrected.bus --whitelist /references/10xv2_whitelist.txt ./sample2/output.bus

Found 737280 barcodes in the whitelist
Number of hamming dist 1 barcodes = 737280
Processed 65902895 bus records, rescued 64168706 records


In [14]:
!ls -lh ./sample2/

total 4.0G
drwxrwxr-x. 2 munfred munfred 4.0K Jul 17 23:50 genecounts
-rw-rw-r--. 1 munfred munfred  46M Jul 19 15:04 matrix.ec
-rw-rw-r--. 1 munfred munfred 2.0G Jul 19 15:04 output.bus
-rw-rw-r--. 1 munfred munfred 2.0G Jul 19 15:04 output.corrected.bus
-rw-rw-r--. 1 munfred munfred  488 Jul 19 15:04 run_info.json
-rw-rw-r--. 1 munfred munfred 3.3M Jul 19 15:04 transcripts.txt


In [15]:
!bustools sort --output ./sample2/output.corrected.sorted.bus ./sample2/output.corrected.bus

Read in 64168706 number of busrecords
All sorted


In [16]:
!ls -lh ./sample2/

total 4.5G
drwxrwxr-x. 2 munfred munfred 4.0K Jul 17 23:50 genecounts
-rw-rw-r--. 1 munfred munfred  46M Jul 19 15:04 matrix.ec
-rw-rw-r--. 1 munfred munfred 2.0G Jul 19 15:04 output.bus
-rw-rw-r--. 1 munfred munfred 2.0G Jul 19 15:04 output.corrected.bus
-rw-rw-r--. 1 munfred munfred 548M Jul 19 15:04 output.corrected.sorted.bus
-rw-rw-r--. 1 munfred munfred  488 Jul 19 15:04 run_info.json
-rw-rw-r--. 1 munfred munfred 3.3M Jul 19 15:04 transcripts.txt


In [17]:
!bustools count \
--output  ./sample2/genecounts \
--genecounts \
--genemap /references/homo_sapiens-ensembl-96/transcripts_to_genes.txt \
--ecmap ./sample2/matrix.ec \
--txnames  ./sample2/ranscripts.txt \
./sample2/output.corrected.sorted.bus 

In [18]:
!ls -lh ./sample2/genecounts/

total 55M
-rw-rw-r--. 1 munfred munfred 2.2M Jul 17 23:50 genes.barcodes.txt
-rw-rw-r--. 1 munfred munfred 640K Jul 17 23:50 genes.genes.txt
-rw-rw-r--. 1 munfred munfred  52M Jul 17 23:50 genes.mtx


In [19]:
!mkdir -p ./sample2/genecounts
!bustools count \
--output  ./sample2/genecounts/genes \
--genecounts \
--genemap /references/homo_sapiens-ensembl-96/transcripts_to_genes.txt \
--ecmap ./sample2/matrix.ec \
--txnames  ./sample2/transcripts.txt \
./sample2/output.corrected.sorted.bus 

bad counts = 0, rescued  =0, compacted = 0


In [20]:
!ls -lh ./sample2/genecounts/

total 55M
-rw-rw-r--. 1 munfred munfred 2.2M Jul 19 15:04 genes.barcodes.txt
-rw-rw-r--. 1 munfred munfred 640K Jul 19 15:04 genes.genes.txt
-rw-rw-r--. 1 munfred munfred  52M Jul 19 15:04 genes.mtx


# Read sample1 and sample2 gene count matrices into anndata and concatenate them

In [21]:
## load sample1 on anndata as sparse crs matrix
sample1 = anndata.AnnData(scipy.io.mmread('./sample1/genecounts/genes.mtx').tocsr())
sample1.obs= pd.read_csv('./sample1/genecounts/genes.barcodes.txt', index_col = 0, header = None, names = ['barcode'])
sample1.var = pd.read_csv('./sample1/genecounts/genes.genes.txt', header = None, index_col = 0, names =['ensembl_id'], sep = '\t')
print('Loaded sample1 mtx:',sample1.X.shape)


Loaded sample1 mtx: (226612, 35606)


In [22]:
sample1

AnnData object with n_obs × n_vars = 226612 × 35606 

In [23]:
sample1.X

<226612x35606 sparse matrix of type '<class 'numpy.float32'>'
	with 8379616 stored elements in Compressed Sparse Row format>

In [24]:
sample1.obs.head()

AAACCTGAGAAACCAT
AAACCTGAGAAACCGC
AAACCTGAGAAACCTA
AAACCTGAGAAACGAG
AAACCTGAGAAAGTGG


In [25]:
sample1.var.head()

ENSG00000223972.5
ENSG00000227232.5
ENSG00000268020.3
ENSG00000240361.2
ENSG00000186092.6


In [26]:
## load sample2 on anndata as sparse crs matrix
sample2 = anndata.AnnData(scipy.io.mmread('./sample2/genecounts/genes.mtx').tocsr())
sample2.obs= pd.read_csv('./sample2/genecounts/genes.barcodes.txt', index_col = 0, header = None, names = ['barcode'])
sample2.var = pd.read_csv('./sample2/genecounts/genes.genes.txt', header = None, index_col = 0, names =['ensembl_id'], sep = '\t')
print('Loaded sample2 mtx:',sample2.X.shape)

Loaded sample2 mtx: (135582, 35606)


In [27]:
sample2

AnnData object with n_obs × n_vars = 135582 × 35606 

In [28]:
sample2.X

<135582x35606 sparse matrix of type '<class 'numpy.float32'>'
	with 3177015 stored elements in Compressed Sparse Row format>

In [29]:
sample2.obs.head()

AAACCTGAGAACTCGG
AAACCTGAGAAGCCCA
AAACCTGAGAAGGCCT
AAACCTGAGAATTCCC
AAACCTGAGACAATAC


In [30]:
sample2.var.head()

ENSG00000223972.5
ENSG00000227232.5
ENSG00000268020.3
ENSG00000240361.2
ENSG00000186092.6


# Concatenate sample1 and sample2 anndatas

In [31]:
concat_samples = AnnData.concatenate(sample1, sample2, join='outer', batch_categories=['sample1','sample2'],index_unique='-')

In [32]:
concat_samples

AnnData object with n_obs × n_vars = 362194 × 35606 
    obs: 'batch'

In [33]:
concat_samples.var.head()

ENSG00000223972.5
ENSG00000227232.5
ENSG00000268020.3
ENSG00000240361.2
ENSG00000186092.6


In [34]:
concat_samples.obs

Unnamed: 0,batch
AAACCTGAGAAACCAT-sample1,sample1
AAACCTGAGAAACCGC-sample1,sample1
AAACCTGAGAAACCTA-sample1,sample1
AAACCTGAGAAACGAG-sample1,sample1
AAACCTGAGAAAGTGG-sample1,sample1
AAACCTGAGAACAATC-sample1,sample1
AAACCTGAGAAGATTC-sample1,sample1
AAACCTGAGAAGGTGA-sample1,sample1
AAACCTGAGAATGTTG-sample1,sample1
AAACCTGAGAATTGTG-sample1,sample1
