# mvTCR Preprocessing
mvTCR uses a specific format to handle single-cell data, which is based on AnnData objects. If not otherwise stated, we follow the speficition from Scanpy [1] and Scirpy [2]. However, we need some additional information to utilize all functions of mvTCR. In this notebook, we will show how to add these to various places in the AnnData object.

All experiments in our paper where conducted on Datasets:
- after Quality Control (cell filtering, doublet detection, ...)
- with normalized and log+1 transformed count data

We will assume, that these steps have already been performed. For further reference, please see Luecken et al [3].

[1] Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: large-scale single-cell gene expression data analysis. Genome biology 19, 1–5 (2018).

[2] Sturm, G. et al. Scirpy: a scanpy extension for analyzing single-cell t-cell receptor-sequencing data. Bioinformatics 36, 4817–4818 (2020).

[3] Luecken, M. D. & Theis, F. J. Current best practices in single-cell rna-seq analysis: a tutorial.
Molecular systems biology 15, e8746 (2019).

The remaining preprocessing is showcased on the dataset from Stephenson et al [4], since the preprocessed data can readily be downloaded from 

- https://covid19.cog.sanger.ac.uk/submissions/release1/haniffa21.processed.h5ad
- https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-10026/E-MTAB-10026.processed.2.zip


[4] Stephenson, E. et al. Single-cell multi-omics analysis of the immune response in covid-19. Nature medicine 27, 904–916 (2021).

In [1]:
import scanpy as sc
import scirpy as ir
import pandas as pd

In [2]:
path_gex = '../data/Haniffa/haniffa21.processed.h5ad'
path_tcr = '../data/Haniffa/TCR_merged-Updated'

We will load the transcriptome data. To speed up runtime, we will downsample the data to two patients.

In [3]:
adata = sc.read(path_gex)

In [4]:
selected_patients = ['AP1', 'CV0062']
adata = adata[adata.obs['patient_id'].isin(selected_patients)].copy()

All models have been trained on the 5000 most highly variable genes:

In [5]:
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
print('Shape: ', adata.shape)

  dispersion = np.log(dispersion)


Shape:  (8811, 24929)


## Adding TCR information
Next, we load the TCR information. We will fuse it using Scirpy:

In [6]:
df_tcr = pd.read_csv(f'{path_tcr}.tsv', sep='\t')
df_tcr['barcode'] = df_tcr.pop('CellID') 

df_tcr = df_tcr[df_tcr['study_id'].isin(selected_patients)]

df_tcr.to_csv(f'{path_tcr}.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [7]:
adata_tcr = ir.io.read_10x_vdj(f'{path_tcr}.csv')
ir.pp.merge_with_ir(adata, adata_tcr)



mvTCR requires paired data between TCR and GEX. So, we remove all samples without a TRA or TRB CDR3 region.

In [8]:
print(len(adata))
adata = adata[~adata.obs['IR_VDJ_1_junction_aa'].isna()]
adata = adata[~adata.obs['IR_VJ_1_junction_aa'].isna()].copy()
print(len(adata))

8811
4227


For training the shared embedding, we advise oversampling rare clonotypes. This avoids the model overfitting to few selected TCR sequences from highly expanded clonotypes. Therefore, we need to add a clonotype label to adata.obs. Here, we define a unique clonotype via Scirpy as having exactly the same CDR3 sequence in TRA and TRB chains.

In [9]:
ir.tl.chain_qc(adata)
ir.pp.ir_dist(adata)
ir.tl.define_clonotypes(adata, key_added='clonotype', receptor_arms='all', dual_ir='primary_only')

100%|█████████████████████████████████████████████████████████████████████████████| 4049/4049 [00:15<00:00, 267.09it/s]


Next, we encode the TCR sequence numerically to adata.obsm. Here, we need to provide the name of the column storing the CDR3a and CDR3b. Additionally, we need to specificy the padding paremter. For data analysis, we use the maximal CDR3 sequence length. If you plan to add new data in the future via a pretrained model, you might want to add some safety margin.

In [10]:
import sys
sys.path.append('..')
from tcr_embedding.utils_preprocessing import encode_tcr

len_beta = adata.obs['IR_VDJ_1_junction_aa'].str.len().max()
len_alpha= adata.obs['IR_VJ_1_junction_aa'].str.len().max()
pad = max(len_beta, len_alpha)

encode_tcr(adata, 'IR_VJ_1_junction_aa', 'IR_VDJ_1_junction_aa', pad)

## Adding conditional variables
Conditioning your model partially removes the effect from a specified condition. We can add conditional variables for e.g. Patient, to avoid batch effects over multiple samples.

In [11]:
from sklearn.preprocessing import OneHotEncoder

In [12]:
enc = OneHotEncoder(sparse=False)
enc.fit(adata.obs['patient_id'].to_numpy().reshape(-1, 1))
adata.obsm['patient_id'] = enc.transform(adata.obs['patient_id'].to_numpy().reshape(-1, 1))
adata.uns['patient_id_enc'] = enc.categories_

## Saving the data
Finally, we save the data to a compressed h5ad file. To check, whether everything worked out, we will load if again afterwards.

In [13]:
path_out = '../data/Haniffa/haniffa_test.h5ad'
adata.write_h5ad(path_out, compression='gzip')
adata = sc.read(path_out)

... storing 'receptor_type' as categorical
... storing 'receptor_subtype' as categorical
... storing 'chain_pairing' as categorical
... storing 'clonotype' as categorical
