# mvTCR Preprocessing Pipeline
mvTCR uses a specific format to handle single-cell data, which is based on AnnData objects. If not otherwise stated, we follow the speficition from Scanpy [1] and Scirpy [2]. However, we need some additional information to utilize all functions of mvTCR. In this notebook, we will show how to add these to various places in the AnnData object.

All experiments in our paper where conducted on Datasets:
- after Quality Control (cell filtering, doublet detection, ...)
- with normalized and log+1 transformed count data
- added TCR information

We will assume, that these steps have already been performed. For further reference, please see Luecken et al [3].

If you know what you are doing: different normalization, log-stabilizing transformations, etc. can also be used, but need to be handled with care!


[1] Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: large-scale single-cell gene expression data analysis. Genome biology 19, 1–5 (2018).

[2] Sturm, G. et al. Scirpy: a scanpy extension for analyzing single-cell t-cell receptor-sequencing data. Bioinformatics 36, 4817–4818 (2020).

[3] Luecken, M. D. & Theis, F. J. Current best practices in single-cell rna-seq analysis: a tutorial.
Molecular systems biology 15, e8746 (2019).

In [1]:
import sys
sys.path.append('..')

In [2]:
import scanpy as sc

from tcr_embedding.utils_preprocessing import Preprocessing

  from .autonotebook import tqdm as notebook_tqdm


## All-in-one Pipeline

This is the fast and easy way to preprocess your data. Simply load your data and call the Preprocessing.preprocessing_pipeline() function with the parameters of interest.
This features (in order):
- Normalization & log transformation checks (experimental)
- "Reasonable" number of highly variable genes (500 < n < 5000)
- Scirpy VDJ gene usage information
- One-Hot encoding of conditional variables
- Stratified-shuffle-splits (into train & val)



In [3]:
path_data = '../data/10x_CD8TC/v7_avidity.h5ad'
adata = sc.read(path_data)

In [4]:
Preprocessing.preprocessing_pipeline(adata, 
                                     clonotype_key_added='clonotype', 
                                     column_cdr3a='IR_VJ_1_junction_aa', 
                                     column_cdr3b='IR_VDJ_1_junction_aa',
                                     cond_vars=['donor'],
                                     stratify_col='binding_name', 
                                     group_col='clonotype', 
                                     val_split=0.2)



100%|██████████| 51605/51605 [01:02<00:00, 827.01it/s] 
100%|██████████| 40/40 [00:00<00:00, 58.71it/s]


# Piece by Piece Preprocessing

All the features inside the pipeline can be executed seperately as well, to perform a step-by-setp preprocessing or only specific methods.

Let's begin by loading our adata object again.

In [5]:
path_data = '../data/10x_CD8TC/v7_avidity.h5ad'
adata = sc.read(path_data)

In [6]:
adata.obs.head(5)

Unnamed: 0,is_cell,high_confidence,multi_chain,extra_chains,IR_VJ_1_c_call,IR_VJ_2_c_call,IR_VDJ_1_c_call,IR_VDJ_2_c_call,IR_VJ_1_consensus_count,IR_VJ_2_consensus_count,...,seq_len,has_binding,binding_label,binding_name,donor+binding,set,high_count_binding_name,high_count_binding_label,alpha_len,beta_len
AAACGGGAGAAGATTC-1-donor_1,True,True,False,"[{""c_call"": ""TRBC2"", ""consensus_count"": 3996, ...",TRAC,,TRBC1,,36437.0,,...,30,True,29,A0301_KLGGALQAK_IE-1_CMV_binder,donor_1_A0301_KLGGALQAK_IE-1_CMV_binder,test,A0301_KLGGALQAK_IE-1_CMV_binder,3,12,15
AAACGGGTCGGACAAG-1-donor_1,True,True,False,[],TRAC,,TRBC2,,18565.0,,...,32,False,-1,no_data,donor_1_no_data,train,no_data,8,14,15
AAAGATGGTACAGACG-1-donor_1,True,True,False,"[{""c_call"": ""TRAC"", ""consensus_count"": 18416, ...",TRAC,,TRBC2,,31549.0,,...,32,False,-1,no_data,donor_1_no_data,train,no_data,8,12,17
AAAGTAGAGACGCTTT-1-donor_1,True,True,False,[],TRAC,,TRBC2,,34680.0,,...,33,False,-1,no_data,donor_1_no_data,train,no_data,8,13,17
AAAGTAGAGCGCTTAT-1-donor_1,True,True,False,[],TRAC,TRAC,TRBC2,,30686.0,23335.0,...,30,False,-1,no_data,donor_1_no_data,val,no_data,8,15,12


### Checking if adata is in a mvTCR compatible shape

In [7]:
Preprocessing.check_if_valid_adata(adata)



True

### Encoding clonotypes with Scirpy

For training the shared embedding, we advise oversampling rare clonotypes. This avoids the model overfitting to few selected TCR sequences from highly expanded clonotypes. Therefore, we need to add a clonotype label to adata.obs. Here, we define a unique clonotype via Scirpy as having exactly the same CDR3 sequence in TRA and TRB chains.

In [8]:
#lets make sure to remove possible clonotype annotation not performed by scirpy. This step can be skipped if there is no encoding in the current adata
adata.obs.drop(columns='clonotype', inplace=True)

In [9]:
Preprocessing.encode_clonotypes(adata, key_added='clonotype')

adata.obs.clonotype.value_counts()

100%|██████████| 51605/51605 [00:55<00:00, 937.13it/s] 


18943    5328
41435    3837
2235     3823
29208    2977
10       2210
         ... 
26568       1
27689       1
22478       1
10627       1
12933       1
Name: clonotype, Length: 51605, dtype: int64

### Adding TCR encoding

Next, we encode the TCR sequence numerically to adata.obsm. Here, we need to provide the name of the column storing the CDR3a and CDR3b. Additionally, we need to specificy the padding paremter (which if set to None uses the maximal CDR3 sequence length as default). If you plan to add new data in the future via a pretrained model, you might want to add some safety margin.

In [10]:
Preprocessing.encode_tcr(adata, column_cdr3a='IR_VJ_1_junction_aa', column_cdr3b='IR_VDJ_1_junction_aa', alpha_label_col='alpha_seq', alpha_length_col='alpha_len', beta_label_col='beta_seq', beta_length_col='beta_len', pad=None)

In [11]:
adata.obsm['beta_seq']

array([[ 2,  1, 16, ...,  0,  0,  0],
       [ 2,  1, 16, ...,  0,  0,  0],
       [ 2, 16,  1, ...,  0,  0,  0],
       ...,
       [ 2,  1, 16, ...,  0,  0,  0],
       [ 2, 16,  1, ...,  0,  0,  0],
       [ 2,  1, 17, ...,  0,  0,  0]])

### Adding conditional variables

Conditioning your model partially removes the effect from a specified condition. We can add conditional variables for e.g. donor, to avoid batch effects over multiple samples.

In [12]:
Preprocessing.encode_conditional_var(adata, column_id='donor')

In [13]:
adata.obsm['donor']

array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       ...,
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.]])

### Creating training and validation splits

The splitting tries to improve splitting by two properties:
- Stratified splitting, so the label distribution is roughly the same in both sets, e.g. antigen specificity
- Certain groups are only in one set, e.g. the same clonotypes are only in one set, so the model cannot peak into similar sample during training.

In [14]:
#remove previous splitting. Ignore if not splitting exists
adata.obs.drop(columns=['set'], inplace=True)

In [15]:
train, val = Preprocessing.stratified_group_shuffle_split(adata.obs, stratify_col='binding_name', group_col='clonotype', val_split=0.2, random_seed=42)

adata.obs['set'] = 'train'
adata.obs.loc[val.index, 'set'] = 'val'

100%|██████████| 40/40 [00:00<00:00, 57.64it/s]


In [16]:
adata.obs.set.value_counts()

train    98220
val      30367
Name: set, dtype: int64

### Finish. You are all set and done to use mvTCR! Save your data!

In [18]:
path_out = '../data/preprocessed/tutorial.h5ad'
adata.write_h5ad(path_out, compression='gzip')