# RNA-seq and ATAC-seq integration using SemiLT

In this tutorial, we will illustrate the utility of SemiLT step by step using scRNA-seq and scATAC-seq dataset (data-5) from a mouse spleen dataset by Kai Cao et al. 2022. The data can be downoladed from (https://github.com/caokai1073/uniPort). After preprocessing, quality control and cell type annotation, the data contains 4271 cells from RNA-seq data and 3166 cells from ATAC-seq data.

In [1]:
import torch
import os
from datetime import datetime
from SemiLT.trainingprocess import Training
from SemiLT.transfer import Transfer
import time
from setting import Setting
import random
random.seed(1)
setting = Setting()

## Preparing input for SemiLT in setting.py

```python
DB = 'ms'
if DB == "ms":
    self.number_of_class = 10
    self.input_size = 11055
    self.rna_paths = ['data_ms/ms_rna.h5ad']
    self.atac_paths = ['data_ms/ms_atac.h5ad']
    self.rna_protein_paths = [] 
    self.atac_protein_paths = [] 
    self.peak_paths = ['data_ms/ms_PCA50.h5ad']
    self.atac_labels = True

    # Training setting            
    self.batch_size = 256
    self.lr = 0.008
    self.lr_decay_epoch = 20
    self.epochs = 20
    self.embedding_size = 64
    self.momentum = 0.9
    self.seed = 1
    self.checkpoint = ''
```

## View data

In [2]:
import scanpy as sc
adata_ref_rna = sc.read(setting.rna_paths[0])
print(adata_ref_rna)
adata_tar_atac = sc.read(setting.atac_paths[0])
print(adata_tar_atac)

AnnData object with n_obs × n_vars = 4271 × 11055
    obs: 'cell_type', 'source', 'domain_id'
    var: 'n_cells-0', 'n_cells-1'
AnnData object with n_obs × n_vars = 3166 × 11055
    obs: 'cell_type', 'source', 'domain_id'
    var: 'n_cells-0', 'n_cells-1'


## Running SemiLT in main.py

In [3]:
def main():
    # hardware constraint for speed test
    start_time = time.time()
    torch.set_num_threads(1)
    os.environ['OMP_NUM_THREADS'] = '1'
    
    # initialization 
    setting = Setting()    
    torch.manual_seed(setting.seed)
    print('Start time: ', datetime.now().strftime('%H:%M:%S'))
    
    # Training
    print('SemiLT start:')
    model_stage1= Training(setting)    
    for epoch in range(setting.epochs):
        print('Epoch:', epoch)
        model_stage1.train(epoch)
    
    print('Write embeddings')
    model_stage1.write_embeddings()
    print('SemiLT finished: ', datetime.now().strftime('%H:%M:%S'))
    
    # Label transfer
    print('Label transfer:')
    Transfer(setting, neighbors = 10, knn_rna_samples=50000)
    print('Label transfer finished: ', datetime.now().strftime('%H:%M:%S'))
    
    end_time = time.time()
    run_time = end_time - start_time
    hours = int(run_time / 3600)
    minutes = int((run_time - hours * 3600) / 60)
    seconds = int(run_time - hours * 3600 - minutes * 60)
    print(f"Run time：{hours}: {minutes}: {seconds}")
    
if __name__ == "__main__":
    main()

Start time:  20:15:39
SemiLT start:
num_workers: 0
load h5ad matrix: /users/PCON0022/wangxiaoying/czt/myJoint/5-SemiLT/data_ms/ms_rna.h5ad
load h5ad matrix: /users/PCON0022/wangxiaoying/czt/myJoint/5-SemiLT/data_ms/ms_atac.h5ad
load h5ad matrix: /users/PCON0022/wangxiaoying/czt/myJoint/5-SemiLT/data_ms/ms_PCA50.h5ad
Epoch: 0
LR is set to 0.008
LR is set to 0.008
Epoch: 1
Epoch: 2
Epoch: 3
Epoch: 4
Epoch: 5
Epoch: 6
Epoch: 7
Epoch: 8
Epoch: 9
Epoch: 10
Epoch: 11
Epoch: 12
Epoch: 13
Epoch: 14
Epoch: 15
Epoch: 16
Epoch: 17
Epoch: 18
Epoch: 19
Write embeddings
SemiLT finished:  20:19:06
Label transfer:
[Label transfer] Read RNA data
[Label transfer] Read ATAC data
[Label transfer] Build Space
[Label transfer] finished
ARI:0.874840
Recall：0.896715
Precision:0.898406
F1-score：0.895107
Label transfer finished:  20:19:09
Run time：0: 3: 29
