### Notebook for preparing the Human Cell Atlas (human heart leucocytes) dataset as a reference dataset for seed label transfer to mouse ACM heart (merged Pkp2+Ttn dataset) using `scANVI` 

#### Environment: scANVI

- **Developed by:** Alexandra Cirnu
- **Modified by:** Alexandra Cirnu
- **Würzburg Institute for Systems Immunology & Julius-Maximilian-Universität Würzburg**
- **Date of creation:** 240308
- **Date of modification:** 240319

Healthy human heart leucocytes (from Human cell atlas) was used to generate a reference for seed labelling. 

**Strategy:** Seed label 'myeloid' and 'lymphoid' on ACM data set to be able to subset the data and afterwards seed label by using the cell states on the lymphoids and the myeloids seperately

#### Import required modules

In [49]:
import anndata
import warnings
import numpy as np
import scanpy as sc
import pandas as pd

In [50]:
sc.settings.verbosity = 3
sc.logging.print_versions()
sc.settings.set_figure_params(dpi = 180, color_map = 'magma_r', dpi_save = 300, vector_friendly = True, format = 'svg')

-----
anndata     0.10.5.post1
scanpy      1.9.8
-----
PIL                 10.2.0
asttokens           NA
colorama            0.4.6
comm                0.2.1
cycler              0.12.1
cython_runtime      NA
dateutil            2.9.0
debugpy             1.8.1
decorator           5.1.1
exceptiongroup      1.2.0
executing           2.0.1
h5py                3.10.0
ipykernel           6.29.3
ipywidgets          8.1.2
jedi                0.19.1
joblib              1.3.2
kiwisolver          1.4.5
llvmlite            0.42.0
matplotlib          3.8.3
matplotlib_inline   0.1.6
mpl_toolkits        NA
natsort             8.4.0
numba               0.59.0
numpy               1.26.4
packaging           23.2
pandas              2.2.1
parso               0.8.3
platformdirs        4.2.0
prompt_toolkit      3.0.43
psutil              5.9.8
pure_eval           0.2.2
pydev_ipython       NA
pydevconsole        NA
pydevd              2.9.5
pydevd_file_utils   NA
pydevd_plugins      NA
pydevd_tracing      NA

In [51]:
def X_is_raw(adata): return np.array_equal(adata.X.sum(axis=0).astype(int), adata.X.sum(axis=0))

### Read in Human Cell Atlas Cardiac Leucocytes data set

Data downloaded from [here](https://cellgeni.cog.sanger.ac.uk/heartcellatlas/data/hca_heart_immune_download.h5ad)

In [52]:
adata = sc.read_h5ad('/home/acirnu/data/ACM_cardiac_leuco/Reference_data/hca_heart_immune_download.h5ad')
adata

AnnData object with n_obs × n_vars = 40868 × 33538
    obs: 'NRP', 'age_group', 'cell_source', 'cell_states', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'type', 'version', 'scNym', 'scNym_confidence'
    uns: 'cell_states_colors', 'scNym_colors', 'scNym_probabilities'
    obsm: 'X_pca', 'X_scnym', 'X_umap'

In [53]:
del(adata.obs['age_group'])
del(adata.obs['percent_mito'])
del(adata.obs['percent_ribo'])
del(adata.obs['scrublet_score'])
del(adata.obs['type'])
del(adata.obs['version'])

adata.obs

Unnamed: 0,NRP,cell_source,cell_states,donor,gender,n_counts,n_genes,region,sample,scNym,scNym_confidence
AAAGTGAAGTCGGCCT-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,724.717285,588,AX,H0015_apex,CD4+T_cell,0.797180
AAATGGAAGGTCCCTG-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,668.059509,515,AX,H0015_apex,CD4+T_cell,0.999248
AAATGGAGTTGTCTAG-1-H0015_apex,No,Harvard-Nuclei,doublets,H5,Female,670.216309,504,AX,H0015_apex,NK,0.680673
AACAACCGTAATTGGA-1-H0015_apex,No,Harvard-Nuclei,DOCK4+MØ1,H5,Female,730.082947,578,AX,H0015_apex,CD14+Monocyte,0.538159
AAGACTCTCAGGACGA-1-H0015_apex,No,Harvard-Nuclei,Mast,H5,Female,612.323425,428,AX,H0015_apex,Mast,0.990977
...,...,...,...,...,...,...,...,...,...,...,...
TTTGATCGTGTCATGT-1-HCAHeart8102862,Yes,Sanger-CD45,CD4+T_cytox,D11,Female,631.149170,715,AX,HCAHeart8102862,CD8+T_cell,0.756579
TTTGATCGTTCTCCTG-1-HCAHeart8102862,Yes,Sanger-CD45,LYVE1+MØ1,D11,Female,819.040100,2526,AX,HCAHeart8102862,CD8+T_cell,0.269561
TTTGGAGGTCGCTCGA-1-HCAHeart8102862,Yes,Sanger-CD45,MØ_AgP,D11,Female,757.455505,1350,AX,HCAHeart8102862,M3,0.585436
TTTGGTTTCAGTGTTG-1-HCAHeart8102862,Yes,Sanger-CD45,LYVE1+MØ3,D11,Female,815.372131,2507,AX,HCAHeart8102862,MØ,0.968681


In [54]:
X_is_raw(adata)

True

In [55]:
adata.obs['cell_states'].cat.categories

Index(['NK', 'CD16+Mo', 'DOCK4+MØ1', 'CD4+T_cytox', 'LYVE1+MØ1', 'CD8+T_tem',
       'CD8+T_cytox', 'LYVE1+MØ2', 'LYVE1+MØ3', 'CD14+Mo', 'Mo_pi',
       'DOCK4+MØ2', 'Mast', 'NKT', 'MØ_mod', 'MØ_AgP', 'B_cells', 'CD4+T_tem',
       'DC', 'doublets', 'NØ', 'IL17RA+Mo'],
      dtype='object')

##### Based on the cell_states categories assign lineage categories (myeloid, lymphoid, doublets)
Cell State  | Lineage       | Cell State  | Lineage
:---:       | :---:         | :---:       | :---:
NK          | lymphoid      | DOCK4+MØ2   | myeloid 
CD16+Mo     | myeloid       | Mast        | myeloid
DOCK4+MØ1   | myeloid       | NKT         | lymphoid 
CD4+T_cytox | lymphoid      | MØ_mod      | myeloid
LYVE1+MØ1   | myeloid       | MØ_AgP      | myeloid 
CD8+T_tem   | lymphoid      | B_cells     | lymphoid 
CD8+T_cytox | lymphoid      | CD4+T_tem   | lymphoid
LYVE1+MØ2   | myeloid       | DC          | myeloid
LYVE1+MØ3   | myeloid       | doublets    | doublets 
CD14+Mo     | myeloid       | NØ          | myeloid
Mo_pi       | myeloid       | IL17RA+Mo   | myeloid

In [56]:
adata.obs['lineage'] = "myeloid"
adata.obs.loc[adata.obs['cell_states'].isin(['NK', 'CD4+T_cytox', 'CD8+T_tem', 'CD8+T_cytox', 'NKT', 'B_cells', 'CD4+T_tem']), 'lineage'] = 'lymphoid'
adata.obs.loc[adata.obs['cell_states'].isin(['doublets']), 'lineage'] = 'doublets'
adata.obs

Unnamed: 0,NRP,cell_source,cell_states,donor,gender,n_counts,n_genes,region,sample,scNym,scNym_confidence,lineage
AAAGTGAAGTCGGCCT-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,724.717285,588,AX,H0015_apex,CD4+T_cell,0.797180,lymphoid
AAATGGAAGGTCCCTG-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,668.059509,515,AX,H0015_apex,CD4+T_cell,0.999248,lymphoid
AAATGGAGTTGTCTAG-1-H0015_apex,No,Harvard-Nuclei,doublets,H5,Female,670.216309,504,AX,H0015_apex,NK,0.680673,doublets
AACAACCGTAATTGGA-1-H0015_apex,No,Harvard-Nuclei,DOCK4+MØ1,H5,Female,730.082947,578,AX,H0015_apex,CD14+Monocyte,0.538159,myeloid
AAGACTCTCAGGACGA-1-H0015_apex,No,Harvard-Nuclei,Mast,H5,Female,612.323425,428,AX,H0015_apex,Mast,0.990977,myeloid
...,...,...,...,...,...,...,...,...,...,...,...,...
TTTGATCGTGTCATGT-1-HCAHeart8102862,Yes,Sanger-CD45,CD4+T_cytox,D11,Female,631.149170,715,AX,HCAHeart8102862,CD8+T_cell,0.756579,lymphoid
TTTGATCGTTCTCCTG-1-HCAHeart8102862,Yes,Sanger-CD45,LYVE1+MØ1,D11,Female,819.040100,2526,AX,HCAHeart8102862,CD8+T_cell,0.269561,myeloid
TTTGGAGGTCGCTCGA-1-HCAHeart8102862,Yes,Sanger-CD45,MØ_AgP,D11,Female,757.455505,1350,AX,HCAHeart8102862,M3,0.585436,myeloid
TTTGGTTTCAGTGTTG-1-HCAHeart8102862,Yes,Sanger-CD45,LYVE1+MØ3,D11,Female,815.372131,2507,AX,HCAHeart8102862,MØ,0.968681,myeloid


### Remove doublets

In [57]:
adata = adata[~adata.obs['lineage'].isin(["doublets"])]
adata

View of AnnData object with n_obs × n_vars = 40245 × 33538
    obs: 'NRP', 'cell_source', 'cell_states', 'donor', 'gender', 'n_counts', 'n_genes', 'region', 'sample', 'scNym', 'scNym_confidence', 'lineage'
    uns: 'cell_states_colors', 'scNym_colors', 'scNym_probabilities'
    obsm: 'X_pca', 'X_scnym', 'X_umap'

In [58]:
adata_raw = adata.copy()

### Export reference data set containing lymphoids and myeloids

In [59]:
reference_lymph_myeloid = adata.copy()
reference_lymph_myeloid.layers['counts'] = adata.X.copy()
reference_lymph_myeloid.write('/home/acirnu/data/ACM_cardiac_leuco/Reference_data/HCA_myeloids+lymphoids_healthy_reference_ac240319.raw.h5ad')   

### Subset the data set to contain only **myeloid** cells

In [60]:
adata_myeloid = adata[adata.obs['lineage'].isin(["myeloid"])]
adata_myeloid.obs

Unnamed: 0,NRP,cell_source,cell_states,donor,gender,n_counts,n_genes,region,sample,scNym,scNym_confidence,lineage
AACAACCGTAATTGGA-1-H0015_apex,No,Harvard-Nuclei,DOCK4+MØ1,H5,Female,730.082947,578,AX,H0015_apex,CD14+Monocyte,0.538159,myeloid
AAGACTCTCAGGACGA-1-H0015_apex,No,Harvard-Nuclei,Mast,H5,Female,612.323425,428,AX,H0015_apex,Mast,0.990977,myeloid
AAGCATCGTTCGCGTG-1-H0015_apex,No,Harvard-Nuclei,LYVE1+MØ2,H5,Female,887.090454,1124,AX,H0015_apex,MØ,0.816231,myeloid
AAGCCATCAGCACAGA-1-H0015_apex,No,Harvard-Nuclei,LYVE1+MØ3,H5,Female,905.353455,1301,AX,H0015_apex,MØ,0.994292,myeloid
AAGCGAGTCAAGGTGG-1-H0015_apex,No,Harvard-Nuclei,Mast,H5,Female,605.308716,401,AX,H0015_apex,Mast,0.887313,myeloid
...,...,...,...,...,...,...,...,...,...,...,...,...
TTTGATCCACCACTGG-1-HCAHeart8102862,Yes,Sanger-CD45,LYVE1+MØ1,D11,Female,869.875061,2411,AX,HCAHeart8102862,MØ,0.986432,myeloid
TTTGATCGTTCTCCTG-1-HCAHeart8102862,Yes,Sanger-CD45,LYVE1+MØ1,D11,Female,819.040100,2526,AX,HCAHeart8102862,CD8+T_cell,0.269561,myeloid
TTTGGAGGTCGCTCGA-1-HCAHeart8102862,Yes,Sanger-CD45,MØ_AgP,D11,Female,757.455505,1350,AX,HCAHeart8102862,M3,0.585436,myeloid
TTTGGTTTCAGTGTTG-1-HCAHeart8102862,Yes,Sanger-CD45,LYVE1+MØ3,D11,Female,815.372131,2507,AX,HCAHeart8102862,MØ,0.968681,myeloid


In [61]:
adata_myeloid.layers['counts'] = adata_myeloid.X.copy()
adata_myeloid

  adata_myeloid.layers['counts'] = adata_myeloid.X.copy()


AnnData object with n_obs × n_vars = 23842 × 33538
    obs: 'NRP', 'cell_source', 'cell_states', 'donor', 'gender', 'n_counts', 'n_genes', 'region', 'sample', 'scNym', 'scNym_confidence', 'lineage'
    uns: 'cell_states_colors', 'scNym_colors', 'scNym_probabilities'
    obsm: 'X_pca', 'X_scnym', 'X_umap'
    layers: 'counts'

In [62]:
X_is_raw(adata_myeloid)

True

In [63]:
adata_myeloid.write('/home/acirnu/data/ACM_cardiac_leuco/Reference_data/HCA_myeloids_healthy_reference_ac240319.raw.h5ad')  

### Subset the data set to contain only **lymphoid** cells

In [64]:
adata_lymphoid = adata[adata.obs['lineage'].isin(["lymphoid"])]
adata_lymphoid.obs

Unnamed: 0,NRP,cell_source,cell_states,donor,gender,n_counts,n_genes,region,sample,scNym,scNym_confidence,lineage
AAAGTGAAGTCGGCCT-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,724.717285,588,AX,H0015_apex,CD4+T_cell,0.797180,lymphoid
AAATGGAAGGTCCCTG-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,668.059509,515,AX,H0015_apex,CD4+T_cell,0.999248,lymphoid
AAGTTCGCAGTGTATC-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,759.100281,633,AX,H0015_apex,CD4+T_cell,0.979933,lymphoid
AATCACGTCCCGTAAA-1-H0015_apex,No,Harvard-Nuclei,CD8+T_tem,H5,Female,715.088623,604,AX,H0015_apex,CD4+T_cell,0.987953,lymphoid
AATGACCGTGTTGCCG-1-H0015_apex,No,Harvard-Nuclei,NK,H5,Female,743.207153,645,AX,H0015_apex,NK,0.759773,lymphoid
...,...,...,...,...,...,...,...,...,...,...,...,...
TTTCACATCGAACACT-1-HCAHeart8102862,Yes,Sanger-CD45,B_cells,D11,Female,736.678040,1991,AX,HCAHeart8102862,DC,0.999435,lymphoid
TTTCAGTCAGTACTAC-1-HCAHeart8102862,Yes,Sanger-CD45,NKT,D11,Female,774.458374,1108,AX,HCAHeart8102862,NK,0.985920,lymphoid
TTTCGATAGCGCGTTC-1-HCAHeart8102862,Yes,Sanger-CD45,NK,D11,Female,778.039612,1215,AX,HCAHeart8102862,NK,0.984224,lymphoid
TTTGACTGTATGAGAT-1-HCAHeart8102862,Yes,Sanger-CD45,CD8+T_tem,D11,Female,707.116882,957,AX,HCAHeart8102862,CD8+T_cell,0.982434,lymphoid


In [65]:
adata_lymphoid.layers['counts'] = adata_lymphoid.X.copy()
adata_lymphoid

  adata_lymphoid.layers['counts'] = adata_lymphoid.X.copy()


AnnData object with n_obs × n_vars = 16403 × 33538
    obs: 'NRP', 'cell_source', 'cell_states', 'donor', 'gender', 'n_counts', 'n_genes', 'region', 'sample', 'scNym', 'scNym_confidence', 'lineage'
    uns: 'cell_states_colors', 'scNym_colors', 'scNym_probabilities'
    obsm: 'X_pca', 'X_scnym', 'X_umap'
    layers: 'counts'

In [66]:
X_is_raw(adata_lymphoid)

True

In [67]:
adata_lymphoid.write('/home/acirnu/data/ACM_cardiac_leuco/Reference_data/HCA_lymphoids_healthy_reference_ac240319.raw.h5ad')  