### Notebook for preparing the Human Cell Atlas (human heart leucocytes) dataset as a reference dataset for seed label transfer to mouse ACM heart (merged Pkp2+Ttn dataset) using `scANVI` 

#### Environment: scANVI

- **Developed by:** Alexandra Cirnu
- **Modified by:** Alexandra Cirnu
- **Würzburg Institute for Systems Immunology & Julius-Maximilian-Universität Würzburg**
- **Date of creation:** 240308
- **Date of modification:** 240318

Healthy human heart leucocytes (from Human cell atlas) was used to generate a reference for seed labelling. 

**Strategy:** Select Highly variable genes (try different amounts: 1500, 3000, 5000, 7500 and 10000) of lymphoid cells only

#### Import required modules

In [16]:
import anndata
import warnings
import numpy as np
import scanpy as sc
import pandas as pd

In [17]:
sc.settings.verbosity = 3
sc.logging.print_versions()
sc.settings.set_figure_params(dpi = 180, color_map = 'magma_r', dpi_save = 300, vector_friendly = True, format = 'svg')

-----
anndata     0.10.5.post1
scanpy      1.9.8
-----
PIL                 10.2.0
asttokens           NA
colorama            0.4.6
comm                0.2.1
cycler              0.12.1
cython_runtime      NA
dateutil            2.9.0
debugpy             1.8.1
decorator           5.1.1
exceptiongroup      1.2.0
executing           2.0.1
h5py                3.10.0
ipykernel           6.29.3
ipywidgets          8.1.2
jedi                0.19.1
joblib              1.3.2
kiwisolver          1.4.5
llvmlite            0.42.0
matplotlib          3.8.3
matplotlib_inline   0.1.6
mpl_toolkits        NA
natsort             8.4.0
numba               0.59.0
numpy               1.26.4
packaging           23.2
pandas              2.2.1
parso               0.8.3
platformdirs        4.2.0
prompt_toolkit      3.0.43
psutil              5.9.8
pure_eval           0.2.2
pydev_ipython       NA
pydevconsole        NA
pydevd              2.9.5
pydevd_file_utils   NA
pydevd_plugins      NA
pydevd_tracing      NA

In [18]:
def X_is_raw(adata): return np.array_equal(adata.X.sum(axis=0).astype(int), adata.X.sum(axis=0))

### Read in Human Cell Atlas Cardiac Leucocytes data set

Data downloaded from [here](https://cellgeni.cog.sanger.ac.uk/heartcellatlas/data/hca_heart_immune_download.h5ad)

In [19]:
adata = sc.read_h5ad('/home/acirnu/data/ACM_cardiac_leuco/Reference_data/hca_heart_immune_download.h5ad')
adata

AnnData object with n_obs × n_vars = 40868 × 33538
    obs: 'NRP', 'age_group', 'cell_source', 'cell_states', 'donor', 'gender', 'n_counts', 'n_genes', 'percent_mito', 'percent_ribo', 'region', 'sample', 'scrublet_score', 'type', 'version', 'scNym', 'scNym_confidence'
    uns: 'cell_states_colors', 'scNym_colors', 'scNym_probabilities'
    obsm: 'X_pca', 'X_scnym', 'X_umap'

In [20]:
del(adata.obs['age_group'])
del(adata.obs['percent_mito'])
del(adata.obs['percent_ribo'])
del(adata.obs['scrublet_score'])
del(adata.obs['type'])
del(adata.obs['version'])

adata.obs

Unnamed: 0,NRP,cell_source,cell_states,donor,gender,n_counts,n_genes,region,sample,scNym,scNym_confidence
AAAGTGAAGTCGGCCT-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,724.717285,588,AX,H0015_apex,CD4+T_cell,0.797180
AAATGGAAGGTCCCTG-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,668.059509,515,AX,H0015_apex,CD4+T_cell,0.999248
AAATGGAGTTGTCTAG-1-H0015_apex,No,Harvard-Nuclei,doublets,H5,Female,670.216309,504,AX,H0015_apex,NK,0.680673
AACAACCGTAATTGGA-1-H0015_apex,No,Harvard-Nuclei,DOCK4+MØ1,H5,Female,730.082947,578,AX,H0015_apex,CD14+Monocyte,0.538159
AAGACTCTCAGGACGA-1-H0015_apex,No,Harvard-Nuclei,Mast,H5,Female,612.323425,428,AX,H0015_apex,Mast,0.990977
...,...,...,...,...,...,...,...,...,...,...,...
TTTGATCGTGTCATGT-1-HCAHeart8102862,Yes,Sanger-CD45,CD4+T_cytox,D11,Female,631.149170,715,AX,HCAHeart8102862,CD8+T_cell,0.756579
TTTGATCGTTCTCCTG-1-HCAHeart8102862,Yes,Sanger-CD45,LYVE1+MØ1,D11,Female,819.040100,2526,AX,HCAHeart8102862,CD8+T_cell,0.269561
TTTGGAGGTCGCTCGA-1-HCAHeart8102862,Yes,Sanger-CD45,MØ_AgP,D11,Female,757.455505,1350,AX,HCAHeart8102862,M3,0.585436
TTTGGTTTCAGTGTTG-1-HCAHeart8102862,Yes,Sanger-CD45,LYVE1+MØ3,D11,Female,815.372131,2507,AX,HCAHeart8102862,MØ,0.968681


In [21]:
X_is_raw(adata)

True

In [22]:
adata.obs['cell_states'].cat.categories

Index(['NK', 'CD16+Mo', 'DOCK4+MØ1', 'CD4+T_cytox', 'LYVE1+MØ1', 'CD8+T_tem',
       'CD8+T_cytox', 'LYVE1+MØ2', 'LYVE1+MØ3', 'CD14+Mo', 'Mo_pi',
       'DOCK4+MØ2', 'Mast', 'NKT', 'MØ_mod', 'MØ_AgP', 'B_cells', 'CD4+T_tem',
       'DC', 'doublets', 'NØ', 'IL17RA+Mo'],
      dtype='object')

##### Based on the cell_states categories assign lineage categories (myeloid, lymphoid, doublets)
Cell State  | Lineage       | Cell State  | Lineage
:---:       | :---:         | :---:       | :---:
NK          | lymphoid      | DOCK4+MØ2   | myeloid 
CD16+Mo     | myeloid       | Mast        | myeloid
DOCK4+MØ1   | myeloid       | NKT         | lymphoid 
CD4+T_cytox | lymphoid      | MØ_mod      | myeloid
LYVE1+MØ1   | myeloid       | MØ_AgP      | myeloid 
CD8+T_tem   | lymphoid      | B_cells     | lymphoid 
CD8+T_cytox | lymphoid      | CD4+T_tem   | lymphoid
LYVE1+MØ2   | myeloid       | DC          | myeloid
LYVE1+MØ3   | myeloid       | doublets    | doublets 
CD14+Mo     | myeloid       | NØ          | myeloid
Mo_pi       | myeloid       | IL17RA+Mo   | myeloid

In [23]:
adata.obs['lineage'] = "myeloid"
adata.obs.loc[adata.obs['cell_states'].isin(['NK', 'CD4+T_cytox', 'CD8+T_tem', 'CD8+T_cytox', 'NKT', 'B_cells', 'CD4+T_tem']), 'lineage'] = 'lymphoid'
adata.obs.loc[adata.obs['cell_states'].isin(['doublets']), 'lineage'] = 'doublets'
adata.obs

Unnamed: 0,NRP,cell_source,cell_states,donor,gender,n_counts,n_genes,region,sample,scNym,scNym_confidence,lineage
AAAGTGAAGTCGGCCT-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,724.717285,588,AX,H0015_apex,CD4+T_cell,0.797180,lymphoid
AAATGGAAGGTCCCTG-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,668.059509,515,AX,H0015_apex,CD4+T_cell,0.999248,lymphoid
AAATGGAGTTGTCTAG-1-H0015_apex,No,Harvard-Nuclei,doublets,H5,Female,670.216309,504,AX,H0015_apex,NK,0.680673,doublets
AACAACCGTAATTGGA-1-H0015_apex,No,Harvard-Nuclei,DOCK4+MØ1,H5,Female,730.082947,578,AX,H0015_apex,CD14+Monocyte,0.538159,myeloid
AAGACTCTCAGGACGA-1-H0015_apex,No,Harvard-Nuclei,Mast,H5,Female,612.323425,428,AX,H0015_apex,Mast,0.990977,myeloid
...,...,...,...,...,...,...,...,...,...,...,...,...
TTTGATCGTGTCATGT-1-HCAHeart8102862,Yes,Sanger-CD45,CD4+T_cytox,D11,Female,631.149170,715,AX,HCAHeart8102862,CD8+T_cell,0.756579,lymphoid
TTTGATCGTTCTCCTG-1-HCAHeart8102862,Yes,Sanger-CD45,LYVE1+MØ1,D11,Female,819.040100,2526,AX,HCAHeart8102862,CD8+T_cell,0.269561,myeloid
TTTGGAGGTCGCTCGA-1-HCAHeart8102862,Yes,Sanger-CD45,MØ_AgP,D11,Female,757.455505,1350,AX,HCAHeart8102862,M3,0.585436,myeloid
TTTGGTTTCAGTGTTG-1-HCAHeart8102862,Yes,Sanger-CD45,LYVE1+MØ3,D11,Female,815.372131,2507,AX,HCAHeart8102862,MØ,0.968681,myeloid


In [24]:
adata_raw = adata.copy()

##### Subset the data set to contain only myeloid cells

In [25]:
adata = adata[adata.obs['lineage'].isin(["lymphoid"])]
adata.obs

Unnamed: 0,NRP,cell_source,cell_states,donor,gender,n_counts,n_genes,region,sample,scNym,scNym_confidence,lineage
AAAGTGAAGTCGGCCT-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,724.717285,588,AX,H0015_apex,CD4+T_cell,0.797180,lymphoid
AAATGGAAGGTCCCTG-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,668.059509,515,AX,H0015_apex,CD4+T_cell,0.999248,lymphoid
AAGTTCGCAGTGTATC-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,759.100281,633,AX,H0015_apex,CD4+T_cell,0.979933,lymphoid
AATCACGTCCCGTAAA-1-H0015_apex,No,Harvard-Nuclei,CD8+T_tem,H5,Female,715.088623,604,AX,H0015_apex,CD4+T_cell,0.987953,lymphoid
AATGACCGTGTTGCCG-1-H0015_apex,No,Harvard-Nuclei,NK,H5,Female,743.207153,645,AX,H0015_apex,NK,0.759773,lymphoid
...,...,...,...,...,...,...,...,...,...,...,...,...
TTTCACATCGAACACT-1-HCAHeart8102862,Yes,Sanger-CD45,B_cells,D11,Female,736.678040,1991,AX,HCAHeart8102862,DC,0.999435,lymphoid
TTTCAGTCAGTACTAC-1-HCAHeart8102862,Yes,Sanger-CD45,NKT,D11,Female,774.458374,1108,AX,HCAHeart8102862,NK,0.985920,lymphoid
TTTCGATAGCGCGTTC-1-HCAHeart8102862,Yes,Sanger-CD45,NK,D11,Female,778.039612,1215,AX,HCAHeart8102862,NK,0.984224,lymphoid
TTTGACTGTATGAGAT-1-HCAHeart8102862,Yes,Sanger-CD45,CD8+T_tem,D11,Female,707.116882,957,AX,HCAHeart8102862,CD8+T_cell,0.982434,lymphoid


In [26]:
adata.layers['counts'] = adata.X.copy()

  adata.layers['counts'] = adata.X.copy()


In [27]:
adata.obs["general_cell_types"] = adata.obs['cell_states'].copy()

In [28]:
adata.obs["general_cell_types"].cat.categories

Index(['NK', 'CD4+T_cytox', 'CD8+T_tem', 'CD8+T_cytox', 'NKT', 'B_cells',
       'CD4+T_tem'],
      dtype='object')

In [29]:
trans_from = [  
              ['NK'],                                                   #NK
              ['CD4+T_cytox', 'CD4+T_tem'],                             #CD4
              ['CD8+T_tem', 'CD8+T_cytox'],                             #CD8
              ['NKT'],                                                  #NKT
              ['B_cells']                                               #B

]

trans_to = ['NK', 'CD4', 'CD8', 'NKT', 'B']

adata.obs['general_cell_types'] = [str(i) for i in adata.obs['cell_states']]
for leiden,celltype in zip(trans_from, trans_to):
    for leiden_from in leiden:
        adata.obs['general_cell_types'][adata.obs['general_cell_types'] == leiden_from] = celltype

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  adata.obs['general_cell_types'][adata.obs['general_cell_types'] == leiden_from] = celltype
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#retur

In [30]:
adata.obs

Unnamed: 0,NRP,cell_source,cell_states,donor,gender,n_counts,n_genes,region,sample,scNym,scNym_confidence,lineage,general_cell_types
AAAGTGAAGTCGGCCT-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,724.717285,588,AX,H0015_apex,CD4+T_cell,0.797180,lymphoid,CD4
AAATGGAAGGTCCCTG-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,668.059509,515,AX,H0015_apex,CD4+T_cell,0.999248,lymphoid,CD4
AAGTTCGCAGTGTATC-1-H0015_apex,No,Harvard-Nuclei,CD4+T_cytox,H5,Female,759.100281,633,AX,H0015_apex,CD4+T_cell,0.979933,lymphoid,CD4
AATCACGTCCCGTAAA-1-H0015_apex,No,Harvard-Nuclei,CD8+T_tem,H5,Female,715.088623,604,AX,H0015_apex,CD4+T_cell,0.987953,lymphoid,CD8
AATGACCGTGTTGCCG-1-H0015_apex,No,Harvard-Nuclei,NK,H5,Female,743.207153,645,AX,H0015_apex,NK,0.759773,lymphoid,NK
...,...,...,...,...,...,...,...,...,...,...,...,...,...
TTTCACATCGAACACT-1-HCAHeart8102862,Yes,Sanger-CD45,B_cells,D11,Female,736.678040,1991,AX,HCAHeart8102862,DC,0.999435,lymphoid,B
TTTCAGTCAGTACTAC-1-HCAHeart8102862,Yes,Sanger-CD45,NKT,D11,Female,774.458374,1108,AX,HCAHeart8102862,NK,0.985920,lymphoid,NKT
TTTCGATAGCGCGTTC-1-HCAHeart8102862,Yes,Sanger-CD45,NK,D11,Female,778.039612,1215,AX,HCAHeart8102862,NK,0.984224,lymphoid,NK
TTTGACTGTATGAGAT-1-HCAHeart8102862,Yes,Sanger-CD45,CD8+T_tem,D11,Female,707.116882,957,AX,HCAHeart8102862,CD8+T_cell,0.982434,lymphoid,CD8


#### Define 1500 highly variable genes and save as a reference file

In [31]:
adata1500 = adata.copy()

sc.pp.highly_variable_genes(
    adata1500,
    flavor = "seurat_v3",
    n_top_genes = 1500,
    layer = "counts",
    batch_key = "donor",
    subset = True,
    span = 1
    )

adata1500

If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
--> added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'variances_norm', float vector (adata.var)


AnnData object with n_obs × n_vars = 16403 × 1500
    obs: 'NRP', 'cell_source', 'cell_states', 'donor', 'gender', 'n_counts', 'n_genes', 'region', 'sample', 'scNym', 'scNym_confidence', 'lineage', 'general_cell_types'
    var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: 'cell_states_colors', 'scNym_colors', 'scNym_probabilities', 'hvg'
    obsm: 'X_pca', 'X_scnym', 'X_umap'
    layers: 'counts'

In [32]:
adata1500.write('/home/acirnu/data/ACM_cardiac_leuco/Reference_data/HCA_lymphoids_healthy_reference_HVG1500_ac240308.raw.h5ad')   

#### Define 3000 highly variable genes and save as a reference file

In [33]:
adata3000 = adata.copy()

sc.pp.highly_variable_genes(
    adata3000,
    flavor = "seurat_v3",
    n_top_genes = 3000,
    layer = "counts",
    batch_key = "donor",
    subset = True,
    span = 1
    )

adata3000

If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
--> added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'variances_norm', float vector (adata.var)


AnnData object with n_obs × n_vars = 16403 × 3000
    obs: 'NRP', 'cell_source', 'cell_states', 'donor', 'gender', 'n_counts', 'n_genes', 'region', 'sample', 'scNym', 'scNym_confidence', 'lineage', 'general_cell_types'
    var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: 'cell_states_colors', 'scNym_colors', 'scNym_probabilities', 'hvg'
    obsm: 'X_pca', 'X_scnym', 'X_umap'
    layers: 'counts'

In [34]:
adata3000.write('/home/acirnu/data/ACM_cardiac_leuco/Reference_data/HCA_lymphoids_healthy_reference_HVG3000_ac240308.raw.h5ad')   

#### Define 5000 highly variable genes and save as a reference file

In [35]:
adata5000 = adata.copy()

sc.pp.highly_variable_genes(
    adata5000,
    flavor = "seurat_v3",
    n_top_genes = 5000,
    layer = "counts",
    batch_key = "donor",
    subset = True,
    span = 1
    )

adata5000

If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
--> added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'variances_norm', float vector (adata.var)


AnnData object with n_obs × n_vars = 16403 × 5000
    obs: 'NRP', 'cell_source', 'cell_states', 'donor', 'gender', 'n_counts', 'n_genes', 'region', 'sample', 'scNym', 'scNym_confidence', 'lineage', 'general_cell_types'
    var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: 'cell_states_colors', 'scNym_colors', 'scNym_probabilities', 'hvg'
    obsm: 'X_pca', 'X_scnym', 'X_umap'
    layers: 'counts'

In [36]:
adata5000.write('/home/acirnu/data/ACM_cardiac_leuco/Reference_data/HCA_lymphoids_healthy_reference_HVG5000_ac240308.raw.h5ad')   

#### Define 7500 highly variable genes and save as a reference file

In [37]:
adata7500 = adata.copy()

sc.pp.highly_variable_genes(
    adata7500,
    flavor = "seurat_v3",
    n_top_genes = 7500,
    layer = "counts",
    batch_key = "donor",
    subset = True,
    span = 1
    )

adata7500

If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
--> added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'variances_norm', float vector (adata.var)


AnnData object with n_obs × n_vars = 16403 × 7500
    obs: 'NRP', 'cell_source', 'cell_states', 'donor', 'gender', 'n_counts', 'n_genes', 'region', 'sample', 'scNym', 'scNym_confidence', 'lineage', 'general_cell_types'
    var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: 'cell_states_colors', 'scNym_colors', 'scNym_probabilities', 'hvg'
    obsm: 'X_pca', 'X_scnym', 'X_umap'
    layers: 'counts'

In [38]:
adata7500.write('/home/acirnu/data/ACM_cardiac_leuco/Reference_data/HCA_lymphoids_healthy_reference_HVG7500_ac240308.raw.h5ad')  

#### Define 10000 highly variable genes and save as a reference file

In [39]:
adata10000 = adata.copy()

sc.pp.highly_variable_genes(
    adata10000,
    flavor = "seurat_v3",
    n_top_genes = 10000,
    layer = "counts",
    batch_key = "donor",
    subset = True,
    span = 1
    )

adata10000

If you pass `n_top_genes`, all cutoffs are ignored.
extracting highly variable genes
--> added
    'highly_variable', boolean vector (adata.var)
    'highly_variable_rank', float vector (adata.var)
    'means', float vector (adata.var)
    'variances', float vector (adata.var)
    'variances_norm', float vector (adata.var)


AnnData object with n_obs × n_vars = 16403 × 10000
    obs: 'NRP', 'cell_source', 'cell_states', 'donor', 'gender', 'n_counts', 'n_genes', 'region', 'sample', 'scNym', 'scNym_confidence', 'lineage', 'general_cell_types'
    var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: 'cell_states_colors', 'scNym_colors', 'scNym_probabilities', 'hvg'
    obsm: 'X_pca', 'X_scnym', 'X_umap'
    layers: 'counts'

In [40]:
adata10000.write('/home/acirnu/data/ACM_cardiac_leuco/Reference_data/HCA_lymphoids_healthy_reference_HVG10000_ac240308.raw.h5ad')  