# Ontology mapping

Ontologies are structured and standardized representations of knowledge in a specific domain, defining the concepts, relationships, and properties within that domain. They are essential for perturbation analysis as they provide a common vocabulary and framework for organizing and integration perturbation data.

ehrapy is compatible with [Bionty](https://github.com/laminlabs/bionty) which provides access to public ontologies and functionality to map values against them.

Here, we'll create an artificial AnnData object containing various guide RNAs and cell lines that we will map against to ensure that all of our annotations adhere to ontologies.

## Setup

In [1]:
import anndata as ad
import numpy as np
import pandas as pd

Create an AnnData object with gene names in Ensemble notation and cell line annotations in the `obs` slot.

In [33]:
adata = ad.AnnData(X=np.random.random((3, 3)),
                   var = pd.DataFrame(index=
                                      [
                                        "ENSG00000148584",
                                        "ENSG00000121410",
                                        "ENSGcorrupted",
                                    ]),
                   obs=pd.DataFrame(columns=["cell lines"],
                                    data=["HEK293", "JURKAT", "THP-1 cell",]))
adata



AnnData object with n_obs × n_vars = 3 × 3
    obs: 'cell lines'

In [25]:
adata.obs

Unnamed: 0,Cell lines
0,HEK293
1,JURKAT
2,THP-1 cell


## Introduction to Bionty

First we import Bionty.

In [3]:
import bionty as bt

Let's look at all available ontologies.

In [77]:
bt.display_available_sources()

Unnamed: 0_level_0,source,species,version,url,md5,source_name,source_website
entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Species,ensembl,vertebrate,release-109,https://ftp.ensembl.org/pub/release-109/specie...,,Ensembl,https://www.ensembl.org
Species,ensembl,vertebrate,release-108,https://ftp.ensembl.org/pub/release-108/specie...,,Ensembl,https://www.ensembl.org
Gene,ensembl,human,release-109,s3://bionty-assets/human_ensembl_release-109_G...,,Ensembl,https://www.ensembl.org
Gene,ensembl,mouse,release-109,s3://bionty-assets/mouse_ensembl_release-109_G...,,Ensembl,https://www.ensembl.org
Protein,uniprot,human,2023-02,s3://bionty-assets/human_uniprot_2023-02_Prote...,,Uniprot,https://www.uniprot.org
Protein,uniprot,mouse,2023-02,s3://bionty-assets/mouse_uniprot_2023-02_Prote...,,Uniprot,https://www.uniprot.org
CellMarker,cellmarker,human,2.0,s3://bionty-assets/human_cellmarker_2.0_CellMa...,,CellMarker,http://bio-bigdata.hrbmu.edu.cn/CellMarker
CellMarker,cellmarker,mouse,2.0,s3://bionty-assets/mouse_cellmarker_2.0_CellMa...,,CellMarker,http://bio-bigdata.hrbmu.edu.cn/CellMarker
CellLine,clo,all,2022-03-21,https://data.bioontology.org/ontologies/CLO/su...,ea58a1010b7e745702a8397a526b3a33,Cell Line Ontology,https://bioportal.bioontology.org/ontologies/CLO
CellType,cl,all,2023-04-20,http://purl.obolibrary.org/obo/cl/releases/202...,,Cell Ontology,https://obophenotype.github.io/cell-ontology


Bionty provides three key functionalities:

1. `inspect`: Check whether any of our values (here diseases) are mappable against a specified ontology.
2. `map_synonyms`: Map values against synonyms. This is not relevant for our diseases.
3. `curate`: Curate ontology values against the ontology to ensure compliance.

Mapping against the Cell Line Ontology with Bionty

We will now showcase how to access the [cell line ontology](https://www.ebi.ac.uk/ols4/ontologies/clo) with Bionty. The Cell Line Ontology (CLO) aims to harmonize cell line definitions across the world.

Bionty is centered around Bionty entity objects that provide the above introduced functionality. We create a Bionty CellLine object with the cell line ontology as our source and a specific version for reproducibility.

### Cell lines

In [35]:
cell_line_bt = bt.CellLine(source="clo", version="2022-03-21")
cell_line_bt

CellLine
Species: all
Source: clo, 2022-03-21

📖 CellLine.df(): ontology reference table
🔎 CellLine.lookup(): autocompletion of terms
🎯 CellLine.search(): free text search of terms
🧐 CellLine.inspect(): check if identifiers are mappable
👽 CellLine.map_synonyms(): map synonyms to standardized names
🔗 CellLine.ontology: Pronto.Ontology object

We can access the DataFrame that contains all ontology terms:

In [36]:
cell_line_bt.df()

Unnamed: 0_level_0,name,definition,synonyms,parents
ontology_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CLO:0000000,cell line cell culturing,a maintaining cell culture process that keeps ...,,[]
CLO:0000001,cell line cell,A cultured cell that is part of a cell line - ...,,[]
CLO:0000002,suspension cell line culturing,suspension cell line culturing is a cell line ...,,[CLO:0000000]
CLO:0000003,adherent cell line culturing,adherent cell line culturing is a cell line cu...,,[CLO:0000000]
CLO:0000004,cell line cell modification,a material processing that modifies an existin...,,[]
...,...,...,...,...
CLO:0051617,RCB0187 cell,A immortal medaka cell line cell that has the ...,RCB0187|OLHE-131,[CLO:0009822]
CLO:0051618,RCB2945 cell,A immortal medaka cell line cell that has the ...,RCB2945|DIT29,[CLO:0009822]
CLO:0051619,RCB0184 cell,A immortal medaka cell line cell that has the ...,OLF-136|RCB0184,[CLO:0009822]
CLO:0051620,RCB0188 cell,A immortal medaka cell line cell that has the ...,RCB0188|OLME-104,[CLO:0009822]


Let's inspect all of our cell lines to learn whether they can be mapped against the ontology using the `name` field:

In [34]:
cell_line_bt.inspect(adata.obs["cell lines"], field=cell_line_bt.name, return_df=True)

🔶 The identifiers contain synonyms!
   To increase mappability, standardize them via '.map_synonyms()'
✅ 2 terms (66.7%) are mapped
🔶 1 terms (33.3%) are not mapped


Unnamed: 0_level_0,__mapped__
cell lines,Unnamed: 1_level_1
HEK293,True
JURKAT,False
THP-1 cell,True


We observe that `JURKAR` cannot be mapped against the Cell Line Ontology. Hence, we create a lookup object and try to find JURKAT cells in the ontology with auto-complete.

In [37]:
cell_line_bt_lookup = cell_line_bt.lookup()

In [38]:
cell_line_bt_lookup.jurkat_cell

CellLine(ontology_id='CLO:0007043', name='JURKAT cell', definition='an immortalized human T lymphocyte cell that was derived in the late 1970s from the peripheral blood of a 14-year-old boy with T cell leukemia|disease: leukemia, T cell', synonyms='JURKAT', parents=array(['CLO:0000523'], dtype=object))

In [70]:
cell_line_bt_lookup.jurkat_cell.name

'JURKAT cell'

In [73]:
cell_line_bt_lookup.jurkat_cell.definition

'an immortalized human T lymphocyte cell that was derived in the late 1970s from the peripheral blood of a 14-year-old boy with T cell leukemia|disease: leukemia, T cell'

Indeed we find that the actual name of the cells is `JURKAT cell`.
Let's rename it.

In [68]:
adata.obs["cell lines"].replace({"JURKAT": "JURKAT cell"}, inplace=True)
adata.obs["cell lines"]

0         HEK293
1    JURKAT cell
2     THP-1 cell
Name: cell lines, dtype: object

In [69]:
cell_line_bt.inspect(adata.obs["cell lines"], field=cell_line_bt.name, return_df=True)

✅ 3 terms (100.0%) are mapped
🔶 0 terms (0.0%) are not mapped


Unnamed: 0_level_0,__mapped__
cell lines,Unnamed: 1_level_1
HEK293,True
JURKAT cell,True
THP-1 cell,True


Now all terms could be mapped.

We could have also used the search functionality to find the match for JURKAT cells:

In [75]:
cell_line_bt.search("JURKAT").head()

Unnamed: 0_level_0,ontology_id,definition,synonyms,parents,__ratio__
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RCB0806 cell,CLO:0050978,A immortal human blood cell line cell that has...,RCB0806|Jurkat,[CLO:0000617],100.0
JURKAT cell,CLO:0007043,an immortalized human T lymphocyte cell that w...,JURKAT,[CLO:0000523],100.0
Jurkat J6 cell,CLO:0007044,,Jurkat J6,[CLO:0000019],80.0
Rat2 cell,CLO:0008750,,Rat2,[CLO:0009760],60.0
JURL-MK1 cell,CLO:0009838,,JURL-MK1,[CLO:0000617],57.142857


The same workflow can be applied to genes.

### Genes

In [30]:
gene_bt = bt.Gene()
gene_bt

Output()

Gene
Species: human
Source: ensembl, release-109

📖 Gene.df(): ontology reference table
🔎 Gene.lookup(): autocompletion of terms
🎯 Gene.search(): free text search of terms
🧐 Gene.inspect(): check if identifiers are mappable
👽 Gene.map_synonyms(): map synonyms to standardized names
🔗 Gene.ontology: Pronto.Ontology object

In [31]:
gene_bt.inspect(adata.var_names, gene_bt.ensembl_gene_id)

✅ 2 terms (66.7%) are mapped
🔶 1 terms (33.3%) are not mapped


{'mapped': ['ENSG00000148584', 'ENSG00000121410'],
 'not_mapped': ['ENSGcorrupted']}

`ENSGcorrupted` is not a valid Ensembl gene ID and should therefore also be corrected.

## Conclusion

pertpy provides support for ontology management, inspection and mapping through Bionty. Bionty provide access to gene, cell type, cell line, disease, phenotype ontologies and many more.

To access these ontologies we create Bionty objects that have class functions to map synonyms and to inspect data for adherence against ontologies. Mismatches can be remedied by finding the actual correct ontology name using lookup objects or fuzzy matching.