<a href="https://polly.elucidata.io/manage/workspaces?action=open_polly_notebook&amp;source=github&amp;path=ElucidataInc%2Fpolly-python%2Fblob%2Fmain%2FCuration%2FUsing+the+Curation+Library+on+Polly-Python.ipynb&amp;kernel=elucidata%2FPython+3.10&amp;machine=small" target="_parent"><img alt="Open in Polly" src="https://elucidatainc.github.io/PublicAssets/open_polly.svg"/></a>


# Usage of Curation Functions on Polly Python3 

What can the curation library be used for?
  * Semantic annotation of text with established polly-compatible ontologies
  * Recognize and annotate 8 entities: disease, drug, species, tissue, cell_type, cell_line, gene, metabolite, drug_chebi
  * Uses our pre-trained NLP models in the backend, trained on millions of datasets
  * Custom curate required metadata in a completely automatic fashion

In [1]:
# Install polly python
!sudo pip3 install polly-python --quiet

You should consider upgrading via the '/usr/local/bin/python3.10 -m pip install --upgrade pip' command.[0m


In [2]:
# Import libraries
from polly.auth import Polly     
from polly.curation  import Curation
import os
import pandas as pd
from json import dumps
import ipywidgets as widgets

In [3]:
# Create curation object and authenticate
AUTH_TOKEN=(os.environ['POLLY_REFRESH_TOKEN'])
curate = Curation(AUTH_TOKEN)  

## 1. standardize_entity()

Map a given mention (keyword) to an ontology term.

Parameters:
> mention (str): mention of an entity e.g. "Cadiac arrythmia"  
> entity_type (str): Should be one of 8 entities  
> context *(Optional)* (str): The text where the mention occurs, used to resolve abbreviations  
> threshold *(Optional)* (float): All entities with a score < threshold are filtered out from the output. Its best not to specify a threshold and just use the default value instead.  

Returns:
> dict  

In [4]:
# Basic example
curate.standardise_entity("Mouse","species")

{'ontology': 'NCBI',
 'ontology_id': 'txid10090',
 'name': 'Mus musculus',
 'entity_type': 'species',
 'score': None,
 'synonym': None}

In [5]:
# Without 'context'
curate.standardise_entity("AD", "disease")

{'ontology': 'MESH',
 'ontology_id': 'C564330',
 'name': 'Alzheimer Disease, Familial, 3, with Spastic Paraparesis and Apraxia',
 'entity_type': 'disease',
 'score': 202.1661376953125,
 'synonym': 'ad'}

In [6]:
# With context, returns the desired keyword in case of abbreviation
curate.standardise_entity("AD", "disease", 
                context="Patients with atopic dermatitis (AD) where given drug A whereas non AD patients were given drug B")

{'ontology': 'MESH',
 'ontology_id': 'D003876',
 'name': 'Dermatitis, Atopic',
 'entity_type': 'disease',
 'score': 196.61105346679688,
 'synonym': 'atopic dermatitis'}

In [7]:
# Usage of non-matching 'entity_type' returns none values
curate.standardise_entity("Mouse","disease")

{'ontology': 'CUI-less',
 'ontology_id': None,
 'name': None,
 'entity_type': 'disease',
 'score': None,
 'synonym': None}

In [8]:
# Usage of non-supported 'entity_type' returns error -> Here, it is supposed to be "species" and not "specie"
curate.standardise_entity("Mouse","specie")

RequestException: ('Invalid Payload', {'detail': [{'loc': ['body', 'mention', 'entity_type'], 'msg': "value is not a valid enumeration member; permitted: 'disease', 'drug', 'drug_chebi', 'species', 'tissue', 'cell_type', 'cell_line', 'gene', 'metabolite'", 'type': 'type_error.enum', 'ctx': {'enum_values': ['disease', 'drug', 'drug_chebi', 'species', 'tissue', 'cell_type', 'cell_line', 'gene', 'metabolite']}}]})

## 2. recognise_entity()

Run an NER model on the given text. The returned value is a list of entities along with span info. Score is the confidence score of the model that given keyword belongs to returned entity type/ ontology.

Parameters:
> text (str): input text  
> threshold *(Optional)*(float): whether to normalize the keywords  
> normalize_output (bool):   

Returns:
> entities (List(dict)): returns a list of span containing the keyword, start and end index of the keyword, and the entity type 

In [10]:
# Basic example with two entities
curate.recognise_entity("Gene expression profiling on mice lungs and reveals ACE2 upregulation")

[{'keyword': 'lungs',
  'entity_type': 'tissue',
  'span_begin': 34,
  'span_end': 39,
  'score': 0.9985597729682922},
 {'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 52,
  'span_end': 55,
  'score': 0.9900580048561096},
 {'keyword': 'mice',
  'entity_type': 'species',
  'span_begin': 29,
  'span_end': 32,
  'score': 0.989605188369751}]

In [14]:
# Multiple entities of the same type
curate.recognise_entity("Batch effects were observed between ductal carcinoma and lobular carcinoma")

[{'keyword': 'ductal carcinoma',
  'entity_type': 'disease',
  'span_begin': 36,
  'span_end': 51,
  'score': 0.9999971389770508},
 {'keyword': 'lobular carcinoma',
  'entity_type': 'disease',
  'span_begin': 57,
  'span_end': 73,
  'score': 0.9999983906745911}]

In [24]:
# Repeating entities
curate.recognise_entity("The study showed ACE2 upregulation and ACE2 downregulation")

[{'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 17,
  'span_end': 20,
  'score': 0.9962862730026245},
 {'keyword': 'ACE2',
  'entity_type': 'gene',
  'span_begin': 39,
  'span_end': 42,
  'score': 0.990687906742096}]

In [16]:
# No entity in the text
curate.recognise_entity("Significant upregulation was found in 100 samples")

[]

## 3. annotate_with_ontology()

Tag a given piece of text. A "tag" is just an ontology term. Annotates with Polly supported ontologies. This function calls 'recognise_entity' followed by 'normalize'
 
Parameters:
> text (str): input text  
        
Returns:
> tags (set of tuples): set of unique tags  

In [17]:
# Basic example
curate.annotate_with_ontology("Mouse model shows presence of Adeno carcinoma")

[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species'),
 Tag(name='Adenocarcinoma', ontology_id='MESH:D000230', entity_type='disease')]

In [21]:
# Spelling errors
curate.annotate_with_ontology("Mouse model shows presence of Adino carcinoma")

[Tag(name='Mus musculus', ontology_id='NCBI:txid10090', entity_type='species')]

In [23]:
# incorrect input format -> here, list instead of string
curate.annotate_with_ontology(["Mouse model shows presence", "adeno carcinoma"])

RequestException: ('Invalid Payload', {'detail': [{'loc': ['body', 'text'], 'msg': 'str type expected', 'type': 'type_error.str'}]})

## 4. find_abbreviations()

To run abbreviation detection separately. Maps the abbreviation to ontologically relevant keyword.

Parameters:
> text (str): The string to detect abbreviations in  
> context *(Optional)* (str): The text where the mention occurs, used to resolve abbreviations  
        
Returns:
> dict: abbreviation as key and full form as value

In [25]:
# Full form is not mentioned on the text
curate.find_abbreviations("Patient is diagnosed with T1D")

{}

In [26]:
# '-' on the text is not understood
curate.find_abbreviations("Patient is diagnosed with T1D- Type 1 Diabetes")

{}

In [27]:
# Abbreviation is recognized
curate.find_abbreviations("Patient is diagnosed with T1D (Type 1 Diabetes)")

{'T1D': 'Type 1 Diabetes'}

In [28]:
# Abbreviation does not match the full text
curate.find_abbreviations("Patient is diagnosed with T2D (Type 1 Diabetes)")

{}

## 5. assign_control_pert_labels()

Can classify the samples into controls or not based on the “disease” label value. 

Parameters:
> sample_metadata (pd.DataFrame)  
> columns_to_exclude *(Optional)* (Set(str)): Any columns which don't play any role in determining the label, e.g. any arbitrary sample identifier

Returns:
> pd.DataFrame: Returns the sample metadata dataframe with 2 additional columns (i) 'is_control' - whether the sample is a control sample (ii) 'control_prob' - the probability that the sample is control


In [30]:
sample_metadata = pd.DataFrame({"sample_id": [1, 2, 3, 4], "disease": ["control1", "ctrl2", "healthy", "HCC"],})
sample_metadata

Unnamed: 0,sample_id,disease
0,1,control1
1,2,ctrl2
2,3,healthy
3,4,HCC


In [32]:
curate.assign_control_pert_labels(sample_metadata, columns_to_exclude=["sample_id"])

Unnamed: 0,sample_id,disease,is_control,control_prob
0,1,control1,True,1.0
1,2,ctrl2,True,1.0
2,3,healthy,True,0.96
3,4,HCC,False,0.08
