![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/release_notebooks/NLU_3_0_2_release_notebook.ipynb)


# Entity Resolution
**Named entities** are sub-strings in a text that can be classified into catogires. For example, in the String   
`"Tesla is a great stock to invest  in "` , the sub-string `"Tesla"` is a named entity, it can be classified with the label `company` by an ML algorithm.  
**Named entities** can easily be extracted by the various pre-trained Deep Learning based NER algorithms provided by NLU. 



After extracting **named entities** an **entity resolution algorithm** can be applied to the extracted named entities. The resolution algorithm classifies each extracted entitiy into a class, which reduces dimensionality of the data and has many useful applications. 
For example : 
- "**Tesla** is a great stock to invest in "
- "**TSLA**  is a great stock to invest  in "
- "**Tesla, Inc** is a great company to invest in"    

The sub-strings `Tesla` , `TSLA` and `Tesla, Inc` are all named entities, that are classified with the labeld `company` by the NER algorithm. It tells us, all these 3 sub-strings are of type `company`, but we cannot yet infer that these 3 strings are actually referring to literally the same company.    

This exact problem is solved by the resolver algorithms, it would resolve all these 3 entities to a common name, like a company ID. This maps every reference of Tesla, regardless of how the string is represented, to the same ID.

This example can analogusly be expanded to healthcare any any other text problems. In medical documents, the same disease can be referenced in many different ways. 

With NLU Healthcare you can leverage state of the art pre-trained NER models to extract **Medical Named Entities** (Diseases, Treatments, Posology, etc..) and **resolve these** to common **healthcare disease codes**.


These algorithms are based provided by **Spark NLP for Healthcare's**  [SentenceEntitiyResolver](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#sentenceentityresolver) and [ChunkEntityResolvers](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#chunkentityresolver)

## New Entity Resolovers In NLU 3.0.2 

| NLU REF                           | NLP REF                                 |
|-----------------------------------|-----------------------------------------|
|[`en.resolve.umls`                    ](https://nlp.johnsnowlabs.com/2021/05/16/sbiobertresolve_umls_major_concepts_en.html)| [`sbiobertresolve_umls_major_concepts`](https://nlp.johnsnowlabs.com/2021/05/16/sbiobertresolve_umls_major_concepts_en.html)     |
|[`en.resolve.umls.findings`           ](https://nlp.johnsnowlabs.com/2021/05/16/sbiobertresolve_umls_findings_en.html)| [`sbiobertresolve_umls_findings`](https://nlp.johnsnowlabs.com/2021/05/16/sbiobertresolve_umls_findings_en.html)           |
|[`en.resolve.loinc`                   ](https://nlp.johnsnowlabs.com/2021/04/29/sbiobertresolve_loinc_en.html)| [`sbiobertresolve_loinc`](https://nlp.johnsnowlabs.com/2021/04/29/sbiobertresolve_loinc_en.html)                   |
|[`en.resolve.loinc.biobert`           ](https://nlp.johnsnowlabs.com/2021/04/29/sbiobertresolve_loinc_en.html)| [`sbiobertresolve_loinc`](https://nlp.johnsnowlabs.com/2021/04/29/sbiobertresolve_loinc_en.html)                   |
|[`en.resolve.loinc.bluebert`          ](https://nlp.johnsnowlabs.com/2021/04/29/sbluebertresolve_loinc_en.html)| [`sbluebertresolve_loinc`](https://nlp.johnsnowlabs.com/2021/04/29/sbluebertresolve_loinc_en.html)                  |
|[`en.resolve.HPO`                     ](https://nlp.johnsnowlabs.com/2021/05/16/sbiobertresolve_HPO_en.html)| [`sbiobertresolve_HPO`](https://nlp.johnsnowlabs.com/2021/05/16/sbiobertresolve_HPO_en.html)                     |



In [None]:
# Upload add your spark_nlp_fo"r_healthcare.json
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu

nlu.auth(SPARK_NLP_LICENSE,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,JSL_SECRET)

--2022-04-15 11:39:17--  https://setup.johnsnowlabs.com/nlu/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh [following]
--2022-04-15 11:39:17--  https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1665 (1.6K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               Installing  NLU 3.4.3rc2 with  PySpark 3.0.3 and Spark NLP 3.4.2 for Google Colab .

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


Looking in indexes: https://pypi.org/simple, https://pypi.johnsnowlabs.com/3.5.0-658432c5c0ac83e65947c58ebd7f573e1c72530e
Collecting spark-nlp-jsl==3.5.0
  Downloading https://pypi.johnsnowlabs.com/3.5.0-658432c5c0ac83e65947c58ebd7f573e1c72530e/spark-nlp-jsl/spark_nlp_jsl-3.5.0-py3-none-any.whl (188 kB)
Installing collected packages: spark-nlp-jsl
Successfully installed spark-nlp-jsl-3.5.0


<module 'nlu' from '/usr/local/lib/python3.7/dist-packages/nlu/__init__.py'>

#### [Sentence Entity Resolver for UMLS CUI Codes](https://nlp.johnsnowlabs.com/2021/05/16/sbiobertresolve_umls_major_concepts_en.html)

In [None]:
nlu.load('med_ner.jsl.wip.clinical en.resolve.umls').predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and
subsequent type two diabetes mellitus (TSS2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute 
hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""
,output_level = 'sentence')

ner_wikiner_glove_840B_300 download started this may take some time.
Approximate size to download 14.8 MB
[OK!]
sbiobertresolve_umls_major_concepts download started this may take some time.
Approximate size to download 761 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
glove_840B_300 download started this may take some time.
Approximate size to download 2.3 GB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,entities_wikiner_glove_840B_300,entities_wikiner_glove_840B_300_class,entities_wikiner_glove_840B_300_confidence,sentence,sentence_embedding_biobert,sentence_resolution_umls,sentence_resolution_umls_code,sentence_resolution_umls_confidence,word_embedding_glove
0,[TSS2DM],[MISC],[0.8243],A 28-year-old female with a history of gestati...,"[[0.057425666600465775, 0.8864338397979736, -0...",[increased risk of type 2 diabetes],[C5195213],[0.3339],"[[-0.2099599987268448, -0.15577000379562378, -..."


In [None]:
nlu.load('med_ner.jsl.wip.clinical en.resolve.umls').viz("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and
subsequent type two diabetes mellitus (TSS2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute 
hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""")

ner_wikiner_glove_840B_300 download started this may take some time.
Approximate size to download 14.8 MB
[OK!]
sbiobertresolve_umls_major_concepts download started this may take some time.
Approximate size to download 761 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
glove_840B_300 download started this may take some time.
Approximate size to download 2.3 GB
[OK!]


Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.


Collecting spark-nlp-display
  Downloading spark_nlp_display-1.9.1-py3-none-any.whl (95 kB)
Collecting svgwrite==1.4
  Downloading svgwrite-1.4-py3-none-any.whl (66 kB)
Installing collected packages: svgwrite, spark-nlp-display
Successfully installed spark-nlp-display-1.9.1 svgwrite-1.4


#### [Sentence Entity Resolver for UMLS CUI Codes](https://nlp.johnsnowlabs.com/2021/05/16/sbiobertresolve_umls_findings_en.html)





In [None]:
nlu.load('med_ner.jsl.wip.clinical en.resolve.umls.findings').predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and
subsequent type two diabetes mellitus (TSS2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute 
hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""
,output_level = 'sentence')

ner_wikiner_glove_840B_300 download started this may take some time.
Approximate size to download 14.8 MB
[OK!]
sbiobertresolve_umls_findings download started this may take some time.
Approximate size to download 134.1 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
glove_840B_300 download started this may take some time.
Approximate size to download 2.3 GB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,entities_wikiner_glove_840B_300,entities_wikiner_glove_840B_300_class,entities_wikiner_glove_840B_300_confidence,sentence,sentence_embedding_biobert,sentence_resolution_umls,sentence_resolution_umls_code,sentence_resolution_umls_confidence,word_embedding_glove
0,[TSS2DM],[MISC],[0.8243],A 28-year-old female with a history of gestati...,"[[0.057425666600465775, 0.8864338397979736, -0...",[body mass index],[C0578022],[0.3338],"[[-0.2099599987268448, -0.15577000379562378, -..."


In [None]:
nlu.load('med_ner.jsl.wip.clinical en.resolve.umls.findings').viz("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and
subsequent type two diabetes mellitus (TSS2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute 
hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""
)

ner_wikiner_glove_840B_300 download started this may take some time.
Approximate size to download 14.8 MB
[OK!]
sbiobertresolve_umls_findings download started this may take some time.
Approximate size to download 134.1 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
glove_840B_300 download started this may take some time.
Approximate size to download 2.3 GB
[OK!]


#### [Loinc Sentence Entity Resolver](https://nlp.johnsnowlabs.com/2021/04/29/sbiobertresolve_loinc_en.html)

In [None]:
nlu.load('med_ner.jsl.wip.clinical en.resolve.loinc').predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and
subsequent type two diabetes mellitus (TSS2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute 
hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""
,output_level = 'sentence')

ner_wikiner_glove_840B_300 download started this may take some time.
Approximate size to download 14.8 MB
[OK!]
sbiobertresolve_loinc download started this may take some time.
Approximate size to download 212.6 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
glove_840B_300 download started this may take some time.
Approximate size to download 2.3 GB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,entities_wikiner_glove_840B_300,entities_wikiner_glove_840B_300_class,entities_wikiner_glove_840B_300_confidence,sentence,sentence_embedding_biobert,sentence_resolution_loinc,sentence_resolution_loinc_code,sentence_resolution_loinc_confidence,word_embedding_glove
0,[TSS2DM],[MISC],[0.8243],A 28-year-old female with a history of gestati...,"[[0.057425666600465775, 0.8864338397979736, -0...",[Insulin Ab],[8072-1],[0.3333],"[[-0.2099599987268448, -0.15577000379562378, -..."


In [None]:
nlu.load('med_ner.jsl.wip.clinical en.resolve.loinc.biobert').viz("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and
subsequent type two diabetes mellitus (TSS2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute 
hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""")

ner_wikiner_glove_840B_300 download started this may take some time.
Approximate size to download 14.8 MB
[OK!]
sbiobertresolve_loinc download started this may take some time.
Approximate size to download 212.6 MB
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
glove_840B_300 download started this may take some time.
Approximate size to download 2.3 GB
[OK!]


#### [Loinc Sentence Entity Resolver](https://nlp.johnsnowlabs.com/2021/04/29/sbluebertresolve_loinc_en.html)

In [None]:
nlu.load('med_ner.jsl.wip.clinical en.resolve.loinc.bluebert').predict("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and
subsequent type two diabetes mellitus (TSS2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute 
hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting."""
,output_level = 'sentence')

ner_wikiner_glove_840B_300 download started this may take some time.
Approximate size to download 14.8 MB
[OK!]


Exception: ignored

In [None]:
nlu.load('med_ner.jsl.wip.clinical en.resolve.loinc.bluebert').viz("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and
subsequent type two diabetes mellitus (TSS2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with an acute 
hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting.""")

#### [Entity Resolver for Human Phenotype Ontology](https://nlp.johnsnowlabs.com/2021/05/16/sbiobertresolve_HPO_en.html)

In [None]:
nlu.load('med_ner.jsl.wip.clinical en.resolve.HPO').predict("""These disorders include cancer, bipolar disorder, schizophrenia, autism, Cri-du-chat syndrome,
 myopia, cortical cataract-linked Alzheimer's disease, and infectious diseases"""
,output_level = 'sentence')

In [None]:
nlu.load('med_ner.jsl.wip.clinical en.resolve.HPO').viz("""These disorders include cancer, bipolar disorder, schizophrenia, autism, Cri-du-chat syndrome,
 myopia, cortical cataract-linked Alzheimer's disease, and infectious diseases""")