<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

# Chunk-Disambiguator - version 2.4.0

## Example for people disambiguator

A typical use case in NLP tasks is to be able to, once we have identified with a Named Entity Recognition model that a given chunk is refering to a person to be able to link that chunk to a particular person using an external source as the Wikipedia.

This is fundamentally a multiclass classification problem where the classes are all the wikipedia entries referring to persons and the entity to be disambiguated is the piece of original content we know is referring to a person. That is why this kind of annotators are named disambiguators.

Lets imagine we have this sentence:

<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
Ronald Reagan was a president of the United States during the 80s.<br>
</div>

SparkNLP Enterprise provides with pipeline functionalities that allow to identify which words are referring to persons (Ronald Reagan) and to link each of those references to the especific entry in the wikipedia.

### Step 1. Prepare the environment

#### Install OpenSource spark-nlp and pyspark pip packages
As a first step we import the required python dependences including some sparknlp components.

Be sure that you have the required python libraries (pyspark 2.4.4, spark-nlp 2.4.0) by running <code>pip list</code>. Check that the versions are correct.

If some of them is missing you can run:

<code>pip install --ignore-installed pyspark==2.4.4</code><br>
<code>pip install --ignore-installed spark-nlp==2.4.0</code><br>

The --ignore-installed parameter is to overwrite your previous pip package version if already installed.

<i>*If this cell fails means you have not propertly setup the required environment. Please check the pre-requisites guideline at http://www.johnsnowlabs.com</i>

In [1]:
from sparknlp.base import *
from sparknlp.annotator import *

#### Install Licensed Sparl-NLP package

We will use also some Spark-NLP enterprise functionalities contained in the spark-nlp-jsl package.

You can check that spark-nlp-jsl is installed by running <code>pip install</code>. Check that version installed is 2.4.0

If it is not then you need to install it by using:

<code>pip install spark-nlp-jsl==2.4.0 --extra-index-url https://pypi.johnsnowlabs.com/##### --ignore-installed</code>

The ##### is a secret url, if you have not received it please contact us at info@johnsnowlabs.com.

<i>*If the next cell fails means your licensed enterprise version is not propertly installed so please check the pre-requisites guideline at http://www.johnsnowlabs.com/</i>

In [2]:
import sparknlp_jsl

#### Setup credentials to private JohnSnowLabs models repository with AWS-CLI

Now is time to configure Spark-NLP in order to access private JohnSnowLabs models repository. This access is done via Amazon aws command line interface (AWSCLI).

Instructions about how to install awscli are available at:

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html

Make sure you configure your credentials with aws configure following the instructions at:

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

Please substitute the ACCESS_KEY and SECRET_KEY with the credentials you have recived. If you need your credentials contact us at info@johnsnowlabs.com


#### Setup your license number in order to have access to private JohnSnowLabs models repository

You need to setup the SPARK_NLP_LICENSE environment variable to the license number provided to you. If you need your license credentials contact us at info@johnsnowlabs.com

In [3]:
import os
os.environ['SPARK_NLP_LICENSE']='#########'

#### Start Spark session

The following will initialize the spark session in case you have run the jupyter notebook directly. If you have started the notebook using pyspark this cell is just ignored.

Initializing the spark session takes some seconds (usually less than 1 minute) as the jar from the server needs to be loaded.

The ####### is a secret code required to run the licensed version 2.4.0, if you have not received it please contact us at info@johnsnowlabs.com.

In [4]:
spark = sparknlp_jsl.start("########") # Secret code provided as part of the license

### Step 2: De-identification pipeline generation

In Spark-NLP annotating NLP happens through pipelines. Pipelines are made out of various Annotator steps. In our case the architecture of the Clinical Named Entity Recognition pipeline will be:

* DocumentAssembler (text -> document)
* SentenceDetector (document -> sentence)
* Tokenizer (sentence -> token)
* WordEmbeddingsModel ([sentence, token] -> embeddings)
* NerDLModel (deidentify_dl) ([sentence, token, embeddings] -> ner)
* NerConverter ([sentence, token, ner] -> ner_chunk)
* DisambiguatorModel ([ner_chunk, embeddings] -> deidentified

So from a text we end having a link to the wikipedia urls for the persons referenced in the document.

We will use a pretrained model (NerDLModel deidentify) that leveraging in a language model encoded in word embeddings (embeddings) is able to recognize tokens that are naming persons, organizations and other. 

Then we transform its output (ner) into chunks (ner_chunk) that are then used by another pretrained annotator (DisambiguatorModel). The disambiguator will select the most relevant wikipedia entry for those chunks naming persons.

In [5]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

document = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP

token = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")


The fourth annotator in the pipeline is "WordEmbeddingsModel". We will download a pretrained model available <a href='https://nlp.stanford.edu/projects/glove/'>freely available</a> named "globe_100d".

When running this cell your are advised to be patient. 

First time you call this pretrained model it needs to be downloaded in your local and it takes a while (depending on your internet connection).

The size is about 145Mb and will be saved typically in your home folder as

    ~HOMEFOLDER/cached_models/glove_100d_en_2.4.0_2.4_1579690104032

Next times you call it the model is loaded from your cached copy but even in that case it needs to be indexed each time so expect waiting up to 1 minutes (depending on your machine)

In [6]:
wordEmbeddings = WordEmbeddingsModel.pretrained("glove_100d", "en")\
.setInputCols("sentence", "token")\
.setOutputCol("word_embeddings")

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


We now will download another freely available SparknLP pretrained model (ner_dl) that consists in a Named Entity Resolver based on a DeepLearning architecture that uses as input Glove100d embeddings (previously loaded in our pipeline). Its size is about 13Mb and will be typically stored in your local machine as:

    ~HOMEFOLDER/cached_models/ner_dl_en_2.4.0_2.4_1580251789753



In [7]:
ner = NerDLModel.pretrained()\
.setInputCols("sentence", "token", "word_embeddings")\
.setOutputCol("ner")

ner_dl download started this may take some time.
Approximate size to download 13.5 MB
[OK!]


Our next pipeline's component is the NerConverter that will filter only for person entities (therefor the <code>.setWhiteList(["PER])</code>).

In [8]:
nerConverter = NerConverter()\
.setInputCols("sentence", "token", "ner")\
.setOutputCol("ner_chunk")\
.setWhiteList(["PER"])

Finally we load the pretrained people disambiguator. This pretrained model is licensed so if this cell fails is because you have not setup the AWS-CLI or SPARK_NLP_LICENSE environmental variable.

Contact us at info@johnsnowlabs.com if you have not received the required credentials and license key.

The "people_disambiguator" model size is about 54Mb and will be typically saved in your local system as:

    ~HOMEFOLDER/cached_models/people_disambiguator_en_2.3.4_2.4_1574806205059

In [9]:
disambiguator = sparknlp_jsl.annotator.DisambiguatorModel.pretrained('people_disambiguator', 'en', 'clinical/models')\
.setInputCols("ner_chunk", "word_embeddings")\
.setOutputCol("disambiguation")

people_disambiguator download started this may take some time.
Approximate size to download 54.1 MB
[OK!]


#### Step 2.2 Defining the stages of the pipeline
Now that we have created all the components of our pipeline, lets put all them together into a pipeline.

In [10]:
pipeline = Pipeline().setStages([
    document,
    sentenceDetector,
    token,
    wordEmbeddings,
    ner,
    nerConverter,
    disambiguator
])

### Step 3 Fit the pipeline with some data
Lest now see how our Deidientification pipeline works with some data. We will use the following sentence naming Ronald Reagan:

<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
Ronald Reagan was a president of the United States during the 80s.<br>
</div>

We will create a Spark DataFrame containing the only line of this document:

In [11]:
data = spark.createDataFrame([
    [1, "Ronald Reagan was a president of the United States during the 80s."]
]).toDF('id', 'text')

data.show(truncate=False)

+---+------------------------------------------------------------------+
|id |text                                                              |
+---+------------------------------------------------------------------+
|1  |Ronald Reagan was a president of the United States during the 80s.|
+---+------------------------------------------------------------------+



Now we will create a model by fitting our pipeline to our content:

In [12]:
model = pipeline.fit(data)

And will apply the model transforming our data:

In [13]:
output = model.transform(data)

The results of our pipeline are stored in <code>output</code> a Spark DataFrame, so lets show some relevant columns:

In [14]:
output.select('ner_chunk.result', 'disambiguation.result', 'disambiguation.metadata').show()

+---------------+--------------------+--------------------+
|         result|              result|            metadata|
+---------------+--------------------+--------------------+
|[Ronald Reagan]|[http://en.wikipe...|[[chunk -> Ronald...|
+---------------+--------------------+--------------------+



The output of Spark is not especially appealing so for the sake of this demo we can extract just the first row of that dataframe, extract the relevant pieces of information and show them in a prettier html format:

In [15]:
# We extract the NER chunk string:
chunk_token = output.select('ner_chunk.result').take(1)[0]['result'][0]

# We extract the URLs suggested by the disambiguator
disambiguator_urls = output.select('disambiguation.result').take(1)
dis_urls = [x.strip() for x in disambiguator_urls[0]['result'][0].split(",")]

# We extract the scores suggested by the disambiguator
disambiguator_scores = output.select('disambiguation.metadata').take(1)
dis_scores = [float(x.strip()) for x in disambiguator_scores[0]['metadata'][0]['scores'].split(",")]

# Now we print all the relevant information in a HTML table
html_output = "<center><table style='font-size: 1.0em; border: 1px solid'>"
html_output += "<tr style='border: 1px solid'><td style='text-align: left'>Original sentence:</td><td></td></tr>"
html_output += "<tr style='border: 1px solid'><td>Ronald Reagan was a president of the United States during the 80s.</td><td></td></tr>"

html_output += "<tr style='border: 1px solid'><td style='text-align: left'>Person Chunk: " + chunk_token + "</td><td></td>"
html_output += "<tr style='border: 1px solid'><td style='text-align: center'>URL candidate</td><td style='text-align: center'>Score</td></tr>"
for this_index in range(len(dis_urls)):
    html_output += "<tr><td style='text-align: center; border: 1px solid'>" + dis_urls[this_index] + "</td><td style='text-align: left; border: 1px solid'>" + str(dis_scores[this_index]) + "</td></tr>"
html_output += "</table></center>"

from IPython.core.display import display, HTML
display(HTML(html_output))

0,1
Original sentence:,
Ronald Reagan was a president of the United States during the 80s.,
Person Chunk: Ronald Reagan,
URL candidate,Score
http://en.wikipedia.org/wiki/Ronald_Reagan,0.9672
http://en.wikipedia.org/wiki/Nancy_Reagan,0.9551
http://en.wikipedia.org/wiki/Ronald_Hines,0.9398
http://en.wikipedia.org/wiki/Ronald_Brittain,0.9376
http://en.wikipedia.org/wiki/Ronald_Millar,0.9373


You can see how our pipeline has identified "Ronald Reagan" as a chunk referring to a person. For this chunk it calculates a score that indicates the likelihood of each wikipedia article to belong to the person named in the chunk.

In our example http://en.wikipedia.org/wiki/Ronald_Reagan gets the highest score (0.9672) followed by the entry for Nancy Reagan (0.9551).


### Step 4 with LightPipelines

Once you have created a model by fitting a pipeline with some data you can leverage the use of LightPipelines, faster and easier to use for testing or real-time queries.

Lets created a light_pipeline from our model:

In [16]:
light_pipeline = LightPipeline(model)

Now we can run the pipeline for a new slightly different sentence:

In [17]:
light_data = light_pipeline.annotate("Nancy Reagan was the spouse of one president of the United States.")

The results are stored in a python dictionary, lets check the sentence identified by our pipeline:

In [18]:
light_data['sentence']

['Nancy Reagan was the spouse of one president of the United States.']

And the chunk identified as a person name:

In [19]:
light_data['ner_chunk']

['Nancy Reagan']

We see how the list of urls that our people_desambiguator suggests has now changed, being Nancy Reagan wikipedia entry the one with the highest score:

In [20]:
[x.strip() for x in light_data['disambiguation'][0].split(",")]

['http://en.wikipedia.org/wiki/Nancy_Reagan',
 'http://en.wikipedia.org/wiki/Ronald_Reagan',
 'http://en.wikipedia.org/wiki/Nancy_McIntosh',
 'http://en.wikipedia.org/wiki/Nancy_Nevinson',
 'http://en.wikipedia.org/wiki/Nancy_Guild']