


![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]((https://github.com/JohnSnowLabs/nlu/blob/master/examples/colab/Training/sentence_entity_resolution/sentence_entity_resolution_training.ipynb)


# Sentence Entity Resolution training
Named Entities are sub pieces in textual data which are labled with classes.    
These classes and strings are still ambious though and it is not possible to group semantically identically entities withouth any definition of `terminology`. 
With the `Sentence Resolver` you can train a state of the art deep learning architecture to map entities to their unique terminological representation.

A concrete example would be : 

- The `TSLA` stock is good to buy. 
- `Tesla, Inc`. is a great company to invest int
- The price of `Teslas` stocks is going up

`TSLA` , `Tesla`, `Teslas` can be extracted by an NER model an labled as `company` entity class. But we cannot tell programmatically, if all the referring to the same sematic concept, in this case company.     

To solve this abigous problem, we can introduce a Terminlogy, where the Tesla company has the ID 21 and every other company in our portfolio get a unique ID aswell.   
With a defined terminology at hand and a labled dataset, we can train a chunk resolver to map textually different but semantically equivalent `company entities` to `the same id`. 








## 1. Install NLU, dependecies and Authenticate

See the [install docs](https://nlu.johnsnowlabs.com/docs/en/install#super-quickstart-on-google-colab-or-kaggle) and [authentification docs](https://nlu.johnsnowlabs.com/docs/en/examples_hc#authorize-access-to-licensed-features-and-install-healthcare-dependencies) for more infos 


In [None]:
!wget http://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu
import pandas as pd 

# Train Sentence Resolver

This is a mini example to make you familiar with the dataset structure you must provide for training. 
Train a chunk resolver on a dataset with columns named `y` , `_y` and `text`. `y` is a label, `_y` is an extra identifier label, `text` is the raw text


In [1]:
import nlu
SPARK_NLP_LICENSE           ="????"
AWS_ACCESS_KEY_ID          = "????"
AWS_SECRET_ACCESS_KEY       ="????"
JSL_SECRET                  ="????"
nlu.auth(SPARK_NLP_LICENSE,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,JSL_SECRET)

<module 'nlu' from '/usr/local/lib/python3.7/dist-packages/nlu/__init__.py'>

In [2]:
import pandas as pd 
import nlu
dataset = pd.DataFrame({
    'text': ['The Tesla company is good to invest is', 'TSLA is good to invest','TESLA INC. we should buy','PUT ALL MONEY IN TSLA inc!!'],
    'y': ['23','23','23','23'],
    '_y': ['TESLA','TESLA','TESLA','TESLA'],

})

trainable_pipe = nlu.load('train.resolve_sentence')
fitted_pipe  = trainable_pipe.fit(dataset)
fitted_pipe.predict(dataset.text)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence_resolution_resolve_sentence_confidence,sentence_resolution_resolve_sentence_code,document,sentence_resolution_resolve_sentence,sentence,sentence_embedding_use
0,[1.0000],[23],The Tesla company is good to invest is,[TESLA],[The Tesla company is good to invest is],"[[0.075991176, 0.04411165, -0.0010679043, -0.0..."
1,[1.0000],[23],TSLA is good to invest,[TESLA],[TSLA is good to invest],"[[0.06989084, -0.002836604, -0.02459646, -0.02..."
2,[1.0000],[23],TESLA INC. we should buy,[TESLA],[TESLA INC. we should buy],"[[0.08029125, 0.03371899, -0.006450202, 0.0065..."
3,[1.0000],[23],PUT ALL MONEY IN TSLA inc!!,[TESLA],[PUT ALL MONEY IN TSLA inc!!],"[[0.06478285, -0.0022001457, -0.0069140834, -0..."


## Train Sentence Resolver with Bert Embeddings
By default, GLOVE embeddings will be used. You can you any of the [100+ Word Embeddings]() to train your chunk resolver. If you are handling medical data, biomedical vectors like glove or biobert are a good choice

In [4]:
# We can configurevarious parameters on the Chunk resolver
trainable_pipe.print_info()


The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['sentence_resolver'] has settable params:
pipe['sentence_resolver'].setNormalizedCol('_y')     | Info: Column name for the original, normalized description | Currently set to : _y
pipe['sentence_resolver'].setDistanceFunction('EUCLIDIAN')  | Info: What distance function to use for WMD: 'EUCLIDEAN' or 'COSINE' | Currently set to : EUCLIDIAN
pipe['sentence_resolver'].setNeighbours(25)          | Info: Number of neighbours to consider in the KNN query to calculate WMD | Currently set to : 25
pipe['sentence_resolver'].setThreshold(1000.0)       | Info: Threshold value for the last distance calculated | Currently set to : 1000.0
pipe['sentence_resolver'].setMissAsEmpty(True)       | Info: whether or not to return an empty annotation on unmatched chunks | Currently set to : True
pipe['sentence_resolver'].setReturnCosineDistances(True)  | Info: Extract Cosine Distances. TRUE or False |

# Train a SNOMED resolver
We download a sample SNOMED dataset which has we can use for training.

In [5]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/AskAPatient.fold-0.test.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/AskAPatient.fold-0.train.txt

In [8]:
import pandas as pd
cols = ["y","_y","text"]
aap_tr = pd.read_csv("AskAPatient.fold-0.train.txt",sep="\t",encoding="ISO-8859-1",header=None).iloc[:250]
aap_te = pd.read_csv("AskAPatient.fold-0.test.txt",sep="\t",encoding="ISO-8859-1",header=None).iloc[:250]
aap_tr.columns = cols
aap_te.columns = cols
aap_tr


Unnamed: 0,y,_y,text
0,108367008,Dislocation of joint,Dislocation of joint
1,3384011000036100,Arthrotec,Arthrotec
2,166717003,Serum creatinine raised,Serum creatinine raised
3,3877011000036101,Lipitor,Lipitor
4,402234004,Foot eczema,Foot eczema
...,...,...,...
245,162290004,Dry eyes,Dry eyes
246,419723007,Mentally dull,Mentally dull
247,4216011000036104,Norvasc,Norvasc
248,13791008,Asthenia,Asthenia


In [11]:
# Healthcare Embeddings
trainable_pipe = nlu.load('en.embed_sentence.bert.jsl_tiny_umls_uncased train.resolve_sentence')
trainable_pipe['sentence_resolver'].setNeighbours(4)  
fitted_pipe  = trainable_pipe.fit(aap_tr)
prediction = fitted_pipe.predict(aap_tr)
prediction

sbert_jsl_tiny_umls_uncased download started this may take some time.
Approximate size to download 15.8 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence_embedding_bert,sentence_resolution_resolve_sentence_confidence,document,sentence_resolution_resolve_sentence,sentence,sentence_resolution_resolve_sentence_code,text,_y,y
0,"[[-0.9687815, -0.31864247, -0.2600491, -0.4007...",[0.9992],Dislocation of joint,[Dislocation of joint],[Dislocation of joint],[108367008],Dislocation of joint,Dislocation of joint,108367008
1,"[[-0.7108755, -0.52662104, -0.5808298, -0.7171...",[0.9921],Arthrotec,[Arthrotec],[Arthrotec],[3384011000036100],Arthrotec,Arthrotec,3384011000036100
2,"[[-0.54099977, -2.0953283, 0.14658599, 0.04070...",[0.9975],Serum creatinine raised,[Serum creatinine raised],[Serum creatinine raised],[166717003],Serum creatinine raised,Serum creatinine raised,166717003
3,"[[-0.4524097, -1.3946228, -0.528351, -0.196025...",[1.0000],Lipitor,[Lipitor],[Lipitor],[3877011000036101],Lipitor,Lipitor,3877011000036101
4,"[[-0.76311046, -0.4025006, -0.17865096, -0.329...",[0.9942],Foot eczema,[Foot eczema],[Foot eczema],[402234004],Foot eczema,Foot eczema,402234004
...,...,...,...,...,...,...,...,...,...
245,"[[-0.037025943, -1.3459508, -0.27637935, -0.10...",[0.9981],Dry eyes,[Dry eyes],[Dry eyes],[162290004],Dry eyes,Dry eyes,162290004
246,"[[-0.93272203, -1.3695889, -0.043282345, -0.65...",[1.0000],Mentally dull,[Mentally dull],[Mentally dull],[419723007],Mentally dull,Mentally dull,419723007
247,"[[-0.45309123, -1.5768625, -0.43449545, -0.912...",[0.9864],Norvasc,[Norvasc],[Norvasc],[4216011000036104],Norvasc,Norvasc,4216011000036104
248,"[[-0.55921286, -1.661054, -0.27011818, -0.6075...",[1.0000],Asthenia,[Asthenia],[Asthenia],[13791008],Asthenia,Asthenia,13791008
