


![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]((https://github.com/JohnSnowLabs/nlu/blob/master/examples/colab/Training/sentence_entity_resolution/sentence_entity_resolution_training.ipynb)


# Sentence Entity Resolution training
Named Entities are sub pieces in textual data which are labled with classes.    
These classes and strings are still ambious though and it is not possible to group semantically identically entities withouth any definition of `terminology`.
With the `Sentence Resolver` you can train a state of the art deep learning architecture to map entities to their unique terminological representation.

A concrete example would be :

- The `TSLA` stock is good to buy.
- `Tesla, Inc`. is a great company to invest int
- The price of `Teslas` stocks is going up

`TSLA` , `Tesla`, `Teslas` can be extracted by an NER model an labled as `company` entity class. But we cannot tell programmatically, if all the referring to the same sematic concept, in this case company.     

To solve this abigous problem, we can introduce a Terminlogy, where the Tesla company has the ID 21 and every other company in our portfolio get a unique ID aswell.   
With a defined terminology at hand and a labled dataset, we can train a chunk resolver to map textually different but semantically equivalent `company entities` to `the same id`.








## 1. Colab Setup


In [None]:
# Install the johnsnowlabs library
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
spark=nlp.start()

# Train Sentence Resolver

This is a mini example to make you familiar with the dataset structure you must provide for training.
Train a chunk resolver on a dataset with columns named `y` , `_y` and `text`. `y` is a label, `_y` is an extra identifier label, `text` is the raw text


In [5]:
import pandas as pd

dataset = pd.DataFrame({
    'text': ['The Tesla company is good to invest is', 'TSLA is good to invest','TESLA INC. we should buy','PUT ALL MONEY IN TSLA inc!!'],
    'y': ['23','23','23','23'],
    '_y': ['TESLA','TESLA','TESLA','TESLA'],

})

trainable_pipe = nlp.load('train.resolve_sentence')
fitted_pipe  = trainable_pipe.fit(dataset)
fitted_pipe.predict(dataset.text)

setInputCols in SentenceEntityResolverApproach_d4f860f61823 expecting 1 columns. Provided column amount: 2. Which should be columns from the following annotators: ['sentence_embeddings']
sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,document,resolution_sentence_entity_resolver_code,resolution_sentence_entity_resolver_confidence,resolution_sentence_entity_resolver_distance,resolution_sentence_entity_resolver_origin_sentence,resolution_sentence_entity_resolver_resolved_text,resolution_sentence_entity_resolver_target_text,resolution_sentence_entity_resolver_token,sentence_embedding_small_bert_L2_128
0,The Tesla company is good to invest is,23,1.0,0.0,0,TESLA,The Tesla company is good to invest is,The Tesla company is good to invest is,"[[0.5044986009597778, 0.7948187589645386, -0.6..."
1,TSLA is good to invest,23,1.0,0.0,0,TESLA,TSLA is good to invest,TSLA is good to invest,"[[-1.1105577945709229, 0.8402332067489624, -1...."
2,TESLA INC. we should buy,23,1.0,0.0,0,TESLA,TESLA INC. we should buy,TESLA INC. we should buy,"[[-0.6380321979522705, 0.5634128451347351, -0...."
3,PUT ALL MONEY IN TSLA inc!!,23,1.0,0.0,0,TESLA,PUT ALL MONEY IN TSLA inc!!,PUT ALL MONEY IN TSLA inc!!,"[[-1.7485851049423218, 0.26517942547798157, -0..."


## Train Sentence Resolver with Bert Embeddings
By default, GLOVE embeddings will be used. You can you any of the [100+ Word Embeddings]() to train your chunk resolver. If you are handling medical data, biomedical vectors like glove or biobert are a good choice

In [6]:
# We can configurevarious parameters on the Chunk resolver
trainable_pipe.print_info()


The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['bert_sentence_embeddings@sent_small_bert_L2_128'] has settable params:
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setBatchSize(8)                          | Info: Size of every batch | Currently set to : 8
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setEngine('tensorflow')                  | Info: Deep Learning engine used for this model | Currently set to : tensorflow
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setIsLong(False)                         | Info: Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int. | Currently set to : False
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setMaxSentenceLength(128)                | Info: Max sentence length to process | Currently set to : 128
component_list['bert_sentence_embeddings@sent_smal

# Train a SNOMED resolver
We download a sample SNOMED dataset which has we can use for training.

In [7]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/AskAPatient.fold-0.test.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/AskAPatient.fold-0.train.txt

In [8]:
import pandas as pd
cols = ["y","_y","text"]
aap_tr = pd.read_csv("AskAPatient.fold-0.train.txt",sep="\t",encoding="ISO-8859-1",header=None).iloc[:250]
aap_te = pd.read_csv("AskAPatient.fold-0.test.txt",sep="\t",encoding="ISO-8859-1",header=None).iloc[:250]
aap_tr.columns = cols
aap_te.columns = cols
aap_tr


Unnamed: 0,y,_y,text
0,108367008,Dislocation of joint,Dislocation of joint
1,3384011000036100,Arthrotec,Arthrotec
2,166717003,Serum creatinine raised,Serum creatinine raised
3,3877011000036101,Lipitor,Lipitor
4,402234004,Foot eczema,Foot eczema
...,...,...,...
245,162290004,Dry eyes,Dry eyes
246,419723007,Mentally dull,Mentally dull
247,4216011000036104,Norvasc,Norvasc
248,13791008,Asthenia,Asthenia


In [12]:
# Healthcare Embeddings
trainable_pipe = nlp.load('en.embed_sentence.bert.jsl_tiny_umls_uncased train.resolve_sentence')
trainable_pipe['trainable_sentence_entity_resolver'].setNeighbours(4)
fitted_pipe  = trainable_pipe.fit(aap_tr)
prediction = fitted_pipe.predict(aap_tr)
prediction

sbert_jsl_tiny_umls_uncased download started this may take some time.
Approximate size to download 15.8 MB
[OK!]
setInputCols in SentenceEntityResolverApproach_2520cda31704 expecting 1 columns. Provided column amount: 2. Which should be columns from the following annotators: ['sentence_embeddings']
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,_y,document,resolution_sentence_entity_resolver_code,resolution_sentence_entity_resolver_confidence,resolution_sentence_entity_resolver_distance,resolution_sentence_entity_resolver_k_codes,resolution_sentence_entity_resolver_k_confidences,resolution_sentence_entity_resolver_k_cos_distances,resolution_sentence_entity_resolver_k_distances,resolution_sentence_entity_resolver_k_resolution,resolution_sentence_entity_resolver_origin_sentence,resolution_sentence_entity_resolver_resolved_text,resolution_sentence_entity_resolver_target_text,resolution_sentence_entity_resolver_token,sentence_embedding_bert,text,y
0,Dislocation of joint,Dislocation of joint,108367008,0.9992,0.0000,"[[108367008, 21288011000036105, 404640003]]","[[0.9992, 0.0005, 0.0004]]","[[0.0000, 0.2300, 0.2344]]","[[0.0000, 7.7017, 7.9164]]","[[Dislocation of joint, diclofenac, Dizziness]]",0,Dislocation of joint,Dislocation of joint,Dislocation of joint,"[[-0.9687817692756653, -0.31864216923713684, -...",Dislocation of joint,108367008
1,Arthrotec,Arthrotec,3384011000036100,0.9921,0.0000,"[[3384011000036100, 57676002]]","[[0.9921, 0.0079]]","[[0.0000, 0.0922]]","[[0.0000, 4.8368]]","[[Arthrotec, Arthralgia]]",0,Arthrotec,Arthrotec,Arthrotec,"[[-0.7108752131462097, -0.5266207456588745, -0...",Arthrotec,3384011000036100
2,Serum creatinine raised,Serum creatinine raised,166717003,0.9975,0.0000,"[[166717003, 39575007, 13644009, 124055002]]","[[0.9975, 0.0011, 0.0009, 0.0005]]","[[0.0000, 0.1798, 0.2049, 0.2325]]","[[0.0000, 6.7997, 7.0506, 7.5232]]","[[Serum creatinine raised, Urine looks dark, H...",0,Serum creatinine raised,Serum creatinine raised,Serum creatinine raised,"[[-0.5410001277923584, -2.0953280925750732, 0....",Serum creatinine raised,166717003
3,Lipitor,Lipitor,3877011000036101,1.0000,0.0000,,,,,,0,Lipitor,Lipitor,Lipitor,"[[-0.45240962505340576, -1.394622564315796, -0...",Lipitor,3877011000036101
4,Foot eczema,Foot eczema,402234004,0.9942,0.0000,"[[402234004, 21930011000036101, 41710110000361...","[[0.9942, 0.0025, 0.0020, 0.0013]]","[[0.0000, 0.1463, 0.1500, 0.1788]]","[[0.0000, 6.0054, 6.1894, 6.6619]]","[[Foot eczema, ezetimibe, Celebrex, Arthralgia]]",0,Foot eczema,Foot eczema,Foot eczema,"[[-0.763110876083374, -0.40250054001808167, -0...",Foot eczema,402234004
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Dry eyes,Dry eyes,162290004,0.9981,0.0000,"[[162290004, 238810007, 404640003]]","[[0.9981, 0.0013, 0.0006]]","[[0.0000, 0.1612, 0.2016]]","[[0.0000, 6.6192, 7.4328]]","[[Dry eyes, Flushing, Dizziness]]",0,Dry eyes,Dry eyes,Dry eyes,"[[-0.03702589124441147, -1.3459508419036865, -...",Dry eyes,162290004
246,Mentally dull,Mentally dull,419723007,1.0000,0.0000,,,,,,0,Mentally dull,Mentally dull,Mentally dull,"[[-0.9327226281166077, -1.3695887327194214, -0...",Mentally dull,419723007
247,Norvasc,Norvasc,4216011000036104,0.9864,0.0000,"[[4216011000036104, 2929011000036108, 367391008]]","[[0.9864, 0.0080, 0.0056]]","[[0.0000, 0.0863, 0.1000]]","[[0.0000, 4.8183, 5.1746]]","[[Norvasc, Nexium, Malaise]]",0,Norvasc,Norvasc,Norvasc,"[[-0.4530910551548004, -1.576862096786499, -0....",Norvasc,4216011000036104
248,Asthenia,Asthenia,13791008,1.0000,0.0000,,,,,,0,Asthenia,Asthenia,Asthenia,"[[-0.5592130422592163, -1.6610543727874756, -0...",Asthenia,13791008
