![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/13.1.Finetuning_Sentence_Entity_Resolver_Model.ipynb)

# 13. Finetuning Sentence Entity Resolver Model

In [2]:
import os

jsl_secret = os.getenv('SECRET')

import sparknlp
sparknlp_version = sparknlp.version()
import sparknlp_jsl
jsl_version = sparknlp_jsl.version()

print (jsl_secret)

In [None]:
import json
import os
import sparknlp_jsl
import sparknlp
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession
import sys, time
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.util import *
from sparknlp_jsl.annotator import *

from sparknlp.pretrained import ResourceDownloader
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType

params = {"spark.driver.memory":"16G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(jsl_secret,params=params)

print (sparknlp.version())
print (sparknlp_jsl.version())

## Load Dataset

In [4]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/AskAPatient.fold-0.train.txt

Now we will create a pandas dataframe using downloaded dataset, and then convert it to a spark dataframe.



In [5]:
import pandas as pd

cols = ["conceptId","_term","term"]

aap_tr = pd.read_csv("AskAPatient.fold-0.train.txt",sep="\t", encoding="ISO-8859-1",header=None)
aap_tr.columns = cols
aap_tr["conceptId"] = aap_tr.conceptId.apply(str)

In [7]:
aap_tr.head()

Unnamed: 0,conceptId,_term,term
0,108367008,Dislocation of joint,Dislocation of joint
1,3384011000036100,Arthrotec,Arthrotec
2,166717003,Serum creatinine raised,Serum creatinine raised
3,3877011000036101,Lipitor,Lipitor
4,402234004,Foot eczema,Foot eczema


In [66]:
aap_train_sdf = spark.createDataFrame(aap_tr)
aap_train_sdf.show()

+----------------+--------------------+--------------------+
|       conceptId|               _term|                term|
+----------------+--------------------+--------------------+
|       XXXX67008|Dislocation New Term|Dislocation New Term|
|XXXXXXXX00036100|           Arthrotec|           Arthrotec|
|       XXXXX7003|Raised serum crea...|Raised serum crea...|
|XXXXXXXX00036101|            New Drug|            New Drug|
|       XXXXX4004|      athlete's foot|      athlete's foot|
|       404640003|           Dizziness|           Dizziness|
|       271681002|        Stomach ache|        Stomach ache|
|        76948002|         Severe pain|         Severe pain|
|        36031001|        Burning feet|        Burning feet|
|        76948002|         Severe pain|         Severe pain|
|        42399005|       Renal failure|       Renal failure|
|       288227007|Myalgia/myositis ...|Myalgia/myositis ...|
|       419723007|       Mentally dull|       Mentally dull|
|       248490000|    Bl

In [10]:
aap_train_sdf.printSchema()

root
 |-- conceptId: string (nullable = true)
 |-- _term: string (nullable = true)
 |-- term: string (nullable = true)



In [67]:
aap_train_sdf.count()

15612

We will limit our dataframe for a faster training.

In [16]:
aap_train_sdf = aap_train_sdf.limit(1000)

Here, we will create a pipeline for adding an embeddings column to our spark dataframe.

In [17]:
documentAssembler = DocumentAssembler()\
    .setInputCol("_term")\
    .setOutputCol("sentence")

bert_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["sentence"])\
    .setOutputCol("bert_embeddings")

snomed_emb_pipeline = Pipeline(stages = [
    documentAssembler,
    bert_embeddings])


snomed_emb_model = snomed_emb_pipeline.fit(aap_train_sdf)

snomed_data = snomed_emb_model.transform(aap_train_sdf)

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


Here is the new training dataframe.

In [19]:
snomed_data.show()

+----------------+--------------------+--------------------+--------------------+--------------------+
|       conceptId|               _term|                term|            sentence|     bert_embeddings|
+----------------+--------------------+--------------------+--------------------+--------------------+
|       108367008|Dislocation of joint|Dislocation of joint|[{document, 0, 19...|[{sentence_embedd...|
|3384011000036100|           Arthrotec|           Arthrotec|[{document, 0, 8,...|[{sentence_embedd...|
|       166717003|Serum creatinine ...|Serum creatinine ...|[{document, 0, 22...|[{sentence_embedd...|
|3877011000036101|             Lipitor|             Lipitor|[{document, 0, 6,...|[{sentence_embedd...|
|       402234004|         Foot eczema|         Foot eczema|[{document, 0, 10...|[{sentence_embedd...|
|       404640003|           Dizziness|           Dizziness|[{document, 0, 8,...|[{sentence_embedd...|
|       271681002|        Stomach ache|        Stomach ache|[{document, 0

Now we can train our SNOMED Sentence Entity Resolver Model by using `SentenceEntityResolverApproach` .

In [20]:
bertExtractor = SentenceEntityResolverApproach()\
  .setNeighbours(25)\
  .setThreshold(1000)\
  .setInputCols("bert_embeddings")\
  .setNormalizedCol("_term")\
  .setLabelCol("conceptId")\
  .setOutputCol('snomed_code')\
  .setDistanceFunction("EUCLIDIAN")\
  .setCaseSensitive(False)

%time snomed_model = bertExtractor.fit(snomed_data)

CPU times: user 601 ms, sys: 55.2 ms, total: 656 ms
Wall time: 1min 53s


In [21]:
# save if you will need that later
snomed_model.write().overwrite().save("sbiobertresolve_snomed_model")

Lets create a new dataset and re-train our model by using this new dataset.



In [26]:
aap_tr.head()

Unnamed: 0,conceptId,_term,term
0,108367008,Dislocation of joint,Dislocation of joint
1,3384011000036100,Arthrotec,Arthrotec
2,166717003,Serum creatinine raised,Serum creatinine raised
3,3877011000036101,Lipitor,Lipitor
4,402234004,Foot eczema,Foot eczema


We need the same columns for training.

In [37]:
new_dataset = pd.DataFrame(columns=aap_tr.columns)
new_dataset

Unnamed: 0,conceptId,_term,term


## Use Cases

Now we can add our concept_codes and terms to this new dataframe. In this example we wanted to show you some cases that you can see the success of re-training feature and used `XXXX` in the codes to get easily;

- Added new lines that they are close to version of them in the main dataset (`Dislocation of joint -> Dislocation New Term`).

- Changed the code of the terms in the main dataset (`Arthrotec`)

- Changed the positions of the words in the terms (`Serum creatinine raised -> Raised serum creatinine`)

- Added new terms to the dataset (`New Drug`, `athlete's foot`)

In [38]:
new_dataset.conceptId = ["XXXX67008", "XXXXXXXX00036100", "XXXXX7003", "XXXXXXXX00036101", "XXXXX4004"]
new_dataset._term = ["Dislocation New Term", "Arthrotec", "Raised serum creatinine", "New Drug", "athlete's foot"]
new_dataset.term = ["Dislocation New Term", "Arthrotec", "Raised serum creatinine", "New Drug", "athlete's foot"]
new_dataset

Unnamed: 0,conceptId,_term,term
0,XXXX67008,Dislocation New Term,Dislocation New Term
1,XXXXXXXX00036100,Arthrotec,Arthrotec
2,XXXXX7003,Raised serum creatinine,Raised serum creatinine
3,XXXXXXXX00036101,New Drug,New Drug
4,XXXXX4004,athlete's foot,athlete's foot


We transformed our new dataframe by using `snomed_emb_model` and added new columns that we need for re-training.

In [40]:
new_snomed_data = snomed_emb_model.transform(spark.createDataFrame(new_dataset))
new_snomed_data.show()

+----------------+--------------------+--------------------+--------------------+--------------------+
|       conceptId|               _term|                term|            sentence|     bert_embeddings|
+----------------+--------------------+--------------------+--------------------+--------------------+
|       XXXX67008|Dislocation New Term|Dislocation New Term|[{document, 0, 19...|[{sentence_embedd...|
|XXXXXXXX00036100|           Arthrotec|           Arthrotec|[{document, 0, 8,...|[{sentence_embedd...|
|       XXXXX7003|Raised serum crea...|Raised serum crea...|[{document, 0, 22...|[{sentence_embedd...|
|XXXXXXXX00036101|            New Drug|            New Drug|[{document, 0, 7,...|[{sentence_embedd...|
|       XXXXX4004|      athlete's foot|      athlete's foot|[{document, 0, 13...|[{sentence_embedd...|
+----------------+--------------------+--------------------+--------------------+--------------------+



Now we will re-train our main model with new dataset by using `.setPretrainedModelPath()` parameter. 

In [43]:
new_snomed_model = bertExtractor.setPretrainedModelPath("sbiobertresolve_snomed_model").fit(new_snomed_data)

In [44]:
# save if you need later

new_snomed_model.write().overwrite().save("new_sbiobertresolve_snomed_model")

Write a function to show the results more clearly.

In [46]:
import pandas as pd

pd.set_option('display.max_colwidth', 0)


def get_codes (lp, text, vocab='snomed_code'):
    
    full_light_result = lp.fullAnnotate(text)

    chunks = []
    codes = []
    begin = []
    end = []
    resolutions=[]
    all_distances =[]
    all_codes=[]
    all_cosines = []

    for chunk, code in zip(full_light_result[0]['ner_chunk'], full_light_result[0][vocab]):
            
        begin.append(chunk.begin)
        end.append(chunk.end)
        chunks.append(chunk.result)
        codes.append(code.result) 
        all_codes.append(code.metadata['all_k_results'].split(':::'))
        resolutions.append(code.metadata['all_k_resolutions'].split(':::'))
        all_distances.append(code.metadata['all_k_distances'].split(':::'))
        all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))
        
    df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'code':codes,'all_codes':all_codes, 
                       'resolutions':resolutions, 'all_distances':all_cosines})
        
    return df



## Test the New Model
We will create a pipeline that contains both of the models and check the results by using `LightPipeline`.

In [45]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sbert_embeddings")

first_model = SentenceEntityResolverModel.load("sbiobertresolve_snomed_model") \
      .setInputCols(["ner_chunk", "sbert_embeddings"]) \
      .setOutputCol("first_code")

second_model = SentenceEntityResolverModel.load("new_sbiobertresolve_snomed_model") \
      .setInputCols(["ner_chunk", "sbert_embeddings"]) \
      .setOutputCol("second_code")


pipelineModel = PipelineModel(
    stages = [
        documentAssembler,
        sbert_embedder,
        first_model,
        second_model])

snomed_lp = LightPipeline(pipelineModel)


sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


Lets test our models with the terms that we added into the new dataset and the ones that comes from the main model (`Stomach ache` and `Pins and needles`).

In [63]:
chunk_list = ["Dislocation New Term", "Arthrotec", "Raised serum creatinine", "New Drug", "athlete's foot", "Stomach ache", "Pins and needles"]
chunk_list

['Dislocation New Term',
 'Arthrotec',
 'Raised serum creatinine',
 'New Drug',
 "athlete's foot",
 'Stomach ache',
 'Pins and needles']

In [64]:
from IPython.display import display

for chunk in chunk_list:

    print ('\n >>','/'*30, chunk, '/'*30, '\n')
    
    print('First Model Result:')
    display(get_codes (snomed_lp, chunk, vocab='first_code'))
    
    print('\n Second Model Result:')
    display(get_codes (snomed_lp, chunk, vocab='second_code'))


 >> ////////////////////////////// Dislocation New Term ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Dislocation New Term,0,19,108367008,"[108367008, 414469009, 40806005, 2764000, 84480002, 77424011000036100, 3298011000036103, 21288011000036105, 53523011000036100, 3559011000036109, 15611000168108, 314983004, 5662003, 429513001, 161891005]","[Dislocation of joint, Impending shock, Derealisation, Joint crepitus, Retching, ubidecarenone, Lipex, diclofenac, Diovan, Zoloft, Naprosyn, Deteriorating renal function, Contusion of hand, Rupture of Achilles tendon, Backache]","[0.2829, 0.3046, 0.2938, 0.3153, 0.3190, 0.3524, 0.3483, 0.3700, 0.3433, 0.3566, 0.3654, 0.3523, 0.3593, 0.3552, 0.3629]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Dislocation New Term,0,19,XXXX67008,"[XXXX67008, XXXXXXXX00036101, 108367008, 414469009, 40806005, 2764000, 84480002, 77424011000036100, 3298011000036103, 21288011000036105, 53523011000036100, 3559011000036109, 15611000168108, 314983004, 5662003, 429513001, 161891005]","[Dislocation New Term, New Drug, Dislocation of joint, Impending shock, Derealisation, Joint crepitus, Retching, ubidecarenone, Lipex, diclofenac, Diovan, Zoloft, Naprosyn, Deteriorating renal function, Contusion of hand, Rupture of Achilles tendon, Backache]","[0.0000, 0.2396, 0.2829, 0.3046, 0.2938, 0.3153, 0.3190, 0.3524, 0.3483, 0.3700, 0.3433, 0.3566, 0.3654, 0.3523, 0.3593, 0.3552, 0.3629]"



 >> ////////////////////////////// Arthrotec ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Arthrotec,0,8,3384011000036100,"[3384011000036100, 3736011000036100, 87715008, 271807003, 37787011000036104, 4171011000036100, 3723001, 4031011000036106, 247472004]","[Arthrotec, Avandia, Aptyalism, Rash, vitamin A, Celebrex, Arthritis, Crestor, Hives]","[0.0000, 0.2204, 0.2272, 0.2310, 0.2365, 0.2597, 0.2509, 0.2623, 0.2607]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Arthrotec,0,8,XXXXXXXX00036100,"[XXXXXXXX00036100, 3384011000036100, 3736011000036100, 87715008, 271807003, 37787011000036104, 4171011000036100, 3723001, 4031011000036106]","[Arthrotec, Arthrotec, Avandia, Aptyalism, Rash, vitamin A, Celebrex, Arthritis, Crestor]","[0.0000, 0.0000, 0.2204, 0.2272, 0.2310, 0.2365, 0.2597, 0.2509, 0.2623]"



 >> ////////////////////////////// Raised serum creatinine ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Raised serum creatinine,0,22,166717003,"[166717003, 69791001, 314983004, 24184005, 166830008, 40095003, 414469009, 42399005, 278528006, 34436003]","[Serum creatinine raised, Increased venous pressure, Deteriorating renal function, Finding of increased blood pressure, Serum cholesterol raised, Renal injury, Impending shock, Renal failure, Facial swelling, Haematuria]","[0.0141, 0.1653, 0.1787, 0.1859, 0.1853, 0.2075, 0.2184, 0.2066, 0.2235, 0.2338]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Raised serum creatinine,0,22,XXXXX7003,"[XXXXX7003, 166717003, 69791001, 314983004, 24184005, 166830008, 40095003, 414469009, 42399005, 278528006, 34436003]","[Raised serum creatinine, Serum creatinine raised, Increased venous pressure, Deteriorating renal function, Finding of increased blood pressure, Serum cholesterol raised, Renal injury, Impending shock, Renal failure, Facial swelling, Haematuria]","[0.0000, 0.0141, 0.1653, 0.1787, 0.1859, 0.1853, 0.2075, 0.2184, 0.2066, 0.2235, 0.2338]"



 >> ////////////////////////////// New Drug ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,New Drug,0,7,40806005,"[40806005, 15611000168108, 3298011000036103, 3877011000036101]","[Derealisation, Naprosyn, Lipex, Lipitor]","[0.3170, 0.3397, 0.3536, 0.3497]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,New Drug,0,7,XXXXXXXX00036101,"[XXXXXXXX00036101, XXXX67008, 40806005, 15611000168108, 3298011000036103, 3877011000036101]","[New Drug, Dislocation New Term, Derealisation, Naprosyn, Lipex, Lipitor]","[0.0000, 0.2396, 0.3170, 0.3397, 0.3536, 0.3497]"



 >> ////////////////////////////// athlete's foot ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,athlete's foot,0,13,102551008,"[102551008, 699368004, 68172002, 297142003, 40806005, 108367008, 309537005, 285395009, 82971005, 271807003, 84480002, 387603000, 267052005, 3877011000036101]","[Cramp in foot, Symptom of ankle, Disorder of tendon, Foot swelling, Derealisation, Dislocation of joint, Numbness of lower limb, Strain of calf muscle, Impaired mobility, Rash, Retching, Impairment of balance, Flatulence/wind, Lipitor]","[0.1984, 0.1969, 0.2218, 0.2347, 0.2373, 0.2575, 0.2578, 0.2631, 0.2564, 0.2714, 0.2732, 0.2631, 0.2797, 0.2807]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,athlete's foot,0,13,XXXXX4004,"[XXXXX4004, 102551008, 699368004, 68172002, 297142003, 40806005, 108367008, 309537005, 285395009, 82971005, 271807003, 84480002, 387603000, 267052005, 3877011000036101]","[athlete's foot, Cramp in foot, Symptom of ankle, Disorder of tendon, Foot swelling, Derealisation, Dislocation of joint, Numbness of lower limb, Strain of calf muscle, Impaired mobility, Rash, Retching, Impairment of balance, Flatulence/wind, Lipitor]","[0.0000, 0.1984, 0.1969, 0.2218, 0.2347, 0.2373, 0.2575, 0.2578, 0.2631, 0.2564, 0.2714, 0.2732, 0.2631, 0.2797, 0.2807]"



 >> ////////////////////////////// Stomach ache ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Stomach ache,0,11,271681002,"[271681002, 51197009, 82991003, 55145008, 119416008, 36349006, 53057004, 29857009]","[Stomach ache, Stomach cramps, Generalised aches and pains, Stabbing pain, Epigastric discomfort, Burning pain, Hand pain, Chest pain]","[0.0000, 0.1166, 0.1145, 0.1424, 0.1392, 0.1407, 0.1461, 0.1451]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Stomach ache,0,11,271681002,"[271681002, 51197009, 82991003, 55145008, 119416008, 36349006, 53057004, 29857009]","[Stomach ache, Stomach cramps, Generalised aches and pains, Stabbing pain, Epigastric discomfort, Burning pain, Hand pain, Chest pain]","[0.0000, 0.1166, 0.1145, 0.1424, 0.1392, 0.1407, 0.1461, 0.1451]"



 >> ////////////////////////////// Pins and needles ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Pins and needles,0,15,62507009,"[62507009, 274676007, 84480002, 3191011000036109, 271782001, 247472004, 415690000, 238810007, 3848011000036104, 80313002]","[Pins and needles, Tingling of skin, Retching, Prinivil, Drowsy, Hives, Sweating, Flushing, Pravachol, Palpitations]","[0.0000, 0.2330, 0.2453, 0.2661, 0.2651, 0.2716, 0.2638, 0.2805, 0.2894, 0.2914]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Pins and needles,0,15,62507009,"[62507009, 274676007, 84480002, 3191011000036109, 271782001, 247472004, 415690000, 238810007, 3848011000036104, 80313002]","[Pins and needles, Tingling of skin, Retching, Prinivil, Drowsy, Hives, Sweating, Flushing, Pravachol, Palpitations]","[0.0000, 0.2330, 0.2453, 0.2661, 0.2651, 0.2716, 0.2638, 0.2805, 0.2894, 0.2914]"


## Conclusion
As you can see in the results;

- The resolutions of the new added terms are at the top of the results.

- The terms that we changed the concept_codes, are at the top of the resuls.

- The close terms results are resolved successfully.

- The terms that comes from the main dataset are resolved with the same result.