![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/05.2.Finetuning_Clinical_Entity_Resolver_Model.ipynb)

# Finetuning Clinical Entity Resolver Model

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs==5.3.4

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical, visual

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()
spark

👌 Detected license file /content/5.3.1.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.1, 💊Spark-Healthcare==5.3.0, running on ⚡ PySpark==3.4.0


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

## Load Dataset

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/AskAPatient.fold-0.train.txt

Now we will create a pandas dataframe using downloaded dataset, and then convert it to a spark dataframe.



In [None]:
import pandas as pd

cols = ["conceptId","_term","term"]

aap_tr = pd.read_csv("AskAPatient.fold-0.train.txt",sep="\t", encoding="ISO-8859-1",header=None)
aap_tr.columns = cols
aap_tr["conceptId"] = aap_tr.conceptId.apply(str)

In [None]:
aap_tr.head()

Unnamed: 0,conceptId,_term,term
0,108367008,Dislocation of joint,Dislocation of joint
1,3384011000036100,Arthrotec,Arthrotec
2,166717003,Serum creatinine raised,Serum creatinine raised
3,3877011000036101,Lipitor,Lipitor
4,402234004,Foot eczema,Foot eczema


In [None]:
aap_train_sdf = spark.createDataFrame(aap_tr).drop_duplicates()
aap_train_sdf.show()

+-----------------+--------------------+--------------------+
|        conceptId|               _term|                term|
+-----------------+--------------------+--------------------+
|        161891005|            Backache|            backache|
|        418290006|             Itching|               itchy|
|         35489007|          Depression|very serious depr...|
|         10601006|  Pain in lower limb|        pain in legs|
|        386806002|  Impaired cognition|  Impaired cognition|
|        386807006|   Memory impairment| memory difficulties|
|         21499005|Feeling agitated ...|   Severe aggitation|
|        247373008|          Ankle pain|          ANKLE PAIN|
|        262286000|    Weight increased|Big weight gain i...|
|         36349006|        Burning pain|burning in back o...|
|         24184005|Finding of increa...|blood pressure ha...|
|        308921004|Neurological symptom|Neurological symptom|
|         49049000| Parkinson's disease| Parkinson's disease|
|       

In [None]:
aap_train_sdf.printSchema()

root
 |-- conceptId: string (nullable = true)
 |-- _term: string (nullable = true)
 |-- term: string (nullable = true)



In [None]:
aap_train_sdf.count()

4382

We will limit our dataframe for a faster training.

In [None]:
aap_train_sdf = aap_train_sdf.limit(1000)

Here, we will create a pipeline for adding an embeddings column to our spark dataframe.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("_term")\
    .setOutputCol("ner_chunk")

bert_embeddings = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("bert_embeddings")\
    .setCaseSensitive(False)

snomed_emb_pipeline = nlp.Pipeline(stages = [
    documentAssembler,
    bert_embeddings])


snomed_emb_model = snomed_emb_pipeline.fit(aap_train_sdf)

snomed_data = snomed_emb_model.transform(aap_train_sdf)

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


Here is the new training dataframe.

In [None]:
snomed_data.show()

+-----------------+--------------------+--------------------+--------------------+--------------------+
|        conceptId|               _term|                term|           ner_chunk|     bert_embeddings|
+-----------------+--------------------+--------------------+--------------------+--------------------+
|        161891005|            Backache|            backache|[{document, 0, 7,...|[{sentence_embedd...|
|        418290006|             Itching|               itchy|[{document, 0, 6,...|[{sentence_embedd...|
|         35489007|          Depression|very serious depr...|[{document, 0, 9,...|[{sentence_embedd...|
|         10601006|  Pain in lower limb|        pain in legs|[{document, 0, 17...|[{sentence_embedd...|
|        386806002|  Impaired cognition|  Impaired cognition|[{document, 0, 17...|[{sentence_embedd...|
|        386807006|   Memory impairment| memory difficulties|[{document, 0, 16...|[{sentence_embedd...|
|         21499005|Feeling agitated ...|   Severe aggitation|[{d

Now we can train our SNOMED Sentence Entity Resolver Model by using `SentenceEntityResolverApproach` .

In [None]:
bertExtractor = medical.SentenceEntityResolverApproach()\
  .setNeighbours(25)\
  .setThreshold(1000)\
  .setInputCols("bert_embeddings")\
  .setNormalizedCol("_term")\
  .setLabelCol("conceptId")\
  .setOutputCol('snomed_code')\
  .setDistanceFunction("EUCLIDIAN")\
  .setCaseSensitive(False)

%time snomed_model = bertExtractor.fit(snomed_data)

CPU times: user 425 ms, sys: 44.7 ms, total: 469 ms
Wall time: 1min 13s


In [None]:
# save if you will need that later
snomed_model.write().overwrite().save("sbiobertresolve_snomed_model")

Lets create a new dataset and re-train our model by using this new dataset.



In [None]:
aap_tr.head()

Unnamed: 0,conceptId,_term,term
0,108367008,Dislocation of joint,Dislocation of joint
1,3384011000036100,Arthrotec,Arthrotec
2,166717003,Serum creatinine raised,Serum creatinine raised
3,3877011000036101,Lipitor,Lipitor
4,402234004,Foot eczema,Foot eczema


We need the same columns for training.

In [None]:
new_dataset = pd.DataFrame(columns=aap_tr.columns)
new_dataset

Unnamed: 0,conceptId,_term,term


## Use Cases

Now we can add our concept_codes and terms to this new dataframe. In this example we wanted to show you some cases that you can see the success of re-training feature and used `XXXX` in the codes to get easily;

- Added new lines that they are close to version of them in the main dataset (`Dislocation of joint -> Dislocation New Term`).

- Changed the code of the terms in the main dataset (`Arthrotec`)

- Changed the positions of the words in the terms (`Serum creatinine raised -> Raised serum creatinine`)

- Added new terms to the dataset (`New Drug`, `athlete's foot`)

In [None]:
new_dataset.conceptId = ["XXXX67008", "XXXXXXXX00036100", "XXXXX7003", "XXXXXXXX00036101", "XXXXX4004"]
new_dataset._term = ["Dislocation New Term", "Arthrotec", "Raised serum creatinine", "New Drug", "athlete's foot"]
new_dataset.term = ["Dislocation New Term", "Arthrotec", "Raised serum creatinine", "New Drug", "athlete's foot"]
new_dataset

Unnamed: 0,conceptId,_term,term
0,XXXX67008,Dislocation New Term,Dislocation New Term
1,XXXXXXXX00036100,Arthrotec,Arthrotec
2,XXXXX7003,Raised serum creatinine,Raised serum creatinine
3,XXXXXXXX00036101,New Drug,New Drug
4,XXXXX4004,athlete's foot,athlete's foot


We transformed our new dataframe by using `snomed_emb_model` and added new columns that we need for re-training.

In [None]:
new_snomed_data = snomed_emb_model.transform(spark.createDataFrame(new_dataset))
new_snomed_data.show()

+----------------+--------------------+--------------------+--------------------+--------------------+
|       conceptId|               _term|                term|           ner_chunk|     bert_embeddings|
+----------------+--------------------+--------------------+--------------------+--------------------+
|       XXXX67008|Dislocation New Term|Dislocation New Term|[{document, 0, 19...|[{sentence_embedd...|
|XXXXXXXX00036100|           Arthrotec|           Arthrotec|[{document, 0, 8,...|[{sentence_embedd...|
|       XXXXX7003|Raised serum crea...|Raised serum crea...|[{document, 0, 22...|[{sentence_embedd...|
|XXXXXXXX00036101|            New Drug|            New Drug|[{document, 0, 7,...|[{sentence_embedd...|
|       XXXXX4004|      athlete's foot|      athlete's foot|[{document, 0, 13...|[{sentence_embedd...|
+----------------+--------------------+--------------------+--------------------+--------------------+



Now we will re-train our main model with new dataset by using `.setPretrainedModelPath()` parameter.

In [None]:
new_snomed_model = bertExtractor.setPretrainedModelPath("sbiobertresolve_snomed_model").fit(new_snomed_data)

In [None]:
# save if you need later

new_snomed_model.write().overwrite().save("new_sbiobertresolve_snomed_model")

Write a function to show the results more clearly.

In [None]:
pd.set_option('display.max_colwidth', 0)


def get_codes (lp, text, vocab='snomed_code'):

    full_light_result = lp.fullAnnotate(text)

    chunks = []
    codes = []
    begin = []
    end = []
    resolutions=[]
    all_distances =[]
    all_codes=[]
    all_cosines = []

    for chunk, code in zip(full_light_result[0]['ner_chunk'], full_light_result[0][vocab]):

        begin.append(chunk.begin)
        end.append(chunk.end)
        chunks.append(chunk.result)
        codes.append(code.result)
        all_codes.append(code.metadata['all_k_results'].split(':::'))
        resolutions.append(code.metadata['all_k_resolutions'].split(':::'))
        all_distances.append(code.metadata['all_k_distances'].split(':::'))
        all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))

    df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 'code':codes,'all_codes':all_codes,
                       'resolutions':resolutions, 'all_distances':all_cosines})

    return df

## Test the New Model
We will create a pipeline that contains both of the models and check the results by using `LightPipeline`.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("ner_chunk")

sbert_embedder = nlp.BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("sbert_embeddings")\
    .setCaseSensitive(False)

first_model = medical.SentenceEntityResolverModel.load("sbiobertresolve_snomed_model") \
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("first_code")

second_model = medical.SentenceEntityResolverModel.load("new_sbiobertresolve_snomed_model") \
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("second_code")


pipelineModel = nlp.PipelineModel(
    stages = [
        documentAssembler,
        sbert_embedder,
        first_model,
        second_model])

snomed_lp = nlp.LightPipeline(pipelineModel)


sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


Lets test our models with the terms that we added into the new dataset and the ones that comes from the main model (`Stomach ache` and `Pins and needles`).

In [None]:
chunk_list = ["Dislocation New Term", "Arthrotec", "Raised serum creatinine", "New Drug", "athlete's foot", "Stomach ache", "Pins and needles"]
chunk_list

['Dislocation New Term',
 'Arthrotec',
 'Raised serum creatinine',
 'New Drug',
 "athlete's foot",
 'Stomach ache',
 'Pins and needles']

In [None]:
from IPython.display import display

for chunk in chunk_list:

    print ('\n >>','/'*30, chunk, '/'*30, '\n')

    print('First Model Result:')
    display(get_codes (snomed_lp, chunk, vocab='first_code'))

    print('\n Second Model Result:')
    display(get_codes (snomed_lp, chunk, vocab='second_code'))


 >> ////////////////////////////// Dislocation New Term ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Dislocation New Term,0,19,2764000,"[2764000, 125667009, 415749005, 40806005, 202855006, 47268002, 414469009, 3559011000036109, 249966004, 14351000168102, 12441001, 4308002, 21288011000036105, 54981004, 698065002, 34840004, 283902008, 419076005, 3530011000036104]","[Joint crepitus, Contusion, Rupture of tendon, Derealisation, Lateral epicondylitis, Reflux, Impending shock, Zoloft, Spasmodic movement, Seroquel, Epistaxis, Repetitive strain injury, diclofenac, Charleyhorse, Acid reflux, Tendonitis, Has delayed recall, Allergic reaction, Lopid]","[0.2654, 0.2695, 0.3011, 0.3030, 0.3281, 0.3145, 0.3136, 0.3337, 0.3238, 0.3381, 0.3422, 0.3231, 0.3598, 0.3562, 0.3352, 0.3447, 0.3418, 0.3383, 0.3609]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Dislocation New Term,0,19,XXXX67008,"[XXXX67008, XXXXXXXX00036101, 2764000, 125667009, 415749005, 40806005, 202855006, 47268002, 414469009, 3559011000036109, 249966004, 14351000168102, 12441001, 4308002, 21288011000036105, 54981004, 698065002, 34840004, 283902008]","[Dislocation New Term, New Drug, Joint crepitus, Contusion, Rupture of tendon, Derealisation, Lateral epicondylitis, Reflux, Impending shock, Zoloft, Spasmodic movement, Seroquel, Epistaxis, Repetitive strain injury, diclofenac, Charleyhorse, Acid reflux, Tendonitis, Has delayed recall]","[0.0000, 0.2513, 0.2654, 0.2695, 0.3011, 0.3030, 0.3281, 0.3145, 0.3136, 0.3337, 0.3238, 0.3381, 0.3422, 0.3231, 0.3598, 0.3562, 0.3352, 0.3447, 0.3418]"



 >> ////////////////////////////// Arthrotec ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Arthrotec,0,8,3384011000036100,"[3384011000036100, 53215011000036102, 87715008, 3736011000036100, 4171011000036100, 416675009, 54981004, 3559011000036109, 28551000168108, 3066011000036105, 3530011000036104, 3563011000036102, 3298011000036103, 3572011000036102, 3848011000036104, 35209006, 40806005, 3904011000036106]","[Arthrotec, Advil, Aptyalism, Avandia, Celebrex, Furuncle, Charleyhorse, Zoloft, Voltaren, Olmetec, Lopid, Zantac, Lipex, Mobic, Pravachol, Sensitivity, Derealisation, Zocor]","[0.0000, 0.1731, 0.1846, 0.1900, 0.2209, 0.2225, 0.2309, 0.2301, 0.2299, 0.2322, 0.2448, 0.2475, 0.2455, 0.2373, 0.2498, 0.2400, 0.2447, 0.2558]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Arthrotec,0,8,XXXXXXXX00036100,"[XXXXXXXX00036100, 3384011000036100, 53215011000036102, 87715008, 3736011000036100, 4171011000036100, 416675009, 54981004, 3559011000036109, 28551000168108, 3066011000036105, 3530011000036104, 3563011000036102, 3298011000036103, 3572011000036102, 3848011000036104, 35209006, 40806005]","[Arthrotec, Arthrotec, Advil, Aptyalism, Avandia, Celebrex, Furuncle, Charleyhorse, Zoloft, Voltaren, Olmetec, Lopid, Zantac, Lipex, Mobic, Pravachol, Sensitivity, Derealisation]","[0.0000, 0.0000, 0.1731, 0.1846, 0.1900, 0.2209, 0.2225, 0.2309, 0.2301, 0.2299, 0.2322, 0.2448, 0.2475, 0.2455, 0.2373, 0.2498, 0.2400, 0.2447]"



 >> ////////////////////////////// Raised serum creatinine ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Raised serum creatinine,0,22,432352001,"[432352001, 166830008, 38936003, 69791001, 51590001, 249477003, 42399005, 60728008, 166584001, 24184005, 40095003, 278528006, 34436003, 271737000, 124042003, 166643006]","[Increased creatine kinase level, Serum cholesterol raised, Abnormal blood pressure, Increased venous pressure, Increased pressure, Increased thirst, Renal failure, Abdominal swelling, C-reactive protein abnormal, Finding of increased blood pressure, Renal injury, Facial swelling, Haematuria, Anaemia, Increased lipid, Liver enzymes abnormal]","[0.1387, 0.1686, 0.1718, 0.1815, 0.1821, 0.1831, 0.1847, 0.1939, 0.2032, 0.2020, 0.2050, 0.2016, 0.2143, 0.2166, 0.2300, 0.2210]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Raised serum creatinine,0,22,XXXXX7003,"[XXXXX7003, 432352001, 166830008, 38936003, 69791001, 51590001, 249477003, 42399005, 60728008, 166584001, 24184005, 40095003, 278528006, 34436003, 271737000, 124042003, 166643006]","[Raised serum creatinine, Increased creatine kinase level, Serum cholesterol raised, Abnormal blood pressure, Increased venous pressure, Increased pressure, Increased thirst, Renal failure, Abdominal swelling, C-reactive protein abnormal, Finding of increased blood pressure, Renal injury, Facial swelling, Haematuria, Anaemia, Increased lipid, Liver enzymes abnormal]","[0.0000, 0.1387, 0.1686, 0.1718, 0.1815, 0.1821, 0.1831, 0.1847, 0.1939, 0.2032, 0.2020, 0.2050, 0.2016, 0.2143, 0.2166, 0.2300, 0.2210]"



 >> ////////////////////////////// New Drug ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,New Drug,0,7,419511003,"[419511003, 14351000168102, 271807003, 34839011000036106, 21885011000036105, 21814011000036109, 40806005, 21839011000036103, 77424011000036100, 21747011000036106, 21252011000036100, 21304011000036105, 47268002, 21134002, 21288011000036105, 21659011000036107]","[Propensity to adverse reactions to drug, Seroquel, Rash, pethidine, ibuprofen, hydrochlorothiazide, Derealisation, bisoprolol, ubidecarenone, glipizide, morphine, naproxen, Reflux, Disability, diclofenac, gemfibrozil]","[0.3044, 0.3488, 0.3267, 0.3476, 0.3438, 0.3644, 0.3470, 0.3811, 0.3921, 0.3705, 0.3779, 0.3704, 0.3670, 0.3497, 0.3936, 0.3961]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,New Drug,0,7,XXXXXXXX00036101,"[XXXXXXXX00036101, XXXX67008, 419511003, 14351000168102, 271807003, 34839011000036106, 21885011000036105, 21814011000036109, 40806005, 21839011000036103, 77424011000036100, 21747011000036106, 21252011000036100, 21304011000036105, 47268002, 21134002]","[New Drug, Dislocation New Term, Propensity to adverse reactions to drug, Seroquel, Rash, pethidine, ibuprofen, hydrochlorothiazide, Derealisation, bisoprolol, ubidecarenone, glipizide, morphine, naproxen, Reflux, Disability]","[0.0000, 0.2513, 0.3044, 0.3488, 0.3267, 0.3476, 0.3438, 0.3644, 0.3470, 0.3811, 0.3921, 0.3705, 0.3779, 0.3704, 0.3670, 0.3497]"



 >> ////////////////////////////// athlete's foot ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,athlete's foot,0,13,118932009,"[118932009, 70733008, 699368004, 128605003, 309087008, 297142003, 102551008, 425772008, 6389006, 54981004, 16973004, 82971005, 387603000, 416675009, 55260003]","[Disorder of foot, Limitation of joint movement, Symptom of ankle, Disorder of extremity, Paraesthesia of foot, Foot swelling, Cramp in foot, Tendonitis of foot, Disturbance in physical behaviour, Charleyhorse, Limping, Impaired mobility, Impairment of balance, Furuncle, Calcaneal spur]","[0.1275, 0.1914, 0.2012, 0.2001, 0.2106, 0.2152, 0.2178, 0.2278, 0.2248, 0.2601, 0.2451, 0.2423, 0.2451, 0.2613, 0.2645]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,athlete's foot,0,13,XXXXX4004,"[XXXXX4004, 118932009, 70733008, 699368004, 128605003, 309087008, 297142003, 102551008, 425772008, 6389006, 54981004, 16973004, 82971005, 387603000, 416675009, 55260003]","[athlete's foot, Disorder of foot, Limitation of joint movement, Symptom of ankle, Disorder of extremity, Paraesthesia of foot, Foot swelling, Cramp in foot, Tendonitis of foot, Disturbance in physical behaviour, Charleyhorse, Limping, Impaired mobility, Impairment of balance, Furuncle, Calcaneal spur]","[0.0000, 0.1275, 0.1914, 0.2012, 0.2001, 0.2106, 0.2152, 0.2178, 0.2278, 0.2248, 0.2601, 0.2451, 0.2423, 0.2451, 0.2613, 0.2645]"



 >> ////////////////////////////// Stomach ache ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Stomach ache,0,11,271681002,"[271681002, 162059005, 116289008, 16331000, 119416008, 162043005, 82991003, 4969004, 248490000, 36349006, 25064002]","[Stomach ache, Upset stomach, Abdominal bloating, Heartburn, Epigastric discomfort, Hunger pain, Generalised aches and pains, Sinus pain, Bloating symptom, Burning pain, Headache]","[0.0000, 0.0699, 0.0826, 0.0854, 0.0902, 0.0919, 0.1028, 0.1053, 0.1087, 0.1092, 0.1102]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Stomach ache,0,11,271681002,"[271681002, 162059005, 116289008, 16331000, 119416008, 162043005, 82991003, 4969004, 248490000, 36349006, 25064002]","[Stomach ache, Upset stomach, Abdominal bloating, Heartburn, Epigastric discomfort, Hunger pain, Generalised aches and pains, Sinus pain, Bloating symptom, Burning pain, Headache]","[0.0000, 0.0699, 0.0826, 0.0854, 0.0902, 0.0919, 0.1028, 0.1053, 0.1087, 0.1092, 0.1102]"



 >> ////////////////////////////// Pins and needles ////////////////////////////// 

First Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Pins and needles,0,15,62507009,"[62507009, 37567005, 17971005, 247472004, 131148009, 91019004, 417237009, 283050005, 161891005, 271807003, 40806005, 387603000, 409589004]","[Pins and needles, Acenaesthesia, Sedated, Hives, Bleeding, Paraesthesia, Blister, Abrasion of eye region, Backache, Rash, Derealisation, Impairment of balance, Scab of skin]","[0.0000, 0.2716, 0.2708, 0.2917, 0.2857, 0.2918, 0.3023, 0.2994, 0.2994, 0.2969, 0.3049, 0.2985, 0.3066]"



 Second Model Result:


Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Pins and needles,0,15,62507009,"[62507009, 37567005, 17971005, 247472004, 131148009, 91019004, 417237009, 283050005, 161891005, 271807003, 40806005, 387603000, 409589004]","[Pins and needles, Acenaesthesia, Sedated, Hives, Bleeding, Paraesthesia, Blister, Abrasion of eye region, Backache, Rash, Derealisation, Impairment of balance, Scab of skin]","[0.0000, 0.2716, 0.2708, 0.2917, 0.2857, 0.2918, 0.3023, 0.2994, 0.2994, 0.2969, 0.3049, 0.2985, 0.3066]"


### Conclusion
As you can see in the results;

- The resolutions of the new added terms are at the top of the results.

- The terms that we changed the concept_codes, are at the top of the resuls.

- The close terms results are resolved successfully.

- The terms that comes from the main dataset are resolved with the same result.

## Overriding Codes

We can override new codes over the existing codes if in pretrained Sentence Entity Resolver Model by using `.setOverrideExistingCodes(True)` . For example, you want to add a new term to a pretrained resolver model, and if the code of term already exists in the pretrained model, when you .setOverwriteExistingCode(True), it removes all the same codes and their descriptions from the model, then you will have just the new term with its code in the fine-tuned model.

In [None]:
snomed_data.show()

+-----------------+--------------------+--------------------+--------------------+--------------------+
|        conceptId|               _term|                term|           ner_chunk|     bert_embeddings|
+-----------------+--------------------+--------------------+--------------------+--------------------+
|        161891005|            Backache|            backache|[{document, 0, 7,...|[{sentence_embedd...|
|        418290006|             Itching|               itchy|[{document, 0, 6,...|[{sentence_embedd...|
|         35489007|          Depression|very serious depr...|[{document, 0, 9,...|[{sentence_embedd...|
|         10601006|  Pain in lower limb|        pain in legs|[{document, 0, 17...|[{sentence_embedd...|
|        386806002|  Impaired cognition|  Impaired cognition|[{document, 0, 17...|[{sentence_embedd...|
|        386807006|   Memory impairment| memory difficulties|[{document, 0, 16...|[{sentence_embedd...|
|         21499005|Feeling agitated ...|   Severe aggitation|[{d

In [None]:
display(get_codes (snomed_lp, "Backache", vocab='first_code'))

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Backache,0,7,161891005,"[161891005, 48926009, 404640003, 249931001, 367391008, 16269008, 116289008, 44077006, 81680005, 25064002]","[Backache, Pain in spine, Dizziness, Weakness of neck, Malaise, Neuralgia, Abdominal bloating, Numbness, Neck pain, Headache]","[0.0000, 0.1410, 0.1426, 0.1424, 0.1517, 0.1569, 0.1520, 0.1529, 0.1530, 0.1571]"


In [None]:
display(get_codes (snomed_lp, "toothache", vocab='first_code'))

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,toothache,0,8,288939007,"[288939007, 404640003, 22253000, 161891005, 44077006, 25064002]","[Difficulty swallowing, Dizziness, Pain, Backache, Numbness, Headache]","[0.1266, 0.1317, 0.1340, 0.1390, 0.1348, 0.1360]"


Lets add a new term `toothache`has the same code with `Bachache`, `161891005` and `.setOverrideExistingCodes(True)`.

In [None]:
override_data = spark.createDataFrame(pd.DataFrame({"conceptId":["161891005"], "_term": ["toothache"], "term": ["toothache"]}))
override_data.show()

+---------+---------+---------+
|conceptId|    _term|     term|
+---------+---------+---------+
|161891005|toothache|toothache|
+---------+---------+---------+



In [None]:
override_data = snomed_emb_model.transform(override_data)
override_data.show()

+---------+---------+---------+--------------------+--------------------+
|conceptId|    _term|     term|           ner_chunk|     bert_embeddings|
+---------+---------+---------+--------------------+--------------------+
|161891005|toothache|toothache|[{document, 0, 8,...|[{sentence_embedd...|
+---------+---------+---------+--------------------+--------------------+



In [None]:
overrided_model = bertExtractor.setPretrainedModelPath("/content/sbiobertresolve_snomed_model").setOverrideExistingCodes(True).fit(override_data)
overrided_model.write().overwrite().save("overrided_model")

In [None]:
overrided_resolver = medical.SentenceEntityResolverModel.load("overrided_model") \
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("overrided_code")


overrided_pipelineModel = nlp.PipelineModel(
    stages = [
        documentAssembler,
        sbert_embedder,
        first_model,
        overrided_resolver])

overrided_lp = nlp.LightPipeline(overrided_pipelineModel)

In [None]:
# original model

display(get_codes (overrided_lp, "Backache", vocab='first_code'))

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Backache,0,7,161891005,"[161891005, 48926009, 404640003, 249931001, 367391008, 16269008, 116289008, 44077006, 81680005, 25064002]","[Backache, Pain in spine, Dizziness, Weakness of neck, Malaise, Neuralgia, Abdominal bloating, Numbness, Neck pain, Headache]","[0.0000, 0.1410, 0.1426, 0.1424, 0.1517, 0.1569, 0.1520, 0.1529, 0.1530, 0.1571]"


In [None]:
# overrided model

display(get_codes (overrided_lp, "Backache", vocab='overrided_code'))

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Backache,0,7,161891005,"[161891005, 48926009, 404640003, 249931001, 367391008, 16269008, 116289008, 44077006, 81680005, 25064002]","[toothache, Pain in spine, Dizziness, Weakness of neck, Malaise, Neuralgia, Abdominal bloating, Numbness, Neck pain, Headache]","[0.1390, 0.1410, 0.1426, 0.1424, 0.1517, 0.1569, 0.1520, 0.1529, 0.1530, 0.1571]"


In [None]:
# overrided model

display(get_codes (overrided_lp, "toothache", vocab='overrided_code'))

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,toothache,0,8,161891005,"[161891005, 288939007, 404640003, 22253000, 44077006, 25064002, 367391008, 41652007]","[toothache, Difficulty swallowing, Dizziness, Pain, Numbness, Headache, Malaise, Eye pain]","[0.0000, 0.1266, 0.1317, 0.1340, 0.1348, 0.1360, 0.1415, 0.1374]"


As you can see, there is no more `Backache` -> `161891005`. Now all descriptions of `161891005` code are removed and there is just one description `tootache` -> `161891005` for this code.

## Drop List Of Codes

We can drop codes from a pretrained resolver model bu using .`setDropCodesList`. Lets remove `161891005` (toothache) and `404640003` (Dizzeness) from the overrided model.

In [None]:
blackListedModel = bertExtractor.\
    setPretrainedModelPath("overrided_model").\
    setNormalizedCol("_term").\
    setDropCodesList(["161891005", "404640003"]).\
    fit(override_data.limit(0))

In [None]:
blackListedModel.write().overwrite().save("blackListedModel")

In [None]:
blackListed_resolver = medical.SentenceEntityResolverModel.load("blackListedModel") \
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("blackListed_code")


blackListed_pipelineModel = nlp.PipelineModel(
    stages = [
        documentAssembler,
        sbert_embedder,
        overrided_resolver,
        blackListed_resolver])

blackListed_lp = nlp.LightPipeline(blackListed_pipelineModel)

In [None]:
# overrided model

display(get_codes (blackListed_lp, "toothache", vocab='overrided_code'))

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,toothache,0,8,161891005,"[161891005, 288939007, 404640003, 22253000, 44077006, 25064002, 367391008, 41652007]","[toothache, Difficulty swallowing, Dizziness, Pain, Numbness, Headache, Malaise, Eye pain]","[0.0000, 0.1266, 0.1317, 0.1340, 0.1348, 0.1360, 0.1415, 0.1374]"


In [None]:
# blackListed model

display(get_codes (blackListed_lp, "toothache", vocab='blackListed_code'))

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,toothache,0,8,288939007,"[288939007, 22253000, 44077006, 25064002, 367391008, 41652007, 53057004, 45534005, 248490000, 18876004]","[Difficulty swallowing, Pain, Numbness, Headache, Malaise, Eye pain, Hand pain, Glossitis, Bloating symptom, Pain in finger]","[0.1266, 0.1340, 0.1348, 0.1360, 0.1415, 0.1374, 0.1379, 0.1457, 0.1395, 0.1412]"


In [None]:
# overrided model

display(get_codes (blackListed_lp, "Dizziness", vocab='overrided_code'))

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Dizziness,0,8,404640003,"[404640003, 44077006, 271782001, 119416008, 271713000, 248490000, 373931001, 309838005, 249931001, 367391008, 301026000, 214264003, 55929007]","[Dizziness, Numbness, Drowsy, Epigastric discomfort, General unsteadiness, Bloating symptom, Sensation of heaviness in limbs, Emotional upset, Weakness of neck, Malaise, Loss of confidence, Lethargy, Feeling irritable]","[0.0000, 0.0802, 0.0811, 0.0838, 0.0858, 0.0877, 0.0893, 0.0979, 0.1022, 0.1055, 0.1028, 0.1055, 0.1024]"


In [None]:
# blackListed model

display(get_codes (blackListed_lp, "Dizziness", vocab='blackListed_code'))

Unnamed: 0,chunks,begin,end,code,all_codes,resolutions,all_distances
0,Dizziness,0,8,44077006,"[44077006, 271782001, 119416008, 271713000, 248490000, 373931001, 309838005, 249931001, 367391008, 301026000, 214264003, 55929007, 271795006, 101000119102, 53057004, 298753001]","[Numbness, Drowsy, Epigastric discomfort, General unsteadiness, Bloating symptom, Sensation of heaviness in limbs, Emotional upset, Weakness of neck, Malaise, Loss of confidence, Lethargy, Feeling irritable, Malaise and fatigue, Numbness and tingling sensation of skin, Hand pain, Numbness of upper limb]","[0.0802, 0.0811, 0.0838, 0.0858, 0.0877, 0.0893, 0.0979, 0.1022, 0.1055, 0.1028, 0.1055, 0.1024, 0.1097, 0.1104, 0.1125, 0.1145]"


As you can see, `161891005`, `404640003` codes are removed from the model.