![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CLINICAL_MULTI.ipynb)

## Setup

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

locals().update(license_keys)

os.environ.update(license_keys)

In [2]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp_jsl.pretrained import InternalResourceDownloader

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 5.1.4
Spark NLP_JSL Version : 5.1.3


In [71]:
text_dict = {

    "ner_clinical_en": [
        """A 50-year-old woman presented to the orthopedic clinic with complaints of persistent pain, swelling and limitation of motion in her right knee. She reported a history of osteoarthritis and previous knee injury. Clinical examination and X-rays revealed joint space narrowing, osteophyte formation and cartilage degeneration. An MRI scan was ordered to confirm the diagnosis and assess its severity. The MRI showed extensive cartilage loss and bone changes consistent with advanced osteoarthritis. After considering the patient's condition and preferences, a treatment plan was discussed, including physical therapy and the possibility of joint replacement surgery.""",
        """A 45-year-old man presented to the clinic with persistent shortness of breath and chest discomfort. He had a medical history of hypertension and a family history of cardiovascular disease. Initial physical examination revealed high blood pressure and irregular heart sounds. An electrocardiogram (ECG), echocardiogram and stress test were performed to further evaluate the condition. ECG showed signs of cardiac arrhythmia, while echocardiogram showed decreased left ventricular function. The exercise test showed exercise-induced myocardial ischemia, confirming the presence of coronary artery disease (CAD).""",
        """A 7-year-old child was brought to the pediatrician because of persistent cough, high fever and difficulty in breathing. He had a history of recurrent respiratory tract infections. On examination, "crackling sounds" were noted on lung auscultation and chest x-ray was performed. X-ray showed patchy infiltrates in both lungs indicating pneumonia. Culture was taken for further testing and Streptococcus pneumoniae was detected. The child was diagnosed with community-acquired pneumonia and appropriate antibiotics were prescribed.""",

    ],

    "ner_clinical_da": [
        """Dette forværredes akut den 18. september 2017, da patienten blev fundet med ændringer i mental status, og en ABG viste en CO2 på over 100, sandsynligvis på grund af nedsat respiratorisk drive som følge af sedativer givet for agitation oven på en eksisterende alkalose.""",
        """Tylenol , 650 mg p.o. q 4 prn ; Allopurinol , 100 mg p.o. q.d. ; Peridex mundskyllevand , 10 ml b.i.d.; cholysteramineresin , 4 gram p.o. q.d. ; clotrimazole , en pastil p.o. q.i.d. ; Fentanyl plaster , 100 mcg per time topical q 72 timer ; glipizide , 10 mg p.o. q.d.; Atarax , 25 operationsbord 50 mg p.o. q 6 til 8 timer prn kløe ; glidende skala insulin ; KBL mundskyllevand , 15 ccp .o. q.d. prn mund ubehag ; Lactulose , 30 ml p.o. q.i.d. der blev startet den 30. Ativan , 0.25 til 0.5 mg IV q 4 til 6 timer prn angst med instruktioner om at være forsigtig.""",
        ],

    "ner_clinical_es": [
        """Una mujer de 50 años acudió a la quejándose de dolor persistente, inflamación y limitación de la amplitud de movimiento en la rodilla derecha. La paciente refería antecedentes de artrosis y una lesión previa de rodilla. Se realizó un examen clínico y radiografías que revelaron un estrechamiento del espacio articular, formación de osteofitos y signos de degeneración del cartílago. Para confirmar el diagnóstico y evaluar la gravedad, se solicitó una resonancia magnética. La resonancia mostró una gran pérdida de cartílago y cambios óseos compatibles con una artrosis avanzada. Tras considerar el estado y las preferencias del paciente, se discutió un plan de tratamiento que incluía control del dolor, fisioterapia y la posibilidad de una cirugía de sustitución articular.""",
        """Un hombre de 45 años acudió a la consulta con disnea persistente y molestias torácicas. La historia clínica del paciente revelaba hipertensión y antecedentes familiares de enfermedad cardiovascular. Un examen físico inicial indicó hipertensión arterial y ruidos cardíacos irregulares. Para evaluar mejor su estado, se le practicaron un electrocardiograma (ECG), un ecocardiograma y una prueba de esfuerzo. ECG reveló signos de arritmia cardiaca, mientras que el ecocardiograma mostró una función ventricular izquierda reducida. La prueba de esfuerzo indicó isquemia miocárdica inducida por el ejercicio, confirmando la presencia de enfermedad arterial coronaria (EAC)""",
    ],

    "ner_clinical_fi": [
        """Lisäksi potilas sai edellä kuvatut diagnostiset toimenpiteet, jotka osoittivat vammoja, jotka rajoittuivat takaraivon haavaan, vasempaan murtuneeseen solisluuhun, vasempaan murtuneeseen säteeseen, oikean reiden hematoomaan ja oikean jalan haavaa""",
        """Tämä on 37-vuotias miespotilas, jolla on sairaushistoriassa Tyypin I diabetes, verenpainetauti, gastropareesi, loppuvaiheen munuaissairaus (viimeisin hoito) 26.12.""",
    ],

     "ner_clinical_fr": [
        """Un enfant de 7 ans a été amené chez le pédiatre en raison d'une toux persistante, d'une forte fièvre et de difficultés respiratoires. Le patient avait des antécédents médicaux d'infections respiratoires récurrentes. À l'examen, des craquements ont été notés à l'auscultation des poumons et des radiographies du thorax ont été réalisées. Les radiographies ont révélé des infiltrats parcellaires dans les deux poumons, indiquant une pneumonie. Les examens complémentaires comprenaient une culture des expectorations, qui a révélé la présence de Streptococcus pneumoniae. On a diagnostiqué chez l'enfant une pneumonie d'origine communautaire et on lui a prescrit un traitement antibiotique approprié.""",
    ],

    "ner_clinical_he":[
        """מר פוהל הוא זכר בן 53 עם היסטוריה של ETOH, דלקת דם שהתפתחה בחדר החירום עם התרגשות מוגברת, כמו משנה לאור אירוע התרסה של ETOH ופיתח לאחר מכן היפוטנזיה, לחץ דם סיסטולי בשנות ה - 80, סטטוס אחרי Ativan ו Haldol, אך היה רגי""",
        """COPD ( s / p חילוף ריאות FEV1 25% ) , הפרעה פריקרדיאלית כרונית , PVD , סטנוזת קרות הכליה הימנית , עורקות רקמה , אוסטיאופורוזיס , פרסבילרינגי""",
        ],

    "ner_clinical_it": [
        """Una donna di 50 anni si è presentata alla clinica ortopedica lamentando persistente, gonfiore e limitata capacità di movimento del ginocchio destro. La paziente ha riferito un'anamnesi di osteoartrite e un precedente infortunio al ginocchio. Sono stati eseguiti un esame clinico e delle radiografie che hanno rivelato un restringimento dello spazio articolare, la formazione di osteofiti e segni di degenerazione della cartilagine. Per confermare la diagnosi e valutarne la gravità, è stata ordinata una risonanza magnetica. La risonanza magnetica ha mostrato un'estesa perdita di cartilagine e alterazioni ossee coerenti con un'osteoartrite avanzata. Dopo aver considerato le condizioni e le preferenze del paziente, è stato discusso un piano di trattamento che prevedeva il controllo del dolera, la fisioterapia e la possibilità di un intervento di sostituzione dell'articolazione.""",
        """Un uomo di 45 anni si è presentato in clinica con una persistente mancanza di respiro e un fastidio al petto. L'anamnesi del paziente rivelava ipertensione e una storia familiare di malattie cardiovascolari. Un esame fisico iniziale indicava pressione alta e suoni cardiaci irregolari. Per valutare ulteriormente la condizione, sono stati eseguiti un elettrocardiogramma (ECG), un ecocardiogramma e un test da sforzo. L'ECG ha rivelato segni di aritmia cardiaca, mentre l'ecocardiogramma ha evidenziato una ridotta funzionalità del ventricolo sinistro. Il test da sforzo ha indicato un'ischemia miocardica indotta dall'esercizio, confermando la presenza di coronaropatia (CAD).""",
        ],

    "ner_clinical_ja":[
        """中等度肺高血圧 、 PA圧 48/24、 1+僧帽弁逆流 、 重度大動脈弁狭窄 、 LVEDP 19、 駆出率 43%。 クロトリマゾール 、1錠 p.o . q.i.d .;""",
        """彼女はまた、 息切れ 、 吐き気 、ついても説明しました。""",
    ],

    "ner_clinical_nl": [
        """Een 50-jarige vrouw kwam naar de orthopedische polikliniek met klachten van aanhoudende pijn, zwelling en bewegingsbeperking in haar rechterknie. Ze meldde een voorgeschiedenis van artrose en eerder knieletsel. Klinisch onderzoek en röntgenfoto's toonden vernauwing van de gewrichtsruimte, osteofytvorming en kraakbeendegeneratie. Er werd een MRI-scan besteld om de diagnose te bevestigen en de ernst ervan te beoordelen. De MRI toonde uitgebreid kraakbeenverlies en botveranderingen die overeenkwamen met gevorderde artrose. Na afweging van de toestand en voorkeuren van de patiënt werd een behandelplan besproken, inclusief fysiotherapie en de mogelijkheid van een gewrichtsvervangende operatie.""",
    ],

    "ner_clinical_no":[
        """Natrium var 140, kalium 3,7 ,klorid 96, bikarbonat 30, BUN og kreatinin 14/0,9 , glukose105, hematokrit42, hvittblodtall 8,6 , blodplater 644, protrombintid 10,4 , delvis tromboplastintid 28,7 , urinanalyse spor av hvite blodceller, svake skjulte røde blodceller. Natrium 148, kalium 3.4, glukose 174, P02 102, PC02 115, PH 7.11 på 40% 02.""",
        """Kolesterol 236, triglyserider 115, HDL 99, LDL 114, total protein 6,2, globulin 2,6, direkte bilirubin 0, total bilirubin 0,2, alkalisk fosfatase 59, amylase 64, SGOT 16, LDH 141, CPK 57.""",
    ],

    "ner_clinical_pt": [
        """Uma mulher de 50 anos veio à clínica ortopédica com queixas de dor persistente, inchaço e limitação da amplitude de movimentos no joelho direito. A doente referia uma história de osteoartrite e uma lesão anterior no joelho. Foi efectuado um exame clínico e radiografias que revelaram um estreitamento do espaço articular, formação de osteófitos e sinais de degeneração da cartilagem. Para confirmar o diagnóstico e avaliar a gravidade, foi pedida uma ressonância magnética. A RM mostrou uma perda extensa de cartilagem e alterações ósseas consistentes com osteoartrite avançada. Depois de considerar a condição e as preferências do doente, foi discutido um plano de tratamento que envolvia o controlo da dor, fisioterapia e a possibilidade de cirurgia de substituição da articulação.""",
        """Um homem de 45 anos apresentou-se na clínica com falta de ar persistente e desconforto no peito. A história clínica do doente revelava hipertensão e uma história familiar de doença cardiovascular. Um exame físico inicial indicou uma tensão arterial elevada e sons cardíacos irregulares. Para avaliar melhor a condição, foram realizados um eletrocardiograma (ECG), um ecocardiograma e uma prova de esforço. O ECG revelou sinais de arritmia cardíaca, enquanto o ecocardiograma mostrou uma função ventricular esquerda reduzida. O teste de esforço indicou isquemia miocárdica induzida pelo exercício, confirmando a presença de doença arterial coronariana (DAC).""",
    ],

    "ner_clinical_pl": [
        """50-letnia kobieta zgłosiła się do kliniki ortopedycznej skarżąc się na uporczywy ból, obrzęk i ograniczony zakres ruchu w prawym kolanie. Pacjentka zgłosiła historię choroby zwyrodnieniowej stawów i wcześniejszy uraz kolana. Przeprowadzono badanie kliniczne i wykonano zdjęcia rentgenowskie, które wykazały zwężenie przestrzeni stawowej, tworzenie się osteofitów i oznaki zwyrodnienia chrząstki. Aby potwierdzić diagnozę i ocenić stopień zaawansowania, zlecono rezonans magnetyczny. Rezonans magnetyczny wykazał rozległą utratę chrząstki i zmiany kostne odpowiadające zaawansowanej chorobie zwyrodnieniowej stawów. Po rozważeniu stanu pacjenta i jego preferencji, omówiono plan leczenia, który obejmował kontrolę bólu, fizjoterapię i możliwość operacji wymiany stawu.""",
        """45-letni mężczyzna zgłosił się do kliniki z uporczywą dusznością i dyskomfortem w klatce piersiowej. Historia medyczna pacjenta ujawniła nadciśnienie tętnicze i rodzinną historię chorób sercowo-naczyniowych. Wstępne badanie fizykalne wykazało wysokie ciśnienie krwi i nieregularne dźwięki serca. W celu dalszej oceny stanu pacjenta wykonano elektrokardiogram (EKG), echokardiogram i test wysiłkowy. EKG ujawniło oznaki arytmii serca, podczas gdy echokardiogram wykazał zmniejszoną czynność lewej komory. Test wysiłkowy wykazał niedokrwienie mięśnia sercowego wywołane wysiłkiem fizycznym, potwierdzając obecność choroby wieńcowej (CAD).""",
    ],

    "ner_clinical_tr": [
        """50 yaşında bir kadın hasta ortopedi kliniğine sağ dizinde sürekli ağrı, şişlik ve hareket kısıtlılığı şikâyetleriyle başvurdu. Hasta osteoartrit ve daha önce geçirilmiş diz yaralanması öyküsü bildirdi. Klinik muayene ve çekilen röntgenlerde eklem aralığında daralma, osteofit oluşumu ve kıkırdak dejenerasyonu bulguları tespit edildi. Tanıyı doğrulamak ve ciddiyetini değerlendirmek için bir MR taraması istendi. MRG, ileri osteoartrit ile uyumlu yoğun kıkırdak kaybı ve kemik değişiklikleri gösterdi. Hastanın durumu ve tercihleri göz önünde bulundurulduktan sonra  fizik tedavi ve eklem replasmanı ameliyatı olasılığını içeren bir tedavi planı tartışıldı.""",
        """45 yaşında bir erkek hasta kliniğe sürekli nefes darlığı ve göğüs rahatsızlığı şikâyetiyle başvurdu. Hastanın tıbbi geçmişinde hipertansiyon ve ailede kardiyovasküler hastalık öyküsü vardı. İlk fizik muayenede yüksek kan basıncı ve düzensiz kalp sesleri tespit edildi. Durumu daha iyi değerlendirmek için elektrokardiyogram (EKG), ekokardiyogram ve efor testi yapıldı. EKG'de kardiyak aritmi belirtileri görülürken, ekokardiyogramda sol ventrikül fonksiyonunun azaldığı görüldü. Efor testi egzersize bağlı miyokardiyal iskemiyi göstererek koroner arter hastalığının (KAH) varlığını doğrulamıştır.""",
    ],

    "ner_clinical_sv": [
        """MUN NORMAL HALS NORMAL sköldkörtel wnl BRÖST NORMAL inga tydliga knölar BRÖSTVÅRTOR NORMAL inverterade [ b ] , evert w / stimulering BRÖST NORMAL LCTA COR NORMAL RRR BUK NORMAL gravid EXTREMITETER NORMAL HUD NORMAL LYMFKÖRTLAR NORMAL VULVA NORMAL inga lesioner , vit d / c vid introitus VAGINA NORMAL liten mängd tunn vit d / c ph 4.5 , koh +amin , NS +clue , neg trich CERVIX NORMAL 1/100/0 srom klar OS NORMAL stängd ADNEXAE NORMAL inga palpabla knölar , NT LIVMODER NORMAL gravid LIVMODERSTORLEK I VECKOR NORMAL term REKTUM NORMAL inga yttre lesioner.""",
        """Patienten hade inga ytterligare klagomål och den 10 mars 2012 var hans vita blodkroppar 2,3, neutrofiler 50%, band 2%, lymfocyter 5% , monocyter 40% och blaster 1%. instruktioner i 250 ml långsam IV-infusion över en timme.""",
    ],

    "ner_clinical_vi":[
        """A/P : Nam , 48 tuổi, có tiền sử HCV, rối loạn lưỡng cực , có nỗ lực tự tử, dùng Inderal, Klonopin, Geodon, nhập viện tại Jackson với ống thông khí để bảo vệ đường thở, có câu hỏi về xâm nhập phía sau tim bên trái, hiện tại đang ổn định. một cuộc quét MRI cổ chỉ ra sự thoái hóa đĩa ấn tượng rất ở C5-C6, ít hơn ở C4-C5, với sự nén tủy rõ ràng, đặc biệt là ở phía bên phải C5-C6.""",
        """FBS nhỏ hơn 200 = 0 đơn vị CZI FBS 201-250 = 2 đơn vị CZI FBS 251-300 = 4 đơn vị CZI FBS 301-350 = 6 đơn vị CZI FBS 351-400 = 8 đơn vị CZI FBS lớn hơn 400= 10 đơn vị CZI.""",
    ],

}

# ner_clinical_en

In [5]:
model_name = "ner_clinical_en"

In [6]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        word_embeddings,
                        ner,
                        ner_converter])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [7]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(100, truncate=False)

+--------------------------------------+---------+
|chunk                                 |ner_label|
+--------------------------------------+---------+
|persistent pain                       |PROBLEM  |
|swelling                              |PROBLEM  |
|limitation of motion in her right knee|PROBLEM  |
|osteoarthritis                        |PROBLEM  |
|previous knee injury                  |PROBLEM  |
|Clinical examination                  |TEST     |
|X-rays                                |TEST     |
|joint space narrowing                 |PROBLEM  |
|osteophyte formation                  |PROBLEM  |
|cartilage degeneration                |PROBLEM  |
|An MRI scan                           |TEST     |
|The MRI                               |TEST     |
|extensive cartilage loss              |PROBLEM  |
|bone changes                          |PROBLEM  |
|advanced osteoarthritis               |PROBLEM  |
|joint replacement surgery             |TREATMENT|
|persistent shortness of breath

In [8]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")
















# ner_clinical_da

In [9]:
model_name = "ner_clinical_da"

In [10]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","da") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "da", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
    ])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [11]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------------+---------+
|chunk                     |ner_label|
+--------------------------+---------+
|ændringer i mental        |PROBLEM  |
|ABG                       |TEST     |
|en CO2                    |TEST     |
|nedsat respiratorisk drive|PROBLEM  |
|sedativer                 |TREATMENT|
|agitation                 |PROBLEM  |
|en eksisterende alkalose  |PROBLEM  |
|Tylenol                   |TREATMENT|
|Allopurinol               |TREATMENT|
|Peridex mundskyllevand    |TREATMENT|
|cholysteramineresin       |TREATMENT|
|clotrimazole              |TREATMENT|
|Fentanyl plaster          |TREATMENT|
|glipizide                 |TREATMENT|
|Atarax                    |TREATMENT|
|kløe                      |PROBLEM  |
|glidende skala insulin    |TREATMENT|
|KBL mundskyllevand        |TREATMENT|
|mund ubehag               |PROBLEM  |
|Lactulose                 |TREATMENT|
+--------------------------+---------+
only showing top 20 rows



In [12]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_es

In [13]:
model_name = "ner_clinical_es"

In [14]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "es", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [15]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(100, truncate=False)

+-------------------------------------------------------------+---------+
|chunk                                                        |ner_label|
+-------------------------------------------------------------+---------+
|dolor persistente                                            |PROBLEM  |
|inflamación                                                  |PROBLEM  |
|limitación de la amplitud de movimiento en la rodilla derecha|PROBLEM  |
|artrosis                                                     |PROBLEM  |
|una lesión previa de rodilla                                 |PROBLEM  |
|examen clínico y radiografías                                |TEST     |
|estrechamiento del espacio articular                         |PROBLEM  |
|formación de osteofitos                                      |PROBLEM  |
|signos de degeneración del cartílago                         |PROBLEM  |
|una resonancia magnética                                     |TEST     |
|resonancia                           

In [16]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_fi

In [17]:
model_name = "ner_clinical_fi"

In [18]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fi") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "fi", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
    ])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [19]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+----------------------------------+---------+
|chunk                             |ner_label|
+----------------------------------+---------+
|diagnostiset toimenpiteet         |TEST     |
|vammoja                           |PROBLEM  |
|takaraivon haavaan                |PROBLEM  |
|vasempaan murtuneeseen solisluuhun|PROBLEM  |
|vasempaan murtuneeseen säteeseen  |PROBLEM  |
|oikean reiden hematoomaan         |PROBLEM  |
|oikean jalan haavaa               |PROBLEM  |
|Tyypin I diabetes                 |PROBLEM  |
|verenpainetauti                   |PROBLEM  |
|gastropareesi                     |PROBLEM  |
|loppuvaiheen munuaissairaus       |PROBLEM  |
|hoito                             |TREATMENT|
+----------------------------------+---------+



In [20]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_fr

In [21]:
model_name = "ner_clinical_fr"

In [22]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fr") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "fr", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [23]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------------------------------------------+---------+
|chunk                                                      |ner_label|
+-----------------------------------------------------------+---------+
|toux persistante                                           |PROBLEM  |
|fièvre                                                     |PROBLEM  |
|difficultés respiratoires                                  |PROBLEM  |
|antécédents médicaux d'infections respiratoires récurrentes|PROBLEM  |
|craquements                                                |PROBLEM  |
|l'auscultation des poumons                                 |TEST     |
|radiographies du thorax                                    |TEST     |
|Les radiographies                                          |TEST     |
|infiltrats parcellaires dans les deux poumons              |PROBLEM  |
|pneumonie                                                  |PROBLEM  |
|Les examens                                                |TES

In [24]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")






# ner_clinical_he

In [25]:
model_name = "ner_clinical_he"

In [26]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = BertEmbeddings.pretrained("alephbertgimmel_base_512","he") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "he", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(['sentence', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
    ])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
alephbertgimmel_base_512 download started this may take some time.
Approximate size to download 658.5 MB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [27]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+------------------------+---------+
|chunk                   |ner_label|
+------------------------+---------+
|דלקת דם                 |PROBLEM  |
|התרגשות מוגברת          |PROBLEM  |
|אירוע התרסה של ETOH     |PROBLEM  |
|היפוטנזיה               |PROBLEM  |
|לחץ דם סיסטולי          |TEST     |
|Ativan                  |TREATMENT|
|Haldol                  |TREATMENT|
|רגי                     |PROBLEM  |
|COPD                    |PROBLEM  |
|חילוף ריאות             |TREATMENT|
|FEV1                    |TEST     |
|הפרעה פריקרדיאלית כרונית|PROBLEM  |
|PVD                     |PROBLEM  |
|סטנוזת קרות הכליה הימנית|PROBLEM  |
|אוסטיאופורוזיס          |PROBLEM  |
|פרסבילרינגי             |PROBLEM  |
+------------------------+---------+



In [28]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_it

In [67]:
model_name = "ner_clinical_it"

In [68]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","it") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "it", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [69]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+---------------------------------------------------+---------+
|chunk                                              |ner_label|
+---------------------------------------------------+---------+
|persistente                                        |PROBLEM  |
|gonfiore                                           |PROBLEM  |
|limitata capacità di movimento del ginocchio destro|PROBLEM  |
|osteoartrite                                       |PROBLEM  |
|infortunio al ginocchio                            |PROBLEM  |
|un esame clinico                                   |TEST     |
|delle radiografie                                  |TEST     |
|un restringimento dello spazio articolare          |PROBLEM  |
|formazione di osteofiti                            |PROBLEM  |
|segni di degenerazione della cartilagine           |PROBLEM  |
|una risonanza magnetica                            |TEST     |
|La risonanza magnetica                             |TEST     |
|un'estesa perdita di cartilagine       

In [70]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_ja

In [33]:
model_name = "ner_clinical_ja"

In [34]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

clinical_embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_japanese","ja") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "ja", "clinical/models")\
    .setInputCols(["sentence", "token","embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    clinical_embeddings,
    ner_model,
    ner_converter
    ])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_embeddings_bert_large_japanese download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [35]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+----------------+---------+
|chunk           |ner_label|
+----------------+---------+
|中等度肺高血圧  |PROBLEM  |
|PA圧            |TEST     |
|1+僧帽弁逆流    |PROBLEM  |
|重度大動脈弁狭窄|PROBLEM  |
|LVEDP           |TEST     |
|駆出率          |TEST     |
|クロトリマゾール|TREATMENT|
|息切れ          |PROBLEM  |
|吐き気          |PROBLEM  |
+----------------+---------+



In [36]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











In [37]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+----------------+---------+
|chunk           |ner_label|
+----------------+---------+
|中等度肺高血圧  |PROBLEM  |
|PA圧            |TEST     |
|1+僧帽弁逆流    |PROBLEM  |
|重度大動脈弁狭窄|PROBLEM  |
|LVEDP           |TEST     |
|駆出率          |TEST     |
|クロトリマゾール|TREATMENT|
|息切れ          |PROBLEM  |
|吐き気          |PROBLEM  |
+----------------+---------+



# ner_clinical_nl

In [72]:
model_name = "ner_clinical_nl"

In [73]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","nl") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "nl", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [74]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------------------------+---------+
|chunk                                 |ner_label|
+--------------------------------------+---------+
|aanhoudende pijn                      |PROBLEM  |
|zwelling                              |PROBLEM  |
|bewegingsbeperking in haar rechterknie|PROBLEM  |
|artrose                               |PROBLEM  |
|knieletsel                            |PROBLEM  |
|Klinisch onderzoek                    |TEST     |
|röntgenfoto's                         |TEST     |
|vernauwing van de gewrichtsruimte     |PROBLEM  |
|osteofytvorming                       |PROBLEM  |
|kraakbeendegeneratie                  |PROBLEM  |
|een MRI-scan                          |TEST     |
|De MRI                                |TEST     |
|uitgebreid kraakbeenverlies           |PROBLEM  |
|botveranderingen                      |PROBLEM  |
|gevorderde artrose                    |PROBLEM  |
|een behandelplan                      |TREATMENT|
|fysiotherapie                 

In [75]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")






# ner_clinical_no

In [42]:
model_name = "ner_clinical_no"

In [43]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","no") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "no", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
    ])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [44]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-----------------------------+---------+
|chunk                        |ner_label|
+-----------------------------+---------+
|Natrium                      |TEST     |
|kalium                       |TEST     |
|klorid                       |TEST     |
|bikarbonat                   |TEST     |
|BUN                          |TEST     |
|kreatinin                    |TEST     |
|glukose105                   |TEST     |
|hematokrit42                 |TEST     |
|hvittblodtall                |TEST     |
|blodplater                   |TEST     |
|protrombintid                |TEST     |
|delvis tromboplastintid      |TEST     |
|urinanalyse                  |TEST     |
|spor av hvite blodceller     |PROBLEM  |
|svake skjulte røde blodceller|PROBLEM  |
|Natrium                      |TEST     |
|kalium                       |TEST     |
|glukose                      |TEST     |
|P02                          |TEST     |
|PC02                         |TEST     |
+-----------------------------+---

In [45]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_pt

In [46]:
model_name = "ner_clinical_pt"

In [47]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pt") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "pt", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [48]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+------------------------------------------------------+---------+
|chunk                                                 |ner_label|
+------------------------------------------------------+---------+
|dor persistente                                       |PROBLEM  |
|inchaço                                               |PROBLEM  |
|limitação da amplitude de movimentos no joelho direito|PROBLEM  |
|osteoartrite                                          |PROBLEM  |
|uma lesão anterior no joelho                          |PROBLEM  |
|exame clínico                                         |TEST     |
|radiografias                                          |TEST     |
|estreitamento do espaço articular                     |PROBLEM  |
|osteófitos                                            |PROBLEM  |
|sinais de degeneração da cartilagem                   |PROBLEM  |
|uma ressonância magnética                             |TEST     |
|A RM                                                  |TEST  

In [49]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_pl

In [50]:
model_name = "ner_clinical_pl"

In [51]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pl") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "pl", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [52]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-------------------------------------------------------------------------+---------+
|chunk                                                                    |ner_label|
+-------------------------------------------------------------------------+---------+
|uporczywy ból                                                            |PROBLEM  |
|obrzęk                                                                   |PROBLEM  |
|ograniczony zakres ruchu w prawym kolanie                                |PROBLEM  |
|choroby zwyrodnieniowej stawów                                           |PROBLEM  |
|wcześniejszy uraz kolana                                                 |PROBLEM  |
|badanie kliniczne                                                        |TEST     |
|zdjęcia rentgenowskie                                                    |TEST     |
|zwężenie przestrzeni stawowej                                            |PROBLEM  |
|tworzenie się osteofitów                             

In [53]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_sv

In [54]:
model_name = "ner_clinical_sv"

In [55]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","sv") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "sv", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
    ])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [56]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------------+---------+
|chunk                     |ner_label|
+--------------------------+---------+
|tydliga knölar            |PROBLEM  |
|inverterade               |PROBLEM  |
|evert w / stimulering     |TREATMENT|
|lesioner                  |PROBLEM  |
|vit d / c vid introitus   |PROBLEM  |
|liten mängd tunn vit d / c|PROBLEM  |
|ph                        |TEST     |
|koh                       |TEST     |
|NS                        |TEST     |
|srom                      |PROBLEM  |
|palpabla knölar           |PROBLEM  |
|yttre lesioner            |PROBLEM  |
|ytterligare klagomål      |PROBLEM  |
|hans vita blodkroppar     |TEST     |
|neutrofiler               |TEST     |
|band                      |TEST     |
|lymfocyter                |TEST     |
|monocyter                 |TEST     |
|blaster                   |TEST     |
|långsam IV-infusion       |TREATMENT|
+--------------------------+---------+



In [57]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_tr

In [58]:
model_name = "ner_clinical_tr"

In [59]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tr") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "tr", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [60]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------------------+---------+
|chunk                           |ner_label|
+--------------------------------+---------+
|sürekli ağrı                    |PROBLEM  |
|şişlik                          |PROBLEM  |
|hareket kısıtlılığı             |PROBLEM  |
|osteoartrit                     |PROBLEM  |
|geçirilmiş diz yaralanması      |PROBLEM  |
|eklem aralığında daralma        |PROBLEM  |
|osteofit oluşumu                |PROBLEM  |
|kıkırdak dejenerasyonu bulguları|PROBLEM  |
|bir MR taraması                 |TEST     |
|MRG                             |TEST     |
|ileri osteoartrit               |PROBLEM  |
|yoğun kıkırdak kaybı            |PROBLEM  |
|kemik değişiklikleri            |PROBLEM  |
|fizik tedavi                    |TREATMENT|
|eklem replasmanı ameliyatı      |TREATMENT|
|sürekli nefes darlığı           |PROBLEM  |
|göğüs rahatsızlığı              |PROBLEM  |
|hipertansiyon                   |PROBLEM  |
|kardiyovasküler hastalık        |PROBLEM  |
|İlk fizik

In [61]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_vi

In [62]:
model_name = "ner_clinical_vi"

In [63]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

clinical_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","vi") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "vi", "clinical/models")\
    .setInputCols(["sentence", "token","embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(['sentence', 'token', 'ner'])\
    .setOutputCol('ner_chunk')

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    clinical_embeddings,
    ner_model,
    ner_converter
    ])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [64]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+------------------------------+---------+
|chunk                         |ner_label|
+------------------------------+---------+
|HCV                           |PROBLEM  |
|rối loạn lưỡng cực            |PROBLEM  |
|tự tử                         |PROBLEM  |
|Inderal                       |TREATMENT|
|Klonopin                      |TREATMENT|
|Geodon                        |TREATMENT|
|ống thông khí                 |TREATMENT|
|bảo vệ đường thở              |TREATMENT|
|xâm nhập phía sau tim bên trái|PROBLEM  |
|một cuộc quét MRI cổ          |TEST     |
|sự nén tủy rõ ràng            |PROBLEM  |
|FBS                           |TEST     |
|CZI                           |TREATMENT|
|FBS                           |TEST     |
|CZI                           |TREATMENT|
|FBS                           |TEST     |
|CZI                           |TREATMENT|
|FBS                           |TEST     |
|CZI                           |TREATMENT|
|FBS                           |TEST     |
+----------

In [65]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")









