![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_CLINICAL_MULTI.ipynb)

## Setup

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

locals().update(license_keys)

os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp_jsl.pretrained import InternalResourceDownloader

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 5.0.2
Spark NLP_JSL Version : 5.0.2


In [46]:
text_dict = {

    "ner_clinical_pt": [
        """Uma mulher de 50 anos veio à clínica ortopédica com queixas de dor persistente, inchaço e limitação da amplitude de movimentos no joelho direito. A doente referia uma história de osteoartrite e uma lesão anterior no joelho. Foi efectuado um exame clínico e radiografias que revelaram um estreitamento do espaço articular, formação de osteófitos e sinais de degeneração da cartilagem. Para confirmar o diagnóstico e avaliar a gravidade, foi pedida uma ressonância magnética. A RM mostrou uma perda extensa de cartilagem e alterações ósseas consistentes com osteoartrite avançada. Depois de considerar a condição e as preferências do doente, foi discutido um plano de tratamento que envolvia o controlo da dor, fisioterapia e a possibilidade de cirurgia de substituição da articulação.""",
        """Um homem de 45 anos apresentou-se na clínica com falta de ar persistente e desconforto no peito. A história clínica do doente revelava hipertensão e uma história familiar de doença cardiovascular. Um exame físico inicial indicou uma tensão arterial elevada e sons cardíacos irregulares. Para avaliar melhor a condição, foram realizados um eletrocardiograma (ECG), um ecocardiograma e uma prova de esforço. O ECG revelou sinais de arritmia cardíaca, enquanto o ecocardiograma mostrou uma função ventricular esquerda reduzida. O teste de esforço indicou isquemia miocárdica induzida pelo exercício, confirmando a presença de doença arterial coronariana (DAC).""",
    ],
    "ner_clinical_tr": [
        """50 yaşında bir kadın hasta ortopedi kliniğine sağ dizinde sürekli ağrı, şişlik ve hareket kısıtlılığı şikâyetleriyle başvurdu. Hasta osteoartrit ve daha önce geçirilmiş diz yaralanması öyküsü bildirdi. Klinik muayene ve çekilen röntgenlerde eklem aralığında daralma, osteofit oluşumu ve kıkırdak dejenerasyonu bulguları tespit edildi. Tanıyı doğrulamak ve ciddiyetini değerlendirmek için bir MR taraması istendi. MRG, ileri osteoartrit ile uyumlu yoğun kıkırdak kaybı ve kemik değişiklikleri gösterdi. Hastanın durumu ve tercihleri göz önünde bulundurulduktan sonra  fizik tedavi ve eklem replasmanı ameliyatı olasılığını içeren bir tedavi planı tartışıldı.""",
        """45 yaşında bir erkek hasta kliniğe sürekli nefes darlığı ve göğüs rahatsızlığı şikâyetiyle başvurdu. Hastanın tıbbi geçmişinde hipertansiyon ve ailede kardiyovasküler hastalık öyküsü vardı. İlk fizik muayenede yüksek kan basıncı ve düzensiz kalp sesleri tespit edildi. Durumu daha iyi değerlendirmek için elektrokardiyogram (EKG), ekokardiyogram ve efor testi yapıldı. EKG'de kardiyak aritmi belirtileri görülürken, ekokardiyogramda sol ventrikül fonksiyonunun azaldığı görüldü. Efor testi egzersize bağlı miyokardiyal iskemiyi göstererek koroner arter hastalığının (KAH) varlığını doğrulamıştır.""",
    ],
    "ner_clinical_pl": [
        """50-letnia kobieta zgłosiła się do kliniki ortopedycznej skarżąc się na uporczywy ból, obrzęk i ograniczony zakres ruchu w prawym kolanie. Pacjentka zgłosiła historię choroby zwyrodnieniowej stawów i wcześniejszy uraz kolana. Przeprowadzono badanie kliniczne i wykonano zdjęcia rentgenowskie, które wykazały zwężenie przestrzeni stawowej, tworzenie się osteofitów i oznaki zwyrodnienia chrząstki. Aby potwierdzić diagnozę i ocenić stopień zaawansowania, zlecono rezonans magnetyczny. Rezonans magnetyczny wykazał rozległą utratę chrząstki i zmiany kostne odpowiadające zaawansowanej chorobie zwyrodnieniowej stawów. Po rozważeniu stanu pacjenta i jego preferencji, omówiono plan leczenia, który obejmował kontrolę bólu, fizjoterapię i możliwość operacji wymiany stawu.""",
        """45-letni mężczyzna zgłosił się do kliniki z uporczywą dusznością i dyskomfortem w klatce piersiowej. Historia medyczna pacjenta ujawniła nadciśnienie tętnicze i rodzinną historię chorób sercowo-naczyniowych. Wstępne badanie fizykalne wykazało wysokie ciśnienie krwi i nieregularne dźwięki serca. W celu dalszej oceny stanu pacjenta wykonano elektrokardiogram (EKG), echokardiogram i test wysiłkowy. EKG ujawniło oznaki arytmii serca, podczas gdy echokardiogram wykazał zmniejszoną czynność lewej komory. Test wysiłkowy wykazał niedokrwienie mięśnia sercowego wywołane wysiłkiem fizycznym, potwierdzając obecność choroby wieńcowej (CAD).""",
    ],
    "ner_clinical_es": [
        """Una mujer de 50 años acudió a la clínica ortopédica quejándose de dolor persistente, inflamación y limitación de la amplitud de movimiento en la rodilla derecha. La paciente refería antecedentes de artrosis y una lesión previa de rodilla. Se realizó un examen clínico y radiografías que revelaron un estrechamiento del espacio articular, formación de osteofitos y signos de degeneración del cartílago. Para confirmar el diagnóstico y evaluar la gravedad, se solicitó una resonancia magnética. La resonancia mostró una gran pérdida de cartílago y cambios óseos compatibles con una artrosis avanzada. Tras considerar el estado y las preferencias del paciente, se discutió un plan de tratamiento que incluía control del dolor, fisioterapia y la posibilidad de una cirugía de sustitución articular.""",
        """Un hombre de 45 años acudió a la consulta con disnea persistente y molestias torácicas. La historia clínica del paciente revelaba hipertensión y antecedentes familiares de enfermedad cardiovascular. Un examen físico inicial indicó hipertensión arterial y ruidos cardíacos irregulares. Para evaluar mejor su estado, se le practicaron un electrocardiograma (ECG), un ecocardiograma y una prueba de esfuerzo. El ECG reveló signos de arritmia cardiaca, mientras que el ecocardiograma mostró una función ventricular izquierda reducida. La prueba de esfuerzo indicó isquemia miocárdica inducida por el ejercicio, confirmando la presencia de enfermedad arterial coronaria (EAC)""",
    ],
    "ner_clinical_it": [
        """Una donna di 50 anni si è presentata alla clinica ortopedica lamentando dolore persistente, gonfiore e limitata capacità di movimento del ginocchio destro. La paziente ha riferito un'anamnesi di osteoartrite e un precedente infortunio al ginocchio. Sono stati eseguiti un esame clinico e delle radiografie che hanno rivelato un restringimento dello spazio articolare, la formazione di osteofiti e segni di degenerazione della cartilagine. Per confermare la diagnosi e valutarne la gravità, è stata ordinata una risonanza magnetica. La risonanza magnetica ha mostrato un'estesa perdita di cartilagine e alterazioni ossee coerenti con un'osteoartrite avanzata. Dopo aver considerato le condizioni e le preferenze del paziente, è stato discusso un piano di trattamento che prevedeva il controllo del dolore, la fisioterapia e la possibilità di un intervento di sostituzione dell'articolazione.""",
        """Un uomo di 45 anni si è presentato in clinica con una persistente mancanza di respiro e un fastidio al petto. L'anamnesi del paziente rivelava ipertensione e una storia familiare di malattie cardiovascolari. Un esame fisico iniziale indicava pressione alta e suoni cardiaci irregolari. Per valutare ulteriormente la condizione, sono stati eseguiti un elettrocardiogramma (ECG), un ecocardiogramma e un test da sforzo. L'ECG ha rivelato segni di aritmia cardiaca, mentre l'ecocardiogramma ha evidenziato una ridotta funzionalità del ventricolo sinistro. Il test da sforzo ha indicato un'ischemia miocardica indotta dall'esercizio, confermando la presenza di coronaropatia (CAD).""",
    ],
    "ner_clinical_fr": [
        """Une femme de 50 ans s'est présentée à la clinique orthopédique en se plaignant d'une douleur persistante, d'un gonflement et d'une limitation de l'amplitude de mouvement de son genou droit. La patiente a déclaré avoir des antécédents d'arthrose et s'être déjà blessée au genou. L'examen clinique et les radiographies effectuées ont révélé un rétrécissement de l'espace articulaire, la formation d'ostéophytes et des signes de dégénérescence du cartilage. Pour confirmer le diagnostic et en évaluer la gravité, une IRM a été demandée. L'IRM a montré une perte importante de cartilage et des modifications osseuses correspondant à une arthrose avancée. Après avoir pris en compte l'état de santé et les préférences du patient, un plan de traitement comprenant la prise en charge de la douleur, la kinésithérapie et la possibilité d'une arthroplastie a été discuté.""",
        """Un homme de 45 ans s'est présenté à la clinique avec un essoufflement persistant et une gêne thoracique. Les antécédents médicaux du patient révèlent une hypertension et des antécédents familiaux de maladies cardiovasculaires. L'examen physique initial a révélé une tension artérielle élevée et des bruits cardiaques irréguliers. Pour mieux évaluer la situation, un électrocardiogramme (ECG), un échocardiogramme et une épreuve d'effort ont été réalisés. L'ECG a révélé des signes d'arythmie cardiaque, tandis que l'échocardiographie a montré une réduction de la fonction ventriculaire gauche. L'épreuve d'effort a révélé une ischémie myocardique induite par l'exercice, confirmant la présence d'une maladie coronarienne.""",
    ],
}

# ner_clinical_tr

In [24]:
model_name = "ner_clinical_tr"

In [25]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","tr") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "tr", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [26]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------------------+---------+
|chunk                           |ner_label|
+--------------------------------+---------+
|sürekli ağrı                    |PROBLEM  |
|şişlik                          |PROBLEM  |
|hareket kısıtlılığı             |PROBLEM  |
|osteoartrit                     |PROBLEM  |
|geçirilmiş diz yaralanması      |PROBLEM  |
|eklem aralığında daralma        |PROBLEM  |
|osteofit oluşumu                |PROBLEM  |
|kıkırdak dejenerasyonu bulguları|PROBLEM  |
|bir MR taraması                 |TEST     |
|MRG                             |TEST     |
|ileri osteoartrit               |PROBLEM  |
|yoğun kıkırdak kaybı            |PROBLEM  |
|kemik değişiklikleri            |PROBLEM  |
|fizik tedavi                    |TREATMENT|
|eklem replasmanı ameliyatı      |TREATMENT|
|sürekli nefes darlığı           |PROBLEM  |
|göğüs rahatsızlığı              |PROBLEM  |
|hipertansiyon                   |PROBLEM  |
|kardiyovasküler hastalık        |PROBLEM  |
|İlk fizik

In [27]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_pt

In [28]:
model_name = "ner_clinical_pt"

In [29]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pt") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "pt", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [30]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+------------------------------------------------------+---------+
|chunk                                                 |ner_label|
+------------------------------------------------------+---------+
|dor persistente                                       |PROBLEM  |
|inchaço                                               |PROBLEM  |
|limitação da amplitude de movimentos no joelho direito|PROBLEM  |
|osteoartrite                                          |PROBLEM  |
|uma lesão anterior no joelho                          |PROBLEM  |
|exame clínico                                         |TEST     |
|radiografias                                          |TEST     |
|estreitamento do espaço articular                     |PROBLEM  |
|osteófitos                                            |PROBLEM  |
|sinais de degeneração da cartilagem                   |PROBLEM  |
|uma ressonância magnética                             |TEST     |
|A RM                                                  |TEST  

In [31]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_pl

In [32]:
model_name = "ner_clinical_pl"

In [33]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","pl") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "pl", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])



sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.1 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [34]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-------------------------------------------------------------------------+---------+
|chunk                                                                    |ner_label|
+-------------------------------------------------------------------------+---------+
|uporczywy ból                                                            |PROBLEM  |
|obrzęk                                                                   |PROBLEM  |
|ograniczony zakres ruchu w prawym kolanie                                |PROBLEM  |
|choroby zwyrodnieniowej stawów                                           |PROBLEM  |
|wcześniejszy uraz kolana                                                 |PROBLEM  |
|badanie kliniczne                                                        |TEST     |
|zdjęcia rentgenowskie                                                    |TEST     |
|zwężenie przestrzeni stawowej                                            |PROBLEM  |
|tworzenie się osteofitów                             

In [35]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_es

In [36]:
model_name = "ner_clinical_es"

In [37]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "es", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [42]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-------------------------------------------------------------+---------+
|chunk                                                        |ner_label|
+-------------------------------------------------------------+---------+
|clínica ortopédica                                           |TREATMENT|
|dolor persistente                                            |PROBLEM  |
|inflamación                                                  |PROBLEM  |
|limitación de la amplitud de movimiento en la rodilla derecha|PROBLEM  |
|artrosis                                                     |PROBLEM  |
|una lesión previa de rodilla                                 |PROBLEM  |
|examen clínico y radiografías                                |TEST     |
|estrechamiento del espacio articular                         |PROBLEM  |
|formación de osteofitos                                      |PROBLEM  |
|signos de degeneración del cartílago                         |PROBLEM  |
|una resonancia magnética             

In [43]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_it

In [45]:
model_name = "ner_clinical_it"

In [44]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","it") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "it", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [None]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+-------------------------------------------------------------+---------+
|chunk                                                        |ner_label|
+-------------------------------------------------------------+---------+
|clínica ortopédica                                           |TREATMENT|
|dolor persistente                                            |PROBLEM  |
|inflamación                                                  |PROBLEM  |
|limitación de la amplitud de movimiento en la rodilla derecha|PROBLEM  |
|artrosis                                                     |PROBLEM  |
|una lesión previa de rodilla                                 |PROBLEM  |
|examen clínico y radiografías                                |TEST     |
|estrechamiento del espacio articular                         |PROBLEM  |
|formación de osteofitos                                      |PROBLEM  |
|signos de degeneración del cartílago                         |PROBLEM  |
|una resonancia magnética             

In [None]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")











# ner_clinical_fr

In [47]:
model_name = "ner_clinical_fr"

In [49]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setSplitChars(['-'])

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fr") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_clinical", "fr", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[document_assembler,
                        sentence_detector,
                        tokenizer,
                        embeddings,
                        ner,
                        ner_converter])

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
ner_clinical download started this may take some time.
[OK!]


In [50]:
df = spark.createDataFrame(pd.DataFrame({'text':text_dict[model_name]}))
result = pipeline.fit(df).transform(df)
result.show


result.select(F.explode(F.arrays_zip("ner_chunk.result", "ner_chunk.metadata")).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------------------------+---------+
|chunk                                 |ner_label|
+--------------------------------------+---------+
|douleur persistante                   |PROBLEM  |
|gonflement                            |PROBLEM  |
|mouvement de son genou droit          |PROBLEM  |
|antécédents d'arthrose                |PROBLEM  |
|blessée au genou                      |PROBLEM  |
|L'examen clinique                     |TEST     |
|les radiographies                     |TEST     |
|rétrécissement de l'espace articulaire|PROBLEM  |
|signes de dégénérescence du cartilage |PROBLEM  |
|gravité                               |PROBLEM  |
|une IRM                               |TEST     |
|L'IRM                                 |TEST     |
|perte importante de cartilage         |PROBLEM  |
|osseuses correspondant                |PROBLEM  |
|arthrose avancée                      |PROBLEM  |
|douleur                               |PROBLEM  |
|arthroplastie                 

In [51]:
from sparknlp_display import NerVisualizer

df = spark.createDataFrame(pd.DataFrame({'text': text_dict[model_name]}))
result = pipeline.fit(df).transform(df)

visualiser = NerVisualizer()

for i in range(len(text_dict[model_name])):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")









