![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_GENE_PHENOTYPES.ipynb)

## Setup

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

locals().update(license_keys)

os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp_jsl.pretrained import InternalResourceDownloader

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 5.5.1
Spark NLP_JSL Version : 5.5.1


# ner_genes_phenotypes

In [4]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_genes_phenotypes", "en", "clinical/models")\
    .setInputCols(["sentence", "token","embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler,
    sentenceDetector,
    tokenizer,
    clinical_embeddings,
    ner_model,
    ner_converter
    ])

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_genes_phenotypes download started this may take some time.
[OK!]


In [15]:
from pyspark.sql.types import StringType

sample_texts = ["""
The G6PD gene provides instructions for glucose-6-phosphate dehydrogenase, crucial for protecting cells from oxidative stress.

Mutations in the G6PD gene cause G6PD deficiency, an X-linked recessive disorder affecting red blood cells.

Over 400 variants have been identified, with the G6PD A- variant common in African populations.

The variant G6PD protein results in reduced enzyme activity.

Clinical presentations of G6PD deficiency include hemolytic anemia triggered by certain medications, foods (e.g., fava beans), or infections.

Symptoms during hemolytic episodes include jaundice, fatigue, and dark urine.

Gene-environment interactions are significant, with G6PD deficiency conferring some protection against malaria.

Diagnosis involves enzyme activity assays and genetic testing. Management focuses on avoiding triggers and providing supportive care during hemolytic episodes.

In severe cases, blood transfusions may be necessary. Patient education about trigger avoidance is crucial for preventing complications.

The global prevalence of G6PD deficiency is estimated at 4.9%, with higher rates in malaria-endemic regions.

"""]

data = spark.createDataFrame(sample_texts, StringType()).toDF("text")

result = pipeline.fit(data).transform(data)

In [18]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                                        result.ner_chunk.begin,
                                                        result.ner_chunk.end,
                                                        result.ner_chunk.metadata)).alias("cols"))\
                .select(F.expr("cols['0']").alias("ner_chunk"),
                        F.expr("cols['1']").alias("begin"),
                        F.expr("cols['2']").alias("end"),
                        F.expr("cols['3']['entity']").alias("ner_label")).show(60,truncate=False)

+--------------------------------------+-----+----+---------------------+
|ner_chunk                             |begin|end |ner_label            |
+--------------------------------------+-----+----+---------------------+
|G6PD gene                             |5    |13  |MPG                  |
|glucose-6-phosphate dehydrogenase     |41   |73  |MPG                  |
|protecting cells from oxidative stress|88   |125 |Gene_Function        |
|G6PD gene                             |147  |155 |MPG                  |
|G6PD deficiency                       |163  |177 |Phenotype_Disease    |
|X-linked recessive                    |183  |200 |Inheritance_Pattern  |
|G6PD A                                |289  |294 |Phenotype_Disease    |
|African populations                   |315  |333 |Prevalence           |
|G6PD protein                          |350  |361 |MPG                  |
|G6PD deficiency                       |427  |441 |Phenotype_Disease    |
|hemolytic anemia                     

In [19]:
light_model = LightPipeline(pipeline.fit(data))

light_result = light_model.fullAnnotate(sample_texts)

In [20]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

visualiser.display(light_result[0], label_col='ner_chunk', document_col='document', save_path="display_result.html")