
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/CLASSIFICATION_GENDER.ipynb)

# **Detects the Gender of the Patient in the Clinical Document**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

# **Colab Setup**

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

# **Install dependencies**

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

# **Import dependencies into Python and start the Spark session**

In [4]:
import json
import os

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession

import sparknlp
import sparknlp_jsl

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.pretrained import ResourceDownloader
from pyspark.sql import functions as F

import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import string
import numpy as np

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(secret = SECRET, params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 3.4.4
Spark NLP_JSL Version : 3.5.3


# **🔎 For about models**


📌 **"classifierdl_gender_sbert"**--> *This model classifies the gender of the patient in the clinical document using context.*

*   Predicted Entities => **Female, Male, Unknown**

📌 **classifierdl_gender_biobert** --> *This model classifies the gender of the patient in the clinical document using context.*

*   Predicted Entities => **Female, Male, Unknown**

📌 **bert_sequence_classifier_gender_biobert** --> *This model classifies the gender of a patient in a clinical document using context. This model is a BioBERT-based classifier.*

*   Predicted Entities => **Female, Male, Unknown**


# **🔎Sample Text**

In [22]:
text_list = ["""HISTORY: The patient is a 57-year-old female, who I initially saw in the office on 12/27/07, as a referral from the Tomball Breast Center. On 12/21/07, the patient underwent image-guided needle core biopsy of a 1.5 cm lesion at the 7 o'clock position of the left breast (inferomedial). The biopsy returned showing infiltrating ductal carcinoma high histologic grade. The patient stated that she had recently felt and her physician had felt a palpable mass in that area prior to her breast imaging.""",
"""The patient states that she has been overweight for approximately 35 years and has tried multiple weight loss modalities in the past including Weight Watchers, NutriSystem, Jenny Craig, TOPS, cabbage diet, grape fruit diet, Slim-Fast, Richard Simmons, as well as over-the-counter  measures without any long-term sustainable weight loss. At the time of presentation to the practice, xx is 5 feet 6 inches tall with a weight of 285.4 pounds and a body mass index of 46. She has obesity-related comorbidities, which includes hypertension and hypercholesterolemia.""",
"""Prostate gland showing moderately differentiated infiltrating adenocarcinoma, Gleason 3 + 2 extending to the apex involving both lobes of the prostate, mainly right.""",
"""SKIN: The patient has significant subcutaneous emphysema of the upper chest and  anterior neck area although he states that the subcutaneous emphysema has improved significantly since yesterday.""",
"""Procedure in detail: after obtaining informed consent from the patient, including a thorough explanation of the risks and benefits of the aforementioned procedure, the patient was taken to the operating room and general endotracheal anesthesia was administered."""
]


In [23]:
df = spark.createDataFrame([[text_list[0]]]).toDF('text')

# **🔎Define Spark NLP pipeline**

### **classifierdl_gender_sbert**

In [37]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")


sbert_embedder = BertSentenceEmbeddings\
     .pretrained("sbiobert_base_cased_mli", 'en', 'clinical/models')\
     .setInputCols(["document"])\
     .setOutputCol("sentence_embeddings")


gender_classifier = ClassifierDLModel.pretrained( 'classifierdl_gender_sbert', 'en', 'clinical/models')\
      .setInputCols(["document", "sentence_embeddings"]) \
      .setOutputCol("gender")

pipeline = Pipeline(stages=[
                            document_assembler, 
                            sbert_embedder, 
                            gender_classifier
                            ])


sbert_model = pipeline.fit(spark.createDataFrame([['']]).toDF("text"))


sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
classifierdl_gender_sbert download started this may take some time.
Approximate size to download 22.2 MB
[OK!]


In [38]:
result   = sbert_model.transform(df)

result.select(F.explode(F.arrays_zip('document.result', 'gender.result')).alias("cols")) \
               .select( F.expr("cols['0']").alias("Text"),
                        F.expr("cols['1']").alias("Gender")).show(truncate=80)

+--------------------------------------------------------------------------------+------+
|                                                                            Text|Gender|
+--------------------------------------------------------------------------------+------+
|HISTORY: The patient is a 57-year-old female, who I initially saw in the offi...|Female|
+--------------------------------------------------------------------------------+------+



### **classifierdl_gender_biobert**

In [32]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")


tokenizer = Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')


biobert_embeddings = BertEmbeddings().pretrained('biobert_pubmed_base_cased') \
        .setInputCols(["document",'token'])\
        .setOutputCol("bert_embeddings")


sentence_embeddings = SentenceEmbeddings() \
     .setInputCols(["document", "bert_embeddings"]) \
     .setOutputCol("sentence_bert_embeddings") \
     .setPoolingStrategy("AVERAGE")


genderClassifier = ClassifierDLModel.pretrained('classifierdl_gender_biobert', 'en', 'clinical/models') \
       .setInputCols(["document", "sentence_bert_embeddings"]) \
       .setOutputCol("gender")


pipeline = Pipeline(stages=[document_assembler, 
                                tokenizer, 
                                biobert_embeddings, 
                                sentence_embeddings, 
                                genderClassifier])


biobert_model = pipeline.fit(spark.createDataFrame([['']]).toDF("text"))


biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
classifierdl_gender_biobert download started this may take some time.
Approximate size to download 21 MB
[OK!]


In [33]:
result   = biobert_model.transform(df)

result.select(F.explode(F.arrays_zip('document.result', 'gender.result')).alias("cols")) \
               .select( F.expr("cols['0']").alias("Text"),
                        F.expr("cols['1']").alias("Gender")).show(truncate=80)



+--------------------------------------------------------------------------------+------+
|                                                                            Text|Gender|
+--------------------------------------------------------------------------------+------+
|HISTORY: The patient is a 57-year-old female, who I initially saw in the offi...|Female|
+--------------------------------------------------------------------------------+------+



### **bert_sequence_classifier_gender**

In [34]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_gender_biobert", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("gender")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    sequenceClassifier    
])


bert_model= pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))


bert_sequence_classifier_gender_biobert download started this may take some time.
[OK!]


In [36]:
result   = bert_model.transform(df)

result.select(F.explode(F.arrays_zip('document.result', 'gender.result')).alias("cols")) \
               .select( F.expr("cols['0']").alias("Text"),
                        F.expr("cols['1']").alias("Gender")).show(truncate=80)



+--------------------------------------------------------------------------------+------+
|                                                                            Text|Gender|
+--------------------------------------------------------------------------------+------+
|HISTORY: The patient is a 57-year-old female, who I initially saw in the offi...|Female|
+--------------------------------------------------------------------------------+------+

