
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare_jsl/CLASSIFICATION_GENDER.ipynb)

# **Detects the Gender of the Patient in the Clinical Document**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

# **Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print("Please Upload your John Snow Labs License using the button below")
license_keys = files.upload()

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect

jsl.install()

## 2. Start Session

In [None]:
from johnsnowlabs import *
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

# **🔎 For about models**


📌 **"classifierdl_gender_sbert"**--> *This model classifies the gender of the patient in the clinical document using context.*

*   Predicted Entities => **Female, Male, Unknown**

📌 **classifierdl_gender_biobert** --> *This model classifies the gender of the patient in the clinical document using context.*

*   Predicted Entities => **Female, Male, Unknown**

📌 **bert_sequence_classifier_gender_biobert** --> *This model classifies the gender of a patient in a clinical document using context. This model is a BioBERT-based classifier.*

*   Predicted Entities => **Female, Male, Unknown**


# **🔎Sample Text**

In [None]:
sample_texts = ["""HISTORY: The patient is a 57-year-old female, who I initially saw in the office on 12/27/07, as a referral from the Tomball Breast Center. On 12/21/07, the patient underwent image-guided needle core biopsy of a 1.5 cm lesion at the 7 o'clock position of the left breast (inferomedial). The biopsy returned showing infiltrating ductal carcinoma high histologic grade. The patient stated that she had recently felt and her physician had felt a palpable mass in that area prior to her breast imaging.""",
"""The patient states that she has been overweight for approximately 35 years and has tried multiple weight loss modalities in the past including Weight Watchers, NutriSystem, Jenny Craig, TOPS, cabbage diet, grape fruit diet, Slim-Fast, Richard Simmons, as well as over-the-counter  measures without any long-term sustainable weight loss. At the time of presentation to the practice, xx is 5 feet 6 inches tall with a weight of 285.4 pounds and a body mass index of 46. She has obesity-related comorbidities, which includes hypertension and hypercholesterolemia.""",
"""Prostate gland showing moderately differentiated infiltrating adenocarcinoma, Gleason 3 + 2 extending to the apex involving both lobes of the prostate, mainly right.""",
"""SKIN: The patient has significant subcutaneous emphysema of the upper chest and  anterior neck area although he states that the subcutaneous emphysema has improved significantly since yesterday.""",
"""Procedure in detail: after obtaining informed consent from the patient, including a thorough explanation of the risks and benefits of the aforementioned procedure, the patient was taken to the operating room and general endotracheal anesthesia was administered."""
]


In [None]:
from pyspark.sql.types import StringType, IntegerType

df = spark.createDataFrame(sample_texts, StringType()).toDF('text')

# **🔎Define Spark NLP pipeline**

### **classifierdl_gender_sbert**

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")


sbert_embedder = nlp.BertSentenceEmbeddings\
     .pretrained("sbiobert_base_cased_mli", 'en', 'clinical/models')\
     .setInputCols(["document"])\
     .setOutputCol("sentence_embeddings")


gender_classifier = nlp.ClassifierDLModel.pretrained( 'classifierdl_gender_sbert', 'en', 'clinical/models')\
      .setInputCols(["document", "sentence_embeddings"]) \
      .setOutputCol("gender")

pipeline = Pipeline(
    stages=[
        document_assembler, 
        sbert_embedder, 
        gender_classifier
        ])


sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
classifierdl_gender_sbert download started this may take some time.
Approximate size to download 22.2 MB
[OK!]


In [None]:
result   = pipeline.fit(df).transform(df)

result.select(F.explode(F.arrays_zip(result.document.result, 
                                     result.gender.result)).alias("cols")) \
      .select( F.expr("cols['0']").alias("Text"),
              F.expr("cols['1']").alias("Gender")).show(truncate=80)

+--------------------------------------------------------------------------------+-------+
|                                                                            Text| Gender|
+--------------------------------------------------------------------------------+-------+
|HISTORY: The patient is a 57-year-old female, who I initially saw in the offi...| Female|
|The patient states that she has been overweight for approximately 35 years an...| Female|
|Prostate gland showing moderately differentiated infiltrating adenocarcinoma,...|   Male|
|SKIN: The patient has significant subcutaneous emphysema of the upper chest a...|   Male|
|Procedure in detail: after obtaining informed consent from the patient, inclu...|Unknown|
+--------------------------------------------------------------------------------+-------+



### **classifierdl_gender_biobert**

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")


tokenizer = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')


biobert_embeddings = nlp.BertEmbeddings().pretrained('biobert_pubmed_base_cased') \
        .setInputCols(["document",'token'])\
        .setOutputCol("bert_embeddings")


sentence_embeddings = nlp.SentenceEmbeddings() \
     .setInputCols(["document", "bert_embeddings"]) \
     .setOutputCol("sentence_bert_embeddings") \
     .setPoolingStrategy("AVERAGE")


genderClassifier = nlp.ClassifierDLModel.pretrained('classifierdl_gender_biobert', 'en', 'clinical/models') \
       .setInputCols(["document", "sentence_bert_embeddings"]) \
       .setOutputCol("gender")


pipeline = Pipeline(
    stages=[
        document_assembler, 
        tokenizer, 
        biobert_embeddings, 
        sentence_embeddings, 
        genderClassifier
        ])


biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
classifierdl_gender_biobert download started this may take some time.
Approximate size to download 21 MB
[OK!]


In [None]:
result   = pipeline.fit(df).transform(df)

result.select(F.explode(F.arrays_zip(result.document.result, 
                                     result.gender.result)).alias("cols")) \
      .select( F.expr("cols['0']").alias("Text"),
              F.expr("cols['1']").alias("Gender")).show(truncate=80)



+--------------------------------------------------------------------------------+-------+
|                                                                            Text| Gender|
+--------------------------------------------------------------------------------+-------+
|HISTORY: The patient is a 57-year-old female, who I initially saw in the offi...| Female|
|The patient states that she has been overweight for approximately 35 years an...| Female|
|Prostate gland showing moderately differentiated infiltrating adenocarcinoma,...|   Male|
|SKIN: The patient has significant subcutaneous emphysema of the upper chest a...|   Male|
|Procedure in detail: after obtaining informed consent from the patient, inclu...|Unknown|
+--------------------------------------------------------------------------------+-------+



### **bert_sequence_classifier_gender**

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_gender_biobert", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("gender")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(512)

pipeline = Pipeline(
    stages=[
        document_assembler, 
        tokenizer,
        sequenceClassifier    
    ])


bert_sequence_classifier_gender_biobert download started this may take some time.
[OK!]


In [None]:
result   = pipeline.fit(df).transform(df)

result.select(F.explode(F.arrays_zip(result.document.result, 
                                     result.gender.result)).alias("cols")) \
      .select( F.expr("cols['0']").alias("Text"),
              F.expr("cols['1']").alias("Gender")).show(truncate=80)



+--------------------------------------------------------------------------------+-------+
|                                                                            Text| Gender|
+--------------------------------------------------------------------------------+-------+
|HISTORY: The patient is a 57-year-old female, who I initially saw in the offi...| Female|
|The patient states that she has been overweight for approximately 35 years an...| Female|
|Prostate gland showing moderately differentiated infiltrating adenocarcinoma,...|   Male|
|SKIN: The patient has significant subcutaneous emphysema of the upper chest a...|   Male|
|Procedure in detail: after obtaining informed consent from the patient, inclu...|Unknown|
+--------------------------------------------------------------------------------+-------+

