<a href="https://colab.research.google.com/github/AlfredIsair/Clinical-SDOH-Social-Determinants-of-Health-Analysis/blob/main/Clinical_SDOH.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Colab Setup**

In [25]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [26]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8539.json to spark_nlp_for_healthcare_spark_ocr_8539 (1).json


In [27]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8539.json
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8539.json
👌 JSL-Home is up to date! 
👌 Everything is already installed, no changes made


In [28]:
from johnsnowlabs import nlp, medical, visual
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

Spark Session already created, some configs may not take.
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8539.json


In [29]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only
from pyspark.sql.types import StringType

## Building the Pipeline

This pipeline is designed to process clinical text, identify sentences, tokenize them, generate clinical word embeddings, perform Named Entity Recognition for Social Determinants of Health, and convert the NER results into chunks for further analysis. It can be used to extract valuable information related to SDOH from clinical documents.

In [30]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

clinical_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = medical.NerModel.pretrained("ner_sdoh", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

sdoh_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    clinical_embeddings,
    ner_model,
    ner_converter
    ])



sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_sdoh download started this may take some time.
[OK!]


In [31]:
sample_texts = [
                "Smith is a 55 years old, divorced Mexcian American woman with financial problems. She speaks spanish. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and does not have access to health insurance or paid sick leave. She has a son student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reprots having her catholic faith as a means of support as well.  She has long history of etoh abuse, beginning in her teens. She reports she has been drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI back in April and was due to be in court this week."
                ]


In [32]:
 #creating a Spark DataFrame
data = spark.createDataFrame(sample_texts, StringType()).toDF("text")

In [33]:
result = sdoh_pipeline.fit(data).transform(data)

.fit(data): This part fits (trains) the pipeline on the input data (data DataFrame). During the fitting process, the pipeline learns from the data and configures its components accordingly.

.transform(data): This part applies the fitted pipeline to the input data, transforming it into a new DataFrame (result). The transformation involves processing the clinical text through the various stages of the pipeline, and the result is a DataFrame containing the original columns and additional columns generated by the pipeline (e.g., NER results, embeddings).

In [34]:
#lets provides a tabular view of the NER chunks and their associated labels.
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(30, truncate=False)

+------------------+-------------------+
|chunk             |ner_label          |
+------------------+-------------------+
|55 years old      |Age                |
|divorced          |Marital_Status     |
|Mexcian American  |Race_Ethnicity     |
|woman             |Gender             |
|financial problems|Financial_Status   |
|She               |Gender             |
|spanish           |Language           |
|She               |Gender             |
|apartment         |Housing            |
|She               |Gender             |
|diabetes          |Other_Disease      |
|hospitalizations  |Other_SDoH_Keywords|
|cleaning assistant|Employment         |
|health insurance  |Insurance_Status   |
|She               |Gender             |
|son               |Family_Member      |
|student           |Education          |
|college           |Education          |
|depression        |Mental_Health      |
|She               |Gender             |
|she               |Gender             |
|rehab          

In [35]:
visualizer = nlp.viz.NerVisualizer()

for i in range(len(sample_texts)):
    visualizer.display(
        result = result.collect()[i],
        label_col = 'ner_chunk',
        document_col = 'document'
    )
    print("\n"*2)







These SDOH elements provide a comprehensive view of the patient's social and environmental context, contributing to a holistic understanding of health and potential healthcare needs.

Recognizing and addressing SDOH is essential for healthcare practitioners aiming to provide patient-centered care. By delving into these determinants, healthcare professionals can tailor interventions, enhance patient engagement, and create more effective and equitable healthcare strategies. This comprehensive understanding allows for a holistic approach, acknowledging that health outcomes are influenced by a complex interplay of social, economic, and environmental factors.

