![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/SOCIAL_DETERMINANT.ipynb)

# **Social Determinants of Health**

📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

# **Colab Setup**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.4
Spark NLP_JSL Version : 4.2.4


# 🔎 MODELS 

### Named Entity Recognition : 
> * ### *`ner_sdoh_mentions`*
> * ### *`ner_sdoh_slim_wip`*

### Sequence Classifier : 
> * ### *`bert_sequence_classifier_sdoh_community_present_status`*
> * ### *`bert_sequence_classifier_sdoh_community_absent_status`*


**🔎You can find all these models and more [NLP Models Hub](https://nlp.johnsnowlabs.com/models?task=Named+Entity+Recognition&edition=Spark+NLP+for+Healthcare)**

# 📌 Named Entity Recognition

### **`ner_sdoh_mentions`**

In [58]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")
    
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_sdoh_mentions", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

ner_pipeline = Pipeline(stages=[
                                documentAssembler, 
                                sentenceDetector,
                                tokenizer,
                                word_embeddings,
                                ner,
                                ner_converter])



embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_sdoh_mentions download started this may take some time.
[OK!]


In [59]:
text_list = [
    """Cooperative gentleman with a long standing history (20 years) diverticulitis. Over the past year he has been having flares of diverticulitis which were well controlled, intermittently. This controlled his pain until today, when he went to his PCP who ordered Initial (PRE) CT abdomen/pelvis w/PO contrast, which showed inflammation of the sigmoid colon with diverticuli and thickened bowel wall. He states his pain is sharp, 4-24, intermittent over the last 2 weeks. The pain is nonradiating, has no provoking factors but is alleviated with narcotics. Over the last 2 weeks, he has had two bouts of emesis, roughly 10 days ago. He reports a 20 lb weight loss over the last 3-4 weeks. Past Medical History: sarcoidosis w/cardiac involvement 2185 pacemaker staph infection psoriatic arthritis Social History: He is history teacher. He is divorced and lives at home with his girlfriend. He does not currently and never has used tobacco or illicit drugs. Until 3 weeks ago, he was having 1-19 drinks per day. Currently he uses no alcohol at all. Family History: noncontributory, no history of colon cancers or IBD.""",
    """This is an 80 year old Caucasian female with known congestive heart failure, who presents as an elective admission for worsened shortness of breath and administration of Natrecor. The patient states that over the past two weeks she has had felt fatigued and short of breath. The patient can no longer climb stairs, and consistently gets short of breath after walking about forty yards, and some times gets short of breath just sitting at rest.The patient also states she has poor appetite. The patient does not adhere to a strict low salt diet and drinks a lot of water, good constant urine output. In cardiac review of systems, the patient had a resolving dry cough times two weeks, no fever or chills. The patient gets intermittent left sided chest pain times two to three seconds, about one time a day, relieved spontaneously, occurring only at rest. Currently, the patient is pain free.  SOCIAL HISTORY: The patient denies any current or past history of tobacco or alcohol use. The patient is police and widowed. FAMILY HISTORY: The patient's father passed away of a myocardial infarction at age 62. The patient's brother died of a myocardial infarction at age 45. low as 70 to 75 over palpable. """]

In [60]:
from pyspark.sql.types import StringType, IntegerType
df = spark.createDataFrame(text_list, StringType()).toDF("text")

In [61]:
result = ner_pipeline.fit(df).transform(df)

In [62]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(30, truncate=False)

+-------------+----------------+
|chunk        |ner_label       |
+-------------+----------------+
|narcotics    |behavior_drug   |
|teacher      |sdoh_economics  |
|divorced     |sdoh_community  |
|home         |sdoh_environment|
|girlfriend   |sdoh_community  |
|tobacco      |behavior_tobacco|
|illicit drugs|behavior_drug   |
|drinks       |behavior_alcohol|
|alcohol      |behavior_alcohol|
|tobacco      |behavior_tobacco|
|alcohol      |behavior_alcohol|
|police       |sdoh_economics  |
|widowed      |sdoh_community  |
|father       |sdoh_community  |
|brother      |sdoh_community  |
+-------------+----------------+



In [63]:
from sparknlp_display import NerVisualizer

for i in range(len(text_list)):
    NerVisualizer().display(
        result = result.collect()[i],
        label_col = 'ner_chunk',
        document_col = 'document')
    print("\n"*2)













### **`ner_sdoh_slim_wip`**

In [64]:
text_list = [
""" Mother states that he does smoke, there is a family hx of alcohol on both maternal and paternal sides of the family, maternal grandfather who died of alcohol related complications and paternal grandmother with alcoholism. Pts own drinking began at age 16, living in LA, had a DUI at 17yo after totaling a new car that his mother bought for him, he was married. """,
""" Husband presented as anxious , while friend took notes about pt s condition and names of providers, etc.  Husb reports both he and pt had been drinking on Saturday night, and he left her sitting up in a chair.  In the morning he found her bleeding from the mouth, and it became apparent she had overdosed, and left a suicide note.  Husband and friend report pt has hx of suicide attempts, most recently in of this year.  She also has hx of EtOH abuse, has been to detox and treatment programs several times over recent years, and resided in sober homes until recently.  Husband reports pt was a pedestrian struck by motor vehicle at 12yo , sustained head injury. He reports pt had been diagnosed bipolar disorder. """
]

In [65]:
from pyspark.sql.types import StringType, IntegerType
df = spark.createDataFrame(text_list, StringType()).toDF("text")

In [66]:
ner = MedicalNerModel.pretrained("ner_sdoh_slim_wip", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_pipeline = Pipeline(stages=[
                                documentAssembler, 
                                sentenceDetector,
                                tokenizer,
                                word_embeddings,
                                ner,
                                ner_converter])



ner_sdoh_slim_wip download started this may take some time.
[OK!]


In [67]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(30, truncate=False)

+-------------+----------------+
|chunk        |ner_label       |
+-------------+----------------+
|narcotics    |behavior_drug   |
|teacher      |sdoh_economics  |
|divorced     |sdoh_community  |
|home         |sdoh_environment|
|girlfriend   |sdoh_community  |
|tobacco      |behavior_tobacco|
|illicit drugs|behavior_drug   |
|drinks       |behavior_alcohol|
|alcohol      |behavior_alcohol|
|tobacco      |behavior_tobacco|
|alcohol      |behavior_alcohol|
|police       |sdoh_economics  |
|widowed      |sdoh_community  |
|father       |sdoh_community  |
|brother      |sdoh_community  |
+-------------+----------------+



In [68]:
from sparknlp_display import NerVisualizer

for i in range(len(text_list)):
    NerVisualizer().display(
        result = result.collect()[i],
        label_col = 'ner_chunk',
        document_col = 'document')
    print("\n"*2)













# 📌 Sequence Classifier

### **`bert_sequence_classifier_sdoh_community_present_status`**

In [69]:
text_list = [
" Patient with history of lupus, lupus nephritis with ESRD on peritoneal dialysis on transplant list, hx of PE/Antiphopholipid antibody on coumadin, mitral regurg, presents with 4-6 month history of cough, worse in the morning, one week of trace blood, now producing bright red blood over last couple days. Patient states that the amount of blood she has been coughing has been increasing and is now almost hourly, aprroximately 1 teaspoon bright red blood. Patient states that the cough produced primarily yellow sputum until it turned to blood. Patient denies any other symptoms such as dizziness or lightheadedness.  Married with three children,Worked as an accountant until health declined in early 2002. No tobacco, ethanol or drug use. Centrilobular nodules and ground glass opacities throughout both lungs, with a basilar predominance, with associated mild bronchiectasis, compatible with chronic collagen vascular disease, progressed since 2002. There is no advanced fibrosis. Superimposed infection cannot be excluded by imaging alone. Ground glass opacities could also represent hemorrhage. 3. Chronic left lower segmental pulmonary arterial PE, unchanged since 2191. No new acute PE detected to the subsegmental levels.",
" This is an 87 year old man status post motor vehicle accident in Month (only) 956 who was recently discharged from Hospital1 18 status post left radical nephrectomy for renal oncocytoma, who returned to Hospital1 18 on 3-22 for outpatient followup CT scan of the head. Patient was found to have a left subdural hematoma and was transported to the emergency department for workup. Currently patient does not complain of fever, chills, nausea, vomiting, chest pain, shortness of breath. No known drug allergies.The patient is a retired priest. Denies history of tobacco or alcohol use. Patient currently lives at Hospital3 2558.",
]

In [70]:
documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")
    
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_sdoh_community_present_status", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("class_")
        

pipeline = Pipeline(stages=[
                        documentAssembler, 
                        tokenizer,
                        sequenceClassifier])
    

df = spark.createDataFrame(text_list, StringType()).toDF("text")
results = pipeline.fit(df).transform(df)

bert_sequence_classifier_sdoh_community_present_status download started this may take some time.
[OK!]


In [71]:
res = results.select(F.explode(F.arrays_zip(results.document.result, 
                                             results.class_.result,
                                             results.class_.metadata)).alias("col"))\
               .select(F.expr("col['1']").alias("prediction"),
                       F.expr("col['2']").alias("confidence"),
                       F.expr("col['0']").alias("sentence"))
                  
if res.count()>1:
  udf_func = F.udf(lambda x,y:  x["Some("+str(y)+")"])
  res.withColumn('confidence', udf_func(res.confidence, res.prediction)).show(truncate=150)

+----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|prediction|confidence|                                                                                                                                              sentence|
+----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|      True| 0.9982032| Patient with history of lupus, lupus nephritis with ESRD on peritoneal dialysis on transplant list, hx of PE/Antiphopholipid antibody on coumadin,...|
|     False|0.78874034| This is an 87 year old man status post motor vehicle accident in Month (only) 956 who was recently discharged from Hospital1 18 status post left r...|
+----------+----------+------------------------------------------------------------------------------------------------------

### **`bert_sequence_classifier_sdoh_community_absent_status`**

In [72]:
text_list = [
"She has two adult sons. She is a widow. She was employed with housework. She quit smoking 20 to 30 years ago, but smoked two packs per day for 20 to 30 years. She drinks one glass of wine occasionally. She avoids salt in her diet. ",
"65 year old male presented with several days of vice like chest pain. He states that he felt like his chest was being crushed from back to the front. Lives with spouse and two sons moved to US 1 month ago.",
]

In [73]:
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_sdoh_community_absent_status", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("class_")
        

pipeline = Pipeline(stages=[
                      documentAssembler, 
                      tokenizer,
                      sequenceClassifier])
    

df = spark.createDataFrame(text_list, StringType()).toDF("text")
results = pipeline.fit(df).transform(df)

bert_sequence_classifier_sdoh_community_absent_status download started this may take some time.
[OK!]


In [74]:
res = results.select(F.explode(F.arrays_zip(results.document.result, 
                                             results.class_.result,
                                             results.class_.metadata)).alias("col"))\
               .select(F.expr("col['1']").alias("prediction"),
                       F.expr("col['2']").alias("confidence"),
                       F.expr("col['0']").alias("sentence"))
                  
if res.count()>1:
  udf_func = F.udf(lambda x,y:  x["Some("+str(y)+")"])
  res.withColumn('confidence', udf_func(res.confidence, res.prediction)).show(truncate=150)

+----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|prediction|confidence|                                                                                                                                              sentence|
+----------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|      True| 0.9894813|She has two adult sons. She is a widow. She was employed with housework. She quit smoking 20 to 30 years ago, but smoked two packs per day for 20 t...|
|     False|0.72528476|65 year old male presented with several days of vice like chest pain. He states that he felt like his chest was being crushed from back to the fron...|
+----------+----------+------------------------------------------------------------------------------------------------------