# Automated Health Information Extraction and Co-occurrence Analysis with John Snow Labs Models

🔎 In this analytical endeavor, we utilize Natural Language Processing (NLP) to unlock vital insights from health-related discussions across various digital platforms. The primary goal is to pinpoint patterns that could suggest the presence of co-occurring health risks, contributing substantially to preventive health strategies.

🔎 By deploying John Snow Labs' NLP models, we extract essential health information from these discussions. This not only includes diseases and conditions but also risk factors, lifestyle habits such as alcohol consumption, and even complex aspects like family health history and the status of alcohol, tobacco, and substance behaviors.

🔎 Our exploration progresses into a risk analysis phase, wherein we concentrate on identifying patterns of health conditions that frequently appear together. For instance, a common co-occurrence of "smoking" and "lung cancer" in discussions could underscore critical health risks often linked together.

🔎 The insights generated from this work aim to bolster our understanding of digital health discussions and help predict potential health risks.

In [1]:
import json
import os
license_key = "5.0.0.spark_nlp_for_healthcare.json"
with open(license_key) as f:
    license_keys = json.load(f)
    
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

In [2]:
# # Installing pyspark and spark-nlp
#%pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# # Installing Spark NLP Healthcare
#%pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# # Installing Spark NLP Display Library for visualization
#%pip install -q spark-nlp-display

In [3]:
import os, json
import pandas as pd
import numpy as np
import warnings
import logging

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import CoNLL
from sparknlp_jsl.annotator import *
from sparknlp_jsl.eval import NerDLMetrics

from ast import literal_eval
from ast import literal_eval
from collections import Counter
import networkx as nx
import matplotlib.pyplot as plt

warnings.filterwarnings("ignore")
logging.getLogger().setLevel(logging.ERROR)

pd.set_option('display.max_columns', None)
pd.set_option("display.max_colwidth",50)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

params = {'spark.jsl.settings.pretrained.cache_folder': '/home/jovyan/work/shared/cache_folder',
          'spark.settings.pretrained.cache_folder': '/home/jovyan/work/shared/cache_folder',
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2048M",
          "spark.local.dir": "/home/jovyan/work/shared/spark-temp",
         }

spark = sparknlp_jsl.start(secret=license_keys["SECRET"], params= params)
print("sparknlp version:",sparknlp.version())
print("sparknlp_jsl version:", sparknlp_jsl.version())

spark.sparkContext.setLogLevel("ERROR")

spark

Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-040d7a02-380d-40b9-982c-a0ec9079cc13;1.0
	confs: [default]


:: loading settings :: url = jar:file:/home/jovyan/work/shared/venvs/cabir-ds/lib/python3.10/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.0.0 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastutil;7.0.12 in central
	found org.projectlombok#lombok;1.16.8 in central
	found com.google.cloud#google-cloud-storage;2.20.1 in central
	found com.google.guava#guava;31.1-jre in central
	found com.google.guava#failureaccess;1.0.1 in central
	found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central
	found com.google.errorprone#error_prone_annotations;2.18.0 in central
	found com.google.j2objc#j2objc-annotations;1.3 in central
	found com.google.http-client#google-http-clie

sparknlp version: 5.0.0
sparknlp_jsl version: 5.0.0


## Download Dataset

RHMD-Health-Mention-Dataset: https://github.com/usmaann/RHMD-Health-Mention-Dataset/blob/main/README.md

RHMD is a dataset developed to understand how users on Reddit use disease or symptom terms. This dataset is designed to classify Reddit posts that use these terms in ways other than describing their health conditions.

The dataset consists of `10,015` manually labeled Reddit posts that mention 15 common disease or symptom terms. These posts are divided into three different categories:

- `Figurative Mentions` (Label:0): These include cases where health terms are used metaphorically.[3,225 instances]
- `Non-Health Mentions` (Label:1): These include cases that 8discuss non-health related topics.[3,430 instances]
- `Health Mentions` (Label:2): These include cases where actual health conditions or diseases are discussed.[3,360 instances]



📌 In our analysis, we will primarily be working with the subset of data labeled as "Health mentions". This segment of data, containing Reddit posts discussing genuine health conditions, is rich in the health-related information we aim to extract and analyze. We believe this focus will enable us to accurately identify patterns of co-occurring health risks, and thus produce the most relevant and meaningful insights.

In [4]:
# read data as a pandas dataframe
data = pd.read_csv("./RHMD_3_Class.csv").reset_index()
#data = data[data.Label == 2 ].reset_index(drop=True).reset_index()
data = data.rename(columns={"index": "text_id","Text": "text"}).drop("Label",axis=1)
data.tail()

Unnamed: 0,text_id,text
10010,10010,An alternative solution to EU Challeger Ladder...
10011,10011,Gold 5 support looking for adc to duo with (EU...
10012,10012,PAC Man is unimpressed with Lux 10 game ranked...
10013,10013,Spongebob is literally curing cancer.
10014,10014,My apple has cancer.


In [5]:
# create spark dataframe
df = spark.createDataFrame(data)
df.show(truncate=50)

[Stage 0:>                                                          (0 + 1) / 1]

+-------+--------------------------------------------------+
|text_id|                                              text|
+-------+--------------------------------------------------+
|      0|Corona and mental health. Indiana's 211 hotline...|
|      1|Impact of genetic mutations on cocaine addictio...|
|      2|Oakland on Tuesday became the second U.S. city ...|
|      3|MDMA treatment for alcoholism reduces relapse s...|
|      4|'I was non-stop Juuling up a storm': 10 college...|
|      5|A single ketamine infusion combined with mindfu...|
|      6|Research suggests chocolate chip cookies equiva...|
|      7|Study finds CBD effective in treating heroin ad...|
|      8|Not all cannabis users develop an addiction, ev...|
|     10|A new study provides evidence that cannabidiol,...|
|     11|Could psychedelics transform mental health? A n...|
|     12|New York Health Officials See Marijuana as an A...|
|     13|Can Cannabis Solve the Opioid Crisis? Recent st...|
|     14|Is Sugar the Ne

                                                                                

## Gender Classification

In [6]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sbert_embedder = BertSentenceEmbeddings().pretrained("sbiobert_base_cased_mli", 'en', 'clinical/models')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")\
    .setMaxSentenceLength(512)

gender_classifier = ClassifierDLModel.pretrained('classifierdl_gender_sbert', 'en', 'clinical/models') \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("gender_class")


gender_pipeline = Pipeline(stages=[
    documentAssembler,
    sbert_embedder,
    gender_classifier,

])

# gender_pipeline_model = gender_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

gender_result = gender_pipeline.fit(df).transform(df)

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[ | ]sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
Download done! Loading the resource.
[OK!]
classifierdl_gender_sbert download started this may take some time.
Approximate size to download 22.2 MB
[ | ]classifierdl_gender_sbert download started this may take some time.
Approximate size to download 22.2 MB
Download done! Loading the resource.
[OK!]


In [7]:
df =  gender_result.select("text_id", "text",
                          F.explode(F.arrays_zip(gender_result.gender_class.result,)).alias("cols"))\
                 .select("text_id","text",
                          F.expr("cols['0']").alias("gender"))
df.show(10,truncate=100)

[Stage 7:>                                                          (0 + 1) / 1]

+-------+----------------------------------------------------------------------------------------------------+-------+
|text_id|                                                                                                text| gender|
+-------+----------------------------------------------------------------------------------------------------+-------+
|      0|Corona and mental health. Indiana's 211 hotline went from receiving roughly 1,000 calls a day reg...|Unknown|
|      1|Impact of genetic mutations on cocaine addiction elucidated: Scientists recently demonstrated tha...|Unknown|
|      2|Oakland on Tuesday became the second U.S. city to decriminalize magic mushrooms after a string of...|Unknown|
|      3|MDMA treatment for alcoholism reduces relapse safely with no serious side effects, suggests the f...|Unknown|
|      4|                'I was non-stop Juuling up a storm': 10 college students on their vaping addictions |   Male|
|      5|A single ketamine infusion combined wit

                                                                                

## Risk Factor Extraction

### Initial pipeline companent

In [8]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")


sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[ | ]sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
Download done! Loading the resource.
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ]embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
Download done! Loading the resource.
[OK!]


🔎 We will download the `ner_risk_factors` NER models and whitelist the labels that can be used as risk factor.

In [9]:
ner_risks_entities = ['CAD','DIABETES','HYPERLIPIDEMIA','HYPERTENSION','MEDICATION','OBESE','SMOKER']

# risk factors
risk_factors_ner = MedicalNerModel.pretrained("ner_risk_factors", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_risks")\
    .setLabelCasing('upper')

risk_factors_ner_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_risks"])\
    .setOutputCol("ner_risks_chunk")\
    .setWhiteList(ner_risks_entities)


ner_risk_factors download started this may take some time.
[ | ]ner_risk_factors download started this may take some time.
Approximate size to download 13.9 MB
Download done! Loading the resource.
[OK!]


🔎 We will download the `sparknlp_jsl` NER models and whitelist the labels that can be used as risk factor.

In [10]:
ner_jsl_entities = [
    'Alcohol', 'Cerebrovascular_Disease', 'Diabetes',
    'Disease_Syndrome_Disorder',  'Heart_Disease', 
    'Hyperlipidemia','Hypertension',"Injury_or_Poisoning",
    'Kidney_Disease', 'Obesity', 'Oncological',
    'Smoking', 'Overweight', 'Psychological_Condition', 'BMI',
    'Total_Cholesterol', 'Race_Ethnicity'
]

ner_jsl_entities = [a.upper() for a in ner_jsl_entities]

# general clinical terminology
jsl_ner = MedicalNerModel.pretrained("ner_jsl_langtest", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_jsl")\
    .setLabelCasing('upper')

jsl_ner_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_jsl"])\
    .setOutputCol("ner_jsl_chunk")\
    .setWhiteList(ner_jsl_entities)


ner_jsl_langtest download started this may take some time.
[ | ]ner_jsl_langtest download started this may take some time.
Approximate size to download 3.1 MB
Download done! Loading the resource.
[OK!]


🔎 We will download the `communicable` NER models and whitelist the labels that can be used as risk factor.

In [11]:
ner_sdoh_entities = [
    "Access_To_Care",'Alcohol', "Childhood_Event", "Community_Safety", "Diet", "Disability", 
    "Eating_Disorder", "Education", "Employment", "Environmental_Condition", "Exercise",  
    "Financial_Status", "Income", "Insurance_Status", "Food_Insecurity", "Family_Member",
    "Geographic_Entity", "Housing", "Mental_Health", "Obesity", "Other_Disease",   
    "Smoking", "Social_Exclusion", "Race_Ethnicity","Sexual_Activity","Sexual_Orientation",
    "Spiritual_Beliefs",  "Substance_Use",  "Violence_Or_Abuse"
]

# social determinants of health (sdoh)
sdoh_ner = MedicalNerModel.pretrained("ner_sdoh_langtest", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_sdoh")\
    .setLabelCasing('upper')

sdoh_ner_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_sdoh"])\
    .setOutputCol("ner_sdoh_chunk")\
    .setWhiteList(ner_sdoh_entities)

ner_sdoh_langtest download started this may take some time.
[ | ]ner_sdoh_langtest download started this may take some time.
Approximate size to download 2.8 MB
Download done! Loading the resource.
[OK!]


🔎 We will download the `ner_vop` NER models and whitelist the labels that can be used as risk factor.

In [12]:
ner_vop_entities = [
    "Substance", "PsychologicalCondition", "Vaccine", "Drug", "Disease", "RelationshipStatus",
    "Allergen", "Symptom", "HealthStatus", "InjuryOrPoisoning", "MedicalDevice","Treatment","Employment"
]

# vop
vop_ner = MedicalNerModel.pretrained("ner_vop", "en", "clinical/models")\
    .setInputCols(["sentence", "token","embeddings"])\
    .setOutputCol("ner_vop")\
    .setLabelCasing('upper')

vop_ner_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_vop"])\
    .setOutputCol("ner_vop_chunk")\
    .setWhiteList(ner_vop_entities)



ner_vop download started this may take some time.
[ | ]ner_vop download started this may take some time.
Approximate size to download 3.7 MB
Download done! Loading the resource.
[OK!]


🔎 We will download the `ner_posology_langtest` NER models and whitelist the labels that can be used as risk factor.

In [13]:
ner_posology_entities = ['DRUG']

# posology
posology_ner = MedicalNerModel.pretrained("ner_posology_langtest","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_posology")\
    .setLabelCasing('upper')

posology_ner_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_posology"])\
    .setOutputCol("ner_posology_chunk")\
    .setWhiteList(ner_posology_entities)

ner_posology_langtest download started this may take some time.
[ | ]ner_posology_langtest download started this may take some time.
Approximate size to download 2.7 MB
Download done! Loading the resource.
[OK!]


🔎 We will download the `ner_deid_generic_augmented` NER models and whitelist the labels that can be used as risk factor.

In [14]:
ner_deid_entities = ["PROFESSION", "AGE"]

# deidentification - Profession and Age labels only
deid_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented","en","clinical/models")\
    .setInputCols(["sentence", "token","embeddings"])\
    .setOutputCol("ner_deid")\
    .setLabelCasing('upper')

deid_ner_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_deid"])\
    .setOutputCol("ner_deid_chunk")\
    .setWhiteList(ner_deid_entities)


ner_deid_generic_augmented download started this may take some time.
[ | ]ner_deid_generic_augmented download started this may take some time.
Approximate size to download 13.8 MB
Download done! Loading the resource.
[OK!]


In [15]:
with open('replace_dict.csv', 'w') as f:
    f.write("""SMOKING,SMOKER
CAD,DISEASE
CEREBROVASCULAR_DISEASE,DISEASE
DIABETES,DISEASE
DISEASE_SYNDROME_DISORDER,DISEASE
HEART_DISEASE,DISEASE
HYPERLIPIDEMIA,DISEASE
HYPERTENSION,DISEASE
INJURY_OR_POISONING,DISEASE
KIDNEY_DISEASE,DISEASE
MENTAL_HEALTH,DISEASE
OBESE,DISEASE
OBESITY,DISEASE
ONCOLOGICAL,DISEASE
OTHER_DISEASE,DISEASE
OVERWEIGHT,DISEASE
EKG_FINDINGS,DISEASE
IMAGINGFINDINGS,DISEASE
VS_FINDING,DISEASE
IMAGINGFINDINGS,DISEASE
DRUG_INGREDIENT,DRUG
DRUG_BRANDNAME,DRUG
MEDICATION,DRUG
SUBSTANCE_USE,SUBSTANCE
EMPLOYMENT,PROFESSION
MENTAL_HEALTH,PSYCHOLOGICAL_CONDITION
PSYCHOLOGICALCONDITION,PSYCHOLOGICAL_CONDITION
PROBLEM,DISEASE
""")

chunk_merger = ChunkMergeApproach()\
    .setInputCols(  "ner_risks_chunk", "ner_sdoh_chunk", "ner_vop_chunk", "ner_posology_chunk", "ner_deid_chunk","ner_jsl_chunk",)\
    .setOutputCol('ner_chunk_merged')\
    .setOrderingFeatures(["ChunkLength"])\
    .setSelectionStrategy("DiverseLonger")\
    .setReplaceDictResource('replace_dict.csv',"text", {"delimiter":","})

🔎 We will use ChunkFilterer for the filter meaningless chunk

In [16]:
chunk_blacklist = ["McDs", "the urge", "weed", "his smoking spot", "I miss", "super crazy", 
                   "many positive changes","Anythoughts", "Wife's brain tumour", "too shy", 
                   "a misery","would've","genetic condition","therapist", "therapists", 
                   "doctor", "doctors", "Doctors","Dr",  "doc", "Doctor", "pyschiatrist", "psyc",  
                 "psychologist", "psych", "&nbsp", "stupidest", "best jobs", "real doctor", 
                   "ER", "RAN", "EVER", "ER doctor", "Ran", "residents", "ghetto", "Army", 
                 "grandmama", "American Doctors", "grandpa", "HS diploma", "solo", "spycology", 
                   "pediatric psychiatrist", "computer shop", "pubs", "you‚Äôre fine now‚Äô", 
                   "psychiatrist", "army", "Sir", "Sorry officer", "dragon prince", "shoplift",
                   "friend's house", "George", "pet George", "regular therapist", "surgeon", 
                   "pediatrician", "coworkers", "radiation technicians", "techs", "study nurse", 
                   "a cat" ,"problems","Home Depot","fall","doesn't","Mr Bean", "friend's",
                   " tofu He's","soy marinero","Symptoms", "therapist", "therapists", "doctor", 
                   "doctors", "Doctors","Dr",  "doc", "Doctor","DOCTOR", "pyschiatrist", "psyc",  
                 "psychologist", "psych", "phn", "stupidest", "dispo","good job","gp ","sniff","grey"
                 "real doctor", "gyno", "RAN", "EVER", "ER doctor", "Ran", "residents", "Army", 
                 "grandmama", "American Doctors", "grandpa", "HS diploma", "solo", "spycology", 
                   "pediatric psychiatrist"" gp", "pubs", "psychiatrist", "army", "lockdown,when", 
                   "Sorry officer", "working in my home","shoplift", "friend's house", "George", 
                   "VPN", "regular therapist", "surgeon", "pediatrician", "granny", 
                 "lord", "metals testing lab", "PC", "movies","crack","gaming addiction","spirits",
                   "spotify", "S", "TED"
                ]

chunk_filterer = ChunkFilterer()\
    .setInputCols("sentence", "ner_chunk_merged")\
    .setOutputCol("ner_chunk")\
    .setCriteria("isin")\
    .setBlackList(chunk_blacklist)

In [17]:

# Familiy Health History
chunk_merger_disease = ChunkMergeApproach()\
    .setInputCols("ner_chunk",)\
    .setOutputCol('ner_chunk_disease')\
    .setWhiteList(["DISEASE"])

clinical_assertion_disease = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk_disease", "embeddings"]) \
    .setOutputCol("assertion_disease")

assertion_filterer_disease = AssertionFilterer()\
    .setInputCols("sentence","ner_chunk_disease","assertion_disease")\
    .setOutputCol("assertion_filtered_family")\
    .setWhiteList(["Family"])


assertion_jsl_augmented download started this may take some time.
[ | ]assertion_jsl_augmented download started this may take some time.
Approximate size to download 6.2 MB
Download done! Loading the resource.
[OK!]


In [18]:
# Status of Alcohol, Tobacco and Substance Behaviours
chunk_merger_behaviour = ChunkMergeApproach()\
    .setInputCols( "ner_chunk")\
    .setOutputCol('ner_chunk_behaviour')\
    .setWhiteList(["ALCOHOL", "SUBSTANCE", "SMOKER"])

clinical_assertion_behaviour = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk_behaviour", "embeddings"]) \
    .setOutputCol("assertion_behaviour")

assertion_filterer_behaviour = AssertionFilterer()\
    .setInputCols("sentence","ner_chunk_behaviour","assertion_behaviour")\
    .setOutputCol("assertion_filtered_behaviour")\
    .setWhiteList(["Present","Past"])

assertion_jsl_augmented download started this may take some time.
[OK!]


### pipeline

In [19]:
ner_pipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    risk_factors_ner,
    risk_factors_ner_converter,
    jsl_ner,
    jsl_ner_converter,
    sdoh_ner,
    sdoh_ner_converter,
    vop_ner,
    vop_ner_converter,
    posology_ner,
    posology_ner_converter,
    deid_ner,
    deid_ner_converter,
    
    chunk_merger,
    chunk_filterer,
    
#   chunk_merger_disease,
#   clinical_assertion_disease,
#   assertion_filterer_disease,
#   
#   chunk_merger_behaviour,
#   clinical_assertion_behaviour,
#   assertion_filterer_behaviour
])

pipeline_model = ner_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

result = pipeline_model.transform(df).cache()

In [20]:
from sparknlp_display import NerVisualizer
from IPython.display import display, HTML

lmodel= LightPipeline(pipeline_model)

light_result = lmodel.fullAnnotate(df.select("text").take(20)[4]["text"])

visualiser = NerVisualizer()
ner_vis = visualiser.display(light_result[0], label_col='ner_chunk', document_col='document', return_html=True)
display(HTML(ner_vis))

                                                                                

In [None]:
ASSERTION_BEHAVIOUR = result.select("text_id","text","gender" ,
                          F.explode(F.arrays_zip(result.assertion_filtered_behaviour.result,
                                                 result.assertion_filtered_behaviour.metadata)).alias("cols"))\
                 .select("text_id","text","gender" ,
                          F.expr("cols['0']").alias("ner_chunk"),
                          F.expr("cols['1']['entity']").alias("entity"),
                          F.expr("cols['1']['assertion']").alias("ASSERTION_BEHAVIOUR")).toPandas()

ASSERTION_BEHAVIOUR.to_csv("ASSERTION_BEHAVIOUR.csv",index=False)



In [None]:
ASSERTION_DISEASE = result.select("text_id","text","gender" ,
                          F.explode(F.arrays_zip(result.assertion_filtered_family.result,
                                                 result.assertion_filtered_family.metadata)).alias("cols"))\
                 .select("text_id","text","gender" ,
                          F.expr("cols['0']").alias("ner_chunk"),
                          F.expr("cols['1']['entity']").alias("entity"),
                          F.expr("cols['1']['assertion']").alias("ASSERTION_DISEASE")).toPandas()

ASSERTION_DISEASE.to_csv("ASSERTION_DISEASE.csv",index=False)

In [None]:
result_pd = result.select("text_id","text","gender" ,
                          F.explode(F.arrays_zip(result.ner_chunk.result,
                                                 result.ner_chunk.metadata)).alias("cols"))\
                 .select("text_id","text","gender" ,
                          F.expr("cols['0']").alias("ner_chunk"),
                          F.expr("cols['1']['entity']").alias("entity")).toPandas()

result_pd.to_csv("result_pd.csv",index=False)

result_pd.tail()

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)

result_pd = pd.read_csv("result_pd.csv")

#result_pd = result_pd.join(data, on=["text_id"], how="left",rsuffix = "_r").drop("text_id_r",axis=1)

#result_pd = result_pd[["text_id","text","gender","ner_chunk","entity"]]

result_pd

In [None]:
result_pd.entity.value_counts()

In [None]:
result_pd.columns

In [None]:
# Group by 'text' and 'entity' columns, then aggregate 'ner_chunk' into lists
result_df_slim = result_pd.groupby(['text_id', 'text',"gender",'entity'])['ner_chunk'].apply(list).reset_index()

# Pivot the DataFrame
result_pivot = result_df_slim.pivot(['text_id', 'text',"gender" ], columns='entity', values='ner_chunk')

# Fill NaN values with placeholder
result_pivot.fillna(value="", inplace=True) # value="placeholder"

# Replace placeholders with empty lists
# result_pivot = result_pivot.applymap(lambda x: "" if x == "placeholder" else x)

# Reset index to make 'text' a regular column
result_pivot.reset_index(inplace=True)
result_pivot = result_pivot.replace("Unknown","")
result_pivot.tail(3)

## Coorelation Table

In [None]:
corr = pd.DataFrame(columns=result_pivot.columns[2:], index=result_pivot.columns[2:])
for col_left in result_pivot.columns[2:]:
    #print(col_left)
    for col_right in result_pivot.columns[2:]:
        corr.loc[col_left ,col_right ] = len(result_pivot[(result_pivot[col_left]!="") & (result_pivot[col_right]!="")])
        
corr.columns.name=None
corr.index.name=None

corr.to_csv("corr_table.csv")

corr

In [None]:
for i, row in corr.iterrows():
    print(i)
    print(corr.loc[i].sort_values(ascending=False)[:10])
    print("\n\n\n")
    

In [None]:
def plot_bar_graph(main_df: pd.DataFrame, col1: str, col2: str, replacement_dict: dict, min_count: int = 3) -> None:
    
    df = main_df.copy()

    # Replacing empty lists with NaNs
    df = df[[col1, col2]].applymap(lambda x: np.nan if x == "" else x)

    # If the column data is string representation of list, convert to list
    df[col1] = df[col1].apply(lambda x: literal_eval(x) if x == np.nan else x)
    df[col2] = df[col2].apply(lambda x: literal_eval(x) if x == np.nan else x)

    # Check if any value in the columns is a boolean and convert to string if so
    if df[col1].apply(isinstance, args=(bool,)).any():
        df[col1] = df[col1].astype(str)
    if df[col2].apply(isinstance, args=(bool,)).any():
        df[col2] = df[col2].astype(str)
        
    
    # Filtering the dataframe for rows that contain both column 1 and column 2 elements
    df_filtered = df[(df[col1].apply(lambda x: len(x) if isinstance(x, list) else 0) > 0) & 
                     (df[col2].apply(lambda x: len(x) if isinstance(x, list) else 0) > 0)]

    # Applying the replacement mapping
    df_filtered[col1] = df_filtered[col1].apply(lambda lst: [replacement_dict.get(item, item) for item in lst])
    df_filtered[col2] = df_filtered[col2].apply(lambda lst: [replacement_dict.get(item, item) for item in lst])

    # Getting the top 5 elements from column 1
    col1_counts = Counter([item for sublist in df_filtered[col1] for item in sublist])
    top_5_col1 = [item for item, _ in col1_counts.most_common(3)]

    # Creating a dictionary to hold co-occurrence counts
    co_occurrences = {}
    for _, row in df_filtered.iterrows():
        for col1_item in row[col1]:
            if col1_item in top_5_col1:
                for col2_item in row[col2]:
                    co_occurrences.setdefault(col1_item, {}).setdefault(col2_item, 0)
                    co_occurrences[col1_item][col2_item] += 1

    # Get the top 5 most common co-occurrences
    co_occurrences_most_common = []
    for col1_item, sub_dict in co_occurrences.items():
        co_occurrences_most_common.extend([(col1_item, col2_item, count) for col2_item, count in Counter(sub_dict).most_common(3)])

    # Sort the co-occurrences by count
    co_occurrences_most_common.sort(key=lambda x: x[2], reverse=True)

     # Filter out co-occurrences with a count less than min_count
    co_occurrences_most_common = [item for item in co_occurrences_most_common if item[2] >= min_count]

    # Creating the bar plot
    labels = [f"{item[0]} - {item[1]}" for item in co_occurrences_most_common]
    counts = [item[2] for item in co_occurrences_most_common]

    plt.figure(figsize=(11, 7))
    plt.barh(labels, counts, color='skyblue')
    plt.xlabel("Count")
    plt.ylabel("Co-occurrences")
    plt.title("Top co-occurrences in data")
    plt.gca().invert_yaxis()  # reverse the order of the y-axis
    plt.show()

In [None]:
# Replacement dictionary
replacement_dict = {
        'drink': 'drink',
        'Drinking': 'drink',
        'drank': 'drink',
        'drinks': 'drink',
        'drunk': 'drink',
        'drink coffee' : "coffee",
        'drinking': 'drink',
        "drinkers" : "drink",
        'Alcohol': 'alcohol',
        "Alcohol's": 'alcohol',
        "alcoholic" : 'alcohol',
        "alcoholics" : 'alcohol',
        "beer" : 'alcohol',
        'Cancer': 'cancer',
        'cancer.': 'cancer',
        "Cancer's": 'cancer',
        "my depresion": "depresion",
        "Addiction" : "addiction",
        "addictions" : "addiction",
        "addicted" : "addiction",
        "kids" : "kids",
        "doctors" : "doctor",
        "Officer" : "officer",
        "class teacher": "teacher",
        "teachers": "teacher",
        "Cop" : "police officer",
        "smoke": "smoking",
        "smokers": "smoking"
    }

In [None]:
import numpy as np