# 7PAVREPR ASMHI Research Project
## Research Project Code
## AF37930

# Task 1
## Construction of a Social History-Enriched Discharge Summary Dataset from MIMIC-III

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 1.0 Import packages we need for Part One

In [None]:
import pandas as pd
import medspacy
from medspacy.section_detection import SectionRule
from spacy.language import Language

from termcolor import colored
from IPython.display import display, HTML

### 1.1 Load the Datasets （MIMIC-III Notes & MIMIC-SBDH)

In [None]:
data_notes = pd.read_csv('/content/drive/MyDrive/MIMIC/NOTEEVENTS.csv.gz', compression='gzip', low_memory=False)
data_notes.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,CHARTDATE,CHARTTIME,STORETIME,CATEGORY,DESCRIPTION,CGID,ISERROR,TEXT
0,174,22532,167853.0,2151-08-04,,,Discharge summary,Report,,,Admission Date: [**2151-7-16**] Dischar...
1,175,13702,107527.0,2118-06-14,,,Discharge summary,Report,,,Admission Date: [**2118-6-2**] Discharg...
2,176,13702,167118.0,2119-05-25,,,Discharge summary,Report,,,Admission Date: [**2119-5-4**] D...
3,177,13702,196489.0,2124-08-18,,,Discharge summary,Report,,,Admission Date: [**2124-7-21**] ...
4,178,26880,135453.0,2162-03-25,,,Discharge summary,Report,,,Admission Date: [**2162-3-3**] D...


In [None]:
data_SBDHlabels = pd.read_csv('/content/drive/MyDrive/MIMIC/MIMIC-SBDH.csv', low_memory=False)
data_SBDHlabels.head()

In [None]:
# Get the number of rows / records in the MIMIC-SBDH label dataset we use
len(data_SBDHlabels)

7025

The result shows that there are 7,025 records of the "social history" section in the MIMIC-SBDH dataset we selected.

### 1.2 Merging MIMIC III Notes with SBDH Labels using row_id

In this MIMIC-SBDH dataset, two CSV files are publicly available:

① One file contains 8 annotated social and behavioural determinants of health labels, with each input text corresponding to one label per category.

② The other file provides the start and end positions of the annotated keywords within the text, which enables precise localization of these keywords. 

##### Here I used the first MIMIC-SBDH label file to merging MIMIC III Notes with SBDH Labels using row_id

In [None]:
# Rename the 'ROW_ID' column in the original MIMIC-III dataset to lowercase for merging
data_notes.rename(columns={'ROW_ID': 'row_id'}, inplace=True)

# Select only 'row_id' and 'TEXT' columns and merge with SBDH labels to match each label with its corresponding clinical note
note_texts = data_notes[['row_id', 'TEXT']]
data_SBDH_full = pd.merge(data_SBDHlabels, note_texts, on='row_id', how='left')

print(data_SBDH_full.head())
# Get the number of rows (records) in the MIMIC-SBDH dataset we use
len(data_SBDH_full)

##### Have a look of the full discharge summary text part

In [None]:
print(data_SBDH_full.head(5))

   row_id  sdoh_community_present  sdoh_community_absent  sdoh_education  \
0       5                       0                      0               0   
1      42                       0                      0               0   
2     136                       1                      0               0   
3     442                       1                      1               0   
4     328                       1                      0               0   

   sdoh_economics  sdoh_environment  behavior_alcohol  behavior_tobacco  \
0               0                 0                 0                 1   
1               0                 0                 0                 2   
2               2                 1                 3                 4   
3               0                 1                 3                 1   
4               2                 1                 3                 3   

   behavior_drug                                               TEXT  
0              0  Admi

In [None]:
print(data_SBDH_full.loc[0, 'TEXT'])

Admission Date:  [**2190-5-16**]     Discharge Date:  [**2190-5-22**]

Date of Birth:   [**2139-4-22**]     Sex:  F

Service:  CARDIOTHORACIC

HISTORY OF PRESENT ILLNESS:  This 51 year-old female was
admitted to an outside hospital with chest pain and ruled in
for myocardial infarction.  She was transferred here for a
cardiac catheterization.

PAST MEDICAL HISTORY:  Hypertension, fibromyalgia,
hypothyroidism, NASH and noninsulin dependent diabetes.

PAST SURGICAL HISTORY:  Hysterectomy and cholecystectomy.

SOCIAL HISTORY:  She smokes a pack per day.

MEDICATIONS ON ADMISSION:  Hydrochlorothiazide, Alprazolam,
Ursodiol and Levoxyl.

She was hospitalized with Aggrastat, nitroglycerin and
heparin as she ruled in for myocardial infarction.

ALLERGIES:  No known drug allergies.

Cardiac catheterization showed left anterior descending
coronary artery diagonal 80% lesion, circumflex 90% lesion
and 90% lesion of the right coronary artery with a normal
ejection fraction.  She was transferred f

In [None]:
example2 = data_SBDH_full[data_SBDH_full['row_id'] == 1169]
print(example2['TEXT'].values[0])

Admission Date:  [**2140-8-25**]              Discharge Date:   [**2140-9-20**]

Date of Birth:  [**2102-2-3**]             Sex:   M

Service: MEDICINE

Allergies:
Penicillins

Attending:[**First Name3 (LF) 5129**]
Chief Complaint:
Severe Pancreatitis

Major Surgical or Invasive Procedure:
Placement of left IJ CVL
Intubation
Mechanical Ventilation


History of Present Illness:
This is a 38yo M with h/o paranoid schizophrenia who initially
presented to [**Hospital6 7472**] in [**Location (un) 5583**] on [**8-18**]
with new DKA and then developed severe pancreatitis/bandemia
with possible HCAP with bilat lower lobe infiltrates s/p
intubation now transferred here for further management.  Pt was
intially treated for DKA however bc of persistent abdominal pain
and fevers, a CT scan was obtained on [**8-19**] which revealed
finsing consitent with pancreatitis.  On [**8-20**], pt was found to
be obtunded and O2sats were in mid80s so pt was intubated for
airway protection and presumed aspirati

### 1.3 Extracting "Social History" from Merged Dataset

Extracted the social history section from each discharge summary using MedSpaCy’s clinical sectionizer that performs pattern-based section extraction

In [None]:
pip install medspacy

Collecting medspacy
  Downloading medspacy-1.3.1.tar.gz (244 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m184.3/244.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.6/244.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting spacy<3.8 (from medspacy)
  Downloading spacy-3.7.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting PyRuSH>=1.0.8 (from medspacy)
  Downloading PyRuSH-1.0.9-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting pysbd==0.3.4 (from medspacy)
  Downloading pysbd-

Here Medspacy's sectionizer was used to scan through the text, and predefined rules were applied to identify various sections. These predefined built-in rules allowed me to quickly process clinical text to extract the 'social history' sections without needing to create custom rules.

However, as the MIMIC III clinical documents can contain variations in different section headers. These normal built-in rules can not identify these headers correctly, which further result in over extraction of the Social History sections.

There are several kinds of variations in the 'social history' header and related surrounding setion titles. For example:

①  Social History Header: SHx, SOC HX, SOCIAL HX, SOCIAL HISTORY

②  PHYSICAL EXAM: PHYSICAL EXAMINATION ON PRESENTATION, PHYSICAL EXAMINATION AT TIME OF TRANSFER......

③  Laoratory: LABORATORY VALUES UPON ADMISSION, LABORATORIES ON ADMISSION


These variations were observed by manual reviewing all these 7,025 extracted social history sections and then find these unreasonable places.



By checking these extracted results in seperate batches, I primarily employed two strategies:

1) To adress over extraction：

   I extended the several custom rules as shown above to ensure that even unconventional or abbreviated section titles were correctly identified, and then to avoid over extraction.
   
   At the same time, I exclude titles that are incorrectly categorized as social_history, such as PSH, PSHx and similar variations.

3) To address under extraction：

   I manually completed parts of the Social History content that were unreasonably cut off. This might happen if there are 'Past History' or similar words appeared in the text.

In [None]:
import medspacy
from medspacy.section_detection import SectionRule
from spacy.language import Language
# Option 1: Load medspacy model
nlp = medspacy.load()

# Built-in rules in the Medspacy
@Language.factory("sectionizer36")
def sectionizer36(nlp, name):
    from medspacy.section_detection import Sectionizer
    return Sectionizer(nlp)

nlp.add_pipe("sectionizer36", last=True)
sectionizer = nlp.get_pipe("sectionizer36")

# Define new custom rules
add_rules = [
    SectionRule("Social History", "social_history", "Social History:"),
    SectionRule("SOCIAL HISTORY", "social_history", "SOCIAL HISTORY:"),
    SectionRule("SOCIAL HX", "social_history", "SOCIAL HX:"),
    SectionRule("SOC HX", "social_history", "SOC HX:"),
    SectionRule("SHx", "social_history", "SHx:"),

    SectionRule("LABORATORY", "labs", "LABORATORY:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "PHYSICAL EXAMINATION ON PRESENTATION:"),
    SectionRule("PHYSICAL EXAM TRANSFER", "physical_exam", "PHYSICAL EXAMINATION AT TIME OF TRANSFER:"),
    SectionRule("PHYSICAL EXAM ADMISSION", "physical_exam", "PHYSICAL EXAMINATION ON ADMISSION:"),
    SectionRule("REVIEW OF SYSTEMS", "review_of_systems", "REVIEW OF SYSTEMS:"),
    SectionRule("LABORATORY DATA", "labs", "LABORATORY DATA:"),
    SectionRule("EXAM ON ADMISSION", "physical_exam", "EXAM ON ADMISSION:"),
    SectionRule("PHYSICAL EXAM ADMISSION", "physical_exam", r"(?i)^Physical examination on admission\b"),
    SectionRule("Hospital Course", "hospital_course", r"^HO\[\*\*.*?\*\*\] COURSE:"),
    SectionRule("LABORATORY", "labs", "LABORATORY VALUES UPON ADMISSION:"),
    SectionRule("LABORATORY", "labs", "LABORATORIES:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "Exam:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "EXAMINATION ON TRANSFER TO THE MEDICAL WARDS:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "PHYSICAL EXAMINATION AT DISCHARGE:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "INITIAL EXAM IN THE EMERGENCY DEPARTMENT:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "PHYSICAL EXAMINATION ON ADMISSION TO CCU:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "PHYSICAL EXAM ON TRANSFER:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "PHYSICAL EXAM UPON ADMISSION:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "PHYSICAL EXAM ON PRESENTING TO THE EMERGENCY DEPARTMENT:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "PHYSICAL EXAMINATION AT THE TIME OF ADMISSION:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "PHYSICAL EXAMINATION ON ADMISSION TO [**Hospital **] [**Hospital **] MEDICAL CENTER:"),
    SectionRule("LABORATORY", "labs", "LABORATORIES ON ADMISSION:"),
    SectionRule("Course", "course", r"EMERGENCY DEPARTMENT COURSE:"),
    SectionRule("PHYSICAL EXAM", "physical_exam", "MEASURES AT BIRTH:"),


    SectionRule("PSH", "past_surgical_history", r"^PSH:"),
    SectionRule("PSHx", "past_surgical_history", r"^PSHx:"),
]

sectionizer.add(add_rules) # Add these new rules to the sectionizer to extract 'social history' section correcltly


def extract_SH_sections(doc):
    extracted_result = []

    for section in doc._.sections:
        # If the current section contains relevant social history, extract the corresponding text content from the document
        if section.category == "social_history":
            title=doc[section.title_span[0]:section.title_span[1]].text.strip() if section.title_span else ""

            if title.upper() not in ["PSH:", "PSHX:", "PSHX", "PShx:", "PSh:", "PShx"]:
                social_history_content=doc[section.body_start:section.body_end].text.strip()
                # check if the extracted social history is not none
                if social_history_content != "":
                    extracted_result.append(social_history_content)

    return "\n\n".join(extracted_result)

##### Having a test of text content under these specific header variations to see if rules work correctly

In [None]:
# Have a test
text= """
Past Medical History:
PMH:
HTN
GERD
Osteoarthritis affecting lower back
Left elbow tendonitis
.
PSH:
Right inguinal hernia repair in childhood
Cervical discectomy 3 years ago
Umbilical hernia repair [**2137**]

Social History:
SHx: Retired schoolteacher, now substitutes. Lives with wife in
[**Location (un) 1439**]. Has a 27 yo son and a 25 yo daughter. [**Name (NI) **] past or present
smoking hx, no EtOH

Family History:
Father had a fatal MI age 86.
"""
text2 = """
SOCIAL HISTORY:  Mr. [**Known lastname 32142**] is a retired optometrist. He is a
one and a half pack per day smoker for the past 55 years. He
has a past history of heavy alcohol abuse but now drinks
occasionally.

PAST HISTORY: Patient has hypertension and diabetes.
"""
doc = nlp(text2)
social_history2=extract_SH_sections(doc)

print("Extracted Social History:\n")
print(social_history2)

Extracted Social History:

Mr. [**Known lastname 32142**] is a retired optometrist. He is a
one and a half pack per day smoker for the past 55 years. He
has a


In [None]:
# Have aother test of the new rules we defined
print(nlp.pipe_names)

for rule in sectionizer.rules:
    print(rule.category)

text = """
SOCIAL HISTORY: The patient lives alone and smokes.
PHYSICAL EXAMINATION ON PRESENTATION: Blood pressure 120/80.
"""

doc1 = nlp(text)
print(extract_SH_sections(doc1))

['medspacy_pyrush', 'medspacy_target_matcher', 'medspacy_context', 'sectionizer36']
addendum
addendum
allergy
allergy
allergies
chief_complaint
chief_complaint
chief_complaint
comments
patient_education
diagnoses
diagnoses
diagnoses
diagnoses
diagnoses
diagnoses
diagnoses
diagnoses
family_history
family_history
history_of_present_illness
history_of_present_illness
history_of_present_illness
history_of_present_illness
history_of_present_illness
history_of_present_illness
history_of_present_illness
history_of_present_illness
hospital_course
hospital_course
hospital_course
hospital_course
imaging
imaging
imaging
imaging
imaging
imaging
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
labs_and_studies
medications
medicat

#### Apply all rules to process the 7,025 full discharge summaries

In [None]:
full_ds=list(nlp.pipe(data_SBDH_full["TEXT"], n_process=1))

data_SBDH_full["social_history"] = [extract_SH_sections(ds) for ds in full_ds]

##### Finally manually completed parts of these specific Social History content that were unreasonably cut off.

In [None]:
# Below are these specific ids and their extra text that are unreasonably cut off
complete = {
    9417: " past history of heavy alcohol abuse but now drinks occasionally.",
    10786: " allergies. He has two brothers, alive and well.",
    14838: " DISCHARGE DISPOSITION:  The patient is to be discharged home.",
    13928: "past history of smoking. Unknown history of etoh. Is very independent, lives with her daughter in [**Name (NI) 33977**] but still plays golf a few times a week.",
    12406: "past history of smoking or alcohol use.  She was transferred from [**Hospital **] Rehabilitation.",
    38660: "Past history of drug abuse, history of opiate dependence. Smokes 1.5 ppd x 30 years Hx of Alcoholism, sober since [**2122**] Recently discharged from [**Hospital 85897**].",
    36605: "Past history of percocet abuse.",
    50593: "past history of heroin use. State he bought Klonipin 2 mg #15 tabs off the street and took them all the weekend prior to admission to 'help me come down'.",
    43534: "vital signs. Tobacco: Quit > 25 years ago ETOH: Reports none Illicits: None",
    43040: "past history of cocaine use. He currently denies illicit drugs.",
    23617: "past history of alcohol abuse, current intake unknown, has 3 children",
    21397: "MEDICATIONS AT HOME:  Lasix 40 mg P.O. B.I.D, potassium supplement 10 mg P.O. B.I.D, Toprol XL 50 mg P.O. q.d., aspirin.",
    13465: "Social History: Family History: Brother with DM. Physical Exam: General: Somnolent but arousable and answers questions"
}

for id, extra_text in complete.items():
    if id in data_SBDH_full["row_id"].values:
        ix = data_SBDH_full.index[data_SBDH_full["row_id"] == id][0]
        if extra_text.strip() not in data_SBDH_full.loc[ix, "social_history"]:
            data_SBDH_full.loc[ix, "social_history"] += " " + extra_text.strip()

In [None]:
print(data_SBDH_full[['row_id', 'social_history']].head())

   row_id                                     social_history
0       5                         She smokes a pack per day.
1      42  Social history is significant for the absence ...
2     136  Retired schoolteacher, now substitutes. Lives ...
3     442  - Tobacco: smokes 1-1.5ppd x 30yrs\n- Alcohol:...
4     328  Married with three children, born in [**2184**...


In [None]:
for idx, row in data_SBDH_full[['row_id', 'social_history']].dropna().iterrows():
    print(f"Row ID: {row['row_id']}")
    print("Social History:")
    print(row['social_history'])
    print("="*40)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
father is having health problems and she recently went through a
divorce.
Row ID: 15517
Social History:
Jehovah's Witness, youngest child four weeks
old.
Row ID: 14858
Social History:
Denies any substance abuse (EtOH, tobacco, illicits). She lives
with her mother. On disability for multiple medical problems.
Row ID: 11240
Social History:
Denies any alcohol or tobacco use.  He is a
dentist.
Row ID: 11094
Social History:
He is retired, barely inactive.  He is a
non-smoker.  He drinks 3-4 glasses of wine; he previously
drank 1 gallon per day.  He has not any alcohol over the last
two months.
Row ID: 13605
Social History:
The patient was discharged after his last
hospitalization to [**Hospital3 1186**] [**Hospital3 **] facility.
He has a significant tobacco history, unknown quantity pack
years, having quit tobacco two years ago.  He is dependent in
all of his activities of daily living.  He is incontinent of
bowel and bladder

##### Calculate the average length of tokens of the final social history sections obtained after this adjustment process

In [None]:
social_histories = data_SBDH_full['social_history']

word_counts= social_histories.apply(lambda x: len(x.split()))

mean_tokens = word_counts.mean()
std_tokens = word_counts.std()

print(f"{mean_tokens:.2f}±{std_tokens:.2f} tokens")

28.22±29.08 tokens


### 1.4 Show the key words in text

To more easily verify that the keywords provided by MIMIC-SBDH are included in the extracted Social History results, I highlight these keywords using HTML <span> tags.

##### Have a look at the extracted 'social history' section

In [None]:
# Here we use 
target_row = data_SBDH_full.loc[data_SBDH_full['row_id'] == 2218, 'social_history'].dropna()

print(f"Row ID: 2218")
print("Social History:")
print(target_row.values[0])

Row ID: 2218
Social History:
Retired. Denies alcohol use. States history of "a few months of
cigarette use", denies IVDU and illicit drug use.
Import Social History Social History: Family History: Brother with DM. Physical Exam: General: Somnolent but arousable and answers questions


#### Highlight the keywords

In [None]:
keywords_df = pd.read_csv("/content/drive/MyDrive/MIMIC/MIMIC-SBDH-keywords.csv")

row_id=52601
text = data_SBDH_full[data_SBDH_full['row_id'] == row_id]['TEXT'].values[0]

keyword_spans = keywords_df[keywords_df['row_id'] == row_id][['start', 'end']].values.tolist()

keyword_spans =sorted(keyword_spans, key=lambda x: x[0], reverse=True)

# Highlight these keywords using HTML span ( Amn example here)
for start, end in keyword_spans:
    highlight_text = f"<span style='background-color: yellow; font-weight: bold'>{text[start:end]}</span>"
    text = text[:start] + highlight_text + text[end:]

display(HTML(f"<div style='white-space: pre-wrap'>{text}</div>"))


### 1.5. Checking these extracted texts sections by Matching Keywords in Extracted Social History

For further check, I checked if the starting position and ending position provided in the MIMIC-SBDH keywords file actually appear in the social history sections we extracted.

In [None]:
keywords_df = pd.read_csv("/content/drive/MyDrive/MIMIC/MIMIC-SBDH-keywords.csv")

keywords_df["row_id"] = keywords_df["row_id"].astype(int)
data_SBDH_full["row_id"] = data_SBDH_full["row_id"].astype(int)

rowid_to_text = dict(zip(data_SBDH_full["row_id"], data_SBDH_full["TEXT"]))
rowid_to_social = dict(zip(data_SBDH_full["row_id"], data_SBDH_full["social_history"]))

def check_keyword_in_social(row):
    rid=row["row_id"]
    start = row["start"]
    end = row["end"]

    full_text = rowid_to_text.get(rid, "")
    keyword = full_text[start:end] if pd.notna(full_text) else ""

    social_text = rowid_to_social.get(rid, "")
    found = keyword.strip() in social_text if pd.notna(social_text) else False

    return pd.Series({"keyword_text": keyword, "found_in_social": found})

keywords_df[["keyword_text", "found_in_social"]] = keywords_df.apply(check_keyword_in_social, axis=1)

print(keywords_df.head(10))

keywords_df.to_csv("keyword_in_social_check.csv", index=False)

   row_id              sbdh  start   end keyword_text  found_in_social
0       5  behavior_tobacco    534   540       smokes             True
1       5  behavior_tobacco    543   547         pack             True
2      42  behavior_tobacco   3160  3167      tobacco             True
3      42  behavior_tobacco   3178  3184       smoked             True
4      42  behavior_tobacco   3197  3200          PPD             True
5      42  behavior_alcohol   3247  3254      alcohol             True
6      42  behavior_alcohol   3283  3287         wine             True
7     136    sdoh_economics   1749  1756      Retired             True
8     136  sdoh_environment   1789  1794        Lives             True
9     136    sdoh_community   1800  1804         wife             True


I checked if the keywords (labeled words) appear in the extracted Social History. If a record didn’t have a matching keyword, the found_in_social shows False. After this, there were only 5 unique row_ids were found where the keyword could not be matched


In [None]:
# Show the unmatched id
unmatched_ids = keywords_df.loc[keywords_df['found_in_social'] == False, 'row_id'].unique()

In [None]:
len(unmatched_ids)

5

In [None]:
unmatched_ids = keywords_df.loc[~keywords_df["found_in_social"], "row_id"].unique()

print("Unmatched row_ids:")
print(unmatched_ids.tolist())


Unmatched row_ids:
[52601, 2194, 23548, 36627, 42944]


In [None]:
import numpy as np

social_histories = data_SBDH_full['social_history'].dropna()

word_counts = social_histories.apply(lambda x: len(x.split()))

mean_tokens = word_counts.mean()
std_tokens = word_counts.std()

print(f"{mean_tokens:.2f}±{std_tokens:.2f} tokens")

28.22±29.08 tokens


In [None]:
data_SBDH_full.head()

Unnamed: 0,row_id,sdoh_community_present,sdoh_community_absent,sdoh_education,sdoh_economics,sdoh_environment,behavior_alcohol,behavior_tobacco,behavior_drug,TEXT,social_history
0,5,0,0,0,0,0,0,1,0,Admission Date: [**2190-5-16**] Discharge...,She smokes a pack per day.
1,42,0,0,0,0,0,0,2,0,Admission Date: [**2174-8-6**] D...,Social history is significant for the absence ...
2,136,1,0,0,2,1,3,4,0,Admission Date: [**2139-2-4**] D...,"Retired schoolteacher, now substitutes. Lives ..."
3,442,1,1,0,0,1,3,1,2,Admission Date: [**2193-1-8**] D...,- Tobacco: smokes 1-1.5ppd x 30yrs\n- Alcohol:...
4,328,1,0,0,2,1,3,3,3,Admission Date: [**2198-4-22**] ...,"Married with three children, born in [**2184**..."


In [None]:
data_SBDH_full.to_csv("/content/drive/MyDrive/MIMIC/mydataset1.csv", index=False)

##### 1.6 Visualizing Frequency of SBDH Discussions in Selected Dataset------Move to next part

##### In Part One, we get the data_SBDH_full cav file dataset as the input to the models

# Part One End Thanks