Load NOTEEVENTS 

Notes to self
- first 60 rows in column A is junk
- After those rows, data is categorized as follows (according to https://mimic.mit.edu/docs/iii/tables/noteevents/)

Table: 
- ROW_ID	INT
- SUBJECT_ID	INT
- HADM_ID	INT
- CHARTDATE	TIMESTAMP(0)
- CHARTTIME	TIMESTAMP(0)
- STORETIME	TIMESTAMP(0)
- CATEGORY	VARCHAR(50)
- DESCRIPTION	VARCHAR(300)
- CGID	INT
- ISERROR	CHAR(1)
- TEXT	TEXT

In [15]:
import pandas as pd

noteevents = pd.read_csv(
    'NOTEEVENTS.csv', 
    skiprows=60  # Skip the first 60 rows
)

# Assign column names to the dataframe
noteevents.columns = [
    'ROW_ID', 'SUBJECT_ID', 'HADM_ID', 'CHARTDATE', 'CHARTTIME', 
    'STORETIME', 'CATEGORY', 'DESCRIPTION', 'CGID', 'ISERROR', 'TEXT'
]
print(noteevents.head())

  noteevents = pd.read_csv(


   ROW_ID  SUBJECT_ID   HADM_ID   CHARTDATE            CHARTTIME  \
0  716996       55118  102445.0  2110-11-21  2110-11-21 06:24:00   
1  717001       55118  102445.0  2110-11-21  2110-11-21 11:44:00   
2  717067       53244  184401.0  2107-12-16  2107-12-16 22:01:00   
3  717068       57615  127312.0  2158-01-27  2158-01-27 22:17:00   
4  717069       57615  127312.0  2158-01-27  2158-01-27 22:17:00   

             STORETIME    CATEGORY            DESCRIPTION     CGID  ISERROR  \
0  2110-11-21 11:12:56  Physician        Intensivist Note  21070.0      NaN   
1  2110-11-21 11:44:59     Nursing  Nursing Transfer Note  17600.0      NaN   
2  2107-12-16 22:01:59     Nursing  Nursing Progress Note  14891.0      NaN   
3  2158-01-27 22:18:01     General           Generic Note  16037.0      NaN   
4  2158-01-27 22:46:29     General           Generic Note  16037.0      NaN   

                                                TEXT  
0  CVICU\n   HPI:\n   HD5   POD 2-CABGx4(LIMA-LAD...  
1  D #

Filter relevant note categories and save to Relevant_NOTEEVENTS.csv 

In [None]:
# Filter for relevant note categories (determine if we should include other categories like general or physician)

#relevant_notes = noteevents[noteevents['CATEGORY'].isin(['Social Work', 'Nursing', 'General', 'Physician', 'Nursing/other', 'Nutrition'])]

# test with all categories (see if it affects 'None' label distribution)
relevant_notes = noteevents

# Select only required columns
relevant_notes = relevant_notes[['SUBJECT_ID', 'HADM_ID', 'CATEGORY', 'TEXT']]

# Display the first few rows of the filtered notes
print(relevant_notes.head())

relevant_notes.to_csv('Relevant_NOTEEVENTS.csv', index=False)


   SUBJECT_ID   HADM_ID    CATEGORY  \
0       55118  102445.0  Physician    
1       55118  102445.0     Nursing   
2       53244  184401.0     Nursing   
3       57615  127312.0     General   
4       57615  127312.0     General   

                                                TEXT  
0  CVICU\n   HPI:\n   HD5   POD 2-CABGx4(LIMA-LAD...  
1  D #2 from cx4.\n   Hyperglycemia\n   Assessmen...  
2  Labile bp with brisk huo,low filling pressures...  
3  TITLE: Pt was made CMO at previous shift,all f...  
4  TITLE: Pt was made CMO at previous shift,all f...  


keyword-based approach for labeling cleaned text (Will have to consider other methods, will likely get a lot of entries labled `NONE`)

In [16]:
# Define simple keyword rules
def enhanced_keyword_label(text):
    text = text.lower()
    if any(word in text for word in ["homeless", "shelter", "evicted", "no stable housing"]):
        return "Housing Instability"
    elif any(word in text for word in ["transport", "distance", "commute", "travel issues"]):
        return "Transportation Issues"
    elif any(word in text for word in ["afford", "financial", "money", "costly", "expensive"]):
        return "Financial Difficulty"
    elif any(word in text for word in ["family", "support", "caretaker", "guardian"]):
        return "Social Support"
    elif any(word in text for word in ["food", "nutrition", "hunger", "malnutrition"]):
        return "Food Insecurity"
    else:
        return "None"

# apply keyword-based labeling
relevant_notes['LABEL'] = relevant_notes['TEXT'].apply(enhanced_keyword_label)


Inspect Prelabled data (ensure it makes sense/captures relevant categories)

Result: Abdundant amount of `None` labels

In [11]:
# Load prelabeled data
prelabeled_notes = pd.read_csv('prelabeled_notes.csv')

# Check the label distribution
print(prelabeled_notes['LABEL'].value_counts()) 

# Inspect a few examples
print(prelabeled_notes[['TEXT', 'LABEL']].head(10))

# Replace NaN values in the LABEL column with 'None'
prelabeled_notes['LABEL'] = prelabeled_notes['LABEL'].fillna('None')

# Get the count of 'None' labels
none_count = prelabeled_notes['LABEL'].value_counts().get('None', 0)
print(f"Number of 'None' labels: {none_count}")


LABEL
Social Support           238907
Food Insecurity           16387
Transportation Issues      6021
Housing Instability         317
Financial Difficulty        238
Name: count, dtype: int64
                                                TEXT           LABEL
0  D #2 from cx4.\n   Hyperglycemia\n   Assessmen...             NaN
1  Labile bp with brisk huo,low filling pressures...             NaN
2  TITLE: Pt was made CMO at previous shift,all f...  Social Support
3  TITLE: Pt was made CMO at previous shift,all f...  Social Support
4  Delirium / confusion\n   Assessment:\n   Minim...             NaN
5  [**Age over 90 **] year old female found down ...             NaN
6  48M with recently diagnosed Large B-Cell Lymph...             NaN
7  [**Age over 90 **] year old female found down ...             NaN
8  55 yr old with severe PVD, CAD with EF 20%, ES...  Social Support
9                                           TITLE:\n             NaN
Number of 'None' labels: 572724


Issue: The dataset is heavily imbalanced, with `None` being the majority label. This means ClinicalBERT might overfit to `None`, which happens when the model learns to always predict the majority label because it sees it so often. As a result, it won’t properly learn to recognize the less common labels like 'Housing Instability' or 'Financial Difficulty'.

Solutions:

- Balance the dataset: 
    - Downsampling `None`: reduce the number of 'None' labels so they match the size of smaller labels (ex: if there are 100,000 `None` and 5,000 `housing instability`, randomly pick 5,000 None samples to keep) 
    
    - Oversampling Minority labels: Duplicate the notes for smaller labels (e.g., `Housing instability`) to make them more frequent in the dataset

- use weighted approach: 
    - Instead of changing the dataset, assign more weight to mistakes on minority labels during training

- Fine-Tuning with ClinicalBERT:
    Because we have a large number of `None` labels, fine-tuning ClinicalBERT can help capture implicit patterns (information not stated directly but understood from context of text) in the data and balance the label distribution, improving classification of minority SDOH categories like `Housing Instability` and `Transportation Issues`.


Prepare Data for Fine-Tuning ClinicalBERT:

- Load the dataset of pre-labeled clinical notes and encode text labels (e.g., "Housing Instability") into numerical values.

- Split the dataset into training and validation sets, ensuring that the label proportions are balanced in both sets (stratification).

- Save the splits to separate files for use during training.

In [None]:
from sklearn.model_selection import train_test_split

# Load prelabeled notes
data = pd.read_csv('prelabeled_notes.csv')


# Encode labels
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data['LABEL_NUM'] = encoder.fit_transform(data['LABEL'])  # Converts text labels to numbers
label_mapping = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))

# Split into training and validation sets
train_data, val_data = train_test_split(data, test_size=0.2, stratify=data['LABEL_NUM'], random_state=42)

print("Training set size:", len(train_data))
print("Validation set size:", len(val_data))

# Save training and validation sets to separate files
train_data.to_csv('train_data.csv', index=False)
val_data.to_csv('val_data.csv', index=False)

print("Training and validation splits saved.")

Training set size: 667675
Validation set size: 166919
