# Topic Modeling 

In [34]:
import pandas as pd
import spacy
import re
from gensim import corpora
import swifter
from gensim.models import LdaModel
from gensim.models import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

## Loading the data (MIMIC-III)

This data is from a demo of the MIMIC-III dataset. 

In [2]:
#loading the data 
df = pd.read_csv('/Users/rithvik/Desktop/personal/code/Github/Topic Modeling/NOTEEVENTS-2.csv')

In [3]:
df.head(1)

Unnamed: 0,row_id,subject_id,hadm_id,chartdate,category,description,cgid,iserror,text
0,776,20007,188442.0,2183-10-29 00:00:00,Discharge summary,Report,,,Admission Date: [**2183-9-25**] Dischar...


In [4]:
df.shape

(2083180, 9)

In [5]:
df['text'][0]

"Admission Date:  [**2183-9-25**]       Discharge Date: [**2183-10-29**] Service: HISTORY OF PRESENT ILLNESS:  The following discharge summary will cover the time period from [**10-15**] through [**2183-10-28**]. Please see previous discharge summary for information on patient's admission diagnosis and medications. HOSPITAL COURSE: 1.  Gastrointestinal.  On [**10-16**] the patient developed nausea, vomiting and abdominal pain.  Because of this she was not discharged to rehabilitation at [**Location (un) 511**] Center Hospital as had been previously planned.  Due to her symptoms a CT scan was obtained which revealed the patient had an ileus. There were no abscesses or other processes that could be identified.  The neurology service was consulted regarding possibility of this ileus being related to the patient's myopathy but felt this was unlikely since skeletal muscle myopathies typically do not also involve smooth muscle of the Gastrointestinal tract.  A Gastrointestinal consult was ob

In [6]:
df.columns

Index(['row_id', 'subject_id', 'hadm_id', 'chartdate', 'category',
       'description', 'cgid', 'iserror', 'text'],
      dtype='object')

In [7]:
df['description'][0]

'Report'

In [8]:
df.isnull().sum()

row_id               0
subject_id           0
hadm_id         231836
chartdate            0
category             0
description          0
cgid            836776
iserror        2082294
text                 0
dtype: int64

In [9]:
text_data =  df[['text']].copy()
text_data.head()

Unnamed: 0,text
0,Admission Date: [**2183-9-25**] Dischar...
1,Admission Date: [**2184-1-16**] Dischar...
2,Admission Date: [**2103-4-11**] ...
3,Admission Date: [**2103-10-7**] Dischar...
4,Admission Date: [**2131-4-2**] D...


## Data preprocessing 

In [10]:
text_data.shape

(2083180, 1)

In [11]:
text_data_sampled = text_data.sample(n=10000, random_state=42)
text_data_sampled.head()

Unnamed: 0,text
1292716,ADMIT NOTE Pt is a 75yo male POD 2 from open ...
1160271,[**2158-1-27**] 9:26 AM AV FITULOGRAM SCH ...
1549380,"Resp Care Note, Pt remains on current vent set..."
7474,Admission Date: [**2167-2-28**] ...
2014768,NEonatology-[** 63**] Progress Note PE; [**Kno...


In [12]:
text_data_sampled.shape

(10000, 1)

In [None]:
# Load spaCy's small English model
#nlp = spacy.load('en_core_web_sm')

In [None]:
 # Function to process a batch of texts
"""def preprocess_texts(texts):
    preprocessed_texts = []
    
    # Use spaCy's efficient `pipe` method (batch processing)
    for doc in nlp.pipe(texts, batch_size=700):
        tokens = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]
        preprocessed_texts.append(' '.join(tokens))
    
    return preprocessed_texts

# Function to clean text (regex part)
def clean_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text.lower()

# Combined function — apply regex cleaning, then batch-process
def preprocess_text_pipeline(texts):
    cleaned_texts = [clean_text(text) for text in texts]
    return preprocess_texts(cleaned_texts)

# Sample DataFrame
# text_data = pd.DataFrame({'text': ["This is a test.", "Another sentence!"]})

# Preprocess all texts in parallel using swifter (which parallelizes .apply() on M2 Pro CPUs)
text_data_sampled['text'] = text_data_sampled['text'].swifter.apply(clean_text)  # Clean first (regex part)

# Now run batch processing (spaCy part) — outside swifter, since batching is its own optimization
text_data_sampled['text'] = preprocess_texts(text_data_sampled['text'].tolist())

# Check results
print(text_data_sampled.head())"""#


Pandas Apply:   0%|          | 0/10000 [00:00<?, ?it/s]

                                                      text
1292716  admit note pt yo male pod open choleystectomy ...
1160271  av fitulogram sch clip clip number radiology r...
1549380  resp care note pt remain current vent setting ...
7474     admission date discharge date date birth sex m...
2014768  neonatology progress note pe know lastname rem...


In [None]:
# save preproced text to a pkl file
#text_data_sampled.to_pickle('text_data_sampled')

In [13]:
#Loading a pickle file 
preped_sampled_data = pd.read_pickle('/Users/rithvik/Desktop/personal/code/Github/Topic Modeling/text_data_sampled')
preped_sampled_data.head()

Unnamed: 0,text
1292716,admit note pt yo male pod open choleystectomy ...
1160271,av fitulogram sch clip clip number radiology r...
1549380,resp care note pt remain current vent setting ...
7474,admission date discharge date date birth sex m...
2014768,neonatology progress note pe know lastname rem...


In [14]:
# that contains your preprocessed text.
tokenized_texts = [text.split() for text in preped_sampled_data['text']]

In [15]:
# Create a dictionary from tokenized texts
dictionary = corpora.Dictionary(tokenized_texts)

# Optionally filter out extremes (adjust thresholds as needed)
dictionary.filter_extremes(no_below=2, no_above=0.95)

# Convert tokenized documents to a bag-of-words corpus
corpus = [dictionary.doc2bow(text) for text in tokenized_texts]

# Optional: Print a sample to check the corpus format
print(corpus[0])

[(0, 1), (1, 1), (2, 3), (3, 1), (4, 2), (5, 2), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 3), (29, 1), (30, 1), (31, 1), (32, 1), (33, 2), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 2), (59, 2), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 2), (69, 1), (70, 1), (71, 2), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 2), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 3), (97, 1), (98, 2), (99, 2), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1),

## LDA Model

In [30]:
# Set the number of topics (adjust as needed for your data)
num_topics = 5

# Train the LDA model
lda_model = LdaModel(corpus=corpus, 
                     id2word=dictionary, 
                     num_topics=num_topics, 
                     random_state=42, 
                     passes=10)

# Print the topics to inspect them
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

Topic 0: 0.033*"pt" + 0.011*"plan" + 0.009*"s" + 0.009*"assessment" + 0.008*"continue" + 0.007*"response" + 0.007*"note" + 0.006*"action" + 0.006*"give" + 0.006*"pain"
Topic 1: 0.018*"normal" + 0.016*"contrast" + 0.014*"ct" + 0.014*"valve" + 0.013*"right" + 0.010*"left" + 0.010*"aortic" + 0.010*"ventricular" + 0.010*"leave" + 0.009*"see"
Topic 2: 0.014*"ml" + 0.010*"mg" + 0.010*"pm" + 0.009*"patient" + 0.008*"po" + 0.007*"mgdl" + 0.007*"blood" + 0.006*"tablet" + 0.005*"history" + 0.005*"day"
Topic 3: 0.024*"clip" + 0.023*"reason" + 0.019*"chest" + 0.017*"right" + 0.013*"report" + 0.013*"number" + 0.013*"examination" + 0.012*"hospital" + 0.012*"radiology" + 0.012*"final"
Topic 4: 0.018*"infant" + 0.014*"feed" + 0.012*"note" + 0.012*"care" + 0.011*"p" + 0.011*"continue" + 0.010*"stable" + 0.009*"o" + 0.009*"cont" + 0.008*"monitor"


## Evaluation

In [31]:
# Create a CoherenceModel using your LDA model, tokenized texts, and dictionary
coherence_model = CoherenceModel(model=lda_model, texts=tokenized_texts, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model.get_coherence()
print(f"Coherence Score: {coherence_score:.4f}")


Coherence Score: 0.5925


In [32]:
print(f"Perplexity: {lda_model.log_perplexity(corpus)}")

Perplexity: -7.225988081713401


## Visualization 

In [35]:
# Prepare the visualization data
lda_vis_data = gensimvis.prepare(lda_model, corpus, dictionary)

In [36]:
pyLDAvis.display(lda_vis_data)

## Results 

In [37]:
topic_labels = {
    0: "Assessment and Plan",
    1: "CT Imaging / Radiology",
    2: "Medication Dosage and Lab Measurements",
    3: "Chest Radiology Report",
    4: "Pediatric/Infant Care"
}


In this project, I preprocessed a set of clinical notes from the MIMIC-III dataset using a combination of regex cleaning and spaCy's lemmatization. I then converted the cleaned text into a bag-of-words representation and applied LDA to extract five distinct topics. The topics that emerged were focused on key clinical themes—namely, Assessment and Plan, CT Imaging/Radiology, Medication Dosage and Lab Measurements, Chest Radiology Reports, and Pediatric/Infant Care. With a coherence score of about 0.595 and a perplexity of -7.21, the model produced moderately coherent topics that align with real-world clinical documentation. These results show that even a basic topic modeling approach can help uncover meaningful patterns in clinical notes, offering a solid foundation for further analysis and integration with additional patient data.