<div style="background-color:#121212; border-left: 5px solid #00bfbf; padding: 1.5em; font-family: 'Segoe UI', sans-serif; color: #e0e0e0; line-height:1.7;">

<h2 style="color:#00e6e6;">🧠 Plastic Surgery Abstracts: Topic Modeling & Trend Analysis</h2>

<p><strong style="color:#fff;">📍 Objective:</strong><br>
To apply advanced <strong style="color:#ffffff;">Natural Language Processing (NLP)</strong> techniques — specifically <strong style="color:#ffffff;">BERTopic</strong> — to a curated dataset of plastic surgery abstracts. This project aims to <strong style="color:#ffffff;">discover latent research themes</strong>, visualize their <strong style="color:#ffffff;">evolution over time</strong>, and generate reproducible insights for clinical and academic interpretation.</p>

<hr style="border:none; border-top: 1px dashed #666;">

<h3 style="color:#ffffff;">🎯 Project Goals:</h3>
<ul>
  <li><strong>🧩 Theme Discovery:</strong> Identify core topics from ~5,000 abstracts using clustering techniques.</li>
  <li><strong>🔑 Keyword Extraction:</strong> Use <code style="color:#111;background:#fff;padding:1px 4px;border-radius:3px;">c-TF-IDF</code> to surface meaningful and distinct terms for each topic.</li>
  <li><strong>📈 Trend Analysis:</strong> Track how topics emerge, decline, or dominate over the years.</li>
  <li><strong>🧭 Topic Relationships:</strong> Visualize topic similarity using UMAP and hierarchical dendrograms.</li>
  <li><strong>📦 Deliverables:</strong> Export reproducible models (.pkl), clean code, and publication-ready visuals.</li>
</ul>

<p style="font-size:0.95em; color:#aaa;">
<em>Note:</em> A rigorous preprocessing pipeline was applied including section header stripping, lemmatization, and medical-domain stopword removal to enhance topic coherence and signal clarity.
</p>
</div>

In [4]:
import pandas as pd
import numpy as np
import nltk
import os
import spacy
import re  # from regex
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets  # example module from scikit-learn
from umap import UMAP
import hdbscan
from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance
from transformers import pipeline
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag

In [3]:
df = pd.read_csv('E:\dataScience\Fiver orders\Order 3 Betopic modeling Plastic surgery\data\data\merged_abstracts.csv',encoding='latin1')

In [4]:
df

Unnamed: 0,year,abstract,Unnamed: 2,Unnamed: 3
0,2014,BACKGROUND and PURPOSE: Teenagers with severe ...,,
1,2014,"BACKGROUND: 160,000 hip and knee replacements ...",,
2,2014,"BACKGROUND: 3,800 patients are diagnosed with ...",,
3,2014,"BACKGROUND: 350,000 ventral hernia repairs (VH...",,
4,2014,BACKGROUND: A challenge to education is offeri...,,
...,...,...,...,...
4862,2023,The Potential of Amniotic Membrane Derived Pro...,,
4863,2023,Use of Ultrasound in the Diagnosis of Craniosy...,,
4864,2023,Utility of AI tools to Detect Pain Through Fac...,,
4865,2023,Ventral Hernia Repair in Complex Patients: The...,,


In [267]:
df['year'].value_counts()

year
2023    907
2022    895
2021    735
2019    594
2016    366
2018    349
2017    326
2014    242
2015    238
2020    215
Name: count, dtype: int64

In [233]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4867 entries, 0 to 4866
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   year        4867 non-null   int64  
 1   abstract    4867 non-null   object 
 2   Unnamed: 2  0 non-null      float64
 3   Unnamed: 3  0 non-null      float64
dtypes: float64(2), int64(1), object(1)
memory usage: 152.2+ KB


In [234]:
df['abstract'][0]

'BACKGROUND and PURPOSE: Teenagers with severe hemifacial microsomia can present with complex deformities of skin, adipose layer, mandible, and occlusion. Various individual techniques have been described to treat specific aspects of the deformity, including orthognathic surgery, distraction, bone grafting (vascularized/nonvascularized), and autologous fat transfer.1-3 Total correction of composite deficiencies, including creation of a proper skeletal foundation and soft tissue envelope, without relapse, is challenging in a single stage. We present our approach to teenagers with severe hemifacial microsomia incorporating orthognathic surgery, a rigid external distractor (RED) device, free osteoseptocutaneous fibular transfer, and fat grafting, to achieve stable skeletal and soft tissue correction.\n      METHODS: 3 teenage patients with severe hemifacial microsomia (Pruzansky 3) were treated at Dell Children\x19s Medical Center with a sequential multi-staged approach for both skeletal 

In [235]:
abs = []
for i in df['abstract']:
    abs.append(i)

In [236]:
# Collect all abstracts into a single text blob
abs = df['abstract'].tolist()
all_text = " ".join(abs)

# `all_text` now contains the full text of all abstracts
print(all_text[:1000])  # Show a preview of the combined text


BACKGROUND and PURPOSE: Teenagers with severe hemifacial microsomia can present with complex deformities of skin, adipose layer, mandible, and occlusion. Various individual techniques have been described to treat specific aspects of the deformity, including orthognathic surgery, distraction, bone grafting (vascularized/nonvascularized), and autologous fat transfer.1-3 Total correction of composite deficiencies, including creation of a proper skeletal foundation and soft tissue envelope, without relapse, is challenging in a single stage. We present our approach to teenagers with severe hemifacial microsomia incorporating orthognathic surgery, a rigid external distractor (RED) device, free osteoseptocutaneous fibular transfer, and fat grafting, to achieve stable skeletal and soft tissue correction.
      METHODS: 3 teenage patients with severe hemifacial microsomia (Pruzansky 3) were treated at Dell Childrens Medical Center with a sequential multi-staged approach for both skeletal and s

In [None]:


abs_text_path = "E:\\dataScience\\Fiver orders\\Order 3 Betopic modeling Plastic surgery\\all_abs.txt"
with open(abs_text_path, 'w', encoding='utf-8') as f:
    f.write(all_text)


<div style="background-color:#121212; border-left: 5px solid #00bfbf; padding: 1.5em; font-family: 'Segoe UI', sans-serif; color: #e0e0e0; line-height:1.7;">

<h2 style="color:#00e6e6;">🧹 Preprocessing Pipeline Before Topic Modeling</h2>

<p>This project involves advanced cleaning and structuring of plastic surgery abstracts to ensure high-quality input for topic modeling using BERTopic. Each step below is critical to eliminate noise and enhance semantic clarity.</p>

<hr style="border:none; border-top: 1px dashed #666;">

<h3 style="color:#ffffff;">⚙️ Steps Applied:</h3>
<ol>
  <li><strong>🗃 Drop Empty Columns:</strong> Remove unused or null-filled columns (e.g., "Unnamed: 2").</li>
  <li><strong>🔻 Lowercase Text:</strong> Normalize casing to reduce duplication (e.g., “Surgery” vs “surgery”).</li>
  <li><strong>🧽 Remove Boilerplate Headers:</strong> Strip section headers such as <code>BACKGROUND:</code>, <code>METHODS:</code>, <code>CONCLUSION:</code>, etc.</li>
  <li><strong>✂️ Remove Punctuation & Digits:</strong> Eliminate characters irrelevant to topic formation.</li>
  <li><strong>🧠 Lemmatization:</strong> Convert words to their base form using <code>spaCy</code> (e.g., "treated" → "treat").</li>
  <li><strong>🚫 Stopword Removal:</strong> Use both <code>ENGLISH_STOP_WORDS</code> and a domain-specific list to remove high-frequency medical filler terms (e.g., "patient", "procedure", "outcome").</li>
  <li><strong>📏 Filter Short Texts:</strong> Discard abstracts with fewer than 30 meaningful tokens to avoid noise in topic modeling.</li>
  <li><strong>🧬 Remove Duplicates:</strong> Drop identical or highly similar abstracts to avoid over-weighting.</li>
</ol>

<hr style="border:none; border-top: 1px dashed #666;">

<p style="font-size:0.95em;color:#aaa;"><em>Note:</em> These steps ensure that BERTopic clusters are formed on meaningful linguistic signals rather than formatting or redundant content.</p>

</div>


In [5]:
df = df.drop(columns=[col for col in df.columns if 'Unnamed' in col])

In [6]:
def clean_boilerplate(text):
    return re.sub(
        r"\b(background|methods|results|conclusions|purpose|discussion|study design|figure|table|summary|introduction|objectives|design|references|study population|statistical analysis|data availability|acknowledgement|clinical question|study results|study objective)\b[\s:–\-]*",
        " ",
        text,
        flags=re.IGNORECASE
    )

In [7]:
def clean_text(text):
    text = clean_boilerplate(text)
    text = text.lower()
    text = re.sub(r"[^a-z\s]", " ", text)
    return re.sub(r"\s+", " ", text).strip()
df['cleaned_abs']= df['abstract'].apply(clean_boilerplate)

In [8]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

custom_stopwords = ENGLISH_STOP_WORDS.union([
    'purpose', 'methods', 'method', 'results', 'conclusion', 'patient', 'patients',
    'study', 'studies', 'clinical', 'analysis', 'significant', 'outcomes',
    'surgery', 'surgeon', 'surgeons', 'treatment', 'procedures', 'case', 'cases',
    'including', 'performed', 'approach', 'report', 'data', 'number', 'using',
    'compared', 'included', 'surgical', 'underwent', 'group', 'significantly',
    'md', 'authors', 'references', 'and', 'of', 'to', 'the', 'in', 'for', 'on',
    'with', 'as', 'by', 'at', 'from', 'a', 'an', 'is', 'was', 'are', 'be', 'this',
    'that', 'it', 'we', 'they', 'their', 'or'
])

def tokenize_and_filter(text):
    tokens = text.split()
    tokens = [t for t in tokens if t not in custom_stopwords and len(t) > 2]
    return " ".join(tokens)

df["cleaned_abs"] = df["cleaned_abs"].apply(tokenize_and_filter)

In [9]:
df['cleaned_abs'].head()

0    Teenagers severe hemifacial microsomia present...
1    160,000 hip knee replacements year UK. After m...
2    3,800 diagnosed sarcoma year 10% requiring spe...
3    350,000 ventral hernia repairs (VHR) yearly US...
4    challenge education offering adequate hands-on...
Name: cleaned_abs, dtype: object

In [246]:
## Run following on colab due to some issues with nltk data path on local machine so instead of time 
# wasting to fix it, I will run this on colab

# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger_eng')  # 👈 required now
# nltk.download('wordnet')

# lemmatizer = WordNetLemmatizer()

# def get_wordnet_pos(word):
#     """Map POS tag to first character used by WordNetLemmatizer"""
#     tag = pos_tag([word])[0][1][0].upper()
#     tag_dict = {
#         "J": wordnet.ADJ,
#         "N": wordnet.NOUN,
#         "V": wordnet.VERB,
#         "R": wordnet.ADV
#     }
#     return tag_dict.get(tag, wordnet.NOUN)

# def lemmatize_text(text):
#     if isinstance(text, str):
#         tokens = word_tokenize(text)
#         lemmatized_tokens = [lemmatizer.lemmatize(t, get_wordnet_pos(t)) for t in tokens]
#         return " ".join(lemmatized_tokens)
#     return ""

# df['lemmatized_abs'] = df['cleaned_abs'].apply(lemmatize_text)
# df[['cleaned_abs', 'lemmatized_abs']].head()


In [7]:
## lemaatized data
df= pd.read_csv('E:\dataScience\Fiver orders\Order 3 Betopic modeling Plastic surgery\data\data\lemmatized_abstracts.csv')

In [8]:
df.drop(columns=['Unnamed: 0'], inplace=True)

In [9]:
# Collect all abstracts into a single text blob
abs = df['lemmatized_abs'].tolist()
all_text = " ".join(abs)

# `all_text` now contains the full text of all abstracts
print(all_text[:1000])  # Show a preview of the combined text

abs_text_path = "E:\\dataScience\\Fiver orders\\Order 3 Betopic modeling Plastic surgery\\lemmatized_abs.txt"
with open(abs_text_path, 'w', encoding='utf-8') as f:
    f.write(all_text)


Teenagers severe hemifacial microsomia present complex deformity skin , adipose layer , mandible , occlusion . Various individual technique described treat specific aspect deformity , orthognathic surgery , distraction , bone graft ( vascularized/nonvascularized ) , autologous fat transfer.1-3 Total correction composite deficiency , creation proper skeletal foundation soft tissue envelope , relapse , challenge single stage . present teenager severe hemifacial microsomia incorporate orthognathic surgery , rigid external distractor ( RED ) device , free osteoseptocutaneous fibular transfer , fat graft , achieve stable skeletal soft tissue correction . teenage severe hemifacial microsomia ( Pruzansky treat Dell Childrens Medical Center sequential multi-staged skeletal soft tissue correction . Treatment protocol follow : Stage conventional orthognathic surgery , application RED device traction correct mandible . Stage II : Mandible facial soft tissue reconstruction free fibula osteoseptoc

In [10]:
df_cleaned = df.copy()

# Step 1: Remove statistical noise like OR [x,y], %, p-values etc.
def remove_statistical_noise(text):
    text = re.sub(r"\b(OR|CI|AUC|ROC|p)\b[\s:=\[\]\d.,%-]*", " ", text, flags=re.IGNORECASE)
    text = re.sub(r"\d+(\.\d+)?%?", " ", text)  # remove numbers and percentages
    text = re.sub(r"\[\s*\d+\.?\d*,?\s*\d*\.?\d*\s*\]", " ", text)  # remove [x, y]
    return text

df_cleaned["step1_stats_removed"] = df_cleaned["lemmatized_abs"].astype(str).apply(remove_statistical_noise)

# Step 2: Remove named entities and boilerplate like degrees, affiliations, author roles
def remove_named_entities(text):
    return re.sub(r"\b(presenter|co-author|affiliation|university|hospital|institute|center|clinic|department|md|phd|mph)\b", " ", text, flags=re.IGNORECASE)

df_cleaned["step2_entities_removed"] = df_cleaned["step1_stats_removed"].apply(remove_named_entities)

# Step 3: Remove non-ASCII characters and normalize hyphens/em-dashes
def normalize_text(text):
    text = text.encode("ascii", "ignore").decode("ascii")
    text = re.sub(r"[–—−]", "-", text)  # normalize em-dashes etc.
    return text

df_cleaned["step3_normalized"] = df_cleaned["step2_entities_removed"].apply(normalize_text)

# Step 4: Expand stopwords (merge built-in + domain-specific)
from sklearn.feature_extraction import text

custom_stopwords = text.ENGLISH_STOP_WORDS.union([
    # Domain-specific
    'purpose', 'methods', 'method', 'results', 'conclusion', 'patient', 'patients',
    'study', 'studies', 'clinical', 'analysis', 'significant', 'outcomes',
    'surgery', 'surgeon', 'surgeons', 'treatment', 'procedures', 'case', 'cases',
    'including', 'performed', 'approach', 'report', 'data', 'number', 'using',
    'compared', 'included', 'surgical', 'underwent', 'group', 'significantly',
    'md', 'authors', 'references', 'augmentation', 'hospital', 'plastics', 'abstract',
    # Generic filler
    'and', 'of', 'to', 'the', 'in', 'for', 'on', 'with', 'as', 'by', 'at', 'from',
    'a', 'an', 'is', 'was', 'are', 'be', 'this', 'that', 'it', 'we', 'they',
    'their', 'or'
])

def remove_stopwords(text):
    tokens = text.split()
    return " ".join([t for t in tokens if t.lower() not in custom_stopwords and len(t) > 2])

df_cleaned["step4_stopwords_removed"] = df_cleaned["step3_normalized"].apply(remove_stopwords)

# Final column for modeling
df_cleaned["final_ready"] = df_cleaned["step4_stopwords_removed"]


In [11]:
df['final_ready'] = df_cleaned['final_ready']

In [268]:
df

Unnamed: 0,year,abstract,Unnamed: 2,Unnamed: 3
0,2014,BACKGROUND and PURPOSE: Teenagers with severe ...,,
1,2014,"BACKGROUND: 160,000 hip and knee replacements ...",,
2,2014,"BACKGROUND: 3,800 patients are diagnosed with ...",,
3,2014,"BACKGROUND: 350,000 ventral hernia repairs (VH...",,
4,2014,BACKGROUND: A challenge to education is offeri...,,
...,...,...,...,...
4862,2023,The Potential of Amniotic Membrane Derived Pro...,,
4863,2023,Use of Ultrasound in the Diagnosis of Craniosy...,,
4864,2023,Utility of AI tools to Detect Pain Through Fac...,,
4865,2023,Ventral Hernia Repair in Complex Patients: The...,,


In [12]:

# Additional filtering
def final_clean(text):
    # Remove duplicate words
    text = re.sub(r'\b(\w+)( \1\b)+', r'\1', text)
    
    # Remove short tokens
    text = ' '.join([w for w in text.split() if len(w) > 2])
    
    # Remove known leftover noise
    noise_words = {'msc', 'md', 'beta', 'aor', 'mnc', 'qq', 'mph'}
    text = ' '.join([w for w in text.split() if w.lower() not in noise_words])
    
    return text.strip()

# Apply to final_ready column
df['final_ready'] = df['final_ready'].astype(str).apply(final_clean)

In [13]:
df.to_csv(r'E:\dataScience\Fiver orders\Order 3 Betopic modeling Plastic surgery\data\data\final.csv', index=False, encoding='utf-8-sig')

In [14]:
df = pd.read_csv(r'E:\dataScience\Fiver orders\Order 3 Betopic modeling Plastic surgery\data\data\final.csv', encoding='utf-8-sig')

In [15]:
df['final_ready'][4000]

'Delineating Effectiveness Perioperative Tranexamic Acid Reducing Bleeding Events Joseph Kuhn Tegan Clarke Aaron Segura Eric Ensign Samantha Huang Dominick Byrd Joshua Harrison Anil Shetty Panniculectomy commonly procedure restores abdominal cosmesis improves hygiene enhances health-related quality life experience massive medical weight loss obesity-related dysmorphic change pannus comorbidities subclinical nutritional deficiency contribute elevate complication profile Risk mitigation strategy include preservation lymphatics mattress progressive tension suture hemostatic agent tissue adhesive negative pressure wound therapy Tranexamic acid block conversion plasminogen plasmin gain recognition pharmacologic adjunct reduce hematoma bruising blood product transfusion post operative edema harness antifibrinolytic anti- inflammatory property hope discern effectiveness parenteral tranexamic acid reduce bleeding complication follow panniculectomy retrospective chart review consecutive pannicu