<div style="background-color:#121212; border-left: 5px solid #00bfbf; padding: 1.5em; font-family: 'Segoe UI', sans-serif; color: #e0e0e0; line-height:1.7;">

<h2 style="color:#00e6e6;">🧠 Plastic Surgery Abstracts: Topic Modeling & Trend Analysis</h2>

<p><strong style="color:#fff;">📍 Objective:</strong><br>
To apply advanced <strong style="color:#ffffff;">Natural Language Processing (NLP)</strong> techniques — specifically <strong style="color:#ffffff;">BERTopic</strong> — to a curated dataset of plastic surgery abstracts. This project aims to <strong style="color:#ffffff;">discover latent research themes</strong>, visualize their <strong style="color:#ffffff;">evolution over time</strong>, and generate reproducible insights for clinical and academic interpretation.</p>

<hr style="border:none; border-top: 1px dashed #666;">

<h3 style="color:#ffffff;">🎯 Project Goals:</h3>
<ul>
  <li><strong>🧩 Theme Discovery:</strong> Identify core topics from ~5,000 abstracts using clustering techniques.</li>
  <li><strong>🔑 Keyword Extraction:</strong> Use <code style="color:#111;background:#fff;padding:1px 4px;border-radius:3px;">c-TF-IDF</code> to surface meaningful and distinct terms for each topic.</li>
  <li><strong>📈 Trend Analysis:</strong> Track how topics emerge, decline, or dominate over the years.</li>
  <li><strong>🧭 Topic Relationships:</strong> Visualize topic similarity using UMAP and hierarchical dendrograms.</li>
  <li><strong>📦 Deliverables:</strong> Export reproducible models (.pkl), clean code, and publication-ready visuals.</li>
</ul>

<p style="font-size:0.95em; color:#aaa;">
<em>Note:</em> A rigorous preprocessing pipeline was applied including section header stripping, lemmatization, and medical-domain stopword removal to enhance topic coherence and signal clarity.
</p>
</div>

In [230]:
import pandas as pd
import numpy as np
import nltk
import os
import spacy
import re  # from regex
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets  # example module from scikit-learn
from umap import UMAP
import hdbscan
from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance
from transformers import pipeline
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag

In [231]:
df = pd.read_csv('E:\dataScience\Fiver orders\Order 3 Betopic modeling Plastic surgery\data\data\merged_abstracts.csv',encoding='latin1')

In [232]:
df['year'].value_counts()

year
2023    907
2022    895
2021    735
2019    594
2016    366
2018    349
2017    326
2014    242
2015    238
2020    215
Name: count, dtype: int64

In [233]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4867 entries, 0 to 4866
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   year        4867 non-null   int64  
 1   abstract    4867 non-null   object 
 2   Unnamed: 2  0 non-null      float64
 3   Unnamed: 3  0 non-null      float64
dtypes: float64(2), int64(1), object(1)
memory usage: 152.2+ KB


In [234]:
df['abstract'][0]

'BACKGROUND and PURPOSE: Teenagers with severe hemifacial microsomia can present with complex deformities of skin, adipose layer, mandible, and occlusion. Various individual techniques have been described to treat specific aspects of the deformity, including orthognathic surgery, distraction, bone grafting (vascularized/nonvascularized), and autologous fat transfer.1-3 Total correction of composite deficiencies, including creation of a proper skeletal foundation and soft tissue envelope, without relapse, is challenging in a single stage. We present our approach to teenagers with severe hemifacial microsomia incorporating orthognathic surgery, a rigid external distractor (RED) device, free osteoseptocutaneous fibular transfer, and fat grafting, to achieve stable skeletal and soft tissue correction.\n      METHODS: 3 teenage patients with severe hemifacial microsomia (Pruzansky 3) were treated at Dell Children\x19s Medical Center with a sequential multi-staged approach for both skeletal 

In [235]:
abs = []
for i in df['abstract']:
    abs.append(i)

In [236]:
# Collect all abstracts into a single text blob
abs = df['abstract'].tolist()
all_text = " ".join(abs)

# `all_text` now contains the full text of all abstracts
print(all_text[:1000])  # Show a preview of the combined text


BACKGROUND and PURPOSE: Teenagers with severe hemifacial microsomia can present with complex deformities of skin, adipose layer, mandible, and occlusion. Various individual techniques have been described to treat specific aspects of the deformity, including orthognathic surgery, distraction, bone grafting (vascularized/nonvascularized), and autologous fat transfer.1-3 Total correction of composite deficiencies, including creation of a proper skeletal foundation and soft tissue envelope, without relapse, is challenging in a single stage. We present our approach to teenagers with severe hemifacial microsomia incorporating orthognathic surgery, a rigid external distractor (RED) device, free osteoseptocutaneous fibular transfer, and fat grafting, to achieve stable skeletal and soft tissue correction.
      METHODS: 3 teenage patients with severe hemifacial microsomia (Pruzansky 3) were treated at Dell Childrens Medical Center with a sequential multi-staged approach for both skeletal and s

In [237]:
abs_text_path = "E:\\dataScience\\Fiver orders\\Order 3 Betopic modeling Plastic surgery\\all_abs.txt"
with open(abs_text_path, 'w', encoding='utf-8') as f:
    f.write(all_text)


<div style="background-color:#121212; border-left: 5px solid #00bfbf; padding: 1.5em; font-family: 'Segoe UI', sans-serif; color: #e0e0e0; line-height:1.7;">

<h2 style="color:#00e6e6;">🧹 Preprocessing Pipeline Before Topic Modeling</h2>

<p>This project involves advanced cleaning and structuring of plastic surgery abstracts to ensure high-quality input for topic modeling using BERTopic. Each step below is critical to eliminate noise and enhance semantic clarity.</p>

<hr style="border:none; border-top: 1px dashed #666;">

<h3 style="color:#ffffff;">⚙️ Steps Applied:</h3>
<ol>
  <li><strong>🗃 Drop Empty Columns:</strong> Remove unused or null-filled columns (e.g., "Unnamed: 2").</li>
  <li><strong>🔻 Lowercase Text:</strong> Normalize casing to reduce duplication (e.g., “Surgery” vs “surgery”).</li>
  <li><strong>🧽 Remove Boilerplate Headers:</strong> Strip section headers such as <code>BACKGROUND:</code>, <code>METHODS:</code>, <code>CONCLUSION:</code>, etc.</li>
  <li><strong>✂️ Remove Punctuation & Digits:</strong> Eliminate characters irrelevant to topic formation.</li>
  <li><strong>🧠 Lemmatization:</strong> Convert words to their base form using <code>spaCy</code> (e.g., "treated" → "treat").</li>
  <li><strong>🚫 Stopword Removal:</strong> Use both <code>ENGLISH_STOP_WORDS</code> and a domain-specific list to remove high-frequency medical filler terms (e.g., "patient", "procedure", "outcome").</li>
  <li><strong>📏 Filter Short Texts:</strong> Discard abstracts with fewer than 30 meaningful tokens to avoid noise in topic modeling.</li>
  <li><strong>🧬 Remove Duplicates:</strong> Drop identical or highly similar abstracts to avoid over-weighting.</li>
</ol>

<hr style="border:none; border-top: 1px dashed #666;">

<p style="font-size:0.95em;color:#aaa;"><em>Note:</em> These steps ensure that BERTopic clusters are formed on meaningful linguistic signals rather than formatting or redundant content.</p>

</div>


In [238]:
df = df.drop(columns=[col for col in df.columns if 'Unnamed' in col])

In [239]:
def clean_boilerplate(text):
    return re.sub(
        r"\b(background|methods|results|conclusions|purpose|discussion|study design|figure|table|summary|introduction|objectives|design|references|study population|statistical analysis|data availability|acknowledgement|clinical question|study results|study objective)\b[\s:–\-]*",
        " ",
        text,
        flags=re.IGNORECASE
    )

In [240]:
def clean_text(text):
    text = clean_boilerplate(text)
    text = text.lower()
    text = re.sub(r"[^a-z\s]", " ", text)
    return re.sub(r"\s+", " ", text).strip()
df['cleaned_abs']= df['abstract'].apply(clean_boilerplate)

In [241]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

custom_stopwords = ENGLISH_STOP_WORDS.union([
    'purpose', 'methods', 'method', 'results', 'conclusion', 'patient', 'patients',
    'study', 'studies', 'clinical', 'analysis', 'significant', 'outcomes',
    'surgery', 'surgeon', 'surgeons', 'treatment', 'procedures', 'case', 'cases',
    'including', 'performed', 'approach', 'report', 'data', 'number', 'using',
    'compared', 'included', 'surgical', 'underwent', 'group', 'significantly',
    'md', 'authors', 'references', 'and', 'of', 'to', 'the', 'in', 'for', 'on',
    'with', 'as', 'by', 'at', 'from', 'a', 'an', 'is', 'was', 'are', 'be', 'this',
    'that', 'it', 'we', 'they', 'their', 'or'
])

def tokenize_and_filter(text):
    tokens = text.split()
    tokens = [t for t in tokens if t not in custom_stopwords and len(t) > 2]
    return " ".join(tokens)

df["cleaned_abs"] = df["cleaned_abs"].apply(tokenize_and_filter)

In [242]:
df['cleaned_abs'].head()

0    Teenagers severe hemifacial microsomia present...
1    160,000 hip knee replacements year UK. After m...
2    3,800 diagnosed sarcoma year 10% requiring spe...
3    350,000 ventral hernia repairs (VHR) yearly US...
4    challenge education offering adequate hands-on...
Name: cleaned_abs, dtype: object

In [244]:
lemmatizer = WordNetLemmatizer()

def lemmatize_text_simple(text):
    tokens = word_tokenize(text)
    lemmatized = [lemmatizer.lemmatize(token) for token in tokens]
    return " ".join(lemmatized)

df['lemmatized_abs'] = df['cleaned_abs'].apply(lemmatize_text_simple)
df['lemmatized_abs'].head()

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - 'e:\\dataScience\\Fiver orders\\Order 3 Betopic modeling Plastic surgery\\Plastic_Surgery_topicModeling\\nltk_data'
**********************************************************************


In [None]:
# There is no error shown in the provided context.
# If you encountered an error, please provide the error message for specific help.
# If you are referring to a variable or file not found, ensure all previous cells are executed in order.
# If you meant to check for a specific error, please clarify.---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
Cell In[244], line 8
      5     lemmatized = [lemmatizer.lemmatize(token) for token in tokens]
      6     return " ".join(lemmatized)
----> 8 df['lemmatized_abs'] = df['cleaned_abs'].apply(lemmatize_text_simple)
      9 df['lemmatized_abs'].head()

File e:\dataScience\Fiver orders\Order 3 Betopic modeling Plastic surgery\Plastic_Surgery_topicModeling\.venv\Lib\site-packages\pandas\core\series.py:4935, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
   4800 def apply(
   4801     self,
   4802     func: AggFuncType,
   (...)   4807     **kwargs,
   4808 ) -> DataFrame | Series:
   4809     """
   4810     Invoke function on values of Series.
   4811 
   (...)   4926     dtype: float64
   4927     """
   4928     return SeriesApply(
   4929         self,
   4930         func,
   4931         convert_dtype=convert_dtype,
   4932         by_row=by_row,
   4933         args=args,
...

  Searched in:
    - 'e:\\dataScience\\Fiver orders\\Order 3 Betopic modeling Plastic surgery\\Plastic_Surgery_topicModeling\\nltk_data'
**********************************************************************
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...