# Scientific Content Generation Using Topic Modeling

##### In this notebook, scientific content embeddings from PubMed abstracts using Latent Dirichlet Allocation (LDA) are created.

Goal:

- Transform unstructured biomedical text into structured topic distributions.
- Identify latent scientific themes that can serve as input features for an LSTM-based recommender system predicting the Next Best Content (NBC) for healthcare professionals (HCPs).

By the end, we’ll obtain:

- Topic-level representations (topic_0 ... topic_9)
- Dominant topic labels and keywords
- Medical specialty mapping for each topic

#### Step 1 : Importing Required Libraries

Here, import key Python packages for:

- **Text Preprocessing:** spacy, nltk, re
- **Topic Modeling:** gensim
- **Visualization:** pyLDAvis, matplotlib, seaborn
- **Data Handling:** pandas, numpy

These libraries together provide a robust NLP + statistical modeling pipeline.
**LDA** assumes a **bag-of-words (BoW)** representation of text which handles tokenization, lemmatization, and stop-word removal before modeling.

In [1]:
# import required libraries
import pandas as pd
import numpy as np
import spacy
import nltk
import re

from gensim.models import LdaModel
from gensim.models import CoherenceModel
from gensim import corpora
from gensim.corpora.dictionary import Dictionary

from nltk.corpus import stopwords

# EDA visualization
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
import seaborn as sn
sn.set()

#enabling visualize inside notebook
pyLDAvis.enable_notebook()

In [2]:
# Download nltk stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bhand\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### Step 2 : Load and Inspect PubMed Abstracts

We load the dataset of PubMed abstracts (2020–2025).
Each abstract is treated as one document.
In LDA’s statistical framework:

- Each document is a mixture of latent topics
- Each topic is a distribution over words

The corpus therefore follows a hierarchical Bayesian model where both documents and words are generated probabilistically.

In [3]:
# Loading pubmed abstracts csv

pubmed_df = pd.read_csv("C:/Users/bhand/Desktop/Data Science - My Collection/Deep Learning Project - 1/Data/Data for Recommender System/HCP_Hybrid_Recommendation_System/data/pubmed_abstracts_2020_2025.csv")
print(f" Loaded {len(pubmed_df)} abstracts")

# View the dataframe
pubmed_df.head()

 Loaded 509 abstracts


Unnamed: 0,PMID,Title,Abstract,Topic
0,40887509,Sodium glucose co-transporter 2 inhibitor-asso...,Perioperative euglycaemic diabetic ketoacidosi...,Diabetes Management
1,40886230,Can Dual Incretin Receptor Agonists Exert Bett...,Despite advances in cardiovascular risk reduct...,Diabetes Management
2,40885915,The association between diabetes management se...,Self-efficacy emerges as a crucial element tha...,Diabetes Management
3,40884731,Intrinsic Motivation Moderates the Effect of F...,Few studies have examined effects of intrinsic...,Diabetes Management
4,40877913,Inhibitory effects of the flavonoids extracted...,"Pollen Typhae (PT), a traditional Chinese medi...",Diabetes Management


#### Step 3 : Load SpaCy Model & Define Domain Stopwords

SpaCy English medium model *(en_core_web_md)* is loaded, which includes pre-trained word embeddings and a **biomedical Named Entity Recognizer (NER)**.

Next, a domain-specific stopword list is defined which helps to removes common biomedical filler words such as *“et”, “al”, “study”, “method”* etc.
These words appear frequently across papers but carry little discriminative information, so removing them improves topic interpretability.

In [4]:
# Loading spacy data

nlp = spacy.load("en_core_web_md")

# Custom stop words

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
extra_stopwords = {"et", "al", "study", "patients", "conclusion", "method", "methods", "results"}
stopwords = spacy_stopwords.union(extra_stopwords)

#### Step 4: Biomedical Entity-Aware Preprocessing

A custom preprocessing function is defined to:

- Converts text to lowercase and removes non-alphanumeric characters.
- Uses SpaCy NER to detect entities such as diseases, drugs, and genes.
- Replaces each entity with placeholders like DISEASE, DRUG, GENE, TRIAL to preserve biomedical context.
- Lemmatizes and filters remaining non-entity tokens.

*This helps to reduce noise in the data and ensures models learns from conceptual patterns like "Disease-Drug-Gene" relationships. It improves semantic cohesion and topic interpretability a lot.*

In [5]:
# Preprocessing pubmed abstracts

def preprocess_with_entities(text):
    # Lowercase & keep only alphanumerics
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    doc = nlp(text)

    tokens = []
    entity_spans = [(ent.start, ent.end, ent) for ent in doc.ents]

    skip_indexes = set()  # to track which tokens belong to entities

    for start, end, ent in entity_spans:
        skip_indexes.update(range(start, end))  # mark tokens inside entity
        # Replace entity with a placeholder depending on its type
        if "disease" in ent.label_.lower():
            tokens.append("<DISEASE>")
        elif "chemical" in ent.label_.lower() or "drug" in ent.label_.lower():
            tokens.append("<DRUG>")
        elif "gene" in ent.label_.lower() or "protein" in ent.label_.lower():
            tokens.append("<GENE>")
        elif "clinical_trial" in ent.label_.lower() or ent.text.lower().startswith("nct"):
            tokens.append("<TRIAL>")
        else:
            # For untagged scientific entities, append their lemmatized words
            tokens.extend([w.lemma_.lower() for w in ent if w.is_alpha and w.text.lower() not in stopwords])

    # Process non-entity tokens
    for i, token in enumerate(doc):
        if i in skip_indexes:  # skip if token already covered by an entity
            continue
        if token.is_alpha and token.text.lower() not in stopwords and len(token.lemma_) > 2:
            tokens.append(token.lemma_.lower())

    return list(tokens)

In [6]:
# appling preprocessing with entities to tokens

pubmed_df["tokens"] = pubmed_df["Abstract"].astype(str).apply(preprocess_with_entities)
print(pubmed_df['tokens'].head())

0    [april, day, day, perioperative, euglycaemic, ...
1    [mace, hr, ci, despite, advance, cardiovascula...
2    [palestine, self, efficacy, emerge, crucial, e...
3    [hispanic, debs, p, study, examine, effect, in...
4    [pollen, typhae, pt, chinese, traditional, med...
Name: tokens, dtype: object


#### Step 5 : Create Dictionary and Corpus

**After preprocessing:**

- Build a Dictionary mapping each token to an integer ID.
- Construct a Corpus, which represents each document as a bag-of-words vector:

     *doci​={(w1​,n1​),(w2​,n2​),...,(wk​,nk​)}*

**Filtering rules:**

- no_below=5: Remove rare words appearing in fewer than 5 docs.
- no_above=0.5: Remove overly common words (in >50% of docs).
- keep_n=1000: Retain top 1000 most informative terms.

These thresholds help balance vocabulary diversity and topic quality.

In [7]:
# Preparing dictionary and corpus for pubmeds tokens

dictionary = corpora.Dictionary(pubmed_df["tokens"])
dictionary.filter_extremes(no_below = 5, no_above = 0.5, keep_n = 1000)

# corpus
corpus = [dictionary.doc2bow(tokens) for tokens in pubmed_df["tokens"]]

#### Step 6 : Train the LDA Topic Model

**Latent Dirichlet Allocation (LDA)** model is trained with:

- num_topics = 10
- passes = 15 (iterations through corpus)
- alpha = 'auto' (learns document-topic density)

**LDA Assumptions**:
- Each document is generated as a mixture of topics drawn from a Dirichlet distribution:  *θd​∼Dir(α)*

- Each topic is a mixture of words drawn from another Dirichlet: *ϕk​∼Dir(β)*

- Words in a document are generated from a topic-specific multinomial distribution.

Thus, LDA captures the hidden semantic structure of the corpus through probabilistic inference.
Each topic is displayed by its top-weighted keywords.
Interpreting these keywords provides insight into the semantic domain represented by each latent topic.

In [8]:
# Training LDA model
NUM_TOPICS = 10
lda_model = LdaModel(corpus = corpus,
                     id2word = dictionary,
                     num_topics = NUM_TOPICS,
                     passes = 15,
                     random_state = 42,
                     alpha = 'auto',
                     per_word_topics=True)              

# Print topics as per idx
for idx, topic in lda_model.print_topics(num_words = 10):
    print(f"Topic {idx}:{topic}")

Topic 0:0.082*"thyroid" + 0.029*"disorder" + 0.022*"woman" + 0.020*"hormone" + 0.018*"disease" + 0.016*"hypothyroidism" + 0.014*"association" + 0.014*"pregnancy" + 0.012*"tsh" + 0.012*"level"
Topic 1:0.022*"health" + 0.020*"care" + 0.014*"healthcare" + 0.009*"self" + 0.009*"model" + 0.008*"include" + 0.008*"factor" + 0.008*"intervention" + 0.008*"management" + 0.008*"medical"
Topic 2:0.018*"treatment" + 0.015*"use" + 0.014*"respiratory" + 0.013*"cancer" + 0.013*"type" + 0.012*"therapy" + 0.012*"clinical" + 0.010*"need" + 0.009*"care" + 0.009*"obesity"
Topic 3:0.034*"respiratory" + 0.023*"care" + 0.016*"disease" + 0.015*"practice" + 0.012*"use" + 0.011*"recommendation" + 0.009*"clinical" + 0.008*"pulmonary" + 0.007*"report" + 0.007*"year"
Topic 4:0.034*"thyroid" + 0.019*"disorder" + 0.016*"drug" + 0.013*"infection" + 0.013*"role" + 0.011*"include" + 0.010*"relate" + 0.010*"adverse" + 0.010*"function" + 0.009*"gut"
Topic 5:0.019*"management" + 0.019*"diabetes" + 0.015*"glucose" + 0.014*"

#### Step 7 : Model Evaluation — Coherence 

Model evaluation is performed using standard metrics:

**Coherence Score (c_v)**: Measures semantic similarity between top words of each topic. 
 Higher values (0.5–0.7) = more interpretable topics.

In this case, it's **0.39** which is considered good.

In [9]:
# Evaluation using Coherence 
coherence_model = CoherenceModel(model = lda_model,
                                 texts = pubmed_df["tokens"],
                                 dictionary = dictionary,
                                 coherence = 'c_v')

coherence_score = coherence_model.get_coherence()

print(f" Coherence Score:", coherence_score)

 Coherence Score: 0.3924401426074726


#### Step 8 : Extract Document-Level Topic Distributions

Compute for each abstract:

- The topic probability distribution (a 10-dimensional vector).
- The dominant topic with the highest probability.
- The top 10 keywords representing that dominant topic.

These topic probability vectors serve as dense, interpretable features for downstream machine learning — ideal for content recommendation or clustering

In [10]:
# Extract Topics
topic_keywords = []
topic_ids = []
topic_distributions = []

for i, row in enumerate(corpus):
    topics = lda_model.get_document_topics(row, minimum_probability=0)

    # Dominant topic
    dominant_topic = max(topics, key=lambda x: x[1])[0]
    topic_ids.append(dominant_topic)

    # Keywords for dominant topic
    keywords = ", ".join([w for w, _ in lda_model.show_topic(dominant_topic, topn=10)])
    topic_keywords.append(keywords)

    # Expand topic distribution into fixed-length vector
    dist_dict = {t: p for t, p in topics}
    topic_distributions.append([dist_dict.get(t, 0.0) for t in range(NUM_TOPICS)])

#### Step 9 : Construct the Scientific Content Dataset

Building a structured dataset containing:

- PubMed metadata *(PMID, Title, Abstract)*
- Topic-based features *(topic_id, topic_keywords, topic probabilities)*
- Unique *content_id* and data source fields

This transforms textual abstracts into numerical, **semantically-rich features**, enabling **hybrid recommender systems to understand content relevance**.

In [11]:
# Saving the csv file
topic_df = pd.DataFrame(topic_distributions, columns=[f"topic_{i}" for i in range(NUM_TOPICS)])
scientific_content = pubmed_df[[
    'PMID', 'Title', 'Abstract']].reset_index(drop=True)

scientific_content['topic_ids'] = topic_ids
scientific_content['topic_keywords'] = topic_keywords
scientific_content['content_id'] = ["C"+ str(i).zfill(4) for i in range(len(scientific_content))]
scientific_content['source'] = "PubMed"

In [12]:
# Merge structured topic distributions
scientific_content_data = pd.concat([scientific_content, topic_df], axis=1)

scientific_content_data.to_csv("Scientific_content_data.csv", index = False)
print("/nScientific Content Data Saved")

/nScientific Content Data Saved


#### Step 10 : Visualize Topics with pyLDAvis

**pyLDAvis** is used for an interactive topic visualization:

- Each bubble represents a topic
- Distance between bubbles shows topic dissimilarity
- Hovering reveals top contributing words

This is crucial for qualitative model validation — ensuring topics are well-separated and interpretable.

In [13]:
# Visualize topics
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_display)

#### Step 11 : Map Topics to Medical Specialties (Taxonomy Alignment)

To connect topics with **HCP specialties**:

- Define a taxonomy of medical specialties (Cardiology, Endocrinology, Oncology, etc.).
- Represent both topics and specialties using **SpaCy word embeddings**.
- Compute **cosine similarity** between topic vectors and specialty keyword vectors.
- Assign each topic to its **most semantically similar specialty**.

This converts unsupervised topic clusters into clinically meaningful domains, bridging NLP-driven insights with medical taxonomy.

In [14]:
hcp_master = pd.read_csv("C:/Users/bhand/Desktop/Data Science - My Collection/Deep Learning Project - 1/Data/Data for Recommender System/HCP_Hybrid_Recommendation_System/data/hcp_master.csv")


In [15]:
# Get top N words for each topic
topic_keywords = {}
for topic_ids in range(lda_model.num_topics):
    words = lda_model.show_topic(topic_ids, topn=15)  # top 15 words
    topic_keywords[topic_ids] = " ".join([w for w, p in words])


In [16]:
# Creating taxonomy data
taxonomy_df = pd.DataFrame({
    'Specialty': ["Cardiology", "Endocrinology", "Family Medicine", "Internal Medicine", "General Practice", "Oncology"],
    'keywords': [
        ["heart, cardiac, hypertension, atrial, stroke, arrhythmia, cardiovascular, coronary, blood pressure, heart failure, myocardial, angina, echocardiography, cholesterol, stent"],
        ["hormone, thyroid, tsh, pituitary, adrenal, metabolism, endocrine, insulin, glucose, pancreas, cortisol, estrogen, testosterone, parathyroid, diabetes"],
        ["primary care, family, general health, routine, preventive, holistic, wellness, pediatrics, geriatrics, chronic care, screening, outpatient, immunization, counseling, checkup"],
        ["internal medicine, adult, chronic disease, hypertension, diabetes, asthma, kidney, liver, digestive, infection, anemia, pneumonia, arthritis, obesity, cardiometabolic"],
        ["general practice, gp, outpatient, clinic, routine, primary care, wellness, diagnosis, treatment, prevention, lifestyle, counseling, vaccination, screening, referral"],
        ["cancer, tumor, oncology, chemotherapy, immunotherapy, checkpoint, pd1, tcell, metastasis, radiation, biopsy, targeted therapy, precision medicine, carcinoma, leukemia"]
    ]
})

In [17]:
results = []

taxonomy_dict = {
    row["Specialty"]: row["keywords"]
    for _, row in taxonomy_df.iterrows()
}

for tid, twords in topic_keywords.items():
    topic_doc = nlp(twords)

    best_match, best_score = "Others", 0
    for spec, spec_words in taxonomy_dict.items():
        spec_doc = nlp(" ".join(spec_words))
        score = topic_doc.similarity(spec_doc)

        if score > best_score:
            best_match, best_score = spec, score

    results.append([tid, twords, best_match, best_score])

results_df = pd.DataFrame(
    results, columns=["topic_ids", "topic_words", "specialty", "similarity"]
)

#### Step 12 : Merge Topic–Specialty Mapping

Merge the **topic–specialty** relationships into our main *scientific_content_data* table.
Now each content piece is labeled with:

- Its dominant topic
- Top keywords
- Topic probability vector
- Matched medical specialty

This enriched representation supports downstream HCP targeting, content personalization, and engagement analytics.

In [18]:
# merging with scientific content data

topic_specialty_map = (
    results_df.loc[results_df.groupby("topic_ids")["similarity"].idxmax(),
                   ["topic_ids", "specialty"]]
    .reset_index(drop=True)
)

scientific_content_data = scientific_content_data.merge(
    topic_specialty_map, on="topic_ids", how="left"
)

#### Step 13 : Save Final Scientific Content Data

Finally!

*scientific_content_data* is transformed and exported as:

Scientific_data_linked_new.csv → *Base topic distribution data + Specialty-mapped enriched data*

This structured dataset can now be used for:
- Input features in **HCP Recommender Systems (LSTM / Hybrid models)**
- **Campaign analytics** for content optimization
- **Clustering / segmentation** of scientific themes

In [19]:
# Saving scientific content mapped

scientific_content_data.to_csv("Scientific_data_linked_new.csv", index = False)
print("/n New Scientific Content Data Mapped")

# view scientific data file
scientific_content_data.head()

/n New Scientific Content Data Mapped


Unnamed: 0,PMID,Title,Abstract,topic_ids,topic_keywords,content_id,source,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,specialty
0,40887509,Sodium glucose co-transporter 2 inhibitor-asso...,Perioperative euglycaemic diabetic ketoacidosi...,9,"group, control, risk, cardiovascular, disease,...",C0000,PubMed,0.0004,0.000584,0.205033,0.000507,0.000541,0.12989,0.187372,0.000603,0.000469,0.4746,Cardiology
1,40886230,Can Dual Incretin Receptor Agonists Exert Bett...,Despite advances in cardiovascular risk reduct...,5,"management, diabetes, glucose, risk, clinical,...",C0001,PubMed,0.00044,0.000642,0.000446,0.000558,0.000595,0.994736,0.00065,0.000663,0.000516,0.000755,General Practice
2,40885915,The association between diabetes management se...,Self-efficacy emerges as a crucial element tha...,1,"health, care, healthcare, self, model, include...",C0002,PubMed,0.001692,0.53234,0.448721,0.002145,0.002287,0.002878,0.002498,0.002551,0.001984,0.002904,Family Medicine
3,40884731,Intrinsic Motivation Moderates the Effect of F...,Few studies have examined effects of intrinsic...,0,"thyroid, disorder, woman, hormone, disease, hy...",C0003,PubMed,0.546631,0.271954,0.000572,0.000716,0.000763,0.00096,0.17592,0.000851,0.000662,0.000969,Endocrinology
4,40877913,Inhibitory effects of the flavonoids extracted...,"Pollen Typhae (PT), a traditional Chinese medi...",7,"disease, therapy, ckd, treatment, cell, risk, ...",C0004,PubMed,0.209026,0.004637,0.003217,0.004026,0.004293,0.366371,0.004689,0.394566,0.003725,0.005451,General Practice
