References:
1. Blei DM, Ng AY, et al. Latent dirichlet allocation. Journal of machine Learning research. 2003;3(Jan):993-1022.
2. Pham CM, et al. Topicgpt: A prompt-based topic modeling framework. Proc of NAACL. 2024.
3. Rijcken E, et al. Towards interpreting topic models with ChatGPT. In: Proc of IFSA; 2023.

warm up: directly prompt LLM UI

In [None]:
# warm up: directly prompt LLM UI (with Deep Research)
prompt0 = """You are the best expert in academic paper topic detection. The attached file contains 100 abstract texts for pubmed papers, read the abstract from each entry, summariez 6 most common topics for them. You will analyze step by step to cover all 100 abstract"""

print(prompt0)

You are the best expert in academic paper topic detection. The attached file contains 100 abstract texts for pubmed papers, read the abstract from each entry, summariez 6 most common topics for them. You will analyze step by step to cover all 100 abstract


LLM-assisted Topic modeling with LDA

In [None]:
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
# Download necessary NLTK data files for preprocessing
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

class DataPreprocessor:
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()

    def preprocess(self, text):
        # Remove special characters and digits
        text = re.sub(r'\W', ' ', text)
        text = re.sub(r'\d', ' ', text)
        text = text.lower()  # Convert to lower case
        text = text.split()  # Split into individual words
        text = [self.lemmatizer.lemmatize(word) for word in text if word not in self.stop_words]  # Lemmatize and remove stop words
        return ' '.join(text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
import os
os.getcwd()

'/content'

In [None]:
data = 'ad+aging-v3_2011_100.jsonl'
df = pd.read_json(data, lines=True)

In [None]:
# Preprocess the abstracts
preprocessor = DataPreprocessor()
df['processed_text'] = df['text'].apply(preprocessor.preprocess)


In [None]:
df

Unnamed: 0,id,text,processed_text
0,0,Convolutional neural networks for classificati...,convolutional neural network classification al...
1,1,MUTATE: a human genetic atlas of multiorgan ar...,mutate human genetic atlas multiorgan artifici...
2,2,MUTATE: a human genetic atlas of multiorgan ar...,mutate human genetic atlas multiorgan artifici...
3,3,Disentangling Normal Aging From Severity of Di...,disentangling normal aging severity disease vi...
4,4,CellTICS: an explainable neural network for ce...,celltics explainable neural network cell type ...
...,...,...,...
140,140,STAB2: an updated spatio-temporal cell atlas o...,stab updated spatio temporal cell atlas human ...
141,141,CirGRDB: a database for the genome-wide deciph...,cirgrdb database genome wide deciphering circa...
142,142,MethBank 4.0: an updated database of DNA methy...,methbank updated database dna methylation acro...
143,143,A metabolome atlas of the aging mouse brain. T...,metabolome atlas aging mouse brain mammalian b...


In [None]:
tr = df['processed_text'][:100]
te = df['processed_text'][100:]

In [None]:
# Convert the processed abstracts to a bag-of-words representation
vectorizer = CountVectorizer()
trX = vectorizer.fit_transform(tr)

In [None]:
# Run LDA with topic number set to 6
n_topics = 6
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda_model.fit(trX)

In [None]:
# Calculate Perplexity and Coherence Score
def compute_coherence(lda_model, vectorizer, n_top_words=10):
    topic_words = lda_model.components_
    feature_names = vectorizer.get_feature_names_out()
    coherence = []
    for topic in topic_words:
        top_words_indices = topic.argsort()[-n_top_words:][::-1]
        coherence_score = sum(topic[top_words_indices]) / n_top_words
        coherence.append(coherence_score)
    return sum(coherence) / len(coherence)

In [None]:
# Calculate Perplexity
perplexity = lda_model.perplexity(trX)

# Calculate Coherence score
coherence_score_val = compute_coherence(lda_model, vectorizer)

print(f"Perplexity: {perplexity}")   # lower the better
print(f"Coherence Score: {coherence_score_val}")  # higher the better

Perplexity: 1496.4888419365739
Coherence Score: 32.09124485951401


In [None]:
2# Function to summarize the topics
def print_topic_keywords(model, vectorizer, n_top_words):
    to_print = ''
    for idx, topic in enumerate(model.components_):
        to_print = to_print + f"Topic {idx}:\n" + " ".join([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-n_top_words:]]) + '\n'

    return to_print

# Displaying the topics
top_words = print_topic_keywords(lda_model, vectorizer, 10)
print(top_words)

Topic 0:
alzheimer available network model based ad brain data method disease
Topic 1:
study clinical protein tremor data based region method model mri
Topic 2:
analysis network information genetic ad feature method disease gene data
Topic 3:
using gene ad study data deep image cell model disease
Topic 4:
method tool analysis available gwas data variant model network gene
Topic 5:
brain ferroptosis pyaging body event http ad aging disease editing



In [None]:
area = 'AI for health'

In [None]:
prompt1 = f"""Given the fact that the following {n_topics} groups of words all originate from topic modeling of literature corpus in the area of {area}, what common denominator or topic do each of the following groups of words have? Please be as general and distinguishable among groups as possible, and save the result to a python list.
{top_words}"""

print(prompt1)

Given the fact that the following 6 groups of words all originate from topic modeling of literature corpus in the area of AI for health, what common denominator or topic do each of the following groups of words have? Please be as general and distinguishable among groups as possible, and save the result to a python list.
Topic 0:
alzheimer available network model based ad brain data method disease
Topic 1:
study clinical protein tremor data based region method model mri
Topic 2:
analysis network information genetic ad feature method disease gene data
Topic 3:
using gene ad study data deep image cell model disease
Topic 4:
method tool analysis available gwas data variant model network gene
Topic 5:
brain ferroptosis pyaging body event http ad aging disease editing



In [None]:
# prompt chatGPT5.1
prompt1_topics = [
    "Alzheimer’s disease modeling using brain data and network-based methods",
    "Clinical neuroimaging studies of neurological disorders and protein/tremor biomarkers",
    "Genetic and network-based analyses of disease mechanisms",
    "Deep learning on multimodal biological data for disease research",
    "GWAS and variant-focused computational genetic analysis",
    "Aging and neurodegeneration research involving cellular aging pathways"
]

In [None]:
prompt2 = f"""You will receive a list of topics that belong to the same level of a topic hierarchy. Your task is to merge topics that are paraphrases or near duplicates of one another. Return 'None' if no modification is needed.

Here are some examples:
[Example 1]
Topic List:
<pairs of similar topics>

Your response:
<topics being merged into an existing topic>

[Example 2]
<pairs of similar topics>

Your response:
<topics being merged into a new topic>

[Rules]
- Each line represents a topic, with a level indicator and a topic label.
- Perform the following operations as many times as needed:
    - Merge relevant topics into a single topic.
    - Do nothing and return 'None' if no modification is needed.
- When merging, the output format should contain a level indicator, the updated label and description, followed by the original topics.


[Topic List]
{prompt1_topics}

Output the modification or 'None' where appropriate. Do not output anything else.
[Your response]
"""

print(prompt2)

You will receive a list of topics that belong to the same level of a topic hierarchy. Your task is to merge topics that are paraphrases or near duplicates of one another. Return 'None' if no modification is needed. 

Here are some examples: 
[Example 1]
Topic List: 
<pairs of similar topics>

Your response: 
<topics being merged into an existing topic>

[Example 2]
<pairs of similar topics>

Your response: 
<topics being merged into a new topic>

[Rules]
- Each line represents a topic, with a level indicator and a topic label. 
- Perform the following operations as many times as needed: 
    - Merge relevant topics into a single topic.
    - Do nothing and return 'None' if no modification is needed.
- When merging, the output format should contain a level indicator, the updated label and description, followed by the original topics.


[Topic List]
['Alzheimer’s disease modeling using brain data and network-based methods', 'Clinical neuroimaging studies of neurological disorders and pro

In [None]:
# prompt2_results = None

In [None]:
# provide manual feedbacks to revise. Note UI cannot fit 100 documents together, so only put 10 as a demo; should use API calling for actual research.
prompt3 = f"""You will receive {tr.shape[0]} documents and a set of top-level topics previously extracted from a topic hierarchy. Your task is, for each document, identify generalizable topics within the document that can act as top-level topics in the hierarchy. If any relevant topics are missing from the provided set, please add them. Otherwise, output the existing top-level topics as identified in the document.

[Top-level topics]
{prompt1_topics}

[Examples]
Example 1: Adding '[1] Data development for healthy aging or Alzheimer's disease'
Document:
Open datasets and code for multi-scale relations on structure, function and neuro-genetics in the human brain. The human brain is an extremely complex network of structural and functional connections that operate at multiple spatial and temporal scales. Investigating the relationship between these multi-scale connections is critical to advancing our comprehension of brain function and disorders. However, accurately predicting structural connectivity from its functional counterpart remains a challenging pursuit. One of the major impediments is the lack of public repositories that integrate structural and functional networks at diverse resolutions, in conjunction with modular transcriptomic profiles, which are essential for comprehensive biological interpretation. To mitigate this limitation, our contribution encompasses the provision of an open-access dataset consisting of derivative matrices of functional and structural connectivity across multiple scales, accompanied by code that facilitates the investigation of their interrelations. We also provide additional resources focused on neuro-genetic associations of module-level network metrics, which present promising opportunities to further advance research in the field of network neuroscience, particularly concerning brain disorders.

Your response:
[1] Data development for healthy aging or Alzheimer's disease: Mentions the provision of an open-access dataset for neuroscience and brain disorders.

Example 2: Adding '[1] Hardware for healthy aging or Alzheimer's disease'
Document:
Helping Older Adults Hear in Noisy Social Situations Using Novel Hardware and AI. In the United States there are 37 million people with hearing loss but only 8 million wear hearing aids. Use of hearing aids by older adults slows the decline of thinking and memory abilities by 48%, making their low rate of adoption a big public health concern. The most common reason for not wearing hearing aids is difficulty participating in conversations in noisy social environments: family reunions, birthday parties, weddings, etc. Imagine a proud mom at her son\u2019s wedding frustrated by loud noise instead of enjoying and reconnecting with her family & friends. Or an uncle struggling to catch up with their niece at the family thanksgiving party. AudioFocus is reimagining the hearing aid from the ground-up to solve this noise problem using AI. Their patented proximity-based speech enhancement algorithm enhances conversations in front of patients while suppressing noise from other conversations around them. To validate the benefits of their technology over existing AI hearing aids, they are running pilot studies with Dr. Reed from John Hopkins University, Dr. Fitzgerald from Stanford, and Dr. Hu from the University of the Pacific. AudioFocus\u2019 team includes hearing aid science & engineering experts Dr. Shariq Mobin (PhD UC Berkeley, Google Brain) & Dr. Reza Kassayan (EarLens, Sonitus Medical). Dr. Shariq Mobin, PhD, Chief Scientific Officer of AudioFocus Professor Jiong \u201cJoe\u201d Hu, PhD, AuD, Vice Chair of Audiology at University of the Pacific

Your response:
[1] Hardware for healthy aging or Alzheimer's disease: Mentions using novel hardware to help older adults.

[Instructions]
For each document,
Step 1: Determine topics mentioned in the document.
- The topic labels must be a general research concept--from one of Data, Software method, Hardware--for AD or aging study. They must not be document-specific.
- The topics must reflect a SINGLE topic instead of a combination of topics.
- The new topics must have a level number, a short general label, and a topic description.
- The topics must be broad enough to accommodate future subtopics.
Step 2: Perform ONE of the following operations:
1. If there are already duplicates or relevant topics in the hierarchy, output those topics and stop here.
2. If the document contains no topic, return 'None'.
3. Otherwise, add your topic as a top-level topic. Stop here and output the added topic(s). DO NOT add any additional levels.


[Documents]
{tr[:10].values}

Please ONLY return the relevant or modified topics at the top level in the hierarchy. Your response should be in the python list format

Your response:
"""

print(prompt3)

You will receive 100 documents and a set of top-level topics previously extracted from a topic hierarchy. Your task is, for each document, identify generalizable topics within the document that can act as top-level topics in the hierarchy. If any relevant topics are missing from the provided set, please add them. Otherwise, output the existing top-level topics as identified in the document.

[Top-level topics]
['Alzheimer’s disease modeling using brain data and network-based methods', 'Clinical neuroimaging studies of neurological disorders and protein/tremor biomarkers', 'Genetic and network-based analyses of disease mechanisms', 'Deep learning on multimodal biological data for disease research', 'GWAS and variant-focused computational genetic analysis', 'Aging and neurodegeneration research involving cellular aging pathways']

[Examples]
Example 1: Adding '[1] Data development for healthy aging or Alzheimer's disease'
Document:
Open datasets and code for multi-scale relations on stru

In [None]:
# manually evaluation the results: if not correct, add human demonstration of corrected examples in prompt3 above to iteratively refine
prompt3_results = [
"Deep learning on multimodal biological data for disease research",
"GWAS and variant-focused computational genetic analysis",
"GWAS and variant-focused computational genetic analysis",
"Aging and neurodegeneration research involving cellular aging pathways",
"Deep learning on multimodal biological data for disease research",
"Genetic and network-based analyses of disease mechanisms",
"Genetic and network-based analyses of disease mechanisms",
"Aging and neurodegeneration research involving cellular aging pathways",
"Deep learning on multimodal biological data for disease research",
"GWAS and variant-focused computational genetic analysis"
]

Can try with different LLMs and topic modeling methods. When use API calling, the above iterative procedures can mitigate the scalability issue.

The LDA codes above were actually generated by gpt-4o-mini within a multiagent framework via Autogen. It showcases the entire topic modeling pipeline can be built with agentic framework with little to none human intervention. Below is an example prompt you can try (using gensim LDA instead of sklearn as above).

In [None]:
agent_persona="""Forget about all the previous conversation. You are the expert in AI and NLP. Suggests a plan for topic modeling using Latent Dirichlet Allocation (LDA).
    The plan includes:
    1. Preprocessing text (DataPreprocessor).
    2. Running LDA for multiple topic numbers (LDATopicModeling).
    3. Evaluating coherence and perplexity scores.
    4. Choosing the best topic number.
    5. Summarizing topics with topN words (TopicSummarizer).
    6. Generating prompts for LLMs to interpret the topics (TopicDescriptor).
    Write python codes for these steps.
    """

In [None]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [None]:
"""
End-to-end LDA topic modeling pipeline:

1. DataPreprocessor       – text cleaning + tokenization (+ optional bigrams)
2. LDATopicModeling       – train LDA for multiple topic numbers
3. evaluate_models        – compute coherence & perplexity
4. choose_best_topic_num  – pick best #topics
5. TopicSummarizer        – top-N words per topic
6. TopicDescriptor        – generate LLM prompts for topic interpretation
"""

import re
import string
from typing import List, Dict, Tuple

import nltk
from nltk.corpus import stopwords
from gensim import corpora
from gensim.models import LdaModel, CoherenceModel
from gensim.models.phrases import Phrases, Phraser


# Make sure NLTK stopwords are available once in your environment:
# >>> import nltk
# >>> nltk.download('stopwords')


# 1. -----------------------------------------------------------
#    DataPreprocessor
# --------------------------------------------------------------

class DataPreprocessor:
    """
    Basic text preprocessing for LDA:
      - lowercasing
      - removing punctuation & digits
      - tokenization
      - stopword removal
      - optional bigram detection
    """

    def __init__(
        self,
        language: str = "english",
        min_token_len: int = 2,
        use_bigrams: bool = True,
        bigram_min_count: int = 10,
        bigram_threshold: float = 10.0,
    ):
        self.language = language
        self.stop_words = set(stopwords.words(language))
        self.min_token_len = min_token_len

        self.use_bigrams = use_bigrams
        self.bigram_min_count = bigram_min_count
        self.bigram_threshold = bigram_threshold
        self.bigram_model = None  # set in fit()

    @staticmethod
    def _basic_clean(text: str) -> str:
        text = text.lower()
        text = re.sub(r"\s+", " ", text)  # normalize spaces
        # keep letters and spaces; strip digits/punct
        text = re.sub(r"[^a-z\s]", " ", text)
        text = re.sub(r"\s+", " ", text).strip()
        return text

    def _tokenize(self, text: str) -> List[str]:
        tokens = text.split()
        tokens = [
            t
            for t in tokens
            if t not in self.stop_words and len(t) >= self.min_token_len
        ]
        return tokens

    def fit(self, documents: List[str]) -> None:
        """
        Learn bigram model (if enabled). Must be called before transform().
        """
        cleaned_docs = [self._basic_clean(doc) for doc in documents]
        tokenized_docs = [self._tokenize(doc) for doc in cleaned_docs]

        if self.use_bigrams:
            phrases = Phrases(
                tokenized_docs,
                min_count=self.bigram_min_count,
                threshold=self.bigram_threshold,
            )
            self.bigram_model = Phraser(phrases)

    def transform(self, documents: List[str]) -> List[List[str]]:
        """
        Apply preprocessing (and bigrams, if fitted) to raw docs.
        """
        cleaned_docs = [self._basic_clean(doc) for doc in documents]
        tokenized_docs = [self._tokenize(doc) for doc in cleaned_docs]

        if self.use_bigrams and self.bigram_model is not None:
            tokenized_docs = [self.bigram_model[doc] for doc in tokenized_docs]

        return tokenized_docs

    def fit_transform(self, documents: List[str]) -> List[List[str]]:
        self.fit(documents)
        return self.transform(documents)


# 2. -----------------------------------------------------------
#    LDATopicModeling
# --------------------------------------------------------------

class LDATopicModeling:
    """
    Handles:
      - building gensim dictionary & corpus
      - training LDA for multiple topic numbers
    """

    def __init__(
        self,
        tokenized_docs: List[List[str]],
        no_below: int = 5,
        no_above: float = 0.5,
        keep_n: int = 50000,
    ):
        """
        tokenized_docs: list of lists of tokens (output of DataPreprocessor)
        """
        self.tokenized_docs = tokenized_docs
        self.dictionary = corpora.Dictionary(tokenized_docs)
        self.dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=keep_n)
        self.corpus = [self.dictionary.doc2bow(doc) for doc in tokenized_docs]
        self.models: Dict[int, LdaModel] = {}

    def train_models(
        self,
        topic_nums: List[int],
        passes: int = 10,
        iterations: int = 200,
        random_state: int = 42,
        alpha: str = "auto",
        eta: str = "auto",
    ) -> Dict[int, LdaModel]:
        """
        Train an LDA model for each k in topic_nums.
        """
        for k in topic_nums:
            lda = LdaModel(
                corpus=self.corpus,
                id2word=self.dictionary,
                num_topics=k,
                passes=passes,
                iterations=iterations,
                random_state=random_state,
                alpha=alpha,
                eta=eta,
                eval_every=None,  # we'll evaluate separately
            )
            self.models[k] = lda
        return self.models


# 3. -----------------------------------------------------------
#    Evaluation: coherence & perplexity
# --------------------------------------------------------------

def evaluate_models(
    lda_pipeline: LDATopicModeling,
    coherence_metric: str = "c_v",
) -> Dict[int, Dict[str, float]]:
    """
    Compute coherence and perplexity for each trained model.

    Returns:
      {
        k: {'coherence': float, 'perplexity': float},
        ...
      }
    """
    scores: Dict[int, Dict[str, float]] = {}

    for k, model in lda_pipeline.models.items():
        coherence_model = CoherenceModel(
            model=model,
            texts=lda_pipeline.tokenized_docs,
            dictionary=lda_pipeline.dictionary,
            coherence=coherence_metric,
        )
        coherence = coherence_model.get_coherence()

        # Gensim's log_perplexity is negative; higher (less negative) is better
        log_perplexity = model.log_perplexity(lda_pipeline.corpus)
        # Convert to a "perplexity-like" positive value if desired:
        # perplexity = math.exp(-log_perplexity)

        scores[k] = {
            "coherence": coherence,
            "log_perplexity": log_perplexity,
        }

    return scores


# 4. -----------------------------------------------------------
#    Choose best topic number
# --------------------------------------------------------------

def choose_best_topic_num(
    scores: Dict[int, Dict[str, float]],
    weight_coherence: float = 1.0,
    weight_perplexity: float = 0.0,
) -> int:
    """
    Select best k given evaluation scores.

    Default: only coherence (max coherence).
    You can incorporate perplexity by setting weight_perplexity > 0.

    Note: log_perplexity is typically negative; higher is better.
    """

    # Normalize metrics for simple weighted scoring
    coherences = [v["coherence"] for v in scores.values()]
    perps = [v["log_perplexity"] for v in scores.values()]

    min_coh, max_coh = min(coherences), max(coherences)
    min_perp, max_perp = min(perps), max(perps)

    def normalize(x, xmin, xmax):
        if xmax == xmin:
            return 0.5  # avoid div by zero; neutral
        return (x - xmin) / (xmax - xmin)

    best_k = None
    best_score = float("-inf")

    for k, vals in scores.items():
        coh_norm = normalize(vals["coherence"], min_coh, max_coh)
        perp_norm = normalize(vals["log_perplexity"], min_perp, max_perp)

        combined_score = weight_coherence * coh_norm + weight_perplexity * perp_norm

        if combined_score > best_score:
            best_score = combined_score
            best_k = k

    return best_k


# 5. -----------------------------------------------------------
#    TopicSummarizer
# --------------------------------------------------------------

class TopicSummarizer:
    """
    Extract top-N words per topic from an LDA model.
    """

    @staticmethod
    def summarize_topics(
        lda_model: LdaModel,
        topn: int = 10
    ) -> Dict[int, List[str]]:
        """
        Returns:
          {topic_id: [word1, word2, ...], ...}
        """
        topic_terms: Dict[int, List[str]] = {}

        for topic_id in range(lda_model.num_topics):
            terms = lda_model.show_topic(topic_id, topn=topn)
            topic_terms[topic_id] = [word for word, _ in terms]

        return topic_terms


# 6. -----------------------------------------------------------
#    TopicDescriptor (LLM prompts)
# --------------------------------------------------------------

class TopicDescriptor:
    """
    Generate prompts for LLMs to interpret topics.
    """

    @staticmethod
    def build_prompts_for_llm(
        topic_words: Dict[int, List[str]],
        domain_hint: str = "AI and health",
        style: str = "concise and high-level"
    ) -> Dict[int, str]:
        """
        Returns:
          {topic_id: prompt_string, ...}
        """
        prompts: Dict[int, str] = {}

        for topic_id, words in topic_words.items():
            word_list_str = ", ".join(words)
            prompt = (
                f"You are an expert in {domain_hint}. "
                f"Given the following topic keywords:\n"
                f"{word_list_str}\n\n"
                f"1. Provide a {style} description of the main theme of this topic.\n"
                f"2. Suggest a short, general topic label (no more than 10 words).\n"
                f"3. Optionally, list 2–3 possible subtopics.\n"
            )
            prompts[topic_id] = prompt

        return prompts




In [None]:
# ------------------------------------------------------------------
# Example usage (wire everything together)
# ------------------------------------------------------------------


# Example corpus (replace with your own documents)
raw_documents = df['text'][:100].to_list()
# raw_documents = [
#     "Alzheimer's disease is a progressive neurodegenerative disorder affecting memory.",
#     "We used MRI scans and deep learning to classify Alzheimer's disease.",
#     "Genome-wide association studies reveal genetic variants associated with dementia.",
#     "Aging is associated with changes in brain structure and cognitive decline.",
#     "Deep learning methods can model complex relationships in multimodal health data.",
#     "RNA-seq data can identify biomarkers for neurodegenerative diseases.",
# ]

# 1. Preprocess
preprocessor = DataPreprocessor(use_bigrams=True)
tokenized_docs = preprocessor.fit_transform(raw_documents)

# 2. Train LDA for multiple topic numbers
lda_pipeline = LDATopicModeling(tokenized_docs)
topic_nums_to_try = [5,6,7]
lda_models = lda_pipeline.train_models(topic_nums_to_try)

# 3. Evaluate models
scores = evaluate_models(lda_pipeline, coherence_metric="c_v")
print("Scores (k -> metrics):")
for k, m in scores.items():
    print(f"k={k}: coherence={m['coherence']:.4f}, log_perplexity={m['log_perplexity']:.4f}")

# 4. Choose best topic number (here, coherence-only)
best_k = choose_best_topic_num(scores, weight_coherence=1.0, weight_perplexity=0.0)
best_model = lda_models[best_k]
print(f"\nBest number of topics: {best_k}")

# 5. Summarize topics with top-N words
summarizer = TopicSummarizer()
topic_words = summarizer.summarize_topics(best_model, topn=10)
print("\nTop words per topic:")
for tid, words in topic_words.items():
    print(f"Topic {tid}: {', '.join(words)}")

# 6. Generate LLM prompts to interpret topics
descriptor = TopicDescriptor()
prompts = descriptor.build_prompts_for_llm(
    topic_words,
    domain_hint="AI and neurodegenerative disease research",
    style="succinct, expert-level"
)

print("\nExample LLM prompts:")
for tid, prompt in prompts.items():
    print(f"\n--- Prompt for Topic {tid} ---\n{prompt}")


Scores (k -> metrics):
k=5: coherence=0.2624, log_perplexity=-6.2448
k=6: coherence=0.2714, log_perplexity=-6.2598
k=7: coherence=0.2998, log_perplexity=-6.2594

Best number of topics: 7

Top words per topic:
Topic 0: mri, deep, regions, across, human, pipeline, predicting, proteins, clinical, framework
Topic 1: brain, disease, method, prediction, datasets, analysis, diseases, mri, ad, proposed
Topic 2: genes, tool, pipeline, time, gene, analysis, genetic, available, key, variants
Topic 3: genes, ad, methods, performance, dataset, gwas, genetic, method, human, different
Topic 4: drug, aging, gene, online, analysis, available, supplementary_information, rna, software, seq
Topic 5: ad, network, information, framework, aging, genes, multi_modal, features, diagnosis, functional
Topic 6: longitudinal, brain, disease, imaging, analysis, genetic, snps, methods, method, variants

Example LLM prompts:

--- Prompt for Topic 0 ---
You are an expert in AI and neurodegenerative disease research. Gi