# üß¨ Discovering Emerging Topics in Drug Discovery Research using BERTopic

## 1Ô∏è‚É£ Introduction
- Motivation and goals of the project
- Importance of topic modelling for scientific literature analysis


## 2Ô∏è‚É£ Data Collection
Use PubMed or arXiv APIs to fetch abstracts on topics like 'drug discovery', 'AI drug design', and 'molecular docking'.
Store as `data/raw_publications.csv`. Example:

In [None]:

from Bio import Entrez
import pandas as pd

Entrez.email = "youremail@example.com"
query = "drug discovery OR AI drug design"
# Example pseudo-code to fetch abstracts (details omitted for brevity)
# handle = Entrez.esearch(db="pubmed", term=query, retmax=100)
# records = Entrez.read(handle)


## 3Ô∏è‚É£ Data Cleaning & Preprocessing

In [None]:

import re, spacy
nlp = spacy.load("en_core_web_sm")

def clean_text(text):
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^\w\s]', '', text.lower())
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc if not token.is_stop]
    return ' '.join(tokens)

df = pd.read_csv("data/raw_publications.csv")
df['clean_text'] = df['abstract'].apply(clean_text)
df.head()


## 4Ô∏è‚É£ Embedding Generation

In [None]:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['clean_text'], show_progress_bar=True)


## 5Ô∏è‚É£ Topic Modelling with BERTopic

In [None]:

from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
hdbscan_model = HDBSCAN(min_cluster_size=25, metric='euclidean', cluster_selection_method='eom')

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(df['clean_text'])
topic_model.get_topic_info().head()


## 6Ô∏è‚É£ Hyperparameter Tuning

In [None]:

for n in [5, 15, 30]:
    for min_cluster in [10, 25, 50]:
        umap_model = UMAP(n_neighbors=n, n_components=5, min_dist=0.0, metric='cosine')
        hdbscan_model = HDBSCAN(min_cluster_size=min_cluster)
        topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
        topics, probs = topic_model.fit_transform(df['clean_text'])
        print(f"Params: n={n}, min_cluster={min_cluster}")
        print(topic_model.get_topic_info().head())


## 7Ô∏è‚É£ Evaluation & Visualization

In [None]:

topic_model.visualize_topics()
topic_model.visualize_barchart()
topic_model.visualize_hierarchy()
topic_model.visualize_topics_over_time(df['clean_text'], df['publication_date'])


## 8Ô∏è‚É£ Results & Insights
- List and interpret top topics.
- Identify emerging or declining research themes.


## 9Ô∏è‚É£ Conclusion
- Summarize key findings.
- Highlight business and scientific relevance.
- Suggest future work (e.g., comparing journals, regions, or institutions).
