# PubMed Open Access Alzheimer's Disease Articles: Query and NLP Augmented Parsing Example

This notebook demonstrates how to use the `PubMedQueryService`, `PubMedAnalyzer`, and `PMCFullTextParser` classes to:
- Search PubMed for open access articles related to Alzheimer's disease (MeSH term) from the past 10 years
- Fetch and summarize article metadata
- Filter articles by keywords
- Retrieve full text from PMC
- Parse and extract tables, sections, and figures from full text XML
- Analyze and visualize article metadata distributions

## 1. Initialize environment

* Pull NCBI API Key from .env file

In [1]:
# set your email and API key for NCBI from environment variables
import os
from dotenv import load_dotenv

EMAIL = os.environ.get("EMAIL")  # e.g., 'your.email@example.com'
API_KEY = os.environ.get("NCBI_API_KEY")  # e.g., 'your_ncbi_api_key'

## 2. Initialize PubMedQueryService

Create an instance of PubMedQueryService. Optionally, provide your email and NCBI API key for higher rate limits.

In [2]:
from niagads.pubmed.services import PubMedQueryService

pubmed_service = PubMedQueryService(email=EMAIL, api_key=API_KEY)

## 3. Search PubMed for Articles (Current Year Only)

Search for open access PubMed articles with the MeSH term 'Alzheimer Disease' for the current year only. This step fetches all PMIDs for the current year.

In [3]:
from niagads.pubmed.services import PubMedQueryFilters

# Search for open access Alzheimer's disease genetics articles (MeSH) for the current year only
mesh_term = ["Alzheimer Disease", "Genetics"]
current_year = 2025

# Get article count for the current year
count = await pubmed_service.find_articles(
    filters=PubMedQueryFilters(
        keyword=None,
        year=current_year,
        mesh_term=mesh_term,
        open_access_only=True
    ),
    counts_only=True
)
print(f"{current_year}: {count} articles found")

# Fetch all PMIDs for the current year (if any)
if count > 0:
    all_pmids = await pubmed_service.find_articles(
        filters=PubMedQueryFilters(
            keyword=None,
            year=current_year,
            mesh_term=mesh_term,
            open_access_only=True
        ),
        max_results=count,
        ids_only=True
    )
    print(f"Total PMIDs found: {len(all_pmids)} (should match count: {count})")
else:
    all_pmids = []
    print("No articles found for the current year.")

2025: 96 articles found
Total PMIDs found: 96 (should match count: 96)
Total PMIDs found: 96 (should match count: 96)


## 4. Fetch Article Metadata

Fetch metadata (title, abstract, authors, etc.) for the PMIDs found in the previous step.

In [4]:
# Fetch metadata for the first 20 PMIDs as a sample
sample_pmids = all_pmids[:20]

async def fetch_sample_metadata():
    metadata = await pubmed_service.fetch_article_metadata(sample_pmids)
    print(f"Fetched metadata for {len(metadata)} articles.")
    for article in metadata:
        print(f"PMID: {article.pmid}")
        print(f"Title: {article.title}")
        print(f"Year: {article.year}")
        print(f"Journal: {article.journal}")
        print(f"Authors: {[a.last for a in article.authors]}")
        print(f"Abstract: {article.abstract[:200] if article.abstract else ''}...\n")
    return metadata

sample_metadata = await fetch_sample_metadata()

Fetched metadata for 20 articles.
PMID: 40864721
Title: Gut-brain nexus: Mapping multimodal links to neurodegeneration at biobank scale.
Year: 2025
Journal: Science advances
Authors: ['Shafieinouri', 'Hong', 'Lee', 'Grant', 'Khani', 'Dadu', 'Schumacher Schuh', 'Makarious', 'Sandon', 'Simmonds', 'Iwaki', 'Hill', 'Blauwendraat', 'Escott-Price', 'Qi', 'Noyce', 'Reyes-Palomares', 'Leonard', 'Tansey', 'Faghri', 'Singleton', 'Nalls', 'Levine', 'Bandres-Ciga']
Abstract: Alzheimer's disease (AD) and Parkinson's disease (PD) are influenced by genetic and environmental factors. We conducted a biobank-scale study to (i) identify endocrine, nutritional, metabolic, and dig...

PMID: 40862516
Title: UnCOT-AD: Unpaired Cross-Omics Translation Enables Multi-Omics Integration for Alzheimer's Disease Prediction.
Year: 2025
Journal: Briefings in bioinformatics
Authors: ['Abir', 'Dip', 'Zhang']
Abstract: Alzheimer's Disease (AD) is a progressive neurodegenerative disorder, posing a growing public health c

## 5. Filter Articles by Keywords

Filter the fetched articles by keywords in the title or abstract (e.g., 'amyloid', 'tau').

In [5]:
# Filter articles by keywords in title or abstract
keywords = ["amyloid", "tau"]
filtered_articles = PubMedQueryService.filter_articles_by_keywords(sample_metadata, keywords)

print(f"Articles matching keywords {keywords}: {len(filtered_articles)}")
for article in filtered_articles:
    print(f"PMID: {article.pmid} | Title: {article.title}")

Articles matching keywords ['amyloid', 'tau']: 7
PMID: 40810263 | Title: Plasma proteomic analysis identifies proteins and pathways related to Alzheimer's risk.
PMID: 40696469 | Title: CHIT1 and DDAH1 levels relate to amyloid-related imaging abnormalities risk profile in Alzheimer's disease patients.
PMID: 40665049 | Title: APOE ε4 carriers share immune-related proteomic changes across neurodegenerative diseases.
PMID: 40660303 | Title: Proteomic landscape of Alzheimer's disease: emerging technologies, advances and insights (2021 - 2025).
PMID: 40653809 | Title: Label-Free Proteomic Profiling of the dvls2 (CL2006) Caenorhabditis elegans Alzheimer's Disease (AD) Model Reveals Conserved Molecular Signatures Shared With the Human AD Brain.
PMID: 40600356 | Title: Milestone Review: The History of Molecular Genetics Analysis of Alzheimer's Disease.
PMID: 40595720 | Title: Proteomic analysis of Down syndrome cerebrospinal fluid compared to late-onset and autosomal dominant Alzheimer´s diseas

## 6. Summarize and Plot Article Metadata

Use PubMedAnalyzer to summarize the distribution of publication years, journals, and MeSH terms, and plot the results.

In [None]:
# Summarize and plot article metadata using PubMedAnalyzer
from niagads.pubmed.parsers import PubMedAnalyzer
analyzer = PubMedAnalyzer(sample_metadata)
analyzer.summarize()
print(analyzer.summary)
analyzer.plot_summary()
print(analyzer.summary.mesh_terms)

## 7. Generate a Word Cloud from Article Titles

This step visualizes the most frequent and semantically meaningful phrases in article titles using the `TextSummarizer` class. The `TextSummarizer` leverages large language models (LLMs) for phrase extraction and, optionally, semantic clustering. When enabled, LLM-based embeddings group similar phrases together, allowing the word cloud to reflect not just raw frequency but also semantic similarity between phrases. This provides a more insightful view of the key topics and concepts present in the article titles.

In [6]:
# Generate a word cloud from article titles using TextSummarizer
from niagads.nlp_utils.summarizer import TextSummarizer

# Collect all article titles (use filtered_articles if available, else sample_metadata)
if 'filtered_articles' in locals() and filtered_articles:
    titles = [article.title for article in filtered_articles if article.title]
elif 'sample_metadata' in locals() and sample_metadata:
    titles = [article.title for article in sample_metadata if article.title]
else:
    titles = []

if titles:
    summarizer = TextSummarizer()
    clusters = summarizer.semantic_phrase_clustering(titles, top_n=30, min_ngram=1, max_ngram=3, use_embeddings=False)
    summarizer.plot_ngram_wordcloud(clusters, max_words=50, title="Word Cloud of Article Titles")
else:
    print("No article titles available for word cloud visualization.")

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-pubmed and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ImportError: 
 requires the protobuf library but it was not found in your environment. Check out the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment. Please note that you may need to restart your runtime after installation.


## 8. Fetch Full Text for Selected PMIDs

Fetch full text XML for selected PMIDs from the PMC Open Access subset.

In [None]:
# Fetch full text for the first filtered article (if available)
if filtered_articles:
    pmid = filtered_articles[0].pmid
    full_text_results = await pubmed_service.fetch_full_text([pmid])
  
else:
    full_text_xml = None
    print("No filtered articles available for full text fetch.")

## 9. Parse PMC Full Text XML

Create a PMCFullTextParser instance with the full text XML and parse it to extract content.

In [None]:
# Parse the full text XML if available
if full_text_xml:
    parser = PMCFullTextParser(full_text_xml)
    parser.parse()
    print("PMC full text parsed.")
else:
    parser = None
    print("No full text XML available to parse.")

## 10. Extract Tables, Sections, and Figures from Full Text

Use PMCFullTextParser methods to extract tables, sections, and figures from the parsed XML.

In [None]:
# Extract and display tables, sections, and figures from the parsed PMC full text
if parser:
    tables = parser.get_tables()
    sections = parser.get_sections()
    figures = parser.get_figures()
    print(f"Tables found: {len(tables)}")
    print(f"Sections found: {len(sections)}")
    print(f"Figures found: {len(figures)}")
    if tables:
        print("First table:", tables[0])
    if sections:
        print("First section:", sections[0])
    if figures:
        print("First figure:", figures[0])
else:
    print("No parsed PMC full text to extract content from.")

## 11. Extract Specific Sections: Results, Conclusions, Discussion

Extract and display only the sections with titles containing 'Results', 'Conclusions', or 'Discussion' from the parsed sections.

In [None]:
# Extract and display 'Results', 'Conclusions', 'Discussion' sections using LLM-based synonym expansion
from niagads.nlp_utils.matchers.synonyms import LLMSynonymMatcher, SynonymModel
import asyncio
if parser:
    # Define section titles to extract (case-insensitive)
    section_titles = ["Results", "Conclusions", "Discussion"]
    # Use LLMSynonymMatcher to get extended list of synonyms for section titles
    matcher = LLMSynonymMatcher(section_titles, model=SynonymModel.T5)
    extended_synonyms = asyncio.run(matcher.get_extended_synonyms())
    print(f"Section titles + LLM synonyms: {extended_synonyms}")
    # Use parser.get_sections with the extended_synonyms as the titles argument
    filtered_sections = parser.get_sections(titles=list(extended_synonyms))
    print(f"Sections matching {section_titles} (with LLM synonyms): {len(filtered_sections)}")
    for sec in filtered_sections:
        print(f"Title: {sec['title']}")
        print(f"Text: {sec['text'][:500]}...\n")
else:
    print("No parsed PMC full text to extract specific sections from.")

## 12. Extract Gene References from Filtered Sections

Use the LLM-based gene extractor to identify gene mentions in the filtered sections. This step demonstrates how to apply the `GeneReferenceExtractor` to biomedical text extracted from PMC full text articles.

In [None]:
# Extract gene references from the filtered sections using GeneReferenceExtractor
from niagads.nlp_utils.extractors.gene import GeneReferenceExtractor, GeneNERModel

gene_extractor = GeneReferenceExtractor(model=GeneNERModel.D4DATA)

# Combine all filtered section texts into one block for extraction
if parser and 'filtered_sections' in locals():
    all_section_text = "\n".join(sec['text'] for sec in filtered_sections)
    gene_mentions = gene_extractor.extract(all_section_text)
    print(f"Gene mentions found: {gene_mentions}")
else:
    print("No filtered sections available for gene extraction.")

## 13. Summarize Gene Contexts in Filtered Sections

Use the `TextSummarizer` to generate a summary for each gene, based on the sentences in which it appears in the filtered sections. This provides a concise overview of the context in which each gene is discussed.

In [None]:
# Summarize gene contexts using TextSummarizer
from niagads.nlp_utils.summarizer import TextSummarizer, SummarizationModel

# Extract gene context sentences (per gene) from the filtered sections
if parser and 'filtered_sections' in locals():
    all_section_text = "\n".join(sec['text'] for sec in filtered_sections)
    gene_contexts = gene_extractor.extract_gene_contexts(all_section_text)
    summarizer = TextSummarizer(model=SummarizationModel.PEGASUS_PUBMED)
    gene_summaries = summarizer.summarize(gene_contexts)
    for gene, summary in gene_summaries.items():
        print(f"Gene: {gene}\nSummary: {summary}\n")
else:
    print("No filtered sections available for gene context summarization.")