# PubMed Open Access Alzheimer's Disease Articles: Query and NLP Augmented Parsing Example

This notebook demonstrates how to use the `PubMedQueryService`, `PubMedAnalyzer`, and `PMCFullTextParser` classes to:
- Search PubMed for open access articles related to Alzheimer's disease (MeSH term) from the past 10 years
- Fetch and summarize article metadata
- Filter articles by keywords
- Retrieve full text from PMC
- Parse and extract tables, sections, and figures from full text XML
- Analyze and visualize article metadata distributions

## 1. Initialize environment

* Pull NCBI API Key from .env file

In [1]:

import os
import logging
import sys
from dotenv import load_dotenv

DEBUG = False

# configure logging
logging.basicConfig(
    level=logging.DEBUG if DEBUG else logging.INFO,
    stream=sys.stdout,
    format="%(asctime)s %(levelname)s %(name)s: %(message)s"
)

# set your email and API key for NCBI from environment variables
EMAIL = os.environ.get("EMAIL")  # e.g., 'your.email@example.com'
API_KEY = os.environ.get("NCBI_API_KEY")  # e.g., 'your_ncbi_api_key'


## 2. Initialize PubMedQueryService

Create an instance of PubMedQueryService. Optionally, provide your email and NCBI API key for higher rate limits.

In [2]:
from niagads.pubmed.services import PubMedQueryService

pubmed_service = PubMedQueryService(email=EMAIL, api_key=API_KEY, debug=DEBUG)

## 3. Search PubMed for Articles (Current Year Only)

Search for open access PubMed articles with the MeSH term 'Alzheimer Disease' for the current year only. This step fetches all PMIDs for the current year.

In [3]:
from niagads.pubmed.services import PubMedQueryFilters

# Search for open access Alzheimer's disease genetics articles (MeSH) for the current year only
mesh_term = ["Alzheimer Disease", "Genetics"]
current_year = 2025

# Get article count for the current year
count = await pubmed_service.find_articles(
    filters=PubMedQueryFilters(
        keyword=None,
        year=current_year,
        mesh_term=mesh_term,
        open_access_only=True
    ),
    counts_only=True
)
print(f"{current_year}: {count} articles found")

# Fetch all PMIDs for the current year (if any)
if count > 0:
    all_pmids = await pubmed_service.find_articles(
        filters=PubMedQueryFilters(
            keyword=None,
            year=current_year,
            mesh_term=mesh_term,
            open_access_only=True
        ),
        max_results=count,
        ids_only=True
    )
    print(f"Total PMIDs found: {len(all_pmids)} (should match count: {count})")
else:
    all_pmids = []
    print("No articles found for the current year.")

2025: 146 articles found
Total PMIDs found: 146 (should match count: 146)


## 4. Fetch Article Metadata

Fetch metadata (title, abstract, authors, etc.) for the PMIDs found in the previous step.

In [4]:
# Fetch metadata for the first 20 PMIDs as a sample
sample_pmids = all_pmids[:20]

async def fetch_sample_metadata():
    metadata = await pubmed_service.fetch_article_metadata(sample_pmids)
    print(f"Fetched metadata for {len(metadata)} articles.")
    for article in metadata:
        print(f"PMID: {article.pmid}")
        print(f"Title: {article.title}")
        print(f"Year: {article.year}")
        print(f"Journal: {article.journal}")
        print(f"Authors: {[a.last for a in article.authors]}")
        print(f"Abstract: {article.abstract[:200] if article.abstract else ''}...\n")
    return metadata

sample_metadata = await fetch_sample_metadata()

Fetched metadata for 20 articles.
PMID: 41235755
Title: Long-read sequencing reveals genomic and epigenomic variation in the dark genome of human Alzheimer's disease.
Year: 2025
Journal: Alzheimer's & dementia : the journal of the Alzheimer's Association
Authors: ['Ramirez', 'Sun', 'Dehkordi', 'Zare', 'Pascarella', 'Carninci', 'Fongang', 'Bieniek', 'Frost']
Abstract: Faulty DNA repair and epigenetic regulation contribute to neurodegeneration in Alzheimer's disease. Long-read sequencing enables analysis of "dark regions" that are difficult to study via traditional ...

PMID: 41225772
Title: Genetic variation in the Nr1d1 transcription factor binding site shapes metabolism-related protein networks associated with cognitive resilience in an Alzheimer's disease mouse reference panel.
Year: 2025
Journal: Alzheimer's & dementia : the journal of the Alzheimer's Association
Authors: ['Chen', 'Stevenson', 'Cao', 'Fish', 'Robbins', 'Merrihew', 'Park', 'Hohman', 'MacCoss', 'Kaczorowski']
Abstract

## 5. Filter Articles by Keywords

Filter the fetched articles by keywords in the title or abstract (e.g., 'amyloid', 'tau').

In [5]:
# Filter articles by keywords in title or abstract
keywords = ["amyloid", "tau"]
filtered_articles = PubMedQueryService.filter_articles_by_keywords(sample_metadata, keywords)

print(f"Articles matching keywords {keywords}: {len(filtered_articles)}")
for article in filtered_articles:
    print(f"PMID: {article.pmid} | Title: {article.title}")

Articles matching keywords ['amyloid', 'tau']: 5
PMID: 41178698 | Title: Proteomic subtyping of Alzheimer's disease CSF links blood-brain barrier dysfunction to reduced levels of tau and synaptic biomarkers.
PMID: 41174000 | Title: Combined multi-omics and brain pathology reveal novel biomarkers for alzheimer's disease.
PMID: 41085772 | Title: Genetic and proteomic analysis identifies BAG3 as an amyloid-responsive regulator of neuronal proteostasis.
PMID: 41078215 | Title: Differential effects of age and sex on tau pathology propagation in the htau mouse model: A neuropathological and proteomic study.
PMID: 41073674 | Title: Glia inflammation and cell death pathways drive disease progression in preclinical and early AD.


## 6. Summarize and Plot Article Metadata

Use PubMedAnalyzer to summarize the distribution of publication years, journals, and MeSH terms, and plot the results.

In [None]:
# Summarize and plot article metadata using PubMedAnalyzer
from niagads.pubmed.parsers import PubMedAnalyzer
analyzer = PubMedAnalyzer(sample_metadata)
analyzer.summarize()
print(analyzer.summary)
# analyzer.plot_summary()
print(analyzer.summary.mesh_terms)

## 7. Generate a Word Cloud from Article Titles

This step visualizes the most frequent and semantically meaningful phrases in article titles using the `LLMSummarizer` class. The `LLMSummarizer` leverages large language models (LLMs) for phrase extraction and, optionally, semantic clustering. When enabled, LLM-based embeddings group similar phrases together, allowing the word cloud to reflect not just raw frequency but also semantic similarity between phrases. This provides a more insightful view of the key topics and concepts present in the article titles.

In [None]:
# Generate a word cloud from article titles using LLMSummarizer
from niagads.nlp.analyzers import PhraseClusterAnalyzer

# Collect all article titles (use filtered_articles if available, else sample_metadata)
if 'filtered_articles' in locals() and filtered_articles:
    titles = [article.title for article in filtered_articles if article.title]
elif 'sample_metadata' in locals() and sample_metadata:
    titles = [article.title for article in sample_metadata if article.title]
else:
    titles = []

if titles:
    analyzer = PhraseClusterAnalyzer(top_n=30, min_ngram=1, max_ngram=3, use_embeddings=False, debug=DEBUG)
    clusters = analyzer.semantic_phrase_clustering(titles)
    analyzer.plot_ngram_wordcloud(clusters, max_words=50, title="Word Cloud of Article Titles")
else:
    print("No article titles available for word cloud visualization.")

## 8. Fetch Full Text for Selected PMIDs

Fetch full text XML for selected PMIDs from the PMC Open Access subset.

In [6]:
# Fetch full text for the first filtered article 
pmids = [f.pmid for f in filtered_articles]
# save to file
await pubmed_service.fetch_full_text(pmids, output_dir=os.getcwd())

# save to dict
full_text_xml = await pubmed_service.fetch_full_text(pmids)


2025-11-18 00:25:14,295 INFO niagads.common.core: PubMedQueryService:fetch_full_text      Fetching full text for 5 PMIDs (batch size = 200).
2025-11-18 00:25:17,563 INFO niagads.common.core: PubMedQueryService:fetch_full_text      Wrote 5 full text XML files to /home/allenem/projects/ai/niagads-pylib/development/bibliography.
2025-11-18 00:25:17,567 INFO niagads.common.core: PubMedQueryService:fetch_full_text      Fetching full text for 5 PMIDs (batch size = 200).
2025-11-18 00:25:19,689 INFO niagads.common.core: PubMedQueryService:fetch_full_text      Retrieved 5 full text files from PMC.


## 9. Parse PMC Full Text XML

Create a PMCFullTextParser instance with the full text XML and parse it to extract content.

In [7]:
from niagads.pubmed.parsers import PMCFullTextParser

# Parse the full text XML if available
parser = PMCFullTextParser(full_text_xml=list(full_text_xml.values())[0], debug=DEBUG)
parser.parse()
print("PMC full text parsed")


PMC full text parsed


## 10. Extract Tables, Sections, and Figures from Full Text

Use PMCFullTextParser methods to extract tables, sections, and figures from the parsed XML.

In [8]:
# Extract and display tables, sections, and figures from the parsed PMC full text
if parser:
    tables = parser.get_tables()
    sections = parser.get_sections()
    figures = parser.get_figures()
    print(f"Tables found: {len(tables)}")
    print(f"Sections found: {len(sections)}")
    print(f"Figures found: {len(figures)}")
    if tables:
        print("First table:", tables[0])
    if sections:
        print("First section:", sections[0])
        print(f"Section Titles: {[s['title'] for s in sections]}")
    if figures:
        print("First figure:", figures[0])
else:
    print("No parsed PMC full text to extract content from.")

Tables found: 0
Sections found: 9
Figures found: 7
First section: {'title': 'Methods', 'text': 'Due to its stable pathological phenotype, the APP/PS1 model has become a classic tool for the discovery of AD biological markers. The use of only male mouse offers advantages in terms of stability and controllability in experimental design, making it particularly suitable for early exploratory studies. Therefore, SPF-grade male APP/PS1 mouse (9 months old, weighing 32\u2009±\u20093\xa0g) were used in this study. These mouse were obtained from the Key Laboratory of Animal Models and Human Disease Mechanisms of the Kunming Institute of Zoology, Chinese Academy of Sciences. The animals were housed at the Experimental Animal Center of the Kunming Institute of Zoology, Chinese Academy of Sciences, under controlled conditions. The facility maintained a constant temperature of 23±\u20092\xa0°C, with relative humidity regulated between 40% and 60%. A 12-hour light/dark cycle was implemented to ensur

## 11. Extract Specific Sections: Results, Conclusions, Discussion

Extract and display only the sections with titles containing 'Results', 'Conclusions', or 'Discussion' from the parsed sections.

In [9]:
from niagads.pubmed.parsers import ArticleSectionType, ArticleSection

section_types = [ArticleSectionType.INTRODUCTION, ArticleSectionType.CONCLUSION, ArticleSectionType.DISCUSSION]

# exact matches only
matching_sections = [parser.fetch_matching_article_section(st, allow_fuzzy=False) for st in section_types]
print(matching_sections)

# allow fuzzy matching
matching_sections = [parser.fetch_matching_article_section(st, allow_fuzzy=True) for st in section_types]
print(matching_sections)


[None, ArticleSection(section_type=<ArticleSectionType.CONCLUSION: 'CONCLUSION'>, section_index=7, title='conclusions', matched_title='conclusion', score=0.9, text='Through an extensive multi-level association analysis, encompassing transcriptomics, proteomics, pathology, and protein-target interactions, we provide compelling evidence supporting the significant role of three novel genes ('), ArticleSection(section_type=<ArticleSectionType.DISCUSSION: 'DISCUSSION'>, section_index=3, title='western blot (wb) analysis', matched_title='analysis', score=0.9, text='Total proteins were extracted from mouse hippocampal tissues by homogenization in RIPA lysis buffer, and their concentrations were determined. Protein separation was achieved via SDS-PAGE, after which the samples were transferred to a PVDF membrane. The membrane was subjected to blocking with 5% skimmed milk powder for 2\xa0h at room temperature under gentle agitation. Following blocking, three PBST washes were performed (10\xa0mi

## 12. Extract Gene References from Filtered Sections

Use the LLM-based gene extractor to identify gene mentions in the filtered sections. This step demonstrates how to apply the `GeneReferenceExtractor` to biomedical text extracted from PMC full text articles.

In [None]:
# Extract gene references from the filtered sections using GeneReferenceExtractor
from niagads.nlp_utils.extractors.gene import GeneReferenceExtractor, GeneNERModel

gene_extractor = GeneReferenceExtractor(model=GeneNERModel.D4DATA)

# Combine all filtered section texts into one block for extraction
if parser and 'filtered_sections' in locals():
    all_section_text = "\n".join(sec['text'] for sec in filtered_sections)
    gene_mentions = gene_extractor.extract(all_section_text)
    print(f"Gene mentions found: {gene_mentions}")
else:
    print("No filtered sections available for gene extraction.")

## 13. Summarize Gene Contexts in Filtered Sections

Use the `TextSummarizer` to generate a summary for each gene, based on the sentences in which it appears in the filtered sections. This provides a concise overview of the context in which each gene is discussed.

In [None]:
# Summarize gene contexts using TextSummarizer
from niagads.nlp_utils.summarizer import TextSummarizer, SummarizationModel

# Extract gene context sentences (per gene) from the filtered sections
if parser and 'filtered_sections' in locals():
    all_section_text = "\n".join(sec['text'] for sec in filtered_sections)
    gene_contexts = gene_extractor.extract_gene_contexts(all_section_text)
    summarizer = TextSummarizer(model=SummarizationModel.PEGASUS_PUBMED)
    gene_summaries = summarizer.summarize(gene_contexts)
    for gene, summary in gene_summaries.items():
        print(f"Gene: {gene}\nSummary: {summary}\n")
else:
    print("No filtered sections available for gene context summarization.")