# Europe PMC Full Text XML Parser Demo

This notebook demonstrates how to use `pyEuropePMC`'s `FullTextClient` and `FullTextXMLParser` to download, parse, and extract information from Europe PMC full text XML articles. It covers both live downloads and local fixture files, and shows how to extract metadata, tables, references, sections, and convert articles to plaintext and markdown formats.

## 1. Import Required Libraries

Import the necessary libraries, including `pathlib` and the relevant classes from `pyEuropePMC`.

In [1]:
from pathlib import Path
from pyeuropepmc import FullTextClient, FullTextXMLParser

## 2. Set Up File Paths and Download Directory

Create a downloads directory and define file paths for saving XML, plaintext, and markdown files.

In [2]:
downloads_dir = Path("downloads")
downloads_dir.mkdir(exist_ok=True)

pmcid = "PMC3258128"
xml_path = downloads_dir / f"{pmcid}.xml"
plaintext_path = downloads_dir / f"{pmcid}_plaintext.txt"
markdown_path = downloads_dir / f"{pmcid}.md"

## 3. Download Full Text XML from Europe PMC

Use `FullTextClient` to download a full text XML file from Europe PMC by PMC ID and save it to the downloads directory.

In [3]:
from pyeuropepmc import FullTextClient

with FullTextClient() as client:
    xml_downloaded_path = client.download_xml_by_pmcid(pmcid, xml_path)
    if xml_downloaded_path:
        print(f"Downloaded XML to: {xml_downloaded_path}")
    else:
        print(f"Failed to download XML for {pmcid}")

Downloaded XML to: downloads/PMC3258128.xml


## 4. Parse XML and Extract Metadata

Read the downloaded XML file, initialize `FullTextXMLParser`, and extract metadata such as title, authors, journal, DOI, publication date, volume, issue, pages, and keywords.

In [4]:
# Read the XML content
with open(xml_path, 'r', encoding='utf-8') as f:
    xml_content = f.read()

# Create parser instance
parser = FullTextXMLParser(xml_content)

# Extract metadata
data = parser.extract_metadata()
print(f"Title: {data['title']}")
print(f"Authors: {', '.join(data['authors'][:3])}")
if len(data['authors']) > 3:
    print(f"... and {len(data['authors']) - 3} more")
print(f"Journal: {data['journal']}")
print(f"DOI: {data['doi']}")
print(f"Publication Date: {data['pub_date']}")
print(f"Volume: {data['volume']}, Issue: {data['issue']}")
print(f"Pages: {data['pages']}")
if data['keywords']:
    print(f"Keywords: {', '.join(data['keywords'])}")

Title: Hepato-specific microRNA-122 facilitates accumulation of newly synthesized miRNA through regulating PRKRA
Authors: Shuai Li, Juanjuan Zhu, Hanjiang Fu
... and 9 more
Journal: Nucleic Acids Research
DOI: 10.1093/nar/gkr715
Publication Date: 2012-01
Volume: 40, Issue: 2
Pages: 884-891


In [6]:
element_types = parser.list_element_types()
element_types

['abstract',
 'aff',
 'article',
 'article-categories',
 'article-id',
 'article-meta',
 'article-title',
 'author-notes',
 'award-id',
 'back',
 'body',
 'bold',
 'caption',
 'contrib',
 'contrib-group',
 'copyright-statement',
 'copyright-year',
 'corresp',
 'counts',
 'date',
 'day',
 'element-citation',
 'email',
 'etal',
 'ext-link',
 'fax',
 'fig',
 'fn',
 'fpage',
 'front',
 'funding-source',
 'given-names',
 'graphic',
 'history',
 'issn',
 'issue',
 'italic',
 'journal-id',
 'journal-meta',
 'journal-title',
 'journal-title-group',
 'label',
 'license',
 'license-p',
 'lpage',
 'media',
 'month',
 'name',
 'p',
 'page-count',
 'permissions',
 'person-group',
 'phone',
 'pub-date',
 'pub-id',
 'publisher',
 'publisher-name',
 'ref',
 'ref-list',
 'sec',
 'source',
 'sub',
 'subj-group',
 'subject',
 'sup',
 'supplementary-material',
 'surname',
 'title',
 'title-group',
 'volume',
 'xref',
 'year']

In [9]:
markdown_output = parser.to_markdown()
print(markdown_output)

# Hepato-specific microRNA-122 facilitates accumulation of newly synthesized miRNA through regulating PRKRA

**Authors:** Shuai Li, Juanjuan Zhu, Hanjiang Fu, Jing Wan, Zheng Hu, Shanshan Liu, Jie Li, Yi Tie, Ruiyun Xing, Jie Zhu, Zhixian Sun, Xiaofei Zheng

**Journal:** Nucleic Acids Research

**DOI:** 10.1093/nar/gkr715

## Abstract

microRNAs (miRNAs) are a versatile class of non-coding RNAs involved in regulation of various biological processes. miRNA-122 (miR-122) is specifically and abundantly expressed in human liver. In this study, we employed 3′-end biotinylated synthetic miR-122 to identify its targets based on affinity purification. Quantitative RT-PCR analysis of the affinity purified RNAs demonstrated a specific enrichment of several known miR-122 targets such as CAT-1 (also called SLC7A1), ADAM17 and BCL-w. Using microarray analysis of affinity purified RNAs, we also discovered many candidate target genes of miR-122. Among these candidates, we confirmed that protein kinas

In [10]:
plain_output = parser.to_plaintext()
print(plain_output)

Hepato-specific microRNA-122 facilitates accumulation of newly synthesized miRNA through regulating PRKRA

Authors: Shuai Li, Juanjuan Zhu, Hanjiang Fu, Jing Wan, Zheng Hu, Shanshan Liu, Jie Li, Yi Tie, Ruiyun Xing, Jie Zhu, Zhixian Sun, Xiaofei Zheng

Abstract
microRNAs (miRNAs) are a versatile class of non-coding RNAs involved in regulation of various biological processes. miRNA-122 (miR-122) is specifically and abundantly expressed in human liver. In this study, we employed 3′-end biotinylated synthetic miR-122 to identify its targets based on affinity purification. Quantitative RT-PCR analysis of the affinity purified RNAs demonstrated a specific enrichment of several known miR-122 targets such as CAT-1 (also called SLC7A1), ADAM17 and BCL-w. Using microarray analysis of affinity purified RNAs, we also discovered many candidate target genes of miR-122. Among these candidates, we confirmed that protein kinase, interferon-inducible double-stranded RNA-dependent activator (PRKRA), a D