# 01 - Ingest PubMed Records

## Goal

Ingest PubMed records (dentistry focus): query by year using NCBI E-utilities, fetch MEDLINE XML into `data/raw/`. We'll be polite (rate limits), resumable, and explicit about scope.


## Why This Step Matters

**Reproducibility** is the foundation of scientific computing. By explicitly:

- Documenting our query terms
- Respecting NCBI rate limits (with `NCBI_EMAIL` and `NCBI_API_KEY`)
- Storing raw XML for provenance

...we ensure anyone can rebuild this dataset from scratch.

### How E-utilities Work

1. **ESearch** â†’ returns list of PMIDs matching query
2. **EFetch** â†’ retrieves full MEDLINE XML for those PMIDs

### Example XML Structure

```xml
<PubmedArticle>
  <MedlineCitation>
    <PMID>12345678</PMID>
    <Article>
      <ArticleTitle>Effect of dental implants...</ArticleTitle>
      <Abstract><AbstractText>This study...</AbstractText></Abstract>
    </Article>
    <MeshHeadingList>
      <MeshHeading><DescriptorName>Dental Implants</DescriptorName></MeshHeading>
    </MeshHeadingList>
  </MedlineCitation>
  <PubmedData>
    <PublicationTypeList>
      <PublicationType>Randomized Controlled Trial</PublicationType>
    </PublicationTypeList>
  </PubmedData>
</PubmedArticle>
```


In [None]:
# === TODO (you code this) ===
# Goal: Import required libraries for PubMed ingestion.
# Hints:
# 1) You'll need: os, yaml, Path, tqdm for progress bars
# 2) Import eutils_get helper from src.utils
# Acceptance:
# - All imports successful
# - Can call eutils_get() function


In [None]:
# === TODO (you code this) ===
# Goal: Read YAML config into a dict.
# Hints:
# 1) Use yaml.safe_load() on the file handle
# 2) Config should have: start_year, end_year, term_template, batch_ids, batch_fetch
# Acceptance:
# - config dict contains all 5 keys
# - start_year and end_year are integers

def load_config(path="../configs/query.yaml"):
    """Load YAML configuration."""
    # TODO
    raise NotImplementedError

# config = load_config()


In [None]:
# === TODO (you code this) ===
# Goal: Set NCBI credentials from environment variables.
# Hints:
# 1) Use os.getenv() with sensible defaults
# 2) Without API key: 3 req/sec; with key: 10 req/sec
# Acceptance:
# - NCBI_EMAIL and NCBI_API_KEY variables defined
# - Defaults provided if env vars missing

# TODO: Set NCBI_EMAIL and NCBI_API_KEY


In [None]:
# === TODO (you code this) ===
# Goal: Create data/raw directory if it doesn't exist.
# Hints:
# 1) Use Path().mkdir() with appropriate flags
# Acceptance:
# - Directory exists after running
# - No error if directory already exists

# TODO: Create ../data/raw directory


## Build the Yearly Query

We'll loop through years and construct queries like:

```
(dentistry[MeSH Terms] OR dental[Title/Abstract] OR ...) AND (2018[PDAT]:2018[PDAT])
```


In [None]:
# === TODO (you code this) ===
# Goal: Build the PubMed query string for a given year.
# Hints:
# 1) Replace {{year}} placeholder in term_template
# 2) Return complete query string ready for ESearch
# Acceptance:
# - Function build_query(year:int, template:str) -> str
# - Query for 2022 contains "2022[PDAT]:2022[PDAT]"

def build_query(year, template):
    """Build PubMed query string for a given year."""
    # TODO
    raise NotImplementedError


## ESearch: Get PMIDs

Use `esearch.fcgi` to get a list of PMIDs matching our query.


In [None]:
# === TODO (you code this) ===
# Goal: Call NCBI ESearch to get PMIDs for a query.
# Hints:
# 1) Use eutils_get() helper with 'esearch.fcgi' endpoint
# 2) Return JSON response (see NCBI E-utilities docs for params)
# 3) Include email and api_key in params for rate limit compliance
# Acceptance:
# - Function esearch(query, retmax, retstart) returns dict
# - Response contains 'esearchresult' with 'idlist'

def esearch(query, retmax=500, retstart=0):
    """Query NCBI ESearch and return JSON response."""
    # TODO: build params dict and call eutils_get
    raise NotImplementedError


In [None]:
# === TODO (you code this) ===
# Goal: Paginate through ESearch results to collect all PMIDs.
# Hints:
# 1) Loop with incrementing retstart until no more IDs returned
# 2) Use batch_size from config for retmax
# 3) Check total count to know when to stop
# Acceptance:
# - Function get_all_pmids(query, batch_size) -> list[str]
# - Returns all PMIDs, not just first batch
# - No duplicates

def get_all_pmids(query, batch_size=500):
    """Paginate ESearch to collect all PMIDs for a query."""
    # TODO: implement pagination loop
    raise NotImplementedError


## EFetch: Download XML

Use `efetch.fcgi` to retrieve MEDLINE XML for batches of PMIDs.


In [None]:
# === TODO (you code this) ===
# Goal: Fetch MEDLINE XML for a list of PMIDs.
# Hints:
# 1) Join PMIDs with commas for 'id' parameter
# 2) Request 'xml' retmode (not json)
# 3) Return raw XML text, not JSON
# Acceptance:
# - Function efetch(pmids:list) -> str
# - Returns XML string starting with <?xml

def efetch(pmids):
    """Fetch MEDLINE XML for a batch of PMIDs."""
    # TODO: build params and call eutils_get, return .text
    raise NotImplementedError


## Main Ingestion Loop

For each year:
1. Build query
2. Get PMIDs
3. Fetch XML in batches
4. Save to `data/raw/pubmed_{year}_{batch}.xml`


In [None]:
# === TODO (you code this) ===
# Goal: Orchestrate full ingestion for all years.
# Hints:
# 1) Loop through year range from config
# 2) For each year: build query, get all PMIDs, chunk and fetch XML
# 3) Save files as pubmed_{year}_{chunk:04d}.xml
# 4) Skip existing files for resumability
# Acceptance:
# - Function ingest_years(config) processes all years
# - Files written to ../data/raw/
# - Re-running skips existing files
# - Progress printed for each year

def ingest_years(config):
    """Ingest all configured years into data/raw/."""
    # TODO: implement year loop with chunked fetching
    raise NotImplementedError

# Run ingestion (uncomment when ready)
# ingest_years(config)


## QA Checklist

Before moving to the next notebook, verify:

- [ ] Counts per year are non-zero
- [ ] Files written to `data/raw/` directory
- [ ] XML files are valid (spot-check by opening one)
- [ ] Re-running the cell skips existing files (resumability)
- [ ] No rate-limit errors from NCBI

### Sanity Check: Count Files


In [None]:
# === TODO (you code this) ===
# Goal: Count and verify downloaded XML files.
# Hints:
# 1) Use Path.glob() to find all .xml files
# 2) Parse filenames to extract year
# 3) Show total count and breakdown by year
# Acceptance:
# - Prints total file count
# - Shows count per year (use Counter)

# TODO: count files in ../data/raw/*.xml


## ðŸ§˜ Reflection Log

**What did you learn in this session?**
- 

**What challenges did you encounter?**
- 

**How will this improve Periospot AI?**
- 
