# 01 - Ingest PubMed Records

## Goal

Ingest PubMed records (dentistry focus): query by year using NCBI E-utilities, fetch MEDLINE XML into `data/raw/`. We'll be polite (rate limits), resumable, and explicit about scope.


## Why This Step Matters

**Reproducibility** is the foundation of scientific computing. By explicitly:

- Documenting our query terms
- Respecting NCBI rate limits (with `NCBI_EMAIL` and `NCBI_API_KEY`)
- Storing raw XML for provenance

...we ensure anyone can rebuild this dataset from scratch.

### How E-utilities Work

1. **ESearch** â†’ returns list of PMIDs matching query
2. **EFetch** â†’ retrieves full MEDLINE XML for those PMIDs

### Example XML Structure

```xml
<PubmedArticle>
  <MedlineCitation>
    <PMID>12345678</PMID>
    <Article>
      <ArticleTitle>Effect of dental implants...</ArticleTitle>
      <Abstract><AbstractText>This study...</AbstractText></Abstract>
    </Article>
    <MeshHeadingList>
      <MeshHeading><DescriptorName>Dental Implants</DescriptorName></MeshHeading>
    </MeshHeadingList>
  </MedlineCitation>
  <PubmedData>
    <PublicationTypeList>
      <PublicationType>Randomized Controlled Trial</PublicationType>
    </PublicationTypeList>
  </PubmedData>
</PubmedArticle>
```


In [None]:
# TODO: Import libraries
# Hint: import os, yaml, json, time
# from pathlib import Path
# from tqdm import tqdm
# import sys
# sys.path.append('..')
# from src.utils import need_env, eutils_get


In [None]:
# TODO: Load config from configs/query.yaml
# Hint: with open('../configs/query.yaml') as f:
#     config = yaml.safe_load(f)
# Extract: start_year, end_year, term_template, batch_ids, batch_fetch


In [None]:
# TODO: Set up NCBI authentication (optional but recommended)
# Hint: NCBI_EMAIL = os.getenv('NCBI_EMAIL', 'your@email.com')
#       NCBI_API_KEY = os.getenv('NCBI_API_KEY', '')
# Note: With API key, rate limit = 10 req/sec; without = 3 req/sec


In [None]:
# TODO: Create output directory
# Hint: Path('../data/raw').mkdir(parents=True, exist_ok=True)


## Build the Yearly Query

We'll loop through years and construct queries like:

```
(dentistry[MeSH Terms] OR dental[Title/Abstract] OR ...) AND (2018[PDAT]:2018[PDAT])
```


In [None]:
# TODO: Build query for a single year
# Hint: def build_query(year, template):
#           return template.replace('{{year}}', str(year))
# Test with one year first


## ESearch: Get PMIDs

Use `esearch.fcgi` to get a list of PMIDs matching our query.


In [None]:
# TODO: Write ESearch function
# Hint: def esearch(query, retmax=500, retstart=0):
#     params = {
#         'db': 'pubmed',
#         'term': query,
#         'retmax': retmax,
#         'retstart': retstart,
#         'retmode': 'json',
#         'email': NCBI_EMAIL,
#         'api_key': NCBI_API_KEY
#     }
#     response = eutils_get('esearch.fcgi', params)
#     return response.json()


In [None]:
# TODO: Collect all PMIDs for a year (with pagination)
# Hint: def get_all_pmids(query, batch_size=500):
#     pmids = []
#     retstart = 0
#     while True:
#         result = esearch(query, retmax=batch_size, retstart=retstart)
#         id_list = result['esearchresult'].get('idlist', [])
#         if not id_list:
#             break
#         pmids.extend(id_list)
#         retstart += batch_size
#         if retstart >= int(result['esearchresult']['count']):
#             break
#     return pmids


## EFetch: Download XML

Use `efetch.fcgi` to retrieve MEDLINE XML for batches of PMIDs.


In [None]:
# TODO: Write EFetch function
# Hint: def efetch(pmids):
#     params = {
#         'db': 'pubmed',
#         'id': ','.join(pmids),
#         'retmode': 'xml',
#         'email': NCBI_EMAIL,
#         'api_key': NCBI_API_KEY
#     }
#     response = eutils_get('efetch.fcgi', params)
#     return response.text


## Main Ingestion Loop

For each year:
1. Build query
2. Get PMIDs
3. Fetch XML in batches
4. Save to `data/raw/pubmed_{year}_{batch}.xml`


In [None]:
# TODO: Implement main ingestion loop
# Hint: 
# for year in range(config['start_year'], config['end_year'] + 1):
#     print(f"\n=== Processing year {year} ===")
#     query = build_query(year, config['term_template'])
#     pmids = get_all_pmids(query, config['batch_ids'])
#     print(f"Found {len(pmids)} records for {year}")
#     
#     # Fetch in chunks
#     for i in range(0, len(pmids), config['batch_fetch']):
#         batch = pmids[i:i + config['batch_fetch']]
#         outfile = Path(f"../data/raw/pubmed_{year}_{i//config['batch_fetch']:04d}.xml")
#         
#         # Skip if already exists (resumability)
#         if outfile.exists():
#             print(f"  Skipping {outfile.name} (already exists)")
#             continue
#         
#         xml_data = efetch(batch)
#         outfile.write_text(xml_data, encoding='utf-8')
#         print(f"  Wrote {outfile.name} ({len(batch)} records)")


## QA Checklist

Before moving to the next notebook, verify:

- [ ] Counts per year are non-zero
- [ ] Files written to `data/raw/` directory
- [ ] XML files are valid (spot-check by opening one)
- [ ] Re-running the cell skips existing files (resumability)
- [ ] No rate-limit errors from NCBI

### Sanity Check: Count Files


In [None]:
# TODO: Count downloaded files
# Hint: raw_files = list(Path('../data/raw').glob('*.xml'))
#       print(f"Total XML files: {len(raw_files)}")
#       # Group by year and count
#       from collections import Counter
#       years = [f.stem.split('_')[1] for f in raw_files]
#       print(Counter(years))


## ðŸ§˜ Reflection Log

**What did you learn in this session?**
- 

**What challenges did you encounter?**
- 

**How will this improve Periospot AI?**
- 
