# PubMed Data Ingestion Pipeline

This pipeline queries PubMed for articles matching a search term, fetches full text, and summarizes them using local LLMs via Ollama.

---

## Setup

In [26]:
import os
from dotenv import load_dotenv
from Bio import Entrez
import ollama

# Import .env file
load_dotenv()

# Access environment variables
Entrez.email = os.getenv("ENTREZ_EMAIL")
SEARCH_QUERY = os.getenv("PUBMED_SEARCHQUERY")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL")
OLLAMA_CONTEXT_WINDOW = int(os.getenv("OLLAMA_CTX", "2048"))

## Ollama

In [27]:
import httpx

def check_ollama():
    try:
        r = httpx.get("http://localhost:11434")
        print("Ollama is running")
        return True
    except httpx.ConnectError:
        print("Ollama is not running — start it with `ollama serve`")
        return False

def list_ollama_models():
    """List available models (run `ollama pull <model>` if needed)."""
    try:
        models = ollama.list()
        return [m["name"] for m in models.get("models", [])]
    except Exception:
        return []

check_ollama()
list_ollama_models()

Ollama is not running — start it with `ollama serve`


[]

### Models under test

| Model | Pull command |
|-------|--------------|
| **tinyllama:1.1b** | `ollama pull tinyllama:1.1b` |
| **gemma2:2b** | `ollama pull gemma2:2b` |
| **gemma3:4b** | `ollama pull gemma3:4b` |

Set `OLLAMA_MODEL` in `.env` to switch models.

## PubMed Helpers

In [28]:
def query_pubmed(query, max_results = 5):
    """
    Helper function to query PubMed database for relevant articles matching input search query.
    """
    handle = Entrez.esearch(db="pubmed", term=query, retmax=max_results)
    record = Entrez.read(handle)
    return record["IdList"]

def fetch_pubmed_full_text(pmc_text_id):
    """
    Helper function to fetch full text via PubMed unique IDs.
    """
    handle = Entrez.efetch(db="pmc", id=pmc_text_id, rettype = "full", retmode = "xml")
    return handle.read().decode("utf-8")

def summarize(text, model, num_ctx=2048):
    """
    Helper function to generate summary of provided input PubMed article full texts.
    """
    text = text[:12000]
    options = {"num_ctx": num_ctx} if num_ctx else {}
    response = ollama.chat(
        model=model,
        messages=[
            {
                "role": "user",
                "content": f"Generate a concise summary of this biomedical article:\n\n{text}",
            }
        ],
        options=options,
    )
    return response.message["content"]

---

## Search & Summarize

In [None]:
''' 
4) Concrete pipeline blueprint for your lab deliverable
Step A — Retrieve IDs
esearch PubMed for "solitary fibrous tumor" with retmax=4500
store PMIDs

Step B — Map IDs
For each PMID:
- try convert to PMCID (ID Converter API)
- if PMCID exists → fetch full text from PMC
else → fetch abstract from PubMed


Step C — Extract candidate snippets (regex)
Collect sentences/paragraphs that mention age.

Step D — LLM extraction (structured JSON)
Prompt like:
Extract all patient age information from the snippets below.
Return JSON: { "age_values": [...], "age_summary": {...}, "n_patients": ..., "evidence": [...] }
If no ages, return empty lists.

Step E — Store results
Write to CSV/JSONL:
PMID, PMCID(if any), title, year
extracted ages, evidence, counts
status flags: full_text|abstract_only|not_accessible|parse_error
This gives your PI a real deliverable: a dataset + a pipeline that can be extended later for sex, race, site, metastasis, survival, etc.
'''

In [None]:
pubmed_ids = query_pubmed(SEARCH_QUERY)
print(f"Pulled the following IDs: {ids}")

Pulled the following IDs: ['41720295', '41710908', '41700583', '41700580', '41699972']


In [30]:
i = ids[0]

# Fetch full text for this ID
text = fetch_pubmed_full_text(i)

print(f"Full Text: \n{text}")

Full Text: 
<?xml version="1.0"  ?><!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd"><pmc-articleset><error id="41720295">The following PMCID is not available: 41720295</error></pmc-articleset>
