-
Notifications
You must be signed in to change notification settings - Fork 0
Python API
pxseek is CLI-first, but it exposes a small documented Python API for code that should not need to shell out to the CLI. The same functions, cache, and artifact formats work in both paths.
Import from the package root:
from pxseek import (
fetch_datasets,
filter_datasets,
lookup_datasets,
FetchResult,
LookupResult,
read_artifact,
render_artifact,
write_artifact,
)These six functions and two dataclasses are the stable surface for the 0.5.x series. Everything else in the package is implementation detail.
Downloads, parses, and caches the ProteomeXchange summary table.
from pxseek import fetch_datasets
result = fetch_datasets()
summary_df = result.dfParameters
-
refresh(bool, defaultFalse) When True, bypass the cache and download fresh data. -
cache_dir(Path | str | None, defaultNone) Base directory for.pxseek_cache/. Defaults to current working directory. -
use_stale_on_error(bool, defaultTrue) When True, return cached data if the network request fails with a connection error or timeout.
Returns
A FetchResult dataclass with these fields:
-
df(pd.DataFrame) Clean summary with 9 columns (see Data Formats). -
from_cache(bool) True if the data came from the local cache. -
stale_fallback(bool) True if stale cached data was returned after a network failure. -
parse_result(ParseResult | None) Parse diagnostics for freshly downloaded data. None when served from cache.
result = fetch_datasets()
print(result.from_cache) # True or False
print(result.stale_fallback) # True if network failed and stale cache was used
print(len(result.df)) # Row countFilters a summary DataFrame using the same rules as the CLI.
from pxseek import fetch_datasets, filter_datasets
summary_df = fetch_datasets().df
filtered_df, summary = filter_datasets(
summary_df,
species="Homo sapiens",
keywords="cancer, proteomics",
match_all=True,
)Parameters
-
df(pd.DataFrame, required) Clean summary DataFrame fromfetch_datasets(). -
species(str | None, defaultNone) Species regex (case-insensitive). -
repository(str | None, defaultNone) Comma-separated repository names. -
keywords(str | None, defaultNone) Comma-separated keywords or path to a keyword file. -
keyword_columns(str | None, defaultNone) Comma-separated column names to search (defaults totitle,keywords). -
after(str | None, defaultNone) Lower date bound in YYYY-MM-DD format. -
before(str | None, defaultNone) Upper date bound in YYYY-MM-DD format. -
instrument(str | None, defaultNone) Instrument regex (case-insensitive). -
match_all(bool, defaultFalse) When True, all keywords must match (AND logic). -
deep(bool, defaultFalse) When True, also search within XML descriptions. Requireskeywords. -
cache_dir(Path | str | None, defaultNone) Cache directory used by deep search for XML files. -
delay(float, default1.0) Seconds between XML requests during deep search.
Returns
A tuple of (pd.DataFrame, dict).
The DataFrame is the filtered subset with the same columns as the input. When deep=True, it also includes a description column.
The summary dict contains:
-
original_count(int) Rows before any filtering. -
filtered_count(int) Rows after all filters. -
active_filters(list[str]) Human-readable descriptions of each active filter. -
nat_count(int) Number of unparseable dates dropped (only present when a date filter is active).
filtered_df, summary = filter_datasets(summary_df, species="Homo sapiens")
print(summary["original_count"]) # e.g. 50000
print(summary["filtered_count"]) # e.g. 12000
print(summary["active_filters"]) # ["species: Homo sapiens"]Fetches detailed XML metadata for specific dataset identifiers.
from pxseek import lookup_datasets
result = lookup_datasets(["PXD000001", "PXD000002"])
details_df = result.dfParameters
-
dataset_ids(Iterable[str], required) PXD or RPXD identifiers. Validated before any HTTP request. -
cache_dir(Path | str | None, defaultNone) Base directory for.pxseek_cache/. -
delay(float, default1.0) Seconds between uncached XML requests.
Returns
A LookupResult dataclass with these fields:
-
df(pd.DataFrame) Parsed metadata rows for successfully fetched datasets. -
failed_ids(list[str]) Identifiers that could not be fetched or parsed.
The DataFrame has 19 columns covering title, description, species, instruments, modifications, contacts, publications, keywords, FTP location, and more. See Data Formats for the full list.
result = lookup_datasets(["PXD000001", "PXD999999"])
print(len(result.df)) # Successful rows
print(result.failed_ids) # ["PXD999999"] if that one failedThese three functions let Python code produce and consume the same file formats as the CLI.
Read a TSV, CSV, or JSON artifact into a DataFrame. Format is inferred from the file suffix unless overridden. If path is "-", the helper reads from stdin and auto-detects JSON, TSV, or CSV from the content when no explicit format is given.
from pxseek import read_artifact
df = read_artifact("results.tsv")
df = read_artifact("results.json")
df = read_artifact("-", format="json")
df = read_artifact("data.csv", format="csv")Turn a DataFrame into text for stdout or an API response.
from pxseek import render_artifact
json_text = render_artifact(df, format="json")
tsv_text = render_artifact(df, format="tsv")Write a DataFrame to disk. Format is inferred from the file suffix unless overridden. In auto mode, .tsv, .csv, .json, and no suffix are accepted. Unknown suffixes raise an error unless you pass format= explicitly. Missing parent directories are created automatically.
from pxseek import write_artifact
write_artifact(df, "results.tsv")
write_artifact(df, "results.json")
write_artifact(df, "results.csv", format="csv")Supported formats: tsv, csv, json.
from pxseek import fetch_datasets, filter_datasets, lookup_datasets
# Step 1: fetch the summary
fetch_result = fetch_datasets()
summary_df = fetch_result.df
print(f"Fetched {len(summary_df)} datasets (from cache: {fetch_result.from_cache})")
# Step 2: filter
filtered_df, summary = filter_datasets(
summary_df,
species="Homo sapiens",
keywords="cancer, phosphoproteomics",
match_all=False,
)
print(f"Filtered to {len(filtered_df)} datasets")
print(f"Active filters: {summary['active_filters']}")
# Step 3: lookup details
lookup_result = lookup_datasets(filtered_df["dataset_id"])
print(f"Looked up {len(lookup_result.df)} datasets")
print(f"Failed: {lookup_result.failed_ids}")
# Step 4: use the results
detailed_df = lookup_result.df
for _, row in detailed_df.iterrows():
print(f"{row['dataset_id']}: {row['title']} ({row['species']})")
print(f" FTP: {row['ftp_location']}")
print(f" PubMed: {row['pubmed_ids']}")- The workflow API shares the same local cache as the CLI. A fetch in Python means the CLI sees it, and vice versa.
-
fetch_datasets(use_stale_on_error=True)returns cached data when the network is down, matching the CLI behavior. -
filter_datasets()withdeep=Truefetches XML for each candidate dataset. This can be slow for large shortlists. Usecache_dirto persist XML across runs. -
lookup_datasets()validates all identifiers before making any HTTP requests. It raisesValueErrorif any ID is invalid. - The artifact helpers produce the exact same output as the CLI. A TSV written by
write_artifact()can be read bypxseek filter -i.
Getting started
Reference
Help