# Systematic Review - Literature Search Execution

This notebook executes literature searches based on a search strategy JSON file and generates a search execution JSON for nanopublication.

**All configuration is read from JSON:**
- Databases
- Time period
- Search terms with labels
- Search groups (Boolean structure)

**Outputs:**
- Search results (CSV, RIS, BibTeX, JSON)
- Search execution JSON (for nanopublication)

In [1]:
# Install required packages (uncomment if needed)
# !pip install requests arxiv biopython pandas -q

In [2]:
import sys
import json
from pathlib import Path
from datetime import datetime

sys.path.insert(0, str(Path('.').resolve()))

from search_utils import (
    SearchExecutor,
    deduplicate_by_doi,
    print_dependency_status,
    load_config_from_json,
    build_boolean_query
)

print_dependency_status()

Optional Dependencies:
  arxiv: ✓ installed
  biopython: ✓ installed

Database Support:
  OpenAlex: ✓ API (no key required)
  Semantic Scholar: ✓ API (no key required)
  Europe PMC: ✓ API (no key required)
  arXiv: ✓ API
  PubMed: ✓ API
  CORE: ✓ API (requires free key from core.ac.uk)
  Google Scholar: ⚠ Manual only (use Publish or Perish)
  BASE: ⚠ Manual only
  Web of Science: ⚠ Requires institutional access
  Scopus: ⚠ Requires institutional access


## 1. Load Configuration from JSON

In [3]:
# ============== SET YOUR CONFIG ==============

JSON_CONFIG_PATH = "../inputs/pets-biodiversity/search-strategy-privacy-pet-biodiversity.json"
EMAIL = "anne.fouilloux@gmail.com"
MAX_RESULTS = 500
OUTPUT_DIR = Path("./pets-biodiversity/results")

# =============================================

In [4]:
# Load configuration
config = load_config_from_json(JSON_CONFIG_PATH)

print(f"Search Strategy: {config['label']}")
print(f"Time period: {config['start_year']} - {config['end_year']}")
print(f"\nDatabases ({len(config['databases'])}):")
for db in config['databases']:
    print(f"  - {db}")
print(f"\nMethodology: {config['methodology_notes'][:200]}...")

Search Strategy: Privacy-Enhancing Technologies for Geospatial Biodiversity Data Sharing - Literature Search
Time period: 2010 - 2026

Databases (3):
  - https://arxiv.org/
  - https://www.semanticscholar.org/
  - https://openalex.org/

Methodology: Search strategy for scoping review on privacy-enhancing technologies applicable to geospatial biodiversity data. All terms backed by Wikidata URIs. Removed broad terms (blockchain, federated learning,...


## 2. Search Groups (from JSON)

In [5]:
# Search groups defined in JSON
SEARCH_TERMS = config['search_groups']

# Build the Boolean query string
BOOLEAN_QUERY = build_boolean_query(SEARCH_TERMS)

print("Boolean query structure:")
print("="*60)
for group, terms in SEARCH_TERMS.items():
    print(f"\n{group.upper()}:")
    for term in terms:
        print(f"  - {term}")

print(f"\n{'='*60}")
print(f"Query: {BOOLEAN_QUERY}")

Boolean query structure:

PRIVACY_METHODS:
  - geo-privacy
  - geoprivacy
  - differential privacy
  - k-anonymity
  - secure multi-party computation
  - data anonymization
  - homomorphic encryption

BIODIVERSITY:
  - biodiversity
  - species occurrence
  - GBIF
  - species distribution model

Query: (geo-privacy OR geoprivacy OR differential privacy OR k-anonymity OR secure multi-party computation OR data anonymization OR homomorphic encryption) AND (biodiversity OR species occurrence OR GBIF OR species distribution model)


## 3. Execute Searches

In [6]:
# Create search executor
executor = SearchExecutor(
    search_terms=SEARCH_TERMS,
    start_year=config['start_year'],
    end_year=config['end_year'],
    email=EMAIL,
    max_results=MAX_RESULTS,
    databases=config['databases']
)

SEARCH_DATE = datetime.now().strftime("%Y-%m-%d")
print(f"Search started: {datetime.now().isoformat()}\n")

# Run searches
all_records = executor.run_all_searches()

Search started: 2026-01-10T15:55:26.325542

Searching arXiv: (all:"geo-privacy" OR all:"geoprivacy" OR all:"differential privacy" OR all:"k-anonymity" OR all:...
arXiv: Retrieved 0 records
Searching Semantic Scholar: (geo-privacy OR geoprivacy OR differential privacy OR k-anonymity OR secure multi-party computati...
  (S2 query: +geo-privacy +biodiversity geoprivacy differential privacy s...)
Semantic Scholar: Total results available: 0
Semantic Scholar: Retrieved 0 records
Searching OpenAlex: (geo-privacy OR geoprivacy OR differential privacy OR k-anonymity OR secure multi-party computati...
  (OpenAlex filter: publication_year:2010-2026,title_and_abstract.search:geo-privacy|geoprivacy|"differential privacy"|k...)
OpenAlex: Total results available: 21
OpenAlex: Retrieved 21 records


In [7]:
executor.print_summary()


SEARCH RESULTS SUMMARY
Search Date: 2026-01-10T15:55:26.325384
Date Range: 2010-2026

arxiv:
  - Total available: 0
  - Retrieved: 0
semanticscholar:
  - Total available: 0
  - Retrieved: 0
openalex:
  - Total available: 21
  - Retrieved: 21

TOTAL RECORDS (before deduplication): 21


## 4. Deduplication

In [8]:
unique_records, dup_count, no_doi_count = deduplicate_by_doi(all_records)

print(f"DEDUPLICATION (DOI-based):")
print(f"  Total records: {len(all_records)}")
print(f"  Duplicates removed: {dup_count}")
print(f"  Records without DOI: {no_doi_count}")
print(f"  Unique records: {len(unique_records)}")

DEDUPLICATION (DOI-based):
  Total records: 21
  Duplicates removed: 0
  Records without DOI: 1
  Unique records: 21


## 5. Export Results

In [9]:
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

executor.export_results(
    unique_records,
    output_dir=OUTPUT_DIR,
    formats=["csv", "ris", "bibtex", "json"]
)

print(f"\nFiles saved to: {OUTPUT_DIR.resolve()}")

Exported 21 records to pets-biodiversity/results/search_results_combined.csv
Exported 21 records to pets-biodiversity/results/search_results_combined.ris
Exported 21 records to pets-biodiversity/results/search_results_combined.bib
Exported 21 records to pets-biodiversity/results/search_results_combined.json
Saved search summary to pets-biodiversity/results/search_summary.json

Files saved to: /Users/annef/Documents/FAIR2Adapt/systematic-review-pipeline/notebooks/pets-biodiversity/results


## 6. Results Analysis

In [10]:
import pandas as pd

df = pd.DataFrame(unique_records)

print(f"Total unique records: {len(df)}")
print(f"\nBy source:")
print(df["source"].value_counts().to_string())
print(f"\nBy year:")
df["year"] = pd.to_numeric(df["year"], errors="coerce")
print(df["year"].value_counts().sort_index().to_string())

Total unique records: 21

By source:
source
OpenAlex    21

By year:
year
2013    1
2020    3
2021    1
2022    7
2024    2
2025    7


In [11]:
# Sample titles
print("\nSample titles (first 10):")
for i, title in enumerate(df["title"].head(10), 1):
    print(f"{i}. {str(title)[:90]}..." if len(str(title)) > 90 else f"{i}. {title}")


Sample titles (first 10):
1. An Efficient Approach Based on Privacy-Preserving Deep Learning for Satellite Image Classi...
2. LUCAS cover photos 2006–2018 over the EU: 874 646 spatially distributed geo-tagged close-u...
3. Ethics of Environmental and Biodiversity Data. When Data Travel, Do the Benefits Return? P...
4. Ethics of Environmental and Biodiversity Data. When Data Travel, Do the Benefits Return? P...
5. Ethics of Environmental and Biodiversity Data. When Data Travel, Do the Benefits Return? P...
6. Exploring the Methods and the Need for Data Anonymization of Species Locational Data
7. Fisheries Inspection in Portuguese Waters from 2015 to 2023
8. LUCAS Cover photos 2006–2018 over the EU: 874,646 spatially distributed geo-tagged close-u...
9. The Importance of Multilevel and Multidimensional Approaches to Integrated Resources Manag...
10. Flora of Russia on iNaturalist backup 2020 Sep 08 (750K + 136K records)


## 7. PRISMA Counts

In [12]:
print("PRISMA FLOW - IDENTIFICATION")
print("="*40)
for db, counts in executor.results_summary["databases"].items():
    if counts.get('total_available') != 'manual':
        print(f"{db}: {counts['total_available']}")
    else:
        print(f"{db}: (manual search required)")
print(f"\nTotal identified: {len(all_records)}")
print(f"After deduplication: {len(unique_records)}")

PRISMA FLOW - IDENTIFICATION
arxiv: 0
semanticscholar: 0
openalex: 21

Total identified: 21
After deduplication: 21


## 8. Generate Search Execution JSON

This JSON file documents the search execution and will be used to create a nanopublication.

In [13]:
# Database metadata mapping
DB_METADATA = {
    "arxiv": {
        "url": "https://arxiv.org/",
        "label": "arXiv",
        "export_format": "BibTeX",
        "notes": "Preprints in cs.CV, cs.LG, eess.IV categories"
    },
    "openalex": {
        "url": "https://openalex.org/",
        "label": "OpenAlex",
        "export_format": "BibTeX",
        "notes": "Comprehensive bibliographic coverage, free API"
    },
    "semanticscholar": {
        "url": "https://www.semanticscholar.org/",
        "label": "Semantic Scholar",
        "export_format": "BibTeX",
        "notes": "AI-enhanced discovery, free API"
    },
    "pubmed": {
        "url": "https://pubmed.ncbi.nlm.nih.gov/",
        "label": "PubMed",
        "export_format": "RIS/NBIB",
        "notes": "Biomedical focus"
    },
    "europepmc": {
        "url": "https://europepmc.org/",
        "label": "Europe PMC",
        "export_format": "RIS",
        "notes": "European repository content"
    },
    "core.ac.uk": {
        "url": "https://core.ac.uk/",
        "label": "CORE",
        "export_format": "BibTeX",
        "notes": "Open access aggregator"
    }
}

# Build db_searches array
db_searches = []
for db_key, counts in executor.results_summary["databases"].items():
    metadata = DB_METADATA.get(db_key, {
        "url": f"https://{db_key}/",
        "label": db_key,
        "export_format": "RIS",
        "notes": ""
    })
    
    db_search = {
        "database_url": metadata["url"],
        "database_label": metadata["label"],
        "search_query": BOOLEAN_QUERY,
        "filters": f"{config['start_year']}-{config['end_year']}",
        "results_count": counts.get('total_available', 0) if counts.get('total_available') != 'manual' else 0,
        "export_format": metadata["export_format"],
        "notes": metadata["notes"]
    }
    db_searches.append(db_search)

print(f"Captured {len(db_searches)} database searches:")
for db in db_searches:
    print(f"  {db['database_label']}: {db['results_count']}")

Captured 3 database searches:
  arXiv: 0
  Semantic Scholar: 0
  OpenAlex: 21


In [14]:
# Load author info from original config
with open(JSON_CONFIG_PATH, "r") as f:
    original_config = json.load(f)

author_info = original_config.get("author", {
    "orcid": "0000-0000-0000-0000",
    "name": "Unknown"
})

# Generate search execution JSON (matching quantum-biodiversity format)
search_execution = {
    "_instructions": "Update screening fields after completing title/abstract and full-text screening.",
    "author": author_info,
    "search_execution_dataset": {
        "label": config['label'].replace("Literature Search", "Search Execution Results"),
        "part_of": original_config.get("search_strategy", {}).get("part_of", ""),
        "creation_date": SEARCH_DATE,
        "db_searches": db_searches,
        "deduplication_methodology": f"Records from {len(db_searches)} databases combined. Automatic deduplication on DOI field. {dup_count} duplicates removed. {no_doi_count} records without DOI kept for title-based deduplication in screening tool.",
        "review_methodology": "UPDATE: Describe your review methodology (e.g., single/dual reviewer, validation sample).",
        "screening_methodology": "UPDATE: Describe screening approach (e.g., Rayyan, ASReview, manual).",
        "screened_record_count": str(len(unique_records)),
        "fulltext_screened_record_count": "UPDATE: Number after title/abstract screening",
        "final_included_study_count": "UPDATE: Final included studies",
        "exclusion_breakdown": "UPDATE: Provide breakdown of exclusions at each stage (e.g., Title/abstract screening: X excluded as not relevant. Full-text screening: Y excluded for reason Z.)",
        "dataset_file_location": "UPDATE: Add Zenodo DOI or repository URL",
        "limitations": "UPDATE: Document limitations (e.g., database coverage, language restrictions, grey literature exclusion)."
    },
    "output": {
        "filename": original_config.get("output", {}).get("filename", "search-execution").replace("search-strategy", "search-execution")
    }
}

print("Search execution JSON generated")

Search execution JSON generated


In [15]:
# Save search execution JSON
execution_filename = search_execution["output"]["filename"] + ".json"
execution_path = OUTPUT_DIR / execution_filename

with open(execution_path, "w") as f:
    json.dump(search_execution, f, indent=2)

print(f"Saved: {execution_path}")
print(f"\n" + "="*60)
print("⚠ FIELDS TO UPDATE AFTER SCREENING:")
print("="*60)
print("  - review_methodology")
print("  - screening_methodology")
print("  - fulltext_screened_record_count")
print("  - final_included_study_count")
print("  - exclusion_breakdown")
print("  - dataset_file_location")
print("  - limitations")

Saved: pets-biodiversity/results/privacy-pet-biodiversity-search-execution.json

⚠ FIELDS TO UPDATE AFTER SCREENING:
  - review_methodology
  - screening_methodology
  - fulltext_screened_record_count
  - final_included_study_count
  - exclusion_breakdown
  - dataset_file_location
  - limitations


In [16]:
# Preview the generated JSON
print("\nGenerated search execution JSON:")
print("="*60)
print(json.dumps(search_execution, indent=2))


Generated search execution JSON:
{
  "_instructions": "Update screening fields after completing title/abstract and full-text screening.",
  "author": {
    "orcid": "0000-0002-1784-2920",
    "name": "Anne Fouilloux"
  },
  "search_execution_dataset": {
    "label": "Privacy-Enhancing Technologies for Geospatial Biodiversity Data Sharing - Search Execution Results",
    "part_of": "https://w3id.org/np/RAqmVeNbWgL7sNtsr9GqdX0ZTa6aQf3itQmort-JMy4tM",
    "creation_date": "2026-01-10",
    "db_searches": [
      {
        "database_url": "https://arxiv.org/",
        "database_label": "arXiv",
        "search_query": "(geo-privacy OR geoprivacy OR differential privacy OR k-anonymity OR secure multi-party computation OR data anonymization OR homomorphic encryption) AND (biodiversity OR species occurrence OR GBIF OR species distribution model)",
        "filters": "2010-2026",
        "results_count": 0,
        "export_format": "BibTeX",
        "notes": "Preprints in cs.CV, cs.LG, eess.I

## Next Steps

1. **Import to Rayyan/ASReview**: Upload `search_results_combined.ris`
2. **Title/abstract screening**: Apply inclusion criteria
3. **Full-text screening**: Review candidates
4. **Update search execution JSON**: Fill in screening results
5. **Create nanopublication**: Run search-execution-nanopub-from-json.ipynb
6. **Data extraction**: Create nanopublications for included papers