# ASReview Results to Nanopublication Export

This notebook extracts screening results from an ASReview `.asreview` project file and generates:
1. **study_inclusion.json** - Ready for nanopub generation
2. **PRISMA flow diagram data**
3. **Export files** (CSV, RIS) for included/excluded studies

## 1. Configuration

Edit these settings for your review:

In [1]:
# === EDIT THIS SECTION ===

# Path to your exported .asreview file
ASREVIEW_FILE = "search_results_combined.asreview"

# Review metadata
REVIEW_TITLE = "Quantum Computing Applications in Biodiversity Research"
REVIEW_DESCRIPTION = "Systematic review of quantum computing methods applied to biodiversity, conservation, and ecological research"

# Screener info (for provenance)
SCREENER_ORCID = "0000-0002-1784-2920"
SCREENER_NAME = "Anne Fouilloux"

# Link to your systematic review nanopubs (provenance chain)
PICO_NANOPUB_URI = "https://w3id.org/np/RA8B3ptXUOsN7obpkFGtA0FBmsh0OnID53wOsUIpSKTcg"
SEARCH_STRATEGY_URI = "https://w3id.org/np/RAEK3jctU2x3IKW174OTgmFH9zDygPiaD-vb4zGrD39A4"
SEARCH_EXECUTION_URI = "https://w3id.org/np/RAMPy96eCLCXlGR9VvCVf6rJmpN_DlxxarMGm91_5n-O8"

# Output directory
OUTPUT_DIR = "screening_results"

In [1]:
# Path to your exported .asreview file
ASREVIEW_FILE = "wildfire-sentinel2-ml/results/wildfire-sentinel2-ml-no-duplicates.asreview"

# Review metadata
REVIEW_TITLE = "Machine Learning Algorithms for Wildfire Detection and Burned Area Mapping using Sentinel-2"
REVIEW_DESCRIPTION = "Systematic review on Machine Learning Algorithms for Wildfire Detection and Burned Area Mapping using Sentinel-2"

# Screener info (for provenance)
SCREENER_ORCID = "0000-0002-1784-2920"
SCREENER_NAME = "Anne Fouilloux"

# Link to your systematic review nanopubs (provenance chain)
PICO_NANOPUB_URI = "https://w3id.org/np/RAjO8tdVOla9I77PeXF4iY92ULngrpx5_ZSKFkVrCmsW0"
SEARCH_STRATEGY_URI = " "
SEARCH_EXECUTION_URI = " "

# Output directory
OUTPUT_DIR = "wildfire-sentinel2-ml/screening_results"

## 2. Setup

In [2]:
import zipfile
import json
import pandas as pd
import sqlite3
from pathlib import Path
from datetime import datetime, timezone
import tempfile
import shutil

# Create output directory
Path(OUTPUT_DIR).mkdir(exist_ok=True)

print(f"✓ Setup complete")
print(f"  ASReview file: {ASREVIEW_FILE}")
print(f"  Output directory: {OUTPUT_DIR}")

✓ Setup complete
  ASReview file: wildfire-sentinel2-ml/results/wildfire-sentinel2-ml-no-duplicates.asreview
  Output directory: wildfire-sentinel2-ml/screening_results


## 3. Extract Data from ASReview Project

The `.asreview` file is a ZIP archive containing:
- `project.json` - Project metadata
- `data_store.db` - SQLite database with paper metadata
- `reviews/*/results.db` - SQLite database with screening decisions

In [3]:
# Extract the .asreview file (it's a ZIP)
temp_dir = tempfile.mkdtemp()
print(f"Extracting to: {temp_dir}")

with zipfile.ZipFile(ASREVIEW_FILE, 'r') as zip_ref:
    zip_ref.extractall(temp_dir)

# List contents
print("\nProject contents:")
for item in Path(temp_dir).rglob("*"):
    if item.is_file():
        print(f"  {item.relative_to(temp_dir)}")

Extracting to: /var/folders/zf/53jxd5nj2j3dmfjqpx41p24c0000gn/T/tmpieghkk0m

Project contents:
  project.json
  data_store.db
  feature_matrices/tfidf_feature_matrix.npz
  data/wildfire-sentinel2-ml-no-duplicates.ris
  reviews/dd8a86cb9f02451c9a9dde42ae792afe/results.db
  reviews/dd8a86cb9f02451c9a9dde42ae792afe/settings_metadata.json


In [4]:
# Load project config
with open(Path(temp_dir) / "project.json") as f:
    project_config = json.load(f)

print("Project info:")
print(f"  Name: {project_config.get('name', 'N/A')}")
print(f"  ID: {project_config.get('id', 'N/A')}")
print(f"  Version: {project_config.get('version', 'N/A')}")
print(f"  Reviews: {len(project_config.get('reviews', []))}")

Project info:
  Name: wildfire-sentinel2-ml-no-duplicates
  ID: b50068f3a2e2439fa80b64922d8418ff
  Version: 2.2
  Reviews: 1


In [5]:
# Load paper metadata from data_store.db
data_store_path = Path(temp_dir) / "data_store.db"

conn = sqlite3.connect(data_store_path)
papers_df = pd.read_sql_query("SELECT * FROM record", conn)
conn.close()

print(f"\nLoaded {len(papers_df)} papers from data store")
print(f"Columns: {list(papers_df.columns)}")
papers_df.head()


Loaded 450 papers from data store
Columns: ['dataset_row', 'dataset_id', 'duplicate_of', 'title', 'abstract', 'authors', 'keywords', 'year', 'doi', 'url', 'included', 'record_id']


Unnamed: 0,dataset_row,dataset_id,duplicate_of,title,abstract,authors,keywords,year,doi,url,included,record_id
0,0,2806332e78004d39a273e4af26696f59,,FireScope: Wildfire Risk Prediction with a Cha...,Predicting wildfire risk is a reasoning-intens...,"[""Mario Markov"", ""Stefan Maria Ailuro"", ""Luc V...",[],2025,,,,0
1,1,2806332e78004d39a273e4af26696f59,,Assessment of the January 2025 Los Angeles Cou...,This study presents a comprehensive analysis o...,"[""Seyd Teymoor Seydi""]",[],2025,,,,1
2,2,2806332e78004d39a273e4af26696f59,,On the Generalizability of Foundation Models f...,Foundation models pre-trained using self-super...,"[""Yi-Chia Chang"", ""Adam J. Stewart"", ""Favyen B...",[],2024,,,,2
3,3,2806332e78004d39a273e4af26696f59,,Sen2Fire: A Challenging Benchmark Dataset for ...,Utilizing satellite imagery for wildfire detec...,"[""Yonghao Xu"", ""Amanda Berg"", ""Leif Haglund""]",[],2024,10.1109/IGARSS53475.2024.10641441,,,3
4,4,2806332e78004d39a273e4af26696f59,,CaBuAr: California Burned Areas dataset for de...,Forest wildfires represent one of the catastro...,"[""Daniele Rege Cambrin"", ""Luca Colomba"", ""Paol...",[],2024,10.1109/MGRS.2023.3292467,,,4


In [6]:
# Find and load screening results
reviews_dir = Path(temp_dir) / "reviews"
results_db = None

for review_dir in reviews_dir.iterdir():
    if review_dir.is_dir():
        results_path = review_dir / "results.db"
        if results_path.exists():
            results_db = results_path
            print(f"Found results database: {results_path.relative_to(temp_dir)}")
            break

if results_db:
    conn = sqlite3.connect(results_db)
    
    # Check available tables
    tables = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table'", conn)
    print(f"\nTables in results.db: {list(tables['name'])}")
    
    # Load results
    results_df = pd.read_sql_query("SELECT * FROM results", conn)
    conn.close()
    
    print(f"\nScreening results: {len(results_df)} decisions")
    print(f"Columns: {list(results_df.columns)}")
else:
    print("ERROR: Could not find results.db")

Found results database: reviews/dd8a86cb9f02451c9a9dde42ae792afe/results.db

Tables in results.db: ['results', 'decision_changes', 'last_ranking']

Screening results: 261 decisions
Columns: ['record_id', 'label', 'classifier', 'querier', 'balancer', 'feature_extractor', 'training_set', 'time', 'note', 'tags', 'user_id']


In [7]:
# Merge papers with screening decisions
# Check what columns are in results_df
print("Columns in results_df:", list(results_df.columns))

# Add record_id to papers_df if not present
if 'record_id' not in papers_df.columns:
    papers_df['record_id'] = papers_df.index

# Select only columns that exist in results_df
merge_cols = ['record_id', 'label']
if 'notes' in results_df.columns:
    merge_cols.append('notes')

# Merge
merged_df = papers_df.merge(
    results_df[merge_cols], 
    on='record_id', 
    how='left'
)

# Categorize
merged_df['status'] = merged_df['label'].map({
    1: 'included',
    0: 'excluded'
}).fillna('not_screened')

print("\nScreening summary:")
print(merged_df['status'].value_counts())

Columns in results_df: ['record_id', 'label', 'classifier', 'querier', 'balancer', 'feature_extractor', 'training_set', 'time', 'note', 'tags', 'user_id']

Screening summary:
status
not_screened    190
included        150
excluded        110
Name: count, dtype: int64


## 4. PRISMA Flow Diagram Data

In [8]:
# Calculate PRISMA numbers
total_records = len(merged_df)
screened = len(merged_df[merged_df['status'] != 'not_screened'])
included = len(merged_df[merged_df['status'] == 'included'])
excluded = len(merged_df[merged_df['status'] == 'excluded'])
not_screened = len(merged_df[merged_df['status'] == 'not_screened'])

prisma_data = {
    "identification": {
        "total_records": total_records,
        "source": "Multiple databases (OpenAlex, arXiv, PubMed, Europe PMC, Semantic Scholar)"
    },
    "screening": {
        "records_screened": screened,
        "records_excluded_titleabstract": excluded,
        "not_screened_ai_prioritization": not_screened,
        "screening_method": "ASReview LAB v2.2 (active learning)"
    },
    "included": {
        "studies_included_titleabstract": included
    },
    "notes": {
        "ai_assisted": True,
        "stopping_rule": "Consecutive irrelevant threshold",
        "estimated_recall": ">95%"
    }
}

print("="*60)
print("PRISMA FLOW DIAGRAM DATA")
print("="*60)
print(f"\nIDENTIFICATION")
print(f"  Total records from databases: {total_records}")
print(f"\nSCREENING (Title/Abstract)")
print(f"  Records screened: {screened}")
print(f"  Records excluded: {excluded}")
print(f"  Not screened (AI stopped): {not_screened}")
print(f"\nINCLUDED")
print(f"  Studies after title/abstract screening: {included}")
print("="*60)

# Save PRISMA data
with open(f"{OUTPUT_DIR}/prisma_flow_data.json", 'w') as f:
    json.dump(prisma_data, f, indent=2)
print(f"\n✓ Saved: {OUTPUT_DIR}/prisma_flow_data.json")

PRISMA FLOW DIAGRAM DATA

IDENTIFICATION
  Total records from databases: 450

SCREENING (Title/Abstract)
  Records screened: 260
  Records excluded: 110
  Not screened (AI stopped): 190

INCLUDED
  Studies after title/abstract screening: 150

✓ Saved: wildfire-sentinel2-ml/screening_results/prisma_flow_data.json


## 5. Generate Study Inclusion JSON for Nanopubs

In [9]:
# Get included studies
included_df = merged_df[merged_df['status'] == 'included'].copy()

print(f"Preparing {len(included_df)} included studies for nanopub export")
print(f"\nSample of included papers:")
display_cols = ['title', 'doi', 'authors', 'year']
available_cols = [c for c in display_cols if c in included_df.columns]
included_df[available_cols].head()

Preparing 150 included studies for nanopub export

Sample of included papers:


Unnamed: 0,title,doi,authors,year
0,FireScope: Wildfire Risk Prediction with a Cha...,,"[""Mario Markov"", ""Stefan Maria Ailuro"", ""Luc V...",2025
1,Assessment of the January 2025 Los Angeles Cou...,,"[""Seyd Teymoor Seydi""]",2025
3,Sen2Fire: A Challenging Benchmark Dataset for ...,10.1109/IGARSS53475.2024.10641441,"[""Yonghao Xu"", ""Amanda Berg"", ""Leif Haglund""]",2024
4,CaBuAr: California Burned Areas dataset for de...,10.1109/MGRS.2023.3292467,"[""Daniele Rege Cambrin"", ""Luca Colomba"", ""Paol...",2024
7,Burned Area Detection with Sentinel-2A Data: U...,10.5194/isprs-annals-x-5-2024-251-2024,"[""Elif Ozlem Yilmaz"", ""T. Kavzoglu""]",2024


In [10]:
# Build study inclusion JSON
def get_study_uri(row):
    """Get best available URI for the study"""
    if pd.notna(row.get('doi')) and row['doi']:
        doi = row['doi']
        if not doi.startswith('http'):
            return f"https://doi.org/{doi}"
        return doi
    if pd.notna(row.get('url')) and row['url']:
        return row['url']
    if pd.notna(row.get('openalex_id')) and row['openalex_id']:
        return row['openalex_id']
    return None

def clean_title(title):
    """Clean title for use as label"""
    if pd.isna(title):
        return "Untitled"
    # Truncate long titles
    title = str(title).strip()
    if len(title) > 200:
        return title[:197] + "..."
    return title

# Build studies list
studies = []
missing_uri = 0

for idx, row in included_df.iterrows():
    uri = get_study_uri(row)
    if uri is None:
        missing_uri += 1
        # Use a placeholder URI based on title hash
        title_hash = hash(str(row.get('title', idx))) % 10000000
        uri = f"urn:study:{title_hash}"
    
    study = {
        "uri": uri,
        "label": clean_title(row.get('title')),
        "metadata": {
            "authors": row.get('authors', ''),
            "year": int(row['year']) if pd.notna(row.get('year')) else None,
            "journal": row.get('journal', row.get('primary_location', '')),
            "doi": row.get('doi', ''),
            "abstract": row.get('abstract', '')[:500] if pd.notna(row.get('abstract')) else ''
        }
    }
    studies.append(study)

print(f"\nProcessed {len(studies)} studies")
if missing_uri > 0:
    print(f"⚠️ {missing_uri} studies without DOI/URL (using placeholder URIs)")


Processed 150 studies
⚠️ 12 studies without DOI/URL (using placeholder URIs)


In [11]:
# Create the full study_inclusion.json
study_inclusion_config = {
    "_comment": "Study Inclusion nanopub configuration for Science Live",
    "_generated": datetime.now(timezone.utc).isoformat(),
    "_source": ASREVIEW_FILE,
    
    "review_metadata": {
        "title": REVIEW_TITLE,
        "description": REVIEW_DESCRIPTION,
        "screener_orcid": SCREENER_ORCID,
        "screener_name": SCREENER_NAME,
        "screening_date": datetime.now(timezone.utc).strftime("%Y-%m-%d"),
        "screening_tool": "ASReview LAB v2.2",
        "total_screened": screened,
        "total_included": included,
        "total_excluded": excluded
    },
    
    "provenance": {
        "pico_nanopub": PICO_NANOPUB_URI,
        "search_strategy_nanopub": SEARCH_STRATEGY_URI,
        "search_execution_nanopub": SEARCH_EXECUTION_URI
    },
    
    "nanopub_template": {
        "base_uri": "https://w3id.org/sciencelivehub/quantum-biodiversity-review/",
        "type": "https://w3id.org/slo/StudyInclusion",
        "license": "https://creativecommons.org/licenses/by/4.0/"
    },
    
    "studies": studies
}

# Save
output_file = f"{OUTPUT_DIR}/study_inclusion.json"
with open(output_file, 'w') as f:
    json.dump(study_inclusion_config, f, indent=2, default=str)

print(f"✓ Saved: {output_file}")
print(f"  Contains {len(studies)} studies ready for nanopub generation")

✓ Saved: wildfire-sentinel2-ml/screening_results/study_inclusion.json
  Contains 150 studies ready for nanopub generation


## 6. Export CSV and RIS Files

In [12]:
# Export included studies to CSV
export_cols = ['title', 'authors', 'year', 'doi', 'journal', 'abstract', 'url']
available_export_cols = [c for c in export_cols if c in included_df.columns]

included_df[available_export_cols].to_csv(
    f"{OUTPUT_DIR}/included_studies.csv", 
    index=False
)
print(f"✓ Saved: {OUTPUT_DIR}/included_studies.csv ({len(included_df)} studies)")

# Export excluded studies to CSV
excluded_df = merged_df[merged_df['status'] == 'excluded'].copy()
excluded_df[available_export_cols].to_csv(
    f"{OUTPUT_DIR}/excluded_studies.csv", 
    index=False
)
print(f"✓ Saved: {OUTPUT_DIR}/excluded_studies.csv ({len(excluded_df)} studies)")

✓ Saved: wildfire-sentinel2-ml/screening_results/included_studies.csv (150 studies)
✓ Saved: wildfire-sentinel2-ml/screening_results/excluded_studies.csv (110 studies)


In [13]:
# Export to RIS format for reference managers
def df_to_ris(df, filename):
    """Convert DataFrame to RIS format"""
    with open(filename, 'w', encoding='utf-8') as f:
        for idx, row in df.iterrows():
            f.write("TY  - JOUR\n")
            
            if pd.notna(row.get('title')):
                f.write(f"TI  - {row['title']}\n")
            
            if pd.notna(row.get('authors')):
                # Split authors if comma-separated
                authors = str(row['authors'])
                for author in authors.split(';'):
                    author = author.strip()
                    if author:
                        f.write(f"AU  - {author}\n")
            
            if pd.notna(row.get('year')):
                f.write(f"PY  - {int(row['year'])}\n")
            
            if pd.notna(row.get('journal')):
                f.write(f"JO  - {row['journal']}\n")
            
            if pd.notna(row.get('doi')):
                doi = row['doi']
                if not doi.startswith('http'):
                    doi = f"https://doi.org/{doi}"
                f.write(f"DO  - {row['doi']}\n")
                f.write(f"UR  - {doi}\n")
            elif pd.notna(row.get('url')):
                f.write(f"UR  - {row['url']}\n")
            
            if pd.notna(row.get('abstract')):
                # Clean abstract for RIS
                abstract = str(row['abstract']).replace('\n', ' ').strip()
                f.write(f"AB  - {abstract}\n")
            
            f.write("ER  - \n\n")

# Export included
df_to_ris(included_df, f"{OUTPUT_DIR}/included_studies.ris")
print(f"✓ Saved: {OUTPUT_DIR}/included_studies.ris")

# Export excluded
df_to_ris(excluded_df, f"{OUTPUT_DIR}/excluded_studies.ris")
print(f"✓ Saved: {OUTPUT_DIR}/excluded_studies.ris")

✓ Saved: wildfire-sentinel2-ml/screening_results/included_studies.ris
✓ Saved: wildfire-sentinel2-ml/screening_results/excluded_studies.ris


## 7. Cleanup and Summary

In [14]:
# Cleanup temp directory
shutil.rmtree(temp_dir)
print(f"✓ Cleaned up temporary files")

✓ Cleaned up temporary files


In [15]:
# Final summary
print("="*60)
print("EXPORT COMPLETE")
print("="*60)
print(f"\nReview: {REVIEW_TITLE}")
print(f"Screener: {SCREENER_NAME} ({SCREENER_ORCID})")
print(f"\nResults:")
print(f"  Total records: {total_records}")
print(f"  Screened: {screened}")
print(f"  Included: {included}")
print(f"  Excluded: {excluded}")
print(f"\nOutput files in '{OUTPUT_DIR}/' :")
print(f"  • study_inclusion.json     - For nanopub generation")
print(f"  • prisma_flow_data.json    - PRISMA diagram numbers")
print(f"  • included_studies.csv     - Included studies")
print(f"  • excluded_studies.csv     - Excluded studies")
print(f"  • included_studies.ris     - For Zotero/reference managers")
print(f"  • excluded_studies.ris     - For Zotero/reference managers")
print(f"\nProvenance chain:")
print(f"  PICO → Search Strategy → Search Execution → Study Inclusion")
print("="*60)
print("\nNext step: Run study-inclusion-nanopub-from-json.ipynb")
print("           with study_inclusion.json to generate nanopubs")

EXPORT COMPLETE

Review: Machine Learning Algorithms for Wildfire Detection and Burned Area Mapping using Sentinel-2
Screener: Anne Fouilloux (0000-0002-1784-2920)

Results:
  Total records: 450
  Screened: 260
  Included: 150
  Excluded: 110

Output files in 'wildfire-sentinel2-ml/screening_results/' :
  • study_inclusion.json     - For nanopub generation
  • prisma_flow_data.json    - PRISMA diagram numbers
  • included_studies.csv     - Included studies
  • excluded_studies.csv     - Excluded studies
  • included_studies.ris     - For Zotero/reference managers
  • excluded_studies.ris     - For Zotero/reference managers

Provenance chain:
  PICO → Search Strategy → Search Execution → Study Inclusion

Next step: Run study-inclusion-nanopub-from-json.ipynb
           with study_inclusion.json to generate nanopubs


## 8. Preview Study Inclusion JSON

In [16]:
# Show sample of the generated JSON
print("Sample of study_inclusion.json:\n")
preview = {
    "review_metadata": study_inclusion_config["review_metadata"],
    "provenance": study_inclusion_config["provenance"],
    "studies": study_inclusion_config["studies"][:3]  # First 3 studies
}
print(json.dumps(preview, indent=2, default=str))
print(f"\n... and {len(studies) - 3} more studies")

Sample of study_inclusion.json:

{
  "review_metadata": {
    "title": "Machine Learning Algorithms for Wildfire Detection and Burned Area Mapping using Sentinel-2",
    "description": "Systematic review on Machine Learning Algorithms for Wildfire Detection and Burned Area Mapping using Sentinel-2",
    "screener_orcid": "0000-0002-1784-2920",
    "screener_name": "Anne Fouilloux",
    "screening_date": "2026-01-08",
    "screening_tool": "ASReview LAB v2.2",
    "total_screened": 260,
    "total_included": 150,
    "total_excluded": 110
  },
  "provenance": {
    "pico_nanopub": "https://w3id.org/np/RAjO8tdVOla9I77PeXF4iY92ULngrpx5_ZSKFkVrCmsW0",
    "search_strategy_nanopub": " ",
    "search_execution_nanopub": " "
  },
  "studies": [
    {
      "uri": "urn:study:3142543",
      "label": "FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle",
      "metadata": {
        "authors": "[\"Mario Markov\", \"Stefan Maria Ailuro\", \"Luc Van Gool\", \"Konrad Schindler\", \"D