Skip to content

Conversation

@pierrepo
Copy link
Member

No description provided.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Figshare scraper (and related scraping pipeline) to use the new ScraperContext and Pydantic-based metadata normalization/export, while also aligning Zenodo and NOMAD scrapers with the same metadata workflow and adding a debug mode flag.

Changes:

  • Refactor Figshare/Zenodo/NOMAD scrapers to produce raw metadata dicts, normalize them into Pydantic models, and export to parquet via shared utilities.
  • Replace the legacy ContextManager/dataframe flow with ScraperContext, list-based processing, and new helper utilities (normalization, deduplication, exclusion filtering, stats printing).
  • Update Figshare docs and simplify the Figshare scraper tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/mdverse_scrapers/scrapers/figshare.py Refactors Figshare scraping pipeline to ScraperContext + Pydantic normalization/export; adds debug mode and zip-content scraping.
src/mdverse_scrapers/scrapers/zenodo.py Refactors Zenodo scraping pipeline to list-based metadata + Pydantic normalization/export; adds debug mode.
src/mdverse_scrapers/scrapers/nomad.py Updates NOMAD scraper to use shared normalization/export utilities and adds debug mode early-stop.
src/mdverse_scrapers/models/utils.py Introduces shared normalization helpers and parquet export for lists of Pydantic models.
src/mdverse_scrapers/core/toolbox.py Removes legacy context/dataclass usage; adds list deduplication; refactors exclusion + false-positive filtering to operate on Pydantic models.
src/mdverse_scrapers/models/scraper.py Adds is_in_debug_mode to the scraper context.
src/mdverse_scrapers/models/file.py Reorders fields (no functional changes intended) while keeping required file metadata.
src/mdverse_scrapers/models/dataset.py Reorders dataset statistic fields (no functional changes intended).
tests/scrapers/test_figshare.py Simplifies Figshare tests after context removal.
docs/figshare.md Fixes wording and improves formatting in Figshare documentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 153 to 176
def scrap_zip_files_content(
all_files_metadata, logger: "loguru.Logger" = loguru.logger
) -> list[dict]:
"""Scrap information from files contained in zip archives.

Uncertain how many files can be fetched from the preview.
Only get file name and file type.
File size and MD5 checksum are not available.

Arguments
---------
files_df: Pandas dataframe
Dataframe with information about files.
logger: loguru.Logger
Logger object.
all_files_metadata: list[dict]
List of dictionaries with files metadata.
logger: "loguru.Logger"
Logger for logging messages.

Returns
-------
zip_df: Pandas dataframe
Dataframe with information about files found in zip archive.
list[dict]
List of dictionaries with files metadata found in zip archive.
"""
files_in_zip_lst = []
zip_files_counter = 0
zip_files_df = files_df[files_df["file_type"] == "zip"]
logger.info(f"Number of zip files to scrap content from: {zip_files_df.shape[0]}")
for zip_files_counter, zip_idx in enumerate(zip_files_df.index, start=1):
zip_file = zip_files_df.loc[zip_idx]
file_id = zip_file["file_url"].split("/")[-1]
files_in_zip_metadata = []
# Select zip files only.
zip_files = [f_meta for f_meta in all_files_metadata if f_meta.file_type == "zip"]
logger.info(f"Number of zip files to scrap content from: {len(zip_files):,}")
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scrap_zip_files_content iterates all_files_metadata as FileMetadata objects (uses .file_type, .file_url_in_repository, etc.), but the signature/docstring say list[dict]. Update the type hint and docstring to list[FileMetadata] to match actual usage and help callers pass the correct type.

Copilot uses AI. Check for mistakes.
file_lst.append(
{
"file_name": path,
"file_size": size,
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_data_from_zip_file builds entries with the key file_size, but FileMetadata uses file_size_in_bytes. As a result, zip-extracted file sizes will be ignored during normalization. Use file_size_in_bytes here (ByteSize can parse strings like "4.6 kB" after the model's validator).

Suggested change
"file_size": size,
"file_size_in_bytes": size,

Copilot uses AI. Check for mistakes.
try:
_ = response_json["hits"]["hits"]
except KeyError:
logger.warning("Cannot extract hits the response JSON.")
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling/grammar: log message reads "Cannot extract hits the response JSON." It should say "Cannot extract hits from the response JSON."

Suggested change
logger.warning("Cannot extract hits the response JSON.")
logger.warning("Cannot extract hits from the response JSON.")

Copilot uses AI. Check for mistakes.
Comment on lines 116 to 121
for file_meta in files_list:
dataset_id = file_meta["dataset_id_in_repository"]
# Print info only when changing dataset.
if dataset_id != previous_dataset_id:
logger.info(f"Normalizing metadata for files in dataset: {dataset_id}")
normalized_metadata = validate_metadata_against_model(
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize_files_metadata directly indexes file_meta["dataset_id_in_repository"] and file_meta["file_name"] before validation. A single malformed entry will raise KeyError and stop the whole normalization. Use .get(...) (or wrap per-item handling in try/except) and skip/log malformed entries so normalization is resilient.

Copilot uses AI. Check for mistakes.
f"Metadata normalization failed for file: {file_meta['file_name']}"
)
logger.info(
f"In dataset: {dataset_id}from {file_meta['dataset_repository_name']}"
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling/formatting in log message: f"In dataset: {dataset_id}from ..." is missing a space after the dataset id, which makes logs hard to read.

Suggested change
f"In dataset: {dataset_id}from {file_meta['dataset_repository_name']}"
f"In dataset: {dataset_id} from {file_meta['dataset_repository_name']}"

Copilot uses AI. Check for mistakes.
files_normalized_metadata += zip_normalized_metadata
logger.info(f"Total number of files found: {len(files_normalized_metadata)}")
files_normalized_metadata = remove_excluded_files(
files_normalized_metadata, excluded_files, excluded_paths
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove_excluded_files now accepts a logger argument, but this call doesn't pass it, so exclusion statistics will go to the default loguru.logger instead of the scraper's configured logger/file. Pass logger=logger here for consistent logging.

Suggested change
files_normalized_metadata, excluded_files, excluded_paths
files_normalized_metadata, excluded_files, excluded_paths, logger=logger

Copilot uses AI. Check for mistakes.
"date_created": record_json["created_date"],
"date_last_updated": record_json["modified_date"],
"title": clean_text(record_json["title"]),
"author": clean_text(record_json["authors"][0]["full_name"]),
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In extract_metadata_from_single_dataset_record, the dataset author is stored under the key author, but the DatasetMetadata model expects author_names (list of strings). With the current key, the author information will be dropped during normalization (and could fail if extra fields are forbidden). Populate author_names (e.g., a single-element list) instead of author.

Suggested change
"author": clean_text(record_json["authors"][0]["full_name"]),
"author_names": [clean_text(record_json["authors"][0]["full_name"])],

Copilot uses AI. Check for mistakes.
Comment on lines 422 to 423
datasets_lst.append(dataset_info)
files_lst += files_info
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_metadata_from_single_dataset_record returns an empty dict for embargoed records, but get_metadata_for_datasets_and_files appends dataset_info unconditionally. This will later raise KeyError in normalize_datasets_metadata when it indexes dataset['dataset_id_in_repository']. Skip appending when dataset_info is empty (and similarly avoid merging empty files_info).

Suggested change
datasets_lst.append(dataset_info)
files_lst += files_info
# Skip embargoed or otherwise invalid records that return empty metadata.
if dataset_info:
datasets_lst.append(dataset_info)
if files_info:
files_lst += files_info

Copilot uses AI. Check for mistakes.
Comment on lines 73 to 83
logger.info(
f"Normalizing metadata for dataset: {dataset['dataset_id_in_repository']}"
)
normalized_metadata = validate_metadata_against_model(
dataset, DatasetMetadata, logger=logger
)
if not normalized_metadata:
logger.error(
f"Metadata normalization failed for dataset "
f"{dataset['dataset_id_in_repository']} "
f"from {dataset['dataset_repository_name']}"
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize_datasets_metadata indexes dataset['dataset_id_in_repository'] (and other keys) before validation. If any upstream scraper returns an empty/partial dict (e.g., Figshare embargoed records currently return {}), this will raise KeyError and abort normalization. Use .get(...) and skip/log malformed entries rather than crashing.

Suggested change
logger.info(
f"Normalizing metadata for dataset: {dataset['dataset_id_in_repository']}"
)
normalized_metadata = validate_metadata_against_model(
dataset, DatasetMetadata, logger=logger
)
if not normalized_metadata:
logger.error(
f"Metadata normalization failed for dataset "
f"{dataset['dataset_id_in_repository']} "
f"from {dataset['dataset_repository_name']}"
if not isinstance(dataset, dict):
logger.warning(
"Skipping dataset metadata entry because it is not a dict: "
f"{dataset!r}"
)
continue
dataset_id = dataset.get("dataset_id_in_repository")
repo_name = dataset.get("dataset_repository_name")
dataset_id_for_log = (
dataset_id if dataset_id is not None else "<missing dataset_id_in_repository>"
)
repo_name_for_log = (
repo_name if repo_name is not None else "<missing dataset_repository_name>"
)
logger.info(
f"Normalizing metadata for dataset: {dataset_id_for_log}"
)
normalized_metadata = validate_metadata_against_model(
dataset, DatasetMetadata, logger=logger
)
if not normalized_metadata:
logger.error(
"Metadata normalization failed for dataset "
f"{dataset_id_for_log} from {repo_name_for_log}"

Copilot uses AI. Check for mistakes.
Comment on lines 193 to 210
def remove_duplicates_in_list_of_dicts(input_list: list[dict]) -> list[dict]:
"""Remove duplicates in a list while preserving the original order.

Parameters
----------
filename : str
Name of file to verify existence

Raises
------
FileNotFoundError
If file does not exist or is not a file.
"""
file_in = pathlib.Path(filename)
if not file_in.exists():
msg = f"File {filename} not found"
raise FileNotFoundError(msg)
if not file_in.is_file():
msg = f"{filename} is not a file"
raise FileNotFoundError(msg)


def verify_output_directory(directory, logger: "loguru.Logger" = loguru.logger):
"""Verify output directory exists.

Create it if necessary.
input_list : list
List with possible duplicate entries.

Parameters
----------
directory : str
Path to directory to store results
logger : "loguru.Logger"
Logger for logging messages.

Raises
------
FileNotFoundError
If directory path is an existing file.
Returns
-------
list
List without duplicates.
"""
directory_path = pathlib.Path(directory)
if directory_path.is_file():
msg = f"{directory} is an existing file."
raise FileNotFoundError(msg)
if directory_path.is_dir():
logger.info(f"Output directory {directory} already exists.")
else:
directory_path.mkdir(parents=True, exist_ok=True)
logger.info(f"Created output directory {directory}")
output_list = []
for dict_item in input_list:
if dict_item not in output_list:
output_list.append(dict_item)
return output_list
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove_duplicates_in_list_of_dicts is O(n^2) because it does a linear dict_item not in output_list check for every element. This can become a major bottleneck for large Zenodo result sets. Consider tracking a set of hashable keys (e.g., tuple(sorted(d.items())) or a domain-specific key like (dataset_id_in_repository, file_name)), while still preserving insertion order.

Copilot uses AI. Check for mistakes.
@pierrepo pierrepo merged commit 5116fe7 into main Jan 26, 2026
1 check passed
@pierrepo pierrepo deleted the update-figshare-scraper branch January 26, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants