Hea 572/label recognition workbook #180

rhunwicks · 2025-10-15T03:15:58Z

No description provided.

…_completeness_constraint_2

Copilot

Pull Request Overview

This PR adds support for generating a label recognition workbook that documents how BSS labels are recognized (either via regex patterns or database entries), updates Python version from 3.10 to 3.12, and refactors instance validation to use iterative processing instead of DataFrame operations. The changes also include improved error handling with a new strict configuration option and several code simplifications.

Adds new configuration for BSS label recognition workbook output
Refactors validation logic to iterate over instances rather than using DataFrames
Introduces conditional error handling based on config.strict setting
Updates target Python version to 3.12

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
requirements/base.txt	Updates gdrivefs dependency commit hash
pyproject.toml	Adds project metadata and updates Python version from 3.10 to 3.12
pipelines_tests/test_assets/test_livelihood_activity_regexes.json	Adds test cases for new wild food product recognition patterns
pipelines/configs.py	Adds configuration fields for BSS label recognition workbook
pipelines/assets/wild_foods.py	Simplifies return statements and adds config parameter to function calls
pipelines/assets/wealth_characteristic.py	Updates validate_instances call to include config parameter
pipelines/assets/other_cash_income.py	Simplifies return statements and adds config parameter to function calls
pipelines/assets/livelihood_activity_regexes.json	Adds regex patterns for wild foods and fish product recognition
pipelines/assets/livelihood_activity.py	Adds label recognition workbook generation, refactors label matching logic, and improves error handling
pipelines/assets/fixtures.py	Refactors validation from DataFrame operations to iterative instance processing and standardizes return values
pipelines/init.py	Adds new livelihood_activity_label_recognition_dataframe asset to exports
env.example	Documents new environment variables for label recognition workbook

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

pipelines/assets/livelihood_activity.py

Copilot · 2025-10-15T03:17:01Z

pipelines/assets/livelihood_activity.py

        df.iloc[num_header_rows:][
-            (df["A"].iloc[num_header_rows:] != "") & (all_label_attributes.iloc[num_header_rows:, 0].isna())
+            (df["A"].iloc[num_header_rows:] != "")
+            & (all_label_attributes.iloc[num_header_rows:]["activity_label"] == "")


The comparison all_label_attributes.iloc[num_header_rows:]["activity_label"] == "" may not correctly identify null/NA values. If activity_label contains None or pd.NA values instead of empty strings, this condition will miss them. Consider using isna() or checking for both empty strings and null values.

Suggested change

& (all_label_attributes.iloc[num_header_rows:]["activity_label"] == "")

& (

all_label_attributes.iloc[num_header_rows:]["activity_label"].isna()

| (all_label_attributes.iloc[num_header_rows:]["activity_label"] == "")

)

We know that a missing attribute will be " because of the fillna("") in get_all_label_attributes

pipelines/assets/livelihood_activity.py

Copilot · 2025-10-15T03:17:02Z

pipelines/assets/fixtures.py

+                record_reference += f"{instance['bss_column']}:{instance['bss_column']}: "
+            elif "bss_row" in instance:
+                record_reference += f"{instance['bss_row']}:{instance['bss_row']}: "
+            record_reference += f"{str({k: v for k,v in instance.items()})}"


[nitpick] The dict comprehension creates a full copy of the instance which could be large. Consider limiting this to relevant fields or using a more concise representation for error messages to improve readability and performance.

Suggested change

record_reference += f"{str({k: v for k,v in instance.items()})}"

# Only include relevant fields for error messages to improve readability and performance

relevant_fields = ["natural_key", "id", "pk"]

concise_instance = {k: instance[k] for k in relevant_fields if k in instance}

record_reference += f"{concise_instance}"

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

The merge-base changed after approval.

rhunwicks added 6 commits October 13, 2025 14:14

Relax data completeness constraints - see HEA-572

2bd0d32

Merge branch 'main' into HEA-572/relax_data_completeness_constraint_2

ab035c7

Merge branch 'HEA-572/regexes_for_wild_foods' into HEA-572/relax_data…

896e2b6

…_completeness_constraint_2

Set Python version requirements in pyproject.toml - see HEA-572

ce5dba6

Minor fixes to livelihood_activity_instances - see HEA-572

6e7f151

Create an external spreadsheet of BSS Labels - see HEA-572

9d6a71a

rhunwicks requested review from Copilot and girum-air October 15, 2025 03:15

rhunwicks self-assigned this Oct 15, 2025

Copilot AI reviewed Oct 15, 2025

View reviewed changes

rhunwicks and others added 3 commits October 14, 2025 23:20

Fix isort of label_recognition_dataframe - see HEA-572

7b15e91

Remove temporary dataframe row limit - see HEA-572

e08014d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fix docstring typo - see HEA-572

b9d4380

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

girum-air previously approved these changes Oct 15, 2025

View reviewed changes

rhunwicks merged commit b8046de into main Oct 15, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hea 572/label recognition workbook #180

Hea 572/label recognition workbook #180

Uh oh!

rhunwicks commented Oct 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Oct 15, 2025

Uh oh!

rhunwicks Oct 15, 2025

Uh oh!

Uh oh!

Copilot AI Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            & (all_label_attributes.iloc[num_header_rows:]["activity_label"] == "")
+            & (
+                all_label_attributes.iloc[num_header_rows:]["activity_label"].isna()
+                | (all_label_attributes.iloc[num_header_rows:]["activity_label"] == "")
+            )

-            record_reference += f"{str({k: v for k,v in instance.items()})}"
+            # Only include relevant fields for error messages to improve readability and performance
+            relevant_fields = ["natural_key", "id", "pk"]
+            concise_instance = {k: instance[k] for k in relevant_fields if k in instance}
+            record_reference += f"{concise_instance}"

Hea 572/label recognition workbook #180

Hea 572/label recognition workbook #180

Uh oh!

Conversation

rhunwicks commented Oct 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

rhunwicks Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants