-
Notifications
You must be signed in to change notification settings - Fork 1
Hea 572/label recognition workbook #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…_completeness_constraint_2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for generating a label recognition workbook that documents how BSS labels are recognized (either via regex patterns or database entries), updates Python version from 3.10 to 3.12, and refactors instance validation to use iterative processing instead of DataFrame operations. The changes also include improved error handling with a new strict configuration option and several code simplifications.
- Adds new configuration for BSS label recognition workbook output
- Refactors validation logic to iterate over instances rather than using DataFrames
- Introduces conditional error handling based on
config.strictsetting - Updates target Python version to 3.12
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements/base.txt | Updates gdrivefs dependency commit hash |
| pyproject.toml | Adds project metadata and updates Python version from 3.10 to 3.12 |
| pipelines_tests/test_assets/test_livelihood_activity_regexes.json | Adds test cases for new wild food product recognition patterns |
| pipelines/configs.py | Adds configuration fields for BSS label recognition workbook |
| pipelines/assets/wild_foods.py | Simplifies return statements and adds config parameter to function calls |
| pipelines/assets/wealth_characteristic.py | Updates validate_instances call to include config parameter |
| pipelines/assets/other_cash_income.py | Simplifies return statements and adds config parameter to function calls |
| pipelines/assets/livelihood_activity_regexes.json | Adds regex patterns for wild foods and fish product recognition |
| pipelines/assets/livelihood_activity.py | Adds label recognition workbook generation, refactors label matching logic, and improves error handling |
| pipelines/assets/fixtures.py | Refactors validation from DataFrame operations to iterative instance processing and standardizes return values |
| pipelines/init.py | Adds new livelihood_activity_label_recognition_dataframe asset to exports |
| env.example | Documents new environment variables for label recognition workbook |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| df.iloc[num_header_rows:][ | ||
| (df["A"].iloc[num_header_rows:] != "") & (all_label_attributes.iloc[num_header_rows:, 0].isna()) | ||
| (df["A"].iloc[num_header_rows:] != "") | ||
| & (all_label_attributes.iloc[num_header_rows:]["activity_label"] == "") |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comparison all_label_attributes.iloc[num_header_rows:]["activity_label"] == "" may not correctly identify null/NA values. If activity_label contains None or pd.NA values instead of empty strings, this condition will miss them. Consider using isna() or checking for both empty strings and null values.
| & (all_label_attributes.iloc[num_header_rows:]["activity_label"] == "") | |
| & ( | |
| all_label_attributes.iloc[num_header_rows:]["activity_label"].isna() | |
| | (all_label_attributes.iloc[num_header_rows:]["activity_label"] == "") | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know that a missing attribute will be " because of the fillna("") in get_all_label_attributes
| record_reference += f"{instance['bss_column']}:{instance['bss_column']}: " | ||
| elif "bss_row" in instance: | ||
| record_reference += f"{instance['bss_row']}:{instance['bss_row']}: " | ||
| record_reference += f"{str({k: v for k,v in instance.items()})}" |
Copilot
AI
Oct 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The dict comprehension creates a full copy of the instance which could be large. Consider limiting this to relevant fields or using a more concise representation for error messages to improve readability and performance.
| record_reference += f"{str({k: v for k,v in instance.items()})}" | |
| # Only include relevant fields for error messages to improve readability and performance | |
| relevant_fields = ["natural_key", "id", "pk"] | |
| concise_instance = {k: instance[k] for k in relevant_fields if k in instance} | |
| record_reference += f"{concise_instance}" |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The merge-base changed after approval.
No description provided.