Skip to content

fix: improve resume location parsing and editing#180

Merged
michaelmwu merged 3 commits intomainfrom
michaelmwu/location-parse-fix
Mar 8, 2026
Merged

fix: improve resume location parsing and editing#180
michaelmwu merged 3 commits intomainfrom
michaelmwu/location-parse-fix

Conversation

@michaelmwu
Copy link
Member

@michaelmwu michaelmwu commented Mar 8, 2026

Description

Tighten CRM/resume location parsing so country values must be valid countries, city/state fields reject obvious garbage, and two-part locations like Nanzih, Kaohsiung City are repaired as city plus region instead of a bad country.
Add a resume-review Edit Location modal for City, State / Region, Country, and Timezone, and render grouped human-readable location summaries in both the preview and applied-updates embeds.
Expand unit coverage for normalization, resume extraction, resume confirmation UI, and non-US location parsing.

Related Issue

N/A

How Has This Been Tested?

uv run pytest tests/unit/test_crm_normalization.py tests/unit/test_resume_extractor.py tests/unit/test_crm.py -q

Summary by CodeRabbit

  • New Features

    • Added a location editor (city, state, country, timezone) and a button to open it.
    • Resume/profile previews now show Location as a single, consolidated line when related fields change.
    • Improved normalization and parsing for ambiguous two-part locations.
  • Bug Fixes

    • Better validation distinguishing cities, states, and countries; more robust timezone handling.
  • Tests

    • Added tests covering location editing, grouping, parsing, and preview behavior.

Copilot AI review requested due to automatic review settings March 8, 2026 21:44
@coderabbitai
Copy link

coderabbitai bot commented Mar 8, 2026

Warning

Rate limit exceeded

@michaelmwu has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 13 minutes and 32 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f5786c91-e0e5-4c3f-97cd-8df86ee99500

📥 Commits

Reviewing files that changed from the base of the PR and between f4a312b and 3cd379e.

📒 Files selected for processing (4)
  • apps/discord_bot/src/five08/discord_bot/cogs/crm.py
  • packages/shared/src/five08/crm_normalization.py
  • tests/unit/test_crm.py
  • tests/unit/test_crm_normalization.py
📝 Walkthrough

Walkthrough

Adds end-to-end location editing: UI modal/button, normalization utilities, resume-extraction heuristics for city/state/country/timezone, and tests; integrates location rendering/collapsing into the CRM update confirmation view and preview generation.

Changes

Cohort / File(s) Summary
CRM UI & View
apps/discord_bot/src/five08/discord_bot/cogs/crm.py
Added ResumeEditLocationModal and ResumeEditLocationButton; extended ResumeUpdateConfirmationView with _LOCATION_FIELDS, _is_location_field, _has_location_updates, _location_component, _format_location_summary; updated applied/preview rendering and collapsing to show Location as a single logical field.
Normalization library
packages/shared/src/five08/crm_normalization.py
Introduced extensive location normalization: US state maps, country canonical/alias maps, and helpers (_normalize_location_text, _location_lookup_key, _is_plausible_location_phrase, _title_case_location); refactored normalize_city, normalize_state, and normalize_country to use these helpers.
Resume extractor
packages/shared/src/five08/resume_extractor.py
Updated _parse_location_candidate to treat two-part locations as city+state when second token isn't a country; added prompt constraints to prefer real country names, treat ambiguous two-part tokens as city/state, and exclude tech tokens from location fields.
Unit tests — CRM
tests/unit/test_crm.py
Added tests for location UI (modal pre-population, submission/normalization), location-button visibility, and grouping/collapsing of location subfields in the confirmation view.
Unit tests — Normalization
tests/unit/test_crm_normalization.py
Exposed/added normalize_state to public API and added tests covering state normalization, country/city disambiguation, and invalid-token rejection.
Unit tests — Resume extractor
tests/unit/test_resume_extractor.py
Added/updated tests for two-part city/region parsing, repair of invalid country data, and updated expectations for normalized state values.

Sequence Diagram

sequenceDiagram
    autonumber
    actor User
    participant DiscordUI as "Discord UI\n(Button)"
    participant Modal as "Location Modal\n(Form)"
    participant Normalizer as "Normalization\n(crm_normalization)"
    participant CRM as "CRM View\n(ResumeUpdateConfirmationView)"
    participant DB as "Database"

    User->>DiscordUI: Click Edit Location
    DiscordUI->>Modal: Open (pre-populate values)
    User->>Modal: Submit city/state/country/timezone
    Modal->>Normalizer: normalize_city(state/city)
    Normalizer-->>Modal: normalized_city
    Modal->>Normalizer: normalize_state(input)
    Normalizer-->>Modal: normalized_state
    Modal->>Normalizer: normalize_country(input)
    Normalizer-->>Modal: normalized_country
    Modal->>Normalizer: normalize_timezone(input)
    Normalizer-->>Modal: normalized_timezone
    Modal->>CRM: Return proposed_updates (location fields)
    CRM->>CRM: _has_location_updates()
    CRM->>CRM: _format_location_summary()
    CRM->>DB: Persist proposed_updates
    CRM-->>User: Display Proposed Changes with consolidated "Location" line
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hopped to a modal, tidy and bright,
City, State, Country — all snug in a line,
Normalized hops, timestamps kept light,
Preview sings "Location" — compact and fine,
Rabbits cheer for tidy data tonight! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary changes: improving resume location parsing and adding location editing UI.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch michaelmwu/location-parse-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens resume/CRM location normalization (rejecting invalid countries and obvious garbage in city/state) and adds Discord UI affordances to review/edit location fields before applying updates to the CRM.

Changes:

  • Strengthen location normalization: canonical-country validation, US state abbreviation expansion, and plausibility checks for city/state text.
  • Improve resume location extraction to better repair two-part non-country locations (city + region).
  • Add a Discord “Edit Location” modal and group location fields into a single human-readable “Location” line in preview/applied-updates embeds, with expanded unit coverage.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/shared/src/five08/crm_normalization.py Adds canonical country validation + aliases, US state abbreviation expansion, and plausibility checks for city/state.
packages/shared/src/five08/resume_extractor.py Updates deterministic parsing to treat 2-part non-country locations as city+region and strengthens LLM prompt constraints.
apps/discord_bot/src/five08/discord_bot/cogs/crm.py Introduces Edit Location modal/button; groups location fields into a single “Location” summary in preview + applied updates.
tests/unit/test_crm_normalization.py Adds/updates normalization tests for country/state/city plausibility and US state expansion.
tests/unit/test_resume_extractor.py Adds extraction test ensuring invalid LLM location fields are overridden by deterministic parsing; updates state expectations.
tests/unit/test_crm.py Adds tests for grouped location rendering and the new Edit Location modal/button behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.



def _normalize_location_text(value: str) -> str:
return re.sub(r"\s+", " ", value).strip(" ,")
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_normalize_location_text currently strips spaces/commas but not trailing periods. This makes normalize_country reject common resume punctuation like Taiwan. / Germany. (and then higher-level location parsing may fail for City, Region, Taiwan.). Consider stripping terminal punctuation (at least .) during normalization so canonical country matching still works.

Suggested change
return re.sub(r"\s+", " ", value).strip(" ,")
# Collapse internal whitespace and strip common trailing punctuation/spaces.
return re.sub(r"\s+", " ", value).strip(" ,.")

Copilot uses AI. Check for mistakes.
Comment on lines +440 to +441
letters_only = re.sub(r"[^A-Za-z]", "", normalized)
if len(letters_only) <= 2:
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize_state uses letters_only = re.sub(r"[^A-Za-z]", "", normalized) to reject short tokens. This unintentionally rejects non-ASCII region names that otherwise pass _is_plausible_location_phrase (e.g., Chinese/Japanese/Korean region strings), because letters_only becomes empty and the function returns None. Consider counting Unicode letters instead (e.g., based on str.isalpha()), so the “short token” guard doesn’t exclude valid non-Latin locations.

Suggested change
letters_only = re.sub(r"[^A-Za-z]", "", normalized)
if len(letters_only) <= 2:
letter_count = sum(1 for ch in normalized if ch.isalpha())
if letter_count <= 2:

Copilot uses AI. Check for mistakes.
Comment on lines 990 to 994
if len(parts) == 2:
if _looks_like_state_region(region):
return city, _normalize_state(region), None
country = _normalize_country(region)
if country:
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _parse_location_candidate, the _looks_like_state_region(region) branch runs before trying _normalize_country(region). Since _looks_like_state_region returns true for any 2-letter uppercase token, inputs like London, UK or San Francisco, US will be treated as a “state” and short-circuited, causing country parsing to fail (and may return (city, None, None) with the new stricter normalize_state). Consider attempting _normalize_country(region) first, or tightening _looks_like_state_region to only match known US state abbreviations (instead of [A-Z]{2}).

Copilot uses AI. Check for mistakes.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
tests/unit/test_crm_normalization.py (1)

55-61: Add the same garbage-string rejection check for normalize_state.

The PR tightens both city and state parsing, but this block only proves the long bullet-prefixed garbage value is rejected for normalize_city. If normalize_state starts accepting the same junk, this file will miss it.

Suggested assertion
     assert normalize_state("CA") == "California"
     assert normalize_state("Kaohsiung City") == "Kaohsiung City"
+    assert (
+        normalize_state("○ A Python Django Api Handles Account Creation And Management")
+        is None
+    )
     assert normalize_state("Js") is None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_crm_normalization.py` around lines 55 - 61, The test
currently verifies that normalize_city rejects the long bullet-prefixed garbage
string but does not assert the same for normalize_state; add an analogous
assertion that normalize_state("○ A Python Django Api Handles Account Creation
And Management") is None so the test covers state normalization rejecting the
same junk. Locate the assertions block containing normalize_city and
normalize_state in tests/unit/test_crm_normalization.py and add the new
assertion next to the existing normalize_city check to ensure normalize_state
behavior is validated.
tests/unit/test_resume_extractor.py (1)

759-826: Also assert timezone backfill in this repaired-location path.

This test proves the invalid country is replaced and the region is repaired, but it never checks the Taiwan timezone inference that similar extractor tests already cover. A regression could keep address_city/state/country correct while dropping result.timezone.

Suggested assertion
     assert result.address_city == "Nanzih"
     assert result.address_state == "Kaohsiung City"
     assert result.address_country == "Taiwan"
+    assert result.timezone == "UTC+08:00"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_resume_extractor.py` around lines 759 - 826, The test
test_extract_discards_invalid_country_and_repairs_current_location_region
currently verifies address_city/state/country but omits timezone; update this
test (which uses ResumeProfileExtractor and assigns result =
extractor.extract(...)) to also assert the inferred timezone (expecting
"Asia/Taipei") on result.timezone to ensure timezone backfill is present when
the invalid country is replaced and the region is repaired.
tests/unit/test_crm.py (1)

669-693: Cover non-canonical manual edits in the location modal submit test.

This case only feeds already-normalized city/state/country values, so it mainly proves passthrough plus timezone normalization. A regression that skips normalize_country/normalize_state in the new modal path would still pass.

Consider a companion submit case with mixed-case or abbreviated inputs, plus an invalid country value, and assert the payload is normalized or rejected as intended.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_crm.py` around lines 669 - 693, Add a companion async test
(e.g., test_edit_location_modal_submit_normalizes_or_rejects) that exercises
ResumeEditLocationModal.on_submit using mixed-case/abbreviated inputs (e.g.,
"nAnZiH", "ca", "uSa" or "US") and an invalid country value to cover
non-canonical manual edits; ensure the test sets
modal.city_input/_state_input/_country_input/_timezone_input values, calls await
modal.on_submit(mock_interaction), and then asserts that
ResumeUpdateConfirmationView.proposed_updates contains normalized keys via
normalize_country/normalize_state (or that the modal rejects/handles the invalid
country as the code path dictates) and that
mock_interaction.response.send_message was called (or not) according to the
rejection behavior so the normalization and validation paths are covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py`:
- Around line 1153-1154: The ResumeEditLocationButton is only added when
self._has_location_updates(proposed_updates) is true, making the location editor
unreachable if the parser returns no location; change the logic so the button is
added whenever location data is missing or incomplete in proposed_updates (i.e.,
when not self._has_location_updates(proposed_updates) or when specific location
fields like city/state/country/timezone are absent) so users can open the modal
to add a new location; update the condition around
add_item(ResumeEditLocationButton()) in the same block to check for
missing/incomplete location fields rather than only existing updates.
- Around line 954-964: The modal currently only uses
normalize_timezone(raw_timezone) and thus rejects abbreviation inputs like
"PST"/"EST"; update the timezone parsing to mirror _parse_location_input by
first trying normalize_timezone(raw_timezone) and if that returns falsy, check
_LOCATION_TIMEZONE_ABBREV_MAP for a mapped value (or call the same helper used
by _parse_location_input) so timezone accepts both offset-style and abbreviation
inputs; adjust the timezone variable assignment and the invalid_fields check
(referencing raw_timezone and timezone) accordingly.

In `@packages/shared/src/five08/crm_normalization.py`:
- Around line 90-317: The country lookup rejects legitimate names because
_location_lookup_key() only strips periods and the canonical/alias maps
(_CANONICAL_COUNTRY_NAMES, _COUNTRY_CANONICAL_MAP, _COUNTRY_ALIASES) lack many
variants (e.g., "Democratic Republic of the Congo", accented forms like "Côte
d’Ivoire"). Fix by (1) updating _CANONICAL_COUNTRY_NAMES to include missing
canonical names (e.g., "Democratic Republic of the Congo") and common official
variants, (2) enhancing _location_lookup_key() to perform robust Unicode
normalization (unicodedata.normalize to NFKD + strip combining marks), normalize
apostrophes/quotes (curly to straight), remove punctuation and extra whitespace
and then casefold, and (3) rebuild _COUNTRY_CANONICAL_MAP and _COUNTRY_ALIASES
keys using that normalized lookup key so alias entries (add accented and
punctuation variants) map correctly to the canonical values; update any
reference code that uses _COUNTRY_CANONICAL_MAP/_COUNTRY_ALIASES to call the
improved _location_lookup_key() before lookup.
- Around line 425-444: The fallback is too permissive and lacks non-US aliases;
update normalize_state() and the shared fallback logic by first checking a
curated map of non-US region abbreviations/names (e.g.,
CANADA_PROVINCE_ABBREVIATIONS, OTHER_REGION_NAMES) after
_US_STATE_ABBREVIATIONS/_US_STATE_NAMES and return canonical values if found,
then tighten _is_plausible_location_phrase (reduce max_words to ~3, max_length
to ~40) and add a short stopword list (e.g.,
"engineer","developer","javascript","senior","jr","sr","intern") to reject
job/tech phrases; also require letters_only length >= 3 before accepting and
calling _title_case_location(normalized). Apply the same stricter checks in
normalize_city()-related paths so 1–2 letter tokens (like "JS") are not
title-cased while known region abbreviations are still preserved via the curated
alias maps.

---

Nitpick comments:
In `@tests/unit/test_crm_normalization.py`:
- Around line 55-61: The test currently verifies that normalize_city rejects the
long bullet-prefixed garbage string but does not assert the same for
normalize_state; add an analogous assertion that normalize_state("○ A Python
Django Api Handles Account Creation And Management") is None so the test covers
state normalization rejecting the same junk. Locate the assertions block
containing normalize_city and normalize_state in
tests/unit/test_crm_normalization.py and add the new assertion next to the
existing normalize_city check to ensure normalize_state behavior is validated.

In `@tests/unit/test_crm.py`:
- Around line 669-693: Add a companion async test (e.g.,
test_edit_location_modal_submit_normalizes_or_rejects) that exercises
ResumeEditLocationModal.on_submit using mixed-case/abbreviated inputs (e.g.,
"nAnZiH", "ca", "uSa" or "US") and an invalid country value to cover
non-canonical manual edits; ensure the test sets
modal.city_input/_state_input/_country_input/_timezone_input values, calls await
modal.on_submit(mock_interaction), and then asserts that
ResumeUpdateConfirmationView.proposed_updates contains normalized keys via
normalize_country/normalize_state (or that the modal rejects/handles the invalid
country as the code path dictates) and that
mock_interaction.response.send_message was called (or not) according to the
rejection behavior so the normalization and validation paths are covered.

In `@tests/unit/test_resume_extractor.py`:
- Around line 759-826: The test
test_extract_discards_invalid_country_and_repairs_current_location_region
currently verifies address_city/state/country but omits timezone; update this
test (which uses ResumeProfileExtractor and assigns result =
extractor.extract(...)) to also assert the inferred timezone (expecting
"Asia/Taipei") on result.timezone to ensure timezone backfill is present when
the invalid country is replaced and the region is repaired.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2a50a859-4f68-4423-b9a3-c05af581bbe8

📥 Commits

Reviewing files that changed from the base of the PR and between b48c508 and 9b90746.

📒 Files selected for processing (6)
  • apps/discord_bot/src/five08/discord_bot/cogs/crm.py
  • packages/shared/src/five08/crm_normalization.py
  • packages/shared/src/five08/resume_extractor.py
  • tests/unit/test_crm.py
  • tests/unit/test_crm_normalization.py
  • tests/unit/test_resume_extractor.py

Copilot AI review requested due to automatic review settings March 8, 2026 21:58
@michaelmwu michaelmwu merged commit 9fc4f98 into main Mar 8, 2026
6 checks passed
@michaelmwu michaelmwu deleted the michaelmwu/location-parse-fix branch March 8, 2026 21:59
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +946 to +950
raw_city = (self.city_input.value or "").strip()
raw_state = (self.state_input.value or "").strip()
raw_country = (self.country_input.value or "").strip()
raw_timezone = (self.timezone_input.value or "").strip()

Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResumeEditLocationModal.on_submit treats user-entered strings like "None"/"null" as real values (e.g., normalize_city("None") returns "None"), so submitting the modal can write the literal string "None" into addressCity/addressState rather than clearing the field. Consider pre-normalizing inputs with the same logic as _location_component (treat case-insensitive "none"/"null" as empty) before calling normalize_city/normalize_state, or make the normalization helpers reject these sentinel values.

Suggested change
raw_city = (self.city_input.value or "").strip()
raw_state = (self.state_input.value or "").strip()
raw_country = (self.country_input.value or "").strip()
raw_timezone = (self.timezone_input.value or "").strip()
def _clean_location_input(value: str | None) -> str:
"""Normalize raw user location input before CRM normalization.
Treats case-insensitive 'none'/'null' as empty so those values clear
existing fields instead of being stored literally.
"""
cleaned = (value or "").strip()
if cleaned.lower() in {"none", "null"}:
return ""
return cleaned
raw_city = _clean_location_input(self.city_input.value)
raw_state = _clean_location_input(self.state_input.value)
raw_country = _clean_location_input(self.country_input.value)
raw_timezone = _clean_location_input(self.timezone_input.value)

Copilot uses AI. Check for mistakes.
Comment on lines +901 to +920
city_input: discord.ui.TextInput = discord.ui.TextInput(
label="City",
required=False,
max_length=100,
)
state_input: discord.ui.TextInput = discord.ui.TextInput(
label="State / Region",
required=False,
max_length=100,
)
country_input: discord.ui.TextInput = discord.ui.TextInput(
label="Country",
required=False,
max_length=100,
)
timezone_input: discord.ui.TextInput = discord.ui.TextInput(
label="Timezone",
required=False,
max_length=100,
)
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location modal allows up to 100 characters per field, but normalize_city/normalize_state currently reject values over 40 chars (via _is_plausible_location_phrase(..., max_length=40)). This mismatch will cause the UI to accept input that the backend will always mark invalid. Consider aligning max_length on the TextInputs with the normalization constraints (or loosening the normalization max_length if 100 is intentional).

Copilot uses AI. Check for mistakes.
Comment on lines +1168 to +1171
@classmethod
def _has_location_updates(cls, values: dict[str, Any]) -> bool:
return any(values.get(field) for field in cls._LOCATION_FIELDS)

Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResumeUpdateConfirmationView._has_location_updates is introduced but appears unused in the module. If it’s not needed, consider removing it to keep the view API minimal; otherwise, wire it into the call sites that need it (e.g., deciding whether to render grouped location output).

Suggested change
@classmethod
def _has_location_updates(cls, values: dict[str, Any]) -> bool:
return any(values.get(field) for field in cls._LOCATION_FIELDS)

Copilot uses AI. Check for mistakes.
Comment on lines +368 to +410
def _normalize_location_text(value: str) -> str:
return re.sub(r"\s+", " ", value).strip(" ,.")


def _location_lookup_key(value: str) -> str:
normalized = _normalize_location_text(value)
normalized = normalized.replace("’", "'").replace("‘", "'")
normalized = unicodedata.normalize("NFKD", normalized)
normalized = "".join(
ch for ch in normalized if not unicodedata.combining(ch)
).casefold()
collapsed = "".join(ch if ch.isalnum() else " " for ch in normalized)
return re.sub(r"\s+", " ", collapsed).strip()


_COUNTRY_CANONICAL_MAP: dict[str, str] = {
_location_lookup_key(value): value for value in _CANONICAL_COUNTRY_NAMES
}
_COUNTRY_ALIASES: dict[str, str] = {
_location_lookup_key(key): value for key, value in _RAW_COUNTRY_ALIASES.items()
}


def _is_plausible_location_phrase(
value: str,
*,
max_words: int,
max_length: int,
) -> bool:
if not value or len(value) > max_length:
return False
tokens = [
token.casefold()
for token in re.findall(r"[^\W\d_][^\W\d_'-]*", value, flags=re.UNICODE)
]
if not tokens or len(tokens) > max_words:
return False
if any(token in _LOCATION_STOPWORDS for token in tokens):
return False
for ch in value:
if ch.isalpha() or ch in {" ", "-", "'", ".", "(", ")"}:
continue
return False
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalize_city/normalize_state currently reject strings containing typographic apostrophes (e.g., “’” / “‘”) because _is_plausible_location_phrase only allows ASCII ', while _location_lookup_key already normalizes these for countries. This can cause valid city/region names with curly quotes to be treated as invalid. Consider normalizing curly quotes in _normalize_location_text (or allowing them in _is_plausible_location_phrase) so city/state validation matches the country normalization behavior.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants