fix: dedupe personal website candidates#198
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Normalizes personal-website deduplication so http:// vs https:// variants collapse to a single website, and updates extraction behavior to avoid adding heuristic “junk” website candidates after an accepted LLM personal-website candidate.
Changes:
- Introduce a scheme-insensitive website identity key and use it for website/social dedupe in the shared resume extractor.
- Update the worker resume profile processor to dedupe CRM website updates using a scheme-insensitive key.
- Add regression tests covering scheme-only duplicates and suppression of heuristic backfill after an accepted LLM website.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
packages/shared/src/five08/resume_extractor.py |
Adds _website_identity_key() and applies it across multiple website/social dedupe paths; suppresses heuristic non-social website backfill after LLM personal website acceptance. |
apps/worker/src/five08/worker/crm/resume_profile_processor.py |
Updates website link coercion/merge logic to dedupe using a scheme-insensitive key. |
tests/unit/test_resume_extractor.py |
Adds extractor regression tests for scheme-only dedupe and heuristic suppression after LLM website success. |
tests/unit/test_resume_profile_processor.py |
Adds processor regression test ensuring existing/extracted website links don’t trigger updates when differing only by scheme. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| result = extractor.extract("Portfolio: https://bit.ly/charleschen-portfolio") | ||
|
|
||
| assert result.website_links == ["http://bit.ly/charleschen-portfolio"] | ||
|
|
There was a problem hiding this comment.
This assertion hard-codes that the surviving deduped URL must be the http variant. The behavior being tested is scheme-insensitive dedupe; depending on candidate ordering or confidence ties, the implementation could reasonably keep https instead while still satisfying the requirement. To reduce brittleness, assert that only one link remains and that its identity matches the expected host/path (ignoring scheme).
| netloc = parsed.netloc.casefold() | ||
| if netloc.startswith("www."): | ||
| netloc = netloc[4:] | ||
| path = re.sub(r"/+", "/", parsed.path or "").rstrip("/") |
There was a problem hiding this comment.
_website_identity_key() case-folds the host but leaves the URL path casing unchanged. Previously dedupe used normalized_url.casefold(), so links that differ only by path casing (e.g. /About vs /about) would have been deduped but will now be treated as distinct. If case-insensitive path dedupe is still desired for this pipeline, case-fold (or otherwise normalize) the path component when building the key, or add coverage to lock in the intended behavior.
| path = re.sub(r"/+", "/", parsed.path or "").rstrip("/") | |
| path = re.sub(r"/+", "/", parsed.path or "").rstrip("/").casefold() |
| def _website_identity_key(value: str) -> str | None: | ||
| normalized_url = _normalize_website_url(value) | ||
| if not normalized_url: | ||
| return None | ||
|
|
||
| try: | ||
| parsed = urlsplit(normalized_url) | ||
| except Exception: | ||
| return normalized_url.casefold() |
There was a problem hiding this comment.
_website_identity_key() re-normalizes the URL via _normalize_website_url(), but most call sites already pass a normalized value (e.g., normalized_link / normalized_url). This causes redundant parsing/normalization work and also couples dedupe-key generation to the full normalization policy. Consider letting the key helper accept an already-normalized URL (or splitting into normalize + key steps) so callers don’t pay the normalization cost twice.
| netloc = parsed.netloc.casefold() | ||
| if netloc.startswith("www."): | ||
| netloc = netloc[4:] | ||
| path = re.sub(r"/+", "/", parsed.path or "").rstrip("/") |
There was a problem hiding this comment.
_website_dedupe_key() duplicates the URL identity logic that already exists in packages/shared/src/five08/resume_extractor.py (_website_identity_key). Keeping two slightly different implementations increases the chance they drift over time (e.g., the current key keeps original path casing, which also changes prior case-insensitive dedupe behavior). Consider moving this into a shared helper (e.g., crm_normalization) and reusing it from both places, and clarify whether path casing should be normalized as part of dedupe.
| path = re.sub(r"/+", "/", parsed.path or "").rstrip("/") | |
| path = re.sub(r"/+", "/", parsed.path or "").rstrip("/").casefold() |
Description
Normalize website dedupe to ignore scheme-only differences so
httpandhttpsvariants do not produce duplicate personal websites.Stop appending heuristic non-social website candidates after an accepted LLM personal website candidate, while keeping the existing social-link split behavior.
Add extractor and resume profile processor regression tests covering scheme-only duplicates and heuristic junk websites.
Related Issue
N/A
How Has This Been Tested?
uv run pytest tests/unit/test_resume_extractor.py -quv run pytest tests/unit/test_resume_profile_processor.py -q