feat(shared): reuse resume extractor in bot and worker#113
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR introduces OpenAI-driven resume profile extraction and structured profile processing across the Discord bot and worker services. It adds a new shared ResumeProfileExtractor module supporting both LLM-based and heuristic extraction, refactors resume processing pipelines to use canonicalized skills and normalized seniority levels, integrates OpenAI configuration parameters, and updates skill and intake form processors to consume structured profile data. Changes
Sequence Diagram(s)sequenceDiagram
participant DCBot as Discord Bot
participant CRMCog as CRM Cog
participant ResExtractor as ResumeProfileExtractor
participant OpenAI as OpenAI API
participant ResProcProc as ResumeProfileProcessor
participant EspoCRM as EspoCRM Client
DCBot->>CRMCog: Upload resume file
CRMCog->>ResExtractor: extract(file_content)
ResExtractor->>ResExtractor: Check cache
alt Cache miss
ResExtractor->>OpenAI: Request profile extraction (LLM)
OpenAI-->>ResExtractor: JSON profile response
else Cache hit
ResExtractor-->>ResExtractor: Return cached profile
end
Note over ResExtractor: Heuristic fallback if LLM unavailable
ResExtractor-->>CRMCog: ResumeExtractedProfile
CRMCog->>CRMCog: _extract_resume_contact_hints(profile)
Note over CRMCog: Extract emails, GitHub, LinkedIn,<br/>phone, skills, seniority, country
CRMCog->>CRMCog: _build_resume_create_contact_payload(hints)
CRMCog-->>EspoCRM: Create contact with structured data
rect rgba(100, 150, 200, 0.5)
Note over ResProcProc: Background: Profile Processing
ResProcProc->>ResProcProc: apply_profile_updates(profile)
ResProcProc->>ResProcProc: Normalize & deduplicate skills
ResProcProc->>ResProcProc: Coerce seniority level
ResProcProc->>ResProcProc: Merge website links & emails
ResProcProc-->>EspoCRM: Update contact fields
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/worker/src/five08/worker/crm/resume_profile_processor.py (1)
356-381:⚠️ Potential issue | 🟠 MajorMove the empty-update guard after Discord link enrichment.
The early return at Line 356 short-circuits link-only apply requests before
cDiscordUserID/cDiscordUsernameare added.Suggested fix
- if not sanitized_updates: - return ResumeApplyResult( - contact_id=contact_id, - updated_fields=[], - success=False, - error="No valid profile fields provided", - ) - link_applied = False if link_discord: discord_user_id = str(link_discord.get("user_id", "")).strip() discord_username = str(link_discord.get("username", "")).strip() if discord_user_id and discord_username: sanitized_updates["cDiscordUserID"] = discord_user_id sanitized_updates["cDiscordUsername"] = ( f"{discord_username} (ID: {discord_user_id})" ) link_applied = True if not sanitized_updates: return ResumeApplyResult( contact_id=contact_id, updated_fields=[], success=False, error="No valid profile fields provided", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/five08/worker/crm/resume_profile_processor.py` around lines 356 - 381, The current early-return that checks sanitized_updates and returns a ResumeApplyResult happens before enriching sanitized_updates with Discord info, which prevents link-only updates from applying; move the empty-update guard (the conditional that returns ResumeApplyResult when sanitized_updates is false) to after the Discord enrichment block that uses link_discord and sets sanitized_updates["cDiscordUserID"] and sanitized_updates["cDiscordUsername"] (and link_applied) so that Discord-only requests will be applied; update references in resume_profile_processor.py around the sanitized_updates check, the link_discord handling, and the ResumeApplyResult creation accordingly.
🧹 Nitpick comments (2)
apps/worker/src/five08/worker/crm/intake_form_processor.py (1)
413-416: KeepcSkillAttrspayload shape consistent across form and resume paths.Form updates write
{"skill": {"strength": N}}, while resume updates currently write{"skill": N}. Unifying this avoids schema drift in CRM payload consumers.♻️ Suggested consistency patch
- def _parse_profile_skill_attrs(self, profile: Any) -> dict[str, int]: + def _parse_profile_skill_attrs(self, profile: Any) -> dict[str, dict[str, int]]: raw_attrs = getattr(profile, "skill_attrs", {}) if not isinstance(raw_attrs, dict): return {} - parsed: dict[str, int] = {} + parsed: dict[str, dict[str, int]] = {} for raw_skill, raw_payload in raw_attrs.items(): normalized_name = self.skills_extractor.canonicalize_skill(str(raw_skill)) if not normalized_name: continue if isinstance(raw_payload, dict): raw_payload = raw_payload.get("strength") try: - parsed[normalized_name] = max(1, min(5, int(raw_payload))) + parsed[normalized_name] = { + "strength": max(1, min(5, int(raw_payload))) + } except Exception: continue return parsedAlso applies to: 421-436
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/five08/worker/crm/intake_form_processor.py` around lines 413 - 416, The cSkillAttrs payload written by the form path must match the nested shape used by the resume path; update the logic around _parse_profile_skill_attrs and where updates["cSkillAttrs"] is set so each skill value is an object like {"strength": N} rather than a bare number—ensure updates["cSkillAttrs"] = json.dumps(profile_attrs) builds profile_attrs with values as dicts (e.g., {"skill": {"strength": value}}) and apply the same transformation in the resume-path code block covering the logic duplicated around lines referenced (the other block at 421-436) so both form and resume produce identical shapes; keep updates["skills"] = sorted(profile_attrs.keys()) unchanged.tests/unit/test_skills_extractor.py (1)
117-125: Extend this test to also cover the hyphenated variant.The extractor now disallows
"bug-tracking"too; asserting that here will fully lock this regression case.✅ Suggested test tweak
- result = extractor._normalize_extracted_payload( - skills_value=["Bug Tracking", "Python", "bugtracking", "Code Review"], + result = extractor._normalize_extracted_payload( + skills_value=[ + "Bug Tracking", + "Python", + "bugtracking", + "bug-tracking", + "Code Review", + ], skill_attrs_value=None, confidence=0.8, source="model", ) assert "bug tracking" not in result.skills assert "bugtracking" not in result.skills + assert "bug-tracking" not in result.skills assert result.skills == ["python"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_skills_extractor.py` around lines 117 - 125, Add an assertion to ensure the hyphenated token is also filtered: extend the existing assertions after creating the extractor result (referencing result.skills and the skills_value list used to build it) to assert that "bug-tracking" is not in result.skills, and keep the existing checks that "bug tracking" and "bugtracking" are absent and that result.skills == ["python"].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/shared/src/five08/resume_extractor.py`:
- Around line 136-166: _normailze_seniority currently returns the string
"unknown" for unmapped or empty inputs which is truthy and prevents the fallback
_infer_seniority_from_resume from running in extract; update
_normalize_seniority to (1) add explicit mappings for "principal" and "principal
engineer" (return a distinct "principal" value) and (2) return None (not
"unknown") for empty or otherwise-unmapped inputs so extract can fall back to
_infer_seniority_from_resume; apply the same change to the other normalization
occurrence referenced around lines 396-400.
- Around line 647-657: The heuristic parsing after the heading fails when the
matched position is followed by a newline because tail.splitlines()[0] returns
an empty string; change the logic in the block using line_start/tail/first_line
so you skip leading blank lines (e.g., strip leading whitespace or iterate to
the first non-empty line) before splitting into comma-separated items and
calling _normalize_skill_payload; ensure you still set attrs to
{skill.casefold(): DEFAULT_SKILL_STRENGTH} when skills exist but attrs is None.
In `@tests/unit/test_resume_profile_processor.py`:
- Around line 80-82: Change the multiline return annotation "-> (None)" to a
single-line "-> None" on the test function signature for
test_extract_profile_proposal_merges_and_serializes_website_and_skill_attrs (and
any other test functions that currently use "-> (None)" such as the similar test
near the end of the file). Edit the def lines so the return annotation is on the
same line as the def and use -> None without parentheses to satisfy ruff format
--check.
In `@tests/unit/test_worker_processor.py`:
- Around line 38-41: The current assertion assumes positional args and exact
order; change it to inspect the mock's call args
(processor.espocrm_client.update_contact_skills.call_args), extract the contact
id and the skills either from kwargs or from positional args, assert the contact
id equals "contact-1", and assert the skills using an order-insensitive
comparison (e.g., compare sets) so the test accepts any valid skill ordering
while still ensuring the same elements were passed.
---
Outside diff comments:
In `@apps/worker/src/five08/worker/crm/resume_profile_processor.py`:
- Around line 356-381: The current early-return that checks sanitized_updates
and returns a ResumeApplyResult happens before enriching sanitized_updates with
Discord info, which prevents link-only updates from applying; move the
empty-update guard (the conditional that returns ResumeApplyResult when
sanitized_updates is false) to after the Discord enrichment block that uses
link_discord and sets sanitized_updates["cDiscordUserID"] and
sanitized_updates["cDiscordUsername"] (and link_applied) so that Discord-only
requests will be applied; update references in resume_profile_processor.py
around the sanitized_updates check, the link_discord handling, and the
ResumeApplyResult creation accordingly.
---
Nitpick comments:
In `@apps/worker/src/five08/worker/crm/intake_form_processor.py`:
- Around line 413-416: The cSkillAttrs payload written by the form path must
match the nested shape used by the resume path; update the logic around
_parse_profile_skill_attrs and where updates["cSkillAttrs"] is set so each skill
value is an object like {"strength": N} rather than a bare number—ensure
updates["cSkillAttrs"] = json.dumps(profile_attrs) builds profile_attrs with
values as dicts (e.g., {"skill": {"strength": value}}) and apply the same
transformation in the resume-path code block covering the logic duplicated
around lines referenced (the other block at 421-436) so both form and resume
produce identical shapes; keep updates["skills"] = sorted(profile_attrs.keys())
unchanged.
In `@tests/unit/test_skills_extractor.py`:
- Around line 117-125: Add an assertion to ensure the hyphenated token is also
filtered: extend the existing assertions after creating the extractor result
(referencing result.skills and the skills_value list used to build it) to assert
that "bug-tracking" is not in result.skills, and keep the existing checks that
"bug tracking" and "bugtracking" are absent and that result.skills ==
["python"].
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
apps/discord_bot/src/five08/discord_bot/cogs/crm.pyapps/discord_bot/src/five08/discord_bot/config.pyapps/worker/src/five08/worker/crm/intake_form_processor.pyapps/worker/src/five08/worker/crm/processor.pyapps/worker/src/five08/worker/crm/resume_profile_processor.pyapps/worker/src/five08/worker/crm/skills_extractor.pyapps/worker/src/five08/worker/models.pypackages/shared/src/five08/resume_extractor.pytests/unit/test_crm.pytests/unit/test_intake_form_processor.pytests/unit/test_resume_profile_processor.pytests/unit/test_skills_extractor.pytests/unit/test_worker_processor.py
| def _normalize_seniority(value: Any) -> str | None: | ||
| if not isinstance(value, str): | ||
| return None | ||
| normalized = value.strip().lower() | ||
| if not normalized: | ||
| return "unknown" | ||
| if normalized in {"jr", "junior", "entry", "entry-level", "entry level"}: | ||
| return "junior" | ||
| if normalized in {"intern", "internship"}: | ||
| return "junior" | ||
| if normalized in {"mid-level", "midlevel", "mid", "intermediate"}: | ||
| return "midlevel" | ||
| if normalized in {"staff", "staff+", "staff and beyond"}: | ||
| return "staff" | ||
| if normalized in { | ||
| "senior", | ||
| "sr", | ||
| "sr. engineer", | ||
| "lead", | ||
| "lead engineer", | ||
| "lead engineer/tech lead", | ||
| }: | ||
| return "senior" | ||
| if "lead" in normalized and ("engineer" in normalized or "lead" == normalized): | ||
| return "senior" | ||
| if "staff" in normalized: | ||
| return "staff" | ||
| if normalized.startswith("sr "): | ||
| return "senior" | ||
| return "unknown" | ||
|
|
There was a problem hiding this comment.
Seniority normalization currently collapses principal titles and can block fallback inference.
_normalize_seniority does not map "principal"/"principal engineer" and returns "unknown" for unmapped values. In extract(), that truthy "unknown" prevents _infer_seniority_from_resume(...) from running.
🐛 Suggested fix
def _normalize_seniority(value: Any) -> str | None:
@@
- if normalized in {"staff", "staff+", "staff and beyond"}:
+ if normalized in {
+ "staff",
+ "staff+",
+ "staff and beyond",
+ "principal",
+ "principal engineer",
+ }:
return "staff"
@@
- return "unknown"
+ return NoneAlso applies to: 396-400
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/shared/src/five08/resume_extractor.py` around lines 136 - 166,
_normailze_seniority currently returns the string "unknown" for unmapped or
empty inputs which is truthy and prevents the fallback
_infer_seniority_from_resume from running in extract; update
_normalize_seniority to (1) add explicit mappings for "principal" and "principal
engineer" (return a distinct "principal" value) and (2) return None (not
"unknown") for empty or otherwise-unmapped inputs so extract can fall back to
_infer_seniority_from_resume; apply the same change to the other normalization
occurrence referenced around lines 396-400.
| line_start = match.end() | ||
| tail = resume_text[line_start : line_start + 500] | ||
| first_line = tail.splitlines()[0] if tail else "" | ||
| if first_line: | ||
| skills, attrs = _normalize_skill_payload( | ||
| [item.strip() for item in first_line.split(",") if item.strip()], | ||
| None, | ||
| ) | ||
| if skills and not attrs: | ||
| attrs = {skill.casefold(): DEFAULT_SKILL_STRENGTH for skill in skills} | ||
| return skills, attrs |
There was a problem hiding this comment.
Heuristic skills parsing drops common Skills: sections followed by a newline.
After the heading match, tail usually starts with a newline, so splitlines()[0] is empty and no skills are parsed.
🐛 Suggested fix
- first_line = tail.splitlines()[0] if tail else ""
- if first_line:
+ first_non_empty = next(
+ (line.strip() for line in tail.splitlines() if line.strip()),
+ "",
+ )
+ if first_non_empty:
skills, attrs = _normalize_skill_payload(
- [item.strip() for item in first_line.split(",") if item.strip()],
+ [item.strip() for item in first_non_empty.split(",") if item.strip()],
None,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| line_start = match.end() | |
| tail = resume_text[line_start : line_start + 500] | |
| first_line = tail.splitlines()[0] if tail else "" | |
| if first_line: | |
| skills, attrs = _normalize_skill_payload( | |
| [item.strip() for item in first_line.split(",") if item.strip()], | |
| None, | |
| ) | |
| if skills and not attrs: | |
| attrs = {skill.casefold(): DEFAULT_SKILL_STRENGTH for skill in skills} | |
| return skills, attrs | |
| line_start = match.end() | |
| tail = resume_text[line_start : line_start + 500] | |
| first_non_empty = next( | |
| (line.strip() for line in tail.splitlines() if line.strip()), | |
| "", | |
| ) | |
| if first_non_empty: | |
| skills, attrs = _normalize_skill_payload( | |
| [item.strip() for item in first_non_empty.split(",") if item.strip()], | |
| None, | |
| ) | |
| if skills and not attrs: | |
| attrs = {skill.casefold(): DEFAULT_SKILL_STRENGTH for skill in skills} | |
| return skills, attrs |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/shared/src/five08/resume_extractor.py` around lines 647 - 657, The
heuristic parsing after the heading fails when the matched position is followed
by a newline because tail.splitlines()[0] returns an empty string; change the
logic in the block using line_start/tail/first_line so you skip leading blank
lines (e.g., strip leading whitespace or iterate to the first non-empty line)
before splitting into comma-separated items and calling
_normalize_skill_payload; ensure you still set attrs to {skill.casefold():
DEFAULT_SKILL_STRENGTH} when skills exist but attrs is None.
| def test_extract_profile_proposal_merges_and_serializes_website_and_skill_attrs() -> ( | ||
| None | ||
| ): |
There was a problem hiding this comment.
Fix these function signatures to pass ruff format --check.
The current multiline -> (None) annotations are causing the formatter gate to fail.
Suggested fix
-def test_extract_profile_proposal_merges_and_serializes_website_and_skill_attrs() -> (
- None
-):
+def test_extract_profile_proposal_merges_and_serializes_website_and_skill_attrs() -> None:
@@
-def test_apply_profile_updates_appends_resume_email_as_primary_emailAddressData() -> (
- None
-):
+def test_apply_profile_updates_appends_resume_email_as_primary_emailAddressData() -> None:Also applies to: 313-315
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/test_resume_profile_processor.py` around lines 80 - 82, Change the
multiline return annotation "-> (None)" to a single-line "-> None" on the test
function signature for
test_extract_profile_proposal_merges_and_serializes_website_and_skill_attrs (and
any other test functions that currently use "-> (None)" such as the similar test
near the end of the file). Edit the def lines so the return annotation is on the
same line as the def and use -> None without parentheses to satisfy ruff format
--check.
| processor.espocrm_client.update_contact_skills.assert_called_once_with( | ||
| "contact-1", | ||
| ["python", "redis", "fastapi", "docker"], | ||
| ) |
There was a problem hiding this comment.
Make this assertion kwargs-aware and order-insensitive.
CI already reports this call arrives via kwargs and with a different (but valid) skill order.
Suggested fix
- processor.espocrm_client.update_contact_skills.assert_called_once_with(
- "contact-1",
- ["python", "redis", "fastapi", "docker"],
- )
+ processor.espocrm_client.update_contact_skills.assert_called_once()
+ _, kwargs = processor.espocrm_client.update_contact_skills.call_args
+ assert kwargs["contact_id"] == "contact-1"
+ assert set(kwargs["skills"]) == {"python", "redis", "fastapi", "docker"}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| processor.espocrm_client.update_contact_skills.assert_called_once_with( | |
| "contact-1", | |
| ["python", "redis", "fastapi", "docker"], | |
| ) | |
| processor.espocrm_client.update_contact_skills.assert_called_once() | |
| _, kwargs = processor.espocrm_client.update_contact_skills.call_args | |
| assert kwargs["contact_id"] == "contact-1" | |
| assert set(kwargs["skills"]) == {"python", "redis", "fastapi", "docker"} |
🧰 Tools
🪛 GitHub Actions: Tests
[error] 38-38: CI failure in pytest: test_process_contact_skills_merges_and_updates did not call update_contact_skills with expected argument order. Actual call: update_contact_skills(contact_id='contact-1', skills=['python', 'redis', 'docker', 'fastapi']); expected: update_contact_skills('contact-1', ['python', 'redis', 'fastapi', 'docker']). Command: 'uv run pytest tests/ -v --tb=short'.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/test_worker_processor.py` around lines 38 - 41, The current
assertion assumes positional args and exact order; change it to inspect the
mock's call args (processor.espocrm_client.update_contact_skills.call_args),
extract the contact id and the skills either from kwargs or from positional
args, assert the contact id equals "contact-1", and assert the skills using an
order-insensitive comparison (e.g., compare sets) so the test accepts any valid
skill ordering while still ensuring the same elements were passed.
Description
Related Issue
How Has This Been Tested?
Pre-commit checks (ruff, ruff format, mypy) passed on commit.
Summary by CodeRabbit
Release Notes
New Features
Configuration