feat(shared): reuse resume extractor in bot and worker#107
Conversation
📝 WalkthroughWalkthroughThis PR consolidates resume extraction functionality into a shared module ( Changes
Sequence Diagram(s)sequenceDiagram
participant DiscordBot as Discord Bot<br/>(CRMCog)
participant Extractor as ResumeProfileExtractor
participant OpenAI as OpenAI API /<br/>Heuristic
participant Cache as Profile Cache
DiscordBot->>Cache: Check cached profile
alt Cache hit
Cache-->>DiscordBot: Return cached profile
else Cache miss
DiscordBot->>Extractor: extract(resume_text)
Extractor->>OpenAI: Try AI extraction<br/>(build prompt + chat)
OpenAI-->>Extractor: JSON with fields
alt AI success
Extractor->>Extractor: Normalize fields
else AI fails
Extractor->>OpenAI: Fallback heuristic
OpenAI-->>Extractor: Rule-based extraction
end
Extractor-->>DiscordBot: ResumeExtractedProfile
DiscordBot->>Cache: Store profile
end
DiscordBot->>DiscordBot: extract_contact_hints()<br/>(emails, github, linkedin, skills)
DiscordBot->>DiscordBot: build_resume_payload()<br/>(enrich with hints)
DiscordBot-->>DiscordBot: Infer/create contact
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
packages/shared/src/five08/resume_extractor.py (1)
231-232: Silent fallback hides extraction failures.Line 231 catches all exceptions and immediately falls back, but there’s no logging/telemetry. This makes production regressions in the LLM path very hard to detect and triage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/shared/src/five08/resume_extractor.py` around lines 231 - 232, The except Exception block that currently does "except Exception: return self._heuristic_extract(text)" should log the caught exception (with stack trace and contextual info like the input text or resume id) before falling back; update the except handler in the method surrounding the LLM extraction to call your telemetry/logging utility (e.g., logger.exception(...) or telemetry.capture_exception(...)) including the exception object and context, then still return self._heuristic_extract(text) so behavior is unchanged but failures are observable.apps/discord_bot/src/five08/discord_bot/cogs/crm.py (2)
3003-3009: Normalize extracted skills before writing payload.Lines 3004-3008 currently join raw extracted skills directly. Running them through
normalize_skill_listkeeps create payload behavior consistent with the rest of this cog’s skill handling.♻️ Suggested refinement
if skills: - normalized_skills = [ - str(item).strip() for item in skills if str(item).strip() - ] + normalized_skills = normalize_skill_list( + [str(item) for item in skills if str(item).strip()] + ) if normalized_skills: payload["skills"] = ", ".join(normalized_skills)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py` around lines 3003 - 3009, The extracted skills are currently trimmed and joined directly into payload["skills"]; instead, pass the raw skills through the existing normalize_skill_list helper before joining to keep behavior consistent. Update the block that handles skills (the one referencing the local variable skills and payload["skills"]) to call normalize_skill_list(skills) (or the appropriate import) and then join the returned normalized list into payload["skills"]; ensure you still filter out empty strings as normalize_skill_list may return them.
2911-2918: Use a stable digest cache key instead ofhash(file_content).Using Python’s runtime hash for bytes on Line 2912 can theoretically collide and return the wrong cached profile. A digest key avoids cross-content aliasing risk.
🔐 Suggested hardening
+import hashlib @@ - self._resume_profile_cache: tuple[int, ResumeExtractedProfile] | None = None + self._resume_profile_cache: tuple[bytes, ResumeExtractedProfile] | None = None @@ - def _extract_resume_profile(self, file_content: bytes) -> Any: + def _extract_resume_profile(self, file_content: bytes) -> ResumeExtractedProfile: @@ - cache_key = hash(file_content) + cache_key = hashlib.sha256(file_content).digest()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py` around lines 2911 - 2918, The cache currently uses Python's runtime hash(file_content) in the resume extraction path (see _resume_profile_cache, cache_key and resume_extractor.extract), which can collide; replace it with a stable cryptographic digest of file_content (e.g., compute a sha256 or similar hex digest from the bytes) and use that digest string as cache_key before comparing/updating _resume_profile_cache so cached profiles reliably map to the exact file bytes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py`:
- Around line 2909-2919: The synchronous call to
self.resume_extractor.extract(text) in _extract_resume_profile blocks the event
loop; change _extract_resume_profile into an async method that offloads the
blocking work to a thread executor (e.g., use
asyncio.get_running_loop().run_in_executor or loop.run_in_executor) to call
self.resume_extractor.extract(text) without blocking, keep the existing caching
logic (cache_key, self._resume_profile_cache) but adapt it to async, and then
update all callers (e.g., upload_resume() and confirm_create() button handler
and any uses of _extract_resume_contact_hints) to await _extract_resume_profile
so the extractor runs without stalling the Discord event loop.
In `@packages/shared/src/five08/resume_extractor.py`:
- Around line 333-346: The _extract_skills heuristic only matches headings on
their own and then takes just the very next line, which misses formats like
"Skills: Python, SQL" or cases where the first returned line is empty; update
_extract_skills to (1) expand the regex to also capture headings followed by
inline content (e.g., allow something after the heading on the same line) and
(2) when no inline content is found, gather the next few non-empty lines (or up
to a character limit, e.g., 500 chars) after match.end(), join them until a
blank line or unrelated section heading, then pass that combined text into
_normalize_skills; keep references to the same function name _extract_skills and
call _normalize_skills with the consolidated candidate skills string.
---
Nitpick comments:
In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py`:
- Around line 3003-3009: The extracted skills are currently trimmed and joined
directly into payload["skills"]; instead, pass the raw skills through the
existing normalize_skill_list helper before joining to keep behavior consistent.
Update the block that handles skills (the one referencing the local variable
skills and payload["skills"]) to call normalize_skill_list(skills) (or the
appropriate import) and then join the returned normalized list into
payload["skills"]; ensure you still filter out empty strings as
normalize_skill_list may return them.
- Around line 2911-2918: The cache currently uses Python's runtime
hash(file_content) in the resume extraction path (see _resume_profile_cache,
cache_key and resume_extractor.extract), which can collide; replace it with a
stable cryptographic digest of file_content (e.g., compute a sha256 or similar
hex digest from the bytes) and use that digest string as cache_key before
comparing/updating _resume_profile_cache so cached profiles reliably map to the
exact file bytes.
In `@packages/shared/src/five08/resume_extractor.py`:
- Around line 231-232: The except Exception block that currently does "except
Exception: return self._heuristic_extract(text)" should log the caught exception
(with stack trace and contextual info like the input text or resume id) before
falling back; update the except handler in the method surrounding the LLM
extraction to call your telemetry/logging utility (e.g., logger.exception(...)
or telemetry.capture_exception(...)) including the exception object and context,
then still return self._heuristic_extract(text) so behavior is unchanged but
failures are observable.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
apps/discord_bot/src/five08/discord_bot/cogs/crm.pyapps/discord_bot/src/five08/discord_bot/config.pyapps/worker/src/five08/worker/crm/intake_form_processor.pyapps/worker/src/five08/worker/crm/resume_profile_processor.pyapps/worker/src/five08/worker/models.pypackages/shared/src/five08/resume_extractor.pytests/unit/test_crm.py
| def _extract_resume_profile(self, file_content: bytes) -> Any: | ||
| """Extract resume profile fields and cache per-file-content results.""" | ||
| cache = self._resume_profile_cache | ||
| cache_key = hash(file_content) | ||
| if cache and cache[0] == cache_key: | ||
| return cache[1] | ||
|
|
||
| linkedin_matches: list[str] = [] | ||
| for linkedin_url in linkedin_re.findall(snippet): | ||
| candidate = str(linkedin_url).strip().lower().rstrip("/") | ||
| if candidate and candidate not in linkedin_matches: | ||
| linkedin_matches.append(candidate) | ||
| text = file_content.decode("utf-8", errors="ignore") | ||
| profile = self.resume_extractor.extract(text) | ||
| self._resume_profile_cache = (cache_key, profile) | ||
| return profile |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check ResumeProfileExtractor.extract() signature and if it's sync or async
find . -name "resume_extractor.py" -type f | head -5Repository: 508-dev/508-workflows
Length of output: 113
🏁 Script executed:
# Search for ResumeProfileExtractor and extract method definition
rg -n "class ResumeProfileExtractor|def extract" packages/shared/src/five08/resume_extractor.py -A 2Repository: 508-dev/508-workflows
Length of output: 349
🏁 Script executed:
# Find all calls to _extract_resume_profile in crm.py to understand the call chain
rg -n "_extract_resume_profile" apps/discord_bot/src/five08/discord_bot/cogs/crm.py -B 2 -A 2Repository: 508-dev/508-workflows
Length of output: 631
🏁 Script executed:
# Check if the methods calling _extract_resume_profile are async
rg -n "async def upload_resume|async def _infer_contact_from_resume|async def _extract_resume_contact_hints" apps/discord_bot/src/five08/discord_bot/cogs/crm.py -B 1 -A 3Repository: 508-dev/508-workflows
Length of output: 441
🏁 Script executed:
# Verify the current state of _extract_resume_profile matches the snippet
sed -n '2909,2919p' apps/discord_bot/src/five08/discord_bot/cogs/crm.pyRepository: 508-dev/508-workflows
Length of output: 562
🏁 Script executed:
# Find all callers of _extract_resume_contact_hints
rg -n "_extract_resume_contact_hints" apps/discord_bot/src/five08/discord_bot/cogs/crm.pyRepository: 508-dev/508-workflows
Length of output: 363
🏁 Script executed:
# Check the implementation of _infer_contact_from_resume to see what it calls
sed -n '3068,3150p' apps/discord_bot/src/five08/discord_bot/cogs/crm.pyRepository: 508-dev/508-workflows
Length of output: 3293
🏁 Script executed:
# Check upload_resume to see the call chain
sed -n '5006,5100p' apps/discord_bot/src/five08/discord_bot/cogs/crm.pyRepository: 508-dev/508-workflows
Length of output: 3770
🏁 Script executed:
# Check if upload_resume calls _infer_contact_from_resume
rg -n "_infer_contact_from_resume" apps/discord_bot/src/five08/discord_bot/cogs/crm.py -B 2 -A 2Repository: 508-dev/508-workflows
Length of output: 565
🏁 Script executed:
# Verify the resume_extractor.extract implementation to confirm it does blocking I/O
sed -n '186,250p' packages/shared/src/five08/resume_extractor.pyRepository: 508-dev/508-workflows
Length of output: 2817
🏁 Script executed:
# Check the methods containing the other calls to _extract_resume_contact_hints (lines 2957 and 2967)
sed -n '2940,2980p' apps/discord_bot/src/five08/discord_bot/cogs/crm.pyRepository: 508-dev/508-workflows
Length of output: 1765
🏁 Script executed:
# Check if _extract_resume_name_hint or _build_resume_create_contact_payload are called from async functions
rg -n "_extract_resume_name_hint|_build_resume_create_contact_payload" apps/discord_bot/src/five08/discord_bot/cogs/crm.py -B 2 -A 2Repository: 508-dev/508-workflows
Length of output: 1486
🏁 Script executed:
# Find the context around line 1024 - what class/method is this in
sed -n '1000,1030p' apps/discord_bot/src/five08/discord_bot/cogs/crm.pyRepository: 508-dev/508-workflows
Length of output: 1386
🏁 Script executed:
# Find callers of _build_resume_link_contact_payload
rg -n "_build_resume_link_contact_payload" apps/discord_bot/src/five08/discord_bot/cogs/crm.py -B 2 -A 2Repository: 508-dev/508-workflows
Length of output: 47
🏁 Script executed:
# Check what class contains confirm_create to understand the context better
sed -n '990,1015p' apps/discord_bot/src/five08/discord_bot/cogs/crm.py | head -30Repository: 508-dev/508-workflows
Length of output: 1046
Synchronous extractor call can block the Discord event loop.
Line 2917 calls self.resume_extractor.extract(text) synchronously. This method makes a blocking OpenAI API call that stalls the Discord bot's event loop when invoked from async command handlers like upload_resume() and the confirm_create() button handler.
⚙️ Suggested fix direction
- def _extract_resume_profile(self, file_content: bytes) -> Any:
+ async def _extract_resume_profile(self, file_content: bytes) -> ResumeExtractedProfile:
"""Extract resume profile fields and cache per-file-content results."""
cache = self._resume_profile_cache
cache_key = hash(file_content)
if cache and cache[0] == cache_key:
return cache[1]
text = file_content.decode("utf-8", errors="ignore")
- profile = self.resume_extractor.extract(text)
+ profile = await asyncio.to_thread(self.resume_extractor.extract, text)
self._resume_profile_cache = (cache_key, profile)
return profile- def _extract_resume_contact_hints(self, file_content: bytes) -> dict[str, Any]:
+ async def _extract_resume_contact_hints(self, file_content: bytes) -> dict[str, Any]:
"""Extract contact-identifying signals and shared resume fields from bytes."""
- profile = self._extract_resume_profile(file_content)
+ profile = await self._extract_resume_profile(file_content)Also update callers of _extract_resume_contact_hints to use await.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py` around lines 2909 -
2919, The synchronous call to self.resume_extractor.extract(text) in
_extract_resume_profile blocks the event loop; change _extract_resume_profile
into an async method that offloads the blocking work to a thread executor (e.g.,
use asyncio.get_running_loop().run_in_executor or loop.run_in_executor) to call
self.resume_extractor.extract(text) without blocking, keep the existing caching
logic (cache_key, self._resume_profile_cache) but adapt it to async, and then
update all callers (e.g., upload_resume() and confirm_create() button handler
and any uses of _extract_resume_contact_hints) to await _extract_resume_profile
so the extractor runs without stalling the Discord event loop.
| def _extract_skills(resume_text: str) -> list[str]: | ||
| match = re.search( | ||
| r"(?im)^\s*(?:skills|technical\s+skills|technologies)\s*[:\-]?\s*$", | ||
| resume_text, | ||
| ) | ||
| if not match: | ||
| return [] | ||
|
|
||
| line_start = match.end() | ||
| tail = resume_text[line_start : line_start + 500] | ||
| first_line = tail.splitlines()[0] if tail else "" | ||
| if first_line: | ||
| return _normalize_skills(first_line) | ||
| return [] |
There was a problem hiding this comment.
Heuristic skills parsing misses common resume formats.
Line 335 only matches a heading-only line, and Line 343 reads only splitlines()[0], which is often empty right after the heading. This causes frequent false negatives for skills when the resume uses Skills: Python, ... or puts skills on the next line.
🔧 Proposed fix
`@staticmethod`
def _extract_skills(resume_text: str) -> list[str]:
match = re.search(
- r"(?im)^\s*(?:skills|technical\s+skills|technologies)\s*[:\-]?\s*$",
+ r"(?im)^\s*(?:skills|technical\s+skills|technologies)\s*[:\-]?\s*(.*)$",
resume_text,
)
if not match:
return []
- line_start = match.end()
- tail = resume_text[line_start : line_start + 500]
- first_line = tail.splitlines()[0] if tail else ""
- if first_line:
- return _normalize_skills(first_line)
+ inline = (match.group(1) or "").strip()
+ if inline:
+ return _normalize_skills(inline)
+
+ line_start = match.end()
+ tail = resume_text[line_start : line_start + 500]
+ for line in tail.splitlines():
+ candidate = line.strip()
+ if candidate:
+ return _normalize_skills(candidate)
return []📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _extract_skills(resume_text: str) -> list[str]: | |
| match = re.search( | |
| r"(?im)^\s*(?:skills|technical\s+skills|technologies)\s*[:\-]?\s*$", | |
| resume_text, | |
| ) | |
| if not match: | |
| return [] | |
| line_start = match.end() | |
| tail = resume_text[line_start : line_start + 500] | |
| first_line = tail.splitlines()[0] if tail else "" | |
| if first_line: | |
| return _normalize_skills(first_line) | |
| return [] | |
| def _extract_skills(resume_text: str) -> list[str]: | |
| match = re.search( | |
| r"(?im)^\s*(?:skills|technical\s+skills|technologies)\s*[:\-]?\s*(.*)$", | |
| resume_text, | |
| ) | |
| if not match: | |
| return [] | |
| inline = (match.group(1) or "").strip() | |
| if inline: | |
| return _normalize_skills(inline) | |
| line_start = match.end() | |
| tail = resume_text[line_start : line_start + 500] | |
| for line in tail.splitlines(): | |
| candidate = line.strip() | |
| if candidate: | |
| return _normalize_skills(candidate) | |
| return [] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/shared/src/five08/resume_extractor.py` around lines 333 - 346, The
_extract_skills heuristic only matches headings on their own and then takes just
the very next line, which misses formats like "Skills: Python, SQL" or cases
where the first returned line is empty; update _extract_skills to (1) expand the
regex to also capture headings followed by inline content (e.g., allow something
after the heading on the same line) and (2) when no inline content is found,
gather the next few non-empty lines (or up to a character limit, e.g., 500
chars) after match.end(), join them until a blank line or unrelated section
heading, then pass that combined text into _normalize_skills; keep references to
the same function name _extract_skills and call _normalize_skills with the
consolidated candidate skills string.
Description
test_link_discord_user_name_searchassertion resilient to wrappedwherefilters.Related Issue
How Has This Been Tested?
Pre-commit checks (ruff, ruff format, mypy) passed on commit.
Summary by CodeRabbit
Release Notes