Skip to content

feat(shared): reuse resume extractor in bot and worker#107

Closed
michaelmwu wants to merge 1 commit into
mainfrom
michaelmwu/resume-llm-extract
Closed

feat(shared): reuse resume extractor in bot and worker#107
michaelmwu wants to merge 1 commit into
mainfrom
michaelmwu/resume-llm-extract

Conversation

@michaelmwu
Copy link
Copy Markdown
Member

@michaelmwu michaelmwu commented Mar 3, 2026

Description

  • Introduced a shared resume extraction service in packages/shared/src/five08/resume_extractor.py to normalize candidate fields, emit confidence, and provide LLM + heuristic fallback extraction.
  • Updated Discord bot CRM resume workflows to use that shared extractor for resume matching payloads, name inference, and inferred contact creation fields.
  • Refactored worker resume intake and resume profile processors to use the shared extractor and removed duplicated extraction logic.
  • Added bot OpenAI configuration fields and made the test_link_discord_user_name_search assertion resilient to wrapped where filters.
  • Validation passed pre-commit hooks (ruff and mypy), with no runtime tests executed in this pass.

Related Issue

How Has This Been Tested?

Pre-commit checks (ruff, ruff format, mypy) passed on commit.

Summary by CodeRabbit

Release Notes

  • New Features
    • Enhanced resume parsing with AI-powered extraction of candidate information including email, GitHub profiles, LinkedIn URLs, phone numbers, location, seniority level, and skills
    • Improved contact inference and matching workflows with more reliable structured data extraction from uploaded resumes

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

This PR consolidates resume extraction functionality into a shared module (five08.resume_extractor) with AI-based and heuristic extraction support. It refactors the Discord bot's CRM cog and worker modules to use this unified extractor, adds OpenAI configuration fields, and enriches contact data extraction with additional signals (GitHub, LinkedIn, skills, seniority).

Changes

Cohort / File(s) Summary
Shared Resume Extraction Module
packages/shared/src/five08/resume_extractor.py
Introduces ResumeProfileExtractor class and ResumeExtractedProfile model supporting both OpenAI-based and heuristic resume text extraction. Provides normalization helpers for emails, GitHub usernames, LinkedIn URLs, phone numbers, countries, seniority levels, and skills.
Discord Bot Configuration
apps/discord_bot/src/five08/discord_bot/config.py
Adds three new optional OpenAI-related configuration fields: openai_api_key, openai_base_url, and openai_model (default: "gpt-4o-mini").
Discord Bot CRM Integration
apps/discord_bot/src/five08/discord_bot/cogs/crm.py
Integrates ResumeProfileExtractor into CRMCog; adds cached resume profile extraction and richer contact hint extraction (emails, GitHub, LinkedIn, skills, seniority). Extends payload building to populate additional contact fields with extracted data.
Worker Resume Extraction Refactoring
apps/worker/src/five08/worker/crm/resume_profile_processor.py, apps/worker/src/five08/worker/crm/intake_form_processor.py, apps/worker/src/five08/worker/models.py
Removes embedded ResumeProfileExtractor implementation from resume_profile_processor; updates all modules to import and instantiate the shared extractor with explicit OpenAI configuration. Aliases ResumeExtractedProfile to SharedResumeExtractedProfile from shared module.
Test Robustness
tests/unit/test_crm.py
Adds type safety guard in test_link_discord_user_name_search to handle or-clause values robustly; validates dict items before accessing attributes.

Sequence Diagram(s)

sequenceDiagram
    participant DiscordBot as Discord Bot<br/>(CRMCog)
    participant Extractor as ResumeProfileExtractor
    participant OpenAI as OpenAI API /<br/>Heuristic
    participant Cache as Profile Cache
    
    DiscordBot->>Cache: Check cached profile
    alt Cache hit
        Cache-->>DiscordBot: Return cached profile
    else Cache miss
        DiscordBot->>Extractor: extract(resume_text)
        Extractor->>OpenAI: Try AI extraction<br/>(build prompt + chat)
        OpenAI-->>Extractor: JSON with fields
        alt AI success
            Extractor->>Extractor: Normalize fields
        else AI fails
            Extractor->>OpenAI: Fallback heuristic
            OpenAI-->>Extractor: Rule-based extraction
        end
        Extractor-->>DiscordBot: ResumeExtractedProfile
        DiscordBot->>Cache: Store profile
    end
    
    DiscordBot->>DiscordBot: extract_contact_hints()<br/>(emails, github, linkedin, skills)
    DiscordBot->>DiscordBot: build_resume_payload()<br/>(enrich with hints)
    DiscordBot-->>DiscordBot: Infer/create contact
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

🐰 A resume walks in, we extract with care,
AI or heuristics—both solutions fair!
Shared code now glimmers across modules wide,
Skills and seniority sorted with pride.
From Discord to Worker, one pipeline thrives!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective: extracting resume parsing logic into a shared module for reuse across bot and worker services.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch michaelmwu/resume-llm-extract

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
packages/shared/src/five08/resume_extractor.py (1)

231-232: Silent fallback hides extraction failures.

Line 231 catches all exceptions and immediately falls back, but there’s no logging/telemetry. This makes production regressions in the LLM path very hard to detect and triage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/shared/src/five08/resume_extractor.py` around lines 231 - 232, The
except Exception block that currently does "except Exception: return
self._heuristic_extract(text)" should log the caught exception (with stack trace
and contextual info like the input text or resume id) before falling back;
update the except handler in the method surrounding the LLM extraction to call
your telemetry/logging utility (e.g., logger.exception(...) or
telemetry.capture_exception(...)) including the exception object and context,
then still return self._heuristic_extract(text) so behavior is unchanged but
failures are observable.
apps/discord_bot/src/five08/discord_bot/cogs/crm.py (2)

3003-3009: Normalize extracted skills before writing payload.

Lines 3004-3008 currently join raw extracted skills directly. Running them through normalize_skill_list keeps create payload behavior consistent with the rest of this cog’s skill handling.

♻️ Suggested refinement
         if skills:
-            normalized_skills = [
-                str(item).strip() for item in skills if str(item).strip()
-            ]
+            normalized_skills = normalize_skill_list(
+                [str(item) for item in skills if str(item).strip()]
+            )
             if normalized_skills:
                 payload["skills"] = ", ".join(normalized_skills)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py` around lines 3003 -
3009, The extracted skills are currently trimmed and joined directly into
payload["skills"]; instead, pass the raw skills through the existing
normalize_skill_list helper before joining to keep behavior consistent. Update
the block that handles skills (the one referencing the local variable skills and
payload["skills"]) to call normalize_skill_list(skills) (or the appropriate
import) and then join the returned normalized list into payload["skills"];
ensure you still filter out empty strings as normalize_skill_list may return
them.

2911-2918: Use a stable digest cache key instead of hash(file_content).

Using Python’s runtime hash for bytes on Line 2912 can theoretically collide and return the wrong cached profile. A digest key avoids cross-content aliasing risk.

🔐 Suggested hardening
+import hashlib
@@
-        self._resume_profile_cache: tuple[int, ResumeExtractedProfile] | None = None
+        self._resume_profile_cache: tuple[bytes, ResumeExtractedProfile] | None = None
@@
-    def _extract_resume_profile(self, file_content: bytes) -> Any:
+    def _extract_resume_profile(self, file_content: bytes) -> ResumeExtractedProfile:
@@
-        cache_key = hash(file_content)
+        cache_key = hashlib.sha256(file_content).digest()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py` around lines 2911 -
2918, The cache currently uses Python's runtime hash(file_content) in the resume
extraction path (see _resume_profile_cache, cache_key and
resume_extractor.extract), which can collide; replace it with a stable
cryptographic digest of file_content (e.g., compute a sha256 or similar hex
digest from the bytes) and use that digest string as cache_key before
comparing/updating _resume_profile_cache so cached profiles reliably map to the
exact file bytes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py`:
- Around line 2909-2919: The synchronous call to
self.resume_extractor.extract(text) in _extract_resume_profile blocks the event
loop; change _extract_resume_profile into an async method that offloads the
blocking work to a thread executor (e.g., use
asyncio.get_running_loop().run_in_executor or loop.run_in_executor) to call
self.resume_extractor.extract(text) without blocking, keep the existing caching
logic (cache_key, self._resume_profile_cache) but adapt it to async, and then
update all callers (e.g., upload_resume() and confirm_create() button handler
and any uses of _extract_resume_contact_hints) to await _extract_resume_profile
so the extractor runs without stalling the Discord event loop.

In `@packages/shared/src/five08/resume_extractor.py`:
- Around line 333-346: The _extract_skills heuristic only matches headings on
their own and then takes just the very next line, which misses formats like
"Skills: Python, SQL" or cases where the first returned line is empty; update
_extract_skills to (1) expand the regex to also capture headings followed by
inline content (e.g., allow something after the heading on the same line) and
(2) when no inline content is found, gather the next few non-empty lines (or up
to a character limit, e.g., 500 chars) after match.end(), join them until a
blank line or unrelated section heading, then pass that combined text into
_normalize_skills; keep references to the same function name _extract_skills and
call _normalize_skills with the consolidated candidate skills string.

---

Nitpick comments:
In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py`:
- Around line 3003-3009: The extracted skills are currently trimmed and joined
directly into payload["skills"]; instead, pass the raw skills through the
existing normalize_skill_list helper before joining to keep behavior consistent.
Update the block that handles skills (the one referencing the local variable
skills and payload["skills"]) to call normalize_skill_list(skills) (or the
appropriate import) and then join the returned normalized list into
payload["skills"]; ensure you still filter out empty strings as
normalize_skill_list may return them.
- Around line 2911-2918: The cache currently uses Python's runtime
hash(file_content) in the resume extraction path (see _resume_profile_cache,
cache_key and resume_extractor.extract), which can collide; replace it with a
stable cryptographic digest of file_content (e.g., compute a sha256 or similar
hex digest from the bytes) and use that digest string as cache_key before
comparing/updating _resume_profile_cache so cached profiles reliably map to the
exact file bytes.

In `@packages/shared/src/five08/resume_extractor.py`:
- Around line 231-232: The except Exception block that currently does "except
Exception: return self._heuristic_extract(text)" should log the caught exception
(with stack trace and contextual info like the input text or resume id) before
falling back; update the except handler in the method surrounding the LLM
extraction to call your telemetry/logging utility (e.g., logger.exception(...)
or telemetry.capture_exception(...)) including the exception object and context,
then still return self._heuristic_extract(text) so behavior is unchanged but
failures are observable.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9bbee4 and 8b08297.

📒 Files selected for processing (7)
  • apps/discord_bot/src/five08/discord_bot/cogs/crm.py
  • apps/discord_bot/src/five08/discord_bot/config.py
  • apps/worker/src/five08/worker/crm/intake_form_processor.py
  • apps/worker/src/five08/worker/crm/resume_profile_processor.py
  • apps/worker/src/five08/worker/models.py
  • packages/shared/src/five08/resume_extractor.py
  • tests/unit/test_crm.py

Comment on lines +2909 to +2919
def _extract_resume_profile(self, file_content: bytes) -> Any:
"""Extract resume profile fields and cache per-file-content results."""
cache = self._resume_profile_cache
cache_key = hash(file_content)
if cache and cache[0] == cache_key:
return cache[1]

linkedin_matches: list[str] = []
for linkedin_url in linkedin_re.findall(snippet):
candidate = str(linkedin_url).strip().lower().rstrip("/")
if candidate and candidate not in linkedin_matches:
linkedin_matches.append(candidate)
text = file_content.decode("utf-8", errors="ignore")
profile = self.resume_extractor.extract(text)
self._resume_profile_cache = (cache_key, profile)
return profile
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check ResumeProfileExtractor.extract() signature and if it's sync or async
find . -name "resume_extractor.py" -type f | head -5

Repository: 508-dev/508-workflows

Length of output: 113


🏁 Script executed:

# Search for ResumeProfileExtractor and extract method definition
rg -n "class ResumeProfileExtractor|def extract" packages/shared/src/five08/resume_extractor.py -A 2

Repository: 508-dev/508-workflows

Length of output: 349


🏁 Script executed:

# Find all calls to _extract_resume_profile in crm.py to understand the call chain
rg -n "_extract_resume_profile" apps/discord_bot/src/five08/discord_bot/cogs/crm.py -B 2 -A 2

Repository: 508-dev/508-workflows

Length of output: 631


🏁 Script executed:

# Check if the methods calling _extract_resume_profile are async
rg -n "async def upload_resume|async def _infer_contact_from_resume|async def _extract_resume_contact_hints" apps/discord_bot/src/five08/discord_bot/cogs/crm.py -B 1 -A 3

Repository: 508-dev/508-workflows

Length of output: 441


🏁 Script executed:

# Verify the current state of _extract_resume_profile matches the snippet
sed -n '2909,2919p' apps/discord_bot/src/five08/discord_bot/cogs/crm.py

Repository: 508-dev/508-workflows

Length of output: 562


🏁 Script executed:

# Find all callers of _extract_resume_contact_hints
rg -n "_extract_resume_contact_hints" apps/discord_bot/src/five08/discord_bot/cogs/crm.py

Repository: 508-dev/508-workflows

Length of output: 363


🏁 Script executed:

# Check the implementation of _infer_contact_from_resume to see what it calls
sed -n '3068,3150p' apps/discord_bot/src/five08/discord_bot/cogs/crm.py

Repository: 508-dev/508-workflows

Length of output: 3293


🏁 Script executed:

# Check upload_resume to see the call chain
sed -n '5006,5100p' apps/discord_bot/src/five08/discord_bot/cogs/crm.py

Repository: 508-dev/508-workflows

Length of output: 3770


🏁 Script executed:

# Check if upload_resume calls _infer_contact_from_resume
rg -n "_infer_contact_from_resume" apps/discord_bot/src/five08/discord_bot/cogs/crm.py -B 2 -A 2

Repository: 508-dev/508-workflows

Length of output: 565


🏁 Script executed:

# Verify the resume_extractor.extract implementation to confirm it does blocking I/O
sed -n '186,250p' packages/shared/src/five08/resume_extractor.py

Repository: 508-dev/508-workflows

Length of output: 2817


🏁 Script executed:

# Check the methods containing the other calls to _extract_resume_contact_hints (lines 2957 and 2967)
sed -n '2940,2980p' apps/discord_bot/src/five08/discord_bot/cogs/crm.py

Repository: 508-dev/508-workflows

Length of output: 1765


🏁 Script executed:

# Check if _extract_resume_name_hint or _build_resume_create_contact_payload are called from async functions
rg -n "_extract_resume_name_hint|_build_resume_create_contact_payload" apps/discord_bot/src/five08/discord_bot/cogs/crm.py -B 2 -A 2

Repository: 508-dev/508-workflows

Length of output: 1486


🏁 Script executed:

# Find the context around line 1024 - what class/method is this in
sed -n '1000,1030p' apps/discord_bot/src/five08/discord_bot/cogs/crm.py

Repository: 508-dev/508-workflows

Length of output: 1386


🏁 Script executed:

# Find callers of _build_resume_link_contact_payload
rg -n "_build_resume_link_contact_payload" apps/discord_bot/src/five08/discord_bot/cogs/crm.py -B 2 -A 2

Repository: 508-dev/508-workflows

Length of output: 47


🏁 Script executed:

# Check what class contains confirm_create to understand the context better
sed -n '990,1015p' apps/discord_bot/src/five08/discord_bot/cogs/crm.py | head -30

Repository: 508-dev/508-workflows

Length of output: 1046


Synchronous extractor call can block the Discord event loop.

Line 2917 calls self.resume_extractor.extract(text) synchronously. This method makes a blocking OpenAI API call that stalls the Discord bot's event loop when invoked from async command handlers like upload_resume() and the confirm_create() button handler.

⚙️ Suggested fix direction
-    def _extract_resume_profile(self, file_content: bytes) -> Any:
+    async def _extract_resume_profile(self, file_content: bytes) -> ResumeExtractedProfile:
         """Extract resume profile fields and cache per-file-content results."""
         cache = self._resume_profile_cache
         cache_key = hash(file_content)
         if cache and cache[0] == cache_key:
             return cache[1]

         text = file_content.decode("utf-8", errors="ignore")
-        profile = self.resume_extractor.extract(text)
+        profile = await asyncio.to_thread(self.resume_extractor.extract, text)
         self._resume_profile_cache = (cache_key, profile)
         return profile
-    def _extract_resume_contact_hints(self, file_content: bytes) -> dict[str, Any]:
+    async def _extract_resume_contact_hints(self, file_content: bytes) -> dict[str, Any]:
         """Extract contact-identifying signals and shared resume fields from bytes."""
-        profile = self._extract_resume_profile(file_content)
+        profile = await self._extract_resume_profile(file_content)

Also update callers of _extract_resume_contact_hints to use await.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py` around lines 2909 -
2919, The synchronous call to self.resume_extractor.extract(text) in
_extract_resume_profile blocks the event loop; change _extract_resume_profile
into an async method that offloads the blocking work to a thread executor (e.g.,
use asyncio.get_running_loop().run_in_executor or loop.run_in_executor) to call
self.resume_extractor.extract(text) without blocking, keep the existing caching
logic (cache_key, self._resume_profile_cache) but adapt it to async, and then
update all callers (e.g., upload_resume() and confirm_create() button handler
and any uses of _extract_resume_contact_hints) to await _extract_resume_profile
so the extractor runs without stalling the Discord event loop.

Comment on lines +333 to +346
def _extract_skills(resume_text: str) -> list[str]:
match = re.search(
r"(?im)^\s*(?:skills|technical\s+skills|technologies)\s*[:\-]?\s*$",
resume_text,
)
if not match:
return []

line_start = match.end()
tail = resume_text[line_start : line_start + 500]
first_line = tail.splitlines()[0] if tail else ""
if first_line:
return _normalize_skills(first_line)
return []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Heuristic skills parsing misses common resume formats.

Line 335 only matches a heading-only line, and Line 343 reads only splitlines()[0], which is often empty right after the heading. This causes frequent false negatives for skills when the resume uses Skills: Python, ... or puts skills on the next line.

🔧 Proposed fix
     `@staticmethod`
     def _extract_skills(resume_text: str) -> list[str]:
         match = re.search(
-            r"(?im)^\s*(?:skills|technical\s+skills|technologies)\s*[:\-]?\s*$",
+            r"(?im)^\s*(?:skills|technical\s+skills|technologies)\s*[:\-]?\s*(.*)$",
             resume_text,
         )
         if not match:
             return []
 
-        line_start = match.end()
-        tail = resume_text[line_start : line_start + 500]
-        first_line = tail.splitlines()[0] if tail else ""
-        if first_line:
-            return _normalize_skills(first_line)
+        inline = (match.group(1) or "").strip()
+        if inline:
+            return _normalize_skills(inline)
+
+        line_start = match.end()
+        tail = resume_text[line_start : line_start + 500]
+        for line in tail.splitlines():
+            candidate = line.strip()
+            if candidate:
+                return _normalize_skills(candidate)
         return []
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _extract_skills(resume_text: str) -> list[str]:
match = re.search(
r"(?im)^\s*(?:skills|technical\s+skills|technologies)\s*[:\-]?\s*$",
resume_text,
)
if not match:
return []
line_start = match.end()
tail = resume_text[line_start : line_start + 500]
first_line = tail.splitlines()[0] if tail else ""
if first_line:
return _normalize_skills(first_line)
return []
def _extract_skills(resume_text: str) -> list[str]:
match = re.search(
r"(?im)^\s*(?:skills|technical\s+skills|technologies)\s*[:\-]?\s*(.*)$",
resume_text,
)
if not match:
return []
inline = (match.group(1) or "").strip()
if inline:
return _normalize_skills(inline)
line_start = match.end()
tail = resume_text[line_start : line_start + 500]
for line in tail.splitlines():
candidate = line.strip()
if candidate:
return _normalize_skills(candidate)
return []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/shared/src/five08/resume_extractor.py` around lines 333 - 346, The
_extract_skills heuristic only matches headings on their own and then takes just
the very next line, which misses formats like "Skills: Python, SQL" or cases
where the first returned line is empty; update _extract_skills to (1) expand the
regex to also capture headings followed by inline content (e.g., allow something
after the heading on the same line) and (2) when no inline content is found,
gather the next few non-empty lines (or up to a character limit, e.g., 500
chars) after match.end(), join them until a blank line or unrelated section
heading, then pass that combined text into _normalize_skills; keep references to
the same function name _extract_skills and call _normalize_skills with the
consolidated candidate skills string.

@michaelmwu michaelmwu closed this Mar 3, 2026
@michaelmwu michaelmwu deleted the michaelmwu/resume-llm-extract branch May 8, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant