feat: add job match extraction and ranked candidate search#142
Conversation
Add complete job matching pipeline: - **DB Migration**: Extend people table with profile fields (contact_type, is_member, address_country, address_city, timezone, seniority, linkedin) and separate skills (TEXT[] with GIN index) + skill_attrs (JSONB) columns. - **Job Requirements Extraction**: OpenAI-powered job_match.py parser with regex hints. Extracts required/preferred skills, seniority, location, timezones. Logs to webhook if unconfigured. - **Candidate Search**: PostgreSQL-based candidate_search.py with multi-level ranking by member status, required skills, skill strength, preferred skills. Seniority scoring in Python (1.0 exact, 0.7 one level up). - **Discord Command**: /match-candidates slash command parses job postings, searches ranked candidates, replies with top 10 results (member/prospect label, contact, CRM link, LinkedIn, resume, matched skills). Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughJob-posting text is analyzed via an OpenAI pipeline to extract structured requirements, then PostgreSQL-backed candidate search ranks people by skills, seniority, membership, and location. Worker sync and DB schema are extended to surface skills, resumes, timezone, seniority, and LinkedIn. Discord bot exposes a /match_candidates command to run this flow. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Discord as "Discord Bot"
participant Job as "Job Match\n(OpenAI)"
participant Search as "Candidate Search"
participant DB as "PostgreSQL DB"
User->>Discord: /match_candidates (posting_text)
Discord->>Discord: defer interaction
Discord->>Job: extract_job_requirements(posting_text, api_key)
Job->>Job: build prompt & hints
Job->>Job: call OpenAI API
Job-->>Discord: JobRequirements (skills, seniority, location, timezones)
alt extraction failed
Discord-->>User: Error response
else no required skills
Discord-->>User: Inform no required skills found
else success
Discord->>Search: search_candidates(requirements, limit=10)
Search->>DB: SQL with CTEs/lateral (skill match, scores, filters)
DB-->>Search: candidate rows
Search->>Search: compute seniority_score & assemble CandidateMatch list
Search-->>Discord: list[CandidateMatch]
Discord->>Discord: format paginated messages
Discord-->>User: candidate result pages
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (1)
apps/worker/src/five08/worker/migrations/versions/20260304_0100_add_job_match_fields_to_people.py (1)
54-70: Use concurrent index operations to avoid write-blocking locks.Index creation on a large
peopletable can block writes during migration. Usepostgresql_concurrently=Truewithautocommit_block()in both upgrade and downgrade.Recommended migration pattern
op.add_column("people", sa.Column("latest_resume_id", sa.Text(), nullable=True)) op.add_column("people", sa.Column("latest_resume_name", sa.Text(), nullable=True)) + with op.get_context().autocommit_block(): - op.create_index( + op.create_index( "idx_people_skills", "people", ["skills"], postgresql_using="gin", + postgresql_concurrently=True, ) - op.create_index("idx_people_is_member", "people", ["is_member"]) - op.create_index("idx_people_seniority", "people", ["seniority"]) - op.create_index("idx_people_address_country", "people", ["address_country"]) + op.create_index( + "idx_people_is_member", + "people", + ["is_member"], + postgresql_concurrently=True, + ) + op.create_index( + "idx_people_seniority", + "people", + ["seniority"], + postgresql_concurrently=True, + ) + op.create_index( + "idx_people_address_country", + "people", + ["address_country"], + postgresql_concurrently=True, + ) def downgrade() -> None: """Remove job-match fields from the people table.""" + with op.get_context().autocommit_block(): - op.drop_index("idx_people_address_country", table_name="people") - op.drop_index("idx_people_seniority", table_name="people") - op.drop_index("idx_people_is_member", table_name="people") - op.drop_index("idx_people_skills", table_name="people") + op.drop_index( + "idx_people_address_country", + table_name="people", + postgresql_concurrently=True, + ) + op.drop_index( + "idx_people_seniority", + table_name="people", + postgresql_concurrently=True, + ) + op.drop_index( + "idx_people_is_member", + table_name="people", + postgresql_concurrently=True, + ) + op.drop_index( + "idx_people_skills", + table_name="people", + postgresql_concurrently=True, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/worker/src/five08/worker/migrations/versions/20260304_0100_add_job_match_fields_to_people.py` around lines 54 - 70, The migration uses op.create_index and op.drop_index in upgrade() and downgrade() and must perform them without write-blocking locks; modify both upgrade() and downgrade() to run index operations inside the migration's autocommit_block() context and call op.create_index(..., postgresql_concurrently=True) for each index and op.drop_index(..., postgresql_concurrently=True) for each drop so Postgres creates/drops indexes concurrently on the people table; ensure you still keep the same index names ("idx_people_skills", "idx_people_is_member", "idx_people_seniority", "idx_people_address_country") and the current column lists when applying these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/discord_bot/src/five08/discord_bot/cogs/crm.py`:
- Around line 6211-6328: The match_candidates command (function
match_candidates) currently returns on several early/error paths without calling
the CRM audit helper; call the audit helper _audit_command on every exit: on
parse failure (in the RuntimeError except), on no required_skills early return,
on candidate search exception, and on successful completion after sending
messages; include context such as interaction.user/id, the posting summary or
id, the outcome (error message or success) and metadata like len(candidates) and
requirements (e.g., required_skills, title) so the worker audit ingest receives
a consistent record for use by apps/worker/src/five08/backend/api.py.
In `@apps/worker/src/five08/worker/crm/people_sync.py`:
- Around line 344-353: The loop in people_sync.py appends raw skill keys to
skills but stores strengths under raw.casefold() in attrs, causing lookups like
p.skill_attrs ->> s to miss values; make the key format consistent by either
storing strengths under the exact raw string or normalizing both collections the
same way — e.g., change the attrs assignment in the block inside the for-loop
(the code that calls attrs[raw.casefold()] = ...) so it uses the same key as
appended to skills (attrs[raw] = ...) or instead append raw.casefold() to
skills; ensure the chosen normalization (raw or raw.casefold()) is used for both
skills.append(...) and attrs[...] so subsequent ranking reads find the strength
values.
In `@packages/shared/src/five08/audit.py`:
- Around line 152-158: The VALUES clause in the upsert_person SQL has one fewer
%s placeholder than the 21 columns and parameters supplied; update the VALUES
string in the upsert_person function (the INSERT ... VALUES definition) to
include the missing %s so there are 21 placeholders matching the parameter tuple
used later, ensuring the parameter count aligns with the INSERT column list.
In `@packages/shared/src/five08/candidate_search.py`:
- Around line 188-192: The secondary sort currently uses key=lambda c: (not
c.is_member, -c.required_skill_score, -c.seniority_score) which omits the
primary/SQL match metrics and therefore lets seniority override stronger skill
matches; update the sort key in the results.sort call (the lambda referenced
above) to include the primary skill-match metrics (e.g., c.sql_score or both
c.required_skill_score and c.preferred_skill_score) before seniority so primary
ranking is preserved — for example include (-c.sql_score, not c.is_member,
-c.required_skill_score, -c.preferred_skill_score, -c.seniority_score) or
otherwise ensure the primary score fields appear earlier in the tuple so
seniority only breaks ties.
In `@packages/shared/src/five08/job_match.py`:
- Around line 29-39: The seniority mapping keys in _SENIORITY_KEYWORDS must
match the normalization performed by _regex_hints (no spaces/hyphens), so
replace space/hyphen variants like "entry level" and "entry-level" with their
normalized form "entrylevel" (and any other affected keys) so matched hints
aren't lost; update the other similar seniority mapping block (lines noted
around the second definition) the same way, ensuring keys like "entrylevel",
"midlevel", "senior", "staff", "principal" are present and map to the intended
canonical values.
- Around line 201-203: Update the logic that applies the US-only regex signal so
it always overrides any LLM-derived location value: replace the conditional that
only sets location_type when None with an unconditional assignment when
hints.get("us_only_detected") is truthy (i.e., if hints.get("us_only_detected"):
location_type = "us_only"). Ensure you modify the code handling location_type
(the block referencing hints.get("us_only_detected") and the variable
location_type) so the "us_only" signal cannot be overridden by an LLM-returned
value like "remote_any".
- Around line 192-200: Normalize and trim enum strings before validation: for
raw_seniority, if non-null call .strip() and .lower(), compare against a set of
lowercased entries derived from SENIORITY_ORDER, and if found assign the
canonical value from SENIORITY_ORDER (preserve original canonical casing) to
seniority; for raw_location_type call .strip() and .lower() then compare to the
allowed lowercase values ("us_only","timezone_preferred","remote_any") and
assign the canonical lowercase string to location_type; ensure you handle None
safely and avoid dropping values that differ only by case or trailing/leading
whitespace.
- Line 177: The code currently sets raw_content via
response.choices[0].message.content without validating the response shape;
update the logic around raw_content in job_match.py (where raw_content is
assigned) to defensively verify response.choices exists and is non-empty, that
response.choices[0].message is present, and that message.content is not
None/empty before using it—if any of these are missing, raise or log a clear,
descriptive error (including the actual response payload) instead of silently
defaulting to "", so downstream JSON parsing failures surface the real
API-structure issue.
---
Nitpick comments:
In
`@apps/worker/src/five08/worker/migrations/versions/20260304_0100_add_job_match_fields_to_people.py`:
- Around line 54-70: The migration uses op.create_index and op.drop_index in
upgrade() and downgrade() and must perform them without write-blocking locks;
modify both upgrade() and downgrade() to run index operations inside the
migration's autocommit_block() context and call op.create_index(...,
postgresql_concurrently=True) for each index and op.drop_index(...,
postgresql_concurrently=True) for each drop so Postgres creates/drops indexes
concurrently on the people table; ensure you still keep the same index names
("idx_people_skills", "idx_people_is_member", "idx_people_seniority",
"idx_people_address_country") and the current column lists when applying these
changes.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
apps/discord_bot/src/five08/discord_bot/cogs/crm.pyapps/worker/src/five08/worker/crm/people_sync.pyapps/worker/src/five08/worker/migrations/versions/20260304_0100_add_job_match_fields_to_people.pypackages/shared/src/five08/audit.pypackages/shared/src/five08/candidate_search.pypackages/shared/src/five08/job_match.py
There was a problem hiding this comment.
Pull request overview
Adds an end-to-end job matching pipeline: extract structured requirements from a pasted job posting, search/rank candidates from the people cache in Postgres, and expose the workflow via a Discord slash command.
Changes:
- Introduces OpenAI-backed job requirement extraction (
job_match.py) with regex hinting + coercion into a normalizedJobRequirements. - Adds a ranked Postgres candidate search (
candidate_search.py) and extends the people cache schema + CRM sync to store matchable profile/skill fields. - Adds a Discord
/match-candidatescommand to run extraction + search and render top candidates.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/shared/src/five08/job_match.py | New job posting → normalized requirements extraction (LLM + regex hints). |
| packages/shared/src/five08/candidate_search.py | New SQL-backed candidate scoring query + Python post-processing. |
| packages/shared/src/five08/audit.py | Extends PersonRecord and people-cache upsert to persist new profile/skill fields. |
| apps/worker/src/five08/worker/migrations/versions/20260304_0100_add_job_match_fields_to_people.py | Adds new people columns + indexes for skills/member/location filtering. |
| apps/worker/src/five08/worker/crm/people_sync.py | Extends CRM → people-cache sync to populate new fields (skills, seniority, resume IDs, etc.). |
| apps/discord_bot/src/five08/discord_bot/cogs/crm.py | Adds /match-candidates command to run extraction and render ranked results. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| requirements = await asyncio.get_event_loop().run_in_executor( | ||
| None, | ||
| lambda: extract_job_requirements( | ||
| posting, | ||
| api_key=settings.openai_api_key, |
There was a problem hiding this comment.
Within an async command handler, asyncio.get_event_loop().run_in_executor(...) is deprecated-style and inconsistent with the rest of this cog (which uses asyncio.to_thread(...) for blocking work). Switching to asyncio.to_thread(...) avoids event-loop selection issues and matches existing patterns in this file.
|
|
||
| seniority_match = _SENIORITY_RE.search(text) | ||
| if seniority_match: | ||
| raw = seniority_match.group(1).lower().replace("-", "").replace(" ", "") |
There was a problem hiding this comment.
This regex hint normalization removes spaces/hyphens (e.g. "entry level" -> "entrylevel"), but _SENIORITY_KEYWORDS is keyed by "entry level"/"entry-level" and does not include "entrylevel". As a result, entry-level postings won't set seniority_hint. Consider normalizing the map keys the same way or normalizing after lookup.
| raw = seniority_match.group(1).lower().replace("-", "").replace(" ", "") | |
| raw = seniority_match.group(1).lower() |
| @app_commands.command( | ||
| name="match-candidates", | ||
| description="Analyze a job posting and return ranked matching candidates from the people cache.", | ||
| ) | ||
| @app_commands.describe(posting="Paste the full job posting text here") |
There was a problem hiding this comment.
PR description says the command "replies to thread", but this implementation posts followups to whatever channel the command is invoked in and doesn’t enforce/ensure a thread context. If the thread requirement is important, consider checking interaction.channel type and creating a thread (or responding ephemerally with instructions) when invoked outside a thread.
| """Extract structured job requirements from a posting using OpenAI. | ||
|
|
||
| Raises RuntimeError if OpenAI is not configured (also logs to webhook). | ||
| Uses regex pre-scan as cheap hints injected into the LLM prompt. | ||
| """ |
There was a problem hiding this comment.
This new extraction pipeline has multiple edge-cases (LLM fenced JSON, invalid JSON, regex-vs-LLM disagreements, seniority/location coercion) but there are no unit tests covering them. Adding focused unit tests for extract_job_requirements will help keep /match-candidates stable as prompts/models change.
| %s, %s, %s, %s, %s, | ||
| %s, %s, %s, %s, %s, | ||
| %s, %s, %s, %s, %s, | ||
| %s, %s, %s, %s, |
There was a problem hiding this comment.
The INSERT column list includes 21 columns (ending with sync_status) but the VALUES placeholder list only has 20 %s entries, so this query will fail at runtime with a bind/parameter count error. Add the missing placeholder (or remove a column) so placeholders and parameters match exactly.
| %s, %s, %s, %s, | |
| %s, %s, %s, %s, %s, |
|
|
||
| Ranking priority: | ||
| 1. Members before prospects. | ||
| 2. Location match (hard filter when us_only, else soft signal). |
There was a problem hiding this comment.
The ranking description says location match is a soft signal when not us_only, but the SQL only applies a hard filter when us_only and otherwise doesn’t score location/timezone at all. Either implement the stated soft location scoring (e.g. ORDER BY terms for country/timezone match) or update the docstring to match the actual behavior.
| 2. Location match (hard filter when us_only, else soft signal). | |
| 2. US-only location restriction when enabled (hard filter); otherwise, | |
| location/timezone are not used in ranking. |
| # Regex hints — used to pre-scan the posting and inform the LLM prompt. | ||
| # These are NOT a standalone fallback; they're cheap signals the LLM can use. |
There was a problem hiding this comment.
The header comment says these regex hints are "NOT a standalone fallback", but later the code forces location_type = "us_only" when the regex matched and the LLM returned null. Either update the comment to reflect the actual fallback behavior or remove the fallback if it's not intended.
| # Regex hints — used to pre-scan the posting and inform the LLM prompt. | |
| # These are NOT a standalone fallback; they're cheap signals the LLM can use. | |
| # Regex hints — used to pre-scan the posting and inform the LLM prompt, and | |
| # in limited cases may be used as a lightweight fallback (e.g., defaulting | |
| # location_type to "us_only" when the LLM output is null but regex matches). |
| if not skills and isinstance(raw_skill_attrs, dict): | ||
| for key, payload in raw_skill_attrs.items(): | ||
| raw = _text_or_none(key) | ||
| if raw and isinstance(payload, dict): | ||
| skills.append(raw) | ||
| value = payload.get("strength") | ||
| parsed = self._coerce_int(value) | ||
| if parsed is not None: | ||
| attrs[raw.casefold()] = max( | ||
| attrs.get(raw.casefold(), 0), parsed | ||
| ) |
There was a problem hiding this comment.
normalize_skill_payload(...) already normalizes skills/attrs from both skills and dict-like cSkillAttrs. This fallback path re-adds skills from cSkillAttrs without normalization, which can introduce non-canonical skill strings into people.skills and reduce match accuracy. Consider removing this fallback or normalizing raw before appending/using it as a key.
| if not skills and isinstance(raw_skill_attrs, dict): | |
| for key, payload in raw_skill_attrs.items(): | |
| raw = _text_or_none(key) | |
| if raw and isinstance(payload, dict): | |
| skills.append(raw) | |
| value = payload.get("strength") | |
| parsed = self._coerce_int(value) | |
| if parsed is not None: | |
| attrs[raw.casefold()] = max( | |
| attrs.get(raw.casefold(), 0), parsed | |
| ) |
| # Secondary sort: among equal SQL scores, rank by seniority alignment. | ||
| # This is stable (preserves SQL order within same seniority bucket). | ||
| results.sort( | ||
| key=lambda c: (not c.is_member, -c.required_skill_score, -c.seniority_score) |
There was a problem hiding this comment.
The final Python results.sort(...) can reorder candidates across different required_matched / preferred_matched buckets because the sort key does not include those fields (and can even override the SQL ORDER BY priority). If the intent is only to break ties by seniority, restrict the sort to seniority (stable) or include the matched-count metrics (e.g. len(matched_required_skills), len(matched_preferred_skills)) in the key so it preserves the documented ranking order.
| # Secondary sort: among equal SQL scores, rank by seniority alignment. | |
| # This is stable (preserves SQL order within same seniority bucket). | |
| results.sort( | |
| key=lambda c: (not c.is_member, -c.required_skill_score, -c.seniority_score) | |
| # Final in-Python ranking: | |
| # 1. Members before prospects. | |
| # 2. Required skill count matched (descending). | |
| # 3. Required skill strength score (descending). | |
| # 4. Preferred skill count matched (descending). | |
| # 5. Seniority score (descending). | |
| results.sort( | |
| key=lambda c: ( | |
| not c.is_member, | |
| -len(c.matched_required_skills), | |
| -c.required_skill_score, | |
| -len(c.matched_preferred_skills), | |
| -c.seniority_score, | |
| ) |
| def search_candidates( | ||
| settings: SharedSettings, | ||
| requirements: JobRequirements, | ||
| *, | ||
| limit: int = 10, |
There was a problem hiding this comment.
This ranking/query logic is subtle (SQL scoring + Python post-sort + us_only filtering) but there are no unit tests ensuring the documented ranking priority is preserved and filtering behaves as expected. Adding a small test matrix for search_candidates would prevent regressions as the query evolves.
- audit.py: fix VALUES placeholder count (20 → 21 to match 21 columns) - people_sync.py: store skill strength attrs under raw key (not casefold) to match skills array keys for DB lookup - job_match.py: fix seniority hint lookup (use normalized keys without spaces/hyphens); add defensive check for empty/missing response.choices content; normalize enum strings before validation (strip/lower); regex US-only detection always overrides LLM location value - candidate_search.py: add timezone_matched scoring in SQL ORDER BY when preferred_timezones provided; fix secondary Python sort to preserve primary ranking metrics before seniority - crm.py: add _audit_command calls on all exit paths (extract error, no skills, search error, success) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…omments - Replace asyncio.get_event_loop().run_in_executor() with asyncio.to_thread() in match_candidates command handler - Fix candidate_search.py docstring: clarify US-only is a hard filter and timezone is a soft signal - Fix job_match.py module comment: clarify regex hints as LLM context with US-only as override guard (not standalone fallback) - Update match_candidates command description to mention public reply - Add tests/unit/test_job_match.py: 23 tests covering _regex_hints, _parse_llm_response, _coerce_str_list, extract_job_requirements, and _build_prompt - Add tests/unit/test_candidate_search.py: 14 tests covering _seniority_score edge cases, early-return on empty required_skills, row-to-CandidateMatch mapping, secondary sort order, and empty DB Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Use postgresql_concurrently=True inside autocommit_block() for all create_index and drop_index calls in the job-match fields migration, preventing table-level write locks on the people table during deploy. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Remove the posting parameter. The command now: - Rejects invocations outside a thread with an ephemeral error. - Reads the thread's starter_message as the job posting text. - Falls back to fetching the parent message by thread ID if starter_message is not cached (TextChannel threads only). - Returns an ephemeral error if the opener is missing or empty. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Forum post IDs equal their thread IDs, so thread.fetch_message(thread.id) works for both text-channel threads and forum posts without needing to go through the parent channel. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Thread tags (e.g. "Python", "Remote", "Senior") are prepended as "Thread tags: ..." before the starter message content so the LLM naturally incorporates them into skill/seniority/location extraction. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Summary
Adds a complete job matching pipeline: OpenAI-powered job requirement extraction, PostgreSQL-based ranked candidate search with skill/seniority/location matching, and a Discord
/match-candidatescommand to find top candidates from the people cache for a given job posting.Key Changes
peopletable with profile fields and separateskills(TEXT[] with GIN index) +skill_attrs(JSONB) for efficient PostgreSQL skill matching.job_match.py): Parses job postings with OpenAI (structured JSON output), uses regex pre-scan as hints. Reports configuration errors to webhook.candidate_search.py): Ranked PostgreSQL query (members first, location match, required/preferred skills with strength scoring, seniority alignment)./match-candidatescommand replies to thread with top 10 candidates including member/prospect status, contact info, CRM links, LinkedIn, resume names, and matched skills.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores / Database
Tests