Skip to content

DamBuilderDev/JobSearchOptimizer

Repository files navigation

JobSearchOptimizer

A daily-run job search pipeline (Steps 1-6) that fetches job postings, disqualifies irrelevant roles through a multi-gate filter, scores the remainder with an LLM (Gemini free tier), and ranks them into a priority list. Steps 7+ (resume tailoring, outreach, application tracking) are described for context but are not included in this repository.

Python 3.10+ License: MIT Cost: Free Platform: Windows


Current Status

Pipeline state: post-recovery patch; first clean validation completed; still experimental and under human review.

This project recently went through a late-April 2026 recovery after filtering/scoring logic became too aggressive and priority output collapsed from dozens of rows per day to 1-5 rows per day. One clean forced validation run has completed successfully, but this should not be treated as proof of long-term stability.

Human review is requested before further automation expansion.

Primary review areas:

  • hard reject vs soft reject separation
  • remote/hybrid location safety
  • AP/staff accountant/payroll systems scoring
  • prevention of silent data loss
  • LLM parse/API failure handling
  • whether the system should produce a review queue for borderline or failed jobs

See REVIEW_REQUEST.md for the full reviewer brief and open questions.


What This Is

This is a real-world job search pipeline built over roughly five weeks of daily use by a billing/AP/payroll systems/staff accountant specialist targeting remote and regional roles. It was refined through failures and is now open-sourced to share the architecture and invite community input.

The core insight: most job search tools just list jobs. This pipeline scores, ranks, and learns -- cutting through noise so you spend time applying to roles that match, not reading 50 postings to find 3 worth sending a resume to.

What This Is NOT

  • A plug-and-play tool. The scoring rubric, keyword weights, disqualifiers, and cluster definitions are all tuned for one person's profile.
  • A web scraper wrapper. JobSpy handles the actual scraping; this pipeline adds filtering, scoring, ranking, and data retention on top of it.
  • Production software. It runs locally on Windows via PowerShell and Python. No Docker, no cloud deployment, no database -- just files.

The Problem It Solves

Early versions of this pipeline silently overwrote historical data on every run. After discovering that 30+ days of scoring data had been lost, every script was audited and a timestamped archive pattern was applied uniformly. The full findings are documented in PIPELINE_DATA_AUDIT.md.

Beyond data retention, the pipeline addresses three scoring problems:

  1. False positives -- jobs that sound related (e.g., "billing" in a nursing context) pass naive keyword filters and waste LLM quota.
  2. False negatives -- strong-fit jobs with unusual title wording get buried by keyword-only scoring.
  3. Static weights -- a system that scores "billing coordinator" the same in month 1 as month 6, even after learning which clusters produce interviews.

Architecture

                 ┌──────────────────┐
                 │  fetch_jobs.py   │  JobSpy scrapes Indeed, LinkedIn, etc.
                 └────────┬─────────┘
                          │ new_jobs_YYYY-MM-DD_HHMM.csv
                 ┌────────▼─────────┐
                 │  build_queue.py  │  Dedup, standardize, archive
                 └────────┬─────────┘
                          │ queue.csv
                 ┌────────▼─────────┐
                 │  pre_scoring.py  │  Cluster, zone, environment, pre_score (0-1)
                 └────────┬─────────┘
                          │ queue_prescored.csv
                 ┌────────▼─────────┐
                 │  disqualify.py   │  14-gate hard-rejection engine
                 └────────┬─────────┘
                          │ (adds is_disqualified flag)
                 ┌────────▼─────────┐
                 │ prepare_scoring  │  Split into batches of 8; skip disqualified
                 │   _batch.py      │  and already-rejected URLs
                 └────────┬─────────┘
                          │ batch_*.md
                 ┌────────▼─────────┐
                 │  gemini_score.py │  Gemini 2.5 Flash Lite via system_instruction
                 │  grok_score.py   │  (xAI Grok fallback if Gemini quota exhausted)
                 └────────┬─────────┘
                          │ queue_scored.csv  (TIER 1 / TIER 2 / MONITOR / SKIP)
                 ┌────────▼─────────┐
                 │ priority_scoring │  Composite formula + time decay → 0-100
                 │      .py         │  Tier 1 >= 85, Tier 2 >= 60
                 └────────┬─────────┘
                          │ priority.csv
                 ┌────────▼─────────┐
                 │ generate_resume  │  Tailored .docx for Tier 1 jobs (optional)
                 └──────────────────┘

Steps 1-6 run daily via run_daily.ps1. Steps 7+ cover outreach, trajectory tracking, and analytics; they are not included in this repository.


Key Innovations

1. Multi-gate disqualifier before LLM scoring

disqualify.py runs a 14-gate hard-rejection engine before any job reaches Gemini. Gates catch:

  • No billing/AR signal in title or description at all
  • Wrong role archetype (clinical coder, benefits admin, etc.)
  • Executive seniority mismatch
  • Healthcare-specific EMR/EHR requirements the candidate lacks
  • Out-of-state hybrid or onsite roles
  • Required tool stacks the candidate hasn't used
  • Specialty niches outside the candidate's experience

Result: 76-92% of fetched jobs are rejected before the LLM sees them, preserving free-tier quota for roles that actually warrant evaluation.

2. Rubric-driven LLM scoring with clean separation

The full scoring rubric lives in State/scoring_rubric.md and is injected as system_instruction on every Gemini call. Batch files contain only job data. This means:

  • Swapping the scorer (Gemini, Grok, Claude, local model) requires changing one variable, not rewriting the prompt.
  • The rubric is version-controlled and human-readable.
  • Feedback constraints (State/feedback.json) are appended to the rubric per-run without modifying the base file.

3. Composite priority formula

priority_scoring.py combines three signals into a 0-100 score:

final = (pre_score * 0.35) + (priority_base * 0.50) + (cluster_weight * 0.15)

Then applies time decay (jobs posted more than 7 days ago are penalized). Gemini's rubric judgment (priority_base) is the dominant signal at 0.50. The weights are tunable via State/scoring_weights.json.

4. Adaptive cluster weights from passive signals

The pipeline supports Bayesian weight updates: scanning ResumeVersions/ for company-specific PDFs (a strong passive signal that an application was submitted) and matching them to past priority outputs. Using Bayesian dampening, it adjusts scoring_weights.json over time -- rewarding clusters that generate applications, penalizing ignored ones. The update script relies on excluded personal data folders and is not included in this repository.

5. Timestamped archive everywhere

Every intermediate file is archived with _YYYY-MM-DD_HHMM timestamps. The current design attempts to prevent silent overwrites by archiving key intermediate files and recording failure states. See PIPELINE_DATA_AUDIT.md for the full per-file retention map.


Repository Structure

JobSearchOptimizer/
├── JobFetcher/
│   └── fetch_jobs.py                       # Job scraper (JobSpy + Adzuna fallback)
├── Scripts/
│   ├── build_queue.py                      # Dedup + standardize -> queue.csv
│   ├── pre_scoring.py                      # Keyword scoring, zone/cluster
│   ├── disqualify.py                       # Multi-gate filter
│   ├── prepare_scoring_batch.py            # Batch prep for LLM
│   ├── gemini_score.py                     # Gemini scorer (primary)
│   ├── grok_score.py                       # Grok scorer (fallback)
│   ├── priority_scoring.py                 # Composite formula + decay
│   ├── generate_resume.py                  # Tailored .docx builder (see note below)
│   ├── validate_queue_scored.py            # Schema + range checks
│   ├── prune_reject_log.py                 # Prune stale rejected URLs
│   └── priority_drift_monitor.py           # Detect scoring drift over time
├── State/
│   ├── scoring_rubric.md                   # LLM system instruction (the rubric)
│   ├── scoring_weights.json                # Per-cluster priority weights
│   └── feedback.json                       # Session-level overrides (EXCLUDE)
├── Automation/                             # Pipeline output -- EXCLUDE from repo
├── Profile/                                # Resume bullets -- EXCLUDE from repo
├── ResumeVersions/                         # Application PDFs -- EXCLUDE from repo
├── run_daily.ps1                    # Headless daily pipeline runner (Steps 1-6)
├── PIPELINE.md                             # Full technical reference
└── PIPELINE_DATA_AUDIT.md                  # Per-file data retention audit

Note on generate_resume.py (Step 6): This script requires build_resume_base.py and Profile/experience_library.json, which contain personal resume data and are not included in this repository. The script exits gracefully with an error message if those files are missing -- the pipeline runner warns and continues. Steps 1-5 run fully without it.


Getting Started

This pipeline is tuned for one person's job search. To adapt it:

  1. Edit disqualify.py -- replace gate logic with your target domain's disqualifiers (the 14 gates are well-commented).
  2. Edit State/scoring_rubric.md -- replace the scoring criteria with your profile and target roles.
  3. Edit KEYWORD_WEIGHTS in pre_scoring.py -- match your domain's vocabulary.
  4. Edit Scripts/fetch_jobs.py -- set your search terms, locations, and sites.

Prerequisites:

  • Python 3.10+ with: google-genai, python-jobspy, python-docx, openai
  • Google AI Studio API key (free tier) -- set GOOGLE_API_KEY in keys.env
  • PowerShell 5.1+ (Windows)
  • Windows Task Scheduler (optional, for daily automation)

Install dependencies:

python -m pip install google-genai python-jobspy python-docx openai --break-system-packages

Run a single daily cycle (Steps 1-5):

.\run_daily.ps1 -phase all

Output: Automation/priority.csv -- your ranked Tier 1/2 jobs for the day.

Step 6 (generate_resume.py) will warn and skip gracefully if its excluded dependencies are missing. Customize or omit it as needed.

Run specific phases:

.\run_daily.ps1 -phase pre-score   # Steps 1-3 only (fetch + filter)
.\run_daily.ps1 -phase post-score  # Steps 4-6 only (score + rank)

Recommended .gitignore

# API keys
keys.env

# Personal data -- never commit
Profile/
ResumeVersions/
Automation/
State/feedback.json

# Pipeline state (large, regeneratable)
JobFetcher/seen_jobs.json
Automation/rejected_urls_seen.txt
pipeline_checkpoint.json
wave8_state.json

# Python
__pycache__/
*.pyc
*.pyo

# Archives (large, regeneratable)
Automation/PRESCORED_ARCHIVE/
Automation/SCORING_BATCH_ARCHIVE/
Automation/SCRAPES_RAW/

Contributing

Note: This is a personal job-search pipeline, not a general-purpose tool. The scoring rubric, keyword weights, and disqualifier logic are tuned for a specific candidate profile. If you adapt it, start with State/scoring_rubric.md and disqualify.py -- those two files drive most of the filtering behavior.

Human review, issues, and critical feedback are welcome. This repo is shared as a transparent architecture/recovery case study, not as a polished general-purpose product.

Community input is especially welcome on these open questions:

  • Scoring formula -- Is the 0.35/0.50/0.15 weight split between pre_score, Gemini score, and cluster weight optimal? Would cross-validation against actual application outcomes improve it?
  • Gate efficiency -- Are there false positive or false negative patterns in disqualify.py that a new rule could catch cleanly?
  • Reject log taxonomy -- Which gate failures deserve permanent vs TTL suppression? Currently only salary floor and collections-primary are permanent.
  • Dedup pruning -- rejected_urls_seen.txt (permanent log) and disqualified_soft.csv/gemini_skipped.csv (TTL logs) each have different retention needs. What are sensible policies for each?
  • Cross-platform -- The runner is PowerShell on Windows. A run_daily.sh bash equivalent would make this usable on Linux/macOS.
  • Monitoring integration -- priority_drift_monitor.py, priority_tuning_assistant.py, and priority_action_alignment.py exist but are not wired into the daily pipeline. Help integrating them would be valuable.
  • LLM backends -- Adding support for other free-tier LLMs (Mistral, local LLaMA via Ollama) as additional fallback scorers.

Please read PIPELINE.md before contributing code. For architecture or design changes, open an issue first.


License

MIT -- use, modify, and share freely.


Built with significant assistance from Claude (Anthropic), ChatGPT, Gemini, Microsoft Copilot and DeepSeek AI.

About

Experimental local job-search pipeline using Python, PowerShell, and LLM scoring. Shared as a sanitized recovery/architecture case study for human review.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors