Skip to content

Bobshpog/code_review_assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Code Review Assistant

A full-stack web application that performs intelligent, multi-step code review using GenAI. Analyzes submitted code in Python, TypeScript, or Java and provides structured feedback covering security, performance, style, and logic issues.

Architecture

User → POST /api/review → FastAPI Backend
                              │
                    ┌─────────┴─────────┐
                    │  Comment Stripping │  (tree-sitter, removes injection vectors)
                    └─────────┬─────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
         tree-sitter      semgrep         ruff/lizard
         (syntax)        (security)       (style/perf/quality)
              │               │               │
              └───────────────┼───────────────┘
                              │ tool findings
                    ┌─────────┴─────────┐
                    │  3 Parallel Claude │  (Haiku reviewers via asyncio.gather)
                    │  Haiku Reviewers   │
                    └─────────┬─────────┘
                              │ + AI findings
                    ┌─────────┴─────────┐
                    │  Post-Processing   │  (deterministic: dedup, rank)
                    └─────────┬─────────┘
                              │
                    ┌─────────┴─────────┐
                    │  Report Aggregator │  (Claude Haiku: group related findings)
                    └─────────┬─────────┘
                              │
                         SSE Stream → React Frontend

Three-layer design:

  1. Layer 1 (Deterministic): 4 static analysis tools run in parallel. Every applicable tool runs on every submission — guaranteed, no LLM gatekeeping.
  2. Layer 2 (AI Reviewers): 3 Claude Haiku reviewers (security, performance, quality) run in parallel. Each receives tool findings as context and adds deeper, contextual analysis.
  3. Post-processing: Deterministic dedup and severity-based ranking. No LLM needed.
  4. Layer 3 (Report Aggregation): A 4th Claude Haiku call groups semantically related findings into distinct issues with validated coverage — every original finding must appear in exactly one group.

Frontend

Split-pane interface built with React 19, TypeScript, and CodeMirror 6:

  • Code editor with syntax highlighting for Python, TypeScript, and Java (CodeMirror 6 with language-specific extensions)
  • Language selector dropdown, disabled during review
  • Review button with phase-aware states — shows tool progress, then AI reviewer progress, with per-agent status indicators
  • Tabbed results panel — "All Findings" shows every individual finding grouped by category; "Report" shows the AI-aggregated executive summary with deduplicated issues
  • Click-to-navigate — clicking a finding scrolls the editor to the relevant line
  • Markdown export — generates a downloadable code-review-report.md with executive summary, severity breakdown, all findings with suggestions and fixed code
  • Responsive layout — stacks vertically on mobile, side-by-side on desktop

Quick Start

Option A: Docker (recommended)

git clone <repo-url> && cd code_review_assistant

# Configure API key
cp backend/.env.example backend/.env
# Edit backend/.env and set: AWS_BEARER_TOKEN_BEDROCK=your-key-here

# Start everything
docker compose up --build
# Frontend: http://localhost:5173
# Backend:  http://localhost:8000/api/health

Option B: Local Development

git clone <repo-url> && cd code_review_assistant

# Backend
cd backend
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Configure API key
cp .env.example .env
# Edit .env and set: AWS_BEARER_TOKEN_BEDROCK=your-key-here

# Start backend
uvicorn app.main:app --reload --port 8000

# Frontend (separate terminal)
cd frontend
npm install
npm run dev
# Opens at http://localhost:5173

Verification

# Health check
curl http://localhost:8000/api/health
# → {"status":"ok"}

# Test review
curl -X POST http://localhost:8000/api/review \
  -H "Content-Type: application/json" \
  -d '{"code":"def foo(x):\n  return eval(x)","language":"python"}'
# → SSE stream with findings

API Documentation

POST /api/review

Submit code for review. Returns an SSE (Server-Sent Events) stream.

Request:

{
  "code": "def get_user_data(user_id):\n    query = \"SELECT * FROM users WHERE id = \" + str(user_id)\n    cursor.execute(query)\n    return cursor.fetchall()",
  "language": "python"
}
  • code: string, 1-50,000 characters
  • language: "python" | "typescript" | "java"

Response: text/event-stream with these event types:

Event Fields When
tools_start Pipeline begins
finding finding: {category, severity, source, title, description, line_range, suggestion, fixed_code} Each finding (tool or AI)
tools_complete finding_count, error_count All tools finished
agent_start agent: "security" | "performance" | "quality" | "report" AI reviewer or report aggregator starts
agent_complete agent, finding_count, error? AI reviewer finished (includes error string if that reviewer failed)
summary summary Deterministic summary of deduped findings
report issues: [{title, category, severity, line_start, suggestion, fixed_code, source_indices}], coverage AI-aggregated report grouping related findings
error error, detail? Error occurred (findings still delivered)

GET /api/health

Returns {"status": "ok"}.

Tech Stack

Layer Technology
Backend Python 3.12, FastAPI, Pydantic v2
AI Claude Haiku 4.5 via AWS Bedrock, accessed through litellm
Static Analysis tree-sitter, semgrep, ruff, lizard
Frontend React 19, TypeScript, Vite 8, CodeMirror 6 (syntax highlighting), Tailwind CSS v4
Infrastructure Docker, nginx (SSE proxy)

Design Decisions

Decision Why
Two-layer architecture Deterministic tools always run (guaranteed findings). LLM adds nuance on top. Graceful degradation if Claude is down.
asyncio.gather for parallelism 3 Claude calls is trivially parallelized without a framework. Simpler than Strands/LangGraph.
Claude Haiku 4.5 (not Sonnet) for reviewers Structured JSON extraction is a focused task. Haiku is 5x cheaper, 3x faster. Sufficient quality.
litellm for LLM access Abstracts Bedrock auth (bearer token). Supports model switching without code changes.
AI report aggregation with validated coverage Groups related findings into distinct issues. Validates that every original finding is covered — auto-adds any the AI missed.
Deterministic post-processing Dedup and ranking are mechanical — no LLM needed.
Comment stripping (not classification) Eliminates the prompt injection vector entirely. No BERT model, no false positives. Replaces comment bytes with spaces (not deletion) so line numbers stay valid between original and stripped code.
Line-numbered code in LLM prompts LLMs hallucinate line numbers when counting themselves. Pre-numbering the code eliminates this. Extraction layer validates references against actual line count and nullifies out-of-range values.
Graceful degradation at every layer return_exceptions=True isolates tool crashes. Reviewer timeouts don't block other reviewers. Report aggregation falls back to ungrouped findings. The pipeline always delivers results.
Sonnet-judges-Haiku eval framework A stronger model (Sonnet) judges Haiku's output to avoid self-evaluation blind spots. Mock mode uses keyword heuristics for fast CI without LLM costs.
CodeMirror 6 (not Monaco) 43% smaller bundle, mobile support, native line scrolling API.
SSE via fetch (not EventSource) EventSource only supports GET. We POST the code body.
4 tools (not 8) ruff includes pyflakes + bandit rules. No redundancy, each tool has a distinct purpose. All pip install-able, no standalone binaries.

Known Limitations

  • No persistence: Reviews are stateless. No history or saved reviews.
  • No authentication: Rate limiting by IP only (10 req/min via slowapi).
  • Java/TypeScript have fewer tools: 3 tools vs 4 for Python (no ruff). Semgrep's open-source rules also have uneven coverage — strong for SQL injection and deserialization, but no rules for Runtime.exec() command injection or resource leaks. Claude compensates for the gaps.
  • Single-file only: No multi-file or project analysis.
  • Comments ignored: Comment stripping eliminates prompt injection but also means the pipeline never reviews comment quality, documentation accuracy, or TODO/FIXME tracking.
  • No library-specific guidance: Reviews are language-generic. The pipeline has no awareness of framework or library best practices (e.g. React hook rules, SQLAlchemy session management, Spring Security configuration). It can only flag general language-level issues.
  • Semgrep image size: ~500MB in Docker due to rule database.

Security

Mitigations in Place

Threat Mitigation
Prompt injection via comments Tree-sitter comment stripping removes all comment nodes before LLM receives code. Replaces bytes with spaces to preserve line offsets. Falls back to original code if stripping fails.
Prompt injection via code System prompts include safety instructions to ignore embedded directives. An inherent LLM limitation — string literals and variable names can contain instructions that survive comment stripping.
Tool process crashes ToolRunner uses asyncio.gather(return_exceptions=True) — each tool is isolated. A crash in semgrep doesn't affect tree-sitter or lizard.
LLM reviewer failures Each reviewer runs independently. Timeouts and exceptions are caught per-reviewer; surviving reviewers still produce findings.
Report aggregation failure Falls back to ungrouped findings (1:1 mapping). The pipeline always delivers results.
LLM output validation extract_findings() validates line numbers against actual code length. Hallucinated references are nullified. Malformed JSON is caught and dropped per-finding.
Input validation Pydantic enforces code length (1–50,000 chars), language enum, and field constraints. Invalid requests return 422.
Rate limiting slowapi enforces 10 requests/minute per IP on /api/review.
CORS Restricted to configured origins (default: http://localhost:5173). Configurable via CORS_ORIGINS env var. Only Content-Type header allowed.
XSS via LLM output Frontend renders all finding text as plain text (React escapes by default). No dangerouslySetInnerHTML.
Security headers nginx sets Content-Security-Policy, X-Content-Type-Options: nosniff, X-Frame-Options: DENY, and Referrer-Policy.
Error information disclosure Error events sent to clients use generic messages. Full details are logged server-side only.
Subprocess injection All tools use create_subprocess_exec (not shell=True). Temp files use tempfile.NamedTemporaryFile with random names.
Secret management API keys loaded from .env (gitignored). Only .env.example with placeholders is tracked.

Deployment Hardening Checklist

These items are not needed for local development but should be addressed before production deployment:

  • HTTPS termination — Add TLS via a reverse proxy (Cloudflare, ALB, or nginx with certs). The current setup uses HTTP only.
  • Trusted proxy configuration — Rate limiting uses client IP from request headers. Behind a load balancer, configure slowapi to trust only your proxy's IP range to prevent X-Forwarded-For spoofing.
  • Concurrent request limiting — Rate limiting caps requests/minute but not concurrent requests. Add a semaphore or API gateway throttle (e.g., max 5 concurrent reviews) to prevent LLM slot exhaustion.
  • CORS origins — Set CORS_ORIGINS to your production domain. The default http://localhost:5173 is for development only.
  • HSTS header — Add Strict-Transport-Security to nginx once HTTPS is enabled.
  • Container user — Dockerfiles run as root. Add a non-root USER directive for production.
  • Secret rotation — Rotate the Bedrock API key periodically. Consider using IAM roles or AWS Secrets Manager instead of bearer tokens.
  • Request body size limit — Add client_max_body_size 100k; to nginx to reject oversized payloads at the proxy layer.
  • Monitoring / audit logging — Add structured logging of review requests (timestamp, IP, language, code hash) for abuse detection. No code content should be logged.

Testing

# All backend tests (activate venv first)
cd backend && source .venv/bin/activate
python -m pytest -v --timeout=60

# Coverage
python -m pytest --cov=app --cov-report=term-missing

# Frontend lint + build check
cd frontend && npm run build

Evaluation

An LLM-as-a-judge eval framework validates that the pipeline reliably detects bugs. 58 YAML test cases (172 runnable snippets across Python, TypeScript, and Java) cover security, performance, logic, style, and syntax categories. A stronger model (Sonnet) judges whether Haiku's review output correctly identifies each planted bug, scoring on detection, actionability, category accuracy, severity, and conciseness.

Latest Results (172 cases, live mode against Docker)

Metric Score
Detection rate 98% (169/172)
Category accuracy 91%
Actionability 96%
Severity accuracy 70%

By category: Security 100%, Logic 98%, Style 97%, Performance 97%, Syntax 100%

By language: Java 100%, TypeScript 98%, Python 96%

3 remaining misses: a boundary error where the reviewer identified the right fix but wrong root cause, a memory leak via unremoved event listeners (subtle lifetime issue), and a long function where the reviewer flagged low-level style issues but missed the SRP violation.

Full report: backend/eval/results/

Running Evals

# Framework self-tests (no LLM calls)
cd backend && pytest eval/tests/ -v

# Mock mode — keyword heuristics, fast CI validation
cd backend && python -m eval --mode mock

# Full live run — Sonnet judges Haiku's output
cd backend && python -m eval --mode live

# Against a running Docker container
docker compose up -d backend
cd backend && python -m eval --mode live --target http://localhost:8000

See backend/eval/README.md for test case format, configuration, and report details.

Test Cases

Three sample inputs with expected outputs are in backend/test_cases/:

File Language Key Expected Findings
python_sql_injection.py Python Security: SQL injection, Style/Perf findings from ruff or AI
typescript_xss.ts TypeScript Security: innerHTML XSS, possible unused variable
java_resource_leak.java Java Performance/Logic: unclosed stream, broad exception

Each .expected.json file specifies minimum expected finding categories.

Future Improvements

  • Expand eval dataset: Current 58 bug types / 172 cases provide good coverage but miss edge cases. Target 150+ bug types with difficulty tiers, multi-bug snippets, and false-positive cases (clean code that shouldn't trigger findings).
  • Stronger reviewer models: Upgrade from Haiku to Sonnet for reviewers, or use a mix (Sonnet for security, Haiku for style). The 3 remaining detection failures are reasoning gaps, not extraction failures — a more capable model would likely catch them.
  • Extended thinking budget: Add a thinking/reasoning budget to reviewer prompts so the model can reason through subtle bugs (e.g. lifetime issues, SRP violations) before committing to findings. Particularly impactful for the quality reviewer.
  • More language-specific static analyzers: Add ESLint for TypeScript, PMD/SpotBugs for Java, and mypy for Python. Semgrep's open-source Java rules have notable gaps (no Runtime.exec() command injection, no resource leak detection) — PMD/SpotBugs would fill these. ESLint would add TS-specific linting that ruff provides for Python.
  • Skill-based reviewer agent: A reviewer agent that dynamically loads library-specific guidance and patterns based on the imports/dependencies detected in the submitted code. For example, if the code imports react, load React hook rules and component patterns; if it uses sqlalchemy, load session management best practices. This turns the generic reviewer into a context-aware one without hardcoding library knowledge into prompts.

Project Structure

code_review_assistant/
├── backend/
│   ├── app/
│   │   ├── main.py              # FastAPI app, CORS, rate limiting, endpoints
│   │   ├── config.py            # AppConfig (pydantic-settings, Bedrock credentials)
│   │   ├── orchestrator.py      # Pipeline: tools → LLM → post-processing → report → SSE
│   │   ├── post_processing.py   # Deterministic dedup, rank, summary
│   │   ├── format.py            # number_lines, format_findings, build_user_message
│   │   ├── models/              # Pydantic models (Finding, SSE events, request/response)
│   │   ├── tools/               # 4 tool wrappers (tree-sitter, semgrep, ruff, lizard) + ToolRunner
│   │   ├── agents/              # AI service (litellm), extraction, prompts, report aggregator
│   │   └── guards/              # Comment stripping (tree-sitter based)
│   ├── tests/                   # pytest suite (unit + integration)
│   ├── eval/                    # LLM-as-a-judge evaluation framework
│   │   ├── cases/               # 58 YAML test cases (172 snippets)
│   │   ├── tests/               # Framework self-tests
│   │   └── README.md            # Full eval documentation
│   ├── test_cases/              # 3 sample inputs with expected outputs
│   ├── requirements.txt
│   └── Dockerfile
├── frontend/
│   ├── src/
│   │   ├── App.tsx              # Main layout (split pane, tabs)
│   │   ├── components/          # 11 React components (editor, findings, report, export)
│   │   ├── hooks/               # useStreamingReview SSE hook
│   │   └── types/               # TypeScript interfaces
│   ├── nginx.conf               # SSE proxy config
│   └── Dockerfile
├── docker-compose.yml
├── DESIGN.md                    # Full architecture design document
└── README.md

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors