AI Code Review Assistant

A full-stack web application that performs intelligent, multi-step code review using GenAI. Analyzes submitted code in Python, TypeScript, or Java and provides structured feedback covering security, performance, style, and logic issues.

Architecture

User → POST /api/review → FastAPI Backend
                              │
                    ┌─────────┴─────────┐
                    │  Comment Stripping │  (tree-sitter, removes injection vectors)
                    └─────────┬─────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
         tree-sitter      semgrep         ruff/lizard
         (syntax)        (security)       (style/perf/quality)
              │               │               │
              └───────────────┼───────────────┘
                              │ tool findings
                    ┌─────────┴─────────┐
                    │  3 Parallel Claude │  (Haiku reviewers via asyncio.gather)
                    │  Haiku Reviewers   │
                    └─────────┬─────────┘
                              │ + AI findings
                    ┌─────────┴─────────┐
                    │  Post-Processing   │  (deterministic: dedup, rank)
                    └─────────┬─────────┘
                              │
                    ┌─────────┴─────────┐
                    │  Report Aggregator │  (Claude Haiku: group related findings)
                    └─────────┬─────────┘
                              │
                         SSE Stream → React Frontend

Three-layer design:

Layer 1 (Deterministic): 4 static analysis tools run in parallel. Every applicable tool runs on every submission — guaranteed, no LLM gatekeeping.
Layer 2 (AI Reviewers): 3 Claude Haiku reviewers (security, performance, quality) run in parallel. Each receives tool findings as context and adds deeper, contextual analysis.
Post-processing: Deterministic dedup and severity-based ranking. No LLM needed.
Layer 3 (Report Aggregation): A 4th Claude Haiku call groups semantically related findings into distinct issues with validated coverage — every original finding must appear in exactly one group.

Frontend

Split-pane interface built with React 19, TypeScript, and CodeMirror 6:

Code editor with syntax highlighting for Python, TypeScript, and Java (CodeMirror 6 with language-specific extensions)
Language selector dropdown, disabled during review
Review button with phase-aware states — shows tool progress, then AI reviewer progress, with per-agent status indicators
Tabbed results panel — "All Findings" shows every individual finding grouped by category; "Report" shows the AI-aggregated executive summary with deduplicated issues
Click-to-navigate — clicking a finding scrolls the editor to the relevant line
Markdown export — generates a downloadable code-review-report.md with executive summary, severity breakdown, all findings with suggestions and fixed code
Responsive layout — stacks vertically on mobile, side-by-side on desktop

Quick Start

Option A: Docker (recommended)

git clone <repo-url> && cd code_review_assistant

# Configure API key
cp backend/.env.example backend/.env
# Edit backend/.env and set: AWS_BEARER_TOKEN_BEDROCK=your-key-here

# Start everything
docker compose up --build
# Frontend: http://localhost:5173
# Backend:  http://localhost:8000/api/health

Option B: Local Development

git clone <repo-url> && cd code_review_assistant

# Backend
cd backend
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Configure API key
cp .env.example .env
# Edit .env and set: AWS_BEARER_TOKEN_BEDROCK=your-key-here

# Start backend
uvicorn app.main:app --reload --port 8000

# Frontend (separate terminal)
cd frontend
npm install
npm run dev
# Opens at http://localhost:5173

Verification

# Health check
curl http://localhost:8000/api/health
# → {"status":"ok"}

# Test review
curl -X POST http://localhost:8000/api/review \
  -H "Content-Type: application/json" \
  -d '{"code":"def foo(x):\n  return eval(x)","language":"python"}'
# → SSE stream with findings

API Documentation

`POST /api/review`

Submit code for review. Returns an SSE (Server-Sent Events) stream.

Request:

{
  "code": "def get_user_data(user_id):\n    query = \"SELECT * FROM users WHERE id = \" + str(user_id)\n    cursor.execute(query)\n    return cursor.fetchall()",
  "language": "python"
}

code: string, 1-50,000 characters
language: "python" | "typescript" | "java"

Response: text/event-stream with these event types:

Event	Fields	When
`tools_start`	—	Pipeline begins
`finding`	`finding: {category, severity, source, title, description, line_range, suggestion, fixed_code}`	Each finding (tool or AI)
`tools_complete`	`finding_count`, `error_count`	All tools finished
`agent_start`	`agent: "security" \| "performance" \| "quality" \| "report"`	AI reviewer or report aggregator starts
`agent_complete`	`agent`, `finding_count`, `error?`	AI reviewer finished (includes error string if that reviewer failed)
`summary`	`summary`	Deterministic summary of deduped findings
`report`	`issues: [{title, category, severity, line_start, suggestion, fixed_code, source_indices}]`, `coverage`	AI-aggregated report grouping related findings
`error`	`error`, `detail?`	Error occurred (findings still delivered)

`GET /api/health`

Returns {"status": "ok"}.

Tech Stack

Layer	Technology
Backend	Python 3.12, FastAPI, Pydantic v2
AI	Claude Haiku 4.5 via AWS Bedrock, accessed through litellm
Static Analysis	tree-sitter, semgrep, ruff, lizard
Frontend	React 19, TypeScript, Vite 8, CodeMirror 6 (syntax highlighting), Tailwind CSS v4
Infrastructure	Docker, nginx (SSE proxy)

Design Decisions

Decision	Why
Two-layer architecture	Deterministic tools always run (guaranteed findings). LLM adds nuance on top. Graceful degradation if Claude is down.
`asyncio.gather` for parallelism	3 Claude calls is trivially parallelized without a framework. Simpler than Strands/LangGraph.
Claude Haiku 4.5 (not Sonnet) for reviewers	Structured JSON extraction is a focused task. Haiku is 5x cheaper, 3x faster. Sufficient quality.
litellm for LLM access	Abstracts Bedrock auth (bearer token). Supports model switching without code changes.
AI report aggregation with validated coverage	Groups related findings into distinct issues. Validates that every original finding is covered — auto-adds any the AI missed.
Deterministic post-processing	Dedup and ranking are mechanical — no LLM needed.
Comment stripping (not classification)	Eliminates the prompt injection vector entirely. No BERT model, no false positives. Replaces comment bytes with spaces (not deletion) so line numbers stay valid between original and stripped code.
Line-numbered code in LLM prompts	LLMs hallucinate line numbers when counting themselves. Pre-numbering the code eliminates this. Extraction layer validates references against actual line count and nullifies out-of-range values.
Graceful degradation at every layer	`return_exceptions=True` isolates tool crashes. Reviewer timeouts don't block other reviewers. Report aggregation falls back to ungrouped findings. The pipeline always delivers results.
Sonnet-judges-Haiku eval framework	A stronger model (Sonnet) judges Haiku's output to avoid self-evaluation blind spots. Mock mode uses keyword heuristics for fast CI without LLM costs.
CodeMirror 6 (not Monaco)	43% smaller bundle, mobile support, native line scrolling API.
SSE via fetch (not EventSource)	EventSource only supports GET. We POST the code body.
4 tools (not 8)	ruff includes pyflakes + bandit rules. No redundancy, each tool has a distinct purpose. All `pip install`-able, no standalone binaries.

Known Limitations

No persistence: Reviews are stateless. No history or saved reviews.
No authentication: Rate limiting by IP only (10 req/min via slowapi).
Java/TypeScript have fewer tools: 3 tools vs 4 for Python (no ruff). Semgrep's open-source rules also have uneven coverage — strong for SQL injection and deserialization, but no rules for Runtime.exec() command injection or resource leaks. Claude compensates for the gaps.
Single-file only: No multi-file or project analysis.
Comments ignored: Comment stripping eliminates prompt injection but also means the pipeline never reviews comment quality, documentation accuracy, or TODO/FIXME tracking.
No library-specific guidance: Reviews are language-generic. The pipeline has no awareness of framework or library best practices (e.g. React hook rules, SQLAlchemy session management, Spring Security configuration). It can only flag general language-level issues.
Semgrep image size: ~500MB in Docker due to rule database.

Security

Mitigations in Place

Threat	Mitigation
Prompt injection via comments	Tree-sitter comment stripping removes all comment nodes before LLM receives code. Replaces bytes with spaces to preserve line offsets. Falls back to original code if stripping fails.
Prompt injection via code	System prompts include safety instructions to ignore embedded directives. An inherent LLM limitation — string literals and variable names can contain instructions that survive comment stripping.
Tool process crashes	`ToolRunner` uses `asyncio.gather(return_exceptions=True)` — each tool is isolated. A crash in semgrep doesn't affect tree-sitter or lizard.
LLM reviewer failures	Each reviewer runs independently. Timeouts and exceptions are caught per-reviewer; surviving reviewers still produce findings.
Report aggregation failure	Falls back to ungrouped findings (1:1 mapping). The pipeline always delivers results.
LLM output validation	`extract_findings()` validates line numbers against actual code length. Hallucinated references are nullified. Malformed JSON is caught and dropped per-finding.
Input validation	Pydantic enforces code length (1–50,000 chars), language enum, and field constraints. Invalid requests return 422.
Rate limiting	`slowapi` enforces 10 requests/minute per IP on `/api/review`.
CORS	Restricted to configured origins (default: `http://localhost:5173`). Configurable via `CORS_ORIGINS` env var. Only `Content-Type` header allowed.
XSS via LLM output	Frontend renders all finding text as plain text (React escapes by default). No `dangerouslySetInnerHTML`.
Security headers	nginx sets `Content-Security-Policy`, `X-Content-Type-Options: nosniff`, `X-Frame-Options: DENY`, and `Referrer-Policy`.
Error information disclosure	Error events sent to clients use generic messages. Full details are logged server-side only.
Subprocess injection	All tools use `create_subprocess_exec` (not `shell=True`). Temp files use `tempfile.NamedTemporaryFile` with random names.
Secret management	API keys loaded from `.env` (gitignored). Only `.env.example` with placeholders is tracked.

Deployment Hardening Checklist

These items are not needed for local development but should be addressed before production deployment:

HTTPS termination — Add TLS via a reverse proxy (Cloudflare, ALB, or nginx with certs). The current setup uses HTTP only.
Trusted proxy configuration — Rate limiting uses client IP from request headers. Behind a load balancer, configure slowapi to trust only your proxy's IP range to prevent X-Forwarded-For spoofing.
Concurrent request limiting — Rate limiting caps requests/minute but not concurrent requests. Add a semaphore or API gateway throttle (e.g., max 5 concurrent reviews) to prevent LLM slot exhaustion.
CORS origins — Set CORS_ORIGINS to your production domain. The default http://localhost:5173 is for development only.
HSTS header — Add Strict-Transport-Security to nginx once HTTPS is enabled.
Container user — Dockerfiles run as root. Add a non-root USER directive for production.
Secret rotation — Rotate the Bedrock API key periodically. Consider using IAM roles or AWS Secrets Manager instead of bearer tokens.
Request body size limit — Add client_max_body_size 100k; to nginx to reject oversized payloads at the proxy layer.
Monitoring / audit logging — Add structured logging of review requests (timestamp, IP, language, code hash) for abuse detection. No code content should be logged.

Testing

# All backend tests (activate venv first)
cd backend && source .venv/bin/activate
python -m pytest -v --timeout=60

# Coverage
python -m pytest --cov=app --cov-report=term-missing

# Frontend lint + build check
cd frontend && npm run build

Evaluation

An LLM-as-a-judge eval framework validates that the pipeline reliably detects bugs. 58 YAML test cases (172 runnable snippets across Python, TypeScript, and Java) cover security, performance, logic, style, and syntax categories. A stronger model (Sonnet) judges whether Haiku's review output correctly identifies each planted bug, scoring on detection, actionability, category accuracy, severity, and conciseness.

Latest Results (172 cases, live mode against Docker)

Metric	Score
Detection rate	98% (169/172)
Category accuracy	91%
Actionability	96%
Severity accuracy	70%

By category: Security 100%, Logic 98%, Style 97%, Performance 97%, Syntax 100%

By language: Java 100%, TypeScript 98%, Python 96%

3 remaining misses: a boundary error where the reviewer identified the right fix but wrong root cause, a memory leak via unremoved event listeners (subtle lifetime issue), and a long function where the reviewer flagged low-level style issues but missed the SRP violation.

Full report: backend/eval/results/

Running Evals

# Framework self-tests (no LLM calls)
cd backend && pytest eval/tests/ -v

# Mock mode — keyword heuristics, fast CI validation
cd backend && python -m eval --mode mock

# Full live run — Sonnet judges Haiku's output
cd backend && python -m eval --mode live

# Against a running Docker container
docker compose up -d backend
cd backend && python -m eval --mode live --target http://localhost:8000

See backend/eval/README.md for test case format, configuration, and report details.

Test Cases

Three sample inputs with expected outputs are in backend/test_cases/:

File	Language	Key Expected Findings
`python_sql_injection.py`	Python	Security: SQL injection, Style/Perf findings from ruff or AI
`typescript_xss.ts`	TypeScript	Security: innerHTML XSS, possible unused variable
`java_resource_leak.java`	Java	Performance/Logic: unclosed stream, broad exception

Each .expected.json file specifies minimum expected finding categories.

Future Improvements

Expand eval dataset: Current 58 bug types / 172 cases provide good coverage but miss edge cases. Target 150+ bug types with difficulty tiers, multi-bug snippets, and false-positive cases (clean code that shouldn't trigger findings).
Stronger reviewer models: Upgrade from Haiku to Sonnet for reviewers, or use a mix (Sonnet for security, Haiku for style). The 3 remaining detection failures are reasoning gaps, not extraction failures — a more capable model would likely catch them.
Extended thinking budget: Add a thinking/reasoning budget to reviewer prompts so the model can reason through subtle bugs (e.g. lifetime issues, SRP violations) before committing to findings. Particularly impactful for the quality reviewer.
More language-specific static analyzers: Add ESLint for TypeScript, PMD/SpotBugs for Java, and mypy for Python. Semgrep's open-source Java rules have notable gaps (no Runtime.exec() command injection, no resource leak detection) — PMD/SpotBugs would fill these. ESLint would add TS-specific linting that ruff provides for Python.
Skill-based reviewer agent: A reviewer agent that dynamically loads library-specific guidance and patterns based on the imports/dependencies detected in the submitted code. For example, if the code imports react, load React hook rules and component patterns; if it uses sqlalchemy, load session management best practices. This turns the generic reviewer into a context-aware one without hardcoding library knowledge into prompts.

Project Structure

code_review_assistant/
├── backend/
│   ├── app/
│   │   ├── main.py              # FastAPI app, CORS, rate limiting, endpoints
│   │   ├── config.py            # AppConfig (pydantic-settings, Bedrock credentials)
│   │   ├── orchestrator.py      # Pipeline: tools → LLM → post-processing → report → SSE
│   │   ├── post_processing.py   # Deterministic dedup, rank, summary
│   │   ├── format.py            # number_lines, format_findings, build_user_message
│   │   ├── models/              # Pydantic models (Finding, SSE events, request/response)
│   │   ├── tools/               # 4 tool wrappers (tree-sitter, semgrep, ruff, lizard) + ToolRunner
│   │   ├── agents/              # AI service (litellm), extraction, prompts, report aggregator
│   │   └── guards/              # Comment stripping (tree-sitter based)
│   ├── tests/                   # pytest suite (unit + integration)
│   ├── eval/                    # LLM-as-a-judge evaluation framework
│   │   ├── cases/               # 58 YAML test cases (172 snippets)
│   │   ├── tests/               # Framework self-tests
│   │   └── README.md            # Full eval documentation
│   ├── test_cases/              # 3 sample inputs with expected outputs
│   ├── requirements.txt
│   └── Dockerfile
├── frontend/
│   ├── src/
│   │   ├── App.tsx              # Main layout (split pane, tabs)
│   │   ├── components/          # 11 React components (editor, findings, report, export)
│   │   ├── hooks/               # useStreamingReview SSE hook
│   │   └── types/               # TypeScript interfaces
│   ├── nginx.conf               # SSE proxy config
│   └── Dockerfile
├── docker-compose.yml
├── DESIGN.md                    # Full architecture design document
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Code Review Assistant

Architecture

Frontend

Quick Start

Option A: Docker (recommended)

Option B: Local Development

Verification

API Documentation

`POST /api/review`

`GET /api/health`

Tech Stack

Design Decisions

Known Limitations

Security

Mitigations in Place

Deployment Hardening Checklist

Testing

Evaluation

Latest Results (172 cases, live mode against Docker)

Running Evals

Test Cases

Future Improvements

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
DESIGN.md		DESIGN.md
PLAN.md		PLAN.md
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

AI Code Review Assistant

Architecture

Frontend

Quick Start

Option A: Docker (recommended)

Option B: Local Development

Verification

API Documentation

POST /api/review

GET /api/health

Tech Stack

Design Decisions

Known Limitations

Security

Mitigations in Place

Deployment Hardening Checklist

Testing

Evaluation

Latest Results (172 cases, live mode against Docker)

Running Evals

Test Cases

Future Improvements

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /api/review`

`GET /api/health`

Packages