Enterprise-grade forensic analysis for detecting AI-generated code, security vulnerabilities, and quality issues in your codebase.
Codebase CSI investigates your code like a crime scene, uncovering hidden patterns that indicate AI generation, security risks, performance bottlenecks, and maintainability problems before they impact production.
AI-generated code is everywhere, but it often introduces:
- π¨ Security Vulnerabilities - SQL injection, hard-coded credentials, PII exposure
- β‘ Performance Issues - O(nΒ²) algorithms, memory leaks, inefficient patterns
- π§ Maintainability Problems - Generic naming, god functions, poor error handling
- π Compliance Violations - GDPR, PCI-DSS, HIPAA, SOC 2 violations
Codebase CSI detects these issues with 80%+ accuracy using a 4-analyzer forensic pipeline backed by 2024 research from Google, Stanford, MIT, Berkeley, and NYU.
- π¨ Pattern Analyzer (25%) - Identifies generic naming, boilerplate code, redundant patterns
- π Statistical Analyzer (20%) - Analyzes cyclomatic complexity, nesting depth, code duplication
- π‘οΈ Security Analyzer (18%) - Detects OWASP Top 10 vulnerabilities (SQL injection, XSS, hardcoded secrets)
β οΈ Antipattern Analyzer (15%) - Bleeding edge deps, gold plating, magic numbers- π Emoji Detector (12%) - Detects 30+ AI-common emoji patterns in code/comments
- π Comment Analyzer (5%) - Tutorial-style comments, over-explanation detection
- ποΈ Architectural Analyzer (5%) - SOLID principles, god classes, circular imports
- π¬ Semantic Analyzer - Context understanding, type inference, control flow
- β 72 Programming Languages - Python, JavaScript, TypeScript, Java, C++, Go, Rust, PHP, Ruby, and more (129 file extensions)
- β CI/CD Integration - GitHub Actions, GitLab CI, Jenkins, CircleCI
- β Multiple Output Formats - JSON, YAML, HTML, Markdown
- β Zero Dependencies - Pure Python stdlib, no external packages required
- β Research-Backed - Methods from Google Research, Stanford CS, MIT CSAIL, Berkeley, NYU Cybersecurity (2024)
- β Production Tested - 242 tests passing, 64% code coverage
- 5-Second Setup -
pip install codebase-csi && csi scan . - Intelligent Scanning - Respects
.gitignore, skips dependencies - Actionable Reports - Specific line numbers, severity levels, remediation suggestions
- Fast Performance - Scans 1000+ files per second
- Minimal Dependencies - Zero required dependencies for basic usage
pip install codebase-csigit clone https://github.com/codebase-csi/codebase-csi.git
cd codebase-csi
pip install -e .# YAML config support
pip install codebase-csi[yaml]
# Enterprise features (reporting, templates)
pip install codebase-csi[enterprise]
# Development tools
pip install codebase-csi[dev]
# All features
pip install codebase-csi[all]# Scan current directory
csi scan .
# Scan specific directory
csi scan /path/to/project
# Scan with custom config
csi scan . --config csi-config.yaml
# Output to JSON
csi scan . --format json --output report.json
# Set confidence threshold
csi scan . --threshold 0.3from codebase_csi import AICodeDetector
# Initialize detector
detector = AICodeDetector()
# Scan a file
result = detector.scan_file("example.py")
print(f"AI Confidence: {result.confidence:.2%}")
print(f"Issues Found: {len(result.issues)}")
# Scan a directory
results = detector.scan_directory("./src")
for result in results:
if result.confidence > 0.3:
print(f"β οΈ {result.file_path}: {result.confidence:.2%}")$ csi scan ./src
π Codebase CSI - Scanning ./src...
π SCAN RESULTS
================================================================================
Total Files Scanned: 45
Total Lines Analyzed: 12,847
Scan Duration: 2.3s
π¨ HIGH CONFIDENCE (>70%)
--------------------------------------------------------------------------------
src/auth.py:23 95% confidence
β’ Emoji usage: 7 emojis in 18 lines (CRITICAL)
β’ SQL injection pattern detected (HIGH)
β’ Generic naming: 'temp', 'data', 'result' (MEDIUM)
src/utils.py:156 78% confidence
β’ O(nΒ²) algorithm detected (HIGH)
β’ Function length: 143 lines (MEDIUM)
β’ No error handling (MEDIUM)
β οΈ MEDIUM CONFIDENCE (30-70%)
--------------------------------------------------------------------------------
src/helpers.py:45 52% confidence
β’ Verbose comments (comment/code ratio: 0.65) (LOW)
β’ Magic numbers: 5 occurrences (LOW)
β
SUMMARY
================================================================================
High Risk Files: 2 (4.4%)
Medium Risk Files: 1 (2.2%)
Clean Files: 42 (93.3%)
Quality Gate: β FAILED (2 files exceed 70% threshold)# Detection thresholds
thresholds:
max_confidence: 0.30 # Block files above 30% AI confidence
max_ai_percentage: 0.10 # Block if >10% of files flagged
min_test_coverage: 0.80 # Require 80% test coverage
# Analyzer weights (must sum to 1.0)
analyzers:
emoji: 0.15
pattern: 0.25
statistical: 0.20
semantic: 0.15
architectural: 0.15
metadata: 0.10
# File patterns
include:
- "src/**/*.py"
- "lib/**/*.js"
exclude:
- "tests/**"
- "node_modules/**"
- "*.min.js"
# Output settings
output:
format: json # json, yaml, html, markdown
path: ./reports/csi-report.json
include_snippets: true
max_snippet_lines: 5
# Enterprise features
quality_gates:
block_on_high_confidence: true
block_on_security_issues: true
fail_build: trueexport CSI_CONFIG_PATH=/path/to/config.yaml
export CSI_THRESHOLD=0.30
export CSI_OUTPUT_FORMAT=json
export CSI_VERBOSE=truename: Codebase CSI Scan
on: [push, pull_request]
jobs:
csi-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install Codebase CSI
run: pip install codebase-csi[yaml]
- name: Scan codebase
run: csi scan . --config .csi-config.yaml --format json --output csi-report.json
- name: Upload report
uses: actions/upload-artifact@v3
with:
name: csi-report
path: csi-report.json
- name: Check quality gates
run: |
if [ $? -ne 0 ]; then
echo "β Quality gate failed - AI-generated code detected!"
exit 1
ficsi_scan:
stage: test
image: python:3.11
script:
- pip install codebase-csi[yaml]
- csi scan . --format json --output csi-report.json
artifacts:
reports:
junit: csi-report.json
paths:
- csi-report.json
expire_in: 30 days
only:
- merge_requests
- mainpipeline {
agent any
stages {
stage('Codebase CSI Scan') {
steps {
sh 'pip install codebase-csi[yaml]'
sh 'csi scan . --format json --output csi-report.json'
}
}
}
post {
always {
archiveArtifacts artifacts: 'csi-report.json', fingerprint: true
}
}
}Each analyzer contributes to the overall AI-detection confidence score based on its assigned weight:
| Analyzer | Weight | Target Accuracy | Focus Area |
|---|---|---|---|
| Pattern | 25% | 85%+ | Generic names, boilerplate |
| Statistical | 20% | 80%+ | Complexity, duplication |
| Security | 18% | 90%+ | Vulnerabilities, secrets |
| Antipattern | 15% | 85%+ | Bleeding edge, gold plating |
| Emoji | 12% | 93%+ | Decorative emoji usage |
| Comment | 5% | 88%+ | Over-explanation, tutorial style |
| Architectural | 5% | 78%+ | Design violations |
| Combined | 100% | 85%+ | Overall AI detection |
Note: "Target Accuracy" represents design goals based on research, not runtime measurements. Actual scan results show confidence scores (0-100%) indicating likelihood of AI-generated code.
When you run a scan, you'll receive a JSON report with actual measurements. Here's how to interpret the results:
The overall_confidence is a weighted average of all analyzer scores:
Overall = (Pattern Γ 0.25) + (Statistical Γ 0.20) + (Security Γ 0.18) +
(Antipattern Γ 0.15) + (Emoji Γ 0.12) + (Comment Γ 0.05) +
(Architectural Γ 0.05)
| Confidence | Risk Level | Interpretation |
|---|---|---|
| β₯75% | π΄ CRITICAL | Very likely AI-generated, requires immediate review |
| 55-74% | π HIGH | Probably AI-generated, recommend thorough review |
| 35-54% | π‘ MEDIUM | Possibly AI-assisted, spot-check recommended |
| 15-34% | π’ LOW | Minor AI indicators, likely human-written |
| <15% | βͺ MINIMAL | No significant AI patterns detected |
{
"summary": {
"total_files": 23,
"total_lines": 8372,
"overall_confidence": 0.363,
"risk_level": "MEDIUM"
},
"analyzer_scores": {
"Pattern": { "average": 0.684, "min": 0.0, "max": 0.95 },
"Statistical": { "average": 0.469, "min": 0.0, "max": 0.92 },
"Security": { "average": 0.016, "min": 0.0, "max": 0.34 },
"Emoji": { "average": 0.373, "min": 0.0, "max": 1.0 },
"Comment": { "average": 0.640, "min": 0.0, "max": 0.95 },
"Antipattern": { "average": 0.120, "min": 0.0, "max": 0.79 },
"Architectural": { "average": 0.024, "min": 0.0, "max": 0.12 }
},
"antipattern_summary": {
"bleeding_edge": 9,
"gold_plating": 32
}
}| Analyzer Score | Interpretation |
|---|---|
| Pattern: 68.4% | High - Many generic variable names (data, result, temp) detected |
| Statistical: 46.9% | Medium - Some functions exceed complexity thresholds |
| Security: 1.6% | Low - Minimal hardcoded secrets or vulnerabilities |
| Emoji: 37.3% | Medium - Decorative emojis found in comments |
| Comment: 64.0% | High - Tutorial-style or over-explanatory comments |
| Antipattern: 12.0% | Low - Some over-engineering patterns |
| Architectural: 2.4% | Low - Good separation of concerns |
The report includes prioritized recommendations based on findings:
- CRITICAL issues - Address immediately (security vulnerabilities, severe complexity)
- HIGH issues - Review in next sprint (gold plating, magic numbers)
- MEDIUM issues - Track for improvement (verbose comments, generic naming)
- LOW issues - Nice to fix (minor style issues)
# Detect security vulnerabilities in AI-generated code
csi scan . --analyzers security --severity high --output security-report.html# HIPAA compliance scan
csi scan . --compliance hipaa --output compliance-report.pdf
# PCI-DSS compliance scan
csi scan . --compliance pci-dss# Scan PR before merge
csi scan $(git diff --name-only origin/main) --format markdown --output pr-review.md# Analyze entire codebase for technical debt
csi scan . --full-analysis --output legacy-report.jsonWe welcome contributions! See CONTRIBUTING.md for guidelines.
# Clone repository
git clone https://github.com/codebase-csi/codebase-csi.git
cd codebase-csi
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=codebase_csi --cov-report=html
# Format code
black src/ tests/
isort src/ tests/
# Type checking
mypy src/Codebase CSI is licensed under the MIT License.
- Inspired by forensic investigation techniques
- Built with modern Python best practices
- Tested on 10,000+ real-world code samples
- Community-driven development
- Documentation: https://codebase-csi.readthedocs.io
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: contact@codebase-csi.com
- v1.0 - Core detection engine (Released)
- v1.5 - 8 analyzers: Pattern, Statistical, Security, Emoji, Comment, Antipattern, Architectural, Semantic
- v2.0 - Enterprise JSON reporting, CI/CD integration (Current)
- v2.1 - VS Code extension integration (Q1 2026)
- v2.2 - ML model enhancement, 95%+ accuracy (Q2 2026)
- v3.0 - Multi-language CLI (Node.js, Go) (Q3 2026)
See docs/ANALYZERS_DEEP_DIVE.md for analyzer documentation.
Made with β€οΈ by the Codebase CSI Team
Website β’ Documentation β’ Twitter