Skip to content

Rahul-Raval-2912/PDF-Malware-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PDF Malware Analysis Framework

Automatically dissects weaponized PDFs through 4 analysis layers to reconstruct complete attacker killchains

πŸš€ What This Project Does (Beats All Current Tools)

This framework provides comprehensive analysis of malicious PDF files through four distinct layers:

  1. Static Dissection: Rebuilds full PDF object dependency graph + decodes 27 nested stream encodings
  2. JavaScript Deobfuscation: 19 advanced techniques (AST parsing, deadcode removal, eval reconstruction)
  3. Dynamic Execution: Instrumented VMs with 132 Windows API hooks + memory forensics
  4. Exploit Chain Reconstruction: ML-powered mapping (CVE→AMSI Bypass→Reflective PE→C2 callback)

Output: Interactive exploit chain graphs, memory dumps, behavioral timelines, SIEM-ready JSON.

πŸ› οΈ Technologies Used

CORE: pikepdf, pdfminer.six, yara-python, pefile, capstone, unicorn-engine, lief, volatility3
ML: transformers(BERT), torch, scikit-learn
DYNAMIC: frida-tools, QEMU VM farm, python-socketio
CLI: typer, rich, asyncio, celery
VISUALIZATION: plotly, networkx

πŸ“ Project Structure

pdfexploitsforge/
β”œβ”€β”€ core/                          # Analysis engines
β”‚   β”œβ”€β”€ static_dissection.py      # Object graph + 27 decoder chains
β”‚   β”œβ”€β”€ js_deobfuscator.py        # 19 AcroJS deobf techniques
β”‚   β”œβ”€β”€ payload_extractor.py      # PE/ELF/shellcode extraction
β”‚   └── exploit_classifier.py     # ML chain reconstruction
β”œβ”€β”€ dynamic/                       # VM orchestration
β”‚   β”œβ”€β”€ qemu_orchestrator.py      # XP/Win7/Win10 VM farm
β”‚   β”œβ”€β”€ api_monitor.py           # 132 Windows API hooks
β”‚   └── memory_forensics.py      # Volatility + YARA scans
β”œβ”€β”€ signatures/                    # Detection rules
β”‚   β”œβ”€β”€ yara_rules/              # 1.2k PDF exploit signatures
β”‚   └── regex_patterns.py        # JS/PowerShell primitives
β”œβ”€β”€ ml_models/                     # Trained models
β”‚   β”œβ”€β”€ js_malware_bert.pt       # JavaScript classifier
β”‚   └── rop_chain_detector.pt    # ROP gadget chains
β”œβ”€β”€ visualizers/                   # Attack graphs
β”‚   β”œβ”€β”€ exploit_graph.py         # NetworkXβ†’Plotly chains
β”‚   └── object_dependency.py     # PDF internal references
β”œβ”€β”€ output/                        # Generated reports
β”‚   β”œβ”€β”€ chain_graph.html         # Interactive visualization
β”‚   β”œβ”€β”€ memory.dmp               # Volatility dumps
β”‚   └── network_timeline.json    # C2 behavioral data
└── cli.py                        # Production CLI entrypoint

πŸ”§ Installation

Prerequisites

  • Python 3.8+
  • QEMU (for dynamic analysis)
  • Volatility3
  • YARA

Quick Install

git clone https://github.com/your-repo/pdfexploitsforge.git
cd pdfexploitsforge
pip install -r requirements.txt
pip install -e .

Docker Installation

docker build -t pdfexploitsforge .
docker run -v $(pwd)/samples:/samples pdfexploitsforge analyze /samples/malicious.pdf

πŸš€ Usage

Basic Analysis

# Analyze single PDF
pdfexploitsforge analyze malicious.pdf

# With dynamic analysis
pdfexploitsforge analyze malicious.pdf --dynamic --vm-snapshot win7_sp1

# Batch processing
pdfexploitsforge batch ./pdf_samples/ --workers 4

Python API

from pdfexploitsforge import StaticAnalyzer, JSDeobfuscator, ExploitClassifier

# Static analysis
analyzer = StaticAnalyzer()
results = analyzer.analyze("malicious.pdf")

# JavaScript deobfuscation
deobf = JSDeobfuscator()
js_results = deobf.process(results['javascript'])

# ML exploit classification
classifier = ExploitClassifier()
exploit_chain = classifier.reconstruct_chain(results, js_results, [])

🎯 Key Features

Static Analysis Engine

  • PDF Object Graph: Complete dependency reconstruction
  • 27 Stream Decoders: FlateDecode, ASCIIHex, ASCII85, LZW, etc.
  • JavaScript Extraction: All embedded JS code with context
  • Embedded File Detection: PE/ELF/Office docs
  • YARA Integration: 1.2k PDF exploit signatures

JavaScript Deobfuscation (19 Techniques)

  1. Unicode unescape sequences
  2. Hexadecimal string decoding
  3. Base64 string decoding
  4. URL encoding resolution
  5. String concatenation resolution
  6. Character code resolution
  7. Eval call reconstruction
  8. Function call resolution
  9. Dead code removal
  10. Array access resolution
  11. Object property access
  12. Mathematical operations
  13. Boolean operations
  14. Conditional expressions
  15. Loop unrolling
  16. Variable substitution
  17. String split/join operations
  18. Regex pattern resolution
  19. Escape sequence resolution

Dynamic Analysis

  • VM Orchestration: XP/Win7/Win10 snapshots
  • 132 API Hooks: Kernel32, Ntdll, Advapi32, User32, WinInet
  • Memory Forensics: Volatility3 integration
  • Network Monitoring: C2 communication detection
  • File System Tracking: Creation/modification monitoring
  • Registry Analysis: Persistence mechanism detection

ML-Powered Classification

  • BERT JavaScript Classifier: Malware family identification
  • ROP Chain Detector: Neural network-based detection
  • CVE Mapping: Automatic vulnerability identification
  • Attack Chain Reconstruction: Complete killchain mapping
  • MITRE ATT&CK Integration: Technique classification

πŸ“Š Output Examples

Interactive Exploit Graph

Exploit Chain Visualization

Analysis Report Structure

{
  "pdf_file": "malicious.pdf",
  "static_analysis": {
    "structure": {...},
    "javascript": [...],
    "embedded_files": [...],
    "yara_matches": [...]
  },
  "javascript_analysis": [...],
  "payloads": [...],
  "exploit_chain": {
    "nodes": [...],
    "edges": [...],
    "cve_mappings": [...],
    "attack_techniques": [...]
  },
  "dynamic_analysis": {
    "api_calls": [...],
    "network_activity": [...],
    "file_changes": [...],
    "memory_dump": {...}
  }
}

πŸ” Detection Capabilities

CVE Coverage

  • CVE-2013-2729 (Adobe Reader JavaScript API)
  • CVE-2010-0188 (Adobe Reader JBIG2)
  • CVE-2009-0927 (Adobe Reader getAnnots)
  • CVE-2008-2992 (Adobe Reader util.printf)
  • And 50+ more PDF vulnerabilities

Exploit Techniques

  • Heap spraying
  • ROP chain exploitation
  • JavaScript API abuse
  • Embedded executable deployment
  • Reflective PE loading
  • AMSI bypass techniques
  • Process injection
  • Persistence mechanisms

πŸ§ͺ Testing

# Run unit tests
python -m pytest tests/

# Test with sample PDFs
python -m pytest tests/test_samples.py

# Performance benchmarks
python -m pytest tests/test_performance.py --benchmark

πŸ“ˆ Performance

  • Static Analysis: ~2-5 seconds per PDF
  • JavaScript Deobfuscation: ~1-3 seconds per script
  • Dynamic Analysis: ~5-10 minutes per PDF (VM dependent)
  • ML Classification: ~0.5-1 second per sample

🀝 Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Adobe Security Research Team
  • YARA Project
  • Volatility Foundation
  • MITRE ATT&CK Framework
  • PDF Association

πŸ“ž Support


⚠️ Disclaimer: This tool is for educational and authorized security testing purposes only. Users are responsible for complying with applicable laws and regulations.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published