Skip to content

24CSB0B05/CD_PROJECT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 

Repository files navigation

README.md

CVE Vulnerability Description Parser

A compiler-inspired NLP tool that parses CVE descriptions, maps them to known vulnerability patterns, and enriches each finding with CWE IDs, mitigations, and code-construct linkage.


Table of Contents

  1. Project Overview
  2. Architecture & Pipeline
  3. Project Structure
  4. Setup & Installation
  5. Usage
  6. Module Reference
  7. Output Format
  8. Supported Vulnerability Types
  9. Sample Run
  10. Extending the Tool
  11. Compiler Concepts Applied

Project Overview

Security teams deal with hundreds of CVE descriptions daily. Reading each one manually to determine the vulnerability class, root cause, affected component, severity, and appropriate mitigation is slow and error-prone.

This tool applies compiler design principles β€” lexical analysis, parsing, semantic analysis, and IR generation β€” to automate that workflow:

  • Parse raw CVE text using NLP tokenisation
  • Pattern-match to 11 known vulnerability categories
  • Semantically extract root cause, impact, and affected component
  • Enrich records with CWE IDs, mitigations, and linked code constructs
  • Output structured JSON + a formatted terminal report

Architecture & Pipeline

 Raw CVE Text
      β”‚
      β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚   LEXER     β”‚  Lowercase β†’ normalise punctuation β†’ strip stop-words β†’ tokens[]
 β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚   PARSER    β”‚  Keyword pattern-matching with weighted scoring β†’ vulnerability_type
 β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ SEMANTIC ANALYSISβ”‚  Rule-based extraction β†’ root_cause, impact, component, severity
 β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ IR BUILDER  β”‚  Assembles full record + CWE enrichment + code-construct linkage
 β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚  OUTPUT                         β”‚
 β”‚  β€’ output.json  (structured IR) β”‚
 β”‚  β€’ Terminal report + statistics β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Project Structure

cve_parser/
β”‚
β”œβ”€β”€ main.py          # Entry point β€” orchestrates the full pipeline
β”œβ”€β”€ lexer.py         # Tokenisation and text normalisation
β”œβ”€β”€ parser.py        # Vulnerability pattern detection
β”œβ”€β”€ semantic.py      # Root cause / impact / component / severity extraction
β”œβ”€β”€ ir.py            # Intermediate Representation builder
β”œβ”€β”€ patterns.py      # Vulnerability keyword patterns + CWE metadata
β”œβ”€β”€ utils.py         # File I/O, statistics, terminal report printer
β”‚
β”œβ”€β”€ data/
β”‚   └── cves.txt     # Input file β€” one CVE description per line
β”‚
β”œβ”€β”€ output.json      # Generated output (created on first run)
└── README.md        # This file

Setup & Installation

Requirements

  • Python 3.10+ (uses str | None union syntax)
  • No third-party packages required β€” pure standard library

Clone / Download

git clone <repo-url>
cd cve_parser

Prepare Input Data

Edit data/cves.txt and add your CVE descriptions β€” one per line:

# Lines starting with '#' are comments and will be ignored

A buffer overflow vulnerability exists in the input validation function...
SQL injection vulnerability in the login module exposes sensitive database...

Usage

Basic Run

python main.py

Reads data/cves.txt, writes output.json, and prints a full terminal report.

Custom Input / Output

python main.py --input path/to/my_cves.txt --output results.json

Quiet Mode (JSON file only, no terminal output)

python main.py --quiet

Print Raw JSON to stdout

python main.py --json-only

Help

python main.py --help

Module Reference

lexer.py

Function Description
tokenize(text) Returns filtered tokens (stop-words removed)
tokenize_raw(text) Returns all tokens without filtering
clean_text(text) Lowercases, normalises hyphens, strips special chars

parser.py

Function Description
detect_vulnerability(text) Returns the best-matching vulnerability type string
detect_all_vulnerabilities(text) Returns all matching types sorted by score

semantic.py

Function Description
semantic_analysis(text) Returns dict: {root_cause, impact, affected_component, severity, confidence}

ir.py

Function Description
build_ir(description, vuln_type, semantics) Assembles enriched IR dict
_extract_code_constructs(description, vuln_type) Heuristically finds linked code identifiers

utils.py

Function Description
load_cves(filepath) Loads CVE descriptions from a text file
save_results(results, output_path) Writes IR list to JSON
generate_statistics(results) Returns aggregated stats dict
print_results(results) Colourised terminal output of all records
print_statistics(stats) Bar-chart style statistics summary

patterns.py

Object Description
VULNERABILITY_PATTERNS Dict mapping vuln type β†’ keyword phrases
VULNERABILITY_METADATA Dict mapping vuln type β†’ CWE ID, mitigation, references

Output Format

Each entry in output.json follows this schema:

{
    "id": "CVE-SAMPLE-0001",
    "parsed_at": "2025-01-01T00:00:00Z",
    "description": "A buffer overflow vulnerability exists...",
    "vulnerability_type": "Buffer Overflow",
    "root_cause": "Boundary checking failure",
    "impact": "Arbitrary code execution",
    "affected_component": "Application function",
    "severity": "High",
    "confidence": 100,
    "cwe_id": "CWE-120",
    "mitigation": "Implement bounds checking, use safe string functions...",
    "references": ["https://cwe.mitre.org/data/definitions/120.html"],
    "linked_code_constructs": ["buffer", "inputValidation"]
}
Field Description
id Auto-generated deterministic ID (or source CVE-ID if provided)
parsed_at UTC timestamp of analysis
vulnerability_type Detected class (11 categories + Unknown)
root_cause Underlying programming flaw
impact What an attacker achieves
affected_component System part at risk
severity Low / Medium / High / Critical
confidence 0–100% estimate based on fields resolved
cwe_id MITRE CWE identifier
mitigation Recommended remediation
references Links to authoritative resources
linked_code_constructs Relevant identifiers found in description

Supported Vulnerability Types

Vulnerability CWE Default Severity
Buffer Overflow CWE-120 High
SQL Injection CWE-89 Medium
Cross Site Scripting CWE-79 Medium
Use After Free CWE-416 Medium
Directory Traversal CWE-22 Medium
Authentication Bypass CWE-287 Medium
Integer Overflow CWE-190 Medium
Denial Of Service CWE-400 Medium
Command Injection CWE-78 High
Race Condition CWE-362 Medium
Privilege Escalation CWE-269 High

Sample Run

  Loaded 17 CVE description(s) from 'data/cves.txt' …

βœ”  Results written to 'output.json'

══════════════════════════════════════════════════════════════
   CVE VULNERABILITY MAPPING β€” PARSED RESULTS
══════════════════════════════════════════════════════════════

[01] [HIGH] Buffer Overflow
     ID          : CVE-SAMPLE-0001
     Description : A buffer overflow vulnerability exists in the input validati…
     Root Cause  : Boundary checking failure
     Impact      : Arbitrary code execution
     Component   : Application function
     CWE         : CWE-120
     Confidence  : 100%
     Constructs  : buffer
     Mitigation  : Implement bounds checking, use safe string functions, enable…

══════════════════════════════════════════════════════════════
   VULNERABILITY STATISTICS
══════════════════════════════════════════════════════════════

  Total CVEs processed: 17
  Average confidence  : 78.2%

  ── By Vulnerability Type ──────────────────────
  Denial Of Service            β–ˆβ–ˆβ–ˆ 3
  Buffer Overflow              β–ˆβ–ˆβ–ˆ 3
  Cross Site Scripting         β–ˆβ–ˆβ–ˆ 3
  SQL Injection                β–ˆβ–ˆ 2
  Authentication Bypass        β–ˆβ–ˆ 2
  Use After Free               β–ˆβ–ˆ 2
  Directory Traversal          β–ˆ 1
  Integer Overflow             β–ˆ 1

  ── By Severity ────────────────────────────────
  Medium       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 13
  High         β–ˆβ–ˆβ–ˆβ–ˆ 4

Extending the Tool

Add a new vulnerability type

  1. Add keyword phrases to VULNERABILITY_PATTERNS in patterns.py
  2. Add CWE metadata to VULNERABILITY_METADATA in patterns.py
  3. Optionally add semantic rules to ROOT_CAUSE_RULES / IMPACT_RULES in semantic.py

Feed real CVE-IDs

Pass a source_id to build_ir() in main.py:

record = process_cve(cve_text, source_id="CVE-2023-12345")

Integrate with the NVD API

Replace load_cves() in utils.py with an HTTP fetch from https://services.nvd.nist.gov/rest/json/cves/2.0 to pull live CVE data.


Compiler Concepts Applied

Compiler Phase Implementation in This Project
Lexical Analysis lexer.py β€” tokenisation, stop-word removal, text normalisation
Parsing / Pattern Matching parser.py β€” weighted keyword pattern matching, multi-type scoring
Semantic Analysis semantic.py β€” rule-based meaning extraction (root cause, impact, severity)
Intermediate Representation ir.py β€” structured IR dictionary as canonical internal format
Source-to-Source Transformation CVE free text β†’ enriched structured JSON suitable for SIEMs / ticketing systems

About

VULNERABILITY DESCRIPTION PARSER

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors