Machine Learning–Based Risk Analysis Using CVSS v3.1
Security teams often need to prioritize vulnerabilities before official CVSS scores are published. This project addresses that gap by predicting vulnerability severity directly from CVE descriptions using machine learning.
The system learns how historical CVE descriptions were translated into CVSS v3.1 metrics by security experts and applies that knowledge to newly disclosed vulnerabilities. By analyzing vulnerability text, the model predicts key CVSS attributes and reconstructs an estimated severity score to support early-stage risk assessment.
- CVE databases grow rapidly, making manual risk analysis slow and inconsistent
- Official CVSS scores may be delayed after disclosure
- Organizations need early, explainable severity estimates for faster remediation
-
Input: Historical CVE descriptions + their official CVSS v3.1 metrics
-
CVE text is converted into numerical features using TF-IDF
-
Supervised machine learning models (Logistic Regression) learn patterns between:
- Vulnerability language
- Exploitability conditions
- Impact severity
📌 Important: The model is not told what “high” or “low” severity means. It learns this implicitly by observing how similar descriptions were scored in the past.
When a new CVE description is provided:
-
The model predicts individual CVSS components:
- Attack Vector
- Attack Complexity
- Privileges Required
- User Interaction
- Confidentiality, Integrity, Availability impact
-
These predicted values are combined using CVSS formulas to estimate a base severity score
-
The result enables early risk prioritization before official scoring
Input CVE Description:
“A remote attacker can exploit this vulnerability via a network request without authentication, leading to system compromise.”
Model Output:
- Attack Vector: Network
- Privileges Required: None
- Attack Complexity: Low
- Availability Impact: High
➡️ Predicted Severity: High / Critical
- Programming: Python
- Machine Learning: Scikit-learn (Logistic Regression)
- Text Processing: TF-IDF (NLP)
- Data Handling: Pandas, NumPy
- Visualization: Matplotlib, Seaborn, Plotly
- Data Source: National Vulnerability Database (NVD)
- Attack Vector prediction accuracy: 96%
- Attack Complexity accuracy: 94%
- User Interaction accuracy: 95%
- Strong performance on early-stage severity estimation
- High interpretability compared to deep learning models
- CVE severity trend heatmaps
- Vulnerability clustering (honeycomb visualization)
- Time-series analysis of CVE growth
- Interactive dashboard for analyst exploration
- TF-IDF lacks contextual understanding (semantic ambiguity)
- Some overlap between similar CVSS categories
- Batch-based analysis (not real-time yet)
- Replace TF-IDF with transformer-based NLP models (BERT)
- Integrate real-time CVE feeds
- Support CVSS v4.0
- Improve handling of class imbalance
This project demonstrates how machine learning can:
- Reduce response time in vulnerability management
- Provide explainable and transparent risk scoring
- Assist security teams in prioritizing threats proactively
Here is an example of the CVE data from NVD
{
"cveMetadata": {
"cveId": "CVE-2025-35028",
"datePublished": "2025-11-30T21:27:56.057Z"
},
"descriptions": [
{
"lang": "en",
"value": "By providing a command-line argument starting with a semi-colon …"
}
],
"metrics": [
{
"cvssV3_1": {
"attackVector": "NETWORK",
"attackComplexity": "LOW",
"privilegesRequired": "NONE",
"userInteraction": "NONE",
"confidentialityImpact": "HIGH",
"integrityImpact": "HIGH",
"availabilityImpact": "NONE",
"baseScore": 9.1,
"baseSeverity": "CRITICAL",
"vectorString": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:N",
"version": "3.1"
}
}
]
}