| title | CVE Severity Re-Ranker API |
|---|---|
| emoji | 🛡️ |
| colorFrom | blue |
| colorTo | indigo |
| sdk | docker |
| app_port | 7860 |
| pinned | false |
Context-aware vulnerability prioritisation using NLP, Deep Learning, and Machine Learning
The National Vulnerability Database (NVD) publishes thousands of CVEs every month, each with a static CVSS score. Security teams sort by CVSS and start patching from the top — but CVSS is environment-blind. A CVSS 9.8 Critical in software you don't use is less urgent than a CVSS 7.5 High in software running on your public-facing server.
This project fixes that.
- Analyses CVE descriptions from the NVD using NLP + SecBERT embeddings
- Predicts severity (Low / Medium / High / Critical) using XGBoost
- Re-ranks CVEs based on your specific software inventory
- Surfaces the most dangerous vulnerabilities for your environment — not a generic list
- Serves results through a FastAPI backend consumed by a Next.js dashboard
| Name | Roll No | Contribution |
|---|---|---|
| Manglam Jaiswal | 10127 | Data collection, NLP preprocessing, EDA |
| Tanaya Bane | 10107 | Re-ranking module, Next.js UI, evaluation |
| Tanmay Sarode | 10154 | SecBERT embeddings, XGBoost training, SHAP |
Third Year | Semester 6 | ML + DL + NLP Mini Project | 2025–26
| Metric | Value |
|---|---|
| Dataset | 200,431 CVEs (NVD 2019–2026) |
| Model | XGBoost on 781-dim fused feature vector |
| Weighted F1 | 0.77 |
| Accuracy | 77% |
| Medium F1 | 0.82 |
| Critical F1 | 0.73 |
NVD API
│
▼
[Layer 1 — NLP]
Text cleaning → spaCy NER → keyword feature flags
│
▼
[Layer 2 — Deep Learning]
SecBERT → 768-dim CLS embedding per CVE
│
▼
[Layer 3 — Feature Fusion]
BERT (768) + NLP features (8) + CVSS metadata (5) = 781-dim vector
│
▼
[Layer 4 — Machine Learning]
XGBoost classifier → Low / Medium / High / Critical
│
▼
[Layer 5 — Contextual Re-Ranking]
User inventory CSV → fuzzy match → boost score → re-sorted list
│
▼
[FastAPI Backend]
REST API serving predictions and re-ranked results
│
▼
[Next.js Frontend]
Single CVE lookup | Bulk CSV upload | Inventory matcher | Dashboard
SDG 9 — Industry, Innovation and Infrastructure Makes intelligent vulnerability prioritisation accessible to organisations of all sizes.
SDG 16 — Peace, Justice and Strong Institutions Strengthens institutional resilience against cyberattacks by enabling faster, targeted vulnerability response.
cve-severity-reranker/
│
├── .github/
│ └── workflows/
│ ├── daily_fetch.yml # Fetches new CVEs every day at 6 AM UTC
│ └── weekly_pipeline.yml # Full pipeline every Sunday at 2 AM UTC
│
├── scripts/
│ ├── 01_fetch.py # NVD API data collection
│ ├── 02_preprocess.py # NLP cleaning + feature engineering
│ ├── 03_embeddings.py # SecBERT embedding generation
│ └── 04_train.py # XGBoost training (smart update mode)
│
├── backend/
│ ├── main.py # FastAPI app — all API routes
│ ├── pipeline.py # Prediction + re-ranking logic
│ ├── reranker.py # Inventory matching + boost formula
│ └── requirements.txt # Python dependencies for backend
│
├── frontend/
│ ├── app/ # Next.js app directory
│ │ ├── page.tsx # Home — dashboard overview
│ │ ├── lookup/page.tsx # Single CVE lookup screen
│ │ ├── bulk/page.tsx # Bulk CSV upload screen
│ │ └── inventory/page.tsx # Inventory matcher screen
│ ├── components/ # Reusable React components
│ ├── public/ # Static assets
│ ├── package.json
│ └── next.config.js
│
├── data/
│ ├── cves_raw.csv # Raw NVD data (Git LFS)
│ ├── cves_processed.csv # Cleaned + feature engineered (Git LFS)
│ ├── bert_embeddings.npy # 200k × 768 embedding matrix (Git LFS)
│ └── last_updated.json # Tracks last data collection date
│
├── models/
│ ├── model_xgb.pkl # Trained XGBoost model (Git LFS)
│ ├── label_encoder.pkl # Label encoder
│ └── training_tracker.json # Tracks rows model was trained on
│
├── notebooks/
│ ├── 01_data_collection.ipynb
│ ├── 02_preprocessing.ipynb
│ ├── 04_embeddings.ipynb
│ ├── 05_training.ipynb
│ └── 06_live_updater.ipynb
│
├── requirements.txt
└── README.md
- Python 3.10+
- Node.js 18+
- npm or yarn
git clone https://github.com/ManglamX/cve-severity-reranker.git
cd cve-severity-rerankerCreate a .env.local in the frontend folder:
NEXT_PUBLIC_API_URL=https://manglamx-cve-reranker.hf.spacecd frontend
npm install
npm run devFrontend runs at http://localhost:3000 (Connecting to Hugging Face API)
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check |
GET |
/cve/{cve_id} |
Analyse a single CVE |
POST |
/bulk |
Analyse a list of CVE IDs |
POST |
/inventory |
Find CVEs matching uploaded inventory |
GET |
/stats |
Dataset and model statistics |
curl http://localhost:8000/cve/CVE-2021-44228{
"cve_id": "CVE-2021-44228",
"cvss_score": 10.0,
"cvss_label": "Critical",
"predicted_label": "Critical",
"context_score": 0.512,
"boost_factor": 1.0,
"matched_inventory": [],
"attack_vector": "NETWORK",
"has_remote": 1,
"has_exec": 1
}curl -X POST http://localhost:8000/bulk \
-H "Content-Type: application/json" \
-d '{
"cve_ids": ["CVE-2021-44228", "CVE-2022-30190"],
"inventory": ["Apache Log4j", "OpenSSL", "Windows Server"]
}'Overview of your CVE dataset — class distribution chart, top 10 highest context score CVEs, model performance summary.
Enter any CVE ID and get instant analysis — predicted severity, context score, risk signals (remote exploitable, code execution, attack vector), and inventory match warning if applicable.
Upload a CSV with a cve_id column. You can type your inventory manually or upload an inventory CSV to get environment-aware rankings.
Upload your software inventory CSV. The system returns only CVEs that affect your software, ranked by context score.
Sample inventory CSV format:
software
Apache HTTP Server
OpenSSL
Windows Server
MySQL
Log4j
nginxboost = 1.0
+ 0.30 × (number of inventory matches)
× 1.25 (if public exploit exists)
× 1.15 (if remote + unauthenticated)
× 1.10 (if attack vector = NETWORK)
context_score = min(prob_critical × boost, 1.0)
CVEs are sorted by context_score — not by CVSS score.
Example — CVE-2021-44228 (Log4Shell)
| Condition | Context Score |
|---|---|
| No inventory | 0.51 |
Inventory contains Log4j |
0.67 (boost 1.43×) |
| Workflow | Schedule | What it does |
|---|---|---|
daily_fetch.yml |
Every day 6 AM UTC | Fetches new CVEs, updates cves_raw.csv |
weekly_pipeline.yml |
Every Sunday 2 AM UTC | Fetch → preprocess → embed → retrain |
Smart training — only does what is needed:
| Situation | Mode | Time |
|---|---|---|
| Dataset unchanged | Skip — loads existing model | 10 sec |
| < 10% new rows | Update — trains on new rows only | ~2 min |
| ≥ 10% new rows | Full retrain | ~15 min |
Add your NVD API key as a GitHub Secret named NVD_API_KEY to enable automation.
notebooks/01_data_collection.ipynb— fetch CVE datanotebooks/02_preprocessing.ipynb— NLP pipelinenotebooks/04_embeddings.ipynb— SecBERT embeddings (GPU, ~65 min)notebooks/05_training.ipynb— XGBoost training + evaluation
Each notebook is smart — skips completed steps and only processes new rows.
| Component | Platform | URL |
|---|---|---|
| API / Backend | Hugging Face Spaces (Docker) | https://manglamx-cve-reranker.hf.space |
| Frontend | Vercel | https://cve-reranker.vercel.app |
| Component | Technology |
|---|---|
| Frontend | Next.js (React) |
| Backend | FastAPI (Python) |
| NLP | spaCy, regex |
| Deep Learning | SecBERT, PyTorch, Hugging Face transformers |
| Machine Learning | XGBoost, scikit-learn |
| Explainability | SHAP TreeExplainer |
| Inventory matching | FuzzyWuzzy |
| Data source | NVD REST API v2.0 |
| Training platform | Google Colab (T4 GPU) |
| Automation | GitHub Actions |
| Storage | Google Drive + Git LFS |
Classification Report — test set (40,087 CVEs)
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Critical | 0.72 | 0.74 | 0.73 | 4,705 |
| High | 0.76 | 0.72 | 0.74 | 14,792 |
| Low | 0.88 | 0.38 | 0.53 | 1,589 |
| Medium | 0.79 | 0.86 | 0.82 | 19,001 |
| Weighted avg | 0.77 | 0.77 | 0.77 | 40,087 |
Low class recall is lower due to class imbalance. Planned fix: SMOTE oversampling.
| Property | Value |
|---|---|
| Source | NVD REST API v2.0 |
| Date range | January 2019 — March 2026 |
| Total CVEs | 200,431 |
| Features per CVE | 781 (768 BERT + 8 NLP + 5 metadata) |
| Auto-updated | Daily via GitHub Actions |
CVSS label mapping:
| Score | Label |
|---|---|
| 0.0 – 3.9 | Low |
| 4.0 – 6.9 | Medium |
| 7.0 – 8.9 | High |
| 9.0 – 10.0 | Critical |
- Shahid & Debar. CVSS-BERT. arXiv 2021.
- NLP-Based Analysis of Cyber Threats. PMC 2023.
- CVE Severity Prediction — A Deep Learning Approach. ScienceDirect 2024.
- jackaduma. SecBERT. Hugging Face Hub.
- Lundberg & Lee. SHAP. NeurIPS 2017.
- Chen & Guestrin. XGBoost. KDD 2016.
This project was built for academic purposes as part of a Third Year Mini Project (2025–26).