CVE Vulnerability Severity Re-Ranking

title	CVE Severity Re-Ranker API
emoji	🛡️
colorFrom	blue
colorTo	indigo
sdk	docker
app_port	7860
pinned	false

CVE Vulnerability Severity Re-Ranking

Context-aware vulnerability prioritisation using NLP, Deep Learning, and Machine Learning

The Problem

The National Vulnerability Database (NVD) publishes thousands of CVEs every month, each with a static CVSS score. Security teams sort by CVSS and start patching from the top — but CVSS is environment-blind. A CVSS 9.8 Critical in software you don't use is less urgent than a CVSS 7.5 High in software running on your public-facing server.

This project fixes that.

What This System Does

Analyses CVE descriptions from the NVD using NLP + SecBERT embeddings
Predicts severity (Low / Medium / High / Critical) using XGBoost
Re-ranks CVEs based on your specific software inventory
Surfaces the most dangerous vulnerabilities for your environment — not a generic list
Serves results through a FastAPI backend consumed by a Next.js dashboard

Team

Name	Roll No	Contribution
Manglam Jaiswal	10127	Data collection, NLP preprocessing, EDA
Tanaya Bane	10107	Re-ranking module, Next.js UI, evaluation
Tanmay Sarode	10154	SecBERT embeddings, XGBoost training, SHAP

Third Year | Semester 6 | ML + DL + NLP Mini Project | 2025–26

Results

Metric	Value
Dataset	200,431 CVEs (NVD 2019–2026)
Model	XGBoost on 781-dim fused feature vector
Weighted F1	0.77
Accuracy	77%
Medium F1	0.82
Critical F1	0.73

System Architecture

NVD API
   │
   ▼
[Layer 1 — NLP]
  Text cleaning → spaCy NER → keyword feature flags
   │
   ▼
[Layer 2 — Deep Learning]
  SecBERT → 768-dim CLS embedding per CVE
   │
   ▼
[Layer 3 — Feature Fusion]
  BERT (768) + NLP features (8) + CVSS metadata (5) = 781-dim vector
   │
   ▼
[Layer 4 — Machine Learning]
  XGBoost classifier → Low / Medium / High / Critical
   │
   ▼
[Layer 5 — Contextual Re-Ranking]
  User inventory CSV → fuzzy match → boost score → re-sorted list
   │
   ▼
[FastAPI Backend]
  REST API serving predictions and re-ranked results
   │
   ▼
[Next.js Frontend]
  Single CVE lookup | Bulk CSV upload | Inventory matcher | Dashboard

SDG Mapping

SDG 9 — Industry, Innovation and Infrastructure Makes intelligent vulnerability prioritisation accessible to organisations of all sizes.

SDG 16 — Peace, Justice and Strong Institutions Strengthens institutional resilience against cyberattacks by enabling faster, targeted vulnerability response.

Repository Structure

cve-severity-reranker/
│
├── .github/
│   └── workflows/
│       ├── daily_fetch.yml         # Fetches new CVEs every day at 6 AM UTC
│       └── weekly_pipeline.yml     # Full pipeline every Sunday at 2 AM UTC
│
├── scripts/
│   ├── 01_fetch.py                 # NVD API data collection
│   ├── 02_preprocess.py            # NLP cleaning + feature engineering
│   ├── 03_embeddings.py            # SecBERT embedding generation
│   └── 04_train.py                 # XGBoost training (smart update mode)
│
├── backend/
│   ├── main.py                     # FastAPI app — all API routes
│   ├── pipeline.py                 # Prediction + re-ranking logic
│   ├── reranker.py                 # Inventory matching + boost formula
│   └── requirements.txt            # Python dependencies for backend
│
├── frontend/
│   ├── app/                        # Next.js app directory
│   │   ├── page.tsx                # Home — dashboard overview
│   │   ├── lookup/page.tsx         # Single CVE lookup screen
│   │   ├── bulk/page.tsx           # Bulk CSV upload screen
│   │   └── inventory/page.tsx      # Inventory matcher screen
│   ├── components/                 # Reusable React components
│   ├── public/                     # Static assets
│   ├── package.json
│   └── next.config.js
│
├── data/
│   ├── cves_raw.csv                # Raw NVD data (Git LFS)
│   ├── cves_processed.csv          # Cleaned + feature engineered (Git LFS)
│   ├── bert_embeddings.npy         # 200k × 768 embedding matrix (Git LFS)
│   └── last_updated.json           # Tracks last data collection date
│
├── models/
│   ├── model_xgb.pkl               # Trained XGBoost model (Git LFS)
│   ├── label_encoder.pkl           # Label encoder
│   └── training_tracker.json       # Tracks rows model was trained on
│
├── notebooks/
│   ├── 01_data_collection.ipynb
│   ├── 02_preprocessing.ipynb
│   ├── 04_embeddings.ipynb
│   ├── 05_training.ipynb
│   └── 06_live_updater.ipynb
│
├── requirements.txt
└── README.md

Quick Start

Prerequisites

Python 3.10+
Node.js 18+
npm or yarn

1. Clone the repo

git clone https://github.com/ManglamX/cve-severity-reranker.git
cd cve-severity-reranker

2. Configure Environment

Create a .env.local in the frontend folder:

NEXT_PUBLIC_API_URL=https://manglamx-cve-reranker.hf.space

3. Start the Next.js frontend

cd frontend
npm install
npm run dev

Frontend runs at http://localhost:3000 (Connecting to Hugging Face API)

API Endpoints

Method	Endpoint	Description
`GET`	`/health`	Health check
`GET`	`/cve/{cve_id}`	Analyse a single CVE
`POST`	`/bulk`	Analyse a list of CVE IDs
`POST`	`/inventory`	Find CVEs matching uploaded inventory
`GET`	`/stats`	Dataset and model statistics

Example — single CVE

curl http://localhost:8000/cve/CVE-2021-44228

{
  "cve_id": "CVE-2021-44228",
  "cvss_score": 10.0,
  "cvss_label": "Critical",
  "predicted_label": "Critical",
  "context_score": 0.512,
  "boost_factor": 1.0,
  "matched_inventory": [],
  "attack_vector": "NETWORK",
  "has_remote": 1,
  "has_exec": 1
}

Example — bulk with inventory

curl -X POST http://localhost:8000/bulk \
  -H "Content-Type: application/json" \
  -d '{
    "cve_ids": ["CVE-2021-44228", "CVE-2022-30190"],
    "inventory": ["Apache Log4j", "OpenSSL", "Windows Server"]
  }'

Frontend Screens

Dashboard

Overview of your CVE dataset — class distribution chart, top 10 highest context score CVEs, model performance summary.

Single CVE Lookup

Enter any CVE ID and get instant analysis — predicted severity, context score, risk signals (remote exploitable, code execution, attack vector), and inventory match warning if applicable.

Bulk CSV Upload

Upload a CSV with a cve_id column. You can type your inventory manually or upload an inventory CSV to get environment-aware rankings.

Inventory Matcher

Upload your software inventory CSV. The system returns only CVEs that affect your software, ranked by context score.

Sample inventory CSV format:

software
Apache HTTP Server
OpenSSL
Windows Server
MySQL
Log4j
nginx

How the Re-Ranking Works

boost = 1.0
  + 0.30 × (number of inventory matches)
  × 1.25  (if public exploit exists)
  × 1.15  (if remote + unauthenticated)
  × 1.10  (if attack vector = NETWORK)

context_score = min(prob_critical × boost, 1.0)

CVEs are sorted by context_score — not by CVSS score.

Example — CVE-2021-44228 (Log4Shell)

Condition	Context Score
No inventory	0.51
Inventory contains `Log4j`	0.67 (boost 1.43×)

Automation (GitHub Actions)

Workflow	Schedule	What it does
`daily_fetch.yml`	Every day 6 AM UTC	Fetches new CVEs, updates `cves_raw.csv`
`weekly_pipeline.yml`	Every Sunday 2 AM UTC	Fetch → preprocess → embed → retrain

Smart training — only does what is needed:

Situation	Mode	Time
Dataset unchanged	Skip — loads existing model	10 sec
< 10% new rows	Update — trains on new rows only	~2 min
≥ 10% new rows	Full retrain	~15 min

Add your NVD API key as a GitHub Secret named NVD_API_KEY to enable automation.

Retraining (Google Colab)

notebooks/01_data_collection.ipynb — fetch CVE data
notebooks/02_preprocessing.ipynb — NLP pipeline
notebooks/04_embeddings.ipynb — SecBERT embeddings (GPU, ~65 min)
notebooks/05_training.ipynb — XGBoost training + evaluation

Each notebook is smart — skips completed steps and only processes new rows.

Deployment

Component	Platform	URL
API / Backend	Hugging Face Spaces (Docker)	`https://manglamx-cve-reranker.hf.space`
Frontend	Vercel	`https://cve-reranker.vercel.app`

Tech Stack

Component	Technology
Frontend	Next.js (React)
Backend	FastAPI (Python)
NLP	spaCy, regex
Deep Learning	SecBERT, PyTorch, Hugging Face transformers
Machine Learning	XGBoost, scikit-learn
Explainability	SHAP TreeExplainer
Inventory matching	FuzzyWuzzy
Data source	NVD REST API v2.0
Training platform	Google Colab (T4 GPU)
Automation	GitHub Actions
Storage	Google Drive + Git LFS

Evaluation

Classification Report — test set (40,087 CVEs)

Class	Precision	Recall	F1	Support
Critical	0.72	0.74	0.73	4,705
High	0.76	0.72	0.74	14,792
Low	0.88	0.38	0.53	1,589
Medium	0.79	0.86	0.82	19,001
Weighted avg	0.77	0.77	0.77	40,087

Low class recall is lower due to class imbalance. Planned fix: SMOTE oversampling.

Dataset

Property	Value
Source	NVD REST API v2.0
Date range	January 2019 — March 2026
Total CVEs	200,431
Features per CVE	781 (768 BERT + 8 NLP + 5 metadata)
Auto-updated	Daily via GitHub Actions

CVSS label mapping:

Score	Label
0.0 – 3.9	Low
4.0 – 6.9	Medium
7.0 – 8.9	High
9.0 – 10.0	Critical

References

Shahid & Debar. CVSS-BERT. arXiv 2021.
NLP-Based Analysis of Cyber Threats. PMC 2023.
CVE Severity Prediction — A Deep Learning Approach. ScienceDirect 2024.
jackaduma. SecBERT. Hugging Face Hub.
Lundberg & Lee. SHAP. NeurIPS 2017.
Chen & Guestrin. XGBoost. KDD 2016.

License

This project was built for academic purposes as part of a Third Year Mini Project (2025–26).

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
app		app
backend		backend
data		data
frontend		frontend
models		models
notebooks		notebooks
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
CVE_Project_Report_Final.docx		CVE_Project_Report_Final.docx
Dockerfile		Dockerfile
README.md		README.md
cve.txt		cve.txt
demo_rerank.csv		demo_rerank.csv
inventory.csv		inventory.csv
inventory2.csv		inventory2.csv
netlify.toml		netlify.toml
requirements.txt		requirements.txt
sample_cves.csv		sample_cves.csv

Folders and files

Latest commit

History

Repository files navigation

CVE Vulnerability Severity Re-Ranking

The Problem

What This System Does

Team

Results

System Architecture

SDG Mapping

Repository Structure

Quick Start

Prerequisites

1. Clone the repo

2. Configure Environment

3. Start the Next.js frontend

API Endpoints

Example — single CVE

Example — bulk with inventory

Frontend Screens

Dashboard

Single CVE Lookup

Bulk CSV Upload

Inventory Matcher

How the Re-Ranking Works

Automation (GitHub Actions)

Retraining (Google Colab)

Deployment

Tech Stack

Evaluation

Dataset

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages