Skip to content

ManglamX/CVE_ReRanker

Repository files navigation

title CVE Severity Re-Ranker API
emoji 🛡️
colorFrom blue
colorTo indigo
sdk docker
app_port 7860
pinned false

CVE Vulnerability Severity Re-Ranking

Context-aware vulnerability prioritisation using NLP, Deep Learning, and Machine Learning

Python Next.js FastAPI XGBoost SecBERT NVD SDG Hugging Face Vercel


The Problem

The National Vulnerability Database (NVD) publishes thousands of CVEs every month, each with a static CVSS score. Security teams sort by CVSS and start patching from the top — but CVSS is environment-blind. A CVSS 9.8 Critical in software you don't use is less urgent than a CVSS 7.5 High in software running on your public-facing server.

This project fixes that.


What This System Does

  1. Analyses CVE descriptions from the NVD using NLP + SecBERT embeddings
  2. Predicts severity (Low / Medium / High / Critical) using XGBoost
  3. Re-ranks CVEs based on your specific software inventory
  4. Surfaces the most dangerous vulnerabilities for your environment — not a generic list
  5. Serves results through a FastAPI backend consumed by a Next.js dashboard

Team

Name Roll No Contribution
Manglam Jaiswal 10127 Data collection, NLP preprocessing, EDA
Tanaya Bane 10107 Re-ranking module, Next.js UI, evaluation
Tanmay Sarode 10154 SecBERT embeddings, XGBoost training, SHAP

Third Year | Semester 6 | ML + DL + NLP Mini Project | 2025–26


Results

Metric Value
Dataset 200,431 CVEs (NVD 2019–2026)
Model XGBoost on 781-dim fused feature vector
Weighted F1 0.77
Accuracy 77%
Medium F1 0.82
Critical F1 0.73

System Architecture

NVD API
   │
   ▼
[Layer 1 — NLP]
  Text cleaning → spaCy NER → keyword feature flags
   │
   ▼
[Layer 2 — Deep Learning]
  SecBERT → 768-dim CLS embedding per CVE
   │
   ▼
[Layer 3 — Feature Fusion]
  BERT (768) + NLP features (8) + CVSS metadata (5) = 781-dim vector
   │
   ▼
[Layer 4 — Machine Learning]
  XGBoost classifier → Low / Medium / High / Critical
   │
   ▼
[Layer 5 — Contextual Re-Ranking]
  User inventory CSV → fuzzy match → boost score → re-sorted list
   │
   ▼
[FastAPI Backend]
  REST API serving predictions and re-ranked results
   │
   ▼
[Next.js Frontend]
  Single CVE lookup | Bulk CSV upload | Inventory matcher | Dashboard

SDG Mapping

SDG 9 — Industry, Innovation and Infrastructure Makes intelligent vulnerability prioritisation accessible to organisations of all sizes.

SDG 16 — Peace, Justice and Strong Institutions Strengthens institutional resilience against cyberattacks by enabling faster, targeted vulnerability response.


Repository Structure

cve-severity-reranker/
│
├── .github/
│   └── workflows/
│       ├── daily_fetch.yml         # Fetches new CVEs every day at 6 AM UTC
│       └── weekly_pipeline.yml     # Full pipeline every Sunday at 2 AM UTC
│
├── scripts/
│   ├── 01_fetch.py                 # NVD API data collection
│   ├── 02_preprocess.py            # NLP cleaning + feature engineering
│   ├── 03_embeddings.py            # SecBERT embedding generation
│   └── 04_train.py                 # XGBoost training (smart update mode)
│
├── backend/
│   ├── main.py                     # FastAPI app — all API routes
│   ├── pipeline.py                 # Prediction + re-ranking logic
│   ├── reranker.py                 # Inventory matching + boost formula
│   └── requirements.txt            # Python dependencies for backend
│
├── frontend/
│   ├── app/                        # Next.js app directory
│   │   ├── page.tsx                # Home — dashboard overview
│   │   ├── lookup/page.tsx         # Single CVE lookup screen
│   │   ├── bulk/page.tsx           # Bulk CSV upload screen
│   │   └── inventory/page.tsx      # Inventory matcher screen
│   ├── components/                 # Reusable React components
│   ├── public/                     # Static assets
│   ├── package.json
│   └── next.config.js
│
├── data/
│   ├── cves_raw.csv                # Raw NVD data (Git LFS)
│   ├── cves_processed.csv          # Cleaned + feature engineered (Git LFS)
│   ├── bert_embeddings.npy         # 200k × 768 embedding matrix (Git LFS)
│   └── last_updated.json           # Tracks last data collection date
│
├── models/
│   ├── model_xgb.pkl               # Trained XGBoost model (Git LFS)
│   ├── label_encoder.pkl           # Label encoder
│   └── training_tracker.json       # Tracks rows model was trained on
│
├── notebooks/
│   ├── 01_data_collection.ipynb
│   ├── 02_preprocessing.ipynb
│   ├── 04_embeddings.ipynb
│   ├── 05_training.ipynb
│   └── 06_live_updater.ipynb
│
├── requirements.txt
└── README.md

Quick Start

Prerequisites

  • Python 3.10+
  • Node.js 18+
  • npm or yarn

1. Clone the repo

git clone https://github.com/ManglamX/cve-severity-reranker.git
cd cve-severity-reranker

2. Configure Environment

Create a .env.local in the frontend folder:

NEXT_PUBLIC_API_URL=https://manglamx-cve-reranker.hf.space

3. Start the Next.js frontend

cd frontend
npm install
npm run dev

Frontend runs at http://localhost:3000 (Connecting to Hugging Face API)


API Endpoints

Method Endpoint Description
GET /health Health check
GET /cve/{cve_id} Analyse a single CVE
POST /bulk Analyse a list of CVE IDs
POST /inventory Find CVEs matching uploaded inventory
GET /stats Dataset and model statistics

Example — single CVE

curl http://localhost:8000/cve/CVE-2021-44228
{
  "cve_id": "CVE-2021-44228",
  "cvss_score": 10.0,
  "cvss_label": "Critical",
  "predicted_label": "Critical",
  "context_score": 0.512,
  "boost_factor": 1.0,
  "matched_inventory": [],
  "attack_vector": "NETWORK",
  "has_remote": 1,
  "has_exec": 1
}

Example — bulk with inventory

curl -X POST http://localhost:8000/bulk \
  -H "Content-Type: application/json" \
  -d '{
    "cve_ids": ["CVE-2021-44228", "CVE-2022-30190"],
    "inventory": ["Apache Log4j", "OpenSSL", "Windows Server"]
  }'

Frontend Screens

Dashboard

Overview of your CVE dataset — class distribution chart, top 10 highest context score CVEs, model performance summary.

Single CVE Lookup

Enter any CVE ID and get instant analysis — predicted severity, context score, risk signals (remote exploitable, code execution, attack vector), and inventory match warning if applicable.

Bulk CSV Upload

Upload a CSV with a cve_id column. You can type your inventory manually or upload an inventory CSV to get environment-aware rankings.

Inventory Matcher

Upload your software inventory CSV. The system returns only CVEs that affect your software, ranked by context score.

Sample inventory CSV format:

software
Apache HTTP Server
OpenSSL
Windows Server
MySQL
Log4j
nginx

How the Re-Ranking Works

boost = 1.0
  + 0.30 × (number of inventory matches)
  × 1.25  (if public exploit exists)
  × 1.15  (if remote + unauthenticated)
  × 1.10  (if attack vector = NETWORK)

context_score = min(prob_critical × boost, 1.0)

CVEs are sorted by context_score — not by CVSS score.

Example — CVE-2021-44228 (Log4Shell)

Condition Context Score
No inventory 0.51
Inventory contains Log4j 0.67 (boost 1.43×)

Automation (GitHub Actions)

Workflow Schedule What it does
daily_fetch.yml Every day 6 AM UTC Fetches new CVEs, updates cves_raw.csv
weekly_pipeline.yml Every Sunday 2 AM UTC Fetch → preprocess → embed → retrain

Smart training — only does what is needed:

Situation Mode Time
Dataset unchanged Skip — loads existing model 10 sec
< 10% new rows Update — trains on new rows only ~2 min
≥ 10% new rows Full retrain ~15 min

Add your NVD API key as a GitHub Secret named NVD_API_KEY to enable automation.


Retraining (Google Colab)

  1. notebooks/01_data_collection.ipynb — fetch CVE data
  2. notebooks/02_preprocessing.ipynb — NLP pipeline
  3. notebooks/04_embeddings.ipynb — SecBERT embeddings (GPU, ~65 min)
  4. notebooks/05_training.ipynb — XGBoost training + evaluation

Each notebook is smart — skips completed steps and only processes new rows.


Deployment

Component Platform URL
API / Backend Hugging Face Spaces (Docker) https://manglamx-cve-reranker.hf.space
Frontend Vercel https://cve-reranker.vercel.app

Tech Stack

Component Technology
Frontend Next.js (React)
Backend FastAPI (Python)
NLP spaCy, regex
Deep Learning SecBERT, PyTorch, Hugging Face transformers
Machine Learning XGBoost, scikit-learn
Explainability SHAP TreeExplainer
Inventory matching FuzzyWuzzy
Data source NVD REST API v2.0
Training platform Google Colab (T4 GPU)
Automation GitHub Actions
Storage Google Drive + Git LFS

Evaluation

Classification Report — test set (40,087 CVEs)

Class Precision Recall F1 Support
Critical 0.72 0.74 0.73 4,705
High 0.76 0.72 0.74 14,792
Low 0.88 0.38 0.53 1,589
Medium 0.79 0.86 0.82 19,001
Weighted avg 0.77 0.77 0.77 40,087

Low class recall is lower due to class imbalance. Planned fix: SMOTE oversampling.


Dataset

Property Value
Source NVD REST API v2.0
Date range January 2019 — March 2026
Total CVEs 200,431
Features per CVE 781 (768 BERT + 8 NLP + 5 metadata)
Auto-updated Daily via GitHub Actions

CVSS label mapping:

Score Label
0.0 – 3.9 Low
4.0 – 6.9 Medium
7.0 – 8.9 High
9.0 – 10.0 Critical

References

  1. Shahid & Debar. CVSS-BERT. arXiv 2021.
  2. NLP-Based Analysis of Cyber Threats. PMC 2023.
  3. CVE Severity Prediction — A Deep Learning Approach. ScienceDirect 2024.
  4. jackaduma. SecBERT. Hugging Face Hub.
  5. Lundberg & Lee. SHAP. NeurIPS 2017.
  6. Chen & Guestrin. XGBoost. KDD 2016.

License

This project was built for academic purposes as part of a Third Year Mini Project (2025–26).

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors