
# 📘 CVE ID Extraction from NVD ZIP feeds (v2.0) — 2002 → 2025


**Goal :**  
This notebook will extract all vulnerability identifiers (CVE IDs) from NVD JSON dumps. Extracted IDs will be used to build links pointing to CVE detail pages on CVEFeed (e.g. `https://cvefeed.io/vuln/detail/CVE-2025-1102`) for downstream ingestion.
**Exemples de fichiers attendus :**

**But :** Parcourir les fichiers `nvdcve-2.0-YYYY.json.zip`, extraire tous les `CVE IDs` (+ métadonnées utiles), concaténer, dédupliquer et produire un CSV unique avec URLs **CVEFeed**.

```
data/
  nvdcve-2.0-2002.json.zip
  nvdcve-2.0-2003.json.zip
  ...
  nvdcve-2.0-2025.json.zip
```


## 1) Imports & configuration

In [4]:
# !pip install --upgrade pip
# !pip install pandas

import json, io, zipfile
from pathlib import Path
from typing import List
import pandas as pd

DATA_DIR = Path("../../../Data/Raw")
OUTPUT_DIR = Path("../../../Data")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

YEARS = list(range(2002, 2026))  # 2002..2025 inclus
ZIP_PATTERN = "nvdcve-2.0-{}.json.zip"

print("DATA_DIR:", DATA_DIR.resolve())
print("OUTPUT_DIR:", OUTPUT_DIR.resolve())
print("Années:", YEARS[:3], "...", YEARS[-3:])


DATA_DIR: C:\Users\hamza\OneDrive\Desktop\Projects\threat-intelligence-pipeline\Data\Raw
OUTPUT_DIR: C:\Users\hamza\OneDrive\Desktop\Projects\threat-intelligence-pipeline\Data
Années: [2002, 2003, 2004] ... [2023, 2024, 2025]


## 2) Fonctions utilitaires (lecture ZIP → JSON → DataFrame)

In [5]:
def read_json_from_zip(zip_path: Path) -> dict:
    """Lit le premier fichier .json présent dans un .zip NVD et retourne l'objet JSON."""
    if not zip_path.exists():
        raise FileNotFoundError(zip_path)
    with zipfile.ZipFile(zip_path, "r") as zf:
        # Chercher le premier membre .json
        json_members = [n for n in zf.namelist() if n.endswith(".json")]
        if not json_members:
            raise ValueError(f"Aucun JSON trouvé dans: {zip_path.name}")
        with zf.open(json_members[0], "r") as f:
            # lecture en texte
            raw = f.read().decode("utf-8")
            return json.loads(raw)

def extract_df_from_obj(data) -> pd.DataFrame:
    """Aplatit la section 'vulnerabilities' (ou fallback) en DataFrame."""
    if isinstance(data, dict) and "vulnerabilities" in data and isinstance(data["vulnerabilities"], list):
        vulns = data["vulnerabilities"]
    elif isinstance(data, list):
        vulns = data
    else:
        vulns = [data]
    return pd.json_normalize(vulns, sep=".")

def trim_df_to_core_cols(df: pd.DataFrame) -> pd.DataFrame:
    """Garde les colonnes essentielles et renomme proprement."""
    cols_wanted = ["cve.id", "cve.published", "cve.lastModified", "cve.sourceIdentifier"]
    cols = [c for c in cols_wanted if c in df.columns]
    if not cols:
        if "id" in df.columns:
            df = df.rename(columns={"id": "cve.id"})
            cols = ["cve.id"]
        else:
            raise ValueError("Colonnes clés introuvables. Colonnes dispo: " + ", ".join(df.columns))
    out = df[cols].copy()
    out.rename(columns={
        "cve.id": "cve_id",
        "cve.published": "published",
        "cve.lastModified": "last_modified",
        "cve.sourceIdentifier": "source_identifier",
    }, inplace=True)
    out["cve_id"] = out["cve_id"].astype(str).str.strip()
    out = out[out["cve_id"].str.startswith("CVE-")]
    return out


## 3) Boucle 2002 → 2025, concaténation & récapitulatif

In [6]:
all_frames = []
missing = []
rows_by_year = {}

for year in YEARS:
    zp = DATA_DIR / ZIP_PATTERN.format(year)
    if not zp.exists():
        missing.append(year)
        print(f"[SKIP] {year} — {zp.name} absent")
        continue
    try:
        data = read_json_from_zip(zp)
        df = extract_df_from_obj(data)
        core = trim_df_to_core_cols(df)
        all_frames.append(core)
        rows_by_year[year] = len(core)
        print(f"[OK]   {year}: {len(core):>6} lignes")
    except Exception as e:
        print(f"[ERR]  {year}: {e}")

print("\nRécapitulatif:")
print("Années manquantes:", missing if missing else "Aucune")
print({k: rows_by_year[k] for k in sorted(rows_by_year)})


[OK]   2002:   6770 lignes
[OK]   2003:   1555 lignes
[OK]   2004:   2707 lignes
[OK]   2005:   4769 lignes
[OK]   2006:   7143 lignes
[OK]   2007:   6580 lignes
[OK]   2008:   7177 lignes
[OK]   2009:   5052 lignes
[OK]   2010:   5244 lignes
[OK]   2011:   4886 lignes
[OK]   2012:   5937 lignes
[OK]   2013:   6819 lignes
[OK]   2014:   9000 lignes
[OK]   2015:   8766 lignes
[OK]   2016:  10561 lignes
[OK]   2017:  17027 lignes
[OK]   2018:  17495 lignes
[OK]   2019:  17082 lignes
[OK]   2020:  20642 lignes
[OK]   2021:  23098 lignes
[OK]   2022:  27046 lignes
[OK]   2023:  30406 lignes
[OK]   2024:  38685 lignes
[OK]   2025:  29072 lignes

Récapitulatif:
Années manquantes: Aucune
{2002: 6770, 2003: 1555, 2004: 2707, 2005: 4769, 2006: 7143, 2007: 6580, 2008: 7177, 2009: 5052, 2010: 5244, 2011: 4886, 2012: 5937, 2013: 6819, 2014: 9000, 2015: 8766, 2016: 10561, 2017: 17027, 2018: 17495, 2019: 17082, 2020: 20642, 2021: 23098, 2022: 27046, 2023: 30406, 2024: 38685, 2025: 29072}


## 4) Fusion, déduplication, URLs CVEFeed & export CSV

In [7]:
if not all_frames:
    raise SystemExit("Aucune donnée chargée — vérifiez vos .zip dans 'data/'.")

df_all = pd.concat(all_frames, ignore_index=True)

# Déduplication globale
before = len(df_all)
df_all.drop_duplicates(subset=["cve_id"], inplace=True)
after = len(df_all)

# URL CVEFeed
df_all["url"] = "https://cvefeed.io/vuln/detail/" + df_all["cve_id"]

print(f"Déduplication: {before} -> {after} lignes uniques")
print("Colonnes:", df_all.columns.tolist())

# Export CSV
out_csv = OUTPUT_DIR / "cve_ids_all_years_2002_2025_from_zip.csv"
df_all.to_csv(out_csv, index=False)
print("CSV écrit ->", out_csv.resolve())

# Aperçu
df_all.head(10)


Déduplication: 313519 -> 313519 lignes uniques
Colonnes: ['cve_id', 'published', 'last_modified', 'source_identifier', 'url']
CSV écrit -> C:\Users\hamza\OneDrive\Desktop\Projects\threat-intelligence-pipeline\Data\cve_ids_all_years_2002_2025_from_zip.csv


Unnamed: 0,cve_id,published,last_modified,source_identifier,url
0,CVE-1999-0095,1988-10-01T04:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-0095
1,CVE-1999-0082,1988-11-11T05:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-0082
2,CVE-1999-1471,1989-01-01T05:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-1471
3,CVE-1999-1122,1989-07-26T04:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-1122
4,CVE-1999-1467,1989-10-26T04:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-1467
5,CVE-1999-1506,1990-01-29T05:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-1506
6,CVE-1999-0084,1990-05-01T04:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-0084
7,CVE-2000-0388,1990-05-09T04:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-2000-0388
8,CVE-1999-0209,1990-08-14T04:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-0209
9,CVE-1999-1198,1990-10-03T04:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-1198


## 5) Métriques simples

In [8]:
print("Total CVE uniques:", len(df_all))
if "published" in df_all.columns:
    try:
        d = pd.to_datetime(df_all["published"], errors="coerce")
        print("\n5 plus anciennes (selon 'published'):")
        display(df_all.loc[d.sort_values().index].head(5))
    except Exception as e:
        print("Parsing des dates impossible:", e)
else:
    print("Colonne 'published' absente.")


Total CVE uniques: 313519

5 plus anciennes (selon 'published'):


Unnamed: 0,cve_id,published,last_modified,source_identifier,url
0,CVE-1999-0095,1988-10-01T04:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-0095
1,CVE-1999-0082,1988-11-11T05:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-0082
2,CVE-1999-1471,1989-01-01T05:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-1471
3,CVE-1999-1122,1989-07-26T04:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-1122
4,CVE-1999-1467,1989-10-26T04:00:00.000,2025-04-03T01:03:51.193,cve@mitre.org,https://cvefeed.io/vuln/detail/CVE-1999-1467
