# Exploratory Data Analysis (EDA): CVE Vulnerability Dataset

**Purpose:** Evaluate structure, quality, and readiness for BI dashboards and CVSS version comparisons.

## Contents

1. [Overview](#overview)
2. [Dataset Description](#dataset-description)
3. [Data Dictionary](#data-dictionary)
4. [Objectives](#objectives)
5. [Expected Outcomes](#expected-outcomes)
6. [Assumptions & Caveats](#assumptions--caveats)

## 1. Overview

This EDA examines a cybersecurity dataset of Common Vulnerabilities and Exposures (CVEs). Each row details a software vulnerability, including metadata, descriptions, affected products, and CVSS scores. Focus areas: data quality, nested field normalization, and CVSS comparisons (v2.0, 3.x, 4.0).

## 2. Dataset Description

**Source:** Aggregated from CVE feeds (e.g., CVEFeed.io, MITRE, NVD) via internal ETL. Sample includes identifiers, text fields, timestamps, exploitation flags, nested products, and CVSS objects.

<details>
<summary>Sample Rows (Truncated)</summary>

```
cve_id,title,description,published_date,last_modified,remotely_exploit,source,category,affected_products,cvss_scores,url,loaded_at
CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,Oct. 11, 2025, 5:15 p.m.,Oct. 11, 2025, 5:15 p.m.,Yes !,cve@mitre.org,Injection,[],
[{"score": "7.5", "version": "CVSS 2.0", "seve...],https://cvefeed.io/vuln/detail/CVE-2025-11608,2025-10-12 17:22:01.863992+00:00

CVE-1999-0095,Sendmail Command Injection Vulnerability,The following products are affected by...,Oct. 1, 1988, 4 a.m.,April 3, 2025, 1:03 a.m.,Yes !,cve@mitre.org,,
[{"id": "1", "vendor": "Eric_allman", "product"...],[{"score": "10", "version": "CVSS 2.0", "sever...],https://cvefeed.io/vuln/detail/CVE-1999-0095,2025-10-12 17:22:01.863992+00:00
```

</details>

## 3. Data Dictionary

| Column            | Description                          | Type            | Notes                          |
|-------------------|--------------------------------------|-----------------|--------------------------------|
| `cve_id`          | Global CVE identifier                | string          | Primary key                    |
| `title`           | Short summary                        | string          | May be truncated               |
| `description`     | Detailed description                 | string          | Contains product hints         |
| `published_date`  | Initial disclosure timestamp         | datetime string | Normalize to UTC               |
| `last_modified`   | Last update timestamp                | datetime string | May differ from publish        |
| `remotely_exploit`| Remote exploitation flag             | string/bool     | Normalize values like "Yes !"  |
| `source`          | Origin/maintainer                    | string          | Email/domain format            |
| `category`        | Vulnerability type                   | string (nullable)| Map to taxonomy if missing     |
| `affected_products`| Affected vendors/products list      | JSON list (string)| Parse and normalize            |
| `cvss_scores`     | CVSS objects list (score, version, etc.) | JSON list (string)| Explode to long format         |
| `url`             | CVE detail link                      | string (URL)    | For verification               |
| `loaded_at`       | ETL load timestamp                   | datetime (UTC)  | For freshness checks           |

## 4. Objectives

1. Check data quality: missing values, inconsistencies, duplicates.
2. Normalize nested data: Parse and expand `cvss_scores`, `affected_products`.
3. Analyze CVSS: Distributions, severity trends, version differences.
4. Profile landscape: Categories, sources, vendors.
5. Prep for BI: Tidy tables with stable keys for dashboards.

## 5. Expected Outcomes

- Clean, normalized tables for analysis and BI.
- Long-format CVSS data (one row per CVE-version).
- Stats and visuals for trends.
- Notes on assumptions and limitations.

## 6. Assumptions & Caveats

- **Datetimes:** Convert heterogeneous formats to UTC.
- **Flags:** Normalize e.g., `"Yes !"` to boolean.
- **CVSS:** Multiple versions per CVE; parse vectors for metrics.
- **Categories:** Infer missing ones via text if needed.

### Step 1 — Import Libraries and Configuration

In [176]:

# --- Manipulation et analyse de données
import pandas as pd
import numpy as np

# --- Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# --- Traitement du texte
import re ,unicodedata
import string

# --- Pré-traitement et machine learning utils
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

# --- Date et temps
from datetime import datetime, timedelta

# --- Options d’affichage pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 120)
pd.set_option('display.float_format', '{:.2f}'.format)

# --- Style des graphiques
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 5)
plt.rcParams['axes.titlesize'] = 13
plt.rcParams['axes.labelsize'] = 11

from sqlalchemy import create_engine

DB_USER = "postgres"
DB_PASS = "tip_pwd"
DB_HOST = "localhost"
DB_PORT = "5432"
DB_NAME = "tip"

engine = create_engine(f"postgresql+psycopg2://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}")
df = pd.read_sql("SELECT * FROM raw.cve_details;", engine)

### Step 2 : Loading and Initial Inspection of the Dataset

In [177]:
query_dims = """
WITH
rows_count AS (
  SELECT COUNT(*)::bigint AS rows
  FROM raw.cve_details
),
cols_count AS (
  SELECT COUNT(*)::int AS cols
  FROM information_schema.columns
  WHERE table_schema = 'raw'
    AND table_name   = 'cve_details'
)
SELECT rows_count.rows, cols_count.cols
FROM rows_count, cols_count;
"""

dims = pd.read_sql(query_dims, engine).iloc[0]
n_rows, n_cols = int(dims["rows"]), int(dims["cols"])

print("✅ Dataset loaded successfully!")
print(f"Dataset dimensions: {n_rows} rows × {n_cols} columns\n")

✅ Dataset loaded successfully!
Dataset dimensions: 64188 rows × 12 columns



In [178]:
"""
=============================================================================
STRATÉGIE AMÉLIORÉE - GESTION DES VALEURS MANQUANTES
=============================================================================

Principe : Ne jamais supprimer des données sans analyse préalable.
Les CVEs sans scores CVSS peuvent être utiles pour d'autres analyses !
"""

import pandas as pd
import numpy as np

# =============================================================================
# 1. ANALYSE DÉTAILLÉE DES VALEURS MANQUANTES
# =============================================================================
def analyze_missing_data(df):
    """
    Analyse complète des données manquantes avec visualisation.
    """
    
    print("="*70)
    print("📊 ANALYSE DES VALEURS MANQUANTES")
    print("="*70)
    
    missing_summary = []
    
    for col in df.columns:
        missing_count = df[col].isna().sum()
        missing_pct = (missing_count / len(df)) * 100
        
        missing_summary.append({
            'column': col,
            'missing_count': missing_count,
            'missing_pct': missing_pct,
            'dtype': df[col].dtype
        })
    
    df_missing = pd.DataFrame(missing_summary).sort_values('missing_pct', ascending=False)
    
    # Afficher uniquement les colonnes avec des valeurs manquantes
    df_missing_filtered = df_missing[df_missing['missing_count'] > 0]
    
    if len(df_missing_filtered) > 0:
        print("\n⚠️  Colonnes avec valeurs manquantes :\n")
        print(df_missing_filtered.to_string(index=False))
    else:
        print("\n✅ Aucune valeur manquante détectée !")
    
    return df_missing


# =============================================================================
# 2. STRATÉGIE DE GESTION - NE PAS SUPPRIMER !
# =============================================================================
def handle_missing_cvss_scores(df):
    """
    Gère les CVEs sans scores CVSS de manière intelligente.
    
    Stratégie :
    1. Créer un flag 'has_cvss_score'
    2. Garder TOUTES les lignes dans dim_cve
    3. Ne pas créer d'entrées dans fact_cvss_scores pour celles sans score
    4. Permettre l'analyse des CVEs non scorés
    """
    
    print("\n🔍 Analyse des scores CVSS manquants...")
    
    # Identifier les lignes sans score
    no_score_mask = (
        df['cvss_scores'].isna() | 
        (df['cvss_scores'].str.strip() == '[]')
    )
    
    no_score_count = no_score_mask.sum()
    no_score_pct = (no_score_count / len(df)) * 100
    
    print(f"\n📊 CVEs sans score CVSS :")
    print(f"   • Nombre : {no_score_count:,} ({no_score_pct:.1f}%)")
    
    # Ajouter un flag au lieu de supprimer
    df['has_cvss_score'] = ~no_score_mask
    
    # Analyser les CVEs sans score
    if no_score_count > 0:
        print("\n🔎 Analyse des CVEs sans score :")
        
        # Par année
        cves_no_score = df[no_score_mask].copy()
        if 'published_date' in cves_no_score.columns:
            cves_no_score['year'] = pd.to_datetime(
                cves_no_score['published_date']
            ).dt.year
            
            year_dist = cves_no_score['year'].value_counts().sort_index()
            print("\n   Distribution par année (top 5) :")
            print(year_dist.head().to_string())
        
        # Par catégorie
        if 'category' in cves_no_score.columns:
            cat_dist = cves_no_score['category'].value_counts()
            print("\n   Distribution par catégorie (top 5) :")
            print(cat_dist.head().to_string())
    
    print(f"\n✅ Flag 'has_cvss_score' ajouté. Toutes les données conservées.")
    
    return df


# =============================================================================
# 3. IMPUTATION INTELLIGENTE (Optionnel)
# =============================================================================
def impute_cvss_scores(df):
    """
    Impute les scores CVSS manquants basés sur des règles métier.
    
    ⚠️  ATTENTION : À utiliser uniquement si justifié par le contexte métier !
    """
    
    print("\n🔮 Imputation des scores CVSS...")
    
    # Règle 1 : Si 'remotely_exploit' = True → Score élevé probable
    # Règle 2 : Basé sur les mots-clés dans la description
    # Règle 3 : Basé sur la catégorie
    
    df_imputed = df.copy()
    
    # Exemple simple (à adapter selon votre contexte)
    keywords_high_risk = ['remote code execution', 'rce', 'overflow', 'injection']
    keywords_medium_risk = ['disclosure', 'xss', 'csrf']
    
    def estimate_severity(row):
        """Estime la sévérité basée sur le contexte"""
        
        # Si déjà un score, ne rien faire
        if pd.notna(row.get('cvss_score')):
            return row.get('cvss_severity', 'UNKNOWN')
        
        desc = str(row.get('description', '')).lower()
        
        # Vérifier les mots-clés
        if any(kw in desc for kw in keywords_high_risk):
            return 'HIGH_ESTIMATED'
        elif any(kw in desc for kw in keywords_medium_risk):
            return 'MEDIUM_ESTIMATED'
        elif row.get('remotely_exploit') == True:
            return 'HIGH_ESTIMATED'
        else:
            return 'UNKNOWN'
    
    # Appliquer l'estimation
    df_imputed['severity_estimated'] = df_imputed.apply(estimate_severity, axis=1)
    
    # Statistiques
    estimated_count = (
        df_imputed['severity_estimated'].str.endswith('_ESTIMATED')
    ).sum()
    
    print(f"   ✅ {estimated_count:,} scores estimés par règles métier")
    
    return df_imputed


# =============================================================================
# 4. VALIDATION DE QUALITÉ
# =============================================================================
def validate_data_quality(df):
    """
    Validation finale de la qualité des données.
    """
    
    print("\n" + "="*70)
    print("🔍 VALIDATION DE QUALITÉ DES DONNÉES")
    print("="*70)
    
    checks = []
    
    # Check 1 : Pas de duplicates sur cve_id
    dup_count = df['cve_id'].duplicated().sum()
    checks.append({
        'check': 'Duplicates CVE ID',
        'status': '✅ PASS' if dup_count == 0 else f'❌ FAIL ({dup_count} duplicates)',
        'critical': True
    })
    
    # Check 2 : Dates cohérentes
    if 'published_date' in df.columns and 'last_modified' in df.columns:
        invalid_dates = (df['last_modified'] < df['published_date']).sum()
        checks.append({
            'check': 'Dates cohérentes',
            'status': '✅ PASS' if invalid_dates == 0 else f'⚠️  WARNING ({invalid_dates} incohérences)',
            'critical': False
        })
    
    # Check 3 : Scores CVSS valides
    if 'cvss_score' in df.columns:
        invalid_scores = (
            (df['cvss_score'] < 0) | (df['cvss_score'] > 10)
        ).sum()
        checks.append({
            'check': 'Scores CVSS valides (0-10)',
            'status': '✅ PASS' if invalid_scores == 0 else f'❌ FAIL ({invalid_scores} invalides)',
            'critical': True
        })
    
    # Check 4 : Clés étrangères
    if 'category' in df.columns:
        undefined_cat = (df['category'] == 'undefined').sum()
        checks.append({
            'check': 'Catégories définies',
            'status': '✅ PASS' if undefined_cat == 0 else f'⚠️  INFO ({undefined_cat} undefined)',
            'critical': False
        })
    
    # Affichage des résultats
    df_checks = pd.DataFrame(checks)
    print("\n" + df_checks.to_string(index=False))
    
    # Résumé
    critical_fails = df_checks[
        (df_checks['critical'] == True) & 
        (df_checks['status'].str.contains('FAIL'))
    ]
    
    if len(critical_fails) > 0:
        print("\n❌ VALIDATION ÉCHOUÉE - Corrigez les erreurs critiques !")
        return False
    else:
        print("\n✅ VALIDATION RÉUSSIE - Données prêtes pour Silver Layer")
        return True


# =============================================================================
# FONCTION PRINCIPALE - ORCHESTRATION
# =============================================================================
def prepare_data_for_silver(df):
    """
    Prépare les données Bronze pour la couche Silver.
    
    Étapes :
    1. Analyser les valeurs manquantes
    2. Gérer les scores CVSS manquants (avec flag)
    3. Valider la qualité
    """
    
    print("🚀 PRÉPARATION DES DONNÉES BRONZE → SILVER")
    
    # 1. Analyse
    analyze_missing_data(df)
    
    # 2. Gestion des CVSS manquants
    df_prepared = handle_missing_cvss_scores(df)
    
    # 3. (Optionnel) Imputation
    # df_prepared = impute_cvss_scores(df_prepared)
    
    # 4. Validation
    is_valid = validate_data_quality(df_prepared)
    
    if is_valid:
        print("\n✅ Données prêtes pour transformation Silver !")
        return df_prepared
    else:
        print("\n❌ Validation échouée. Corrigez les erreurs avant de continuer.")
        return None


# =============================================================================
# EXEMPLE D'UTILISATION
# =============================================================================
if __name__ == "__main__":
    # Charger les données Bronze
    # df = pd.read_sql("SELECT * FROM bronze.cve_details", engine)
    
    # Préparer les données
    df_prepared = prepare_data_for_silver(df)
    
    if df_prepared is not None:
        # Statistiques finales
        print("\n📊 STATISTIQUES FINALES :")
        print(f"   • Total CVEs : {len(df_prepared):,}")
        print(f"   • Avec scores CVSS : {df_prepared['has_cvss_score'].sum():,}")
        print(f"   • Sans scores CVSS : {(~df_prepared['has_cvss_score']).sum():,}")

NameError: name 'df_bronze' is not defined

In [137]:
# Preview of the first rows
display(df.head(5))

Unnamed: 0,cve_id,title,description,published_date,last_modified,remotely_exploit,source,category,affected_products,cvss_scores,url,loaded_at
0,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,"Oct. 11, 2025, 5:15 p.m.","Oct. 11, 2025, 5:15 p.m.",Yes !,[email protected],Injection,[],"[{""score"": ""7.5"", ""version"": ""CVSS 2.0"", ""seve...",https://cvefeed.io/vuln/detail/CVE-2025-11608,2025-10-13 12:00:10.184538+00:00
1,CVE-1999-0095,Sendmail Command Injection Vulnerability,The following products are affected byCVE-1999...,"Oct. 1, 1988, 4 a.m.","April 3, 2025, 1:03 a.m.",Yes !,[email protected],,"[{""id"": ""1"", ""vendor"": ""Eric_allman"", ""product...","[{""score"": ""10"", ""version"": ""CVSS 2.0"", ""sever...",https://cvefeed.io/vuln/detail/CVE-1999-0095,2025-10-13 12:00:10.184538+00:00
2,CVE-1999-0082,Tenable FTP Server Command Injection Vulnerabi...,The following products are affected byCVE-1999...,"Nov. 11, 1988, 5 a.m.","April 3, 2025, 1:03 a.m.",Yes !,[email protected],,"[{""id"": ""1"", ""vendor"": ""Ftp"", ""product"": ""ftp""...","[{""score"": ""10"", ""version"": ""CVSS 2.0"", ""sever...",https://cvefeed.io/vuln/detail/CVE-1999-0082,2025-10-13 12:00:10.184538+00:00
3,CVE-1999-1471,"""BSD Passwd Buffer Overflow Root Privilege Esc...",The following products are affected byCVE-1999...,"Jan. 1, 1989, 5 a.m.","April 3, 2025, 1:03 a.m.",No,[email protected],,"[{""id"": ""1"", ""vendor"": ""Bsd"", ""product"": ""bsd""}]","[{""score"": ""7.2"", ""version"": ""CVSS 2.0"", ""seve...",https://cvefeed.io/vuln/detail/CVE-1999-1471,2025-10-13 12:00:10.184538+00:00
4,CVE-1999-1122,SunOS Restore Privilege Escalation Vulnerability,Vulnerability in restore in SunOS 4.0.3 and ea...,"July 26, 1989, 4 a.m.","April 3, 2025, 1:03 a.m.",No,[email protected],,"[{""id"": ""1"", ""vendor"": ""Sun"", ""product"": ""suno...","[{""score"": ""4.6"", ""version"": ""CVSS 2.0"", ""seve...",https://cvefeed.io/vuln/detail/CVE-1999-1122,2025-10-13 12:00:10.184538+00:00


In [138]:
# General information about the columns
print("Informations sur les types de colonnes :")
df.info()

Informations sur les types de colonnes :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64188 entries, 0 to 64187
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   cve_id             64188 non-null  object             
 1   title              64187 non-null  object             
 2   description        64185 non-null  object             
 3   published_date     64184 non-null  object             
 4   last_modified      64184 non-null  object             
 5   remotely_exploit   64181 non-null  object             
 6   source             64181 non-null  object             
 7   category           64181 non-null  object             
 8   affected_products  64181 non-null  object             
 9   cvss_scores        64179 non-null  object             
 10  url                64179 non-null  object             
 11  loaded_at          64188 non-null  datetime64[ns, UTC]
dtypes: da

### Step 3 — Nettoyage de base et conversions de types

#### 3.1 — Conversion des dates

The first thing we noticed is that the columns **`published_date`** and **`last_modified`** are of type *object* — they need to be converted to **datetime**.  
Convert `published_date` and `last_modified` to datetime, handling heterogeneous formats.

**Post-Conversion Notes:**
- Any unparseable dates become **NaT** → flag them for review.

In [139]:
import pandas as pd
from dateutil import parser

# Columns to convert
DATE_COLUMNS = ["published_date", "last_modified"]

def parse_date(date_str):
    """Parse various date formats to datetime"""
    if pd.isna(date_str):
        return pd.NaT
    
    try:
        # dateutil.parser handles most formats automatically
        return parser.parse(str(date_str), fuzzy=False)
    except:
        try:
            # Fuzzy parsing as fallback
            return parser.parse(str(date_str), fuzzy=True)
        except:
            return pd.NaT

# Process each date column
for col in DATE_COLUMNS:
    if col in df.columns:
        df[col] = df[col].apply(parse_date)
        # Ensure datetime64[ns] dtype
        df[col] = pd.to_datetime(df[col], errors='coerce')

# ========== VERIFICATION ==========
print("== Dtypes ==")
print(df[DATE_COLUMNS].dtypes)

print("\n== Conversion Success Rate ==")
for col in DATE_COLUMNS:
    if col in df.columns:
        total = len(df[col])
        null_count = df[col].isna().sum()
        success_count = total - null_count
        success_rate = (success_count / total * 100) if total > 0 else 0
        
        print(f"\n{col}:")
        print(f"  ✓ Successfully converted: {success_count}/{total} ({success_rate:.1f}%)")
        print(f"  ✗ Failed (NaT): {null_count}")

# Drop rows with failed conversions
failed_mask = df[DATE_COLUMNS].isna().any(axis=1)
rows_to_drop = failed_mask.sum()
if rows_to_drop > 0:
    df.dropna(subset=DATE_COLUMNS, inplace=True)
    print(f"\n✓ Dropped {rows_to_drop} rows with failed date conversions")
    print(f"✓ Remaining rows: {len(df)}")

== Dtypes ==
published_date    datetime64[ns]
last_modified     datetime64[ns]
dtype: object

== Conversion Success Rate ==

published_date:
  ✓ Successfully converted: 64179/64188 (100.0%)
  ✗ Failed (NaT): 9

last_modified:
  ✓ Successfully converted: 64180/64188 (100.0%)
  ✗ Failed (NaT): 8

✓ Dropped 9 rows with failed date conversions
✓ Remaining rows: 64179


In [140]:
df[['cve_id' , 'published_date', 'last_modified']].head(5)

Unnamed: 0,cve_id,published_date,last_modified
0,CVE-2025-11608,2025-10-11 17:15:00,2025-10-11 17:15:00
1,CVE-1999-0095,1988-10-01 04:00:00,2025-04-03 01:03:00
2,CVE-1999-0082,1988-11-11 05:00:00,2025-04-03 01:03:00
3,CVE-1999-1471,1989-01-01 05:00:00,2025-04-03 01:03:00
4,CVE-1999-1122,1989-07-26 04:00:00,2025-04-03 01:03:00


In [141]:
# --- Normalize `loaded_at` ----------------------------------------------------
df['loaded_at'] = (
    pd.to_datetime(df['loaded_at'], utc=True, errors='coerce')
      .dt.tz_localize(None)
      .dt.floor('s')  # Arrondir à la seconde (enlève les microsecondes)
)

# Quick check
print(df['loaded_at'].head(10))
print(df['loaded_at'].dtype)

0   2025-10-13 12:00:10
1   2025-10-13 12:00:10
2   2025-10-13 12:00:10
3   2025-10-13 12:00:10
4   2025-10-13 12:00:10
5   2025-10-13 12:00:10
6   2025-10-13 12:00:10
7   2025-10-13 12:00:10
8   2025-10-13 12:00:10
9   2025-10-13 12:00:10
Name: loaded_at, dtype: datetime64[ns]
datetime64[ns]


#### 3.2 — Suppression des colonnes inutiles et vérification des doublons

Since the **`url`** column can be easily reconstructed (it follows a simple pattern:  
`https://cvefeed.io/vuln/detail/{cve_id}`), we can safely drop it from the dataset


In [142]:
df.drop(columns=["url"], inplace=True, errors='ignore')

Afterwards, check for **duplicate entries** (e.g., multiple rows with the same `cve_id`)  
to ensure data consistency and avoid redundant analyses.

In [143]:
# checking for duplicates
df.duplicated().any()

np.False_

#### 3.3 — Normalisation de la colonne `remotely_exploit`

The values in this column should be either `true` or `false`. Let's check the different values in the column.

In [144]:
# Get all unique values in a column
unique_values = df['remotely_exploit'].unique()
print(unique_values)

['Yes !' 'No']


In [145]:
# Convert 'Yes !' to True and 'No' to False, leave existing True/False as is
df['remotely_exploit'] = df['remotely_exploit'].apply(
    lambda x: True if x == 'Yes !' else (False if x == 'No' else x)
)

# Check the result
print(df['remotely_exploit'].head())

0     True
1     True
2     True
3    False
4    False
Name: remotely_exploit, dtype: bool


In [146]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 64179 entries, 0 to 64187
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   cve_id             64179 non-null  object        
 1   title              64179 non-null  object        
 2   description        64179 non-null  object        
 3   published_date     64179 non-null  datetime64[ns]
 4   last_modified      64179 non-null  datetime64[ns]
 5   remotely_exploit   64179 non-null  bool          
 6   source             64179 non-null  object        
 7   category           64179 non-null  object        
 8   affected_products  64179 non-null  object        
 9   cvss_scores        64179 non-null  object        
 10  loaded_at          64179 non-null  datetime64[ns]
dtypes: bool(1), datetime64[ns](3), object(7)
memory usage: 5.4+ MB


#### 3.4 — Processing the `source` column

We notice that most of the values in the `source` column are missing, so we will drop this column.

In [147]:
import pandas as pd

# --- Normalize source and extract year (temporary, not added to df) ---
source_norm = (
    df['source']
    .fillna('')
    .str.normalize('NFKC')
    .str.replace('\u00a0', ' ', regex=False)
    .str.replace('\u200b', '', regex=False)
    .str.replace(r'[\[\]]', '', regex=True)
    .str.strip()
    .str.lower()
    .str.replace(r'\s+', ' ', regex=True)
)

year = df['published_date'].dt.year

# --- Calculate counts efficiently ---
TARGET = 'email protected'
YEAR_RANGE = range(1999, 2026)

# Create temporary DataFrame for grouping
temp_df = pd.DataFrame({
    'year': year,
    'is_target': source_norm == TARGET,
    'cve_id': df['cve_id']
})

# Group once and calculate both metrics
yearly_stats = (
    temp_df.groupby(['year', 'is_target'])['cve_id']
    .nunique()
    .unstack(fill_value=0)
)

# Extract total and target counts
total_per_year = yearly_stats.sum(axis=1).reindex(YEAR_RANGE, fill_value=0)
email_per_year = yearly_stats.get(True, pd.Series(0, index=yearly_stats.index)).reindex(YEAR_RANGE, fill_value=0)

# --- Build results DataFrame ---
percent_df = pd.DataFrame({
    'year': YEAR_RANGE,
    'total_cves': total_per_year.values,
    'email_protected': email_per_year.values,
    'email_protected_pct': (email_per_year / total_per_year * 100).fillna(0).values
})

# --- Display summary ---
print(percent_df)
print(f"\n✓ Example: In 2025, {percent_df.loc[percent_df['year']==2025, 'email_protected_pct'].values[0]:.1f}% of CVEs are from [email protected]")

    year  total_cves  email_protected  email_protected_pct
0   1999         923              923               100.00
1   2000        1020             1020               100.00
2   2001        1678             1678               100.00
3   2002        2165             2165               100.00
4   2003        1545             1545               100.00
5   2004        2472             2472               100.00
6   2005        4994             4993                99.98
7   2006        6336             6329                99.89
8   2007           0                0                 0.00
9   2008           0                0                 0.00
10  2009          25               25               100.00
11  2010          40               40               100.00
12  2011           6                6               100.00
13  2012           1                1               100.00
14  2013           1                1               100.00
15  2014           4                4               100.

In [148]:
df.drop(columns=['source'], inplace=True)

In [149]:
## 3.5 category 

In [150]:
df["category"].unique()

array(['Injection', '', 'Race Condition', 'Memory Corruption',
       'Authorization', 'Misconfiguration', 'Cryptography',
       'Cross-Site Request Forgery', 'Cross-Site Scripting',
       'Information Disclosure', 'Path Traversal', 'Authentication',
       'Server-Side Request Forgery', 'Denial of Service',
       'XML External Entity', 'Supply Chain'], dtype=object)

In [151]:
df["category"] = df["category"].replace("", "undefined")

In [152]:
df["category"].unique()

array(['Injection', 'undefined', 'Race Condition', 'Memory Corruption',
       'Authorization', 'Misconfiguration', 'Cryptography',
       'Cross-Site Request Forgery', 'Cross-Site Scripting',
       'Information Disclosure', 'Path Traversal', 'Authentication',
       'Server-Side Request Forgery', 'Denial of Service',
       'XML External Entity', 'Supply Chain'], dtype=object)

In [153]:
df.head(5)

Unnamed: 0,cve_id,title,description,published_date,last_modified,remotely_exploit,category,affected_products,cvss_scores,loaded_at
0,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,True,Injection,[],"[{""score"": ""7.5"", ""version"": ""CVSS 2.0"", ""seve...",2025-10-13 12:00:10
1,CVE-1999-0095,Sendmail Command Injection Vulnerability,The following products are affected byCVE-1999...,1988-10-01 04:00:00,2025-04-03 01:03:00,True,undefined,"[{""id"": ""1"", ""vendor"": ""Eric_allman"", ""product...","[{""score"": ""10"", ""version"": ""CVSS 2.0"", ""sever...",2025-10-13 12:00:10
2,CVE-1999-0082,Tenable FTP Server Command Injection Vulnerabi...,The following products are affected byCVE-1999...,1988-11-11 05:00:00,2025-04-03 01:03:00,True,undefined,"[{""id"": ""1"", ""vendor"": ""Ftp"", ""product"": ""ftp""...","[{""score"": ""10"", ""version"": ""CVSS 2.0"", ""sever...",2025-10-13 12:00:10
3,CVE-1999-1471,"""BSD Passwd Buffer Overflow Root Privilege Esc...",The following products are affected byCVE-1999...,1989-01-01 05:00:00,2025-04-03 01:03:00,False,undefined,"[{""id"": ""1"", ""vendor"": ""Bsd"", ""product"": ""bsd""}]","[{""score"": ""7.2"", ""version"": ""CVSS 2.0"", ""seve...",2025-10-13 12:00:10
4,CVE-1999-1122,SunOS Restore Privilege Escalation Vulnerability,Vulnerability in restore in SunOS 4.0.3 and ea...,1989-07-26 04:00:00,2025-04-03 01:03:00,False,undefined,"[{""id"": ""1"", ""vendor"": ""Sun"", ""product"": ""suno...","[{""score"": ""4.6"", ""version"": ""CVSS 2.0"", ""seve...",2025-10-13 12:00:10


#### 3.4 — Processing the `product` column

We notice that most of the values in the `source` column are missing, so we will drop this column.
The product data is also in JSON format, and since the affected products are important, we should create a new table in the "Selver" section of our data warehouse.

In [154]:
import json

# ===== Extraire les produits uniques =====
products_dict = {}

for idx, row in df.iterrows():
    cve_id = row['cve_id']
    affected_products = row['affected_products']
    
    if affected_products and affected_products != '[]':
        try:
            if isinstance(affected_products, str):
                products = json.loads(affected_products)
            else:
                products = affected_products
            
            for product in products:
                vendor = product.get('vendor', '').strip()
                product_name = product.get('product', '').strip()
                
                if vendor and product_name:
                    key = (vendor.lower(), product_name.lower())
                    
                    if key not in products_dict:
                        products_dict[key] = {
                            'vendor': vendor,
                            'product_name': product_name,
                            'cves': set()
                        }
                    products_dict[key]['cves'].add(cve_id)
                    
        except (json.JSONDecodeError, TypeError):
            continue

# Créer df_products
products_data = []
product_lookup = {}

for product_id, ((vendor_lower, product_lower), data) in enumerate(products_dict.items(), start=1):
    products_data.append({
        'product_id': product_id,
        'vendor': data['vendor'],
        'product_name': data['product_name'],
        'total_cves': len(data['cves']),
        'cve_list_json': json.dumps(list(data['cves']))
    })
    product_lookup[(vendor_lower, product_lower)] = product_id

df_products = pd.DataFrame(products_data)

# ===== Enrichir avec les dates CVE =====
cve_products_for_dates = []

for idx, row in df.iterrows():
    cve_id = row['cve_id']
    published_date = row['published_date']
    affected_products = row['affected_products']
    
    if affected_products and affected_products != '[]':
        try:
            if isinstance(affected_products, str):
                products = json.loads(affected_products)
            else:
                products = affected_products
            
            for product in products:
                vendor = product.get('vendor', '').strip()
                product_name = product.get('product', '').strip()
                
                if vendor and product_name:
                    key = (vendor.lower(), product_name.lower())
                    product_id = product_lookup.get(key)
                    
                    if product_id:
                        cve_products_for_dates.append({
                            'product_id': product_id,
                            'published_date': published_date
                        })
        except (json.JSONDecodeError, TypeError):
            continue

df_temp = pd.DataFrame(cve_products_for_dates)

# Agréger les dates par produit
product_dates = df_temp.groupby('product_id').agg({
    'published_date': ['min', 'max']
}).reset_index()

product_dates.columns = ['product_id', 'first_cve_date', 'last_cve_date']

# Joindre avec df_products
df_pr = df_products.merge(product_dates, on='product_id', how='left')

In [155]:
df_pr.head(10)

Unnamed: 0,product_id,vendor,product_name,total_cves,cve_list_json,first_cve_date,last_cve_date
0,1,Eric_allman,sendmail,14,"[""CVE-1999-0163"", ""CVE-1999-0203"", ""CVE-1999-0...",1988-10-01 04:00:00,2000-04-23 04:00:00
1,2,Ftp,ftp,2,"[""CVE-1999-0201"", ""CVE-1999-0082""]",1988-11-11 05:00:00,1997-01-01 05:00:00
2,3,Ftpcd,ftpcd,1,"[""CVE-1999-0082""]",1988-11-11 05:00:00,1988-11-11 05:00:00
3,4,Bsd,bsd,6,"[""CVE-1999-1394"", ""CVE-1999-1098"", ""CVE-2001-0...",1989-01-01 05:00:00,2001-10-03 04:00:00
4,5,Sun,sunos,339,"[""CVE-2004-0496"", ""CVE-2000-0030"", ""CVE-1999-1...",1989-07-26 04:00:00,2006-12-13 01:28:00
5,6,Sun,nfs,4,"[""CVE-1999-0084"", ""CVE-1999-0165"", ""CVE-1999-0...",1990-05-01 04:00:00,1997-07-01 04:00:00
6,7,Freebsd,freebsd,256,"[""CVE-2002-0754"", ""CVE-2006-4304"", ""CVE-2000-0...",1990-05-09 04:00:00,2025-01-30 05:15:00
7,8,Linux,linux_kernel,3659,"[""CVE-2022-49618"", ""CVE-2025-37950"", ""CVE-2022...",1995-09-07 04:00:00,2025-10-07 16:15:00
8,9,Next,next,5,"[""CVE-1999-1391"", ""CVE-1999-1193"", ""CVE-1999-1...",1990-10-03 04:00:00,1991-10-22 04:00:00
9,10,Next,nex,1,"[""CVE-1999-1392""]",1990-10-03 04:00:00,1990-10-03 04:00:00


In [156]:
df.drop(columns=['affected_products'], inplace=True)

### Step 4 — CVSS Data Extraction and Transformation

#### 4.1 — ⚙️ Preparing Multi-Version CVSS Data

The **`cvss_scores`** column contains a list of CVSS score objects for each CVE entry.  
A typical schema looks like this:

```json
[
  {
    "score": "7.5",
    "version": "CVSS 2.0",
    "severity": "HIGH",
    "vector": "AV:N/AC:L/Au:N/C:P/I:P/A:P",
    "exploitability_score": "10.0",
    "impact_score": "6.4",
    "source": "[email protected]"
  },
  {
    "score": "7.3",
    "version": "CVSS 3.1",
    "severity": "HIGH",
    "vector": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:L",
    "exploitability_score": "3.9",
    "impact_score": "3.4",
    "source": "[email protected]"
  },
  {
    "score": "6.9",
    "version": "CVSS 4.0",
    "severity": "MEDIUM",
    "vector": "CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:L/VI:L/VA:L/SC:N/SI:N/SA:N/..."
  }
]
```

For CVEs that contain **multiple CVSS versions** (e.g., _2.0_, _3.1_, _4.0_),  
we will **duplicate the corresponding rows** so that **each row represents a single CVSS entry** —  
with its specific **score**, **version**, and **vector**.

This normalized structure will make it easier to **analyze**, **compare**, and **visualize** different CVSS versions later during the dashboard or analytical phase.

Before performing this operation, we should first **remove all rows that do not contain any CVSS data**.  
If a CVE has no `cvss_score`, `cvss_version`, or `cvss_vector`,  
it means that **no vulnerability scoring information is available** —  
and therefore, other related fields are likely missing as well.

➡️ These rows will be dropped to ensure we only work with complete and meaningful data.

In [157]:
# 1️⃣ Compter les lignes sans CVSS score (NaN ou liste vide)
missing_count = df["cvss_scores"].isna().sum() + (df["cvss_scores"].str.strip() == "[]").sum()
print(f"Number of rows without CVSS scores: {missing_count}")

# 2️⃣ Supprimer ces lignes directement dans df
df.drop(df[df["cvss_scores"].isna() | (df["cvss_scores"].str.strip() == "[]")].index, inplace=True)

# 3️⃣ Vérification
print(f"Remaining rows after drop: {len(df)}")

Number of rows without CVSS scores: 4871
Remaining rows after drop: 59308


In [158]:
df.head(5)

Unnamed: 0,cve_id,title,description,published_date,last_modified,remotely_exploit,category,cvss_scores,loaded_at
0,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,True,Injection,"[{""score"": ""7.5"", ""version"": ""CVSS 2.0"", ""seve...",2025-10-13 12:00:10
1,CVE-1999-0095,Sendmail Command Injection Vulnerability,The following products are affected byCVE-1999...,1988-10-01 04:00:00,2025-04-03 01:03:00,True,undefined,"[{""score"": ""10"", ""version"": ""CVSS 2.0"", ""sever...",2025-10-13 12:00:10
2,CVE-1999-0082,Tenable FTP Server Command Injection Vulnerabi...,The following products are affected byCVE-1999...,1988-11-11 05:00:00,2025-04-03 01:03:00,True,undefined,"[{""score"": ""10"", ""version"": ""CVSS 2.0"", ""sever...",2025-10-13 12:00:10
3,CVE-1999-1471,"""BSD Passwd Buffer Overflow Root Privilege Esc...",The following products are affected byCVE-1999...,1989-01-01 05:00:00,2025-04-03 01:03:00,False,undefined,"[{""score"": ""7.2"", ""version"": ""CVSS 2.0"", ""seve...",2025-10-13 12:00:10
4,CVE-1999-1122,SunOS Restore Privilege Escalation Vulnerability,Vulnerability in restore in SunOS 4.0.3 and ea...,1989-07-26 04:00:00,2025-04-03 01:03:00,False,undefined,"[{""score"": ""4.6"", ""version"": ""CVSS 2.0"", ""seve...",2025-10-13 12:00:10


In [159]:
import pandas as pd
import json

def extract_cvss_scores(df):
    """
    Extract and normalize CVSS scores from the cvss_scores column.
    Creates one row per CVSS version for each CVE.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing a 'cvss_scores' column with CVSS data
        
    Returns:
    --------
    pandas.DataFrame
        Normalized DataFrame with one row per CVSS score entry
    """
    
    # List to store expanded rows
    expanded_rows = []
    
    # Iterate through each CVE record
    for idx, row in df.iterrows():
        cvss_scores = row['cvss_scores']
        
        # Handle cases where cvss_scores might be None or empty
        if not cvss_scores or cvss_scores == '[]' or pd.isna(cvss_scores):
            # Keep the row but with null CVSS data
            row_dict = row.to_dict()
            row_dict.update({
                'cvss_score': None,
                'cvss_version': None,
                'cvss_severity': None,
                'cvss_vector': None,
                'cvss_exploitability_score': None,
                'cvss_impact_score': None,
            })
            expanded_rows.append(row_dict)
            continue
        
        # Parse JSON if it's a string
        if isinstance(cvss_scores, str):
            try:
                cvss_scores = json.loads(cvss_scores)
            except json.JSONDecodeError:
                # If parsing fails, skip or handle gracefully
                row_dict = row.to_dict()
                row_dict.update({
                    'cvss_score': None,
                    'cvss_version': None,
                    'cvss_severity': None,
                    'cvss_vector': None,
                    'cvss_exploitability_score': None,
                    'cvss_impact_score': None,
                })
                expanded_rows.append(row_dict)
                continue
        
        # For each CVSS score entry, create a new row
        for cvss_entry in cvss_scores:
            row_dict = row.to_dict()
            
            # Extract CVSS-specific fields
            row_dict['cvss_score'] = cvss_entry.get('score')
            row_dict['cvss_version'] = cvss_entry.get('version')
            row_dict['cvss_severity'] = cvss_entry.get('severity')
            row_dict['cvss_vector'] = cvss_entry.get('vector')
            row_dict['cvss_exploitability_score'] = cvss_entry.get('exploitability_score')
            row_dict['cvss_impact_score'] = cvss_entry.get('impact_score')
            
            expanded_rows.append(row_dict)
    
    # Create new DataFrame from expanded rows
    df_expanded = pd.DataFrame(expanded_rows)
    
    # Drop the original cvss_scores column
    if 'cvss_scores' in df_expanded.columns:
        df_expanded = df_expanded.drop('cvss_scores', axis=1)
    
    # Convert numeric columns to appropriate types
    numeric_cols = ['cvss_score', 'cvss_exploitability_score', 'cvss_impact_score']
    for col in numeric_cols:
        if col in df_expanded.columns:
            df_expanded[col] = pd.to_numeric(df_expanded[col], errors='coerce')
    
    return df_expanded


def analyze_cvss_versions(df_expanded):
    """
    Analyze the distribution of CVSS versions in the normalized dataset.
    
    Parameters:
    -----------
    df_expanded : pandas.DataFrame
        Normalized DataFrame with CVSS data
        
    Returns:
    --------
    pandas.DataFrame
        Summary statistics by CVSS version
    """
    
    version_summary = df_expanded.groupby('cvss_version').agg({
        'cve_id': 'count',
        'cvss_score': ['mean', 'median', 'min', 'max'],
        'cvss_severity': lambda x: x.value_counts().to_dict()
    }).round(2)
    
    version_summary.columns = ['_'.join(col).strip() for col in version_summary.columns]
    version_summary = version_summary.rename(columns={'cve_id_count': 'total_entries'})
    
    return version_summary


# Example usage:
df = extract_cvss_scores(df)

In [160]:
df.head(3)

Unnamed: 0,cve_id,title,description,published_date,last_modified,remotely_exploit,category,loaded_at,cvss_score,cvss_version,cvss_severity,cvss_vector,cvss_exploitability_score,cvss_impact_score
0,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,True,Injection,2025-10-13 12:00:10,7.5,CVSS 2.0,HIGH,AV:N/AC:L/Au:N/C:P/I:P/A:P,10.0,6.4
1,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,True,Injection,2025-10-13 12:00:10,7.3,CVSS 3.1,HIGH,CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:L,3.9,3.4
2,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,True,Injection,2025-10-13 12:00:10,6.9,CVSS 4.0,MEDIUM,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:L/VI:L/VA...,,


We can notice one thing: for entries where the CVSS version is **4.0**, the fields **cvss_exploitability** and **cvss_impact** are empty.

In [161]:
df[df["cvss_version"] == "CVSS 4.0"][["cve_id", "cvss_version", "cvss_severity",   "cvss_vector" ,   "cvss_exploitability_score", "cvss_impact_score"]].head(5)

Unnamed: 0,cve_id,cvss_version,cvss_severity,cvss_vector,cvss_exploitability_score,cvss_impact_score
2,CVE-2025-11608,CVSS 4.0,MEDIUM,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:L/VI:L/VA...,,
9838,CVE-2025-0103,CVSS 4.0,CRITICAL,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:H/VI:L/VA...,,
9844,CVE-2025-0104,CVSS 4.0,HIGH,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:A/VC:H/VI:L/VA...,,
9851,CVE-2025-0105,CVSS 4.0,MEDIUM,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:N/VI:L/VA...,,
9857,CVE-2025-0106,CVSS 4.0,MEDIUM,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:L/VI:N/VA...,,


## Why `cvss_exploitability` Is Empty for CVSS 4.0

1. Because CVSS 4.0 no longer includes a separate “Exploitability” sub-score.
    

In earlier versions of CVSS (v2 and v3.x), the total score formula was structured as follows:

|CVSS Version|Score Structure|
|---|---|
|CVSS 2.0|Base Score = Impact × Exploitability|
|CVSS 3.0 / 3.1|Base Score = f(Impact, Exploitability) (still calculated separately)|
|CVSS 4.0|Exploitability is no longer a standalone component|

In CVSS 4.0, the exploitability metrics (such as Attack Vector, Attack Complexity, Privileges Required, etc.) still exist,  
but they are integrated directly into the overall formula, rather than being summarized in a separate `exploitability_score` field.

## About Missing Exploitability and Impact Scores in CVSS 4.0

For now, we will **leave the `exploitability_score` and `impact_score` fields empty** for CVSS 4.0 entries.

However, it is technically possible to **approximate** these values using weighted metrics extracted from the CVSS vector.  
An example approach could be:

```python
df["exploitability_proxy"] = (
    df["AV_score"] * 0.3 +
    df["AC_score"] * 0.25 +
    df["PR_score"] * 0.25 +
    df["UI_score"] * 0.2
)

df["impact_proxy"] = (
    df["VC_score"] * 0.4 +
    df["VI_score"] * 0.3 +
    df["VA_score"] * 0.3
)


# 💡 CVSS Versions and Metric Mappings

## Overview

**CVSS Version** refers to the version of the **CVSS standard** used to assess the severity of a vulnerability.

---

## 🔍 What is CVSS?

**CVSS (Common Vulnerability Scoring System)** is a standardized scoring system used in cybersecurity to measure the severity of vulnerabilities (CVE). It is managed by the **Forum of Incident Response and Security Teams (FIRST)**.

Each CVSS version defines:
- A mathematical formula to calculate a score from **0 to 10**
- Criteria (vectors) describing how the vulnerability can be exploited

---

## ⚙️ Main CVSS Versions

| Version | Year | Main Characteristics |
|---------|------|---------------------|
| **CVSS 2.0** | 2007 | First widely used version; less precise for real-world exploitation contexts |
| **CVSS 3.0** | 2015 | Better distinction between exploitability and impact; introduction of the "scope" concept |
| **CVSS 3.1** | 2019 | Most widely used version; clarifies metric definitions (same formula as 3.0) |
| **CVSS 4.0** | 2023 | Next generation: adds environmental and contextual metrics; better reflects modern attack scenarios |

---

## 🧩 CVSS Metric Mappings

Below are the **abbreviations and their meanings** for each CVSS version. These mappings are essential for parsing CVSS vectors into human-readable components.

---

### 🟦 Common Metrics (CVSS 3.x & Compatible)

```python
MAPS_COMMON = {
    "AV": {    # Attack Vector
        "N": "Network",
        "A": "Adjacent",
        "L": "Local",
        "P": "Physical"
    },
    "AC": {    # Attack Complexity
        "L": "Low",
        "H": "High"
    },
    "PR": {    # Privileges Required
        "N": "None",
        "L": "Low",
        "H": "High"
    },
    "UI": {    # User Interaction
        "N": "None",
        "R": "Required"
    },
    "S": {     # Scope
        "U": "Unchanged",
        "C": "Changed"
    },
    "C": {     # Confidentiality Impact
        "N": "None",
        "L": "Low",
        "H": "High"
    },
    "I": {     # Integrity Impact
        "N": "None",
        "L": "Low",
        "H": "High"
    },
    "A": {     # Availability Impact
        "N": "None",
        "L": "Low",
        "H": "High"
    }
}
```

---

### 🟨 CVSS 2.0 Specific Metrics

```python
MAPS_V2 = {
    "AV": {    # Access Vector
        "N": "Network",
        "A": "Adjacent/Local",
        "L": "Local",
        "P": "Physical"
    },
    "AC": {    # Access Complexity
        "L": "Low",
        "M": "Medium",
        "H": "High"
    },
    "Au": {    # Authentication
        "N": "None",
        "S": "Single",
        "M": "Multiple"
    },
    "C": {     # Confidentiality Impact
        "N": "None",
        "P": "Partial",
        "C": "Complete",
        "L": "Low"
    },
    "I": {     # Integrity Impact
        "N": "None",
        "P": "Partial",
        "C": "Complete",
        "L": "Low"
    },
    "A": {     # Availability Impact
        "N": "None",
        "P": "Partial",
        "C": "Complete",
        "L": "Low"
    }
}
```

**CVSS 2.0 Differences:**

- Uses **Au (Authentication)** instead of PR (Privileges Required)
- Impact values use **Partial/Complete** instead of Low/High
- Less granular than CVSS 3.x versions

---

### 🟥 CVSS 4.0 Additional Metrics

```python
MAPS_V40 = {
    "AT": {    # Attack Requirements
        "N": "None",
        "P": "Present"
    },
    # Vulnerable System Impact
    "VC": {    # Vulnerable System - Confidentiality
        "N": "None",
        "L": "Low",
        "H": "High"
    },
    "VI": {    # Vulnerable System - Integrity
        "N": "None",
        "L": "Low",
        "H": "High"
    },
    "VA": {    # Vulnerable System - Availability
        "N": "None",
        "L": "Low",
        "H": "High"
    },
    # Subsequent System Impact
    "SC": {    # Subsequent System - Confidentiality
        "N": "None",
        "L": "Low",
        "H": "High"
    },
    "SI": {    # Subsequent System - Integrity
        "N": "None",
        "L": "Low",
        "H": "High"
    },
    "SA": {    # Subsequent System - Avai

In [162]:
import pandas as pd
import re

# Mappings CVSS
MAPS_COMMON = {
    "AV": {"N": "Network", "A": "Adjacent", "L": "Local", "P": "Physical"},
    "AC": {"L": "Low", "H": "High"},
    "PR": {"N": "None", "L": "Low", "H": "High"},
    "UI": {"N": "None", "R": "Required"},
    "S": {"U": "Unchanged", "C": "Changed"},
    "C": {"N": "None", "L": "Low", "H": "High"},
    "I": {"N": "None", "L": "Low", "H": "High"},
    "A": {"N": "None", "L": "Low", "H": "High"}
}

MAPS_V2 = {
    "AV": {"N": "Network", "A": "Adjacent/Local", "L": "Local", "P": "Physical"},
    "Au": {"N": "None", "S": "Single", "M": "Multiple"},
    "C": {"N": "None", "P": "Partial", "C": "Complete", "L": "Low"},
    "I": {"N": "None", "P": "Partial", "C": "Complete", "L": "Low"},
    "A": {"N": "None", "P": "Partial", "C": "Complete", "L": "Low"}
}

MAPS_V40 = {
    "AT": {"N": "None", "P": "Present"},
    "VC": {"N": "None", "L": "Low", "H": "High"},
    "VI": {"N": "None", "L": "Low", "H": "High"},
    "VA": {"N": "None", "L": "Low", "H": "High"},
    "SC": {"N": "None", "L": "Low", "H": "High"},
    "SI": {"N": "None", "L": "Low", "H": "High"},
    "SA": {"N": "None", "L": "Low", "H": "High"}
}

def parse_cvss_vector(vector_str, version):
    """
    Parse un vecteur CVSS et retourne un dictionnaire des métriques
    """
    if pd.isna(vector_str) or not isinstance(vector_str, str):
        return {}
    
    metrics = {}
    
    # Déterminer les mappings à utiliser selon la version
    if version == "CVSS 2.0":
        maps = {**MAPS_V2}
    elif version == "CVSS 3.1" or version == "CVSS 3.0":
        maps = {**MAPS_COMMON}
    elif version == "CVSS 4.0":
        maps = {**MAPS_COMMON, **MAPS_V40}
    else:
        maps = {**MAPS_COMMON}
    
    # Nettoyer le vecteur (enlever le préfixe CVSS:3.1/ ou similaire)
    vector_str = re.sub(r'^CVSS:\d+\.\d+/', '', vector_str)
    
    # Parser les paires metric:value
    pairs = vector_str.split('/')
    for pair in pairs:
        if ':' in pair:
            metric, value = pair.split(':', 1)
            metric = metric.strip()
            value = value.strip()
            
            # Chercher la valeur décodée
            if metric in maps and value in maps[metric]:
                metrics[metric] = maps[metric][value]
            else:
                # Garder la valeur brute si pas de mapping
                metrics[metric] = value
    
    return metrics

def extract_cvss_metrics(df):
    """
    Extrait les métriques CVSS et les ajoute comme colonnes au DataFrame
    """
    # Créer une copie du DataFrame
    df_result = df.copy()
    
    # Parser tous les vecteurs
    parsed_metrics = []
    for idx, row in df_result.iterrows():
        metrics = parse_cvss_vector(row['cvss_vector'], row['cvss_version'])
        parsed_metrics.append(metrics)
    
    # Obtenir toutes les métriques uniques
    all_metrics = set()
    for metrics in parsed_metrics:
        all_metrics.update(metrics.keys())
    
    # Créer des colonnes pour chaque métrique
    for metric in sorted(all_metrics):
        column_name = f'cvss_metric_{metric}'
        df_result[column_name] = [metrics.get(metric, None) for metrics in parsed_metrics]
    
    return df_result


# Supposons que 'df' est votre DataFrame existant
df = extract_cvss_metrics(df)

In [163]:
df.head(3)

Unnamed: 0,cve_id,title,description,published_date,last_modified,remotely_exploit,category,loaded_at,cvss_score,cvss_version,cvss_severity,cvss_vector,cvss_exploitability_score,cvss_impact_score,cvss_metric_A,cvss_metric_AC,cvss_metric_AR,cvss_metric_AT,cvss_metric_AU,cvss_metric_AV,cvss_metric_Au,cvss_metric_C,cvss_metric_CR,cvss_metric_E,cvss_metric_I,cvss_metric_IR,cvss_metric_MAC,cvss_metric_MAT,cvss_metric_MAV,cvss_metric_MPR,cvss_metric_MSA,cvss_metric_MSC,cvss_metric_MSI,cvss_metric_MUI,cvss_metric_MVA,cvss_metric_MVC,cvss_metric_MVI,cvss_metric_PR,cvss_metric_R,cvss_metric_RE,cvss_metric_S,cvss_metric_SA,cvss_metric_SC,cvss_metric_SI,cvss_metric_U,cvss_metric_UI,cvss_metric_V,cvss_metric_VA,cvss_metric_VC,cvss_metric_VI
0,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,True,Injection,2025-10-13 12:00:10,7.5,CVSS 2.0,HIGH,AV:N/AC:L/Au:N/C:P/I:P/A:P,10.0,6.4,Partial,L,,,,Network,,Partial,,,Partial,,,,,,,,,,,,,,,,,,,,,,,,,
1,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,True,Injection,2025-10-13 12:00:10,7.3,CVSS 3.1,HIGH,CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:L,3.9,3.4,Low,Low,,,,Network,,Low,,,Low,,,,,,,,,,,,,,,,Unchanged,,,,,,,,,
2,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,True,Injection,2025-10-13 12:00:10,6.9,CVSS 4.0,MEDIUM,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:L/VI:L/VA...,,,,Low,X,,X,Network,,,X,P,,X,X,X,X,X,X,X,X,X,X,X,X,,X,X,X,,,,X,,X,Low,Low,Low


1. Convert to category if they have limited unique values:

This is useful if the columns have many repeated values (e.g., many instances of High, Low, None).

Why Use category?

Memory efficiency: The category dtype uses less memory compared to object when the number of unique values is small.

Faster operations: Operations like sorting, filtering, and grouping are faster with category dtype compared to object (since internally category uses integer codes for values).

In [164]:
# Convert relevant columns to 'category' dtype
metric_columns = [
    'cvss_metric_A', 'cvss_metric_AC', 'cvss_metric_AR', 'cvss_metric_AT', 'cvss_metric_AU',
    'cvss_metric_AV', 'cvss_metric_Au', 'cvss_metric_C', 'cvss_metric_CR', 'cvss_metric_E',
    'cvss_metric_I', 'cvss_metric_IR', 'cvss_metric_MAC', 'cvss_metric_MAT', 'cvss_metric_MAV',
    'cvss_metric_MPR', 'cvss_metric_MSA', 'cvss_metric_MSC', 'cvss_metric_MSI', 'cvss_metric_MUI',
    'cvss_metric_MVA', 'cvss_metric_MVC', 'cvss_metric_MVI', 'cvss_metric_PR', 'cvss_metric_R',
    'cvss_metric_RE', 'cvss_metric_S', 'cvss_metric_SA', 'cvss_metric_SC', 'cvss_metric_SI',
    'cvss_metric_U', 'cvss_metric_UI', 'cvss_metric_V', 'cvss_metric_VA', 'cvss_metric_VC',
    'cvss_metric_VI'
]

# Apply category dtype conversion
df[metric_columns] = df[metric_columns].apply(lambda x: x.astype('category'))


In [165]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92107 entries, 0 to 92106
Data columns (total 50 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   cve_id                     92107 non-null  object        
 1   title                      92107 non-null  object        
 2   description                92107 non-null  object        
 3   published_date             92107 non-null  datetime64[ns]
 4   last_modified              92107 non-null  datetime64[ns]
 5   remotely_exploit           92107 non-null  bool          
 6   category                   92107 non-null  object        
 7   loaded_at                  92107 non-null  datetime64[ns]
 8   cvss_score                 92107 non-null  float64       
 9   cvss_version               92107 non-null  object        
 10  cvss_severity              92107 non-null  object        
 11  cvss_vector                92107 non-null  object        
 12  cvss

il ya des metrics qui sont toujours with valus X (unkonwn) : il ne aide arien 

In [166]:
# list of your CVSS-related columns
cvss_cols = [
    "cvss_metric_A", "cvss_metric_AC", "cvss_metric_AR", "cvss_metric_AT",
    "cvss_metric_AU", "cvss_metric_AV", "cvss_metric_Au", "cvss_metric_C",
    "cvss_metric_CR", "cvss_metric_E", "cvss_metric_I", "cvss_metric_IR",
    "cvss_metric_MAC", "cvss_metric_MAT", "cvss_metric_MAV", "cvss_metric_MPR",
    "cvss_metric_MSA", "cvss_metric_MSC", "cvss_metric_MSI", "cvss_metric_MUI",
    "cvss_metric_MVA", "cvss_metric_MVC", "cvss_metric_MVI", "cvss_metric_PR",
    "cvss_metric_R", "cvss_metric_RE", "cvss_metric_S", "cvss_metric_SA",
    "cvss_metric_SC", "cvss_metric_SI", "cvss_metric_U", "cvss_metric_UI",
    "cvss_metric_V", "cvss_metric_VA", "cvss_metric_VC", "cvss_metric_VI"
]

# display unique values for each
for col in cvss_cols:
    unique_vals = df[col].dropna().unique()
    n_unique = len(unique_vals)
    print(f"\n=== {col} ({n_unique} unique) ===")
    if n_unique <= 15:
        print(unique_vals)
    else:
        print(unique_vals[:10], "...")


=== cvss_metric_A (5 unique) ===
['Partial', 'Low', 'Complete', 'High', 'None']
Categories (5, object): ['Complete', 'High', 'Low', 'None', 'Partial']

=== cvss_metric_AC (5 unique) ===
['L', 'Low', 'H', 'M', 'High']
Categories (5, object): ['H', 'High', 'L', 'Low', 'M']

=== cvss_metric_AR (1 unique) ===
['X']
Categories (1, object): ['X']

=== cvss_metric_AT (2 unique) ===
['None', 'Present']
Categories (2, object): ['None', 'Present']

=== cvss_metric_AU (3 unique) ===
['X', 'N', 'Y']
Categories (3, object): ['N', 'X', 'Y']

=== cvss_metric_AV (5 unique) ===
['Network', 'Local', 'Adjacent', 'Adjacent/Local', 'Physical']
Categories (5, object): ['Adjacent', 'Adjacent/Local', 'Local', 'Network', 'Physical']

=== cvss_metric_Au (3 unique) ===
['None', 'Single', 'Multiple']
Categories (3, object): ['Multiple', 'None', 'Single']

=== cvss_metric_C (5 unique) ===
['Partial', 'Low', 'Complete', 'High', 'None']
Categories (5, object): ['Complete', 'High', 'Low', 'None', 'Partial']

=== cvs

In [167]:
# 1️⃣ Drop useless columns
drop_cols = [
    "cvss_metric_AR", "cvss_metric_CR", "cvss_metric_IR", "cvss_metric_MAC",
    "cvss_metric_MAT", "cvss_metric_MAV", "cvss_metric_MPR", "cvss_metric_MSA",
    "cvss_metric_MSC", "cvss_metric_MUI", "cvss_metric_MVA", "cvss_metric_MVC",
    "cvss_metric_MVI"
]
df = df.drop(columns=drop_cols)

In [168]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92107 entries, 0 to 92106
Data columns (total 37 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   cve_id                     92107 non-null  object        
 1   title                      92107 non-null  object        
 2   description                92107 non-null  object        
 3   published_date             92107 non-null  datetime64[ns]
 4   last_modified              92107 non-null  datetime64[ns]
 5   remotely_exploit           92107 non-null  bool          
 6   category                   92107 non-null  object        
 7   loaded_at                  92107 non-null  datetime64[ns]
 8   cvss_score                 92107 non-null  float64       
 9   cvss_version               92107 non-null  object        
 10  cvss_severity              92107 non-null  object        
 11  cvss_vector                92107 non-null  object        
 12  cvss

In [170]:
df["cvss_version"].unique()

array(['CVSS 2.0', 'CVSS 3.1', 'CVSS 4.0', 'CVSS 3.0'], dtype=object)

In [179]:
import pandas as pd
import re

# ============================================================================
# MAPPINGS CVSS PAR VERSION
# ============================================================================

MAPS_V2 = {
    "AV": {"N": "Network", "A": "Adjacent", "L": "Local"},
    "AC": {"L": "Low", "M": "Medium", "H": "High"},
    "Au": {"N": "None", "S": "Single", "M": "Multiple"},
    "C": {"N": "None", "P": "Partial", "C": "Complete"},
    "I": {"N": "None", "P": "Partial", "C": "Complete"},
    "A": {"N": "None", "P": "Partial", "C": "Complete"}
}

MAPS_V3 = {
    "AV": {"N": "Network", "A": "Adjacent", "L": "Local", "P": "Physical"},
    "AC": {"L": "Low", "H": "High"},
    "PR": {"N": "None", "L": "Low", "H": "High"},
    "UI": {"N": "None", "R": "Required"},
    "S": {"U": "Unchanged", "C": "Changed"},
    "C": {"N": "None", "L": "Low", "H": "High"},
    "I": {"N": "None", "L": "Low", "H": "High"},
    "A": {"N": "None", "L": "Low", "H": "High"}
}

MAPS_V4 = {
    "AV": {"N": "Network", "A": "Adjacent", "L": "Local", "P": "Physical"},
    "AC": {"L": "Low", "H": "High"},
    "AT": {"N": "None", "P": "Present"},
    "PR": {"N": "None", "L": "Low", "H": "High"},
    "UI": {"N": "None", "P": "Passive", "A": "Active"},
    "VC": {"N": "None", "L": "Low", "H": "High"},
    "VI": {"N": "None", "L": "Low", "H": "High"},
    "VA": {"N": "None", "L": "Low", "H": "High"},
    "SC": {"N": "None", "L": "Low", "H": "High"},
    "SI": {"N": "None", "L": "Low", "H": "High"},
    "SA": {"N": "None", "L": "Low", "H": "High"}
}

# Métriques spécifiques par version
METRICS_V2 = ['AV', 'AC', 'Au', 'C', 'I', 'A']
METRICS_V3 = ['AV', 'AC', 'PR', 'UI', 'S', 'C', 'I', 'A']
METRICS_V4 = ['AV', 'AC', 'AT', 'PR', 'UI', 'VC', 'VI', 'VA', 'SC', 'SI', 'SA']

# ============================================================================
# FONCTIONS DE PARSING
# ============================================================================

def parse_cvss_vector(vector_str, version):
    """Parse un vecteur CVSS selon sa version"""
    if pd.isna(vector_str) or not isinstance(vector_str, str):
        return {}
    
    # Sélectionner le mapping approprié
    if version == "CVSS 2.0":
        maps = MAPS_V2
    elif version in ["CVSS 3.0", "CVSS 3.1"]:
        maps = MAPS_V3
    elif version == "CVSS 4.0":
        maps = MAPS_V4
    else:
        return {}
    
    metrics = {}
    
    # Nettoyer le vecteur
    vector_str = re.sub(r'^CVSS:\d+\.\d+/', '', vector_str)
    
    # Parser les paires metric:value
    pairs = vector_str.split('/')
    for pair in pairs:
        if ':' in pair:
            metric, value = pair.split(':', 1)
            metric = metric.strip()
            value = value.strip()
            
            if metric in maps and value in maps[metric]:
                metrics[metric] = maps[metric][value]
            else:
                metrics[metric] = value
    
    return metrics

# ============================================================================
# FONCTION PRINCIPALE - CRÉATION DES 3 DATASETS
# ============================================================================

def create_silver_datasets(df):
    """
    Crée 3 datasets Silver optimisés selon les versions CVSS
    
    Args:
        df: DataFrame bronze avec duplicates (une ligne par version CVSS)
    
    Returns:
        dict: {
            'cvss_v2': DataFrame pour CVSS 2.0,
            'cvss_v3': DataFrame pour CVSS 3.0/3.1 combinés,
            'cvss_v4': DataFrame pour CVSS 4.0
        }
    """
    
    datasets = {}
    
    # Colonnes communes à garder
    base_cols = [
        'cve_id', 'title', 'description', 'published_date', 'last_modified',
        'remotely_exploit', 'category', 'loaded_at', 
        'cvss_score', 'cvss_version', 'cvss_severity', 'cvss_vector',
        'cvss_exploitability_score', 'cvss_impact_score'
    ]
    
    # ========================================================================
    # DATASET 1: CVSS 2.0
    # ========================================================================
    print("🔄 Traitement CVSS 2.0...")
    df_v2 = df[df['cvss_version'] == 'CVSS 2.0'].copy()
    
    # Créer les colonnes de métriques en utilisant une liste de dictionnaires
    metrics_list = []
    for _, row in df_v2.iterrows():
        metrics = parse_cvss_vector(row['cvss_vector'], 'CVSS 2.0')
        metrics_list.append(metrics)
    
    # Ajouter les colonnes de métriques au DataFrame
    for metric in METRICS_V2:
        col_name = f'cvss_metric_{metric}'
        df_v2[col_name] = [m.get(metric, None) for m in metrics_list]
    
    # Sélectionner uniquement les colonnes pertinentes pour V2
    v2_metric_cols = [f'cvss_metric_{m}' for m in METRICS_V2]
    df_v2 = df_v2[base_cols + v2_metric_cols].reset_index(drop=True)
    
    datasets['cvss_v2'] = df_v2
    print(f"✅ CVSS 2.0 Dataset: {len(df_v2):,} rows, {len(df_v2.columns)} columns")
    
    # ========================================================================
    # DATASET 2: CVSS 3.0 / 3.1 (combinés)
    # ========================================================================
    print("🔄 Traitement CVSS 3.x...")
    df_v3 = df[df['cvss_version'].isin(['CVSS 3.0', 'CVSS 3.1'])].copy()
    
    # Créer les colonnes de métriques
    metrics_list = []
    for _, row in df_v3.iterrows():
        metrics = parse_cvss_vector(row['cvss_vector'], row['cvss_version'])
        metrics_list.append(metrics)
    
    # Ajouter les colonnes de métriques au DataFrame
    for metric in METRICS_V3:
        col_name = f'cvss_metric_{metric}'
        df_v3[col_name] = [m.get(metric, None) for m in metrics_list]
    
    # Sélectionner uniquement les colonnes pertinentes pour V3
    v3_metric_cols = [f'cvss_metric_{m}' for m in METRICS_V3]
    df_v3 = df_v3[base_cols + v3_metric_cols].reset_index(drop=True)
    
    datasets['cvss_v3'] = df_v3
    print(f"✅ CVSS 3.x Dataset: {len(df_v3):,} rows, {len(df_v3.columns)} columns")
    
    # ========================================================================
    # DATASET 3: CVSS 4.0
    # ========================================================================
    print("🔄 Traitement CVSS 4.0...")
    df_v4 = df[df['cvss_version'] == 'CVSS 4.0'].copy()
    
    # Créer les colonnes de métriques
    metrics_list = []
    for _, row in df_v4.iterrows():
        metrics = parse_cvss_vector(row['cvss_vector'], 'CVSS 4.0')
        metrics_list.append(metrics)
    
    # Ajouter les colonnes de métriques au DataFrame
    for metric in METRICS_V4:
        col_name = f'cvss_metric_{metric}'
        df_v4[col_name] = [m.get(metric, None) for m in metrics_list]
    
    # Sélectionner uniquement les colonnes pertinentes pour V4
    v4_metric_cols = [f'cvss_metric_{m}' for m in METRICS_V4]
    df_v4 = df_v4[base_cols + v4_metric_cols].reset_index(drop=True)
    
    datasets['cvss_v4'] = df_v4
    print(f"✅ CVSS 4.0 Dataset: {len(df_v4):,} rows, {len(df_v4.columns)} columns")
    
    return datasets

# ============================================================================
# FONCTION DE SAUVEGARDE
# ============================================================================

def extract_dataframes(datasets):
    """
    Retourne les DataFrames individuels avec des noms simples
    
    Returns:
        tuple: (df_v2, df_v3, df_v4)
    """
    df_v2 = datasets.get('cvss_v2', pd.DataFrame())
    df_v3 = datasets.get('cvss_v3', pd.DataFrame())
    df_v4 = datasets.get('cvss_v4', pd.DataFrame())
    
    print("\n📦 DataFrames extraits:")
    print(f"  • df_v2 (CVSS 2.0)  : {len(df_v2):,} rows")
    print(f"  • df_v3 (CVSS 3.x)  : {len(df_v3):,} rows")
    print(f"  • df_v4 (CVSS 4.0)  : {len(df_v4):,} rows")
    
    return df_v2, df_v3, df_v4

# ============================================================================
# FONCTION D'ANALYSE RAPIDE
# ============================================================================

def analyze_silver_datasets(datasets):
    """Affiche des statistiques détaillées sur les datasets Silver"""
    print("\n" + "="*70)
    print("📊 STATISTIQUES SILVER LAYER")
    print("="*70)
    
    total_rows = 0
    for name, df in datasets.items():
        total_rows += len(df)
        
        print(f"\n🔹 {name.upper().replace('_', ' ')}")
        print(f"   Lignes         : {len(df):,}")
        print(f"   Colonnes       : {len(df.columns)}")
        
        # Compter les métriques CVSS
        metric_cols = [c for c in df.columns if c.startswith('cvss_metric_')]
        print(f"   Métriques CVSS : {len(metric_cols)}")
        
        # Taille en mémoire
        memory_mb = df.memory_usage(deep=True).sum() / 1024**2
        print(f"   Mémoire        : {memory_mb:.2f} MB")
        
        # Statistiques sur les scores
        if 'cvss_score' in df.columns:
            print(f"   Score moyen    : {df['cvss_score'].mean():.2f}")
            print(f"   Score max      : {df['cvss_score'].max():.2f}")
    
    print(f"\n{'─'*70}")
    print(f"📈 TOTAL: {total_rows:,} lignes réparties sur {len(datasets)} datasets")
    print("="*70)

# ============================================================================
# EXEMPLE D'UTILISATION
# ============================================================================

if __name__ == "__main__":
    # Charger votre DataFrame bronze
    # df = pd.read_parquet('bronze_layer/cve_data.parquet')
    
    print("🚀 Début de la transformation Bronze → Silver")
    print("="*70)
    
    # Créer les 3 datasets Silver
    silver_datasets = create_silver_datasets(df)
    
    # Analyser les datasets
    analyze_silver_datasets(silver_datasets)
    
    # Extraire les DataFrames individuels
    df_v2, df_v3, df_v4 = extract_dataframes(silver_datasets)
    
    print("\n✅ Transformation terminée avec succès!")
    print("\n💡 Utilisation:")
    print("   df_v2  → CVSS 2.0")
    print("   df_v3  → CVSS 3.0 & 3.1")
    print("   df_v4  → CVSS 4.0")

🚀 Début de la transformation Bronze → Silver
🔄 Traitement CVSS 2.0...


KeyError: 'cvss_version'

In [175]:
df_v4.head()

Unnamed: 0,cve_id,title,description,published_date,last_modified,remotely_exploit,category,loaded_at,cvss_score,cvss_version,cvss_severity,cvss_vector,cvss_exploitability_score,cvss_impact_score,cvss_metric_AV,cvss_metric_AC,cvss_metric_AT,cvss_metric_PR,cvss_metric_UI,cvss_metric_VC,cvss_metric_VI,cvss_metric_VA,cvss_metric_SC,cvss_metric_SI,cvss_metric_SA
0,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,True,Injection,2025-10-13 12:00:10,6.9,CVSS 4.0,MEDIUM,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:L/VI:L/VA...,,,Network,Low,,,,Low,Low,Low,,,
1,CVE-2025-0103,Expedition: SQL Injection Vulnerability,An SQL injection vulnerability in Palo Alto Ne...,2025-01-11 03:15:00,2025-01-11 03:15:00,True,Injection,2025-10-13 12:00:10,9.2,CVSS 4.0,CRITICAL,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:H/VI:L/VA...,,,Network,Low,,,,High,Low,,High,,
2,CVE-2025-0104,Expedition: Cross-Site Scripting (XSS) Vulnera...,A reflected cross-site scripting (XSS) vulnera...,2025-01-11 03:15:00,2025-01-11 03:15:00,True,Cross-Site Scripting,2025-10-13 12:00:10,7.0,CVSS 4.0,HIGH,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:A/VC:H/VI:L/VA...,,,Network,Low,,,Active,High,Low,,,,
3,CVE-2025-0105,Expedition: Arbitrary File Deletion Vulnerability,An arbitrary file deletion vulnerability in Pa...,2025-01-11 03:15:00,2025-01-11 03:15:00,True,Path Traversal,2025-10-13 12:00:10,6.9,CVSS 4.0,MEDIUM,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:N/VI:L/VA...,,,Network,Low,,,,,Low,,,,
4,CVE-2025-0106,Expedition: Wildcard Expansion Vulnerability,A wildcard expansion vulnerability in Palo Alt...,2025-01-11 03:15:00,2025-01-11 03:15:00,True,Information Disclosure,2025-10-13 12:00:10,6.9,CVSS 4.0,MEDIUM,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:L/VI:N/VA...,,,Network,Low,,,,Low,,,,,
