## 🚀 Starting EDA Version 2

After completing the first version of our EDA and fixing the scrapers,  
we will now perform a **new round of Exploratory Data Analysis**.

But first, let's **redefine and repeat the initial setup steps** to ensure a clean and consistent workflow.

### Step 1 — Import Libraries and Configuration
---

In [6]:
# --- Manipulation et analyse de données
import pandas as pd
import numpy as np

# --- Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# --- Traitement du texte
import re
import string

# --- Pré-traitement et machine learning utils
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

# --- Date et temps
from datetime import datetime, timedelta

# --- Options d’affichage pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 120)
pd.set_option('display.float_format', '{:.2f}'.format)

# --- Style des graphiques
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 5)
plt.rcParams['axes.titlesize'] = 13
plt.rcParams['axes.labelsize'] = 11

### Step 2 : Loading and Initial Inspection of the Dataset
---

In [2]:
from sqlalchemy import create_engine
import pandas as pd

# ===============================================================
# 🔌 Database Connection
# ===============================================================
PG_USER = "postgres"
PG_PASSWORD = "tip_pwd"
PG_HOST = "localhost"
PG_PORT = "5432"
PG_DB = "tip"

CONN_STR = f"postgresql+psycopg2://{PG_USER}:{PG_PASSWORD}@{PG_HOST}:{PG_PORT}/{PG_DB}"
engine = create_engine(CONN_STR)

# ===============================================================
# 📥 Load raw.cve_details
# ===============================================================
def load_silver_cve_details(limit=None):
    """
    Load CVE details directly from PostgreSQL schema raw.cve_details
    - limit: optional integer to restrict number of rows (for testing)
    """
    query = "SELECT * FROM silver.cve_cleaned"
    if limit:
        query += f" ORDER BY published_date DESC NULLS LAST LIMIT {int(limit)}"

    df = pd.read_sql(query, engine)
    print("✅ Dataset loaded successfully from silver!")
    print(f"📊 Dataset dimensions: {df.shape[0]} rows × {df.shape[1]} columns\n")
    return df


# Example usage
df = load_silver_cve_details()

✅ Dataset loaded successfully from silver!
📊 Dataset dimensions: 36359 rows × 14 columns



In [3]:
display(df.head(5))

Unnamed: 0,cve_id,title,description,category,published_date,last_modified,loaded_at,remotely_exploit,source_identifier,affected_products,cvss_scores,url,created_at,updated_at
0,CVE-2010-0315,WebKit Chrome CSS Injection Information Disclo...,The following products are affected byCVE-2010...,undefined,2010-01-14 19:30:00,2025-04-09 00:30:00,2025-10-16 16:40:38.008621,,cve@mitre.org,"[{""id"": ""1"", ""vendor"": ""Google"", ""product"": ""c...","[{""score"": ""5"", ""vector"": ""AV:N/AC:L/Au:N/C:P/...",https://cvefeed.io/vuln/detail/CVE-2010-0315,2025-10-17 10:51:41.620019,2025-10-17 10:51:41.620019
1,CVE-2010-0249,Internet Explorer HTML Object Memory Corruptio...,Use-after-free vulnerability in Microsoft Inte...,undefined,2010-01-15 17:30:00,2025-04-09 00:30:00,2025-10-16 16:40:38.008621,,secure@microsoft.com,"[{""id"": ""1"", ""vendor"": ""Microsoft"", ""product"":...","[{""score"": ""9.3"", ""vector"": ""AV:N/AC:M/Au:N/C:...",https://cvefeed.io/vuln/detail/CVE-2010-0249,2025-10-17 10:51:41.620019,2025-10-17 10:51:41.620019
2,CVE-2010-0280,Google SketchUp lib3ds Array Index Error Remot...,The following products are affected byCVE-2010...,undefined,2010-01-15 17:30:00,2025-04-09 00:30:00,2025-10-16 16:40:38.008621,,cve@mitre.org,"[{""id"": ""1"", ""vendor"": ""Jan_eric_krprianidis"",...","[{""score"": ""9.3"", ""vector"": ""AV:N/AC:M/Au:N/C:...",https://cvefeed.io/vuln/detail/CVE-2010-0280,2025-10-17 10:51:41.620019,2025-10-17 10:51:41.620019
3,CVE-2010-0316,Google SketchUp Integer Overflow Denial of Ser...,The following products are affected byCVE-2010...,undefined,2010-01-15 17:30:00,2025-04-09 00:30:00,2025-10-16 16:40:38.008621,,cve@mitre.org,"[{""id"": ""1"", ""vendor"": ""Google"", ""product"": ""g...","[{""score"": ""9.3"", ""vector"": ""AV:N/AC:M/Au:N/C:...",https://cvefeed.io/vuln/detail/CVE-2010-0316,2025-10-17 10:51:41.620019,2025-10-17 10:51:41.620019
4,CVE-2010-0317,Novell Netware Denial of Service Vulnerability,The following products are affected byCVE-2010...,undefined,2010-01-15 18:30:00,2025-04-09 00:30:00,2025-10-16 16:40:38.008621,,cve@mitre.org,"[{""id"": ""1"", ""vendor"": ""Novell"", ""product"": ""n...","[{""score"": ""7.8"", ""vector"": ""AV:N/AC:L/Au:N/C:...",https://cvefeed.io/vuln/detail/CVE-2010-0317,2025-10-17 10:51:41.620019,2025-10-17 10:51:41.620019


on va drop la source car la plupart des valeurs sont mising 

In [10]:
# any duplicate CVE IDs?
has_dupes = df.duplicated(subset=["cve_id"]).any()
print("Duplicates on cve_id?", has_dupes)

# all rows that share a duplicate cve_id (keep=False marks all members)
dupes_cve = df[df.duplicated(subset=["cve_id"], keep=False)].sort_values("cve_id")
print(dupes_cve.head())


Duplicates on cve_id? False
Empty DataFrame
Columns: [cve_id, title, description, published_date, last_modified, remotely_exploit, source_identifier, category, affected_products, cvss_scores, url, loaded_at]
Index: []


In [11]:
# General information about the columns
print("Informations sur les types de colonnes :")
df.info()

Informations sur les types de colonnes :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4783 entries, 0 to 4782
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   cve_id             4783 non-null   object             
 1   title              4783 non-null   object             
 2   description        4783 non-null   object             
 3   published_date     4783 non-null   object             
 4   last_modified      4783 non-null   object             
 5   remotely_exploit   622 non-null    object             
 6   source_identifier  4783 non-null   object             
 7   category           4783 non-null   object             
 8   affected_products  4783 non-null   object             
 9   cvss_scores        4783 non-null   object             
 10  url                4783 non-null   object             
 11  loaded_at          4783 non-null   datetime64[ns, UTC]
dtypes: date

lets from the link colmumns cuz , les links simple de le construire il sont generalment  : https://cvefeed.io/vuln/detail/cve_id 

In [12]:
df.drop(columns=["url"], inplace=True, errors='ignore')

In [13]:
def normalize_date_str(s):
    """Normalise les variations communes avant parsing:
       - convertit None/NaN en None
       - remplace 'a.m.' / 'p.m.' / 'a.m' / 'pm.' etc par 'AM'/'PM'
       - enlève le point après month abbrev (e.g. 'Oct.' -> 'Oct')
       - supprime espaces multiples
    """
    if pd.isna(s):
        return None
    s = str(s).strip()

    # Normalize AM/PM variants to 'AM' / 'PM'
    s = re.sub(r'\b(a\.?m\.?|am)\b', 'AM', s, flags=re.IGNORECASE)
    s = re.sub(r'\b(p\.?m\.?|pm)\b', 'PM', s, flags=re.IGNORECASE)

    # Remove dot after 3-letter month abbreviations like 'Oct.' -> 'Oct'
    # only if it's followed by space and digit (month dot used only there)
    s = re.sub(r'([A-Za-z]{3})\.(?=\s+\d)', r'\1', s)

    # Also remove stray dots that break parsing (but be conservative)
    # e.g. 'CVE-...' might contain dots but dates are fine after previous fixes.
    # Remove remaining dots in the AM/PM area already handled.
    s = s.replace('..', '.')  # collapse double dots if any

    # Normalize commas/spaces: ensure one space after comma
    s = re.sub(r',\s*', ', ', s)

    # Examples of remaining forms:
    # "Oct 11, 2025, 5:15 PM", "Nov 11, 1988, 5 AM", "July 26, 1989, 4 AM"
    return s

def try_parse_date(s):
    """Try several parsing strategies, return pd.Timestamp or NaT."""
    if s is None:
        return pd.NaT

    # 1) Try common explicit formats (fast)
    formats = [
        "%b %d, %Y, %I:%M %p",   # "Oct 11, 2025, 5:15 PM"
        "%b %d, %Y, %I %p",      # "Nov 11, 1988, 5 PM" (no minutes)
        "%B %d, %Y, %I:%M %p",   # "July 26, 1989, 4:00 AM" (full month)
        "%B %d, %Y, %I %p",      # "July 26, 1989, 4 AM"
        "%Y-%m-%dT%H:%M:%S.%f",  # ISO-ish (if present)
        "%Y-%m-%d %H:%M:%S",     # fallback ISO/no-T
    ]
    for fmt in formats:
        try:
            return pd.to_datetime(s, format=fmt, errors='raise')
        except Exception:
            pass

    # 2) Try pandas with infer (which uses dateutil under the hood)
    try:
        return pd.to_datetime(s, infer_datetime_format=True, errors='raise')
    except Exception:
        pass

    # 3) Last fallback: direct dateutil parsing (most flexible)
    try:
        return parser.parse(s)
    except Exception:
        return pd.NaT

# Apply to your dataframe
for col in ["published_date", "last_modified"]:
    # 1) Normalize strings
    norm_col = f"{col}_norm"
    df[norm_col] = df[col].apply(normalize_date_str)

    # 2) Parse using the robust function
    parsed = df[norm_col].apply(try_parse_date)

    # 3) Assign back as datetime dtype
    df[col] = pd.to_datetime(parsed, errors='coerce')

    # Drop helper column if you want
    df.drop(columns=[norm_col], inplace=True)

# Quick checks
print("Dtypes:")
print(df[["published_date", "last_modified"]].dtypes)
print("\nHow many missing after parse?")
print(df["published_date"].isna().sum(), "published_date NaT")
print(df["last_modified"].isna().sum(), "last_modified NaT")

# Show the rows that still failed (to inspect problematic strings)
failed_pub = df[df["published_date"].isna()][["published_date", "published_date"]].head(10)
if len(failed_pub) > 0:
    print("\nSample rows with published_date still NaT (show original raw strings for debugging):")

  return pd.to_datetime(s, infer_datetime_format=True, errors='raise')
  return pd.to_datetime(s, infer_datetime_format=True, errors='raise')
  return pd.to_datetime(s, infer_datetime_format=True, errors='raise')


Dtypes:
published_date    datetime64[ns]
last_modified     datetime64[ns]
dtype: object

How many missing after parse?
68 published_date NaT
0 last_modified NaT

Sample rows with published_date still NaT (show original raw strings for debugging):


After what we did in Version 1 — where we refactored the scrapers to collect data in a more complete and reliable way —  
let’s now explore how the **CVSS scores** are distributed in this new dataset.


In [14]:
print(df["cvss_scores"].iloc[0])

[{'score': '5', 'vector': 'AV:N/AC:L/Au:N/C:P/I:N/A:N', 'version': 'CVSS 2.0', 'severity': 'MEDIUM', 'impact_score': '2.9', 'source_identifier': 'nvd@nist.gov', 'exploitability_score': '10'}]


## ⚙️ Preparing Multi-Version CVSS Data

For CVEs that contain **multiple CVSS versions** (e.g., 2.0, 3.1, 4.0),  
we will **duplicate the corresponding rows** so that **each row represents a single CVSS entry** —  
with its specific **score**, **version**, and **vector**.

This structure will make it easier to analyze and compare different CVSS versions later in the dashboard or analytics phase.


Before performing this operation, we should first **remove all rows that do not contain any CVSS data**.  
If a CVE has no `cvss_score`, `cvss_version`, or `cvss_vector`,  
it means that **no vulnerability scoring information is available** —  
and therefore, other related fields are likely missing as well.

➡️ These rows will be dropped to ensure we only work with complete and meaningful data.


In [15]:
# 1️⃣ Compter les lignes sans CVSS score (NaN ou liste vide)
missing_count = df["cvss_scores"].isna().sum() + (df["cvss_scores"].str.strip() == "[]").sum()
print(f"Number of rows without CVSS scores: {missing_count}")

# 2️⃣ Supprimer ces lignes directement dans df
df.drop(df[df["cvss_scores"].isna() | (df["cvss_scores"].str.strip() == "[]")].index, inplace=True)

# 3️⃣ Vérification
print(f"Remaining rows after drop: {len(df)}")


Number of rows without CVSS scores: 0
Remaining rows after drop: 4783


In [16]:
# Fonction pour parser la colonne cvss_scores si elle est en string
def parse_cvss_scores(score_str):
    """Parse CVSS scores from string to list of dicts"""
    if pd.isna(score_str) or score_str == '[]':
        return []
    try:
        # Si c'est déjà une liste
        if isinstance(score_str, list):
            return score_str
        # Si c'est une string, essayer de la parser
        return ast.literal_eval(score_str)
    except:
        return []

# Appliquer la fonction de parsing
df['cvss_scores_parsed'] = df['cvss_scores'].apply(parse_cvss_scores)

# Créer une liste pour stocker les nouvelles lignes
expanded_rows = []

# Parcourir chaque ligne du dataframe
for idx, row in df.iterrows():
    cvss_list = row['cvss_scores_parsed']
    
    # Si pas de scores CVSS, garder la ligne originale
    if len(cvss_list) == 0:
        row_dict = row.drop('cvss_scores_parsed').to_dict()
        expanded_rows.append(row_dict)
    else:
        # Pour chaque score CVSS, créer une nouvelle ligne
        for cvss_score in cvss_list:
            row_dict = row.drop('cvss_scores_parsed').to_dict()
            # Remplacer cvss_scores par un seul score
            row_dict['cvss_scores'] = json.dumps([cvss_score])
            # Optionnel : ajouter des colonnes individuelles pour faciliter l'analyse
            row_dict['cvss_score'] = cvss_score.get('score', '')
            row_dict['cvss_version'] = cvss_score.get('version', '')
            row_dict['cvss_severity'] = cvss_score.get('severity', '')
            row_dict['cvss_vector'] = cvss_score.get('vector', '')
            row_dict['cvss_exploitability'] = cvss_score.get('exploitability_score', '')
            row_dict['cvss_impact'] = cvss_score.get('impact_score', '')
            
            expanded_rows.append(row_dict)

# Remplacer le dataframe original par le dataframe étendu
df = pd.DataFrame(expanded_rows)

# Afficher les résultats
print(f"Nombre de lignes après expansion: {len(df)}")
print("\nExemple pour CVE-2025-11608:")
print(df[df['cve_id'] == 'CVE-2025-11608'][['cve_id', 'cvss_version', 'cvss_score', 'cvss_severity']])

# Le dataframe df contient maintenant les lignes dupliquées
print("\nLe dataset 'df' a été mis à jour avec les lignes dupliquées")

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [149]:
df.drop(columns=["cvss_scores", "cvss_scores_parsed"], inplace=True, errors='ignore')

In [150]:
df.head(3)

Unnamed: 0,cve_id,title,description,published_date,last_modified,remotely_exploit,category,affected_products,cvss_score,cvss_version,cvss_severity,cvss_vector,cvss_exploitability,cvss_impact
0,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,Yes !,Injection,[],7.5,CVSS 2.0,HIGH,AV:N/AC:L/Au:N/C:P/I:P/A:P,10.0,6.4
1,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,Yes !,Injection,[],7.3,CVSS 3.1,HIGH,CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:L,3.9,3.4
2,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,Yes !,Injection,[],6.9,CVSS 4.0,MEDIUM,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:L/VI:L/VA...,,


on peut remarque un chose c'est que pour qui on cvss score 4.0 les cvss_exploitability et cvss_impact sont vide 

In [151]:
df[df["cvss_version"] == "CVSS 4.0"][["cve_id", "cvss_version", "cvss_score",   "cvss_severity" ,   "cvss_impact", "cvss_exploitability"]].head()

Unnamed: 0,cve_id,cvss_version,cvss_score,cvss_severity,cvss_impact,cvss_exploitability
2,CVE-2025-11608,CVSS 4.0,6.9,MEDIUM,,
1915,CVE-2024-7830,CVSS 4.0,8.7,HIGH,,
1926,CVE-2024-41906,CVSS 4.0,6.3,MEDIUM,,
1930,CVE-2024-11050,CVSS 4.0,5.3,MEDIUM,,
1936,CVE-2024-8089,CVSS 4.0,5.3,MEDIUM,,


## Why `cvss_exploitability` Is Empty for CVSS 4.0

1. Because CVSS 4.0 no longer includes a separate “Exploitability” sub-score.
    

In earlier versions of CVSS (v2 and v3.x), the total score formula was structured as follows:

|CVSS Version|Score Structure|
|---|---|
|CVSS 2.0|Base Score = Impact × Exploitability|
|CVSS 3.0 / 3.1|Base Score = f(Impact, Exploitability) (still calculated separately)|
|CVSS 4.0|Exploitability is no longer a standalone component|

In CVSS 4.0, the exploitability metrics (such as Attack Vector, Attack Complexity, Privileges Required, etc.) still exist,  
but they are integrated directly into the overall formula, rather than being summarized in a separate `exploitability_score` field.

## ⚖️ About Missing Exploitability and Impact Scores in CVSS 4.0

For now, we will **leave the `exploitability_score` and `impact_score` fields empty** for CVSS 4.0 entries.

However, it is technically possible to **approximate** these values using weighted metrics extracted from the CVSS vector.  
An example approach could be:

```python
df["exploitability_proxy"] = (
    df["AV_score"] * 0.3 +
    df["AC_score"] * 0.25 +
    df["PR_score"] * 0.25 +
    df["UI_score"] * 0.2
)

df["impact_proxy"] = (
    df["VC_score"] * 0.4 +
    df["VI_score"] * 0.3 +
    df["VA_score"] * 0.3
)


In [152]:
df.head()

Unnamed: 0,cve_id,title,description,published_date,last_modified,remotely_exploit,category,affected_products,cvss_score,cvss_version,cvss_severity,cvss_vector,cvss_exploitability,cvss_impact
0,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,Yes !,Injection,[],7.5,CVSS 2.0,HIGH,AV:N/AC:L/Au:N/C:P/I:P/A:P,10.0,6.4
1,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,Yes !,Injection,[],7.3,CVSS 3.1,HIGH,CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:L,3.9,3.4
2,CVE-2025-11608,code-projects E-Banking System POST Parameter ...,A security vulnerability has been detected in ...,2025-10-11 17:15:00,2025-10-11 17:15:00,Yes !,Injection,[],6.9,CVSS 4.0,MEDIUM,CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:L/VI:L/VA...,,
3,CVE-1999-0082,Tenable FTP Server Command Injection Vulnerabi...,The following products are affected byCVE-1999...,1988-11-11 05:00:00,2025-04-03 01:03:00,Yes !,,"[{""id"": ""1"", ""vendor"": ""Ftp"", ""product"": ""ftp""...",10.0,CVSS 2.0,HIGH,AV:N/AC:L/Au:N/C:C/I:C/A:C,10.0,10.0
4,CVE-1999-0095,Sendmail Command Injection Vulnerability,The following products are affected byCVE-1999...,1988-10-01 04:00:00,2025-04-03 01:03:00,Yes !,,"[{""id"": ""1"", ""vendor"": ""Eric_allman"", ""product...",10.0,CVSS 2.0,HIGH,AV:N/AC:L/Au:N/C:C/I:C/A:C,10.0,10.0


now we are here let construit the infos from  cvss_vector 

## 💡 Definition — CVSS Versions and Metric Mappings

<div style="border-radius:10px; border:#DEB887 solid; padding:15px; background-color:#f6f5f5; font-size:100%; text-align:left">

<h3 align="left"><font color='#DEB887'>💡 Definition:</font></h3>

Refers to the version of the **CVSS standard** used to assess the severity of a vulnerability.

---

### 🔍 What is CVSS?

**CVSS (Common Vulnerability Scoring System)** is a standardized scoring system used in cybersecurity to measure the severity of vulnerabilities (CVE).  
It is managed by the **Forum of Incident Response and Security Teams (FIRST)**.

Each CVSS version defines:
- A mathematical formula to calculate a score from 0 to 10  
- Criteria (vectors) describing how the vulnerability can be exploited

---

### ⚙️ Main Versions

| Version | Year | Main Characteristics |
|:--------:|:----:|:---------------------|
| **CVSS 2.0** | 2007 | First widely used version; less precise for real-world exploitation contexts. |
| **CVSS 3.0** | 2015 | Better distinction between exploitability and impact; introduction of the “scope” concept. |
| **CVSS 3.1** | 2019 | Most widely used version; clarifies metric definitions (same formula as 3.0). |
| **CVSS 4.0** | 2023 | Next generation: adds environmental and contextual metrics; better reflects modern attack scenarios. |

---

### 🧩 CVSS Metric Mappings

Below are the **abbreviations and their meanings** for each CVSS version.  
These mappings are essential for parsing CVSS vectors into human-readable components.

---

#### 🟦 Common (CVSS 3.x & compatible)

```python
MAPS_COMMON = {
    "AV": {    # Attack Vector
        "N": "Network",
        "A": "Adjacent",
        "L": "Local",
        "P": "Physical"
    },
    "AC": {    # Attack Complexity
        "L": "Low",
        "H": "High"
    },
    "PR": {    # Privileges Required
        "N": "None",
        "L": "Low",
        "H": "High"
    },
    "UI": {    # User Interaction
        "N": "None",
        "R": "Required"
    },
    "S": {     # Scope
        "U": "Unchanged",
        "C": "Changed"
    },
    "C": {     # Confidentiality Impact
        "N": "None",
        "L": "Low",
        "H": "High"
    },
    "I": {     # Integrity Impact
        "N": "None",
        "L": "Low",
        "H": "High"
    },
    "A": {     # Availability Impact
        "N": "None",
        "L": "Low",
        "H": "High"
    }
}
```
#### 🟨 CVSS 2.0 Specific Mappings

``` python
MAPS_V2 = {
    "AV": {  # Access Vector
        "N": "Network",
        "A": "Adjacent/Local",
        "L": "Local",
        "P": "Physical"
    },
    "Au": {  # Authentication
        "N": "None",
        "S": "Single",
        "M": "Multiple"
    },
    "C": {   # Confidentiality Impact
        "N": "None",
        "P": "Partial",
        "C": "Complete",
        "L": "Low"
    },
    "I": {   # Integrity Impact
        "N": "None",
        "P": "Partial",
        "C": "Complete",
        "L": "Low"
    },
    "A": {   # Availability Impact
        "N": "None",
        "P": "Partial",
        "C": "Complete",
        "L": "Low"
    }
}
```

#### 🟥 CVSS 4.0 Additions & Conventions

```python
MAPS_V40 = {
    "AT": {  # Attack Requirements
        "N": "None",
        "P": "Present"
    },
    # Confidentiality / Integrity / Availability impacts
    "VC": {"N": "None", "L": "Low", "H": "High"},
    "VI": {"N": "None", "L": "Low", "H": "High"},
    "VA": {"N": "None", "L": "Low", "H": "High"},
    # System-level impacts (System Confidentiality / Integrity / Availability)
    "SC": {"N": "None", "L": "Low", "H": "High"},
    "SI": {"N": "None", "L": "Low", "H": "High"},
    "SA": {"N": "None", "L": "Low", "H": "High"}
}
```

</div>

In [154]:
import pandas as pd
import re

# Mappings CVSS
MAPS_COMMON = {
    "AV": {"N": "Network", "A": "Adjacent", "L": "Local", "P": "Physical"},
    "AC": {"L": "Low", "H": "High"},
    "PR": {"N": "None", "L": "Low", "H": "High"},
    "UI": {"N": "None", "R": "Required"},
    "S": {"U": "Unchanged", "C": "Changed"},
    "C": {"N": "None", "L": "Low", "H": "High"},
    "I": {"N": "None", "L": "Low", "H": "High"},
    "A": {"N": "None", "L": "Low", "H": "High"}
}

MAPS_V2 = {
    "AV": {"N": "Network", "A": "Adjacent/Local", "L": "Local", "P": "Physical"},
    "Au": {"N": "None", "S": "Single", "M": "Multiple"},
    "C": {"N": "None", "P": "Partial", "C": "Complete", "L": "Low"},
    "I": {"N": "None", "P": "Partial", "C": "Complete", "L": "Low"},
    "A": {"N": "None", "P": "Partial", "C": "Complete", "L": "Low"}
}

MAPS_V40 = {
    "AT": {"N": "None", "P": "Present"},
    "VC": {"N": "None", "L": "Low", "H": "High"},
    "VI": {"N": "None", "L": "Low", "H": "High"},
    "VA": {"N": "None", "L": "Low", "H": "High"},
    "SC": {"N": "None", "L": "Low", "H": "High"},
    "SI": {"N": "None", "L": "Low", "H": "High"},
    "SA": {"N": "None", "L": "Low", "H": "High"}
}

def parse_cvss_vector(vector_str, version):
    """
    Parse un vecteur CVSS et retourne un dictionnaire des métriques
    """
    if pd.isna(vector_str) or not isinstance(vector_str, str):
        return {}
    
    metrics = {}
    
    # Déterminer les mappings à utiliser selon la version
    if version == "CVSS 2.0":
        maps = {**MAPS_V2}
    elif version == "CVSS 3.1" or version == "CVSS 3.0":
        maps = {**MAPS_COMMON}
    elif version == "CVSS 4.0":
        maps = {**MAPS_COMMON, **MAPS_V40}
    else:
        maps = {**MAPS_COMMON}
    
    # Nettoyer le vecteur (enlever le préfixe CVSS:3.1/ ou similaire)
    vector_str = re.sub(r'^CVSS:\d+\.\d+/', '', vector_str)
    
    # Parser les paires metric:value
    pairs = vector_str.split('/')
    for pair in pairs:
        if ':' in pair:
            metric, value = pair.split(':', 1)
            metric = metric.strip()
            value = value.strip()
            
            # Chercher la valeur décodée
            if metric in maps and value in maps[metric]:
                metrics[metric] = maps[metric][value]
            else:
                # Garder la valeur brute si pas de mapping
                metrics[metric] = value
    
    return metrics

def extract_cvss_metrics(df):
    """
    Extrait les métriques CVSS et les ajoute comme colonnes au DataFrame
    """
    # Créer une copie du DataFrame
    df_result = df.copy()
    
    # Parser tous les vecteurs
    parsed_metrics = []
    for idx, row in df_result.iterrows():
        metrics = parse_cvss_vector(row['cvss_vector'], row['cvss_version'])
        parsed_metrics.append(metrics)
    
    # Obtenir toutes les métriques uniques
    all_metrics = set()
    for metrics in parsed_metrics:
        all_metrics.update(metrics.keys())
    
    # Créer des colonnes pour chaque métrique
    for metric in sorted(all_metrics):
        column_name = f'cvss_metric_{metric}'
        df_result[column_name] = [metrics.get(metric, None) for metrics in parsed_metrics]
    
    return df_result

# Exemple d'utilisation - mise à jour directe du DataFrame:

# Supposons que 'df' est votre DataFrame existant
df = extract_cvss_metrics(df)

In [1]:
df.head(10)

NameError: name 'df' is not defined