# 🧭 Exploratory Data Analysis (EDA) et Pré-traitement du jeu de données CVE

Voici un guide pratique, ciblé et prêt à l’emploi pour l’**EDA (Exploratory Data Analysis)** et le **pré-traitement** de notre jeu de données de vulnérabilités (CVE).  
Ce notebook couvre : les **objectifs de l’EDA**, les **contrôles qualité**, les **étapes de nettoyage**, le **feature engineering** pour l’analytics ou le machine learning, ainsi que des **idées de visualisations**.  

---

## 1️⃣ Objectifs de l’EDA (ce qu’on cherche à observer)

- **Qualité des données** : doublons, valeurs manquantes, formats incohérents (dates, scores).
- **Distribution des vulnérabilités** : par `cvss_score`, `cvss_severity`, `vendor`, `product`, `category`.
- **Temporalité** : tendances de publication/modification, fréquence par jour, semaine, mois.
- **Corrélations** : entre score CVSS et exploitability/impact ; entre attack_vector et severity.
- **Text mining** : repérage de thèmes récurrents dans `title` et `description` (RCE, Path Traversal, etc.).
- **Création de flags analytiques** : `is_remote`, `has_exploit`, `patched`, `is_high_risk`, etc.

---

## 2️⃣ Contrôles qualité rapides

- Vérifier les **doublons** sur `cve_id`.
- Vérifier les **dates parseables** (`published_date`, `last_modified`).
- Vérifier que `cvss_score` est **numérique** et compris entre **0 et 10**.
- Uniformiser la casse et les valeurs catégorielles (`Network` → `network`).
- Identifier et séparer les **champs multi-valeurs** (`vendors`, `products`) à *explode*.

---

## 3️⃣ Pré-traitement détaillé (pipeline recommandé)

### a) Chargement et nettoyage de base
- Lecture du CSV avec `dtype=str` pour éviter les conversions automatiques.
- Suppression des espaces et normalisation en minuscules.
- Conversion des types : `float` pour `cvss_score`, `datetime` pour les dates.

### b) Dates
- Standardiser `published_date` et `last_modified` (UTC ou fuseau local).
- Extraire des variables temporelles : `year`, `month`, `week`, `day_of_week`, `age_days = today - published_date`.

### c) CVSS
- Convertir `cvss_score` en float et créer `severity_label` si absent (Low < 4 ; Medium 4–7 ; High ≥ 7).
- **Parser** `cvss_vector` (ex. : `CVSS:3.1/AV:N/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H`) pour extraire les métriques AV, AC, PR, UI, S, C, I, A.
  - Si aucune librairie externe n’est utilisée : découpage par `split('/')` et `key:value`.

### d) Colonnes booléennes / flags
- `is_remote = attack_vector.str.contains('Network', case=False)`
- `has_exploit = source_from_table.str.contains('exploit', case=False)`
- `is_patched = description` contient « fixed » / « version » → extraction de `fixed_version` si possible.

### e) Multi-valeurs
- `affected_vendors` et `affected_products` souvent séparés par `,` ou `;` → normaliser puis *explode* pour analyses par produit/vendor.

### f) Texte (`title`, `description`)
- Nettoyage (minuscule, suppression HTML/URL/ponctuation), tokenisation, stop-words, lemmatisation (spaCy / NLTK).
- Représentation TF-IDF ou embeddings (Sentence-Transformers) pour *topic modeling*.

### g) Valeurs manquantes
- Imputation : médiane du `cvss_score` par `category` ; label `unknown` pour catégories manquantes.
- Conserver un masque des valeurs imputées (utile pour suivi qualité).

### h) Encodage pour ML
- Encodage One-Hot ou Target-Encoding pour les variables catégorielles (`attack_vector`, `privileges_required`, `user_interaction`, `scope`).
- Normalisation des variables continues (`cvss_score`, `exploitability_score`, `impact_score`) via `StandardScaler`.

---

## 4️⃣ Features utiles à créer (exemples)

- `severity_label` → Low / Medium / High / Critical  
- `vector_av`, `vector_ac`, `vector_pr`, `vector_ui`, `vector_scope`, `vector_c`, `vector_i`, `vector_a`  
- `is_remote` → bool  
- `days_since_published`  
- `num_vendors`, `num_products`  
- `description_len`, `title_len`  
- `has_patch_info` → bool  
- `exploitability_normalized` (échelle 0–1)

---

## 5️⃣ Visualisations recommandées (quick wins)

- Histogramme du `cvss_score`
- Barplot : top 20 vendors / products (après *explode*)
- Time Series : nombre de CVE publiés par semaine ou mois
- Heatmap des corrélations (`cvss_score`, `exploitability_score`, `impact_score`)
- Wordcloud / TF-IDF : top tokens dans `description` par `severity`
- Barres empilées : `attack_vector` vs `severity`

---

🧩 **Étape suivante** : importer les bibliothèques nécessaires (Pandas, NumPy, Matplotlib, Seaborn, etc.) et charger le dataset brut.


### Step 1 — Import Libraries and Configuration
---

In [40]:
# --- Manipulation et analyse de données
import pandas as pd
import numpy as np

# --- Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# --- Traitement du texte
import re
import string

# --- Pré-traitement et machine learning utils
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

# --- Date et temps
from datetime import datetime, timedelta

# --- Options d’affichage pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 120)
pd.set_option('display.float_format', '{:.2f}'.format)

# --- Style des graphiques
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 5)
plt.rcParams['axes.titlesize'] = 13
plt.rcParams['axes.labelsize'] = 11

### Step 2 : Loading and Initial Inspection of the Dataset
---

In [52]:
# Load the raw dataset
file_path = "output/cve_detailed_raw_v1.csv"
df = pd.read_csv(file_path, dtype=str, encoding='utf-8', low_memory=False)

print("Dataset loaded successfully!")
print(f"Dataset dimensions: {df.shape[0]} rows × {df.shape[1]} columns\n")

Dataset loaded successfully!
Dataset dimensions: 500 rows × 26 columns



In [53]:
# Preview of the first rows
display(df.head(5))

Unnamed: 0,cve_id,title,description,cvss_score,cvss_severity,cvss_version,cvss_vector,exploitability_score,impact_score,published_date,last_modified,remotely_exploit,source,source_from_table,category,affected_vendors,affected_products,attack_vector,attack_complexity,privileges_required,user_interaction,scope,confidentiality_impact,integrity_impact,availability_impact,url
0,CVE-2025-61590,Cursor is vulnerable to RCE via .code-workspac...,The following products are affected byCVE-2025...,7.5,HIGH,CVSS 3.1,CVSS:3.1/AV:N/AC:H/PR:L/UI:N/S:U/C:H/I:H/A:H,1.6,5.9,"Oct. 3, 2025, 5:15 p.m.","Oct. 6, 2025, 2:56 p.m.",Yes !,[email protected],[email protected],Misconfiguration,Anysphere,cursor,Network,High,Low,,Unchanged,High,High,High,https://cvefeed.io/vuln/detail/CVE-2025-61590
1,CVE-2025-61591,Cursor CLI's Cursor Agent MCP OAuth2 Communica...,Cursor is a code editor built for programming ...,8.8,HIGH,CVSS 3.1,CVSS:3.1/AV:N/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H,2.8,5.9,"Oct. 3, 2025, 5:15 p.m.","Oct. 6, 2025, 2:56 p.m.",Yes !,[email protected],[email protected],Authentication,Anysphere,cursor,Network,Low,,Required,Unchanged,High,High,High,https://cvefeed.io/vuln/detail/CVE-2025-61591
2,CVE-2025-33034,Qsync Central,A path traversal vulnerability has been report...,6.5,MEDIUM,CVSS 3.1,CVSS:4.0/AV:N/AC:L/AT:N/PR:L/UI:N/VC:L/VI:N/VA...,,,"Oct. 3, 2025, 6:15 p.m.","Oct. 7, 2025, 3:04 p.m.",Yes !,[email protected],[email protected],Path Traversal,Qnap,qsync_central,Network,Low,Low,,Unchanged,High,,,https://cvefeed.io/vuln/detail/CVE-2025-33034
3,CVE-2025-33039,Qsync Central,An allocation of resources without limits or t...,7.1,HIGH,CVSS 4.0,CVSS:4.0/AV:N/AC:L/AT:N/PR:L/UI:N/VC:N/VI:N/VA...,,,"Oct. 3, 2025, 6:15 p.m.","Oct. 7, 2025, 3:01 p.m.",Yes !,[email protected],[email protected],Denial of Service,Qnap,qsync_central,Network,Low,Low,,Unchanged,,,High,https://cvefeed.io/vuln/detail/CVE-2025-33039
4,CVE-2025-33040,Qsync Central,An allocation of resources without limits or t...,7.1,HIGH,CVSS 4.0,CVSS:4.0/AV:N/AC:L/AT:N/PR:L/UI:N/VC:N/VI:N/VA...,,,"Oct. 3, 2025, 6:15 p.m.","Oct. 7, 2025, 3 p.m.",Yes !,[email protected],[email protected],Denial of Service,Qnap,qsync_central,Network,Low,Low,,Unchanged,,,High,https://cvefeed.io/vuln/detail/CVE-2025-33040


In [54]:
# General information about the columns
print("Informations sur les types de colonnes :")
df.info()

Informations sur les types de colonnes :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   cve_id                  500 non-null    object
 1   title                   494 non-null    object
 2   description             500 non-null    object
 3   cvss_score              500 non-null    object
 4   cvss_severity           464 non-null    object
 5   cvss_version            464 non-null    object
 6   cvss_vector             464 non-null    object
 7   exploitability_score    184 non-null    object
 8   impact_score            184 non-null    object
 9   published_date          500 non-null    object
 10  last_modified           500 non-null    object
 11  remotely_exploit        500 non-null    object
 12  source                  500 non-null    object
 13  source_from_table       464 non-null    object
 14  category         

### Step 3 : Nettoyage de base et conversions de types
---

- The first thing we noticed is that the columns 'published_date' and 'last_modified' are of type object — they need to be converted to datetime.

In [44]:
date_format = "%b. %d, %Y, %I:%M %p"

df["published_date"] = pd.to_datetime(df["published_date"], format=date_format, errors="coerce")
df["last_modified"] = pd.to_datetime(df["last_modified"], format=date_format, errors="coerce")
print(df[["published_date", "last_modified"]].dtypes)

published_date    datetime64[ns]
last_modified     datetime64[ns]
dtype: object


- The second thing we noticed is that the numerical scores — cvss_score, exploitability_score, and impact_score — are stored as objects.They should be converted to float, taking care of NaN values and non-standard entries such as 'Unchanged'.


In [57]:
# --- Exploring the different types of values in 'cvss_score', 'exploitability_score', and 'impact_score'

cols_to_explore = ["cvss_score", "exploitability_score", "impact_score"]

print("Unique value types in score columns:\n" + "-"*50)
for col in cols_to_explore:
    unique_values = df[col].unique()
    print(f"\n▶ {col}:")
    print(unique_values)
    print(f"Total unique values: {len(unique_values)}")
    print("-"*50)

Unique value types in score columns:
--------------------------------------------------

▶ cvss_score:
['7.5' '8.8' '6.5' '7.1' '9.8' '7.6' '7.3' '6.9' '7.2' '5.1' '4.6' '4.8'
 '8.5' '9.9' '6.1' '2.3' '8.0' '0.0' '8.6' '2.5' '7.7' '6.6' '5.4' '5.3'
 '5.9' '4.3' '8.1' '6.4' '5.5' '7.8' '6.3' '5.0' '5.8' '9.0' '8.2' '8.4'
 '3.8' '2.7' '4.5' '8.7' '5.7' '6.0' '4.7' '8.9' '9.2' '9.3' '9.6' '8.3'
 '10.0' '9.1' '9.4' '3.6' '2.2' '4.9' '3.5' '7.9' '7.4' '6.7' '4.2' '3.7'
 '4.4' '1.0' '9.5' '3.3' '5.6']
Total unique values: 65
--------------------------------------------------

▶ exploitability_score:
['1.6' '2.8' nan '3.9' '2.3' '1.3' '1.8' '2.1' '1.7' '3.1' '1.0' '2.5'
 '2.2' '0.3' '0.6' '0.9' '1.2' '0.5' '1.4' '0.8']
Total unique values: 20
--------------------------------------------------

▶ impact_score:
['5.9' nan '2.7' '5.2' '2.5' '4.7' '6.0' '1.4' '3.6' '3.4' '5.5' '5.3'
 '4.2' '5.8' '4.0']
Total unique values: 15
--------------------------------------------------


In [59]:
# --- Converting score columns to numeric types
# Non-numeric values will be converted to NaN

df["cvss_score"] = pd.to_numeric(df["cvss_score"], errors="coerce")
df["exploitability_score"] = pd.to_numeric(df["exploitability_score"], errors="coerce")
df["impact_score"] = pd.to_numeric(df["impact_score"], errors="coerce")

# verfiy
print(df[["cvss_score", "exploitability_score", "impact_score"]].dtypes)

cvss_score              float64
exploitability_score    float64
impact_score            float64
dtype: object


- Another issue: some values in `cvss_version` and `cvss_vector` are missing. 

In [69]:
df["cvss_version"].value_counts(dropna=False)

cvss_version
CVSS 3.1    293
CVSS 4.0    104
CVSS 2.0     63
CVSS 3.0      4
Name: count, dtype: int64


  * Let's first understand what these CVSS elements represent.

<div style="border-radius:10px; border:#DEB887 solid; padding:15px; background-color:#f6f5f5; font-size:100%; text-align:left">

<h3 align="left"><font color='#DEB887'>💡 Defintion:</font></h3>

Refers to the version of the **CVSS standard** used to assess the severity of a vulnerability.

---

### 🔍 What is CVSS?

**CVSS (Common Vulnerability Scoring System)** is a standardized scoring system used in cybersecurity to measure the severity of vulnerabilities (CVE).  
It is managed by the **Forum of Incident Response and Security Teams (FIRST)**.

Each CVSS version defines:
- A mathematical formula to calculate a score from 0 to 10  
- Criteria (vectors) describing how the vulnerability can be exploited

---

### ⚙️ Main Versions

| Version | Year | Main Characteristics |
|:--------:|:----:|:---------------------|
| **CVSS 2.0** | 2007 | First widely used version; less precise for real-world exploitation contexts. |
| **CVSS 3.0** | 2015 | Better distinction between exploitability and impact; introduction of the “scope” concept. |
| **CVSS 3.1** | 2019 | Most widely used version; clarifies metric definitions (same formula as 3.0). |
| **CVSS 4.0** | 2023 | Next generation: adds environmental and contextual metrics; better reflects modern attack scenarios. |

---

### 🧩 Example — CVSS 3.1 Vector
**CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H**

This vector describes:

- **AV:N** — Attack Vector: Network  
- **AC:L** — Attack Complexity: Low  
- **PR:N** — Privileges Required: None  
- **UI:N** — User Interaction: None  
- **S:U** — Scope: Unchanged  
- **C:H / I:H / A:H** — High impacts on Confidentiality, Integrity, and Availability  

→ **Score: 9.8 (CRITICAL)**

</div>

* After this definition, we can conclude that the entries missing a value in `cvss_version` contain a large amount of missing or incomplete data across other related columns as well.  

In [66]:
df[df["cvss_version"].isna()].head(5).isna().sum()

cve_id                    0
title                     0
description               0
cvss_score                0
cvss_severity             5
cvss_version              5
cvss_vector               5
exploitability_score      5
impact_score              5
published_date            0
last_modified             0
remotely_exploit          0
source                    0
source_from_table         5
category                  5
affected_vendors          5
affected_products         5
attack_vector             5
attack_complexity         5
privileges_required       5
user_interaction          5
scope                     5
confidentiality_impact    5
integrity_impact          5
availability_impact       5
url                       0
dtype: int64

In [67]:
df = df.dropna(subset=["cvss_version"]).reset_index(drop=True)

=> Since these rows lack essential information (such as cvss_vector, scores, and impact metrics), we should drop all records where cvss_version is missing to ensure data consistency and quality.

In [68]:
df["cvss_version"].value_counts(dropna=False)

cvss_version
CVSS 3.1    293
CVSS 4.0    104
CVSS 2.0     63
CVSS 3.0      4
Name: count, dtype: int64

## 🧩 Notes — End of EDA (Version 1)

After this clarification, we will keep this notebook as **Version 1 of the Exploratory Data Analysis (EDA)**.  
The next step will be to **refactor the scrapers** to improve data completeness and structure.

Currently, the scrapers only extract the `cvss_vector`, without directly collecting its sub-metrics such as:

```json
{
  "AV": "Attack Vector",
  "AC": "Attack Complexity",
  "PR": "Privileges Required",
  "UI": "User Interaction",
  "S": "Scope",
  "C": "Confidentiality Impact",
  "I": "Integrity Impact",
  "A": "Availability Impact"
}

However, since these fields can easily be derived directly from the CVSS vector, there is no need to scrape them separately.
This approach offers several advantages:

⚡ Simpler scraping logic — fewer selectors and reduced parsing errors.

🧱 More consistent data — avoids mismatches between vector and individual metric values.

🚀 Easier future updates — if the CVSS format evolves (e.g., CVSS 4.0), only the parsing function needs modification.

📉 Reduced redundancy — smaller data files and faster processing.

Additionally, the scrapers have been updated to collect all available CVSS versions on each CVE page (2.0, 3.0, 3.1, 4.0, etc.),
which will allow future comparative analysis across versions in the final dashboard or analytics module.

✅ In short, this notebook serves as a prototype (EDA v1) — data structure and cleaning methods are validated,
while the next version will focus on full-scale scraping and advanced analysis.