# EDA y neutralización de payloads XSS del dataset de Kaggle

En este notebook trabajamos con un **dataset de Cross-Site Scripting (XSS) proveniente de Kaggle**, cuyo objetivo es distinguir entre:

- **Texto benigno** (principalmente HTML real de Wikipedia).
- **Payloads maliciosos** diseñados para explotar vulnerabilidades XSS.

El propósito de este notebook es doble:

1. **Exploración y entendimiento del dataset (EDA):**  
   - Cuántos ejemplos hay y cómo están distribuidas las etiquetas.  
   - Cómo se ve el texto real de cada clase.  
   - Qué patrones concretos aparecen en los payloads maliciosos.

2. **Construcción de una función de _neutralización_ de payloads XSS:**  
   - A partir de los datos reales, definimos **familias de patrones XSS**.  
   - Diseñamos funciones que **eliminan o reemplazan de forma segura el JavaScript ejecutable**,  
     pero **mantienen la estructura general del texto HTML**.  
   - Generamos una **versión limpia del dataset** para usarla en un benchmark posterior de mitigación.


In [1]:
import os
import re
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option("display.max_colwidth", 200)
plt.rcParams["figure.figsize"] = (8, 4)

# Detect project base directory
cwd = Path.cwd()


def is_project_root(path: Path) -> bool:
    """
    Detect project root by looking for key markers.
    We consider the directory that contains:
    - requirements.txt
    - app/
    - notebooks/
    as the project root.
    """
    markers = ["requirements.txt", "app", "notebooks"]
    return any((path / m).exists() for m in markers)


if is_project_root(cwd):
    # Running from the project root itself
    BASE_DIR = cwd
elif cwd.name in {"notebooks", "src"} and is_project_root(cwd.parent):
    # Running from notebooks/ or src/ → go one level up
    BASE_DIR = cwd.parent
else:
    # Fallback: use current directory (for ad-hoc runs)
    BASE_DIR = cwd

NOTEBOOKS_DIR = BASE_DIR / "notebooks"
NB_DATA_DIR = NOTEBOOKS_DIR / "data"
OUTPUT_DIR = NB_DATA_DIR / "data_processed"

NB_DATA_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

DATA_PATH_KAGGLE = NB_DATA_DIR / "XSS_dataset.csv"

print("CWD              :", cwd)
print("BASE_DIR         :", BASE_DIR)
print("NOTEBOOKS_DIR    :", NOTEBOOKS_DIR)
print("NB_DATA_DIR      :", NB_DATA_DIR)
print("OUTPUT_DIR       :", OUTPUT_DIR)
print("DATA_PATH_KAGGLE :", DATA_PATH_KAGGLE)


CWD              : d:\Archivos de Usuario\Documents\xss-cookie\notebooks
BASE_DIR         : d:\Archivos de Usuario\Documents\xss-cookie
NOTEBOOKS_DIR    : d:\Archivos de Usuario\Documents\xss-cookie\notebooks
NB_DATA_DIR      : d:\Archivos de Usuario\Documents\xss-cookie\notebooks\data
OUTPUT_DIR       : d:\Archivos de Usuario\Documents\xss-cookie\notebooks\data\data_processed
DATA_PATH_KAGGLE : d:\Archivos de Usuario\Documents\xss-cookie\notebooks\data\XSS_dataset.csv


## 1. Carga y descripción del dataset

En esta sección cargamos el archivo original de Kaggle y verificamos su estructura básica.

- El archivo esperado es **`XSS_dataset.csv`** ubicado en `notebooks/data/`.  
- Las columnas principales que usaremos son:

  - **`Sentence`**: cadena de texto que puede ser:
    - un fragmento HTML benigno (por ejemplo, texto real de Wikipedia), o  
    - un payload XSS diseñado para ejecutar JavaScript.
  - **`Label`**: etiqueta binaria que indica la clase de cada ejemplo:
    - `0` → benigno  
    - `1` → malicioso

En las siguientes celdas:

1. Cargamos el CSV en un `DataFrame` de pandas.  
2. Mostramos su forma (número de filas y columnas).  
3. Visualizamos algunas filas iniciales para tener una primera idea del contenido.

In [5]:
# Load raw Kaggle dataset
df_raw = pd.read_csv(DATA_PATH_KAGGLE)

print("Shape (rows, columns):", df_raw.shape)
df_raw.head()

Shape (rows, columns): (13686, 3)


Unnamed: 0.1,Unnamed: 0,Sentence,Label
0,0,"<li><a href=""/wiki/File:Socrates.png"" class=""image""><img alt=""Socrates.png"" src=""//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Socrates.png/18px-Socrates.png"" decoding=""async"" width=""18"" hei...",0
1,1,"<tt onmouseover=""alert(1)"">test</tt>",1
2,2,"\t </span> <span class=""reference-text"">Steering for the 1995 ""<a href=""/wiki/History_of_autonomous_cars#1990s"" class=""mw-redirect"" title=""History of autonomous cars"">No Hands Across America </a>""...",0
3,3,"\t </span> <span class=""reference-text""><cite class=""citation web""><a rel=""nofollow"" class=""external text"" href=""https://www.mileseducation.com/finance/artificial_intelligence"">""Miles Education | ...",0
4,4,"\t </span>. <a href=""/wiki/Digital_object_identifier"" title=""Digital object identifier"">doi </a>:<a rel=""nofollow"" class=""external text"" href=""https://doi.org/10.1016%2FS0921-8890%2805%2980025-9"">...",0


Se obtuvo:

- **13686 filas** y **3 columnas** (`Unnamed: 0`, `Sentence`, `Label`).  

Esto significa que contamos con **13 686 ejemplos de texto** etiquetados como benignos o maliciosos.

Observando las primeras filas vemos que:

- Muchos textos benignos son **fragmentos reales de HTML de Wikipedia** (listas, citas, enlaces, etc.).  
- Los textos maliciosos incluyen **etiquetas con atributos extraños** como `onmouseover="alert(1)"`,
  típicos de payloads XSS.

Esta combinación hace que el dataset sea interesante porque mezcla **HTML real** con
**payloads especialmente construidos para atacar navegadores**.

In [12]:
# 1. Revisar estructura básica del dataset
print("=== DataFrame info ===")
df_raw.info()
print()

print("=== Primeras filas (crudo) ===")
print(df_raw.head(10))
print()

# 2. Ver distribución de las etiquetas
print("=== Distribución de Label ===")
print(df_raw["Label"].value_counts(dropna=False))
print()

# 3. Limpiar la columna de índice viejo si existe
cols_to_drop = [c for c in df_raw.columns if c.lower().startswith("unnamed")]
df = df_raw.drop(columns=cols_to_drop)
print("Columnas después de eliminar posibles índices:", df.columns.tolist())
print()

# 4. Verificar valores únicos de Label
print("Valores únicos en Label:", df["Label"].unique())
print()

# 5. Revisar algunos ejemplos benignos y maliciosos
print("=== Ejemplos benignos (Label = 0) ===")
print(df[df["Label"] == 0].head(5)[["Sentence", "Label"]])
print()

print("=== Ejemplos maliciosos (Label = 1) ===")
print(df[df["Label"] == 1].head(5)[["Sentence", "Label"]])
print()

# 6. Pequeño resumen de longitud de los textos (para ver qué tan largos son)
df["len_chars"] = df["Sentence"].astype(str).str.len()
print("=== Estadísticas de longitud de Sentence (caracteres) ===")
print(df["len_chars"].describe())


=== DataFrame info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13686 entries, 0 to 13685
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  13686 non-null  int64 
 1   Sentence    13686 non-null  object
 2   Label       13686 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 320.9+ KB

=== Primeras filas (crudo) ===
   Unnamed: 0  \
0           0   
1           1   
2           2   
3           3   
4           4   
5           5   
6           6   
7           7   
8           8   
9           9   

                                                                                                                                                                                                  Sentence  \
0  <li><a href="/wiki/File:Socrates.png" class="image"><img alt="Socrates.png" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Socrates.png/18px-Socrates.png" decoding="async" width=

In [3]:
# --- Reconstruir df (en caso de que el kernel haya reiniciado) ---

# Volver a cargar el dataset crudo
df_raw = pd.read_csv(DATA_PATH_KAGGLE)

# Eliminar columnas tipo "Unnamed"
cols_to_drop = [c for c in df_raw.columns if c.lower().startswith("unnamed")]
df = df_raw.drop(columns=cols_to_drop)

# ----------------------------------------------------------
# Ahora sí: análisis de familias de payloads
# ----------------------------------------------------------

import re

def detect_family(text):
    t = text.lower()
    families = []

    # Script tags
    if "<script" in t or "</script" in t:
        families.append("script_tag")

    # Event handlers
    if re.search(r"on[a-z]+\s*=", t):
        families.append("event_handler")

    # javascript: URI
    if "javascript:" in t:
        families.append("javascript_uri")

    # img tags
    if "<img" in t:
        families.append("image_tag")

    # iframe tags
    if "<iframe" in t:
        families.append("iframe_tag")

    # svg tags
    if "<svg" in t:
        families.append("svg_tag")

    # heurística simple para polyglots
    if "<script" not in t and "<img" not in t and "on" in t and ";" in t:
        families.append("maybe_polyglot")

    return families if families else ["other"]


# Aplicar detección
df["families"] = df["Sentence"].astype(str).apply(detect_family)

# Frecuencias
from collections import Counter
family_counts = Counter(f for row in df["families"] for f in row)

print("=== Conteo de familias detectadas ===")
print(family_counts)

print("\n=== Ejemplos de cada familia ===")
for fam in ["script_tag", "event_handler", "javascript_uri", "image_tag",
            "iframe_tag", "svg_tag", "maybe_polyglot", "other"]:
    subset = df[df["families"].apply(lambda lst: fam in lst)]
    print(f"\n--- {fam} (n={len(subset)}) ---")
    print(subset.head(3)[["Sentence", "Label"]])


=== Conteo de familias detectadas ===
Counter({'event_handler': 7184, 'other': 5597, 'maybe_polyglot': 1446, 'svg_tag': 200, 'image_tag': 152, 'script_tag': 118, 'iframe_tag': 57, 'javascript_uri': 50})

=== Ejemplos de cada familia ===

--- script_tag (n=118) ---
                                                    Sentence  Label
87              <script onmouseover="alert(1)">test</script>      1
117          <script src="_static/js/hoverxref.js"></script>      0
247             <script onmousemove="alert(1)">test</script>      1

--- event_handler (n=7184) ---
                                                    Sentence  Label
1                       <tt onmouseover="alert(1)">test</tt>      1
11  <a onblur=alert(1) tabindex=1 id=x></a><input autofocus>      1
12   <col draggable="true" ondragenter="alert(1)">test</col>      1

--- javascript_uri (n=50) ---
                                                                            Sentence  \
64                                      

In [4]:
from collections import Counter

# Expandir las familias en filas (una fila por (sample, family))
records = []
for idx, row in df.iterrows():
    for fam in row["families"]:
        records.append({
            "index": idx,
            "family": fam,
            "Label": row["Label"],
        })

fam_df = pd.DataFrame(records)

print("=== Primeras filas de fam_df (estructura expandida) ===")
print(fam_df.head(10))
print()

# Tabla resumen: conteo por familia y Label
summary = fam_df.groupby(["family", "Label"]).size().unstack(fill_value=0)
summary.columns = [f"Label_{c}" for c in summary.columns]
summary["total"] = summary.sum(axis=1)
summary["prop_malicious"] = summary["Label_1"] / summary["total"]

print("=== Resumen por familia y Label ===")
print(summary.sort_values("prop_malicious", ascending=False))
print()

# Ver cuántas familias tiene cada muestra (para saber si hay mucho solapamiento)
df["n_families"] = df["families"].apply(len)
print("=== Distribución de número de familias por fila ===")
print(df["n_families"].value_counts().sort_index())
print()

# Ejemplos de filas con múltiples familias (si las hay)
multi = df[df["n_families"] > 1].head(5)
print("=== Ejemplos con múltiples familias detectadas ===")
print(multi[["Sentence", "Label", "families"]])


=== Primeras filas de fam_df (estructura expandida) ===
   index          family  Label
0      0       image_tag      0
1      1   event_handler      1
2      2           other      0
3      3  maybe_polyglot      0
4      4           other      0
5      5           other      0
6      6           other      0
7      7           other      0
8      8           other      0
9      9           other      0

=== Resumen por familia y Label ===
                Label_0  Label_1  total  prop_malicious
family                                                 
iframe_tag            0       57     57        1.000000
javascript_uri        0       50     50        1.000000
svg_tag               0      200    200        1.000000
event_handler         9     7175   7184        0.998747
script_tag           32       86    118        0.728814
image_tag            51      101    152        0.664474
maybe_polyglot      731      715   1446        0.494467
other              5492      105   5597        0.01

In [6]:
import re
from IPython.display import display

# Asegurarnos de que df está cargado y limpio
df_raw = pd.read_csv(DATA_PATH_KAGGLE)
cols_to_drop = [c for c in df_raw.columns if c.lower().startswith("unnamed")]
df = df_raw.drop(columns=cols_to_drop)

# Volver a calcular families y n_families si hace falta
def detect_family(text):
    t = str(text).lower()
    families = []

    if "<script" in t or "</script" in t:
        families.append("script_tag")
    if re.search(r"on[a-z]+\s*=", t):
        families.append("event_handler")
    if "javascript:" in t:
        families.append("javascript_uri")
    if "<img" in t:
        families.append("image_tag")
    if "<iframe" in t:
        families.append("iframe_tag")
    if "<svg" in t:
        families.append("svg_tag")
    if "<script" not in t and "<img" not in t and "on" in t and ";" in t:
        families.append("maybe_polyglot")

    return families if families else ["other"]

df["families"] = df["Sentence"].astype(str).apply(detect_family)
df["n_families"] = df["families"].apply(len)

# Volver a calcular longitud en caracteres
df["len_chars"] = df["Sentence"].astype(str).str.len()

# -------- función de ayuda para mostrar ejemplos --------
def show_samples(df_sub, n=5, max_chars=500):
    n = min(n, len(df_sub))
    samples = df_sub.sample(n=n, random_state=42)
    rows = []
    for _, row in samples.iterrows():
        text = str(row["Sentence"])
        shortened = text[:max_chars] + ("..." if len(text) > max_chars else "")
        rows.append({
            "Label": row["Label"],
            "Length": len(text),
            "Families": row["families"],
            "Snippet": shortened
        })
    return pd.DataFrame(rows)

print("=== 5 ejemplos aleatorios de cada familia ===")

for fam in ["script_tag", "event_handler", "javascript_uri", "image_tag",
            "iframe_tag", "svg_tag", "maybe_polyglot", "other"]:
    
    subset = df[df["families"].apply(lambda lst: fam in lst)]
    print(f"\n--- {fam} (total={len(subset)}) ---")
    display(show_samples(subset, n=5))

# Payloads más largos
print("\n=== 5 payloads más largos del dataset ===")
longest = df.sort_values("len_chars", ascending=False).head(5)
display(show_samples(longest, n=5, max_chars=800))

# Ejemplos con múltiples familias
print("\n=== Ejemplos de payloads con múltiples familias (n_families > 1) ===")
multi = df[df["n_families"] > 1]
display(show_samples(multi, n=5, max_chars=500))


=== 5 ejemplos aleatorios de cada familia ===

--- script_tag (total=118) ---


Unnamed: 0,Label,Length,Families,Snippet
0,1,46,[script_tag],"<SCRIPT ="">"" SRC=""httx://.rocks/.js""></SCRIPT>"
1,0,31,[script_tag],"<script type=""text/javascript"">"
2,1,77,[script_tag],"<SCRIPT>document.write(""<SCRI"");</SCRIPT>PT SRC=""httx://.rocks/.js""></SCRIPT>"
3,1,47,[script_tag],"<SCRIPT a="">"" SRC=""httx://.rocks/.js""></SCRIPT>"
4,1,59,"[script_tag, event_handler]","<script onkeypress=""alert(1)"" contenteditable>test</script>"



--- event_handler (total=7184) ---


Unnamed: 0,Label,Length,Families,Snippet
0,1,59,[event_handler],"<strike onkeypress=""alert(1)"" contenteditable>test</strike>"
1,1,30,[event_handler],"<u onclick=""alert(1)"">test</u>"
2,1,59,[event_handler],"<output draggable=""true"" ondragend=""alert(1)"">test</output>"
3,1,48,[event_handler],"<h1 onkeyup=""alert(1)"" contenteditable>test</h1>"
4,1,55,"[event_handler, svg_tag]","<svg draggable=""true"" ondragstart=""alert(1)"">test</svg>"



--- javascript_uri (total=50) ---


Unnamed: 0,Label,Length,Families,Snippet
0,1,30,"[javascript_uri, image_tag]",<IMG SRC=javascript:alert('')>
1,1,25,[javascript_uri],"SRC=""javascript:alert('')"
2,1,37,[javascript_uri],"<BGSOUND SRC=""javascript:alert('');"">"
3,1,35,"[javascript_uri, image_tag]","<IMG DYNSRC=""javascript:alert('')"">"
4,1,33,"[javascript_uri, image_tag]","<IMG SRC=""javascript:alert('');"">"



--- image_tag (total=152) ---


Unnamed: 0,Label,Length,Families,Snippet
0,1,52,"[event_handler, image_tag]","<img onkeydown=""alert(1)"" contenteditable>test</img>"
1,0,648,[image_tag],"<li><a href=""/wiki/File:Nuvola_apps_kalzium.svg"" class=""image""><img alt=""Nuvola apps kalzium.svg"" src=""//upload.wikimedia.org/wikipedia/commons/thumb/8/8b/Nuvola_apps_kalzium.svg/28px-Nuvola_apps_..."
2,1,37,"[event_handler, image_tag]",<img onpointermove=alert(1)>XSS</img>
3,0,611,[image_tag],"<li><img alt=""List-Class article"" src=""//upload.wikimedia.org/wikipedia/en/thumb/d/db/Symbol_list_class.svg/16px-Symbol_list_class.svg.png"" decoding=""async"" title=""List-Class article"" width=""16"" h..."
4,1,104,"[event_handler, image_tag]","<style>:target {color:red;}</style><img id=x style=""transition:color 1s"" ontransitionend=alert(1)></img>"



--- iframe_tag (total=57) ---


Unnamed: 0,Label,Length,Families,Snippet
0,1,33,"[event_handler, iframe_tag]",<iframe onload=alert(1)></iframe>
1,1,54,"[event_handler, iframe_tag]","<iframe oncut=""alert(1)"" contenteditable>test</iframe>"
2,1,71,"[iframe_tag, maybe_polyglot]",<iframe src=javascript&colon;alert&lpar;document&period;location&rpar;>
3,1,113,"[event_handler, iframe_tag]","<div draggable=""true"" contenteditable>drag me</div><iframe ondragover=alert(1) contenteditable>drop here</iframe>"
4,1,43,"[event_handler, iframe_tag]","<iframe ondblclick=""alert(1)"">test</iframe>"



--- svg_tag (total=200) ---


Unnamed: 0,Label,Length,Families,Snippet
0,1,52,"[event_handler, svg_tag]",<svg><animate onend=alert(1) attributeName=x dur=1s>
1,1,104,"[event_handler, svg_tag, maybe_polyglot]","<style>:target {color:red;}</style><svg id=x style=""transition:color 1s"" ontransitionend=alert(1)></svg>"
2,1,38,"[event_handler, svg_tag]",<svg><source onload=alert(1)></source>
3,1,32,"[event_handler, svg_tag]",<svg><pre onload=alert(1)></pre>
4,1,34,"[event_handler, svg_tag]",<svg><samp onload=alert(1)></samp>



--- maybe_polyglot (total=1446) ---


Unnamed: 0,Label,Length,Families,Snippet
0,0,107,[maybe_polyglot],"<link rel=""edit"" title=""Edit this page"" href=""/w/index.php?title=Artificial_intelligence&amp;action=edit""/>"
1,0,550,[maybe_polyglot],"\t </span>. </cite><span title=""ctx_ver=Z39.88-2004&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=article&amp;rft.jtitle=The+Atlantic&amp;rft.atitle=Noam+Chomsky+on+Where+Ar..."
2,1,184,"[event_handler, maybe_polyglot]","<style>@keyframes x{from {left:0;}to {left: 1000px;}}:target {animation:10s ease-in-out 0s 1 x;}</style><listing id=x style=""position:absolute;"" onanimationcancel=""alert(1)""></listing>"
3,1,119,"[event_handler, maybe_polyglot]","<style>:target {color: red;}</style><frameset id=x style=""transition:color 10s"" ontransitioncancel=alert(1)></frameset>"
4,1,102,"[event_handler, maybe_polyglot]","<style>:target {color:red;}</style><li id=x style=""transition:color 1s"" ontransitionend=alert(1)></li>"



--- other (total=5597) ---


Unnamed: 0,Label,Length,Families,Snippet
0,0,93,[other],"<li id=""cite_note-260""><span class=""mw-cite-backlink""><b><a href=""#cite_ref-260"">^ </a> </b>"
1,0,9,[other],\t </span>
2,0,6,[other],</li>
3,0,15,[other],\t </span> </li>
4,0,7,[other],</div>



=== 5 payloads más largos del dataset ===


Unnamed: 0,Label,Length,Families,Snippet
0,0,3681,[maybe_polyglot],"<p><a href=""/wiki/Knowledge_representation"" class=""mw-redirect"" title=""Knowledge representation"">Knowledge representation </a><sup id=""cite_ref-Knowledge_representation_91-0"" class=""reference""><a ..."
1,0,3479,[maybe_polyglot],"</p><p>The field of AI research was born at <a href=""/wiki/Dartmouth_workshop"" title=""Dartmouth workshop"">a workshop </a> at <a href=""/wiki/Dartmouth_College"" title=""Dartmouth College"">Dartmouth ..."
2,0,3587,[maybe_polyglot],"</p><p>In 2011, a <i><a href=""/wiki/Jeopardy!"" title=""Jeopardy!"">Jeopardy! </a> </i> <a href=""/wiki/Quiz_show"" class=""mw-redirect"" title=""Quiz show"">quiz show </a> exhibition match, <a href=""/wik..."
3,0,5911,[other],"\t <div id=""mw-hidden-catlinks"" class=""mw-hidden-catlinks mw-hidden-cats-hidden"">Hidden categories: <ul><li><a href=""/wiki/Category:Wikipedia_articles_needing_page_number_citations_from_December_2..."
4,0,3479,[maybe_polyglot],"</p><p><a href=""/wiki/Bayesian_network"" title=""Bayesian network"">Bayesian networks </a><sup id=""cite_ref-Bayesian_networks_205-0"" class=""reference""><a href=""#cite_note-Bayesian_networks-205"">&#91..."



=== Ejemplos de payloads con múltiples familias (n_families > 1) ===


Unnamed: 0,Label,Length,Families,Snippet
0,1,115,"[event_handler, maybe_polyglot]","<style>:target {color: red;}</style><button id=x style=""transition:color 10s"" ontransitioncancel=alert(1)></button>"
1,1,50,"[event_handler, svg_tag]",<svg><set onbegin=alert(1) attributename=x dur=1s>
2,1,42,"[event_handler, svg_tag]",<svg><template onload=alert(1)></template>
3,1,113,"[event_handler, maybe_polyglot]","<style>:target {color: red;}</style><blink id=x style=""transition:color 10s"" ontransitioncancel=alert(1)></blink>"
4,1,166,"[event_handler, iframe_tag, maybe_polyglot]","<style>@keyframes slidein {}</style><iframe style=""animation-duration:1s;animation-name:slidein;animation-iteration-count:2"" onanimationiteration=""alert(1)""></iframe>"


In [7]:
import re
from IPython.display import display

# =====================================================
# 1. Reconstrucción base del dataset
# =====================================================

df_raw = pd.read_csv(DATA_PATH_KAGGLE)
cols_to_drop = [c for c in df_raw.columns if c.lower().startswith("unnamed")]
df = df_raw.drop(columns=cols_to_drop)

# -----------------------------------------------------
# Recalcular families y longitudes
# -----------------------------------------------------
def detect_family(text):
    t = str(text).lower()
    families = []

    if "<script" in t or "</script" in t:
        families.append("script_tag")

    if re.search(r"on[a-z]+\s*=", t):
        families.append("event_handler")

    if "javascript:" in t:
        families.append("javascript_uri")

    if "<img" in t:
        families.append("image_tag")

    if "<iframe" in t:
        families.append("iframe_tag")

    if "<svg" in t:
        families.append("svg_tag")

    if "<script" not in t and "<img" not in t and "on" in t and ";" in t:
        families.append("maybe_polyglot")

    return families if families else ["other"]

df["families"] = df["Sentence"].astype(str).apply(detect_family)
df["len_chars"] = df["Sentence"].astype(str).str.len()


# =====================================================
# 2. Limpieza mínima
# =====================================================

def clean_text(s):
    s = str(s)
    # normalizar tabs
    s = s.replace("\t", " ")
    # normalizar espacios múltiples
    s = re.sub(r" {2,}", " ", s)
    # remover espacios o saltos iniciales/finales
    s = s.strip()
    return s

# aplicar limpieza
df["Sentence_clean"] = df["Sentence"].apply(clean_text)
df["len_after_clean"] = df["Sentence_clean"].str.len()


# =====================================================
# 3. Filtros de eliminación mínima
# =====================================================

# 3.1 excluir payloads demasiado largos
MAX_LEN = 3000
mask_length = df["len_after_clean"] <= MAX_LEN

# 3.2 excluir payloads demasiado cortos o irrelevantes
mask_too_short = df["len_after_clean"] >= 5

# 3.3 excluir payloads corruptos (caracteres ilegibles)
def looks_corrupt(s):
    try:
        s.encode("utf-8")
        return False
    except:
        return True

mask_corrupt = ~df["Sentence_clean"].apply(looks_corrupt)

df_clean = df[mask_length & mask_too_short & mask_corrupt].copy()


# =====================================================
# 4. Separar dataset de ataques vs benignos
# =====================================================

# ataques: Label == 1
df_clean_attacks = df_clean[df_clean["Label"] == 1].copy()

# benignos: Label == 0
df_clean_benign = df_clean[df_clean["Label"] == 0].copy()


# =====================================================
# 5. Mostrar estadísticas finales
# =====================================================

print("=== Limpieza mínima completada ===")
print("Original dataset size:", len(df))
print("Clean dataset size:", len(df_clean))
print("Clean attacks:", len(df_clean_attacks))
print("Clean benign:", len(df_clean_benign))

print("\n=== Top 5 ejemplos de df_clean_attacks ===")
display(df_clean_attacks.head(5)[["Sentence_clean", "Label", "families", "len_after_clean"]])

print("\n=== Top 5 ejemplos de df_clean_benign ===")
display(df_clean_benign.head(5)[["Sentence_clean", "Label", "families", "len_after_clean"]])


=== Limpieza mínima completada ===
Original dataset size: 13686
Clean dataset size: 13438
Clean attacks: 7373
Clean benign: 6065

=== Top 5 ejemplos de df_clean_attacks ===


Unnamed: 0,Sentence_clean,Label,families,len_after_clean
1,"<tt onmouseover=""alert(1)"">test</tt>",1,[event_handler],36
11,<a onblur=alert(1) tabindex=1 id=x></a><input autofocus>,1,[event_handler],56
12,"<col draggable=""true"" ondragenter=""alert(1)"">test</col>",1,[event_handler],55
13,<caption onpointerdown=alert(1)>XSS</caption>,1,[event_handler],45
16,<caption id=x tabindex=1 ondeactivate=alert(1)></caption><input id=y autofocus>,1,[event_handler],79



=== Top 5 ejemplos de df_clean_benign ===


Unnamed: 0,Sentence_clean,Label,families,len_after_clean
0,"<li><a href=""/wiki/File:Socrates.png"" class=""image""><img alt=""Socrates.png"" src=""//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Socrates.png/18px-Socrates.png"" decoding=""async"" width=""18"" hei...",0,[image_tag],557
2,"</span> <span class=""reference-text"">Steering for the 1995 ""<a href=""/wiki/History_of_autonomous_cars#1990s"" class=""mw-redirect"" title=""History of autonomous cars"">No Hands Across America </a>"" re...",0,[other],230
3,"</span> <span class=""reference-text""><cite class=""citation web""><a rel=""nofollow"" class=""external text"" href=""https://www.mileseducation.com/finance/artificial_intelligence"">""Miles Education | Fut...",0,[maybe_polyglot],392
4,"</span>. <a href=""/wiki/Digital_object_identifier"" title=""Digital object identifier"">doi </a>:<a rel=""nofollow"" class=""external text"" href=""https://doi.org/10.1016%2FS0921-8890%2805%2980025-9"">10....",0,[other],419
5,"<li id=""cite_note-118""><span class=""mw-cite-backlink""><b><a href=""#cite_ref-118"">^ </a> </b>",0,[other],92


In [None]:
from collections import Counter

# 1. Ver distribución de familias en el dataset limpio de ataques
attack_family_counts = Counter(
    fam
    for row in df_clean_attacks["families"]
    for fam in row
)

benign_family_counts = Counter(
    fam
    for row in df_clean_benign["families"]
    for fam in row
)

print("=== Familias en ataques limpios (df_clean_attacks) ===")
print(attack_family_counts)
print()

print("=== Familias en benignos limpios (df_clean_benign) ===")
print(benign_family_counts)
print()

# 2. Preparar columnas para guardar (families como string legible)
def families_to_str(fams):
    return "|".join(fams)

df_clean_to_save = df_clean.copy()
df_clean_to_save["families_str"] = df_clean_to_save["families"].apply(families_to_str)

df_clean_attacks_to_save = df_clean_attacks.copy()
df_clean_attacks_to_save["families_str"] = df_clean_attacks_to_save["families"].apply(families_to_str)

df_clean_benign_to_save = df_clean_benign.copy()
df_clean_benign_to_save["families_str"] = df_clean_benign_to_save["families"].apply(families_to_str)

# 3. Definir rutas de salida
path_all = OUTPUT_DIR / "xss_kaggle_clean.csv"
# path_attacks = OUTPUT_DIR / "xss_kaggle_clean_attacks.csv"
# path_benign = OUTPUT_DIR / "xss_kaggle_clean_benign.csv"

# 4. Guardar a disco
df_clean_to_save.to_csv(path_all, index=False)
# df_clean_attacks_to_save.to_csv(path_attacks, index=False)
# df_clean_benign_to_save.to_csv(path_benign, index=False)

print("=== Archivos guardados ===")
print("Todos     :", path_all)
# print("Ataques   :", path_attacks)
# print("Benignos  :", path_benign)


=== Familias en ataques limpios (df_clean_attacks) ===
Counter({'event_handler': 7175, 'maybe_polyglot': 715, 'svg_tag': 200, 'other': 105, 'image_tag': 101, 'script_tag': 86, 'iframe_tag': 57, 'javascript_uri': 50})

=== Familias en benignos limpios (df_clean_benign) ===
Counter({'other': 5249, 'maybe_polyglot': 726, 'image_tag': 51, 'script_tag': 32, 'event_handler': 9})

=== Archivos guardados ===
Todos     : d:\Archivos de Usuario\Documents\xss-cookie\notebooks\data\data_processed\xss_kaggle_clean.csv
Ataques   : d:\Archivos de Usuario\Documents\xss-cookie\notebooks\data\data_processed\xss_kaggle_clean_attacks.csv
Benignos  : d:\Archivos de Usuario\Documents\xss-cookie\notebooks\data\data_processed\xss_kaggle_clean_benign.csv
