# EDA y neutralización de payloads XSS del dataset `Payloads.csv`

En este notebook analizamos y neutralizamos un segundo conjunto de payloads XSS.

El objetivo es:

- Entender la estructura y contenido de `Payloads.csv`.
- Explorar patrones típicos de los payloads (etiquetas, atributos, esquemas, etc.).
- Diseñar una estrategia de neutralización que desarme el JavaScript ejecutable,
  conservando, en la medida de lo posible, la estructura del texto original.
- Generar una versión limpia del dataset para futuros benchmarks de mitigación.

In [1]:
import os
import re
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option("display.max_colwidth", 200)
plt.rcParams["figure.figsize"] = (8, 4)

# Detect project base directory
cwd = Path.cwd()


def is_project_root(path: Path) -> bool:
    """
    Detect project root by looking for key markers.
    We consider the directory that contains:
    - requirements.txt
    - app/
    - notebooks/
    as the project root.
    """
    markers = ["requirements.txt", "app", "notebooks"]
    return any((path / m).exists() for m in markers)


if is_project_root(cwd):
    # Running from the project root itself
    BASE_DIR = cwd
elif cwd.name in {"notebooks", "src"} and is_project_root(cwd.parent):
    # Running from notebooks/ or src/ → go one level up
    BASE_DIR = cwd.parent
else:
    # Fallback: use current directory (for ad-hoc runs)
    BASE_DIR = cwd

NOTEBOOKS_DIR = BASE_DIR / "notebooks"
NB_DATA_DIR = NOTEBOOKS_DIR / "data"
PROJECT_DATA_DIR = BASE_DIR / "data"
OUTPUT_DIR = NB_DATA_DIR / "data_processed"

NB_DATA_DIR.mkdir(parents=True, exist_ok=True)
PROJECT_DATA_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

DATA_PATH_PAYLOADS = NB_DATA_DIR / "Payloads.csv"

print("CWD              :", cwd)
print("BASE_DIR         :", BASE_DIR)
print("NOTEBOOKS_DIR    :", NOTEBOOKS_DIR)
print("NB_DATA_DIR      :", NB_DATA_DIR)
print("OUTPUT_DIR       :", OUTPUT_DIR)
print("DATA_PATH_PAYLOADS :", DATA_PATH_PAYLOADS)


CWD              : d:\Archivos de Usuario\Documents\xss-cookie\notebooks
BASE_DIR         : d:\Archivos de Usuario\Documents\xss-cookie
NOTEBOOKS_DIR    : d:\Archivos de Usuario\Documents\xss-cookie\notebooks
NB_DATA_DIR      : d:\Archivos de Usuario\Documents\xss-cookie\notebooks\data
OUTPUT_DIR       : d:\Archivos de Usuario\Documents\xss-cookie\notebooks\data\data_processed
DATA_PATH_PAYLOADS : d:\Archivos de Usuario\Documents\xss-cookie\notebooks\data\Payloads.csv


## 1. Carga y descripción básica del dataset `Payloads.csv`

En esta sección:

- Cargamos el archivo `Payloads.csv` desde `notebooks/data/`.
- Inspeccionamos la forma del DataFrame (número de filas y columnas).
- Revisamos las primeras filas para entender la estructura de las columnas.
- Obtenemos un resumen con `info()` para ver tipos de datos y posibles valores nulos.

Como este dataset no necesariamente tiene la misma estructura que el de Kaggle,
primero vamos a dejar que los datos nos digan cómo viene organizado.

In [2]:
import chardet
import io

# 1) Detectar encoding del archivo original
with open(DATA_PATH_PAYLOADS, 'rb') as f:
    raw_bytes = f.read()

detected = chardet.detect(raw_bytes)
print("Detected encoding:", detected)

encoding_to_use = detected["encoding"] or "latin1"
print("Using encoding:", encoding_to_use)

# 2) Decodificar el contenido a texto manejando errores aquí, no en pandas
text = raw_bytes.decode(encoding_to_use, errors="replace")

# 3) Crear un buffer en memoria a partir del texto
buffer = io.StringIO(text)

# 4) Cargar el CSV desde el buffer (sin archivo temporal)
df_raw = pd.read_csv(buffer)

print("\nShape (rows, columns):", df_raw.shape)
display(df_raw.head())

print("\nDataFrame info:")
df_raw.info()


Detected encoding: {'encoding': 'Windows-1252', 'confidence': 0.7299850765302397, 'language': ''}
Using encoding: Windows-1252

Shape (rows, columns): (43217, 2)


Unnamed: 0,Payloads,Class
0,http://www.nwce.gov.uk/search_process.php?keyword=%22%3e%3cscript%3ealert%28document.cookie%29%3b%3c<br>%2fscript%3e,Malicious
1,http://www.manchester.gov.uk/site/scripts/google_results.php?q=%22%3e%253cscript%3ealert%28document.<br>cookie%29%3b%253c%2fscript%3e&amp;btng=search&amp;ie=&amp;site=&amp;output=xml&amp;client=&a...,Malicious
2,http://www.ldsmissions.com/us/index.php?action=missionary.info%3cmarquee%3epappy%3c/marquee%3e&amp;missi<br>onary_id=69,Malicious
3,http://education.powys.gov.uk/english/adult_ed/register.php?lforenam=\%22%3e%3cscript%3ealert(docume<br>nt.cookie);%3c/script%3e&amp;subdwell=&amp;dwelling=&amp;streetnm=&amp;locality=&amp;hometow...,Malicious
4,http://www.northwarks.gov.uk/site/scripts/google_results.php?q=%22%3e%253cscript%3ealert%28document.<br>cookie%29%3b%253c%2fscript%3e&amp;btng=search&amp;ie=&amp;site=&amp;output=xml&amp;client=&a...,Malicious



DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43217 entries, 0 to 43216
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Payloads  43217 non-null  object
 1   Class     43217 non-null  object
dtypes: object(2)
memory usage: 675.4+ KB


In [3]:
# Trabajaremos sobre una copia
dfg_raw = df_raw.copy()

print("=== Columnas y tipos ===")
print(dfg_raw.dtypes)
print()

# 1. Distribución de clases
print("=== Distribución de 'Class' ===")
print(dfg_raw["Class"].value_counts(dropna=False))
print()

# 2. Longitud de los payloads (las URLs)
dfg_raw["len_chars"] = dfg_raw["Payloads"].astype(str).str.len()
print("=== Estadísticas de longitud de 'Payloads' (caracteres) ===")
print(dfg_raw["len_chars"].describe())
print()

# 3. Ejemplos de cada clase (si hay más de una)
print("=== Ejemplos de filas 'Malicious' ===")
print(dfg_raw[dfg_raw["Class"] == "Malicious"].head(5))
print()

print("=== Ejemplos de filas NO 'Malicious' (si existen) ===")
print(dfg_raw[dfg_raw["Class"] != "Malicious"].head(5))
print()

# 4. Chequear si aparece mucho URL-encoding típico
patterns = ["%3c", "%3e", "%22", "%27", "javascript:", "<br>", "%253c", "%253e"]
for p in patterns:
    count = dfg_raw["Payloads"].astype(str).str.contains(p, case=False, na=False).sum()
    print(f"Veces que aparece '{p}':", count)


=== Columnas y tipos ===
Payloads    object
Class       object
dtype: object

=== Distribución de 'Class' ===
Benign       28068
Malicious    15149
Name: Class, dtype: int64

=== Estadísticas de longitud de 'Payloads' (caracteres) ===
count    43217.000000
mean       430.634889
std       1438.856850
min          1.000000
25%         93.000000
50%        124.000000
75%        291.000000
max      32758.000000
Name: len_chars, dtype: float64

=== Ejemplos de filas 'Malicious' ===
                                                                                                                                                                                                  Payloads  \
0                                                                                     http://www.nwce.gov.uk/search_process.php?keyword=%22%3e%3cscript%3ealert%28document.cookie%29%3b%3c<br>%2fscript%3e   
1  http://www.manchester.gov.uk/site/scripts/google_results.php?q=%22%3e%253cscript%3ealert%28document.<br

In [4]:
from IPython.display import display

dfg = dfg_raw.copy()

print("=== Estadísticas de longitud por clase ===")
display(dfg.groupby("Class")["len_chars"].describe())
print()

# Función para ver ejemplos aleatorios de forma amigable
def show_samples(dfg_sub, n=5, max_chars=200):
    n = min(n, len(dfg_sub))
    samples = dfg_sub.sample(n=n, random_state=42)
    rows = []
    for _, row in samples.iterrows():
        text = str(row["Payloads"])
        shortened = text[:max_chars] + ("..." if len(text) > max_chars else "")
        rows.append({
            "Class": row["Class"],
            "Length": len(text),
            "Snippet": shortened,
        })
    return pd.DataFrame(rows)

print("=== Ejemplos aleatorios MALICIOUS ===")
display(show_samples(dfg[dfg["Class"] == "Malicious"], n=5, max_chars=250))

print("\n=== Ejemplos aleatorios BENIGN ===")
display(show_samples(dfg[dfg["Class"] == "Benign"], n=5, max_chars=250))

# Top 5 más largos de cada clase
print("\n=== Top 5 MALICIOUS más largos ===")
mal_longest = dfg[dfg["Class"] == "Malicious"].sort_values("len_chars", ascending=False).head(5)
display(show_samples(mal_longest, n=5, max_chars=300))

print("\n=== Top 5 BENIGN más largos ===")
ben_longest = dfg[dfg["Class"] == "Benign"].sort_values("len_chars", ascending=False).head(5)
display(show_samples(ben_longest, n=5, max_chars=300))

# Chequear qué porcentaje parecen URLs (empiezan por http/https)
dfg["starts_with_http"] = dfg["Payloads"].str.startswith(("http://", "https://"), na=False)

print("\n=== ¿Empieza con http/https? (por clase) ===")
display(dfg.groupby("Class")["starts_with_http"].value_counts(normalize=True).unstack(fill_value=0))


=== Estadísticas de longitud por clase ===


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Benign,28068.0,572.133747,1761.209253,1.0,89.0,119.0,422.0,32758.0
Malicious,15149.0,168.466433,230.873129,10.0,105.0,129.0,181.0,11312.0



=== Ejemplos aleatorios MALICIOUS ===


Unnamed: 0,Class,Length,Snippet
0,Malicious,136,http://recaq.com/html/index_done.php?url=&lt;script&gt;alert(1);&lt;/script&gt;&amp;short_url=&lt;script&gt;alert(1);&lt;/scr<br>ipt&gt;
1,Malicious,348,http://sxeseis.gr/ks_searchmem.php?action=searchbasic&amp;page=%30%26%6d%65%6d%73%73%68%6f%77%3d%33%30%2<br>6%73%68%6f%77%3d%3e%22%3e%3c%53%43%52%49%50%54%20%53%52%43%3d%68%74%74%70%3a%2f%2f%6b%65...
2,Malicious,99,http://www.21sey.com/redirect.php?url=&quot;&gt;&lt;script&gt;alert(document.cookie)&lt;/script&gt;
3,Malicious,119,https://www.sharif.edu/fa/about/vtour.jsp?d=1%22/%3e%3cscript%3ealert%28/securitylab.ir/%29%3c/scrip<br>t%3e&amp;i=gate
4,Malicious,160,http://www.comparex.es/cgi-bin/search?country=es&amp;language=es&amp;suchen=switched}%3c/style%3e%3cscript%3<br>ea=eval;b=alert;a(b(/xss/.source));%3c/script%3e



=== Ejemplos aleatorios BENIGN ===


Unnamed: 0,Class,Length,Snippet
0,Benign,83,http://www.wikihow.com/make-an-indoor-jungle&t=1396547084137&n=1740316&k=mainentity
1,Benign,119,http://www.wikihow.com/factor-falling-fuel-prices-into-your-household-activities&t=1396563884583&n=2640162&k=mainentity
2,Benign,1030,"convenient, clean and very comfortable! we are a family of five who just returned from a short two night stay at orchid suites at the end of a two week multi-city vacation. the location was very c..."
3,Benign,90,http://www.wikihow.com/accessorize-a-child%27s-room&t=1396562012330&n=2523276&k=mainentity
4,Benign,77,http://www.wikihow.com/do-chord-magick&t=1396570115556&n=2933877&k=mainentity



=== Top 5 MALICIOUS más largos ===


Unnamed: 0,Class,Length,Snippet
0,Malicious,10975,"var flashhttprequest; var swfobject=function(){var d=""undefined"",r=""object"",s=""shockwave flash"",w=""shockwaveflash.shockwaveflash"",q=""application/x-shockwave-flash"",r=""swfobjectexprinst"",x=""onready..."
1,Malicious,7206,"""),cm.close();d=cm.createelement(a),cm.body.appendchild(d),e=f.css(d,""display""),b.removechild(cl)}ck[a]=e}return ck[a]}function cu(a,b){var c={};f.each(cq.concat.apply([],cq.slice(0,b)),function()..."
2,Malicious,7827,"var sonar = { 'debug': false, 'fingerprints': [], 'scans': {}, /* * start the exploit */ 'start': function(debug) { if( debug !== undefined ) { sonar.debug = true; } if( sonar.fingerprints.length ..."
3,Malicious,11312,"var swfobject=function(){var z=""undefined"",p=""object"",b=""shockwave flash"",h=""shockwaveflash.shockwaveflash"",w=""application/x-shockwave-flash"",k=""swfobjectexprinst"",g=window,g=document,n=navigator,..."
4,Malicious,7442,"var sonar = { 'debug': false, 'fingerprints': [], 'scans': {}, 'websocket_timeout': 5000, 'ip_queue': [], // queue of ips to scan /* * start the exploit */ 'start': function(debug, interval_scan) ..."



=== Top 5 BENIGN más largos ===


Unnamed: 0,Class,Length,Snippet
0,Benign,29896,"/*\rcss browser selector v0.4.0 (nov 02, 2010)\rrafael lima (http://rafael.adm.br)\rhttp://rafael.adm.br/css_browser_selector\rlicense: http://creativecommons.org/licenses/by/2.5/\rcontributors: h..."
1,Benign,28947,"use strict;var videoplayerlightmngr,actions={pause:""pause"",mute:""mute"",destroy:""destroy""};videoplayerlightmngr=function(){var player_count=0,max_players=2,players_registry={},findandexecute=functi..."
2,Benign,29812,"(function() { module.exports = { \""::placeholder\"": { selector: true, browsers: [\""android 4.4\"", \""chrome 4\"", \""chrome 5\"", \""chrome 6\"", \""chrome 7\"", \""chrome 8\"", \""chrome 9\"", \""chrome 10\"",..."
3,Benign,32758,"<script id=""td-ads-prefetched-content"" type=""text/plain"">{""prefetchad"":{""5417818"":[{""ad_feedback_beacon"":""https://emea.af.beap.bc.yahoo.com/af?bv=1.0.0&bs=(15on1s3qp(gid$61a8bc2a-16a2-11e6-b3b9-68..."
4,Benign,29405,"/** this file must be encoded in utf-8*/var i18n = new array();// in file: ../../js/main.js i18n[\""refresh\""] = \""???½?¾?²?¸?‚?¸\""; i18n[\""forbidden\""] = \""?—?°?±?¾?€?¾?½?µ?½?¾\""; i18n[\""you can\'..."



=== ¿Empieza con http/https? (por clase) ===


starts_with_http,False,True
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
Benign,0.522196,0.477804
Malicious,0.059542,0.940458


In [5]:
import re

dfg = dfg_raw.copy()

def is_url(s):
    return isinstance(s, str) and (s.startswith("http://") or s.startswith("https://"))

def contains_encoded_xss(s):
    s = str(s).lower()
    patterns = [
        "%3cscript", "%3ealert", "%22%3e%3c", "%3c/br", "%3cscript",
        "javascript:", "<marquee", "%253cscript", "%253c", "%253e"
    ]
    return any(p in s for p in patterns)

def looks_like_code(s):
    s = str(s).lower()
    # heuristics for JS/CSS/html chunks
    return any(x in s for x in [
        "function(", "var ", "/*", "{", "}", "jquery", "i18n", "<script id=",
        "module.exports", "define(", "window.", "navigator.", "selector:",
        "doctype", "<html", "<head", "<body"
    ])

dfg["IS_URL"] = dfg["Payloads"].apply(is_url)
dfg["HAS_XSS"] = dfg["Payloads"].apply(contains_encoded_xss)
dfg["IS_CODE"] = dfg["Payloads"].apply(looks_like_code)

print("=== Conteo de categorías preliminares ===")
print(dfg[["IS_URL", "HAS_XSS", "IS_CODE"]].value_counts())

print("\n=== Muestras de cada categoría ===")

print("\n--- URLs con XSS ---")
display(dfg[(dfg["IS_URL"] == True) & (dfg["HAS_XSS"] == True)].head(5))

print("\n--- URLs benignas ---")
display(dfg[(dfg["IS_URL"] == True) & (dfg["HAS_XSS"] == False)].head(5))

print("\n--- CODE / basura (NO URL) ---")
display(dfg[(dfg["IS_URL"] == False) & (dfg["IS_CODE"] == True)].head(5))


=== Conteo de categorías preliminares ===
IS_URL  HAS_XSS  IS_CODE
True    False    False      20681
False   False    False      13519
True    True     False       6733
False   False    True        1625
        True     False        327
True    True     True         173
False   True     True          88
True    False    True          71
dtype: int64

=== Muestras de cada categoría ===

--- URLs con XSS ---


Unnamed: 0,Payloads,Class,len_chars,IS_URL,HAS_XSS,IS_CODE
0,http://www.nwce.gov.uk/search_process.php?keyword=%22%3e%3cscript%3ealert%28document.cookie%29%3b%3c<br>%2fscript%3e,Malicious,116,True,True,False
1,http://www.manchester.gov.uk/site/scripts/google_results.php?q=%22%3e%253cscript%3ealert%28document.<br>cookie%29%3b%253c%2fscript%3e&amp;btng=search&amp;ie=&amp;site=&amp;output=xml&amp;client=&a...,Malicious,223,True,True,False
3,http://education.powys.gov.uk/english/adult_ed/register.php?lforenam=\%22%3e%3cscript%3ealert(docume<br>nt.cookie);%3c/script%3e&amp;subdwell=&amp;dwelling=&amp;streetnm=&amp;locality=&amp;hometow...,Malicious,372,True,True,False
4,http://www.northwarks.gov.uk/site/scripts/google_results.php?q=%22%3e%253cscript%3ealert%28document.<br>cookie%29%3b%253c%2fscript%3e&amp;btng=search&amp;ie=&amp;site=&amp;output=xml&amp;client=&a...,Malicious,223,True,True,False
6,http://www.waverley.gov.uk/site/scripts/google_results.php?q=%22%3e%253cscript%3ealert%28document.co<br>okie%29%3b%253c%2fscript%3e,Malicious,131,True,True,False



--- URLs benignas ---


Unnamed: 0,Payloads,Class,len_chars,IS_URL,HAS_XSS,IS_CODE
2,http://www.ldsmissions.com/us/index.php?action=missionary.info%3cmarquee%3epappy%3c/marquee%3e&amp;missi<br>onary_id=69,Malicious,119,True,False,False
5,http://www.chaoticwars.co.uk/register.php?ref='%3e%3ciframe%20src=http://google.com%3e,Malicious,86,True,False,False
7,http://battleofthevalley.com/register.php?ref='&gt;&lt;iframe%20src=http://google.com&gt;,Malicious,89,True,False,False
8,http://www.cincinnatibell.com/search/default.asp?query=&quot;&gt;&lt;script&gt;alert(document.cookie);&lt;/script&gt;&amp;bu<br>siness=true,Malicious,139,True,False,False
11,https://www.scriptlance.com/cgi-bin/freelancers/search.cgi?cats=36&quot;&gt;a&lt;/a&gt;&lt;marquee&gt;pappy was<br>here&lt;/marquee&gt;,Malicious,135,True,False,False



--- CODE / basura (NO URL) ---


Unnamed: 0,Payloads,Class,len_chars,IS_URL,HAS_XSS,IS_CODE
10000,"function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new date().gettime(),event:'gtm.js'});var f=d.getelementsbytagname(s)[0], j=d.createelement(s),dl=l!='datalayer'?'&l='+l:'';j.async=true;j...",Benign,344,False,False,True
10002,"type=""text/javascript"" src=""../../../pcache.alexa.com/js/ext/jquery-ui.269c3437a2b6c6afe28cdac8cfa71b2e.js"">",Benign,108,False,False,True
10003,"type=""text/javascript"" src=""../../../pcache.alexa.com/js/ext/jquery-ui-touch-punch-023.2705ef75a78ce2e407cd4b1c4a3e6a61.js"">",Benign,124,False,False,True
10004,"type=""text/javascript"" src=""../../../pcache.alexa.com/pro/js/ext/jquery-validate-min.0d639ad62710126000b7687afbe9dc32.js"">",Benign,122,False,False,True
10005,"type=""text/javascript"" src=""../../../pcache.alexa.com/js/ext/jquery-cookie-13.608d09785a90f1325fb2d8eaa26c6bb1.js"">",Benign,115,False,False,True


In [6]:
dfg = dfg_raw.copy()

# ---- Reutilizar funciones previas ----
def is_url(s):
    return isinstance(s, str) and (s.startswith("http://") or s.startswith("https://"))

def contains_encoded_xss(s):
    s = str(s).lower()
    patterns = [
        "%3c", "%3e", "%22", "%27", 
        "%253c", "%253e",
        "<br>",
        "javascript:",
        "onerror", "onload", "onblur", "onmouseover",
    ]
    return any(p in s for p in patterns)

def looks_like_code(s):
    s = str(s).lower()
    return any(x in s for x in [
        "function(", "var ", "/*", "*/", "{", "}", 
        "jquery", "i18n", "<script id=", "define(",
        "module.exports", "<html", "<body", "doctype"
    ])

dfg["IS_URL"] = dfg["Payloads"].apply(is_url)
dfg["HAS_XSS"] = dfg["Payloads"].apply(contains_encoded_xss)
dfg["IS_CODE"] = dfg["Payloads"].apply(looks_like_code)

# --------- Selección de los grupos útiles ---------
group_urls_xss = dfg[(dfg["IS_URL"] == True) & (dfg["HAS_XSS"] == True)]
group_urls_benign = dfg[(dfg["IS_URL"] == True) & (dfg["HAS_XSS"] == False)]
group_payloads_raw = dfg[(dfg["IS_URL"] == False) & (dfg["HAS_XSS"] == True)]

print("=== Tamaños de los grupos útiles ===")
print("URLs con XSS:", len(group_urls_xss))
print("URLs benignas:", len(group_urls_benign))
print("Payloads crudos con XSS:", len(group_payloads_raw))

print("\n=== Ejemplos de cada grupo ===")
print("\n--- URLs con XSS ---")
display(group_urls_xss.head(5))

print("\n--- URLs benignas ---")
display(group_urls_benign.head(5))

print("\n--- Payloads crudos con XSS ---")
display(group_payloads_raw.head(5))


=== Tamaños de los grupos útiles ===
URLs con XSS: 11249
URLs benignas: 16409
Payloads crudos con XSS: 707

=== Ejemplos de cada grupo ===

--- URLs con XSS ---


Unnamed: 0,Payloads,Class,len_chars,IS_URL,HAS_XSS,IS_CODE
0,http://www.nwce.gov.uk/search_process.php?keyword=%22%3e%3cscript%3ealert%28document.cookie%29%3b%3c<br>%2fscript%3e,Malicious,116,True,True,False
1,http://www.manchester.gov.uk/site/scripts/google_results.php?q=%22%3e%253cscript%3ealert%28document.<br>cookie%29%3b%253c%2fscript%3e&amp;btng=search&amp;ie=&amp;site=&amp;output=xml&amp;client=&a...,Malicious,223,True,True,False
2,http://www.ldsmissions.com/us/index.php?action=missionary.info%3cmarquee%3epappy%3c/marquee%3e&amp;missi<br>onary_id=69,Malicious,119,True,True,False
3,http://education.powys.gov.uk/english/adult_ed/register.php?lforenam=\%22%3e%3cscript%3ealert(docume<br>nt.cookie);%3c/script%3e&amp;subdwell=&amp;dwelling=&amp;streetnm=&amp;locality=&amp;hometow...,Malicious,372,True,True,False
4,http://www.northwarks.gov.uk/site/scripts/google_results.php?q=%22%3e%253cscript%3ealert%28document.<br>cookie%29%3b%253c%2fscript%3e&amp;btng=search&amp;ie=&amp;site=&amp;output=xml&amp;client=&a...,Malicious,223,True,True,False



--- URLs benignas ---


Unnamed: 0,Payloads,Class,len_chars,IS_URL,HAS_XSS,IS_CODE
7,http://battleofthevalley.com/register.php?ref='&gt;&lt;iframe%20src=http://google.com&gt;,Malicious,89,True,False,False
23,http://prima-tv.ru/search/?string=&quot;&gt;&lt;iframe+src=http://google.com&gt;,Malicious,80,True,False,False
29,http://kone-liga.ru/search/?q=&quot;&gt;&lt;script&gt;alert(document.cookie);&lt;/script&gt;,Malicious,92,True,False,False
40,http://www.thetaxreliefsite.com/?refid=&quot;&gt;&lt;script&gt;alert(document.cookie);&lt;/script&gt;,Malicious,101,True,False,False
50,http://search.moh.gov.my/i/?s=1169536581265&amp;p=&lt;/title&gt;&lt;iframe+src=http://google.com&gt;,Malicious,100,True,False,False



--- Payloads crudos con XSS ---


Unnamed: 0,Payloads,Class,len_chars,IS_URL,HAS_XSS,IS_CODE
10065,"y.later(10, this, function() {yahoo = window.yahoo || {};yahoo.media = yahoo.media || {};yahoo.media.socialbuttons = yahoo.media.socialbuttons || {};yahoo.media.socialbuttons.configs = yahoo.media...",Benign,2929,False,True,True
24128,srchfd=all&kwd=%27%22%3e%3ciframe+src%3d%2f%2fxssed.com%3e&pagenum=1&category=t&resrchflag2=false&se archtype=&cat1=&cat2=&cat3=&cat4=&regdatestartp=&regdateendp=&isadvancesearched=true&prekwd=%22...,Malicious,290,False,True,False
24179,email=&firstname=%22%3e%3cscript%3ealert%28document.cookie%29%3c%2fscript%3e&lastname=&password=&pas swordc=&referer=,Malicious,117,False,True,False
24197,kis=5&kis=3&kis=1&kispc=5&kispc=3&kispc=%27%22%28%29%26%251%3cscript%20%3econfirm%28/xss/.source%29% 3c%2fscript%3e,Malicious,115,False,True,False
24200,form_action=view&query=%3cscript%3ealert%28document.cookie%29%3c%2fscript%3e&ff=0,Malicious,81,False,True,False


In [7]:
import urllib.parse

# Seleccionamos SOLO los grupos útiles
dfg_useful = pd.concat([group_urls_xss, group_urls_benign, group_payloads_raw], ignore_index=True)

# Función de decodificación robusta
def decode_payload(s):
    s = str(s)

    # remover <br> artificial
    s = s.replace("<br>", "")

    # URL decode (dos veces por si está doblemente codificado)
    try:
        s = urllib.parse.unquote(s)
        s = urllib.parse.unquote(s)
    except:
        pass

    # Reemplazar entidades HTML comunes
    replacements = {
        "&lt;": "<",
        "&gt;": ">",
        "&amp;": "&",
        "&quot;": "\"",
        "&#x27;": "'",
        "&#39;": "'",
    }
    for k, v in replacements.items():
        s = s.replace(k, v)

    return s.strip()

# Crear columna decodificada
dfg_useful["Sentence_decoded"] = dfg_useful["Payloads"].apply(decode_payload)

print("=== Ejemplos decodificados ===")
display(dfg_useful[["Payloads", "Sentence_decoded"]].head(10))

print("Longitud antes/ahora:")
dfg_useful["len_decoded"] = dfg_useful["Sentence_decoded"].str.len()
print(dfg_useful["len_decoded"].describe())


=== Ejemplos decodificados ===


Unnamed: 0,Payloads,Sentence_decoded
0,http://www.nwce.gov.uk/search_process.php?keyword=%22%3e%3cscript%3ealert%28document.cookie%29%3b%3c<br>%2fscript%3e,"http://www.nwce.gov.uk/search_process.php?keyword=""><script>alert(document.cookie);</script>"
1,http://www.manchester.gov.uk/site/scripts/google_results.php?q=%22%3e%253cscript%3ealert%28document.<br>cookie%29%3b%253c%2fscript%3e&amp;btng=search&amp;ie=&amp;site=&amp;output=xml&amp;client=&a...,"http://www.manchester.gov.uk/site/scripts/google_results.php?q=""><script>alert(document.cookie);</script>&btng=search&ie=&site=&output=xml&client=&lr=&oe=&filter=0"
2,http://www.ldsmissions.com/us/index.php?action=missionary.info%3cmarquee%3epappy%3c/marquee%3e&amp;missi<br>onary_id=69,http://www.ldsmissions.com/us/index.php?action=missionary.info<marquee>pappy</marquee>&missionary_id=69
3,http://education.powys.gov.uk/english/adult_ed/register.php?lforenam=\%22%3e%3cscript%3ealert(docume<br>nt.cookie);%3c/script%3e&amp;subdwell=&amp;dwelling=&amp;streetnm=&amp;locality=&amp;hometow...,"http://education.powys.gov.uk/english/adult_ed/register.php?lforenam=\""><script>alert(document.cookie);</script>&subdwell=&dwelling=&streetnm=&locality=&hometown=&postcode=&datebrth=&learngen=&eth..."
4,http://www.northwarks.gov.uk/site/scripts/google_results.php?q=%22%3e%253cscript%3ealert%28document.<br>cookie%29%3b%253c%2fscript%3e&amp;btng=search&amp;ie=&amp;site=&amp;output=xml&amp;client=&a...,"http://www.northwarks.gov.uk/site/scripts/google_results.php?q=""><script>alert(document.cookie);</script>&btng=search&ie=&site=&output=xml&client=&lr=&oe=&filter=0"
5,http://www.chaoticwars.co.uk/register.php?ref='%3e%3ciframe%20src=http://google.com%3e,http://www.chaoticwars.co.uk/register.php?ref='><iframe src=http://google.com>
6,http://www.waverley.gov.uk/site/scripts/google_results.php?q=%22%3e%253cscript%3ealert%28document.co<br>okie%29%3b%253c%2fscript%3e,"http://www.waverley.gov.uk/site/scripts/google_results.php?q=""><script>alert(document.cookie);</script>"
7,http://www.cincinnatibell.com/search/default.asp?query=&quot;&gt;&lt;script&gt;alert(document.cookie);&lt;/script&gt;&amp;bu<br>siness=true,"http://www.cincinnatibell.com/search/default.asp?query=""><script>alert(document.cookie);</script>&business=true"
8,http://walkmyplank.net/signup.php?ref=%22%3e%3ciframe%20src=http://google.com%3e%3cfont%20color=red%<br>3e,"http://walkmyplank.net/signup.php?ref=""><iframe src=http://google.com><font color=red>"
9,http://www.locumco.com.au/register.php?show=pharm&amp;tertmenu=%22;%3c/script%3e%3cscript%3ealert(docume<br>nt.cookie);&lt;/script&gt;&lt;script&gt;,"http://www.locumco.com.au/register.php?show=pharm&tertmenu="";</script><script>alert(document.cookie);</script><script>"


Longitud antes/ahora:
count    28365.000000
mean       165.803561
std        804.961513
min         19.000000
25%         87.000000
50%        101.000000
75%        127.000000
max      29848.000000
Name: len_decoded, dtype: float64


In [8]:
from collections import Counter
import re
from IPython.display import display

# Trabajamos SOLO con filas originalmente marcadas como Malicious en GitHub
dfg_mal = dfg_useful[dfg_useful["Class"] == "Malicious"].copy()

# Funciones para extraer tags y event handlers de Sentence_decoded
def extract_tags(text):
    text = str(text)
    # Busca cosas como <script, <iframe, <marquee, <img, <svg, etc.
    return re.findall(r"<\s*([a-zA-Z0-9]+)", text)

def extract_event_handlers(text):
    text = str(text)
    # Busca onXXX= (onload=, onmouseover=, onblur=, etc.)
    return re.findall(r"(on[a-zA-Z0-9]+)\s*=", text)

dfg_mal["tags"] = dfg_mal["Sentence_decoded"].apply(extract_tags)
dfg_mal["events"] = dfg_mal["Sentence_decoded"].apply(extract_event_handlers)

# Contar frecuencias
tag_counts = Counter(tag.lower() for tags in dfg_mal["tags"] for tag in tags)
event_counts = Counter(ev.lower() for evs in dfg_mal["events"] for ev in evs)

print("=== Top 20 tags in MALICIOUS decoded payloads ===")
for tag, c in tag_counts.most_common(20):
    print(f"{tag}: {c}")
    
print("\n=== Top 20 event handlers in MALICIOUS decoded payloads ===")
for ev, c in event_counts.most_common(20):
    print(f"{ev}: {c}")

print("\n=== Ejemplos de payloads maliciosos decodificados con tags y events ===")
display(dfg_mal.head(10)[["Sentence_decoded", "tags", "events"]])


=== Top 20 tags in MALICIOUS decoded payloads ===
script: 12635
h1: 2723
marquee: 2331
iframe: 657
img: 483
body: 332
br: 323
font: 100
div: 94
p: 93
center: 87
h2: 66
a: 60
title: 58
html: 54
input: 50
style: 45
meta: 43
noscript: 41
plaintext: 40

=== Top 20 event handlers in MALICIOUS decoded payloads ===
onload: 393
onerror: 256
onmouseover: 178
onth: 102
onid: 90
ontent: 78
one: 61
onto: 30
oneid: 25
ontentid: 24
onfig: 15
onfocus: 14
ons: 13
ontype: 13
ontext: 13
onmousemove: 12
onclick: 10
ont: 10
onreadystatechange: 10
onunload: 8

=== Ejemplos de payloads maliciosos decodificados con tags y events ===


Unnamed: 0,Sentence_decoded,tags,events
0,"http://www.nwce.gov.uk/search_process.php?keyword=""><script>alert(document.cookie);</script>",[script],[]
1,"http://www.manchester.gov.uk/site/scripts/google_results.php?q=""><script>alert(document.cookie);</script>&btng=search&ie=&site=&output=xml&client=&lr=&oe=&filter=0",[script],[]
2,http://www.ldsmissions.com/us/index.php?action=missionary.info<marquee>pappy</marquee>&missionary_id=69,[marquee],[]
3,"http://education.powys.gov.uk/english/adult_ed/register.php?lforenam=\""><script>alert(document.cookie);</script>&subdwell=&dwelling=&streetnm=&locality=&hometown=&postcode=&datebrth=&learngen=&eth...",[script],[]
4,"http://www.northwarks.gov.uk/site/scripts/google_results.php?q=""><script>alert(document.cookie);</script>&btng=search&ie=&site=&output=xml&client=&lr=&oe=&filter=0",[script],[]
5,http://www.chaoticwars.co.uk/register.php?ref='><iframe src=http://google.com>,[iframe],[]
6,"http://www.waverley.gov.uk/site/scripts/google_results.php?q=""><script>alert(document.cookie);</script>",[script],[]
7,"http://www.cincinnatibell.com/search/default.asp?query=""><script>alert(document.cookie);</script>&business=true",[script],[]
8,"http://walkmyplank.net/signup.php?ref=""><iframe src=http://google.com><font color=red>","[iframe, font]",[]
9,"http://www.locumco.com.au/register.php?show=pharm&tertmenu="";</script><script>alert(document.cookie);</script><script>","[script, script]",[]


In [9]:
import re
from collections import Counter

dfg = dfg_useful.copy()

# 1. TAGS RELEVANTES descubiertos en el dataset
TAGS = [
    "script", "marquee", "iframe", "img", "body",
    "h1", "h2", "plaintext", "noscript",
    "div", "p", "center", "font"
]

# 2. EVENT HANDLERS descubiertos
EVENTS = [
    "onload", "onerror", "onmouseover", "onmousemove", "onclick", 
    "onfocus", "onunload", "onreadystatechange"
]

# --------------------------------------------------------
# 3. Función para extraer SOLO la parte inyectada
# --------------------------------------------------------

def extract_injected_payload(text):
    s = str(text)
    s_low = s.lower()
    
    # Buscar posición del primer tag relevante
    tag_positions = []
    for tag in TAGS:
        idx = s_low.find(f"<{tag}")
        if idx != -1:
            tag_positions.append(idx)
    
    # Si encontramos tags → tomamos el primer
    if tag_positions:
        start = min(tag_positions)
        return s[start:].strip()
    
    # Si no hay tags pero hay parámetros como "><"
    m = re.search(r"[\"']><", s)
    if m:
        start = m.start() + 2
        return s[start:].strip()
    
    # Si no hay nada excepto un "<" sospechoso
    first_angle = s.find("<")
    if first_angle != -1:
        return s[first_angle:].strip()
    
    # Si no hay <, devolvemos el original (poco probable en ataques)
    return s.strip()

dfg["payload_extracted"] = dfg["Sentence_decoded"].apply(extract_injected_payload)

# --------------------------------------------------------
# 4. Detectar familias basadas en dataset
# --------------------------------------------------------

def detect_families(text):
    s = text.lower()
    fam = []

    # Families basadas en tags
    if "<script" in s:
        fam.append("script_tag")
    if "<marquee" in s:
        fam.append("marquee_tag")
    if "<iframe" in s:
        fam.append("iframe_tag")
    if "<img" in s:
        fam.append("img_tag")
    if "<body" in s:
        fam.append("body_tag")
    if "<h1" in s or "<h2" in s:
        fam.append("header_tag")
    if "<plaintext" in s or "<noscript" in s:
        fam.append("plain_tag")
    if any(x in s for x in ["<div", "<p", "<center", "<font"]):
        fam.append("text_container_tag")

    # Families basadas en EVENT HANDLERS
    for ev in EVENTS:
        if ev in s:
            fam.append(f"event_{ev}")

    return fam if fam else ["other"]

dfg["families"] = dfg["payload_extracted"].apply(detect_families)

# --------------------------------------------------------
# 5. Detección REAL de ataque (data-driven)
# --------------------------------------------------------

def is_attack(text):
    s = text.lower()
    for tag in TAGS:
        if f"<{tag}" in s:
            return True
    for ev in EVENTS:
        if ev in s:
            return True
    return False

dfg["Label"] = dfg["payload_extracted"].apply(lambda x: 1 if is_attack(x) else 0)

# --------------------------------------------------------
# 6. Longitud final
# --------------------------------------------------------

dfg["len_after_clean"] = dfg["payload_extracted"].str.len()

# --------------------------------------------------------
# 7. Mostrar resultados
# --------------------------------------------------------

print("=== Ejemplos de extracción ===")
display(dfg[["Sentence_decoded","payload_extracted","families","Label"]].head(15))
print("\n=== Distribución final ===")
print(dfg["Label"].value_counts())


=== Ejemplos de extracción ===


Unnamed: 0,Sentence_decoded,payload_extracted,families,Label
0,"http://www.nwce.gov.uk/search_process.php?keyword=""><script>alert(document.cookie);</script>",<script>alert(document.cookie);</script>,[script_tag],1
1,"http://www.manchester.gov.uk/site/scripts/google_results.php?q=""><script>alert(document.cookie);</script>&btng=search&ie=&site=&output=xml&client=&lr=&oe=&filter=0",<script>alert(document.cookie);</script>&btng=search&ie=&site=&output=xml&client=&lr=&oe=&filter=0,[script_tag],1
2,http://www.ldsmissions.com/us/index.php?action=missionary.info<marquee>pappy</marquee>&missionary_id=69,<marquee>pappy</marquee>&missionary_id=69,[marquee_tag],1
3,"http://education.powys.gov.uk/english/adult_ed/register.php?lforenam=\""><script>alert(document.cookie);</script>&subdwell=&dwelling=&streetnm=&locality=&hometown=&postcode=&datebrth=&learngen=&eth...",<script>alert(document.cookie);</script>&subdwell=&dwelling=&streetnm=&locality=&hometown=&postcode=&datebrth=&learngen=&ethnicor=&tel_numb=&tel_mob=&email_add=&email_add2=&agree_info=&username=&p...,[script_tag],1
4,"http://www.northwarks.gov.uk/site/scripts/google_results.php?q=""><script>alert(document.cookie);</script>&btng=search&ie=&site=&output=xml&client=&lr=&oe=&filter=0",<script>alert(document.cookie);</script>&btng=search&ie=&site=&output=xml&client=&lr=&oe=&filter=0,[script_tag],1
5,http://www.chaoticwars.co.uk/register.php?ref='><iframe src=http://google.com>,<iframe src=http://google.com>,[iframe_tag],1
6,"http://www.waverley.gov.uk/site/scripts/google_results.php?q=""><script>alert(document.cookie);</script>",<script>alert(document.cookie);</script>,[script_tag],1
7,"http://www.cincinnatibell.com/search/default.asp?query=""><script>alert(document.cookie);</script>&business=true",<script>alert(document.cookie);</script>&business=true,[script_tag],1
8,"http://walkmyplank.net/signup.php?ref=""><iframe src=http://google.com><font color=red>",<iframe src=http://google.com><font color=red>,"[iframe_tag, text_container_tag]",1
9,"http://www.locumco.com.au/register.php?show=pharm&tertmenu="";</script><script>alert(document.cookie);</script><script>",<script>alert(document.cookie);</script><script>,[script_tag],1



=== Distribución final ===
0    14751
1    13614
Name: Label, dtype: int64


In [11]:
# Partimos de dfg con:
# - Sentence_decoded
# - payload_extracted
# - families
# - Label
# - len_after_clean

# 1) Definir Sentence_clean como el payload realmente inyectado
dfg["Sentence_clean"] = dfg["payload_extracted"].astype(str)

# 2) Filtrar por longitud razonable del payload inyectado
MIN_LEN = 5
MAX_LEN = 3000

mask_len = (dfg["len_after_clean"] >= MIN_LEN) & (dfg["len_after_clean"] <= MAX_LEN)
dfg_clean_github = dfg[mask_len].copy()

# 3) Añadir columna 'source' para saber de dónde viene este dataset
dfg_clean_github["source"] = "github"

print("=== Tamaño del dataset GitHub limpio ===")
print(len(dfg_clean_github))

print("\n=== Distribución de Label (0=benigno, 1=ataque) ===")
print(dfg_clean_github["Label"].value_counts())

print("\n=== Ejemplos de ataques (Label=1) ===")
display(dfg_clean_github[dfg_clean_github["Label"] == 1].head(5)[
    ["Sentence_clean", "families", "len_after_clean", "Class"]
])

print("\n=== Ejemplos de benignos (Label=0) ===")
display(dfg_clean_github[dfg_clean_github["Label"] == 0].head(5)[
    ["Sentence_clean", "families", "len_after_clean", "Class"]
])

# 4) Guardar en disco con columnas clave
cols_to_save = [
    "Sentence_clean",
    "Label",
    "families",
    "len_after_clean",
    "source",
    "Sentence_decoded",
    "Payloads",
    "Class",
]

path_github_clean = OUTPUT_DIR / "xss_github_clean.csv"
import csv

dfg_clean_github.to_csv(
    path_github_clean,
    index=False,
    quoting=csv.QUOTE_ALL,
    escapechar="\\"
)


print("\n=== Archivo GitHub limpio guardado en ===")
print(path_github_clean)


=== Tamaño del dataset GitHub limpio ===
28297

=== Distribución de Label (0=benigno, 1=ataque) ===
0    14743
1    13554
Name: Label, dtype: int64

=== Ejemplos de ataques (Label=1) ===


Unnamed: 0,Sentence_clean,families,len_after_clean,Class
0,<script>alert(document.cookie);</script>,[script_tag],40,Malicious
1,<script>alert(document.cookie);</script>&btng=search&ie=&site=&output=xml&client=&lr=&oe=&filter=0,[script_tag],98,Malicious
2,<marquee>pappy</marquee>&missionary_id=69,[marquee_tag],41,Malicious
3,<script>alert(document.cookie);</script>&subdwell=&dwelling=&streetnm=&locality=&hometown=&postcode=&datebrth=&learngen=&ethnicor=&tel_numb=&tel_mob=&email_add=&email_add2=&agree_info=&username=&p...,[script_tag],212,Malicious
4,<script>alert(document.cookie);</script>&btng=search&ie=&site=&output=xml&client=&lr=&oe=&filter=0,[script_tag],98,Malicious



=== Ejemplos de benignos (Label=0) ===


Unnamed: 0,Sentence_clean,families,len_after_clean,Class
38,"http://www.knoxcounty.org/search/sresults.php?search="");alert(document.cookie);//",[other],81,Malicious
50,"<frame src=""javascript:alert('pappy was here');""></frameset>&subseq=y",[other],69,Malicious
84,"http://bigcharts.marketwatch.com/symbollookup/symbollookupresults.asp?symb=""');alert(document.cookie);//&country=all&type=all",[other],125,Malicious
98,"<>folder_id=2534374302031664&bmuid=1213393925797&adid=dsetdir&originalhostname="";alert(document.cookie);//&bmuid=1213393931143",[other],126,Malicious
100,"<frameset><frame src=""javascript:alert('xss');""></frameset>",[other],59,Malicious



=== Archivo GitHub limpio guardado en ===
d:\Archivos de Usuario\Documents\xss-cookie\notebooks\data\data_processed\xss_github_clean.csv
