<a href="https://colab.research.google.com/github/Ignacioelamo/LLMs4Phishing/blob/main/02_feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instalación Previa

## Descargas

In [1]:
!pip install --upgrade --force-reinstall gensim pyLDAvis

Collecting gensim
  Using cached gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting pyLDAvis
  Using cached pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting scipy<1.14.0,>=1.7.0 (from gensim)
  Using cached scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Using cached smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)
Collecting pandas>=2.0.0 (from pyLDAvis)
  Using cached pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting joblib>=1.2.0 (from pyLDAvis)
  Using cached joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting jinja2 (from pyLDAvis)
  Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting numexpr (from pyLDAvis)
  Us

In [2]:
!pip install keybert




In [4]:
NOMBRE_ARCHIVO = 'emails.csv'

!wget https://raw.githubusercontent.com/Ignacioelamo/LLMs4Phishing/main/data/01_combined_cleaned_email_data.csv -O $NOMBRE_ARCHIVO

--2025-05-23 13:11:26--  https://raw.githubusercontent.com/Ignacioelamo/LLMs4Phishing/main/data/01_combined_cleaned_email_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10064045 (9.6M) [text/plain]
Saving to: ‘emails.csv’


2025-05-23 13:11:27 (151 MB/s) - ‘emails.csv’ saved [10064045/10064045]



## Librerías

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


import gensim
import gensim.corpora as corpora
from gensim.models import LdaModel
from gensim.models.phrases import Phrases, Phraser
from gensim.models.coherencemodel import CoherenceModel
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

import spacy
from nltk.corpus import stopwords
import nltk

from keybert import KeyBERT

# FUNCIONES AUXILIARES

## KEY BERT

In [2]:
kw_model = KeyBERT()

def extract_keywords_from_text(text):
    """
    Extrae keywords de un texto usando KeyBERT
    """
    try:
        if pd.isna(text) or str(text).strip() == '':
            return []

        keywords = kw_model.extract_keywords(
            str(text),
            keyphrase_ngram_range=(1, 1),
            stop_words='english',
            use_mmr=True,
            diversity=0.7
        )

        # Retornar solo las palabras clave (sin los scores)
        return [keyword[0] for keyword in keywords]

    except Exception as e:
        print(f"Error extracting keywords: {e}")
        return []

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


# Extracción de las características del cuerpo del correo:
1. **body_html=contains_html**: This is a binary feature that represents the presence of HTML in the email body.  
2. **body_forms**: This binary feature represents the presence of forms in HTML email bodies.
3. **body_noWords**: This feature measures the total number of words occurring in the email.
4.  **body_noCharacters**: This feature measures the total number of characters occurring in the email body.
5.  **body_noDistinctWords**: This feature measures the total number of distinct words occurring in the body of the email.  
6. **body_richness**: The richness is defined as the ratio of the number of words to the number of characters in the document.
$$
\text{body_richness} = \frac{\text{body_noWords}}{\text{body_noCharacters}}
$$
8. **body_noFunctionWords** Chandrasekaran [6] also listed a set of function words that included:  
`account`, `access`, `bank`, `credit`, `click`, `identity`, `inconvenience`, `information`, `limited`, `log`, `minutes`, `password`, `recently`, `risk`, `social`, `security`, `service`, and `suspended`.  The `body_noFunctionWords` feature measures the total number of occurrences of these function words in the email body.
9. **body_suspension** This binary feature represents the presence of the word **"suspension"** in the body of the email.  
10.  **body_verifyYourAccount** This binary feature represents the presence of the phrase **"verify your account"** in the body of the email.
11. **body_text** contains information regarding the context and purpose of an email. For this, we extract the plain text from the email body and use word embedding techniques to represent it.

Añadimos las features que ya teníamos: has_attachment, contains_html, urls.

In [5]:
#Es necesario la version de numpy= 1.26.4
#%pip install --upgrade --force-reinstall numpy==1.26.4 pandas

In [5]:
df = (
    pd.read_csv(NOMBRE_ARCHIVO)
      .assign(
          # Feature 3: body_noWords (Total number of words)
          body_noWords=lambda df: df['body'].apply(lambda x: len(str(x).split())),

          # Feature 4: body_noCharacters (Total number of characters)
          body_noCharacters=lambda df: df['body'].apply(lambda x: len(str(x))),

          # Feature 5: body_noDistinctWords (Total number of distinct words)
          body_noDistinctWords=lambda df: df['body'].apply(lambda x: len(set(str(x).split()))),

          # Feature 6: body_richness (Ratio of words to characters)
          body_richness=lambda df: df['body'].apply(lambda x: len(str(x).split())) / df['body'].apply(lambda x: len(str(x))),

          # Feature 7: body_noFunctionWords (Count of specific function words)
          body_noFunctionWords=lambda df: df['body'].apply(
              lambda x: sum(1 for word in str(x).split()
                          if word.lower() in ['account', 'access', 'bank', 'credit', 'click',
                                            'identity', 'inconvenience', 'information', 'limited',
                                            'log', 'minutes', 'password', 'recently', 'risk',
                                            'social', 'security', 'service', 'suspended'])
          ),

          # Feature 8: body_keywords (Keywords extracted using KeyBERT)
          body_keywords=lambda df: df['body'].apply(extract_keywords_from_text)
      )
)

In [6]:
df.columns

Index(['source', 'subject', 'body', 'contains_html', 'body_forms',
       'has_attachment', 'urls', 'label', 'body_noWords', 'body_noCharacters',
       'body_noDistinctWords', 'body_richness', 'body_noFunctionWords',
       'body_keywords'],
      dtype='object')

In [9]:
print(df['body_keywords'].head(1000))

0               [viagra, pharmacy, cheapest, send, high]
1      [empowerment, 500, 0bligation, returning, real...
2                  [watches, vip, gift, stylish, models]
3              [watches, real, signature, du4d, cartier]
4      [rewards, shopperssavingcenter, kingsbury, dea...
                             ...                        
995             [speakup, file, braille, types, mailman]
996         [unsubscribe, debootstrap, wifi, ata, golov]
997                  [websvn, samba, 1092, tagging, bin]
998           [profiling, init, ptc, discussed, tickets]
999    [debugging, stubs, hierarchy, unsubscribe, ens...
Name: body_keywords, Length: 1000, dtype: object


# KeyBert

In [8]:
# prompt: guarda en una variable todo el texto de la columa body del dataframe

email_bodies = df['body'].tolist()
all_text = '\n'.join(email_bodies)

In [9]:
kw_model = KeyBERT()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
keywords = kw_model.extract_keywords(all_text, keyphrase_ngram_range=(1, 1), stop_words='english', use_mmr=True, diversity=0.7)
print(keywords)

In [None]:
#Usamos el model MiniLM-L6-v2 para generar la feature del body_text
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
df =df.dropna(subset=['body'])
emails=df['body'].tolist()
embeddings = model.encode(emails, show_progress_bar=True)
df["body_text"]=embeddings.tolist()

Para sacar los tópicos de los correos podemos hacerlo de dos formas:
1. BERTopic: usa embeddings contextuales para agrupar documentos, y luego re-pondera con TF-IDF para extraer términos.
2. LDA: modelo generativo de tópicos sobre Bag-of-Words, que descubre distribuciones de palabras.

In [None]:
#Lematizamos el body del correo para aplicar el modelo LDA
import spacy
from nltk.corpus import stopwords
import nltk
import matplotlib.pyplot as plt
import numpy as np

In [None]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
nlp = spacy.load('en_core_web_sm')

El Coherence Score es una medida de qué tan “coherentes” aparecen los tópicos para la interpretación, basándose en la co-ocurrencia de las palabras más representativas dentro de los documentos. Nos dice cuanto de interpretables son los tópicos. Obtenemos un valor moderado.

In [None]:
#MODELO LDA:
texts = df['body'].astype(str).tolist()

# 2) Prepara stopwords y spaCy para lematizar (igual que en el notebook)
nlp = spacy.load('en_core_web_sm', disable=['parser','ner'])
stop_words = set(stopwords.words('english'))

def lemmatize_tokens(doc):
    parsed = nlp(doc)
    return [
        token.lemma_.lower()
        for token in parsed
        if token.lemma_ not in stop_words
           and token.is_alpha
           and len(token.lemma_) > 3
    ]

data_lemm = [lemmatize_tokens(doc) for doc in texts]

# 3) Detecta bigramas y aplícalos
bigram = Phrases(data_lemm, min_count=5, threshold=100)
bigram_mod = Phraser(bigram)
data_words = [bigram_mod[doc] for doc in data_lemm]

# 4) Crea el diccionario y el corpus de Gensim (Bag-of-Words)
id2word = corpora.Dictionary(data_words)
id2word.filter_extremes(no_below=15, no_above=0.5)
corpus = [id2word.doc2bow(text) for text in data_words]

# --- Promedio de coherencia sobre varias corridas ---
k_values = list(range(10, 16))
seeds    = [0, 7, 42, 99, 123]   # distintas semillas
results  = {k: [] for k in k_values}

for k in k_values:
    for seed in seeds:
        lda = LdaModel(
            corpus=corpus,
            id2word=id2word,
            num_topics=k,
            random_state=seed,
            update_every=1,
            chunksize=100,
            passes=10,
            alpha='auto'
        )
        cm = CoherenceModel(
            model=lda,
            texts=data_words,
            dictionary=id2word,
            coherence='c_v'
        )
        results[k].append(cm.get_coherence())

# --- Calcular media y desviación ---
means = [np.mean(results[k]) for k in k_values]
stds  = [np.std(results[k])  for k in k_values]

# --- Graficar con barras de error ---
plt.figure(figsize=(8,5))
plt.errorbar(k_values, means, yerr=stds, fmt='-o', capsize=5)
plt.xticks(k_values)
plt.xlabel("Número de tópicos (k)")
plt.ylabel("Coherence Score (c_v)")
plt.title("Coherence vs k (media ± 1σ sobre distintas semillas)")
plt.grid(True)
plt.show()

En el k=10 obtenemos el óptimo para el número de tópicos, ya que es el que alcanza el mayor valor medio tiene y la desviación que puede producir al tomar diferentes semillas es baja.

In [None]:
#Por lo que usamos k=10 para el modelo LDA
lda = LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=10,
    random_state=42,
    chunksize=100,
    passes=10,
    alpha='auto',
    eta='auto',
    per_word_topics=True
)

# 6) Imprime los términos más representativos de cada tópico
for idx, topic in lda.print_topics(num_topics=10, num_words=10):
    print(f"Tópico {idx:2d}: {topic}")

#Graficamos los topicos usando pyLDAvis
data = gensimvis.prepare(lda, corpus, id2word)
pyLDAvis.display(data)