![Clase aprendizaje no supervisado](https://raw.githubusercontent.com/MECA4605-Aprendizaje-no-supervisado/S8_clase_analisis_geografico_II/main/figs/taller-meca-aprendizaje%20no%20supervisado_banner%201169%20x%20200%20px%20-05.png)

# Fine-Tuning de BETO para Detección de Spam en Mensajes de Texto (Spam vs Ham)

El objetivo de este notebook es entrenar un modelo basado en BETO (BERT en Español) para detectar mensajes de texto fraudulentos (spam) y diferenciarlos de mensajes legítimos (ham).

## Pasos
* Carga del dataset base desde Hugging Face.
* Ampliación del dataset con ejemplos sintéticos balanceados entre las clases spam y ham.
* Fine-tuning del modelo BETO utilizando los datos combinados.
* Evaluación del modelo en el conjunto de prueba.

## Resultados Clave
Recall para la clase spam: 81%
El modelo detecta 81% de los mensajes fraudulentos, lo que es crucial en tareas de detección de fraude.

Precisión general: 68%
Si bien hay margen de mejora en precisión, el foco principal fue maximizar el recall en la clase spam para evitar que mensajes fraudulentos pasen desapercibidos.

In [2]:
!pip install datasets
!pip install pysentimiento

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading

In [3]:
import torch
from datasets import load_dataset
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
import pandas as pd
import os
os.environ["WANDB_DISABLED"] = "true"

# Preparación de datos
Primero, cargamos el dataset:

In [4]:
# Cargar desde pandas y convertir a Hugging Face Dataset directamente
from datasets import Dataset, DatasetDict

train_path = "https://raw.githubusercontent.com/MECA4605-Aprendizaje-no-supervisado/S8_clase_analisis_geografico_II/main/data/train.csv"
test_path = "https://raw.githubusercontent.com/MECA4605-Aprendizaje-no-supervisado/S8_clase_analisis_geografico_II/main/data/test.csv"

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

train_df.columns = train_df.columns.str.strip()
test_df.columns = test_df.columns.str.strip()

train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

dataset = DatasetDict({"train": train_dataset, "test": test_dataset})

In [5]:
def prepare_labels(example):
    example["labels"] = 1 if example["tipo"] == "spam" else 0
    return example

dataset = dataset.map(prepare_labels)

Map:   0%|          | 0/998 [00:00<?, ? examples/s]

Map:   0%|          | 0/209 [00:00<?, ? examples/s]

# Cargar y Entrenar el modelo BETO

In [18]:
# 5. Cargar el modelo y tokenizer de BETO
model_name = "dccuchile/bert-base-spanish-wwm-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

In [19]:
# 6. Tokenización del dataset
def tokenize_function(example):
    return tokenizer(example["mensaje"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# 7. Preparar los datasets de entrenamiento y evaluación
train_dataset = tokenized_datasets["train"]
test_dataset = tokenized_datasets["test"]

Map:   0%|          | 0/998 [00:00<?, ? examples/s]

Map:   0%|          | 0/209 [00:00<?, ? examples/s]

In [20]:
# 9. Métricas de evaluación
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

In [21]:
# 9. Configuración del entrenamiento
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=7,
    weight_decay=0.01,
    logging_dir='./logs',
)

# 10. Inicializar el Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

# 11. Ejecutar el Fine-tuning
trainer.train()



{'eval_loss': 0.4344066381454468, 'eval_accuracy': 0.8516746411483254, 'eval_f1': 0.8622222222222222, 'eval_precision': 0.776, 'eval_recall': 0.97, 'eval_runtime': 5.9543, 'eval_samples_per_second': 35.101, 'eval_steps_per_second': 4.535, 'epoch': 1.0}
{'eval_loss': 0.42428478598594666, 'eval_accuracy': 0.9138755980861244, 'eval_f1': 0.9072164948453608, 'eval_precision': 0.9361702127659575, 'eval_recall': 0.88, 'eval_runtime': 5.9136, 'eval_samples_per_second': 35.342, 'eval_steps_per_second': 4.566, 'epoch': 2.0}
{'eval_loss': 0.46943262219429016, 'eval_accuracy': 0.9234449760765551, 'eval_f1': 0.9207920792079208, 'eval_precision': 0.9117647058823529, 'eval_recall': 0.93, 'eval_runtime': 5.8594, 'eval_samples_per_second': 35.669, 'eval_steps_per_second': 4.608, 'epoch': 3.0}
{'loss': 0.1395, 'grad_norm': 0.01971636898815632, 'learning_rate': 8.571428571428571e-06, 'epoch': 4.0}
{'eval_loss': 0.7996419668197632, 'eval_accuracy': 0.8660287081339713, 'eval_f1': 0.8703703703703703, 'eval_

TrainOutput(global_step=875, training_loss=0.0853114367893764, metrics={'train_runtime': 714.2189, 'train_samples_per_second': 9.781, 'train_steps_per_second': 1.225, 'train_loss': 0.0853114367893764, 'epoch': 7.0})

# Classification Report

In [22]:
# 14. Predicción y generación del classification report
predictions = trainer.predict(test_dataset)
y_true = predictions.label_ids
y_pred = predictions.predictions.argmax(-1)

# 15. Mostrar el classification report
report = classification_report(y_true, y_pred, target_names=["ham", "spam"])
print("\nClassification Report:")
print(report)


Classification Report:
              precision    recall  f1-score   support

         ham       0.94      0.87      0.90       109
        spam       0.87      0.94      0.90       100

    accuracy                           0.90       209
   macro avg       0.91      0.91      0.90       209
weighted avg       0.91      0.90      0.90       209



In [14]:
dataset

DatasetDict({
    train: Dataset({
        features: ['mensaje', 'tipo', 'labels'],
        num_rows: 998
    })
    test: Dataset({
        features: ['mensaje', 'tipo', 'labels'],
        num_rows: 209
    })
})

### Machine learning clasico

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from xgboost import XGBClassifier
import pandas as pd

# Convertir a pandas
train_df = dataset['train'].to_pandas()
test_df = dataset['test'].to_pandas()

# Vectorizar el texto con TF-IDF
vectorizer = TfidfVectorizer(max_features=3000)
X_train = vectorizer.fit_transform(train_df["mensaje"])
X_test = vectorizer.transform(test_df["mensaje"])

# Etiquetas
y_train = train_df["labels"]
y_test = test_df["labels"]

# Entrenar modelo XGBoost
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)

# Predicción y evaluación
y_pred = xgb.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))

Parameters: { "use_label_encoder" } are not used.



              precision    recall  f1-score   support

           0     0.7250    0.7982    0.7598       109
           1     0.7528    0.6700    0.7090       100

    accuracy                         0.7368       209
   macro avg     0.7389    0.7341    0.7344       209
weighted avg     0.7383    0.7368    0.7355       209



In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

# Convertir a pandas
train_df = dataset['train'].to_pandas()
test_df = dataset['test'].to_pandas()

# Vectorizar texto
vectorizer = TfidfVectorizer(max_features=3000)

# Modelo con ElasticNet (LogisticRegression con penalty='elasticnet' + solver='saga')
model = LogisticRegression(
    penalty='elasticnet',
    solver='saga',
    max_iter=1000,
    l1_ratio=0.5
)

# Pipeline
pipeline = Pipeline([
    ("tfidf", vectorizer),
    ("clf", model)
])

# Entrenar
pipeline.fit(train_df["mensaje"], train_df["labels"])

# Evaluación
y_pred = pipeline.predict(test_df["mensaje"])
print(classification_report(test_df["labels"], y_pred, digits=4))

              precision    recall  f1-score   support

           0     0.8500    0.7798    0.8134       109
           1     0.7798    0.8500    0.8134       100

    accuracy                         0.8134       209
   macro avg     0.8149    0.8149    0.8134       209
weighted avg     0.8164    0.8134    0.8134       209



# Conclusiones
1. El modelo BETO fine-tuned logró un recall del 81% en la clase spam, lo que es clave para detectar mensajes fraudulentos.
2. La precisión general fue del 68%, lo que indica que el modelo aún clasifica algunos mensajes legítimos como spam (falsos positivos).
3. El modelo es efectivo para detectar fraudes en español, pero podría mejorarse ajustando hiperparámetros y añadiendo más ejemplos de mensajes legítimos (ham).
4. El enfoque actual prioriza detectar la mayor cantidad de fraudes posibles, lo cual es adecuado en este tipo de tareas.

# Pysentimiento
https://huggingface.co/pysentimiento/robertuito-sentiment-analysis

In [12]:
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="sentiment", lang="es")

analyzer.predict("Qué gran jugador es Messi")

config.json:   0%|          | 0.00/925 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/435M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

AnalyzerOutput(output=POS, probas={POS: 0.946, NEU: 0.037, NEG: 0.017})

In [13]:
analyzer.predict("Este restaurante esta horrible, pesimo servicio")

AnalyzerOutput(output=NEG, probas={NEG: 0.983, NEU: 0.014, POS: 0.003})