# Clasificaci√≥n de comentarios t√≥xicos con Naive Bayes

En este notebook entrenamos un modelo cl√°sico de **Naive Bayes (MultinomialNB)** para detectar
comentarios t√≥xicos (`IsToxic`) en el dataset preprocesado:


**Objetivo principal:**
- Entrenar un modelo **Multinomial Naive Bayes** usando:
  - Dataset limpio: `data/preprocessing_data/youtoxic_english_1000_clean.csv`
  - Texto procesado (`text_classic`)
  - *Features* num√©ricas ya calculadas en el preprocesado:
    - `text_len_classic`
    - `word_count_classic`
    - `uppercase_ratio`
    - `exclamation_count`
    - `hate_words_count`
- Predecir la columna objetivo: **`IsToxic`**

Al final del notebook:

- Entrenaremos el modelo con un **train/test split (80/20)**.
- Calcularemos m√©tricas: accuracy, precision, recall, F1, ROC-AUC y matriz de confusi√≥n.
- Guardaremos:
  - El modelo entrenado (`.pkl`) en `backend/models/`
  - Un fichero de resultados (`.json`) en `data/results/` siguiendo el formato acordado.


### 1 Importaci√≥n de librer√≠as y configuraci√≥n

En esta celda:
- Importamos las librer√≠as necesarias para:
  - Carga de datos (`pandas`, `pathlib`)
  - Modelado cl√°sico (`scikit-learn`)
  - C√°lculo de m√©tricas
  - Guardado del modelo (`joblib`)
  - Guardado de resultados en JSON
- Definimos el nombre del modelo, la columna objetivo y las columnas de texto y num√©ricas que vamos a usar.


In [9]:
# === 1. Imports libraries ======================================

import json  # To save metrics in JSON format
from datetime import datetime  # To generate ISO timestamp
from pathlib import Path  # To handle file system paths

import numpy as np  # Numerical operations
import pandas as pd  # DataFrame handling

# Machine Leearning: Scikit-learn: data split, preprocessing and modeling
from sklearn.model_selection import train_test_split  # Train/test split
from sklearn.feature_extraction.text import TfidfVectorizer  # Text vectorization
from sklearn.compose import ColumnTransformer  # Combine text + numeric features
from sklearn.preprocessing import FunctionTransformer  # For numeric features
from sklearn.pipeline import Pipeline  # Build end-to-end ML pipeline
from sklearn.naive_bayes import MultinomialNB  # Naive Bayes classifier (text)

# Metrics
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    precision_recall_fscore_support,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    classification_report
)

# Persistence
import joblib

import warnings  # To ignore some sklearn warnings
warnings.filterwarnings("ignore")

print("‚úÖ Librer√≠as importadas correctamente")


‚úÖ Librer√≠as importadas correctamente


## 2. Carga del dataset preprocesado

En esta secci√≥n:

- Localizamos la ra√≠z del proyecto.
- Cargamos el fichero limpio: `data/preprocessing_data/youtoxic_english_1000_clean.csv`.
- Verificamos dimensiones y algunas columnas clave.


In [10]:
# =============================================================================
# 2. CARGA DEL DATASET PREPROCESADO
# =============================================================================

# Detect project root (assuming this notebook is in backend/notebooks)

##################
# Detect project root
notebook_dir = Path.cwd()

if notebook_dir.name == "notebooks":
    project_root = notebook_dir.parent.parent
elif notebook_dir.name == "backend":
    project_root = notebook_dir.parent
else:
    project_root = notebook_dir

data_path = project_root / "data" / "preprocessing_data" / "youtoxic_english_1000_clean.csv"

print(f"üìÇ Ra√≠z del proyecto: {project_root}")
print(f"üìÑ Cargando dataset desde: {data_path}")

if not data_path.exists():
    raise FileNotFoundError(
        f"Dataset not found at {data_path}.\n"
        "Comprueba que la carpeta 'data' est√° en la ra√≠z del proyecto y que\n"
        "el fichero 'youtoxic_english_1000.csv' est√° dentro de ella."
    )

# Load CSV
df = pd.read_csv(data_path)

print("\nüìä Dimensiones del dataset limpio:")
print(f"   Filas:    {df.shape[0]}")
print(f"   Columnas: {df.shape[1]}")

print("\nüîç Primeras filas:")
display(df.head(3))


üìÇ Ra√≠z del proyecto: c:\dev\proyectos\PX_NLP_G4
üìÑ Cargando dataset desde: c:\dev\proyectos\PX_NLP_G4\data\preprocessing_data\youtoxic_english_1000_clean.csv

üìä Dimensiones del dataset limpio:
   Filas:    997
   Columnas: 18

üîç Primeras filas:


Unnamed: 0,CommentId,VideoId,Text,IsToxic,IsAbusive,IsThreat,IsProvocative,IsObscene,IsHatespeech,IsRacist,IsReligiousHate,text_basic,text_classic,text_len_classic,word_count_classic,uppercase_ratio,exclamation_count,hate_words_count
0,Ugg2KwwX0V8-aXgCoAEC,04kJtp6pVXI,If only people would just take a step back and...,False,False,False,False,False,False,False,False,If only people would just take a step back and...,people would take step back make case wasnt an...,850,129,0.014121,0,2
1,Ugg2s5AzSPioEXgCoAEC,04kJtp6pVXI,Law enforcement is not trained to shoot to app...,True,True,False,False,False,False,False,False,Law enforcement is not trained to shoot to app...,law enforcement trained shoot apprehend traine...,90,13,0.036232,0,3
2,Ugg3dWTOxryFfHgCoAEC,04kJtp6pVXI,\r\nDont you reckon them 'black lives matter' ...,True,True,False,False,True,False,False,False,Dont you reckon them 'black lives matter' bann...,dont reckon black life matter banner held whit...,252,40,0.002375,0,1


## 3. Definici√≥n de columnas de entrada y variable objetivo

En este notebook vamos a:

- Usar el texto preprocesado `text_classic` como **feature textual principal**.
- A√±adir las siguientes **features num√©ricas**:
  - `text_len_classic` ‚Üí longitud de texto
  - `word_count_classic` ‚Üí n√∫mero de palabras
  - `uppercase_ratio` ‚Üí porcentaje de letras en may√∫scula
  - `exclamation_count` ‚Üí n√∫mero de signos de exclamaci√≥n
  - `hate_words_count` ‚Üí n√∫mero de palabras de odio encontradas
- Predecir la variable binaria **`IsToxic`** como objetivo.

Si otro compa√±ero quiere entrenar el modelo para otra etiqueta (`IsHatespeech`, `IsAbusive`, etc.),
solo tendr√≠a que cambiar el nombre de `TARGET_COL`.


In [11]:
# =============================================================================
# 3. DEFINICI√ìN DE FEATURES Y TARGET
# =============================================================================

# Target column for this notebook
TARGET_COL = "IsToxic"  # Change this if you want to model another label
TEXT_COL = "text_classic"

# Numeric features already prepared in the preprocessing step
numeric_features = [
    "text_len_classic",
    "word_count_classic",
    "uppercase_ratio",
    "exclamation_count",
    "hate_words_count",
]

print("üéØ Columna objetivo:", TARGET_COL)
print("üìù Columna de texto:", TEXT_COL)
print("üî¢ Features num√©ricas:", numeric_features)

# Check that all needed columns exist
required_cols = [TEXT_COL, TARGET_COL] + numeric_features
missing_cols = [c for c in required_cols if c not in df.columns]

if missing_cols:
    raise ValueError(f"‚ùå Faltan columnas en el dataset: {missing_cols}")
else:
    print("\n‚úÖ Todas las columnas necesarias est√°n presentes en el dataset")

# Quick check of target distribution
print("\nüìä Distribuci√≥n de la variable objetivo (IsToxic):")
print(df[TARGET_COL].value_counts(normalize=True).rename("ratio").to_frame())


üéØ Columna objetivo: IsToxic
üìù Columna de texto: text_classic
üî¢ Features num√©ricas: ['text_len_classic', 'word_count_classic', 'uppercase_ratio', 'exclamation_count', 'hate_words_count']

‚úÖ Todas las columnas necesarias est√°n presentes en el dataset

üìä Distribuci√≥n de la variable objetivo (IsToxic):
            ratio
IsToxic          
False    0.539619
True     0.460381


## 4. Partici√≥n Train/Test

Hacemos un **train/test split 80/20**, estratificando por la variable objetivo para mantener
la misma proporci√≥n de clases en ambos subconjuntos.


In [12]:
# =============================================================================
# 4. TRAIN/TEST SPLIT
# =============================================================================

# Features (X) and target (y)
X = df[[TEXT_COL] + numeric_features]
y = df[TARGET_COL].astype(int)  # Make sure it's 0/1

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,  # Keep class distribution
)

print("üìö Tama√±os de los conjuntos:")
print(f"   X_train: {X_train.shape}")
print(f"   X_test:  {X_test.shape}")
print(f"   y_train: {y_train.shape}")
print(f"   y_test:  {y_test.shape}")

print("\nüìä Proporci√≥n de clase positiva (IsToxic = 1):")
print(f"   Train: {y_train.mean():.3f}")
print(f"   Test:  {y_test.mean():.3f}")


üìö Tama√±os de los conjuntos:
   X_train: (797, 6)
   X_test:  (200, 6)
   y_train: (797,)
   y_test:  (200,)

üìä Proporci√≥n de clase positiva (IsToxic = 1):
   Train: 0.460
   Test:  0.460


## 5. Pipeline: TF-IDF + features num√©ricas + Naive Bayes

Construimos un **Pipeline de scikit-learn** que:

1. Aplica `TfidfVectorizer` sobre la columna `text_classic` (unigrams + bigrams).
2. A√±ade las columnas num√©ricas en bruto (ya normalizadas o acotadas en el preprocesado).
3. Entrena un modelo `MultinomialNB` con todos esos features combinados.

De esta forma:

- Tenemos un √∫nico objeto (`Pipeline`) que incluye preprocesado + modelo.
- Es m√°s f√°cil guardar y reutilizar el modelo despu√©s (`.pkl`).


In [13]:
# =============================================================================
# 5. DEFINICI√ìN DEL PIPELINE (TF-IDF + NUM FEATURES + NAIVE BAYES)
# =============================================================================

from sklearn.preprocessing import FunctionTransformer  # Optional, if needed

# Define TF-IDF vectorizer for text
tfidf_vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),       # unigrams + bigrams
    max_features=10000,       # limit vocabulary size
    min_df=2,                 # ignore very rare terms
    strip_accents="unicode",  # normalize accents
)

# ColumnTransformer to combine text and numeric features
preprocessor = ColumnTransformer(
    transformers=[
        # Apply TF-IDF on text column
        ("text", tfidf_vectorizer, TEXT_COL),
        # Pass numeric features as they are (they are already non-negative and reasonable)
        ("num", "passthrough", numeric_features),
    ]
)

# Build Pipeline: preprocessor + Naive Bayes classifier
nb_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", MultinomialNB()),
    ]
)

print("‚úÖ Pipeline definido correctamente")


‚úÖ Pipeline definido correctamente


## 6. Entrenamiento y evaluaci√≥n del modelo

Entrenamos el pipeline completo y calculamos las siguientes m√©tricas sobre el conjunto de test:

- **Accuracy**
- **Precision**
- **Recall**
- **F1-Score**
- **ROC-AUC**
- **Matriz de confusi√≥n** (TN, FP, FN, TP)


In [None]:
# =============================================================================
# 6. ENTRENAMIENTO Y EVALUACI√ìN
# =============================================================================

# Fit the pipeline on training data
print("‚è≥ Entrenando modelo Naive Bayes...")
nb_pipeline.fit(X_train, y_train)
print("‚úÖ Entrenamiento completado")

# Predictions
y_pred = nb_pipeline.predict(X_test)
y_proba = nb_pipeline.predict_proba(X_test)[:, 1]  # Probabilities for positive class

# Compute metrics
accuracy = accuracy_score(y_test, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(
    y_test, y_pred, average="binary", zero_division=0
)
roc_auc = roc_auc_score(y_test, y_proba)

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print("\nüìä M√âTRICAS EN TEST (Naive Bayes - IsToxic):")
print(f"   Accuracy : {accuracy:.3f}")
print(f"   Precision: {precision:.3f}")
print(f"   Recall   : {recall:.3f}")
print(f"   F1-Score : {f1:.3f}")
print(f"   ROC-AUC  : {roc_auc:.3f}")

print("\nüìå Matriz de confusi√≥n:")
print(f"   TN: {tn}   FP: {fp}")
print(f"   FN: {fn}   TP: {tp}")



‚è≥ Entrenando modelo Naive Bayes...
‚úÖ Entrenamiento completado

üìä M√âTRICAS EN TEST (Naive Bayes - IsToxic):
   Accuracy : 0.760
   Precision: 0.833
   Recall   : 0.598
   F1-Score : 0.696
   ROC-AUC  : 0.801

üìå Matriz de confusi√≥n:
   TN: 97   FP: 11
   FN: 37   TP: 55


## 7. Generaci√≥n del JSON de resultados y guardado del modelo

Para poder comparar modelos de forma homog√©nea:

- Construimos un diccionario con la misma estructura de JSON para **todos los modelos**.
- Guardamos ese JSON en `data/results/<model_name>.json`.
- Guardamos el modelo (`Pipeline` completo) en `backend/models/<model_name>.pkl`.

De esta forma, el notebook de comparaci√≥n solo tendr√° que leer los `.json` de `data/results`.


In [16]:
# =============================================================================
# 7. CREACI√ìN DE JSON DE RESULTADOS Y GUARDADO
# =============================================================================

from joblib import dump  # To save the trained model

model_name = "naive_bayes_toxic_v1"

# Get TF-IDF feature count after fitting
fitted_tfidf = nb_pipeline.named_steps["preprocessor"].named_transformers_["text"]
n_features_text = len(fitted_tfidf.get_feature_names_out())
n_features_numeric = len(numeric_features)
n_samples = df.shape[0]

results_dict = {
    "model_name": model_name,
    "task": "binary_classification",
    "target_label": TARGET_COL,
    "data": {
        "n_samples": int(n_samples),
        "n_features_text": int(n_features_text),
        "n_features_numeric": int(n_features_numeric),
        "train_size": float(len(X_train) / len(df)),
        "test_size": float(len(X_test) / len(df)),
        "random_state": 42,
    },
    "metrics": {
        "accuracy": float(accuracy),
        "precision": float(precision),
        "recall": float(recall),
        "f1": float(f1),
        "roc_auc": float(roc_auc),
    },
    "confusion_matrix": {
        "tn": int(tn),
        "fp": int(fp),
        "fn": int(fn),
        "tp": int(tp),
    },
    "timestamp": datetime.now().isoformat(timespec="seconds"),
    "notes": "Naive Bayes + TF-IDF (1,2) + 5 numeric features on text_classic",
}

# Paths for saving
results_dir = project_root / "data" / "results"
results_dir.mkdir(parents=True, exist_ok=True)

models_dir = project_root / "backend" / "models"
models_dir.mkdir(parents=True, exist_ok=True)

json_path = results_dir / f"{model_name}.json"
model_path = models_dir / f"{model_name}.pkl"

# Save JSON
with open(json_path, "w", encoding="utf-8") as f:
    json.dump(results_dict, f, indent=2, ensure_ascii=False)

# Save model
dump(nb_pipeline, model_path)

print("\nüíæ Archivos guardados:")
print(f"   üìÅ JSON resultados: {json_path}")
print(f"   üìÅ Modelo (.pkl)   : {model_path}")



üíæ Archivos guardados:
   üìÅ JSON resultados: c:\dev\proyectos\PX_NLP_G4\data\results\naive_bayes_toxic_v1.json
   üìÅ Modelo (.pkl)   : c:\dev\proyectos\PX_NLP_G4\backend\models\naive_bayes_toxic_v1.pkl
