# NLP Feedback Modeling Framework (NLP + Survey Analytics)
**Autor:** David José Parales Araujo  
**Objetivo:** Framework replicable para clasificar feedback textual (binario y multiclase) e integrar analítica de encuestas (Likert) para construir un **Índice de Desempeño 0–100**.  
**Aplicaciones:** Educación (alumno→docente / directivo→docente / autoevaluación), HR Analytics, Customer Feedback, Calidad de servicio.

> Este notebook usa un dataset **sintético** para demostración. Reemplázalo por tu dataset real anonimizado.

## 1) Setup

In [None]:
# Si ejecutas en Colab, descomenta estas líneas.
# !pip -q install pandas numpy scikit-learn nltk matplotlib joblib imbalanced-learn

import re
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import (
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    f1_score
)
import matplotlib.pyplot as plt

# NLP utils
import nltk
nltk.download("punkt", quiet=True)
nltk.download("stopwords", quiet=True)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Opcional (solo si quieres SMOTE)
try:
    from imblearn.over_sampling import SMOTE
    IMBLEARN_AVAILABLE = True
except Exception:
    IMBLEARN_AVAILABLE = False

SEED = 42
np.random.seed(SEED)

## 2) Problem Definition

### Escenarios soportados
1. **Binario:** `risk` vs `ok`.
2. **Multiclase:** `Negativa / Neutral / Positiva`.
3. **Score continuo:** **Índice 0–100** (derivado de Likert + señales del texto).

### Nota de diseño (importante)
- **Las etiquetas NO deben salir del mismo texto con reglas ad-hoc** si vas a entrenar un modelo supervisado (riesgo de leakage).
- En este framework, la etiqueta se puede derivar de **encuesta estructurada** (Likert) y el texto se usa como predictor.

## 3) Data Model (schema recomendado)

Columnas recomendadas (mínimo viable):

- `source`: student / leadership / self  
- `year`: año  
- `level`: curso/año (ej: 5)  
- `role_evaluator`: student / director / self  
- `likert_avg`: promedio Likert normalizado **0–1**  
- `text_feedback`: comentario libre

Targets derivados:
- `performance_index` (0–100)
- `target_binary` (0/1)
- `target_multiclass` (Negativa/Neutral/Positiva)

## 4) Build a Synthetic Dataset (demo)

In [None]:
positive_texts = [
    "Explica con claridad y responde dudas con paciencia.",
    "Las clases son dinámicas y se nota dominio del tema.",
    "Motiva al curso y brinda material útil.",
    "Evalúa de forma justa y retroalimenta con detalle."
]

neutral_texts = [
    "Algunas clases son buenas, otras podrían mejorar.",
    "A veces explica claro, a veces rápido.",
    "El ritmo es variable, en general cumple."
]

negative_texts = [
    "Las explicaciones son confusas y desorganizadas.",
    "No responde bien a las preguntas y llega tarde.",
    "Las clases son aburridas y poco productivas.",
    "Las evaluaciones no reflejan lo visto en clase."
]

def make_rows(n=600, p_pos=0.65, p_neu=0.20, p_neg=0.15):
    rows = []
    for _ in range(n):
        r = np.random.rand()
        if r < p_neg:
            txt = np.random.choice(negative_texts)
            likert = np.random.uniform(0.15, 0.55)
        elif r < p_neg + p_neu:
            txt = np.random.choice(neutral_texts)
            likert = np.random.uniform(0.45, 0.75)
        else:
            txt = np.random.choice(positive_texts)
            likert = np.random.uniform(0.65, 0.95)

        source = np.random.choice(["student", "leadership", "self"], p=[0.7, 0.2, 0.1])
        role = {"student":"student", "leadership":"director", "self":"self"}[source]
        year = np.random.choice([2023, 2024, 2025])
        level = np.random.choice([1,2,3,4,5])

        rows.append([source, year, level, role, float(likert), txt])
    return pd.DataFrame(rows, columns=["source","year","level","role_evaluator","likert_avg","text_feedback"])

df = make_rows(n=900)
df.head()

## 5) Preprocessing Pipeline (Spanish)

In [None]:
STOPWORDS_ES = set(stopwords.words("spanish"))

def preprocess_text(text: str) -> str:
    text = (text or "").lower()
    text = re.sub(r"[^a-záéíóúñü\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    tokens = word_tokenize(text, language="spanish")
    tokens = [t for t in tokens if t not in STOPWORDS_ES and len(t) > 2]
    return " ".join(tokens)

df["text_clean"] = df["text_feedback"].apply(preprocess_text)
df[["text_feedback","text_clean"]].head()

## 6) Survey Analytics → Performance Index (0–100) + Targets

In [None]:
df["performance_index"] = (df["likert_avg"] * 100).round().astype(int)

def multiclass_from_index(idx: int) -> str:
    if idx < 50:
        return "Negativa"
    if idx <= 75:
        return "Neutral"
    return "Positiva"

df["target_multiclass"] = df["performance_index"].apply(multiclass_from_index)
df["target_binary"] = (df["performance_index"] >= 60).astype(int)  # 1 = OK, 0 = Risk

df[["likert_avg","performance_index","target_binary","target_multiclass"]].head()

## 7) Train/Test Split

In [None]:
X = df["text_clean"].values
y_bin = df["target_binary"].values
y_multi = df["target_multiclass"].values

X_train, X_test, yb_train, yb_test = train_test_split(
    X, y_bin, test_size=0.2, random_state=SEED, stratify=y_bin
)

Xm_train, Xm_test, ym_train, ym_test = train_test_split(
    X, y_multi, test_size=0.2, random_state=SEED, stratify=y_multi
)

print("Binary class distribution:", np.bincount(yb_train), "train |", np.bincount(yb_test), "test")
print("Multiclass distribution:\n", pd.Series(ym_train).value_counts())

## 8) Baselines: TF-IDF + Logistic Regression / Linear SVM (Binary)

In [None]:
binary_lr = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), max_features=5000)),
    ("clf", LogisticRegression(max_iter=2000, class_weight="balanced", random_state=SEED))
])

binary_svm = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), max_features=5000)),
    ("clf", LinearSVC(class_weight="balanced", random_state=SEED))
])

binary_lr.fit(X_train, yb_train)
pred_lr = binary_lr.predict(X_test)

binary_svm.fit(X_train, yb_train)
pred_svm = binary_svm.predict(X_test)

print("=== Logistic Regression (Binary) ===")
print(classification_report(yb_test, pred_lr, target_names=["risk(0)","ok(1)"]))

print("=== Linear SVM (Binary) ===")
print(classification_report(yb_test, pred_svm, target_names=["risk(0)","ok(1)"]))

### Confusion Matrix (Binary)

In [None]:
fig, ax = plt.subplots(figsize=(5,4))
ConfusionMatrixDisplay.from_predictions(yb_test, pred_lr, display_labels=["risk","ok"], ax=ax)
ax.set_title("Binary - Logistic Regression")
plt.show()

## 9) Cross-Validation (Binary) — Macro F1

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
scores = cross_val_score(binary_lr, X, y_bin, cv=cv, scoring="f1_macro")
print("Macro F1 (5-fold):", round(scores.mean(), 4), "+/-", round(scores.std(), 4))

## 10) Threshold Tuning (Binary) — usando probabilidades (LR)

In [None]:
proba = binary_lr.predict_proba(X_test)[:, 1]

thresholds = np.linspace(0.2, 0.8, 13)
best_thr, best_f1 = None, -1

for thr in thresholds:
    pred_thr = (proba >= thr).astype(int)
    f1 = f1_score(yb_test, pred_thr, average="macro")
    if f1 > best_f1:
        best_f1 = f1
        best_thr = float(thr)

print("Best threshold:", best_thr, "Macro F1:", round(best_f1, 4))
pred_best = (proba >= best_thr).astype(int)
print(classification_report(yb_test, pred_best, target_names=["risk(0)","ok(1)"]))

## 11) Multiclass Modeling (Negativa / Neutral / Positiva)

In [None]:
multi_lr = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), max_features=8000)),
    ("clf", LogisticRegression(max_iter=3000, class_weight="balanced", random_state=SEED))
])

multi_lr.fit(Xm_train, ym_train)
pred_m = multi_lr.predict(Xm_test)

print(classification_report(ym_test, pred_m))

### Confusion Matrix (Multiclass)

In [None]:
fig, ax = plt.subplots(figsize=(6,5))
ConfusionMatrixDisplay.from_predictions(ym_test, pred_m, ax=ax, xticks_rotation=45)
ax.set_title("Multiclass - Logistic Regression")
plt.show()

## 12) Imbalance Handling (Opcional): SMOTE (requiere imbalanced-learn)

In [None]:
if IMBLEARN_AVAILABLE:
    vec = TfidfVectorizer(ngram_range=(1,2), max_features=5000)
    Xv = vec.fit_transform(X_train)
    sm = SMOTE(random_state=SEED)
    X_res, y_res = sm.fit_resample(Xv, yb_train)
    clf = LogisticRegression(max_iter=2000, random_state=SEED)
    clf.fit(X_res, y_res)
    Xt = vec.transform(X_test)
    pred = clf.predict(Xt)
    print(classification_report(yb_test, pred, target_names=["risk(0)","ok(1)"]))
else:
    print("SMOTE no disponible. Instala con: pip install imbalanced-learn")

## 13) Composite Performance Index (0–100) — Integración Survey + NLP

In [None]:
prob_ok_text = binary_lr.predict_proba(df["text_clean"])[:, 1]
score_likert = df["likert_avg"].values  # 0-1

score_comp = (0.6 * score_likert + 0.4 * prob_ok_text) * 100
df["performance_index_composite"] = score_comp.round().astype(int)

df[["performance_index","performance_index_composite"]].describe()

## 14) Aggregations — listo para Power BI / Looker

In [None]:
agg = (df
       .groupby(["year","level","role_evaluator"], as_index=False)
       .agg(
           n=("text_feedback","count"),
           index_mean=("performance_index_composite","mean"),
           index_p25=("performance_index_composite", lambda x: np.percentile(x, 25)),
           index_p75=("performance_index_composite", lambda x: np.percentile(x, 75)),
       ))
agg.head(10)

## 15) Ethical & Interpretability Considerations

- No subir feedback real en repos públicos sin anonimización y permiso.
- Revisiones humanas para casos borderline.
- Para interpretabilidad: inspeccionar tokens con mayor peso en modelos lineales.

In [None]:
tfidf = binary_lr.named_steps["tfidf"]
clf = binary_lr.named_steps["clf"]

feature_names = np.array(tfidf.get_feature_names_out())
coefs = clf.coef_.ravel()

top_ok = feature_names[np.argsort(coefs)[-15:]][::-1]
top_risk = feature_names[np.argsort(coefs)[:15]]

print("Top tokens que empujan a OK (1):\n", top_ok)
print("\nTop tokens que empujan a RISK (0):\n", top_risk)

## 16) How to plug your real data

CSV mínimo recomendado:

`source,year,level,role_evaluator,likert_avg,text_feedback`

✅ Este notebook ya es demostrable, replicable y orientado a industria.