![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [59]:
import warnings
warnings.filterwarnings('ignore')

In [60]:
# Importación librerías
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

In [61]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [62]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [63]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


# **1. Preprocesamiento de datos**

In [64]:
# Crear la variable predictora combinando title, year y plot
dataTraining['combined'] = dataTraining['title'] + ' ' + dataTraining['plot']

# X final para vectorizar
X_dtm = dataTraining['combined']

In [65]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

## **1.1.  División en datos de entrenamiento y validación**  

In [66]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usando la función train_test_split
X_train_plot, X_test_plot, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

## **1.2. Ajuste datos de entrenamiento y de validación**  

In [34]:
# Vectorización de variable predictora (x)
vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)
X_train_vec = vectorizer.fit_transform(X_train_plot)
X_test_vec = vectorizer.transform(X_test_plot)  

# **2. Calibración del modelo**

## **2.1. Calibración de parámetros** 

In [None]:
# Definición y entrenamiento
clf = OneVsRestClassifier(LogisticRegression(max_iter=1000))
clf.fit(X_train_vec, y_train_genres)

## **2.2. Justificación del método seleccionado de calibración**

Por ejemplo:

In [11]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import make_scorer, f1_score

pipeline = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('clf', OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, random_state=42)))
])

param_grid = {
    'vect__max_features': [500, 1000, 2000, 3000],
    'clf__estimator__n_estimators': [50, 100, 200],
    'clf__estimator__max_depth': [10, 20]
}

scorer = make_scorer(f1_score, average='macro')

grid = GridSearchCV(pipeline, param_grid, scoring=scorer, cv=3, verbose=1)
grid.fit(X_train_plot, y_train_genres)

print("Mejores parámetros:", grid.best_params_)
print("Mejor F1-score:", grid.best_score_)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
Mejores parámetros: {'clf__estimator__max_depth': 20, 'clf__estimator__n_estimators': 50, 'vect__max_features': 500}
Mejor F1-score: 0.11345350928830088


## 2.3. Análisis de los valores calibrados de cada parámetro

# **3. Entrenamiento del modelo**

## 3.1. Entrenamiento del modelo con los datos de entrenamiento y parámetros óptimos

In [None]:
# OPCIÓN 1

# Definición y entrenamiento
clf = OneVsRestClassifier(LogisticRegression(max_iter=1000))
clf.fit(X_train_vec, y_train_genres)

# Validación cruzada con 10 folds (cv=10) usando MSE
scores = cross_val_score(clf, X_train_vec, y_train_genres, cv=10, scoring='f1_macro')
print(f"\nF1-score promedio del modelo calibrado (cv=10): {np.mean(scores):.3f}")

# Predicciones en el conjunto de prueba
y_pred_rf = clf.predict(X_test_vec)
print(y_pred_rf.shape)

In [67]:
# OPCIÓN 2 - inclusión de bigramas

vectorizer = TfidfVectorizer(stop_words='english', max_features=5000, ngram_range=(1, 2))
X_train_vec = vectorizer.fit_transform(X_train_plot)
X_test_vec = vectorizer.transform(X_test_plot)

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

clf = OneVsRestClassifier(
    LogisticRegression(C=2.0, max_iter=2000, solver='lbfgs')  # C > 1 reduce regularización
)
clf.fit(X_train_vec, y_train_genres)

y_proba = clf.predict_proba(X_test_vec)
y_pred = (y_proba >= 0.3).astype(int)

In [68]:
# Repetir combinación para dataTesting
dataTesting['combined'] = dataTesting['title'] + ' ' + dataTesting['plot']
X_test_competencia = vectorizer.transform(dataTesting['combined'])

# Probabilidades y binarización
y_proba_kaggle = clf.predict_proba(X_test_competencia)
y_pred_kaggle = (y_proba_kaggle >= 0.3).astype(int)

# Guardar CSV
cols = ['p_' + genre for genre in le.classes_]
res = pd.DataFrame(y_pred_kaggle, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_robust.csv', index_label='ID')

In [79]:
# OPCIÓN 3

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report, f1_score
import joblib

# -------------------------------------------
# 1. Cargar datos
# -------------------------------------------
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='utf-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='utf-8', index_col=0)

# -------------------------------------------
# 2. Preprocesamiento
# -------------------------------------------
# Combinar title y plot
dataTraining['combined'] = dataTraining['title'] + ' ' + dataTraining['plot']
dataTesting['combined'] = dataTesting['title'] + ' ' + dataTesting['plot']

# Vectorización TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000, ngram_range=(1, 2), sublinear_tf=True, min_df=3)
X_dtm = vectorizer.fit_transform(dataTraining['combined'])

# Variable de salida multietiqueta
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
mlb = MultiLabelBinarizer()
y_genres = mlb.fit_transform(dataTraining['genres'])

# -------------------------------------------
# 3. Separación en train/test
# -------------------------------------------
X_train_vec, X_test_vec, y_train_genres, y_test_genres = train_test_split(
    X_dtm, y_genres, test_size=0.2, random_state=42)

# -------------------------------------------
# 4. Modelo robusto con LightGBM
# -------------------------------------------
clf = OneVsRestClassifier(LGBMClassifier(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=12,
    num_leaves=128,
    random_state=42,
    n_jobs=-1
))

clf.fit(X_train_vec, y_train_genres)

# -------------------------------------------
# 5. Evaluación con ajuste de umbral
# -------------------------------------------
y_proba = clf.predict_proba(X_test_vec)

print("\nEvaluación con distintos umbrales:")
for t in [0.5, 0.4, 0.35, 0.3]:
    y_pred = (y_proba >= t).astype(int)
    score = f1_score(y_test_genres, y_pred, average='macro')
    print(f"Threshold {t} - Macro F1-score: {score:.3f}")

# Umbral final seleccionado
threshold = 0.3
y_pred_final = (y_proba >= threshold).astype(int)
print("\nReporte por género:")
print(classification_report(y_test_genres, y_pred_final, target_names=mlb.classes_))

# -------------------------------------------
# 6. Validación cruzada (macro F1)
# -------------------------------------------
cv_scores = cross_val_score(clf, X_train_vec, y_train_genres, cv=5, scoring='f1_macro')
print(f"\nCross-Validation F1-macro promedio: {cv_scores.mean():.3f}")

# -------------------------------------------
# 7. Predicción en conjunto de test para Kaggle
# -------------------------------------------
X_test_kaggle = vectorizer.transform(dataTesting['combined'])
y_pred_kaggle = (clf.predict_proba(X_test_kaggle) >= threshold).astype(int)

# Guardar resultados en CSV
cols = ['p_' + g for g in mlb.classes_]
submission = pd.DataFrame(y_pred_kaggle, index=dataTesting.index, columns=cols)
submission.to_csv('pred_genres_lightgbm.csv', index_label='ID')

# -------------------------------------------
# 8. Guardar modelo y objetos
# -------------------------------------------
joblib.dump(clf, 'clf_genero.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')
joblib.dump(mlb, 'binarizer.pkl')

print("\n✅ Entrenamiento y exportación completados. Archivo listo: pred_genres_lightgbm.csv")


[LightGBM] [Info] Number of positive: 1039, number of negative: 5277
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.101139 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 85144
[LightGBM] [Info] Number of data points in the train set: 6316, number of used features: 3390
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.164503 -> initscore=-1.625099
[LightGBM] [Info] Start training from score -1.625099
[LightGBM] [Info] Number of positive: 816, number of negative: 5500
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.038358 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 85144
[LightGBM] [Info] Number of data points in the train set: 6316, number of used features: 3390
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129196 -> initscore=-1.908089
[LightGBM] [Info] Start training from score -1.908089
[LightGBM] [I

In [80]:
# OPCION 4:

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report, f1_score
import joblib

# -------------------------------------------
# 1. Cargar datos
# -------------------------------------------
dataTraining = pd.read_csv(
    'https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip',
    encoding='utf-8', index_col=0)
dataTesting = pd.read_csv(
    'https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip',
    encoding='utf-8', index_col=0)

# -------------------------------------------
# 2. Preprocesamiento — usar solo plot
# -------------------------------------------
dataTraining['combined'] = dataTraining['plot']
dataTesting['combined'] = dataTesting['plot']

# Vectorización TF-IDF optimizada
vectorizer = TfidfVectorizer(
    stop_words='english',
    max_features=10000,
    ngram_range=(1, 2),
    sublinear_tf=True,
    min_df=3
)
X_dtm = vectorizer.fit_transform(dataTraining['combined'])

# Procesar variable de salida multietiqueta
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
mlb = MultiLabelBinarizer()
y_genres = mlb.fit_transform(dataTraining['genres'])

# -------------------------------------------
# 3. Separación en train/test
# -------------------------------------------
X_train_vec, X_test_vec, y_train_genres, y_test_genres = train_test_split(
    X_dtm, y_genres, test_size=0.2, random_state=42)

# -------------------------------------------
# 4. Modelo robusto con LightGBM
# -------------------------------------------
clf = OneVsRestClassifier(LGBMClassifier(
    n_estimators=300,
    learning_rate=0.1,
    max_depth=12,
    num_leaves=128,
    random_state=42,
    n_jobs=-1
))
clf.fit(X_train_vec, y_train_genres)

# -------------------------------------------
# 5. Evaluación con distintos umbrales
# -------------------------------------------
y_proba = clf.predict_proba(X_test_vec)

print("\n🔎 Evaluación con distintos umbrales:")
for t in [0.5, 0.4, 0.35, 0.3]:
    y_pred = (y_proba >= t).astype(int)
    f1 = f1_score(y_test_genres, y_pred, average='macro')
    print(f"Threshold {t:.2f} - F1-macro: {f1:.3f}")

# Seleccionar umbral final
threshold = 0.3
y_pred_final = (y_proba >= threshold).astype(int)
print("\n📋 Reporte por clase con threshold = 0.3:")
print(classification_report(y_test_genres, y_pred_final, target_names=mlb.classes_))

# -------------------------------------------
# 6. Validación cruzada
# -------------------------------------------
cv_scores = cross_val_score(clf, X_train_vec, y_train_genres, cv=5, scoring='f1_macro')
print(f"\n📊 Cross-Validation F1-macro promedio: {cv_scores.mean():.3f}")

# -------------------------------------------
# 7. Predicción final en conjunto de test (Kaggle)
# -------------------------------------------
X_test_kaggle = vectorizer.transform(dataTesting['combined'])
y_pred_kaggle = (clf.predict_proba(X_test_kaggle) >= threshold).astype(int)

# Exportar predicciones
cols = ['p_' + g for g in mlb.classes_]
submission = pd.DataFrame(y_pred_kaggle, index=dataTesting.index, columns=cols)
submission.to_csv('pred_genres_plot_only.csv', index_label='ID')

# -------------------------------------------
# 8. Guardar modelo y objetos
# -------------------------------------------
joblib.dump(clf, 'clf_genero_plot.pkl')
joblib.dump(vectorizer, 'vectorizer_plot.pkl')
joblib.dump(mlb, 'binarizer_plot.pkl')

print("\n✅ Modelo entrenado con solo 'plot'. Archivo listo para Kaggle: pred_genres_plot_only.csv")



[LightGBM] [Info] Number of positive: 1039, number of negative: 5277
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.039126 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 83324
[LightGBM] [Info] Number of data points in the train set: 6316, number of used features: 3323
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.164503 -> initscore=-1.625099
[LightGBM] [Info] Start training from score -1.625099
[LightGBM] [Info] Number of positive: 816, number of negative: 5500
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.031092 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 83324
[LightGBM] [Info] Number of data points in the train set: 6316, number of used features: 3323
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.129196 -> initscore=-1.908089
[LightGBM] [Info] Start training from score -1.908089
[LightGBM] [I

## 3.2. Desempeño del modelo con los datos de validación

In [46]:
from sklearn.metrics import classification_report

y_pred_test = clf.predict(X_test_vec)
print(classification_report(y_test_genres, y_pred_test, target_names=le.classes_))

              precision    recall  f1-score   support

      Action       0.71      0.17      0.27       423
   Adventure       0.92      0.13      0.23       340
   Animation       0.00      0.00      0.00        99
   Biography       0.00      0.00      0.00       130
      Comedy       0.73      0.50      0.60      1028
       Crime       0.86      0.27      0.41       468
 Documentary       0.93      0.21      0.34       129
       Drama       0.66      0.71      0.69      1283
      Family       0.75      0.01      0.02       252
     Fantasy       0.89      0.03      0.06       243
   Film-Noir       0.00      0.00      0.00        57
     History       0.00      0.00      0.00        80
      Horror       0.89      0.14      0.24       300
       Music       0.89      0.07      0.12       123
     Musical       0.00      0.00      0.00        97
     Mystery       0.86      0.05      0.09       242
        News       0.00      0.00      0.00         3
     Romance       0.73    

In [84]:
# OPCION 5:

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report, f1_score
import joblib

# ---------------------------------------
# 1. Cargar datos
# ---------------------------------------
dataTraining = pd.read_csv(
    'https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip',
    encoding='utf-8', index_col=0)

dataTesting = pd.read_csv(
    'https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip',
    encoding='utf-8', index_col=0)

# Usamos solo la columna plot
dataTraining['combined'] = dataTraining['plot']
dataTesting['combined'] = dataTesting['plot']

# Procesar etiquetas multietiqueta
dataTraining['genres'] = dataTraining['genres'].map(eval)
mlb = MultiLabelBinarizer()
y_genres = mlb.fit_transform(dataTraining['genres'])

# ---------------------------------------
# 2. Separar entrenamiento y validación
# ---------------------------------------
X_train_text, X_valid_text, y_train_genres, y_valid_genres = train_test_split(
    dataTraining['combined'], y_genres, test_size=0.2, random_state=42)

# ---------------------------------------
# 3. TF-IDF + LSA
# ---------------------------------------
vectorizer = TfidfVectorizer(
    stop_words='english',
    max_features=20000,
    ngram_range=(1, 2),
    sublinear_tf=True,
    min_df=3
)

X_train_tfidf = vectorizer.fit_transform(X_train_text)
X_valid_tfidf = vectorizer.transform(X_valid_text)

# LSA
svd = TruncatedSVD(n_components=300, random_state=42)
X_train_lsa = svd.fit_transform(X_train_tfidf)
X_valid_lsa = svd.transform(X_valid_tfidf)

# ---------------------------------------
# 4. Modelo robusto con LogisticRegressionCV
# ---------------------------------------
clf = OneVsRestClassifier(LogisticRegressionCV(
    cv=5, scoring='f1_macro', max_iter=1000, n_jobs=-1
))
clf.fit(X_train_lsa, y_train_genres)

# ---------------------------------------
# 5. Evaluación en validación
# ---------------------------------------
y_proba_valid = clf.predict_proba(X_valid_lsa)

print("\n🔎 Evaluación con distintos umbrales:")
for t in [0.5, 0.4, 0.3]:
    y_pred = (y_proba_valid >= t).astype(int)
    f1 = f1_score(y_valid_genres, y_pred, average='macro')
    print(f"Threshold {t} - F1 macro: {f1:.3f}")

# Reporte completo con threshold seleccionado
threshold = 0.3
y_pred_final = (y_proba_valid >= threshold).astype(int)
print("\n📋 Reporte por clase con threshold = 0.3:")
print(classification_report(y_valid_genres, y_pred_final, target_names=mlb.classes_))

# ---------------------------------------
# 6. Predicción final en test para Kaggle
# ---------------------------------------
X_test_lsa = svd.transform(vectorizer.transform(dataTesting['combined']))
y_pred_kaggle = (clf.predict_proba(X_test_lsa) >= threshold).astype(int)

# Exportar a CSV
cols = ['p_' + g for g in mlb.classes_]
submission = pd.DataFrame(y_pred_kaggle, index=dataTesting.index, columns=cols)
submission.to_csv('submission_lsa_robust.csv', index_label='ID')

# ---------------------------------------
# 7. Guardar modelo y objetos
# ---------------------------------------
joblib.dump(clf, 'clf_lsa.pkl')
joblib.dump(vectorizer, 'vectorizer_lsa.pkl')
joblib.dump(svd, 'svd_lsa.pkl')
joblib.dump(mlb, 'binarizer_lsa.pkl')

print("\n✅ Entrenamiento completo. Archivo listo para Kaggle: submission_lsa_robust.csv")



🔎 Evaluación con distintos umbrales:
Threshold 0.5 - F1 macro: 0.382
Threshold 0.4 - F1 macro: 0.428
Threshold 0.3 - F1 macro: 0.459

📋 Reporte por clase con threshold = 0.3:
              precision    recall  f1-score   support

      Action       0.55      0.56      0.55       264
   Adventure       0.55      0.51      0.53       208
   Animation       0.48      0.17      0.25        60
   Biography       0.44      0.17      0.24        83
      Comedy       0.58      0.81      0.68       617
       Crime       0.64      0.64      0.64       286
 Documentary       0.76      0.58      0.66        83
       Drama       0.50      1.00      0.67       783
      Family       0.70      0.43      0.53       159
     Fantasy       0.52      0.38      0.44       144
   Film-Noir       0.54      0.18      0.27        39
     History       0.50      0.22      0.30        46
      Horror       0.58      0.54      0.56       169
       Music       0.71      0.45      0.55        76
     Musical 

In [89]:
!pip install transformers[sentencepiece] --no-deps



In [91]:
pip install tf-keras

Collecting tf-kerasNote: you may need to restart the kernel to use updated packages.

  Downloading tf_keras-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Collecting numpy<2.2.0,>=1.26.0 (from tensorflow<2.20,>=2.19->tf-keras)
  Using cached numpy-2.1.3-cp310-cp310-win_amd64.whl.metadata (60 kB)
Downloading tf_keras-2.19.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.7 MB ? eta -:--:--
   ------------------------ --------------- 1.0/1.7 MB 2.5 MB/s eta 0:00:01
   ------------------------------------ --- 1.6/1.7 MB 3.1 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 2.6 MB/s eta 0:00:00
Using cached numpy-2.1.3-cp310-cp310-win_amd64.whl (12.9 MB)
Installing collected packages: numpy, tf-keras
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
Successfully installed numpy

  You can safely remove it manually.


In [95]:
# OPCIÓN 6

!pip install sentence-transformers scikit-learn pandas joblib

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report, f1_score
from sentence_transformers import SentenceTransformer
import joblib

# ---------------------------------------
# 1. Cargar datos
# ---------------------------------------
print("📥 Cargando datos...")
dataTraining = pd.read_csv(
    'https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip',
    encoding='utf-8', index_col=0)

dataTesting = pd.read_csv(
    'https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip',
    encoding='utf-8', index_col=0)

# Usamos solo 'plot' como entrada textual
dataTraining['combined'] = dataTraining['plot']
dataTesting['combined'] = dataTesting['plot']

# Procesar etiquetas multietiqueta
dataTraining['genres'] = dataTraining['genres'].map(eval)
mlb = MultiLabelBinarizer()
y_genres = mlb.fit_transform(dataTraining['genres'])

# ---------------------------------------
# 2. Separar entrenamiento y validación
# ---------------------------------------
print("📊 Dividiendo datos...")
X_train_text, X_valid_text, y_train_genres, y_valid_genres = train_test_split(
    dataTraining['combined'], y_genres, test_size=0.2, random_state=42)

# ---------------------------------------
# 3. Generar embeddings con SentenceTransformers
# ---------------------------------------
print("🔍 Generando embeddings con BERT...")
bert_model = SentenceTransformer('all-MiniLM-L6-v2')  # Muy eficiente y rápido

X_train_embed = bert_model.encode(X_train_text.tolist(), show_progress_bar=True)
X_valid_embed = bert_model.encode(X_valid_text.tolist(), show_progress_bar=True)
X_test_embed = bert_model.encode(dataTesting['combined'].tolist(), show_progress_bar=True)

# ---------------------------------------
# 4. Entrenar modelo multietiqueta
# ---------------------------------------
print("⚙️ Entrenando modelo...")
clf = OneVsRestClassifier(LogisticRegressionCV(
    cv=5, scoring='f1_macro', max_iter=2000, n_jobs=1
))
clf.fit(X_train_embed, y_train_genres)

# ---------------------------------------
# 5. Evaluar con diferentes umbrales
# ---------------------------------------
print("📈 Evaluando en conjunto de validación...")
y_proba_valid = clf.predict_proba(X_valid_embed)

for t in [0.5, 0.4, 0.3]:
    y_pred = (y_proba_valid >= t).astype(int)
    score = f1_score(y_valid_genres, y_pred, average='macro')
    print(f"Threshold {t} - F1 macro: {score:.3f}")

# Reporte detallado con el mejor umbral
threshold = 0.3
y_pred_final = (y_proba_valid >= threshold).astype(int)
print("\n📋 Reporte de clasificación:")
print(classification_report(y_valid_genres, y_pred_final, target_names=mlb.classes_))

# ---------------------------------------
# 6. Predicciones en conjunto de test (Kaggle)
# ---------------------------------------
print("✈️ Generando predicciones para Kaggle...")
y_pred_kaggle = (clf.predict_proba(X_test_embed) >= threshold).astype(int)

cols = ['p_' + g for g in mlb.classes_]
submission = pd.DataFrame(y_pred_kaggle, index=dataTesting.index, columns=cols)
submission.to_csv('submission_bert_logreg.csv', index_label='ID')

# ---------------------------------------
# 7. Guardar modelos
# ---------------------------------------
joblib.dump(clf, 'clf_bert.pkl')
joblib.dump(mlb, 'binarizer_bert.pkl')
bert_model.save('bert_model')  # Se guarda como carpeta

print("\n✅ ¡Listo! Archivo para Kaggle: opcion_6.csv")




📊 Dividiendo datos...
🔍 Generando embeddings con BERT...


Batches:   0%|          | 0/198 [00:00<?, ?it/s]

Batches:   0%|          | 0/50 [00:00<?, ?it/s]

Batches:   0%|          | 0/106 [00:00<?, ?it/s]

⚙️ Entrenando modelo...
📈 Evaluando en conjunto de validación...
Threshold 0.5 - F1 macro: 0.483
Threshold 0.4 - F1 macro: 0.498
Threshold 0.3 - F1 macro: 0.502

📋 Reporte de clasificación:
              precision    recall  f1-score   support

      Action       0.53      0.61      0.57       264
   Adventure       0.51      0.57      0.54       208
   Animation       0.52      0.37      0.43        60
   Biography       0.40      0.33      0.36        83
      Comedy       0.60      0.81      0.69       617
       Crime       0.56      0.65      0.60       286
 Documentary       0.67      0.78      0.72        83
       Drama       0.57      0.94      0.71       783
      Family       0.62      0.56      0.59       159
     Fantasy       0.40      0.47      0.43       144
   Film-Noir       0.21      0.08      0.11        39
     History       0.29      0.41      0.34        46
      Horror       0.56      0.70      0.62       169
       Music       0.67      0.64      0.66        76

In [93]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')  # Si ya lo tienes en caché, no volverá a descargar
emb = model.encode(["Esto es una prueba."])
print(emb.shape)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(1, 384)


## 3.3. Justificación del modelo seleccionado

# **4. Disponibilización del modelo**

In [35]:
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier

# Definición y entrenamiento
clf = OneVsRestClassifier(LogisticRegression(max_iter=1000))
clf.fit(X_train_vec, y_train_genres)

# Validación cruzada con 10 folds (cv=10) usando MSE
scores = cross_val_score(clf, X_train_vec, y_train_genres, cv=10, scoring='f1_macro')
print(f"\nF1-score promedio del modelo calibrado (cv=10): {np.mean(scores):.3f}")

# Predicciones en el conjunto de prueba
y_pred_rf = clf.predict(X_test_vec)
print(y_pred_rf.shape)


F1-score promedio del modelo calibrado (cv=10): 0.178
(2606, 24)


In [38]:
from sklearn.metrics import classification_report

y_pred_test = clf.predict(X_test_vec)
print(classification_report(y_test_genres, y_pred_test, target_names=le.classes_))

              precision    recall  f1-score   support

      Action       0.71      0.17      0.27       423
   Adventure       0.92      0.13      0.23       340
   Animation       0.00      0.00      0.00        99
   Biography       0.00      0.00      0.00       130
      Comedy       0.73      0.50      0.60      1028
       Crime       0.86      0.27      0.41       468
 Documentary       0.93      0.21      0.34       129
       Drama       0.66      0.71      0.69      1283
      Family       0.75      0.01      0.02       252
     Fantasy       0.89      0.03      0.06       243
   Film-Noir       0.00      0.00      0.00        57
     History       0.00      0.00      0.00        80
      Horror       0.89      0.14      0.24       300
       Music       0.89      0.07      0.12       123
     Musical       0.00      0.00      0.00        97
     Mystery       0.86      0.05      0.09       242
        News       0.00      0.00      0.00         3
     Romance       0.73    

In [39]:
# Exportar modelo a archivo binario .pkl
import joblib
joblib.dump(clf, 'clf_genero.pkl', compress=3)
joblib.dump(vectorizer, 'vectorizer.pkl', compress=3)
joblib.dump(le, 'binarizer.pkl', compress=3)

['binarizer.pkl']

In [40]:
# Importación librerías
from flask import Flask
from flask_restx import Api, Resource, fields

In [41]:
from flask import Flask, request
from flask_restx import Api, Resource, fields

app = Flask(__name__)
api = Api(app, version='1.0', title='Genres Prediction API',
          description='Predice el género de las películas a partir de sus características')
ns = api.namespace('Predict', description='Modelo de clasificación')

# Modelo de salida
output_model = api.model('Prediction', {
    'result': fields.String
})

In [42]:
# Definición de la clase para disponibilización

import joblib
from flask import request 

modelo = joblib.load("clf_genero.pkl")
vectorizer = joblib.load("vectorizer.pkl")
binarizer = joblib.load("binarizer.pkl")

@ns.route('/')
@ns.doc(params={
    'title': 'Título de la película',
    'plot': 'Sinopsis de la película'
})
class GenreClassifier(Resource):
    @ns.marshal_with(output_model)
    def get(self):
        title = request.args.get('title')
        plot = request.args.get('plot')

        if not all([title, plot]):
            return {'result': 'Error: Faltan parámetros'}, 400

        texto_completo = title + ' ' + plot
        X_input = vectorizer.transform([texto_completo])
        y_pred = modelo.predict(X_input)
        etiquetas = binarizer.inverse_transform(y_pred)

        return {'result': ', '.join(etiquetas[0]) if etiquetas[0] else 'Sin género detectado'}

In [43]:
import numpy as np
genre_frequencies = np.sum(y_genres, axis=0)
for genre, count in zip(le.classes_, genre_frequencies):
    print(f"{genre}: {count}")

Action: 1303
Adventure: 1024
Animation: 260
Biography: 373
Comedy: 3046
Crime: 1447
Documentary: 419
Drama: 3965
Family: 682
Fantasy: 707
Film-Noir: 168
History: 273
Horror: 954
Music: 341
Musical: 271
Mystery: 759
News: 7
Romance: 1892
Sci-Fi: 723
Short: 92
Sport: 261
Thriller: 2024
War: 348
Western: 237


In [45]:
# Ejecutar servidor
if __name__ == '__main__':
    app.run(debug=False)

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [15/May/2025 13:06:25] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [15/May/2025 13:06:25] "GET /swaggerui/swagger-ui.css HTTP/1.1" 304 -
127.0.0.1 - - [15/May/2025 13:06:25] "GET /swaggerui/droid-sans.css HTTP/1.1" 304 -
127.0.0.1 - - [15/May/2025 13:06:25] "GET /swaggerui/swagger-ui-bundle.js HTTP/1.1" 304 -
127.0.0.1 - - [15/May/2025 13:06:25] "GET /swaggerui/swagger-ui-standalone-preset.js HTTP/1.1" 304 -
127.0.0.1 - - [15/May/2025 13:06:26] "GET /swagger.json HTTP/1.1" 200 -
127.0.0.1 - - [15/May/2025 13:06:50] "GET /Predict/?title=Master%20and%20Commander:%20The%20Far%20Side%20of%20the%20World&plot=during%20the%20napoleonic%20wars%20,%20%20a%20british%20frigate%20,%20%20hms%20surprise%20,%20%20and%20a%20much%20larger%20french%20warship%20,%20%20the%20acheron%20,%20%20with%20greater%20fire%20power%20,%20%20stalk%20each%20other%20off%20of%20the%20coast%20of%20south%20america%20.%20%20russell%20crowe%20brings%20great%20i

In [53]:
from sklearn.metrics import roc_auc_score

# Predicción
y_pred_rf = clf.predict(X_test_vec)

# Evaluación
roc_auc = roc_auc_score(y_test_genres, y_pred_rf, average='macro')
print(f"ROC AUC macro: {roc_auc}")

ROC AUC macro: 0.5554862362483376


In [57]:
import pandas as pd

# 1. Preparar texto combinado si no lo has hecho
dataTesting['combined'] = dataTesting['title'] + ' ' + dataTesting['plot']

# 2. Vectorizar
X_test_competencia = vectorizer.transform(dataTesting['combined'])

# 3. Predecir etiquetas binarias (0/1)
y_pred_test_genres = clf.predict(X_test_competencia)

# 4. Columnas esperadas
cols = ['p_' + genre for genre in le.classes_]

In [58]:
# Guardar en CSV para Kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
