![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [2]:
pip install neattext

Collecting neattext
  Downloading neattext-0.1.3-py3-none-any.whl.metadata (12 kB)
Downloading neattext-0.1.3-py3-none-any.whl (114 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/114.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.7/114.7 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neattext
Successfully installed neattext-0.1.3


In [1]:
# Importación librerías
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split
import joblib
import neattext as nt
import neattext.functions as nfx

In [2]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [3]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [4]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


Vectorizacion inicial

In [5]:
# Definición de variables predictoras (X)
vect = CountVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

(7895, 1000)

In [6]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [7]:
dataTraining['plot'].apply(lambda x:nt.TextFrame(x).noise_scan())
dataTraining['plot'].apply(lambda x:nt.TextExtractor(x).extract_stopwords())

Unnamed: 0,plot
3107,"[most, is, the, of, a, who, his, eight, to, wi..."
900,"[a, to, the, of, his, to, a]"
6724,"[in, a, with, a, a, who, beyond, his, they, be..."
4704,"[in, a, in, the, of, the, has, just, had, a, w..."
2582,"[in, the, of, a, to, a, with, the, who, has, t..."
...,...
8417,"[our, their, it, s, one, for, any, and, and, a..."
1592,"[the, his, are, with, and, her, to, a, they, m..."
1723,"[a, by, the, and, of, that, a, it, all, in, on..."
7605,"[a, in, a, with, her, on, the, she, is, to, mo..."


In [8]:
dataTraining['plot'].apply(nfx.remove_stopwords)

Unnamed: 0,plot
3107,story single father takes year - old son work ...
900,serial killer decides teach secrets satisfying...
6724,"sweden , female blackmailer disfiguring facial..."
4704,"friday afternoon new york , president tredway ..."
2582,"los angeles , editor publishing house carol hu..."
...,...
8417,""" marriage , wedding . "" ' lesson number newly..."
1592,"wandering barbarian , conan , alongside goofy ..."
1723,"like tale spun scheherazade , kismet follows r..."
7605,"mrs . brisby , widowed mouse , lives cinder bl..."


In [9]:
corpus = dataTraining['plot'].apply(lambda x: nt.TextFrame(x).remove_stopwords().remove_special_characters().text)

In [10]:
corpus

Unnamed: 0,plot
3107,story single father takes year old son work r...
900,serial killer decides teach secrets satisfying...
6724,sweden female blackmailer disfiguring facial ...
4704,friday afternoon new york president tredway c...
2582,los angeles editor publishing house carol hun...
...,...
8417,marriage wedding lesson number newly enga...
1592,wandering barbarian conan alongside goofy ro...
1723,like tale spun scheherazade kismet follows re...
7605,mrs brisby widowed mouse lives cinder block...


In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
Xfeatures = tfidf.fit(corpus)
joblib.dump(Xfeatures, 'featureX.pkl', compress=3)

['featureX.pkl']

In [12]:
#Se transforma a array
Xfeatures = tfidf.fit_transform(corpus).toarray()

In [15]:
Xfeatures

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

RandomForestClassifier

In [None]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(Xfeatures, y_genres, test_size=0.33, random_state=42)

In [None]:
base_model = LogisticRegression(C=0.5, max_iter=1000)
calibrated_model = CalibratedClassifierCV(estimator=base_model, method='sigmoid', cv=3)
clf = OneVsRestClassifier(calibrated_model)


In [None]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

In [None]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7883120686450534

In [None]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = tfidf.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
#res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.152426,0.125171,0.023958,0.038249,0.38455,0.162418,0.0446,0.526177,0.070303,0.113444,...,0.036715,0.082483,0.000267,0.299491,0.070677,0.021965,0.024055,0.238398,0.034735,0.021139
4,0.155435,0.111219,0.045896,0.044988,0.380398,0.202274,0.047566,0.525979,0.070993,0.075016,...,0.026642,0.082124,0.000364,0.206102,0.071992,0.008137,0.024055,0.237947,0.041568,0.026517
5,0.174872,0.116391,0.033689,0.060856,0.335869,0.303432,0.0446,0.552076,0.070993,0.076485,...,0.026642,0.139509,0.000108,0.245291,0.072627,0.007861,0.025331,0.315422,0.056708,0.021134
6,0.185941,0.113372,0.023958,0.048485,0.346191,0.174821,0.054119,0.520513,0.070373,0.075532,...,0.055662,0.099481,0.000233,0.244145,0.086598,0.00786,0.033627,0.301791,0.046571,0.02133
7,0.187201,0.132477,0.023958,0.038249,0.354594,0.209041,0.044167,0.487815,0.070373,0.088925,...,0.036156,0.14279,0.00027,0.217588,0.177314,0.008137,0.024055,0.250821,0.034735,0.021144


In [None]:
# Exportar modelo a archivo binario .pkl
joblib.dump(clf, 'genreclf.pkl', compress=3)

['genreclf.pkl']

Tokenización

In [14]:
import nltk
import pandas as pd
import numpy as np
import json
import re
#Importacion de librerias tokenizar
from nltk.corpus import stopwords
from nltk.tokenize import ToktokTokenizer
from nltk.stem import SnowballStemmer

#Importacion de librerias models

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
tokenizer = ToktokTokenizer()
STOPWORDS = set(stopwords.words("english"))
stemmer = SnowballStemmer("english")
#Definicion de funciones necesarias para limpieza
#limpieza de datos
def limpiar_texto(texto):
    texto = re.sub(r'\W', ' ', str(texto))
    texto = re.sub(r'\s+[a-zA-Z]\s+', ' ', texto)
    texto = re.sub(r'\s+', ' ', texto, flags=re.I)
    texto = texto.lower()
    return texto

def filtrar_stopword_digitos(tokens):
    return [token for token in tokens if token not in STOPWORDS
            and not token.isdigit()]

def stem_palabras(tokens):
    return [stemmer.stem(token) for token in tokens]

def tokenize(texto):
    text_cleaned = limpiar_texto(texto)
    tokens = [word for word in tokenizer.tokenize(text_cleaned) if len(word) > 1]
    tokens = filtrar_stopword_digitos(tokens)
    stems = stem_palabras(tokens)
    return stems

In [16]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [17]:
# Definición de variables predictoras (X)
vect = TfidfVectorizer(tokenizer=tokenize,sublinear_tf=True,max_features=15000)
X_dtm = vect.fit_transform(dataTraining['plot']).toarray()
X_dtm.shape

(7895, 15000)

In [18]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_auc_score
import joblib
import numpy as np

model_base = LogisticRegression(random_state=42)
model = OneVsRestClassifier(model_base)
param_dist = {
    "estimator__max_iter": [100, 200],# ej 400, 1000
    "estimator__penalty": ['l2'],# Ej 'l1', 'elasticnet',
    "estimator__C": np.logspace(-2, 2, 5),
    "estimator__solver": ['lbfgs'] # ej  'liblinear', 'saga'
}

random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10,
                                   scoring='roc_auc', n_jobs=-1, cv=3, random_state=42, verbose=1)

random_search.fit(X_train, y_train_genres)

RL = random_search.best_estimator_

y_pred_genres = RL.predict_proba(X_test)

roc_auc = roc_auc_score(y_test_genres, y_pred_genres, average='macro')
print(f'ROC-AUC Score: {roc_auc}')


joblib.dump(RL, 'genreRL.pkl', compress=3)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
ROC-AUC Score: 0.8869842434654136


['movie_genre_MRL.pkl']

In [22]:
cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']


In [24]:
X_test_dtm = vect.transform(dataTesting['plot'])
y_pred_test_genres = MRL.predict_proba(X_test_dtm)


In [25]:
# Guardar predicciones en formato exigido en la competencia de Kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('genresLR.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.083539,0.084634,0.024657,0.030599,0.364692,0.126847,0.032072,0.56881,0.048432,0.10463,...,0.03607,0.102274,0.000664,0.536068,0.050091,0.011571,0.026651,0.180365,0.022727,0.028469
4,0.155448,0.041694,0.027395,0.132348,0.210923,0.330046,0.062884,0.746857,0.031286,0.028319,...,0.030652,0.037363,0.00076,0.081135,0.027176,0.011926,0.026721,0.253232,0.064128,0.033672
5,0.079382,0.027562,0.01383,0.052855,0.109737,0.627616,0.025651,0.846717,0.020891,0.037216,...,0.018164,0.331941,0.000649,0.180833,0.051431,0.008129,0.023165,0.531663,0.035482,0.017067
6,0.137946,0.09507,0.018552,0.042129,0.201322,0.096524,0.030458,0.723572,0.050979,0.048188,...,0.034903,0.091591,0.000676,0.254433,0.091935,0.008276,0.032115,0.393536,0.066028,0.020368
7,0.082682,0.065654,0.027199,0.034701,0.224991,0.093203,0.049482,0.332222,0.05703,0.121667,...,0.021731,0.096431,0.000687,0.14109,0.364073,0.011675,0.017451,0.257702,0.022852,0.021613


In [26]:
import pandas as pd
import numpy as np

# Transformar plots en representación de bolsa de palabras (Bag of Words)
X_test_dtm = vect.transform(dataTesting['plot'])

# Hacer predicciones de probabilidad para cada género
y_pred_test_genres = MRL.predict_proba(X_test_dtm)

# Tomamos solo las dos primeras observaciones
sample_pred = y_pred_test_genres[:2]
sample_index = dataTesting.index[:2]

# Crear DataFrame de resultados con los nombres de los géneros como columnas
res_sample = pd.DataFrame(sample_pred, index=sample_index, columns=cols)

# Mostrar resultados en pantalla
print("Predicciones de géneros para dos observaciones del set de validación:\n")
print(res_sample)

# (Opcional) Guardar en CSV
# res_sample.to_csv("sample_pred_genres.csv", index_label="ID")


🎬 Predicciones de géneros para dos observaciones del set de validación:

   p_Action  p_Adventure  p_Animation  p_Biography  p_Comedy   p_Crime  \
1  0.083539     0.084634     0.024657     0.030599  0.364692  0.126847   
4  0.155448     0.041694     0.027395     0.132348  0.210923  0.330046   

   p_Documentary   p_Drama  p_Family  p_Fantasy  ...  p_Musical  p_Mystery  \
1       0.032072  0.568810  0.048432   0.104630  ...   0.036070   0.102274   
4       0.062884  0.746857  0.031286   0.028319  ...   0.030652   0.037363   

     p_News  p_Romance  p_Sci-Fi   p_Short   p_Sport  p_Thriller     p_War  \
1  0.000664   0.536068  0.050091  0.011571  0.026651    0.180365  0.022727   
4  0.000760   0.081135  0.027176  0.011926  0.026721    0.253232  0.064128   

   p_Western  
1   0.028469  
4   0.033672  

[2 rows x 24 columns]
