![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importación librerías
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [3]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [4]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [5]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [6]:
# Definición de variables predictoras (X)
vect = CountVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

(7895, 1000)

In [7]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [8]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

In [9]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

In [10]:
# Predicción del modelo de clasificación
y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7812262183677007

In [11]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

In [12]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.14303,0.10196,0.024454,0.029938,0.354552,0.13883,0.030787,0.49014,0.073159,0.101339,...,0.025069,0.063208,0.0,0.362818,0.056648,0.00897,0.017522,0.202605,0.033989,0.018117
4,0.122624,0.085786,0.024213,0.084795,0.370949,0.216657,0.080359,0.515684,0.062976,0.067019,...,0.024734,0.060935,0.000477,0.149703,0.05819,0.014248,0.020099,0.204794,0.030438,0.018506
5,0.151364,0.110284,0.013762,0.075334,0.304837,0.448736,0.02101,0.611544,0.081741,0.169121,...,0.044538,0.261372,0.0,0.335987,0.128505,0.001016,0.048658,0.423242,0.052693,0.025351
6,0.154448,0.125772,0.020991,0.064124,0.340779,0.140892,0.009133,0.632038,0.068287,0.063631,...,0.131074,0.088418,0.0,0.197224,0.132208,0.001432,0.039743,0.269385,0.077607,0.017862
7,0.175143,0.210069,0.035476,0.032505,0.31385,0.24315,0.021793,0.427885,0.079781,0.143879,...,0.023859,0.090359,4.8e-05,0.205117,0.241663,0.002634,0.018403,0.259465,0.021569,0.017585


## 1- Preprocesamiento de los datos

In [45]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [46]:
# Dividir en conjuntos de train y test

dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [47]:
# Ajustar las variables predictoras - Dejar todas las variables en minúscula
dataTraining['title'] =  dataTraining['title'].str.lower()
dataTraining['plot'] =  dataTraining['plot'].str.lower()
dataTraining.head()


# dataTesting
dataTesting['title'] = dataTesting['title'].str.lower()
dataTesting['plot'] =  dataTesting['plot'].str.lower()
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,message in a bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,midnight express,"the true story of billy hayes , an american c..."
5,1996,primal fear,martin vail left the chicago da ' s office to ...
6,1950,crisis,husband and wife americans dr . eugene and mr...
7,1959,the tingler,the coroner and scientist dr . warren chapin ...


In [48]:
# Limpieza de simbolos, datos númericos y signos de puntuación

import re
import unicodedata

def remove_accents(text):
    # Normalizar el texto para separar los caracteres base de los diacríticos
    nfkd_form = unicodedata.normalize('NFKD', text)
    # Filtrar solo los caracteres base (sin diacríticos)
    return ''.join([c for c in nfkd_form if not unicodedata.combining(c)])


def remove_symbols(text):
    # Usar una expresión regular para eliminar todos los caracteres no alfabéticos y espacios
    return re.sub(r'[^a-zA-Z\s]', '', text)


def clean_text(text):
    text = remove_accents(text)
    text = remove_symbols(text)
    return text

dataTraining['plot'] = dataTraining['plot'].map(lambda x: clean_text(x))
dataTraining['title'] = dataTraining['title'].map(lambda x: clean_text(x))


# dataTesting
dataTesting['plot'] = dataTesting['plot'].map(lambda x: clean_text(x))
dataTesting['title'] = dataTesting['title'].map(lambda x: clean_text(x))

In [49]:
# Count Vectorizer con lematización y exclusión de stopwords

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Asegurarse de tener descargados los recursos necesarios de NLTK
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

lemmatizer = WordNetLemmatizer()

def lemmatize_as_verb(text):
    words = nltk.word_tokenize(text)
    return [lemmatizer.lemmatize(w, pos='v') for w in words]

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\camil\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\camil\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\camil\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [50]:
##### Ajustar las palabras en PLOT

# Crear CountVectorizer usando la función personalizada de tokenización y lematización
vectorizer_splem = CountVectorizer(tokenizer=lemmatize_as_verb, stop_words='english', max_features=2000)

# Aplicar fit_transform al conjunto de entrenamiento para aprender el vocabulario y transformarlo en vectores de frecuencia.
dataTraining_plot_splem = vectorizer_splem.fit_transform(dataTraining['plot'])

# Aplicar transform al conjunto de test usando el mismo vocabulario.
dataTesting_plot_splem = vectorizer_splem.transform(dataTesting['plot'])

column_names = vectorizer_splem.get_feature_names_out()

In [51]:
##### Ajustar las palabras en TITLE

# Crear CountVectorizer usando la función personalizada de tokenización y lematización
vectorizer_splem = CountVectorizer(tokenizer=lemmatize_as_verb, stop_words='english', max_features=2000)

# Aplicar fit_transform al conjunto de entrenamiento para aprender el vocabulario y transformarlo en vectores de frecuencia.
dataTraining_title_splem = vectorizer_splem.fit_transform(dataTraining['title'])

# Aplicar transform al conjunto de test usando el mismo vocabulario.
dataTesting_title_splem = vectorizer_splem.transform(dataTesting['title'])

column_names_title = vectorizer_splem.get_feature_names_out()

In [52]:
# dataTraining
df = pd.DataFrame.sparse.from_spmatrix(dataTraining_plot_splem, columns=column_names)
df = df.add_prefix('plot_')
df.head()

# dataTesting
df_test = pd.DataFrame.sparse.from_spmatrix(dataTesting_plot_splem, columns=column_names)
df_test = df.add_prefix('plot_')
df_test.head()

Unnamed: 0,plot_plot_aaron,plot_plot_abandon,plot_plot_abby,plot_plot_abduct,plot_plot_ability,plot_plot_able,plot_plot_aboard,plot_plot_abuse,plot_plot_abusive,plot_plot_academy,...,plot_plot_writer,plot_plot_wrong,plot_plot_wwii,plot_plot_x,plot_plot_year,plot_plot_years,plot_plot_york,plot_plot_young,plot_plot_younger,plot_plot_youth
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [53]:
df1 = pd.DataFrame.sparse.from_spmatrix(dataTraining_title_splem, columns=column_names_title)
df1 = df1.add_prefix('title_')
df1.head()

# dataTesting
df1_test = pd.DataFrame.sparse.from_spmatrix(dataTesting_title_splem, columns=column_names_title)
df1_test = df1.add_prefix('title_')
df1_test.head()

Unnamed: 0,title_title_abandon,title_title_abbott,title_title_abcs,title_title_abduction,title_title_abominable,title_title_abraham,title_title_absolute,title_title_aby,title_title_academy,title_title_accidental,...,title_title_yuma,title_title_z,title_title_zanzibar,title_title_zero,title_title_ziegfeld,title_title_zombie,title_title_zombies,title_title_zone,title_title_zoo,title_title_zorro
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [54]:
df_final = pd.concat([dataTraining,df,df1], axis=1)
df_final_test = pd.concat([dataTesting,df_test,df1_test], axis=1)

In [55]:
df_final = df_final[df_final['year'].isna() == False]
df_final_test = df_final_test[df_final_test['year'].isna() == False]

In [56]:
df_final.shape

(7895, 4005)

In [57]:
df_final.drop(columns = ['title','plot'], inplace = True)
df_final.head()

Unnamed: 0,year,genres,rating,plot_aaron,plot_abandon,plot_abby,plot_abduct,plot_ability,plot_able,plot_aboard,...,title_yuma,title_z,title_zanzibar,title_zero,title_ziegfeld,title_zombie,title_zombies,title_zone,title_zoo,title_zorro
3107,2003.0,"['Short', 'Drama']",8.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
900,2008.0,"['Comedy', 'Crime', 'Horror']",5.6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6724,1941.0,"['Drama', 'Film-Noir', 'Thriller']",7.2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4704,1954.0,['Drama'],7.4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2582,1990.0,"['Action', 'Crime', 'Thriller']",6.6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [58]:
# dataTesting
df_final_test.drop(columns = ['title','plot'], inplace = True)
df_final_test.head()

Unnamed: 0,year,plot_plot_aaron,plot_plot_abandon,plot_plot_abby,plot_plot_abduct,plot_plot_ability,plot_plot_able,plot_plot_aboard,plot_plot_abuse,plot_plot_abusive,...,title_title_yuma,title_title_z,title_title_zanzibar,title_title_zero,title_title_ziegfeld,title_title_zombie,title_title_zombies,title_title_zone,title_title_zoo,title_title_zorro
1,1999.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1978.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1996.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,1950.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,1959.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [59]:
# Definición de variable de interés (y)
df_final['genres'] = df_final['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(df_final['genres'])
y_genres

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0]])

In [60]:
df_genders = pd.DataFrame(y_genres, columns = le.classes_)
df_genders.head()

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Musical,Mystery,News,Romance,Sci-Fi,Short,Sport,Thriller,War,Western
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [61]:
df_final = pd.concat([df_final, df_genders], axis=1)
df_final.head()

Unnamed: 0,year,genres,rating,plot_aaron,plot_abandon,plot_abby,plot_abduct,plot_ability,plot_able,plot_aboard,...,Musical,Mystery,News,Romance,Sci-Fi,Short,Sport,Thriller,War,Western
3107,2003.0,"[Short, Drama]",8.0,0,0,0,0,0,0,0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
900,2008.0,"[Comedy, Crime, Horror]",5.6,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6724,1941.0,"[Drama, Film-Noir, Thriller]",7.2,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4704,1954.0,[Drama],7.4,0,0,0,0,0,0,0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2582,1990.0,"[Action, Crime, Thriller]",6.6,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
df_final = df_final[df_final['year'].isna() == False]
df_final.shape

(7895, 4027)

In [63]:
df_final.head()

Unnamed: 0,year,genres,rating,plot_aaron,plot_abandon,plot_abby,plot_abduct,plot_ability,plot_able,plot_aboard,...,Musical,Mystery,News,Romance,Sci-Fi,Short,Sport,Thriller,War,Western
3107,2003.0,"[Short, Drama]",8.0,0,0,0,0,0,0,0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
900,2008.0,"[Comedy, Crime, Horror]",5.6,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6724,1941.0,"[Drama, Film-Noir, Thriller]",7.2,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4704,1954.0,[Drama],7.4,0,0,0,0,0,0,0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2582,1990.0,"[Action, Crime, Thriller]",6.6,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [64]:
# Separar la base en una base de variables predictoras y otra de variables de respuesta

X = df_final.drop(columns = list(df_genders.columns))
y = df_final[df_genders.columns]


# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split

X_train, X_test, y_train_genres, y_test_genres = train_test_split(X, y, test_size=0.33, random_state=42)


In [65]:
X.shape

(7895, 4003)

In [66]:
y.head()

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Musical,Mystery,News,Romance,Sci-Fi,Short,Sport,Thriller,War,Western
3107,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
900,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6724,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2582,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [67]:
X_train.dropna(inplace = True)

In [68]:
y_train_genres.dropna(inplace = True)

In [69]:
X_train.shape

(3723, 4003)

In [70]:
y_train_genres.shape

(3723, 24)

In [71]:
X_train.head()

Unnamed: 0,year,genres,rating,plot_aaron,plot_abandon,plot_abby,plot_abduct,plot_ability,plot_able,plot_aboard,...,title_yuma,title_z,title_zanzibar,title_zero,title_ziegfeld,title_zombie,title_zombies,title_zone,title_zoo,title_zorro
6350,2005.0,[Horror],5.3,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2154,1985.0,"[Biography, Crime, Drama, Thriller]",6.8,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6191,1989.0,"[Adventure, History, Romance]",5.8,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3142,2014.0,[Comedy],3.7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2254,1994.0,"[Biography, Drama, Sport, Western]",6.5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [72]:
X_train.drop(columns=['genres'], inplace = True)
X_train.head()

Unnamed: 0,year,rating,plot_aaron,plot_abandon,plot_abby,plot_abduct,plot_ability,plot_able,plot_aboard,plot_abuse,...,title_yuma,title_z,title_zanzibar,title_zero,title_ziegfeld,title_zombie,title_zombies,title_zone,title_zoo,title_zorro
6350,2005.0,5.3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2154,1985.0,6.8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6191,1989.0,5.8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3142,2014.0,3.7,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2254,1994.0,6.5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [73]:
y_train_genres.head()

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Musical,Mystery,News,Romance,Sci-Fi,Short,Sport,Thriller,War,Western
6350,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2154,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6191,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3142,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2254,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Modelo de RandomForestClassifier y OneVsRestClassifier

In [74]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

In [75]:
# Predicción del modelo de clasificación

X_test.drop(columns=['genres'], inplace = True)
X_test.dropna(inplace = True)
y_test_genres.dropna(inplace = True)

y_pred_genres = clf.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7952884745921843

## Modelo RandomForestClassifier and OneVsRestClassifier - GridSearch

In [88]:
# Definir los parametros a optimizar
clf_params = {
    'estimator__n_jobs': [-1],
    'estimator__n_estimators': [50,60,70,75,80,85],
    'estimator__max_depth': [25,30,35,40],
    'estimator__random_state': [42]
}


In [89]:
# Definición y entrenamiento
rf = RandomForestClassifier()
clf = OneVsRestClassifier(rf)

In [85]:
from sklearn.model_selection import GridSearchCV

# Definición método GridSearch 
gs = GridSearchCV(estimator = clf, param_grid = clf_params, cv = 3, verbose = 2)
gs.fit(X_train, y_train_genres)

print('Los mejores parametros segun Grid Search:', gs.best_params_)

Fitting 3 folds for each of 21 candidates, totalling 63 fits
[CV] END estimator__max_depth=10, estimator__n_estimators=10, estimator__n_jobs=-1, estimator__random_state=42; total time=  24.3s
[CV] END estimator__max_depth=10, estimator__n_estimators=10, estimator__n_jobs=-1, estimator__random_state=42; total time=  20.4s
[CV] END estimator__max_depth=10, estimator__n_estimators=10, estimator__n_jobs=-1, estimator__random_state=42; total time=  20.1s
[CV] END estimator__max_depth=10, estimator__n_estimators=20, estimator__n_jobs=-1, estimator__random_state=42; total time=  20.9s
[CV] END estimator__max_depth=10, estimator__n_estimators=20, estimator__n_jobs=-1, estimator__random_state=42; total time=  22.4s
[CV] END estimator__max_depth=10, estimator__n_estimators=20, estimator__n_jobs=-1, estimator__random_state=42; total time=  21.4s
[CV] END estimator__max_depth=10, estimator__n_estimators=30, estimator__n_jobs=-1, estimator__random_state=42; total time=  22.3s
[CV] END estimator__ma

In [87]:
# Predicción del modelo de clasificación

# X_test.drop(columns=['genres'], inplace = True)
# X_test.dropna(inplace = True)
# y_test_genres.dropna(inplace = True)

y_pred_genres = gs.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7798331415963063

## Modelo de Redes Neuronales

In [None]:
pip install keras

Note: you may need to restart the kernel to use updated packages.


In [None]:
# Importación librerías
from keras.models import Sequential, Model
from keras.layers import Dense, Input
from keras import metrics
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping
from keras import backend as K
from livelossplot import PlotLossesKeras

KeyError: "Registering two gradient with name 'FakeQuantWithMinMaxArgs'! (Previous registration was in register c:\\Users\\camil\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\tensorflow\\python\\framework\\registry.py:65)"

In [None]:
# Definición y entrenamiento
rn = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
rn.fit(X_train, y_train_genres)

In [69]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

ValueError: X has 1000 features, but RandomForestClassifier is expecting 4002 features as input.

In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.14303,0.10196,0.024454,0.029938,0.354552,0.13883,0.030787,0.49014,0.073159,0.101339,...,0.025069,0.063208,0.0,0.362818,0.056648,0.00897,0.017522,0.202605,0.033989,0.018117
4,0.122624,0.085786,0.024213,0.084795,0.370949,0.216657,0.080359,0.515684,0.062976,0.067019,...,0.024734,0.060935,0.000477,0.149703,0.05819,0.014248,0.020099,0.204794,0.030438,0.018506
5,0.151364,0.110284,0.013762,0.075334,0.304837,0.448736,0.02101,0.611544,0.081741,0.169121,...,0.044538,0.261372,0.0,0.335987,0.128505,0.001016,0.048658,0.423242,0.052693,0.025351
6,0.154448,0.125772,0.020991,0.064124,0.340779,0.140892,0.009133,0.632038,0.068287,0.063631,...,0.131074,0.088418,0.0,0.197224,0.132208,0.001432,0.039743,0.269385,0.077607,0.017862
7,0.175143,0.210069,0.035476,0.032505,0.31385,0.24315,0.021793,0.427885,0.079781,0.143879,...,0.023859,0.090359,4.8e-05,0.205117,0.241663,0.002634,0.018403,0.259465,0.021569,0.017585
