![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/banner_1.png)

# Proyecto 2 - Clasificación de género de películas

El propósito de este proyecto es que puedan poner en práctica, en sus respectivos grupos de trabajo, sus conocimientos sobre técnicas de preprocesamiento, modelos predictivos de NLP, y la disponibilización de modelos. Para su desarrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 2: Clasificación de género de películas"

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 8. Sin embargo, es importante que avancen en la semana 7 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 8, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/2c54d005f76747fe83f77fbf8b3ec232).

## Datos para la predicción de género en películas

![image info](https://raw.githubusercontent.com/albahnsen/MIAD_ML_and_NLP/main/images/moviegenre.png)

En este proyecto se usará un conjunto de datos de géneros de películas. Cada observación contiene el título de una película, su año de lanzamiento, la sinopsis o plot de la película (resumen de la trama) y los géneros a los que pertenece (una película puede pertenercer a más de un género). Por ejemplo:
- Título: 'How to Be a Serial Killer'
- Plot: 'A serial killer decides to teach the secrets of his satisfying career to a video store clerk.'
- Generos: 'Comedy', 'Crime', 'Horror'

La idea es que usen estos datos para predecir la probabilidad de que una película pertenezca, dada la sinopsis, a cada uno de los géneros.

Agradecemos al profesor Fabio González, Ph.D. y a su alumno John Arevalo por proporcionar este conjunto de datos. Ver https://arxiv.org/abs/1702.01992

## Ejemplo predicción conjunto de test para envío a Kaggle
En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importación librerías
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

In [3]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/MIAD_ML_and_NLP/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [4]:
# Visualización datos de entrenamiento
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [5]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [6]:
dataTraining['length']=dataTraining['plot'].apply(len)
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating,length
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0,1236
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6,94
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2,737
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4,2067
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6,1027


In [10]:
import re
import string
import nltk
from wordcloud import WordCloud
from collections import Counter
from tensorflow.keras.preprocessing.text import Tokenizer
def cleaningtext(text):
    text = text.lower()                                 
    text =  re.sub(r'@\S+', '',text)                     
    text =  re.sub(r'http\S+', '',text)                 
    text =  re.sub(r'pic.\S+', '',text) 
    text =  re.sub(r"[^a-zA-Z+']", ' ',text)             
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text+' ')
    text = "".join([i for i in text if i not in string.punctuation])
    words = nltk.tokenize.word_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')  
    text = " ".join([i for i in words if i not in stopwords and len(i)>2])
    text= re.sub("\s[\s]+", " ",text).strip()            
    return text

In [11]:
#Texto redecido, comprobando que hay menos caracteres
dataTraining['Text_cleaning'] = dataTraining["plot"].apply(cleaningtext)
dataTraining['length_Text_cleaning']=dataTraining['Text_cleaning'].apply(len)
dataTraining.head()

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\usuario/nltk_data'
    - 'C:\\Users\\usuario\\anaconda3\\nltk_data'
    - 'C:\\Users\\usuario\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\usuario\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\usuario\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


In [9]:
# Definición de variable de interés (y)
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [10]:
dataTraining = dataTraining.drop('genres', axis=1)
dataTraining

Unnamed: 0,year,title,plot,rating,length,Text_cleaning,length_Text_cleaning
3107,2003,Most,most is the story of a single father who takes...,8.0,1236,story single father takes eight year old son w...,742
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,5.6,94,serial killer decides teach secrets satisfying...,71
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...",7.2,737,sweden female blackmailer disfiguring facial s...,470
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",7.4,2067,friday afternoon new york president tredway co...,1360
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...",6.6,1027,los angeles editor publishing house carol hunn...,668
...,...,...,...,...,...,...,...
8417,2010,Our Family Wedding,""" our marriage , their wedding . "" it ' s l...",4.9,1072,marriage wedding lesson number one newly engag...,632
1592,1984,Conan the Destroyer,"the wandering barbarian , conan , alongside ...",5.8,593,wandering barbarian conan alongside goofy rogu...,415
1723,1955,Kismet,"like a tale spun by scheherazade , kismet fol...",6.4,199,like tale spun scheherazade kismet follows rem...,145
7605,1982,The Secret of NIMH,"mrs . brisby , a widowed mouse , lives in a...",7.6,1815,mrs brisby widowed mouse lives cinder block ch...,1137


In [11]:
# Definición de variables predictoras (X)
vect = CountVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['Text_cleaning'])
X_dtm.shape

(7895, 1000)

In [12]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

In [48]:
# XGBoost como base
base_classifier = XGBClassifier(learning_rate=0.1,n_estimators=300,max_depth=3)

classifier = OneVsRestClassifier(base_classifier)

classifier.fit(X_train, y_train_genres)

In [49]:
# Predicción del modelo de clasificación
y_pred_genres = classifier.predict_proba(X_test)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.8125283216752321

Primer intento

Sin calibracion, el score es 0.7997
Con minicalibracion learning_rate=0.1,n_estimators=300, 0.804
Con minicalibracion learning_rate=0.1,n_estimators=300 maxdepth 5, 0.807
Con minicalibracion learning_rate=0.1,n_estimators=300 maxdepth 4, 0.8104
Con minicalibracion learning_rate=0.1,n_estimators=300 maxdepth 3, 0.8125


In [None]:
####SIN EJECUTAR#####
from sklearn.model_selection import GridSearchCV

# Instantiate the binary classifier (e.g., XGBoost)
base_classifier = XGBClassifier()

# Instantiate the OneVsRestClassifier
classifier = OneVsRestClassifier(base_classifier)

# Define the hyperparameters and their possible values
param_grid = {
    'estimator__n_estimators': [100, 200, 300],
    'estimator__max_depth': [3, 4, 5],
    'estimator__learning_rate': [0.1, 0.01, 0.001]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(classifier, param_grid, cv=5)
grid_search.fit(X_train, y_train_genres)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Train the model with the best hyperparameters
classifier.set_params(**best_params)
classifier.fit(X_train, y_train_genres)

# Predict labels for new instances
y_pred = classifier.predict(X_test)


In [None]:
###SIN EJECUTAR####
# Impresión del desempeño del modelo
roc_auc_score(y_test_genres, y_pred, average='macro')

In [None]:
#Nuevo intento, nuevos parametros para countvectorizer, mismo modelo OneVsRest

In [24]:
# Definición de variables predictoras (X)
vect2 = CountVectorizer(stop_words="english", analyzer='word',ngram_range=(1, 2), tokenizer=lambda text: text.split(),max_df=1.0, min_df=1, max_features=1000)
X_dtm2 = vect2.fit_transform(dataTraining['Text_cleaning'])
X_dtm2.shape

(7895, 1000)

In [25]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train2, X_test2, y_train_genres2, y_test_genres2 = train_test_split(X_dtm2, y_genres, test_size=0.33, random_state=42)

In [50]:
classifier.fit(X_train2, y_train_genres2)

In [51]:
# Predicción del modelo de clasificación
y_pred_genres2 = classifier.predict_proba(X_test2)

# Impresión del desempeño del modelo
roc_auc_score(y_test_genres2, y_pred_genres2, average='macro')

0.8186199856779007

In [None]:
#Tuning en countvectorizer max1000, no tuning en modelo 0.806
#Tuning en countvectorizer max1000 tokenizer, no tuning en modelo 0.806
#Tuning en countvectorizer max1000 tokenizer, mini tuning en modelo 0.816
#Tuning en countvectorizer max1000 tokenizer, mini tuning en modelo 0.818

In [52]:
# transformación variables predictoras X del conjunto de test
dataTesting['Text_cleaning'] = dataTesting["plot"].apply(cleaningtext)
dataTesting['length_Text_cleaning']=dataTesting['Text_cleaning'].apply(len)
dataTesting.head()

Unnamed: 0,year,title,plot,Text_cleaning,length_Text_cleaning
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ....",meets fate shall sealed fate theresa osborne r...,287
4,1978,Midnight Express,"the true story of billy hayes , an american c...",true story billy hayes american college studen...,91
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...,martin vail left chicago office become success...,1274
6,1950,Crisis,husband and wife americans dr . eugene and mr...,husband wife americans eugene mrs helen fergus...,694
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...,coroner scientist warren chapin researching sh...,675


In [53]:
X_test_dtm = vect2.transform(dataTesting['Text_cleaning'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = classifier.predict_proba(X_test_dtm)

In [54]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_XGB1.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.091717,0.085754,0.015698,0.038545,0.233926,0.073458,0.009824,0.507028,0.044467,0.187805,...,0.029091,0.063169,1.9e-05,0.707223,0.024595,0.007462,0.01489,0.135813,0.010802,0.013612
4,0.049555,0.022849,0.013458,0.173837,0.210501,0.158839,0.097006,0.727237,0.027878,0.012107,...,0.013534,0.045998,1.9e-05,0.039702,0.018374,0.011704,0.012133,0.211963,0.028143,0.015717
5,0.011926,0.005832,0.000453,0.024776,0.032616,0.903003,0.004752,0.93318,0.002276,0.009023,...,0.001309,0.228701,0.000761,0.053699,0.006433,0.000756,0.004632,0.319013,0.000574,0.002142
6,0.07848,0.053122,0.001645,0.079013,0.076584,0.077605,0.058923,0.622371,0.03464,0.040761,...,0.017651,0.157864,2.4e-05,0.181866,0.042531,5.6e-05,0.028391,0.387057,0.061712,0.002801
7,0.337785,0.138581,0.007091,0.0097,0.345536,0.177293,0.00459,0.197485,0.048612,0.278645,...,0.00656,0.048535,1.9e-05,0.054,0.478431,0.004356,0.009862,0.235329,0.004012,0.008468


In [None]:
#Resultado en kaggle 0.805

In [44]:
num_words = 50000
max_len = 250
tokenizer = Tokenizer(num_words=num_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(dataTraining['Text_cleaning'].values)

Unnamed: 0,year,title,plot,rating,length,Text_cleaning,length_Text_cleaning
3107,2003,Most,most is the story of a single father who takes...,8.0,1236,story single father takes eight year old son w...,742
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,5.6,94,serial killer decides teach secrets satisfying...,71
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...",7.2,737,sweden female blackmailer disfiguring facial s...,470
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",7.4,2067,friday afternoon new york president tredway co...,1360
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...",6.6,1027,los angeles editor publishing house carol hunn...,668
...,...,...,...,...,...,...,...
8417,2010,Our Family Wedding,""" our marriage , their wedding . "" it ' s l...",4.9,1072,marriage wedding lesson number one newly engag...,632
1592,1984,Conan the Destroyer,"the wandering barbarian , conan , alongside ...",5.8,593,wandering barbarian conan alongside goofy rogu...,415
1723,1955,Kismet,"like a tale spun by scheherazade , kismet fol...",6.4,199,like tale spun scheherazade kismet follows rem...,145
7605,1982,The Secret of NIMH,"mrs . brisby , a widowed mouse , lives in a...",7.6,1815,mrs brisby widowed mouse lives cinder block ch...,1137


(7895, 1000)

In [50]:
from keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
X = tokenizer.texts_to_sequences(dataTraining['Text_cleaning'].values)
X = pad_sequences(X, maxlen=max_len)
#y = y_genres
X

array([[    0,     0,     0, ...,  4911,  8273,    67],
       [    0,     0,     0, ...,  1008,   432,  1739],
       [    0,     0,     0, ...,    40,    14,  1826],
       ...,
       [    0,     0,     0, ...,   665,    58,   373],
       [    0,     0,     0, ...,   133, 23579,   234],
       [    0,     0,     0, ..., 38229,  4010,  2077]], dtype=int32)

In [88]:
# Separación de variables predictoras (X) y variable de interés (y) en set de entrenamiento y test usandola función train_test_split
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

In [91]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, SpatialDropout1D
from tensorflow.keras.callbacks import Callback, EarlyStopping
EMBEDDING_DIM = 100
model = Sequential()
model.add(Embedding(num_words, EMBEDDING_DIM, input_length=X_train.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.1, recurrent_dropout=0.2))
model.add(Dense(24, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [92]:
my_callbacks  = [EarlyStopping(monitor='val_loss',
                              min_delta=0,
                              patience=2,
                              mode='auto')]

In [93]:
X_test

<2606x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 72920 stored elements in Compressed Sparse Row format>

In [94]:
history = model.fit(X_train, y_train_genres, epochs=6, batch_size=32,validation_data=(X_test,y_test_genres), callbacks=my_callbacks)

2023-05-20 21:55:44.227394: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at serialize_sparse_op.cc:389 : INVALID_ARGUMENT: indices[2] = [0,925] is out of order. Many sparse ops require sorted indices.
    Use `tf.sparse.reorder` to create a correctly ordered copy.




InvalidArgumentError: {{function_node __wrapped__SerializeManySparse_device_/job:localhost/replica:0/task:0/device:CPU:0}} indices[2] = [0,925] is out of order. Many sparse ops require sorted indices.
    Use `tf.sparse.reorder` to create a correctly ordered copy.

 [Op:SerializeManySparse]

(5289, 250)

In [89]:
# Definición y entrenamiento
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))
clf.fit(X_train, y_train_genres)

0.7699927924648776

In [11]:
# transformación variables predictoras X del conjunto de test
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

# Predicción del conjunto de test
y_pred_test_genres = clf.predict_proba(X_test_dtm)

In [12]:
# Guardar predicciones en formato exigido en la competencia de kaggle
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)
res.to_csv('pred_genres_text_RF.csv', index_label='ID')
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.14303,0.10196,0.024454,0.029938,0.354552,0.13883,0.030787,0.49014,0.073159,0.101339,...,0.025069,0.063208,0.0,0.362818,0.056648,0.00897,0.017522,0.202605,0.033989,0.018117
4,0.122624,0.085786,0.024213,0.084795,0.370949,0.216657,0.080359,0.515684,0.062976,0.067019,...,0.024734,0.060935,0.000477,0.149703,0.05819,0.014248,0.020099,0.204794,0.030438,0.018506
5,0.151364,0.110284,0.013762,0.075334,0.304837,0.448736,0.02101,0.611544,0.081741,0.169121,...,0.044538,0.261372,0.0,0.335987,0.128505,0.001016,0.048658,0.423242,0.052693,0.025351
6,0.154448,0.125772,0.020991,0.064124,0.340779,0.140892,0.009133,0.632038,0.068287,0.063631,...,0.131074,0.088418,0.0,0.197224,0.132208,0.001432,0.039743,0.269385,0.077607,0.017862
7,0.175143,0.210069,0.035476,0.032505,0.31385,0.24315,0.021793,0.427885,0.079781,0.143879,...,0.023859,0.090359,4.8e-05,0.205117,0.241663,0.002634,0.018403,0.259465,0.021569,0.017585
