## Movie Genre Classification

Classify a movie genre based on its plot.

<img src="https://raw.githubusercontent.com/sergiomora03/AdvancedTopicsAnalytics/main/notebooks/img/moviegenre.png"
     style="float: left; margin-right: 10px;" />



### Data

Input:
- movie plot

Output:
Probability of the movie belong to each genre


### Evaluation

- 30% Report with all the details of the solution, the analysis and the conclusions. The report cannot exceed 10 pages, must be send in PDF format and must be self-contained.
- 30% Code with the data processing and models developed that support the reported results.
- 30% Presentation of no more than 15 minutes with the main results of the project.
- 10% Model performance achieved. Metric: "AUC".

• The project must be carried out in groups of 4 people.
• Use clear and rigorous procedures.
• The delivery of the project is on March 15th, 2024, 11:59 pm, through email with Github link.
• No projects will be received after the delivery time or by any other means than the one established.




### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

## Sample Submission

In [1]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [2]:
dataTraining = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [3]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [4]:
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [5]:
import re
special_char_pattern = re.compile(r'[^a-zA-Z0-9\s]')

In [6]:
# Reemplazar caracteres especiales por su equivalente en texto
replacements = {
    "ï": "i",
    "£": "$",
    "à":"a",
    "è": "e",
    "ì": "i",
    "ò": "o",
    "ù": "u",
    "®": "",
    "ä": "a",
    "ë": "é",
    "ï": "i",
    "ö": "o",
    "ü": "u",
    "Bouvetøya":"Bouvet",
    " \' ":"'",
    "\'":"'",
    "¹":"'",
    "â":"a",
    "ê":"e",
    "î":"i",
    "ô":"o",
    "û":"u",
    "å": "a",
    "é": "e",
    "í": "i",
    "ó": "o",
    "ú": "u",
    "á": "a",
    "é": "e",
    "í": "i",
    "ó": "o",
    "ú": "u"
    # "'":""
}

# Aplicar todas las sustituciones
for old, new in replacements.items():
    dataTraining['plot'] = dataTraining['plot'].str.replace(old, new, regex=False)

# Verificar el reemplazo
dataTraining['plot'].head()


3107    most is the story of a single father who takes...
900     a serial killer decides to teach the secrets o...
6724    in sweden ,  a female blackmailer with a disfi...
4704    in a friday afternoon in new york ,  the presi...
2582    in los angeles ,  the editor of a publishing h...
Name: plot, dtype: object

In [7]:
# Reemplazar caracteres especiales por su equivalente en texto
replacements = {
    "bullard\'s": "bullards",
    "world\'s":"worlds",
    "avery\'s":"averys",
    "wallet\'s":"wallets",
    "father\'s": "fathers",
    "mother\'s": "mothers",
    "brother\'s": "brothers",
    "sister\'s": "sisters",
    "haakon\'s": "haakons",
    "king\'s": "kings",
    "queen\'s": "queens",
    "family\'s": "families",
    "father\'s": "fathers",
    "mother\'s": "mothers",
    "it\'s": "its",
    "won\'t":"wont",
    "weyland\'s": "weylands",
    "didn\'t": "didnt"
}

# Aplicar todas las sustituciones
for old, new in replacements.items():
    dataTraining['plot'] = dataTraining['plot'].str.replace(old, new, regex=False)

# Verificar el reemplazo
dataTraining['plot'].head()


3107    most is the story of a single father who takes...
900     a serial killer decides to teach the secrets o...
6724    in sweden ,  a female blackmailer with a disfi...
4704    in a friday afternoon in new york ,  the presi...
2582    in los angeles ,  the editor of a publishing h...
Name: plot, dtype: object

In [8]:
# Busca y reemplaza dos o más espacios seguidos por un solo espacio en la columna 'Text'
dataTraining['plot'] = dataTraining['plot'].str.replace(r'  +', ' ', regex=True)

# Busca y reemplaza dos comas seguidas por una sola coma en la columna 'Text'
dataTraining['plot'] = dataTraining['plot'].str.replace(r',,', ',')

# # Elimina los espacios al inicio y al final del texto en la columna 'Text'
dataTraining['plot'] = dataTraining['plot'].apply(lambda x: x.strip())

In [9]:
import re

def normalize_text(text):
    # Normalizar espacios múltiples a un solo espacio
    text = re.sub(r'\s+', ' ', text)
    # Asegurar un espacio después de los signos de puntuación (.,;:!?), si no lo hay
    text = re.sub(r'([.,;:!?])([^\s])', r'\1 \2', text)
    # Opcional: remover espacios antes de signos de puntuación
    text = re.sub(r'\s([.,;:!?])', r'\1', text)
    return text

# Aplicar la normalización al texto completo
dataTraining['plot'] = dataTraining['plot'].apply(normalize_text)

In [10]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import re

# Función para limpiar el texto
def clean_text(text):
    # Convertir en minúsculas
    text = text.lower()
    return text

# Aplicar la función de limpieza a la columna de trama de tus DataFrames
dataTraining['plot'] = dataTraining['plot'].apply(clean_text)


In [11]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden, a female blackmailer with a disfigu...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york, the preside...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles, the editor of a publishing hou...","['Action', 'Crime', 'Thriller']",6.6


Lematizacion

In [12]:
pip install spacy



In [13]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [14]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [15]:
import pandas as pd
import re

def corregir_espacios_s(texto):
    # Patrón para identificar " palabra s "
    patron = r"\b(\w+)\s+s\b"
    # Reemplazar con "palabra's"
    texto_corregido = re.sub(patron, r"\1's", texto)
    return texto_corregido

# Aplicar la función a la columna 'plot'
dataTraining['plot'] = dataTraining['plot'].apply(corregir_espacios_s)

In [16]:
# Reemplazar apóstrofes en la columna "plot"
dataTraining['plot'] = dataTraining['plot'].str.replace("'", "")

In [17]:
# Reemplazar patrones de apóstrofe seguidos por un espacio con una cadena vacía
dataTraining['plot'] = dataTraining['plot'].str.replace("\s's", "")

# Si también deseas manejar casos sin espacio antes del 's
dataTraining['plot'] = dataTraining['plot'].str.replace("'s", "")

# Verificar los cambios en algunos ejemplos
print(dataTraining['plot'].head())


3107    most is the story of a single father who takes...
900     a serial killer decides to teach the secrets o...
6724    in sweden, a female blackmailer with a disfigu...
4704    in a friday afternoon in new york, the preside...
2582    in los angeles, the editor of a publishing hou...
Name: plot, dtype: object


  dataTraining['plot'] = dataTraining['plot'].str.replace("\s's", "")


In [18]:
# Reemplazar caracteres especiales por su equivalente en texto
replacements = {
    "u. s.": "u.s.",
    "dr. t. ": "dr.",
    "!!": "!",
    "_": " "
}

# Aplicar todas las sustituciones
for old, new in replacements.items():
    dataTraining['plot'] = dataTraining['plot'].str.replace(old, new, regex=False)

# Verificar el reemplazo
dataTraining['plot'].head()

3107    most is the story of a single father who takes...
900     a serial killer decides to teach the secrets o...
6724    in sweden, a female blackmailer with a disfigu...
4704    in a friday afternoon in new york, the preside...
2582    in los angeles, the editor of a publishing hou...
Name: plot, dtype: object

In [19]:
import spacy
# Cargar el modelo del idioma inglés
nlp = spacy.load('en_core_web_sm')

def remove_stopwords_spacy(text):
    # Procesar el texto con spaCy
    doc = nlp(text)
    # Eliminar stopwords y unir las palabras restantes
    filtered_text = ' '.join([token.text for token in doc if not token.is_stop])
    return filtered_text

# Aplicar la función de eliminación de stopwords a la columna 'plot'
dataTraining['plot'] = dataTraining['plot'].apply(remove_stopwords_spacy)

In [20]:
import spacy
from tqdm import tqdm
nlp = spacy.load('en_core_web_sm')

def lematizar_texto(texto):
    doc = nlp(texto)
    lemas = [token.lemma_ for token in doc]
    return ' '.join(lemas)

# Aplicar la función a la columna "plot" con una barra de progreso
tqdm.pandas()
dataTraining['plot'] = dataTraining['plot'].progress_apply(lematizar_texto)


100%|██████████| 7895/7895 [02:30<00:00, 52.46it/s]


In [21]:
import pandas as pd
import re

def corregir_guiones(texto):
    # Reemplazar secuencias de "- -" por "-"
    texto_corregido = re.sub(r"- -", "-", texto)
    # Reemplazar "a$$" por "ass"
    texto_corregido = re.sub(r"a\$\$", "ass", texto_corregido)
    return texto_corregido

# Aplicar la función a la columna 'plot'
dataTraining['plot'] = dataTraining['plot'].apply(corregir_guiones)


import pandas as pd
import re

def corregir_puntos_variados(texto):
    # Identificar y corregir secuencias variadas de puntos y espacios
    texto_corregido = re.sub(r"(\.(\s*\.){2,}\s*)|(\.{3,})", ".", texto)
    return texto_corregido

# Aplicar la función a la columna 'plot'
dataTraining['plot'] = dataTraining['plot'].apply(corregir_puntos_variados)

# Verificar los cambios en algunos ejemplos
print(dataTraining['plot'].head())


import re

def ajustar_puntuacion(texto):
    # Asegurar que no haya espacio antes de los signos de puntuación
    texto = re.sub(r"\s+([,.;:!?%])", r"\1", texto)
    # Asegurar que haya un espacio después de los signos de puntuación si no lo hay
    texto = re.sub(r"([,.])([^\s])", r"\1 \2", texto)
    return texto

# Aplicar la función a cada texto en la columna 'plot'
dataTraining['plot'] = dataTraining['plot'].apply(ajustar_puntuacion)

# Verificar los cambios
print(dataTraining['plot'])

3107    story single father take year - old son work r...
900     serial killer decide teach secret satisfy care...
6724    sweden , female blackmailer disfigure facial s...
4704    friday afternoon new york , president tredway ...
2582    los angeles , editor publish house carol hunni...
Name: plot, dtype: object
3107    story single father take year - old son work r...
900     serial killer decide teach secret satisfy care...
6724    sweden, female blackmailer disfigure facial sc...
4704    friday afternoon new york, president tredway c...
2582    los angeles, editor publish house carol hunnic...
                              ...                        
8417    " marriage, wedding. " lesson number newly eng...
1592    wander barbarian, conan, alongside goofy rogue...
1723    like tale spin scheherazade, kismet follow rem...
7605    mrs. brisby, widow mouse, live cinder block ch...
215     tinker bell journey far north land patch thing...
Name: plot, Length: 7895, dtype: object


In [22]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,story single father take year - old son work r...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,serial killer decide teach secret satisfy care...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"sweden, female blackmailer disfigure facial sc...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"friday afternoon new york, president tredway c...",['Drama'],7.4
2582,1990,Narrow Margin,"los angeles, editor publish house carol hunnic...","['Action', 'Crime', 'Thriller']",6.6


In [23]:
import pandas as pd
from ast import literal_eval

# Convertir los strings de géneros a listas si están en formato de string
dataTraining['genres'] = dataTraining['genres'].apply(lambda x: literal_eval(x) if isinstance(x, str) else x)

# Obtener un conjunto único de géneros
generos_unicos = set()
for lista_generos in dataTraining['genres']:
    generos_unicos.update(lista_generos)

# Imprimir los géneros únicos
print(generos_unicos)


{'War', 'Film-Noir', 'Adventure', 'Short', 'Mystery', 'Western', 'Biography', 'Musical', 'Romance', 'Animation', 'Fantasy', 'Music', 'Horror', 'Action', 'Family', 'Comedy', 'Documentary', 'Crime', 'Sport', 'Sci-Fi', 'History', 'Drama', 'Thriller', 'News'}


In [24]:
import pandas as pd

# Identificar caracteres únicos en 'plot'
caracteres_unicos = set(''.join(dataTraining['plot'].tolist()))

# Crear un diccionario para almacenar ejemplos de 'plot' que contienen cada carácter
ejemplos_por_caracter = {caracter: [] for caracter in caracteres_unicos}

# Llenar el diccionario con ejemplos
for plot in dataTraining['plot']:
    for caracter in caracteres_unicos:
        if caracter in plot:
            ejemplos_por_caracter[caracter].append(plot)
            # Limitar a 10 ejemplos por carácter
            if len(ejemplos_por_caracter[caracter]) == 10:
                break

# Ejemplo de cómo imprimir los resultados para un carácter
print("Ejemplos para el carácter ' 's':")
for ejemplo in ejemplos_por_caracter['.'][:10]:
    print(ejemplo)

Ejemplos para el carácter ' 's':
story single father take year - old son work railroad drawbridge bridge tender. day, boy meet woman boarding train, drug abuser. bridge, father go engine room, tell son stay edge nearby lake. ship come, bridge lift. suppose arrive hour later, train happen arrive. son see, try warn father, able. oncoming train approach, son fall drawbridge gear work attempt low bridge, leave father horrific choice. father lowers bridge, gear crush boy. people train completely oblivious fact boy die try save, drug addict woman, happen look train window. movie end, man wander new city, meet woman, long drug addict, hold small baby. relevant narrative run parallel, female drug - addict, meet climax tumultuous film.
serial killer decide teach secret satisfy career video store clerk.
sweden, female blackmailer disfigure facial scar meet gentleman life mean. accomplice blackmail, fall love, bitterly resign impossibility return affection. life change victim prove wife plastic s

In [25]:
caracter_a_buscar = ','
if caracter_a_buscar in ejemplos_por_caracter:
    print(f"Ejemplos para el carácter '{caracter_a_buscar}':")
    for ejemplo in ejemplos_por_caracter[caracter_a_buscar][:10]:
        print(ejemplo)
else:
    print(f"No se encontraron ejemplos para el carácter '{caracter_a_buscar}'.")

Ejemplos para el carácter ',':
story single father take year - old son work railroad drawbridge bridge tender. day, boy meet woman boarding train, drug abuser. bridge, father go engine room, tell son stay edge nearby lake. ship come, bridge lift. suppose arrive hour later, train happen arrive. son see, try warn father, able. oncoming train approach, son fall drawbridge gear work attempt low bridge, leave father horrific choice. father lowers bridge, gear crush boy. people train completely oblivious fact boy die try save, drug addict woman, happen look train window. movie end, man wander new city, meet woman, long drug addict, hold small baby. relevant narrative run parallel, female drug - addict, meet climax tumultuous film.
sweden, female blackmailer disfigure facial scar meet gentleman life mean. accomplice blackmail, fall love, bitterly resign impossibility return affection. life change victim prove wife plastic surgeon, catch apartment, believe jewel thief blackmailer. offer chance

In [27]:
pip install xgboost



In [28]:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import roc_auc_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

# Preparación de las características textuales con TfidfVectorizer
vect = TfidfVectorizer(max_features=50000)
X_dtm = vect.fit_transform(dataTraining['plot'])

# Preparar las etiquetas
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.20, random_state=40)

# Utilizar XGBClassifier con los mejores hiperparámetros encontrados
clf = OneVsRestClassifier(XGBClassifier(
    n_estimators=439,  # Número óptimo de árboles encontrado
    max_depth=4,  # Profundidad máxima del árbol encontrada
    learning_rate=0.013193250444042839,  # Tasa de aprendizaje encontrada
    subsample=0.6964101864104046,  # Submuestra de los datos de entrenamiento encontrada
    colsample_bytree=0.7541666010159664,  # Submuestra de características para cada árbol encontrada
    objective='binary:logistic',  # Objetivo y función de pérdida
    n_jobs=-1,  # Usar todos los núcleos disponibles
    eval_metric='auc',  # Métrica de evaluación
    use_label_encoder=False,  # Para evitar advertencias por el codificador de etiquetas
    random_state=42  # Semilla para la reproducibilidad
), n_jobs=-1)

# Entrenamiento del clasificador
clf.fit(X_train, y_train_genres)

# Realizar predicciones en el conjunto de prueba
y_pred_genres = clf.predict_proba(X_test)

# Calcular y mostrar el AUC ROC score
auc_score = roc_auc_score(y_test_genres, y_pred_genres, average='macro')
print(f"ROC AUC Score: {auc_score}")

ROC AUC Score: 0.8129234757782973
