## Movie Genre Classification

Classify a movie genre based on its plot.

<img src="https://raw.githubusercontent.com/sergiomora03/AdvancedTopicsAnalytics/main/notebooks/img/moviegenre.png"
     style="float: left; margin-right: 10px;" />



### Data

Input:
- movie plot

Output:
Probability of the movie belong to each genre


### Evaluation

- 30% Report with all the details of the solution, the analysis and the conclusions. The report cannot exceed 10 pages, must be send in PDF format and must be self-contained.
- 30% Code with the data processing and models developed that support the reported results.
- 30% Presentation of no more than 15 minutes with the main results of the project.
- 10% Model performance achieved. Metric: "AUC".

• The project must be carried out in groups of 4 people.
• Use clear and rigorous procedures.
• The delivery of the project is on March 15th, 2024, 11:59 pm, through email with Github link.
• No projects will be received after the delivery time or by any other means than the one established.




### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

## Sample Submission

In [57]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [58]:
dataTraining = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [59]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [60]:
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [61]:
import re
special_char_pattern = re.compile(r'[^a-zA-Z0-9\s]')

In [62]:
# Reemplazar caracteres especiales por su equivalente en texto
replacements = {
    # "ï": "i",
    "£": "$",
    "à":"á",
    "è": "é",
    "ì": "í",
    "ò": "ó",
    "ù": "ú",
    # "®": "",
    # "ä": "a",
    # "ë": "e",
    # "ï": "i",
    # "ö": "o",
    # "ü": "u",
    "Bouvetøya":"Bouvet",
    " \' ":"'",
    "\'":"'"
    # "¹":"'",
    # "â":"a",
    # "ê":"e",
    # "î":"i",
    # "ô":"o",
    # "û":"u",
    # "å": "a",
    # "é": "e",
    # "í": "i",
    # "ó": "o",
    # "ú": "u",
    # "á": "a",
    # "é": "e",
    # "í": "i",
    # "ó": "o",
    # "ú": "u"
    # "'":""
}

# Aplicar todas las sustituciones
for old, new in replacements.items():
    dataTraining['plot'] = dataTraining['plot'].str.replace(old, new, regex=False)

# Verificar el reemplazo
dataTraining['plot'].head()


3107    most is the story of a single father who takes...
900     a serial killer decides to teach the secrets o...
6724    in sweden ,  a female blackmailer with a disfi...
4704    in a friday afternoon in new york ,  the presi...
2582    in los angeles ,  the editor of a publishing h...
Name: plot, dtype: object

In [63]:
# Reemplazar caracteres especiales por su equivalente en texto
replacements = {
    "bullard\'s": "bullards",
    "world\'s":"worlds",
    "avery\'s":"averys",
    "wallet\'s":"wallets",
    "father\'s": "fathers",
    "mother\'s": "mothers",
    "brother\'s": "brothers",
    "sister\'s": "sisters",
    "haakon\'s": "haakons",
    "king\'s": "kings",
    "queen\'s": "queens",
    "family\'s": "families",
    "father\'s": "fathers",
    "mother\'s": "mothers",
    "it\'s": "its",
    "won\'t":"wont",
    "weyland\'s": "weylands",
    "didn\'t": "didnt"
}

# Aplicar todas las sustituciones
for old, new in replacements.items():
    dataTraining['plot'] = dataTraining['plot'].str.replace(old, new, regex=False)

# Verificar el reemplazo
dataTraining['plot'].head()


3107    most is the story of a single father who takes...
900     a serial killer decides to teach the secrets o...
6724    in sweden ,  a female blackmailer with a disfi...
4704    in a friday afternoon in new york ,  the presi...
2582    in los angeles ,  the editor of a publishing h...
Name: plot, dtype: object

In [67]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [74]:
# Reemplazar caracteres especiales por su equivalente en texto
replacements = {
    "u. s.": "u.s.",
    "dr. t. ": "dr.",
    "!!": "!",
    "_": " "
}

# Aplicar todas las sustituciones
for old, new in replacements.items():
    dataTraining['plot'] = dataTraining['plot'].str.replace(old, new, regex=False)

# Verificar el reemplazo
dataTraining['plot'].head()

3107    most is the story of a single father who takes...
900     a serial killer decides to teach the secrets o...
6724    in sweden ,  a female blackmailer with a disfi...
4704    in a friday afternoon in new york ,  the presi...
2582    in los angeles ,  the editor of a publishing h...
Name: plot, dtype: object

In [78]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [79]:
import pandas as pd
from ast import literal_eval

# Convertir los strings de géneros a listas si están en formato de string
dataTraining['genres'] = dataTraining['genres'].apply(lambda x: literal_eval(x) if isinstance(x, str) else x)

# Obtener un conjunto único de géneros
generos_unicos = set()
for lista_generos in dataTraining['genres']:
    generos_unicos.update(lista_generos)

# Imprimir los géneros únicos
print(generos_unicos)


{'Romance', 'Thriller', 'Drama', 'Adventure', 'News', 'Western', 'Comedy', 'Documentary', 'Fantasy', 'Mystery', 'Horror', 'Action', 'Family', 'Crime', 'Short', 'Film-Noir', 'History', 'Sport', 'Biography', 'Animation', 'Sci-Fi', 'Musical', 'War', 'Music'}


In [80]:
import pandas as pd

# Identificar caracteres únicos en 'plot'
caracteres_unicos = set(''.join(dataTraining['plot'].tolist()))

# Crear un diccionario para almacenar ejemplos de 'plot' que contienen cada carácter
ejemplos_por_caracter = {caracter: [] for caracter in caracteres_unicos}

# Llenar el diccionario con ejemplos
for plot in dataTraining['plot']:
    for caracter in caracteres_unicos:
        if caracter in plot:
            ejemplos_por_caracter[caracter].append(plot)
            # Limitar a 10 ejemplos por carácter
            if len(ejemplos_por_caracter[caracter]) == 10:
                break

# Ejemplo de cómo imprimir los resultados para un carácter
print("Ejemplos para el carácter ' 's':")
for ejemplo in ejemplos_por_caracter['.'][:10]:
    print(ejemplo)

Ejemplos para el carácter ' 's':
most is the story of a single father who takes his eight year - old son to work with him at the railroad drawbridge where he is the bridge tender .  a day before ,  the boy meets a woman boarding a train ,  a drug abuser .  at the bridge ,  the father goes into the engine room ,  and tells his son to stay at the edge of the nearby lake .  a ship comes ,  and the bridge is lifted .  though it is supposed to arrive an hour later ,  the train happens to arrive .  the son sees this ,  and tries to warn his father ,  who is not able to see this .  just as the oncoming train approaches ,  his son falls into the drawbridge gear works while attempting to lower the bridge ,  leaving the father with a horrific choice .  the father then lowers the bridge ,  the gears crushing the boy .  the people in the train are completely oblivious to the fact a boy died trying to save them ,  other than the drug addict woman ,  who happened to look out her train window .  the 

In [81]:
caracter_a_buscar = ','
if caracter_a_buscar in ejemplos_por_caracter:
    print(f"Ejemplos para el carácter '{caracter_a_buscar}':")
    for ejemplo in ejemplos_por_caracter[caracter_a_buscar][:10]:
        print(ejemplo)
else:
    print(f"No se encontraron ejemplos para el carácter '{caracter_a_buscar}'.")

Ejemplos para el carácter ',':
most is the story of a single father who takes his eight year - old son to work with him at the railroad drawbridge where he is the bridge tender .  a day before ,  the boy meets a woman boarding a train ,  a drug abuser .  at the bridge ,  the father goes into the engine room ,  and tells his son to stay at the edge of the nearby lake .  a ship comes ,  and the bridge is lifted .  though it is supposed to arrive an hour later ,  the train happens to arrive .  the son sees this ,  and tries to warn his father ,  who is not able to see this .  just as the oncoming train approaches ,  his son falls into the drawbridge gear works while attempting to lower the bridge ,  leaving the father with a horrific choice .  the father then lowers the bridge ,  the gears crushing the boy .  the people in the train are completely oblivious to the fact a boy died trying to save them ,  other than the drug addict woman ,  who happened to look out her train window .  the mo

In [83]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.multiclass import OneVsRestClassifier
from scipy.sparse import hstack
import pandas as pd
from textblob import TextBlob


# Primero, calculamos las nuevas características
dataTraining['text_length'] = dataTraining['plot'].apply(len)  # Longitud del texto
dataTraining['sentiment'] = dataTraining['plot'].apply(lambda x: TextBlob(x).sentiment.polarity)  # Puntuación de sentimiento

# Preparación de las características textuales con TfidfVectorizer
vect = TfidfVectorizer(max_features=10000)
X_dtm = vect.fit_transform(dataTraining['plot'])

# Preparar las características adicionales
X_additional = dataTraining[['text_length', 'sentiment']]

# Asegurarse de que los géneros están en formato de lista de listas
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x) if isinstance(x, str) else x)

# Utilizar MultiLabelBinarizer para transformar las etiquetas de género
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

# Dividir los datos en conjuntos de entrenamiento y prueba, incluyendo las nuevas características
X_train_dtm, X_test_dtm, X_train_additional, X_test_additional, y_train_genres, y_test_genres = train_test_split(X_dtm, X_additional, y_genres, test_size=0.20, random_state=22)

# Combinar las características textuales con las nuevas características
X_train_combined = hstack([X_train_dtm, X_train_additional])
X_test_combined = hstack([X_test_dtm, X_test_additional])

# Entrenamiento del modelo con las características combinadas
clf = OneVsRestClassifier(RandomForestClassifier(
    bootstrap=False,  # Uso del parámetro bootstrap
    max_depth=31,  # Profundidad máxima del árbol
    max_features='log2',  # Número de características a considerar al buscar la mejor división
    min_samples_leaf=11,  # Mínimo de muestras requeridas en un nodo hoja
    min_samples_split=9,  # Mínimo de muestras requeridas para dividir un nodo
    n_estimators=800,  # Número óptimo de árboles
    n_jobs=-1,  # Usar todos los núcleos disponibles
    random_state=42  # Semilla para la reproducibilidad
))
clf.fit(X_train_combined, y_train_genres)

# Realizar predicciones en el conjunto de prueba
y_pred_genres = clf.predict_proba(X_test_combined)

# Calcular y mostrar el AUC ROC score
auc_score = roc_auc_score(y_test_genres, y_pred_genres, average='macro')
print(f"ROC AUC Score: {auc_score}")



ROC AUC Score: 0.8637622758187478
