## Movie Genre Classification

Classify a movie genre based on its plot.

<img src="https://raw.githubusercontent.com/sergiomora03/AdvancedTopicsAnalytics/main/notebooks/img/moviegenre.png"
     style="float: left; margin-right: 10px;" />



### Data

Input:
- movie plot

Output:
Probability of the movie belong to each genre


### Evaluation

- 30% Report with all the details of the solution, the analysis and the conclusions. The report cannot exceed 10 pages, must be send in PDF format and must be self-contained.
- 30% Code with the data processing and models developed that support the reported results.
- 30% Presentation of no more than 15 minutes with the main results of the project.
- 10% Model performance achieved. Metric: "AUC".

• The project must be carried out in groups of 4 people.
• Use clear and rigorous procedures.
• The delivery of the project is on March 15th, 2024, 11:59 pm, through email with Github link.
• No projects will be received after the delivery time or by any other means than the one established.




### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

## Sample Submission

In [1]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split

In [2]:
dataTraining = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/sergiomora03/AdvancedTopicsAnalytics/raw/main/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [3]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [4]:
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


In [5]:
import re
special_char_pattern = re.compile(r'[^a-zA-Z0-9\s]')

In [6]:
# Revisar los valores nulos
dataTraining.replace('', np.nan, inplace=True)

# Lista de columnas a revisar
columns_to_check = ['plot', 'genres','year','title']

#  Iterar sobre cada columna y contar los valores nulos
for column in columns_to_check:
    # Contar los valores nulos en la columna
    missing_count = dataTraining[column].isnull().sum()
    print(f'Número de datos faltantes en {column}: {missing_count}')

Número de datos faltantes en plot: 0
Número de datos faltantes en genres: 0
Número de datos faltantes en year: 0
Número de datos faltantes en title: 0


In [7]:
from collections import Counter

# Extraer todos los caracteres especiales encontrados en la columna 'genres'
all_special_characters_text_id = ''.join(dataTraining['plot'].apply(lambda x: ''.join(special_char_pattern.findall(x))))

# Contar la frecuencia de cada caracter especial
special_character_frequencies_text_id = Counter(all_special_characters_text_id)

special_character_frequencies_text_id

Counter({'-': 10466,
         '.': 46614,
         ',': 57327,
         "'": 15571,
         '"': 4776,
         '?': 1136,
         '(': 2072,
         ')': 2059,
         '!': 350,
         ':': 1112,
         '$': 246,
         'ï': 5,
         ';': 1202,
         '/': 318,
         'é': 320,
         '£': 7,
         '%': 20,
         'è': 23,
         '&': 137,
         'ʼ': 1,
         'ç': 9,
         'ú': 6,
         '¹': 2,
         'û': 2,
         'ö': 6,
         'â': 3,
         'ñ': 6,
         '¡': 1,
         'ê': 3,
         '®': 2,
         'í': 2,
         'ä': 2,
         'ø': 2,
         '=': 3,
         'ó': 4,
         'á': 5,
         'ë': 2,
         '½': 1,
         'ô': 3,
         'å': 1,
         'ò': 5,
         'à': 13,
         'ù': 5,
         '°': 1,
         'ü': 2})

In [8]:
from collections import defaultdict
import re

text_column = dataTraining['plot']

# Definir un patrón de regex para buscar caracteres que no sean alfanuméricos ni espacios básicos
special_char_pattern = re.compile(r'[^a-zA-Z0-9\s]')

# Diccionario para almacenar hasta 5 ejemplos por cada carácter especial encontrado
examples_per_special_char = defaultdict(list)

# Recorrer cada texto en la columna 'Text'
for text in text_column:
    # Encontrar todos los caracteres especiales únicos en el texto actual
    unique_special_chars = set(special_char_pattern.findall(text))
    # Para cada carácter especial encontrado, agregar el texto actual como un ejemplo
    for char in unique_special_chars:
        if len(examples_per_special_char[char]) < 1:
            examples_per_special_char[char].append(text)

# Convertir el diccionario a un formato más legible para la presentación
{char: examples[:2] for char, examples in examples_per_special_char.items()}

{'-': ['most is the story of a single father who takes his eight year - old son to work with him at the railroad drawbridge where he is the bridge tender .  a day before ,  the boy meets a woman boarding a train ,  a drug abuser .  at the bridge ,  the father goes into the engine room ,  and tells his son to stay at the edge of the nearby lake .  a ship comes ,  and the bridge is lifted .  though it is supposed to arrive an hour later ,  the train happens to arrive .  the son sees this ,  and tries to warn his father ,  who is not able to see this .  just as the oncoming train approaches ,  his son falls into the drawbridge gear works while attempting to lower the bridge ,  leaving the father with a horrific choice .  the father then lowers the bridge ,  the gears crushing the boy .  the people in the train are completely oblivious to the fact a boy died trying to save them ,  other than the drug addict woman ,  who happened to look out her train window .  the movie ends ,  with the ma

In [9]:
# Reemplazar caracteres especiales por su equivalente en texto
replacements = {
    "ï": "i",
    "£": "$",
    "à":"a",
    "è": "e",
    "ì": "i",
    "ò": "o",
    "ù": "u",
    "®": "",
    "ä": "a",
    "ë": "é",
    "ï": "i",
    "ö": "o",
    "ü": "u",
    "Bouvetøya":"Bouvet",
    " \' ":"'",
    "\'":"'",
    "¹":"'",
    "â":"a",
    "ê":"e",
    "î":"i",
    "ô":"o",
    "û":"u",
    "å": "a",
    "é": "e",
    "í": "i",
    "ó": "o",
    "ú": "u",
    "á": "a",
    "é": "e",
    "í": "i",
    "ó": "o",
    "ú": "u"
    # "'":""
}

# Aplicar todas las sustituciones
for old, new in replacements.items():
    dataTraining['plot'] = dataTraining['plot'].str.replace(old, new, regex=False)

# Verificar el reemplazo
dataTraining['plot'].head()


3107    most is the story of a single father who takes...
900     a serial killer decides to teach the secrets o...
6724    in sweden ,  a female blackmailer with a disfi...
4704    in a friday afternoon in new york ,  the presi...
2582    in los angeles ,  the editor of a publishing h...
Name: plot, dtype: object

In [10]:
# Reemplazar caracteres especiales por su equivalente en texto
replacements = {
    "bullard\'s": "bullards",
    "world\'s":"worlds",
    "avery\'s":"averys",
    "wallet\'s":"wallets",
    "father\'s": "fathers",
    "mother\'s": "mothers",
    "brother\'s": "brothers",
    "sister\'s": "sisters",
    "haakon\'s": "haakons",
    "king\'s": "kings",
    "queen\'s": "queens",
    "family\'s": "families",
    "father\'s": "fathers",
    "mother\'s": "mothers",
    "it\'s": "its",
    "won\'t":"wont",
    "weyland\'s": "weylands",
    "didn\'t": "didnt"
}

# Aplicar todas las sustituciones
for old, new in replacements.items():
    dataTraining['plot'] = dataTraining['plot'].str.replace(old, new, regex=False)

# Verificar el reemplazo
dataTraining['plot'].head()


3107    most is the story of a single father who takes...
900     a serial killer decides to teach the secrets o...
6724    in sweden ,  a female blackmailer with a disfi...
4704    in a friday afternoon in new york ,  the presi...
2582    in los angeles ,  the editor of a publishing h...
Name: plot, dtype: object

In [11]:
# Busca y reemplaza dos o más espacios seguidos por un solo espacio en la columna 'Text'
dataTraining['plot'] = dataTraining['plot'].str.replace(r'  +', ' ', regex=True)

# Busca y reemplaza dos comas seguidas por una sola coma en la columna 'Text'
dataTraining['plot'] = dataTraining['plot'].str.replace(r',,', ',')

# # Elimina los espacios al inicio y al final del texto en la columna 'Text'
dataTraining['plot'] = dataTraining['plot'].apply(lambda x: x.strip())

In [12]:
# Reemplazar caracteres especiales por su equivalente en texto
replacements = {
    "one-": "1-",
    "two-": "2-",
    "three-": "3-",
    "four-": "4-",
    "five-": "5-",
    "six-": "6-",
    "seven-": "7-",
    "eight-": "8-",
    "nine-": "9-"
}

# Aplicar todas las sustituciones
for old, new in replacements.items():
    dataTraining['plot'] = dataTraining['plot'].str.replace(old, new, regex=False)

# Verificar el reemplazo
dataTraining['plot'].head()

3107    most is the story of a single father who takes...
900     a serial killer decides to teach the secrets o...
6724    in sweden , a female blackmailer with a disfig...
4704    in a friday afternoon in new york , the presid...
2582    in los angeles , the editor of a publishing ho...
Name: plot, dtype: object

In [13]:
import re

def normalize_text(text):
    # Normalizar espacios múltiples a un solo espacio
    text = re.sub(r'\s+', ' ', text)
    # Asegurar un espacio después de los signos de puntuación (.,;:!?), si no lo hay
    text = re.sub(r'([.,;:!?])([^\s])', r'\1 \2', text)
    # Opcional: remover espacios antes de signos de puntuación
    text = re.sub(r'\s([.,;:!?])', r'\1', text)
    return text

# Aplicar la normalización al texto completo
dataTraining['plot'] = dataTraining['plot'].apply(normalize_text)

In [14]:
from collections import defaultdict
import re

text_column = dataTraining['plot']

# Definir un patrón de regex para buscar caracteres que no sean alfanuméricos ni espacios básicos
special_char_pattern = re.compile(r'[^a-zA-Z0-9\s]')

# Diccionario para almacenar hasta 5 ejemplos por cada carácter especial encontrado
examples_per_special_char = defaultdict(list)

# Recorrer cada texto en la columna 'Text'
for text in text_column:
    # Encontrar todos los caracteres especiales únicos en el texto actual
    unique_special_chars = set(special_char_pattern.findall(text))
    # Para cada carácter especial encontrado, agregar el texto actual como un ejemplo
    for char in unique_special_chars:
        if len(examples_per_special_char[char]) < 1:
            examples_per_special_char[char].append(text)

# Convertir el diccionario a un formato más legible para la presentación
{char: examples[:2] for char, examples in examples_per_special_char.items()}

{'-': ['most is the story of a single father who takes his eight year - old son to work with him at the railroad drawbridge where he is the bridge tender. a day before, the boy meets a woman boarding a train, a drug abuser. at the bridge, the father goes into the engine room, and tells his son to stay at the edge of the nearby lake. a ship comes, and the bridge is lifted. though it is supposed to arrive an hour later, the train happens to arrive. the son sees this, and tries to warn his father, who is not able to see this. just as the oncoming train approaches, his son falls into the drawbridge gear works while attempting to lower the bridge, leaving the father with a horrific choice. the father then lowers the bridge, the gears crushing the boy. the people in the train are completely oblivious to the fact a boy died trying to save them, other than the drug addict woman, who happened to look out her train window. the movie ends, with the man wandering a new city, and meets the woman, n

In [15]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import re

# Función para limpiar el texto
def clean_text(text):
    # 1. Eliminar caracteres especiales
    # text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A)
    # 2. Convertir en minúsculas
    text = text.lower()
    # # 3. Eliminar stopwords
    # text = ' '.join([word for word in text.split() if word not in ENGLISH_STOP_WORDS])
    return text

# Aplicar la función de limpieza a la columna de trama de tus DataFrames
dataTraining['plot'] = dataTraining['plot'].apply(clean_text)
dataTesting['plot'] = dataTesting['plot'].apply(clean_text)


In [16]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden, a female blackmailer with a disfigu...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york, the preside...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles, the editor of a publishing hou...","['Action', 'Crime', 'Thriller']",6.6


# Modelo 1, Orignal

### Create count vectorizer


In [17]:
vect = CountVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

(7895, 1000)

In [18]:
print(list(vect.vocabulary_.keys())[:50])

['most', 'is', 'the', 'story', 'of', 'single', 'father', 'who', 'takes', 'his', 'year', 'old', 'son', 'to', 'work', 'with', 'him', 'at', 'where', 'he', 'day', 'before', 'boy', 'meets', 'woman', 'train', 'drug', 'goes', 'into', 'room', 'and', 'tells', 'stay', 'ship', 'comes', 'though', 'it', 'arrive', 'an', 'later', 'happens', 'sees', 'this', 'tries', 'not', 'able', 'see', 'just', 'as', 'falls']


### Create y

In [19]:
# Verificar el tipo de dato de los primeros elementos para entender el problema
print(type(dataTraining['genres'].iloc[0]))


<class 'str'>


In [20]:
# Ajuste condicional para aplicar eval() solo si es necesario
dataTraining['genres'] = dataTraining['genres'].apply(lambda x: eval(x) if isinstance(x, str) else x)

# Proceder con la binarización
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [21]:
y_genres

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0]])

In [22]:
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.20, random_state=42)

### Train multi-class multi-label model

In [23]:
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))

In [24]:
clf.fit(X_train, y_train_genres)

In [25]:
y_pred_genres = clf.predict_proba(X_test)

In [26]:
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7917311245652595

In [27]:
# Obtener las etiquetas predichas
y_pred_genres_labels = clf.predict(X_test)

In [28]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score


accuracy = accuracy_score(y_test_genres, y_pred_genres_labels)
recall = recall_score(y_test_genres, y_pred_genres_labels, average='macro')
precision = precision_score(y_test_genres, y_pred_genres_labels, average='macro')
f1 = f1_score(y_test_genres, y_pred_genres_labels, average='macro')
roc_auc = roc_auc_score(y_test_genres, y_pred_genres, average='macro')

print(f"Accuracy: {accuracy}")
print(f"Recall: {recall}")
print(f"Precision: {precision}")
print(f"F1 Score: {f1}")
print(f"ROC AUC Score: {roc_auc}")

#0.790530
#0.791731

Accuracy: 0.045598480050664976
Recall: 0.03773380097266429
Precision: 0.19869762165642157
F1 Score: 0.047105652909985134
ROC AUC Score: 0.7917311245652595


  _warn_prf(average, modifier, msg_start, len(result))


# Modelo 2 TfidfVectorizer

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.multiclass import OneVsRestClassifier

# Utilizar TfidfVectorizer en lugar de CountVectorizer
vect = TfidfVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['plot'])

# Asegurarse de que los géneros están en formato de lista de listas
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x) if isinstance(x, str) else x)

# Utilizar MultiLabelBinarizer para transformar las etiquetas de género
le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

# Dividir los datos en conjuntos de entrenamiento y prueba
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.20, random_state=42)

# Definir y entrenar el clasificador con los mejores hiperparámetros
clf = OneVsRestClassifier(RandomForestClassifier(
    n_estimators=200,  # número óptimo de árboles
    max_depth=20,  # profundidad máxima del árbol
    min_samples_split=2,  # mínimo de muestras requeridas para dividir un nodo
    min_samples_leaf=4,  # mínimo de muestras requeridas en un nodo hoja
    n_jobs=-1,  # usar todos los núcleos disponibles
    random_state=42  # semilla para la reproducibilidad
))
clf.fit(X_train, y_train_genres)

# Realizar predicciones en el conjunto de prueba
y_pred_genres = clf.predict_proba(X_test)

# Calcular y mostrar el AUC ROC score
auc_score = roc_auc_score(y_test_genres, y_pred_genres, average='macro')
print(f"ROC AUC Score: {auc_score}")

ROC AUC Score: 0.8185995257049203


In [30]:
print(dataTraining.shape)

(7895, 5)
