# Modelo de recomendación de películas
## Objetivo
En esta práctica entrenaremos un **modelo de recomendación de películas**. Los modelos de recomendación pueden usar el **filtrado colaborativo**; usando puntuaciones de usuarios con gustos similares para elegir la recomendación, sin embargo, nosotros solo tendremos el conjunto de datos de puntuaciones de un usuario, por lo que usaremos un **filtrado basado en contenido**; usaremos las características de las películas vistas por el usuario para recomendar otras películas con características similares.

Primero tenemos que preparar el conjunto de datos objeto, las películas:

## Curando el dataset
Para entrenar el modelo usaremos los conjuntos de datos de [IMDB](https://developer.imdb.com/non-commercial-datasets/), title.basics para obtener los títulos, fecha de salida y géneros de las películas y title.crew para los directores de las películas.

Filtramos el conjunto para eliminar todo lo que no sea películas y lo guardamos en un archivo csv más fácil de procesar.


In [None]:
import pandas as pd

# Cargamos el dataset de IMDB
df = pd.read_csv("datasets/title.basics.tsv", sep ='\t')

# Cogemos solo las películas y guardamos en otro archivo
df = df[df['titleType'] == 'movie']
df.to_csv("datasets/titles_filtered.csv")
df.head(3)

Eliminamos columnas de datos innecesarios y consideramos irrelevantes aquellos registros sin fecha de salida ni género.

In [22]:
import pandas as pd

df = pd.read_csv("datasets/titles_filtered.csv")

# Eliminamos la primera columna, son índices
df = df.drop(df.columns[0], axis=1)

# Eliminamos columnas irrelevantes
df = df.drop(['endYear', 'isAdult', 'titleType'], axis=1)

# Eliminamos películas sin fecha de salida ni géneros
df = df[df['startYear'] != '\\N']
df = df[df['genres'] != '\\N']

# Unificamos los títulos, todos en mayúscula
df['primaryTitle'] = df['primaryTitle'].str.upper()
df['originalTitle'] = df['originalTitle'].str.upper()

# Cambiamos los valores nulos de la duración a 0
df.loc[df['runtimeMinutes'] == '\\N', 'runtimeMinutes'] = 0

# Forzamos el tipo de startYear y runtimeMinutes a ser numérico,
# era mixto por los nulos ('\N')
df['startYear'] = pd.to_numeric(df['startYear'], downcast='integer', errors='coerce')
df['runtimeMinutes'] = pd.to_numeric(df['runtimeMinutes'], downcast='integer', errors='coerce')

df.head(10)

  df = pd.read_csv("datasets/titles_filtered.csv")


Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
0,tt0000009,MISS JERRY,MISS JERRY,1894,45,Romance
1,tt0000147,THE CORBETT-FITZSIMMONS FIGHT,THE CORBETT-FITZSIMMONS FIGHT,1897,100,"Documentary,News,Sport"
3,tt0000574,THE STORY OF THE KELLY GANG,THE STORY OF THE KELLY GANG,1906,70,"Action,Adventure,Biography"
4,tt0000591,THE PRODIGAL SON,L'ENFANT PRODIGUE,1907,90,Drama
5,tt0000615,ROBBERY UNDER ARMS,ROBBERY UNDER ARMS,1907,0,Drama
6,tt0000630,HAMLET,AMLETO,1908,0,Drama
7,tt0000675,DON QUIJOTE,DON QUIJOTE,1908,0,Drama
8,tt0000679,THE FAIRYLOGUE AND RADIO-PLAYS,THE FAIRYLOGUE AND RADIO-PLAYS,1908,120,"Adventure,Fantasy"
19,tt0000886,"HAMLET, PRINCE OF DENMARK",HAMLET,1910,0,Drama
20,tt0000941,LOCURA DE AMOR,LOCURA DE AMOR,1909,45,Drama


Unimos a nuestro conjunto los directores de cada película a partir del archivo de IMDB.

In [23]:
df_crew = pd.read_csv("datasets/title.crew.tsv", sep ='\t')

df = pd.merge(df, df_crew, on="tconst")

df.head(10)

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,directors,writers
0,tt0000009,MISS JERRY,MISS JERRY,1894,45,Romance,nm0085156,nm0085156
1,tt0000147,THE CORBETT-FITZSIMMONS FIGHT,THE CORBETT-FITZSIMMONS FIGHT,1897,100,"Documentary,News,Sport",nm0714557,\N
2,tt0000574,THE STORY OF THE KELLY GANG,THE STORY OF THE KELLY GANG,1906,70,"Action,Adventure,Biography",nm0846879,nm0846879
3,tt0000591,THE PRODIGAL SON,L'ENFANT PRODIGUE,1907,90,Drama,nm0141150,nm0141150
4,tt0000615,ROBBERY UNDER ARMS,ROBBERY UNDER ARMS,1907,0,Drama,nm0533958,"nm0092809,nm0533958"
5,tt0000630,HAMLET,AMLETO,1908,0,Drama,nm0143333,nm0000636
6,tt0000675,DON QUIJOTE,DON QUIJOTE,1908,0,Drama,nm0194088,nm0148859
7,tt0000679,THE FAIRYLOGUE AND RADIO-PLAYS,THE FAIRYLOGUE AND RADIO-PLAYS,1908,120,"Adventure,Fantasy","nm0091767,nm0877783","nm0000875,nm0877783"
8,tt0000886,"HAMLET, PRINCE OF DENMARK",HAMLET,1910,0,Drama,nm0099901,nm0000636
9,tt0000941,LOCURA DE AMOR,LOCURA DE AMOR,1909,45,Drama,"nm0063413,nm0550220","nm0063413,nm0550220,nm0848502"


Comprobamos la integridad de los tipos de las columnas.

In [24]:
for column in df.columns:
    print(column, ":", pd.api.types.infer_dtype(df[column]))

tconst : string
primaryTitle : string
originalTitle : string
startYear : integer
runtimeMinutes : integer
genres : string
directors : string
writers : string


Guardamos un csv nuevo con el nuevo registro con los datos que usaremos para el entrenamiento.

In [26]:
df.to_csv("datasets/movies_curated.csv")

## Sistema de recomendación

In [2]:
import pandas as pd

df = pd.read_csv("datasets/movies_curated.csv")

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from scipy.sparse import hstack, csr_matrix
import numpy as np

# Limpiamos las columnas de nulos
df['genres'] = df['genres'].replace("\\N", "").fillna("")
df['directors'] = df['directors'].replace("\\N", "").fillna("")
df['writers'] = df['writers'].replace("\\N", "").fillna("")

# Transformamos los strings separados por comas en listas
df['genres'] = df['genres'].apply(lambda x: [g.strip() for g in x.split(",") if g.strip()])
df['directors'] = df['directors'].apply(lambda x: [d.strip() for d in x.split(",") if d.strip()])
df['writers'] = df['writers'].apply(lambda x: [w.strip() for w in x.split(",") if w.strip()])

# Creamos 'cast' = directors + writers
df['cast'] = df['directors'] + df['writers']

# Multi-hot encode genres and cast
mlb_genres = MultiLabelBinarizer(sparse_output=True)
mlb_cast = MultiLabelBinarizer(sparse_output=True)

genres_matrix = mlb_genres.fit_transform(df['genres'])
cast_matrix = mlb_cast.fit_transform(df['cast'])

# Combine sparse matrices horizontally
item_matrix = hstack([genres_matrix, cast_matrix]).tocsr()

# Show matrix shape and density info
matrix_shape = item_matrix.shape
nonzeros = item_matrix.nnz
sparsity = 1 - (nonzeros / (matrix_shape[0] * matrix_shape[1]))

matrix_shape, nonzeros, round(sparsity, 4)


((548451, 411591), 1916655, 1.0)

In [4]:
# Build feature names: genres first, then cast
feature_names = list(mlb_genres.classes_) + list(mlb_cast.classes_)

# Convert first 5 rows to dense array
dense_sample = item_matrix[:5, :20].toarray()

# Put into a DataFrame for easier viewing
sample_df = pd.DataFrame(dense_sample, 
                         columns=feature_names[:20], 
                         index=df['primaryTitle'][:5])
sample_df

Unnamed: 0_level_0,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,Film-Noir,Game-Show,History,Horror,Music,Musical,Mystery,News,Reality-TV
primaryTitle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
MISS JERRY,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
THE CORBETT-FITZSIMMONS FIGHT,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
THE STORY OF THE KELLY GANG,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
THE PRODIGAL SON,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
ROBBERY UNDER ARMS,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0


In [19]:
# Map movie titles to row indices
title_to_index = {title: idx for idx, title in enumerate(df['primaryTitle'])}

# --- Build title+year lookup ---
title_year_to_index = {
    (title, int(year)) : idx
    for idx, (title, year) in enumerate(zip(df['primaryTitle'], df['startYear']))
    if not pd.isna(year)
}

# --- Build user profile vector ---
ratings_df = pd.read_csv("datasets/ratings.csv")

user_profile = csr_matrix((1, item_matrix.shape[1]))  # empty profile

for _, row in ratings_df.iterrows():
    movie_name = row['Name'].upper()
    year = int(row['Year'])
    rating = row['Rating']

    key = (movie_name, year)
    if key in title_year_to_index:
        movie_idx = title_year_to_index[key]
        movie_vector = item_matrix[movie_idx]

        # Add weighted movie vector
        user_profile += rating * movie_vector

# Convert to dense (only for inspection)
user_vector = np.array(user_profile.todense())[0]

# Show first 20 features (could be genres or cast)
user_profile_df = pd.DataFrame(
    user_vector[:20].reshape(1, -1),
    columns=feature_names[:20]
)


user_profile_df

Unnamed: 0,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,Film-Noir,Game-Show,History,Horror,Music,Musical,Mystery,News,Reality-TV
0,62.0,0.0,168.0,163.5,53.0,598.5,249.5,8.0,1371.0,67.5,109.5,21.0,0.0,15.0,205.0,80.0,143.0,198.0,0.0,0.0
