<h3>Sistema de recomendación:</h3>

Una vez que toda la data es consumible por la API, está lista para consumir por los departamentos de Analytics y Machine Learning, y nuestro EDA nos permite entender bien los datos a los que tenemos acceso, es hora de entrenar nuestro modelo de machine learning para armar un sistema de recomendación de películas. El EDA debería incluir gráficas interesantes para extraer datos, como por ejemplo una nube de palabras con las palabras más frecuentes en los títulos de las películas. Éste consiste en recomendar películas a los usuarios basándose en películas similares, por lo que se debe encontrar la similitud de puntuación entre esa película y el resto de películas, se ordenarán según el score de similaridad y devolverá una lista de Python con 5 valores, cada uno siendo el string del nombre de las películas con mayor puntaje, en orden descendente. Debe ser deployado como una función adicional de la API anterior y debe llamarse:

<b>def recomendacion(<i>titulo</i>)</b>: Se ingresa el nombre de una película y te recomienda las similares en una lista de 5 valores.

<h3>Librerías</h3>

In [29]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer


<h3>Carga de Datos</h3>

In [30]:
ruta_movies_dataset = r'C:\Users\Practical Tecno\Desktop\HENRY\Henry_Proyecto_Individual_1_Memolli\datasets\movies_limpio.parquet'
credit_crew_dataset = r'C:\Users\Practical Tecno\Desktop\HENRY\Henry_Proyecto_Individual_1_Memolli\datasets\credit_crew_limpio.parquet'
credit_cast_dataset = r'C:\Users\Practical Tecno\Desktop\HENRY\Henry_Proyecto_Individual_1_Memolli\datasets\credit_cast_limpio.parquet'

df_movies = pd.read_parquet(ruta_movies_dataset)
df_credit_crew = pd.read_parquet(credit_crew_dataset)
df_credit_cast = pd.read_parquet(credit_cast_dataset)

In [31]:
df_movies

Unnamed: 0,budget,id,overview,popularity,release_date,revenue,runtime,title,vote_average,vote_count,release_year,return,genres_name,production_companies_name,production_countries_name,spoken_languages_name
0,30000000,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033,81.0,Toy Story,7.7,5415,1995,12.451801,Animation,Pixar Animation Studios,United States of America,English
1,30000000,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033,81.0,Toy Story,7.7,5415,1995,12.451801,Comedy,Pixar Animation Studios,United States of America,English
2,30000000,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033,81.0,Toy Story,7.7,5415,1995,12.451801,Family,Pixar Animation Studios,United States of America,English
3,65000000,8844,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249,104.0,Jumanji,6.9,2413,1995,4.043035,Adventure,TriStar Pictures,United States of America,English
4,65000000,8844,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249,104.0,Jumanji,6.9,2413,1995,4.043035,Adventure,Teitler Film,United States of America,English
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133594,0,30840,"Yet another version of the classic epic, with ...",5.683753,1991-05-13,0,104.0,Robin Hood,5.7,26,1991,0.000000,Romance,CanWest Global Communications,Canada,English
133595,0,111109,An artist struggles to finish his work while a...,0.178241,2011-11-17,0,360.0,Century of Birthing,9.0,3,2011,0.000000,Drama,Sine Olivia,Philippines,
133596,0,67758,"When one of her hits goes wrong, a professiona...",0.903007,2003-08-01,0,90.0,Betrayal,3.8,6,2003,0.000000,Action,American World Pictures,United States of America,English
133597,0,67758,"When one of her hits goes wrong, a professiona...",0.903007,2003-08-01,0,90.0,Betrayal,3.8,6,2003,0.000000,Drama,American World Pictures,United States of America,English


In [32]:
df_movies = df_movies[df_movies['release_year'] >= 1980]

In [33]:
principales_idiomas = ['English', 'Français', 'Deutsch', 'Español', 'Italiano']

df_movies = df_movies[df_movies['spoken_languages_name'].isin(principales_idiomas)]

In [34]:
paises_principales = ['United States of America', 'United Kingdom', 'France', 'Canada', 
                      'Japan', 'Germany', 'Italy', 'Russia', 'India', 'Spain', 'Argentina']

df_movies = df_movies[df_movies['production_countries_name'].isin(paises_principales)]

<h3>Transformaciones previas a unir los Dataframes</h3>
• Los valores de las columnas "genres_name", "crew_name" y "cast_name" se convierten en listas, dejando solo un valor de ID en cada una de ellas.

In [35]:
genero = (df_movies[['id', 'genres_name']]
                               .drop_duplicates()
                               .groupby('id')['genres_name']
                               .apply(list)
                               .reset_index(name='generos'))

In [36]:
directores = (df_credit_crew[['id', 'crew_name']]
                               .drop_duplicates()
                               .groupby('id')['crew_name']
                               .apply(list)
                               .reset_index(name='directores'))

In [37]:
actores = (df_credit_cast[['id', 'cast_name']]
                               .drop_duplicates()
                               .groupby('id')['cast_name']
                               .apply(list)
                               .reset_index(name='actores'))

Se crea otro DataFrame que contiene selecciones de columnas de varios DataFrames diferentes.

In [38]:
col_movies = df_movies[['id', 'title','overview']]
col_genero = genero[['id', 'generos']]
col_directores = directores[['id', 'directores']]
col_actores = actores[['id', 'actores']]

# Realizar merge secuencial para combinar los DataFrames
df_tags = pd.merge(col_movies, col_genero, on='id', how='inner')
df_tags = pd.merge(df_tags, col_directores, on='id', how='inner')
df_tags = pd.merge(df_tags, col_actores, on='id', how='inner')

In [39]:
df_tags

Unnamed: 0,id,title,overview,generos,directores,actores
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[Animation, Comedy, Family]",[John Lasseter],"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney]"
1,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[Animation, Comedy, Family]",[John Lasseter],"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney]"
2,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[Animation, Comedy, Family]",[John Lasseter],"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney]"
3,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[Adventure, Fantasy, Family]",[Joe Johnston],"[Robin Williams, Jonathan Hyde, Kirsten Dunst,..."
4,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[Adventure, Fantasy, Family]",[Joe Johnston],"[Robin Williams, Jonathan Hyde, Kirsten Dunst,..."
...,...,...,...,...,...,...
87600,30840,Robin Hood,"Yet another version of the classic epic, with ...","[Drama, Action, Romance]",[John Irvin],"[Patrick Bergin, Uma Thurman, David Morrissey,..."
87601,30840,Robin Hood,"Yet another version of the classic epic, with ...","[Drama, Action, Romance]",[John Irvin],"[Patrick Bergin, Uma Thurman, David Morrissey,..."
87602,67758,Betrayal,"When one of her hits goes wrong, a professiona...","[Action, Drama, Thriller]",[Mark L. Lester],"[Erika Eleniak, Adam Baldwin, Julie du Page, J..."
87603,67758,Betrayal,"When one of her hits goes wrong, a professiona...","[Action, Drama, Thriller]",[Mark L. Lester],"[Erika Eleniak, Adam Baldwin, Julie du Page, J..."


In [40]:
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87605 entries, 0 to 87604
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          87605 non-null  int64 
 1   title       87605 non-null  object
 2   overview    87605 non-null  object
 3   generos     87605 non-null  object
 4   directores  87605 non-null  object
 5   actores     87605 non-null  object
dtypes: int64(1), object(5)
memory usage: 4.0+ MB


<h3>Transformaciones</h3>
La columna 'overview' será transformada para dividir el texto en caracteres individuales.</br>
Eliminamos cualquier espacio que pueda existir. Este paso es crucial para la vectorización.</br>
Se crea la columna 'tags' para consolidar los resultados de las otras columnas, asegurándonos de tener solo uno por ID.

In [41]:
#Se Separas en caracteres a la columna de 'overview'
#Se eliminan las filas vacías que puedan haber quedado.. 
df_tags.dropna(subset=['overview'], inplace=True)
df_tags['overview'] = df_tags['overview'].apply(lambda x:x.split())

In [42]:
#Se ha creado una función para eliminar los espacios que puedan existir.
def eliminar_espacios(cadena):
    sin_espacios = []
    for i in cadena:
        sin_espacios.append(i.replace(" ", ""))
    return sin_espacios

#Se aplica la función creada para eliminar espacios en las columnas 'gener_name', 'crew_name' y 'cast_name'.
df_tags['generos'] = df_tags['generos'].apply(eliminar_espacios)
df_tags['directores'] = df_tags['directores'].apply(eliminar_espacios)
df_tags['actores'] = df_tags['actores'].apply(eliminar_espacios)


Se ha creado la columna "tags" para almacenar toda la información que fue transformada.

In [43]:
df_tags['etiquetas'] = df_tags['overview'] + df_tags['generos'] + df_tags['directores'] + df_tags['actores']

In [44]:
#Eliminamos las columnas que ya no nos sirven.
df_tags = df_tags.drop(columns = ['overview', 'generos', 'directores', 'actores','id'])

In [45]:
#la línea de código convierte cada lista de strings en la columna 'etiquetas' del DataFrame df_tags en una cadena única de texto, donde cada elemento está separado por espacios, 
# utilizando una función lambda con apply().
df_tags['etiquetas'] = df_tags ['etiquetas'].apply(lambda x: " ".join(x))

In [46]:
df_tags = df_tags.drop_duplicates()

In [47]:
df_tags

Unnamed: 0,title,etiquetas
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
3,Jumanji,When siblings Judy and Peter discover an encha...
12,Grumpier Old Men,A family wedding reignites the ancient feud be...
16,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
19,Father of the Bride Part II,Just when George Banks has recovered from his ...
...,...,...
87583,The Morning After,The Morning After is a feature film that consi...
87587,The Burkittsville 7,A film archivist revisits the story of Rustin ...
87589,Caged Heat 3000,It's the year 3000 AD. The world's most danger...
87590,Robin Hood,"Yet another version of the classic epic, with ..."


In [48]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to C:\Users\Practical
[nltk_data]     Tecno\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<h3>Creación de Modelo de Recomendación</h3>


In [49]:
#Reiniciar el index porque me estaban dando mal las recomendaciones. 
df_tags.reset_index(drop=True, inplace=True)

In [50]:
cv = CountVectorizer(max_features = 10000 , stop_words= 'english')

In [51]:
vector = cv.fit_transform(df_tags['etiquetas']).toarray()

In [52]:
vector.shape

(15550, 10000)

In [53]:
similitud_coseno = cosine_similarity(vector)

similitud_coseno

array([[1.        , 0.03928371, 0.04474374, ..., 0.        , 0.        ,
        0.        ],
       [0.03928371, 1.        , 0.07118685, ..., 0.03965258, 0.        ,
        0.        ],
       [0.04474374, 0.07118685, 1.        , ..., 0.        , 0.03001501,
        0.        ],
       ...,
       [0.        , 0.03965258, 0.        , ..., 1.        , 0.02507849,
        0.06302535],
       [0.        , 0.        , 0.03001501, ..., 0.02507849, 1.        ,
        0.08377078],
       [0.        , 0.        , 0.        , ..., 0.06302535, 0.08377078,
        1.        ]])

In [54]:
def recomendacion(titulo):
    try:
        indice = df_tags[df_tags['title'] == titulo].index[0]
    
        distancia = sorted(list(enumerate(similitud_coseno[indice])), reverse = True, key = lambda x: x[1])

        recomendadas = [df_tags.iloc[i[0]].title for i in distancia[1:6]]
        
        print(f"Porque viste '{titulo}', tal vez te guste:")
        for pelicula in recomendadas:
            print(pelicula)
    except IndexError:
        print(f"No se encontró la película '{titulo}' en la base de datos.")

In [55]:
recomendacion("Toy Story")

Porque viste 'Toy Story', tal vez te guste:
Toy Story 3
Toy Story 2
The 40 Year Old Virgin
Small Fry
Andy Peters: Exclamation Mark Question Point


In [56]:
#Descargamos la base de datos que utilizamos para crear la función de recomendación. 
df_tags.to_parquet('tags_ML.parquet', index=False)