## ***Machine Learning***
El modelo deberá tener una relación `ítem-ítem`, esto se refiere a que se toma un item, en base a que tan similar es ese ítem al resto, se recomiendan similares. Aquí el input es un juego y el output es una lista de juegos recomendados, para ello aplicaremos la similitud del coseno.

**Comenzamos trayendo los datos, las librerias y creando nuestro sistema de recomendación usando un filtro-colaborativo de `item-item`. Entonces...**
<img src = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTqF0wkErtJDCV5QzwoPO_9sdEA7nxQdk5EJA&usqp=CAU">

### ***1. Importamos  librerías***

In [1]:
# Importtamos librerias
import numpy as np 
import pandas as pd 
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem.porter import PorterStemmer

### **1.1 Carga inicial de los datos**

In [2]:
movies = pd.read_parquet("Data/movies.parquet")
actors = pd.read_parquet("Data/actores.parquet")
team = pd.read_parquet("Data/equipo.parquet")

In [3]:
# Convertir en Dataframe 
df_movies = pd.DataFrame(movies)
df_actors = pd.DataFrame(actors)
df_team = pd.DataFrame(team)

### ***2. Preparamos los datos***

In [4]:
df_actors.head(1)

Unnamed: 0,id_movie,gender,actor_id,name_actor
0,862,2,31,Tom Hanks


In [5]:
df_team.head(1)

Unnamed: 0,id_movie,team_member_id,job,name,gender
0,862,7879,Director,John Lasseter,2


In [6]:
df_movies.head(1)

Unnamed: 0,id_movie,name,original_language,release_date,release_year,genre_id,genre_name,popularity,runtime,vote_average,vote_count,company_id,company_name,revenue,budget,return,overview
0,862,Toy Story,en,1995-10-30,1995,16,Animation,21.946943,81,7.699219,5415,3,Pixar Animation Studios,373554033,30000000,12.453125,"Led by Woody, Andy's toys live happily in his ..."


In [7]:
# Mediante el indice vamos a observar los nombres de las peliculas
movie_name = df_movies[['id_movie', 'name']]
movie_name.drop_duplicates(inplace = True)
movie_name.set_index('id_movie', inplace = True)
movie_name.rename(columns={'name': 'Name'}, inplace= True)
movie_name.head(4)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_name.drop_duplicates(inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_name.rename(columns={'name': 'Name'}, inplace= True)


Unnamed: 0_level_0,Name
id_movie,Unnamed: 1_level_1
862,Toy Story
8844,Jumanji
15602,Grumpier Old Men
31357,Waiting to Exhale


In [8]:
item = 862
print(movie_name.loc[item].Name)

Toy Story


In [9]:
# Por ejemplo vamos a buscar peliculas de batman 
batman_movies = df_movies[df_movies['name'].str.contains('Batman Begins', case=False, na=False)]
batman_movies.head(4)

Unnamed: 0,id_movie,name,original_language,release_date,release_year,genre_id,genre_name,popularity,runtime,vote_average,vote_count,company_id,company_name,revenue,budget,return,overview
49512,272,Batman Begins,en,2005-06-10,2005,28,Action,28.505341,-116,7.5,7511,429,DC Comics,374218673,150000000,2.494141,"Driven by tragedy, billionaire Bruce Wayne ded..."
49513,272,Batman Begins,en,2005-06-10,2005,28,Action,28.505341,-116,7.5,7511,923,Legendary Pictures,374218673,150000000,2.494141,"Driven by tragedy, billionaire Bruce Wayne ded..."
49514,272,Batman Begins,en,2005-06-10,2005,28,Action,28.505341,-116,7.5,7511,6194,Warner Bros.,374218673,150000000,2.494141,"Driven by tragedy, billionaire Bruce Wayne ded..."
49515,272,Batman Begins,en,2005-06-10,2005,28,Action,28.505341,-116,7.5,7511,9993,DC Entertainment,374218673,150000000,2.494141,"Driven by tragedy, billionaire Bruce Wayne ded..."


In [10]:
# Dataframe df_movies
df_genres = df_movies[['id_movie', 'genre_name']]
# Eliminamos valores duplicados 
df_genres = df_genres.drop_duplicates(subset=df_genres.columns)
print(df_genres.shape)
# Agrupamos por id_movie y seleccionar los primeros 3 generos
df_genres = df_genres.groupby('id_movie')['genre_name'].apply(lambda x : x.head(3).tolist()).reset_index()
df_genres

(70407, 2)


Unnamed: 0,id_movie,genre_name
0,-32763,"[Mystery, Drama]"
1,-32754,"[TV Movie, Drama]"
2,-32746,"[Family, Fantasy, Drama]"
3,-32744,"[Drama, Music]"
4,-32739,"[Romance, Comedy]"
...,...,...
28155,32757,[Comedy]
28156,32761,[Drama]
28157,32764,[Drama]
28158,32765,"[Thriller, Crime, Drama]"


In [11]:
id_seleccionado = 272

# Encontrar la fila correspondiente al 'id_movie' seleccionado
fila_seleccionada = df_genres[df_genres['id_movie'] == id_seleccionado]

# Acceder a la lista de actores
if not fila_seleccionada.empty:
    genre = fila_seleccionada.iloc[0]['genre_name']
    print(f"Actores en la película con id_movie {id_seleccionado}: {genre}")
else:
    print(f"No se encontró la película con id_movie {id_seleccionado}")


Actores en la película con id_movie 272: ['Action', 'Crime', 'Drama']


In [12]:
# Dataframe df_actors
# Agrupamos por id_movie y seleccionar los primeros 3 actores 
top_actors = df_actors.groupby('id_movie')['name_actor'].apply(lambda x : x.head(4).tolist()).reset_index()
top_actors

Unnamed: 0,id_movie,name_actor
0,2,"[Turo Pajala, Susanna Haavisto, Matti Pellonpä..."
1,3,"[Matti Pellonpää, Kati Outinen, Sakari Kuosman..."
2,5,"[Tim Roth, Antonio Banderas, Jennifer Beals, M..."
3,6,"[Emilio Estevez, Cuba Gooding Jr., Denis Leary..."
4,11,"[Mark Hamill, Harrison Ford, Carrie Fisher, Pe..."
...,...,...
43013,464207,"[William Shatner, Neil deGrasse Tyson, Chris H..."
43014,465044,"[Karolina Antosik, Amelie Leroy, Tessa McGinn,..."
43015,467731,"[Lloyd Bridges, Jack Warden, Rafael Campos, Ro..."
43016,468707,"[Inka Haapamäki, Rosa Honkonen, Tiitus Rantala..."


In [13]:
# Seleccionar el 'id_movie' específico
id_movie_seleccionado = 272

# Encontrar la fila correspondiente al 'id_movie' seleccionado
fila_seleccionada = top_actors[top_actors['id_movie'] == id_movie_seleccionado]

# Acceder a la lista de actores
if not fila_seleccionada.empty:
    lista_actores = fila_seleccionada.iloc[0]['name_actor']
    print(f"Actores en la película con id_movie {id_movie_seleccionado}: {lista_actores}")
else:
    print(f"No se encontró la película con id_movie {id_movie_seleccionado}")


Actores en la película con id_movie 272: ['Christian Bale', 'Michael Caine', 'Liam Neeson', 'Katie Holmes']


In [14]:
top_actors.head(1)

Unnamed: 0,id_movie,name_actor
0,2,"[Turo Pajala, Susanna Haavisto, Matti Pellonpä..."


In [15]:
# Filtramos los datos solo dejamos las columnas que necesitamos
df_team = df_team[['id_movie', 'name']]
df_team.rename(columns={'name': 'director'}, inplace = True)
df_team.head(1)

Unnamed: 0,id_movie,director
0,862,John Lasseter


In [16]:
# Hacemos lo mismo con df_games, filtramos los datos para no tener duplicados 
df_movies = df_movies[['id_movie', 'name', 'overview']]
print("Antes ", df_movies.shape)
df_movies.drop_duplicates(subset=df_movies.columns, inplace=True)
print("Despues ", df_movies.shape)

Antes  (157676, 3)
Despues  (33213, 3)


In [17]:
# Columnas necesarias para el sistema de recomendacion id_movie, name, overview, genres, cast, crew
df_system = pd.merge(df_movies, top_actors, on= 'id_movie', how='inner')

df_com = pd.merge(df_system, df_team, on='id_movie', how='inner')

df_combined = pd.merge(df_com, df_genres, on='id_movie', how='inner')

df_combined.head()

Unnamed: 0,id_movie,name,overview,name_actor,director,genre_name
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...","[Tom Hanks, Tim Allen, Don Rickles, Jim Varney]",John Lasseter,"[Animation, Comedy, Family]"
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...",Joe Johnston,"[Adventure, Fantasy, Family]"
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...",Howard Deutch,"[Romance, Comedy]"
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","[Whitney Houston, Angela Bassett, Loretta Devi...",Forest Whitaker,"[Comedy, Drama, Romance]"
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,"[Steve Martin, Diane Keaton, Martin Short, Kim...",Charles Shyer,[Comedy]


In [18]:
# Verificamos cuantos valores tenemos
df_combined.shape

(16649, 6)

In [19]:
df_combined.rename(columns={'genre_name' : 'genre', 'name_actor' : 'actor'}, inplace=True)
df_combined.columns

Index(['id_movie', 'name', 'overview', 'actor', 'director', 'genre'], dtype='object')

In [20]:
# Removemos los espacios
df_combined['genre'] = df_combined['genre'].apply(lambda x:[i.replace(" ", "") for i in x])
df_combined['actor'] = df_combined['actor'].apply(lambda x:[i.replace(" ", "") for i in x])

In [21]:
# Contenamos las columnas genre y actor, porque son listas, para mejor manejo
df_combined['semi_tags'] = df_combined['genre'] + df_combined['actor']

df_combined['semi_tags'] = df_combined['semi_tags'].apply(lambda x:" ".join(x))

In [22]:
# Concatenamos las columnas que son listas 
df_combined['tags'] = df_combined['overview'] + df_combined['semi_tags'] + df_combined['director']

matriz = df_combined[['id_movie', 'name', 'tags']]

In [23]:
matriz['tags'][0]

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.Animation Comedy Family TomHanks TimAllen DonRickles JimVarneyJohn Lasseter"

In [24]:
# En la columna tags vamos a dejarlo todo en minúscula
matriz['tags'] = matriz['tags'].apply(lambda x:x.lower())
matriz.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  matriz['tags'] = matriz['tags'].apply(lambda x:x.lower())


Unnamed: 0,id_movie,name,tags
0,862,Toy Story,"led by woody, andy's toys live happily in his ..."
1,8844,Jumanji,when siblings judy and peter discover an encha...
2,15602,Grumpier Old Men,a family wedding reignites the ancient feud be...
3,31357,Waiting to Exhale,"cheated on, mistreated and stepped on, the wom..."
4,11862,Father of the Bride Part II,just when george banks has recovered from his ...


In [25]:
matriz.tags[0]

"led by woody, andy's toys live happily in his room until andy's birthday brings buzz lightyear onto the scene. afraid of losing his place in andy's heart, woody plots against buzz. but when circumstances separate buzz and woody from their owner, the duo eventually learns to put aside their differences.animation comedy family tomhanks timallen donrickles jimvarneyjohn lasseter"

In [26]:
# Extraemos el archivo para el deploy
# matriz.to_parquet("ml1.parquet")

### ***Pre - procesamiento usando nltk***

In [27]:
# Inicializamos el CountVector
cv = CountVectorizer(max_features= 5000, stop_words= 'english')

In [28]:
vectors = cv.fit_transform(matriz['tags']).toarray()
print(vectors)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 1 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [29]:
# Inicializamos el Porter Stemmer para reducir las palabras a palabras base 
ps = PorterStemmer()

In [30]:
# Creamos la funcion de ps o (PorterStemmer)
def stem(text):
 y = []
 for i in text.split():
    y.append(ps.stem(i))

 return " ".join(y)

In [31]:
matriz['tags'] = matriz['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  matriz['tags'] = matriz['tags'].apply(stem)


In [32]:
# Reducimos un poco el CountVectors
vectors.shape

(16649, 5000)

In [33]:
# reduce_vector = vectors[:1000, :]

In [34]:
# Usamos las similitud del coseno 
similarity = cosine_similarity(vectors)

In [35]:
similarity.shape[0]

16649

In [36]:
print(similarity)

[[1.         0.04652421 0.05415304 ... 0.05025189 0.05025189 0.        ]
 [0.04652421 1.         0.0831411  ... 0.02571722 0.02571722 0.        ]
 [0.05415304 0.0831411  1.         ... 0.08980265 0.08980265 0.03666178]
 ...
 [0.05025189 0.02571722 0.08980265 ... 1.         1.         0.06804138]
 [0.05025189 0.02571722 0.08980265 ... 1.         1.         0.06804138]
 [0.         0.         0.03666178 ... 0.06804138 0.06804138 1.        ]]


In [37]:
def recommend(movie):
  movie_index = matriz[matriz['name'] == movie].index[0]
  distances = similarity[movie_index]
  movies_list = sorted(list(enumerate(distances)), reverse = True, key = lambda x:x[1])[1:6]
  
  for i in movies_list:
    print(matriz.iloc[i[0]]['name'])
  

In [38]:
recommend('Toy Story')

Toy Story 3
Toy Story 2
The 40 Year Old Virgin
Andy Peters: Exclamation Mark Question Point
Factory Girl


In [39]:
def recommend(movie):
    movie_index = matriz[matriz['name'] == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
    
    recommended_movies = [matriz.iloc[i[0]]['name'] for i in movies_list]
    return recommended_movies

# Crear el DataFrame con las películas y sus recomendaciones
df_recommendations = pd.DataFrame(columns=['movie', 'recomendations'])

for movie in matriz['name']:
    recomendations = recomend2(movie)
    # Agregamos nuevas filas al df 
    new_row = pd.DataFrame({'movie': [movie], 'recomendations': [recomendations]})
    df_recommendations = pd.concat([df_recommendations, new_row], ignore_index=True)


IndexError: invalid index to scalar variable.

In [40]:
def recommend(movie):
    movie_index = matriz[matriz['name'] == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
    
    recommended_movies = [matriz.iloc[i[0]]['name'] for i in movies_list]
    return recommended_movies

In [41]:
df_recommendations = pd.DataFrame(columns=['movie', 'recomendations'])

# Llenar el DataFrame con las recomendaciones para cada película
for movie in matriz['name']:
    recomendations = recommend(movie)  # Esto devuelve una lista de películas
    new_row = pd.DataFrame({'movie': [movie], 'recomendations': [recomendations]})
    df_recommendations = pd.concat([df_recommendations, new_row], ignore_index=True)

df_recommendations.head()

Unnamed: 0,movie,recomendations
0,Toy Story,"[Toy Story 3, Toy Story 2, The 40 Year Old Vir..."
1,Jumanji,"[Word Wars, Word Wars, Dungeons & Dragons, Bra..."
2,Grumpier Old Men,"[Grumpy Old Men, Rendezvous in Paris, Spoon, A..."
3,Waiting to Exhale,"[The Apple Game, The Great Sinner, How I Learn..."
4,Father of the Bride Part II,"[Father of the Bride, Lambchops, Kuffs, I Star..."


In [42]:
df_recommendations.info(0)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16649 entries, 0 to 16648
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   movie           16649 non-null  object
 1   recomendations  16649 non-null  object
dtypes: object(2)
memory usage: 260.3+ KB


In [45]:
def recom(name):
    indice = df_recommendations[df_recommendations['movie'] == name].index[0]
    recomendacion = df_recommendations.iloc[indice]['recomendations']
    return recomendacion

In [47]:
recom('Batman')

['Batman Beyond: Return of the Joker',
 'Batman & Robin',
 'DC Showcase: Catwoman',
 'Lugares comunes',
 'Batman Begins']

In [41]:
# Por ejemplo vamos a buscar peliculas de batman 
batman_movies = matriz[matriz['name'].str.contains('Batman', case=False, na=False)]
batman_movies.head(4)

Unnamed: 0,id_movie,name,tags
133,414,Batman Forever,the dark knight of gotham citi confront a dast...
574,268,Batman,the dark knight of gotham citi begin hi war on...
1283,364,Batman Returns,"have defeat the joker, batman now face the pen..."
1415,415,Batman & Robin,along with crime-fight partner robin and new r...


In [49]:
# Guardamos el df recomendantios
df_recommendations.to_csv("movies.parquet", sep=';')