****MODELO PARA RECOMENDACIÓN DE PELÍCULAS****

En este notebook se creará un modelo para recomendación de películas basada en contenido.

El usuario deberá ingresar el título de una película.

El modelo devolverá 5 películas similares.

Basado en las conclusiones a las que llegué en el EDA, voy a construir un modelo basado en contenido, teniendo en cuenta el overview de las películas. El retorno serán 5 películas similares ordenadas por popularidad de mayor a menor. 

El modelo de KNN se contruirá calificando los ítems de acuerdo a similitud de cosenos.

In [1]:
import pandas as pd
#import matplotlib.pyplot as plt
import seaborn as sns
#from wordcloud import WordCloud, STOPWORDS
import numpy as np
#from pandas.io.json import json_normalize
#import ast
#import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors



sns.set()

**Paso 1: Preparación de los datos**

In [2]:
movies = pd.read_csv('Movies ETL.csv',usecols= ['title', 'genres', 'overview', 'popularity'], sep=',')
movies.head()

Unnamed: 0,genres,overview,popularity,title
0,"['Animation', 'Comedy', 'Family']","Led by Woody, Andy's toys live happily in his ...",21.946943,Toy Story
1,"['Adventure', 'Fantasy', 'Family']",When siblings Judy and Peter discover an encha...,17.015539,Jumanji
2,"['Romance', 'Comedy']",A family wedding reignites the ancient feud be...,11.7129,Grumpier Old Men
3,"['Comedy', 'Drama', 'Romance']","Cheated on, mistreated and stepped on, the wom...",3.859495,Waiting to Exhale
4,['Comedy'],Just when George Banks has recovered from his ...,8.387519,Father of the Bride Part II


In [3]:
# Relleno los valores nulos en 'overview' con una cadena vacía
movies['title'] = movies['title'].fillna('')

Voy a reducir el tamaño de la matriz por limitaciones en la capacidad de procesamiento de mi equipo.

In [4]:
muestra = 10000
muestra_index = np.random.choice(range(len(movies)), size=muestra, replace=False)


In [5]:
movies_muestra = movies.iloc[muestra_index]
movies_muestra = movies_muestra.reset_index(drop=True)
movies_muestra.shape

(10000, 4)

In [18]:
movies_muestra.tail(20)

Unnamed: 0,genres,overview,popularity,title
9980,"['TV Movie', 'Drama', 'Science Fiction']",A Scottish chaplain embarks on an epic journey...,0.353742,Oasis
9981,"['Action', 'Comedy', 'Thriller']",Two bickering mercenaries are hired by the CIA...,1.199005,Fifty/Fifty
9982,"['Action', 'Comedy', 'Romance']","A girl falls for the ""perfect"" guy, who happen...",7.317816,Mr. Right
9983,['Comedy'],"A shy, introverted young girl takes a summer j...",0.122181,Experience Preferred...But Not Essential
9984,"['Romance', 'Drama']","Divya, a woman grieving over the death of her ...",0.437959,Mouna Raagam
9985,"['Mystery', 'Horror']",A housewife is frequently left alone by her hu...,0.352005,Glass Ceiling
9986,[],A static camera observes a room as it slowly f...,1e-06,Tango
9987,"['Adventure', 'Comedy', 'Crime']",All the passengers on an airplane headed for S...,0.267863,The Sky Dragon
9988,['Crime'],Three teens get into the drug business when th...,0.047829,Stakeout on Dope Street
9989,"['Action', 'Adventure', 'Western']",The clerk at the train station is assaulted an...,4.248169,The Great Train Robbery


**Paso 2: Vectorización del texto**

In [10]:
tfidf = TfidfVectorizer(stop_words='english')
title_matrix = tfidf.fit_transform(movies_muestra['title'])

In [11]:
title_matrix.shape

(10000, 8682)

**Paso 3: Cálculo de la similitud**

In [12]:
similarity_matrix = cosine_similarity(title_matrix)

**Paso 4: Construcción del modelo 1: KNN**

Voy a usar KNN con K=10

In [13]:
k = 5
model = NearestNeighbors(metric='cosine', algorithm='brute')
model.fit(title_matrix)


**Paso 5: Función de Recomendación**

In [14]:
def recomendacion(titulo):
    movie_index = movies_muestra[movies_muestra['title'] == titulo].index[0]
    distancias, indices = model.kneighbors(title_matrix[movie_index], n_neighbors=k+1)
    similar_movies = sorted(list(zip(indices.squeeze().tolist(), distancias.squeeze().tolist())), key=lambda x: x[1])[:0:-1]
    similar_movie_indices = [i[0] for i in similar_movies]

    recommended_movies = movies_muestra.iloc[similar_movie_indices][['title', 'genres', 'overview', 'popularity']]
    
    recommended_movies = recommended_movies.sort_values(by='popularity', ascending=False)  # Ordenar por popularidad descendente

    return recommended_movies


In [19]:
recomendacion('Elizabeth')

Unnamed: 0,title,genres,overview,popularity
6665,Reality,"['Drama', 'Comedy']",A dark comedy centering on the lives of a Neap...,2.890778
6667,Goya in Bordeaux,"['Drama', 'War']","Francisco Goya (1746-1828), deaf and ill, live...",1.300405
6668,Promise,"['War', 'Drama']",The film tells about members of Finnish women'...,0.368474
6670,Fresh,['Documentary'],"FRESH is more than a movie, it’s a gateway to ...",0.086169
6666,VeggieTales: Josh and the Big Wall,"['Family', 'Animation']","A lesson in obedience, it's the Bible story of...",0.07202
