## En el siguiente conjunto de datos, esperamos poder predecir qué películas podrían ser o no un éxito comercial, este conjunto de datos recopila parte del conocimiento de la API TMDB, que contiene solo 5000 películas del número total.

In [1]:
import pandas as pd 
import numpy as np
import sqlite3
import json
from pickle import dump
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import cosine_similarity

Separamos los datos en dos Datasets, una para tener la información de las peliculas, el segundo para tener informacion de los creditos.

In [2]:
movies = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv")
credits = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv")

### Conectamos a una base de datos SQL y creamos dos tablas con los dos Datasets.

In [3]:
conn = sqlite3.connect("../data/movies_database.db")

In [4]:
movies.to_sql("movies_table", conn, if_exists = "replace", index = False)
credits.to_sql("credits_table", conn, if_exists = "replace", index = False)

4803

Creamos un solo Dataset, concatenando la informacion de las dos tablas donde tengan en comun el mismo titulo.


In [5]:
query = """
    SELECT *
    FROM movies_table
    INNER JOIN credits_table
    ON movies_table.title = credits_table.title;
"""

total_data = pd.read_sql_query(query, conn)
conn.close()

total_data = total_data.loc[:, ~total_data.columns.duplicated()]

In [6]:
total_data.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


#### Procedemos a seleccionar las columnas de mayor importancia y con las que trabajaremos para la predicción.

In [7]:
data_limpia = total_data[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']].copy()

In [8]:
#Funcion para convertir texto json
def load_json_safe(json_str, default_value = None):
    try:
        return json.loads(json_str)
    except (TypeError, json.JSONDecodeError):
        return default_value

In [9]:
# Transformar las columnas 'genres' y 'keywords'
data_limpia['genres'] = data_limpia['genres'].apply(lambda x: [item["name"].replace(" ", "") for item in load_json_safe(x)] if x else None)
data_limpia['keywords'] = data_limpia['keywords'].apply(lambda x: [item["name"].replace(" ", "") for item in load_json_safe(x)] if x else None)
# Transformar la columna 'cast'
data_limpia['cast'] = data_limpia['cast'].apply(lambda x: [item["name"].replace(" ", "") for item in load_json_safe(x)][:3] if x else None)
# Transformar la columna 'crew' para obtener el nombre del director
def get_director(crew_data):
    for crew_member in crew_data:
        if crew_member['job'] == 'Director':
            return crew_member['name'].replace(" ", "")
    return None
data_limpia['crew'] = data_limpia['crew'].apply(lambda x: get_director(load_json_safe(x)) if x else None)
# Transformar la columna 'overview' en una lista
data_limpia['overview'] = data_limpia['overview'].apply(lambda x: [x] if x else None)

Procedemos a guardar el Datasets con la informacion en limpio.

In [10]:
data_limpia.to_csv("/workspaces/Carlos2607-a-KNN/data/processed/data_limpia.csv")
data_limpia.head(2)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In the 22nd century, a paraplegic Marine is d...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",JamesCameron
1,285,Pirates of the Caribbean: At World's End,"[Captain Barbossa, long believed to be dead, h...","[Adventure, Fantasy, Action]","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",GoreVerbinski
2,206647,Spectre,[A cryptic message from Bond’s past sends him ...,"[Action, Adventure, Crime]","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",SamMendes
3,49026,The Dark Knight Rises,[Following the death of District Attorney Harv...,"[Action, Crime, Drama, Thriller]","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",ChristopherNolan
4,49529,John Carter,"[John Carter is a war-weary, former military c...","[Action, Adventure, ScienceFiction]","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",AndrewStanton
...,...,...,...,...,...,...,...
4804,9367,El Mariachi,[El Mariachi just wants to play his guitar and...,"[Action, Crime, Thriller]","[unitedstates–mexicobarrier, legs, arms, paper...","[CarlosGallardo, JaimedeHoyos, PeterMarquardt]",RobertRodriguez
4805,72766,Newlyweds,[A newlywed couple's honeymoon is upended by t...,"[Comedy, Romance]",[],"[EdwardBurns, KerryBishé, MarshaDietlein]",EdwardBurns
4806,231617,"Signed, Sealed, Delivered","[""Signed, Sealed, Delivered"" introduces a dedi...","[Comedy, Drama, Romance, TVMovie]","[date, loveatfirstsight, narration, investigat...","[EricMabius, KristinBooth, CrystalLowe]",ScottSmith
4807,126186,Shanghai Calling,[When ambitious New York attorney Sam is sent ...,[],[],"[DanielHenney, ElizaCoupe, BillPaxton]",DanielHsia


Guardamos en nuestra tabla de SQL la base de datos limpia para poder trabajar con ella mas adelante.

In [11]:
conn = sqlite3.connect("../data/movies_database.db")

movies.to_sql("clean_movies_data", conn, if_exists = "replace", index = False)

4803

Con dicha función podemos concatenar todas las columnas separadas entre ells con un espacio, creando asi por cada fila, un texto completo.

In [12]:
def list_to_string(lst):
    return ' '.join(lst) if lst else ''
# Aplicar la conversión en cada columna y luego unirlas
data_limpia['tags'] = data_limpia.apply(lambda x: ' '.join(list_to_string(x[col]) for col in ['genres', 'keywords', 'cast', 'crew', 'overview']), axis=1)


In [13]:
data_limpia.iloc[0].tags

'Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver J a m e s C a m e r o n In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

Vectorizamos la data para poder procesarla y trabajar con ella.

In [14]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data_limpia["tags"])

In [15]:
model = NearestNeighbors(n_neighbors = 6, algorithm = "brute", metric = "cosine")
model.fit(tfidf_matrix)

In [16]:
def get_movie_recommendations(movie_title):
    # Encuentra el índice de la película en el DataFrame
    movie_index = data_limpia[data_limpia["title"] == movie_title].index[0]
    # Utiliza el modelo de k-Nearest Neighbors (k-NN) para encontrar películas similares
    distances, indices = model.kneighbors(tfidf_matrix[movie_index])
     # Crea una lista de tuplas que contienen películas similares y sus distancias
    similar_movies = [(data_limpia["title"][i], distances[0][j]) for j, i in enumerate(indices[0])]
    # Devuelve la lista de películas similares, excluyendo la primera (la misma película)
    return similar_movies[1:]

input_movie = "The Dark Knight Rises"
recommendations = get_movie_recommendations(input_movie)
print("Film recommendations '{}'".format(input_movie))
for movie, distance in recommendations:
    print("- Film: {}".format(movie))

Film recommendations 'The Dark Knight Rises'
- Film: The Dark Knight
- Film: Batman Returns
- Film: Batman Forever
- Film: Batman Begins
- Film: Batman


## Podemos determinar que el modelo tiene la capacidad de recomendar peliculas similares a la indicada.
