## Preparación Datos para el Modelo de Recomendacion

Para evitar problemas con la capacidad de memoria en el deployment en Render, se trabaja previamente este dataset para prepararlo para el modelo

Esto incluye eliminación de columnas que no serán utilizadas en el protopito de modelo y la previa tokenizacion de la data

In [154]:
import pandas as pd
import numpy as np
from nltk import word_tokenize
from scipy.sparse import hstack
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\migue\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\migue\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [155]:
# importamos un dataset previamente trabajado
df = pd.read_csv("../Data/limpio2.csv")

In [156]:
df.head(3)

Unnamed: 0,id,overview,popularity,title,vote_average,genres_names,production_companies_names,production_countries_iso,spoken_languages_iso,belongs_to_collection_id,cast_names,director_name
0,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,Toy Story,7.7,"['Animation', 'Comedy', 'Family']",['Pixar Animation Studios'],['US'],['en'],10194.0,"['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...",John Lasseter
1,8844,When siblings Judy and Peter discover an encha...,17.015539,Jumanji,6.9,"['Adventure', 'Fantasy', 'Family']","['TriStar Pictures', 'Teitler Film', 'Intersco...",['US'],"['en', 'fr']",,"['Robin Williams', 'Jonathan Hyde', 'Kirsten D...",Joe Johnston
2,15602,A family wedding reignites the ancient feud be...,11.7129,Grumpier Old Men,6.5,"['Romance', 'Comedy']","['Warner Bros.', 'Lancaster Gate']",['US'],['en'],119050.0,"['Walter Matthau', 'Jack Lemmon', 'Ann-Margret...",Howard Deutch


In [1]:
# escogemos las columnas con las que se planea alimentar el modelo
columnas_ml = ['overview', 'title', 'director_name','genres_names', 'production_companies_names']
df = df[columnas_ml]
df


NameError: name 'df' is not defined

In [158]:
# se descargan librerias de palabras stopwords
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\migue\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\migue\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Se tokeniza la data de las columnas relevantes

In [159]:
# se genera un conjunto de las stopwords en inglés
stop_words = set(stopwords.words('english'))

# funciona para tokenizar texto
def preprocess(text):
    if pd.isna(text):
        return ""
    text = text.lower()
    words =word_tokenize(text)
    words = [word for word in words if word.isalnum() and word not in stop_words]
    return ' '.join(words)

In [160]:
# se aplica la tokenizacion a las columnas escogidas
for column in columnas_ml:
    df[f'tokenizada_{column}'] = df[column].apply(preprocess)

In [161]:
# se eliminan columnas originales
columnas_tokenizadas = ['title','tokenizada_overview', 'tokenizada_title', 'tokenizada_director_name', 'tokenizada_genres_names', 'tokenizada_production_companies_names']
df = df[columnas_tokenizadas]
df

Unnamed: 0,title,tokenizada_overview,tokenizada_title,tokenizada_director_name,tokenizada_genres_names,tokenizada_production_companies_names
0,Toy Story,led woody andy toys live happily room andy bir...,toy story,john lasseter,,animation studios
1,Jumanji,siblings judy peter discover enchanted board g...,jumanji,joe johnston,,pictures film communications
2,Grumpier Old Men,family wedding reignites ancient feud neighbor...,grumpier old men,howard deutch,,gate
3,Waiting to Exhale,cheated mistreated stepped women holding breat...,waiting exhale,forest whitaker,,century fox film corporation
4,Father of the Bride Part II,george banks recovered daughter wedding receiv...,father bride part ii,charles shyer,,productions pictures
...,...,...,...,...,...,...
43267,Robin Hood,yet another version classic epic enough variat...,robin hood,john irvin,,rundfunk wdr title films century fox televisio...
43268,Century of Birthing,artist struggles finish work storyline cult pl...,century birthing,lav diaz,,olivia
43269,Betrayal,one hits goes wrong professional assassin ends...,betrayal,mark lester,,world pictures
43270,Satan Triumphant,small town live two brothers one minister one ...,satan triumphant,yakov protazanov,,


## Escogemos solo las columnas que alimentarán al modelo
Data tokenizada y lista para el prototipo del modelo

In [162]:
# Temporalmente, dejamos fuera la columna genres_names
df = df.drop(columns=['tokenizada_genres_names'])
df


Unnamed: 0,title,tokenizada_overview,tokenizada_title,tokenizada_director_name,tokenizada_production_companies_names
0,Toy Story,led woody andy toys live happily room andy bir...,toy story,john lasseter,animation studios
1,Jumanji,siblings judy peter discover enchanted board g...,jumanji,joe johnston,pictures film communications
2,Grumpier Old Men,family wedding reignites ancient feud neighbor...,grumpier old men,howard deutch,gate
3,Waiting to Exhale,cheated mistreated stepped women holding breat...,waiting exhale,forest whitaker,century fox film corporation
4,Father of the Bride Part II,george banks recovered daughter wedding receiv...,father bride part ii,charles shyer,productions pictures
...,...,...,...,...,...
43267,Robin Hood,yet another version classic epic enough variat...,robin hood,john irvin,rundfunk wdr title films century fox televisio...
43268,Century of Birthing,artist struggles finish work storyline cult pl...,century birthing,lav diaz,olivia
43269,Betrayal,one hits goes wrong professional assassin ends...,betrayal,mark lester,world pictures
43270,Satan Triumphant,small town live two brothers one minister one ...,satan triumphant,yakov protazanov,


### En este prototipo de modelo solo se usarán las columnas 'overview', 'director_name' 'title' y 'production_companies_name'.
Se espera incluir 'genres_names', 'belongs_to_collection' y otros en futuros deployments.
Se guarda en un archivo data_modelo.csv la data tokenizada

In [163]:
# se guarda el dataset tokenizado para entrenar el modelo
# df.to_csv('data_modelo.csv', index=False)