# Proyecto individual: Sistema de recomendación de películas
​
Este proyecto constará de dos fases: `Ingenieria de datos`, `Modelamiento y evaluación con machine learning`.

### 1. Ingeniería de datos
* Esto incluye la limpieza y transformación de los datos, abordando problemas como:
    * valores faltantes, 
    * datos duplicados y variables irrelevantes,
    * valores anidados,
    * formateo de columnas,
    * nubes de palabras para ver las más frecuentes.
    * a fin de mejorar la calidad del dataset para el modelado.
    * análisis univariado.
    * análisis bivariado y multivariado.
​
### 2. Modelamiento y evaluación con machine learning
* Implementar un modelo de clasificación con aprendizaje supervisado que permita clasificar (**con un algoritmo de coseno de similitud, por ejemplo**) las películas por ... para encontrar una lista de 5 películas similares 

### 1. Transformaciones

Se realizará una transformación de datos para que corra el sistema de recomendación de películas (último endpoints):
* Las variables a considerar serán:
    * belongs_to_collection (name_collection)
    * budget
    * revenue
    * genres (name_genre)
    * vote_average
    * production_companies (name_company)
    * release_date (release_year)
    * keywords (se verá si hay tiempo)

#### 1.1. Importación de librerías

In [2]:
import pandas as pd 
from pandas import json_normalize
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import datetime
import re
import json
import math

# Mostrar figuras de matplotlib en el entorno de Jupyter Notebook
%matplotlib inline

#### 1.2. Carga y visualización los datos

In [3]:
df_movies = pd.read_csv('datasets/movies.csv')
df_movies.head(2)

  df_movies = pd.read_csv('datasets/movies.csv')


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


#### 1.3. Eliminación de filas

1.3.1. Primero se van a eliminar las filas donde haya valores nulos en las columnas status (no fueron lanzadas todavía) y release_date (no tienen fecha y no sirve para los endpoints).

In [4]:
# Se filtran las filas con valores notna en las columnas 'status', 'release_date' 
df_movies_1 = df_movies[df_movies['status'].notna()] 
df_movies_1 = df_movies[df_movies['release_date'].notna()] 

1.3.2. El propósito es reducir el dataset y enfocar los datos, para ello se elige trabajar solo con las películas en inglés, las cuales representan el 70 % del total, dado que el cine en inglés es el más visualizado.

In [5]:
total_rows = len(df_movies)

count_en = (df_movies['original_language'] == 'en').sum()

prop_en = int(count_en / total_rows * 100)

print(f"La proporción del idioma inglés es del: {prop_en} %")

La proporción del idioma inglés es del: 70 %


In [6]:
# Se eliminan las películas que su idioma original no es en inglés (habria que analizar que porcentaje representan)
df_movies_1 = df_movies[df_movies['original_language'] == 'en']

1.3.3. Seguidamente, se elige trabajar con las películas mayores al año 1990, las cuales representan más del 75 % de los datos.

Este punto se resolverá luego de crear la columna release_year

#### 1.4. Eliminación de columnas 

In [7]:
df_movies_2 = df_movies_1.drop(['adult', 'poster_path', 'status', 'homepage', 'imdb_id', 'production_countries', 'original_language', 'runtime', 'spoken_languages', 'tagline', 'original_title', 'video'], axis=1)

In [8]:
df_movies_2.head(2)

Unnamed: 0,belongs_to_collection,budget,genres,id,overview,popularity,production_companies,release_date,revenue,title,vote_average,vote_count
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]",1995-10-30,373554033.0,Toy Story,7.7,5415.0
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,When siblings Judy and Peter discover an encha...,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",1995-12-15,262797249.0,Jumanji,6.9,2413.0


#### 1.5. Distinción entre day, month y year en la columna release_date

In [9]:
# Convertir la columna 'date' a datetime
df_movies_2['release_date'] = pd.to_datetime(df_movies_2['release_date'], errors='coerce')

# Verificar el formato de la columna 'date' después de la conversión
print(df_movies_2['release_date'].head())

0   1995-10-30
1   1995-12-15
2   1995-12-22
3   1995-12-22
4   1995-02-10
Name: release_date, dtype: datetime64[ns]


In [10]:
# Crear nuevas columnas para día, mes y año
df_movies_2['day'] = df_movies_2['release_date'].dt.day
df_movies_2['month'] = df_movies_2['release_date'].dt.month
df_movies_2['release_year'] = df_movies_2['release_date'].dt.year

In [11]:
df_movies_2.head(2)

Unnamed: 0,belongs_to_collection,budget,genres,id,overview,popularity,production_companies,release_date,revenue,title,vote_average,vote_count,day,month,release_year
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]",1995-10-30,373554033.0,Toy Story,7.7,5415.0,30.0,10.0,1995.0
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,When siblings Judy and Peter discover an encha...,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",1995-12-15,262797249.0,Jumanji,6.9,2413.0,15.0,12.0,1995.0


In [12]:
df_movies_2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32269 entries, 0 to 45465
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  3113 non-null   object        
 1   budget                 32269 non-null  object        
 2   genres                 32269 non-null  object        
 3   id                     32269 non-null  object        
 4   overview               32200 non-null  object        
 5   popularity             32267 non-null  object        
 6   production_companies   32267 non-null  object        
 7   release_date           32202 non-null  datetime64[ns]
 8   revenue                32267 non-null  float64       
 9   title                  32267 non-null  object        
 10  vote_average           32267 non-null  float64       
 11  vote_count             32267 non-null  float64       
 12  day                    32202 non-null  float64       
 13  month 

In [13]:
df_movies_2.describe()

Unnamed: 0,release_date,revenue,vote_average,vote_count,day,month,release_year
count,32202,32267.0,32267.0,32267.0,32202.0,32202.0,32202.0
mean,1991-08-16 23:18:38.166573440,15171920.0,5.491171,141.56643,14.116111,6.44873,1991.135613
min,1878-06-14 00:00:00,0.0,0.0,0.0,1.0,1.0,1878.0
25%,1978-05-03 12:00:00,0.0,5.0,3.0,6.0,3.0,1978.0
50%,2000-12-27 00:00:00,0.0,5.9,10.0,14.0,7.0,2000.0
75%,2010-10-08 00:00:00,0.0,6.7,43.0,22.0,10.0,2010.0
max,2020-12-16 00:00:00,2787965000.0,10.0,14075.0,31.0,12.0,2020.0
std,,75578050.0,1.941068,574.58508,9.262318,3.598641,24.711462


#### 1.6. Filtrado de películas mayores a 1980

Las cuales representan cerca del 75 % de los datos según lo muestra el .describe

In [14]:
df_movies_2 = df_movies_2[df_movies_2['release_year'] >= 1980]

In [15]:
df_movies_2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 23802 entries, 0 to 45465
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   belongs_to_collection  2518 non-null   object        
 1   budget                 23802 non-null  object        
 2   genres                 23802 non-null  object        
 3   id                     23802 non-null  object        
 4   overview               23751 non-null  object        
 5   popularity             23802 non-null  object        
 6   production_companies   23802 non-null  object        
 7   release_date           23802 non-null  datetime64[ns]
 8   revenue                23802 non-null  float64       
 9   title                  23802 non-null  object        
 10  vote_average           23802 non-null  float64       
 11  vote_count             23802 non-null  float64       
 12  day                    23802 non-null  float64       
 13  month 

#### 1.7. Desanidado de la columna belongs_to_collection

In [16]:
df_collection = df_movies_2['belongs_to_collection']

df_collection.head(2)

0    {'id': 10194, 'name': 'Toy Story Collection', ...
1                                                  NaN
Name: belongs_to_collection, dtype: object

In [17]:
# Se convierte la columna belongs_to_collection a lista
df_collection_list = df_collection.tolist()
df_collection_list

["{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}",
 nan,
 "{'id': 119050, 'name': 'Grumpy Old Men Collection', 'poster_path': '/nLvUdqgPgm3F85NMCii9gVFUcet.jpg', 'backdrop_path': '/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg'}",
 nan,
 "{'id': 96871, 'name': 'Father of the Bride Collection', 'poster_path': '/nts4iOmNnq7GNicycMJ9pSAn204.jpg', 'backdrop_path': '/7qwE57OVZmMJChBpLEbJEmzUydk.jpg'}",
 nan,
 nan,
 nan,
 nan,
 "{'id': 645, 'name': 'James Bond Collection', 'poster_path': '/HORpg5CSkmeQlAolx3bKMrKgfi.jpg', 'backdrop_path': '/6VcVl48kNKvdXOZfJPdarlUGOsk.jpg'}",
 nan,
 nan,
 "{'id': 117693, 'name': 'Balto Collection', 'poster_path': '/w0ZgH6Lgxt2bQYnf1ss74UvYftm.jpg', 'backdrop_path': '/9VM5LiJV0bGb1st1KyHA3cVnO2G.jpg'}",
 nan,
 nan,
 nan,
 nan,
 nan,
 "{'id': 3167, 'name': 'Ace Ventura Collection', 'poster_path': '/qCxH543pScFed1CycwJ1nVgrkOc.jpg', 'backdrop_path': '/bswWgdDsLu0fhWMYUzLF8X

In [18]:
# Lista de ejemplo con elementos en formato de cadena
df_collection_list
# Extraer nombres, ignorando valores NaN y elementos no válidos
nombres = []
for item in df_collection_list:
    if isinstance(item, str):
        # Convertir comillas simples a dobles y None a null para hacerlo compatible con JSON
        item_json_format = item.replace("'", '"').replace("None", "null")
        try:
            coleccion = json.loads(item_json_format)  # Convertir de JSON string a diccionario
            nombre = coleccion.get("name")
            print(f"Nombre encontrado: {nombre}")  # Depuración
            nombres.append(nombre)  # Agregar el nombre a la lista
        except json.JSONDecodeError:
            print(f"Error de JSON en el elemento: {item}")  # Depuración
            continue
    elif isinstance(item, float) and math.isnan(item):
        # Ignorar elementos NaN
        continue

# Imprimir la lista de nombres final
print("Lista de nombres extraídos:", nombres)

Nombre encontrado: Toy Story Collection
Nombre encontrado: Grumpy Old Men Collection
Nombre encontrado: Father of the Bride Collection
Nombre encontrado: James Bond Collection
Nombre encontrado: Balto Collection
Nombre encontrado: Ace Ventura Collection
Nombre encontrado: Chili Palmer Collection
Nombre encontrado: Babe Collection
Nombre encontrado: Mortal Kombat Collection
Nombre encontrado: Pocahontas Collection
Nombre encontrado: The Lawnmower Man Collection
Nombre encontrado: Friday Collection
Nombre encontrado: From Dusk Till Dawn Collection
Nombre encontrado: Screamers Collection
Nombre encontrado: The Muppet Collection
Nombre encontrado: The Neverending Story Collection
Nombre encontrado: Bad Boys Collection
Nombre encontrado: Batman Collection
Nombre encontrado: Brooklyn Cigar Store Collection
Nombre encontrado: Casper Collection
Nombre encontrado: Mexico Trilogy
Nombre encontrado: Die Hard Collection
Nombre encontrado: Teenage Apocalypse Trilogy
Nombre encontrado: Free Willy Co

In [19]:
# Crear el DataFrame con una columna llamada "name"
df_collection_name = pd.DataFrame(nombres, columns=["collection_name"])
df_collection_name.head(3)

Unnamed: 0,collection_name
0,Toy Story Collection
1,Grumpy Old Men Collection
2,Father of the Bride Collection


In [20]:
df_movies_3 = pd.concat([df_movies_2, df_collection_name], axis=1)
df_movies_3.head(3)

Unnamed: 0,belongs_to_collection,budget,genres,id,overview,popularity,production_companies,release_date,revenue,title,vote_average,vote_count,day,month,release_year,collection_name
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]",1995-10-30,373554033.0,Toy Story,7.7,5415.0,30.0,10.0,1995.0,Toy Story Collection
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,When siblings Judy and Peter discover an encha...,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",1995-12-15,262797249.0,Jumanji,6.9,2413.0,15.0,12.0,1995.0,Grumpy Old Men Collection
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,A family wedding reignites the ancient feud be...,11.7129,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",1995-12-22,0.0,Grumpier Old Men,6.5,92.0,22.0,12.0,1995.0,Father of the Bride Collection


In [21]:
df_movies_4 = df_movies_3.drop('belongs_to_collection', axis=1)
df_movies_4.head(2)

Unnamed: 0,budget,genres,id,overview,popularity,production_companies,release_date,revenue,title,vote_average,vote_count,day,month,release_year,collection_name
0,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]",1995-10-30,373554033.0,Toy Story,7.7,5415.0,30.0,10.0,1995.0,Toy Story Collection
1,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,When siblings Judy and Peter discover an encha...,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",1995-12-15,262797249.0,Jumanji,6.9,2413.0,15.0,12.0,1995.0,Grumpy Old Men Collection


#### 1.8. Desanidado de la columna production_companies

In [22]:
# Crear una lista
df_production = df_movies_4['production_companies'].tolist()
df_production

["[{'name': 'Pixar Animation Studios', 'id': 3}]",
 "[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]",
 "[{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]",
 "[{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]",
 "[{'name': 'Sandollar Productions', 'id': 5842}, {'name': 'Touchstone Pictures', 'id': 9195}]",
 "[{'name': 'Regency Enterprises', 'id': 508}, {'name': 'Forward Pass', 'id': 675}, {'name': 'Warner Bros.', 'id': 6194}]",
 "[{'name': 'Paramount Pictures', 'id': 4}, {'name': 'Scott Rudin Productions', 'id': 258}, {'name': 'Mirage Enterprises', 'id': 932}, {'name': 'Sandollar Productions', 'id': 5842}, {'name': 'Constellation Entertainment', 'id': 14941}, {'name': 'Worldwide', 'id': 55873}, {'name': 'Mont Blanc Entertainment GmbH', 'id': 58079}]",
 "[{'name': 'Walt Disney Pictures', 'id': 2}]",
 "[{'name': 'Universal Pictures', 'id': 33}, {'name': 'Imperia

In [23]:
# Crear un DataFrame
df_production = pd.DataFrame(data=df_production, columns=['production_companies'])
df_production

Unnamed: 0,production_companies
0,"[{'name': 'Pixar Animation Studios', 'id': 3}]"
1,"[{'name': 'TriStar Pictures', 'id': 559}, {'na..."
2,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'..."
3,[{'name': 'Twentieth Century Fox Film Corporat...
4,"[{'name': 'Sandollar Productions', 'id': 5842}..."
...,...
24408,
24409,
24410,
24411,


In [24]:
import ast

# Función para extraer los nombres de las compañías considerando valores NaN
def extract_company_names(row):
    if pd.isna(row):  # Verificar si la fila es NaN
        return []  # Devolver una lista vacía en caso de NaN
    companies_list = ast.literal_eval(row)  # Convertir la cadena en lista de diccionarios
    return [company['name'] for company in companies_list]  # Extraer solo los nombres

# Crear la columna 'studios_name' con listas de nombres de compañías
df_production['studios_name'] = df_production['production_companies'].apply(extract_company_names)

In [25]:
# Crear el nuevo DataFrame con solo la columna 'studios_name'
df_studios_name = df_production[['studios_name']]

df_studios_name.head(3)

Unnamed: 0,studios_name
0,[Pixar Animation Studios]
1,"[TriStar Pictures, Teitler Film, Interscope Co..."
2,"[Warner Bros., Lancaster Gate]"


In [26]:
df_movies_5 = pd.concat([df_movies_4, df_studios_name], axis=1)
df_movies_5.head(2)

Unnamed: 0,budget,genres,id,overview,popularity,production_companies,release_date,revenue,title,vote_average,vote_count,day,month,release_year,collection_name,studios_name
0,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,"Led by Woody, Andy's toys live happily in his ...",21.946943,"[{'name': 'Pixar Animation Studios', 'id': 3}]",1995-10-30,373554033.0,Toy Story,7.7,5415.0,30.0,10.0,1995.0,Toy Story Collection,[Pixar Animation Studios]
1,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,When siblings Judy and Peter discover an encha...,17.015539,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",1995-12-15,262797249.0,Jumanji,6.9,2413.0,15.0,12.0,1995.0,Grumpy Old Men Collection,"[TriStar Pictures, Teitler Film, Interscope Co..."


In [27]:
df_movies_6 = df_movies_5.drop('production_companies', axis=1)
df_movies_6.head(2)

Unnamed: 0,budget,genres,id,overview,popularity,release_date,revenue,title,vote_average,vote_count,day,month,release_year,collection_name,studios_name
0,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,Toy Story,7.7,5415.0,30.0,10.0,1995.0,Toy Story Collection,[Pixar Animation Studios]
1,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,Jumanji,6.9,2413.0,15.0,12.0,1995.0,Grumpy Old Men Collection,"[TriStar Pictures, Teitler Film, Interscope Co..."


#### 1.9. Desanidado de la columna genres

In [28]:
# Crear una lista
df_genres = df_movies_6['genres'].tolist()
df_genres

["[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",
 "[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",
 "[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]",
 "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]",
 "[{'id': 35, 'name': 'Comedy'}]",
 "[{'id': 28, 'name': 'Action'}, {'id': 80, 'name': 'Crime'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]",
 "[{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}]",
 "[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}]",
 "[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 53, 'name': 'Thriller'}]",
 "[{'id': 12, 'name': 'Adventure'}, {'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}]",
 "[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10

In [29]:
# Crear un DataFrame
df_genres = pd.DataFrame(data=df_genres, columns=['genres_name'])
df_genres

Unnamed: 0,genres_name
0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."
3,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam..."
4,"[{'id': 35, 'name': 'Comedy'}]"
...,...
34632,
34633,
34634,
34635,


In [30]:
import ast

# Función para extraer los nombres de las compañías considerando valores NaN
def extract_genres_names(row):
    if pd.isna(row):  # Verificar si la fila es NaN
        return []  # Devolver una lista vacía en caso de NaN
    genres_list = ast.literal_eval(row)  # Convertir la cadena en lista de diccionarios
    return [genre['name'] for genre in genres_list]  # Extraer solo los nombres

# Crear la columna 'genres_name' con listas de nombres de géneros
df_genres['genres_name'] = df_genres['genres_name'].apply(extract_genres_names)

In [31]:
# Crear el nuevo DataFrame con solo la columna 'genres_name'
df_genres_name = df_genres[['genres_name']]

df_genres_name.head(3)

Unnamed: 0,genres_name
0,"[Animation, Comedy, Family]"
1,"[Adventure, Fantasy, Family]"
2,"[Romance, Comedy]"


In [32]:
df_movies_7 = pd.concat([df_movies_6, df_genres_name], axis=1)
df_movies_7.head(2)

Unnamed: 0,budget,genres,id,overview,popularity,release_date,revenue,title,vote_average,vote_count,day,month,release_year,collection_name,studios_name,genres_name
0,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,Toy Story,7.7,5415.0,30.0,10.0,1995.0,Toy Story Collection,[Pixar Animation Studios],"[Animation, Comedy, Family]"
1,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,Jumanji,6.9,2413.0,15.0,12.0,1995.0,Grumpy Old Men Collection,"[TriStar Pictures, Teitler Film, Interscope Co...","[Adventure, Fantasy, Family]"


In [33]:
df_movies_8 = df_movies_7.drop('genres', axis=1)
df_movies_8.head(2)

Unnamed: 0,budget,id,overview,popularity,release_date,revenue,title,vote_average,vote_count,day,month,release_year,collection_name,studios_name,genres_name
0,30000000,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,Toy Story,7.7,5415.0,30.0,10.0,1995.0,Toy Story Collection,[Pixar Animation Studios],"[Animation, Comedy, Family]"
1,65000000,8844,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,Jumanji,6.9,2413.0,15.0,12.0,1995.0,Grumpy Old Men Collection,"[TriStar Pictures, Teitler Film, Interscope Co...","[Adventure, Fantasy, Family]"


#### 1.10. Transformación de valores nulos

1.10.1. Primero se encuentra si existen valores nulos o vacíos

In [34]:
# Encontrar valores nulos
valores_nulos = df_movies_8[['budget', 'revenue']].isnull()
print("Valores nulos en el DataFrame:\n", valores_nulos)

Valores nulos en el DataFrame:
        budget  revenue
0       False    False
1       False    False
2       False    False
3       False    False
4       False    False
...       ...      ...
34629    True     True
34630    True     True
34632    True     True
34635    True     True
34636    True     True

[39901 rows x 2 columns]


In [35]:
nulos_budget = df_movies_8['budget'].isnull().sum()
nulos_revenue = df_movies_8['revenue'].isnull().sum()

print(f"Valores nulos en 'budget': {nulos_budget}")
print(f"Valores nulos en 'revenue': {nulos_revenue}")

Valores nulos en 'budget': 16099
Valores nulos en 'revenue': 16099


1.10.2. Luego se reemplaza por 0 y se verifica que no haya nulos

In [36]:
df_movies_8[['budget', 'revenue']] = df_movies_8[['budget', 'revenue']].fillna(0)

In [37]:
nulos_budget = df_movies_8['budget'].isnull().sum()
nulos_revenue = df_movies_8['revenue'].isnull().sum()

print(f"Valores nulos en 'budget': {nulos_budget}")
print(f"Valores nulos en 'revenue': {nulos_revenue}")

Valores nulos en 'budget': 0
Valores nulos en 'revenue': 0


In [38]:
# Convertir las columnas 'budget' y 'revenue' a valores numéricos
df_movies_8['budget'] = pd.to_numeric(df_movies_8['budget'], errors='coerce')
df_movies_8['revenue'] = pd.to_numeric(df_movies_8['revenue'], errors='coerce')

def apply_return(row):

    if row['budget'] == 0:
        return 0
    else:
        return row['revenue'] / row['budget']

In [39]:
df_movies_8['return'] = df_movies_8.apply(apply_return, axis=1)

In [40]:
df_movies_8.head(2)

Unnamed: 0,budget,id,overview,popularity,release_date,revenue,title,vote_average,vote_count,day,month,release_year,collection_name,studios_name,genres_name,return
0,30000000,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,Toy Story,7.7,5415.0,30.0,10.0,1995.0,Toy Story Collection,[Pixar Animation Studios],"[Animation, Comedy, Family]",12.451801
1,65000000,8844,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,Jumanji,6.9,2413.0,15.0,12.0,1995.0,Grumpy Old Men Collection,"[TriStar Pictures, Teitler Film, Interscope Co...","[Adventure, Fantasy, Family]",4.043035


#### 1.11. Cambio en los tipos de dato a float

Se cambian los tipos de datos a integer en las columnas: [revenue, vote_count, day, month, release_year] para visualizarlos más correctamente.

In [41]:
df_movies_8[['vote_count', 'day', 'month', 'release_year']] = df_movies_8[['vote_count', 'day', 'month', 'release_year']].fillna(0)

In [42]:
df_movies_8['popularity'] = df_movies_8['popularity'].astype(float)
df_movies_8['vote_average'] = df_movies_8['vote_average'].astype(float)
df_movies_8['revenue'] = df_movies_8['revenue'].astype(float)
df_movies_8['vote_count'] = df_movies_8['vote_count'].astype(float)
df_movies_8['day'] = df_movies_8['day'].astype(float)
df_movies_8['month'] = df_movies_8['month'].astype(float)
df_movies_8['release_year'] = df_movies_8['release_year'].astype(float)
df_movies_8['vote_average'] = df_movies_8['vote_average'].astype(float)

In [43]:
df_movies_8.head(2)

Unnamed: 0,budget,id,overview,popularity,release_date,revenue,title,vote_average,vote_count,day,month,release_year,collection_name,studios_name,genres_name,return
0,30000000,862,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,Toy Story,7.7,5415.0,30.0,10.0,1995.0,Toy Story Collection,[Pixar Animation Studios],"[Animation, Comedy, Family]",12.451801
1,65000000,8844,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,Jumanji,6.9,2413.0,15.0,12.0,1995.0,Grumpy Old Men Collection,"[TriStar Pictures, Teitler Film, Interscope Co...","[Adventure, Fantasy, Family]",4.043035


#### 1.12. Eliminación de columnas para el modelo de sistema de recomendación

* Al inicio de la transformación de habia expresado que variables se iban a considerar, por tanto, se eliminan las siguientes:
    * release_date
    * day
    * month 

In [44]:
df_movies_9 = df_movies_8.drop(['release_date', 'day', 'month', 'id'], axis=1)

In [45]:
data_movies_ml = df_movies_9.to_csv('dataset/data_movies_ml.csv', index=False)