# **P0. Recogida datos estructurados. Movies**

Fuentes de datos:
- Movies: https://media.githubusercontent.com/media/melodiromero/movies/refs/heads/main/dataset/movies_limpio.csv

## **Movies**

In [1]:
import pandas as pd
import json
import ast

In [2]:
url = "https://media.githubusercontent.com/media/melodiromero/movies/refs/heads/main/dataset/movies_limpio.csv"
df = pd.read_csv(url)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45376 entries, 0 to 45375
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   budget             45376 non-null  float64
 1   genres             45376 non-null  object 
 2   id                 45376 non-null  int64  
 3   original_language  45365 non-null  object 
 4   overview           44435 non-null  object 
 5   popularity         45376 non-null  float64
 6   release_date       45376 non-null  object 
 7   revenue            45376 non-null  float64
 8   runtime            45130 non-null  float64
 9   spoken_languages   45376 non-null  object 
 10  status             45296 non-null  object 
 11  tagline            20398 non-null  object 
 12  title              45376 non-null  object 
 13  vote_average       45376 non-null  float64
 14  vote_count         45376 non-null  float64
 15  franquicia         4488 non-null   object 
 16  productoras        330

### Preprocesado
Se seleccionan los atributos mas importantes del dataframe. Además, se transforman los tipos de los atributos para que puedan ser insertados en una tabla de postgresql

In [4]:
df.head(5)

Unnamed: 0,budget,genres,id,original_language,overview,popularity,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,franquicia,productoras,paises,release_year,retorno
0,30000000.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,7.7,5415.0,Toy Story Collection,Pixar Animation Studios,United States of America,1995,12.451801
1,65000000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",8844,en,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413.0,,"TriStar Pictures, Teitler Film, Interscope Com...",United States of America,1995,4.043035
2,0.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",15602,en,A family wedding reignites the ancient feud be...,11.7129,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92.0,Grumpy Old Men Collection,"Warner Bros., Lancaster Gate",United States of America,1995,0.0
3,16000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34.0,,Twentieth Century Fox Film Corporation,United States of America,1995,5.09076
4,0.0,"[{'id': 35, 'name': 'Comedy'}]",11862,en,Just when George Banks has recovered from his ...,8.387519,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173.0,Father of the Bride Collection,"Sandollar Productions, Touchstone Pictures",United States of America,1995,0.0


## Preprocesado

In [5]:
# Valores nulos de los atributos
df.isnull().sum()

budget                   0
genres                   0
id                       0
original_language       11
overview               941
popularity               0
release_date             0
revenue                  0
runtime                246
spoken_languages         0
status                  80
tagline              24978
title                    0
vote_average             0
vote_count               0
franquicia           40888
productoras          12280
paises                6216
release_year             0
retorno                  0
dtype: int64

In [6]:
df.drop(['tagline', 'paises'], axis = 1, inplace = True)

In [7]:
df['genres'] = df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

In [17]:
# Mostrar el DataFrame original
print("DataFrame original:")
print(len(df))

# Eliminar duplicados basándose en la columna 'id'
df.drop_duplicates(subset='id', keep='first', inplace = True)

# Mostrar el DataFrame resultante con filas únicas
print("\nDataFrame con filas únicas (sin duplicados en 'id'):")
print(len(df))

DataFrame original:
45376

DataFrame con filas únicas (sin duplicados en 'id'):
45346


## **Inserción en Postgres**

In [8]:
# df = df[:20]

In [19]:
import psycopg2
import pandas as pd


try:
    conn_postgre = psycopg2.connect(
        dbname='postgres',
        user='hive',
        password='password',
        host='hive4-postgres',
        port='5432'
    )
    print("Conexión exitosa a PostgreSQL")
    
    cur = conn_postgre.cursor()
    
    ### Creacion de 3 tablas, peliculas, generos y una tabla intermedia que relaciona ambas.
    cur.execute("""
    CREATE TABLE IF NOT EXISTS movies (
        movie_id INT PRIMARY KEY,
        budget FLOAT,
        original_language VARCHAR(10),
        overview TEXT,
        popularity FLOAT,
        release_date DATE,
        revenue FLOAT,
        runtime FLOAT,
        title VARCHAR(255),
        vote_average FLOAT,
        vote_count INT,
        release_year INT,
        retorno FLOAT
    );
    """)
    
    cur.execute("""
    CREATE TABLE IF NOT EXISTS genres (
        genre_id INT PRIMARY KEY,
        name VARCHAR(100) UNIQUE
    );
    """)
    
    cur.execute("""
    CREATE TABLE IF NOT EXISTS movie_genres (
        movie_id INT REFERENCES movies(movie_id) ON DELETE CASCADE,
        genre_id INT REFERENCES genres(genre_id) ON DELETE CASCADE,
        PRIMARY KEY (movie_id, genre_id)
    );
    """)
    
    print("Tablas creadas (si no existían).")
    
    # Extraccion géneros únicos
    genres_set = {g['id']: g['name'] for genres in df['genres'] for g in genres}
    genre_rows = [(genre_id, name) for genre_id, name in genres_set.items()]

    # Insert géneros
    genre_insert_query = """
        INSERT INTO genres (genre_id, name) VALUES (%s, %s)
        ON CONFLICT (genre_id) DO NOTHING;
    """
    cur.executemany(genre_insert_query, genre_rows)
    conn_postgre.commit()
    print(f"Géneros insertados: {len(genre_rows)}")
    
    # películas y relaciones en movie_genres
    movie_rows = [
        (row['id'], row['budget'], row['original_language'], row['overview'],
         row['popularity'], row['release_date'], row['revenue'], row['runtime'],
         row['title'], row['vote_average'], row['vote_count'], row['release_year'], row['retorno'])
        for _, row in df.iterrows()
    ]
    movie_insert_query = """
        INSERT INTO movies (movie_id, budget, original_language, overview, popularity, 
                            release_date, revenue, runtime, title, vote_average, 
                            vote_count, release_year, retorno)
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
    """
    cur.executemany(movie_insert_query, movie_rows)
    conn_postgre.commit()
    print(f"Películas insertadas: {len(movie_rows)}")

    # Relaciones en movie_genres
    movie_genre_rows = [
        (row['id'], genre['id']) for _, row in df.iterrows() for genre in row['genres']
    ]
    movie_genre_insert_query = """
        INSERT INTO movie_genres (movie_id, genre_id) VALUES (%s, %s)
    """
    cur.executemany(movie_genre_insert_query, movie_genre_rows)
    conn_postgre.commit()
    print(f"Relaciones de película-género insertadas: {len(movie_genre_rows)}")
    
finally:
    if 'cur' in locals():
        cur.close()
    if 'conn_postgre' in locals():
        conn_postgre.close()

Conexión exitosa a PostgreSQL
Tablas creadas (si no existían).
Géneros insertados: 20
Películas insertadas: 45346
Relaciones de película-género insertadas: 90957


In [20]:
### Consulta para devolver 3 peliculas

conn_postgre = psycopg2.connect(
        dbname='postgres',
        user='hive',
        password='password',
        host='hive4-postgres',
        port='5432'
    )

cur = conn_postgre.cursor()
    
    
cur.execute("SELECT * FROM movies LIMIT 3;")
rows = cur.fetchall()
print("Datos en movies:", rows)

cur.close()
conn_postgre.close()

Datos en movies: [(862, 30000000.0, 'en', "Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.", 21.946943, datetime.date(1995, 10, 30), 373554033.0, 81.0, 'Toy Story', 7.7, 5415, 1995, 12.4518011), (8844, 65000000.0, 'en', "When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.", 17.015539, datetime.date(1995, 12, 15), 262797249.0, 104.0, 'Jumanji', 6.9, 2413, 1995, 4.0430346), (15602, 0.0, 'en', "A fam

In [18]:
#################################################
##### PARA BORRAR LAS TABLAS CREADAS ARRIBA #####
#################################################
"""import psycopg2

try:
    conn_postgre = psycopg2.connect(
        dbname='postgres',
        user='hive',
        password='password',
        host='hive4-postgres',
        port='5432'
    )
    print("Conexión exitosa a PostgreSQL")
    
    cur = conn_postgre.cursor()
    
    cur.execute("DROP TABLE IF EXISTS movie_genres;")
    cur.execute("DROP TABLE IF EXISTS movies;")
    cur.execute("DROP TABLE IF EXISTS genres;")
    
    conn_postgre.commit()
    print("Tablas borradas exitosamente.")

finally:
    if 'cur' in locals():
        cur.close()
    if 'conn_postgre' in locals():
        conn_postgre.close()"""

Conexión exitosa a PostgreSQL
Tablas borradas exitosamente.


In [22]:
df[:100].to_csv('./movies_100.csv', index=True)