# Environment setup

Launch the following commands to setup a project environment:
> conda create --name db_project -c conda-forge --file requirements.txt

> conda activate db_project

> pip install dotenv

Write your database access credentials to the file `.env`

# Data preprocessing

In [8]:
import pandas as pd
import numpy as np
import re

In [39]:
oscars = pd.read_csv('oscars.csv')
oscars.head(3)

Unnamed: 0,year,edition,award,nomination_actor,nomination_country,nomination_character_name,nomination_citation,nomination_producers,nomination_description,film_title,is_winner,acceptance_speech_text,acceptance_speech_url
0,2024,97,Actor In A Leading Role,Adrien Brody,,László Tóth,,,,The Brutalist,True,,http://aaspeechesdb.oscars.org/ics-wpd/url.ash...
1,2024,97,Actor In A Leading Role,Timothée Chalamet,,Bob Dylan,,,,A Complete Unknown,False,,
2,2024,97,Actor In A Leading Role,Colman Domingo,,Divine G,,,,Sing Sing,False,,


In [40]:
rows_number, cols_number = oscars.shape
print(f"Oscars dataset contains {rows_number} rows and {cols_number} columns")

Oscars dataset contains 11995 rows and 13 columns


In [41]:
print(f"The percentage of NaN values in the film_title column: {round(oscars.film_title.isna().mean() * 100, 1)}%")

The percentage of NaN values in the film_title column: 10.9%


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
    
Some award categories are not associated with a specific film. For example, the *Honorary Award* or the *Jean Hersholt Humanitarian Award*. Around 11% of the rows in the dataset fall into this group (i.e., *film_title* equals NaN). We will remove these rows.
</div>

In [42]:
oscars.dropna(subset=['film_title'], inplace=True)

In [43]:
print("The percentage of NaN values in the columns of the dataset:")
oscars.isna().mean().sort_values(ascending=False) * 100

The percentage of NaN values in the columns of the dataset:


nomination_citation          100.000000
nomination_description       100.000000
acceptance_speech_text       100.000000
nomination_country            99.719311
nomination_producers          87.846183
acceptance_speech_url         83.327096
nomination_character_name     82.718937
nomination_actor              12.453219
year                           0.000000
edition                        0.000000
award                          0.000000
is_winner                      0.000000
film_title                     0.000000
dtype: float64

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

The columns *nomination_citation*, *nomination_description*, *acceptance_speech_text*, and *nomination_country* contain around 100% NaN values. Besides, the *acceptance_speech_url* data is not relevant for our analysis. We will remove these columns as well.
</div>

In [44]:
oscars.drop(
    columns=[
        'nomination_citation',
        'nomination_description',
        'acceptance_speech_text',
        'nomination_country',
        'acceptance_speech_url'
    ],
    inplace=True
)

In [45]:
print("The percentage of NaN values in the columns of the dataset:")
oscars.isna().mean().sort_values(ascending=False) * 100

The percentage of NaN values in the columns of the dataset:


nomination_producers         87.846183
nomination_character_name    82.718937
nomination_actor             12.453219
year                          0.000000
award                         0.000000
edition                       0.000000
film_title                    0.000000
is_winner                     0.000000
dtype: float64

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

We can also notice that when the people involved in a nomination are specified in a row, they are listed either in the *nomination_actor* column or in the *nomination_producers* column.
</div>

In [46]:
any(~oscars.nomination_actor.isna().values & ~oscars.nomination_producers.isna().values)

False

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

Thus, we can combine these two columns into one and name it *nomination_people*.
</div>

In [47]:
oscars['nomination_people'] = oscars.nomination_actor.fillna(oscars.nomination_producers)

In [48]:
oscars.drop(
    columns=[
        'nomination_actor',
        'nomination_producers',
    ],
    inplace=True
)

In [49]:
print("The percentage of NaN values in the columns of the dataset:")
oscars.isna().mean().sort_values(ascending=False) * 100

The percentage of NaN values in the columns of the dataset:


nomination_character_name    82.718937
nomination_people             0.299401
year                          0.000000
award                         0.000000
edition                       0.000000
film_title                    0.000000
is_winner                     0.000000
dtype: float64

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

The column *nomination_people* contains highly inconsistent textual data: sometimes a single person, sometimes multiple people, sometimes people sharing one last name, and sometimes entries that are not people at all (countries, groups, studios, etc.). Because of this wide variety of formats, we need a robust parsing function that can handle all these scenarios.

The function below normalizes the text, detects and expands shared-last-name patterns, filters out non-human entries, and finally returns a clean list of individual people (each represented as "First Last") for every row.
</div>

In [50]:
def parse_people(text):
    """
    Parse a free-form text field containing names or groups
    and extract a clean list of human names (first + last name).
    
    Rules applied:
    - A valid person must consist of at least two words (e.g., "John Smith").
    - Single words (e.g., "Russia", "Pixar") are ignored.
    - Group indicators (team, company, studio…) are ignored.
    - Handles the pattern "Joel and Ethan Coen" by expanding to:
      ["Joel Coen", "Ethan Coen"].
    """

    # Return empty list for empty or non-string values
    if pd.isna(text) or not isinstance(text, str) or not text.strip():
        return []

    t = text.strip()

    # 1. Normalize connectors:
    #    Replace "&" with "and", then turn "and" into commas.
    #    This ensures that "A & B" or "A and B" becomes a
    #    comma-separated list, easier to split.
    t = re.sub(r'\s*&\s*', ' and ', t, flags=re.I)
    t = re.sub(r'\s+and\s+', ', ', t, flags=re.I)

    # 2. Split on commas into individual candidate segments
    parts = [p.strip() for p in t.split(',') if p.strip()]

    persons = []

    for part in parts:

        # 3. Ignore group-like entities:
        #    This removes segments such as "The VFX Team",
        #    "Disney Studio", "Editing Crew", etc.
        if re.search(r'\b(group|team|band|studio|company|committee|ensemble|crew)\b',
                     part, flags=re.I):
            continue

        # 4. Handle cases with shared last name:
        #    Example: "Joel and Ethan Coen"
        #    - Detect the shared surname "Coen"
        #    - Split "Joel and Ethan" into ["Joel", "Ethan"]
        m = re.match(r'(.+?)\s+(\w+)$', part)
        if m:
            body, lastname = m.group(1), m.group(2)

            # Split "Joel and Ethan" (now already "Joel, Ethan")
            subparts = re.split(r'\s+and\s+', body)
            if len(subparts) > 1:
                for sp in subparts:
                    full_name = (sp.strip() + " " + lastname).strip()

                    # 5. Validate: a person must have >= 2 words
                    if len(full_name.split()) >= 2:
                        persons.append(full_name)
                continue

        # 6. Regular case:
        #    Accept only segments that contain at least two words
        #    to filter out countries, organizations, nicknames, etc.
        if len(part.split()) >= 2:
            persons.append(part)

    return persons

In [51]:
oscars['people_list'] = oscars.nomination_people.apply(parse_people)
empty_people_rows = oscars[oscars['people_list'].apply(lambda x: len(x) == 0)]
print(f"The dataset contains only {len(empty_people_rows)} rows in which people_list is empty")

The dataset contains only 481 rows in which people_list is empty


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

We applied the parsing function to all rows, saved the result into the *people_list* column, and then extracted the rows for which the list of people turned out to be empty. There were only 481 such rows, so we saved them and manually verified that they indeed do not contain any human names and our function works as expected.
</div>

In [52]:
# empty_people_rows.to_csv('empty_rows.csv', index=False)

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

Now we can remove the nomination_people column.
</div>

In [53]:
oscars.drop(
    columns=[
        'nomination_people',
    ],
    inplace=True
)

In [54]:
oscars.dtypes

year                          int64
edition                       int64
award                        object
nomination_character_name    object
film_title                   object
is_winner                      bool
people_list                  object
dtype: object

In [55]:
oscars.head()

Unnamed: 0,year,edition,award,nomination_character_name,film_title,is_winner,people_list
0,2024,97,Actor In A Leading Role,László Tóth,The Brutalist,True,[Adrien Brody]
1,2024,97,Actor In A Leading Role,Bob Dylan,A Complete Unknown,False,[Timothée Chalamet]
2,2024,97,Actor In A Leading Role,Divine G,Sing Sing,False,[Colman Domingo]
3,2024,97,Actor In A Leading Role,Lawrence,Conclave,False,[Ralph Fiennes]
4,2024,97,Actor In A Leading Role,Donald Trump,The Apprentice,False,[Sebastian Stan]


# Database creation

In [1]:
import os
from sqlalchemy import create_engine, text, DDL
from dotenv import load_dotenv
from sqlalchemy.exc import IntegrityError

In [2]:
load_dotenv()

True

In [3]:
password = os.getenv("PASSWORD")
user = os.getenv("USERNAME_DB")
db = os.getenv("NAME_DB")
engine = create_engine(f"postgresql+psycopg2://{user}:{password}@postgresql-edu.in.centralelille.fr:5432/{db}")

In [4]:
#if needed
#with engine.begin() as conn:
#    conn.execute(text("""DROP SCHEMA films CASCADE;"""))

In [14]:
with engine.begin() as conn:
    #conn.execute(text("""CREATE SCHEMA films;"""))
    conn.execute(text("""set search_path = 'films'"""))
   
    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Person (
        person_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
        name TEXT NOT NULL UNIQUE
    ); """))

    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Film (
        film_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
        title TEXT NOT NULL,
        CONSTRAINT unique_title UNIQUE (title)
    ); """))

    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Award (
        award_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
        award_name TEXT NOT NULL UNIQUE
    ); """))

    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Edition (
        year INTEGER PRIMARY KEY,
        edition INTEGER NOT NULL UNIQUE
    ); """))

    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Nomination (
        nom_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
        year INTEGER NOT NULL,
        film_id INTEGER NOT NULL,
        award_id INTEGER NOT NULL,
        is_winner INTEGER NOT NULL CHECK (is_winner IN (0, 1)),
        UNIQUE(year, film_id, award_id),
        FOREIGN KEY(year) REFERENCES Edition(year) ON DELETE CASCADE,
        FOREIGN KEY(film_id) REFERENCES Film(film_id) ON DELETE CASCADE,
        FOREIGN KEY(award_id) REFERENCES Award(award_id) ON DELETE CASCADE
    ); """))

    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Nomination_person (
        nomination_id INTEGER NOT NULL,
        person_id INTEGER NOT NULL,
        character TEXT,
        PRIMARY KEY (nomination_id, person_id),
        FOREIGN KEY (nomination_id) REFERENCES Nomination(nom_id) ON DELETE CASCADE,
        FOREIGN KEY (person_id) REFERENCES Person(person_id) ON DELETE CASCADE
    ); """))

# Table population

In [62]:
import io

# We need a low-level DB connection because we will use PostgreSQL's COPY via psycopg2 cursor.
# SQLAlchemy's high-level Connection doesn't expose copy_expert, so we get the raw DB connection.
raw_conn = engine.raw_connection()
cursor = raw_conn.cursor()

# ---------------------------------------------------------
# 1. Prepare data for COPY
# ---------------------------------------------------------

# Make a working copy of the DataFrame so we don't modify the original 'oscars'
df = oscars.copy()

# The tmp_oscars.people_list column should be plain text for COPY.
# If people_list is a Python list, convert it to a single comma-separated string.
# If it's already a string, leave it as-is (str(x) will convert NaN to 'nan' though — see caveats).
# We use map(str, x) to safely convert non-string entries to strings before joining.
df["people_list"] = df["people_list"].apply(
    lambda x: ",".join(map(str, x)) if isinstance(x, list) else str(x)
)

# Convert the DataFrame to an in-memory CSV. COPY from STDIN expects a file-like object.
# We disable the header because the temporary table's columns are specified explicitly in COPY.
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False, header=False)   # header=False — COPY assumes the same column order
csv_buffer.seek(0)  # rewind to start so copy_expert reads from the beginning

# ---------------------------------------------------------
# 2. Load into temporary table tmp_oscars via COPY
# ---------------------------------------------------------

# Create a temporary table that mirrors the columns we will COPY.
# TEMP tables live only for the session and are convenient for bulk-loading and transformation.
cursor.execute("""
    DROP TABLE IF EXISTS tmp_oscars;
    CREATE TEMP TABLE tmp_oscars (
        year INTEGER,
        edition INTEGER,
        award TEXT,
        nomination_character_name TEXT,
        film_title TEXT,
        is_winner BOOLEAN,
        people_list TEXT
    );
""")

# Use psycopg2's copy_expert to run a COPY FROM STDIN. This is very fast for bulk loads.
# Because we passed header=False above, COPY will read rows in the exact column order we exported.
cursor.copy_expert(
    """
    COPY tmp_oscars
    FROM STDIN WITH (FORMAT CSV);
    """,
    csv_buffer
)

# Commit the COPY - required because we're using raw_connection outside SQLAlchemy transactional context.
raw_conn.commit()

# ---------------------------------------------------------
# 3. Bulk INSERT ... SELECT operations (no Python loops)
# ---------------------------------------------------------

# Insert Editions (one row per year+edition). DISTINCT prevents duplicate inserts.
# ON CONFLICT (year) DO NOTHING uses the unique constraint on year to skip existing rows.
cursor.execute("""
    INSERT INTO Edition (edition, year)
    SELECT DISTINCT edition, year 
    FROM tmp_oscars
    ON CONFLICT (year) DO NOTHING;
""")

# Insert unique awards from tmp_oscars -> Award table
cursor.execute("""
    INSERT INTO Award (award_name)
    SELECT DISTINCT award FROM tmp_oscars
    ON CONFLICT (award_name) DO NOTHING;
""")

# Insert unique films into Film(title)
cursor.execute("""
    INSERT INTO Film (title)
    SELECT DISTINCT film_title FROM tmp_oscars;
""")

# Insert nominations by joining tmp data to the Film and Award rows we just inserted.
# We convert the boolean is_winner to integer 1/0 here for your Nomination table if needed.
# ON CONFLICT (year, film_id, award_id) DO NOTHING prevents duplicate nominations.
cursor.execute("""
    INSERT INTO Nomination (year, film_id, award_id, is_winner)
    SELECT 
        t.year,
        f.film_id,
        a.award_id,
        CASE WHEN t.is_winner THEN 1 ELSE 0 END
    FROM tmp_oscars t
    JOIN Film f ON f.title = t.film_title
    JOIN Award a ON a.award_name = t.award
    ON CONFLICT (year, film_id, award_id) DO NOTHING;
""")

# ---------------------------------------------------------
# 4. Insert people and link them to nominations
# We need to split people_list back into individual names.
# ---------------------------------------------------------

# First: expand people_list into separate rows and insert unique Person names
cursor.execute("""
    WITH expanded AS (
        SELECT 
            nom.nom_id,
            unnest(string_to_array(t.people_list, ',')) AS person_name,
            t.nomination_character_name AS character
        FROM tmp_oscars t
        JOIN Film f ON f.title = t.film_title
        JOIN Award a ON a.award_name = t.award
        JOIN Nomination nom 
          ON nom.year = t.year 
         AND nom.film_id = f.film_id 
         AND nom.award_id = a.award_id
    )
    INSERT INTO Person (name)
    SELECT DISTINCT trim(person_name)
    FROM expanded
    ON CONFLICT (name) DO NOTHING;
""")
# Notes:
# - string_to_array(t.people_list, ',') splits the comma-separated string into text[]
# - unnest(...) expands the array into rows
# - trim(...) removes leading/trailing whitespace from names
# Caveat: if a person's name itself contains a comma (e.g., "Surname, Jr."), splitting on commas will mis-split.
# If names may contain commas, you should have used a safer delimiter or JSON array when exporting.

# Now insert the association rows into Nomination_person
cursor.execute("""
    WITH expanded AS (
        SELECT 
            nom.nom_id,
            trim(unnest(string_to_array(t.people_list, ','))) AS person_name,
            t.nomination_character_name AS character
        FROM tmp_oscars t
        JOIN Film f ON f.title = t.film_title
        JOIN Award a ON a.award_name = t.award
        JOIN Nomination nom 
          ON nom.year = t.year 
         AND nom.film_id = f.film_id 
         AND nom.award_id = a.award_id
    )
    INSERT INTO Nomination_person (nomination_id, person_id, character)
    SELECT 
        e.nom_id,
        p.person_id,
        e.character
    FROM expanded e
    JOIN Person p ON p.name = trim(e.person_name)
    ON CONFLICT (nomination_id, person_id) DO NOTHING;
""")
# Notes:
# - We join expanded rows to Person to obtain person_id
# - ON CONFLICT prevents duplicates if the same person-nomination pair already exists

# ---------------------------------------------------------
# 5. Finalize: commit and clean up low-level resources
# ---------------------------------------------------------
raw_conn.commit()
cursor.close()
raw_conn.close()

# Queries (part 1)

In [6]:
def run_query(query):
    text_query = text(query)
    return pd.read_sql_query(text_query, engine)

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Collect all the Award name
</div>

In [15]:
run_query("SELECT award_name FROM Award;")

Unnamed: 0,award_name
0,Music (Original Song Score And Its Adaptation ...
1,Engineering Effects
2,Film Editing
3,Actress In A Leading Role
4,Outstanding Picture
...,...
102,Art Direction
103,Directing (Dramatic Picture)
104,Special Effects
105,Writing (Title Writing)


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Get the title of all the movie which won an oscar of any type.
</div>

In [16]:
run_query("select film.title from film inner join nomination as nomi on Film.film_id = nomi.film_id where nomi.is_winner = 1;")

Unnamed: 0,title
0,The Fighter
1,Blade Runner 2049
2,Blade Runner 2049
3,Lilies of the Field
4,Skyfall
...,...
2168,The Hurricane
2169,Just Another Missing Kid
2170,The Father
2171,The Father


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Count the number of time a film was nominated
</div>

In [17]:
run_query("SELECT Film.title, COUNT(Nomination.nom_id) AS total_nominations FROM Film INNER JOIN Nomination ON Film.film_id = Nomination.film_id GROUP BY Film.title ORDER BY COUNT(Nomination.nom_id) desc;")

Unnamed: 0,title,total_nominations
0,A Star Is Born,25
1,West Side Story,18
2,Titanic,16
3,Moulin Rouge,15
4,Little Women,14
...,...,...
5085,Jasper and the Beanstalk,1
5086,When Marnie Was There,1
5087,Opera,1
5088,Attica,1


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Give all the categories where Titanic was nominated.
</div>

In [67]:
run_query("SELECT Award.award_name FROM Award JOIN Nomination ON Nomination.award_id = Award.award_id JOIN Film ON Nomination.film_id = Film.film_id WHERE Film.title = 'Titanic';")

Unnamed: 0,award_name
0,Actress In A Leading Role
1,Actress In A Supporting Role
2,Art Direction
3,Best Picture
4,Cinematography
5,Costume Design
6,Directing
7,Film Editing
8,Makeup
9,Music (Original Dramatic Score)


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Give the name of the people who played the 'Joker' a movie.
</div>

In [68]:
run_query("select Person.name from Person join Nomination_person on Person.person_id = Nomination_person.person_id where Nomination_person.character = 'Joker';")

Unnamed: 0,name
0,Heath Ledger


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Give the name of the people who played a character with the name Bob inside.
</div>

In [69]:
run_query("select Person.name from Person join Nomination_person on Person.person_id = Nomination_person.person_id where Nomination_person.character LIKE '%Bob%';")

Unnamed: 0,name
0,Timothée Chalamet
1,Kathy Bates
2,Sam Elliott
3,Willem Dafoe
4,Michael Shannon
5,Laura Dern
6,George Clooney
7,Bill Murray
8,Bruce Dern


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
The actor/actress who were nominated to a performance of best actor/actress in a leading role and best actor/actress in a supporting role.
</div>

In [70]:
run_query("SELECT P.name FROM Person AS P INNER JOIN Nomination_person AS Np ON P.person_id = Np.person_id INNER JOIN Nomination AS N ON Np.nomination_id = N.nom_id INNER JOIN Award AS A ON N.award_id = A.award_id WHERE A.award_name = 'Actor In A Leading Role' INTERSECT SELECT P.name FROM Person AS P INNER JOIN Nomination_person AS Np ON P.person_id = Np.person_id INNER JOIN Nomination AS N ON Np.nomination_id = N.nom_id INNER JOIN Award AS A ON N.award_id = A.award_id WHERE A.award_name = 'Actor In A Supporting Role' UNION SELECT P.name FROM Person AS P INNER JOIN Nomination_person AS Np ON P.person_id = Np.person_id INNER JOIN Nomination AS N ON Np.nomination_id = N.nom_id INNER JOIN Award AS A ON N.award_id = A.award_id WHERE A.award_name = 'Actress In A Leading Role' INTERSECT SELECT P.name FROM Person AS P INNER JOIN Nomination_person AS Np ON P.person_id = Np.person_id INNER JOIN Nomination AS N ON Np.nomination_id = N.nom_id INNER JOIN Award AS A ON N.award_id = A.award_id WHERE A.award_name = 'Actress In A Supporting Role';")

Unnamed: 0,name
0,Nick Nolte
1,Ian McKellen
2,Paul Newman
3,Kate Winslet
4,Robert Duvall
...,...
125,Adam Driver
126,Nicole Kidman
127,Richard Burton
128,Helen Mirren


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
The film with the best rate nomination/win
</div>

In [71]:
run_query("WITH FilmStats AS ( SELECT Film.title, N.year, COUNT(N.nom_id) AS total_noms, SUM(CASE WHEN N.is_winner = 1 THEN 1 ELSE 0 END) AS total_wins FROM Film INNER JOIN Nomination AS N ON Film.film_id = N.film_id GROUP BY Film.title, N.year HAVING COUNT(N.nom_id) >= 5 ) SELECT title, year, total_wins, total_noms, CAST(total_wins AS numeric) / total_noms AS conversion_rate FROM FilmStats ORDER BY conversion_rate DESC, total_noms DESC LIMIT 1;")

Unnamed: 0,title,year,total_wins,total_noms,conversion_rate
0,The Lord of the Rings: The Return of the King,2003,11,11,1.0


# Adding a new dataset to our data

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Once the database for the Oscars is set up, someone comes to us and asks if it's possible to add the data they have retrieved on a large quantity of films in order to link the Oscars to the film information.
</div>

In [24]:
import pandas as pd
films = pd.read_csv('TMDB_movie_dataset_v11.csv')
films.head(3)


Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Inception,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc..."
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,Interstellar,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,..."
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,...,The Dark Knight,Batman raises the stakes in his war on crime. ...,130.643,/qJ2tW6WMUDux911r6m7haRef0WH.jpg,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime f..."


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0"> To add the films to the database, we must first verify the quality of the new data.
We will only keep the following columns: 
<ul> 
    <li> title</li> 
    <li> release_date</li> 
    <li> vote_average</li> 
    <li> vote_count</li> 
    <li> runtime</li>
    <li> revenue</li> 
    <li> budget</li> 
</ul> </div>

# Data Preprocessing

In [25]:
rows_number, cols_number = films.shape
print(f"Films dataset contains {rows_number} rows and {cols_number} columns")

Films dataset contains 1333570 rows and 24 columns


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Dropping all the adult movies and the movies without any name
</div>

In [26]:
films = films[films['adult'] != True]
films = films.dropna(subset=['title'])
films.shape

(1201510, 24)

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Looking at the title in double, deleting doublon with the same title and year
</div>

In [27]:
lowtitles = films['title'].str.lower()

# Count the number of double
nb_doublons = lowtitles.duplicated().sum()
print(f"Amount of double : {nb_doublons}")

films['low_title'] = films['title'].str.lower()
nb_doublons = films.duplicated(subset=['low_title', 'release_date']).sum()

print(f"Total amount of movie with the same name and same year of released (double) : {nb_doublons}")

film = films.drop_duplicates(subset=['low_title', 'release_date'], keep='first').copy()
film.shape

Amount of double : 197721
Total amount of movie with the same name and same year of released (double) : 26109


(1175401, 25)

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Deleting movie not already released, because they can't be in the oscar.
</div>

In [28]:
from datetime import datetime
film['release_date'] = pd.to_datetime(film['release_date'], errors='coerce')

# 1. Deleting movie with missing date.
film = film.dropna(subset=['release_date'])
film.shape

(942153, 25)

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Preparing data to add into the table
</div>

In [77]:
columns = [
    'id',
    'title',
    'release_date',
    'vote_average',
    'vote_count',
    'runtime',
    'revenue',
    'budget'
]
df_to_push = film[columns].copy()


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Now we will try to separate the data from our new dataset which are new, and the one which represent movie already in there.
</div>

In [78]:
db_films = run_query("select * from film;")
print("Number of unique movies : ", db_films['title'].shape)
#So we know it's 5090 unique movie, let's find these movie in the film database.

title_oscar = db_films['title'].str.lower()
titres_to_push = df_to_push['title'].str.lower()

films_trouves = titres_to_push.isin(title_oscar)
films_trouves

Number of unique movies :  (5090,)


0           True
1           True
2           True
3           True
4          False
           ...  
1333564    False
1333565    False
1333566    False
1333568    False
1333569    False
Name: title, Length: 942153, dtype: bool

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Film with budget, revenue or runtime = 0, set at NaN. Same for film with vote_count = 0, in this case vote_average become Nan.
</div>

In [79]:
df_to_push[['budget','runtime','revenue']] = df_to_push[['budget','runtime','revenue']].replace(0, np.nan)

zero_vote = df_to_push['vote_count'] == 0
df_to_push.loc[zero_vote, 'vote_average'] = np.nan
df_to_push

Unnamed: 0,id,title,release_date,vote_average,vote_count,runtime,revenue,budget
0,27205,Inception,2010-07-15,8.364,34495,148.0,8.255328e+08,160000000.0
1,157336,Interstellar,2014-11-05,8.417,32571,169.0,7.017292e+08,165000000.0
2,155,The Dark Knight,2008-07-16,8.512,30619,152.0,1.004558e+09,185000000.0
3,19995,Avatar,2009-12-15,7.573,29815,162.0,2.923706e+09,237000000.0
4,24428,The Avengers,2012-04-25,7.710,29166,143.0,1.518816e+09,220000000.0
...,...,...,...,...,...,...,...,...
1333564,829326,Around Mr. Yasunari Kawabata,1968-12-11,,0,29.0,,
1333565,829327,One Foot In,2021-05-13,,0,25.0,,
1333566,829329,West,2020-03-20,,0,4.0,,
1333568,829332,Lagaan: The Thrill of Victory,2016-11-17,,0,4.0,,


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
To know which movie belong to an oscar when you have 2 movie with the same name, I choose the movie with the release date the closest to the first nomination.
</div>

In [80]:
db_films_oscar = run_query("select f.film_id, lower(f.title) as title_low , min(n.year) as first_oscar_year from film f join nomination n on f.film_id = n.film_id group by f.title, f.film_id;")
df_to_push['title_low'] = df_to_push['title'].str.lower()

df_close = pd.merge(
    db_films_oscar,
    df_to_push[['title', 'title_low', 'release_date', 'vote_average', 'vote_count', 'runtime', 'revenue', 'budget']],
    on='title_low',
    how='inner',
    suffixes=('_bdd', '_tmdb')
)

df_close['diff_year'] = abs(
    df_close['release_date'].dt.year - df_close['first_oscar_year']
)

df_close.sort_values(by=['film_id', 'diff_year'], ascending=True, inplace=True)

df_fin = df_close.drop_duplicates(
    subset=['film_id'], 
    keep='first'
).copy()

final_cols = ['film_id', 'title', 'release_date', 'vote_average', 'vote_count', 'runtime', 'revenue', 'budget']
df_fin = df_fin[final_cols]

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Now I have the dataset with all the movie already in the DB, so I will separate the already inside movie with the one I have to put.
</div>

In [81]:
df_fin

Unnamed: 0,film_id,title,release_date,vote_average,vote_count,runtime,revenue,budget
7341,1,The Fighter,2010-12-10,7.380,4025,116.0,93617009.0,25000000.0
394,2,Kangaroo Court,1994-01-01,,0,33.0,,
12254,3,Blade Runner 2049,2017-10-04,7.544,12331,164.0,259239658.0,150000000.0
2118,4,Lilies of the Field,1963-06-04,7.248,139,94.0,,
1354,5,Singing Guns,1950-02-28,5.400,5,91.0,,
...,...,...,...,...,...,...,...,...
3376,5086,The Help,2011-08-09,8.202,7538,146.0,216639112.0,25000000.0
8843,5087,Lonelyhearts,1959-03-04,6.000,17,100.0,,
11506,5088,Broadway Melody of 1936,1935-09-18,6.375,16,103.0,,
1446,5089,The Olympics in Mexico,1969-08-29,5.700,6,160.0,,


In [82]:
run_query("select * from film where title='Inception'")

Unnamed: 0,film_id,title
0,1429,Inception


# Population of the new table

In [10]:
with engine.begin() as conn:
    #conn.execute(text("""CREATE SCHEMA films;"""))
    conn.execute(text("""set search_path = 'films'"""))

    conn.execute(text("""
    ALTER TABLE Film
    ADD COLUMN release_date DATE,
    ADD COLUMN vote_average FLOAT,
    ADD COLUMN vote_count INTEGER,
    ADD COLUMN runtime INTEGER,
    ADD COLUMN revenue FLOAT,
    ADD COLUMN budget FLOAT;
    """))

       
    conn.execute(text("""
    ALTER TABLE Film
    DROP CONSTRAINT unique_title;
    """))

    conn.execute(text("""
    ALTER TABLE Film
    ADD CONSTRAINT uq_film_title_release UNIQUE (title, release_date);
    """))

ProgrammingError: (psycopg2.errors.DuplicateColumn) column "release_date" of relation "film" already exists

[SQL: 
    ALTER TABLE Film
    ADD COLUMN release_date DATE,
    ADD COLUMN vote_average FLOAT,
    ADD COLUMN vote_count INTEGER,
    ADD COLUMN runtime INTEGER,
    ADD COLUMN revenue FLOAT,
    ADD COLUMN budget FLOAT;
    ]
(Background on this error at: https://sqlalche.me/e/20/f405)

In [85]:
import pandas as pd
from sqlalchemy import text

df_a_pusher = df_fin

colonnes_update = [
    'release_date', 'vote_average', 'vote_count', 
    'runtime', 'revenue', 'budget'
]

with engine.begin() as conn:
    conn.execute(text("""set search_path = 'films'"""))
        
    for index, row in df_a_pusher.iterrows():
        set_clauses = []
        params = {}
        
        for col in colonnes_update:
            if pd.notna(row[col]):
                set_clauses.append(f"{col} = :{col}")
                params[col] = row[col]

        if set_clauses:
            params['film_id'] = int(row['film_id'])
            
            sql_update = text(f"""
                UPDATE film
                SET {', '.join(set_clauses)}
                WHERE film_id = :film_id;
            """)
            
            conn.execute(sql_update, params)

In [86]:
run_query("SELECT * FROM film WHERE title = 'Inception';")

Unnamed: 0,film_id,title,release_date,vote_average,vote_count,runtime,revenue,budget
0,1429,Inception,2010-07-15,8.364,34495,148,825532764.0,160000000.0


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
As we can see, over the 5090 films already in the database, we had 357 without any more information in the IMDB dataset.
</div>

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Let's now, add 100 000 random films from the IMDB dataset.
</div>

In [87]:
df_sample = film.sample(n=100000, random_state=42)
df_sample = df_sample[['title', 'release_date', 'vote_average', 'vote_count', 'runtime', 'revenue', 'budget']]
df_sample

Unnamed: 0,title,release_date,vote_average,vote_count,runtime,revenue,budget
133318,Heroic Pioneers,1986-10-09,5.00,4,110,0,0
532882,REVERT,2023-07-30,0.00,0,12,0,0
984507,Lær At Spille Guitar Med Ole Kibsgaard,2003-01-01,0.00,0,0,0,0
7014,Find Me Guilty,2006-03-16,6.55,439,125,2636637,13000000
128660,The Sadist Has Red Teeth,1971-01-13,4.20,4,75,0,0
...,...,...,...,...,...,...,...
422204,Happy Go Jenny,2022-10-23,0.00,0,329,0,0
110767,Isle of Forgotten Sins,1943-08-15,4.50,6,82,0,0
1111602,"Rocks, Paper, Scissors",2014-01-01,0.00,0,0,0,0
1030983,Solo,1989-01-01,0.00,0,4,0,0


In [88]:
run_query("select count(*) from film;")

Unnamed: 0,count
0,5090


In [89]:
import pandas as pd

sql_insert = text(f"""
    INSERT INTO films.film (title, release_date, vote_average, vote_count, runtime, revenue, budget)
    VALUES (:title, :release_date, :vote_average, :vote_count, :runtime, :revenue, :budget)
    ON CONFLICT ON CONSTRAINT uq_film_title_release DO NOTHING;""")

with engine.begin() as conn:
    conn.execute(sql_insert, df_sample.to_dict('records'))

# Query

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Film released after 2020.
</div>

In [18]:
run_query("select title, release_date from film where release_date > '2020-01-01' LIMIT 10;")

Unnamed: 0,title,release_date
0,Istärin,2022-07-08
1,Tell Me Yes,2020-05-18
2,Facade,2022-06-28
3,Bones,2020-11-11
4,Чеснова супа,2024-06-11
5,The Ballad of John St. George,2020-02-12
6,Acces Denied,2022-11-02
7,Not a Word of Truth,2021-08-28
8,O Chá de Alice,2023-08-28
9,The 7th Man,2020-09-23


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Top 10 movies (Adding vote_count constraint for relevance.)
</div>

In [19]:
run_query("select title, vote_average from film where vote_average is not null and vote_count > 10000 order by vote_average desc LIMIT 10;")

Unnamed: 0,title,vote_average
0,The Godfather,8.707
1,The Shawshank Redemption,8.702
2,The Godfather Part II,8.591
3,Schindler's List,8.573
4,Spirited Away,8.539
5,Parasite,8.515
6,The Dark Knight,8.512
7,The Green Mile,8.507
8,Pulp Fiction,8.488
9,Forrest Gump,8.477


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Movie with vote_average below 6 still nominated for oscar.
</div>

In [20]:
run_query("select distinct f.title, f.vote_average, n.is_winner " \
"from film f join nomination n using(film_id) " \
"where vote_average < 6 and vote_count > 1000")

Unnamed: 0,title,vote_average,is_winner
0,102 Dalmatians,5.456,0
1,Babe: Pig in the City,5.553,0
2,Batman Forever,5.409,0
3,Fifty Shades of Grey,5.882,0
4,"Hail, Caesar!",5.914,0
5,Hollow Man,5.918,0
6,Into the Woods,5.75,0
7,Junior,5.167,0
8,Mirror Mirror,5.918,0
9,Norbit,5.584,0


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Average budget of nominated movie with Brad Pitt in it.
</div>

In [21]:
run_query("with brad_movie as (select n.film_id from nomination n join nomination_person np on n.nom_id=np.nomination_id join person p using(person_id) where p.name = 'Brad Pitt')" \
"select avg(f.budget) from film f join brad_movie using(film_id)")

Unnamed: 0,avg
0,59600000.0


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
Film who won an oscar with the lowest budget.
</div>

In [22]:
run_query("select f.title, min(f.budget) from film f join nomination n using(film_id) where is_winner=1 group by f.title")

Unnamed: 0,title,min
0,The Sandpiper,
1,Colette,
2,The Fighter,25000000.0
3,Blade Runner 2049,150000000.0
4,Lilies of the Field,
...,...,...
1330,Traffic,48000000.0
1331,The Father,6000000.0
1332,Just Another Missing Kid,
1333,Mighty Joe Young,1800000.0
