# Environment setup

Launch the following commands to setup a project environment:
> conda create --name db_project -c conda-forge --file requirement.txt

> conda activate db_project

> pip install dotenv

Write your database access credentials to the file `.env`

# Data preprocessing

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
oscars = pd.read_csv('oscars.csv')
oscars.head(3)

Unnamed: 0,year,edition,award,nomination_actor,nomination_country,nomination_character_name,nomination_citation,nomination_producers,nomination_description,film_title,is_winner,acceptance_speech_text,acceptance_speech_url
0,2024,97,Actor In A Leading Role,Adrien Brody,,László Tóth,,,,The Brutalist,True,,http://aaspeechesdb.oscars.org/ics-wpd/url.ash...
1,2024,97,Actor In A Leading Role,Timothée Chalamet,,Bob Dylan,,,,A Complete Unknown,False,,
2,2024,97,Actor In A Leading Role,Colman Domingo,,Divine G,,,,Sing Sing,False,,


In [3]:
rows_number, cols_number = oscars.shape
print(f"Oscars dataset contains {rows_number} rows and {cols_number} columns")

Oscars dataset contains 11995 rows and 13 columns


In [4]:
print(f"The percentage of NaN values in the film_title column: {round(oscars.film_title.isna().mean() * 100, 1)}%")

The percentage of NaN values in the film_title column: 10.9%


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">
    
Some award categories are not associated with a specific film. For example, the *Honorary Award* or the *Jean Hersholt Humanitarian Award*. Around 11% of the rows in the dataset fall into this group (i.e., *film_title* equals NaN). We will remove these rows.
</div>

In [5]:
oscars.dropna(subset=['film_title'], inplace=True)

In [6]:
print("The percentage of NaN values in the columns of the dataset:")
oscars.isna().mean().sort_values(ascending=False) * 100

The percentage of NaN values in the columns of the dataset:


nomination_citation          100.000000
nomination_description       100.000000
acceptance_speech_text       100.000000
nomination_country            99.719311
nomination_producers          87.846183
acceptance_speech_url         83.327096
nomination_character_name     82.718937
nomination_actor              12.453219
year                           0.000000
edition                        0.000000
award                          0.000000
is_winner                      0.000000
film_title                     0.000000
dtype: float64

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

The columns *nomination_citation*, *nomination_description*, *acceptance_speech_text*, and *nomination_country* contain around 100% NaN values. Besides, the *acceptance_speech_url* data is not relevant for our analysis. We will remove these columns as well.
</div>

In [7]:
oscars.drop(
    columns=[
        'nomination_citation',
        'nomination_description',
        'acceptance_speech_text',
        'nomination_country',
        'acceptance_speech_url'
    ],
    inplace=True
)

In [8]:
print("The percentage of NaN values in the columns of the dataset:")
oscars.isna().mean().sort_values(ascending=False) * 100

The percentage of NaN values in the columns of the dataset:


nomination_producers         87.846183
nomination_character_name    82.718937
nomination_actor             12.453219
year                          0.000000
award                         0.000000
edition                       0.000000
film_title                    0.000000
is_winner                     0.000000
dtype: float64

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

We can also notice that when the people involved in a nomination are specified in a row, they are listed either in the *nomination_actor* column or in the *nomination_producers* column.
</div>

In [9]:
any(~oscars.nomination_actor.isna().values & ~oscars.nomination_producers.isna().values)

False

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

Thus, we can combine these two columns into one and name it *nomination_people*.
</div>

In [10]:
oscars['nomination_people'] = oscars.nomination_actor.fillna(oscars.nomination_producers)

In [11]:
oscars.drop(
    columns=[
        'nomination_actor',
        'nomination_producers',
    ],
    inplace=True
)

In [12]:
print("The percentage of NaN values in the columns of the dataset:")
oscars.isna().mean().sort_values(ascending=False) * 100

The percentage of NaN values in the columns of the dataset:


nomination_character_name    82.718937
nomination_people             0.299401
year                          0.000000
award                         0.000000
edition                       0.000000
film_title                    0.000000
is_winner                     0.000000
dtype: float64

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

The column *nomination_people* contains highly inconsistent textual data: sometimes a single person, sometimes multiple people, sometimes people sharing one last name, and sometimes entries that are not people at all (countries, groups, studios, etc.). Because of this wide variety of formats, we need a robust parsing function that can handle all these scenarios.

The function below normalizes the text, detects and expands shared-last-name patterns, filters out non-human entries, and finally returns a clean list of individual people (each represented as "First Last") for every row.
</div>

In [13]:
def parse_people(text):
    """
    Parse a free-form text field containing names or groups
    and extract a clean list of human names (first + last name).
    
    Rules applied:
    - A valid person must consist of at least two words (e.g., "John Smith").
    - Single words (e.g., "Russia", "Pixar") are ignored.
    - Group indicators (team, company, studio…) are ignored.
    - Handles the pattern "Joel and Ethan Coen" by expanding to:
      ["Joel Coen", "Ethan Coen"].
    """

    # Return empty list for empty or non-string values
    if pd.isna(text) or not isinstance(text, str) or not text.strip():
        return []

    t = text.strip()

    # 1. Normalize connectors:
    #    Replace "&" with "and", then turn "and" into commas.
    #    This ensures that "A & B" or "A and B" becomes a
    #    comma-separated list, easier to split.
    t = re.sub(r'\s*&\s*', ' and ', t, flags=re.I)
    t = re.sub(r'\s+and\s+', ', ', t, flags=re.I)

    # 2. Split on commas into individual candidate segments
    parts = [p.strip() for p in t.split(',') if p.strip()]

    persons = []

    for part in parts:

        # 3. Ignore group-like entities:
        #    This removes segments such as "The VFX Team",
        #    "Disney Studio", "Editing Crew", etc.
        if re.search(r'\b(group|team|band|studio|company|committee|ensemble|crew)\b',
                     part, flags=re.I):
            continue

        # 4. Handle cases with shared last name:
        #    Example: "Joel and Ethan Coen"
        #    - Detect the shared surname "Coen"
        #    - Split "Joel and Ethan" into ["Joel", "Ethan"]
        m = re.match(r'(.+?)\s+(\w+)$', part)
        if m:
            body, lastname = m.group(1), m.group(2)

            # Split "Joel and Ethan" (now already "Joel, Ethan")
            subparts = re.split(r'\s+and\s+', body)
            if len(subparts) > 1:
                for sp in subparts:
                    full_name = (sp.strip() + " " + lastname).strip()

                    # 5. Validate: a person must have >= 2 words
                    if len(full_name.split()) >= 2:
                        persons.append(full_name)
                continue

        # 6. Regular case:
        #    Accept only segments that contain at least two words
        #    to filter out countries, organizations, nicknames, etc.
        if len(part.split()) >= 2:
            persons.append(part)

    return persons

In [14]:
oscars['people_list'] = oscars.nomination_people.apply(parse_people)
empty_people_rows = oscars[oscars['people_list'].apply(lambda x: len(x) == 0)]
print(f"The dataset contains only {len(empty_people_rows)} rows in which people_list is empty")

The dataset contains only 481 rows in which people_list is empty


<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

We applied the parsing function to all rows, saved the result into the *people_list* column, and then extracted the rows for which the list of people turned out to be empty. There were only 481 such rows, so we saved them and manually verified that they indeed do not contain any human names and our function works as expected.
</div>

In [15]:
# empty_people_rows.to_csv('empty_rows.csv', index=False)

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

Now we can remove the nomination_people column.
</div>

In [16]:
oscars.drop(
    columns=[
        'nomination_people',
    ],
    inplace=True
)

In [17]:
oscars.dtypes

year                          int64
edition                       int64
award                        object
nomination_character_name    object
film_title                   object
is_winner                      bool
people_list                  object
dtype: object

In [18]:
oscars.head()

Unnamed: 0,year,edition,award,nomination_character_name,film_title,is_winner,people_list
0,2024,97,Actor In A Leading Role,László Tóth,The Brutalist,True,[Adrien Brody]
1,2024,97,Actor In A Leading Role,Bob Dylan,A Complete Unknown,False,[Timothée Chalamet]
2,2024,97,Actor In A Leading Role,Divine G,Sing Sing,False,[Colman Domingo]
3,2024,97,Actor In A Leading Role,Lawrence,Conclave,False,[Ralph Fiennes]
4,2024,97,Actor In A Leading Role,Donald Trump,The Apprentice,False,[Sebastian Stan]


# Database creation

In [19]:
import os
from sqlalchemy import create_engine, text, DDL
from dotenv import load_dotenv
from sqlalchemy.exc import IntegrityError

In [20]:
load_dotenv()

True

In [21]:
password = os.getenv("PASSWORD")
user = os.getenv("USERNAME_DB")
db = os.getenv("NAME_DB")
engine = create_engine(f"postgresql+psycopg2://{user}:{password}@postgresql-edu.in.centralelille.fr:5432/{db}")

In [22]:
# if needed
with engine.begin() as conn:
    conn.execute(text("""DROP SCHEMA films CASCADE;"""))

In [23]:
with engine.begin() as conn:
    conn.execute(text("""CREATE SCHEMA films;"""))
    conn.execute(text("""set search_path = 'films'"""))
   
    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Person (
        person_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
        name TEXT NOT NULL UNIQUE
    ); """))

    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Film (
        film_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
        title TEXT NOT NULL UNIQUE
    ); """))

    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Award (
        award_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
        award_name TEXT NOT NULL UNIQUE
    ); """))

    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Edition (
        year INTEGER PRIMARY KEY,
        edition INTEGER NOT NULL UNIQUE
    ); """))

    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Nomination (
        nom_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
        year INTEGER NOT NULL,
        film_id INTEGER NOT NULL,
        award_id INTEGER NOT NULL,
        is_winner INTEGER NOT NULL CHECK (is_winner IN (0, 1)),
        UNIQUE(year, film_id, award_id),
        FOREIGN KEY(year) REFERENCES Edition(year) ON DELETE CASCADE,
        FOREIGN KEY(film_id) REFERENCES Film(film_id) ON DELETE CASCADE,
        FOREIGN KEY(award_id) REFERENCES Award(award_id) ON DELETE CASCADE
    ); """))

    conn.execute(text("""
    CREATE TABLE IF NOT EXISTS Nomination_person (
        nomination_id INTEGER NOT NULL,
        person_id INTEGER NOT NULL,
        character TEXT,
        PRIMARY KEY (nomination_id, person_id),
        FOREIGN KEY (nomination_id) REFERENCES Nomination(nom_id) ON DELETE CASCADE,
        FOREIGN KEY (person_id) REFERENCES Person(person_id) ON DELETE CASCADE
    ); """))

# Table population

In [24]:
import io

# We need a low-level DB connection because we will use PostgreSQL's COPY via psycopg2 cursor.
# SQLAlchemy's high-level Connection doesn't expose copy_expert, so we get the raw DB connection.
raw_conn = engine.raw_connection()
cursor = raw_conn.cursor()

# ---------------------------------------------------------
# 1. Prepare data for COPY
# ---------------------------------------------------------

# Make a working copy of the DataFrame so we don't modify the original 'oscars'
df = oscars.copy()

# The tmp_oscars.people_list column should be plain text for COPY.
# If people_list is a Python list, convert it to a single comma-separated string.
# If it's already a string, leave it as-is (str(x) will convert NaN to 'nan' though — see caveats).
# We use map(str, x) to safely convert non-string entries to strings before joining.
df["people_list"] = df["people_list"].apply(
    lambda x: ",".join(map(str, x)) if isinstance(x, list) else str(x)
)

# Convert the DataFrame to an in-memory CSV. COPY from STDIN expects a file-like object.
# We disable the header because the temporary table's columns are specified explicitly in COPY.
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False, header=False)   # header=False — COPY assumes the same column order
csv_buffer.seek(0)  # rewind to start so copy_expert reads from the beginning

# ---------------------------------------------------------
# 2. Load into temporary table tmp_oscars via COPY
# ---------------------------------------------------------

# Create a temporary table that mirrors the columns we will COPY.
# TEMP tables live only for the session and are convenient for bulk-loading and transformation.
cursor.execute("""
    DROP TABLE IF EXISTS tmp_oscars;
    CREATE TEMP TABLE tmp_oscars (
        year INTEGER,
        edition INTEGER,
        award TEXT,
        nomination_character_name TEXT,
        film_title TEXT,
        is_winner BOOLEAN,
        people_list TEXT
    );
""")

# Use psycopg2's copy_expert to run a COPY FROM STDIN. This is very fast for bulk loads.
# Because we passed header=False above, COPY will read rows in the exact column order we exported.
cursor.copy_expert(
    """
    COPY tmp_oscars
    FROM STDIN WITH (FORMAT CSV);
    """,
    csv_buffer
)

# Commit the COPY - required because we're using raw_connection outside SQLAlchemy transactional context.
raw_conn.commit()

# ---------------------------------------------------------
# 3. Bulk INSERT ... SELECT operations (no Python loops)
# ---------------------------------------------------------

# Insert Editions (one row per year+edition). DISTINCT prevents duplicate inserts.
# ON CONFLICT (year) DO NOTHING uses the unique constraint on year to skip existing rows.
cursor.execute("""
    INSERT INTO Edition (edition, year)
    SELECT DISTINCT edition, year 
    FROM tmp_oscars
    ON CONFLICT (year) DO NOTHING;
""")

# Insert unique awards from tmp_oscars -> Award table
cursor.execute("""
    INSERT INTO Award (award_name)
    SELECT DISTINCT award FROM tmp_oscars
    ON CONFLICT (award_name) DO NOTHING;
""")

# Insert unique films into Film(title)
cursor.execute("""
    INSERT INTO Film (title)
    SELECT DISTINCT film_title FROM tmp_oscars
    ON CONFLICT (title) DO NOTHING;
""")

# Insert nominations by joining tmp data to the Film and Award rows we just inserted.
# We convert the boolean is_winner to integer 1/0 here for your Nomination table if needed.
# ON CONFLICT (year, film_id, award_id) DO NOTHING prevents duplicate nominations.
cursor.execute("""
    INSERT INTO Nomination (year, film_id, award_id, is_winner)
    SELECT 
        t.year,
        f.film_id,
        a.award_id,
        CASE WHEN t.is_winner THEN 1 ELSE 0 END
    FROM tmp_oscars t
    JOIN Film f ON f.title = t.film_title
    JOIN Award a ON a.award_name = t.award
    ON CONFLICT (year, film_id, award_id) DO NOTHING;
""")

# ---------------------------------------------------------
# 4. Insert people and link them to nominations
# We need to split people_list back into individual names.
# ---------------------------------------------------------

# First: expand people_list into separate rows and insert unique Person names
cursor.execute("""
    WITH expanded AS (
        SELECT 
            nom.nom_id,
            unnest(string_to_array(t.people_list, ',')) AS person_name,
            t.nomination_character_name AS character
        FROM tmp_oscars t
        JOIN Film f ON f.title = t.film_title
        JOIN Award a ON a.award_name = t.award
        JOIN Nomination nom 
          ON nom.year = t.year 
         AND nom.film_id = f.film_id 
         AND nom.award_id = a.award_id
    )
    INSERT INTO Person (name)
    SELECT DISTINCT trim(person_name)
    FROM expanded
    ON CONFLICT (name) DO NOTHING;
""")
# Notes:
# - string_to_array(t.people_list, ',') splits the comma-separated string into text[]
# - unnest(...) expands the array into rows
# - trim(...) removes leading/trailing whitespace from names
# Caveat: if a person's name itself contains a comma (e.g., "Surname, Jr."), splitting on commas will mis-split.
# If names may contain commas, you should have used a safer delimiter or JSON array when exporting.

# Now insert the association rows into Nomination_person
cursor.execute("""
    WITH expanded AS (
        SELECT 
            nom.nom_id,
            trim(unnest(string_to_array(t.people_list, ','))) AS person_name,
            t.nomination_character_name AS character
        FROM tmp_oscars t
        JOIN Film f ON f.title = t.film_title
        JOIN Award a ON a.award_name = t.award
        JOIN Nomination nom 
          ON nom.year = t.year 
         AND nom.film_id = f.film_id 
         AND nom.award_id = a.award_id
    )
    INSERT INTO Nomination_person (nomination_id, person_id, character)
    SELECT 
        e.nom_id,
        p.person_id,
        e.character
    FROM expanded e
    JOIN Person p ON p.name = trim(e.person_name)
    ON CONFLICT (nomination_id, person_id) DO NOTHING;
""")
# Notes:
# - We join expanded rows to Person to obtain person_id
# - ON CONFLICT prevents duplicates if the same person-nomination pair already exists

# ---------------------------------------------------------
# 5. Finalize: commit and clean up low-level resources
# ---------------------------------------------------------
raw_conn.commit()
cursor.close()
raw_conn.close()

# Queries (part 1)

In [25]:
def run_query(query):
    text_query = text(query)
    return pd.read_sql_query(text_query, engine)

<div style="border: 2px solid #4CAF50; padding: 10px; border-radius: 8px; background-color:#f0fff0">

Query description
</div>

In [31]:
run_query("SELECT * FROM Nomination_person LIMIT 20;").head()

Unnamed: 0,nomination_id,person_id,character
0,1,77,László Tóth
1,2,7649,Bob Dylan
2,3,1544,Divine G
3,4,6284,Lawrence
4,5,7110,Donald Trump
