# Assignment 09: Join and Merge in SQL (SQLite Version)

## Due 09 April 2025

### Introduction

For this assignment, you will continue working with SQL databases using SQLite. You should use Python to write the SQL queries. If possible, please submit your answers in HTML format. The data and questions are listed below.

In [1]:
import sqlite3
import pandas as pd

# Create in-memory database
conn = sqlite3.connect(':memory:')

# Create tables
conn.execute('''
CREATE TABLE directors (
    director_id INTEGER PRIMARY KEY AUTOINCREMENT,
    director_name TEXT,
    country TEXT,
    birth_year INTEGER,
    awards INTEGER
)''')

conn.execute('''
CREATE TABLE movies (
    movie_id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT,
    director_id INTEGER,
    release_year INTEGER,
    box_office REAL,
    rating REAL,
    FOREIGN KEY (director_id) REFERENCES directors(director_id)
)''')

# Insert data
directors_data = [
    ('Christopher Nolan', 'UK', 1970, 5),
    ('Greta Gerwig', 'USA', 1983, 3),
    ('Bong Joon-ho', 'South Korea', 1969, 4),
    ('Sofia Coppola', 'USA', 1971, 2),
    ('Pedro Almodóvar', 'Spain', 1949, 6),
    ('Agnès Varda', 'France', 1928, 4)
]
conn.executemany('INSERT INTO directors (director_name, country, birth_year, awards) VALUES (?,?,?,?)', directors_data)

movies_data = [
    ('Oppenheimer', 1, 2023, 950000000.00, 8.5),
    ('Barbie', 2, 2023, 1440000000.00, 7.0),
    ('Parasite', 3, 2019, 258773645.00, 8.9),
    ('Lost in Translation', 4, 2003, 119723856.00, 7.7),
    ('Pain and Glory', 5, 2019, 38219573.00, 7.5),
    ('Faces Places', 6, 2017, 903996.00, 7.9),
    ('Inception', 1, 2010, 836836967.00, 8.8),
    ('Lady Bird', 2, 2017, 78965367.00, 7.4)
]
conn.executemany('''
    INSERT INTO movies (title, director_id, release_year, box_office, rating)
    VALUES (?,?,?,?,?)''', movies_data)
conn.commit()

1. Write a query using `INNER JOIN` to display the movie title, director name, and box office earnings for all movies, ordered by box office earnings in descending order

In [2]:
# Write your anwer here
table1 = pd.read_sql_query("""
SELECT 
    movies.title AS movie_title,
    directors.director_name AS director_name,
    movies.box_office AS box_office_earnings
FROM 
    movies
INNER JOIN 
    directors
ON 
    movies.director_id = directors.director_id
ORDER BY 
    movies.box_office DESC;
""", conn)
print(table1)

           movie_title      director_name  box_office_earnings
0               Barbie       Greta Gerwig         1.440000e+09
1          Oppenheimer  Christopher Nolan         9.500000e+08
2            Inception  Christopher Nolan         8.368370e+08
3             Parasite       Bong Joon-ho         2.587736e+08
4  Lost in Translation      Sofia Coppola         1.197239e+08
5            Lady Bird       Greta Gerwig         7.896537e+07
6       Pain and Glory    Pedro Almodóvar         3.821957e+07
7         Faces Places        Agnès Varda         9.039960e+05


2. Using a `LEFT JOIN`, find all directors and count the number of movies they have directed.

In [3]:
# Write your answer here
table2 = pd.read_sql_query("""
SELECT 
    directors.director_name AS director_name,
    COUNT(movies.movie_id) AS movie_count
FROM 
    directors
LEFT JOIN 
    movies
ON 
    directors.director_id = movies.director_id
GROUP BY 
    directors.director_id
ORDER BY 
    movie_count DESC;
""", conn)
print(table2)

       director_name  movie_count
0  Christopher Nolan            2
1       Greta Gerwig            2
2       Bong Joon-ho            1
3      Sofia Coppola            1
4    Pedro Almodóvar            1
5        Agnès Varda            1


3. Write a `SELF JOIN` query to compare the ratings of movies by the same director. Show only pairs where the second movie has a higher rating than the first.

In [4]:
# Write your answer here
table3 = pd.read_sql_query("""
SELECT 
    m1.title AS movie_1,
    m2.title AS movie_2,
    m1.rating AS rating_1,
    m2.rating AS rating_2,
    d.director_name AS director_name
FROM 
    movies m1
JOIN 
    movies m2
ON 
    m1.director_id = m2.director_id AND m1.movie_id < m2.movie_id
JOIN 
    directors d
ON 
    m1.director_id = d.director_id
WHERE 
    m2.rating > m1.rating
ORDER BY 
    d.director_name, m1.title, m2.title;
""", conn)
print(table3)

       movie_1    movie_2  rating_1  rating_2      director_name
0  Oppenheimer  Inception       8.5       8.8  Christopher Nolan
1       Barbie  Lady Bird       7.0       7.4       Greta Gerwig


4. Using appropriate joins, find directors who have made movies with above-average box office earnings (compared to all movies in the database).

In [5]:
# Write your answer here
table4 = pd.read_sql_query("""
SELECT 
    d.director_name AS director_name,
    m.title AS movie_title,
    m.box_office AS box_office_earnings
FROM 
    movies m
JOIN 
    directors d
ON 
    m.director_id = d.director_id
WHERE 
    m.box_office > (SELECT AVG(box_office) FROM movies)
ORDER BY 
    m.box_office DESC;
""", conn)
print(table4)

       director_name  movie_title  box_office_earnings
0       Greta Gerwig       Barbie         1.440000e+09
1  Christopher Nolan  Oppenheimer         9.500000e+08
2  Christopher Nolan    Inception         8.368370e+08


5. Create a query using `CROSS JOIN` to show all possible combinations of directors and movies, even if they did not direct them. Limit the output to 10 rows.

In [6]:
# Write your answer here
table5 = pd.read_sql_query("""
SELECT 
    d.director_name AS director_name,
    m.title AS movie_title
FROM 
    directors d
CROSS JOIN 
    movies m
LIMIT 10;
""", conn)
print(table5)

       director_name          movie_title
0  Christopher Nolan          Oppenheimer
1  Christopher Nolan               Barbie
2  Christopher Nolan             Parasite
3  Christopher Nolan  Lost in Translation
4  Christopher Nolan       Pain and Glory
5  Christopher Nolan         Faces Places
6  Christopher Nolan            Inception
7  Christopher Nolan            Lady Bird
8       Greta Gerwig          Oppenheimer
9       Greta Gerwig               Barbie


6. Write a query that uses `UNION` to create a list of all director names and movie titles in a single column. Label the column `name` and include a column (called `type`) indicating if it is a director or movie. Order the results by type and name.

In [7]:
# Write your answer here
table6 = pd.read_sql_query("""
SELECT 
    director_name AS name,
    'Director' AS type
FROM 
    directors
UNION
SELECT 
    title AS name,
    'Movie' AS type
FROM 
    movies
ORDER BY 
    type, name;
""", conn)
print(table6)

                   name      type
0           Agnès Varda  Director
1          Bong Joon-ho  Director
2     Christopher Nolan  Director
3          Greta Gerwig  Director
4       Pedro Almodóvar  Director
5         Sofia Coppola  Director
6                Barbie     Movie
7          Faces Places     Movie
8             Inception     Movie
9             Lady Bird     Movie
10  Lost in Translation     Movie
11          Oppenheimer     Movie
12       Pain and Glory     Movie
13             Parasite     Movie


7. Using appropriate joins, find the director with the highest average movie rating. Show only the row with the director's name, average rating, and number of movies.

In [8]:
# Write your answer here
table7 = pd.read_sql_query("""
SELECT 
    d.director_name AS director_name,
    ROUND(AVG(m.rating), 2) AS average_rating,
    COUNT(m.movie_id) AS number_of_movies
FROM 
    directors d
JOIN 
    movies m
ON 
    d.director_id = m.director_id
GROUP BY 
    d.director_id
ORDER BY 
    average_rating DESC
LIMIT 1;
""", conn)
print(table7)

  director_name  average_rating  number_of_movies
0  Bong Joon-ho             8.9                 1


8. Create a query using `LEFT JOIN` and `IS NULL` to find whether there are directors who have not directed any movies.

In [10]:
# Write your answer here
table8 = pd.read_sql_query("""
SELECT 
    d.director_name AS director_name
FROM 
    directors d
LEFT JOIN 
    movies m
ON 
    d.director_id = m.director_id
WHERE 
    m.movie_id IS NULL;
""", conn)
print(table8)

Empty DataFrame
Columns: [director_name]
Index: []


9. Using appropriate joins, find pairs of movies released in the same year, along with their directors' names. Please do not match a movie with itself.

In [11]:
# Write your answer here
table9 = pd.read_sql_query("""
SELECT 
    m1.title AS movie_1,
    d1.director_name AS director_1,
    m2.title AS movie_2,
    d2.director_name AS director_2,
    m1.release_year AS release_year
FROM 
    movies m1
JOIN 
    movies m2
ON 
    m1.release_year = m2.release_year AND m1.movie_id < m2.movie_id
JOIN 
    directors d1
ON 
    m1.director_id = d1.director_id
JOIN 
    directors d2
ON 
    m2.director_id = d2.director_id
ORDER BY 
    m1.release_year, m1.title, m2.title;
""", conn)
print(table9)

        movie_1         director_1         movie_2       director_2  \
0  Faces Places        Agnès Varda       Lady Bird     Greta Gerwig   
1      Parasite       Bong Joon-ho  Pain and Glory  Pedro Almodóvar   
2   Oppenheimer  Christopher Nolan          Barbie     Greta Gerwig   

   release_year  
0          2017  
1          2019  
2          2023  


10. Show the age of each director when they released their movies. Create a column entitled `age_at_release` in your output. Order the results by the director's name and the movie's release year.

In [12]:
# Write your answer here
table10 = pd.read_sql_query( """
SELECT 
    d.director_name AS director_name,
    m.title AS movie_title,
    m.release_year AS release_year,
    (m.release_year - d.birth_year) AS age_at_release
FROM 
    directors d
JOIN 
    movies m
ON 
    d.director_id = m.director_id
ORDER BY 
    d.director_name, m.release_year;
""", conn)
print(table10)

       director_name          movie_title  release_year  age_at_release
0        Agnès Varda         Faces Places          2017              89
1       Bong Joon-ho             Parasite          2019              50
2  Christopher Nolan            Inception          2010              40
3  Christopher Nolan          Oppenheimer          2023              53
4       Greta Gerwig            Lady Bird          2017              34
5       Greta Gerwig               Barbie          2023              40
6    Pedro Almodóvar       Pain and Glory          2019              70
7      Sofia Coppola  Lost in Translation          2003              32


Good luck! 😃