# Connect to DuckDB

This cell establishes a connection to the DuckDB database file `sakila.duckdb` so we can query the data. The connection uses a relative path from the `notebooks/` folder.


In [1]:
import duckdb
import pandas as pd

# Connect to DuckDB (relative path from notebooks folder)
con = duckdb.connect("../data/sakila.duckdb")




### Preview Staging Tables

This cell shows all tables in the `staging` schema and previews the first 5 rows of the `actor` table.


In [2]:
tables = con.execute("SHOW TABLES").fetchall()
staging_tables = con.execute(
    "SELECT table_name FROM information_schema.tables WHERE table_schema='staging'"
).fetchall()
print("Tables in staging schema:", staging_tables)

Tables in staging schema: [('actor',), ('address',), ('category',), ('city',), ('country',), ('customer',), ('film',), ('film_actor',), ('film_category',), ('inventory',), ('language',), ('payment',), ('rental',), ('staff',), ('store',), ('_dlt_loads',), ('_dlt_pipeline_state',), ('_dlt_version',)]


### first 5 rows
This code shows first 5 rows of the table actor where we can see 

In [4]:

sample_data = con.execute("SELECT * FROM staging.actor LIMIT 5").fetchdf()
sample_data


Unnamed: 0,actor_id,first_name,last_name,last_update,_dlt_load_id,_dlt_id
0,1.0,PENELOPE,GUINESS,2021-03-06 15:51:59,1765206049.4797397,/n6D5pphutgN6g
1,2.0,NICK,WAHLBERG,2021-03-06 15:51:59,1765206049.4797397,vYtmfzoPJ99Pzg
2,3.0,ED,CHASE,2021-03-06 15:51:59,1765206049.4797397,CM6qcu0U+Apl4w
3,4.0,JENNIFER,DAVIS,2021-03-06 15:51:59,1765206049.4797397,SlvWobGEuMou6g
4,5.0,JOHNNY,LOLLOBRIGIDA,2021-03-06 15:51:59,1765206049.4797397,jRE9Gm1dmLgtBQ


### 1a Identify Movies Longer Than 3 Hours

Objective: Find all movies with a duration greater than 180 minutes and display their title and length.

In [5]:
long_movies = con.execute("""
    SELECT title, length 
    FROM staging.film 
    WHERE length > 180
""").df()
long_movies


Unnamed: 0,title,length
0,ANALYZE HOOSIERS,181
1,BAKED CLEOPATRA,182
2,CATCH AMISTAD,183
3,CHICAGO NORTH,185
4,CONSPIRACY SPIRIT,184
5,CONTROL ANTHEM,185
6,CRYSTAL BREAKING,184
7,DARN FORRESTER,185
8,FRONTIER CABIN,183
9,GANGS PRIDE,185


### 1b Movies Containing the Word “Love” in the Title

Objective: List movies whose titles contain “love” along with their rating, length, and description.

In [50]:
love_movies = con.execute("""
    SELECT title, rating, length, description
    FROM staging.film
    WHERE title ILIKE '%love%'
""").df()

love_movies

Unnamed: 0,title,rating,length,description
0,GRAFFITI LOVE,PG,117,A Unbelieveable Epistle of a Sumo Wrestler And...
1,IDAHO LOVE,PG-13,172,A Fast-Paced Drama of a Student And a Crocodil...
2,IDENTITY LOVER,PG-13,119,A Boring Tale of a Composer And a Mad Cow who ...
3,INDIAN LOVE,NC-17,135,A Insightful Saga of a Mad Scientist And a Mad...
4,LAWRENCE LOVE,NC-17,175,A Fanciful Yarn of a Database Administrator An...
5,LOVE SUICIDES,R,181,A Brilliant Panorama of a Hunter And a Explore...
6,LOVELY JINGLE,PG,65,A Fanciful Yarn of a Crocodile And a Forensic ...
7,LOVER TRUMAN,G,75,A Emotional Yarn of a Robot And a Boy who must...
8,LOVERBOY ATTACKS,PG-13,162,A Boring Story of a Car And a Butler who must ...
9,STRANGELOVE DESIRE,NC-17,103,A Awe-Inspiring Panorama of a Lumberjack And a...


### 1c) Descriptive Statistics of Movie Length

Objective: Compute summary statistics for the length column, including the shortest, longest, average, and median movie lengths.

In [51]:
length_stats = con.execute("""
    SELECT
        MIN(length) AS shortest,
        AVG(length) AS average,
        MEDIAN(length) AS median,
        MAX(length) AS longest
    FROM staging.film
""").df()

length_stats


Unnamed: 0,shortest,average,median,longest
0,46,115.272,114.0,185


### 1d) Top 10 Most Expensive Movies to Rent Per Day

Objective: Determine the movies with the highest rental rate per day.

In [52]:
expensive_movies = con.execute("""
    SELECT title, rental_rate, rental_duration, (rental_rate / rental_duration) AS cost_per_day
    FROM staging.film
    ORDER By cost_per_day DESC
    LIMIT 10
""").df()
expensive_movies

Unnamed: 0,title,rental_rate,rental_duration,cost_per_day
0,ACE GOLDFINGER,4.99,3,1.663333
1,AMERICAN CIRCUS,4.99,3,1.663333
2,AUTUMN CROW,4.99,3,1.663333
3,BACKLASH UNDEFEATED,4.99,3,1.663333
4,BEAST HUNCHBACK,4.99,3,1.663333
5,BEHAVIOR RUNAWAY,4.99,3,1.663333
6,BILKO ANONYMOUS,4.99,3,1.663333
7,CARIBBEAN LIBERTY,4.99,3,1.663333
8,CASPER DRAGONFLY,4.99,3,1.663333
9,CASUALTIES ENCINO,4.99,3,1.663333


### 1e Top 10 Actors by Number of Movies

Objective: Identify the actors who have appeared in the most films.

In [53]:
top_actors = con.execute("""
    SELECT a.first_name, a.last_name, COUNT(fa.film_id) AS movie_count
    FROM staging.actor a
    JOIN staging.film_actor fa ON a.actor_id = fa.actor_id
    GROUP BY a.actor_id, a.first_name, a.last_name
    ORDER BY movie_count DESC
    LIMIT 10
""").df()

top_actors
                         
                         
                         
               



Unnamed: 0,first_name,last_name,movie_count
0,GINA,DEGENERES,42
1,WALTER,TORN,41
2,MARY,KEITEL,40
3,MATTHEW,CARREY,39
4,SANDRA,KILMER,37
5,SCARLETT,DAMON,36
6,ANGELA,WITHERSPOON,35
7,GROUCHO,DUNST,35
8,VIVIEN,BASINGER,35
9,VAL,BOLGER,35


### **Custom Question 1: Which film has the highest replacement cost?**

**Objective:**  
Identify the film with the highest *replacement cost*, which indicates the most expensive movie to replace in the inventory.

**Explanation:**  
The `replacement_cost` column represents how much it costs the store to replace a lost or damaged film. Finding the highest value helps the business understand which films are most valuable.

**Query:**


In [None]:
highest_replacement = con.execute("""
    SELECT title, replacement_cost
    FROM staging.film
    ORDER BY replacement_cost DESC
    LIMIT 1
""").df()

highest_replacement

### **Custom Question 2: How many customers do we have in each city?**

**Objective:**  
Understand how the customer base is distributed across cities.

**Explanation:**  
This helps reveal where most of the customers live. Cities with higher customer counts may need more marketing focus, promotions, or store attention.

**Query:**


In [None]:
customers_by_city = con.execute("""
    SELECT ci.city, COUNT(c.customer_id) AS customer_count
    FROM staging.customer c
    JOIN staging.address a ON c.address_id = a.address_id
    JOIN staging.city ci ON a.city_id = ci.city_id
    GROUP BY ci.city
    ORDER BY customer_count DESC
""").df()

customers_by_city


### **Custom Question 3: What is the average rental duration for each film rating?**

**Objective:**  
Compare the average number of days customers keep movies with different MPAA ratings (G, PG, PG-13, R, NC-17).

**Explanation:**  
This helps reveal whether certain types of movies tend to be rented for longer periods (for example, family movies vs. adult-rated movies).

**Query:**


In [None]:
avg_rental_duration = con.execute("""
    SELECT rating, AVG(rental_duration) AS avg_duration
    FROM staging.film
    GROUP BY rating
    ORDER BY avg_duration DESC
""").df()

avg_rental_duration
