## SQL 3: Subqueries and joins

In [1]:
from sqlalchemy import create_engine, text
import pandas as pd
import os
import gc

In [2]:
engine = create_engine("mysql+mysqlconnector://root:abc@127.0.0.1:3306/cs639")
conn = engine.connect()

In [3]:
list(conn.execute(text("show tables;")))

[('name_basics',),
 ('title_akas',),
 ('title_basics',),
 ('title_crew',),
 ('title_episode',),
 ('title_principals',),
 ('title_ratings',)]

### IMDB dataset

- Source: https://datasets.imdbws.com/ 
- Original dataset is too large to be analyzed using our current VM
- Schema information: https://developer.imdb.com/non-commercial-datasets/

In [4]:
!rm IMDB.zip
!rm *.tsv
!wget https://ms.sites.cs.wisc.edu/cs639/data/IMDB.zip
!unzip IMDB.zip

rm: cannot remove 'IMDB.zip': No such file or directory
rm: cannot remove '*.tsv': No such file or directory
--2025-02-10 18:54:49--  https://ms.sites.cs.wisc.edu/cs639/data/IMDB.zip
65.8.243.101, 65.8.243.112, 65.8.243.63, ...wisc.edu)... 
connected. to ms.sites.cs.wisc.edu (ms.sites.cs.wisc.edu)|65.8.243.101|:443... 
200 OKequest sent, awaiting response... 
Length: 584293 (571K) [application/zip]
Saving to: ‘IMDB.zip’


2025-02-10 18:54:50 (2.87 MB/s) - ‘IMDB.zip’ saved [584293/584293]

Archive:  IMDB.zip
  inflating: name.basics.tsv         
  inflating: title.akas.tsv          
  inflating: title.basics.tsv        
  inflating: title.crew.tsv          
  inflating: title.episode.tsv       
  inflating: title.principals.tsv    
  inflating: title.ratings.tsv       


#### Populating MySQL server with tables corresponding to all tsv files

In [5]:
files = os.listdir()
tsv_files = [f for f in files if ".tsv" in f]
table_names = [f.replace(".tsv", "") for f in tsv_files]
table_names = [f.replace(".", "_") for f in table_names]

In [6]:
for idx, tsv_file in enumerate(tsv_files):
    df = pd.read_csv(tsv_file, sep="\t", na_values='\\N')
    df.to_sql(table_names[idx], conn, index=False, if_exists="replace")
    print(f"Populated {table_names[idx]}")

Populated title_basics
Populated title_principals
Populated title_akas
Populated name_basics
Populated title_episode
Populated title_ratings
Populated title_crew


In [7]:
list(conn.execute(text("show tables;")))

[('name_basics',),
 ('title_akas',),
 ('title_basics',),
 ('title_crew',),
 ('title_episode',),
 ('title_principals',),
 ('title_ratings',)]

### Data Analysis

### SQL Subqueries

- What is a subquery?
    - A query contained within another query. The outer query is typically referred to as "containing statement".
    - A subquery can be used with all four SQL data statements: `SELECT`, `INSERT`, `UPDATE`, `DELETE`.
    - The subquery is always executed prior to the containing statement.
    - Subqueries act like a temporary table with statement scope. That is when the containing statement has finished executing, data returned by the subqueries are discarded.
    - Subqueries can return:
        - Single row with a single column
        - Multiple rows with a single column
        - Multiple rows with multiple columns
- Types of subqueries:
    1. noncorrelated subqueries: self-contained subqueries
    2. correlated subqueries: reference columns from the containing statement

### Noncorrelated subqueries

- What is a scalar subquery?
    - A query returning a result set containing a single row and column.
    - Can be used for conditional operators: `=`, `<`, `<=`, `>`, `>=`, `<>`

#### Single-Row and Single-Column subqueries

#### Q1: What are the titles that have a runtime greater than the average runtime of all movies?

In [8]:
pd.read_sql("""
    SELECT AVG(runtimeMinutes)
    FROM title_basics
    WHERE runtimeMinutes IS NOT NULL
""", conn)

Unnamed: 0,AVG(runtimeMinutes)
0,42.008753


In [9]:
pd.read_sql("""
    SELECT *
    FROM title_basics
    WHERE runtimeMinutes > (
        SELECT AVG(runtimeMinutes)
        FROM title_basics
        WHERE runtimeMinutes IS NOT NULL
)
""", conn)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0017504,movie,Unseen Enemies,Unseen Enemies,0,1925.0,,54.0,Western
1,tt0024996,movie,Coming Out Party,Coming Out Party,0,1934.0,,80.0,Drama
2,tt0029553,movie,The Sheik Steps Out,The Sheik Steps Out,0,1937.0,,65.0,Musical
3,tt0035860,movie,The Fallen Sparrow,The Fallen Sparrow,0,1943.0,,94.0,"Film-Noir,Mystery"
4,tt0037142,movie,Oath of Vengeance,Oath of Vengeance,0,1944.0,,57.0,Western
...,...,...,...,...,...,...,...,...,...
355,tt9653828,movie,Arest,Arest,0,2019.0,,126.0,Drama
356,tt9654270,tvSeries,Giardino d'inverno,Giardino d'inverno,0,1961.0,1961.0,120.0,Comedy
357,tt9685774,tvMovie,The Farewell Girls,The Farewell Girls,0,2017.0,,86.0,Drama
358,tt9728774,tvSeries,Innocent the Bhola,Innocent the Bhola,0,2020.0,,98.0,Thriller


#### Q2: What are the most recent movies?

In [10]:
pd.read_sql("""
    SELECT MAX(startYear)
    FROM title_basics
    WHERE titleType = 'movie'
""", conn)

Unnamed: 0,MAX(startYear)
0,2024.0


In [11]:
pd.read_sql("""
    SELECT *
    FROM title_basics
    WHERE startYear = (
        SELECT MAX(startYear)
        FROM title_basics
        WHERE titleType = 'movie'
    ) AND titleType = 'movie'
""", conn)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt16311360,movie,Krzyk: Losing Control,Krzyk: Losing Control,0,2024.0,,80.0,"Drama,Thriller"
1,tt29009061,movie,Amici per caso,Amici per caso,0,2024.0,,95.0,Comedy
2,tt32848875,movie,Dad and I - Chapter 1: The Life of Timothy J. ...,Dad and I - Chapter 1: The Life of Timothy J. ...,0,2024.0,,111.0,Biography


#### Multiple-Row and Single-Column subqueries

- Operators: `IN`, `NOT IN`, `ALL`, `ANY`

#### Q3: Find the number of movies that have more than one genre.

We can find number of genres by simply counting number of commas and adding 1 to that count. Let's first determine length of genres column.

In [12]:
pd.read_sql("""
    SELECT *
    FROM title_basics
    LIMIT 5
""", conn)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000912,short,The Indian Runner's Romance,The Indian Runner's Romance,0,1909.0,,11.0,"Short,Western"
1,tt0013001,short,The Cashier,The Cashier,0,1922.0,,,"Animation,Comedy,Short"
2,tt0016344,movie,Shirayuri wa nageku,Shirayuri wa nageku,0,1925.0,,,
3,tt0017504,movie,Unseen Enemies,Unseen Enemies,0,1925.0,,54.0,Western
4,tt0024996,movie,Coming Out Party,Coming Out Party,0,1934.0,,80.0,Drama


`LENGTH` in SQL.

In [13]:
pd.read_sql("""
    SELECT genres, LENGTH(genres)
    FROM title_basics
""", conn)

Unnamed: 0,genres,LENGTH(genres)
0,"Short,Western",13.0
1,"Animation,Comedy,Short",22.0
2,,
3,Western,7.0
4,Drama,5.0
...,...,...
2769,Drama,5.0
2770,"Family,Short",12.0
2771,Game-Show,9.0
2772,Comedy,6.0


To find, number of commas, we can replace commas with nothing and find difference between original string and the replaced string.

In [14]:
pd.read_sql("""
    SELECT genres, LENGTH(genres) - LENGTH(REPLACE(genres, ',', '')) + 1
    FROM title_basics
""", conn)

Unnamed: 0,genres,"LENGTH(genres) - LENGTH(REPLACE(genres, ',', '')) + 1"
0,"Short,Western",2.0
1,"Animation,Comedy,Short",3.0
2,,
3,Western,1.0
4,Drama,1.0
...,...,...
2769,Drama,1.0
2770,"Family,Short",2.0
2771,Game-Show,1.0
2772,Comedy,1.0


Now putting it together in a subquery.

In [15]:
pd.read_sql("""
    SELECT COUNT(*)
    FROM title_basics
    WHERE (
        SELECT LENGTH(genres) - LENGTH(REPLACE(genres, ',', '')) + 1
    ) > 1;
""", conn)

Unnamed: 0,COUNT(*)
0,1181


#### Q4: Find the titles of movies that have the maximum number of genres.

In [16]:
pd.read_sql("""
    SELECT primaryTitle, genres
    FROM title_basics
    WHERE (
        SELECT LENGTH(genres) - LENGTH(REPLACE(genres, ',', '')) + 1
    ) = (
        SELECT MAX(LENGTH(genres) - LENGTH(REPLACE(genres, ',', '')) + 1)
        FROM title_basics
    )
""", conn)

Unnamed: 0,primaryTitle,genres
0,The Cashier,"Animation,Comedy,Short"
1,You Bet Your Life,"Comedy,Family,Game-Show"
2,Return of the Seven,"Action,Drama,Western"
3,Kindergeld,"Crime,Drama,Mystery"
4,Mindwarp,"Horror,Sci-Fi,Thriller"
...,...,...
442,Episode dated 21 January 2019,"Documentary,News,Talk-Show"
443,Episode dated 5 November 2018,"Documentary,News,Talk-Show"
444,Frozen and Afraid,"Adventure,Game-Show,Horror"
445,Múmia do Amor,"Adventure,Animation,Comedy"


#### Q5: Find the titles of movies that belong to the same genres as those with a runtime longer than 150 minutes.

In [17]:
pd.read_sql("""
    SELECT genres
    FROM title_basics
    WHERE titleType = "movie" AND runtimeMinutes > 150 AND genres IS NOT NULL
""", conn)

Unnamed: 0,genres
0,Drama


In [18]:
pd.read_sql("""
    SELECT *
    FROM title_basics
    WHERE genres IN (
        SELECT genres
        FROM title_basics
        WHERE titleType = "movie" AND runtimeMinutes > 150 AND genres IS NOT NULL
    ) AND titleType = "movie"
""", conn)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0024996,movie,Coming Out Party,Coming Out Party,0,1934.0,,80.0,Drama
1,tt0098516,movie,Trois pommes à côté du sommeil,Trois pommes à côté du sommeil,0,1989.0,,98.0,Drama
2,tt0173156,movie,Saajan Ka Ghar,Saajan Ka Ghar,0,1994.0,,153.0,Drama
3,tt0179070,movie,All the King's Horses,All the King's Horses,0,1977.0,,80.0,Drama
4,tt0228992,movie,An Outgoing Woman,Une femme d'extérieur,0,2000.0,,118.0,Drama
5,tt0268446,movie,Mask of Desire,Mukundo,0,2000.0,,105.0,Drama
6,tt0328672,movie,A Yellow Raft in Blue Water,A Yellow Raft in Blue Water,0,,,,Drama
7,tt0347010,movie,Fondovalle,Fondovalle,0,1998.0,,74.0,Drama
8,tt0349688,movie,A Little Bit of Freedom,Kleine Freiheit,0,2003.0,,102.0,Drama
9,tt0371002,movie,"Ne se sardi, choveche","Ne se sardi, choveche",0,1985.0,,81.0,Drama


#### Q6: Find titles of movies that have not received any ratings.

In [19]:
pd.read_sql("SELECT tconst FROM title_ratings", conn)

Unnamed: 0,tconst
0,tt0000912
1,tt0017504
2,tt0024996
3,tt0029553
4,tt0030476
...,...
384,tt9728774
385,tt9758424
386,tt9796264
387,tt9847426


In [20]:
pd.read_sql("""
    SELECT primaryTitle
    FROM title_basics
    WHERE titleType = 'movie' 
    AND tconst NOT IN (
        SELECT tconst
        FROM title_ratings
)
""", conn)

Unnamed: 0,primaryTitle
0,Shirayuri wa nageku
1,L'ippocampo
2,Diary of a Window
3,Der Elfenbeinturm
4,Zhi Mo Nu
...,...
88,Dreamlock
89,Making Masculine
90,Gado
91,A Song or Two to Make You Feel


#### Q7: Find all the titles that have an average rating greater than all titles released in the year 2005.

In [21]:
pd.read_sql("""
    SELECT *
    FROM title_basics
    WHERE tconst IN (
        SELECT tconst
        FROM title_ratings
        WHERE averageRating > ALL (
            SELECT averageRating
            FROM title_ratings
            WHERE tconst IN (
                SELECT tconst
                FROM title_basics
                WHERE startYear = 2005
            )
        )
    )
""", conn)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0042171,tvSeries,You Bet Your Life,You Bet Your Life,0,1950.0,1961.0,30.0,"Comedy,Family,Game-Show"
1,tt0108382,short,The Traveling Poet,The Traveling Poet,0,1993.0,,8.0,Short
2,tt0118989,tvMovie,The Ditchdigger's Daughters,The Ditchdigger's Daughters,0,1997.0,,92.0,"Biography,Drama"
3,tt0123134,tvMovie,Kismaszat és a Gézengúzok,Kismaszat és a Gézengúzok,0,1984.0,,74.0,"Adventure,Comedy,Family"
4,tt0128154,movie,Daybreak,Daybreak,0,2002.0,,87.0,"Crime,Mystery,Thriller"
...,...,...,...,...,...,...,...,...,...
143,tt9472276,tvEpisode,The Shepherd,The Shepherd,0,2017.0,,20.0,"Drama,History"
144,tt9506684,tvEpisode,Episode #6.2,Episode #6.2,0,2019.0,,29.0,Comedy
145,tt9728774,tvSeries,Innocent the Bhola,Innocent the Bhola,0,2020.0,,98.0,Thriller
146,tt9758424,tvEpisode,Frozen and Afraid,Frozen and Afraid,0,2019.0,,41.0,"Adventure,Game-Show,Horror"


#### Q8: Find all the titles that have an average rating lower than any title released in the year 2005.

In [22]:
pd.read_sql("""
    SELECT *
    FROM title_basics
    WHERE tconst IN (
        SELECT tconst
        FROM title_ratings
        WHERE averageRating < ANY (
            SELECT averageRating
            FROM title_ratings
            WHERE tconst IN (
                SELECT tconst
                FROM title_basics
                WHERE startYear = 2005
            )
        )
    )
""", conn)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000912,short,The Indian Runner's Romance,The Indian Runner's Romance,0,1909.0,,11.0,"Short,Western"
1,tt0017504,movie,Unseen Enemies,Unseen Enemies,0,1925.0,,54.0,Western
2,tt0024996,movie,Coming Out Party,Coming Out Party,0,1934.0,,80.0,Drama
3,tt0029553,movie,The Sheik Steps Out,The Sheik Steps Out,0,1937.0,,65.0,Musical
4,tt0030476,short,Music Made Simple,Music Made Simple,0,1938.0,,8.0,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
214,tt9050522,tvEpisode,It's Okay for Me to Be Moe for Little Sisters,It's Okay for Me to Be Moe for Little Sisters,0,2018.0,,,"Animation,Comedy,Romance"
215,tt9452890,tvEpisode,When Thieves Drop By,When Thieves Drop By,0,2018.0,,1.0,"Action,Adventure,Animation"
216,tt9642604,movie,Los hombres sin rostros,Los hombres sin rostros,0,2016.0,,59.0,Documentary
217,tt9685774,tvMovie,The Farewell Girls,The Farewell Girls,0,2017.0,,86.0,Drama


### Correlated subqueries

#### Q9: Find the titles of movies that have a runtime longer than the average runtime of all movies in the same genre.

In [23]:
pd.read_sql("""
    SELECT primaryTitle, runtimeMinutes, genres
    FROM title_basics tb_outer
    WHERE titleType = 'movie'
    AND runtimeMinutes > (
        SELECT AVG(runtimeMinutes)
        FROM title_basics tb_inner
        WHERE tb_inner.genres = tb_outer.genres
        AND tb_inner.titleType = 'movie'
)
""", conn)

Unnamed: 0,primaryTitle,runtimeMinutes,genres
0,Oath of Vengeance,57.0,Western
1,Escape from the Planet of the Apes,98.0,"Action,Sci-Fi"
2,Per amore,100.0,"Drama,Romance"
3,Trois pommes à côté du sommeil,98.0,Drama
4,Saajan Ka Ghar,153.0,Drama
5,L'île d'amour,106.0,"Drama,Romance"
6,Proêzas de Satanás na Vila de Leva-e-Traz,100.0,Comedy
7,An Outgoing Woman,118.0,Drama
8,Mask of Desire,105.0,Drama
9,A Little Bit of Freedom,102.0,Drama


### JOINs

### `JOIN` aka `INNER JOIN` 

#### Q10: Find all movies and their corresponding ratings.

In [24]:
pd.read_sql("""
    SELECT * FROM title_basics
    WHERE titleType = "movie"
""", conn)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0016344,movie,Shirayuri wa nageku,Shirayuri wa nageku,0,1925.0,,,
1,tt0017504,movie,Unseen Enemies,Unseen Enemies,0,1925.0,,54.0,Western
2,tt0024996,movie,Coming Out Party,Coming Out Party,0,1934.0,,80.0,Drama
3,tt0029553,movie,The Sheik Steps Out,The Sheik Steps Out,0,1937.0,,65.0,Musical
4,tt0035860,movie,The Fallen Sparrow,The Fallen Sparrow,0,1943.0,,94.0,"Film-Noir,Mystery"
...,...,...,...,...,...,...,...,...,...
183,tt8787458,movie,Gado,Gado,0,,,,Western
184,tt8906732,movie,A Song or Two to Make You Feel,A Song or Two to Make You Feel,0,2018.0,,54.0,Music
185,tt9198442,movie,My Hero Academia,My Hero Academia,0,,,,"Action,Adventure,Animation"
186,tt9642604,movie,Los hombres sin rostros,Los hombres sin rostros,0,2016.0,,59.0,Documentary


In [25]:
pd.read_sql("""
    SELECT * FROM title_ratings
""", conn)

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000912,4.3,73
1,tt0017504,4.8,27
2,tt0024996,5.9,66
3,tt0029553,6.0,45
4,tt0030476,6.2,81
...,...,...,...
384,tt9728774,8.9,11
385,tt9758424,8.3,58
386,tt9796264,7.7,53
387,tt9847426,7.3,8


In [26]:
pd.read_sql("""
    SELECT b.primaryTitle, r.averageRating
    FROM title_basics b
    JOIN title_ratings r ON b.tconst = r.tconst
    WHERE b.titleType = 'movie'
""", conn)

Unnamed: 0,primaryTitle,averageRating
0,Unseen Enemies,4.8
1,Coming Out Party,5.9
2,The Sheik Steps Out,6.0
3,The Fallen Sparrow,6.6
4,Oath of Vengeance,5.7
...,...,...
90,"Horror, Madness & Mayhem Vol 1 Snuff Party",7.2
91,Natha Pure Aata,4.9
92,Ordinary Gods,8.5
93,Los hombres sin rostros,6.8


### `LEFT JOIN` aka `LEFT OUTER JOIN`

#### Q11: Find all movies and their corresponding ratings. If a movie doesn't have a rating, still include it in the results.

In [27]:
pd.read_sql("""
    SELECT * FROM title_basics
    WHERE titleType = "movie"
""", conn)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0016344,movie,Shirayuri wa nageku,Shirayuri wa nageku,0,1925.0,,,
1,tt0017504,movie,Unseen Enemies,Unseen Enemies,0,1925.0,,54.0,Western
2,tt0024996,movie,Coming Out Party,Coming Out Party,0,1934.0,,80.0,Drama
3,tt0029553,movie,The Sheik Steps Out,The Sheik Steps Out,0,1937.0,,65.0,Musical
4,tt0035860,movie,The Fallen Sparrow,The Fallen Sparrow,0,1943.0,,94.0,"Film-Noir,Mystery"
...,...,...,...,...,...,...,...,...,...
183,tt8787458,movie,Gado,Gado,0,,,,Western
184,tt8906732,movie,A Song or Two to Make You Feel,A Song or Two to Make You Feel,0,2018.0,,54.0,Music
185,tt9198442,movie,My Hero Academia,My Hero Academia,0,,,,"Action,Adventure,Animation"
186,tt9642604,movie,Los hombres sin rostros,Los hombres sin rostros,0,2016.0,,59.0,Documentary


In [28]:
pd.read_sql("""
    SELECT * FROM title_ratings
""", conn)

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000912,4.3,73
1,tt0017504,4.8,27
2,tt0024996,5.9,66
3,tt0029553,6.0,45
4,tt0030476,6.2,81
...,...,...,...
384,tt9728774,8.9,11
385,tt9758424,8.3,58
386,tt9796264,7.7,53
387,tt9847426,7.3,8


In [29]:
pd.read_sql("""
    SELECT b.primaryTitle, r.averageRating
    FROM title_basics b
    LEFT JOIN title_ratings r ON b.tconst = r.tconst
    WHERE b.titleType = 'movie'
""", conn)

Unnamed: 0,primaryTitle,averageRating
0,Shirayuri wa nageku,
1,Unseen Enemies,4.8
2,Coming Out Party,5.9
3,The Sheik Steps Out,6.0
4,The Fallen Sparrow,6.6
...,...,...
183,Gado,
184,A Song or Two to Make You Feel,
185,My Hero Academia,
186,Los hombres sin rostros,6.8


### `RIGHT JOIN` aka `RIGHT OUTER JOIN`

#### Q13: Solve Q12 using `RIGHT JOIN`.

In [30]:
pd.read_sql("""
    SELECT b.primaryTitle, r.averageRating
    FROM title_ratings r 
    RIGHT JOIN title_basics b ON b.tconst = r.tconst
    WHERE b.titleType = 'movie'
""", conn)

Unnamed: 0,primaryTitle,averageRating
0,Shirayuri wa nageku,
1,Unseen Enemies,4.8
2,Coming Out Party,5.9
3,The Sheik Steps Out,6.0
4,The Fallen Sparrow,6.6
...,...,...
183,Gado,
184,A Song or Two to Make You Feel,
185,My Hero Academia,
186,Los hombres sin rostros,6.8


#### Q14: Find all movies, their average rating, and the total number of regions they have been released in.

In [31]:
pd.read_sql("""
SELECT b.primaryTitle, r.averageRating, COUNT(DISTINCT a.region) AS totalRegions
FROM title_basics b
JOIN title_ratings r ON b.tconst = r.tconst
JOIN title_akas a ON b.tconst = a.titleId
WHERE b.titleType = 'movie'
GROUP BY b.primaryTitle, b.tconst, r.averageRating
ORDER BY totalRegions DESC, averageRating DESC, primaryTitle ASC
""", conn)

Unnamed: 0,primaryTitle,averageRating,totalRegions
0,Wild,7.1,50
1,Return of the Seven,5.5,40
2,Escape from the Planet of the Apes,6.3,38
3,Crazed Fruit,7.2,18
4,The Fallen Sparrow,6.6,17
...,...,...,...
90,Fondovalle,4.6,1
91,Klarar Bananen Biffen?,4.6,1
92,Saajan Ka Ghar,4.3,1
93,Cock Tail,4.1,1


### Order of execution

Execution order: `FROM`, `JOIN`, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, and `LIMIT`.

### Window functions aka Analytic Functions aka Online Analytical Processing (OLAP) functions 

- What are window functions?
    - Special types of functions that perform calculations across a set of table rows that are related to the current row.
    - Unlike aggregate functions, window functions do not collapse the result set into a single row or group of rows. Instead, they provide a result for each row while still considering a "window" of other rows.

### Clauses

- `OVER`: defines the window or partition over which the function operates.
- `ORDER BY`: Specifies the order in which rows should be processed within each window.
- `PARTITION BY`: divides the result set into partitions to apply the function to each partition separately.

### Ranking functions

- `RANK`
    - returns same ranking in case of a tie, with gaps in the rankings
    - why are there gaps? because rank assigned after a tie skips over the subsequent positions, resulting in a gap
- `DENSE_RANK`:
    - returns the same ranking as `RANK` with no gaps in the rankings
- `ROW_NUMBER`:
    - returns unique number for each row with rankings arbitrarily assigned in case of a tie
    - ordering requirements can help you break ties and come up with predictable numbering

#### Q15: Rank all titlesIDs by their rating (descending order).

In [32]:
pd.read_sql("""
    SELECT 
        tconst, averageRating, 
        RANK() OVER (ORDER BY averageRating DESC) AS titleRank
    FROM title_ratings
""", conn)

Unnamed: 0,tconst,averageRating,titleRank
0,tt2924058,10.0,1
1,tt1841655,9.8,2
2,tt12601448,9.6,3
3,tt4065164,9.5,4
4,tt4740328,9.5,4
...,...,...,...
384,tt5188300,3.2,383
385,tt18257696,2.9,386
386,tt6840238,2.8,387
387,tt0933342,2.7,388


#### Q16: Rank all titles by their rating (descending order).

In [33]:
pd.read_sql("""
    SELECT 
        b.tconst, b.primaryTitle, 
        r.averageRating, 
        RANK() OVER (ORDER BY r.averageRating DESC) AS titleRank
    FROM title_ratings r
    JOIN title_basics b ON r.tconst = b.tconst
    LIMIT 15
""", conn)

Unnamed: 0,tconst,primaryTitle,averageRating,titleRank
0,tt2924058,Episode #9.2,10.0,1
1,tt1841655,In the Bin,9.8,2
2,tt12601448,Episode 11,9.6,3
3,tt4065164,All Shook Up,9.5,4
4,tt4740328,Lavanya fires Khushi,9.5,4
5,tt7385060,A Premature Christmas,9.5,4
6,tt2271562,Episode #1.6,9.4,7
7,tt23901758,The Mountain Path,9.4,7
8,tt29208392,Postmord,9.3,9
9,tt30835366,Learning English,9.3,9


#### Q17: Dense rank all titles by their rating (descending order).

In [34]:
pd.read_sql("""
    SELECT 
        b.tconst, b.primaryTitle, 
        r.averageRating, 
        DENSE_RANK() OVER (ORDER BY r.averageRating DESC) AS titleDenseRank
    FROM title_ratings r
    JOIN title_basics b ON r.tconst = b.tconst
    LIMIT 15
""", conn)

Unnamed: 0,tconst,primaryTitle,averageRating,titleDenseRank
0,tt2924058,Episode #9.2,10.0,1
1,tt1841655,In the Bin,9.8,2
2,tt12601448,Episode 11,9.6,3
3,tt4065164,All Shook Up,9.5,4
4,tt4740328,Lavanya fires Khushi,9.5,4
5,tt7385060,A Premature Christmas,9.5,4
6,tt2271562,Episode #1.6,9.4,5
7,tt23901758,The Mountain Path,9.4,5
8,tt29208392,Postmord,9.3,6
9,tt30835366,Learning English,9.3,6


#### Q18: Assign a sequential rank to each title by rating (descending order). If there are ties in ratings, break ties based on ascending order of titles.

In [35]:
pd.read_sql("""
    SELECT 
        b.tconst, b.primaryTitle, 
        r.averageRating, 
        ROW_NUMBER() OVER (ORDER BY r.averageRating DESC, primaryTitle ASC) AS titleUniqueRank
    FROM title_ratings r
    JOIN title_basics b ON r.tconst = b.tconst
    LIMIT 15
""", conn)

Unnamed: 0,tconst,primaryTitle,averageRating,titleUniqueRank
0,tt2924058,Episode #9.2,10.0,1
1,tt1841655,In the Bin,9.8,2
2,tt12601448,Episode 11,9.6,3
3,tt7385060,A Premature Christmas,9.5,4
4,tt4065164,All Shook Up,9.5,5
5,tt4740328,Lavanya fires Khushi,9.5,6
6,tt2271562,Episode #1.6,9.4,7
7,tt23901758,The Mountain Path,9.4,8
8,tt30835366,Learning English,9.3,9
9,tt29208392,Postmord,9.3,10


### Ranking using `PARTITION BY`

#### Q19: Rank all titles by their rating (descending order) within each genre.

In [36]:
pd.read_sql("""
    SELECT 
        b.tconst, b.primaryTitle, b.genres,r.averageRating, 
        ROW_NUMBER() OVER (PARTITION BY b.genres ORDER BY r.averageRating DESC) AS genreRanking
    FROM title_ratings r
    JOIN title_basics b ON r.tconst = b.tconst
    WHERE b.genres IS NOT NULL
""", conn)

Unnamed: 0,tconst,primaryTitle,genres,averageRating,genreRanking
0,tt5316184,Episode #1.9,Action,9.2,1
1,tt27946257,Yongchun of South Shaolin: Breakthrough,Action,8.1,2
2,tt28114581,Tebus the Movie,Action,7.4,3
3,tt0892322,Lumines II,Action,7.3,4
4,tt0318498,Ninja in the Killing Fields,Action,3.8,5
...,...,...,...,...,...
380,tt0556591,Death Ride,Western,8.6,1
381,tt0631931,Truth About Gunfighting,Western,7.6,2
382,tt0038874,Red River Renegades,Western,6.6,3
383,tt0037142,Oath of Vengeance,Western,5.7,4


### Aggregate functions with window functions

`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`

#### Q20: Rank all titles by total number of ratings (descending order) for each title. If there are ties in ratings, break ties based on ascending order of titles.

In [37]:
pd.read_sql("""
    SELECT 
        b.tconst, b.primaryTitle, 
        SUM(r.numVotes) AS totalRatings,
        RANK() OVER (ORDER BY SUM(r.numVotes) DESC, primaryTitle ASC) AS rating_rank
    FROM title_basics b
    JOIN title_ratings r ON b.tconst = r.tconst
    GROUP BY b.tconst, b.primaryTitle
""", conn)

Unnamed: 0,tconst,primaryTitle,totalRatings,rating_rank
0,tt2305051,Wild,140750.0,1
1,tt0067065,Escape from the Planet of the Apes,41883.0,2
2,tt13659418,Pam & Tommy,40732.0,3
3,tt6294706,The Chi,8463.0,4
4,tt0060897,Return of the Seven,4931.0,5
...,...,...,...,...
384,tt4175544,Reem Halloween,5.0,385
385,tt27749874,Shoeless in the Woods,5.0,386
386,tt23901758,The Mountain Path,5.0,387
387,tt4041376,Vampires & Hormones,5.0,388
