## SQL: Window functions

In [None]:
from sqlalchemy import create_engine, text
import pandas as pd
import os
import gc

In [None]:
engine = create_engine("mysql+mysqlconnector://root:abc@127.0.0.1:3306/cs639")
conn = engine.connect()

In [None]:
list(conn.execute(text("show tables;")))

### IMDB dataset

- Source: https://datasets.imdbws.com/ 
- Original dataset is too large to be analyzed using our current VM
- Schema information: https://developer.imdb.com/non-commercial-datasets/

In [None]:
!rm IMDB.zip
!rm *.tsv
!wget https://ms.sites.cs.wisc.edu/cs639/data/IMDB.zip
!unzip IMDB.zip

#### Populating MySQL server with tables corresponding to all tsv files

In [None]:
files = os.listdir()
tsv_files = [f for f in files if ".tsv" in f]
table_names = [f.replace(".tsv", "") for f in tsv_files]
table_names = [f.replace(".", "_") for f in table_names]

In [None]:
for idx, tsv_file in enumerate(tsv_files):
    df = pd.read_csv(tsv_file, sep="\t", na_values='\\N')
    df.to_sql(table_names[idx], conn, index=False, if_exists="replace")
    print(f"Populated {table_names[idx]}")

In [None]:
list(conn.execute(text("show tables;")))

### Data Analysis

### Window functions aka Analytic Functions aka Online Analytical Processing (OLAP) functions 

- What are window functions?
    - Special types of functions that perform calculations across a set of table rows that are related to the current row.
    - Unlike aggregate functions, window functions do not collapse the result set into a single row or group of rows. Instead, they provide a result for each row while still considering a "window" of other rows.

### Clauses

- `OVER`: defines the window or partition over which the function operates.
- `ORDER BY`: Specifies the order in which rows should be processed within each window.
- `PARTITION BY`: divides the result set into partitions to apply the function to each partition separately.

### Ranking functions

- `RANK`
    - returns same ranking in case of a tie, with gaps in the rankings
    - why are there gaps? because rank assigned after a tie skips over the subsequent positions, resulting in a gap
- `DENSE_RANK`:
    - returns the same ranking as `RANK` with no gaps in the rankings
- `ROW_NUMBER`:
    - returns unique number for each row with rankings arbitrarily assigned in case of a tie
    - ordering requirements can help you break ties and come up with predictable numbering

#### Q1: Rank all titlesIDs by their rating (descending order).

In [None]:
pd.read_sql("""
    
""", conn)

#### Q2: Rank all titles by their rating (descending order).

In [None]:
pd.read_sql("""
    
""", conn)

#### Q3: Dense rank all titles by their rating (descending order).

In [None]:
pd.read_sql("""
    
""", conn)

#### Q4: Assign a sequential rank to each title by rating (descending order). If there are ties in ratings, break ties based on ascending order of titles.

In [None]:
pd.read_sql("""
    
""", conn)

### `PARTITION BY`

- divides the result set into subsets or partitions, based on one or more columns and performs calculations separately for each partition
- similar to a `GROUP BY` clause, but `PARTITION BY` does not collapse rows into a single result
- Use case scenarios:
    - ranking within groups

#### Q5: Rank all titles by their rating (descending order) within each genre.

In [None]:
pd.read_sql("""
    
""", conn)

### Aggregate functions with window functions

`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`

#### Q6: Rank all titles by total number of ratings (descending order) for each title. If there are ties in ratings, break ties based on ascending order of titles.

In [None]:
pd.read_sql("""
    
""", conn)

### Window Frames

### `ROWS UNBOUNDED PRECEDING`

- the window includes all rows before the current one, effectively creating a running total
- Use case scenarios:
    - running totals
    - moving average

#### Q7: Calculate the cumulative total of votes for each title over time (based on the startYear).

In [None]:
pd.read_sql("""
    SELECT * FROM title_ratings LIMIT 2
""", conn)

In [None]:
pd.read_sql("""
    SELECT * FROM title_basics LIMIT 2
""", conn)

In [None]:
pd.read_sql("""
    
""", conn)

### `ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING`

- the window includes current row, previous row (`1 PRECEDING`), and next row (`1 FOLLOWING`)
- for the first row, window includes just first row and second (because of non-existence of previous row)
- similarly, for the last row, window includes the penultimate row and the last row

#### Q8: Calculate the average rating of each movie, including the ratings of the previous and next movies based on their release year (ascending).

In [None]:
pd.read_sql("""
    
""", conn)

### `RANGE BETWEEN INTERVAL <N> DAY PRECEDING AND INTERVAL <N> DAY FOLLOWING`

- typically used for columns with `DATE`, or `DATETIME`, or `TIMESTAMP` types

#### Q9: Calculate the total number of votes each movie received, including votes from movies released in the 3 days before and after the release date of each movie.

Let's first explore the title_basics table schema.

In [None]:
pd.read_sql("", conn)

In [None]:
pd.read_sql("", conn)

In [None]:
pd.read_sql("""
    
""", conn)

### `LAG` and `LEAD`

- `LAG` allows you to access data from a previous row within the same result set
- `LEAD` allows you to access data from the next row in the result set

#### Q10: What is the number of votes for each title compared to the previous title released in the same year?

In [None]:
pd.read_sql("""
    
""", conn)

What if you want to filter out rows where `previousVotes` is `NULL`?

### Common Table Expression (CTE)

- temporary result set that you can reference within a SQL query
- defined using the `WITH` clause
- CTEs are only visible to the SQL statement that immediately follows them
- benefits: modularity, reusability

In [None]:
pd.read_sql("""
    
""", conn)

#### Q11: What is the number of votes for each title compared to the next title released in the same year?

In [None]:
pd.read_sql("""
    
""", conn)