## SQL 3: Subqueries and joins

In [None]:
from sqlalchemy import create_engine, text
import pandas as pd
import os
import gc

In [None]:
engine = create_engine("mysql+mysqlconnector://root:abc@127.0.0.1:3306/cs639")
conn = engine.connect()

In [None]:
list(conn.execute(text("show tables;")))

### IMDB dataset

- Source: https://datasets.imdbws.com/ 
- Original dataset is too large to be analyzed using our current VM
- Schema information: https://developer.imdb.com/non-commercial-datasets/

In [None]:
!wget https://ms.sites.cs.wisc.edu/cs639/data/IMDB.zip

In [None]:
!unzip IMDB.zip

#### Populating MySQL server with tables corresponding to all tsv files

In [None]:
files = os.listdir()
tsv_files = [f for f in files if ".tsv" in f]
table_names = [f.replace(".tsv", "") for f in tsv_files]
table_names = [f.replace(".", "_") for f in table_names]

In [None]:
for idx, tsv_file in enumerate(tsv_files):
    df = pd.read_csv(tsv_file, sep="\t", na_values='\\N')
    df.to_sql(table_names[idx], conn, index=False, if_exists="replace")
    print(f"Populated {table_names[idx]}")

In [None]:
list(conn.execute(text("show tables;")))

### Data Analysis

### SQL Subqueries

- What is a subquery?
    - A query contained within another query. The outer query is typically referred to as "containing statement".
    - A subquery can be used with all four SQL data statements: `SELECT`, `INSERT`, `UPDATE`, `DELETE`.
    - The subquery is always executed prior to the containing statement.
    - Subqueries act like a temporary table with statement scope. That is when the containing statement has finished executing, data returned by the subqueries are discarded.
    - Subqueries can return:
        - Single row with a single column
        - Multiple rows with a single column
        - Multiple rows with multiple columns
- Types of subqueries:
    1. noncorrelated subqueries: self-contained subqueries
    2. correlated subqueries: reference columns from the containing statement

### Noncorrelated subqueries

- What is a scalar subquery?
    - A query returning a result set containing a single row and column.
    - Can be used for conditional operators: `=`, `<`, `<=`, `>`, `>=`, `<>`

#### Single-Row and Single-Column subqueries

#### Q1: What are the titles that have a runtime greater than the average runtime of all movies?

In [None]:
pd.read_sql("""
    
""", conn)

In [None]:
pd.read_sql("""
    
""", conn)

#### Q2: What are the most recent movies?

In [None]:
pd.read_sql("""
    
""", conn)

In [None]:
pd.read_sql("""
    
""", conn)

#### Multiple-Row and Single-Column subqueries

- Operators: `IN`, `NOT IN`, `ALL`, `ANY`

#### Q3: Find the number of movies that have more than one genre.

We can find number of genres by simply counting number of commas and adding 1 to that count. Let's first determine length of genres column.

In [None]:
pd.read_sql("""
    
""", conn)

`LENGTH` in SQL.

In [None]:
pd.read_sql("""
    
""", conn)

To find, number of commas, we can replace commas with nothing and find difference between original string and the replaced string.

In [None]:
pd.read_sql("""
    
""", conn)

Now putting it together in a subquery.

In [None]:
pd.read_sql("""
    
""", conn)

#### Q4: Find the titles of movies that have the maximum number of genres.

In [None]:
pd.read_sql("""
    
""", conn)

#### Q5: Find the titles of movies that belong to the same genres as those with a runtime longer than 150 minutes.

In [None]:
pd.read_sql("""
    
""", conn)

In [None]:
pd.read_sql("""
    
""", conn)

#### Q6: Find titles of movies that have not received any ratings.

In [None]:
pd.read_sql("SELECT tconst FROM title_ratings", conn)

In [None]:
pd.read_sql("""
    
""", conn)

#### Q7: Find all the titles that have an average rating greater than all titles released in the year 2005.

In [None]:
pd.read_sql("""
    
""", conn)

#### Q8: Find all the titles that have an average rating lower than any title released in the year 2005.

In [None]:
pd.read_sql("""
    SELECT *
    FROM title_basics
    WHERE tconst IN (
        SELECT tconst
        FROM title_ratings
        WHERE averageRating < ANY (
            SELECT averageRating
            FROM title_ratings
            WHERE tconst IN (
                SELECT tconst
                FROM title_basics
                WHERE startYear = 2005
            )
        )
    )
""", conn)

### Correlated subqueries

#### Q9: Find the titles of movies that have a runtime longer than the average runtime of all movies in the same genre.

In [None]:
pd.read_sql("""
    
""", conn)

### JOINs

### `JOIN` aka `INNER JOIN` 

#### Q10: Find all movies and their corresponding ratings.

In [None]:
pd.read_sql("""
    SELECT * FROM title_basics
    WHERE titleType = "movie"
""", conn)

In [None]:
pd.read_sql("""
    SELECT * FROM title_ratings
""", conn)

In [None]:
pd.read_sql("""
    
""", conn)

### `LEFT JOIN` aka `LEFT OUTER JOIN`

#### Q11: Find all movies and their corresponding ratings. If a movie doesn't have a rating, still include it in the results.

In [None]:
pd.read_sql("""
    SELECT * FROM title_basics
    WHERE titleType = "movie"
""", conn)

In [None]:
pd.read_sql("""
    SELECT * FROM title_ratings
""", conn)

In [None]:
pd.read_sql("""
    
""", conn)

### `RIGHT JOIN` aka `RIGHT OUTER JOIN`

#### Q13: Solve Q12 using `RIGHT JOIN`.

In [None]:
pd.read_sql("""
    
""", conn)

#### Q14: Find all movies, their average rating, and the total number of regions they have been released in.

In [None]:
pd.read_sql("""
    
""", conn)

### Order of execution

Execution order: `FROM`, `JOIN`, `WHERE`, `GROUP BY`, `HAVING`, `ORDER BY`, and `LIMIT`.

### Window functions aka Analytic Functions aka Online Analytical Processing (OLAP) functions 

- What are window functions?
    - Special types of functions that perform calculations across a set of table rows that are related to the current row.
    - Unlike aggregate functions, window functions do not collapse the result set into a single row or group of rows. Instead, they provide a result for each row while still considering a "window" of other rows.

### Clauses

- `OVER`: defines the window or partition over which the function operates.
- `ORDER BY`: Specifies the order in which rows should be processed within each window.
- `PARTITION BY`: divides the result set into partitions to apply the function to each partition separately.

### Ranking functions

- `RANK`
    - returns same ranking in case of a tie, with gaps in the rankings
    - why are there gaps? because rank assigned after a tie skips over the subsequent positions, resulting in a gap
- `DENSE_RANK`:
    - returns the same ranking as `RANK` with no gaps in the rankings
- `ROW_NUMBER`:
    - returns unique number for each row with rankings arbitrarily assigned in case of a tie
    - ordering requirements can help you break ties and come up with predictable numbering

#### Q15: Rank all titlesIDs by their rating (descending order).

In [None]:
pd.read_sql("""
    
""", conn)

#### Q16: Rank all titles by their rating (descending order).

In [None]:
pd.read_sql("""
    
""", conn)

#### Q17: Dense rank all titles by their rating (descending order).

In [None]:
pd.read_sql("""
    
""", conn)

#### Q18: Assign a sequential rank to each title by rating (descending order). If there are ties in ratings, break ties based on ascending order of titles.

In [None]:
pd.read_sql("""
    
""", conn)

### Ranking using `PARTITION BY`

#### Q19: Rank all titles by their rating (descending order) within each genre.

In [None]:
pd.read_sql("""
    
""", conn)

### Aggregate functions with window functions

`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`

#### Q20: Rank all titles by total number of ratings (descending order) for each title. If there are ties in ratings, break ties based on ascending order of titles.

In [None]:
pd.read_sql("""
    
""", conn)