# Introduction SQL 

In this notebook we walk through basic SQL queries and review how to interact with SQL from Python.  


## Connecting to the Database

There actually many ways to connect to databases systems from Python.  For the labs, homeworks, and projects we will use the simple SQLite database system.  In this notebook we will also demonstrate how to use `sqlalchemy`.

## SQLite

[SQLite](https://www.sqlite.org/index.html) is a simple database engine implemented as C library that operates on self-contained database files and has been incorporated in to many different programming languages and systems.  SQLite also has a simple command line interface. 

# Download the Database

The following block of code will download the database if it is not already present.  This may take a few minutes.

In [None]:
from ds100_utils import download_file_from_google_drive
from pathlib import Path
import zipfile

if not Path("im.db").exists():
    download_file_from_google_drive("1owm2jBnVwCXAXVRsbzLSj8j9Ct8PkeVF", "im.db.zip")
    with zipfile.ZipFile("im.db.zip", 'r') as zip_ref:
        zip_ref.extractall()

### Using the Command Line SQLite Client

To connect to a database from the command line you open a database file using the `sqlite3` client

```bash
> sqlite3 im.db 
```

Then once you connect to the client you might want to improve output formatting by running:

```
.mode column
.headers on
```

Note these above commands will not work for other database systems.

Try running a few commands:

* List tables

```
.tables
```

* Try running a few basic queries.

## Connecting to Databases from Python


In [None]:
import sqlite3

The following code connects to the database and opens a cursor to send queries.

In [None]:
conn = sqlite3.connect("im.db")
cursor = conn.cursor()

To get a list of the tables in the database we can actually query the `sqlite_master` table.  Each database will have a different mechanism to list the available tables.

In [None]:
query = """
    SELECT name 
    FROM sqlite_master 
    WHERE type='table';
"""
for row in cursor.execute(query):
    print(row)

Notice that each time we execute a query we get a new cursor.

In [None]:
res = cursor.execute("""
    SELECT * FROM students LIMIT 10;
""")
res

The `sqlite` cursor operates like a an iterator.

In [None]:
next(res)

In [None]:
[r for r in res]

The `sqlite` cursor has a description field which lists the columns

In [None]:
res.description

## SQL Alchemy

We can also use the [sqlalchemy library](http://docs.sqlalchemy.org/en/latest/core/tutorial.html) as an abstraction layer on top of the underlying database system. This is how you will likely connect to databases in many real-world applications.

In [None]:
import sqlalchemy

Here we use sqlalchemy to connect to SQLite but in lecture professor Gonzalez connected to a separate postgres server.

In [None]:
# # the following line connects to Professor Gonzalez's postgres server.  
# engine = sqlalchemy.create_engine("postgres://jegonzal:@localhost:5432/data100")

engine = sqlalchemy.create_engine("sqlite:///im.db")
conn = engine.connect()

Just as before `sqlalchemy` returns a cursor for each query.

In [None]:
res = conn.execute("""
    SELECT * FROM students LIMIT 10;
""")
res

In [None]:
res.keys()

In [None]:
[r for r in res]

## Reading Directly into a DataFrame

We haven't yet started to work with the Pandas DataFrame (next lecture) but we will use DataFrames in this lecture to make it easier to see the table output.  In the following line of code we use the `read_sql` command to construct a DataFrame containing the results from the SQL query. 

The following function prints the SQL query and returns the DataFrame.  Notice that Jupyter is able to render the DataFrame in a easy to read format.  Also notice that the following code would accidentally pull a lot more data than might fit in memory.

In [None]:
# Normally this will be at the top of the notebook 
# but we import it here to talk about it. 
# Pandas is always imported as pd (it's just standard style)
import pandas as pd

In [None]:
def pretty_print_query(query, stop_early=100):
    print(query, "\n;")
    return pd.read_sql(query, conn)

In [None]:
pretty_print_query("""
    SELECT * FROM names LIMIT 1000;
""")

Notice that the above command did load all 1000 rows but that the output was at least surpressed.

## Listing Tables with `sqlalchemy`

Each database has a different way to get information about the database.  The `sqlalchemy` library provides a common abstraction using the `table_names()` function:

In [None]:
engine.table_names()

# Taking a Random Sample 

In lecture we talked about a method to take a random sample.  Here we apply that method to learn about the tables.

In [None]:
pretty_print_query("""
    SELECT * FROM names 
    ORDER BY RANDOM() 
    LIMIT 5
""")

The **names** table contains the *primary key* `nconst` and the `name`, `birth_year`, and `death_year` of the actors.  Note there are some missing values.  Try running the query multiple times.

In [None]:
pretty_print_query("""
    SELECT * FROM profession 
    ORDER BY RANDOM() 
    LIMIT 5
""")

The **profession** table contains the described profession and what appears to be a foreign key `nconst` that appears to reference the **names** table.

In [None]:
pretty_print_query("""
    SELECT * FROM titles 
    ORDER BY RANDOM() 
    LIMIT 5
""")

The **titles** table contains information about each film and appears to have the primary key `tconst`. 

In [None]:
pretty_print_query("""
    SELECT * FROM name_to_title 
    ORDER BY RANDOM() 
    LIMIT 5
""")

The **name_to_title** table (which probably should have been called staring in) contains which titles in which each actor starred.  Notice that their are two *foreign key* reference in this table. 

<br/><br/><br/>

# Q1: What is the average Age of Actors and Actresses in Films?



What tables do I need?

In [None]:
pretty_print_query("""
    SELECT * FROM names LIMIT 2
""")

In [None]:
pretty_print_query("""
    SELECT * FROM profession LIMIT 3
""")

To differentiate between actors and actresses we will need to join the profession with the actors

In [None]:
pretty_print_query("""
    SELECT 
        names.nconst, 
        name, 
        birth_year, 
        profession 
    FROM names, profession 
    WHERE 
        names.nconst = profession.nconst
    -- ORDER BY RANDOM() -- commented this out
    LIMIT 5
""")

Restrict to actors and actresses

In [None]:
pretty_print_query("""
    SELECT names.nconst, name, birth_year, profession 
    FROM names, profession 
    WHERE 
        names.nconst = profession.nconst AND
        (profession = 'actor' OR profession = 'actress')
    LIMIT 5
""")

Does anyone star as both an actor and an actress?

In [None]:
pretty_print_query("""
    SELECT names.nconst, name, COUNT(*) AS cnt
    FROM names, profession 
    WHERE 
        names.nconst = profession.nconst AND
        (profession = 'actor' OR profession = 'actress')
    
    GROUP BY names.nconst, name 
    HAVING COUNT(*) > 1
    
    ORDER BY name
""")

That is a lot!  Could be multiple roles or it could be issues in the data.  We should probably investigate but not today.

## Common Table Expressions

The next step in our query about the age of actors in different films requires combining the actor information with the date of their films.  However, we will also want to keep track of the role (actor/actress) information.

To do this we will use a **Common Table Expression**

In [None]:
pretty_print_query("""
WITH actors_with_profession AS (
    SELECT names.nconst, name, birth_year, profession 
    FROM names, profession 
    WHERE 
        names.nconst = profession.nconst AND
        (profession = 'actor' OR profession = 'actress')
    )
SELECT 
    a.name, 
    t.start_year - a.birth_year AS age, 
    t.start_year as year, 
    profession
FROM 
    actors_with_profession a, 
    name_to_title nt, 
    titles t
WHERE 
    a.nconst = nt.nconst AND nt.tconst = t.tconst
LIMIT 10
""")

We are programming in Python so we can actually use code to help organize the query:

In [None]:
actors_with_profession = """
    SELECT names.nconst, name, birth_year, profession 
    FROM names, profession 
    WHERE 
        names.nconst = profession.nconst AND
        (profession = 'actor' OR profession = 'actress')
"""

This is done using python f-strings (format strings).

In [None]:
pretty_print_query(f"""
WITH actors_with_profession AS ({actors_with_profession})
SELECT 
    a.name, 
    t.start_year - a.birth_year AS age, 
    t.start_year as year, 
    profession
FROM 
    actors_with_profession a, 
    name_to_title nt, 
    titles t
WHERE 
    a.nconst = nt.nconst AND nt.tconst = t.tconst
LIMIT 10
""")

## Checking the Actor Age Calculation

I have added the birth year of the actor and the title of the film and ordered by age.

In [None]:
pretty_print_query(f"""
WITH actors_with_profession AS ({actors_with_profession})
SELECT 
    a.name, 
    t.start_year - a.birth_year AS age,
    birth_year AS born,
    t.start_year as year, 
    title,
    profession
FROM 
    actors_with_profession a, 
    name_to_title nt, 
    titles t
WHERE 
    a.nconst = nt.nconst AND nt.tconst = t.tconst
ORDER BY age
LIMIT 10;
""")

More issues!  For now let's focus on more recent films and constrain the age to be a positive value.  This will be throwing away data that might be significant.  In a real analysis we would want to examine the data we are removing to understand biases we may be introducing.

In [None]:
pretty_print_query(f"""
WITH actors_with_profession AS ({actors_with_profession})
SELECT 
    a.name, 
    t.start_year - a.birth_year AS age, 
    t.start_year as year, 
    profession
FROM 
    actors_with_profession a, 
    name_to_title nt, 
    titles t
WHERE 
    a.nconst = nt.nconst AND nt.tconst = t.tconst 
        AND t.start_year - a.birth_year > 0
        AND t.start_year > 1940
        AND t.start_year < 2020
ORDER BY RANDOM()
LIMIT 20
""")

## Let's examine aggregate summaries for each year

Notice I am saving the resulting dataframe for future visualization.  We will learn more about this in future lectures.

In [None]:
df = pretty_print_query(f"""
WITH actors_with_profession AS ({actors_with_profession})
SELECT 
    t.start_year AS year, 
    profession, 
    AVG(t.start_year - a.birth_year) AS avg_age,  
    COUNT(*) AS cnt
FROM 
    actors_with_profession a, 
    name_to_title nt, 
    titles t
WHERE a.nconst = nt.nconst AND nt.tconst = t.tconst 
    AND t.start_year - a.birth_year > 0
    AND t.start_year > 1940
    AND t.start_year < 2020
GROUP BY year, profession
ORDER BY year, profession
""")
df

## Visualizing the Resulting Table

In future lectures we will cover how to visualize data.  Here I am using `plotly`, `cufflinks` and `pandas` to build an interactive web visualization (in one line!).

In [None]:
import plotly.offline as py
import plotly.express as px
import cufflinks as cf
cf.set_config_file(sharing="private", offline=True, offline_connected=False)

In [None]:
df.iplot(kind="line", x="year", y="cnt", yTitle="Count",
         categories="profession", 
         colors={"actor": "blue", "actress": "red"}, 
         mode="lines+markers")
df.iplot(kind="line", x="year", y="avg_age", yTitle="Average Age",
         categories="profession", 
         colors={"actor": "blue", "actress": "red"}, 
         mode="lines+markers")

## Digging into 1970

We look at just the data ` WHERE (t.start_year = 1970 OR t.start_year = 1971)`

In [None]:
df70s = pretty_print_query(f"""
WITH actors_with_profession AS ({actors_with_profession})
SELECT 
    t.start_year - a.birth_year AS age, 
    t.start_year AS year, 
    COUNT(*) AS cnt
FROM 
    actors_with_profession a, 
    name_to_title nt, 
    titles  t
WHERE a.nconst = nt.nconst AND nt.tconst = t.tconst 
    AND t.start_year - a.birth_year > 0
    AND (t.start_year = 1970 OR t.start_year = 1971)
GROUP BY age, t.start_year
""")
df70s

Visualizing the distribution of actors at each age we see a large number of young actors in 1970 that are not present in 1971.

In [None]:
px.bar(df70s.astype({"year":"str"}), x="age", y="cnt", color="year", barmode="overlay")

Looking at which titles have many young actors in 1970

In [None]:
df70s_1 = pretty_print_query(f"""
WITH actors_with_profession AS ({actors_with_profession})
SELECT birth_year, name, profession, title, title_type,
    (t.start_year - a.birth_year) AS age
FROM 
    actors_with_profession a, name_to_title nt, titles t
WHERE 
    a.nconst = nt.nconst AND nt.tconst = t.tconst 
    AND t.start_year - a.birth_year > 0
    AND t.start_year = 1970
    AND (t.start_year - a.birth_year) < 10
""")
df70s_1

In the following line I am cheating by grouping by title as well using the Pandas DataFrame API (`value_counts`).

In [None]:
df70s_1['title'].value_counts().iplot(kind='barh')

We would normally want to investigate the TV show Tatort a bit more but not today.  Instead we will restrict our attentions to titles that are actually movies.

# Focusing on Movies Only

In [None]:
df = pretty_print_query(f"""
WITH actors_with_profession AS ({actors_with_profession})
SELECT t.start_year as year, profession, 
    AVG(t.start_year - a.birth_year) AS avg_age,  
    COUNT(*) AS cnt
FROM 
    actors_with_profession a, 
    name_to_title nt, 
    titles t
WHERE a.nconst = nt.nconst AND nt.tconst = t.tconst 
    AND t.start_year - a.birth_year > 0
    AND t.start_year > 1940
    AND t.start_year < 2020
    AND t.title_type = 'movie'
GROUP BY year, profession
ORDER BY year, profession
""")
df

In [None]:
df.iplot(kind="line", x="year", y="cnt", categories="profession", colors={"actor": "blue", "actress": "red"}, mode="lines+markers")
df.iplot(kind="line", x="year", y="avg_age", categories="profession", colors={"actor": "blue", "actress": "red"}, mode="lines+markers")

Do you see any interesting trends?