# 💽 Lecture 20 – Data 100, Summer 2025

Data 100, Summer 2025

[Acknowledgments Page](https://ds100.org/su25/acks/)

## Starting Up SQL

Before we look at SQL syntax in detail, let's first get ourselves set up to run SQL queries in Jupyter.

### Approach #1: SQL Magic

**1. Load the `sql` Module.** 

Load `%%sql` cell magic to allow the Jupyter notebook to understand SQL.

> FYI: `%sql` is called line magic because it only applies to one line. We will see shortly that `%%sql` applies to an entire cell. So, it's called cell magic.

In [None]:
%load_ext sql

**2. Connect to a database.**  

Here, we connect to the SQLite database `basic_examples.db` and duckdb database `example_duck.db`.

In [None]:
%sql sqlite:///data/basic_examples.db --alias sqlite_ex

In [None]:
%sql duckdb:///data/example_duck.db --alias duckdb_ex

If you were connecting to an "enterprise data platform"

```python
from sqlalchemy import create_engine

snow_engine = create_engine(
    f"snowflake://{user}:{password}@{account_identifier}")
%sql snow_engine --alias snow

db_engine = create_engine(
  url = f"databricks://token:{access_token}@{server_hostname}?" +
        f"http_path={http_path}&catalog={catalog}&schema={schema}"
)
%sql db_engine --alias db
```

<br/>

**3. Run a simple SQL query.** 

`%sql` parses only the immmediate line as a SQL command.

In [None]:
%sql SELECT * FROM Dragon;

The `%%sql` command, on the other hand, lets Jupyter parse the rest of the lines (the entire code block) as a SQL command.

In [None]:
%%sql
SELECT * FROM Dragon;

Simple query, this time on two different lines.

In [None]:
%%sql
SELECT *
FROM Dragon;

#### Storing one-line `%sql` queries

For simple one-line queries, you can use IPython's ability to store the result of a magic command like `%sql` as if it were any other Python statement, and save the output to a variable:

In [None]:
dragon_table = %sql SELECT * FROM Dragon
dragon_table

As noted above, the result of the query is a Python variable of type `ResultSet`, more specifically:

In [None]:
type(dragon_table)

You need to manually convert it to a Pandas `DataFrame` if you want to do pandas-things with its content:

In [None]:
dragon_df = dragon_table.DataFrame()
dragon_df

You can configure `jupysql` to _automatically_ convert all outputs to Pandas DataFrames. This can be handy if you intend all your Python-side work to be done with Pandas, as it saves you from manually having to call `.DataFrame()` first on all outputs. 

- On the other hand, you don't get access to the original SQL `ResultSet` objects, which have a number of interesting properties and capabilities. You can learn more about those in the [jupysql documentation](https://jupysql.ploomber.io).

For now, let's turn this on so you can see how this simplified, "pandas all the way" worfklow looks like:

In [None]:
%config SqlMagic.autopandas = True

In [None]:
dragon_df = %sql SELECT * FROM Dragon
dragon_df

In [None]:
type(dragon_df)

#### Storing output of multiple SQL lines

You can use the `variable <<` syntax in jupysql to store its output.

- Note: This will follow your `autopandas` state and store either a `sql.run.ResultState` or a Pandas `DataFrame`.

In [None]:
%%sql res <<
SELECT *
FROM Dragon;

In [None]:
res

### Approach #2: `pd.read_sql`

It turns out that `pandas` has a special-purpose function to parse SQL queries. We can pass in a SQL query as a string to return a `pandas` DataFrame. To achieve the same result as we did using cell magic above, we can do the following.

**1. Connect to a database**

In [None]:
import sqlalchemy 
import pandas as pd

engine = sqlalchemy.create_engine("duckdb:///data/example_duck.db")

**2. Run a simple SQL query**

In [None]:
query = """
SELECT * 
FROM Dragon;
"""

df = pd.read_sql(query, engine)
df

### Approach #3 -- DuckDB Special

With DuckDB we can directly reference dataframe objects in our Python environment:

In [None]:
import seaborn as sns
import duckdb
mpg = sns.load_dataset("mpg")

In [None]:
output = duckdb.query("SELECT * FROM mpg")
output

In [None]:
type(output)

In [None]:
output.df()


---

## Tables and Schema

A **database** contains a collection of SQL **tables**. Let's connect to our "toy" database `example_duck.db` and explore the tables it stores.

In [None]:
%%sql
SELECT * FROM information_schema.tables

In [None]:
%%sql
SELECT * FROM information_schema.columns

### Getting Schema information with SQLAlchemy 
How you list the tables varies across database platforms.  For example, the statement:

```sql
SELECT * FROM information_schema.columns
```

only works on Postgres compatible databases.

For example, if we wanted to get the schema for tables in SQLite we would need the following:

In [None]:
pd.read_sql("SELECT * FROM sqlite_schema", "sqlite:///data/basic_examples.db")

Fortunately, SQLAlchemy has some generic tools that will be helpful regardless of what database platform you use.

In [None]:
from sqlalchemy import inspect
inspector = inspect(engine)
inspector.get_table_names()

In [None]:
inspector.get_columns('scene')

Same with SQLite

In [None]:
sqlite_engine = sqlalchemy.create_engine("sqlite:///data/basic_examples.db")
inspect(sqlite_engine).get_columns("scene")

More advanced example of creating tables with primary and foreign key constraints:

In [None]:
%sql duckdb:///data/duckdb_example.db --alias student_db

In [None]:
%%sql student_db

DROP TABLE IF EXISTS grade;
DROP TABLE IF EXISTS assignment;
DROP TABLE IF EXISTS student;


CREATE TABLE student (
    student_id INTEGER PRIMARY KEY,
    name VARCHAR,
    email VARCHAR
);

CREATE TABLE assignment (
    assignment_id INTEGER PRIMARY KEY,
    description VARCHAR
);

CREATE TABLE grade (
    student_id INTEGER,
    assignment_id INTEGER,
    score REAL CHECK (score > 0 AND score <= 100),
    FOREIGN KEY (student_id) REFERENCES student(student_id),
    FOREIGN KEY (assignment_id) REFERENCES assignment(assignment_id)
);

INSERT INTO student VALUES
(123, 'JoeyG', 'jegonzal@berkeley.edu'),
(456, 'NargesN', 'norouzi@berkeley.edu');

INSERT INTO assignment VALUES
(1, 'easy assignment'),
(2, 'hard assignment');

In [None]:
%%sql 
INSERT INTO grade VALUES
(123, 1, 80),
(123, 2, 42),
(456, 2, 100);

In [None]:
%sql SELECT * FROM grade;

<br/>

---

## Basic Queries

### `SELECT` and `FROM`
Every SQL query *must* contain a `SELECT` and `FROM` clause.

* `SELECT`: specify the column(s) to return in the output.
* `FROM`: specify the database table from which to extract data.

First, let's reconnect to our `duckdb_ex` database from earlier:

In [None]:
%sql duckdb_ex

In [None]:
%%sql
SELECT * 
FROM Dragon;

In [None]:
%%sql
SELECT cute, year 
FROM Dragon;

### Aliasing with `AS`

In [None]:
%%sql
SELECT cute AS cuteness,
       year AS "birth year"
FROM Dragon;

`AS` is technically optional, but often good practice to include!

In [None]:
%%sql
SELECT cute cuteness,
       year "birth year"
FROM Dragon;

### Uniqueness with `DISTINCT`

In [None]:
%%sql
SELECT DISTINCT year
FROM Dragon;

### Filtering with `WHERE`

In [None]:
%%sql
SELECT name, year
FROM Dragon
WHERE cute > 0;

In [None]:
%%sql
SELECT name, cute, year
FROM Dragon
WHERE cute > 0 OR year > 2013;

In [None]:
%%sql
SELECT name, year
FROM Dragon 
WHERE name IN ('puff', 'hiccup');

In [None]:
%%sql
SELECT name, cute
FROM Dragon
WHERE cute IS NOT NULL;

### Ordering data using `ORDER BY`

In [None]:
%%sql
SELECT *
FROM Dragon
ORDER BY cute DESC;

### Restricting output with `LIMIT` and `OFFSET`

In [None]:
%%sql
SELECT *
FROM Dragon
LIMIT 2;

In [None]:
%%sql
SELECT *
FROM Dragon
LIMIT 2
OFFSET 1;

### Sampling with `RANDOM()`
What if we wanted a random sample:

In [None]:
%%sql
SELECT *
FROM Dragon
ORDER BY RANDOM() 
LIMIT 2

In [None]:
%%sql
SELECT * 
FROM Dragon USING SAMPLE reservoir(2 ROWS) REPEATABLE (100);