# STA 141B Lecture 14

The class website is <https://github.com/2019-winter-ucdavis-sta141b/notes>

### Announcements

* Want additional feedback or a regrade on Assignment 3? Email Shan (<shjiang@ucdavis.edu>) and CC me (<naulle@ucdavis.edu>).
* Presentation info coming next week.
* Assignment 4 deadline extended 1 day (to Saturday).

### Topics

* Databases & SQL

### Datasets

* The [Suppliers Database](http://nick-ulle.github.io/teach/suppliers.sqlite)

### References

* [W3 Schools SQL Tutorial](https://www.w3schools.com/sql/)
* [SQL Cheatsheet](http://anson.ucdavis.edu/~clarkf/sql/sql_cheatsheet.pdf)

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/
[nlpp]: https://www.nltk.org/book/
[atap]: https://search.library.ucdavis.edu/primo-explore/fulldisplay?docid=01UCD_ALMA51320822340003126&context=L&vid=01UCD_V1&search_scope=everything_scope&tab=default_tab&lang=en_US

In [1]:
import numpy as np
import pandas as pd

## Databases

A _database_ is a collection of data. There are several different models for how to organize data in a database; these are called _database models_. In this context, "model" refers to a design or mental model, not a statistical model.

The _relational model_ organizes data as a collection of tables. Tables have rows (also called _tuples_ or _records_) and columns (also called _attributes_). Most tables have a _key_ column that is unique for each row and _relates_ the table to other tables. The relational model is the most popular database model by far, and the one we'll focus on in this course.

There are also many different software programs for managing databases, called _database management systems_ (DBMS). Each DBMS usually has its own format for storing data on disk, independent of the database model. Some popular DBMSes are:

* [SQLite](https://www.sqlite.org/)
* [MySQL](https://www.mysql.com/)
* [Microsoft SQL Server](https://www.microsoft.com/en-us/sql-server)
* [PostgreSQL](https://www.postgresql.org/)

Why use a database? There are several reasons:

* Your data may already be in a database, so converting to another format is extra work.
* Database operations are highly optimized, so they typically take less time and memory than an equivalent operation in Python.
* Database operations can run on datasets that are too large to fit in memory. Doing this in Python requires special programming strategies.
* Many DBMSes provide built-in version control, multi-user access, and security checks.
* Databases can be updated in real time.

## Structured Query Language

_Structured query language_ (SQL) is a language designed for querying information in relational databases.

A free SQL tutorial is available [here](https://www.w3schools.com/sql/).

### Getting Connected

There are several ways to connect to a database and run SQL queries from Python:

* The built-in __sqlite3__ module, which only supports SQLite.
* The __sqlalchemy__ package, a unified interface for a variety of different SQL database formats (more than just SQLite). See the [tutorial](http://docs.sqlalchemy.org/en/latest/core/tutorial.html) for more details.

We'll use a SQLite database here, since SQLite is possibly [the most-used database engine in the world](https://sqlite.org/mostdeployed.html). SQLite's popularity is partly due to its reliability, easy setup, and broad range of features.

Let's connect to the suppliers database:

In [2]:
import sqlite3 as sql

To connect to a database, use the module's `connect()` function. This is similar to opening a file; you should close the database when you're done using it.

In [3]:
db = sql.connect("../data/suppliers.sqlite")

In [4]:
db

<sqlite3.Connection at 0x7fb882ef4ab0>

To execute a SQL query, use the connection's `.execute()` method. This returns a _cursor_, which is a pointer to the results in the database (imagine a finger pointing at the results).

SQLite databases store metadata in a special table called `sqlite_master`. We can use `sqlite_master` to find out the names of the other tables in the database.

In [6]:
cur = db.execute("SELECT * FROM sqlite_master")

To get the results from the database, use one of the cursor's fetch methods. The `.fetchall()` method returns all rows in the result.

In [7]:
cur.fetchall()

[('table',
  'Suppliers',
  'Suppliers',
  2,
  'CREATE TABLE Suppliers (\n  SupplierID integer,\n  SupplierName text,\n  Status integer,\n  City text,\n  PRIMARY KEY(SupplierID)\n)'),
 ('table',
  'Parts',
  'Parts',
  3,
  'CREATE TABLE Parts (\n  PartID integer,\n  PartName text,\n  Color text,\n  Weight real,\n  City text,\n  PRIMARY KEY(PartID)\n)'),
 ('table',
  'SupplierParts',
  'SupplierParts',
  4,
  'CREATE TABLE SupplierParts (\n  PartID integer,\n  SupplierID integer,\n  Qty integer,\n  PRIMARY KEY(PartID, SupplierID)\n)'),
 ('index', 'sqlite_autoindex_SupplierParts_1', 'SupplierParts', 5, None)]

By default, `sqlite3` will return rows as tuples. If you'd rather have the rows as dictionaries indexed by column name, set the `.row_factory` attribute on the database connection.

In [8]:
db.row_factory = sql.Row

Now the rows will behave like dictionaries:

In [13]:
cur = db.execute("SELECT * FROM sqlite_master")
rows = cur.fetchall()
dict(rows[0])

{'type': 'table',
 'name': 'Suppliers',
 'tbl_name': 'Suppliers',
 'rootpage': 2,
 'sql': 'CREATE TABLE Suppliers (\n  SupplierID integer,\n  SupplierName text,\n  Status integer,\n  City text,\n  PRIMARY KEY(SupplierID)\n)'}

Don't forget to close the database when you're done!

In [None]:
# db.close()

We'll generally use the `pd.read_sql()` function in __pandas__ to run our SQL queries. 

The function takes a SQL query and an open database connection as arguments, so you still need to connect to the database first with `sqlite3` or `sqlalchemy`. The result of the query is returned as a data frame.

### `SELECT`

The `SELECT` command selects rows from a table. Most of your SQL queries will start with `SELECT`. The syntax is:

```sql
SELECT col1, col2, ... FROM my_table;
```

Here `col1`, `col2`, and so on are column names and `my_table` is a table name. You can select all columns with an asterisk  `*`.

SQL is not case-sensitive and ignores whitespace, but the convention is to write SQL keywords in uppercase and column/table names in lowercase. A semicolon `;` marks the end of a SQL query, but this is optional for many tools.

In [15]:
pd.read_sql("SELECT * FROM sqlite_master;", db)

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,Suppliers,Suppliers,2,CREATE TABLE Suppliers (\n SupplierID integer...
1,table,Parts,Parts,3,"CREATE TABLE Parts (\n PartID integer,\n Par..."
2,table,SupplierParts,SupplierParts,4,CREATE TABLE SupplierParts (\n PartID integer...
3,index,sqlite_autoindex_SupplierParts_1,SupplierParts,5,


In [20]:
pd.read_sql("SELECT name, type FROM sqlite_master;", db)

Unnamed: 0,name,type
0,Suppliers,table
1,Parts,table
2,SupplierParts,table
3,sqlite_autoindex_SupplierParts_1,index


In [19]:
pd.read_sql("SELECT * FROM parts;", db)

Unnamed: 0,PartID,PartName,Color,Weight,City
0,1,Nut,Red,12.0,London
1,2,Bolt,Green,17.0,Paris
2,3,Screw,Blue,17.0,Oslo
3,4,Screw,Red,14.0,London
4,5,Cam,Blue,12.0,Paris
5,6,Cog,Red,19.0,London


### `LIMIT`

The `SELECT` command can be extended with many other keywords.

The first of these is `LIMIT`, which limits the number of rows returned. `LIMIT` is the SQL equivalent of Pandas' `.head()` method.

In [22]:
pd.read_sql("SELECT * FROM supplierparts LIMIT 3;", db)

Unnamed: 0,PartID,SupplierID,Qty
0,1,1,300
1,1,2,200
2,1,3,400


### `DISTINCT`

The `DISTINCT` keyword limits rows to distinct results. `DISTINCT` is the SQL equivalent of Pandas' `.drop_duplicates()` method.

Keep in mind that `DISTINCT` applies to all of the selected columns, not just one column.

In [26]:
pd.read_sql("SELECT color, city FROM parts;", db)

Unnamed: 0,Color,City
0,Red,London
1,Green,Paris
2,Blue,Oslo
3,Red,London
4,Blue,Paris
5,Red,London


In [25]:
pd.read_sql("SELECT DISTINCT color, city FROM parts;", db)

Unnamed: 0,Color,City
0,Red,London
1,Green,Paris
2,Blue,Oslo
3,Blue,Paris


### `ORDER BY`

The `ORDER BY` keyword sorts the returned rows. `ORDER BY` is the SQL equivalent of Pandas' `.sort_values()` method.

In [29]:
pd.read_sql("SELECT * FROM parts ORDER BY weight LIMIT 3;", db)

Unnamed: 0,PartID,PartName,Color,Weight,City
0,1,Nut,Red,12.0,London
1,5,Cam,Blue,12.0,Paris
2,4,Screw,Red,14.0,London


Add the suffix `ASC` for an ascending sort (smallest to largest) and `DESC` for a descending sort (largest to smallest).

In SQLite, the default is ascending, but other other databases may differ.

In [30]:
pd.read_sql("SELECT * FROM parts ORDER BY weight DESC;", db)

Unnamed: 0,PartID,PartName,Color,Weight,City
0,6,Cog,Red,19.0,London
1,2,Bolt,Green,17.0,Paris
2,3,Screw,Blue,17.0,Oslo
3,4,Screw,Red,14.0,London
4,1,Nut,Red,12.0,London
5,5,Cam,Blue,12.0,Paris


In [33]:
pd.read_sql("SELECT * FROM parts ORDER BY weight DESC, city DESC;", db)

Unnamed: 0,PartID,PartName,Color,Weight,City
0,6,Cog,Red,19.0,London
1,2,Bolt,Green,17.0,Paris
2,3,Screw,Blue,17.0,Oslo
3,4,Screw,Red,14.0,London
4,5,Cam,Blue,12.0,Paris
5,1,Nut,Red,12.0,London


### `WHERE`

`WHERE` puts conditions on the rows returned. `WHERE` is the SQL equivalent of subsetting.

You can use `=` to test equality. Other comparison operators, such as `>=`, are also available.

In [37]:
pd.read_sql("SELECT * FROM parts WHERE weight = 17", db)

Unnamed: 0,PartID,PartName,Color,Weight,City
0,2,Bolt,Green,17.0,Paris
1,3,Screw,Blue,17.0,Oslo


You can use `AND` and `OR` to combine conditions. You can also use parenthesis to indicate the order of operations.

In [41]:
pd.read_sql("SELECT * FROM parts WHERE city = 'London' OR color = 'Red';", db)

Unnamed: 0,PartID,PartName,Color,Weight,City
0,1,Nut,Red,12.0,London
1,4,Screw,Red,14.0,London
2,6,Cog,Red,19.0,London


You can use `IN` to check whether a value is in a collection of values.

In [42]:
pd.read_sql("SELECT * FROM parts WHERE city IN ('Paris', 'London');", db)

Unnamed: 0,PartID,PartName,Color,Weight,City
0,1,Nut,Red,12.0,London
1,2,Bolt,Green,17.0,Paris
2,4,Screw,Red,14.0,London
3,5,Cam,Blue,12.0,Paris
4,6,Cog,Red,19.0,London


SQL's `LIKE` keyword does simple pattern-matching language for strings. This is less powerful than regular expressions, but still useful.

* `%` matches zero or more of any character, similar to regex `.*`
* `_` matches any one character, similar to regex `.`

In other databases (but not SQLite):
* `[]` matches any one of the characters you put inside the brackects, identical to regex `[]`

In [44]:
pd.read_sql("SELECT * FROM parts WHERE city LIKE '%s';", db)

Unnamed: 0,PartID,PartName,Color,Weight,City
0,2,Bolt,Green,17.0,Paris
1,5,Cam,Blue,12.0,Paris


The `BETWEEN` keyword is useful for selecting ranges.

In [46]:
pd.read_sql("SELECT * FROM parts WHERE weight BETWEEN 14 AND 20;", db)

Unnamed: 0,PartID,PartName,Color,Weight,City
0,2,Bolt,Green,17.0,Paris
1,3,Screw,Blue,17.0,Oslo
2,4,Screw,Red,14.0,London
3,6,Cog,Red,19.0,London


### Operators

You can use arithmetic operators `+`, `-`, `*`, `\`, `%` on SQL columns to perform columnwise computations. These are the SQL equivalent of vectorized arithmetic.

In [52]:
pd.read_sql("SELECT weight * weight AS squared_weight, * FROM parts;", db)

Unnamed: 0,squared_weight,PartID,PartName,Color,Weight,City
0,144.0,1,Nut,Red,12.0,London
1,289.0,2,Bolt,Green,17.0,Paris
2,289.0,3,Screw,Blue,17.0,Oslo
3,196.0,4,Screw,Red,14.0,London
4,144.0,5,Cam,Blue,12.0,Paris
5,361.0,6,Cog,Red,19.0,London


In [54]:
pd.read_sql("SELECT weight * weight AS squared_weight, * FROM parts WHERE squared_weight > 300;", db)

Unnamed: 0,squared_weight,PartID,PartName,Color,Weight,City
0,361.0,6,Cog,Red,19.0,London


### `AS`

You can rename a column with the `AS` keyword. This keyword is especially useful together with SQL arithmetic operators and functions.

### Functions & Aggregation

SQL has built-in functions, which vary from one DBMS to another. The SQL cheatsheet lists most of the functions supported by SQLite.

Most SQL functions aggregate data in a column, summarizing that column somehow.

In [55]:
pd.read_sql("SELECT weight * 12 AS multiplied_weight, * FROM parts;", db)

Unnamed: 0,multiplied_weight,PartID,PartName,Color,Weight,City
0,144.0,1,Nut,Red,12.0,London
1,204.0,2,Bolt,Green,17.0,Paris
2,204.0,3,Screw,Blue,17.0,Oslo
3,168.0,4,Screw,Red,14.0,London
4,144.0,5,Cam,Blue,12.0,Paris
5,228.0,6,Cog,Red,19.0,London


In [58]:
pd.read_sql("SELECT AVG(weight * 12) AS total_weight FROM parts;", db)

Unnamed: 0,total_weight
0,182.0


In [61]:
pd.read_sql("SELECT COUNT(*) FROM parts WHERE weight > 15;", db)

Unnamed: 0,COUNT(*)
0,3


In [63]:
pd.read_sql("SELECT UPPER(city), * FROM parts;", db)

Unnamed: 0,UPPER(city),PartID,PartName,Color,Weight,City
0,LONDON,1,Nut,Red,12.0,London
1,PARIS,2,Bolt,Green,17.0,Paris
2,OSLO,3,Screw,Blue,17.0,Oslo
3,LONDON,4,Screw,Red,14.0,London
4,PARIS,5,Cam,Blue,12.0,Paris
5,LONDON,6,Cog,Red,19.0,London


### `GROUP BY`

The `GROUP BY` keyword groups rows before they are aggregated. `GROUP BY` is the SQL equivalent of Pandas' `.groupby()` method.

In [64]:
pd.read_sql("SELECT AVG(weight) FROM parts;", db)

Unnamed: 0,AVG(weight)
0,15.166667


In [66]:
pd.read_sql("SELECT AVG(weight), city FROM parts GROUP BY city;", db)

Unnamed: 0,AVG(weight),City
0,15.0,London
1,17.0,Oslo
2,14.5,Paris
