# Lecture 17:  Databases
## Monday, November 6th 2017

# Introduction

## Why Learn Databases

- You will see many databases in your career.

- `SQL` (Structured Query Language) is still very popular and will remain so for a long time.  Hence, you will need to code `SQL`.
  * `SQL` is the language used to query a relational database.

- You will have to deal with systematic storage of structured and unstructured data at some point.

## More Database Motivation

- It is very hard to implement a database well, but you must understand how they work.

- Data storage/wrangling are not just database concerns; packages such as `dplyr` and `pandas` require a similar knowledge-base.

- It is very important to make an informed decision on a storage engine that is sufficient for your program.

- Important to understand query performance.

- Transaction processing is not optimal for analytics.

## What kind of data access do you need?

The answer depends on your problem and the resources you have available.

|Database Genre     | Examples                 |
| :-----------:     | :------:                 |
| relational        | SQL and its derivatives |
| document oriented | MongoDB, CouchDB         |
| key-value         | Riak, Memcached, leveldb |
| graph oriented    | Neo4J                    |
| columnar          | HBase                    |

# A Sampling of Database Genres

## Relational Model
- A relation (table) is a collection of tuples. Each tuple is called a *row*.

- A collection of tables related to each other through common data values.

- Items in a column are values of one attribute.

- A cell is expected to be atomic

- Tables are related to each other if they have columns (called keys) which represent the same values.

- SQL (Structured Query Language) is a declarative model: a query optimizer decides how to execute the query.
  - If a field range covers 80% of values, should we use the index or the table.

### Example from CS109:
![](https://github.com/cs109/2015/raw/master/Lectures/Lecture4/contributors.png)

![](https://github.com/cs109/2015/raw/master/Lectures/Lecture4/candidates.png)

## Key-Value Model

- like a dictionary
- the database is the index

## Document Model

- stores nested records
- bad for many-to-many
- storage locality good for access, bad for writing

## Components to a database

1. Client connection manager: what to do with incoming data
2. Transactional storage
    - storage data structures and the log
    - transactions and ACID: atomicity, consistency, isolation, durability
3. Process model: coroutines, threads, processes
4. Query model and language: query optimization

# Working with Databases

## Relational Grammar of Data

- We want a language to help us easily query items in the database.
- Provide simple verbs for simple things.
- [`Pandas`](http://pandas.pydata.org/) is a library for `Python` that allows users to work with data structures and relational databases.
- The `dplyr` package offers a bunch of data manipulation tools including those for working with relational databases with the `R` programming lanuage.

The [`dplyr_pandas`](https://gist.github.com/TomAugspurger/6e052140eaa5fdb6e8c0/) notebook by Tom Augspurger contains a table comparing `dplyr` and `pandas`.  The following table is a modification to that table:

<table>
  <tr>
    <th><b>VERB</b></th>
    <th><b>dplyr</b></th>
    <th><b>pandas</b></th>
    <th><b>SQL</b></th>
  </tr>
  <tr>
    <td>QUERY/SELECTION</td>
    <td>filter() (and slice())</td>
    <td>query() (and loc[], iloc[])</td>
    <td>SELECT WHERE</td>
  </tr>
  <tr>
    <td>SORT</td>
    <td>arrange()</td>
    <td>sort()</td>
    <td>ORDER BY</td>
  </tr>
  <tr>
    <td>SELECT-COLUMNS/PROJECTION</td>
    <td>select() (and rename())</td>
    <td>[](__getitem__) (and rename())</td>
    <td>SELECT COLUMN</td>
  </tr>
  <tr>
    <td>SELECT-DISTINCT</td>
    <td>distinct()</td>
    <td>unique(),drop_duplicates()</td>
    <td>SELECT DISTINCT COLUMN</td>
  </tr>
  <tr>
    <td>ASSIGN</td>
    <td>mutate() (and transmute())</td>
    <td>assign</td>
    <td>ALTER/UPDATE</td>
  </tr>
  <tr>
    <td>AGGREGATE</td>
    <td>summarise()</td>
    <td>describe(), mean(), max()</td>
    <td>None, AVG(),MAX()</td>
  </tr>
  <tr>
    <td>SAMPLE</td>
    <td>sample_n() and sample_frac()</td>
    <td>sample()</td>
    <td>implementation dep, use RAND()</td>
  </tr>
  <tr>
    <td>GROUP-AGG</td>
    <td>group_by/summarize</td>
    <td>groupby/agg, count, mean</td>
    <td>GROUP BY</td>
  </tr>
  <tr>
    <td>DELETE</td>
    <td>?</td>
    <td>drop/masking</td>
    <td>DELETE/WHERE</td>
  </tr>
</table>

`NoSQL` databases are gaining in popularity.  However, we will stick with traditional relational databases in the course.

* We need a way of querying a given relational database.  There are several languages for such a purpose.  We will focus on `SQL` (Structured Query Language).

* `SQL` has a long history.  Because of this (or in spite of it), there are many version of `SQL` available today.

* We'll use `SQLite`.  Here are some great references: 
  - [`SQLite` Homepage](https://www.sqlite.org/)
  - [A thorough guide to SQLite database operations in Python](http://sebastianraschka.com/Articles/2014_sqlite_in_python_tutorial.html)
  - [SQL-Tutorial](https://github.com/tthibo/SQL-Tutorial)

## `SQLite`

* `SQLite` is built into `Python`.
  - `Python` implements a standard database API for all databases called DBAPI2.  Some references:  [DBAPI2](http://cewing.github.io/training.codefellows/lectures/day21/intro_to_dbapi2.html) and [PEP 249](https://www.python.org/dev/peps/pep-0249/).

* Note:  There is an even higher level API available, called [SQLAlchemy](http://www.sqlalchemy.org).

* You can install `SQLite` if you need to: https://www.sqlite.org/download.html.
* You may find the `SQLite` browser useful: http://sqlitebrowser.org.
* You can access the command line interface by downloading the `SQLite` CLI:, [`SQLite` CLI](https://www.sqlite.org/cli.html)

# `SQLite` Basics

## The Plan
* We're going to work with the `sqlite3` package in `Python`.
* This package will allow us to execute basic `SQLite` commands in `Python` to build and manipulate our database.
* We'll start by creating a `SQL` database and work up from there.
* Ultimately, we'd like to work with `pandas` to make our lives easier.
* At least in the beginning, we'll just work directly with the `SQLite` commands to get the basics down.

## The Essentials
### Core Commands
* SELECT --- Select a table name
* INSERT --- Insert data into the table
* UPDATE --- Change data values in the table
* DELETE --- Delete data in the table

We can string these commands together to perform our basic operations on the database.

### Structural Commands
* CREATE --- Create a table in a database
* DROP --- Delete a table in the database
* ALTER --- Add, delete, or modify columns in an existing table