<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Database Fundamentals and Types of Databases

---

## Learning Objectives

### Core

- Explain RDBMS fundamentals
- Describe what SQL and noSQL mean
- Describe tradeoffs between SQL and noSQL
- Identify remote vs local database instances

### Target

- Describe the importance of
    - Transactional integrity
    - ACID
    - Relational databases
    - Schemas

### Stretch 
- Understand how to design a relational database
- Know about the relevance of alternative databases
    - Key-value stores
    - NoSQL
    - Timeseries databases (TSDB)
    - Graph databases

### Lesson Guide

- [Opening](#opening)
- [Intro to Relational Databases](#introduction)
    - [Transactional integrity](#transactional-integrity)
    - [ACID](#acid)
    - [Relational databases](#relational-databases)
    - [Schemas](#schemas)
    - [Entity relation diagram (ERD)](#erd)
- [Design a relational database](#guided_practice_1)
- [Alternative databases](#alternative)
    - [Key-value stores](#key-value)
    - [NoSQL](#nosql)
    - [Timeseries databases (TSDB)](#tsdb)
    - [Graph databases](#graph-db)
- [Conclusion](#conclusion)
- [Additional Resources](#resources)

#### Who has used relational databases and/or non-relational databases (noSQL)?

#### How is this different from a Pandas DataFrame?

<a name="opening"></a>

## Opening

---

Up to this point, we have used DataFrames and sourced from CSV files or json files for our data.

Mainly, these solutions lack:
- **Fault tolerence**
- **Performance / Scalability**
- **Interactive Features**

Databases are the standard solution for data storage and are much more robust than text, CSV or json files. Most analyses involve using data in some format and in most settings a database is the tool of choice.

Databases can come in many flavors, designed to serve different use cases. We will survey few applications and explore the most common families of databases: 
- **Relational (RDBMS)** 
- **non-Relational (noSQL)**

### Skill prevalence in the DS job market
_Circa December 2015 - Frequency of Terms related to "Data Scientist" on Indeed.com_
![](https://snag.gy/Gweik7.jpg)

<a name="introduction"></a>

## Intro to Relational Database Management Systems (RDBMS)

---

Databases are computer systems that manage storage and querying of data. Databases provide a way to organize data along with efficient methods to retrieve specific information.

Typically, retrieval is performed using structured query language (SQL), with many operators for conditional selection, aggregation, joining/merging, and data transformation.  

**Many of these concepts we've already explored using Pandas DataFrames!**

Databases allow users to create rules that ensure proper data management and verification.

### Industry example: bank data

Consider the case of a bank. It needs to keep track of all the money in each of its clients' accounts. Let's suppose that the bank stores these as numbers in a table with two columns:

| ACCOUNT_ID | BALANCE |
|---|---|
| 1 | 10.000 |
| 2 | 12.546 |
| 3 | 8761 |
|...|...|

#### If this table was stored in a file in a central bank, how would internet banking look like? What problems could arise?

Notice problems of:
- Consistency (what if two nodes try to read/edit the file at the same time?)
- Availability (what if a node is not connected to the central bank?)
- Partition tolerance (what if only part of the file is available?)
- Scale (what if too many nodes request data from the file at the same time?)

As you may have realized, when multiple processes/users are interacting with the same data, it quickly becomes impractical to store it in a single file on a single machine. That's when a database comes in.

<a id='transactional-integrity'></a>

### Transactional Integrity

**A unit of work performed against a database is called a _transaction_.**

This term generally represents any change to a database.

Going back to the bank example, consider the case where you want to transfer money from an account to another.

![Transaction](./assets/images/transaction.png)

**Imagine your money in this system:**
- What happens if step 1 succeeds and step 2 fails ?
- What if you request the balance between step 1 and step 2 ?

The system that stores the data must be resilient to these problems. It must know:
- When a transaction _begins_
- When a transaction _ends_
- What to do if a transaction _never ends_ 
- What to do if another transaction is _requested_ while the previous one is _still in process_

<a id='acid'></a>

### ACID

![](https://snag.gy/kp5Rqi.jpg)

The acronym ACID stands for Atomicity, Consistency, Isolation, Durability. This is a set of properties that guarantee  database transactions are processed reliably.

**Atomicity** requires that each transaction be "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged.

**Consistency** ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof.

**Isolation** ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e., one after the other.

**Durability** ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter).

**This is the typical model under which _relational databases_ operate**. These guarantees would work perfectly for the example bank.


<a id='relational-databases'></a>

### Relational Databases

**A _relational database_ is a database of tabular data and links between data entities or concepts.** Typically, a relational database is organized into _tables_. Each table should correspond to one entity or concept. Each _table_ is similar to a single CSV file or Pandas dataframe.

For example, let's take a sample application like Twitter. Our two main entities are Users and Tweets. For each of these we would have a table.

| TWEET_ID | USER_ID | TWEET_TEXT |
|---|---|---|
| 5234 | 1234567 | "Ate an entire pound of bacon this morning.  My arteries are ready to start the day." |
| 2351 | 4529234 | "Spock vs Chewbaka.  My definitive fan fiction chronicles a potential outcome." |
| 5521 | 2348902932 | "OMG Kardashians + Bieber convolutional network mashup madness." |
|...|...|...|

| USER_ID | USERNAME |
|---|---|
| 1234567 | "dyerrington" |
| 4529234 | "kieferk" |
| 2348902932 | "stoneyv" |
|...|...|

A table is made up of rows and columns, similar to a Pandas dataframe or Excel spreadsheet.  It's standard practice in relational database design to segment your data.  Rather than having a third column with "username" in every single row of the tweets table, we can simply reference a username by "ID".  This saves a lot of space if you have a table with billions of records.

<a id='schemas'></a>

### A quick note on "Schemas"

The term **"schema"** can mean different things depending on which flavor of database you are talking about (MySQL, Postgres, Oracle, MSSQL).  Generally, the definition that we will accept for this class is:

>A **schema** is a collection of database objects which includes logical structures.

Including:

- Databases
- Tables
- Relationships between Tables
- Keys and Indices

We will talk more about these soon.

![](https://snag.gy/Qzhvdp.jpg)

### Who remembers what happens with dtypes in a DataFrame?

```python
data = [
   [1, 34, "2004-12-31", 55],
   [0, 34, "2004-12-31", 55],
   ['?', np.nan, "2004-12-31", 55],
   [1, 34, "2004-12-31", 55],
   [1, 34, "2004-12-31", 5.5],
]
df = pd.DataFrame(data)

```

###  Relational Database Schemas

A table can also be refered to as a _schema_ which defines how data will be managed and contained. 

Table schemas define:

- Column definitions
  - Type
  - Length
- Indices
  - Unique constraints
- Keys
  - Auto-increment behavior
  - Relationships to other tables
    - Primary 
    - Foreign

These specify what columns are contained in the table and what _type_ those columns are (text, integers, floats, etc.).

**The addition of _type_ information makes this constraint stronger than a CSV file. For this reason, and many others, databases allow for stronger data consistency and often are a better solution for data storage.**

**Each table typically has a _primary key_ column. This column is a unique value per row and serves as the identifier for the row.**

A table can have many _foreign keys_ as well. **A _foreign key_ is a column that contains values to link the table to the other tables.** For example, the tweets table may have as columns:
- tweet_id, the primary key tweet identifier
- the tweet text
- the user id of the member, a foreign key to the users table

| _Primary Key_ | _Foreign Key_ | |
|---|---|---|
| **TWEET_ID** | **USER_ID** | **TWEET_TEXT** |
|---|---|---|
| 5234 | 1234567 | "Ate an entire pound of bacon this morning.  My arteries are ready to start the day." |
| 2351 | 4529234 | "Spock vs Chewbaka.  My definitive fan fiction chronicles a potential outcome." |
| 5521 | 2348902932 | "OMG Kardashians + Bieber convolutional network mashup madness." |
|...|...|...|

These keys that link the table together define the relational database.

MySQL and Postgres are popular variants of relational databases and are widely used. Both of these are open-source so are available for free.

Alternatively, many larger companies may use Oracle or Microsoft SQL databases. While all of these offer many of the same features (and use SQL as a query language), the latter also offer some maintenance features that large companies find useful.

<a id='erd'></a>

### An Entity Relation Diagram (ERD)

![](https://snag.gy/QsBNnS.jpg)

<a name="guided_practice_1"></a>

Find more about the symbols used in this context [here](https://en.wikipedia.org/wiki/Entity–relationship_model).

## Design a relational database

---

Consider the following dataset from Uber with the fields:
    - User ID
    - User Name
    - Driver ID
    - Driver Name
    - Ride ID
    - Ride Time
    - Pickup Longitude
    - Pickup Latitude
    - Pickup Location Entity
    - Drop-off Longitude
    - Drop-off Latitude
    - Drop-off Location Entity
    - Miles
    - Travel Time
    - Fare
    - CC Number
   

#### Work in pairs and answer the following questions:

- How would you design a relational database to support this data?
- List the tables you would create.
- What fields would they contain?
- How would they link to other tables?

> Answer:
    Users table:
        - User ID
        - User Name
        - CC Number

>    Drivers table:
        - Driver ID
        - Driver Name
       

>    Locations table: Should store popular destinations metadata
        - Entity
        - Longitude
        - Latitude

>    Rides:
        - Ride ID
        - Ride Time
        - User ID (link to users)
        - Driver ID (link to drivers)
        - Pickup Location Entity (link to locations)
        - Drop-off Location Entity (link to locations)
        - Miles
        - Travel Time
        - Fare
        - CC Number

<a id='alternative-databases'></a>

## Alternative types of databases
<a id='alternative'></a>
---

<a id='key-value'></a>

### Key-value stores

Some databases are nothing more than very-large (and very-fast) hash-maps or dictionaries. These are useful for storing key based data, i.e. storing the last access or visit time per user, or counting things per user or customer.

Every entry in these databases has two values, a key and a value, and we can retrieve any value based on its key. This is exactly like a python dictionary, but can be much larger and uses smart caching algorithms to ensure frequently or recently accessed items are quickly accessible. The ideal use case for a key-value store is data which does not come in fixed formats.

Popular key-value stores include **Cassandra, Redis, Kafka, and Memcachedb**.

Key-value stores are typically used for:
- image stores
- key-based file systems
- object cache
- systems designed to scale

<a id='nosql'></a>

### NoSQL or Document databases
"NoSQL" databases don't rely on a traditional table setup and are more flexible in their data organization. Typically they do actually have SQL querying abilities, but simply model their data differently.

Many organize the data on an entity level, but often have denormalized and nested data setups. For example, for each user, we may store their metadata, along with a collection of tweets, each of which has its own metadata. This nesting can continue down encapsulating entities. This data layout is similar to what we might expect in hierarchical data structures such as JSON, XML, or Python Dictionaries.

Popular databases of this variety include MongoDB and CouchDB.

Typical uses: 
- high-variablity data
- document search
- integration hubs
- web content management
- publishing

### A quick note on selecting a database solution:

**If you are unsure about which database solution to use, choose a relational database.  Postgres and MySQL are much more scalable than you think.**

Benefits of a relational database structure:
- Easy to migrate to NoSQL
- Scalable 
- Maximum flexiblity to query data
- Widest array of features overall

### Industry Example: social network app

Consider the case of a social network app under development by a startup. The app has:
- Many user profiles. 
- Multiple activities that users might participate in.
- Being a start-up, database requirements change almost daily so nothing is set in stone.
- It's not important that everything is consistent.
- It's very likely some profiles will have features in the future that older profiles won't.

#### What would be the shortcomings of the relational DB model in this case?

> Answer:
> - schema: different products have different properties, a rigid schema makes it hard to add new products
> - potentially:  scalability

<a id='tsdb'></a>

### Timeseries databases
<a id='tsdb'></a>

Time series databases (TSDB) are optimized for handling time series data, i.e. data that is indexed by time (a datetime or a datetime range).

Examples of time series include:
- stock market data
- energy load data from a utility company
- server metrics
- purchase history
- website metric
- ads and clicks
- sensor data from a wearable device or an internet-of-things sensor
- smartphone sensor data

Time series pose different challenges that cannot typically be solved with a traditional relational database model.

#### What issues could arise when modeling time series data with a tabular data model?

> Answer:
> - critical data volume
> - time ordering
> - out of order inserts
> - joins

#### Popular TSDB databases include:

- Atlas
- Druid
- InfluxDB
- Splunk

Mainly these are good at dimensional time series data for near real-time operational insight.  

Splunk for example, is a solution for searching activity data across entire infrastructures in real-time.  MySQL can do this, but Splunk has many features tailored for time series problems that are central to reporting and forensics.

<a id='graph-db'></a>

### Graph databases

Graph databases are optimized to store data about networks. Most graph databases are NoSQL in nature and store their data in a key-value store or document-oriented database. In general terms, they can be considered to be key-value databases with the addition of the relationship concept.

![](http://image.slidesharecdn.com/beginnerpresentation-120429104540-phpapp01/95/introduction-to-graph-databases-24-728.jpg?cb=1335696642)

In traditional relational databases, the relationships are defined within the data itself. In graph databases, relationships allow the values in the store to be related to each other in a free form way. This allows complex hierarchies to be quickly traversed, addressing one of the more common performance problems found in traditional key-value stores.

Most graph databases also add the concept of _tags_ or _properties_, which are essentially relationships lacking a pointer to another document.

**Popular databases of this variety include:**
- Neo4j
- OpenCog
- AllegroGraph
- Oracle Spatial
- Graph

**Typical uses:**
- social networks
- fraud detection
- relationship-heavy data

### Industry example: phone company

Consider a phone company that has information about phone calls. Each phone call entity has the following properties:

- caller_id
- receiver_id
- time_of_call
- duration

Each user does several calls, and some users may be more connected than others. The company is interested in finding the people that are central in the network of call connections (super connectors) in order to extend to them a promotion on their phone usage. The company wants the super connectors to be happy with the service and in turn speak highly of the service to their connections.

A graph database is perfectly suited to answer such a question.

Other examples include:

- finding communities
- finding the shortest path between two entities
- detecting fraudulent behavior
- establishing user identity

## A Graph Example


![](http://web.madstudio.northwestern.edu/wp-content/uploads/2015/04/InterestingOut.png)

<a id='base'></a>

<a name="guided-practice_2"></a>

## Conclusion
---
<a id='conclusion'></a>

Relational databases are the most common database. They organize data into tables. Other database types exist, including graph, hash, documents and time-series specific databases.

![](http://itknowledgeexchange.techtarget.com/overheard/files/2014/01/Graph-database-sketch.jpg)

![](https://snag.gy/Yaz0yT.jpg)

![](https://snag.gy/pz01bd.jpg)

### ADDITIONAL RESOURCES
<a id='resources'></a>

- [Database page on Wikipedia](https://en.wikipedia.org/wiki/Database)
- [Database tutorials](http://www.tutorialspoint.com/database_tutorials.htm)
- [Postgres Cheat Sheet](https://gist.github.com/Kartones/dd3ff5ec5ea238d4c546)
- [ACID versus BASE](https://neo4j.com/blog/acid-vs-base-consistency-models-explained/)