# Data Management and Database Basics

## Motivation

## Motivation

<img src="https://preview.redd.it/gph4rp6drvo41.jpg?width=640&crop=smart&auto=webp&s=a407a7be1da73ba010f0295a6351ab9d14471b2a" width=400 />

## Overview

1. Pre-SQL  (Robin)
2. SQL databases  (Emilio)
3. Non-SQL databases  (Ali)
4. Graph databases  (Simon, Day 5)

# Pre-SQL

- You kind of have data, but not really that much.
- You want to organize it better,  but keep things lightweight to share.

## Working with CSV files

### Comma Separated Values (CSV) Theory

Defined in 2005 by [RFC 4180](https://tools.ietf.org/html/rfc4180), the Request for Comments (RFC) publication of the Internet Society (ISOC)

Basics:
- One table
- Values separated by separator `,` (although `\t, ;, etc.` also possible)
- Entries separated by new lines `\n` (although others also possible)

Benefits:
- Human readability
- Extremely efficient appending
- Easy sharing between people and programs
- Relatively fast reading and writing

Drawbacks:
- Only one table
- Only tabular data
- Disk space

### Basics

Simplest CSV handling: Pandas

#### Reading CSV Files

In [1]:
csv_test_file = "C:/Users/rgrei/OneDrive/Work/Cryptix/data/Marketdata/boerse-frankfurt/bid-ask/XETR/sap-se.csv"

##### The Read function has *A LOT* of kwargs

```python
pandas.read_csv(filepath_or_buffer, 
                sep=',', 
                delimiter=None, 
                header='infer', 
                names=None, 
                index_col=None, 
                usecols=None, 
                squeeze=False, 
                prefix=None, 
                mangle_dupe_cols=True, 
                dtype=None, engine=None, 
                converters=None, 
                true_values=None, 
                false_values=None, 
                skipinitialspace=False, 
                skiprows=None, 
                skipfooter=0, 
                nrows=None, 
                na_values=None, 
                keep_default_na=True, 
                na_filter=True, 
                verbose=False, 
                skip_blank_lines=True, 
                parse_dates=False, 
                infer_datetime_format=False, 
                keep_date_col=False, 
                date_parser=None, 
                dayfirst=False, 
                cache_dates=True, 
                iterator=False, 
                chunksize=None, 
                compression='infer', 
                thousands=None, 
                decimal='.', 
                lineterminator=None, 
                quotechar='"', 
                quoting=0, 
                doublequote=True, 
                escapechar=None, 
                comment=None, 
                encoding=None, 
                dialect=None, 
                error_bad_lines=True, 
                warn_bad_lines=True, 
                delim_whitespace=False, 
                low_memory=True, 
                memory_map=False, 
                float_precision=None)
```

In [2]:
import pandas as pd

In [3]:
%%time
df = pd.read_csv(csv_test_file)

Wall time: 601 ms


###### Define your `sep`arator

Allows to read all CSV-Type formats, e.g.:
- Tab separated "\t"
- Space separated " "
- Semicolon separated ";"
- etc.

###### Ensuring vaslidity with `quotechar`, `escapechar`, and `dialect`

`quotechar` defines your quote character to ensure proper parsing

```python
    quotechar='"',
```

`escapechar` defines a encoding escape character, e.g. `\` such that `\\n` renders `\n` instead of a new line

```python
    escapechar=None,
```

`dialect` defines parameters will "override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting."

```python
    dialect=None,
```

##### Thousands and Fractions

```python
    thousands=None, 
    decimal='.', 
```

##### Save CSV Files

In [None]:
df.to_csv(test.csv)

#### Reading ***large*** CSV files

##### File is compressed because it is too large

```python
    compression='infer', 
```

compression:
- "on-the-fly decompression of on-disk data"
- default: `infer` compression from file-ending
- other options: gzip, bz2, zip, xz, None

##### Simple options

###### Skipping rows

```python
    skipinitialspace=False, 
    skiprows=None, 
    skipfooter=0, 
    nrows=None, 
```

###### Reading select columns

```python
    index_col=None, 
    usecols=None, 
```

##### Memory Maps

```python
    low_memory=True, 
    memory_map=False,
```

`memory_map`
- boolean attribute
- "map the file object directly onto memory" for access
- "can improve performance because there is no longer any I/O overhead"
  - paging
  - flushing
- specifically:
  - OS optimized subroutines are used
  - strided memory access through numpy
  - lazy loading

In [331]:
%%time
df = pd.read_csv(csv_test_file, memory_map=True)

Wall time: 509 ms


##### Implicit mappings

`low_memory`
- process the file in chunks internally
- returns one full DataFrame
- only chunk size used reducign memory bandwidth
- allows for possibly mixed type inference within a column (!)
    - manually set mixed types either to False
    - or better specify the type with the dtype parameter

In [334]:
%%time
df = pd.read_csv(csv_test_file, low_memory=True)

Wall time: 459 ms


##### Explicit Mapping

```python
    chunksize=None, 
    iteratorbool=False,
```

- Iterating through files chunk by chunk by loading file lazily

`chunksize`:
- number of lines ot be read as a chunk
- returns `TextFileReader` object for iteration

`iterator`:
- Return `TextFileReader` object for iteration
- etting chunks with `get_chunk()`

In [7]:
a = pd.read_csv(csv_test_file, chunksize=4)
a.get_chunk()
a.get_chunk()
type(a)

pandas.io.parsers.TextFileReader

For mor details see:
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
- https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

### Efficiently reading last lines

Sample file of > 20mb

In [191]:
test_file = "C:/Users/rgrei/OneDrive/Work/Cryptix/data/Marketdata/boerse-frankfurt/bid-ask/XETR/sap-se.csv"

#### Simple Read Last Line Codes

In [233]:
%%time
with open(test_file, 'rb') as f:
    lines = f.read().splitlines()
    last_line = lines[-1].decode('utf-8')
    print(last_line)

2020-08-24T11:58:31.11+02:00,138.82,138.78,419.0,510.0
Wall time: 79.8 ms


In [8]:
%%time
with open(test_file, 'rb') as f:
    f.seek(0, os.SEEK_END)
    f.seek(f.tell() - 8*16, os.SEEK_SET)
    lines = f.read().decode('utf-8').splitlines()[-1]
    print(lines)

NameError: name 'test_file' is not defined

In [201]:
from collections import deque

In [235]:
%%time
with open(test_file, 'rb') as f:
    last_line = deque(f, 1)
    print(last_line)

deque([b'2020-08-24T11:58:31.11+02:00,138.82,138.78,419.0,510.0\r\n'], maxlen=1)
Wall time: 49.9 ms


In [236]:
%%time
with open(test_file, 'rb') as f:
    for i in f:
        pass
    print(i)

b'2020-08-24T11:58:31.11+02:00,138.82,138.78,419.0,510.0\r\n'
Wall time: 58.8 ms


#### Using seek offsets and buffer bytes

In [168]:
import os
import io

In [248]:
def tail(f, window=1):
    """Returns the last `window` lines of file `f` as a list."""
    if window == 0:
        return []

    BUFSIZ = 8192
    f.seek(0, 2)
    remaining_bytes = f.tell()
    size = window + 1
    block = -1
    data = []

    while size > 0 and remaining_bytes > 0:
        if remaining_bytes - BUFSIZ > 0:
            # Seek back one whole BUFSIZ
            f.seek(block * BUFSIZ, 2)
            # read BUFFER
            bunch = f.read(BUFSIZ)
        else:
            # file too small, start from beginning
            f.seek(0, 0)
            # only read what was not read
            bunch = f.read(remaining_bytes)

        bunch = bunch.decode('utf-8')
        data.insert(0, bunch)
        size -= bunch.count('\n')
        remaining_bytes -= BUFSIZ
        block -= 1

    return ''.join(data).splitlines()[-window:]

In [249]:
%%time
with open(test_file, 'rb') as f:
    last_line = tail(f, 1)
    print(last_line)

['2020-08-24T11:58:31.11+02:00,138.82,138.78,419.0,510.0']
Wall time: 997 µs


### Appending CSV Files

In [226]:
with open(filename, "a"):
    df.to_csv(filename, header=False)

NameError: name 'filename' is not defined

### Working with multiple CSV Files

Since dictionaries are hashmaps, you can use them to map to files.

Further, you can use them to map to memory mapped files.

Using the `os` module, you can create nested dictionaries that memory map a folder

The result:

> Simple access to various large data files

#### Implementation Example

In [250]:
import os

folderpath = "C:/Users/rgrei/OneDrive/Work/Cryptix/data/Marketdata/boerse-frankfurt"

In [327]:
def map_folder_csv(folderpath):
    """creates a tree hashmap to the memory mapped csv files (identical structure)"""
    files = os.listdir(folderpath)
    new_dic = {}
    for file in files:
        filepath = f"{folderpath}/{file}"
        if '.csv' in file:
            new_dic[file[:-4]] = pd.read_csv(filepath, memory_map=True)
        elif os.path.isdir(filepath):
            new_dic[file] = map_folder_csv(filepath)
        else:
            pass
    return new_dic
a = map_folder_csv(folderpath)

KeyboardInterrupt: 

In [None]:
def process_filepath(filepath, fileending, reader):
    """memory map the files"""
    if fileending in filepath:
        return reader(filepath)
    elif os.path.isdir(filepath):
        return map_folder(filepath)
    else:
        return 


def map_folder(folderpath, fileending='.csv', reader=lambda x: pd.read_csv(x, memory_map=True)):
    """creates a tree hashmap to the memory mapped csv files (identical structure)"""
    return {file.replace(fileending, ''): process_filepath(f"{folderpath}/{file}", fileending, reader)
            for file in os.listdir(folderpath)}

a = map_folder(folderpath)

## JSON

In [335]:
json_test_file = "C:/Users/rgrei/OneDrive/Work/Cryptix/data/Marketdata/boerse-frankfurt/bond_definitions.json"

### JavaScript Object Notation

Defined by [RFC 7159](https://tools.ietf.org/html/rfc7159.html) (previously [RFC 4627](https://tools.ietf.org/html/rfc4627.html) and [ECMA-404](http://www.ecma-international.org/publications/standards/Ecma-404.htm)

Standardized in 2013, current version in 2017 [RFC 8259](https://tools.ietf.org/html/rfc8259)

Benefits:
- Human readbility
- Easy sharing between people and programs
- Relatively fast reading and writing
- Hierarchies native

Drawbacks:
- Inefficient Storage and Access

Ignores indent so can be both easily human readable and more compact

### Basic Usage

In [9]:
import json

#### JSON to Python


JSON          | Python
--------------|--------
object        | dict
array         | list
string        | str
number (int)  | int
number (real) | float
true          | True
false         | False
null          | None

#### Python to JSON


Python                                 | JSON
---------------------------------------|------
dict                                   | object
list, tuple                            | array
str                                    | string
int, float, int- & float-derived Enums | number
True                                   | true
False                                  | false
None                                   | null

#### From file to Python object

"Deserialize fp (a .read()-supporting text file or binary file containing a JSON document) to a Python object using this conversion table."

```python
json.load(fp, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)
```

In [338]:
jf = json.load(open(json_test_file, "r"))

#### From string to Python native dtypes

"Deserialize s (a str, bytes or bytearray instance containing a JSON document) to a Python object using this conversion table."

```python
json.loads(s, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)
```

In [358]:
json_string = """
{
    "researcher": {
        "name": "Ford Prefect",
        "species": "Betelgeusian",
        "relatives": [
            {
                "name": "Zaphod Beeblebrox",
                "species": "Betelgeusian"
            }
        ]
    }
}
"""
data = json.loads(json_string)
data

{'researcher': {'name': 'Ford Prefect', 'species': 'Betelgeusian', 'relatives': [{'name': 'Zaphod Beeblebrox', 'species': 'Betelgeusian'}]}}

#### Convert dictionary object to string for JSON storage

"Serialize obj to a JSON formatted str using this conversion table. The arguments have the same meaning as in dump()."

```python
json.dumps(obj, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)
```

In [361]:
print(json.dumps(data))
print(type(json.dumps(data)))

{"researcher": {"name": "Ford Prefect", "species": "Betelgeusian", "relatives": [{"name": "Zaphod Beeblebrox", "species": "Betelgeusian"}]}}
<class 'str'>


#### Saving object as JSON

"Serialize obj as a JSON formatted stream to fp (a .write()-supporting file-like object) using this conversion table."

```python
json.dump(obj, fp, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)
```

In [359]:
with open('test_file.json', 'w') as f:
    json.dump(data, f, indent=4)

#### More on working with JSON files

- https://realpython.com/python-json/
- https://docs.python.org/3/library/json.html

## XML

### Extensible Markup Language (XML) Basics

First Published 1998

Python XML module in the core library: https://docs.python.org/3/library/xml.html

There is a big warning:
```
Warning
The XML modules are not secure against erroneous or maliciously constructed data. If you need to parse untrusted or unauthenticated data see the XML vulnerabilities and The defusedxml and defusedexpat Packages sections. 
```

```xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
```

## YAML

### YAML Ain't Markup Language Basics

- Introduced in 2001
- Supposed to be a human readable format
- Used for configuration files often


```yaml
---
receipt:     Oz-Ware Purchase Invoice
date:        2012-08-06
customer:
    first_name:   Dorothy
    family_name:  Gale

items:
    - part_no:   A4786
      descrip:   Water Bucket (Filled)
      price:     1.47
      quantity:  4

    - part_no:   E1628
      descrip:   High Heeled "Ruby" Slippers
      size:      8
      price:     133.7
      quantity:  1

bill-to:  &id001
    street: |
            123 Tornado Alley
            Suite 16
    city:   East Centerville
    state:  KS

ship-to:  *id001

specialDelivery:  >
    Follow the Yellow Brick
    Road to the Emerald City.
    Pay no attention to the
    man behind the curtain.
```

## Parquet

Introduced in 2013

"[Apache Parquet](http://parquet.apache.org/) is a [columnar storage format](http://en.wikipedia.org/wiki/Column-oriented_DBMS) available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language."

Python uses pyarrow to acces parquet files

```bash
pip install pyarrow
```

### Converting to parquet compatible formats

In [363]:
import pyarrow as pa
import pyarrow.parquet as pq

In [368]:
table = pa.Table.from_pandas(df)
table

pyarrow.Table
time: string
bestAskPrice: double
bestBidPrice: double
bestAskUnits: double
bestBidUnits: double

### Writing Parquet files

In [369]:
pq.write_table(table, 'example.parquet')

In [383]:
import humanize
print(f"csv:     {humanize.naturalsize(os.path.getsize(csv_test_file))}, \nparquet:  {humanize.naturalsize(os.path.getsize('example.parquet'))}")

csv:     18.8 MB, 
parquet:  3.6 MB


### Reading Parquet files

In [384]:
table2 = pq.read_table('example.parquet')
table2

pyarrow.Table
time: string
bestAskPrice: double
bestBidPrice: double
bestAskUnits: double
bestBidUnits: double

### Convert back to Pandas

In [387]:
table2.to_pandas().head()

Unnamed: 0,time,bestAskPrice,bestBidPrice,bestAskUnits,bestBidUnits
0,2020-08-10T13:58:30.42+02:00,135.600006,135.56,570.0,694.0
1,2020-08-10T13:58:30.88+02:00,135.600006,135.56,570.0,622.0
2,2020-08-10T13:58:34.81+02:00,135.600006,135.56,570.0,694.0
3,2020-08-10T13:58:36.06+02:00,135.6,135.56,488.0,816.0
4,2020-08-10T13:58:36.06+02:00,135.6,135.56,488.0,816.0


### More Information

For more information on pyarrow: https://arrow.apache.org/docs/python/parquet.html

There is also the popular: [FastParquet](https://fastparquet.readthedocs.io/en/latest/)

You can also read and write straight from pandas if you prefer.

# SQL

## intro
 - sql is a declarative programming language to manipulate tables
 - declarative: no functions or loops, just _declare_ what you need and the runtime will figure out how to compute it
 - sql queries can be used to
   - insert new rows into a table
   - delete rows from a table
   - update one or more attributes of one or more rows in a table
   - retrieve and possibly transform rows combing from one or more tables
 - this section will mostly focus on reading data (last point)


## main abstraction: tables
 - a table is a _set_ of tuples (rows)
   - no two rows are the same
 - rows are distinguished by _primary keys_
   - primary key: smallest set of attributes that uniquely identifies a row, examples:
     - student ID (one attribute)
     - first name, last name, birth date, place of birth (four attributes)
   - the primary key is a property of each table
     - all rows in a table use the same attributes as primary key
     - but different tables can have different primary keys
   - cannot have two rows with the same primary key
 - _foreign keys_ are used to refer to rows of other tables
   - e.g. a table with grades will have foreign keys that point to the student and the course

## domain
 - good database design has
   - one table for each "entity" in the domain
   - relationships between entities
 - types of binary relationships:
   - 1 to 1 (can be stored in either entity, but NOT BOTH, or in a separate talbe)
   - 1 to n (must be stored in the entity with cardinality "1", or in a separate table)
   - m to n (requires a separate table)  
 - possible to have relationships between more than two entities
 - example:
   - entities
     - students (id, name, degree)
     - courses (id, faculty, semester)
     - professors (id, name, chair)
   - relationships
     - "mentor" between students (suppose 1 to 1), three possibilities
       - have a column "mentor"
       - have a column "mentee"
       - having both is not ideal: more work to ensure consistency
       - have a new table (mentor, mentee)
     -  grades (entity student, entity course, attribute grade)
       - m to n -> requires a table
     - teaches (professors, courses)
       - m to n -> requires a table
       - only one professor per course -> store professor in course table or create separate table
 - sql shines when "navigating" across relationships, for example:
   - for each student, find the professor that gave them the highest grade
   - for each professor, find courses taught last semester

## anatomy of a select query
 - "select" queries are used to retrieve data from the database
   
    ```
SELECT <columns and transformation>
FROM <source table(s)>
[WHERE <filter rows coming from source table(s)>]
[GROUP BY <create groups of rows>
[HAVING <filter groups>]]
    ```
    
 - must have SELECT+FROM
 - WHERE and GROUPBY optional
 - HAVING optional, must be used with GROUP BY
 - note on GROUP BY: eventually you must have only one row per group

## select query untangled
 - confusingly, order of execution is different than order of writing:
   1. FROM: first, gather all input rows from all tables
   2. WHERE: next, remove all rows not matching the predicate
   3. GROUP BY: now, if needed, create groups of rows
   4. HAVING: then, remove all groups that do not match the predicate
   5. SELECT: finally, produce output rows

## FROM: source tables
 - you can specify one or more tables in the from clause
 - FROM will do a cross-product of all tuples of all tables
 - in almost all cases, you only want a small subset of the cross-product
   - use WHERE to remove tuples that do not make sense - possible to give aliases to tables that can be used in the remainder of the query
 - possible to give aliases to tables and use that alias in the rest of the query
   - useful to keep query short and when the same table is used several times in the same query

## WHERE: tuple filter
 - specify a boolean condition that is evaluated for each row produced by the FROM
 - all rows where this evaluates to false are discarded
 - handling of null values

## JOIN: a special case of FROM+WHERE
 - in most cases, we are not interested in the cross-product
 - we actually want tuples that match primary/foreign keys
 - example `SELECT * FROM students, grades WHERE students.id = grade.student`
   - associates to each student all its grades (one per row)
 - this operation is so common that it has a special name to distinguish it from the general case
 - `SELECT * FROM students JOIN grades ON student.id = grade.student`
 - options to handle non-matches:
   - inner join: `FROM students [INNER] JOIN grades ON student.id = grade.student`
     - `WHERE students.id = grade.student`
     - only keep matches
   - left join: `FROM students LEFT JOIN grades ON student.id = grade.student`
     - `WHERE students.id = grade.student OR grade.student IS NULL`
     - keep matches and un-matched records from _left_ table
   - right join: `FROM students RIGHT JOIN grades ON student.id = grade.student`
     - `WHERE students.id = grade.student OR stude=nt.id IS NULL`
     - keep matches and un-matched records from _right_ table
   - outer join: `FROM students OUTER JOIN grades ON student.id = grade.student`
     - `WHERE students.id = grade.student OR grade.student IS NULL OR student.id IS NULL`
     - keep matches, cross-product between un-matched records
 - other possibilities:
    - natural join: `FROM students JOIN grades`
      - `ON` is missing -> match all columns with the same name
    - self join: `FROM stdudents s JOIN students t`
      - better to use aliases

## GROUP BY: create groups of rows
 - must specify one or more columns, possibly with transformation
 - all rows that have the same values for all (transformed) column(s) end up in the same group

## HAVING: filter groups
 - another boolean condition applied to each group
 - example: filter by group size, min/max/mean of something..

## SELECT: produce output columns
 - all the surviving groups/rows are transformed
 - select only a subset of attributes, or transform values
 - careful: each group must be collapsed into a row

## subqueries and CTE
 - to make things messy
 - too many CTE's can make query slower, sometimes better to create temporary table
 - jww's usecase:
    ```sql
select t.id, (
  select u.status
  from tbl u
  where t.id == u.id and t.timestamp >= u.timestamp
  order by desc u.timestamp
  limit 1
) as status
from tbl t
```

   - tbl has columns id, timestamp, status, where status can be null.
   - goal is to fill null status with most recent non-null status (of the same id)
   - need index on (id, timestamp) to be quick

## examples of complex queries
 - TODO

## programmatically interfacing to a RDBMS
 - connections
 - cursors
 - sql injection and proper escaping

## transactions and ACID
 - heh

## advanced: indexing
 - depending on your query and how you express it, it may be quite slow
 - the DBMS tries to optimize every query, but sometimes it fails
 - when most of the time is spent on joins and lookups, creating _indices_ can greatly speed up the query
 - an index is just a mapping from values to rows that contain that value in one or more columns
 - this makes it much faster to find rows that contain a given value
   - instead of checking row by row, simply look in the index
   - think about books!
 - an index is always relative to a table and one or more columns
   - `CREATE INDEX <index name> ON <table name>(<list of columns>)`
 - a table can have many indices, but one is always created automatically for primary keys
   - all otherunique keys must also have an index
   - joins are much faster when there is an index on one of the columns
 - if a query is slow and/or executed very frequently, consider adding an index on columns used in the WHERE/JOIN
 - types of index:
   - tree-based: O(NlogN) access, can be used to quickly answer queries like `WHERE L < column < U`
   - hash-based: O(1) access, cannot answer range queries
   - clustered index: table is physically sorted by the columns

## advanced: query plans
 - understanding why a query is slow is not trivial
 - the query plan is produced by the optimizer and shows exactly what and how is done to execute the query
 - it contains an estimated cost and can be augmented with the actual cost measured when executing the query
 - estimated cost:
   - computed from statistics about rows/values that the DBMS maintains internally
   - these statistics can become inaccurate after lots of operations
   - useful to periodically recompute these statistics
   - also useful to periodically clear the space allocated to deleted rows and defragment table data
 - (show example of plans before/after adding an index)

### Example

![](../img/qplan.png)

Image from [dba.stackexchange.com](https://dba.stackexchange.com/q/9234)

# Non-SQL

What does NoSQL actually mean?

A bit of history …
- 1998: First used for a relational database that omitted usage of SQL
- 2009: First used during a conference to advocate non-relational databases

NoSQL is an accidental term with no precise definition

## NoSQL: Overview

Main objective: implement distributed state

- Different objects stored on different servers
- Same object replicated on different servers

Main idea: give up some of the ACID constraints to improve performance

Simple interface:

- Write (=Put): needs to write all replicas
- Read (=Get): may get only one

Eventual consistency <- Strong consistency

## NoSQL Six key features

1. Scale horizontally “simple operations”
2. Replicate/distribute data over many servers
3. Simple call level interface (contrast w/ SQL)
4. Weaker concurrency model than ACID
5. Efficient use of distributed indexes and RAM
6. Flexible schema

## Types of NoSQL Databases

Core types
- Key-value stores
- Document stores
- Wide column (column family, column oriented, …) stores
- Graph databases

Non-core types
- Object databases
- Native XML databases
- RDF stores
- ...

## Key-Value Stores

Data model
- he most simple NoSQL database type
    - Works as a simple hash table (mapping)
- Key-value pairs
    - Key (id, identifier, primary key)
    - Value: binary object, black box for the database system

Query patterns
- Create, update or remove value for a given key
- Get value for a given key

Characteristics
- Simple model ⇒ great performance, easily scaled, …
- Simple model ⇒ not for complex queries nor complex data


### Key-Value Stores

Suitable use cases
- Session data, user profiles, user preferences, shopping carts, …
I.e. when values are only accessed via keys

When not to use
- Relationships among entities
- Queries requiring access to the content of the value part
- Set operations involving multiple key-value pairs

Representatives
- Redis, MemcachedDB, Riak KV, Hazelcast, Ehcache, Amazon, SimpleDB, Berkeley DB, Oracle NoSQL, Infinispan, LevelDB, Ignite, Project Voldemort
- Multi-model: OrientDB, ArangoDB

### Key-Value Stores

<img src="photos/keyvalue_example.PNG">

### Key-Value Stores Use cases

Key-value data stores are ideal for storing user profiles, blog comments, product recommendations, and session information.

- Twitter uses Redis to deliver your Twitter timeline
- Pinterest uses Redis to store lists of users, followers, unfollowers, boards, and more
- Coinbase uses Redis to enforce rate limits and guarantee correctness of Bitcoin transactions
- Quora uses Memcached to cache results from slower, persistent databases

## Document Stores

Data model
- Documents
    - Self-describing
    - Hierarchical tree structures (JSON, XML, …)
        – Scalar values, maps, lists, sets, nested documents, …
    - Identified by a unique identifier (key, …)
- Documents are organized into collections

Query patterns
- Create, update or remove a document
- Retrieve documents according to complex query conditions

Observation
- Extended key-value stores where the value part is examinable!


## Document Stores

Suitable use cases
- Event logging, content management systems, blogs, web, analytics, e-commerce applications, …
    - I.e. *for structured documents with similar schema*

When not to use
- *Set operations* involving multiple documents
- Design of document structure is constantly changing
    - I.e. when the required level of granularity would outbalance the advantages of aggregates

Representatives
- MongoDB, Couchbase, Amazon DynamoDB, CouchDB, RethinkDB, RavenDB, Terrastore
- Multi-model: MarkLogic, OrientDB, OpenLink Virtuoso, ArangoDB


## Document Stores

```json
[
    {
        "year" : 2013,
        "title" : "Turn It Down, Or Else!",
        "info" : {
            "directors" : [ "Alice Smith", "Bob Jones"],
            "release_date" : "2013-01-18T00:00:00Z",
            "rating" : 6.2,
            "genres" : ["Comedy", "Drama"],
            "image_url" : "http://ia.media-imdb.com/images/N/O9ERWAU7FS797AJ7LU8HN09AMUP908RLlo5JF90EWR7LJKQ7@@._V1_SX400_.jpg",
            "plot" : "A rock band plays their music at high volumes, annoying the neighbors.",
            "actors" : ["David Matthewman", "Jonathan G. Neff"]
        }
    },
    {
        "year": 2015,
        "title": "The Big New Movie",
        "info": {
            "plot": "Nothing happens at all.",
            "rating": 0
        }
    }
]
```

## Document Stores Use Cases

- SEGA uses MongoDB for handling 11 million in-game accounts
- Cisco moved its VSRM (video session and research manager) platform to Couchbase to achieve greater scalability
- Aer Lingus uses MongoDB with Studio 3T to handle ticketing and internal apps
- Built on MongoDB, The Weather Channel’s iOS and Android apps deliver weather alerts to 40 million users in real-time

## Wide Column Stores

Data model
- Column family (table)
    - Table is a collectioon of similar rows (not necessarily identical)

- Row
    - Row is a collectioon of columns
        - Should encompass a group of data that is accessed together
    - Associated with a unique row key

- Column
    - Column consists of a column name and column value (and possibly other metadata records)
    - Scalar values, but also flat sets, lists or maps may be allowed

## Wide Column Stores

Query patterns
- Create, update or remove a row within a given column family
- Select rows according to a row key or simple conditions

Warning
- Wide column stores are not just a special kind of RDBMSs with a variable set of columns!


## Wide Column Stores

Suitable use cases
- Event logging, content management systems, blogs, …
    - I.e. for structured flat data with similar schema

When not to use
- ACID transactions are required
- Complex queries: aggregation (SUM, AVG, …), joining, …
- Early prototypes: i.e. when database design may change

Representatives
- Apache Cassandra, Apache HBase, Apache Accumulo, Hypertable, Google Bigtable

##  Wide Column Stores

<img src="https://pandaforme.gitbooks.io/introduction-to-cassandra/content/Screen%20Shot%202016-02-24%20at%2012.24.12.png">

## Wide Column Stores Use Cases

Column stores offer very high performance and a highly scalable architecture. Because they’re fast to load and query, they’ve been popular among companies and organizations dealing with big data, IoT, and user recommendation and personalization engines.

- Spotify uses Cassandra to store user profile attributes and metadata about artists, songs, etc. for better personalization
- Facebook initially built its revamped Messages on top of HBase, but is now also used for other Facebook services like the Nearby Friends feature and search indexing
- Outbrain uses Cassandra to serve over 190 billion personalized content recommendations each month

## Graph Databases

Data model
- Property graphs
    - Directed / undirected graphs, i.e. collections of …
        - nodes (vertices) for real-world entities, and
        - relationships (edges) between these nodes
    - Both the nodes and relationships can be associated with additional properties

Types of databases
- Non-transactional = small number of very large graphs
- Transactional = large number of small graphs

## Graph Databases

Query patterns
- Create, update or remove a node / relationship in a graph
- Graph algorithms (shortest paths, spanning trees, …)
- General graph traversals
- Sub-graph queries or super-graph queries
- Similarity based queries (approximate matching)

Representatives
- Neo4j, Titan, Apache Giraph, InfiniteGraph, FlockDB
- Multi-model: OrientDB, OpenLink Virtuoso, ArangoDB

## Graph Databases

Suitable use cases
- Social networks, routing, dispatch, and location-based, services, recommendation engines, chemical compounds, biological pathways, linguistic trees, …
    - I.e. simply for __graph structures__

When not to use
- Extensive batch operations are required
    - Multiple nodes / relationships are to be affected
- Only too large graphs to be stored
    - Graph distribution is difficult or impossible at all


## Graph Databases

The image below shows how a relational database like MySQL works, which use memory-intensive and more complicated join operations to search entire tables to find a match:

<img src="https://s3.amazonaws.com/dev.assets.neo4j.com/wp-content/uploads/from_relational_model.png">


## Graph Databases

Compare that to a graph database, which already predetermines relationships based on connected nodes, making queries much faster and more economical.

<img src="https://s3.amazonaws.com/dev.assets.neo4j.com/wp-content/uploads/relational_to_graph.png">


## Graph Databases Use Cases

Graph databases are great for establishing data relationships especially when dealing with large data sets. They offer blazing fast query performance that relational databases cannot compete with, especially as data grows much deeper.

- Walmart uses Neo4j to provide customers personalized, relevant product recommendations and promotions
- Medium uses Neo4j to build their social graph to enhance content personalization
- Cisco uses Neo4j to mine customer support cases to anticipate bugs

## Reasons to use a NoSQL database

- The pace of development with NoSQL databases can be much faster than with a SQL database.
- The structure of many different forms of data is more easily handled and evolved with a NoSQL database.
- The amount of data in many applications cannot be served affordably by a SQL database.
- The scale of traffic and need for zero downtime cannot be handled by SQL.

## Big Data

<img src="https://media.makeameme.org/created/big-data-big-5ad56d.jpg">

## What is Big Data

- high Volume
- high Velocity
- high Variety
- Veracity

## Data Sources

- Social media and networks
- Scientific instruments
- Mobile devices
- Sensor technology and networks

# Architecture Design Project

# Background

Our client is a well-known manufacturer who builds different types of robotic manipulators. These robots are used by different operators/companies all around the world. 

During the operation, each robot creates a log file which includes timestamps as well as information from different sensors.

For example, this is a snapshot of the log file

```
.
.
.
timestamp 10:20:00 X 100 Y 100 Z 100
timestamp 10:20:01 T 5 R 6
timestamp 10:20:02 X 101 Y 99 Z 99
timestamp 10:20:03 X 102 Y 100 Z 99
timestamp 10:20:04 T 7 R 6
timestamp 10:20:05 X 100 Y 100 Z 99
.
.
.

```

The logfile can contain differet sensors and sometimes, there are differences between robot sensors and how they are captured.

# Main Idea

Our client wants to create a platform (website) for health-condition-monitoring of the robots. In particular, our client wants to create a platform where different operators of the robotic arms can upload the log files and get insights about the status of their robot and different parts of it. 

<img src="photos/cybernetics-1869205_1280.jpg">

# Task

Design an architecture for the front-end and the back-end. (on the abstract level)
Consider that each logfile is around 100Mb and is in text format

- What types of databases are needed?
- In case of sql databases, what columns do you suggest?
- Do we need any nosql db? what type?

Note: in this case, there are no unique solution. Try to be creative :)