## Data, Analytics &amp; AI 
# <font color=indigo> Practical Big Data: NoSQL Models </font>


---

<small>QA Ltd. owns the copyright and other intellectual property rights of this material and asserts its moral rights as the author. All rights reserved.</small>

## PreReq Python

### Tuples

In [56]:
row_tuple = (1, "Michael", "UK")

ordered, fixed, ...

In [57]:
row_tuple[0]

1

### Lists

In [58]:
data_list = ["Michael"]
data_list.append("Alice")
data_list.append("Eve")

ordered, mutable (editable)

In [60]:
data_list[0]

'Michael'

### Sets

In [61]:
unique_set = {'A', 'A', 'B', 'C'}

unique, unordered, ...

In [62]:
unique_set

{'A', 'B', 'C'}

In [64]:
# unique_set[0]  # ERROR!

### Dictionary

In [69]:
word_dictionary = {
    "happy"  : "def. a positive emotion", # key-value pair
    "sad"    : "def. a negative emotion"
}

unordered, unique *pairs*, key is the index,

In [73]:
word_dictionary['happy'] # query *by key*

'def. a positive emotion'

#### ASIDE: Dictionaries are a bit like collections of 2-tuples

In [70]:
dicty = [
    ("happy", "def. postiive"), 
    ("sad", "def. negative")
]

In [71]:
dicty[0]

('happy', 'def. postiive')

In [72]:
word_dictionary['happy']

'def. a positive emotion'

## What is a **Data Model** ?

The term data model can refer to two distinct but closely related concepts. 

Sometimes it refers to an abstract formalization of the objects and relationships found in a particular application domain: for example the customers, products, and orders found in a manufacturing organization. 


At other times it refers to the set of concepts used in defining such formalizations: for example concepts such as entities, attributes, relations, or tables. 




* (Abstract) Data Model:
    * table, graph, document...
* Data Model (of an Application):
    * customer, sales, order, ...

## What is the Relational Model?

* <font color=green> Relational Databases are ALMOST ALWAYS right answer! </font>

* Relation = Table 
    * Set of Tuples
    * Tuple = Row
* Set
    * all the entires are unique
    * no guarenteed ordering to entries
* Tuple (aka Row)
    * entries are ordered, may be duplicated
    * columns are parts of a row
    * fixed size

In [5]:
customers_table = {
    (3, "Eve", "DE"),
    (1, "Michael", "UK"),
    (2, "Alice", "FR"),
}

In [6]:
customers_table

{(1, 'Michael', 'UK'), (2, 'Alice', 'FR'), (3, 'Eve', 'DE')}

## Why do we want to use Relations (ie., Tables)?

* A relational system provides a data model with the minimum properties we need for modelling application data
    * Sets = a group (we dont need order)
    * Tuple = an element (we do need order)

### Why are relational DBs unordered?

* powerful
    * recall: it doesnt **promise** any ordering
    * the database can optimize the storage by reordering
        * you dont need to care
* weakness 
    * can't force an ordering
    * if we know a good ordering, we can't use it

## What are the NoSQL Data Models?

* The relational model can be used for pretty much anything
    * but is lack of order (and other guarentees) may make some queries highly unperformant

### The Relational Model Failing: An Example

In [14]:
people = {
    # staff, manager, age, location
    ("Alice", "Eve", 30, "UK"),
    ("Eve", "Bob", 50, "FR"),
    ("Bob", None, 60, "DE"),
}

In [21]:
alices_manager = None

# SELECT, WHERE, O(N)
for (eid, manager, age, location) in people: 
    if eid == "Alice":
        alices_manager = manager

# SELECT, WHERE, O(N)
for (eid, manager, age, location) in people: 
    if eid == alices_manager:
        print(manager)

Bob


```sql

SELECT manager
FROM people 
WHERE eid = (
    SELECT manager 
    FROM people 
    WHERE eid = "Alice"
)
```


```sql
SELECT manager
FROM people 
WHERE eid = ....
    SELECT manager
    FROM people 
    WHERE eid = (
        SELECT manager 
        FROM people 
        WHERE eid = "Alice"
    )
```

#### A Possible Solution

In [19]:
graph = {
    'Alice': ['Eve', 'Bob', 'Michael'],
    'Eve': ['Bob'],
    'Bob': []
}

In [20]:
graph['Alice'] # time = constant time, O(1)

['Eve', 'Bob', 'Michael']

###### Denormalization: Store the above results in a useful table

In [17]:
bosses = {
    ("Alice", "Eve", "Bob")
}

## Reflection: How did `{k: v1, v2...}` structure help?

* `k` provided an index, we can use `k` (eg., "Alice") to find data *in one operation*
* `v` is a *list* which can change its size 
    * ordered, varying-size, mutable...
    
Having these properties as part of the data model means we can query very efficiently *when they help*. 



* Comparing to a relation:
    * relation dbs **do** maintain indexes
        * they know where data is
    * but becasue 
        * rows are fixed-width (ie., tuples are fixed length)
        * rows are rarely going to be "just next to each other"
    * you will run many selects

## What are the NoSQL Data Models?

* Key-Value
* Graph
* Document
* Columnar

### Key-Value Pairs

* Key-Value pairs provide a minimal amount of structure for one peice of data
    * tag + value
* A collection of kv-pairs is then a very loosely structured dataset

In [23]:
kvpair = {
    #  KEY                                            VALUE
    '/uk/suspects/theft/images/2022-01-01/michael.jpg' : 'IMGDATA' 
}

In [30]:
suspects = [
    {'name': 'Michael', 'age': 32, 'height': 1.81},
    {'name': 'Thomas', 'location': 'UK'}, # height: None
    {'location': 'UK', 'age': 19}, # height: None
]

In [27]:
suspects[0]['name']

'Michael'

In [28]:
for s in suspects:
    if 'name' in s:
        print(s['name'])

Michael
Thomas


* We get a semi-structured, maybe "schemaless" benefit
* Enables:
    * storing sparse data
        * where, if you had the same columns for all rows, most would be `NULL`
    * don't to have to commit to a schmea
        * eg., many different data sources, different fields, etc.
        
* WARNING
    * if you dont commit to a schema, you tend to make querying much harder
    
* Example:
    * you could use a key-value system as a *staging area* for data
        * a place for data before its moved into a relational database

## Documents

* document = hierachical kvpair

In [36]:
credit_file = {
    # profile table
    'profile': {
        'name': 'Alice',
        'age': 30,
    },
    
    # loans table, scores table...
    'loans': [
        {'amount': 1000, 'score': 600},
        {'amount': 2000, 'score': 800},
    ]
}

In one query we get *a lot* of information,

In [33]:
credit_file['profile']['name']

'Alice'

In [35]:
credit_file['loans'][0]['amount']

1000

* Advantage
    * pre-joined, "denormalized"
    * ie., not split up into different tables, all together
* Use case:
    * when queries always want the same info (a lot)
    * eg., credit report
* Disadvantage:
    * if you need multiple different types of documnets
        * and to *join those together*
        * then: worse performance than relational

## Graphs

* lots of ways of representing as a data model
* key idea:
    * links between nodes (aka rows) will determine the data order
    * so data is "in the right place"

In [37]:
graph

{'Alice': ['Eve', 'Bob', 'Michael'], 'Eve': ['Bob'], 'Bob': []}

All the data we need *is in the same place*,

In [39]:
graph['Alice']

['Eve', 'Bob', 'Michael']

## Columnar Stores (aka DataFrames, BigTable)

* Columnar databases are *tabular* 
* Table
    * Set of *Columns*
    * Tuple = Column

In [44]:
columnar = {
    (1, 2, 3, 4), # id col
    ("Alice", "Eve", "Bob", "Michael"), # name col
    ("UK", "FR", "FR", "UK")
}

* Relation makes easy
    * adding rows, querying across rows, ...
* Columnar
    * adding columns, querying across columns
* Useful for analytical databases
    * analytical queries usually require all rows of many columns
    * for a report you're bringing together many datasets
        * lots of columns
* Advantage
    * easily compressible

In [91]:
key = {0: "UK", 1: "FR"} # this is invisible when using columnar dbs


compressed_columnar = [
    (1, 2, 3, 4), # id col
    ("Alice", "Eve", "Bob", "Michael"), # name col
    (0, 1, 1, 0)
]

query_result = compressed_columnar[2] # + decompress



## What databases provide these data models?

* Relation
    * Set of Rows
    * Postgres, MySQL, Oracle, DB2, MS SQL Sever, ...
* Columnar
    * Set of Columns
    * Snowflake, Delta Lake, Spark/Databricks, ...
* Key-Value Pair
    * Tag + Value
    * Redis, Postgres, ...
* Document
    * Hierachical Key-Value Pairs
    * MongoDB, Postgres, ...
* Graphs
    * Ordered by their edges
    * Connected nodes are "usefully together"
    * Postgres, Neo4j, ...

## How do data systems work together?

* Redis

```sql
result = SELECT * FROM lotsofdata WHERE ...
```

```python
kv = {
    "12pm/today/SELECT * FROM lotsofdata WHERE": "CACHEOFDATA"
}
```

### Eg. Pandas

In [105]:
import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')

We can *easily* subset on columns, 

As expensive in a columnar system to ask for three columns, as in a relational system three rows. 

In [106]:
titanic[['sex', 'age']]

Unnamed: 0,sex,age
0,male,22.0
1,female,38.0
2,female,26.0
3,female,35.0
4,male,35.0
...,...,...
886,male,27.0
887,female,19.0
888,female,
889,male,26.0


In [107]:
titanic.to_csv('relational.csv')

In [108]:
titanic.to_parquet('columnar.pq')

In [112]:
import os

print("the CSV is larger than the pq by, ")
round(
    os.stat('relational.csv').st_size / os.stat('columnar.pq').st_size
)

the CSV is larger than the pq by, 


4

In [110]:
pd.read_parquet('columnar.pq')

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


## Polyglot Persistance

* The same data stored in different data models in different database
    * querying for different
* This seems a bad idea *if* data coping is manual
    * in this case you're relying on a person to ensure system is in sync
* This **requires** automated data management processes

## Exericse: Apply these Data Models (30 min)


* Example Problem:
    * you are given health records of some patients
    * they contain:
        * name, age, hr, bp, etc.
        * medical health history:
            * hospital visits, treatments, surgeries
        * known contacts
            * friends, etc.
* Question 1:
    * on a piece of paper (or word, etc.) sketch an example health record
* Question 2: 
    * based on this dataset
    * define: relation, keyvals, columnar, document
* HINT:
    * relational = rows of fixed no. cols
    * keyval = rows of different cols
    * columnar = lots of cols
    * document = hierachy/nesting of data
* EXTRA:
    * define a graph
    * HINT: keys - nodes, values are list of their friends

In [49]:
people # relatioal

{('Alice', 'Eve', 30, 'UK'), ('Bob', None, 60, 'DE'), ('Eve', 'Bob', 50, 'FR')}

In [51]:
suspects # key-value

[{'name': 'Michael', 'age': 32, 'height': 1.81},
 {'name': 'Thomas', 'location': 'UK'},
 {'location': 'UK', 'age': 19}]

In [52]:
columnar # columnar

{('Alice', 'Eve', 'Bob', 'Michael'), ('UK', 'FR', 'FR', 'UK'), (1, 2, 3, 4)}

In [53]:
credit_file # document

{'profile': {'name': 'Alice', 'age': 30},
 'loans': [{'amount': 1000, 'score': 600}, {'amount': 2000, 'score': 800}]}

---

In [54]:
graph

{'Alice': ['Eve', 'Bob', 'Michael'], 'Eve': ['Bob'], 'Bob': []}

---

## Solution

In [78]:
# live data systems
patients_relation = {
    (1001, "Michael", "London"),
    (1002, "Alice", "London"),
}

visits_keyval = [
    {"patient_id": 1001, "dr": "dr. gloster", "summary": "fever"},
    {"patient_id": 1001, "pills": "5mg happy"},
    {"patient_id": 1001, "dr": "dr. gloster", "summary": "fever"},
    {"patient_id": 1002, "dr": "dr. miggins", "diagnosis": "flu"}
]

# predict-database
predictivedata_documents = [
    
    {"patient_id": 1001, 
     "name": "Micahel", 
     "prognosis": [
         {"prescription": "5mg happy", "outcome": "cured"},
         {"prescription": "5mg sad", "outcome": "cured"},
     ],
     "visits": [
        {"dr": "dr. gloster", "summary": "fever"},
        {"pills": "5mg happy"},
        {"patient_id": 1001, "dr": "dr. gloster", "summary": "fever"}
    ]
    },
    
    
    # another 
    {"patient_id": 1002, 
     "name": "Alice", 
     "prognosis": [
         {"prescription": "5mg happy", "outcome": "cured"},
         {"prescription": "5mg sad", "outcome": "cured"},
     ],
     "visits": [
        {"dr": "dr. gloster", "summary": "fever"},
        {"pills": "5mg happy"},
        {"patient_id": 1001, "dr": "dr. gloster", "summary": "fever"}
    ]
    },
]
 

# analytical reporting db
reporting_columnar = {
    (1001, 1002),
    ("Michael", "Alice"),
    ("London", "London"),
    ("Dr. G", "Dr. Miggings"),
    ("5mg Happy", None),
    ("fever", None),
    (None, "flu")
    # ...
}


In [83]:
patientcontacts_graph = {
    1001: [1002, 2002, 2001],
    2002: [1004, 1005],
    1005: [3001]
}

Q. Who is within two degrees of separation (ie., 3 links) from 1001?

In [90]:
for firstdeg in patientcontacts_graph[1001]:
    print(firstdeg)
    for secondeg in patientcontacts_graph.get(firstdeg, []):
        print(secondeg)

1002
2002
1004
1005
2001
