**Types of Data Anomalies:** 

\- Update anomaly; when you have redundant data and only partially update the data (only updated addresses on some of the author's books)

\- Insertion Anomaly: can't insert data because it's missing other data in a row (Didn't have the author's address when adding a new book, can't add the book)

\- Deletion Anomaly: unintentionally lose data because we've deleted other data (We deleted the books, now the author has disappeared, even though we want to keep them on file)

**Avoiding Anomalies: Normalization Rules**

First Normal Form - Values in each column of a table must be atomic

Second Normal Form - All attributes not part of the key depend on the key

Third Normal Form - No transitive dependencies

\----

Normalized tables in a schema should show relations to the table 1:1, 1:N, ... 

**OLTP - Online Transaction Processing Systems**

\- many reads/writes

\- many processes

\- often normalized to third normal form 

**Analytical DBs**

\- data analysis

\- many reads by many processes

\- many writes by few processes 

\- Often Denormalized

**Denormalization** 

Can significantly improve performance

Why?

\- Data is often redundant

\- Contains non-atomic values

\- Tolerate transitive dependencies

\- Also has reduced risk of anomalies due to:

> \- few updates
> 
> \- batch inserts; data transformed before the inserts
> 
> \- streaming inserts; simple data structures
> 
> \- eliminate the need for complex joins

**Forms of Denormalized DBs**

\- Star Schema: Fact and Dimension tables

> \- Row-level orientation
> 
> \- Columnar data stores

\- Wide Column: everything in a table that is really wide

> \- The opposite of Tidy Data

**Partitioning**

**Vertical Partitioning**

\- Separating by column

\- Increases the number of rows in a data block with fewer columns

\- Global indexes for each partition

\- Can reduce I/O

  

**Horizontal Partitioning**  

\- Separation by subsets of rows \*most common

\- Limits scans to a subset of partitions by chunks of rows, same columns

\- Local indexes for each partition (smaller indexes)

\- Efficient adding and deleting 

  

**Range Partitioning -** type of horizontal

\- Parition on non-overlapping keys (like by month)

\- Parition by date is common

\- Numeric ranges are often used

\- Alphabetic, global region... 

\- _Partition on a value or list of values_

  

**Hash Partitioning**

\- Partition on modulus of hash of parition key

\- Pick the key based on the remainer of the modulus

\- Does not logically group into subgroups

\- _Want an even distribution of data across partitions_

In [1]:
-- Postgres, create a table with partitions and then their partitions

CREATE TABLE iot.sensor_msmt (
    sensor_id int NOT NULL, 
    msmt_date date NOT NULL, 
    temperature int, 
    humidity int)
    PARTITION BY RANGE (msmt_date);

-- It is partitioning by date, now we have to create the partitions
CREATE TABLE io_sensor_msmt_y2021m01 PARTITION OF iot.sensor_msmt
    FOR VALUES FROM ('2021_01_01') TO ('2021_01_31');

CREATE TABLE io_sensor_msmt_y2021m02 PARTITION OF iot.sensor_msmt
    FOR VALUES FROM ('2021_02_01') TO ('2021_02_28');

: schema "iot" does not exist

**Materialized Views**

Persisited Results of a Query - a form a caching

  

\- Execute the query once

\- Save results once

\- Read many times

\- Trading space for time

  

**When to use materialized views?** 

\- Long-running queries

\- Complex queries, many joins

\- Computing aggregates or other derived data

\- Separate read and write operations

  

**When NOT to use Materialized Views?**

\- Eventual Consistency: you may not have the latest data in an updating system

\- Cost of update process

\- Concurrent reads during update? (Default in postgres)

\- Size of materialized view

\- Refresh Frequency

In [None]:
-- Create a materialized view - kind of like a WITH clause

CREATE MATERIALIZED VIEW landon.mv_locations_expenses AS 
(
SELECT  l.hotel, 
        l.city, 
        l.state_province, 
        l.country, 
        e.year, 
        e.annual_payroll, 
        e.health_insurance, 
        e.supplies
FROM    
    landon.locations l
    LEFT JOIN
    landon.expenses e 
    ON (l.hotel_id = e.hotel_id)
)

SELECT * FROM landon.mv_locations_expenses;

-- IF there is an update, we can refresh the query

REFRESH MATERIALIZED VIEW landon.mv_locations_expenses;

**Read Replicas**

The primary server is responsible for all read and write queries at the same time

A **read replica** - data written to the primary is written to another instance of the data server

  

Using this, only WRITEs go to the primary server

Primary WRITEs to the Read Replica

All READ queries access the Read Replica  

\- Primary can focus on the writes

\- Multiple replicas can scale to meet read load

\- Especially useful when more read than write workload

\- Need to consider eventual consistency

**CHALLENGE**

An IoT company collects streaming data from thousands of sensors every minute

Data scientists will perform series analyses, including many aggregate data by sensor over hours and days

\- Low Latency is essential

\- Access to all data older than one hour

  

Design 

\- High-level model

\- What kind of structures

\- What kind of design patterns

\-------------------------------------------------------------------

  Solution: 

\- Sensor data written to table with the attributes for id and measures

\- Partition by time

\- Use Read Replicas to not bog down the server

\- Use Materialized Views, refresh once per hour for hourly

**B-Tree Indexing**
*Indexing for Analytical Queries*

- Reduces need to scan data blocks
- Comes at cost of additional writes during data loading
- The higher the cardinality of indexed data, the better the performance improvement
- Not used in analytical databases, such as Google BigQuery

---
Types of Indexes: 
- B-tree
- Bitmap
- Hash
- Special Purpose

## Balanced Tree
- Capture small amounts of data
- (Workhorse of indexing)
- Work well in many different cases
- Ability to look up values in logarithmic time
- Provides good order and lookup time

## Bitmap Indexes
- Used when there are a small number of possible values in a column (low cardinality)
- Filter by Bitwise operations such as AND, OR, NOT 
- Time to access is based on time to perform the Bitwise operations (fast)
- Read-intesive use cases with few writes (data warehouse/data science applications)

- Some databases allow you to explicity create bitmap indexes (Postgres does not)
- Postgres build bitmap indexes on the fly as needed

## Hash Functions
- Functions for mapping arbitrary length data to fixed-size string
- Has values virtually unique
- Even slight changes create a new hash
- Equal operations only
- Can be smaller than B-Tree indexes
- Comparable with speed of B-Tree build and access

## GiST - Generalized Search Tree 
- Balance Tree-structure access method
- Used as template to implement other indexing schemes
    - B-Tree -- self-balancing tree; operations in log time
    - R-Tree -- index of multidimensional data such as geographical coordinates
- Used in Postgres for indexing: 
    - hstore
    - ltree

--- Operator Classes and Indexed Data Types
- Box_ops  = box
- Circle_ops = circle
- Index_ops - inet, cidr
- Point_ops - point
- Poly_ops - polygon
- Range_ops - any range type
- Tsquery_ops - text queries
- Tsvectory_ops - tsvectors, sorted list of distinct lexemes

## SP-GiST - Space Partitioned GiST
- Supports partitioned search trees
- Useful for non-balanced data structures
    - quadtree - tree with internal nodes having 4 children
    - k-d tree -- k-dimensional tree, used to index point in k-dimensional space
- Develop custom indexes

--- Operators 
- kd_point_ops
- quad_point_ops
- range_ops -- any range type
- box_ops
- poly_ops
- text_ops
- Inet_ops -- inet, cidr

## GIN and BRIN Indexes
- Generalized Inverted Index
- Used when data to be indexed are composite values
- Composite values require index to search for elements within composite item
- EX:  Words in a document (each word is an individual element that can be indexed)

Index stores data in pairs (key, posting list)
- A key is an element value
- Posting list is a set of row IDs in which the key occurs

- Access methods defines when creating a GIN index based on the types of data indexed

Built-In Operator Classes
- Array_ops - any array
- Json_ops - jsonb
- Json_path - jsonb
- Tsvector_ops - text_vectors

Tips 
- Insertion can be slow because many keys may be inserted for each item (many words in a document)
- For very large bulk operations, likely faster to drop and recreate index
- Postgres can postpone much of indexing workk by using temporary lists
    - Temp lists eventually inserted into index using optimized bulk insertion techniques
    - Disadvantage: temp list must ALSO be searched in addition to regular index when both indexed
    - Large temp lists will slow Searches significantly
    - Tradeoff: Slow search vs Slow load
    - Can Disable **fastupdate** parameter in **CREATE INDEX** to disable temporary lists

## BRIN Index
- Block Range Index
- Used with very large tables 
- Column data has a correlation with physical data (postal code, coordinates)

- Block ranges are pages that are physically adjacent in a table
- BRIN indexes store summary info about block ranges
- BRIN indexes tend to be small
    - Entries are for entire block ranges, not indiv elements
    - Quickly scan, skip large segments of a table when searching (min/max)

Operators: 
- Date_minmax_ops
- Char_minmax_ops
- Float8_minmax_ops
- Timestamp_minmax_ops
- UUID_minmax_ops
- Many more ...



---
**CHALLENGE:**
 - You have a very large dataset of insurance claim details, and you want to ingest data into existing db 
 - Each claim has a unique id and 12 columns of data about the claim
 - Existing db has a table of all claim numbers ever generated

 How would you index the new claim detail data to optimize a join operation on the claim ID? 
 - B-Tree
 - Hash

 SOLUTION: 
 - Because using claim IDs, 1 row in each table will have the claim id, use a Hash Index - converts the ID to 32 bit integer
