**Master SQL Development**

_Procedural Language Structure_

In procedural languages (Python, R, C), you specify **how** to do something

\- Directly manipulate data structures in an order determined by code

  

_Execution Plans_

\- SQL query processors take declarative statements

\- They create procedural plans

\- These are the set of steps that scan, filter and join data

  

**Scanning Tables and Indexes**

_Scanning Operations - Linear operation_

\- Scanning looks at each row

\- Fetch data block containing row 

\- Apply filter or condition (based on WHERE)

\- Cost of query based on number of rows in the table

  

Most efficient with small tables, inefficient with large tables. 

  

**_Full table scan_ \-** scans the entire data to retrieve data, inefficient

_**Indexes**_ \-  Ordered subsets of data in a table  

> \- Efficient to look up rows by value
> 
> \- Faster to search index for attribute value
> 
> \- Points to a location of row
> 
> \- Ex: filter by checking index for match, then retrieve row

  

NOTE: SQL tables are NOT ordered. Ordering only happens with the ORDER BY clause, which results in a cursor.

  

**Index Types:** 

\- Balanced Tree, or B-Tree - for equality and Range queries

\- Hash Indexes, for equality

\- Bitmap, for inclusion

\- Specialized Indexes, for geo-spatial or user-defined indexing strategies

**Joining Tables**
How to match rows:
> \- Primary Key in one table   
> \- Foreign Key in the other table
**3 Ways to Join Tables**  
> 1.  Nested Loop Join
> > \- Compare all rows in both tables to each other
> 
> > \- Loop through one table
> 
> > \- For each row, loop through the other table
> 
> > \- At each step, compare keys
> 
> > \- Simple to implement
> 
> > \- Can be expensive
> 
> 2\. Hash Join
> > \- Calculate the hash value of key and join based on matching hash values
> 
> > \- Compute hash values of key values in smaller table
> 
> > \- Store in hash table, which has hash value and row attributes
> 
> > \- Scan larger table; find rows from smaller hash table
> 
> 3\. Sort Merge Join
> > \- Sort both tables and then join rows while taking advantage of order
> 
> > \- Compare rows like nested loop join, but...
> 
> > \- Stop when it is no longer possible to find a match later in the table due to sort order
> 
> > \- Scan the driving table only once
  
**Partitioning Data**
> \- Large tables are stored as smaller tables, known as partitions
> 
> \- Used to improve query, load, and delete operations
> 
> \- Used for large tables
> 
> \- When subset of data is accessed or changed
> 
> \- Can be expensive
  
Need to use a partition key
> \- Based on time... 
> 
> \- Local indexes can be used per partition
> 
> \- Global indexes across all partitions, for data spread across all partitions

**Using EXPLAIN and ANALYZE** 

These give the query plan and the information about the execution of the query

In [2]:
EXPLAIN ANALYZE SELECT * FROM staff


QUERY PLAN
Seq Scan on staff (cost=0.00..24.00 rows=1000 width=75) (actual time=0.019..0.136 rows=1000 loops=1)
Planning Time: 0.066 ms
Execution Time: 0.183 ms


In [3]:
EXPLAIN ANALYZE SELECT last_name FROM staff

QUERY PLAN
Seq Scan on staff (cost=0.00..24.00 rows=1000 width=7) (actual time=0.010..0.086 rows=1000 loops=1)
Planning Time: 0.041 ms
Execution Time: 0.115 ms


**Query all staff with a salary above 75,000**

In [4]:
SELECT  *
FROM    staff
WHERE   salary > 75000

id,last_name,email,gender,department,start_date,salary,job_title,region_id
3,Carr,fcarr2@woothemes.com,Male,Automotive,2009-07-12,101768,Recruiting Manager,3
4,Murray,jmurray3@gov.uk,Female,Jewelery,2014-12-25,96897,Desktop Support Technician,3
6,Phillips,bphillips5@time.com,Male,Tools,2013-08-21,118497,Executive Secretary,1
8,Harris,aharris7@ucoz.com,Female,Toys,2003-08-12,84427,Safety Technician I,4
9,James,rjames8@prnewswire.com,Male,Jewelery,2005-09-07,108657,Sales Associate,2
10,Sanchez,rsanchez9@cloudflare.com,Male,Movies,2013-03-13,108093,Sales Representative,1
11,Jacobs,jjacobsa@sbwire.com,Female,Jewelery,2003-11-27,121966,Community Outreach Specialist,7
13,Schmidt,sschmidtc@state.gov,Male,Baby,2002-10-13,85227,Compensation Analyst,3
15,Jacobs,ajacobse@google.it,Female,Games,2007-03-04,141139,Community Outreach Specialist,7
16,Medina,smedinaf@amazonaws.com,Female,Baby,2008-03-14,106659,Web Developer III,1


In [5]:
-- Note: the number of rows is actually off by 2, but this is just an estimate
EXPLAIN 
SELECT  *
FROM    staff
WHERE   salary > 75000

QUERY PLAN
Seq Scan on staff (cost=0.00..26.50 rows=715 width=75)
Filter: (salary > 75000)


In [6]:
EXPLAIN 
ANALYZE
SELECT  *
FROM    staff
WHERE   salary > 75000

QUERY PLAN
Seq Scan on staff (cost=0.00..26.50 rows=715 width=75) (actual time=0.014..0.102 rows=717 loops=1)
Filter: (salary > 75000)
Rows Removed by Filter: 283
Planning Time: 0.059 ms
Execution Time: 0.129 ms


**Using Indexes to reduce Query Time**

In [7]:
CREATE INDEX idx_staff_salary ON staff(salary)

In [11]:
EXPLAIN ANALYZE SELECT * FROM staff WHERE salary > 75000

QUERY PLAN
Seq Scan on staff (cost=0.00..26.50 rows=715 width=75) (actual time=0.015..0.131 rows=717 loops=1)
Filter: (salary > 75000)
Rows Removed by Filter: 283
Planning Time: 0.093 ms
Execution Time: 0.161 ms


In [12]:
EXPLAIN ANALYZE SELECT * FROM staff WHERE salary > 150000

QUERY PLAN
Index Scan using idx_staff_salary on staff (cost=0.28..8.29 rows=1 width=75) (actual time=0.002..0.003 rows=0 loops=1)
Index Cond: (salary > 150000)
Planning Time: 0.097 ms
Execution Time: 0.012 ms


Note: the first scan determined it was faster to just scan everything without using the index, 717 out of 1000.

When filtering for salaries above 150000, it used the indexes prior to filtering, returning only 1 row out of 1000.

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

**Types of Indexes**

Purpose of Indexes

> \- Speed up access to data  
> \- Help enforce constraints
> 
> \- Indexes are ordered
> 
> \- Typically smaller than tables

  

Reading Data: 

> \- From Memory:    100 ns

> \- From SSD:           1,000,000 ns   (1 ms)
> 
> \- From HDD:          20,000,000 ns (20 ms)

  

Implementing Indexes: 

> \- Data structure is separate from table
> 
> \- Sometimes duplicates some data, for example, keys
> 
> \- Organized differently from tables (ordered, whereas tables are not ordered)

  

While these do add duplication to the existing data, speeds up the data processing

  

**\- B-Trees**  
**\- Bitmap**  
**\- Hash**  

**\- Special Purpose**

**\-------------------------------------------------**

**B-Tree Indexes (Balanced Tree)**

> \- Splits values that are greater than base node to the right, less than base node to the left
> 
> \- Continues this for each node value until all values are split - one node per row

> \- Most common type of index

> \- Used when a large number of possible values in a column (high cardinality)

> \- Rebalances as needed to keep the sides of the tree/tables balanced

> \- Time to access is based on the depth of the tree (log of number of nodes in the tree)

In [13]:
SELECT  * 
FROM    staff
WHERE   email = 'bphillips5@time.com'

id,last_name,email,gender,department,start_date,salary,job_title,region_id
6,Phillips,bphillips5@time.com,Male,Tools,2013-08-21,118497,Executive Secretary,1


In [14]:
-- Used 26.5 computational units
EXPLAIN
SELECT  * 
FROM    staff
WHERE   email = 'bphillips5@time.com'

QUERY PLAN
Seq Scan on staff (cost=0.00..26.50 rows=1 width=75)
Filter: ((email)::text = 'bphillips5@time.com'::text)


In [17]:
-- Used 8.29 computational units with the index

-- B-Tree is the default index in postgres, so no need to specify index type
CREATE INDEX idx_staff_email ON staff(email);

EXPLAIN
SELECT  * 
FROM    staff
WHERE   email = 'bphillips5@time.com'

QUERY PLAN
Index Scan using idx_staff_email on staff (cost=0.28..8.29 rows=1 width=75)
Index Cond: ((email)::text = 'bphillips5@time.com'::text)


**\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_**

**Bitmap Index**

Good for low cardinality problems

  

> \- Uses a number of bits for each distinct value in a column : yes/no = 2 bits,  Ranks from 1-5 = 5 bits

> \- Can quickly perform Boolean Operations

> \- Used when small number of possible values in a column (low cardinality)

> \- Filter by bitwise operations, such as AND, OR, NOT

> \- Time to access is based on time to perform bitwise operations (longer)

> \- Read-intensive use cases, few writes

>   

  

> \- Some dbs allow you to create bitmap indexes explicitly
> 
> \- Postgres does not
> 
> \- postgres builds bitmap indexes on the fly as needed

In [5]:
-- Find all employees with title operator

SELECT      *
FROM        staff
WHERE       job_title = 'Operator'

id,last_name,email,gender,department,start_date,salary,job_title,region_id
71,Vasquez,evasquez1y@behance.net,Male,Baby,2002-10-20,77285,Operator,6
113,Moore,kmoore34@shareasale.com,Male,Baby,2000-03-01,54413,Operator,5
151,Larson,blarson46@newsvine.com,Male,Books,2011-08-09,50066,Operator,1
242,Robinson,probinson6p@ucla.edu,Male,Health,2003-11-30,137594,Operator,6
257,Freeman,gfreeman74@bloomberg.com,Female,Home,2012-06-05,83804,Operator,1
371,Sims,bsimsaa@privacy.gov.au,Male,Sports,2000-06-04,127223,Operator,5
673,Thomas,lthomasio@pagesperso-orange.fr,Male,Health,2014-10-27,51782,Operator,6
719,Wagner,ewagnerjy@jalbum.net,Male,Beauty,2013-01-04,135445,Operator,2
807,Gibson,hgibsonme@ox.ac.uk,Male,Industrial,2005-12-15,148816,Operator,4
847,Knight,dknightni@unc.edu,Female,Clothing,2008-03-22,91532,Operator,4


In [15]:
-- Now add indexing 
CREATE INDEX idx_staff_job_title ON staff(job_title);

SELECT  *
FROM    staff
WHERE   job_title = 'Operator'



id,last_name,email,gender,department,start_date,salary,job_title,region_id
71,Vasquez,evasquez1y@behance.net,Male,Baby,2002-10-20,77285,Operator,6
113,Moore,kmoore34@shareasale.com,Male,Baby,2000-03-01,54413,Operator,5
151,Larson,blarson46@newsvine.com,Male,Books,2011-08-09,50066,Operator,1
242,Robinson,probinson6p@ucla.edu,Male,Health,2003-11-30,137594,Operator,6
257,Freeman,gfreeman74@bloomberg.com,Female,Home,2012-06-05,83804,Operator,1
371,Sims,bsimsaa@privacy.gov.au,Male,Sports,2000-06-04,127223,Operator,5
673,Thomas,lthomasio@pagesperso-orange.fr,Male,Health,2014-10-27,51782,Operator,6
719,Wagner,ewagnerjy@jalbum.net,Male,Beauty,2013-01-04,135445,Operator,2
807,Gibson,hgibsonme@ox.ac.uk,Male,Industrial,2005-12-15,148816,Operator,4
847,Knight,dknightni@unc.edu,Female,Clothing,2008-03-22,91532,Operator,4


In [16]:
EXPLAIN SELECT  *
FROM    staff
WHERE   job_title = 'Operator'

QUERY PLAN
Bitmap Heap Scan on staff (cost=4.36..18.36 rows=11 width=75)
Recheck Cond: ((job_title)::text = 'Operator'::text)
-> Bitmap Index Scan on idx_staff_job_title (cost=0.00..4.36 rows=11 width=0)
Index Cond: ((job_title)::text = 'Operator'::text)


Postgres automatically uses bitmap to index in this case

  

\----------------

  

**HASH FUNCTIONS**

> \- Function for mapping arbitrary length data to a fixed-size string
> 
> \- Hash values are virtually unique
> 
> \- Even slight changes in input produce a new hash
> 
> \- Size of hash value depends on algorithm used
> 
> \- Even changing caps or adding a single character can significantly change the hash value (looks like a product key code)
> 
> \- NO order preserving with hash functions

> 1.  Test for equality only, when '=' is used; cannot test ranges of values
> 
> 2\. Latest versions of postgres have improved hash indexes
> 
> 3\. Builds and lookups are comparable; advantage is size; may fit in memory - fastest
> 
>

In [18]:
-- Create hash index for emails
CREATE INDEX idx_staff_email ON staff USING HASH (email);

In [19]:
--See if it worked
EXPLAIN
SELECT  *
FROM    staff
WHERE   email = 'bphillips5@time.com' 

QUERY PLAN
Index Scan using idx_staff_email on staff (cost=0.00..8.02 rows=1 width=75)
Index Cond: ((email)::text = 'bphillips5@time.com'::text)


**SPECIALIZED POSTGRES INDEXES**

**\- GIST**

**\- SP-GIST**

**\- GIN**

**\- BRIN**

**\----**

**GIST**

> \- Generalized Search Tree
> 
> \- Not a single type of index
> 
> \- Framework for implementing custom indexes

**SP-GIST**

> \- Space-partitioned GIST
> 
> \- Supports partitioned search trees
> 
> \- Used for nonbalanced data structures
> 
> \- Partitions do not have to be same size

**GIN**

> \- Used for text indexing
> 
> \- Lookups are faster than GIST
> 
> \- Builds are slower than GIST
> 
> \- Indexes are 2-3 times larger than GIST

**BRIN**

> \- Block Range Indexing
> 
> \- Used for large data sets
> 
> \- Divides data into ordered blocks
> 
> \- Keeps min and max values
> 
> \- Search only blocks that may have a match

**Type of Joins**

> \- INNER JOIN

> > \- Return rows from both tables that have corresponding row in other table
> > 
> > \- Performed when joining in WHERE clause
> 
> \- LEFT OUTER JOIN

> > \- Returns all from left, only rows from right table with matching keys
> 
> > \- Others will return as NULLs in the right 
> 
> \- RIGHT OUTER JOIN

> > \- Mirror image of the left
> 
> \- FULL OUTER JOIN

> > \- Returns all rows from both tables
> 
> > \- When there are no matches, Nulls will replace the missing values

**Nested Loop Joins**

> **\-** Two loops  

> > \- Outer loop iterates over one table, the driver table  
> > \- Inner loop iterates over other table, the join table  
> > \- Outer loop runs once  
> > \- Inner loop runs once for each row in join table

> \- When to use them: 

> > \- Works with all join conditions
> 
> > \- Low overhead
> 
> > <span style="background-color: rgb(255, 255, 254);">- Work well with small tables</span>

> \- Limitations: 

> > \- Can be slow
> 
> > \- If tables do not fit in memory, even slower performance
> 
> > \- Indexes can improve the performance of nested loop joins, especially covered indexes

>

In [21]:
EXPLAIN
SELECT  s.id, 
        s.last_name, 
        s.job_title, 
        cr.country
FROM    staff s
        INNER JOIN
        company_regions cr ON s.region_id = cr.region_id

QUERY PLAN
Hash Join (cost=22.38..49.02 rows=1000 width=88)
Hash Cond: (s.region_id = cr.region_id)
-> Seq Scan on staff s (cost=0.00..24.00 rows=1000 width=34)
-> Hash (cost=15.50..15.50 rows=550 width=62)
-> Seq Scan on company_regions cr (cost=0.00..15.50 rows=550 width=62)


Note: the query determined that Hash indexing was the optimal way to run the query

  

To perform a nested loop...

In [22]:
SET enable_nestloop = true;
SET enable_hashjoin = false;
SET enable_mergejoin = false;

EXPLAIN
SELECT  s.id, 
        s.last_name, 
        s.job_title, 
        cr.country
FROM    staff s
        INNER JOIN
        company_regions cr ON s.region_id = cr.region_id

QUERY PLAN
Nested Loop (cost=0.16..50.69 rows=1000 width=88)
-> Seq Scan on staff s (cost=0.00..24.00 rows=1000 width=34)
-> Memoize (cost=0.16..0.23 rows=1 width=62)
Cache Key: s.region_id
Cache Mode: logical
-> Index Scan using company_regions_pkey on company_regions cr (cost=0.15..0.22 rows=1 width=62)
Index Cond: (region_id = s.region_id)


**HASH JOINS**

- Function for mapping arbitrary length data to a value that can act as an index to an array
- Hash values are virtually unique
- Even slight changes in input create a new hash

Create a Hash Table  

- Uses the smaller of the two tables
- Compute has value of primary key value
- Stores in the table

Probe Hash Table

- Step through the large table
- Compute has value of primary or foreign key value
- Lookup corresponding value in hash table

In [24]:
SET enable_nestloop = false;
SET enable_hashjoin = true;
SET enable_mergejoin = false;

EXPLAIN
SELECT  s.id, 
        s.last_name, 
        s.job_title, 
        cr.country
FROM    staff s
        INNER JOIN
        company_regions cr ON s.region_id = cr.region_id

QUERY PLAN
Hash Join (cost=22.38..49.02 rows=1000 width=88)
Hash Cond: (s.region_id = cr.region_id)
-> Seq Scan on staff s (cost=0.00..24.00 rows=1000 width=34)
-> Hash (cost=15.50..15.50 rows=550 width=62)
-> Seq Scan on company_regions cr (cost=0.00..15.50 rows=550 width=62)


**MERGE JOINS / SORTING**

- Merge join is also known as sort merge
- First step is sorting tables
- Takes advantage of ordering to reduce the # of rows checked

1. Equality only 
2. Time is based on table size -- Time to sort and time to scan
3. Large table joins = works well when tables do not fit in memory

In [25]:
SET enable_nestloop = false;
SET enable_hashjoin = false;
SET enable_mergejoin = true;

EXPLAIN
SELECT  s.id, 
        s.last_name, 
        s.job_title, 
        cr.country
FROM    staff s
        INNER JOIN
        company_regions cr ON s.region_id = cr.region_id

QUERY PLAN
Merge Join (cost=114.36..132.11 rows=1000 width=88)
Merge Cond: (s.region_id = cr.region_id)
-> Sort (cost=73.83..76.33 rows=1000 width=34)
Sort Key: s.region_id
-> Seq Scan on staff s (cost=0.00..24.00 rows=1000 width=34)
-> Sort (cost=40.53..41.91 rows=550 width=62)
Sort Key: cr.region_id
-> Seq Scan on company_regions cr (cost=0.00..15.50 rows=550 width=62)


**SUBQUERIES VS JOINS**

**Subqueries -** return values from a related table

  

```
SELECT s.id, s.last_name, s.department, 
( SELECT company_regions
  FROM company_regions cr
  WHERE cr.region_id = s.region_id) region_name

FROM staff s
```

**Using Joins**

```
SELECT s.id, s.last_name, s.department, cr.company_regions region_name
FROM  company_regions cr
      INNER JOIN 
      staff s
ON cr.region_id = s.region_id
```

Both methods work well, and new builders have gotten better with subqueries. 

Aim to optimize clarity - what makes the intention clear in the query?

  

If there is a performance difference, note it.

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

**PARTITIONING**

**Horizontal Partitioning** 

- Large tables can be difficult to query efficiently
- Split tables by rows into partitions
- Treat each partition as a table

Benefits to Horizontal Paritions

- Limits the scans to the subset of partitions
- Local indexes can be created for each partition
- Efficient adding and deleting

Where used:

1. Data Warehouses
2. Parition on time
3. Query on Time
4. Delete by time
5. Timeseries
6. Most likely to query the latest data
7. Summarize data in older partitions
8. Naturally partition data
9. Retailer by geography
10. Data science, by product category

**Vertical Partitioning**

- Implement as separate tables
- No partitioning-specific definitions are required
- Separates columns into multiple tables
- Keep frequently queried columns together
- Use the same primary key in all tables

Benefits

- Increases the number of rows in each block
- Global indexes for each partition
- Can reduce I/O
- Columnar data storage offers similar benefits

Where Used?

1. Data warehouses - partitioning on groups
2. Wide variety of products, each with different attributes (very wide tables)
3. Data analytics - Stats on subsets of attributes; after factor analysis

**Range Partitioning**

- Type of horizontal partitioning
- Partition on non-overlapping keys, like dates, numeric, or alphabetic ranges

1. Partition Key
2. Determines which partition is used for the data
3. Partition Bounds
4. Minimum and maximum values allowed in the partition
5. Constraints
6. Each partition can have its own indexes, contstraints, and defaults

When to use?

- Query the latest data
- Comparative queries, for ex. same time last year? 
- Report within range, for ex. numeric identifier range
- Drop data after period of time (keep data for 3 years, then drop oldest monthly as new is added)

In [33]:
CREATE TABLE iot_measurement
(   location_id int not null, 
    measure_date date not null, 
    temp_celcius int, 
    rel_humidity_pct int)
PARTITION BY RANGE (measure_date);

CREATE TABLE iot_measurement_wk1_2019 PARTITION OF iot_measurement
FOR VALUES FROM ('2019-01-01') TO ('2019-01-08');

CREATE TABLE iot_measurement_wk2_2019 PARTITION OF iot_measurement
FOR VALUES FROM ('2019-01-08') TO ('2019-01-15');

CREATE TABLE iot_measurement_wk3_2019 PARTITION OF iot_measurement
FOR VALUES FROM ('2019-01-15') TO ('2019-01-22');


**Partition By List**

- Type of horizontal 
- Partition on non-overlapping keys
- Definted by a list of values

1. Partition Key - determines which partition is used for data
2. Partition Bounds - list of values for a partition
3. Constraints - Each partition can have its own indexes, contstaints, and defaults

Use when: 

- Data is logically grouped into subgroups
- Often query within subgroups
- Data not time oriented enough to warrant range partitions by time

In [32]:
CREATE TABLE products 
( prod_id int not null, 
  prod_name text not null, 
  prod_short_desc text not null, 
  prod_long_desc text not NULL, 
  prod_category varchar not null)
PARTITION BY LIST (prod_category);

CREATE TABLE product_clothing PARTITION OF products 
    FOR VALUES IN ('casual_clothing', 'business_attire', 'formal_clothing');
    
CREATE TABLE product_electronics PARTITION OF products 
    FOR VALUES IN ('mobile_phones', 'tablets', 'computers');

    
CREATE TABLE product_kitchen PARTITION OF products 
    FOR VALUES IN ('appliances', 'dishware', 'silverware');


**Hash Partitioning**

- Another horizontal partitioning
- Partitions on the modulus of hash on a partition key

1. Partition Key - determines which partition is used, but not used directly
2. Modulus - number of partitions is divided and returns equal sized partitions
3. Available in postgres, oracle, and mysql

When to use?

- Data does NOT logically group into subgroups
- Want even distributions across partitions
- No need for subgroup-specific operations, such as drop a partition
-

In [34]:
CREATE TABLE customer_interations
(   ci_id int not null, 
    ci_url text not null, 
    time_at_url int not NULL, 
    click_sequence int not null)
PARTITION BY HASH (ci_id);

CREATE TABLE customer_interations_1 PARTITION OF customer_interations
    FOR VALUES WITH (MODULUS 5, REMAINDER 0);

CREATE TABLE customer_interations_2 PARTITION OF customer_interations
    FOR VALUES WITH (MODULUS 5, REMAINDER 1);


CREATE TABLE customer_interations_3 PARTITION OF customer_interations
    FOR VALUES WITH (MODULUS 5, REMAINDER 2);
    
CREATE TABLE customer_interations_4 PARTITION OF customer_interations
    FOR VALUES WITH (MODULUS 5, REMAINDER 3);
    
CREATE TABLE customer_interations_5 PARTITION OF customer_interations
    FOR VALUES WITH (MODULUS 5, REMAINDER 4);

**MATERIALIZED VIEWS**

- Store results of precomputed queries
- Join and store results
- Apply other operations without performing the expensive query

1. Duplicate data - this must be stored/cached
1. Trading space for time
3. Updates
1. Updates to sources require updates to the materialized views
2. It's a snapshot of a query when the view was created
5. Potential Inconsistencies
1. REFRESH MATERIALIZED VIEW command

View vs. Materialized View? 

- Time is more important than storage space
- Can tolerate some inconsistencies
- Or can refresh after each update to sources

In [1]:
CREATE MATERIALIZED VIEW mv_staff AS 
    SELECT 
        s.last_name, 
        s.department, 
        s.job_title, 
        cr.company_regions
    FROM 
        staff s
        INNER JOIN 
        company_regions cr
        ON
            s.region_id = cr.region_id
    

Now materialized views are saved in the db in MVs, not in tables

Can create indexes with materialized views = can optimize queries on mvs

In [3]:
SELECT *
FROM mv_staff;

-- To update the mv, need to use refresh command

REFRESH MATERIALIZED VIEW mv_staff;

-- This will rebuild the ENTIRE table, utilizing the FULL query 

last_name,department,job_title,company_regions
Kelley,Computers,Structural Engineer,Southeast
Armstrong,Sports,Financial Advisor,Southeast
Carr,Automotive,Recruiting Manager,Northwest
Murray,Jewelery,Desktop Support Technician,Northwest
Ellis,Grocery,Software Engineer III,Nova Scotia
Phillips,Tools,Executive Secretary,Northeast
Williamson,Computers,Dental Hygienist,Quebec
Harris,Toys,Safety Technician I,Southwest
James,Jewelery,Sales Associate,Southeast
Sanchez,Movies,Sales Representative,Northeast


**Other Optimization Techniques**

**What Schemas contain...**

- Tables 
- Indexes
- Constraints
- Views
- Materialized Views
- Statistics about distribution of data in tables

1. Size of the data
1. Number of rows
2. How much storage is used? 
3. Frequency Data
1. Fraction of nulls
2. Number of distinct values
3. Frequent values
5. Distribution
1. Histogram
2. Spread of data

**Skew**

- Is the distribution normal, with left (neg) or right (pos) skew?

**ANALYZE**

In postgres, ANALYZE will 

- Collect statistics on columns, tables, or schemas
- Not human readable
- Run automatically by AUTOVACUUM daemon or manually

**VACUUM**

- Reclaims space of updated data
- VACUUM 
    - reclaims space
- VACUUM(FULL)\[tablename\]
    - Locks tables and reclaims more space
- VACUUM(FULL, ANALYZE)\[tablename\]
    - performs full vacuum and collects statistics 

**REINDEX**

- Rebuilds corrupt indexes
- Shouldn't be needed, but if there are bugs... 
- Cleans up unused pages in B-tree indexes
- REINDEX INDEX \[indexname\]
    - Reindexes that index
- REINDEX TABLE \[tablename\]
    - Reindexes all indexes in a table
- REINDEX SCHEMA \[schemaname\]
    - Reindexes as indexes in a schema

**HINTS TO QUERY OPTIMIZER**

- Suggest optimizations
- Some databases accept hints
- Extra-SQL statements suggesting methods
- Pushing boundary between declarative and procedural code

**_Inline hints are supported by_:** 

- Oracle
- EnterpriseDB (based on Postgres)
- MySQL
- SQL Server

  

**_Postgres Uses Parameters_**

- SET command
- SET enable\_hashjoin = off
- SET enable\_nestloop = on

\* When using SSDs, try setting ...

- random\_page\_cost
- seq\_page\_cost 

      equal to the same value

  

  

**Some Caveats... (before using Hints...\_)**

1. Analyze and Vacuum
2. Try other optimization techniques first
3. Verify query plan is consistently suboptimal
4. Watch for changes in amount or distribution of data

  

**PARALLEL EXECUTION**

- Query optimizer may determine all or part of a query can be run in parallel
- Executes part of a plan in parallel
- Then gathers results
- It is running in parallel if there is a GATHER or GATHER MERGE in the query plan
- All steps below the GATHER/GATHER MERGE are executed in parallel
- Number of parallel processes limited by _max\_parallel\_workers_ and _max\_worker\_processes_ parameters (postgres specific)

In order to have a Parallel Query

- max\_parallel\_workers\_per\_gather must be \> 0
- dynamic\_shared\_memory\_type must NOT be 'none'
- Database must not be running in 'single-user' mode
- Query does not WRITE data or LOCK rows
- Does not use a function marker PARALLEL UNSAFE (user-defined functions)

**Parallel queries may be less efficient**

- Parallel plans must exceed parallel\_setup\_cost and parallel\_tuple\_cost parameters
- Parallel index scans only supported for B-tree indexes
- Inner side of nested loop and merge join is nonparallel
- Hash joins are parallel but each process creates a copy of the hash table (consider this in using large tables)

**INDEXING**

- Create indexes on join columns, same for columns used in WHERE clauses
- Use covering indexes
- Don't filter on a column using equality to NULL (Null is not a value), use IS NULL
- Don't use functions in WHERE clauses unless you have a functional index

**INDEX RANGE SCAN**

- If a plan uses an index range scan, keep the range as SMALL as possible
- Use equality with conditions
- Be careful with LIKE
    - LIKE 'ABC%' can be used and indexed
    - LIKE '%ABC' cannot 
- Use indexes to avoid sorts with ORDER BY

**FILTERING and DATA TYPES**

- When filtering on a range condition, especially dates, use continuous conditions
    - TRUNC(sysdate)
    - TRUNC(systdate+1)
- Don't separate date and time into separate columns;; use a datetime datatype
- Don't store numeric as char, varchar, or text - change to numeric