# SQLite for Data Scientists

#### Produced & Presented by Florents Tselai - [tselai.com](tselai.com)

## 5. ETL with Triggers

An (SQLite) trigger is a named database object that is executed automatically when an INSERT, UPDATE or DELETE statement is issued against the associated table.

* Triggers can be used to specify ETL routines without paying the IO cost.
* They are a simple way to leverage the so-called *pushdown optimization* (send the logic to the data)
* They also ensure ACID-compliance

They are generally regarded "obscure" for non-DBAs but can be great tool for a data practitioner's toolbelt

In [None]:
import json
import pandas as pd
from gzip import GzipFile

In [None]:
with GzipFile('../data/hn_dump.json.gz', 'r') as fin:
    data = json.loads(fin.read().decode('utf-8'))

In [None]:
data[0]

Let's create a *table* to store "summarized" view of the data

In [None]:
from sqlite3 import connect
DB_PATH = '../sqlite-olt.db'

with connect(DB_PATH) as db:
        db.execute("""
        create table if not exists items (
            title      text,
            points     int,
            item_id    text primary key,
            item_url   text,
            created_at timestamp
        );
        """)

Triggers usually follow a quite fine-grained syntax

```
CREATE TRIGGER [IF NOT EXISTS] trigger_name 
   [BEFORE|AFTER|INSTEAD OF] [INSERT|UPDATE|DELETE] 
   ON table_name
   [WHEN condition]
BEGIN
 statements;
END;
```

We want to just dump json data on the `hn_raw_data` table and have the summarized view (i.e. selected fields) pushed to the `items` table.

In [None]:
with connect(DB_PATH) as db:
    db.execute("""
            drop trigger if exists clean_hn_items_raw;
    """)
    db.execute("""
        create trigger if not exists clean_hn_items_raw
            after
        insert
        on hn_items_raw
            for each row
        begin
        insert into items
        values (json_extract(new.data, '$.title'), json_extract(new.data, '$.points'), json_extract(new.data, '$.objectID'),
                "https://news.ycombinator.com/item?id=" || json_extract(new.data, '$.objectID'),
                json_extract(new.data, '$.created_at'))
        on conflict do nothing;
        end
    """)

In [None]:
import json
with connect(DB_PATH) as db:
    db.execute("insert into hn_items_raw(data) values (?)", (json.dumps(data[0]),))

What about performance ?

### Question: 

How could we use a trigger to automatically have indexed comments for FTS as we did in the previous section ?

In [None]:
with connect(DB_PATH) as db:
    db.execute("""
            drop trigger if exists do_index_comments_text;
    """)
    db.execute("""
        create trigger if not exists do_index_comments_text
            after
        insert
        on hn_items_raw
            for each row
        begin
        insert into comments_fts
        values (json_extract(new.data, '$.objectID'), 
        json_extract(new.data, '$.author'), json_extract(new.data, '$.comment_text'));
        end
    """)

In [None]:
with connect(DB_PATH) as db:
    db.execute("delete from comments_fts")  

In [None]:
with connect(DB_PATH) as db:
    for item in data[:1000]:
        db.execute("insert into hn_items_raw(data) values (?)", (json.dumps(item),))