# SQLite for Data Scientists

#### Produced & Presented by Florents Tselai - [tselai.com](tselai.com)

# 3. Working with JSON Data

In [35]:
import json
import pandas as pd
from gzip import GzipFile

In [36]:
with GzipFile('../data/hn_dump.json.gz', 'r') as fin:
    data = json.loads(fin.read().decode('utf-8'))

In [37]:
len(data)

100954

In [38]:
data[100]

{'created_at': '2012-04-03T21:25:57.000Z',
 'title': '',
 'url': '',
 'author': 'crag',
 'points': 15,
 'story_text': None,
 'comment_text': 'Let me add another database that\'s "underestimated" (by mainstream corporate America): SQLite3.<p>SQLite is fast, small, portable, easy &#38; simple to maintain and backup, AND reliable. And unless you are running a high traffic site (or application) it could handle everything a small (even medium) business would need.<p>Why small companies get talked into running MSQL or Oracle or MySQL is beyond me. And even if (and that\'s a big IF) they needed more "power", there\'s Postgres.<p>PS: Sorry for hijacking this thread. I\'m a big fan boy of both SQLite and Postgres.',
 'num_comments': None,
 'story_id': 3793973,
 'story_title': 'Postgres 9.2 will feature linear read scalability up to 64 cores',
 'story_url': 'http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-about-64.html',
 'parent_id': 3794160,
 'created_at_i': 1333488357,
 'relevancy_sc

In [39]:
df = pd.DataFrame(data).set_index('objectID')

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100954 entries, 4616844 to 2323113
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   created_at        100954 non-null  object 
 1   title             92443 non-null   object 
 2   url               86420 non-null   object 
 3   author            100954 non-null  object 
 4   points            99609 non-null   float64
 5   story_text        36261 non-null   object 
 6   comment_text      21000 non-null   object 
 7   num_comments      79954 non-null   float64
 8   story_id          21003 non-null   float64
 9   story_title       20977 non-null   object 
 10  story_url         20524 non-null   object 
 11  parent_id         21000 non-null   float64
 12  created_at_i      100954 non-null  int64  
 13  relevancy_score   87631 non-null   float64
 14  _tags             100954 non-null  object 
 15  _highlightResult  100954 non-null  object 
dtypes: float64(5), int

In [41]:
from sqlite3 import connect

In [42]:
DB_PATH = '../sqlite-olt.db'

In [43]:
with connect(DB_PATH) as db:
    db.execute("create table if not exists hn_items_raw(data)")

### ~Dumping~ Writing Schemaless Data to a Relational Database

In [44]:
COUNT_ITEMS=1000

In [45]:
with connect(DB_PATH) as db:
        db.execute("DELETE FROM hn_items_raw")

### 1st Way

In [46]:
%%timeit

for item in data[:COUNT_ITEMS]:
    with connect(DB_PATH) as db:
        db.execute("insert into hn_items_raw(data) values (?)", (json.dumps(item),))

1.03 s ± 35.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Clearing the DB to re-run the experiment

In [47]:
with connect(DB_PATH) as db:
        db.execute("DELETE FROM hn_items_raw")

### 2nd Way

In [48]:
%%timeit
with connect(DB_PATH) as db:
    for item in data[:COUNT_ITEMS]:
        db.execute("insert into hn_items_raw(data) values (?)", (json.dumps(item),))

85 ms ± 44.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [49]:
with connect(DB_PATH) as db:
        db.execute("DELETE FROM hn_items_raw")

### 3rd Way

In [50]:
%%timeit
with connect(DB_PATH) as db:
    db.executemany("insert into hn_items_raw(data) values (?)", 
                   [(json.dumps(item),) for item in data[:COUNT_ITEMS]]
                  )

The slowest run took 4.69 times longer than the fastest. This could mean that an intermediate result is being cached.
110 ms ± 72.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [51]:
with connect(DB_PATH) as db:
        db.execute("DELETE FROM hn_items_raw")

### It's usually smart to write data in batches

In [52]:
def make_chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

In [53]:
for chunk in make_chunks(data, 1000):
    with connect(DB_PATH) as db:
        db.executemany("insert into hn_items_raw(data) values (?)", 
                       [(json.dumps(item),) for item in chunk]
                      )

### Let's tabulate (~ *normalize*) the data

```
objectID
created_at
title
url
author
points
story_text
comment_text
comment_text_length
num_comments
story_id
story_title
story_url
parent_id
relevancy_score
tags
```

In [54]:
with connect(DB_PATH) as db:
    db.execute("drop view if exists hn_items_fields")
    db.execute("""
        create view if not exists hn_items_fields as
        select 
            json_extract(data, '$.created_at') as created_at,
            json_extract(data, '$.title') as title,
            json_extract(data, '$.url') as url,
            json_extract(data, '$.author') as author,
            json_extract(data, '$.points') as points,
            json_extract(data, '$.comment_text') as comment_text,
            length(json_extract(data, '$.comment_text')) as comment_text_length,
            json_extract(data, '$.story_text') as story_text,
            json_extract(data, '$.story_id') as story_id,
            json_extract(data, '$.story_title') as story_title,
            json_extract(data, '$.story_url') as story_url,
            json_extract(data, '$.story_text') as story_text,
            json_extract(data, '$.parent_id') as parent_id,
            json_extract(data, '$.relevancy_score') as relevancy_score,
            json_extract(data, '$._tags') as tags 
        from hn_items_raw
    """)

### Let's see what we got

In [55]:
with connect(DB_PATH) as db:
    hn_items_fields = pd.read_sql('select * from hn_items_fields', db)
    

hn_items_fields

Unnamed: 0,created_at,title,url,author,points,comment_text,comment_text_length,story_text,story_id,story_title,story_url,story_text:1,parent_id,relevancy_score,tags
0,2012-10-05T13:51:25.000Z,,,chris_wot,136.0,One of my proudest moments was finding a bug i...,1040.0,,4616548.0,How SQLite is tested,http://www.sqlite.org/testing.html,,4616548.0,4193.0,"[""comment"",""author_chris_wot"",""story_4616548""]"
1,2011-06-22T19:53:59.000Z,,,thechangelog,103.0,SQLite all over the place – it's great having ...,161.0,,2684620.0,Poll: What database does your company use?,,,2684620.0,3293.0,"[""comment"",""author_thechangelog"",""story_2684620""]"
2,2013-05-23T19:31:03.000Z,,,bane,71.0,"Fun SQLite story, I had a project that needed ...",2775.0,,5758192.0,SQLite improves performance with memory-mapped...,http://www.sqlite.org/releaselog/3_7_17.html,,5758192.0,4637.0,"[""comment"",""author_bane"",""story_5758192""]"
3,2013-04-30T23:58:27.000Z,,,SQLite,71.0,There are currently 24 contiguous bytes of unu...,248.0,,5634992.0,We Need A Standard Layered Image Format,http://shapeof.com/archives/2013/4/we_need_a_s...,,5635283.0,4597.0,"[""comment"",""author_SQLite"",""story_5634992""]"
4,2013-02-06T17:14:11.000Z,,,evmar,70.0,I worked on Chrome for most of its life. The ...,1812.0,,5176288.0,Remember when people tracked bugs?,http://www.jwz.org/blog/2013/02/wow-remember-w...,,5176779.0,4435.0,"[""comment"",""author_evmar"",""story_5176288""]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101950,2016-03-18T21:31:55.000Z,Ask HN: Do you use Android's capability to con...,,plugnburn,4.0,,,My laptop&#x27;s hard drive died irrecoverably...,,,,My laptop&#x27;s hard drive died irrecoverably...,,6612.0,"[""story"",""author_plugnburn"",""story_11315277"",""..."
101951,2012-02-12T22:05:50.000Z,Ask HN: Thoughts on cyclical content refreshing?,,ak86,4.0,,,I spend a fair amount of time consuming conten...,,,,I spend a fair amount of time consuming conten...,,3735.0,"[""story"",""author_ak86"",""story_3583185"",""ask_hn""]"
101952,2018-02-11T21:01:48.000Z,What news sources do you pay for and why?,,lemonberry,4.0,,,The post below from TechCrunch got me thinking...,,,,The post below from TechCrunch got me thinking...,,7942.0,"[""story"",""author_lemonberry"",""story_16354779""]"
101953,2011-06-09T13:52:32.000Z,My Weekend Project: A push based Android Bitco...,,DrHeisenberg,4.0,,,"Bitcoins (BTC) is a hot topic at the moment, a...",,,,"Bitcoins (BTC) is a hot topic at the moment, a...",,3265.0,"[""story"",""author_DrHeisenberg"",""story_2637110""]"


In [56]:
hn_items_fields

Unnamed: 0,created_at,title,url,author,points,comment_text,comment_text_length,story_text,story_id,story_title,story_url,story_text:1,parent_id,relevancy_score,tags
0,2012-10-05T13:51:25.000Z,,,chris_wot,136.0,One of my proudest moments was finding a bug i...,1040.0,,4616548.0,How SQLite is tested,http://www.sqlite.org/testing.html,,4616548.0,4193.0,"[""comment"",""author_chris_wot"",""story_4616548""]"
1,2011-06-22T19:53:59.000Z,,,thechangelog,103.0,SQLite all over the place – it's great having ...,161.0,,2684620.0,Poll: What database does your company use?,,,2684620.0,3293.0,"[""comment"",""author_thechangelog"",""story_2684620""]"
2,2013-05-23T19:31:03.000Z,,,bane,71.0,"Fun SQLite story, I had a project that needed ...",2775.0,,5758192.0,SQLite improves performance with memory-mapped...,http://www.sqlite.org/releaselog/3_7_17.html,,5758192.0,4637.0,"[""comment"",""author_bane"",""story_5758192""]"
3,2013-04-30T23:58:27.000Z,,,SQLite,71.0,There are currently 24 contiguous bytes of unu...,248.0,,5634992.0,We Need A Standard Layered Image Format,http://shapeof.com/archives/2013/4/we_need_a_s...,,5635283.0,4597.0,"[""comment"",""author_SQLite"",""story_5634992""]"
4,2013-02-06T17:14:11.000Z,,,evmar,70.0,I worked on Chrome for most of its life. The ...,1812.0,,5176288.0,Remember when people tracked bugs?,http://www.jwz.org/blog/2013/02/wow-remember-w...,,5176779.0,4435.0,"[""comment"",""author_evmar"",""story_5176288""]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101950,2016-03-18T21:31:55.000Z,Ask HN: Do you use Android's capability to con...,,plugnburn,4.0,,,My laptop&#x27;s hard drive died irrecoverably...,,,,My laptop&#x27;s hard drive died irrecoverably...,,6612.0,"[""story"",""author_plugnburn"",""story_11315277"",""..."
101951,2012-02-12T22:05:50.000Z,Ask HN: Thoughts on cyclical content refreshing?,,ak86,4.0,,,I spend a fair amount of time consuming conten...,,,,I spend a fair amount of time consuming conten...,,3735.0,"[""story"",""author_ak86"",""story_3583185"",""ask_hn""]"
101952,2018-02-11T21:01:48.000Z,What news sources do you pay for and why?,,lemonberry,4.0,,,The post below from TechCrunch got me thinking...,,,,The post below from TechCrunch got me thinking...,,7942.0,"[""story"",""author_lemonberry"",""story_16354779""]"
101953,2011-06-09T13:52:32.000Z,My Weekend Project: A push based Android Bitco...,,DrHeisenberg,4.0,,,"Bitcoins (BTC) is a hot topic at the moment, a...",,,,"Bitcoins (BTC) is a hot topic at the moment, a...",,3265.0,"[""story"",""author_DrHeisenberg"",""story_2637110""]"


### Let's find the most frequent authors in our data

In [57]:
query_1 = """
select 
    json_extract(data, '$.author') as author, 
    count(*) as count_author_comments
from hn_items_raw
group by author
order by count_author_comments desc
"""

with connect(DB_PATH) as db:
    frequent_authors_1 = pd.read_sql(query_1, db)
    
frequent_authors_1  

Unnamed: 0,author,count_author_comments
0,rbanffy,552
1,pseudolus,440
2,jseliger,391
3,prostoalex,314
4,ingve,298
...,...,...
36952,zyxo,1
36953,zzaner,1
36954,zzbn00,1
36955,zzeder,1


In [58]:
query_2 = """
select author, count(*) as count_author_comments
from hn_items_fields
group by author
order by count_author_comments desc
"""

with connect(DB_PATH) as db:
    frequent_authors_1 = pd.read_sql(query_1, db)
    
frequent_authors_1.head(20)

Unnamed: 0,author,count_author_comments
0,rbanffy,552
1,pseudolus,440
2,jseliger,391
3,prostoalex,314
4,ingve,298
5,tptacek,232
6,Tomte,217
7,jonbaer,211
8,bookofjoe,209
9,danso,204


In [59]:
filter_author_query = """
select json_extract(data, '$.author'), json_extract(data, '$.objectID')
from hn_items_raw
where json_extract(data, '$.author') = 'luu'
"""

In [None]:
%%timeit
with connect(DB_PATH) as db:
    luu_df = pd.read_sql(filter_author_query, db)

### How can we speed this up ? 

- We would usually create an index, but indices are defined on **columns**; where's the column here?
- There's none, we haven't persisted data on disk, hence SQLite doesn't know where to look for pre-computed results
- Instead we want to "cache" the result for the computation `json_extract(data, '$.author')`. 
- This is called an expression
- In these senarios we create an *index on expression*


Index on expression format 

`CREATE INDEX idx_name on TABLE_NAME (<expression_here>)`

In `<expression_here>` we usually copy-paste the predicate from the `WHERE` clause.

In [None]:
create_author_idx_query = """
create index if not exists idx_author on hn_items_raw (json_extract(data, '$.author'))
"""

In [None]:
with connect(DB_PATH) as db:
    db.execute(create_author_idx_query)

In [None]:
%%timeit
with connect(DB_PATH) as db:
    luu_df = pd.read_sql(filter_author_query, db)

### Indices on expressions come usually handy in time-oriented computations.

Say we want to filter comments posted on Sundays.

The query would look like

In [None]:
sunday_comments="""
select json_extract(data, '$.comment_text'), datetime(json_extract(data, '$.created_at'))
from hn_items_raw
where strftime('%w', datetime(json_extract(data, '$.created_at'))) = '0'
"""

In [None]:
%%timeit

with connect(DB_PATH) as db:
    sunday_comments_df = pd.read_sql(sunday_comments, db)
    
sunday_comments_df

The predicate expression here is a little bit more complex, but certainly optimizable.

Let's create an index

In [None]:
create_index_on_sunday_comments_query =\
"""
create index if not exists 
idx_comments_on_sundays on 
hn_items_raw (strftime('%w', datetime(json_extract(data, '$.created_at'))))
"""

In [None]:
with connect(DB_PATH) as db:
    db.execute(create_index_on_sunday_comments_query)

In [None]:
%%timeit

with connect(DB_PATH) as db:
    sunday_comments_df = pd.read_sql(sunday_comments, db)
    
sunday_comments_df