## <span style=color:blue>Patterns used in Programming Assignment 2 Part 2(version using util.py file)  </span>

In [1]:
# These are boiler plate imports that seem useful
# Perhaps cleaner would be to delete or comment out the ones that aren't used in this script...

import sys
import json
import csv
import yaml

import copy

import pandas as pd
import numpy as np

import matplotlib as mpl

import time
from datetime import datetime
# see https://stackoverflow.com/questions/415511/how-do-i-get-the-current-time-in-python
#   for some basics about datetime

import pprint

# sqlalchemy 2.0 documentation: https://www.sqlalchemy.org/
import psycopg2
from sqlalchemy import create_engine, text as sql_text

# the following is deprecated, it seems, so using the sqlalchemy
# from pyscopg2 import sqlio

# the file in benchmarking/util.py should hold utilities useful for your benchmarking exercise
# In this notebook we have commented out all mentions of util, so that you can run
#    this notebook before setting up your benchmarking/util.py file
sys.path.append('benchmarking/')
import util
# to invoke a function "foo()" inside util.py, use "util.foo()"

In [2]:
# test that utils.py has been imported well
# util.hello_world()

### <span style=color:blue>Setting up Postgres connection.  Note database name is "airbnb" </span>

### <span style=color:blue>Note: this should be modified so that the user name/password are not included into the program. </span>

<span style=color:blue>E.g., see https://docs.sqlalchemy.org/en/20/core/engines.html for how to construct the URLs that the create_engine command uses.  Also, one should store the user/password into environment variables and read them in to populate the URL.  </span>

<span style=color:blue>E.g., see https://stackoverflow.com/questions/4906977/how-can-i-access-environment-variables-in-python for how to work with environment variables on mac, </span>

In [3]:
# following https://www.geeksforgeeks.org/connecting-postgresql-with-sqlalchemy-in-python/

db_eng = create_engine('postgresql+psycopg2://postgres:postgres@localhost:5432/airbnb',
                       connect_args={'options': '-csearch_path={}'.format('new_york_city')},
                       isolation_level = 'SERIALIZABLE')
#    , echo=True)
#    , echo_pool="debug")

print("Successfully created db engine.")

# connect_args is used to set search_path to the schema 'new_york_city' in the airbnb database

# isolation_level SERIALIZABLE makes transactions happen in sequence, which is good 
#      for the benchmarking we will be doing

# for general info on sqlalchemy connections,
#    see: https://docs.sqlalchemy.org/en/20/core/connections.html

# echo from https://docs.sqlalchemy.org/en/20/core/engines.html

Successfully created db engine.


### <span style=color:blue>Working with btree indexing</span>

<span style=color:blue>I will encourage students to build a query that computes, for each year, the number of reviews that happened in that year. </span>

In [4]:
q = util.query_reviews_by_year_counts()

with db_eng.connect() as conn:
    result = conn.execute(sql_text(q))

result_list = result.fetchall()

pprint.pp(result_list)


[('2009', 56),
 ('2010', 449),
 ('2011', 1905),
 ('2012', 3872),
 ('2013', 7317),
 ('2014', 14203),
 ('2015', 28465),
 ('2016', 48527),
 ('2017', 66146),
 ('2018', 95137),
 ('2019', 126469),
 ('2020', 51172),
 ('2021', 109415),
 ('2022', 196136),
 ('2023', 228831),
 ('2024', 8710)]


<span style=color:blue>Likewise, the next query gives a count of listings within the prices ranges 0 <= p < 100, 100 <= p < 200, ..., 900 <= p < 1000 and '1000_and_above'

In [5]:
q = util.query_listings_price_ranges_counts()

with db_eng.connect() as conn:
    result = conn.execute(sql_text(q))

result_list = result.fetchall()

pprint.pp(result_list)


[('000s', 1735),
 ('1000_and_above', 226),
 ('100s', 6362),
 ('200s', 2875),
 ('300s', 1307),
 ('400s', 519),
 ('500s', 233),
 ('600s', 146),
 ('700s', 95),
 ('800s', 85),
 ('900s', 75),
 ('NULL', 7195)]


<span style=color:blue>And here is a query that gives the counts of occurrences in the reviews table, for three selected words</span>

In [6]:
q = util.query_reviews_words_counts()

with db_eng.connect() as conn:
    result = conn.execute(sql_text(q))

result_list = result.fetchall()

pprint.pp(result_list)

[('apartment', 195261), ('awesome', 27938), ('horrible', 859), (None, 762752)]


<span style=color:blue>First, here is an example of the queries we are using for the testing with b-tree indexes </span>

In [7]:
date_start = '2015-01-01'
date_end = '2015-12-31'

q = util.build_query_listings_join_reviews(date_start, date_end)

print(q)


SELECT DISTINCT l.id, l.name
FROM listings l, reviews r 
WHERE l.id = r.listing_id
  AND r.date >= '2015-01-01'
  AND r.date <= '2015-12-31'
ORDER BY l.id;


<span style=color:blue>My big vision was to have the students set up some infrastructure so that they could easily experiment with a variety of different queries and indexes, and do explorations of things on their own.  So, in util.py I have some helper functions that enable someone to specify a list of indexes, and then run experiments for all combinations of that list.  However, at present I'm thinking that will take people to far into some weeds.   </span>

<span style=color:blue>I am now thinking that I may ask the students to only build files like my files perf_test_v01.json and ts_perf_test_v01.json.  Or I might find time to do some update-based experiments, again for the same set of indexes that are involved with perf_test_v01.json and ts_perf_test_v01.json.  In either case, most likely, they will only have to think about the indexes date_in_reviews, id_in_listings, and comments_tsv_in_reviews. <span>

<span style=color:blue>With regards to storing results, I was originally thinking about this assignment, and about enabling the students to explore different index combinations over time.  So I developed a helper function that would read a json file from my disk, add to it, and then write it out again.  That way, I would be adding performance information over many programming/execution sessions.  This is my function util.fetch_and_update_perf_data(perf_file, perf_summary).  However, I am now thinking that the students don't need to bother with that, given that they will be given quite specific files that they are supposed to generate.<span>

<span style=color:blue>This generates a json file 'perf_test_v01.json'.  </span>

<span style=color:blue>This produces a lot of log info which may be helpful ... </span>

In [8]:
# testing function util.run_multiple_tests(db_eng, all_indexes, q_list, i_spec, perf_file, count)

all_indexes = [['date','reviews'], ['id','listings'], ['price','listings'], ['comments_tsv', 'reviews', 'gin']] 

q_dict = {}
# reviews has data for years 2009 to 2024
for yr in range(2009,2025):
    q_name = 'listings_join_review_' + str(yr)
    date_start = str(yr) + '-01-01'
    date_end = str(yr) + '-12-31'
    q_dict[q_name] = util.build_query_listings_join_reviews(date_start, date_end)
# pprint.pp(q_dict)

i_spec = [['id','listings'], ['date','reviews']]

perf_file = 'listings_join_reviews_v01.json'

# setting count to 3 for now so that things run faster
count = 3

for q in q_dict:
    print('\n====>>> Now working on query:', q)
    perf_summary = util.run_one_query_and_multi_index_specs(db_eng, all_indexes, q, q_dict[q], i_spec, count)
    # print('\nThe perf_summary of running query', q, 'is as follows:') 
    # pprint.pp(perf_summary)
    updated_perf_summary = util.fetch_and_update_perf_data(perf_file, perf_summary)

print('\nThe value of updated perf_summary is:')
pprint.pp(updated_perf_summary, sort_dicts=True)



====>>> Now working on query: listings_join_review_2009

Now working on the i_spec: []
Current set of indexes in effect is:
{'reviews': ["CREATE INDEX comments_in_reviews ON new_york_city.reviews USING gin (to_tsvector('simple'::regconfig, (comments)::text))"],
 'listings': ['CREATE INDEX neighbourhood_group_in_listings ON new_york_city.listings USING btree (neighbourhood_group)']}

Now invoking run_one_query

Now working on the i_spec: [['id', 'listings']]
Current set of indexes in effect is:
{'reviews': ["CREATE INDEX comments_in_reviews ON new_york_city.reviews USING gin (to_tsvector('simple'::regconfig, (comments)::text))"],
 'listings': ['CREATE INDEX neighbourhood_group_in_listings ON new_york_city.listings USING btree (neighbourhood_group)',
              'CREATE INDEX id_in_listings ON new_york_city.listings USING btree (id)']}

Now invoking run_one_query

Now working on the i_spec: [['date', 'reviews']]
Current set of indexes in effect is:
{'reviews': ['CREATE INDEX date_in_r

### <span style=color:blue>Working with text indexing</span>

<span style=color:blue>First, here are examples of the 2 kinds of query we are working with.  I might decide to ask the students to experiment with a second word.  (I need to find a word with different behavior!)</span>

### <span style=color:blue>NOTE: please see the double "%%" in the first query -- this is a kind of escape character so that sqlalchemy will be able to work with the "%" signs.  </span>

In [11]:
word = 'awesome'
start_date = '2015-01-01'
end_date = '2015-12-31'

q1 = util.build_query_reviews_no_index(word, start_date, end_date)
q2 = util.build_query_reviews_ts(word, start_date, end_date)

print('q1 is:')
print(q1)
print('\nq2 is:')
print(q2)
    

q1 is:
SELECT *
FROM reviews r 
WHERE comments ILIKE '%%awesome%%' 
  AND date >= '2015-01-01'
  AND date <= '2015-12-31';

q2 is:
SELECT *
FROM reviews r 
WHERE comments_tsv @@ to_tsquery('awesome')
  AND date >= '2015-01-01'
  AND date <= '2015-12-31';


<span style=color:blue>The next cell generates the file 'ts_pef_test_v01.json'<span>

<span style=color:blue>For these experiments the "no-index" case is making a search against the comments column of reviews, whereas the "with-index" case is making a search against the comments_tsv column of reviews.  So I don't need to drop the index on comments_tsv when running the "no-index" case.  This makes it simpler than the b-tree situation</span>

In [10]:
# testing function util.run_multiple_tests(db_eng, all_indexes, q_list, i_spec, perf_file, count)

perf_file = 'ts_perf_test_v01.json'

# using a low count so that the query without index runs reasonably quickly
count = 50

# reviews has data for years 2009 to 2024
for yr in range(2009,2025):
    print('\nEntering loop for yr = ', str(yr))
    q_name = 'listings_join_review_' + str(yr)
    date_start = str(yr) + '-01-01'
    date_end = str(yr) + '-12-31'
    for word in ['horrible', 'awesome', 'apartment']:
        print('Entering sub-loop for word: ', word)
        perf_info = util.run_one_text_query_without_with_ts_index(db_eng, word, date_start, date_end, count)
        print('perf_file for the run for year "', yr, '" and word "', word, '" is:')
        pprint.pp(perf_info)
        updated_perf_summary = util.fetch_and_update_ts_perf_data(perf_file, perf_info)




print('After running all years, the updated perf_summary info, in sorted order, is:')
# the sorted(...items()) returns list of ordered pairs; 
#     for each pair, the first element is a key in the dictionary
#                    and the second element is the value associated with that key
pprint.pp(sorted(updated_perf_summary.items()))





Entering loop for yr =  2009
Entering sub-loop for word:  horrible
{'avg': 0.0028,
 'min': 0.0021,
 'max': 0.0079,
 'std': 0.0008,
 'exec_count': 50,
 'timestamp': '2024-05-07-23:24:02'}
{'avg': 0.0021,
 'min': 0.0016,
 'max': 0.0032,
 'std': 0.0004,
 'exec_count': 50,
 'timestamp': '2024-05-07-23:24:03'}
perf_file for the run for year " 2009 " and word " horrible " is:
{'horrible_2009': {'no_ts_index': {'avg': 0.0028,
                                   'min': 0.0021,
                                   'max': 0.0079,
                                   'std': 0.0008,
                                   'exec_count': 50,
                                   'timestamp': '2024-05-07-23:24:02'},
                   'with_ts_index': {'avg': 0.0021,
                                     'min': 0.0016,
                                     'max': 0.0032,
                                     'std': 0.0004,
                                     'exec_count': 50,
                                     '

[('apartment_2009',
  {'no_ts_index': {'avg': 0.0033,
                   'min': 0.0028,
                   'max': 0.004,
                   'std': 0.0005,
                   'count': 3,
                   'timestamp': '2024-05-07-19:21:29'},
   'with_ts_index': {'avg': 0.002,
                     'min': 0.0019,
                     'max': 0.0021,
                     'std': 0.0001,
                     'count': 3,
                     'timestamp': '2024-05-07-19:21:29'}}),
 ('apartment_2010',
  {'no_ts_index': {'avg': 0.0059,
                   'min': 0.0053,
                   'max': 0.0067,
                   'std': 0.0006,
                   'count': 3,
                   'timestamp': '2024-05-07-19:21:29'},
   'with_ts_index': {'avg': 0.0038,
                     'min': 0.0037,
                     'max': 0.0041,
                     'std': 0.0002,
                     'count': 3,
                     'timestamp': '2024-05-07-19:21:29'}}),
 ('apartment_2011',
  {'no_ts_index': {'av

### <span style=color:blue>Here is an interesting query, in which we can vary three things: price range of the listing, year of when a review was made, and whether one of the 3 main words is in the review.

In [11]:
q_using_ts_index = """select count(*)
from listings l, reviews r
where l.id = r.listing_id
  and r.date >= '2015-01-01' and r.date <= '2015-12-31'
  and comments_tsv @@ to_tsquery('awesome')
  and l.price >= 300 and l.price < 400"""

# notice that I'm using the double "%%"; sqlalchemy reads that as "%"
q_not_using_ts_index = """select count(*)
from listings l, reviews r
where l.id = r.listing_id
  and r.date >= '2015-01-01' and r.date <= '2015-12-31'
  and comments ilike '%%awesome%%'
  and l.price >= 300 and l.price < 400"""

count = 10
perf_with_ts_index, time_list = util.run_one_query(db_eng, q_using_ts_index, count)
perf_no_ts_index, time_list = util.run_one_query(db_eng, q_not_using_ts_index, count)

print('Perf profile when using ts index is:')
pprint.pp(perf_with_ts_index)

print('\nPerf profile when NOT using ts index is:')
pprint.pp(perf_no_ts_index)



Perf profile when using ts index is:
{'avg': 0.0754,
 'min': 0.069,
 'max': 0.1181,
 'std': 0.0143,
 'count': 10,
 'timestamp': '2024-05-07-19:13:16'}

Perf profile when NOT using ts index is:
{'avg': 0.1434,
 'min': 0.1412,
 'max': 0.1457,
 'std': 0.0015,
 'count': 10,
 'timestamp': '2024-05-07-19:13:17'}


In [9]:
print('0100s'[1:4]) 

100
