## Speeding up BigQuery queries with BI Engine

To speed up small queries in BigQuery, simply turn on BI Engine.
The Client API remains exactly the same.

Accompanies https://medium.com/@lakshmanok/speeding-up-small-queries-in-bigquery-with-bi-engine-4ac8420a2ef0

#### Queries

In [1]:
from google.cloud import bigquery
from timeit import default_timer as timer
from datetime import timedelta

def show_query(query):
    client = bigquery.Client()
    query_job = client.query(query)
    return query_job.result().to_dataframe()

This query finds the most common names by state.

In [15]:
NAME_BY_STATE="""
    SELECT name, state, SUM(number) as total_people
    FROM `bigquery-public-data.usa_names.usa_1910_2013`
    GROUP BY name, state
    ORDER BY total_people DESC
    LIMIT 10
"""
show_query(NAME_BY_STATE)

Unnamed: 0,name,state,total_people
0,John,NY,494388
1,Robert,NY,439191
2,Michael,NY,432493
3,Michael,CA,422555
4,John,PA,418333
5,David,CA,364313
6,Robert,PA,351959
7,Robert,CA,348162
8,James,NY,340430
9,Joseph,NY,339021


This query is more complex and finds the highest relative-frequency names by state.

In [3]:
NAME_RELFREQ="""
    WITH num_babies_in_state AS (
      SELECT 
         state,
         SUM(number) as total_people
      FROM `bigquery-public-data.usa_names.usa_1910_2013`
      GROUP BY state
    )

    SELECT 
       name,
       state,
       SUM(number/total_people) as rel_freq
    FROM `bigquery-public-data.usa_names.usa_1910_2013`
    JOIN num_babies_in_state USING(state)
    GROUP BY name, state
    ORDER BY rel_freq DESC
    LIMIT 10
"""
show_query(NAME_RELFREQ)

Unnamed: 0,name,state,rel_freq
0,James,SC,0.028909
1,James,MS,0.028827
2,James,AL,0.027832
3,James,TN,0.027124
4,James,KY,0.026861
5,John,MA,0.026169
6,John,RI,0.025709
7,James,WV,0.025554
8,James,AR,0.02554
9,John,PA,0.025107


This query uses a discrete number for grouping and will be inherently faster than the names queries (which use strings).
This query finds the sites with the worst (on average) air quality

In [6]:
AIR_QUALITY="""
            SELECT
               site_num,
               ANY_VALUE(state_name) AS state,
               AVG(aqi) as air_quality_index,
            FROM `bigquery-public-data.epa_historical_air_quality.pm10_daily_summary`
            GROUP BY site_num
            ORDER BY air_quality_index DESC
            LIMIT 10
"""
show_query(AIR_QUALITY)

Unnamed: 0,site_num,state,air_quality_index
0,8012,Country Of Mexico,81.027027
1,3013,Arizona,68.102236
2,7030,Arizona,61.595819
3,3015,Arizona,57.42039
4,241,California,54.416667
5,3008,Pennsylvania,51.125606
6,149,Pennsylvania,49.254864
7,1999,California,48.589238
8,2306,Guam,46.846154
9,3011,Arizona,45.814396


### Without BI Engine

Time it. Note that I am measuring the time taken on the server
using query_job.started and query_job.ended.
This takes out variability due to the time it takes to send the query
over the network to the BigQuery API.

In [8]:
from google.cloud import bigquery
from timeit import default_timer as timer
from datetime import timedelta

# Construct a BigQuery client object.
client = bigquery.Client()

def run_query(query, n=5):
    tot_slotmillis, tot_timeelapsed = 0, timedelta(0)
    for iter in range(n):
        query_job = client.query(query, bigquery.job.QueryJobConfig(use_query_cache=False))
        df = query_job.result().to_dataframe()
        tot_timeelapsed += (query_job.ended - query_job.started)
        tot_slotmillis += query_job.slot_millis
    print("Job stat: slot_mills={} server_time={}".format(tot_slotmillis/n, tot_timeelapsed/n))

Here, I'm running the query without BI Engine turned on.

In [18]:
run_query(NAME_BY_STATE)

Job stat: slot_mills=4052.2 server_time=0:00:03.902200


In [19]:
run_query(NAME_RELFREQ)

Job stat: slot_mills=11572.8 server_time=0:00:05.105800


In [20]:
run_query(AIR_QUALITY)

Job stat: slot_mills=1200.8 server_time=0:00:01.003200


The slot milliseconds is a proxy for the cost if you have a reservation -- it measures how much your BigQuery slots are getting used.
The server_time is the time taken to process the request (we don't measure the network roundtrip time because it's going to be the
same whether or not you use BI Engine).

### With BI Engine

Then, I went to the web console and turned on a 1 GB BI Engine reservation (monthly cost: $30).
Note: It seems to take about 3 minutes for the memory to become available, so this is something
you should consider doing for a few hours at least, not on a per-query basis.

In [21]:
run_query(NAME_BY_STATE)

Job stat: slot_mills=1846.8 server_time=0:00:01.634400


In [22]:
run_query(NAME_RELFREQ)

Job stat: slot_mills=9884.0 server_time=0:00:05.075800


In [23]:
run_query(AIR_QUALITY)

Job stat: slot_mills=94.6 server_time=0:00:00.254400


As you can see, I got anywhere from a 1.5x to 10x speedup! My code did not change.

Copyright 2021 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License