# Before you begin

1. Use the [Cloud Resource Manager](https://console.cloud.google.com/cloud-resource-manager) to Create a Cloud Platform project if you do not already have one.
2. [Enable billing](https://support.google.com/cloud/answer/6293499#enable-billing) for the project.
3. [Enable BigQuery](https://console.cloud.google.com/flows/enableapi?apiid=bigquery) APIs for the project.
4. Provide your credentials to the runtime

In [2]:
from google.colab import auth
auth.authenticate_user()

In [None]:
GCP_PROJECT = 'httparchive'  # @param {type: "string"}

The [`pages`](https://har.fyi/reference/tables/pages/) table contains details about each page tracked in the archive. It includes timings, number of requests, types of requests and byte sizes. You can see the table schema by [selecting the table in BigQuery UI](https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1shttparchive!2sall!3spages) or reading the [schema reference](https://har.fyi/reference/tables/pages/#schema). You can also preview the contents by clicking on the **Preview** button.

In [5]:
# This query will process 898 MB when run.
%%bigquery df_preview --project {GCP_PROJECT}
SELECT *
FROM `httparchive.all.pages`
WHERE date = '2024-05-01'
    AND client='desktop'
    AND is_root_page
    AND rank = 1000
    AND page = 'https://www.google.com/'

Query is running:   0%|          |

Downloading:   0%|          |

In [6]:
df_preview.head()

Unnamed: 0,date,client,page,is_root_page,root_page,rank,wptid,payload,summary,custom_metrics,lighthouse,features,technologies,metadata
0,2024-05-01,desktop,https://www.google.com/,True,https://www.google.com/,1000,240514_Dx1RL_DMOOB,"{""startedDateTime"":""2024-05-16T10:40:35.950225...","{""metadata"": ""{\""rank\"": 1000, \""page_id\"": 22...","{""00_reset"":null,""Colordepth"":24,""Dpi"":{""dppx""...","{""lighthouseVersion"":""12.0.0"",""requestedUrl"":""...","[{'feature': 'V8SloppyMode', 'id': '1075', 'ty...","[{'technology': 'Apple iCloud Mail', 'categori...","{""rank"":1000,""page_id"":22893419,""tested_url"":""..."


Let's start exploring this table with a simple query and build up to something interesting. How many pages are included in the May 2024 Desktop data? To do this, we'll use [COUNT(0)](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#count) in a simple aggregate query.

In [3]:
# This query will process 397 MB when run.
%%bigquery --project {GCP_PROJECT}
SELECT
    COUNT(0) pages_total
FROM `httparchive.all.pages`
WHERE date = '2024-05-01'
    AND client='desktop'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,pages_total
0,24485986


Next let's calculate the average number of requests per page across all 24.5 million pages. In the table example shown earlier, the `reqTotal` metric within `summary` column contained the total number of requests on the page and Google's homepage had 39 requests. In order to calculate the average, we'll use SQL's [AVG()](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#avg) function.

In [4]:
# This query will process 50 GB when run.
%%bigquery --project {GCP_PROJECT}
SELECT
    COUNT(0) pages_total,
    AVG(CAST(JSON_VALUE(summary, '$.reqTotal') AS INT64)) avg_requests
FROM `httparchive.all.pages`
WHERE date = '2024-05-01'
    AND client='desktop'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,pages_total,avg_requests
0,24485986,97.734921


The average number of requests per page is 97.734921. Let's use the [ROUND() function](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#round) to truncate this to two decimal points.

In [7]:
# This query will process 50 GB when run.
%%bigquery --project {GCP_PROJECT}
SELECT
    COUNT(0) pages_total,
    ROUND(AVG(CAST(JSON_VALUE(summary, '$.reqTotal') AS INT64)), 2) avg_requests
FROM `httparchive.all.pages`
WHERE date = '2024-05-01'
    AND client='desktop'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,pages_total,avg_requests
0,24485986,97.73


You may have heard the phrase "averages are misleading", and that's certainly true here. While it's very easy and familiar to represent stats as averages, it hides a lot of detail and is easily skewed by outliers. Let's explore the number of requests per page with percentiles now.

In Standard SQL, we can use the [APPROX_QUANTILES()](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#approx_quantiles) function to calculate all the percentiles for a field, which is returned as an array.  If we combined that with the [SAFE_ORDINAL()](https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#safe_offset-and-safe_ordinal) function then we can select the percentile that interests us from this array.  In the example below, we will create 100 quantiles and then select the Nth index from that array to get the Nth percentile. So, `APPROX_QUANTILES(requests_total, 100)[SAFE_ORDINAL(50)]` is actually the 50th percentile, or the median.

Let's do this for the 25th, 50th, 75th and 95th percentiles and see how that compares to the average.

In [8]:
# This query will process 50 GB when run.
%%bigquery --project {GCP_PROJECT}
WITH pages AS (
    SELECT CAST(JSON_VALUE(summary, '$.reqTotal') AS INT64) AS requests_total
    FROM `httparchive.all.pages`
    WHERE date = '2024-05-01'
        AND client='desktop'
)

SELECT
    COUNT(0) pages,
    ROUND(AVG(requests_total),2) avg_requests,
    APPROX_QUANTILES(requests_total, 100)[SAFE_ORDINAL(25)] p25_requests,
    APPROX_QUANTILES(requests_total, 100)[SAFE_ORDINAL(50)] p50_requests,
    APPROX_QUANTILES(requests_total, 100)[SAFE_ORDINAL(75)] p75_requests,
    APPROX_QUANTILES(requests_total, 100)[SAFE_ORDINAL(95)] p95_request
FROM pages

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,pages,avg_requests,p25_requests,p50_requests,p75_requests,p95_request
0,24485986,97.73,40,69,111,213


When we look at the results from this query, the median number of requests per page was 69. The average was in fact skewed by outliers. Also, since the 25th percentile is 40 requests and the 75th percentile is 111 requests, that tells us that 50% of the 24.5 million pages tracked by the HTTP Archive have between 40 and 111 requests. This is also known as the [interquartile range](https://en.wikipedia.org/wiki/Interquartile_range).

Now let's add another dimension to this query. The numDomains metric within `summary` column counts the number of unique domain names used across all the page's requests. If we add numDomains to the query, and GROUP BY it then we can see these stats broken down by the number of domains per page. In this next example, we'll use the `HAVING` clause to limit the results to domain counts that have at least 1000 pages.

In [9]:
# This query will process 50 GB when run.
%%bigquery df_domains --project {GCP_PROJECT}
WITH pages AS (
    SELECT
        CAST(JSON_VALUE(summary, '$.reqTotal') AS INT64) AS requests_total,
        CAST(JSON_VALUE(summary, '$.numDomains') AS INT64) AS number_domains
    FROM `httparchive.all.pages`
    WHERE date = "2024-05-01"
        AND client = 'desktop'
)

SELECT
    number_domains,
    COUNT(0) pages,
    ROUND(AVG(requests_total), 2) avg_requests,
    APPROX_QUANTILES(requests_total, 100)[SAFE_ORDINAL(25)] p25_requests,
    APPROX_QUANTILES(requests_total, 100)[SAFE_ORDINAL(50)] p50_requests,
    APPROX_QUANTILES(requests_total, 100)[SAFE_ORDINAL(75)] p75_requests,
    APPROX_QUANTILES(requests_total, 100)[SAFE_ORDINAL(95)] p95_requests
FROM pages
GROUP BY number_domains
HAVING pages > 1000
ORDER BY number_domains ASC

Query is running:   0%|          |

Downloading:   0%|          |

In [10]:
df_domains

Unnamed: 0,number_domains,pages,avg_requests,p25_requests,p50_requests,p75_requests,p95_requests
0,1,835476,28.04,8,19,35,75
1,2,480351,37.90,15,28,47,90
2,3,875923,48.63,23,39,61,108
3,4,714381,53.72,27,44,65,115
4,5,826471,60.58,32,50,75,127
...,...,...,...,...,...,...,...
135,164,1096,660.70,578,638,717,870
136,165,1062,659.99,574,631,702,860
137,166,1020,656.51,573,639,707,850
138,167,1101,659.79,571,637,708,874


The result contained 140 rows of data.  Now that we're dealing with larger result sets, it's time to start graphing them!
In BigQuery you can save your query results to a CSV file, to a Google Sheet or export to Data Studio for visualization. In this guide, we'll attempt to visualize some of this data.

When we look at the relationship between the number of domains and the pages it looks like a fair numbers of sites load content from less than 25 unique domains. Using the same technique we practiced above, we can validate this by calculating the percentiles for the number of domains.

In [11]:
# This query will process 50 GB when run.
%%bigquery --project {GCP_PROJECT}
WITH pages AS (
    SELECT CAST(JSON_VALUE(summary, '$.numDomains') AS INT64) AS number_domains
    FROM `httparchive.all.pages`
    WHERE date = "2024-05-01"
        AND client = 'desktop'
)
SELECT
    APPROX_QUANTILES(number_domains, 100)[SAFE_ORDINAL(25)] p25_number_domains,
    APPROX_QUANTILES(number_domains, 100)[SAFE_ORDINAL(50)] p50_number_domains,
    APPROX_QUANTILES(number_domains, 100)[SAFE_ORDINAL(75)] p75_number_domains,
    APPROX_QUANTILES(number_domains, 100)[SAFE_ORDINAL(95)] p95_number_domains
FROM pages

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,p25_number_domains,p50_number_domains,p75_number_domains,p95_number_domains
0,5,9,15,30


When we put all of this together, we can see some interesting patterns. For example:

- The number of requests per page increases linearly with respect to the number of domains.
- The long tail of the histogram of domains per site represents is a fairly large percentage of sites. 25% of sites have more than 15 domains.
- The median number of requests per page tracks is close to the average
- There is a wide gap between the 75th percentile and 95th percentile requests per page, which remains consistent for all domain groupings

In [12]:
import altair
chart = altair.Chart(df_domains).mark_point().encode(
    x='number_domains',
    y='pages',
    tooltip=['number_domains', 'pages']
).properties(
    width=1200,
    height=600
)
chart

In [13]:
chart = altair.Chart(df_domains).mark_point().encode(
    x='number_domains',
    y='avg_requests',
    tooltip=['number_domains', 'avg_requests']
).properties(
    width=1200,
    height=600
)
chart

In [14]:
melted_df = df_domains.melt(id_vars='number_domains', value_vars=['p25_requests', 'p50_requests', 'p75_requests', 'p95_requests'], var_name='percentile', value_name='number_of_pages')

chart = altair.Chart(melted_df).mark_line().encode(
    x=altair.X('number_domains:O', title='Number of domains'),
    y=altair.Y('number_of_pages:Q', title='Number of pages'),
    color='percentile:N',
    tooltip=['percentile', 'number_domains', 'number_of_pages']
).properties(
    title='Percentile Values by Domain',
    width=1200,
    height=600
)
chart

Let's step back and look at another example. In the `pages` table, there is `num_scripts_sync` and `num_scripts_async` metrics within `summary` column, which indicate the number of async and sync scripts per page. We can run a simple query using the techniques you learned above to see how they relate to each other.

In [15]:
# This query will process 50 GB when run.
%%bigquery df_sync_async_scripts --project {GCP_PROJECT}
SELECT
    CAST(JSON_VALUE(summary, '$.num_scripts_async') AS INT64) AS number_scripts_async,
    CAST(JSON_VALUE(summary, '$.num_scripts_sync') AS INT64) AS number_scripts_sync,
    COUNT(0) AS pages_total
FROM `httparchive.all.pages`
WHERE date = "2024-05-01"
    AND client = 'desktop'
GROUP BY
    number_scripts_async,
    number_scripts_sync
HAVING pages_total > 100

Query is running:   0%|          |

Downloading:   0%|          |

In [16]:
df_sync_async_scripts

Unnamed: 0,number_scripts_async,number_scripts_sync,pages_total
0,9,89,218
1,1,60,1826
2,21,15,3724
3,16,63,329
4,48,4,738
...,...,...,...
3925,50,30,132
3926,3,22,30459
3927,67,5,914
3928,20,10,6934


The results contain a many rows of data, so let's visualize it again. In order to look at the relationships between these two metrics here is a pivot table to cross tabulate the results. Color scale helps to visualise it as a heat map.

In [17]:
heatmap = altair.Chart(df_sync_async_scripts).mark_rect().encode(
    x=altair.X('number_scripts_sync:O', title='Number of Scripts Sync'),
    y=altair.Y('number_scripts_async:O', title='Number of Scripts Async'),
    color=altair.Color('pages_total:Q', title='Pages Total', scale=altair.Scale(type='quantile', scheme='plasma')),
    tooltip=['number_scripts_sync', 'number_scripts_async', 'pages_total']
).properties(
    title='Heatmap of Pages Total by Number of Scripts Sync and Async',
    width=1200,
    height=600
)
heatmap

You can also make a copy of the notebook and experiment with some of your own visualization ideas for the data as well.

In [Part 2](https://colab.research.google.com/github/rviscomi/har.fyi/blob/main/workbooks/exploring_httparchive-all-requests_tables.ipynb) we'll explore the requests tables.