In [4]:
from google.colab import auth
auth.authenticate_user()

In [5]:
GCP_PROJECT = 'httparchive'  # @param {type: "string"}

The [`requests`](https://har.fyi/reference/tables/requests/) tables contain details about all HTTP requests made by the 24.5 million pages tracked in the archive. This dataset is quite large, as you can see by the aggregate query that counts all rows in the table

In [6]:
# This query will process 38 GB when run.
%%bigquery --project {GCP_PROJECT}
SELECT COUNT(0) AS requests_total
FROM `httparchive.crawl.requests`
WHERE date = "2024-05-01"
  AND client = 'desktop'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,requests_total
0,2393199079


In [Part 1](https://colab.research.google.com/github/HTTPArchive/har.fyi/blob/main/workbooks/exploring_httparchive-all-pages_tables.ipynb) we looked at a sample of what the `pages` table data looked like. The following table shows a sample of a row from the `requests` table. The url is a unique identifier for each request.  The page corresponds to the page column in the `pages` table, and can be used to relate the two datasets. There are also numerous other columns for collecting information from request and response headers, types of objects, and even a partial dump of response body in the `response_body` column.

In [7]:
# This query will process 10 GB when run.
%%bigquery --project {GCP_PROJECT}
SELECT *
FROM `httparchive.crawl.requests` TABLESAMPLE SYSTEM (5 PERCENT)
WHERE date = "2024-05-01"
  AND client = 'desktop'
  AND is_root_page
  AND is_main_document
  AND CONTAINS_SUBSTR(page, 'www.google.com')
LIMIT 1

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,date,client,page,is_root_page,root_page,url,is_main_document,type,index,payload,summary,request_headers,response_headers,response_body
0,2024-05-01,desktop,https://www.google.com.gh/,True,https://www.google.com.gh/,https://www.google.com.gh/,True,html,1,"{""pageref"":""page_1_0_1"",""_run"":1,""_cached"":0,""...","{""requestid"": 97122220018499585, ""pageid"": 226...","[{'name': 'accept', 'value': 'text/html,applic...","[{'name': 'accept-ch', 'value': 'Sec-CH-UA-Pla...","<!doctype html><html itemscope="""" itemtype=""ht..."


Since each page tracked by the HTTP Archive has a unique page URL value, we can also summarize these results by the number of distinct pages.  For example, the following query tells us that there are 2,393,199,079 requests in this dataset and that they're loaded from 24,485,975 pages.

In [8]:
# This query will process 137 GB when run.
%%bigquery --project {GCP_PROJECT}
SELECT
  COUNT(0) requests,
  COUNT(DISTINCT page) pages
FROM `httparchive.crawl.requests`
WHERE date = "2024-05-01"
  AND client = 'desktop'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,requests,pages
0,2393199079,24485975


We already knew the number of pages from previous queries, but now lets add a dimension to the query to explore this table some more. The following query adds the 'type' column, which indicates the type of resource loaded (ie, script, image, css, etc)

In [9]:
# This query will process 155 GB when run.
%%bigquery --project {GCP_PROJECT}
SELECT
  type,
  COUNT(0) requests,
  COUNT(DISTINCT page) pages
FROM `httparchive.crawl.requests`
WHERE date = '2024-05-01'
    AND client = 'desktop'
GROUP BY type
ORDER BY requests DESC

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,type,requests,pages
0,script,813565233,23556030
1,image,747018774,24146223
2,css,278463773,23259825
3,other,194533696,11901742
4,html,185868729,24145122
5,font,128104348,21224863
6,text,64065722,14736086
7,video,7191779,1407697
8,xml,2714133,747170
9,audio,1177141,518769


When we look at this data we can see counts of requests and pages for each content type. But what if we want the percentage of total? You could divide by the numbers that we know. But a more repeatable way of doing this would be to include a subquery. In the following query you'll notice that the number of requests per type is divided by the total number of requests. And likewise the same is done for pages. As we seen in Part 1, we can use the `ROUND()` function to trim the result to 2 decimal points for readability.

In [11]:
# This query will process 153 GB when run.
%%bigquery requests_type_df --project {GCP_PROJECT}
WITH requests AS (
  SELECT
    type,
    page,
    COUNT(0) OVER() AS total_requests,
    COUNT(DISTINCT page) OVER() AS total_pages
  FROM `httparchive.crawl.requests`
  WHERE date = '2024-05-01'
    AND client = 'desktop'
)

SELECT
  type,
  COUNT(0) requests,
  COUNT(DISTINCT page) pages,
  ROUND(COUNT(0) / ANY_VALUE(total_requests), 2) percent_requests,
  ROUND(COUNT(DISTINCT page) / ANY_VALUE(total_pages), 2) percent_pages
FROM requests
GROUP BY type
ORDER BY requests DESC

Query is running:   0%|          |

Downloading:   0%|          |

In [12]:
requests_type_df

Unnamed: 0,type,requests,pages,percent_requests,percent_pages
0,script,817636850,23885146,0.34,0.98
1,image,744181758,24465443,0.31,1.0
2,css,279208078,23575222,0.12,0.96
3,other,187424669,12001071,0.08,0.49
4,html,162335275,24462613,0.07,1.0
5,font,128926306,21491621,0.05,0.88
6,text,62780147,14798581,0.03,0.6
7,video,6874793,1377851,0.0,0.06
8,xml,2504041,772155,0.0,0.03
9,audio,1184613,504820,0.0,0.02


Graphing this we can see both the distribution of content types across all requests in the archive as well as the popularity of certain types of content on each site.  For example, 100% of sites contained images and HTML. 97% contain JavaScript and 96% - CSS. 88% contain custom webfonts and 6% contain video files.

In [13]:
import altair as alt

pie_chart_requests = alt.Chart(requests_type_df).mark_arc().encode(
    theta=alt.Theta(field="percent_requests", type="quantitative", stack=True),
    color=alt.Color(field="type", type="nominal", sort=None),
    order=alt.Order(field="percent_requests", type="quantitative",sort='descending'),
    tooltip=["type", alt.Tooltip("percent_requests:Q", format=".0%")]
).properties(
    title="Requests by Type"
)
pie_chart_requests + pie_chart_requests.mark_text(
    radius=170
).encode(
    text=alt.Text("percent_requests:Q", format=".0%")
)

In [14]:
bar_chart_pages = alt.Chart(requests_type_df).mark_bar().encode(
    x=alt.X("type", title="Type", sort=None),
    y=alt.Y("percent_pages", title="Percent Pages"),
    color=alt.Color(field="type", sort=None),
    tooltip=["type", "percent_pages"]
).properties(
    title="Percent Pages by Type"
)
bar_chart_pages + bar_chart_pages.mark_text(
    dy=-10,
).encode(
    text=alt.Text("percent_pages:Q", format=".0%")
)

Let's say we want to extend this query some more and look at the formats of each type. Now we have a query that is summarizing the % of requests and pages for each file type and format.

The summary columns stores quite a lot of data and we would need to analyze ~10TB of data. So let's look at the root pages only and apply sampling to the data scanned by the query using [`TABLESAMPLE`](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#tablesample_operator) operator to optimize the cost of the query.

In [15]:
# This query will process 129 GB when run.
%%bigquery --project {GCP_PROJECT}
WITH requests AS (
  SELECT
    type,
    STRING(summary.format) AS format,
    page,
    COUNT(0) OVER() AS total_requests,
    COUNT(DISTINCT page) OVER() AS total_pages
  FROM `httparchive.crawl.requests` TABLESAMPLE SYSTEM (5 PERCENT)
  WHERE date = '2024-05-01'
    AND client = 'desktop'
    AND is_root_page
)

SELECT
  type,
  format,
  COUNT(0) requests,
  COUNT(DISTINCT page) pages,
  ROUND(COUNT(0) / ANY_VALUE(total_requests), 2) percent_requests,
  ROUND(COUNT(DISTINCT page) / ANY_VALUE(total_pages), 2) percent_pages
FROM requests
GROUP BY
  type,
  format
ORDER BY requests DESC

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,type,format,requests,pages,percent_requests,percent_pages
0,script,,21085473,8030900,0.32,0.69
1,image,jpg,8696102,4322825,0.13,0.37
2,image,png,6928635,4122297,0.11,0.35
3,css,,6775517,4371876,0.1,0.38
4,other,,4447765,1466010,0.07,0.13
5,html,,4139296,2255120,0.06,0.19
6,image,gif,4022545,1539581,0.06,0.13
7,font,,3812048,2903219,0.06,0.25
8,text,,1406090,1120520,0.02,0.1
9,image,svg,1316180,942875,0.02,0.08


Using a WHERE clause we can filter out all of the non-Image content and examine the popularity of various image formats. For example, how often is "jpg", "gif", "webp", etc used.

Filtering by `type` column we are leveraging table's clustering to optimize the data scanned by the query even further.

In [16]:
# This query will process 41 GB when run.
%%bigquery requests_image_df --project {GCP_PROJECT}
WITH requests AS (
  SELECT
    STRING(summary.format) AS format,
    page,
    COUNT(0) OVER() AS total_requests,
    COUNT(DISTINCT page) OVER() AS total_pages
  FROM `httparchive.crawl.requests` TABLESAMPLE SYSTEM (5 PERCENT)
  WHERE date = '2024-05-01'
    AND client = 'desktop'
    AND is_root_page
    AND type = 'image'
)

SELECT
  format,
  COUNT(0) requests,
  COUNT(DISTINCT page) pages,
  ROUND(COUNT(0) / ANY_VALUE(total_requests), 2) percent_requests,
  ROUND(COUNT(DISTINCT page) / ANY_VALUE(total_pages), 2) percent_pages
FROM requests
GROUP BY format
ORDER BY requests DESC

Query is running:   0%|          |

Downloading:   0%|          |

In [17]:
requests_image_df

Unnamed: 0,format,requests,pages,percent_requests,percent_pages
0,jpg,8290474,4189907,0.38,0.53
1,png,6604473,3984064,0.31,0.5
2,gif,3836964,1487233,0.18,0.19
3,svg,1255002,906022,0.06,0.11
4,webp,1040312,630692,0.05,0.08
5,ico,272608,269039,0.01,0.03
6,,191874,148218,0.01,0.02
7,avif,131101,81393,0.01,0.01
8,,23113,17699,0.0,0.0
9,heic,2156,1783,0.0,0.0


**Note**: It's important to understand the bias in the data when we are doing this type of analysis. While we do have a very diverse set of 1.3 million page views - the technology used to parse these pages is Chrome browsers (both Desktop and Emulated mobile). Because of this, some formats may be under-represented - since Chrome supports webp but not jpeg-xr or jpeg2000.  You may find cases like this with other type of technologies as well - for example custom web font types that vary based on browser support.

Let's graph the results now to see what types of images are being served to Chrome browsers.

In [18]:
pie_chart_image_formats = alt.Chart(requests_image_df).mark_arc().encode(
    theta=alt.Theta(field="percent_requests", type="quantitative", stack=True),
    color=alt.Color(field="format", sort=None),
    order=alt.Order(field="percent_requests", type="quantitative",sort='descending'),
    tooltip=["format", alt.Tooltip("percent_requests:Q", format=".0%")]
).properties(
    title="Distribution of image formats"
)
pie_chart_image_formats + pie_chart_image_formats.mark_text(
    radius=170
).encode(
    text=alt.Text("percent_requests:Q", format=".0%")
)

In [19]:
bar_chart_image_pages = alt.Chart(requests_image_df).mark_bar().encode(
    x=alt.X("format", title="Type", sort=None),
    y=alt.Y("percent_pages", title="Percent Pages", sort=None),
    color=alt.Color(field="format", sort=None),
    tooltip=["format", alt.Tooltip("percent_pages:Q", format=".0%")]
).properties(
    title="Percent Pages by Type"
)
bar_chart_image_pages + bar_chart_image_pages.mark_text(
    dy=-10,
).encode(
    text=alt.Text("percent_pages:Q", format=".0%")
)


Let's explore a simple histogram of the requests dataset by looking at the distribution of response sizes. Histograms are useful for representing the distribution of data, by organizing a range of values into "bins" (or buckets), and then counting the number of values that fall into each of the bins. If you are not familiar with this type of visualization, then you can [read more about histograms here](https://en.wikipedia.org/wiki/Histogram).

In the example below, we'll be using a histogram to visualize the size of individual requests served from websites across the entire dataset. To do this, we'll be using the `respBodySize` metric from `summary` column. This metric represents the size of the response payload in bytes. Since 1 byte is very granular, we'll divide by 1024 to get to 1 KB and then by 100 so that we are looking at this data with bin sizes of 100KB. We'll also wrap this in a `CEIL()` function to remove the decimal points and then multiply the result by 100. Using this technique, 1234567 bytes would be rounded to a bin of 1300 KB.

In [23]:
# This query will process 125 GB when run.
%%bigquery response_size_df --project {GCP_PROJECT}
WITH requests AS (
  SELECT
    CEIL(INT64(summary.respBodySize)/1024/100)*100 AS responseSize100KB,
    COUNT(0) OVER () AS total_requests
  FROM `httparchive.crawl.requests` TABLESAMPLE SYSTEM (5 PERCENT)
  WHERE date = '2024-05-01'
    AND client = 'desktop'
    AND is_root_page
    AND INT64(summary.respBodySize) > 0
)

SELECT
  responseSize100KB,
  COUNT(0) AS requests,
  COUNT(0)/ANY_VALUE(total_requests) AS pct_requests
FROM requests
GROUP BY responseSize100KB
ORDER BY responseSize100KB ASC

Query is running:   0%|          |

Downloading:   0%|          |

In [24]:
response_size_df

Unnamed: 0,responseSize100KB,requests,pct_requests
0,100.0,50125121,9.134219e-01
1,200.0,2343976,4.271389e-02
2,300.0,818513,1.491563e-02
3,400.0,405782,7.394499e-03
4,500.0,252538,4.601959e-03
...,...,...,...
555,67000.0,1,1.822284e-08
556,67300.0,1,1.822284e-08
557,67700.0,1,1.822284e-08
558,67800.0,1,1.822284e-08


When we analyze this data we can see that that 91% of requests have a response size less than 100KB. Try repeating this with 10KB bin sizes and you'll be able to see the spread of response sizes with more granularity.

In [25]:
# Plotting 10 first rows
alt.Chart(response_size_df.head(10)).mark_point().encode(
    x=alt.X("responseSize100KB", title="Response Size, 100KB bins"),
    y=alt.Y("pct_requests", title="Percent Requests"),
    tooltip=['responseSize100KB', alt.Tooltip("pct_requests:Q", format=".0%")]
).properties(
    width=1200,
    height=200
)

You can find many examples of working with the `requests` table in the [HTTP Archive discussion forums](https://discuss.httparchive.org/).

You can also make a copy of the workbook and experiment with some of your own visualization ideas for the data as well.

In [Part 3](https://colab.research.google.com/github/HTTPArchive/har.fyi/blob/main/workbooks/exploring_pages_and_requests_tables_joined.ipynb) we'll look at how you can use SQL JOINs to analyze both the `pages` and `requests` datasets.
