# Accessing the NLNZ Web archive dataset

This notebook includes the following sections,
1. Query web archive data using Memento
2. Query web archive data using CDX API
3. Access CDX and WARC files
4. Extracting metadata (URLs, timestamps, MIME types).


## Install required python packages

In [None]:
# Install pre-requisites
!pip -q install warcio>=1.7.4 validators boto3>=1.40.26 s3fs bs4 wordcloud
!pip -q install selenium chromedriver-autoinstaller # for webpage screenshots

In [None]:
# Install wa_nlnz_toolkit
!pip -q install -i https://test.pypi.org/simple/ wa-nlnz-toolkit==0.2.0

## Query web archive data using Memento

The **Memento protocol** makes it easier to find and use archived versions of web pages, even if other APIs aren't available. This gives us machine-readable information about web captures.

In the following section, we'll see how NLNZ web archive support the Memento protocol. Specifically, we'll look at three main features:
- TimeGate - get the version of a page closest to a date you choose.
- TimeMap -  see all archived versions of a page.
- Memento - change how an archived page is shown using special URL options

In [2]:
import wa_nlnz_toolkit as want

In [5]:
webpage = "www.natlib.govt.nz"

# default query - get latest capture
dict(want.query_memento(webpage).headers)

{'Date': 'Wed, 01 Oct 2025 23:03:09 GMT',
 'Server': 'Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_fcgid/2.3.9',
 'Content-Type': 'text/html; charset=UTF-8',
 'Link': '<http://www.natlib.govt.nz/>; rel="original", <https://ndhadeliver.natlib.govt.nz/webarchive/http://www.natlib.govt.nz/>; rel="timegate", <https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://www.natlib.govt.nz/>; rel="timemap"; type="application/link-format", <https://ndhadeliver.natlib.govt.nz/webarchive/20250908102349mp_/http://www.natlib.govt.nz/>; rel="memento"; datetime="Mon, 08 Sep 2025 10:23:49 GMT"',
 'Vary': 'accept-datetime',
 'Expect-CT': 'max-age=86400, enforce',
 'X-XSS-Protection': '1; mode=block',
 'X-Content-Type-Options': 'nosniff',
 'X-Permitted-Cross-Domain-Policies': 'none',
 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains',
 'Referrer-Policy': 'no-referrer',
 'Keep-Alive': 'timeout=5, max=92',
 'Connection': 'Keep-Alive',
 'Set-Cookie': 'visid_incap_

In [6]:
# or get a tidy-up version
want.get_memento_urls(webpage)

{'original': 'http://www.natlib.govt.nz/',
 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/http://www.natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://www.natlib.govt.nz/',
 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/20250908102349mp_/http://www.natlib.govt.nz/'}

The *link* field contains the Memento information. For this case, we can see it contains 4 link types as follows:

- **original**: the url that was archived (e.g., https://covid19.govt.nz/)
- **timegate**: the harvested url (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/https://covid19.govt.nz/)
- **timemap**: list of all available captures over time (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/https://covid19.govt.nz/)
- **memento**: the url of the specific archived version of the webpage (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/20250728214105mp_/https://covid19.govt.nz/)

By default, the *memento* shows the url from the latest capture. If a specific datetime was provided, it will return the capture closest in time to the given datetime. Example is shown below.

In [7]:
import datetime


# query for a capture closest to a given datetime
dt_required = datetime.datetime(2020, 1, 1, 0, 0, 0)
dict(want.query_memento(webpage, dt=dt_required).headers)

{'Date': 'Wed, 01 Oct 2025 23:03:16 GMT',
 'Server': 'Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_fcgid/2.3.9',
 'Content-Type': 'text/html; charset=UTF-8',
 'Link': '<http://www.natlib.govt.nz/>; rel="original", <https://ndhadeliver.natlib.govt.nz/webarchive/http://www.natlib.govt.nz/>; rel="timegate", <https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://www.natlib.govt.nz/>; rel="timemap"; type="application/link-format", <https://ndhadeliver.natlib.govt.nz/webarchive/20200130060111mp_/http://www.natlib.govt.nz/>; rel="memento"; datetime="Thu, 30 Jan 2020 06:01:11 GMT"',
 'Vary': 'accept-datetime',
 'Expect-CT': 'max-age=86400, enforce',
 'X-XSS-Protection': '1; mode=block',
 'X-Content-Type-Options': 'nosniff',
 'X-Permitted-Cross-Domain-Policies': 'none',
 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains',
 'Referrer-Policy': 'no-referrer',
 'Keep-Alive': 'timeout=5, max=100',
 'Connection': 'Keep-Alive',
 'Set-Cookie': 'visid_incap

In [8]:
# or get the tidy-up version
want.get_memento_urls(webpage, dt=dt_required)

{'original': 'http://www.natlib.govt.nz/',
 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/http://www.natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://www.natlib.govt.nz/',
 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/20200130060111mp_/http://www.natlib.govt.nz/'}

In [9]:
want.query_memento("www.niwa.co.nz").links

{'original': {'url': 'http://www.niwa.co.nz/', 'rel': 'original'},
 'timegate': {'url': 'https://ndhadeliver.natlib.govt.nz/webarchive/http://www.niwa.co.nz/',
  'rel': 'timegate'},
 'timemap': {'url': 'https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://www.niwa.co.nz/',
  'rel': 'timemap',
  'type': 'application/link-format'},
 'memento': {'url': 'https://ndhadeliver.natlib.govt.nz/webarchive/20200318073807mp_/http://www.niwa.co.nz/',
  'rel': 'memento',
  'datetime': 'Wed, 18 Mar 2020 07:38:07 GMT'}}

### Get full list of captures from _timemap_

Memento Timemap provide a list of webpage captures for a given webpage. It is available from Pywb (NLNZ selective web archive) and OpenWayback systems. For Pywb, hree formats are supported - link, cdxj, and json.

The example below show a timemap for the given webpage from NLNZ selective web archive.

In [11]:
webpage = "www.natlib.govt.nz"

want.get_timemap(webpage)

https://ndhadeliver.natlib.govt.nz/webarchive/timemap/json/www.natlib.govt.nz


Unnamed: 0_level_0,urlkey,url,mime,status,digest,redirect,robotflags,length,offset,filename,source,source-coll,access_url
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2020-01-30 06:01:06,"nz,govt,natlib)/",http://natlib.govt.nz/,text/html,301,Q2NZOEQVJ4XX7ZDMCJH63Q4ZJZ3NX66N,-,-,0,151529,V1-FL50765403.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200130060106/http://natlib.govt.nz/
2020-01-30 06:01:11,"nz,govt,natlib)/",https://natlib.govt.nz/,text/html,200,DHRWFIKXJPMQV3BNEWYTURZ5ZJ4PXSDT,-,-,0,162188,V1-FL50765403.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200130060111/https://natlib.govt.nz/
2020-05-30 06:35:23,"nz,govt,natlib)/",https://natlib.govt.nz/,text/html,200,J5DYVJWHJ7MKZDKQCH2B6CVY6YJOF5IY,-,-,0,3477,V1-FL57988342.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200530063523/https://natlib.govt.nz/
2020-05-30 23:42:00,"nz,govt,natlib)/",http://natlib.govt.nz/,text/html,301,Q2NZOEQVJ4XX7ZDMCJH63Q4ZJZ3NX66N,-,-,0,76339245,V1-FL57988457.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200530234200/http://natlib.govt.nz/
2020-07-30 07:00:58,"nz,govt,natlib)/",https://natlib.govt.nz/,text/html,200,24OAFLGTCDDIOQZ6PINZAXBPJGAV73WW,-,-,0,27225,V1-FL55461382.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200730070058/https://natlib.govt.nz/
2021-01-29 01:36:44,"nz,govt,natlib)/",http://natlib.govt.nz/,text/html,200,UAOK5GLA5X7OKWOTZI4JPXTNQK7TVKKC,-,-,0,86249495,V1-FL62209940.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20210129013644/http://natlib.govt.nz/
2021-01-30 06:00:56,"nz,govt,natlib)/",http://natlib.govt.nz/,text/html,503,N67J36CWSVSGPQLJCVMHS3EG7Q4S5VNW,-,-,0,155090,V1-FL62215104.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20210130060056/http://natlib.govt.nz/
2021-12-03 01:31:57,"nz,govt,natlib)/",https://natlib.govt.nz/,text/html,200,VXIQIOHXS7HN7B6YVNWYHDUEFRQBWUGK,-,-,0,3819,V1-FL80155252.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20211203013157/https://natlib.govt.nz/
2022-04-01 09:56:43,"nz,govt,natlib)/",http://natlib.govt.nz/,text/html,301,DFS4JFJMZDAIFJRQP3LHAYNFPKVWMMX2,-,-,0,31898460,V1-FL80790464.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20220401095643/http://natlib.govt.nz/
2022-04-01 09:56:49,"nz,govt,natlib)/",https://natlib.govt.nz/,text/html,200,QSUK5KELPFRQLUEDBD6B66YJKVTHA7ZQ,-,-,0,34852274,V1-FL80790464.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20220401095649/https://natlib.govt.nz/


Note that the load_url field contains the URL used by Pywb internally, which cannot be used directly to access the specific version of web archive.


Also, Memento supports changing the way it is presented by adding some modifiers to the url. For example,

- **mp_** modifier: indicate "main page" content replay.
- **id_** modifier: returns the original harvested version of the webpage.
- **if_** modifier: returns the view with web archive headers (default for NLNZ web archive).

For more information, check https://pywb.readthedocs.io/en/latest/manual/rewriter.html?highlight=id_#url-rewriting

## Query web archive data using CDX API

Because our OutbackCDX server is not accessed internally, the following CDX API queries were actually redirected by the pywb to the outbackCDX server. As a result, some native CDX query params are not supported, such as setting cdx output format.

In [12]:
webpage = "www.natlib.govt.nz"

df_captures = want.query_cdx_index(webpage)
df_captures

Unnamed: 0_level_0,urlkey,url,mime,status,digest,redirect,robotflags,length,offset,filename,source,source-coll,access_url
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2004-07-11 21:32:25,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,JV66FPIIX6IJTB42TNHMQDEU5Z3LFBCK,-,-,0,976,V1-FL1645590.arc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20040711213225/http://www.natlib.govt.nz/
2006-07-04 03:31:35,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,JKXIM5NTOXWFNC5UOIAN37AGPV2KL73O,-,-,0,976,V1-FL1645520.arc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20060704033135/http://www.natlib.govt.nz/
2007-03-22 04:15:46,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,CU2KIAIJGUZD4IOV43D7LE2J5TVMUJYR,-,-,0,19799,V1-FL481509.arc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20070322041546/http://www.natlib.govt.nz/
2008-02-25 06:02:38,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,2IIVSKCHCNVKN6Z273YKZBEW6QVMYXKK,-,-,0,2717767,V1-FL538322.arc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20080225060238/http://www.natlib.govt.nz/
2008-10-19 22:53:43,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,6TCIF3SQHDMFZWZ2YTJ5AFNTSYUPYZX7,-,-,0,48523900,V1-FL617362.arc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20081019225343/http://www.natlib.govt.nz/
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-08-04 19:03:40,"nz,govt,natlib)/",https://natlib.govt.nz/,text/html,200,OIDDMZRGBYM5BP3V5VIVABZQOKPJILGV,-,-,-,78080125,V1-FL94835963.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250804190340/https://natlib.govt.nz/
2025-09-04 12:11:18,"nz,govt,natlib)/",http://natlib.govt.nz/,text/html,301,DFS4JFJMZDAIFJRQP3LHAYNFPKVWMMX2,-,-,-,26316599,V1-FL94997298.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250904121118/http://natlib.govt.nz/
2025-09-04 12:11:24,"nz,govt,natlib)/",https://natlib.govt.nz/,text/html,200,B545ERXBLTOCXX6CJZQBC2CF4TWXAJZY,-,-,-,27802697,V1-FL94997298.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250904121124/https://natlib.govt.nz/
2025-09-08 10:23:46,"nz,govt,natlib)/",http://natlib.govt.nz/,text/html,301,DFS4JFJMZDAIFJRQP3LHAYNFPKVWMMX2,-,-,-,29801086,V1-FL94998111.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250908102346/http://natlib.govt.nz/


Note that the query results above is actually the same as timemap. But in our function, we have added a "access_url" column which contains actual URL for each webpage snapshot.

In [3]:
# Furthermore, we can also query the CDX index for other types of files, such as images, videos, etc.
# However, due to the architecture design, we cannot do a fuzzy query for these types of files. 
# Instead, we will need to query the webpage at least from the first-level subdomain.
webpage = "covid19.govt.nz/assets/"

df_captures = want.query_cdx_index(webpage, filter="mimetype:application/pdf", matchType="prefix")
df_captures["original_file_name"] = df_captures["urlkey"].str.split("/").str[-1]
df_captures

Unnamed: 0_level_0,urlkey,url,mime,status,digest,redirect,robotflags,length,offset,filename,source,source-coll,access_url,original_file_name
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2020-03-30 21:00:21,"nz,govt,covid19)/assets/covid_alert-levels_v2.pdf",https://covid19.govt.nz/assets/COVID_Alert-levels_v2.pdf,application/pdf,200,MSTJSOX3WPVLPJK2NCOALZS7TUBXBHNS,-,-,0,7418315,V1-FL52965947.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200330210021/https://covid19.govt.nz/assets/COVID_Alert-levels_v2.pdf,covid_alert-levels_v2.pdf
2020-03-30 21:02:01,"nz,govt,covid19)/assets/covid_event-criteria-guide.pdf",https://covid19.govt.nz/assets/COVID_event-criteria-guide.pdf,application/pdf,200,NIXKDOXXOZ2MMIQL3J77FBZ4CKEQUQO3,-,-,0,24868717,V1-FL52965947.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200330210201/https://covid19.govt.nz/assets/COVID_event-criteria-guide.pdf,covid_event-criteria-guide.pdf
2020-04-06 06:24:09,"nz,govt,covid19)/assets/covid_alert-levels_v2.pdf",https://covid19.govt.nz/assets/COVID_Alert-levels_v2.pdf,application/pdf,200,MSTJSOX3WPVLPJK2NCOALZS7TUBXBHNS,-,-,0,7767870,V1-FL53132949.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200406062409/https://covid19.govt.nz/assets/COVID_Alert-levels_v2.pdf,covid_alert-levels_v2.pdf
2020-04-06 06:24:54,"nz,govt,covid19)/assets/covid_event-criteria-guide.pdf",https://covid19.govt.nz/assets/COVID_event-criteria-guide.pdf,application/pdf,200,NIXKDOXXOZ2MMIQL3J77FBZ4CKEQUQO3,-,-,0,25954920,V1-FL53132949.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200406062454/https://covid19.govt.nz/assets/COVID_event-criteria-guide.pdf,covid_event-criteria-guide.pdf
2020-04-13 06:32:18,"nz,govt,covid19)/assets/covid_alert-levels_v2.pdf",https://covid19.govt.nz/assets/COVID_Alert-levels_v2.pdf,application/pdf,200,MSTJSOX3WPVLPJK2NCOALZS7TUBXBHNS,-,-,0,37794252,V1-FL53203417.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200413063218/https://covid19.govt.nz/assets/COVID_Alert-levels_v2.pdf,covid_alert-levels_v2.pdf
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-08-07 08:04:15,"nz,govt,covid19)/assets/covid-19-protection-framework/covid-19-protection-framework-traffic-lights-table.pdf",https://covid19.govt.nz/assets/COVID-19-Protection-Framework/COVID-19-Protection-Framework-traffic-lights-table.pdf,application/pdf,200,UCP2OCCD22DUHAMNBIXW5F6FR7VYTJUT,-,-,0,94249314,V1-FL88232058.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20230807080415/https://covid19.govt.nz/assets/COVID-19-Protection-Framework/COVID-19-Protection-Framework-traffic-lights-table.pdf,covid-19-protection-framework-traffic-lights-table.pdf
2023-08-14 07:57:55,"nz,govt,covid19)/assets/covid-19-protection-framework/covid-19-protection-framework-traffic-lights-table.pdf",https://covid19.govt.nz/assets/COVID-19-Protection-Framework/COVID-19-Protection-Framework-traffic-lights-table.pdf,application/pdf,200,UCP2OCCD22DUHAMNBIXW5F6FR7VYTJUT,-,-,0,93535743,V1-FL88260441.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20230814075755/https://covid19.govt.nz/assets/COVID-19-Protection-Framework/COVID-19-Protection-Framework-traffic-lights-table.pdf,covid-19-protection-framework-traffic-lights-table.pdf
2023-08-21 07:57:10,"nz,govt,covid19)/assets/covid-19-protection-framework/covid-19-protection-framework-traffic-lights-table.pdf",https://covid19.govt.nz/assets/COVID-19-Protection-Framework/COVID-19-Protection-Framework-traffic-lights-table.pdf,application/pdf,200,UCP2OCCD22DUHAMNBIXW5F6FR7VYTJUT,-,-,0,88745446,V1-FL88366738.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20230821075710/https://covid19.govt.nz/assets/COVID-19-Protection-Framework/COVID-19-Protection-Framework-traffic-lights-table.pdf,covid-19-protection-framework-traffic-lights-table.pdf
2023-10-01 07:01:23,"nz,govt,covid19)/assets/covid-19-protection-framework/covid-19-protection-framework-traffic-lights-table.pdf",https://covid19.govt.nz/assets/COVID-19-Protection-Framework/COVID-19-Protection-Framework-traffic-lights-table.pdf,application/pdf,200,UCP2OCCD22DUHAMNBIXW5F6FR7VYTJUT,-,-,0,91087870,V1-FL88887196.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20231001070123/https://covid19.govt.nz/assets/COVID-19-Protection-Framework/COVID-19-Protection-Framework-traffic-lights-table.pdf,covid-19-protection-framework-traffic-lights-table.pdf


> HANDS-ON: Query the CDX index for all PNG files from the given webpage.

In [None]:
# webpage = "covid19.govt.nz/assets/"

# df_captures = want.query_cdx_index(webpage, filter="mimetype:image/png", matchType="prefix")
# df_captures["original_file_name"] = df_captures["urlkey"].str.split("/").str[-1]
# df_captures

Unnamed: 0_level_0,urlkey,url,mime,status,digest,redirect,robotflags,length,offset,filename,source,source-coll,access_url,original_file_name
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2020-03-30 21:03:09,"nz,govt,covid19)/assets/images/email-signature.png",https://covid19.govt.nz/assets/Images/Email-signature.png,image/png,200,BHYGUNLPKCYW2RZMEGMXGL6SWFGLQFVH,-,-,0,28461102,V1-FL52965947.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200330210309/https://covid19.govt.nz/assets/Images/Email-signature.png,email-signature.png
2020-04-06 06:25:50,"nz,govt,covid19)/assets/images/email-signature.png",https://covid19.govt.nz/assets/Images/Email-signature.png,image/png,200,BHYGUNLPKCYW2RZMEGMXGL6SWFGLQFVH,-,-,0,27930179,V1-FL53132949.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200406062550/https://covid19.govt.nz/assets/Images/Email-signature.png,email-signature.png
2020-08-27 01:42:00,"nz,govt,covid19)/assets/images/face-coverings/make_no-sew_step-1__resizedimagewzywmcwzodrd.png",https://covid19.govt.nz/assets/Images/face-coverings/make_no-sew_step-1__ResizedImageWzYwMCwzODRd.png,image/png,200,GLQPSQIWJDFKBP6YBV6XP5VJYYL5K6W3,-,-,0,8869271,V1-FL56954801.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200827014200/https://covid19.govt.nz/assets/Images/face-coverings/make_no-sew_step-1__ResizedImageWzYwMCwzODRd.png,make_no-sew_step-1__resizedimagewzywmcwzodrd.png
2020-08-27 01:42:00,"nz,govt,covid19)/assets/images/face-coverings/make_no-sew_step-2__resizedimagewzywmcwzodrd.png",https://covid19.govt.nz/assets/Images/face-coverings/make_no-sew_step-2__ResizedImageWzYwMCwzODRd.png,image/png,200,XVG7ESMXWPWXGACH44WZ5ZJXLR5CLOGE,-,-,0,1485271,V1-FL56954801.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200827014200/https://covid19.govt.nz/assets/Images/face-coverings/make_no-sew_step-2__ResizedImageWzYwMCwzODRd.png,make_no-sew_step-2__resizedimagewzywmcwzodrd.png
2020-08-27 01:42:00,"nz,govt,covid19)/assets/images/face-coverings/make_no-sew_step-3__resizedimagewzywmcwzodrd.png",https://covid19.govt.nz/assets/Images/face-coverings/make_no-sew_step-3__ResizedImageWzYwMCwzODRd.png,image/png,200,F6UBJCMUPIJBIGFQ66WXPOWFZKQFNN52,-,-,0,3889630,V1-FL56954801.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20200827014200/https://covid19.govt.nz/assets/Images/face-coverings/make_no-sew_step-3__ResizedImageWzYwMCwzODRd.png,make_no-sew_step-3__resizedimagewzywmcwzodrd.png
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-02-11 07:14:10,"nz,govt,covid19)/assets/images/background/support-for-disabled-people_image-block.png",https://covid19.govt.nz/assets/Images/background/support-for-disabled-people_image-block.png,image/png,200,3IBFZ6IKCKXUFNSMPD2BSEVVENDIH5BS,-,-,0,67334803,V1-FL90448663.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20240211071410/https://covid19.govt.nz/assets/Images/background/support-for-disabled-people_image-block.png,support-for-disabled-people_image-block.png
2024-02-11 07:14:13,"nz,govt,covid19)/assets/images/background/support-and-info-for-seniors_image-block.png",https://covid19.govt.nz/assets/Images/background/support-and-info-for-seniors_image-block.png,image/png,200,4FQ3XDR6J4GFNT72MM7MOA4RLN6QHTON,-,-,0,67554895,V1-FL90448663.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20240211071413/https://covid19.govt.nz/assets/Images/background/support-and-info-for-seniors_image-block.png,support-and-info-for-seniors_image-block.png
2024-02-11 07:14:16,"nz,govt,covid19)/assets/images/background/support-for-maori_image-block.png",https://covid19.govt.nz/assets/Images/background/support-for-maori_image-block.png,image/png,200,RTCPNJFLDQWI566FLC3HAVAN5LH2R6S2,-,-,0,69141886,V1-FL90448663.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20240211071416/https://covid19.govt.nz/assets/Images/background/support-for-maori_image-block.png,support-for-maori_image-block.png
2024-02-11 07:14:20,"nz,govt,covid19)/assets/images/background/support-for-pacific-people_image-block.png",https://covid19.govt.nz/assets/Images/background/support-for-pacific-people_image-block.png,image/png,200,G3KQXJEDJ53A4VCGN6KQQN22N7OWH5JB,-,-,0,70908026,V1-FL90448663.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20240211071420/https://covid19.govt.nz/assets/Images/background/support-for-pacific-people_image-block.png,support-for-pacific-people_image-block.png


## Access WARC file

In the following section, we will access real WARC files and its corresponding CDX files selected from the NLNZ web archive dataset.

In [14]:
bucket_name = "ndha-public-data-ap-southeast-2"
folder_prefix = "iPRES-2025"

want.list_s3_files(bucket_name, folder_prefix)

['iPRES-2025-prep/',
 'iPRES-2025-prep/covid19.govt.nz/',
 'iPRES-2025-prep/covid19.govt.nz/2020-03-19_IE52722688/FL52722690_NLNZ-20200318051635624-00000-29270~appserv17~8443.warc.gz',
 'iPRES-2025-prep/covid19.govt.nz/2020-03-19_IE52722688/IE52722688.cdx',
 'iPRES-2025-prep/covid19.govt.nz/2020-04-07_IE53132947/FL53132949_NLNZ-20200406061614107-00000-7225~appserv17~8443.warc.gz',
 'iPRES-2025-prep/covid19.govt.nz/2020-04-07_IE53132947/IE53132947.cdx',
 'iPRES-2025-prep/covid19.govt.nz/2020-05-11_IE53917175/FL53917177_NLNZ-20200508060017653-00000-7225~appserv17~8443.warc.gz',
 'iPRES-2025-prep/covid19.govt.nz/2020-05-11_IE53917175/FL53917178_NLNZ-20200508062841598-00001-7225~appserv17~8443.warc.gz',
 'iPRES-2025-prep/covid19.govt.nz/2020-05-11_IE53917175/FL53917179_NLNZ-20200508063436203-00002-7225~appserv17~8443.warc.gz',
 'iPRES-2025-prep/covid19.govt.nz/2020-05-11_IE53917175/FL53917180_NLNZ-20200508064355099-00003-7225~appserv17~8443.warc.gz',
 'iPRES-2025-prep/covid19.govt.nz/2020-

Let's have a look at the CDX file first.

Here we have followed the standard 11-field format as described in the [CDX documentation](https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/).

These fields consist of the following:

1. N: massaged url
2. b: date
3. a: original url
4. m: mime type of original document
5. s: response code
6. k: new style checksum
7. r: redirect
8. M: meta tags 
9. S: compressed record size
10. V: compressed payload offset 
11. g: file name

The following cell reads a CDX index data into pandas DataFrame.

In [17]:
import pandas as pd


object_key = 'iPRES-2025/test/2023-12-14_IE89493927/IE89493927.cdx'
df = pd.read_csv(f"s3://{bucket_name}/{object_key}", sep=" ", skiprows=1)
df.columns = ['N', 'b', 'a', 'm', 's', 'k', 'r', 'M', 'S', 'V', 'g']
df.head()

Unnamed: 0,N,b,a,m,s,k,r,M,S,V,g
0,"nz,govt,covid19)/""tel:111/"">111.</a></p>"",""sortfield"":3302},{""question"":""how",20231213001225,"https://covid19.govt.nz/%22tel:111/%22%3E111.%3C//a%3E%3C//p%3E%22,%22sortField%22:3302%7D,%7B%22question%22:%22How",text/html,200,7Z5PDAWCYNIKTTHNDEBCFBLY35VHLUMX,-,-,26616,3126252,FL89493929_NLNZ-20231212233435565-00000-72544~wlgprdwctweb01.natlib.govt.nz~8443.warc.gz
1,"nz,govt,covid19)/""tel:111/"">111</a>",20231213012505,https://covid19.govt.nz/%22tel:111/%22%3E111%3C//a%3E,text/html,200,II7BJKXT7P6EDVO4UOXYHELSBETKETB4,-,-,26531,217399,FL89493931_NLNZ-20231213012006008-00002-72544~wlgprdwctweb01.natlib.govt.nz~8443.warc.gz
2,"nz,govt,covid19)/""tel:111/"">111</a>&nbsp;m/u0113n/u0101",20231213034942,https://covid19.govt.nz/%22tel:111/%22%3E111%3C//a%3E&nbsp;m/u0113n/u0101,text/html,200,I6RDYSRBSSRTAZGLD7MOT4KZSPP2SPJC,-,-,26244,1461304,FL89493933_NLNZ-20231213032831620-00004-72544~wlgprdwctweb01.natlib.govt.nz~8443.warc.gz
3,"nz,govt,covid19)/""tel:111/"">111</a>&nbsp;m/u014d",20231213034922,https://covid19.govt.nz/%22tel:111/%22%3E111%3C//a%3E&nbsp;m/u014d,text/html,200,ONZZJTI3FZKSOYNL62RKIOXSVQGQB4RQ,-,-,26245,1402094,FL89493933_NLNZ-20231213032831620-00004-72544~wlgprdwctweb01.natlib.govt.nz~8443.warc.gz
4,"nz,govt,covid19)/""tel:111/"">111</a>,",20231213002555,"https://covid19.govt.nz/%22tel:111/%22%3E111%3C//a%3E,",text/html,200,WVK3EVPDR7IOH7TGZTKXEPP4IEA3RMHA,-,-,26539,268477,FL89493930_NLNZ-20231213002040269-00001-72544~wlgprdwctweb01.natlib.govt.nz~8443.warc.gz


Using the information from the CDX file, we can extract a specific payload from the WARC file.

In [28]:
html_payload = want.extract_payload("s3://ndha-public-data-ap-southeast-2/iPRES-2025/test/2023-12-14_IE89493927/FL89493929_NLNZ-20231212233435565-00000-72544~wlgprdwctweb01.natlib.govt.nz~8443.warc.gz",offset=3126252)

After we have extracted the payload, we can use BeautifulSoup module to parse it and then extract the text content.

In [29]:
from bs4 import BeautifulSoup

# Parse HTML
soup = BeautifulSoup(html_payload, "html.parser")

# Get all <p> elements as separate paragraphs
paragraphs = [p.get_text(" ", strip=True) for p in soup.find_all("p")]

for para in paragraphs:
    print(para)

COVID-19 information has a new home. Starting from December 11 2023, you can find information on the Te Whatu Ora Health Information and Services website. COVID-19 — Health Information and Services
Keep up to date with Aotearoa New Zealand’s response to COVID-19.
With COVID-19 case numbers trending downwards, a highly vaccinated population, and increased access to antiviral medicines, most COVID-19 rules and border restrictions have ended.
Protect yourself and others from COVID-19 by following the latest health advice and not sharing unreliable information.
COVID-19 vaccines are free for everyone aged 5 and over. They’re also available to tamariki from 6 months who are at greater risk of severe illness if they were to get COVID-19.
Testing for COVID-19 and isolating when you test positive are 2 important ways to help manage COVID-19 in Aotearoa New Zealand.
Information for air and sea travellers about entering and leaving New Zealand.
Although requirements have been removed, we recomme

In [None]:
# The above script has been wrapped into a function in `want.extract_content_html()`
# e.g.,
html_payload = want.extract_payload(find_warc_file_path(warc_file), warc_offset)
content = want.extract_content_html(html_payload)