# Exploring Access Methods for NLNZ Web archive dataset

This notebook includes the following sections,
1. Use Memento and outbackCDX to query data.
2. Parsing WARC files.
3. Extracting metadata (URLs, timestamps, MIME types).
4. Visualizing simple stats (volume by year, domain types).

## Install required python packages

In [None]:
# !pip -q install warcio playwright validators
# !pip -q install -i https://test.pypi.org/simple/ wa-nlnz-toolkit

## Query web archive data using Memento

The **Memento protocol** makes it easier to find and use archived versions of web pages, even if other APIs aren't available. This gives us machine-readable information about web captures.

In the following section, we'll see how NLNZ web archive support the Memento protocol. Specifically, we'll look at three main features:
- TimeGate - get the version of a page closest to a date you choose.
- TimeMap -  see all archived versions of a page.
- Memento - change how an archived page is shown using special URL options

In [1]:
import datetime
import pandas as pd
import wa_nlnz_toolkit as want

In [2]:
webpage = "www.natlib.govt.nz"

# default query - get latest capture
dict(want.query_memento(webpage).headers)

{'Date': 'Tue, 09 Sep 2025 23:38:58 GMT',
 'Server': 'Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_fcgid/2.3.9',
 'Content-Type': 'text/html; charset=UTF-8',
 'Link': '<http://www.natlib.govt.nz/>; rel="original", <https://ndhadeliver.natlib.govt.nz/webarchive/http://www.natlib.govt.nz/>; rel="timegate", <https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://www.natlib.govt.nz/>; rel="timemap"; type="application/link-format", <https://ndhadeliver.natlib.govt.nz/webarchive/20250908102349mp_/http://www.natlib.govt.nz/>; rel="memento"; datetime="Mon, 08 Sep 2025 10:23:49 GMT"',
 'Vary': 'accept-datetime',
 'Expect-CT': 'max-age=86400, enforce',
 'X-XSS-Protection': '1; mode=block',
 'X-Content-Type-Options': 'nosniff',
 'X-Permitted-Cross-Domain-Policies': 'none',
 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains',
 'Referrer-Policy': 'no-referrer',
 'Keep-Alive': 'timeout=5, max=100',
 'Connection': 'Keep-Alive',
 'Set-Cookie': 'visid_incap

In [3]:
# or get a tidy-up version
want.get_memento_urls(webpage)

{'original': 'http://www.natlib.govt.nz/',
 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/http://www.natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://www.natlib.govt.nz/',
 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/20250908102349mp_/http://www.natlib.govt.nz/'}

The *link* field contains the Memento information. For this case, we can see it contains 4 link types as follows:

- **original**: the url that was archived (e.g., https://covid19.govt.nz/)
- **timegate**: the harvested url (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/https://covid19.govt.nz/)
- **timemap**: list of all available captures over time (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/https://covid19.govt.nz/)
- **memento**: the url of the specific archived version of the webpage (e.g., https://ndhadeliver.natlib.govt.nz/webarchive/20250728214105mp_/https://covid19.govt.nz/)

By default, the *memento* shows the url from the latest capture. If a specific datetime was provided, it will return the capture closest in time to the given datetime. Example is shown below.

In [4]:
# query for a capture closest to a given datetime
dt_required = datetime.datetime(2020, 1, 1, 0, 0, 0)
dict(want.query_memento(webpage, dt=dt_required).headers)

{'Date': 'Tue, 09 Sep 2025 23:39:00 GMT',
 'Server': 'Apache/2.4.6 (Red Hat Enterprise Linux) OpenSSL/1.0.2k-fips mod_fcgid/2.3.9',
 'Content-Type': 'text/html; charset=UTF-8',
 'Link': '<http://www.natlib.govt.nz/>; rel="original", <https://ndhadeliver.natlib.govt.nz/webarchive/http://www.natlib.govt.nz/>; rel="timegate", <https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://www.natlib.govt.nz/>; rel="timemap"; type="application/link-format", <https://ndhadeliver.natlib.govt.nz/webarchive/20200130060111mp_/http://www.natlib.govt.nz/>; rel="memento"; datetime="Thu, 30 Jan 2020 06:01:11 GMT"',
 'Vary': 'accept-datetime',
 'Expect-CT': 'max-age=86400, enforce',
 'X-XSS-Protection': '1; mode=block',
 'X-Content-Type-Options': 'nosniff',
 'X-Permitted-Cross-Domain-Policies': 'none',
 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains',
 'Referrer-Policy': 'no-referrer',
 'Keep-Alive': 'timeout=5, max=98',
 'Connection': 'Keep-Alive',
 'Set-Cookie': 'visid_incap_

In [5]:
# or get the tidy-up version
want.get_memento_urls(webpage, dt=dt_required)

{'original': 'http://www.natlib.govt.nz/',
 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/http://www.natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/timemap/link/http://www.natlib.govt.nz/',
 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/20200130060111mp_/http://www.natlib.govt.nz/'}

### Get full list of captures from _timemap_

Memento Timemap provide a list of webpage captures for a given webpage. It is available from Pywb (NLNZ selective web archive) and OpenWayback systems. For Pywb, hree formats are supported - link, cdxj, and json.

The example below show a timemap for the given webpage from NLNZ selective web archive.

In [3]:
want.get_timemap(webpage)

https://ndhadeliver.natlib.govt.nz/webarchive/timemap/json/www.natlib.govt.nz


Unnamed: 0_level_0,urlkey,url,mime,status,digest,redirect,robotflags,length,offset,filename,source,source-coll,access_url
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2004-07-11 21:32:25,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,JV66FPIIX6IJTB42TNHMQDEU5Z3LFBCK,-,-,0,976,V1-FL1645590.arc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20040711213225/http://www.natlib.govt.nz/
2006-07-04 03:31:35,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,JKXIM5NTOXWFNC5UOIAN37AGPV2KL73O,-,-,0,976,V1-FL1645520.arc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20060704033135/http://www.natlib.govt.nz/
2007-03-22 04:15:46,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,CU2KIAIJGUZD4IOV43D7LE2J5TVMUJYR,-,-,0,19799,V1-FL481509.arc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20070322041546/http://www.natlib.govt.nz/
2008-02-25 06:02:38,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,2IIVSKCHCNVKN6Z273YKZBEW6QVMYXKK,-,-,0,2717767,V1-FL538322.arc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20080225060238/http://www.natlib.govt.nz/
2008-10-19 22:53:43,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,6TCIF3SQHDMFZWZ2YTJ5AFNTSYUPYZX7,-,-,0,48523900,V1-FL617362.arc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20081019225343/http://www.natlib.govt.nz/
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-08-04 19:03:40,"nz,govt,natlib)/",https://natlib.govt.nz/,text/html,200,OIDDMZRGBYM5BP3V5VIVABZQOKPJILGV,-,-,-,78080125,V1-FL94835963.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250804190340/https://natlib.govt.nz/
2025-09-04 12:11:18,"nz,govt,natlib)/",http://natlib.govt.nz/,text/html,301,DFS4JFJMZDAIFJRQP3LHAYNFPKVWMMX2,-,-,-,26316599,V1-FL94997298.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250904121118/http://natlib.govt.nz/
2025-09-04 12:11:24,"nz,govt,natlib)/",https://natlib.govt.nz/,text/html,200,B545ERXBLTOCXX6CJZQBC2CF4TWXAJZY,-,-,-,27802697,V1-FL94997298.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250904121124/https://natlib.govt.nz/
2025-09-08 10:23:46,"nz,govt,natlib)/",http://natlib.govt.nz/,text/html,301,DFS4JFJMZDAIFJRQP3LHAYNFPKVWMMX2,-,-,-,29801086,V1-FL94998111.warc,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250908102346/http://natlib.govt.nz/


Note that the load_url field contains the URL used by Pywb internally, which cannot be used directly to access the specific version of web archive.


Also, Memento supports changing the way it is presented by adding some modifiers to the url. For example,

- **mp_** modifier: indicate "main page" content replay.
- **id_** modifier: returns the original harvested version of the webpage.
- **if_** modifier: returns the view with web archive headers (default for NLNZ web archive).

For more information, check https://pywb.readthedocs.io/en/latest/manual/rewriter.html?highlight=id_#url-rewriting

## Query web archive data using CDX API

Because our OutbackCDX server is not accessed internally, the following CDX API queries were actually redirected by the pywb to the outbackCDX server. As a result, some native CDX query params are not supported, such as setting cdx output format.

In [7]:
df_captures = want.query_cdx_index(webpage)
df_captures

Unnamed: 0_level_0,urlkey,url,mime,status,digest,redirect,robotflags,length,offset,filename,load_url,source,source-coll,access_url
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2004-07-11 21:32:25,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,JV66FPIIX6IJTB42TNHMQDEU5Z3LFBCK,-,-,0,976,V1-FL1645590.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20040711213225id_/http://www.natlib.govt.nz/,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20040711213225/http://www.natlib.govt.nz/
2006-07-04 03:31:35,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,JKXIM5NTOXWFNC5UOIAN37AGPV2KL73O,-,-,0,976,V1-FL1645520.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20060704033135id_/http://www.natlib.govt.nz/,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20060704033135/http://www.natlib.govt.nz/
2007-03-22 04:15:46,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,CU2KIAIJGUZD4IOV43D7LE2J5TVMUJYR,-,-,0,19799,V1-FL481509.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20070322041546id_/http://www.natlib.govt.nz/,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20070322041546/http://www.natlib.govt.nz/
2008-02-25 06:02:38,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,2IIVSKCHCNVKN6Z273YKZBEW6QVMYXKK,-,-,0,2717767,V1-FL538322.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20080225060238id_/http://www.natlib.govt.nz/,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20080225060238/http://www.natlib.govt.nz/
2008-10-19 22:53:43,"nz,govt,natlib)/",http://www.natlib.govt.nz/,text/html,200,6TCIF3SQHDMFZWZ2YTJ5AFNTSYUPYZX7,-,-,0,48523900,V1-FL617362.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20081019225343id_/http://www.natlib.govt.nz/,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20081019225343/http://www.natlib.govt.nz/
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-08-04 19:03:40,"nz,govt,natlib)/",https://natlib.govt.nz/,text/html,200,OIDDMZRGBYM5BP3V5VIVABZQOKPJILGV,-,-,-,78080125,V1-FL94835963.warc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20250804190340id_/https://natlib.govt.nz/,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250804190340/https://natlib.govt.nz/
2025-09-04 12:11:18,"nz,govt,natlib)/",http://natlib.govt.nz/,text/html,301,DFS4JFJMZDAIFJRQP3LHAYNFPKVWMMX2,-,-,-,26316599,V1-FL94997298.warc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20250904121118id_/http://natlib.govt.nz/,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250904121118/http://natlib.govt.nz/
2025-09-04 12:11:24,"nz,govt,natlib)/",https://natlib.govt.nz/,text/html,200,B545ERXBLTOCXX6CJZQBC2CF4TWXAJZY,-,-,-,27802697,V1-FL94997298.warc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20250904121124id_/https://natlib.govt.nz/,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250904121124/https://natlib.govt.nz/
2025-09-08 10:23:46,"nz,govt,natlib)/",http://natlib.govt.nz/,text/html,301,DFS4JFJMZDAIFJRQP3LHAYNFPKVWMMX2,-,-,-,29801086,V1-FL94998111.warc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20250908102346id_/http://natlib.govt.nz/,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20250908102346/http://natlib.govt.nz/


Note that the query results above is actually the same as timemap. But in our function, we have added a "access_url" column which contains actual URL for each webpage snapshot.

In [9]:
webpage = "niwa.co.nz/"
want.query_cdx_index(webpage, filter="mimetype:image/png", matchType="prefix")

Unnamed: 0_level_0,urlkey,url,mime,status,digest,redirect,robotflags,length,offset,filename,load_url,source,source-coll,access_url
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2008-10-09 07:19:37,"nz,co,niwa)/__data/assets/image/0003/64128/vessel_background.png",http://www.niwa.co.nz/__data/assets/image/0003/64128/vessel_background.png,image/png,200,3G3ZYUAOXKCM564FLQQ2HD6ZMP4QGW55,-,-,0,90648133,V1-FL617890.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20081009071937id_/http://www.niwa.co.nz/__data/assets/image/0003/64128/vessel_background.png,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20081009071937/http://www.niwa.co.nz/__data/assets/image/0003/64128/vessel_background.png
2008-10-10 06:16:08,"nz,co,niwa)/__data/assets/image/0003/64128/vessel_background.png",http://www.niwa.co.nz/__data/assets/image/0003/64128/vessel_background.png,image/png,200,3G3ZYUAOXKCM564FLQQ2HD6ZMP4QGW55,-,-,0,16823778,V1-FL619668.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20081010061608id_/http://www.niwa.co.nz/__data/assets/image/0003/64128/vessel_background.png,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20081010061608/http://www.niwa.co.nz/__data/assets/image/0003/64128/vessel_background.png
2008-10-10 11:31:13,"nz,co,niwa)/__data/assets/image/0004/64624/ipy-caml.png",http://niwa.co.nz/__data/assets/image/0004/64624/ipy-caml.png,image/png,200,HQVY2YENMUN7EIBBDW6OFBKEOXMOFIJC,-,-,0,44575369,V1-FL619036.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20081010113113id_/http://niwa.co.nz/__data/assets/image/0004/64624/ipy-caml.png,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20081010113113/http://niwa.co.nz/__data/assets/image/0004/64624/ipy-caml.png
2008-10-10 12:20:56,"nz,co,niwa)/__data/assets/image/0004/64624/ipy-caml.png",http://www.niwa.co.nz/__data/assets/image/0004/64624/ipy-caml.png,image/png,200,HQVY2YENMUN7EIBBDW6OFBKEOXMOFIJC,-,-,0,30778495,V1-FL619172.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20081010122056id_/http://www.niwa.co.nz/__data/assets/image/0004/64624/ipy-caml.png,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20081010122056/http://www.niwa.co.nz/__data/assets/image/0004/64624/ipy-caml.png
2008-10-10 18:50:23,"nz,co,niwa)/__data/assets/image/0003/74469/nz_spr_probrain_en.png",http://niwa.co.nz/__data/assets/image/0003/74469/nz_spr_probrain_EN.png,image/png,200,2A22PN35DMRDKQ6WMAA3LO32JE4L43G7,-,-,0,45258726,V1-FL618824.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20081010185023id_/http://niwa.co.nz/__data/assets/image/0003/74469/nz_spr_probrain_EN.png,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20081010185023/http://niwa.co.nz/__data/assets/image/0003/74469/nz_spr_probrain_EN.png
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2010-11-11 17:35:02,"nz,co,niwa)/__data/assets/image/0003/97761/outrain200909_thumb.png",http://www.niwa.co.nz/__data/assets/image/0003/97761/outrain200909_thumb.png,image/png,200,SCIIA6RRDNJ6AS3WNQMDBHA27TAECWMH,-,-,0,69134947,V1-FL2626535.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20101111173502id_/http://www.niwa.co.nz/__data/assets/image/0003/97761/outrain200909_thumb.png,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20101111173502/http://www.niwa.co.nz/__data/assets/image/0003/97761/outrain200909_thumb.png
2011-06-13 14:28:50,"nz,co,niwa)/__data/assets/image/0003/91632/shadow.png",http://www.niwa.co.nz/__data/assets/image/0003/91632/shadow.png,image/png,200,P2UVZ4HCIZJDCXVMJCREBIRFCLLXL625,-,-,0,84019957,V1-FL6616986.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20110613142850id_/http://www.niwa.co.nz/__data/assets/image/0003/91632/shadow.png,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20110613142850/http://www.niwa.co.nz/__data/assets/image/0003/91632/shadow.png
2011-06-13 14:33:17,"nz,co,niwa)/__data/assets/image/0003/83424/subnav-header-about.png",http://www.niwa.co.nz/__data/assets/image/0003/83424/subnav-header-about.png,image/png,200,M2E5SQOASVVAYLI5VFA7RT5EIHUORXZF,-,-,0,77590749,V1-FL6616985.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20110613143317id_/http://www.niwa.co.nz/__data/assets/image/0003/83424/subnav-header-about.png,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20110613143317/http://www.niwa.co.nz/__data/assets/image/0003/83424/subnav-header-about.png
2011-06-29 01:47:15,"nz,co,niwa)/__data/assets/image/0003/91632/shadow.png",http://www.niwa.co.nz/__data/assets/image/0003/91632/shadow.png,image/png,200,P2UVZ4HCIZJDCXVMJCREBIRFCLLXL625,-,-,0,96609486,V1-FL6264804.arc,https://wlgprdowapp01.natlib.govt.nz/nlnzwebarchive_PROD/ap/20110629014715id_/http://www.niwa.co.nz/__data/assets/image/0003/91632/shadow.png,webarchive,webarchive,https://ndhadeliver.natlib.govt.nz/webarchive/20110629014715/http://www.niwa.co.nz/__data/assets/image/0003/91632/shadow.png


In [None]:
import requests


response = requests.head(
    "https://ndhadeliver.natlib.govt.nz/webarchive/http://natlib.govt.nz/",
    headers={"Accept-Datetime": "Fri, 01 Jan 2025 01:00:00 GMT", "User-Agent": "NLNZWebArchiveAccessBot/1.0 (wa-nlnz-toolkit)"},
)
response.links

In [None]:
df_records = want.query_cdx_index("covid19.govt.nz")#, filter="mimetype:application/pdf")
want.plot_monthly_captures(df_records)

In [None]:
want.query_memento("covid19.govt.nz", dt=datetime.datetime(2020, 3, 18, 5, 16, 41)).links

In [None]:
wa_nlnz_toolkit.get_memento_urls("covid19.govt.nz")

In [None]:
await wa_nlnz_toolkit.webshot("http://www.natlib.govt.nz")


In [None]:
g = df_records.groupby(pd.Grouper(freq="M"))
g["status"].count().plot.bar()

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

# Example: list of archived URLs and their links
# Replace this with your own extracted link data
edges = [
    ("http://web.archive.org/web/20200101/example.com", 
     "http://web.archive.org/web/20200102/example.com/page1"),
    
    ("http://web.archive.org/web/20200101/example.com", 
     "http://web.archive.org/web/20200102/example.com/page2"),
    
    ("http://web.archive.org/web/20200102/example.com/page1", 
     "http://web.archive.org/web/20200103/example.com/page3"),

    ("http://web.archive.org/web/20200101/example.com", 
     "http://web.archive.org/web/20200102/example.com/page4"),

    ("http://web.archive.org/web/20200101/example.com", 
     "http://web.archive.org/web/20200102/example.com/page5"),

    ("http://web.archive.org/web/20200101/example.com", 
     "http://web.archive.org/web/20200102/example.com/page7"),
]

# Create a directed graph
G = nx.DiGraph()
G.add_edges_from(edges)

# Draw the graph
plt.figure(figsize=(10, 6))
pos = nx.spring_layout(G, k=0.5, iterations=50)
nx.draw(G, pos, with_labels=False, node_size=500, node_color="skyblue", arrows=True)
nx.draw_networkx_labels(G, pos, font_size=8)

plt.title("Link Graph of Web Archive URLs")
plt.show()


In [None]:
import networkx as nx
from pyvis.network import Network
import matplotlib.cm as cm
import matplotlib.colors as mcolors

# Step 1: Build graph with networkx
G = nx.DiGraph()
G.add_edges_from(edges)

# Step 2: Compute degrees (in+out connections)
degrees = dict(G.degree())

# Normalize degrees for colormap
norm = mcolors.Normalize(vmin=min(degrees.values()), vmax=max(degrees.values()))
cmap = cm.get_cmap("plasma")  # you can try "plasma", "coolwarm", etc.

# Step 3: Create Pyvis network
net = Network(height="1200px", width="100%", directed=True)

for node, deg in degrees.items():
    # Convert degree to RGBA then to HEX
    rgba = cmap(norm(deg))
    color = mcolors.to_hex(rgba)
    
    net.add_node(node, 
                 label=node.split("/")[-1],  # show shorter label
                 title=f"{node}<br>Degree: {deg}",
                 color=color)

# Step 4: Add edges
for u, v in edges:
    net.add_edge(u, v)

# Step 5: Save and open interactive graph
net.save_graph("archive_linkgraph_colored.html")