# Using Memento

Systems supporting the Memento protocol provide machine-readable information about web archive captures, even if other APIs are not available.

**Documentation**
* [Memento Protocol Specification](https://tools.ietf.org/html/rfc7089)
* [Pywb implementation](https://pywb.readthedocs.io/en/latest/manual/memento.html)
* [Pywb URL rewriting](https://pywb.readthedocs.io/en/latest/manual/rewriter.html?highlight=id_#url-rewriting)

**Tools**
* [Memento client](https://github.com/mementoweb/py-memento-client)

In [2]:
import requests
import arrow
import re
import json

# Alternatively use the python Memento client 

In [16]:
# Some handy functions that we'll use below

def format_date_for_headers(iso_date, tz):
    '''
    Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.
    Convert the datetime to UTC and format as required by Accet-Datetime headers:
    eg Fri, 23 Mar 2007 01:00:00 GMT
    '''
    local = arrow.get(f'{iso_date} 12:00:00 {tz}', 'YYYY-MM-DD HH:mm:ss ZZZ')
    gmt = local.to('utc')
    return f'{gmt.format("ddd, DD MMM YYYY HH:mm:ss")} GMT'

def parse_links_from_headers(headers):
    '''
    Extract original, timegate, timemap, and memento links from 'Link' header.
    '''
    memento_links = {}
    try:
        links = re.findall(r'<(.*?)>; rel="(original|timegate|timemap|memento|first memento|prev memento|next memento|last memento)"', headers['Link'])
    except (KeyError):
        print("No 'Link' header")
        print(headers)
    else:
        for url, url_type in links:
            memento_links[url_type] = url
    return memento_links

def format_timestamp(timestamp, date_format='YYYY-MM-DD HH:mm:ss'):
    return arrow.get(timestamp, 'YYYYMMDDHHmmss').format(date_format)

In [4]:
format_date_for_headers('2010-01-01', 'Australia/Canberra')

'Fri, 01 Jan 2010 01:00:00 GMT'

In [5]:
format_timestamp('20010101000000')

'2001-01-01 00:00:00'

In [6]:
TIMEGATES = {
    'nla': 'https://web.archive.org.au/awa/',
    'nlnz': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/',
    'bl': 'https://www.webarchive.org.uk/wayback/archive/',
    'ia': 'https://web.archive.org/web/'
}

## Timegates

Timegates let you query a web archive for the capture closest to a specific date. You do this by supplying your target date as the `Accept-Datetime` value in the headers of your request.

For example, if you wanted to query the Australian Web Archive to find the version of `http://nla.gov.au/` that was captured as close as possible to 1 January 2001, you'd set the `Accept-Datetime` header to header to 'Fri, 01 Jan 2010 01:00:00 GMT' and request the url:

```
https://web.archive.org.au/awa/http://nla.gov.au/
```

A `get` request will return the captured page, but if all you want is the url of the archived page you can use a `head` request and extract the information you need from the response headers. Try this:

In [7]:
response = requests.head('https://web.archive.org.au/awa/http://nla.gov.au/', headers={'Accept-Datetime': 'Fri, 01 Jan 2010 01:00:00 GMT'})
response.headers

{'Server': 'nginx', 'Date': 'Fri, 08 May 2020 02:03:16 GMT', 'Content-Length': '0', 'Connection': 'keep-alive', 'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/', 'Link': '<http://nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:42:27 GMT"', 'Vary': 'accept-datetime'}

The request above returns the following headers:

``` python
{
    'Server': 'nginx', 
    'Date': 'Wed, 06 May 2020 04:34:50 GMT', 
    'Content-Length': '0', 'Connection': 'keep-alive', 
    'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/', 
    'Link': '<http://nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:42:27 GMT"', 
    'Vary': 'accept-datetime'
}
```

The `Link` parameter contains the Memento information. You can see that it's actually providing information on four types of link:

* the `original` url (ie the url that was archived) – `<http://nla.gov.au/>`
* the `timegate` for the harvested url (which us what we just used) – `<https://web.archive.org.au/awa/http://nla.gov.au/>`
* the `timemap` for the harvested url (we'll look at this below) – `<https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>`
* the `memento` – `<https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>`

The `memento` link is the capture closest in time to the date we requested. In this case there's only about a month's difference, but of course this will depend on how frequently a url is captured. Opening the link will display the capture in the web archive. As we'll see below, some systems provide additional links such as `first memento`, `last memento`, `prev memento`, and `next memento`.

Here's a basic function to query a timegate in one of the four systems we're exploring. We'll use it to compare the results we get from each.

In [8]:
def query_timegate(timegate, url, date=None, tz='Australia/Canberra', request_type='head'):
    headers = {}
    if date:
        formatted_date = format_date_for_headers(date, tz)
        headers['Accept-Datetime'] = formatted_date
    # Note that you don't get a timegate response if you leave off the trailing slash
    tg_url = f'{TIMEGATES[timegate]}{url}/' if not url.endswith('/') else f'{TIMEGATES[timegate]}{url}'
    print(tg_url)
    if request_type == 'head':
        response = requests.head(tg_url, headers=headers)
    else:
        response = requests.get(tg_url, headers=headers)
    # print(response.headers)
    return parse_links_from_headers(response.headers)

### National Library of Australia

A query without an `Accept-Datetime` value returns a recent capture.

In [105]:
query_timegate('nla', 'http://www.nla.gov.au')

https://web.archive.org.au/awa/http://www.nla.gov.au/


{'original': 'http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',
 'timegate': 'https://web.archive.org.au/awa/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html',
 'memento': 'https://web.archive.org.au/awa/20200305172547mp_/http://pandora.nla.gov.au/pan/161756/20200306-0200/www.nla.gov.au/index.html'}

A query with an `Accept-Datetime` value of 1 January 2002 returns a capture from 20 January 2002.

In [98]:
query_timegate('nla', 'http://www.education.gov.au/', date='2002-01-01')

https://web.archive.org.au/awa/http://www.education.gov.au/


{'original': 'http://www.education.gov.au:80/',
 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',
 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}

Using a `GET` rather than a `HEAD` request returns no Memento information.

In [17]:
query_timegate('nla', 'http://www.education.gov.au/', date='2002-01-01', request_type='get')

https://web.archive.org.au/awa/http://www.education.gov.au/
No 'Link' header
{'Server': 'nginx', 'Date': 'Fri, 08 May 2020 02:25:04 GMT', 'Content-Type': 'text/html;charset=UTF-8', 'Connection': 'close', 'Content-Language': 'en-AU', 'Content-Encoding': 'gzip'}


{}

### National Library of New Zealand

A query without an `Accept-Datetime` value doesn't return a `memento`, but does include `first memento`, `last memento`, and `prev memento`.

In [9]:
query_timegate('nlnz', 'http://natlib.govt.nz')

https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/


{'original': 'http://natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://natlib.govt.nz/',
 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://natlib.govt.nz/',
 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060106/http://natlib.govt.nz/'}

A query with an `Accept-Datetime` value of 1 January 2005 doesn't return a `memento`, even though there's a capture available from July 2004. I don't know why this is.

In [10]:
query_timegate('nlnz', 'http://natlib.govt.nz', date='2005-01-01')

https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/


{'original': 'http://natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://natlib.govt.nz/',
 'next memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20060704033135/http://natlib.govt.nz/',
 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://natlib.govt.nz/'}

A query with an `Accept-Datetime` value of 1 January 2008 returns a `memento` from 25 February 2008, as well as `first memento`, `last memento`, `prev memento`, and `next memento`.

In [11]:
query_timegate('nlnz', 'http://natlib.govt.nz', date='2008-01-01')

https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/


{'original': 'http://natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://natlib.govt.nz/',
 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20070322041546/http://natlib.govt.nz/',
 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20080225060238/http://natlib.govt.nz/',
 'next memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20081019225343/http://natlib.govt.nz/',
 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://natlib.govt.nz/'}

A `GET` requests returns the same results as a `HEAD` request.

In [18]:
query_timegate('nlnz', 'http://natlib.govt.nz', date='2008-01-01', request_type='get')

https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/


{'original': 'http://www.natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://www.natlib.govt.nz/',
 'timegate': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://www.natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://www.natlib.govt.nz/',
 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20070322041546/http://www.natlib.govt.nz/',
 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20080225060238/http://www.natlib.govt.nz/',
 'next memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20081019225343/http://www.natlib.govt.nz/',
 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://www.natlib.govt.nz/'}

### Internet Archive

Using a `HEAD` request returns no Memento information (this is different from all the others).

In [19]:
query_timegate('ia', 'http://discontents.com.au')

https://web.archive.org/web/http://discontents.com.au/
No 'Link' header
{'Server': 'nginx/1.15.8', 'Date': 'Fri, 08 May 2020 02:33:02 GMT', 'Content-Type': 'text/plain; charset=utf-8', 'Content-Length': '0', 'Connection': 'keep-alive', 'X-Archive-Redirect-Reason': 'found capture at 20200418035854', 'Location': 'https://web.archive.org/web/20200418035854/http://www.discontents.com.au/', 'Server-Timing': 'PetaboxLoader3.datanode;dur=25.887495, captures_list;dur=263.083917, LoadShardBlock;dur=209.967965, PetaboxLoader3.resolve;dur=67.117410, RedisCDXSource;dur=25.657726, exclusion.robots.policy;dur=0.173983, CDXLines.iter;dur=15.774050, esindex;dur=0.019052, exclusion.robots;dur=0.186151', 'X-App-Server': 'wwwb-app29', 'X-ts': '302', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/http://discontents.com.au/AU', 'X-Page-Cache': 'MISS'}


{}

A query without an `Accept-Datetime` value returns a `memento` and also includes a `first memento`, `last memento`, `prev memento`, and `last memento`. It seems that the `memento` returned is the second last capture.

In [21]:
query_timegate('ia', 'http://discontents.com.au', request_type='get')

https://web.archive.org/web/http://discontents.com.au/


{'original': 'http://discontents.com.au/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au/',
 'timegate': 'https://web.archive.org/web/http://discontents.com.au/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20200307153616/http://www.discontents.com.au/',
 'memento': 'https://web.archive.org/web/20200417044906/http://discontents.com.au/',
 'next memento': 'https://web.archive.org/web/20200418035854/http://www.discontents.com.au/',
 'last memento': 'https://web.archive.org/web/20200418035854/http://www.discontents.com.au/'}

A query with an `Accept-Datetime` value of 1 January 2010 returns a `memento` from 4 September 2010, even though the `prev memento` date, 30 October 2009, is closer.

In [23]:
query_timegate('ia', 'http://discontents.com.au', date='2010-01-01', request_type='get')

https://web.archive.org/web/http://discontents.com.au/


{'original': 'http://discontents.com.au:80/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://discontents.com.au:80/',
 'timegate': 'https://web.archive.org/web/http://discontents.com.au:80/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'prev memento': 'https://web.archive.org/web/20091030053520/http://discontents.com.au/',
 'memento': 'https://web.archive.org/web/20100209041537/http://discontents.com.au:80/',
 'next memento': 'https://web.archive.org/web/20100523101442/http://discontents.com.au:80/',
 'last memento': 'https://web.archive.org/web/20200418035854/http://www.discontents.com.au/'}

### UK Web Archive

A query without an `Accept-Datetime` value doesn't return a `memento`.

In [101]:
query_timegate('bl', 'http://bl.uk')

https://www.webarchive.org.uk/wayback/archive/http://bl.uk/


{'original': 'http://bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/'}

A query with an `Accept-Datetime` value of 1 January 2006 returns a `memento` from 1 January 2006.

In [100]:
query_timegate('bl', 'http://bl.uk', date='2006-01-01')

https://www.webarchive.org.uk/wayback/archive/http://bl.uk/


{'original': 'http://bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20060101010000mp_/http://bl.uk/'}

A `GET` request returns the same results as a `HEAD` request.

In [24]:
query_timegate('bl', 'http://bl.uk', date='2006-01-01', request_type='get')

https://www.webarchive.org.uk/wayback/archive/http://bl.uk/


{'original': 'http://bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20060101010000mp_/http://bl.uk/'}

## Timemaps

In [262]:
def get_timemap(timegate, url):
    tg_url = f'{TIMEGATES[timegate]}timemap/json/{url}/'
    response = requests.get(tg_url)
    data = [json.loads(line) for line in response.text.splitlines()]
    return data

In [None]:
timemap = get_timemap('nla', 'http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm')
timemap

## Mementos

id_, if_, mp_

In [247]:
response = requests.head('https://web.archive.org.au/awa/20200302223537if_/http://discontents.com.au/')
response.headers

{'Server': 'nginx', 'Date': 'Sat, 02 May 2020 12:17:38 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '0', 'Connection': 'keep-alive', 'X-Archive-Orig-Server': 'nginx', 'X-Archive-Orig-Connection': 'close', 'X-Archive-Orig-Vary': 'Accept-Encoding', 'Link': '<http://discontents.com.au/wp-json/>; rel="https://api.w.org/", <http://wp.me/65XnW>; rel=shortlink, <http://discontents.com.au/>; rel="original", <https://web.archive.org.au/awa/http://discontents.com.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://discontents.com.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20200302223537mp_/http://discontents.com.au/>; rel="memento"; datetime="Mon, 02 Mar 2020 22:35:37 GMT"; collection="awa"', 'Memento-Datetime': 'Mon, 02 Mar 2020 22:35:37 GMT'}