# Using Memento

Systems supporting the Memento protocol provide machine-readable information about web archive captures, even if other APIs are not available.

**Documentation**
* [Memento Protocol Specification](https://tools.ietf.org/html/rfc7089)
* [Pywb implementation](https://pywb.readthedocs.io/en/latest/manual/memento.html)
* [Pywb URL rewriting](https://pywb.readthedocs.io/en/latest/manual/rewriter.html?highlight=id_#url-rewriting)

**Tools**
* [Memento client](https://github.com/mementoweb/py-memento-client)

In [3]:
import requests
import arrow
import re
import json

# Alternatively use the python Memento client 

In [27]:
def format_date_for_headers(iso_date, tz):
    '''
    Convert an ISO date (YYYY-MM-DD) to a datetime at noon in the specified timezone.
    Convert the datetime to UTC and format as required by Accet-Datetime headers:
    eg Fri, 23 Mar 2007 01:00:00 GMT
    '''
    local = arrow.get(f'{iso_date} 12:00:00 {tz}', 'YYYY-MM-DD HH:mm:ss ZZZ')
    gmt = local.to('utc')
    return f'{gmt.format("ddd, DD MMM YYYY HH:mm:ss")} GMT'

def parse_links_from_headers(headers):
    '''
    Extract original, timegate, timemap, and memento links from 'Link' header.
    '''
    memento_links = {}
    links = re.findall(r'<(.*?)>; rel="(original|timegate|timemap|memento|first memento|prev memento|next memento|last memento)"', headers['Link'])
    for url, url_type in links:
        memento_links[url_type] = url
    return memento_links

def format_timestamp(timestamp, date_format='YYYY-MM-DD HH:mm:ss'):
    return arrow.get(timestamp, 'YYYYMMDDHHmmss').format(date_format)

In [5]:
format_date_for_headers('2010-01-01', 'Australia/Canberra')

'Fri, 01 Jan 2010 01:00:00 GMT'

In [6]:
format_timestamp('20010101000000')

'2001-01-01 00:00:00'

In [7]:
TIMEGATES = {
    'nla': 'https://web.archive.org.au/awa/',
    'nlnz': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/',
    'bl': 'https://www.webarchive.org.uk/wayback/archive/',
    'ia': 'https://web.archive.org/web/'
}

## Timegates

Timegates let you query a web archive for the capture closest to a specific date. You do this by supplying your target date as the `Accept-Datetime` value in the headers of your request.

For example, if you wanted to query the Australian Web Archive to find the version of `http://nla.gov.au/` that was captured as close as possible to 1 January 2001, you'd set the `Accept-Datetime` header to header to 'Fri, 01 Jan 2010 01:00:00 GMT' and request the url:

```
https://web.archive.org.au/awa/http://nla.gov.au/
```

A `get` request will return the captured page, but if all you want is the url of the archived page you can use a `head` request and extract the information you need from the response headers. Try this:

In [12]:
response = requests.head('https://web.archive.org.au/awa/http://nla.gov.au/', headers={'Accept-Datetime': 'Fri, 01 Jan 2010 01:00:00 GMT'})
response.headers

{'Server': 'nginx', 'Date': 'Wed, 06 May 2020 04:37:21 GMT', 'Content-Length': '0', 'Connection': 'keep-alive', 'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/', 'Link': '<http://nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:42:27 GMT"', 'Vary': 'accept-datetime'}

The request above returns the following headers:

``` python
{
    'Server': 'nginx', 
    'Date': 'Wed, 06 May 2020 04:34:50 GMT', 
    'Content-Length': '0', 'Connection': 'keep-alive', 
    'Location': 'https://web.archive.org.au/awa/20100205144227/http://nla.gov.au/', 
    'Link': '<http://nla.gov.au/>; rel="original", <https://web.archive.org.au/awa/http://nla.gov.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>; rel="memento"; datetime="Fri, 05 Feb 2010 14:42:27 GMT"', 
    'Vary': 'accept-datetime'
}
```

The `Link` parameter contains the Memento information. You can see that it's actually providing information on four types of link:

* the `original` url (ie the url that was archived) – `<http://nla.gov.au/>`
* the `timegate` for the harvested url (which us what we just used) – `<https://web.archive.org.au/awa/http://nla.gov.au/>`
* the `timemap` for the harvested url (we'll look at this below) – `<https://web.archive.org.au/awa/timemap/link/http://nla.gov.au/>`
* the `memento` – `<https://web.archive.org.au/awa/20100205144227mp_/http://nla.gov.au/>`

The `memento` link is the capture closest in time to the date we requested. In this case there's only about a month's difference, but of course this will depend on how frequently a url is captured. Opening the link will display the capture in the web archive.

In [99]:
def query_timegate(timegate, url, date=None, tz='Australia/Canberra'):
    headers = {}
    if date:
        formatted_date = format_date_for_headers(date, tz)
        headers['Accept-Datetime'] = formatted_date
    '''
    # BL doesn't seem to default to latest date if no date supplied
    elif not date and timegate == 'bl':
        formatted_date = format_date_for_headers(arrow.utcnow().format('YYYY-MM-DD'), tz)
        headers['Accept-Datetime'] = formatted_date
    '''
    # Note that you don't get a timegate response if you leave off the trailing slash
    tg_url = f'{TIMEGATES[timegate]}{url}/' if not url.endswith('/') else f'{TIMEGATES[timegate]}{url}'
    print(tg_url)
    if timegate == 'ia':
        response = requests.get(tg_url, headers=headers)
    else:
        response = requests.head(tg_url, headers=headers)
    # print(response.headers)
    return parse_links_from_headers(response.headers)

In [103]:
query_timegate('nla', 'http://www.education.gov.au/')

https://web.archive.org.au/awa/http://www.education.gov.au/


{'original': 'http://pandora.nla.gov.au/pan/180147/20200304-0021/www.education.gov.au/index.html',
 'timegate': 'https://web.archive.org.au/awa/http://pandora.nla.gov.au/pan/180147/20200304-0021/www.education.gov.au/index.html',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://pandora.nla.gov.au/pan/180147/20200304-0021/www.education.gov.au/index.html',
 'memento': 'https://web.archive.org.au/awa/20200304105707mp_/http://pandora.nla.gov.au/pan/180147/20200304-0021/www.education.gov.au/index.html'}

In [98]:
query_timegate('nla', 'http://www.education.gov.au/', date='2002-01-01')

https://web.archive.org.au/awa/http://www.education.gov.au/


{'original': 'http://www.education.gov.au:80/',
 'timegate': 'https://web.archive.org.au/awa/http://www.education.gov.au:80/',
 'timemap': 'https://web.archive.org.au/awa/timemap/link/http://www.education.gov.au:80/',
 'memento': 'https://web.archive.org.au/awa/20020120171009mp_/http://www.education.gov.au:80/'}

In [102]:
query_timegate('nlnz', 'http://natlib.govt.nz')

https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/


{'original': 'http://natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://natlib.govt.nz/',
 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://natlib.govt.nz/',
 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060106/http://natlib.govt.nz/'}

In [93]:
query_timegate('nlnz', 'http://natlib.govt.nz', date='2005-01-01')

https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/


{'original': 'http://natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://natlib.govt.nz/',
 'next memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20060704033135/http://natlib.govt.nz/',
 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://natlib.govt.nz/'}

In [94]:
query_timegate('nlnz', 'http://natlib.govt.nz', date='2008-01-01')

https://ndhadeliver.natlib.govt.nz/webarchive/wayback/http://natlib.govt.nz/


{'original': 'http://natlib.govt.nz/',
 'timemap': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/timemap/link/http://natlib.govt.nz/',
 'first memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20040711213225/http://natlib.govt.nz/',
 'prev memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20070322041546/http://natlib.govt.nz/',
 'memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20080225060238/http://natlib.govt.nz/',
 'next memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20081019225343/http://natlib.govt.nz/',
 'last memento': 'https://ndhadeliver.natlib.govt.nz/webarchive/wayback/20200130060111/http://natlib.govt.nz/'}

In [89]:
query_timegate('ia', 'http://discontents.com.au', date='1998-01-01')

https://web.archive.org/web/http://discontents.com.au/


{'original': 'http://www.discontents.com.au:80/',
 'timemap': 'https://web.archive.org/web/timemap/link/http://www.discontents.com.au:80/',
 'timegate': 'https://web.archive.org/web/http://www.discontents.com.au:80/',
 'first memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'memento': 'https://web.archive.org/web/19981206012233/http://www.discontents.com.au:80/',
 'next memento': 'https://web.archive.org/web/19981212024410/http://www.discontents.com.au:80/',
 'last memento': 'https://web.archive.org/web/20200418035854/http://www.discontents.com.au/'}

In [101]:
query_timegate('bl', 'http://bl.uk')

https://www.webarchive.org.uk/wayback/archive/http://bl.uk/


{'original': 'http://bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/'}

In [100]:
query_timegate('bl', 'http://bl.uk', date='2006-01-01')

https://www.webarchive.org.uk/wayback/archive/http://bl.uk/


{'original': 'http://bl.uk/',
 'timegate': 'https://www.webarchive.org.uk/wayback/archive/http://bl.uk/',
 'timemap': 'https://www.webarchive.org.uk/wayback/archive/timemap/link/http://bl.uk/',
 'memento': 'https://www.webarchive.org.uk/wayback/archive/20060101010000mp_/http://bl.uk/'}

## Timemaps

In [262]:
def get_timemap(timegate, url):
    tg_url = f'{TIMEGATES[timegate]}timemap/json/{url}/'
    response = requests.get(tg_url)
    data = [json.loads(line) for line in response.text.splitlines()]
    return data

In [336]:
timemap = get_timemap('nla', 'http://www.aph.gov.au/Senate/committee/eet_ctte/uni_finances/report/index.htm')
timemap

False


[{'urlkey': 'au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm',
  'timestamp': '20031122074837',
  'url': 'http://www.aph.gov.au/senate/committee/EET_CTTE/uni_finances/report/index.htm',
  'mime': 'text/html',
  'status': '200',
  'digest': '3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE',
  'offset': '97170362',
  'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-01336-000000.arc.gz',
  'source': 'awa',
  'source-coll': 'awa'},
 {'urlkey': 'au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm',
  'timestamp': '20031204140637',
  'url': 'http://www.aph.gov.au/senate/committee/eet_ctte/uni_finances/report/index.htm',
  'mime': 'text/html',
  'status': '200',
  'digest': '3H5Z77RMYKXRFCE6ODTWAKMBPZOGJ4YE',
  'offset': '42227559',
  'filename': 'NLA-EXTRACTION-1996-2004-ARCS-PART-03097-000000.arc.gz',
  'source': 'awa',
  'source-coll': 'awa'},
 {'urlkey': 'au,gov,aph)/senate/committee/eet_ctte/uni_finances/report/index.htm',
  'timestamp': '20031204170123',
  'url': 'ht

## Mementos

In [247]:
response = requests.head('https://web.archive.org.au/awa/20200302223537if_/http://discontents.com.au/')
response.headers

{'Server': 'nginx', 'Date': 'Sat, 02 May 2020 12:17:38 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '0', 'Connection': 'keep-alive', 'X-Archive-Orig-Server': 'nginx', 'X-Archive-Orig-Connection': 'close', 'X-Archive-Orig-Vary': 'Accept-Encoding', 'Link': '<http://discontents.com.au/wp-json/>; rel="https://api.w.org/", <http://wp.me/65XnW>; rel=shortlink, <http://discontents.com.au/>; rel="original", <https://web.archive.org.au/awa/http://discontents.com.au/>; rel="timegate", <https://web.archive.org.au/awa/timemap/link/http://discontents.com.au/>; rel="timemap"; type="application/link-format", <https://web.archive.org.au/awa/20200302223537mp_/http://discontents.com.au/>; rel="memento"; datetime="Mon, 02 Mar 2020 22:35:37 GMT"; collection="awa"', 'Memento-Datetime': 'Mon, 02 Mar 2020 22:35:37 GMT'}