# Resources

Several external resources have been employed in our experiments. This notebook automates the process of downloading them and storing them in the appropriate locations.

Run All cells to complete the downloads. **Beware that this will download >70GB to your local machine. You will need a reliable and stable internet connection for this.**

Manual instructions for obtaining the resources are included too.

## Table of contents

- [Download resources from Zenodo](#download-resources-from-zenodo)
- [Obtain remaining resources](#obtain-remaining-resources)
- [Resources file structure](#resources-file-structure)
- [Additional information on shared resources](#additional-information-on-shared-resources)


In [None]:
import requests
import zipfile
import io
from os import path

# A handy wrapper function
def download_and_unzip(source_url, dest_dir):
    response = requests.get(source_url)
    zipped_data = zipfile.ZipFile(io.BytesIO(response.content))
    zipped_data.extractall(dest_dir)

root_local_dir = path.dirname(path.realpath("__file__"))


# Zento

Many of the resources we use for our experiments can be downloaded from [Zento](https://zenodo.org/record/5520883). 

#### Automated download:

The cell below automates this download. Some of the directories will be empty at this stages. They will be populated by some of the later cells.

#### Manual download:

Download the compressed file `resources.zip` and unzip it. Our code assumes the following directory structure:

```
station-to-station/
├── ...
├── resources/
│   ├── deezymatch/
│   ├── geonames/
│   ├── geoshapefiles/
│   ├── quicks/
│   ├── ranklib/
│   ├── wikidata/
│   ├── wikigaz/
│   └── wikipedia/
└── ...
```

Some of the directories will be empty, because we cannot share all the resources we used in our experiments. Please follow the instructions below to obtain the remaining files and store them in the right location.

In [None]:
zento_url = "https://zenodo.org/record/5520883/files/resources.zip?download=1"
zento_dest_dir = root_local_dir

# Get resources from Zento
download_and_unzip(zento_url, zento_dest_dir)


# Geonames

#### Automated download:
The cell below downloads both required Geonames files.

#### Manual download:
Download the [GB table](http://download.geonames.org/export/dump/GB.zip), and store the unzipped file (`GB.txt`) under `resources/geonames/`.
> For reference, we have used the `2021-04-26 09:01` version in our experiments.

Download the [alternateNameV2 table](http://download.geonames.org/export/dump/alternateNamesV2.zip), and store the unzipped files (`alternateNamesV2.txt` and `iso-languagecodes.txt`) under `resources/geonames/`.
> For reference, we have used the `2021-04-26 09:11` version in our experiments.

In [None]:
# Get two Geonames data files
geonames_url = "http://download.geonames.org/export/dump/GB.zip"
alt_geonames_url = "http://download.geonames.org/export/dump/alternateNamesV2.zip"
geonames_dest_dir = path.join(root_local_dir, "resources/geonames/")

download_and_unzip(geonames_url, geonames_dest_dir)
download_and_unzip(alt_geonames_url, geonames_dest_dir)


# Boundary-Line™ Data

#### Automated download:
The cell below downloads the Ordnance Survey Boundary-Line™ data.

#### Manual download:

Download the Boundary-Line™ ESRI Shapefile from https://osdatahub.os.uk/downloads/open/BoundaryLine (see [licence](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)). Unzip it and copy the following files under the `geoshapefiles/` folder:
* `Data/Supplementary_Country/country_region.dbf`
* `Data/Supplementary_Country/country_region.prj`
* `Data/Supplementary_Country/country_region.shp`
* `Data/Supplementary_Country/country_region.shx`


In [None]:
# Get Boundary-Line™ Data
boundaryline_url = "https://api.os.uk/downloads/v1/products/BoundaryLine/downloads?area=GB&format=ESRI%C2%AE+Shapefile&redirect"
boundaryline_dest_dir = path.join(root_local_dir, "resources/geoshapefiles/")

download_and_unzip(boundaryline_url, boundaryline_dest_dir)


# Ranklib

#### Automated download:
The cell below downloads the Ranklib software library.

#### Manual download:

Download the Ranklib `.jar` file from the Lemur project [RankLib page](https://sourceforge.net/p/lemur/wiki/RankLib/) and store it in `ranklib/`. In our experiments, we have used version 2.13, available [here](https://sourceforge.net/projects/lemur/files/lemur/RankLib-2.13/). If this is not available anymore, we would suggest that you get the most recent binary [from here](https://sourceforge.net/projects/lemur/files/lemur/).


In [None]:
ranklib_url = "https://sourceforge.net/projects/lemur/files/lemur/RankLib-2.13/RankLib-2.13.jar/download"
rank_lib_fpath = path.join(root_local_dir, "resources/ranklib/RankLib-2.13.jar")

# Download without needing to unzip
response = requests.get(ranklib_url)
with open(rank_lib_fpath, mode='wb') as jar_file:
    jar_file.write(response.content)

# Wikidata

The wikidata is huge (~70 Gb). Therefore it needs to be downloaded in chunks and it size needs to be verified to ensure that it has all downloaded successfully.


#### Automated download:
The cell below downloads the wikidata file and verify its size.

#### Manual download:

Download a full Wikidata dump from [here](https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2) and store the `latest-all.json.bz2` file in `wikidata/`.

In [None]:
# Therefore it needs to be downloaded in chunks and it size needs to be
# verified to ensure that it has all downloaded successfully.
import xml.etree.ElementTree as ET
import urllib.parse
from bs4 import BeautifulSoup
import re

def get_details_from_nginx_filelist(filelist_url, filename):
    """
    This function expects the url for a `nginx` sytled directory listing, and a filename appearing on it.
    It parses the directory listing to obtain metadata about the named file.
    It returns a tuple with the `filename`, the `modified date` (as str) and the filesize (as int).
    """
    search_fname = re.escape(filename)
    pattern = (r"^(?P<filename>" +
                search_fname +
                r")\s+(?P<datetime>\d\d-[a-zA-Z]{3}-20\d\d \d\d:\d\d)\s+(?P<filesize>\d+)"
    )

    response = requests.get(filelist_url)

    # As a minimum, check that it is an nginx site
    if not re.search("nginx", response.headers["server"]):
        raise ValueError(f"Not an Nginx webserver: {filelist_url}")

    file_list_txt = BeautifulSoup(response.text).get_text()
    match = re.search(pattern, file_list_txt, re.MULTILINE)

    if match:
        return match.group("filename"), match.group("datetime"), int(match.group("filesize"))

    return None
    # filesize
    print(file_list_txt)

# Must have trailing slash
wikidata_filelist_url = "https://dumps.wikimedia.org/wikidatawiki/entities/"
wikidata_fname = "latest-all.json.bz2"

wikidata_url = urllib.parse.urljoin(wikidata_filelist_url, wikidata_fname)
wikidata_fpath = path.join(root_local_dir, "resources/wikidata", wikidata_fname)

_, _, target_size = get_details_from_nginx_filelist(wikidata_filelist_url, wikidata_fname)
print(target_size)
print(path.getsize(wikidata_fpath))


# Only download if the correct sized file does not exist locally
if (not path.exists(wikidata_fpath)) or path.getsize(wikidata_fpath) != target_size:
    print("Up to date Wikidata not downloaded locally. Downloading now")

    chunk_size = 1024*1024
    i = 0

    response = requests.get(wikidata_url, stream=True)
    print(response.status_code)
    print(response.ok)
    with open(wikidata_fpath, mode="wb") as fb:
        for chunk in response.iter_content(chunk_size=chunk_size):
            print(f"downloaded {i*chunk_size} bytes so far")
            fb.write(chunk)
            i = i+1


if (not path.exists(wikidata_fpath)) or path.getsize(wikidata_fpath) != target_size:
    raise UserWarning("warning - wikidata not downloaded correctly")


# -- End --