### This first set of scripts downloads and creates AddressBase Premium.  

This is a ~120GB database of all of the addresses in the UK, so some of these scripts will take quite a long time to run

### Prerequisites

- You need a connection to a Postgres database that supports PostGIS and Full Text Search.  AWS RDS for Postgres will work.
- The Postgres instance will need about 150GB of spare disk space
- You will need about 100gb of spare disk space on your local machine
- These scripts will probably take you about a day to run in total.  Most of this time is just waiting for long running processes to complete


To perform address lookups, we use [Addressbase Premium](https://www.ordnancesurvey.co.uk/business-and-government/products/addressbase-premium.html).  This is free to the Government under the Public Sector Mapping Agreement.  You can get a login to the [PSMA portal](https://www.ordnancesurvey.co.uk/psma/) by emailing customerservices@os.uk

Addressbase Premium is provided in the format of a csv file for each OS grid square.  There are a total of around 10000.  

This script automatically downloads and unzips all the files.  It uses as an input "Ordnance Survey Download Centre.htm", which is the html page containing the download links.

We use BeautifulSoup to parse this html page and pull out a list of download links

In [None]:
from bs4 import BeautifulSoup

with open("Ordnance Survey Download Centre.htm") as f:
    soup  = BeautifulSoup(f, "lxml")

In [None]:
# How many links do we expect to find?
el = soup.find(text = "Number of Files:").parent.parent
num = el.text.replace("Number of Files:", "").strip()
numfiles = int(num)

In [None]:
my_links = set()
for a in soup.findAll("a"):
    if "href" in a.attrs:
        if "AB76DL" in a["href"]:  # This is a bit of trial and error, but it turned out that the download links all contain "AB76DL" as part of the URL
            my_links.add(a["href"])
my_links = list(my_links)

# Check we've found the right number of links
if (len(my_links) != numfiles):
    raise Exception

In [None]:
#links_done and links_failed allow us to restart from where we left off if we get an error or e.g. the internet cuts out
links_done = set()
links_failed = set()

In [None]:
from urllib.request import urlopen
from zipfile import ZipFile
from io import BytesIO

counter = 0.0
denom = len(my_links)
for link in my_links:
    if link not in links_done:  
        try:
            counter +=1
            if counter % 20 == 0:
                print(counter/denom)
            response = urlopen(link, timeout = 5)
            zipfile = ZipFile(BytesIO(response.read()))

            with zipfile as z:
                z.extractall("raw/outdata/")
            links_done.add(link)
        except:
            links_failed.add(link)

Finally, retry any links that failed with a longer timeout 

In [None]:
from zipfile import ZipFile

links_failed2 = set()

counter = 0
denom = len(my_links)*1.0

for link in links_failed:
   
    try:
        counter += 1
        print(counter)
        response = urlopen(link, timeout = 30)
        zipfile = ZipFile(BytesIO(response.read()))

        with zipfile as z:
            z.extractall("raw/outdata/")
        links_done.add(link)
    except:
        links_failed2.add(link)

In [None]:
len(links_failed2) == 0 

Finally  verify that the number of files downloaded is equal to the number expected

In [None]:
import os, os.path

# simple version for working with CWD
dl_files_counter = 0
for name in os.listdir('raw/outdata/'):
    if os.path.isfile(os.path.join('raw', 'outdata', name)):
        if ".csv" in name:
            dl_files_counter += 1
    
 
if (len(my_links) != dl_files_counter):
    raise Exception