# Scan for New SPD Files
Two approaches have been attempted: By reading the HTML content using bs4 and "hashing" the content using hashlib.

In [1]:
import hashlib
import urllib
import urllib.request
from urllib.request import urlopen, Request
import filecmp
import os
from bs4 import BeautifulSoup
import numpy as np
import wget
import requests
import zipfile
import io

## Opening and reading the URL
We first input the desired url and parse the HTML content.

In [2]:
url = Request('https://www.nrscotland.gov.uk/statistics-and-data/geography/nrs-postcode-extract',
              headers={'User-Agent': 'Moziilla/5.0'})

Read the url and convert to a "soup" object to read the HTML content.

In [3]:
response = urlopen(url).read()
soup = BeautifulSoup(response)

## Searching for the target location
Now we search for the required information, which is the '2021-2' link to the latest SPD files. This is subject to change in the future, e.g., when it is updated to '2022-2'. Assuming the format of the names remains unchanged, we can simply replace the '2021-2' below with the next expected name.

There are various methods to look for our target information. From above, upon going through the HTML content, we find that the tags 'td' (for tables) and 'a' (for hyperlinks) are associated with our target. In fact, we know this by just examining the website since the link to '2021-2' is part of a table, and is a hyperlink (this is more obvious). 

We use the find_all function to look for content associated with the tags.

In [4]:
soup.find_all("td")

[<td valign="top" width="15%"><strong>Latest</strong></td>,
 <td valign="top" width="85%"><a href="/statistics-and-data/geography/our-products/scottish-postcode-directory/2021-2">2021-2</a></td>,
 <td valign="top" width="15%"><strong>Previous</strong></td>,
 <td valign="top" width="85%"><a href="/statistics-and-data/geography/our-products/scottish-postcode-directory/2021-1">2021-1</a></td>,
 <td valign="top" width="15%"><strong>Archive</strong></td>,
 <td valign="top" width="85%"><a href="/statistics-and-data/geography/our-products/scottish-postcode-directory/archived-postcode-extract">2012-2 onwards</a></td>]

By inspecting the results, we see that the desired result is the second item. So we extract it and turn it into a string to confirm this is indeed what we are looking for.

In [5]:
section_label = soup.find_all("td")[1]
section_label.string

'2021-2'

We then search for hyperlinks within the website. There are *many* hyperlinks on the website so we get a long list of results. Specifically, by inspecting the soup object, we find that the required link is located on line 36 (this is rather dumb, a smarter way is shown further down below).

In [6]:
soup.find_all("a")[36]

<a href="/statistics-and-data/geography/our-products/scottish-postcode-directory/2021-2">2021-2</a>

Again, the above method can be difficult since we have the spot this specific link manually from a list of messy results (websites are often very complicated). We know that the name of the link is '2021-2' so we attempt to find the link with this information.

In [7]:
soup.find_all('a',text='2021-2')

[<a href="/statistics-and-data/geography/our-products/scottish-postcode-directory/2021-2">2021-2</a>,
 <a href="/statistics-and-data/geography/our-products/scottish-postcode-directory/2021-2">2021-2</a>]

In [8]:
'''
def search_link(response_object,tag,date):
    k = BeautifulSoup(response_object,"html.parser")
    a = k.find_all('{}'.format(tag),text='{}'.format(date))
    #a = k.find_all('{0},text={1}'.format(tag,date))
    #b = k.find_all('{0}'.format(tag))
    return a

a = "a"
b = "2021-2"
x = search_link(response,a,b)
x
'''

'\ndef search_link(response_object,tag,date):\n    k = BeautifulSoup(response_object,"html.parser")\n    a = k.find_all(\'{}\'.format(tag),text=\'{}\'.format(date))\n    #a = k.find_all(\'{0},text={1}\'.format(tag,date))\n    #b = k.find_all(\'{0}\'.format(tag))\n    return a\n\na = "a"\nb = "2021-2"\nx = search_link(response,a,b)\nx\n'

There are two links on the url which are identical so we pick the latter one and extract the name.

In [9]:
target1=soup.find_all("a",text='2021-2')[1]
t1=target1.string
t1

'2021-2'

Another method is to directly look for the exact link. But this requires a precise link that must be predictable.

In [10]:
target_link = soup.find_all("a",href="/statistics-and-data/geography/our-products/scottish-postcode-directory/2021-2")[1]
target_link

<a href="/statistics-and-data/geography/our-products/scottish-postcode-directory/2021-2">2021-2</a>

In [11]:
newlink = soup.find_all("a")[36]
newlinkstr = str(newlink) # turn content into a string
#dllink = newlink['href']
#newlink.string # gives same result as section_label.string
newlink

<a href="/statistics-and-data/geography/our-products/scottish-postcode-directory/2021-2">2021-2</a>

We can also "hash" the entire webpage and compare the content later on to spot changes. However, this is not ideal as it can detect very subtle changes. 

We can try to isolate a certain part of the website and apply this method. This requires a bit more thinking and we should come back to this in the future.

In [12]:
newHash = hashlib.sha224(response).hexdigest()

## Check for updates
Now we want to know if there has been any updates.

We write the above results into a text file (here we use the HTML results).

In [13]:
with open('currentSPD.txt','w') as f:
    f.write(t1)

We can then compare the content of the textfiles to see if there has been any changes. Result will be true if there are no updates.

In [14]:
f1 = "./currentSPD.txt"
f2 = "./oldSPD.txt"

result = filecmp.cmp(f1, f2, shallow=False)
result

False

We will also write the current HTML content into "oldSPD.txt" so that it can be used for comparison next time.

In [15]:
if result == True:
    print('No new SPD files.')
else:
    print('Check the website for updates.')
    with open('oldSPD.txt','w') as f:
        f.write(t1)

Check the website for updates.


## Accessing the target link
Now that we located the desired link, we shall access it to move to the next page and download our target file.

Firstly, let's examine our result again.

In [16]:
target_link

<a href="/statistics-and-data/geography/our-products/scottish-postcode-directory/2021-2">2021-2</a>

Notice that the link contained in there is missing the base url. However, we know what it is by examining the website ourselves.

In [17]:
base_target = "https://www.nrscotland.gov.uk"

Now we can get the actual url and parse it. But first let's extract the "href" content from our result above.

In [18]:
target_linkhref = target_link["href"]
target_linkhref

'/statistics-and-data/geography/our-products/scottish-postcode-directory/2021-2'

In [19]:
target_url = base_target+target_linkhref
target_url

'https://www.nrscotland.gov.uk/statistics-and-data/geography/our-products/scottish-postcode-directory/2021-2'

That's exactly what we need! Now we read and parse the url as before.

In [20]:
get_target_url = Request(target_url,headers={'User-Agent': 'Moziilla/5.0'})
target_html = urlopen(get_target_url).read()
#target_text = target_html.read().decode("utf-8")
#target_url_response = target_html.read()
target_soup = BeautifulSoup(target_html)

We have a beautiful soup object again. Similar to before, we look for our target file: SPD_PostcoodeIndex_Cut_21_2_csv.zip. Again, this is only possible when we can predict the file names and names of the hyperlinks.

In [21]:
target_soup.find_all("a",text='Postcode Index')

[<a href="/files//statistics/geography/2021-2/SPD_PostcodeIndex_Cut_21_2_Access.zip">Postcode Index</a>,
 <a href="/files//statistics/geography/2021-2/SPD_PostcodeIndex_Cut_21_2_CSV.zip">Postcode Index</a>]

The second result is what we are looking for.

In [22]:
file_target = target_soup.find_all("a",text='Postcode Index')[1]["href"]
file_target

'/files//statistics/geography/2021-2/SPD_PostcodeIndex_Cut_21_2_CSV.zip'

Again we construct the full link with the previous base link.

In [23]:
final_target = base_target+file_target
final_target

'https://www.nrscotland.gov.uk/files//statistics/geography/2021-2/SPD_PostcodeIndex_Cut_21_2_CSV.zip'

## Downloading the files
Now we can download the desired files. The content will be saved in the "SPD_PostcodeIndex_Cut_21_2_CSV" folder. You might get an error saying the target is not a zip file. This is still being investigated by you might want to make sure you do not have any files that share the same name in your downloads folder.

**Edit**: this is probably due to the 403 Forbidden error that we encountered using urllib before. Is it due to certain permission or we are blocked from downloading through Python?

**Temporary workaround:** run all cells first before running the download cell.

In [26]:
user_agent = {'User-agent': 'Mozilla/5.0'}
r = requests.get(final_target,headers=user_agent)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall("./SPD_PostcodeIndex_Cut_21_2_CSV")

## To-do
* Currently this notebook only demonstrates how this can be done in general. A Python script is being written to execute the process and will restructure the methods (e.g. if there are no updates, then there is no need to download anything)
* Make it work with real time, i.e. scheduling. This will probably require some integration with the command-line
* Make the whole process smarter, e.g. as mentioned above, no need to run further if there are no updates, and accept input arguments (e.g. if you want to look for a specific date/file name you can just throw it in from the command-line when you run the script without having to click into the script, scrolling through and editing it manually)