# Gov.uk DEFRA actions compatibility scraper

In order to compute available area for a given action we need to be able to assess compatibility between existing actions on a parcel and the applicant action. To do this we must construct a compatibility matrix, for which we must scrape each action's page and extract the compatible actions.

Process:
- use the action-scraper.ipynb notebiook to construct a list of the urls of all the actions
- write a new function to extratc only the compatible actions
- run that over the list of actions, storing compatible actions in the values array of a dict of actions
- write an algo that reads in this dict and outputs a compat matrix - we may need to derive a class in which to store the matrix, or some JSON-serialisable form we can read in and parse for the JS peeps

In [40]:
import json

gov_base_url = "https://www.gov.uk"
finder_base_url = "https://www.gov.uk/find-funding-for-land-or-farms"
page2 = "?page=2"
page3 = "?page=3"

In [2]:
# use the action-scraper.ipynb file to generate the list of urls, if you haven't already, then
# read in the three files

with open('output/actions_links_page1.txt') as file:
    links1 = file.readlines()
with open('output/actions_links_page2.txt') as file:
    links2 = file.readlines()
with open('output/actions_links_page3.txt') as file:
    links3 = file.readlines()

# concatenate, stripping newlines

all_links_relative = [link.rstrip() for link in links1 + links2 + links3]

In [42]:
import pandas
import requests
from bs4 import BeautifulSoup
from utils import get_code, get_filename

def get_tables(relative_link):
    code = get_code(relative_link)
    gov_base_url = "https://www.gov.uk"
    url = gov_base_url + relative_link

    try:
        # Send a GET request to the URL
        response = requests.get(url)
        response.raise_for_status()

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        table = soup.select_one('h2:-soup-contains("Other actions or options") + p + table')

        # set up an empty dict for the results
        results = {}
        results[code] = {}

        # collect the rows
        for row in table.tbody.find_all('tr'):
            columns = row.find_all('td')

            if(columns != []):
                scheme = columns[0].text.strip()
                cactions = columns[1].text.strip()

                results[code][scheme] = cactions

                # df = pandas.concat([df, pandas.DataFrame({'Scheme': [scheme], 'Compatible Actions': [cactions]})], ignore_index=True)

        return results

    except requests.exceptions.HTTPError as errh:
        print("HTTP Error:", errh)
    except requests.exceptions.ConnectionError as errc:
        print("Error Connecting:", errc)
    except requests.exceptions.Timeout as errt:
        print("Timeout Error:", errt)
    except requests.exceptions.RequestException as err:
        print("Oops! Something went wrong:", err)

In [37]:
all_links_relative[4]

'/find-funding-for-land-or-farms/agf2-maintain-low-density-in-field-agroforestry-on-less-sensitive-land'

In [38]:
get_tables(all_links_relative[0])

https://www.gov.uk/find-funding-for-land-or-farms/wbd2-manage-ditches


{'WBD2': {'SFI  2024 actions': 'All SFI 2024 actions, except BND1',
  'SFI 2023 actions': 'All SFI 2023 actions',
  'CS options': 'All CS management options, including BE3 (management of hedgerows)',
  'ES options': 'All ES revenue options, except boundary options',
  'SFI pilot standards': 'All SFI pilot standards, including all levels of the SFI pilot hedgerows standard'}}

Now we're ready to iterate the list of urls and extract the tables from them all, save them out to individual json documents

In [43]:
code_to_results = {}

for relative_link in all_links_relative:
    # get the document contents
    output_filename = get_filename(relative_link)
    code = get_code(relative_link)
    result = get_tables(relative_link)
    code_to_results[code] = result
    with open('output/compatibility/' + output_filename, 'w') as f:
        f.write(json.dumps(result))
    print(f"Retrieved {get_code(relative_link)}: {get_filename(relative_link)}")

Retrieved WBD2: wbd2-manage-ditches.txt
Retrieved WBD1: wbd1-manage-ponds.txt
Retrieved OFA3: ofa3-supplementary-winter-bird-food-organic-land.txt
Retrieved AGF1: agf1-maintain-very-low-density-in-field-agroforestry-on-less-sensitive-land.txt
Retrieved AGF2: agf2-maintain-low-density-in-field-agroforestry-on-less-sensitive-land.txt
Retrieved CHRW1: chrw1-assess-and-record-hedgerow-condition.txt
Retrieved CHRW2: chrw2-manage-hedgerows.txt
Retrieved CHRW3: chrw3-maintain-or-establish-hedgerow-trees.txt
Retrieved BND1: bnd1-maintain-dry-stone-walls.txt
Retrieved BND2: bnd2-maintain-earth-banks-or-stone-faced-hedgebanks.txt
Retrieved CAHL4: cahl4-4m-to-12m-grass-buffer-strip-on-arable-and-horticultural-land.txt
Retrieved CIGL3: cigl3-4m-to-12m-grass-buffer-strip-on-improved-grassland.txt
Retrieved BFS1: bfs1-12m-to-24m-watercourse-buffer-strip-on-cultivated-land.txt
Retrieved BFS2: bfs2-buffer-in-field-ponds-on-arable-land.txt
Retrieved BFS3: bfs3-buffer-in-field-ponds-on-improved-grasslan

In [47]:
for code, result in code_to_results.items():
    for k, v in result.items():
        for k2, v2 in v.items():
            print(v2)

All SFI 2024 actions, except BND1
All SFI 2023 actions
All CS management options, including BE3 (management of hedgerows)
All ES revenue options, except boundary options
All SFI pilot standards, including all levels of the SFI pilot hedgerows standard
No SFI 2024 actions
No SFI 2023 actions
No CS management options
No ES options
No SFI pilot standards
Same as base action
Same as base action
Same as base action
Same as base action
Same as base action
AHW1, AHW3, AHW6, AHW7, AHW8, AHW9, AHW10, AHW11, AHW12, BFS1, BFS2, BFS3, OFA1, OFA6, OFC1, OFC3, OFC4, OFM1, OFM4, OFM5, SOH4, WBD3, WBD4, WBD5, WBD6, WBD7, WBD8, PRF1, PRF2, PRF3, PRF4, SOH1, SOH2, SOH3, CAHL1, CAHL2, CAHL3, CAHL4, CIGL1, CIGL2, CIGL3, CIPM1, CIPM2, CIPM3, CIPM4, CLIG3, CNUM1, CNUM2, CNUM3, CSAM1, CSAM2, CSAM3
AHL1, AHL2, AHL3, AHL4, IGL1, IGL2, IGL3, IPM1, IPM2, IPM3, IPM4, LIG1, LIG2, NUM1, NUM2, NUM3, SAM1, SAM2, SAM3
AB1, AB2, AB3, AB6, AB7, AB8, AB9, AB10, AB11, AB13, AB14, AB15, AB16, GS2, GS3, GS4, GS5, HS6, SW1, 