# Gov.uk DEFRA actions compatibility scraper

In order to compute available area for a given action we need to be able to assess compatibility between existing actions on a parcel and the applicant action. To do this we must construct a compatibility matrix, for which we must scrape each action's page and extract the compatible actions.

Process:
- use the action-scraper.ipynb notebiook to construct a list of the urls of all the actions
- write a new function to extratc only the compatible actions
- run that over the list of actions, storing compatible actions in the values array of a dict of actions
- write an algo that reads in this dict and outputs a compat matrix - we may need to derive a class in which to store the matrix, or some JSON-serialisable form we can read in and parse for the JS peeps

In [20]:
import json
from pydantic import BaseModel

In [21]:

gov_base_url = "https://www.gov.uk"
finder_base_url = "https://www.gov.uk/find-funding-for-land-or-farms"
page2 = "?page=2"
page3 = "?page=3"

In [22]:
# use the action-scraper.ipynb file to generate the list of urls, if you haven't already, then
# read in the three files

with open('output/actions_links_page1.txt') as file:
    links1 = file.readlines()
with open('output/actions_links_page2.txt') as file:
    links2 = file.readlines()
with open('output/actions_links_page3.txt') as file:
    links3 = file.readlines()

# concatenate, stripping newlines

all_links_relative = [link.rstrip() for link in links1 + links2 + links3]

In [23]:
import pandas
import requests
from bs4 import BeautifulSoup
from utils import get_code, get_filename

def get_tables(relative_link):
    code = get_code(relative_link)
    gov_base_url = "https://www.gov.uk"
    url = gov_base_url + relative_link

    try:
        # Send a GET request to the URL
        response = requests.get(url)
        response.raise_for_status()

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        table = soup.select_one('h2:-soup-contains("Other actions or options") + p + table')

        # set up an empty dict for the results
        results = {}
        results[code] = {}

        # collect the rows
        for row in table.tbody.find_all('tr'):
            columns = row.find_all('td')

            if(columns != []):
                scheme = columns[0].text.strip()
                cactions = columns[1].text.strip()

                # there might be nonbreaking spaces in here, let's strip them
                cactions = cactions.replace(u'\xa0', u' ')

                results[code][scheme] = cactions

                # df = pandas.concat([df, pandas.DataFrame({'Scheme': [scheme], 'Compatible Actions': [cactions]})], ignore_index=True)

        return results

    except requests.exceptions.HTTPError as errh:
        print("HTTP Error:", errh)
    except requests.exceptions.ConnectionError as errc:
        print("Error Connecting:", errc)
    except requests.exceptions.Timeout as errt:
        print("Timeout Error:", errt)
    except requests.exceptions.RequestException as err:
        print("Oops! Something went wrong:", err)

In [24]:
all_links_relative[4]

'/find-funding-for-land-or-farms/agf2-maintain-low-density-in-field-agroforestry-on-less-sensitive-land'

In [25]:
get_tables(all_links_relative[0])

{'WBD2': {'SFI  2024 actions': 'All SFI 2024 actions, except BND1',
  'SFI 2023 actions': 'All SFI 2023 actions',
  'CS options': 'All CS management options, including BE3 (management of hedgerows)',
  'ES options': 'All ES revenue options, except boundary options',
  'SFI pilot standards': 'All SFI pilot standards, including all levels of the SFI pilot hedgerows standard'}}

Now we're ready to iterate the list of urls and extract the tables from them all, save them out to individual json documents

In [26]:
code_to_results = {}

for relative_link in all_links_relative:
    # get the document contents
    output_filename = get_filename(relative_link)
    code = get_code(relative_link)
    result = get_tables(relative_link)
    code_to_results[code] = result
    with open('output/compatibility/' + output_filename, 'w') as f:
        f.write(json.dumps(result))
    print(f"Retrieved {get_code(relative_link)}: {get_filename(relative_link)}")

Retrieved WBD2: wbd2-manage-ditches.txt
Retrieved WBD1: wbd1-manage-ponds.txt
Retrieved OFA3: ofa3-supplementary-winter-bird-food-organic-land.txt
Retrieved AGF1: agf1-maintain-very-low-density-in-field-agroforestry-on-less-sensitive-land.txt
Retrieved AGF2: agf2-maintain-low-density-in-field-agroforestry-on-less-sensitive-land.txt
Retrieved CHRW1: chrw1-assess-and-record-hedgerow-condition.txt
Retrieved CHRW2: chrw2-manage-hedgerows.txt
Retrieved CHRW3: chrw3-maintain-or-establish-hedgerow-trees.txt
Retrieved BND1: bnd1-maintain-dry-stone-walls.txt
Retrieved BND2: bnd2-maintain-earth-banks-or-stone-faced-hedgebanks.txt
Retrieved CAHL4: cahl4-4m-to-12m-grass-buffer-strip-on-arable-and-horticultural-land.txt
Retrieved CIGL3: cigl3-4m-to-12m-grass-buffer-strip-on-improved-grassland.txt
Retrieved BFS1: bfs1-12m-to-24m-watercourse-buffer-strip-on-cultivated-land.txt
Retrieved BFS2: bfs2-buffer-in-field-ponds-on-arable-land.txt
Retrieved BFS3: bfs3-buffer-in-field-ponds-on-improved-grasslan

The above data structure is three layers deep but every leaf in the bottom layer represents the same thing - a list of codes. However some of the entries aren't explicit liists but instructions on how to build a list. So we need to be able to execute that instruction and turn it into an explicit list so that that can be used to train a model.

We'll collect all those bottom-level leaves, classify them and build a set of scenarios. We can use those to help write a class to perform the expansion.

In [27]:
# collect leaves into an array

leaves = []

for code, result in code_to_results.items():
    for k, v in result.items():
        for k2, v2 in v.items():
            print(v2)
            leaves.append(v2)

All SFI 2024 actions, except BND1
All SFI 2023 actions
All CS management options, including BE3 (management of hedgerows)
All ES revenue options, except boundary options
All SFI pilot standards, including all levels of the SFI pilot hedgerows standard
No SFI 2024 actions
No SFI 2023 actions
No CS management options
No ES options
No SFI pilot standards
Same as base action
Same as base action
Same as base action
Same as base action
Same as base action
AHW1, AHW3, AHW6, AHW7, AHW8, AHW9, AHW10, AHW11, AHW12, BFS1, BFS2, BFS3, OFA1, OFA6, OFC1, OFC3, OFC4, OFM1, OFM4, OFM5, SOH4, WBD3, WBD4, WBD5, WBD6, WBD7, WBD8, PRF1, PRF2, PRF3, PRF4, SOH1, SOH2, SOH3, CAHL1, CAHL2, CAHL3, CAHL4, CIGL1, CIGL2, CIGL3, CIPM1, CIPM2, CIPM3, CIPM4, CLIG3, CNUM1, CNUM2, CNUM3, CSAM1, CSAM2, CSAM3
AHL1, AHL2, AHL3, AHL4, IGL1, IGL2, IGL3, IPM1, IPM2, IPM3, IPM4, LIG1, LIG2, NUM1, NUM2, NUM3, SAM1, SAM2, SAM3
AB1, AB2, AB3, AB6, AB7, AB8, AB9, AB10, AB11, AB13, AB14, AB15, AB16, GS2, GS3, GS4, GS5, HS6, SW1, 

In [28]:
# first we dedupe and sort
leaves = list(set(leaves))
leaves.sort()

Some of the entries in this list are already in the right form - an explicit list of codes, albeit in string form. So we need a method to separate them and put them in an actual list. We also need to remove those from the list, so we can see what else we have.

todo list:
- write method to split string on commas
- get an exhaustive list of the codes. Enum? Class to which we add and remove codes? Maybe the latter
- identify entries that are just lists of codes. Perhaps we use an 'extract codes' method and then see if there's anything left?
- remove those entries from the leaves


class design
- method to separate string into array of codes


In [29]:
# obtain list of codes and their categories
# for now what we'll do is reprocess them from the list of links and put all 101 current ones into SFI2024
# we can get the ES and whatnot categories later

all_sfi2024_codes = [get_code(relative_link) for relative_link in all_links_relative]

# turns out a lot of the ones in the leaves are not SFI2024 actions - makes sense, these'll be the old ones
# let's collect those too
all_the_codes = list(set([code.strip() for leaf in leaves for code in leaf.split(",")]))
# print them to screen, copy to file and manually clean up
all_the_codes.sort()
# all_the_codes

Now we want to import the class we just made and use it to process the list of all 'action code list strings' obtained from the actions text. We will first run it through the 'extract codes' method and see if there's a remainder, anything with no remainder gets cleared off the list. Anything else in the list has to be processed, we'll need to write methods for that.

In [30]:
import actionCodes

<module 'actionCodes' from '/Users/joe/Documents/git/defra/ffc-rps-scratchpad/src/python/actionCodes.py'>

In [31]:
ac = actionCodes.ActionCodes()
ac.add_category('SFI2024', actionCodes.all_sfi2024_codes)
ac.add_category('historic', actionCodes.historic_codes)

new_leaves = []

for leaf in leaves:
    res, remainder = ac.extract_code_from_string(leaf)
    if len(remainder)!=0:
        print(f"Processed {leaf[0:10]}...")
        print(f"   Res: {res[0:10]}")
        print(f"   Rem: {remainder}")
        new_leaves.append(remainder)

Added category SFI2024
Added category historic
Processed AB1, AB2, ...
   Res: ['AB1', 'AB2', 'AB3', 'AB5', 'AB6', 'AB7', 'AB8', 'AB9', 'AB10', 'AB11']
   Rem: ['UP2 (if located below the moorland line)']
Processed AHL2, IPM1...
   Res: ['AHL2', 'IPM1', 'IPM4', 'NUM1', 'SAM1']
   Rem: ['SAM2 (only if CIPM3 is done during the summer months)']
Processed All CS man...
   Res: []
   Rem: ['All CS management options']
Processed All CS man...
   Res: []
   Rem: ['All CS management options (if located above the moorland line)']
Processed All CS man...
   Res: []
   Rem: ['All CS management options', 'except BE3']
Processed All CS man...
   Res: []
   Rem: ['All CS management options', 'except BE3 (management of hedgerows)']
Processed All CS man...
   Res: []
   Rem: ['All CS management options', 'including BE3 (management of hedgerows)']
Processed All ES rev...
   Res: []
   Rem: ['All ES revenue options']
Processed All ES rev...
   Res: []
   Rem: ['All ES revenue options (if located above t

In [60]:
from importlib import reload
reload(actionCodes)

ac = actionCodes.ActionCodes()
ac.add_category('SFI2024', actionCodes.all_sfi2024_codes)
ac.add_category('SFI', actionCodes.historic_codes)
ac.add_category('CS', actionCodes.all_CS)
ac.add_category('SFI2023', actionCodes.all_SFI_2023)

unmatched = []

for leaf in leaves:
    print(f"Processing {leaf}")
    ac_result = ac.process_string(leaf)
    print(f"Processed {leaf[0:10]}...")
    print(f"   Found codes: {ac_result.codes[0:10]}")
    if len(ac_result.remnant)!=0:
        print(f"   Unprocessable: {ac_result.remnant}")
        unmatched.append(leaf)
    print()



Added category SFI2024
Added category SFI
Added category CS
Added category SFI2023
Processing AB1, AB2, AB3, AB5, AB6, AB7, AB8, AB9, AB10, AB11, AB13, AB14, AB15, AB16, BE1, BE2, BE4, BE5, GS1, GS2, GS3, GS4, GS5, GS6, GS7, GS8, GS9, GS10, GS11, GS12, GS13, GS14, HS3, HS4, HS7, HS9, CT1, CT2, CT3, CT4, CT5, CT7, LH1, LH2, LH3, WT6, WT7, WT8, WT9, WT10, OP1, OP2, OP4, OP5, OR1, OR2, OR3, OR4, OR5, OT1, OT2, OT3, OT4, OT5, SW1, SW2, SW3, SW4, SW5, SW6, SW7, SW8, SW9, SW10, SW12, SW13, SW15, SW16, SW17, SW18, UP2, WT1, WT2, UP2 (if located below the moorland line)
  Items split into ['AB1', 'AB2', 'AB3', 'AB5', 'AB6', 'AB7', 'AB8', 'AB9', 'AB10', 'AB11', 'AB13', 'AB14', 'AB15', 'AB16', 'BE1', 'BE2', 'BE4', 'BE5', 'GS1', 'GS2', 'GS3', 'GS4', 'GS5', 'GS6', 'GS7', 'GS8', 'GS9', 'GS10', 'GS11', 'GS12', 'GS13', 'GS14', 'HS3', 'HS4', 'HS7', 'HS9', 'CT1', 'CT2', 'CT3', 'CT4', 'CT5', 'CT7', 'LH1', 'LH2', 'LH3', 'WT6', 'WT7', 'WT8', 'WT9', 'WT10', 'OP1', 'OP2', 'OP4', 'OP5', 'OR1', 'OR2', 'OR3', 

AttributeError: 'NoneType' object has no attribute 'codes'

In [None]:
new_leaves

[['UP2 (if located below the moorland line)'],
 ['SAM2 (only if CIPM3 is done during the summer months)'],
 ['All CS management options'],
 ['All CS management options (if located above the moorland line)'],
 ['All CS management options', 'except BE3'],
 ['All CS management options', 'except BE3 (management of hedgerows)'],
 ['All CS management options', 'including BE3 (management of hedgerows)'],
 ['All ES revenue options'],
 ['All ES revenue options (if located above the moorland line)'],
 ['All ES revenue options', 'except boundary options'],
 ['All SFI 2023 actions'],
 ['All SFI 2023 actions', 'except HRW1'],
 ['All SFI 2023 actions', 'except HRW1', 'HRW2 and HRW3'],
 ['All SFI 2023 actions', 'except HRW2'],
 ['All SFI 2023 actions', 'except HRW3'],
 ['All SFI 2024 actions', 'except BND1'],
 ['All SFI 2024 actions', 'except CHRW1', 'CHRW3 and WBD2'],
 ['All SFI 2024 actions', 'expect BND1'],
 ['All SFI actions', 'except CMOR1 or MOR1'],
 ['All SFI actions', 'except CMOR1', 'MOR1 or