# Gov.uk DEFRA Actions scraper

We want to recover the full text of all the available actions on the DEFRA finder, so that we can analyse them and build up a taxonomy to help us derive a data model.

In [10]:
finder_base_url = "https://www.gov.uk/find-funding-for-land-or-farms"
page2 = "?page=2"
page3 = "?page=3"

There are threee pages of links so rather than spend time building a clever scraper we'll extract the links from all three pages, save them to files, manually trim the unnecessary ones and then concatenate. Then we can run through that list and pull the text of each page.

In [3]:
import requests
from bs4 import BeautifulSoup

def extract_hyperlinks(url):
    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code != 200:
        print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
        return []

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all anchor tags
    anchor_tags = soup.find_all('a')

    # Extract the href attribute from each anchor tag
    hyperlinks = [a.get('href') for a in anchor_tags if a.get('href')]

    return hyperlinks

# Example usage
url = 'https://example.com'
hyperlinks = extract_hyperlinks(url)

# Print the extracted hyperlinks
for link in hyperlinks:
    print(link)

https://www.iana.org/domains/example


Now we'll get the three pages of links.

In [13]:
links_page1 = extract_hyperlinks(finder_base_url)

with open('output/actions_links_page1.txt', 'w') as f:
    for link in links_page1:
        f.write(f"{link}\n")

In [11]:
links_page2 = extract_hyperlinks(finder_base_url+page2)

with open('output/actions_links_page2.txt', 'w') as f:
    for link in links_page2:
        f.write(f"{link}\n")

In [12]:
links_page3 = extract_hyperlinks(finder_base_url+page3)

with open('output/actions_links_page3.txt', 'w') as f:
    for link in links_page3:
        f.write(f"{link}\n")

At this point we edit the files manually. Do that and then come back here : )

In [17]:
# read in the three files

with open('output/actions_links_page1.txt') as file:
    links1 = file.readlines()
with open('output/actions_links_page2.txt') as file:
    links2 = file.readlines()
with open('output/actions_links_page3.txt') as file:
    links3 = file.readlines()

# concatenate, stripping newlines

all_links_relative = [link.rstrip() for link in links1 + links2 + links3]

Now we need to iterate the list of links and retrieve each page. We'll store them in individual files and, if it looks easy enough, extract the codes and use them as keys in a dictionary to index into filenames