## Automation - Scraping data publications from NHS England web pages and consolidating them into one file.

This demonstration will show you how you can automatically download multiple published data files from different (related) web pages and consolidate them into one file. The example is of the **Learning Disability Health Check Scheme**

[Data landing page](https://digital.nhs.uk/data-and-information/publications/statistical/learning-disabilities-health-check-scheme)

### Beautiful Soup

### Using urljoin to construct URLs

This will extract any .csv files for the calendar year 2024, which are all saved to their individual web pages, meaning that urljoin is required to construct the URLs dynamically (i.e. so that you don't have to hard code all the indidual web pages).

The package "re" is imported so that regular expression logic can be used in the construction of the URLs i.e. anything matching the patterm of the regular expression will be considered a web page of interest. (NOTE: you do not need to install "re", it is native to Python)

It's been limited to 2024 files to reduce the amount of data being transferred, but you could use a different regular expression to cover more.

### Re-usability

In [1]:
from urllib.parse import urljoin
import requests as req
import re
import os
from bs4 import BeautifulSoup

url = 'https://digital.nhs.uk/data-and-information/publications/statistical/learning-disabilities-health-check-scheme'

target_urls = []                           # empty list that will later get filled with target URLs in a for loop.

dynamic_section = r'^england-[a-z]+-2024$' # the regular expression for the URLs we are interested in. note that the $ implies that you don't want anything else to follow.

response = req.get(url)                  # get the response from the base URL

if response.status_code == 200:
    soup5 = BeautifulSoup(response.content, "html.parser")     # if there is a successful response, create a BeautifulSoup object.

    for link in soup5.find_all('a', href = True):
        sublink = link["href"]
        if re.match(dynamic_section,sublink.split('/')[-1]):
            full_url = urljoin(url, sublink)                   # for each of the instances of the pattern we are looking for
            target_urls.append(full_url)                        # add the constructed full URL to a list of target URLs
        
    for link in target_urls:                                    # check for a successful response (code 200) from each URL
        response = req.get(link)                                # and create a BeautifulSoup object for each.
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")

            for link in soup.find_all("a", href=True):          # for each URL found on each of the pages in target_urls...
                file_url = link['href']                         

                if file_url.endswith(('.csv')):                 # ... check for .csv file extensions
                    print("Found .csv file:", file_url)

                    file_name = file_url.split("/")[-1]         # extract the file name from the URL i.e. everything after the last /
                    file_response = req.get(file_url)           # check the response for each file

                    file_path = os.path.join("downloads",       # set up the file path for the downloaded files to the "downloads" folder.
                                              file_name)
            
                    if file_response.status_code == 200:        # if there's a successful response
                        
                        with open(file_path, "wb") as file:     # save the file to the downloads directory
                            file.write(file_response.content)
                        print(f"Downloaded: {file_name}")
                    else:
                        print(f"Failed to download: {file_url}")

else:
    print(f'Failed to fetch webpage: {response.status_code}')   # this else statement pairs with the original response code check for the base URL
                                                                # (see the first "if" in this code block)

Found .csv file: https://files.digital.nhs.uk/89/11D2B8/learning-disabilities-health-check-scheme-eng-Dec-2024.csv
Downloaded: learning-disabilities-health-check-scheme-eng-Dec-2024.csv
Found .csv file: https://files.digital.nhs.uk/DB/E2BCB6/learning-disabilities-health-check-scheme-eng-Nov-2024.csv
Downloaded: learning-disabilities-health-check-scheme-eng-Nov-2024.csv
Found .csv file: https://files.digital.nhs.uk/8E/429E64/learning-disabilities-health-check-scheme-eng-Oct-2024.csv
Downloaded: learning-disabilities-health-check-scheme-eng-Oct-2024.csv
Found .csv file: https://files.digital.nhs.uk/1E/56812A/learning-disabilities-health-check-scheme-eng-Sep-2024.csv
Downloaded: learning-disabilities-health-check-scheme-eng-Sep-2024.csv
Found .csv file: https://files.digital.nhs.uk/1C/BF3E28/learning-disabilities-health-check-scheme-eng-Aug-2024.csv
Downloaded: learning-disabilities-health-check-scheme-eng-Aug-2024.csv
Found .csv file: https://files.digital.nhs.uk/0C/DC2F3D/learning-disab

### Consolidating the files into one