# Web Scraper for gov.uk Immigration Legislation

This Python script is designed to scrape content from the UK government's official website (gov.uk), specifically focusing on immigration legislation and related information. It utilizes the Selenium library for web browsing automation and BeautifulSoup for HTML parsing. The data extracted includes URLs, titles, and textual content from various pages, which is then stored in a CSV file.

In [None]:
import csv
import logging
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
import time

In [None]:
def scrape_category(category_url, category_title, csvwriter, visited_urls):
    if category_url in visited_urls:
        return
    visited_urls.add(category_url)

    logging.info(f"Visiting category URL: {category_url}")

    try:
        driver.get(category_url)
        time.sleep(5)  # Increase wait time to ensure the page fully loads

        # Parse the page with BeautifulSoup
        page_soup = BeautifulSoup(driver.page_source, 'html.parser')
        content = page_soup.find('div', id='wrapper', class_='wrapper')

        if not content:
            logging.error(f"No content found on {category_url}")
            return

        # Extract all paragraphs
        paragraphs = ' '.join([p.get_text(strip=True) for p in content.find_all('p')])

        # Write the category content to the CSV
        csvwriter.writerow([category_url, category_title, paragraphs])
        logging.info(f"Successfully scraped category URL: {category_url}")

        # Find and follow sub-links within the category
        sub_links = content.find_all('a', href=True)
        for sub_link in sub_links:
            sub_href = sub_link['href']
            if sub_href.startswith('/'):
                sub_full_url = "https://www.gov.uk" + sub_href
                scrape_section(sub_full_url, category_title, csvwriter, visited_urls)
            elif sub_href.startswith('http'):
                scrape_section(sub_href, category_title, csvwriter, visited_urls)

    except Exception as e:
        logging.error(f"An error occurred while scraping category {category_url}: {e}")

def scrape_section(section_url, category_title, csvwriter, visited_urls):
    if section_url in visited_urls:
        return
    visited_urls.add(section_url)

    logging.info(f"Visiting section URL: {section_url}")

    try:
        driver.get(section_url)
        time.sleep(5)  # Increase wait time to ensure the page fully loads

        # Parse the page with BeautifulSoup
        page_soup = BeautifulSoup(driver.page_source, 'html.parser')
        content = page_soup.find('div', id='wrapper', class_='wrapper')

        if not content:
            logging.error(f"No content found on {section_url}")
            return

        # Extract the title from the <h1> tag
        title_tag = content.find('h1')
        title = title_tag.get_text(strip=True) if title_tag else 'No Title'

        # Extract all paragraphs
        paragraphs = ' '.join([p.get_text(strip=True) for p in content.find_all('p')])

        # Write the section content to the CSV
        csvwriter.writerow([section_url, title, paragraphs])
        logging.info(f"Successfully scraped section URL: {section_url}")

        # Find and follow sub-links within the section
        sub_links = content.find_all('a', href=True)
        for sub_link in sub_links:
            sub_href = sub_link['href']
            if sub_href.startswith('/'):
                sub_full_url = "https://www.gov.uk" + sub_href
                scrape_section(sub_full_url, category_title, csvwriter, visited_urls)
            elif sub_href.startswith('http'):
                scrape_section(sub_href, category_title, csvwriter, visited_urls)

    except Exception as e:
        logging.error(f"An error occurred while scraping section {section_url}: {e}")

def main(url, output_csv, log_file):
    # Set up logging
    logging.basicConfig(filename=log_file, level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

    # Set up Selenium WebDriver for Firefox
    options = Options()
    options.headless = True
    service = Service('/opt/homebrew/bin/geckodriver')  # Use the correct path
    global driver
    driver = webdriver.Firefox(service=service, options=options)

    try:
        # Open the output CSV file
        with open(output_csv, 'w', newline='', encoding='utf-8') as outfile:
            csvwriter = csv.writer(outfile)
            csvwriter.writerow(['Section', 'Title', 'Content'])

            # Start scraping from the main URL
            visited_urls = set()
            scrape_category(url, 'Main Page', csvwriter, visited_urls)

    except Exception as e:
        logging.error(f"An error occurred in the main function: {e}")

    finally:
        driver.quit()

# Example usage
url = "https://www.gov.uk/browse/visas-immigration"
output_csv = "gov_additional_content.csv"
log_file = "scrape_errors.log"
main(url, output_csv, log_file)

## Files

*   `gov_additional_content.csv`: The output CSV file where the scraped data is stored.
*   `scrape_errors.log`: A log file that records any errors or informational messages encountered during the scraping process.

## Functions

### `scrape_category(category_url, category_title, csvwriter, visited_urls)`

This function scrapes a given category URL from the website.

**Parameters:**

*   `category_url` (str): The URL of the category to scrape.
*   `category_title` (str): The title of the category.
*   `csvwriter` (csv.writer): The CSV writer object.
*   `visited_urls` (set): A set to keep track of visited URLs to avoid re-scraping.

**Functionality:**

1.  **Checks for Duplicates:** Ensures the URL hasn't been visited before.
2.  **Navigates:** Uses Selenium to open the URL.
3.  **Parses HTML:** Utilizes BeautifulSoup to parse the HTML content.
4.  **Extracts Content:** Finds all `<p>` (paragraph) tags and extracts their text.
5.  **Writes to CSV:** Writes the category URL, title, and extracted text to the CSV file.
6.  **Finds Sub-links:** Looks for `<a>` (anchor) tags to find links.
7.  **Recursively Calls** `scrape_section` on the identified sub-links.
8. **Error Handling:** If there is any issue, it will log it in the log file.

### `scrape_section(section_url, category_title, csvwriter, visited_urls)`

This function scrapes a given section URL from the website.

**Parameters:**

*   `section_url` (str): The URL of the section to scrape.
*   `category_title` (str): The title of the parent category.
*   `csvwriter` (csv.writer): The CSV writer object.
*   `visited_urls` (set): A set to keep track of visited URLs to avoid re-scraping.

**Functionality:**

1.  **Checks for Duplicates:** Ensures the URL hasn't been visited before.
2.  **Navigates:** Uses Selenium to open the URL.
3.  **Parses HTML:** Utilizes BeautifulSoup to parse the HTML content.
4.  **Extracts Content:** Finds the `<h1>` tag for the title and all `<p>` tags for text.
5.  **Writes to CSV:** Writes the section URL, title, and extracted text to the CSV file.
6.  **Finds Sub-links:** Looks for `<a>` (anchor) tags to find links.
7.  **Recursively Calls** `scrape_section` on the identified sub-links.
8. **Error Handling:** If there is any issue, it will log it in the log file.

### `main(url, output_csv, log_file)`

This is the main function that orchestrates the scraping process.

**Parameters:**

*   `url` (str): The starting URL for scraping.
*   `output_csv` (str): The file path for the output CSV file.
*   `log_file` (str): The file path for the log file.

**Functionality:**

1.  **Logging Setup:** Configures the logging system.
2.  **Selenium Setup:** Initializes the Selenium WebDriver for Firefox in headless mode.
    *   Ensure that `geckodriver` is installed and its path is correctly specified in the code. You can use `!wget https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz` to download it, and then `!tar -xvzf geckodriver*` to unpack it.
3.  **CSV Setup:** Opens the specified CSV file for writing.
4.  **Initial Scraping:** Calls `scrape_category` to start scraping from the main URL.
5.  **Driver Cleanup:** Ensures the WebDriver is closed properly.
6. **Error Handling:** If there is any issue, it will log it in the log file.

## Usage

1.  **Run in Google Colab:** Execute the provided code in a Google Colab notebook.
2.  **Set Parameters:** Modify the `url`, `output_csv`, and `log_file` variables at the end of the script to change the starting URL, output file name, and log file name, respectively.
3. **Set geckodriver:** Make sure that the variable `service` in the main function points to the correct geckodriver path.
4.  **Execute:** Run the script to start the scraping process.