# Web Scraper for UK Immigration Legislation

This Python script is designed to scrape content from the UK government's website related to visas and immigration. The scraper extracts information from various pages under the `/browse/visas-immigration` section and saves the content to a text file.

## Overview

The script performs the following key tasks:

1.  **Initialization**:
    *   Sets up the necessary libraries (requests, BeautifulSoup, time, os).
    *   Mounts Google Drive for storing the scraped data.
    *   Defines the base URL and request headers.

2.  **Base Page Scraping**:
    *   Fetches the content of the base page (`https://www.gov.uk/browse/visas-immigration`).
    *   Parses the HTML content using BeautifulSoup.
    *   Finds all links (anchor tags) on the base page.

3.  **Content Extraction**:
    *   A function `scrape_immigration_content` handles the scraping of individual topic pages.
    *   It fetches the page content, parses the HTML, and extracts text from `div` elements with the class `govuk-body`.
    *   The extracted text is appended to a file named `immigration_data.txt` on Google Drive.

4.  **Looping Through Topics**:
    *   The script iterates through the links found on the base page.
    *   It filters for links related to 'immigration' that are not in specific categories like `/info/`, `/search/` or links ending with `/print$`.
    *   For each valid link, it determines if it's an absolute or relative URL and constructs a full URL if necessary.
    *   Calls `scrape_immigration_content` to scrape each topic page.
    *   Implements a 5-second delay between requests to be respectful to the server.

In [None]:
import requests
import os
import time
from bs4 import BeautifulSoup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Base URL for immigration-related pages
base_url = 'https://www.gov.uk/browse/visas-immigration'

# Headers to mimic a browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Request the base page content
response = requests.get(base_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# Find all immigration topic links
topics = soup.find_all('a', href=True)

In [None]:
# Function to scrape immigration content
def scrape_immigration_content(topic_url):
    try:
        response = requests.get(topic_url, headers=headers)
        topic_soup = BeautifulSoup(response.text, 'html.parser')

        # Extract the relevant content (e.g., paragraph texts)
        sections = topic_soup.find_all('div', class_='govuk-body')
        for section in sections:
            # Save content to a file or database
            with open('/content/drive/MyDrive/immigration_data.txt', 'a') as file:
                file.write(section.get_text() + '\n')
        print(f"Scraped content from {topic_url}")

    except Exception as e:
        print(f"Error scraping {topic_url}: {e}")


In [None]:
# Loop through each topic and scrape immigration-related pages
for topic in topics:
    # Check that it's not a disallowed section
    if 'immigration' in topic['href'] and '/info/' not in topic['href'] and '/search/' not in topic['href'] and not topic['href'].endswith('/print$'):
        # Check if the URL is already complete (absolute), otherwise add the base URL
        if topic['href'].startswith('http'):
            topic_url = topic['href']  # It's already a complete URL
        else:
            topic_url = 'https://www.gov.uk' + topic['href']  # Add the base URL to relative links

        scrape_immigration_content(topic_url)

        # Respectful crawling: add delay between requests
        time.sleep(5)  # Adjust delay as needed (5 seconds here)


Scraped content from https://www.gov.uk/browse/visas-immigration
Scraped content from https://www.gov.uk/browse/visas-immigration/what-you-need-to-do
Scraped content from https://www.gov.uk/browse/visas-immigration/tourist-short-stay-visas
Scraped content from https://www.gov.uk/browse/visas-immigration/work-visas
Scraped content from https://www.gov.uk/browse/visas-immigration/student-visas
Scraped content from https://www.gov.uk/browse/visas-immigration/family-visas
Scraped content from https://www.gov.uk/browse/visas-immigration/eu-eea-swiss
Scraped content from https://www.gov.uk/browse/visas-immigration/ukrainian-nationals
Scraped content from https://www.gov.uk/browse/visas-immigration/commonwealth-british-nationals-overseas
Scraped content from https://www.gov.uk/browse/visas-immigration/settle-in-the-uk
Scraped content from https://www.gov.uk/browse/visas-immigration/asylum
Scraped content from https://www.gov.uk/browse/visas-immigration/immigration-appeals
Scraped content from