# **National Archives Web Scrapper**

! pip install feedparser

### Libraries and Modules Description

In [2]:
from IPython import get_ipython
from IPython.display import display
import requests
import feedparser
import os
import time
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import csv
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


1. **IPython**
   - **`get_ipython`**: Provides an interface to interact with the IPython runtime environment. It allows access to IPython's configuration and utilities for working with IPython sessions.
   - **`display`**: A function used to display rich representations of objects (like images, HTML, and more) in IPython environments such as Jupyter Notebooks.

2. **`requests`**
   - A Python library for making HTTP requests. It simplifies sending HTTP requests to interact with web services or fetch data from URLs.
   - Example use: Sending a GET or POST request to a web API or downloading data from a URL.

3. **`feedparser`**
   - A module used for parsing RSS and Atom feeds. It allows extracting data like blog posts, articles, or news from XML feeds provided by websites.
   - Example use: Reading and parsing RSS feeds from news websites.

4. **`os`**
   - A built-in Python library for interacting with the operating system. It provides functionalities for working with directories, files, environment variables, and other OS-level operations.
   - Example use: Accessing file paths or managing directories programmatically.

5. **`time`**
   - A built-in Python module that provides various time-related functions, such as pausing execution (`sleep`), getting the current time (`time`), or measuring elapsed time.
   - Example use: Implementing delays or time-related calculations.

6. **`bs4` (BeautifulSoup)**
   - A popular library used for parsing HTML and XML documents. BeautifulSoup makes it easy to navigate and extract data from web pages by using tag and attribute searches.
   - Example use: Scraping information from web pages by parsing the HTML content.

7. **`xml.etree.ElementTree` (ET)**
   - A built-in Python module for parsing and creating XML documents. It provides a tree structure for manipulating XML data and extracting specific elements.
   - Example use: Reading, parsing, and modifying XML documents.

8. **`csv`**
   - A Python module for handling CSV (Comma-Separated Values) files. It provides methods to read from and write to CSV files, making it easy to work with structured tabular data.
   - Example use: Reading a dataset from a CSV file or writing data to a CSV file for storage.

9. **`google.colab.drive` (Colab Drive)
   - **`drive.mount()`**: A function to mount Google Drive in a Colab environment, allowing access to files stored in your Google Drive. This is commonly used to load or save data from Google Drive while working in Colab.
   - Example use: Mounting Google Drive to read and save files in Colab.


In [3]:
# Updated Atom feed URL
feed_url = "https://caselaw.nationalarchives.gov.uk/atom.xml?"

In [4]:
# Headers for requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

### Function: `scrape_case_content(topic_url)`

#### Purpose:
This function scrapes the content of a web page (specifically paragraphs of text) and stores it in an XML file. It is designed to scrape case law text from the provided `topic_url`, save the content into an XML structure, and append it to an existing XML file stored in Google Drive.

#### Parameters:
- **`topic_url`**: A string representing the URL of the webpage to be scraped for case law content.

In [5]:
def scrape_case_content(topic_url):
    try:
        response = requests.get(topic_url, headers=headers)
        topic_soup = BeautifulSoup(response.text, 'html.parser')

        sections = topic_soup.find_all('p')

        case_element = ET.Element("case")
        for section in sections:
            text_element = ET.SubElement(case_element, "text")
            text_element.text = section.get_text()

        try:
            tree = ET.parse('/content/drive/MyDrive/all_case_data.xml')
            root = tree.getroot()
            root.append(case_element)
        except FileNotFoundError:
            root = ET.Element("cases")
            root.append(case_element)
            tree = ET.ElementTree(root)

        tree.write('/content/drive/MyDrive/all_case_data.xml', encoding='utf-8', xml_declaration=True)

        print(f"Scraped content from {topic_url}")

    except Exception as e:
        print(f"Error scraping {topic_url}: {e}")

#### Steps:
1. **Make HTTP Request:**
   - The function uses the `requests.get()` method to send an HTTP request to the given `topic_url`. The response is fetched with the headers defined earlier in the code.
   
2. **Parse HTML with BeautifulSoup:**
   - The response content is parsed using BeautifulSoup, specifying the `'html.parser'` as the parsing engine.
   - The function looks for all `<p>` tags in the HTML (which typically represent paragraphs) using `find_all('p')`.

3. **Create XML Elements:**
   - A new XML element named `<case>` is created using `xml.etree.ElementTree`.
   - For each paragraph found in the page (`sections`), a new `<text>` sub-element is created within the `<case>` element. The text content from each paragraph is added as the value of the `<text>` element.

4. **Append to or Create XML File:**
   - The function attempts to open and parse an existing XML file (`all_case_data.xml`) located in Google Drive.
     - If the file is found, it retrieves the root element and appends the new `<case>` element.
     - If the file is not found (i.e., `FileNotFoundError`), a new XML file is created with a root element `<cases>`, and the new `<case>` element is added.
   
5. **Write XML File:**
   - The XML tree is written back to the `all_case_data.xml` file in Google Drive, with UTF-8 encoding and an XML declaration.

6. **Error Handling:**
   - If an error occurs during the scraping process (e.g., network issues or parsing errors), the exception is caught, and an error message is printed to indicate that the scraping failed for the given URL.

#### Output:
- The function prints a message indicating the successful scraping of content from the `topic_url`. In the case of an error, an error message is displayed.


### Code Snippet: Fetch and Scrape Case Content from RSS Feed

#### Purpose:
This code snippet retrieves legal case URLs from an RSS feed, then scrapes the content of each case page and stores it in an XML file using the previously defined `scrape_case_content()` function. A delay is added between requests to avoid overloading the server.

In [6]:
response = requests.get(feed_url, headers=headers)
feed = feedparser.parse(response.content)

judgment_links = [entry.link for entry in feed.entries]

for link in judgment_links:
    scrape_case_content(link)  
    time.sleep(5)

Scraped content from https://caselaw.nationalarchives.gov.uk/ewhc/kb/2025/138
Scraped content from https://caselaw.nationalarchives.gov.uk/ewhc/ch/2025/135
Scraped content from https://caselaw.nationalarchives.gov.uk/ewhc/admin/2025/123
Scraped content from https://caselaw.nationalarchives.gov.uk/ewhc/comm/2025/140
Scraped content from https://caselaw.nationalarchives.gov.uk/ewca/crim/2025/52
Scraped content from https://caselaw.nationalarchives.gov.uk/ewhc/ch/2025/136
Scraped content from https://caselaw.nationalarchives.gov.uk/ewca/crim/2025/51
Scraped content from https://caselaw.nationalarchives.gov.uk/ewhc/admin/2025/137
Scraped content from https://caselaw.nationalarchives.gov.uk/ukftt/tc/2025/66
Scraped content from https://caselaw.nationalarchives.gov.uk/ukftt/grc/2025/73
Scraped content from https://caselaw.nationalarchives.gov.uk/ewhc/admin/2025/113
Scraped content from https://caselaw.nationalarchives.gov.uk/ewhc/tcc/2025/100
Scraped content from https://caselaw.nationalarch

#### Steps:

1. **Fetch the RSS Feed:**
   - The `requests.get()` method is used to send an HTTP request to the provided `feed_url`. The response is fetched, and the `headers` parameter is passed to simulate a legitimate web browser request.
   - The content of the response (which is the RSS feed) is passed to the `feedparser.parse()` function to parse the RSS feed into a structured format.

2. **Extract Judgment Links:**
   - The `feed.entries` attribute contains the parsed RSS feed entries. A list comprehension is used to extract the `link` attribute from each entry, which points to the specific case judgment page.
   - These links are stored in the `judgment_links` list.

3. **Scrape Case Content:**
   - The code iterates through the list of `judgment_links`, and for each link:
     - It calls the `scrape_case_content(link)` function to scrape and save the case content from the respective URL.
     - A delay of 5 seconds (`time.sleep(5)`) is added between each request to prevent overwhelming the server with rapid successive requests, ensuring responsible scraping behavior.

#### Output:
- The content of each case linked in the RSS feed is scraped and appended to an XML file located in Google Drive.
- A 5-second delay is introduced between requests to avoid hitting server rate limits.

In [7]:
try:
    tree = ET.parse('/content/drive/MyDrive/all_case_data.xml')
    root = tree.getroot()
    case_count = len(root.findall('case'))
    print(f"Total number of case laws downloaded: {case_count}")
except FileNotFoundError:
    print("XML file not found.")

Total number of case laws downloaded: 100
