# Converting Case Law from XML to CSV

### Libraries and Modules Description

In [1]:
import csv
import xml.etree.ElementTree as ET
import re

1. **`csv`**
   - A Python module for working with CSV (Comma-Separated Values) files. It allows for easy reading and writing of structured data in CSV format.
   - **Common Usage**: 
     - Reading CSV files and parsing their content into lists or dictionaries.
     - Writing lists or dictionaries to a CSV file.

2. **`xml.etree.ElementTree` (ET)**
   - A module in Python’s standard library for parsing and creating XML documents. It provides a simple API to create, manipulate, and parse XML data in a tree-like structure.
   - **Common Usage**:
     - Reading and writing XML files.
     - Modifying XML elements and attributes.

3. **`re`**
   - The **regular expressions (regex)** module in Python. It allows for matching, searching, and manipulating strings using regular expression patterns.
   - **Common Usage**:
     - Searching for patterns in strings.
     - Replacing parts of a string based on a regex pattern.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
xml_file_path = '/content/drive/MyDrive/all_case_data.xml'
csv_file_path = 'cleaned_cases_two.csv'

### Function: `clean_text(text)`

#### Purpose:
The `clean_text` function processes and cleans a given text string by removing unwanted content, formatting dates, and eliminating unnecessary tags. It is particularly useful for cleaning case law text before storing or further processing it.

#### Parameters:
- **`text`**: A string representing the text to be cleaned. This could be case law content or other textual data. The function handles cases where the input text is `None`.

In [4]:
def clean_text(text):

    if text is None:
        return ""
    if "cookies" in text.lower() or "find case law" in text.lower():
        return ""

    text = text.strip()
    text = re.sub(r'(\d{1,2})(st|nd|rd|th)\s([A-Za-z]+)\s(\d{4})', r'\4-\3-\1', text)
    text = re.sub(r"<text />", "", text)
    return text

#### Steps:

1. **Handle `None` Input**:
   - If the `text` is `None`, the function returns an empty string (`""`), ensuring that `None` values do not cause errors in further processing.

2. **Filter Unwanted Text**:
   - The function checks if the text contains certain unwanted phrases:
     - If the text contains `"cookies"` (case-insensitive) or `"find case law"` (case-insensitive), it is deemed irrelevant and discarded by returning an empty string (`""`).
   - This step is useful for removing common boilerplate or irrelevant content (e.g., cookie notices or promotional phrases) from legal text.

3. **Remove Leading and Trailing Whitespace**:
   - The `strip()` method is used to remove any leading or trailing whitespace from the text.

4. **Format Dates**:
   - The function uses a regular expression (`re.sub()`) to match and reformat dates in the form of `day-month-year` (e.g., "1st January 2020") to the format `year-month-day` (e.g., "2020-January-1").
   - This is achieved by matching patterns like "1st January 2020" and converting them into "2020-January-1".

5. **Remove Empty Text Tags**:
   - The function uses another regular expression (`re.sub()`) to remove any occurrences of the string `"<text />"`, which might represent empty or unwanted XML tags in the input.

6. **Return Cleaned Text**:
   - After the cleaning process, the function returns the cleaned version of the input text, with irrelevant content removed, dates reformatted, and unnecessary tags stripped.

#### Output:
- The cleaned version of the input text, with dates formatted in `year-month-day` style and unwanted phrases or tags removed.

### Processing and Saving Cleaned Case Law Data to CSV

#### Purpose:
This code reads case law data from an XML file, cleans the text using the `clean_text` function, and writes the cleaned content to a CSV file. Each row in the CSV file represents the cleaned text of one case, with the text content separated by newlines.

In [5]:
with open(csv_file_path, 'w', newline='', encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile)  # Use csv.writer for single column
    writer.writerow(['cleaned_text'])  # Write the header

    tree = ET.parse(xml_file_path)
    root = tree.getroot()

    for case in root.findall(".//case"):
        cleaned_texts = []
        for text_element in case.findall(".//text"):
            cleaned = clean_text(text_element.text)
            if cleaned:
                cleaned_texts.append(cleaned)

        writer.writerow(["\n".join(cleaned_texts)])

print(f"Processing complete. Cleaned data saved to '{csv_file_path}'.")

Processing complete. Cleaned data saved to 'cleaned_cases_two.csv'.


#### Steps:

1. **Open the CSV File for Writing**:
   - The `open()` function is used to open or create a CSV file specified by `csv_file_path` in write mode (`'w'`).
   - `newline=''` ensures no extra blank lines are added between rows, and `encoding='utf-8'` ensures that the CSV file can handle Unicode characters.
   - A `csv.writer` object is created to write rows to the CSV file, which is expected to contain only one column (for the cleaned case text).
   
2. **Write Header to CSV**:
   - The `writer.writerow()` method is used to write the header row, which contains the column name `'cleaned_text'`. This will label the column in the CSV file.

3. **Parse XML File**:
   - The XML file at `xml_file_path` is parsed using `ET.parse()`, and the root element of the XML document is retrieved using `getroot()`. This provides access to all the `<case>` elements within the XML structure.

4. **Iterate Through Case Elements**:
   - The code iterates over all `<case>` elements found in the XML file using `findall(".//case")`. Each case represents a collection of `<text>` elements.

5. **Clean Text Elements**:
   - For each `<case>`, the code initializes an empty list `cleaned_texts` to store the cleaned text content.
   - It then iterates through all `<text>` elements within the case using `findall(".//text")`.
   - The text content of each `<text>` element is passed to the `clean_text()` function for cleaning (e.g., removing irrelevant content, formatting dates).
   - If the cleaned text is not empty, it is appended to the `cleaned_texts` list.

6. **Write Cleaned Text to CSV**:
   - After processing all the `<text>` elements for a given case, the `cleaned_texts` list is joined into a single string, with each cleaned text entry separated by a newline (`"\n"`).
   - This string is written to the CSV file as a single row using `writer.writerow()`, ensuring each row contains the cleaned content of one case.

7. **Completion Message**:
   - After processing all cases, the code prints a message indicating that the processing is complete and the cleaned data has been saved to the specified CSV file.

#### Output:
- The cleaned case law text is saved in a CSV file at `csv_file_path`. Each row contains the cleaned text of one case, with the individual sections of text separated by newlines within that row.