In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Explanation:**
The provided code reads XML content from a file and prints the first 100 lines of the XML content. The **read_xml_from_file** function is defined to read XML content from a file.
It takes the file path as a parameter. It uses a **try-except** block to handle potential exceptions. Inside the try block, it opens the file in read mode using the with statement. It reads the XML content from the file using the **readlines()** method, which returns a list of lines. Finally, it returns the XML content as a list of lines. If a FileNotFoundError occurs, it prints "File not found." If any other exception occurs, it prints "An error occurred:" along with the exception details.
The **file_path** variable is assigned the path to the .txt file containing the XML content. The read_xml_from_file function is called with the file_path as an argument, and the returned XML content is stored in the **xml_lines** variable. If xml_lines is not None (i.e., if the file was successfully read), it prints the first 100 lines of the XML content. It uses list slicing **(xml_lines[:100])** to get the first 100 lines. The **join()** method is used to concatenate the lines into a single string for printing. Finally, the code joins all the XML lines into a single string using join() and stores it in the xml_data variable.

In [2]:
def read_xml_from_file(file_path):
    try:
        # Open the file in read mode
        with open(file_path, 'r') as file:
            # Read the XML content from the file
            xml_content = file.readlines()

            # Return the XML content as a list of lines
            return xml_content
    except FileNotFoundError:
        print("File not found.")
    except Exception as e:
        print("An error occurred:", e)

# Provide the path to your .txt file containing XML content
file_path = '/content/drive/Shareddrives/FIT5196_S1_2024/A1/Students data/Task 1/Group065.txt'

# Read XML content from the file
xml_lines = read_xml_from_file(file_path)

# If xml_lines is not None, print the first 100 lines
if xml_lines:
    print("First 100 lines of XML content:")
    print(''.join(xml_lines[:100]))


    # Join the XML lines into a single string
    xml_data = ''.join(xml_lines)

First 100 lines of XML content:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE trademark-assignments [<!-- DOCUMENT TYPE DEFINITION FOR TRADEMARK ASSIGNMENTS
Reference this DTD as PUBLIC "-//USPTO//DTD TRADEMARK_ASSIGNMENTS V1.0 2013-09-05//EN" 

Contact:
U.S. Patent and Trademark Office
Electronic Information Products Division
P.O. Box 1450
Alexandria, VA 22313
ipd@uspto.gov

***** START REVISION HISTORY *****
12/01/2000 Added ACK_DATE (Date Acknowledged) to PARTY element.
03/06/2002 Updated element names to conform to PTO Enterprise standard element names.
05/14/2002 Updated property element to include Trademark Law Treaty values
05/19/2003 Added LAST-UPDATE-DATE element to the group element <assignment>.
08/28/2003 Updated property group element to add optional element International-Registration-Number.
Updated property group elements serial-no, registration-no with optional parameter
(These fields may not be available when an international registration number is present).
10/10/2

**Explanation:**
The **get_country** function is designed to extract the country information from the assignor or assignee data in the XML. It takes the **party_data (which represents the XML data for an assignor or assignee) as input** and returns the corresponding country based on the available information.

It searches for the country name using the regular expression. If a country name is found, it checks if it matches any of the predefined variations for the United States, United Kingdom, or Canada. If a match is found, it returns the corresponding country code **("USA", "UK", or "CANADA")**. If the country name doesn't match any of these variations, it returns the country name as is.

If no country name is found, it searches **first for state and then nationality** information using the regular expression. If a state/nationality is found, it checks if it matches any of the predefined US states, UK countries, or Canadian provinces/territories given in a list. If a match is found, it returns the corresponding country code ("USA", "UK", or "CANADA").

If no country, state, or nationality information is found, it returns **"NA"** to indicate that the country information is not available.

In [3]:
import re
import json

# Function to extract the country information from an assignor or assignee
def get_country(party_data):
    """
    This function takes the assignor or assignee data as input and extracts the country information.
    It searches for the country name, state, and nationality in the data and returns the corresponding country.
    If no country information is found, it returns "NA".
    """
    # Search for the country name in the assignor or assignee data
    country_name_match = re.search(r'<country-name>(.*?)</country-name>', party_data)
    if country_name_match:
        country_name = country_name_match.group(1)
        # Check if the country name matches United States or its variations
        if re.search(r'united states|usa|u\.s\.a\.|u\.s\.|UNITED STATES|USA|U\.S\.A\.|U\.S\.', country_name):
            return "USA"
        # Check if the country name matches United Kingdom or its variations
        elif re.search(r'united kingdom|uk|u\.k\.|UNITED KINGDOM|UK|U\.K\.', country_name):
            return "UK"
        # Check if the country name matches Canada or its variations
        elif re.search(r'canada|CANADA', country_name):
            return "CANADA"
        else:
            return country_name
    # Search for the state in the assignor data
    state_match = re.search(r'<state>(.*?)</state>', party_data)
    if state_match:
        state = state_match.group(1)

         # Check if the state matches any US state or district
        usa_states = [
            'ALABAMA', 'ALASKA', 'ARIZONA', 'ARKANSAS', 'CALIFORNIA', 'COLORADO', 'CONNECTICUT', 'DELAWARE', 'FLORIDA',
            'GEORGIA', 'HAWAII', 'IDAHO', 'ILLINOIS', 'INDIANA', 'IOWA', 'KANSAS', 'KENTUCKY', 'LOUISIANA', 'MAINE',
            'MARYLAND', 'MASSACHUSETTS', 'MICHIGAN', 'MINNESOTA', 'MISSISSIPPI', 'MISSOURI', 'MONTANA', 'NEBRASKA',
            'NEVADA', 'NEW HAMPSHIRE', 'NEW JERSEY', 'NEW MEXICO', 'NEW YORK', 'NORTH CAROLINA', 'NORTH DAKOTA', 'OHIO',
            'OKLAHOMA', 'OREGON', 'PENNSYLVANIA', 'RHODE ISLAND', 'SOUTH CAROLINA', 'SOUTH DAKOTA', 'TENNESSEE', 'TEXAS',
            'UTAH', 'VERMONT', 'VIRGINIA', 'WASHINGTON', 'WEST VIRGINIA', 'WISCONSIN', 'WYOMING', 'DISTRICT OF COLUMBIA'
        ]
        if state.upper() in usa_states:

            return "USA"
        # Check if the state matches any constituent country of the UK
        uk_countries = ['ENGLAND', 'SCOTLAND', 'WALES', 'NORTHERN IRELAND']
        if state.upper() in uk_countries:

            return "UK"

        # Check if the state matches any province or territory of Canada
        canada_provinces_territories = [
            'ALBERTA', 'BRITISH COLUMBIA', 'MANITOBA', 'NEW BRUNSWICK', 'NEWFOUNDLAND AND LABRADOR',
            'NOVA SCOTIA', 'ONTARIO', 'PRINCE EDWARD ISLAND', 'QUEBEC', 'SASKATCHEWAN',
            'NORTHWEST TERRITORIES', 'NUNAVUT', 'YUKON'
        ]
        if state.upper() in canada_provinces_territories:

            return "CANADA"
    # Search for the nationality in the assignor data
    nationality_match = re.search(r'<nationality>(.*?)</nationality>', party_data)
    if nationality_match:
        nationality = nationality_match.group(1)

         # Check if the nationality matches any US state or district
        usa_states = [
            'ALABAMA', 'ALASKA', 'ARIZONA', 'ARKANSAS', 'CALIFORNIA', 'COLORADO', 'CONNECTICUT', 'DELAWARE', 'FLORIDA',
            'GEORGIA', 'HAWAII', 'IDAHO', 'ILLINOIS', 'INDIANA', 'IOWA', 'KANSAS', 'KENTUCKY', 'LOUISIANA', 'MAINE',
            'MARYLAND', 'MASSACHUSETTS', 'MICHIGAN', 'MINNESOTA', 'MISSISSIPPI', 'MISSOURI', 'MONTANA', 'NEBRASKA',
            'NEVADA', 'NEW HAMPSHIRE', 'NEW JERSEY', 'NEW MEXICO', 'NEW YORK', 'NORTH CAROLINA', 'NORTH DAKOTA', 'OHIO',
            'OKLAHOMA', 'OREGON', 'PENNSYLVANIA', 'RHODE ISLAND', 'SOUTH CAROLINA', 'SOUTH DAKOTA', 'TENNESSEE', 'TEXAS',
            'UTAH', 'VERMONT', 'VIRGINIA', 'WASHINGTON', 'WEST VIRGINIA', 'WISCONSIN', 'WYOMING', 'DISTRICT OF COLUMBIA'
        ]

        if nationality.upper() in usa_states:

            return "USA"

        # Check if the nationality matches any constituent country of the UK
        uk_countries = ['ENGLAND', 'SCOTLAND', 'WALES', 'NORTHERN IRELAND']
        if nationality.upper() in uk_countries:

            return "UK"

        # Check if the nationality matches any province or territory of Canada
        canada_provinces_territories = [
            'ALBERTA', 'BRITISH COLUMBIA', 'MANITOBA', 'NEW BRUNSWICK', 'NEWFOUNDLAND AND LABRADOR',
            'NOVA SCOTIA', 'ONTARIO', 'PRINCE EDWARD ISLAND', 'QUEBEC', 'SASKATCHEWAN',
            'NORTHWEST TERRITORIES', 'NUNAVUT', 'YUKON'
        ]
        if nationality.upper() in canada_provinces_territories:

            return "CANADA"

        if nationality.upper() != "NOT PROVIDED":
            return nationality
    # Return "NA" if no country information is found

    return "NA"


**Explanation:**
**re.search()** function is used to find the first occurrence of each pattern within the assignment_data string. re.search is used to find reel number, frame number, last update date, conveyance text and correspondant party within xml data.

r before the string indicates a raw string literal, which treats backslashes as literal characters. **(\d+)** is a capturing group that matches one or more digits (\d+). A capturing group **(\d{8})** is used to extract the 8-digit date value. (.*?) **bold text** is used to extract the text between the tags.

**reel_no.group(1) if reel_no else '':** This is a conditional expression that checks if reel_no exists (i.e., not None). If it exists, it extracts the captured group at index 1 using reel_no.group(1). If reel_no is None, it returns an empty string ''. frame_no.group(1) if frame_no else '': Similarly, this conditional expression checks if frame_no exists. The two expressions are concatenated within the f-string to form the rf_id.

**last_update_date_value[:4]** extracts the first 4 characters (year) from last_update_date_value.
**last_update_date_value[4:6]** extracts characters at index 4 and 5 (month) from last_update_date_value.
**last_update_date_value[6:]** extracts characters from index 6 onwards (day) from last_update_date_value.

**re.findall()** function to extract assignor information from a string called assignment_data. **re.DOTALL** flag is used as the third argument to re.findall(). It allows the dot (.) metacharacter to match newline characters as well. Without this flag, the dot would match any character except newline.

**company_indicators** is a list of strings that are commonly found in company or organization names. The expression **indicator in party_name.lower() for indicator in company_indicators** is a generator expression that checks if any indicator from company_indicators is present in the lowercase version of party_name. **re.sub()** is a function from the re module that replaces occurrences of a pattern with a replacement string.
The first argument is the regex pattern: r'(?i)(Mr|Mrs|Miss|Ms|Mx|Sir|Dame|Dr|Cllr|Lady|Lord)\. **(?i)** is a flag that makes the pattern case-insensitive. The pattern matches any of the specified titles followed by a dot and a space.

In [4]:
# Function to process an assignment and extract relevant information
def process_assignment(assignment_data):
    """
    This function takes an assignment entry as input and extracts relevant information from it.
    It extracts the reel number, frame number, last update date, conveyance text, correspondent party,
    assignor information, assignee information, and property count.
    It returns a JSON object containing the extracted information.
    """
    # Extract reel number, frame number, last update date, conveyance text, and correspondent party
    reel_no = re.search(r'<reel-no>(\d+)</reel-no>', assignment_data)
    frame_no = re.search(r'<frame-no>(\d+)</frame-no>', assignment_data)
    last_update_date = re.search(r'<last-update-date>(\d{8})</last-update-date>', assignment_data)
    conveyance_text = re.search(r'<conveyance-text>(.*?)</conveyance-text>', assignment_data)
    correspondent_party = re.search(r'<person-or-organization-name>(.*?)</person-or-organization-name>', assignment_data)

    # Create the rf_id by combining reel number and frame number
    rf_id = f"{reel_no.group(1) if reel_no else ''}{frame_no.group(1) if frame_no else ''}"

    # Format the last update date
    last_update_date_formatted = ""
    if last_update_date:
        last_update_date_value = last_update_date.group(1)
        last_update_date_formatted = f"{last_update_date_value[:4]}-{last_update_date_value[4:6]}-{last_update_date_value[6:]}"

    # Extract conveyance text and correspondent party values
    conveyance_text_value = conveyance_text.group(1) if conveyance_text else ""
    correspondent_party_value = correspondent_party.group(1) if correspondent_party else ""

    # Extract assignor information
    assignors_info = re.findall(r'<assignor>(.*?)</assignor>', assignment_data, re.DOTALL)
    assignors = []

    for assignor in assignors_info:
        # Extract party name and remove titles
        party_name = re.search(r'<person-or-organization-name>(.*?)</person-or-organization-name>', assignor).group(1)
        # Check if the party name contains any company or organization indicators
        company_indicators = ['company', 'corporation', 'incorporated', 'limited', 'ltd', 'llc', 'Inc', 'co', 'corp', 'inc']
        is_company = any(indicator in party_name.lower() for indicator in company_indicators)

        if not is_company:
            party_name = re.sub(r'(?i)(Mr|Mrs|Miss|Ms|Mx|Sir|Dame|Dr|Cllr|Lady|Lord)\. ', '', party_name)

        # Extract date acknowledged and format it
        date_acknowledged_match = re.search(r'<ack-date>(\d{8})</ack-date>', assignor)
        date_acknowledged = date_acknowledged_match.group(1) if date_acknowledged_match else "NA"

        if date_acknowledged != "NA":
            date_acknowledged = f"{date_acknowledged[:4]}-{date_acknowledged[4:6]}-{date_acknowledged[6:]}"

        # Extract execution date and format it
        execution_date_match = re.search(r'<execution-date>(\d{8})</execution-date>', assignor)
        execution_date = execution_date_match.group(1) if execution_date_match else "NA"

        if execution_date != "NA":
            execution_date = f"{execution_date[:4]}-{execution_date[4:6]}-{execution_date[6:]}"

        # Get the country information for the assignor
        country = get_country(assignor)

        # Extract legal entity text
        legal_entity_text_match = re.search(r'<legal-entity-text>(.*?)</legal-entity-text>', assignor)
        legal_entity_text = legal_entity_text_match.group(1) if legal_entity_text_match else "NA"

        # Append the assignor information to the assignors list
        assignors.append({
            "party-name": party_name,
            "date-acknowledged": date_acknowledged,
            "execution-date": execution_date,
            "country": country,
            "legal-entity-text": legal_entity_text
        })

    # Extract assignee information
    assignees_info = re.findall(r'<assignee>(.*?)</assignee>', assignment_data, re.DOTALL)
    assignees = []

    for assignee in assignees_info:
        # Extract party name and remove titles
        party_name = re.search(r'<person-or-organization-name>(.*?)</person-or-organization-name>', assignee).group(1)
        # Check if the party name contains any company or organization indicators
        company_indicators = ['company', 'corporation', 'incorporated', 'limited', 'ltd', 'llc', 'Inc', 'co', 'corp', 'inc']
        is_company = any(indicator in party_name.lower() for indicator in company_indicators)

        if not is_company:
            party_name = re.sub(r'(?i)(Mr|Mrs|Miss|Ms|Mx|Sir|Dame|Dr|Cllr|Lady|Lord)\. ', '', party_name)
        # Get the country information for the assignee
        country = get_country(assignee)

        # Extract legal entity text
        legal_entity_text_match = re.search(r'<legal-entity-text>(.*?)</legal-entity-text>', assignee)
        legal_entity_text = legal_entity_text_match.group(1) if legal_entity_text_match else "NA"

        # Append the assignee information to the assignees list
        assignees.append({
            "party-name": party_name,
            "country": country,
            "legal-entity-text": legal_entity_text
        })

    # Count the number of properties in the assignment
    properties = re.findall(r'<property>(.*?)</property>', assignment_data, re.DOTALL)
    property_count = len(properties)

    # Create the JSON output for the assignment
    json_output = {
        rf_id: {
            "last-update-date": last_update_date_formatted,
            "conveyance-text": conveyance_text_value,
            "correspondent_party": correspondent_party_value,
            "assignors_info": assignors,
            "assignees_info": assignees,
            "property-count": str(property_count)
        }
    }

    return json_output

**Explanation:**
It combines the individual JSON outputs into a single dictionary called **final_json_output**. The **update()** method is used to merge the key-value pairs from each JSON output into the final_json_output dictionary. It writes the final_json_output dictionary to a JSON file specified by output_file using the **json.dump()** function. The **indent=4** parameter ensures that the JSON output is formatted with proper indentation for readability.

In [5]:
# Read XML content from the file
xml_data = ''.join(xml_lines)

# Extract individual assignments using regular expressions
assignments = re.findall(r'<assignment-entry>(.*?)</assignment-entry>', xml_data, re.DOTALL)

# Process each assignment and generate JSON output
json_outputs = []
for assignment in assignments:
    json_output = process_assignment(assignment)
    json_outputs.append(json_output)

# Combine the JSON outputs into a single dictionary
final_json_output = {}
for json_output in json_outputs:
    final_json_output.update(json_output)

# Write the final JSON output to a file
output_file = "task1_65.json"
with open(output_file, "w") as file:
    json.dump(final_json_output, file, indent=4)

print(f"JSON output written to {output_file}")

JSON output written to task1_65.json
