<a href="https://colab.research.google.com/github/AnastasiiaLavre/AI4Gov2023/blob/main/5_TEI_XML_Enrichment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we enrich TEI XMLs generated by Grobid by injecting additional metadata

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install tika

Collecting tika
  Downloading tika-2.6.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-2.6.0-py3-none-any.whl size=32621 sha256=db3934bd3cf5124cb80736d919d047754da9666c43486c31a526d471bdc51f86
  Stored in directory: /root/.cache/pip/wheels/5f/71/c7/b757709531121b1700cffda5b6b0d4aad095fb507ec84316d0
Successfully built tika
Installing collected packages: tika
Successfully installed tika-2.6.0


In [3]:
# Install Java
!apt-get install -y openjdk-8-jre-headless

# Set JAVA_HOME environment variable
import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64'

# Install Apache Tika Python library
!pip install tika

# Import necessary modules
from tika import parser
from tika import detector

# Test Apache Tika
try:
    parsed = parser.from_buffer("Test Apache Tika.")
    detected_type = detector.from_buffer("Test Apache Tika.")
    print("Detected MIME type:", detected_type)
    print("Extracted text:", parsed['content'])
except Exception as e:
    print("An error occurred while testing Apache Tika:", str(e))


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libxtst6
Suggested packages:
  libnss-mdns fonts-dejavu-extra fonts-nanum fonts-ipafont-gothic
  fonts-ipafont-mincho fonts-wqy-microhei fonts-wqy-zenhei fonts-indic
The following NEW packages will be installed:
  libxtst6 openjdk-8-jre-headless
0 upgraded, 2 newly installed, 0 to remove and 16 not upgraded.
Need to get 30.8 MB of archives.
After this operation, 104 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libxtst6 amd64 2:1.2.3-1build4 [13.4 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 openjdk-8-jre-headless amd64 8u382-ga-1~22.04.1 [30.8 MB]
Fetched 30.8 MB in 1s (47.5 MB/s)
Selecting previously unselected package libxtst6:amd64.
(Reading database ... 120901 files and directories currently installed.)
Preparing to unpack .../libxtst6_2%3a1.2.3-1build4_

2023-09-14 11:23:25,764 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
INFO:tika.tika:Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar to /tmp/tika-server.jar.
2023-09-14 11:23:26,332 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar.md5 to /tmp/tika-server.jar.md5.
INFO:tika.tika:Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/2.6.0/tika-server-standard-2.6.0.jar.md5 to /tmp/tika-server.jar.md5.
2023-09-14 11:23:26,617 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


Detected MIME type: text/plain
Extracted text: 








Test Apache Tika.



Adding <data> and <objectName>elements to all TEI XMLs if they do not exist there.

In [7]:
#add <date> and <objectName> elements to all tei.xml
import os
import xml.etree.ElementTree as ET

# Set the path to the directory containing TEI XML files
tei_xml_dir = '/content/drive/MyDrive/Stanford/TEI XML'

# Set the path to the directory where updated TEI XML files will be saved
output_dir = '/content/drive/MyDrive/Stanford/TEI XML/Enriched'


# Create the output directory if it doesn't exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# Iterate through TEI XML files in the directory
for filename in os.listdir(tei_xml_dir):
    if filename.endswith('.tei.xml'):
        # Create the full path to the TEI XML file
        tei_xml_file = os.path.join(tei_xml_dir, filename)

        # Read the content of the TEI XML file
        tree = ET.parse(tei_xml_file)
        root = tree.getroot()

        # Find the <publicationStmt> element within the TEI XML
        namespace = {'tei': 'http://www.tei-c.org/ns/1.0'}  # Use the TEI XML namespace if available
        publication_stmt_element = root.find('.//tei:publicationStmt', namespaces=namespace)

        if publication_stmt_element is None:
            print(f"Error: <publicationStmt> element not found in {tei_xml_file}.")
        else:
            # Create the <date> element within the <publicationStmt> element
            date_element = ET.Element('date')
            publication_stmt_element.append(date_element)

            # Create the <objectName> element within the <publicationStmt> element
            object_name_element = ET.Element('objectName')
            publication_stmt_element.append(object_name_element)

            # Save the updated TEI XML to a new file in the output directory
            new_teixml_file = os.path.join(output_dir, filename)
            tree.write(new_teixml_file, encoding='utf-8', xml_declaration=True)

            print(f"Data updated and saved to {new_teixml_file}.")

Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.008_1_updated.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.008_1_enriched.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.002 0017 - 0334_27.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.002 0017 - 0334_6.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.002 0017 - 0334_24.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.002 0017 - 0334_32.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.006_2.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.002 0017 - 0334_26.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.006_1.tei.xml.
Data updated and saved to /c

Injecting required values for dates and symbols from CSV to all TEI XMLs

In [8]:
#insert values from csv
import os
import csv
import xml.etree.ElementTree as ET

# Set the path to the directory containing TEI XML files
tei_xml_dir = '/content/drive/MyDrive/Stanford/TEI XML/Enriched'

# Set the path to the CSV file
csv_file = '/content/drive/MyDrive/Stanford/TEXT 1 page/formatted_documents/document_data.csv'

# Read the content of each TEI XML file in the directory
for filename in os.listdir(tei_xml_dir):
    if filename.endswith('.tei.xml'):
        # Create the full path to the TEI XML file
        tei_xml_file = os.path.join(tei_xml_dir, filename)

        # Read the content of the specific TEI XML file
        with open(tei_xml_file, 'r', encoding='utf-8') as tei_file:
            tei_xml_content = tei_file.read()

        # Parse the TEI XML content
        root = ET.fromstring(tei_xml_content)

        # Read the CSV file and store the data in a dictionary
        csv_data = {}
        with open(csv_file, 'r', newline='', encoding='utf-8') as csvfile:
            reader = csv.reader(csvfile)
            header = next(reader)  # Read the header row
            for row in reader:
                filename_csv = row[0]
                dates_text = row[1]
                symbols_text = row[2]
                csv_data[filename_csv] = {'dates': dates_text, 'symbols': symbols_text}

        # Get the filename of the current TEI XML file without the extension
        filename_without_extension = os.path.splitext(filename)[0].replace(".tei", "")

        # Check if the filename exists in the CSV data
        if filename_without_extension in csv_data:
            dates_text = csv_data[filename_without_extension]['dates']
            symbols_text = csv_data[filename_without_extension]['symbols']

            # Find or create the <date> element within the TEI XML
            date_element = root.find('.//date')
            if date_element is None:
                date_element = ET.Element('date')
                root.insert(1, date_element)  # Insert as the second child of the root
            date_element.text = dates_text

            # Find or create the <objectName> element within the TEI XML
            object_name_element = root.find('.//objectName')
            if object_name_element is None:
                object_name_element = ET.Element('objectName')
                root.insert(2, object_name_element)  # Insert as the third child of the root
            object_name_element.text = symbols_text

            # Save the updated TEI XML to a new file
            new_teixml_file = os.path.join(tei_xml_dir, filename.replace("_updated.tei", "_enriched.tei"))
            tree = ET.ElementTree(root)
            tree.write(new_teixml_file, encoding='utf-8')

            print(f"Data updated and saved to {new_teixml_file}.")
        else:
            print(f"Error: Filename '{filename_without_extension}' not found in the CSV data.")


Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.002 0017 - 0334_27.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.002 0017 - 0334_24.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.006_2.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.006_3.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.002 0017 - 0334_5.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.002 0017 - 0334_32.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.006_1.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.002 0017 - 0334_33.tei.xml.
Data updated and saved to /content/drive/MyDrive/Stanford/TEI XML/Enriched/40001521.002 0017 - 0334_6.tei.xml.
Data updated and saved to /conten

This is the test to see what titles, dates and symbols are now in TEI XMLs

In [9]:
import os
import xml.etree.ElementTree as ET

# Define the directory containing TEI XML files
tei_dir = '/content/drive/MyDrive/Stanford/TEI XML/Enriched'

# Iterate through TEI XML files in the directory
for filename in os.listdir(tei_dir):
    if filename.endswith('.xml'):
        tei_xml_path = os.path.join(tei_dir, filename)

        # Parse the TEI XML file
        tree = ET.parse(tei_xml_path)
        root = tree.getroot()

        # Extract information from the TEI XML within the publicationStmt element
        title_element = root.find('.//ns0:titleStmt/ns0:title', namespaces={'ns0': 'http://www.tei-c.org/ns/1.0'})
        date_element = root.find('.//ns0:publicationStmt/date', namespaces={'ns0': 'http://www.tei-c.org/ns/1.0'})
        object_name_element = root.find('.//ns0:publicationStmt/objectName', namespaces={'ns0': 'http://www.tei-c.org/ns/1.0'})

        # Get the text content of the elements (if they exist)
        title = title_element.text if title_element is not None else 'N/A'
        date = date_element.text if date_element is not None else 'N/A'
        object_name = object_name_element.text if object_name_element is not None else 'N/A'

        # Print the results
        print(f"Filename: {filename}")
        print(f"Title: {title}")
        print(f"Date: {date}")
        print(f"ObjectName: {object_name}")
        print("=" * 40)  # Separate entries for better readability

Filename: 40001521.002 0017 - 0334_27.tei.xml
Title: NOTE FOR PARTICIPANTS IN THE CONSULTATIVE GROUP OF EIGHTEEN
Date: 4 March 1980
ObjectName: None
Filename: 40001521.002 0017 - 0334_24.tei.xml
Title: None
Date: 31 October 1980, 31 October 1980
ObjectName: CG.18/W/47, CG.18/INF/13, W/47, INF/13, GATT/1271
Filename: 40001521.006_2.tei.xml
Title: None
Date: None
ObjectName: CG.18/1, CG.18/1, F/28
Filename: 40001521.006_3.tei.xml
Title: ET LE COMMERCE Special Distribution
Date: 28 February 1985, 1 March 1985, 1 mars 1985
ObjectName: NF/27
Filename: 40001521.002 0017 - 0334_5.tei.xml
Title: None
Date: 21 August 1984, 6 July 1984, 6 July 1984
ObjectName: CG.18/W/82, CG.18/INF/25, W/82, INF/25
Filename: 40001521.002 0017 - 0334_32.tei.xml
Title: NOTE ON THE TENTH MEETING OF THE CONSULTATIVE GROUP OF EIGHTEEN
Date: 23 October 1979, 23 October 1979, 23 October 1979
ObjectName: CG.18/W/32, CG.16/INF/10, W/32, INF/10
Filename: 40001521.006_1.tei.xml
Title: None
Date: 9 July 1985, 9 juillet 1985

Option 2. Enriching TEI XMLs with beautifulsoup

In [10]:
!pip install beautifulsoup4



Merging two csv files with titles and with dates and symbols

In [11]:
import csv
import os

# Define the paths to the input CSV files and the output CSV file
input_file1 = '/content/drive/MyDrive/Stanford/PDF 1 page/pdf_titles.csv'
input_file2 = '/content/drive/MyDrive/Stanford/TEXT 1 page/formatted_documents/document_data.csv'
output_file = '/content/drive/MyDrive/Stanford/merged_output_final_test.csv'

# Function to extract filenames without extensions
def extract_filename_without_extension(file_path):
    return os.path.splitext(os.path.basename(file_path))[0]

# Read data from the first input CSV file
data1 = {}
with open(input_file1, 'r', newline='') as csv_file:
    reader = csv.DictReader(csv_file)
    for row in reader:
        filename = extract_filename_without_extension(row['Filename'])
        data1[filename] = row

# Read data from the second input CSV file
data2 = {}
with open(input_file2, 'r', newline='') as csv_file:
    reader = csv.DictReader(csv_file)
    for row in reader:
        filename = row['Document Name']
        data2[filename] = row

# Merge the data from both files based on the common column ('filename')
merged_data = []
for filename, row1 in data1.items():
    if filename in data2:
        row2 = data2[filename]
        row1.update(row2)  # Merge the rows from both files
    merged_data.append(row1)

# Get a list of all column names
column_names = set()
for row in merged_data:
    column_names.update(row.keys())

# Write the merged data to the output CSV file
with open(output_file, 'w', newline='') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=column_names)
    writer.writeheader()  # Write the header row with all columns
    writer.writerows(merged_data)

print(f'Merged data saved to {output_file}')


Merged data saved to /content/drive/MyDrive/Stanford/merged_output_final_test.csv


Adding elements to TEI XML

In [15]:
from bs4 import BeautifulSoup

# Set the paths for the TEI XML directory, CSV file, and output directory
tei_xml_directory = '/content/drive/MyDrive/Stanford/TEI XML v2'
#csv_file = '/content/drive/MyDrive/Stanford/merged_output.csv'
#output_directory = '/content/drive/MyDrive/Stanford/TEI XML v2/Enriched'


# Function to update TEI XML files
def update_tei_xml(tei_xml_path):
    # Read the TEI XML file
    with open(tei_xml_path, 'r', encoding='utf-8') as tei_file:
        tei_content = tei_file.read()

    # Parse the TEI XML content using BeautifulSoup
    soup = BeautifulSoup(tei_content, 'xml')

    # Check if <publicationStmt> exists, and if not, create it
    publication_stmt = soup.find('publicationStmt')
    if publication_stmt is None:
        file_desc = soup.find('fileDesc')
        publication_stmt = soup.new_tag('publicationStmt')
        if file_desc is not None:
            file_desc.append(publication_stmt)

    # Check if <date> element exists, and if not, create it
    date_element = publication_stmt.find('date')
    if date_element is None:
        date_element = soup.new_tag('date')
        date_element.string = 'Your Date Value Here'
        publication_stmt.append(date_element)

    # Check if <objectName> element exists, and if not, create it
    object_name_element = publication_stmt.find('objectName')
    if object_name_element is None:
        object_name_element = soup.new_tag('objectName')
        object_name_element.string = 'Your ObjectName Value Here'
        publication_stmt.append(object_name_element)

    # Save the updated TEI XML to the same file
    with open(tei_xml_path, 'w', encoding='utf-8') as updated_tei_file:
        updated_tei_file.write(str(soup))

# Iterate through TEI XML files in the directory
for filename in os.listdir(tei_xml_directory):
    if filename.endswith('tei.xml'):
        tei_xml_path = os.path.join(tei_xml_directory, filename)
        update_tei_xml(tei_xml_path)
        print(f"Updated TEI XML file: {filename}")

print("Processing complete.")



Updated TEI XML file: 40001521.002 0017 - 0334_33.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_6.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_26.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_27.tei.xml
Updated TEI XML file: 40001521.006_3.tei.xml
Updated TEI XML file: 40001521.006_2.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_32.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_5.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_7.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_18.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_30.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_25.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_19.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_24.tei.xml
Updated TEI XML file: 40001521.006_1.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_35.tei.xml
Updated TEI XML file: 40001521.002 0017 - 0334_31.tei.xml
Updated TEI XML file: 40001521.006_5.tei.xml
Updated TEI

Enriching TEI XML with the metadata values

In [16]:
# Set the paths for the TEI XML directory, CSV file, and output directory

tei_xml_directory = '/content/drive/MyDrive/Stanford/TEI XML v2'
csv_file = '/content/drive/MyDrive/Stanford/merged_output.csv'
output_directory = '/content/drive/MyDrive/Stanford/TEI XML v2/Enriched'

# Create the output directory if it doesn't exist
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Read the CSV file and store the data in a dictionary
csv_data = {}
with open(csv_file, 'r', newline='', encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile)
    header = next(reader)  # Read the header row
    for row in reader:
        filename = row[3]
        dates = row[4]
        symbols = row[1]
        csv_data[filename] = {'dates': dates, 'symbols': symbols}

# Process each TEI XML file in the directory
for filename in os.listdir(tei_xml_directory):
    if filename.endswith('.tei.xml'):
        tei_xml_path = os.path.join(tei_xml_directory, filename)
        base_name = os.path.splitext(filename)[0].replace('.tei',"")  # Remove extension from TEI XML filename

        # Check if the base filename exists in the CSV data
        if base_name in csv_data:
            dates = csv_data[base_name]['dates']
            symbols = csv_data[base_name]['symbols']

            # Read the TEI XML file
            with open(tei_xml_path, 'r', encoding='utf-8') as tei_file:
                tei_content = tei_file.read()

            # Parse the TEI XML content using BeautifulSoup with 'lxml' parser
            soup = BeautifulSoup(tei_content, 'xml')

            # Find or create the <date> element and set its text
            date_element = soup.find('date')
            if date_element is None:
                date_element = soup.new_tag('date')
                publication_stmt = soup.find('publicationStmt')
                if publication_stmt:
                    publication_stmt.insert(0, date_element)
            date_element.string = dates

            # Find or create the <objectName> element and set its text
            object_name_element = soup.find('objectName')
            if object_name_element is None:
                object_name_element = soup.new_tag('objectName')
                publication_stmt = soup.find('publicationStmt')
                if publication_stmt:
                    publication_stmt.insert(1, object_name_element)
            object_name_element.string = symbols

            # Save the updated TEI XML to the output directory
            output_xml_path = os.path.join(output_directory, filename)
            with open(output_xml_path, 'w', encoding='utf-8') as updated_tei_file:
                updated_tei_file.write(str(soup))

            print(f"Updated TEI XML file: {filename}")

            # Print filename, dates, and symbols
            print(f"Filename: {filename}")
            print(f"Dates: {dates}")
            print(f"Symbols: {symbols}")

print("Processing complete.")

Updated TEI XML file: 40001521.002 0017 - 0334_33.tei.xml
Filename: 40001521.002 0017 - 0334_33.tei.xml
Dates: 13 October 1978
Symbols: 
Updated TEI XML file: 40001521.002 0017 - 0334_6.tei.xml
Filename: 40001521.002 0017 - 0334_6.tei.xml
Dates: 6 April 1984
Symbols: CG.18/W/78, CG.18/INF/24, CG.18/W/79, W/78, INF/24, W/79
Updated TEI XML file: 40001521.002 0017 - 0334_26.tei.xml
Filename: 40001521.002 0017 - 0334_26.tei.xml
Dates: 15 July 1980, 15 JULY 1980, 15 July 1980
Symbols: CG.18/W/40, CG.18/INF/12, W/40, INF/12
Updated TEI XML file: 40001521.002 0017 - 0334_27.tei.xml
Filename: 40001521.002 0017 - 0334_27.tei.xml
Dates: 4 March 1980
Symbols: 
Updated TEI XML file: 40001521.006_3.tei.xml
Filename: 40001521.006_3.tei.xml
Dates: 28 February 1985, 1 March 1985, 1 mars 1985
Symbols: NF/27
Updated TEI XML file: 40001521.006_2.tei.xml
Filename: 40001521.006_2.tei.xml
Dates: 
Symbols: CG.18/1, CG.18/1, F/28
Updated TEI XML file: 40001521.002 0017 - 0334_32.tei.xml
Filename: 40001521.00