# Script for xml to txt conversion

This script is used to convert the xml files to txt files. The xml files are present in the `resources\xml` folder and the txt files are saved in the `resources\txt` folder. The txt files are saved with the same name as the xml files.

### Xml partitioning

Xml files are organized in logical groups: when the xml files are converted to txt files, each txt files will contain the aggregated text from the xml files in the same group.
This allows the generation of larger txt files that can be used for testing the Workflow.

### Code

In [6]:
# Adding imports
import os
import re
import shutil
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import iterparse

# Specify the directory where XML files and input files are stored
file_dir = 'performance'
print('Directory:', file_dir)

Directory: performance


In [7]:
# Path of the directory containing the XML files
xml_dir = os.path.join('..','resources', file_dir, 'xml')
# Path of the directory where to save the text files
txt_dir = os.path.join('..','resources', file_dir, 'input')

print(f'XML directory: {xml_dir}')
print(f'Text directory: {txt_dir}')

XML directory: ../resources/performance/xml
Text directory: ../resources/performance/input


In [8]:
# List of files we are going to overload
files = []
for dirpath, dirnames, filenames in os.walk(xml_dir):
    for filename in filenames:
        # Check if the file has .xml extension
        if filename.endswith('.xml'):
            # Create the name of the output file by using the parent directory name
            output_filename = os.path.join(txt_dir, os.path.basename(dirpath) + '.txt')
            # Append the file to the list
            files.append(output_filename)

# Clean the txt_dir directory (only for file we specified if they already exist)
for file in files:
    if os.path.exists(file):
        os.remove(file)


In [9]:
# Function to extract the text from the XML file
def extract_text_from_xml(xml_file, output_file):
    namespace = '{http://www.mediawiki.org/xml/export-0.10/}'

    # Open the output file
    with open(output_file, 'a', encoding='utf-8') as f:
        # Use iterparse to create an iterator over the XML file
        for event, elem in iterparse(xml_file, events=('end',)):
            # Check if the element is a page
            if elem.tag == namespace + 'page':
                text_elements = elem.findall('.//' + namespace + 'revision/' + namespace + 'text')
                for text_element in text_elements:
                    if text_element is not None and text_element.text is not None:
                        text = text_element.text
                        # Remove the unwanted part using regex
                        text = re.sub(r'<div.*?>.*?</div>', '', text, flags=re.DOTALL)
                        # Write the text directly to the output file
                        f.write(text + '\n\n')
                # Clear the page element from memory once we're done with it
                elem.clear()


In [10]:
# Walk through xml_dir directory# Walk through xml_dir directory
for dirpath, dirnames, filenames in os.walk(xml_dir):
    for filename in filenames:
        # Check if the file has .xml extension
        if filename.endswith('.xml'):
            xml_file = os.path.join(dirpath, filename)
            # Create the name of the output file by using the parent directory name
            output_file = os.path.join(txt_dir, os.path.basename(dirpath) + '.txt')
            # Extract the text from the XML file and append it to the output file
            extract_text_from_xml(xml_file, output_file)

            print(f'The content of {filename} has been appended to {output_file}')

The content of ita-XXI-ext.xml has been appended to ../resources/performance/input/performance-GB.txt
The content of ita-XX-ext.xml has been appended to ../resources/performance/input/performance-GB.txt
The content of ita-XXI.xml has been appended to ../resources/performance/input/performance-MB.txt
The content of ita-XX.xml has been appended to ../resources/performance/input/performance-MB.txt
