## Notebook to extract the annotated sound event spans from the xml files.

The following code can only be executed if there are no errors in the structure of the xml tei files.
In case of an error, you can control the xml on the following website: https://jsonformatter.org/xml-validator (JSON Formatter 2020)

In [23]:
import xml.etree.ElementTree as ET

In [8]:
folder_path = '/Users/sguhr/Desktop/Diss_notebooks/Diss_data_notebooks_man_anno/Manually_annotated_data/20240329_Testing'

In [9]:
#This code works well to extract the sound event spans and store them in two separate lists according to their ambient or character sound classification. 

import os
import xml.etree.ElementTree as ET

#the following function creates empty lists, iterates over the xml elements extracting the content between the elements ambient_sound and character_sound to store them in the empty list, sorted by xml text file using the list.append and extend commands.

def extract_sound_spans(xml_content):
    ambient_sound_spans = []
    character_sound_spans = []
    root = ET.fromstring(xml_content)

    ambient_sound_text = ''
    character_sound_text = ''
    
    for elem in root.iter():
        if elem.tag.endswith('ambient_sound'):
            ambient_sound_text = elem.text.strip()
            ambient_sound_spans.append(ambient_sound_text)
        elif elem.tag.endswith('character_sound'):
            character_sound_text = elem.text.strip()
            character_sound_spans.append(character_sound_text)
    return ambient_sound_spans, character_sound_spans

def process_xml_file(filepath):
    ambient_sound_spans_list = []
    character_sound_spans_list = []
    with open(filepath, 'r', encoding='utf-8') as file:
        xml_content = file.read()
        ambient_sound_spans, character_sound_spans = extract_sound_spans(xml_content)
        ambient_sound_spans_list.extend(ambient_sound_spans)
        character_sound_spans_list.extend(character_sound_spans)
    return ambient_sound_spans_list, character_sound_spans_list

def process_folder(folder_path):
    sound_spans_per_file = {}
    for filename in os.listdir(folder_path):
        if filename.endswith('.xml'):
            filepath = os.path.join(folder_path, filename)
            ambient_sound_spans_list, character_sound_spans_list = process_xml_file(filepath)
            sound_spans_per_file[filename] = {'ambient_sound_spans': ambient_sound_spans_list, 
                                              'character_sound_spans': character_sound_spans_list}
    return sound_spans_per_file


sound_spans_per_file = process_folder(folder_path)
for filename, sound_spans in sound_spans_per_file.items():
    print("File:", filename)
    print("Ambient Sound Spans:", sound_spans['ambient_sound_spans'])
    print("Character Sound Spans:", sound_spans['character_sound_spans'])
    print()


File: Viebig_Clara_Der_Osterquell.xml
Ambient Sound Spans: ['der Wind zaust in ihren Kronen', 'Der Wind weht stark', 'die Schritte raschelten im braunen Laub', 'der Klingelzug mit dem Kreuzchen als Griff gab einen undeutlich heiseren Klang', 'Wie still ist es hier', 'Zweige klopfen an die bleigefaßten Scheiben', 'das heilige Brünnlein rinnt murmelnd', 'wo Baumwipfel Schlummerlieder rauschen', 'Die Kapellenthür hatte sich lautlos geöffnet', 'in die verdunkelte Kirche tritt der Mathes mit knarrenden Stiefeln', 'Leise rauschte ein Regen nieder', 'trommelte auf das Kirchendach', 'tröpfelte von den Zweigen', 'und raschelte in den Guirlanden von weißen Papierrosen', 'Leise begann im Turm das Glöcklein zu bimmeln', 'gedämpft drang sein Schall zu dem Gebeugten nieder', 'Ein Vogel hebt sich trillernd vom Grasrain', 'mit langem, jauchzendem Geschmetter schießt sie auf in den Äther', 'Die Glocke im Turm bimmelt immer noch', 'von weitem hallt Gesang', 'hell tönt ihr geistlicher Gesang', 'sie murme

In [10]:
# Write the output to a text file
output_file = 'sound_spans_output_testing.txt'
with open(output_file, 'w') as f:
    for filename, sound_spans in sound_spans_per_file.items():
        f.write("File: {}\n".format(filename))
        f.write("Ambient Sound Spans: {}\n".format(sound_spans['ambient_sound_spans']))
        f.write("Character Sound Spans: {}\n".format(sound_spans['character_sound_spans']))
        f.write("\n")

In [11]:
# Write the output to a CSV file
import csv
output_file = 'sound_spans_output_dataframe_testing.csv'
with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['File', 'Ambient Sound Spans', 'Character Sound Spans']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for filename, sound_spans in sound_spans_per_file.items():
        writer.writerow({'File': filename,
                         'Ambient Sound Spans': sound_spans['ambient_sound_spans'],
                         'Character Sound Spans': sound_spans['character_sound_spans']})


In [12]:
print("Finished sound event extraction.")

Finished sound event extraction.
