# Extract _Taalportaal_ sentences from XML
Hello visitor! Welcome to this notebook! 

This notebook parses the sentences from Taalportaal into a spreadsheet. The Taalportaal data is in XML format (see the format below). This data is parsed into an excel sheet with five columns: the judgment of the sentence (e.g. '*' or '?'), the sentence itself, the example number, the title and the sourcefile. The last three can help look up the context of a sentence in Taalportaal if necessary. Note that the numbers are given per topic so there are quite a few duplicate numbers in there. 

The rest of this notebook consists of mostly code. Each time, I have tried to comment the code as carefully as possible to explain what it does. Originally, the goal was to collect sentences that have a ? in their judgment, so the notebook is focused on that. The files that are produced can of course also be used for other purposes.

In [1]:
# Necessary imports
from xml.dom import minidom
import csv
import pandas as pd
import xlsxwriter

# Structure: 
The Taalportaal data has the general following structure of xml tags: 

`
tp-example
   topicid
   sourcefile
   title
   ilexample
       sod_ex_index
       sod_judgment
       sod_bookmarks
           sod_bookmark
       sod_generalList
           innerExample
               [sod_ex_index]  (subnumbering, e.g. 1a)
               wordgroup
                   sod_wg_w
                       sod_judgment
                       sod_ex_index
                       sod_categorialfeature
               exampleComment 
`

There is also this alternative structure

`
tp-example
   topicid
   sourcefile
   title
   ilexample
       sod_ex_index
       sod_judgment
       sod_bookmarks
           sod_bookmark
       sod_generalList
           innerExample
               [sod_ex_index]  (subnumbering, e.g. 1a)
               wordgroup
                   lexterm
                       word
`

The problematic thing about the data at hand is that the order of the tags is not used very consistently. This is likely because the file has been collected from the internet. Every example has a different amount of nesting of examples. This makes parsing difficult because we need to account for all possibilities. 

**NB: It is quite possible that some sentences are not parsed correctly and instead are simply ignored. Feel free to download the code and improve in order to fix it if you want to!**

In [2]:
# The name of the file we want to parse (assumption that the file has the structure as described above)
filename = "tp_publ_examples_nl_syn_feb22.xml"

## Create the sentence object

In [3]:
class Sentence:
    '''
    Defines the Sentence object, which has the sentence and judgment as a string
    and information on the origin of said sentence (sourcefile, title and example number)
    '''
    
    def __init__(self, sentence, judgment, sourcefile, title, ex_number):
        '''
        Initializes sentence object
        '''
        self.sentence = sentence
        self.judgment = judgment
        self.sourcefile = sourcefile
        self.title = title
        self.ex_number = ex_number
    
    def has_questionmark_judgment(self):
        '''
        Returns whether the judgment has a ? in it 
        Return: boolean
        '''
        return '?' in self.judgment
    
    def __repr__(self):
        '''
        Provides string representation for the sentence object
        '''
        return "%s %s, which is nr. %s from file %s with title %s" % (self.judgment, self.sentence, self.ex_number, self.sourcefile, self.title)

## Functions to save data 

In [4]:
def write_sentences_to_csv(filename, sentences):
    '''
    writes all sentences to a csv file
    filename: the name of the file to be written to
    sentences: the sentences to be written down
    return: None
    '''
    with open(filename, 'w', encoding='UTF8', newline='') as f:
        writer = csv.writer(f)
        
        # write the header
        writer.writerow(['judgment','sentence', 'examplenumber', 'title', 'sourcefile'])
        
        for sentence in sentences:
            writer.writerow([sentence.judgment, sentence.sentence, sentence.ex_number, sentence.title, sentence.sourcefile])

def write_sentences_to_xlsx(filename, sentences):
    '''
    writes all sentences to a xlsx file
    filename: the name of the file to be written to
    sentences: the sentences to be written down
    return: None
    '''
    with open(filename, 'w', encoding='UTF8', newline='') as f:
        workbook = xlsxwriter.Workbook(filename)
        worksheet = workbook.add_worksheet()

               
        # write the header
        worksheet.write(0, 0, 'judgment')
        worksheet.write(0, 1, 'sentence')
        worksheet.write(0, 2, 'examplenumber')
        worksheet.write(0, 3, 'title')
        worksheet.write(0, 4, 'sourcefile')
        
        for i, sentence in enumerate(sentences):
            worksheet.write(i+1, 0, sentence.judgment)
            worksheet.write(i+1, 1, sentence.sentence)
            worksheet.write(i+1, 2, sentence.ex_number)
            worksheet.write(i+1, 3, sentence.title)
            worksheet.write(i+1, 4, sentence.sourcefile)
        
        workbook.close()

## Parsing functions
This is the actual body of the code, where the magic happens. The main function is `parse_xml_file` and the rest are helper functions

In [5]:
def add_word_to_string(string, word):
    '''
    
    '''
    whitespace_like = [' ', '/', '\n', '\t']
    if len(string) == 0:
        string += word
    elif string[-1] in whitespace_like:
        string += word
    else:
        string += ' ' + word
    return string

def add_judgment_to_string(prev_judg, j):
    '''
    Some sentences contain multiple judgments. This function concatenates a new judgment to previous judgments.
    If there are no previous judgments, the new judgment becomes the start of the string. If there were any judgments, 
    the new judgment is added after adding a '/' to separate the judgments clearly. 
    
    prev_judg: previous judgments
    j: new judgment
    return: prev_judg + j
    '''
    if len(prev_judg) == 0:
        prev_judg += j
    else:
        prev_judg += "/" + j
    return prev_judg

def clean_sentence(sentence):
    '''
    Cleans out some of the junk in the parsed sentences. For example, there are 
    unnecessary enters, tabs, other whitespace, and there are superfluous xml tags
    
    sentence: sentence to be cleaned (string)
    return: cleaned sentence
    '''
    sentence = sentence.replace('\n', '')
    sentence = sentence.replace('\t', ' ')
    sentence = sentence.replace('  ', ' ')
    sentence = sentence.replace('                     ', ' ')
    sentence = sentence.replace('    ', ' ')
    sentence = sentence.replace('</sod_emphasisitalics><sod_emphasisitalics>', '')
    return sentence

def append_word(word, sentence, judgment):
    if word.nodeType == word.TEXT_NODE:
        sentence = add_word_to_string(sentence, word.data.strip())
    else:
        if word.nodeName == 'sod_judgment':
            judgment = add_judgment_to_string(judgment, word.firstChild.data.strip())
            sentence = add_word_to_string(sentence, word.firstChild.data.strip())
        else:
            sentence = add_word_to_string(sentence,  word.toprettyxml())
    return sentence, judgment

def parse_xml_file(filename):
    '''
    Parses the xml file under filename
    
    filename: name of the file to be parsed
    return: sentences that have been parsed in the format of a list of Sentence instances
    '''
    f = minidom.parse(filename)
    ex = f.getElementsByTagName('examples')[0]
    examples = ex.getElementsByTagName('tp-example')
    

    problems = 0
    sentences = []
    
    for example in examples:

        try: 
            sourcefile = example.getElementsByTagName('sourcefile')[0].firstChild.data
            xml_title = example.getElementsByTagName('title')[0]
            title = ""
            for part in xml_title.childNodes:
                try: 
                    title += part.data
                except:
                    title += part.toxml()
            ilexamples = example.getElementsByTagName('ilexample')
            for ilexample in ilexamples:
                try: 
                    il_ex_nr = ilexample.getElementsByTagName('sod_ex_index')[0].firstChild.data
                except:
                    il_ex_nr = ""
                sod_generalists = ilexample.getElementsByTagName('sod_generalList')
                for sod_generalist in sod_generalists:
                    inner_examples = sod_generalist.getElementsByTagName('innerExample')

                    for inner_ex in inner_examples:
                        try: 
                            in_ex_nr = inner_ex.getElementsByTagName('sod_ex_index')[0].firstChild.data
                        except:
                            in_ex_nr = ""
                        
                        #print(inner_ex)
                        wordgroups = inner_ex.getElementsByTagName('wordgroup')
                        for wordgroup in wordgroups:
                            sod_wg_w = wordgroup.getElementsByTagName('sod_wg_w')
                            sentence = ""
                            judgment = ""
                            for words in sod_wg_w:
                                for word in words.childNodes:
                                    #print(word.toprettyxml())
                                    sentence, judgment = append_word(word, sentence, judgment)
                
                            lexterms = wordgroup.getElementsByTagName('lexterm')
                            for lexterm in lexterms:
                                words = wordgroup.getElementsByTagName('word')
                                for word in words:
                                    sentence = add_word_to_string(sentence, word.toprettyxml())
                            sentences.append(Sentence(clean_sentence(sentence), judgment, sourcefile, title, il_ex_nr + in_ex_nr))
        except Exception as e:
            print(example.toprettyxml())
            print(str(e))
            problems += 1
    if problems>0:
        print("Encountered: " + str(problems) + " problem sentences out of " + str(len(examples)) + " sentences")
    return sentences



In [6]:
sentences = parse_xml_file(filename)
#print(sentences)
write_sentences_to_xlsx('sentences.xlsx', sentences)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [7]:
def filter_sentences(sentences):
    '''
    Writes only the sentences that contain at least one ? in the judgment to a file
    '''
    filtered_sentences = []
    for sentence in sentences:
        if sentence.has_questionmark_judgment():
            filtered_sentences.append(sentence)
    return filtered_sentences

filtered_sentences = filter_sentences(sentences)
write_sentences_to_xlsx('sentences_filtered.xlsx', filtered_sentences)