# Preprocessing Files

Different sources and tools may make use of different formats to represent information and the output of various tools may not directly correspond. In this course, we will mainly (or even exclusively) work with the conll format. Even within this format, there may be differences in tokenization, class labels used or in the number of columns provided in the output. Depending on what the difference is exactly, you may want to adapt input files or build scripts that can deal with such differences during the process.
In this case, we are preparing files that present output of two different tools for evaluation, where the exact annotation scheme differs. We set this up so you can first convert the files, so that they match and then can run evaluation (covered in a different notebook). Originally, both systems had a different tokenization and they both differed from the tokenization used in training and evaluation data. The steps of making sure that the tokens align have already been taken. We left some of the basic functions used as part of this process (e.g. the verification whether tokens align) as an example.

In [34]:
import csv
from typing import List, Dict
from collections import Counter

In [35]:
def matching_tokens(conll1: List, conll2: List) -> bool:
    '''
    Check whether the tokens of two conll files are aligned
    
    :param conll1: tokens (or full annotations) from the first conll file
    :param conll2: tokens (or full annotations) from the first conll file
    
    :returns boolean indicating whether tokens match or not
    '''
    for i, row in enumerate(conll1):
        row2 = conll2[i]
        if row[0] != row2[0]:
            return False
    
    return True

In [92]:
def read_in_conll_file(conll_file: str, delimiter: str = '\t'):
    '''
    Read in conll file and return structured object
    
    :param conll_file: path to conll_file
    :param delimiter: specifies how columns are separated. Tabs are standard in conll
    
    :returns List of splitted rows included in conll file
    '''
    conll_rows = []
    with open(conll_file, 'r') as my_conll:
        for line in my_conll:
            row = line.strip("\n").split(delimiter)
            if len(row) == 1:
                conll_rows.append([""]*rowlen)
            else:
                rowlen = len(row)
                conll_rows.append(row)
    
    return conll_rows

In [37]:
def alignment_okay(conll1: str, conll2: str) -> bool:
    '''
    Read in two conll files and see if their tokens align
    '''
    my_first_conll = read_in_conll_file(conll1)
    my_second_conll = read_in_conll_file(conll2)

    return matching_tokens(my_first_conll, my_second_conll)
    
    

In [38]:
def get_predefined_conversions(conversion_file: str) -> Dict:
    '''
    Read in file with predefined conversions and return structured object that maps old annotation to new annotation
    
    :param conversion_file: path to conversion file
    
    :returns object that maps old annotations to new ones
    '''
    conversion_dict = {}
    my_conversions = open(conversion_file, 'r')
    conversion_reader = csv.reader(my_conversions, delimiter='\t')
    for row in conversion_reader:
        conversion_dict[row[0]] = row[1]
    return conversion_dict

In [39]:
def create_converted_output(conll_rows: List, annotation_identifier: int, conversions: Dict, outputfilename: str, delimiter: str = '\t'):
    '''
    Check which annotations need to be converted for the output to match and convert them
    
    :param conll_rows: rows with conll annotations
    :param annotation_identifier: indicator of how to find the annotations in the object (index)
    :param conversions: pointer to the conversions that apply. This can be external (e.g. a local file with conversions) or internal (e.g. prestructured dictionary). In case of an internal object, you probably want to add a function that creates this from a local file.
    
    '''
    with open(outputfilename, 'w') as outputfile:
        for row in conll_rows:
            annotation = row[annotation_identifier]
            if annotation in conversions:
                row[annotation_identifier] = conversions.get(annotation)
            if row[0] == "":
                outputfile.write("\n")
            else:
                outputfile.write(delimiter.join(row)+"\n")


In [45]:
def preprocess_files(conll1: str, conll2: str, column_identifiers: List, conversions: Dict):
    '''
    Guides the full process of preprocessing files and outputs the modified files.
    
    :param conll1: path to the first conll input file
    :param conll2: path to the second conll input file
    :param column_identifiers: object providing the identifiers for target column
    :param conversions: path to a file that defines conversions
    '''
    if alignment_okay(conll1, conll2):
        conversions = get_predefined_conversions(conversions)
        my_first_conll = read_in_conll_file(conll1)
        my_second_conll = read_in_conll_file(conll2)
        create_converted_output(my_first_conll, column_identifiers[0], conversions, conll1.replace('.conll','-preprocessed.conll'))
        create_converted_output(my_second_conll, column_identifiers[1], conversions, conll2.replace('.conll','-preprocessed.conll'))
    else:
        print(conll1, conll2, 'do not align')

In [75]:
def count_NEs(input_list:list, NE_column:int):
    '''
    Counts the unique NE labels.
    
    :param input_list: list of the rows of a conll file (output of read_in_conll_file function)
    :param NE_column: identifier of the column that contains the NE labels
    
    :returns dictionary of all classes in the NE_column of the input_list and their count.
    '''
    NE_counter = Counter()
    for line in input_list:
        NE_counter.update([line[NE_column]])
    return NE_counter

### Change paths
The cell below can be used to check if two files align and preprocess these files. This was used for the SpaCy and Stanford conll files.
If the file paths are not the same on your device, please replace the path of the input files and their respective NE label column identifiers in the cell below to be able to run the code. If a different conversions file needs to be used, the path to the conversions file can also to be changed.

In [101]:
# Preprocess 2 files (alignment between these files will be checked)
# Replace path to input files, their respective column identifiers and the conversions file path here.
path_conll1 = '../data/spacy_out.dev.conll' #'../data/stanford_out.dev.conll' for stanford
path_conll2 = '../data/conll2003.dev.conll'
column_identifiers = [2,3] #[2,3] for spacy & conll2003dev respectively, [3,3] for stanford & conll2003dev respectively.
path_conversions = 'settings/conversions.tsv'

preprocess_files(path_conll1, path_conll2, column_identifiers, path_conversions)

### Change paths
The cell below can be used to preprocess a single file without checking it for alignment with another file.
If the file paths are not the same on your device, please replace the path of the input file and its respective NE label column identifier in the cell below to be able to run the code. If a different conversions file needs to be used, the path to the conversions file can also to be changed.

In [103]:
# Preprocess a single file without checking for alignment with another file.
# Replace path to input file, its column identifier and the conversions file path here.
path_conll = '../data/conll2003.dev.conll'
column_identifier = 3
path_conversions = 'settings/conversions.tsv'

conversions = get_predefined_conversions(path_conversions)
conll_rows = read_in_conll_file(path_conll)
create_converted_output(conll_rows, column_identifier, conversions, path_conll.replace('.conll','-preprocessed.conll'))


In [104]:
# Show count for each of the NE labels in path_conll file before and after preprocessing.
data_raw = read_in_conll_file(path_conll, '\t')
data_preprocessed = read_in_conll_file(path_conll.replace('.conll','-preprocessed.conll'), '\t')

label_counts_raw = count_NEs(data_raw, column_identifier)
label_counts_preprocessed = count_NEs(data_preprocessed, column_identifier)

print(f"Using file path:\n {path_conll} \n and \n {path_conll.replace('.conll','-preprocessed.conll')}")
print("\nNE labels before preprocessing:\n", label_counts_raw)
print("\nNE labels after preprocessing:\n", label_counts_preprocessed)

Using file path:
 ../data/conll2003.dev.conll 
 and 
 ../data/conll2003.dev-preprocessed.conll

NE labels before preprocessing:
 Counter({'O': 42759, '': 3250, 'B-PER': 1842, 'B-LOC': 1837, 'B-ORG': 1341, 'I-PER': 1307, 'B-MISC': 922, 'I-ORG': 751, 'I-MISC': 346, 'I-LOC': 257})

NE labels after preprocessing:
 Counter({'O': 42759, '': 3250, 'PERSON': 3149, 'LOCATION': 2094, 'ORG': 2092, 'MISC': 1268})
