# Purpose

This file will hold the functions responsible for creating the text input and XML check files needed to validate the Philter algorithm's performance. 

# Importing necessary libaries

## Defining a function that creates the text input file

This is the file with the raw "clinical note" (quotes due to the fact that the PHIs will be fake). The file will be given to Philter as input corresponding to the labels outlined in the XML check file when validating Philter's performance.

In [6]:
def create_text(text, file_ID):
    
    filepath = "./data/faked_notes/philter_files/without_tags/" + str(file_ID) + ".txt"
    with open(filepath, "w") as file:
        file.write(text)
        file.close

## Defining a function that creates the XML check file

This is the file containing all the tags (hence the XML format), and the generated XML file will be used as "labels" to validate the Philter algorithm. The only thing that sets this apart from the cells generating the raw text file is that the tags need to be added for this file. The code for this was adapted from a StackAbuse guide (seems like it's meant to be some sort of StackOverflow clone, but with articles rather than forum threads): https://stackabuse.com/reading-and-writing-xml-files-in-python/.

In [29]:
def create_XML(text, tags_list, file_ID, write=False):
    '''
    Currently, this is written to generate files in the same format as the XML files given in the public GitHub version of Philter.
    Those files seem to be formatted in the following way:
    
    <?xml version='1.0' encoding='utf8'?> #this seems to be a declaration of the XML version and encoding of the file
    <deIdi2b2> #this seems to symbolize "de-identified i2b2" (i2b2 is likely the same i2b2 dataset referenced in the paper written on Philter)
        <TEXT>
            clinical note text here...
        </TEXT>
        <TAGS>
            <DATE TYPE="DATE", comment="" end="string_end_index" id="P0" (0 can be any number and seems to just increment by 1 each time) start="string_start_index" text="string_text" />
            pretty much the same thing for names, but with a different tag type (NAME) and PHI type ("PATIENT" or "DOCTOR")
        </TAGS>
    </deIdi2b2>
    '''
    #header, container, and text portions of the XML file
    string = "<?xml version='1.0' encoding='utf8'?>\n<deIdMIMIC>\n<TEXT> \n" + text + "\n</TEXT>\n<TAGS>\n"
    
    ID_number = 0
    
    #creating the tags for the XML file
    
    for value in tags_list:
        
        #this structure is drawn from an example i2b2 file in the public GitHub site for Philter
        string += "<" + value["type"][0].upper() + " TYPE=\"" + value["type"][0].upper() + "\" comment= \"\" end=\"" + str(value["end"]) + "\" id= \"P" + str(ID_number) + "\" start=\"" + str(value["start"]) + "\" text=\"" + str(value["text"]) + "\" /> \n"
        
        ID_number += 1
    
    string += "</TAGS>\n</deIdMIMIC>"
    
    if write:
        #write the formatted file_data to the file_ID.xml file
        filepath = "./data/faked_notes/philter_files/with_tags/" + str(file_ID) + ".xml"
        with open(filepath, "w") as file:
            file.write(string)
    
    return string

## Defining a function that creates a csv file

While creating individual XML and txt files works in small batches, file directories do have rough limits on how many files can be put within them. As such, particularly for things such as the entire MIMIC-III dataset (which has ~2 million clinical notes), a more compact structure such as a single .csv file will be extremely useful. Creating such a .csv file is what this function does.

In [1]:
def create_csv(text, XML, filename, previous_data, filepath="./data/faked_notes/" , write=False):

    previous_data = previous_data.append(pd.DataFrame([[text, XML]], columns=["altered_texts", "XMLs"]))
        
    if write:
        previous_data = previous_data.dropna()
        previous_data.to_csv(filepath+str(filename)+".csv", header=["altered_texts", "XMLs"], index=False)

    return previous_data