# Python Implementation of a Raw Text Parser 
#### Author Name: Pouria Ebrahimnezhad

Date:13/04/2019

Environment: Python 3.6.5 and Jupyter notebook
Libraries used: 
* re
* json

Other resources used to check the work include:
* https://pythex.org/ (for checking and developing the Regular expressions)
* http://xmlgrid.net/validator.html (for validating the XML file created)
* https://www.freeformatter.com/json-validator.html (for validating the JSON file created)



# 1. Introduction
This part of the assessment touches the very first step of analyzing textual data, i.e., extracting data
from semi-structured text files. I have been provided with a data-set that contains
information about several units in the Monash University. The data-set contains information about the unit, e.g., unit
code, unit title, synopsis, requirements, output, chief examiner, etc. What I am trying to achieve here is to extract the data and transform the data into the XML and JSON format with the given elements in the specification

# 2. Building information extraction functions

In this part of the code I have build some functions specifically for retrieving the required information these include, this way I was able to design and test my output and have the flexibility to change the code when I came across test cases that needed workaround

* find_unit()
* find_prerequisite()
* find_prohib()
* find_synop()
* find_req()
* find_outcome()
* find_chief()

One thing to note is that inorder to do achieve the tasks I have assumed in my main program I will pass each segment of the provided text file for each unit seperatly through my functions Below. I will explain each function purpose and details of the main program in the following sections.

In [1]:
import re

Here I have developed a funtion to find the unit code and title for each unit section, returning a list containing both outputs. I have identified where this information in the text file exists and have written the approperiate Regex to achieve this, I first step through lines of each section and find the line which has the unitcode span section and then using two regex I extract the unitcode and the title using non-capturing groups and a single capturing group in between to achieve this

In [2]:
def find_unit(segment):
    lines = segment.splitlines()
    unit_code = []
    unit_name = []
    for i in range(0, len(lines)):
        unit = re.findall(r"""<span class="unitcode">""", lines[i])
        if unit != []:
            unit_code = re.findall(r"([A-Z]{3}\d{4})", lines[i])
            unit_name = re.findall(r"(?:</span>\s-\s)(.*?)(?:<)", lines[i])
            break
    return (unit_code + unit_name)

The next function extracts the Prerequisites and Co-requisites of each unit and returns a list containing unique unitcodes, through checking the document and test files I have noticed we require only uniue ones as I had repeated unitcodes I used set() to make sure I am returning uniquely, also I have assumed that we are only interested in the first section of these information similar to sample test file output provided. So I am using the associated p-tag with each of these sections and then looking through the data for any unitcodes using the unitcode format Regex stepping through each section and appending the lists with the codes identified. then i will join the two lists and return as one final list.

In [3]:
def find_prerequisite(segment):
    prereq = []
    coreq = []
    final = []
    regex1 = re.compile(r"(?:>Prerequisites</p>)(.*?</p>)", re.DOTALL)
    temp_pre = regex1.findall(segment)
    regex2 = re.compile(r"(?:>Co-requisites</p>)(.*?</p>)", re.DOTALL)
    temp_co = regex2.findall(segment)
    if (temp_pre == []) and (temp_co == []):
        final.append('NA')
        return(final)
    for i in range(0,len(temp_pre)):
        item = re.findall(r"([A-Z]{3}\d{4})", temp_pre[i])
        prereq.extend(item)
    for i in range(0,len(temp_co)):
        item = re.findall(r"([A-Z]{3}\d{4})", temp_co[i])
        coreq.extend(item)
    if (prereq != []) or (coreq != []):
        final = prereq + coreq
        return(list(set(final)))
    else:
        final.append('NA')
        return(final)

Next I develop a function to find and return a list of prohibited units for each section, Here I will find the section that includes the p-tag for prohibitions and then look below that line for all lines before I get to the end of the div-tag which signals the end of the section. I use the same unitcode format regex to extract these codes and append to a list. Also I pay attention to the cases where there is no Prohibitions section or no such unitcode under the section to return NA as per specifications

In [4]:
def find_prohib(segment):
    lines = segment.splitlines()
    prohib_code = []
    for i in range(0, len(lines)):
        prohib = re.findall(r"(>Prohibitions</p>)", lines[i])
        if prohib != []:
            end = []
            c = 1
            prohib_code = []
            while end == []:
                end = re.findall(r"(</div>)", lines[i+c])
                prohib_code.extend(re.findall(r"([A-Z]{3}\d{4})", lines[i+c]))
                c += 1
            if prohib_code == []:
                prohib_code.append('NA')
            break
        else:
            prohib_code.append('NA')
    return (list(set(prohib_code)))

Now this next function is built specific to extracting the synopsis of each unit and clean any hyperlinks and return as a string. The issue I faced here was that some of the synopsis included hyperlinks of the unitcodes and this needed to be removed as they would cause inconsistencies in the output and would include tags which would make the XML and json output unparsable. I have removed these and replaced them with the unitcodes so that we don't loose the information at the same time. Also again returning NA where required

In [5]:
def find_synop(segment):
    lines = segment.splitlines()
    temp = []
    for i in range(0, len(lines)):
        synop = re.findall(r"(>Synopsis</h2>)", lines[i])
        if synop != []:
            end = []
            c = 1
            temp = []
            while end == []:
                end = re.findall(r"(</div>)", lines[i+c])
                temp.extend(re.findall(r"(?:<p>)(.*?)(?=</p>)", lines[i+c]))
                c += 1
            if temp == []:
                temp.append('NA')
            break
        else:
            temp.append('NA')
    synopsis = temp[0]
    # Cleaning any hyperlink information in this section but keeping the unitcodes
    hyper_codes = re.findall(r"(?:<span).*?([A-Z]{3}\d{4}).*?(?:</span>)", synopsis)
    hyper_sec = re.findall(r"(<span.*?(?:</span>)+)", synopsis)
    for i in range(0,len(hyper_sec)):
        synopsis = synopsis.replace(hyper_sec[i],hyper_codes[i])
    return(synopsis) 

Very similar to the previous function, here I write this function to extract the Assessment section and then look for all information included in all p-tags below the Assessment line before reaching the closing div-tag. also again I wrote a Regex to find any tags in between the information and removing them for the same reason as the hyperlinks above.
Again please note the assumption here is we are only interested in the p-tag information, checking the input file we can see there are other information for requirments in line-tags but I have tried to keep my code similar to the test file provided in the assessment

In [6]:
def find_req(segment):
    lines = segment.splitlines()
    temp = []
    for i in range(0, len(lines)):
        req = re.findall(r">Assessment</h2>", lines[i])
        if req != []:
            end = []
            c = 1
            temp = []
            while end == []:
                end = re.findall(r"(</div>)", lines[i+c])
                temp.extend(re.findall(r"(?:<p>)(.*?)(?=</p>)", lines[i+c]))
                c += 1
            if temp == []:
                temp.append('NA')
            break
        else:
            temp = ['NA']
    for i in range(0, len(temp)):
        link_tag = re.findall(r"(?:<.*?>)", temp[i])
        for j in range(0, len(link_tag)):
            temp[i] = temp[i].replace(link_tag[j], "")
    return(temp)

Here I define a function to extract the outcomes for each unit, clean any hyperlinks and return as a list. I have assumed and verified that the outcomes in the main text file are located under the Outcomes section and are encapsulated n line tags. so I have written the approperiate regular expression to achieve this extraction followed by any extra tags in the section inorder to clean them and make sure these are not reflected in the outcome list

In [7]:
def find_outcome(segment):
    out = []
    regex1 = re.compile(r"(?:>Outcomes</h2>)(.*?</div>)", re.DOTALL)
    temp = regex1.findall(segment)
    if temp == []:
        out.append('NA')
        return(out)
    regex2 = re.compile(r"(?:<li>)(.*?)(?:</li>)", re.DOTALL)
    temp = regex2.findall(temp[0])
    if temp == []:
        out.append('NA')
        return(out)
    for i in range(0, len(temp)):
        remove = re.findall(r"(?:<.*?>)|(?:<p>)|(?:</p>)", temp[i])
        for j in range(0, len(remove)):
            temp[i] = temp[i].replace(remove[j], "")
    return(temp)

Next I have defined a function to allow for extraction of the Chief Examiner of each unit. looking through the data and the section for Chief Examiner I have assumed and verified that this information is within a-tags. also similar to before there are hyperlinks to some of the faculty memebers which cause issues. I have made sure this is removed using regular expresion which is consisting of locating sections encapsulated by p-tags associated with the hyperlink information

In [8]:
def find_chief(segment):
    chief = []
    regex = re.compile(r"(?:>Chief examiner\(s\)</p>)(.*?</p>)", re.DOTALL)
    temp = regex.findall(segment)
    #print(temp[0])
    if temp == []:
        chief.append('TBA')
        return(chief)
    temp = re.findall(r"(?:>)(.*?)(?:</a>)", temp[0])
    if temp == []:
        chief.append('TBA')
        return(chief)
    for i in range(0, len(temp)):
        remove = re.findall(r"(?:<.*?>)|(?:<p>)|(?:</p>)", temp[i])
        for j in range(0, len(remove)):
            temp[i] = temp[i].replace(remove[j], "")
    chief = temp
    return(chief)

# 3. xml and json builder

In the following section I have developed two functions

* xtag_builder()
* jtag_builder()

the aim of these two functions are to take a dictionary of information for units as argument then build the specific file format required for both XML and json files without using any python libraries to achieve this.

The output of the xtag_builder() is a list of all strings for each section and the output of the tag_builder() is a dictionary built with specific structure to match the json file format requested containing units information

In [9]:
# defining a function to take the dictionary containing each unit's information and building the xml required tags for each one
def xtag_builder(dic):
    
    unit_code = dic['unit_id']
    title = dic['title']
    synopsis = dic['synopsis']
    prereq = dic['pre']
    prohib = dic['pro']
    req = dic['requi']
    outcome = dic['out']
    chief = dic['chief']
    
    unit_id = f"<unit id='{unit_code}'>"
    
    title = "<title> " + title + "</title>"
    
    synopsis = "<synopsis> " + synopsis + "</synopsis>"
    
    prerequisites = ""
    if prereq == ['NA']:
        prerequisites = "<pre_requistics> NA </pre_requistics>"
    else:
        for i in range(0, len(prereq)):
            prerequisites = prerequisites + "<pre_requistic>" + prereq[i] + "</pre_requistic>"
            if i == (len(prereq)-1):
                prerequisites = "<pre_requistics>" + "\n" + prerequisites + "</pre_requistics>"
    
    prohibitions = ""
    if prohib == ['NA']:
        prohibitions = "<prohibisions> NA </prohibisions>"
    else:
        for i in range(0, len(prohib)):
            prohibitions = prohibitions + "<prohibision>" + prohib[i] + "</prohibision>"
            if i == (len(prohib)-1):
                prohibitions = "<prohibisions>" + "\n" + prohibitions + "</prohibisions>"
    
    requirements = ""
    if req == ['NA']:
        requirements = "<requirements> NA </requirements>"
    else:
        for i in range(0, len(req)):
            requirements = requirements + "<requirement>" + req[i] + "</requirement>"
            if i == (len(req)-1):
                requirements = "<requirements>" + "\n" + requirements + "</requirements>"
    
    outcomes  = ""
    if outcome == ['NA']:
        outcomes = "<outcomes> NA </outcomes>"
    else:
        for i in range(0, len(outcome)):
            outcomes = outcomes + "<outcome>" + outcome[i] + "</outcome>"
            if i == (len(outcome)-1):
                outcomes = "<outcomes>" + "\n" + outcomes + "</outcomes>"
    
    examiner = ""
    if chief == ['TBA']:
        examiner = "<chief_examiners> TBA </chief_examiners>"
    else:
        for i in range(0, len(chief)):
            examiner = examiner + "<chief_examiner>" + chief[i] + "</chief_examiner>"
            if i == (len(chief)-1):
                examiner = "<chief_examiners>" + "\n" + examiner + "</chief_examiners>"
    
    output = [unit_id, title, synopsis, prerequisites, prohibitions, requirements, outcomes, examiner]
    return(output)

In [10]:
def jtag_builder(dic):
    
    unit_code = dic['unit_id']
    title = dic['title']
    synopsis = dic['synopsis']
    prereq = dic['pre']
    prohib = dic['pro']
    req = dic['requi']
    outcome = dic['out']
    chief = dic['chief']
    
    unit_dic = {"@id": unit_code, "title": title, "synopsis": synopsis}
    
    if prereq == ['NA']:
        unit_dic["pre_requistics"] = "NA"
    elif (len(prereq) == 1) and (prereq != ['NA']):
        unit_dic["pre_requistics"] = {"pre_requistic": prereq[0]}
    else:
        unit_dic["pre_requistics"] = {"pre_requistic": prereq}
    
    if prohib == ['NA']:
        unit_dic["prohibisions"] = "NA"
    elif (len(prohib) == 1) and (prohib != ['NA']):
        unit_dic["prohibisions"] = {"prohibision": prohib[0]}
    else:
        unit_dic["prohibisions"] = {"prohibision": prohib}
        
    if req == ['NA']:
        unit_dic["requirements"] = "NA"
    elif (len(req) == 1) and (req != ['NA']):
        unit_dic["requirements"] = {"requirement": req[0]}
    else:
        unit_dic["requirements"] = {"requirement": req}
        
    if outcome == ['NA']:
        unit_dic["outcomes"] = "NA"
    elif (len(outcome) == 1) and (outcome != ['NA']):
        unit_dic["outcomes"] = {"outcome": outcome[0]}
    else:
        unit_dic["outcomes"] = {"outcome": outcome}
    
    if chief == ['TBA']:
        unit_dic["chief_examiners"] = "TBA"
    elif (len(chief) == 1) and (chief != ['TBA']):
        unit_dic["chief_examiners"] = {"chief_examiner": chief[0]}
    else:
        unit_dic["chief_examiners"] = {"chief_examiner": chief}
    
    return(unit_dic)    

# 4. Main Program section

Here I use the above functions to deliver the two requested output files
in order to do this I first read the input text file and split the file in to sections, each one containing one unit information this is achieved through selecting the approperiate class tag which seems to be consistent throughout the file and indicates the start of a unit

then using my functions I pass each section through them and generate a dictionary for each section containing all required information. This dictionary is then passed in to the xtag_builder to build the required xml tags for each unit and I write the output to a file named 30035678.xml

then I use the jtag_builder function to pass the dictionary of units information and to create the json tags and I then use the json.dump() method from the json library to write the information to a json file with and indent level of 5 to match the provided sample indentation level

In [11]:
import json
# reading the input file and breaking it to sections for each unit 
with open("unit_data.txt", "r") as raw_file:
    data = raw_file.read()
section = data.split("""<div class="hbk-banner-box">""")
raw_file.close()

unit_list = []
# passing each unit section through the functions to produce each unit's required tags and produce a dictionary to store them
for i in range(1, len(section)):
    unit_code = find_unit(section[i])
    synop = find_synop(section[i])
    prereq = find_prerequisite(section[i])
    prohib = find_prohib(section[i])
    req = find_req(section[i])
    outcome = find_outcome(section[i])
    chief = find_chief(section[i])
    
    sec_dic = {'unit_id': unit_code[0], 'title': unit_code[1], 'synopsis': synop, 'pre': prereq, 'pro': prohib, 'requi': req, 'out':outcome, 'chief': chief}
    
    xtags = xtag_builder(sec_dic)
    
    # creating a xml file and writing / appending outcomes for each section to the file with respect to the starting and ending tags
    with open("30035678.xml","a+") as output:
        if i == 1:
            output.write(f"""<?xml version="1.0" encoding="UTF-8" ?>\n<units>\n{xtags[0]}\n{xtags[1]}\n{xtags[2]}\n{xtags[3]}\n{xtags[4]}\n{xtags[5]}\n{xtags[6]}\n{xtags[7]}\n</unit>\n""")
        elif i == (len(section) - 1):
            output.write(f"{xtags[0]}\n{xtags[1]}\n{xtags[2]}\n{xtags[3]}\n{xtags[4]}\n{xtags[5]}\n{xtags[6]}\n{xtags[7]}\n</unit>\n</units>\n")
        else:
            output.write(f"{xtags[0]}\n{xtags[1]}\n{xtags[2]}\n{xtags[3]}\n{xtags[4]}\n{xtags[5]}\n{xtags[6]}\n{xtags[7]}\n</unit>\n")
    output.close()
    
    # creating the tags for the json file and building the inner list required for the final json file
    jtags = jtag_builder(sec_dic)
    unit_list.append(jtags)

# creating a json file and riting the extracted json tags to the file 
json_form = {"units": {"unit": unit_list}}
with open("30035678.json","a+") as outfile:
    json.dump(json_form, outfile, indent=5)
    outfile.close()

# 5. Summary

I have created both required files according to specification requirements. I have then tested both files using online file format checker to make sure they are parsable.

Please refer to
* http://xmlgrid.net/validator.html (for validating the XML file created)
* https://www.freeformatter.com/json-validator.html (for validating the JSON file created)
