# Task 1

Date: 30.03.2019

Environment: Python 3.6.8 and Anaconda 4.6.7 (64-bit)

Libraries used:
* re 2.2.1 (for regular expression, included in Anaconda Python 3.6) 
* json 2.6  (for working with JSON data, included in Anaconda Python 3.6)

In [19]:
from IPython.core.display import HTML
css = open('style/style-table.css').read() + open('style/style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

## 1. Introduction
This report focuses on the logics and the implementation of parsing text files and text pre-processing in this Unit Info task using Python. The main purpose of this report is to provide information about the methodology and process to solve this problem.

>The required tasks are the following:
>1. Extract the data for each unit from `.txt` file.
>2. Transform the data to the `.json` file.
>3. Transform the data to the `.xml` file

>The processes implemented in this report are :
>1. Understand the format and hierarchical structure of output file.
>2. Loading input data and split it into different unit blocks.
>3. Extract data for each keyword and store it into correct format.
>4. Write into `.json` and `.xml` file


## 2.  Import libraries 

In [2]:
import re
import json

## 3. Understand the format and hierarchical structure of output file

> * Read output json file 
> * Explore the structure of the output
> * Use output file to get the right type and format of data container 
> * Visualize the whole structure and logic of storing different info

In [3]:
with open('style/test_output.json', 'r') as f:
    test_output = json.load(f)
print(f'Top-level is a dict with one key: {test_output.keys()}')
print(f"2nd-level and 3-rd level are:  {type(test_output['units']['unit'][0])}")

Top-level is a dict with one key: dict_keys(['units'])
2nd-level and 3-rd level are:  <class 'dict'>


In [4]:
# explore the type of data container and string format for each keyword
check_type_dict = {}
for i in range(100):
    for key,value in test_output['units']['unit'][i].items():
        if key in check_type_dict:
            check_type_dict[key].append(str(type(value))) 
        else:
            check_type_dict[key]=[]
            
for key,value in check_type_dict.items():
    check_type_dict[key] = set(check_type_dict.get(key))
check_type_dict

{'@id': {"<class 'str'>"},
 'title': {"<class 'str'>"},
 'synopsis': {"<class 'str'>"},
 'pre_requistics': {"<class 'dict'>", "<class 'str'>"},
 'prohibisions': {"<class 'dict'>", "<class 'str'>"},
 'requirements': {"<class 'dict'>", "<class 'str'>"},
 'outcomes': {"<class 'dict'>"},
 'chief_examiners': {"<class 'dict'>", "<class 'str'>"}}

<img src = "style/img_2_json_structure.png" height = "500" width = "900" style="float: left;">

## 4. Loading input data and split it into different unit blocks

> * As a first step, the `.txt file` will be loaded as a string containing all info.
> * After explore the test input file and the real data file, we found that the **<font color=blue>Navigation links</font>** part can be used as seperators to split the whole data string into different blocks.
> * We use **<font color=blue>nav</font>** and **<font color=blue><\nav></font>** to find the section of navigation links as seperators
> * After splitting, we drop the first blank one

In [5]:
f = open('style/30086434.txt', 'r')
data_str = f.read()  
f.close()

In [6]:
nav_pattern = re.compile(r'<nav.*?</nav>',re.S)
nav_str_list = re.findall(nav_pattern,data_str)
# Check whether the nav string is unique for each unit
print(f'There are {len(nav_str_list)} nav section between each unit block.')
print(f'The # of unique section is {len(set(nav_str_list))}.')
print(f'Section of navigation: \n------\n {nav_str_list[0]}\n------')

There are 400 nav section between each unit block.
The # of unique section is 1.
Section of navigation: 
------
 <nav class="breadcrumbs mobile-hidden" id="breadcrumbs">
<p class="visuallyhidden" id="breadcrumb__label">You are here:</p>
<ul aria-labelledby="breadcrumb__label" class="breadcrumbs__list">
<li class="breadcrumbs__item home">
<a class="breadcrumbs__link" href="https://www.monash.edu/">Home</a>
<span aria-hidden="true" class="breadcrumbs__divider">|</span>
</li>
<li class="breadcrumbs__item">
<a class="breadcrumbs__link" href="https://www.monash.edu/study">Study</a>
<span aria-hidden="true" class="breadcrumbs__divider">|</span>
</li>
<li class="breadcrumbs__item">
<a class="breadcrumbs__link" href="/pubs/2019handbooks/">2019 Handbooks</a>
<span aria-hidden="true" class="breadcrumbs__divider">|</span>
</li>
<li class="breadcrumbs__item breadcrumbs__current"><a class="breadcrumbs__link" href="/pubs/2019handbooks/units">Units</a></li>
</ul>
</nav>
------


In [7]:
# Split the str into different unit blocks and drop the first blank one
nav_str = nav_str_list[0]
unit_block_split = data_str.split(nav_str)
unit_block_list = unit_block_split[1:]
print(f'There are {len(unit_block_list)} unit blocks in the whole dataset.')

There are 400 unit blocks in the whole dataset.


## 5. Extract info for each keyword and store into correct format

> For each keyword, the function will complete following sub-tasks:
1. Extract data
2. Write into correct data container i.e.dict or list or string for `.json output`
3. Write into a string for the `.xml output`
  * Because some special characters cannot directly expressed in XML-format text.<br> 
    So they are expressed in a special way using XML's escape charter &.<br> 
    XML representations for **<font color=blue>&, >, <</font>**  --->  **<font color=blue>&amp, &gt, &lt</font>**.<br> 
    Therefore, we need to use re.sub to do the replacement.<br> 

In [8]:
# function to replace the special character for json data
def sub_xml_cha(s):
    s = re.sub('&amp;','&',s)
    s = re.sub('&gt;','>',s)
    s = re.sub('&lt;','<',s)
    return s

In [9]:
# function to transform into special character for xml file
def to_xml_cha(s):
    s = re.sub('&','&#38;',s)
    s = re.sub('>','&gt;',s)
    s = re.sub('<','&lt;',s)
    return s

### 5.1 Function for Unit code & Title

> We extract the unitcode and title simultaneously as they locate in the same section.<br>
**<font color=blue>Special case:</font>** there is some unit code **prefix with 4 digits**  which cannot be captured.<br>
**<font color=blue>Regex adjustment:</font>**  `[A-Z]{3}\d{4} ---> [A-Z]{3,4}\d{4}`

In [10]:
def unit_code_title_extraction(unit_dict, unit_block, unit_str):
    # unitcode & title extraction
    pattern_unitc_title = re.compile(r'<span class="unitcode">(?P<unitc>[A-Z]{3,4}\d{4})</span> - (?P<title>.*)<span',re.M)
    unitc = re.search(pattern_unitc_title,unit_block).group('unitc')
    title = re.search(pattern_unitc_title,unit_block).group('title')
    title = sub_xml_cha(title)
    # add to the dict of each unit
    unit_dict['@id'] = unitc
    unit_dict['title'] = title
    # for xml string
    title_xml = to_xml_cha(title)
    unit_str += f"<unit id='{unitc}'>\n"
    unit_str += f"<title> {title_xml}</title>\n"
    return (unit_dict, unit_str)

### 5.2 Function for synopsis

> As `<p>` and `</p>` tag define a paragraph in HTML. <br>
We use Regex `>Synopsis</h2>\n<div>\n<p>` and `</p>` to locate synopsis between them.<br>
**<font color=blue>Special case:</font>** units without synopsis <br>
**<font color=blue>Logic adjustment:</font>** <br>
    1. First use Regex re.search to match pattern 
    2. If return value ---> append to dict 
    3. IF no return ---> append NA    

In [11]:
# synopsis extraction
def synopsis_extraction(unit_dict, unit_block, unit_str):
    pattern_synopsis = re.compile(r'(?<=>Synopsis</h2>\n<div>\n<p>)(.*?)(?=</p>)',re.S)
    synopsis_search = re.search(pattern_synopsis,unit_block)
    if synopsis_search:
        synopsis = synopsis_search.group()
        synopsis = sub_xml_cha(synopsis)
        pattern_link = re.compile(r'(?:(<.*?>))',re.S)
        synopsis = re.sub(pattern_link,'',synopsis)
        # for xml string
        synopsis_xml = to_xml_cha(synopsis)
        unit_str += f"<synopsis> {synopsis_xml}</synopsis>\n"
    else:
        synopsis = 'NA'
        unit_str += f"<synopsis> NA </synopsis>\n"
    unit_dict['synopsis'] = synopsis
    return (unit_dict, unit_str)

### 5.2 Function for pre_requistics and prohibisions

> We handle these two at the same time as data is located in the same section: <br>
1. Search keyword in the unit block, return nothing then append `NA` otherwise continue
2. Match the paragraph which contains all three kinds of info
3. In the matched paragraph, match the unitcode pattern 
4. If more than 1 unitcode returned, remove duplicated units <br>
>The graph below provide the basic logic applied here.<br>

> **<font color=blue>Special consideration:</font>** sample output only capture first `<p>...</p>` section <br>
**<font color=blue>Potential improvement:</font>** fix Regex in step 2 to capture wider range of paragraph<br>    
<img src = "style/img_3_procedure.png" height = "300" width = "500" style="float: left;">

In [12]:
# function to capture unitcode for different keyword of pre- co-req and pro-
def pre_pro_capture(keyword,unit_block):
    if keyword in unit_block:
        pattern_paragraph = re.compile(r'(?<={}</p>)(.*?)(?=</p>)'.format(keyword),re.S)
        paragraph = re.search(pattern_paragraph,unit_block).group()
        pattern_code = re.compile(r'[A-Z]{3,4}\d{4}',re.S)
        code_list = re.findall(pattern_code,paragraph)
        code_list_uniq = list(set(code_list))
    else:
        code_list_uniq = []
    return code_list_uniq

In [13]:
# pre_requistics & prohibisions extraction
def pre_pro_extraction(unit_dict, unit_block, unit_str):
    # combine prerequisites and co-requisites info 
    pre_list = pre_pro_capture('Prerequisites',unit_block)+ pre_pro_capture('Co-requisites',unit_block)
    pre_list = list(set(pre_list)) # remove duplicate units
    pro_list = pre_pro_capture('Prohibitions',unit_block) # extract prohibitions
    
    # combine prerequisites and co-requisites info 
    if not pre_list:
        unit_dict['pre_requistics'] = 'NA'
        unit_str += f"<pre_requistics> NA </pre_requistics>\n"
    elif len(pre_list)==1:
        unit_dict['pre_requistics'] = {'pre_requistic':pre_list[0]}
        unit_str += f"<pre_requistics>\n"\
                    f"<pre_requistic>{pre_list[0]}</pre_requistic>"\
                    f"</pre_requistics>\n"
    else:
        unit_dict['pre_requistics'] = {'pre_requistic':pre_list}
        unit_str += f"<pre_requistics>\n"
        for each_pre in pre_list:
            unit_str += f"<pre_requistic>{each_pre}</pre_requistic>"
        unit_str += f"</pre_requistics>\n"
    
    # apply the writing logic above to prohibition
    if not pro_list:
        unit_dict['prohibisions'] = 'NA'
        unit_str += f"<prohibisions> NA </prohibisions>\n"
    elif len(pro_list)==1:
        unit_dict['prohibisions'] = {'prohibision':pro_list[0]}
        unit_str += f"<prohibisions>\n"\
                    f"<prohibision>{pro_list[0]}</prohibision>"\
                    f"</prohibisions>\n"
    else:
        unit_dict['prohibisions'] = {'prohibision':pro_list}
        unit_str += f"<prohibisions>\n"
        for each_pro in pro_list:
            unit_str += f"<prohibision>{each_pro}</prohibision>"
        unit_str += f"</prohibisions>\n"
    return (unit_dict, unit_str)

### 5.3 Function for requirement

> We still use `<p>` and `</p>` paragraph tag and keyword `Assessment` to locate the paragraph first. <br>
> The rest procedures are similar. <br>
> **<font color=blue>Special case:</font>** links are mixed with desired data `e.g. <...>` <br>
> **<font color=blue>Regex adjustment:</font>** use `(?:(<.*?>))` to re.sub all links<br> 


In [14]:
def req_extraction(unit_dict, unit_block, unit_str):
    # Requirements extraction
    pattern_req_paragraph = re.compile(r'(?<=>Assessment</h2>)(.*?)(?=</div>)',re.S)
    req_paragraph_search = re.search(pattern_req_paragraph,unit_block)
    if req_paragraph_search:
        req_paragraph = req_paragraph_search.group()
        pattern_req = re.compile(r'(?<=<p>)(.*?)(?=</p>)',re.S)
        req_list = re.findall(pattern_req,req_paragraph)
        pattern_link = re.compile(r'(?:(<.*?>))',re.S)
        req_list = [re.sub(pattern_link,'',i) for i in req_list]
        req_list = [sub_xml_cha(i) for i in req_list]
        if not req_list:
            unit_dict['requirements'] = 'NA'
            unit_str += f"<requirements> NA </requirements>\n"
        elif len(req_list)==1:
            unit_dict['requirements'] = {'requirement':req_list[0]}
            # for xml string
            req_xml = to_xml_cha(req_list[0])
            unit_str += f"<requirements>\n"\
                        f"<requirement>{req_xml}</requirement>"\
                        f"</requirements>\n"
        else:
            unit_dict['requirements'] = {'requirement':req_list}
            # for xml string
            unit_str += f"<requirements>\n"
            for each_req in req_list:
                req_xml = to_xml_cha(each_req)
                unit_str += f"<requirement>{req_xml}</requirement>"
            unit_str += f"</requirements>\n"
    else:
        unit_dict['requirements'] = 'NA'
        unit_str += f"<requirements> NA </requirements>\n"
    return (unit_dict, unit_str)

### 5.4 Function for outcomes

> Again, use `<p>` and `</p>` paragraph tag and keyword `Outcomes` to locate the paragraph first. <br>
> Then extract the data and remove the `links <...>`. <br>

In [15]:
def outcome_extraction(unit_dict, unit_block, unit_str):
    # Outcomes extraction
    pattern_outc_paragraph = re.compile(r'(?<=>Outcomes</h2>)(.*?)(?=</div>)',re.S)
    outc_paragraph_search = re.search(pattern_outc_paragraph,unit_block)
    if outc_paragraph_search:
        outc_paragraph = outc_paragraph_search.group()
        pattern_outc = re.compile(r'(?<=<li>)(.*?)(?=</li>)',re.S)
        pattern_link = re.compile(r'(?:(<.*?>))',re.S)
        outc_list = re.findall(pattern_outc,outc_paragraph)
        outc_list = [sub_xml_cha(i) for i in outc_list]
        outc_list = [re.sub(pattern_link,'',i) for i in outc_list]
        if not outc_list:
            unit_dict['outcomes'] = 'NA'
            unit_str += f"<outcomes> NA </outcomes>\n"
        elif len(outc_list)==1:
            unit_dict['outcomes'] = {'outcome':outc_list[0]}
            # for xml string
            outc_xml = to_xml_cha(outc_list[0])
            unit_str += f"<outcomes>\n"\
                        f"<outcome>{outc_xml}</outcome>"\
                        f"</outcomes>\n"
        else:
            unit_dict['outcomes'] = {'outcome':outc_list}
            # for xml string
            unit_str += f"<outcomes>\n"
            for each_outc in outc_list:
                outc_xml = to_xml_cha(each_outc)
                unit_str += f"<outcome>{outc_xml}</outcome>"
            unit_str += f"</outcomes>\n"
    else:
        #print(unit_dict['@id'],'outc')
        unit_dict['outcomes'] = 'NA'
        unit_str += f"<outcomes> NA </outcomes>\n"
    return (unit_dict, unit_str)

### 5.5 Function for Chief examiner

>1. use `<p>` and `</p>` paragraph tag and keyword `Chief examiner` to locate the paragraph first.<br>
>2. Use Regex to extract the data
>3. Append `TBA` insdead of `NA`<br>
> **<font color=blue>Special consideration:</font>**<br> 
Assoc. Professor Bernard Flynn<br>
Dr Rose-Marie Bezuidenhout<br>
> **<font color=blue>Regex improvement:</font>**`[A-Za-z ]*` --->`[A-Z\'a-z-\. ]*`<br> 

In [16]:
def ce_extraction(unit_dict, unit_block, unit_str):
    # Chief_examiners extraction
    pattern_ce_pragraph = re.compile(r'(?<=Chief examiner\(s\)</p>).*?(?=</p>)',re.S)
    ce_pragraph_search = re.search(pattern_ce_pragraph,unit_block)
    if ce_pragraph_search:
        ce_pragraph = ce_pragraph_search.group()
        pattern_ce = re.compile(r"(?<=>)([A-Z\'a-z-\. ]*)(?=</a>)",re.S)
        ce_list = re.findall(pattern_ce,ce_pragraph)
        if not ce_list:
            unit_dict['chief_examiners'] = 'TBA'
            unit_str += f"<chief_examiners> TBA </chief_examiners>\n"
        elif len(ce_list)==1:
            unit_dict['chief_examiners'] = {'chief_examiner':ce_list[0]}
            unit_str += f"<chief_examiners>\n"\
                    f"<chief_examiner>{ce_list[0]}</chief_examiner>"\
                    f"</chief_examiners>\n"
        else:
            unit_dict['chief_examiners'] = {'chief_examiner':ce_list}
            unit_str += f"<chief_examiners>\n"
            for each_ce in ce_list:
                unit_str += f"<chief_examiner>{each_ce}</chief_examiner>"
            unit_str += f"</chief_examiners>\n"
    else:
        unit_dict['chief_examiners'] = 'TBA'
        unit_str += f"<chief_examiners> TBA </chief_examiners>\n"
    return (unit_dict, unit_str)

### 5.6 Combine ALL data into appropriate container ( dict for .json & str for .xml )

> Finally, we can use different function to extract the data and construct the dict and str 

In [17]:
unit_list = []
unit_str_all = '<?xml version="1.0" encoding="UTF-8" ?>\n<units>\n'

for unit_block in unit_block_list:
    unit_dict = {}
    unit_str = ''
    for keyword in test_output['units']['unit'][0].keys():
        # construct an dict with keys and empty values
        unit_dict[keyword] = None
    # for each unit block, extract the info sequentially
    # each function will return a dict and a string
    unit_dict_1,unit_str_1 = unit_code_title_extraction(unit_dict, unit_block, unit_str)
    unit_dict_2,unit_str_2 = synopsis_extraction(unit_dict_1, unit_block, unit_str_1)
    unit_dict_3,unit_str_3 = pre_pro_extraction(unit_dict_2, unit_block, unit_str_2)
    unit_dict_4,unit_str_4 = req_extraction(unit_dict_3, unit_block, unit_str_3)
    unit_dict_5,unit_str_5 = outcome_extraction(unit_dict_4, unit_block, unit_str_4)
    unit_dict_6,unit_str_6 = ce_extraction(unit_dict_5, unit_block, unit_str_5)
    
    unit_list.append(unit_dict_6)
    unit_str_6 += "</unit>\n"
    unit_str_all += unit_str_6
unit_str_all += "</units>"
unit_dict_all = {'unit':unit_list}
final_dict = {'units':unit_dict_all}

> We can check the # of duplicate units in case of further processing

In [18]:
unit_code_all = []
for each in unit_list:
    unit_code_all.append(each['@id'])
print(f'The number of unit code are {len(unit_code_all)}')
print(f'The number of unique unit code are {len(set(unit_code_all))}')

The number of unit code are 400
The number of unique unit code are 379


## 6. Write into json and xml file

In [18]:
with open("json_file.json","w") as f:
    json.dump(final_dict,f,indent=4)

In [19]:
with open("xml_file.xml","w") as f:
        f.write(unit_str_all)

## 7. Conclusion

> What we have done in this task:
1. We create a **logic map** to handle the data extraction problems, break them into component parts.
2. We use **html_tab-based approach** to identify common sections within a limited domain
3. We use **Regex** as our text matching tool to identify small-scale structure in each section. 
3. Finally, we can retrieve data out of the semi-tructured data for further processing or text analytics. 
<img src = "style/img_1_overview.png" height = "500" width = "500" style="float: left;">