# Parsing text files and converting into XML and JSON
## Task 1

#### Name: Aniruddha Indurkar
#### Date: 14/04/2019

Environment: Python 3.6.3 and Jupyter notebook Libraries used

* json
* re



## 1. Introduction
This analysis extracts data from an HTML-based text corpus containing 400 Monash University units. Data was extracted by reading the input file 'Data.txt', splitting at the document boundaries (i.e. HTML headers) and then creating a list where each item in the list holds the full XML for each individual patent.

Data relating to patent classifications and citations was extracted using the package `re` to extract values from specific HTML tags. Similarly, the text associated with abstracts was extracted for further pre-processing.

Text pre-processing was performed with the objective of producing a xml and json document for the unit information. The pre-processing included extracting the information for unit code, Synopsis, Outcomes etc. 

### Methodology
- After reading the file, we find the patterns in the file.
- We extract containers containing information about the subjects.
- We make use of regex (.|\n)*? that extracts information between identified tags for each unit.
- Then, we remove the unwanted characters and format the data.
- To export the data in the XML and JSON format we make changes to the dictionary created to store the information for each unit.

We make use of functions and then loop over for each unit to extract information. Further, we preprocess and write the xml file. JSON library is used to dump the data.

## Importing the given modules and opening the file

In [20]:
import re
import json

# Keep the file in the same working directory
with open('Data.txt','r') as file:
    extract=file.read()

## Test file
#with open('test.txt','r') as file:
#    extract=file.read()
    
# Observing the pattern

We first extract the information between the tags and store it as list of string

In [21]:
#information for a subject is located in the container. Extracting the container by finding pattern in the tags
#Performing the lazy search to search for shortest string between the two tags
information=re.findall('<div class="content-inner__main">((.|\n)*?)<!-- /.content_container--> </div>',extract)

## Observing the patterns we define functions to extract the required unit codes

### 1. Extracting the unit code
We find the below pattern for unit code and the unit title

span class="unitcode">RTS4104</span> - Radiation therapy principles and practice 1<span class="hbk-archive-only"> - 2019</span

In [22]:
#pass a dictionary to store the information in it
#Some Unit codes

def Unit_code(subject):
    
    #apply the regex to extract finding the pattern
    # this will also include exceptions
    a=re.findall('<span class="unitcode">(\w{3,5}\d{4})?</span>',str(subject))
    
    if len(a)!=0:
    
        return a[0]
    
    else:
        #Exception check
        return 'Unitcode inappropriate'

### 2. Title 
We find the below pattern for unit code and the unit title

span class="unitcode">RTS4104</span> - Radiation therapy principles and practice 1<span class="hbk-archive-only"> - 2019</span

In [23]:
# Function to extract the title of the code

def titl(sent):
    
    #Title information is in the container with Unit code
    a=re.findall('<span class="unitcode">(\w{3,5}\d{4})?</span>',str(sent))
    
    
    if len(a)!=0:
        #Extracting the title in between
        b=re.findall('<span class="unitcode">\w{3,5}\d{4}?(.*)<span class="hbk-archive-only">',str(sent))
        
        #Removing non-required characters
        c=re.findall("- (.*)']?",str(b))
        return c[0]
    

### 3. Synopsis


In [24]:
#Synopsis
def Syno(extract):
    
    # Container with the information of synopsis
    a=re.findall("Synopsis</h2>\\\\n<div>\\\\n<p>(.*?)</p>",str(extract))
    
    # Case when there is no synopsis
    if len(a)==0:
        a="NA"
    
    #removing unwanted characters
    b=a[0].replace("\\","")
    
    #returning the synopsis
    return re.sub('<.*?>',"",b)

### 4. Pre-requisites

In [25]:
#Pre-requisites
def PreReq(extract):
    
    # initialising required objects
    mylist=list()
    a_list=list()
    
    # The case to check whether there are pre-requisites
    if len(re.findall("Prerequisites",str(extract)))!=0:
        
        # Pre-requisites are in between in the container with this pattern
        a=re.findall("Prerequisites</p>((.|\n)*?)</p>",str(extract))
        
        # Finding unique pre-requisites, we make use of set
        for i in set(re.findall('\w{3,5}\d{4}',str(a))):
            a_list.append(i)
            
        # Substitute unwanted characters
        for i in a_list:
            i=re.sub(">","",i)
            i=re.sub("<","",i)
            mylist.append(i)
            
    # Similar approach for co-requisites
    a_list=list()
    if len(re.findall("Co-requisites",str(extract)))!=0:
        a=re.findall("Co-requisites</p>((.|\n)*?)</p>",str(extract))

        for i in set(re.findall('\w{3,5}\d{4}',str(a))):
            a_list.append(i)
        for i in a_list:
            i=re.sub(">","",i)
            i=re.sub("<","",i)
            mylist.append(i)
    
    # If there are no pre requisites or co-requisites
    if len(re.findall("Prerequisites",str(extract)))==0 + len(re.findall("Co-requisites",str(extract)))==0:
        
            mylist.append('NA')
    else:
        # Case when there are pre-requisites but not satisfying the condition of the unit code
        if len(mylist)==0:
            mylist.append('NA')
            
    return mylist

### 5. Prohibitions

In [26]:
# Prohibitions
def Prohibitions(extract):
    # Initiating the objects required
    mylist=list()
    a_list=list()
    
    # Condition to check whether there are prohibitions or not
    if len(re.findall("Prohibitions",str(extract)))!=0:
        
        # Container with the Information of prohibitions
        a=re.findall("Prohibitions</p>((.|\n)*?)</p>",str(extract))
        
        # Looping to get all the prohibitions
        for i in set(re.findall('>\w{3,5}\d{4}<',str(a))):
            
            a_list.append(i)
        
        # Removing unwanted characters
        for i in a_list:

            i=re.sub(">","",i)
            i=re.sub("<","",i)
            mylist.append(i)
            
    # Case when there are not prohibitions            
    if len(re.findall("Prohibitions",str(extract)))==0:
        
            mylist.append('NA')
    else:
        # Case when it has garbage values
        if len(mylist)==0:
            
            mylist.append('NA')
            
    return mylist

### 6. Requirements

In [27]:
# Requirements
def Require(extract):
    
    # Case when there is no assessment
    if len(re.findall("Assessment",str(extract)))==0:
        
        return ['NA']
    
    # For assessment information
    else:
        # Container with assessment Information
        a=re.findall("Assessment</h2>((.|\n)*?)<h2",str(extract))
        
        #Information required is between the two tags
        b=re.findall('<p>(.*?)</p>',str(a))
        
        # Initiating a list to append and return
        c=list()
        for i in b:
            # Remove unwanted links in between such as <http://www.monash.edu/timetables>
            i=re.sub("<(.*?)>","",i)
            c.append(i)

        return c

### 7. Outcomes

In [28]:
# Outcomes
def Outcomes(extract):
    
    # Case for no outcomes
    if len(re.findall("Outcomes",str(extract)))==0:
        
        return ['NA']
    
    else:
        # Container with the information of outcomes
        a=re.findall("<ol.*?>((.|\n)*?)</ol>",str(extract))
        
        # Extracting information between the multiple tags
        b=re.findall('>(.*?)<',str(a))
        
        #Initiating the object
        c=list()
        
        for i in b:
            if "\\\\n" not in i:
                
                # Replacing the unwanted characters
                i.replace('\\','')
                
                if i!='':
                    c.append(i)
                
        return c
    

### 8. Chief Examiner

In [29]:
def CheExam(extract):
    
    # Checking whether there is chief examiner
    if len(re.findall("Chief examiner",str(extract)))==0:
        return ['TBA']
    
    else:
        
        a_list=list()
        
        if len(re.findall("Chief examiner",str(extract)))!=0:
            
            # Information of chief examiner
            a=re.findall("Chief examiner...</p>((.|\n)*?)</p>",str(extract))
            
            # Pattern for chief examiners in the container
            for i in set(re.findall('>(.*?)<',str(a))):

                # Removing unwanted characters
                if "\\\\n" not in i:
                    i=i.replace("\\","")
                    
                    if i!='':
                        a_list.append(i)
                        
            # Case for no chief examiner
            if len(a_list)==0:
                a_list=["TBA"]
                
        return a_list

### Compiling all the information

In [30]:
# Compiling all the information to a dictionary to print in the XML
def Compiler_xml(info):
    # Inititating the dictionary
    a_dict=dict()
    
    #Combining all the information in a single dictionary
    a_dict['Unitcode']=Unit_code(info)
    a_dict['Prerequisite']=PreReq(info)
    a_dict['Prohibitions']=Prohibitions(info)
    a_dict['Outputs']=Outcomes(info)
    a_dict['Chief Examiner']=CheExam(info)
    a_dict['Synopsis']=Syno(info)
    a_dict['Requirements']=Require(info)
    a_dict['Title']=titl(info)

    return a_dict

### Sample output

In [31]:
Compiler_xml(information[0])

{'Unitcode': 'DIS3905',
 'Prerequisite': ['DIS2601', 'OHS1000'],
 'Prohibitions': ['NA'],
 'Outputs': ['Utilise a range of software applications (including sound and media authoring) to an advanced level in the creation, editing and realization of conceptual digital audio video works;',
  'Have an advanced practical and critical understanding of video production (including various lighting conditions, advanced camera operations and audio recording techniques);',
  'Be able to produce a major body of work which demonstrates advanced technique integrally linked with ambitious ideas;',
  'Have an advanced understanding of video encoding for a range a distribution methods (web, DVD authoring, digital media players);',
  'Be able to apply highly polished audio video production skills to a range of potential output considerations such as: film, video, short film, television, documentary, web, documentation and video art',
  'Have considered how their work is placed in the context of current 

### Creating a list of all the unit code dictionaries for XML

In [32]:
my_list=list()
for i in information:
    # Appending it to an iterable object to easily access the dictionary
    my_list.append(Compiler_xml(i))

# Checking the number of units
len(my_list)

400

### Testing

In [33]:
test_list=list()
for i in my_list:
    test_list.append(i["Unitcode"])
len(set(test_list)) ## Check the unique units
## Thus, we have units that are repeated.


373

### Explanation
As this task does not require to process the data but rather to extract the information. We keep the duplicated units as it is and continue.

### Creating the XML file

In [34]:
## Manually creating XML Output

##============================================
## Testing code
#with open("test_output1.xml", "w") as a_file:

##============================================

with open("Output.xml", "w") as a_file:
    a_file.write("<?xml version=\"1.0\" encoding=\"UTF-8\" ?>"+"\n"+"<units>")
    for j in range(0,len(my_list)):
        
        # XML Unit format
        unit="<unit id=\""+my_list[j]['Unitcode']+"\">"
        
        # Title
        title="<title>"+my_list[j]['Title']+"</title>"
        
        # Synopsis
        syno="<synopsis>"+my_list[j]['Synopsis']+"</synopsis>"
        
        # Pre-requisites
        in_pr=""
        if len(my_list[j]["Prerequisite"])>1:
            for i in my_list[j]["Prerequisite"]:
                in_pr+="<pre_requistic>"+i+"</pre_requistic>"
        else:
            in_pr=my_list[j]["Prerequisite"][0]
        prerequisite="<pre_requistics>"+"\n"+in_pr+"\n"+"</pre_requistics>"
        
        # Prohibitions
        in_po=""
        if len(my_list[j]["Prohibitions"])>1:
            for i in my_list[j]["Prohibitions"]:
                in_po+="<prohibision>"+i+"</prohibision>"
        else:
            in_po+=my_list[j]["Prohibitions"][0]
        prohibision="<prohibisions>"+"\n"+in_po+"\n"+"</prohibisions>"
        
        # Requirements
        in_req=""
        for i in my_list[j]["Requirements"]:
            in_req+="<requirement>"+i+"</requirement>"
        req="<requirements>"+"\n"+in_req+"\n"+"</requirements>"
        
        # Outcomes
        in_out=""
        for i in my_list[j]["Outputs"]:
            in_out+="<outcome>"+i+"</outcome>"
        outcome="<outcomes>"+"\n"+in_out+"\n"+"</outcomes>"
        
        # Chief Examiner
        in_ce=""
        for i in my_list[j]["Chief Examiner"]:
            if my_list[j]["Chief Examiner"]==['TBA']:
                in_ce="TBA"
            else:
                in_ce+="<chief_examiner>"+i+"</chief_examiner>"
        Chief_examiner="<chief_examiners>"+in_ce+"</chief_examiners>"
        
        # Writing the xml file
        a_file.write("\n"+unit+"\n")
        a_file.write(title+"\n")
        a_file.write(syno+"\n")
        a_file.write(prerequisite+"\n")
        a_file.write(prohibision+"\n")
        a_file.write(req+"\n")
        a_file.write(outcome+"\n")
        a_file.write(Chief_examiner+"\n"+"</unit>")
        
    a_file.write("</units>")


### Creating a json compiler for the format specific to json file

In [35]:
## Json file output
def Compiler_json(info):
    udict=dict()
    
    udict["@id"]=Unit_code(info)
    udict['title']=titl(info)
    udict['synopsis']=Syno(info)
    
    if PreReq(info)==['NA']:
        udict['pre_requistics']="NA"
    else:
        sdict=dict()
        if len(PreReq(info))==1:
            sdict["pre_requistic"]=PreReq(info)[0]
        else:
            sdict["pre_requistic"]=PreReq(info)
        udict['pre_requistics']=sdict
        
    if Prohibitions(info)==['NA']:
        udict['prohibisions']="NA"
    else:
        sdict=dict()
        if len(Prohibitions(info))==1:
            sdict["prohibision"]=Prohibitions(info)[0]
        else:
            sdict["prohibision"]=Prohibitions(info)
        udict['prohibisions']=sdict
    
    if Require(info)==['NA']:
        udict['requirements']="NA"
    else:
        sdict=dict()
        if len(Require(info))==1:
            sdict["requirement"]=Require(info)[0]
        else:
            sdict["requirement"]=Require(info)
        udict['requirements']=sdict
    
    if Outcomes(info)==['NA']:
        udict['outcomes']="NA"
    else:
        sdict=dict()
        if len(Outcomes(info))==1:
            sdict["outcome"]=Outcomes(info)[0]
        else:
            sdict["outcome"]=Outcomes(info)
        udict['outcomes']=sdict
    
    if CheExam(info)==['TBA']:
        udict['chief_examiners']="TBA"
    else:
        sdict=dict()
        if len(CheExam(info))==1:
            sdict["chief_examiner"]=CheExam(info)[0]
        else:
            sdict["chief_examiner"]=CheExam(info)
        udict['chief_examiners']=sdict
    
    return udict

### Sample json output

In [36]:
Compiler_json(information[80])

{'@id': 'OPM5003',
 'title': 'Managing project knowledge',
 'synopsis': "The unit explores the role of formal and informal knowledge networks in project delivery. It focuses on developing an understanding of how project managers acquire and exchange knowledge, and how this knowledge impacts on their actions. The unit incorporates concepts and theories from various disciplines aligned to project management and knowledge management to enrich students' understanding of the socio-technical aspects of project management.",
 'pre_requistics': 'NA',
 'prohibisions': {'prohibision': ['FIT5178', 'FIT5194', 'FIT5057']},
 'requirements': {'requirement': 'In-semester assessment: 100%'},
 'outcomes': {'outcome': ['Evaluate, assess and communicate different project management perspectives to meet contextual demands.',
   'Critically reflect on and improve practice through effective management of knowledge in a project context.',
   'Contrast and assess the most effective communication techniques to 

### Creating a json output

In [37]:
unit_list=list()
for i in information:

    unit_list.append(Compiler_json(i))
adict=dict()

adict['unit']=unit_list

json_dict=dict()
# Creating a dictionary to dump the json output
json_dict['units']=adict

## Writing the json output
with open('Output.json', 'w') as test_json_output:
    json.dump(json_dict, test_json_output,indent=4)

#with open('test_output1.json', 'w') as test_json_output:
#    json.dump(json_dict, test_json_output,indent=4)

### Testing to check whether the documents generated are parsable or not

In [38]:
#import xml.etree.ElementTree as ET
#root = ET.parse('29389429.xml').getroot()
#len(root)

#with open('29389429.json', 'r') as f:
  #  distros_dict = json.load(f)
#print(distros_dict)