# Parsing Data
#### XML

Version: 2.0

Environment: Python 3.6.3 and Anaconda 4.3.0 (64-bit)

Libraries used: 
* re (for regular expression, included in Anaconda Python 3.6)
* json (to create json data)

## 1.  Import libraries 

In [None]:
import re
import json


## 2. Parse xml File

In this section, we will perform the following tasks:
* Read the file
* Examine the contents of the file
* convert the data into a form that can be more easily manipulated
* parse data to extract data in a specific format in order to build the sport dictionary

#### 2.1 Read the file

In [None]:
xmlFile = 'australian-sport-thesaurus.xml'
xmlobj = open(xmlFile, "r", encoding="utf8")

#### 2.2 Examine the contents

We see the parent here is <Terms> followed by child <Term>, then subchilds <Title>, <Description>, <RelatedTerms> ( with its own children nodes <Title> and <Relationship>. It is clear that some <Term> nodes have <RelatedTerms> nodes with multiple <Term> nodes of its own but this is not true for all.

In [None]:
for line in xmlobj:
    print(line)
    


#### 2.3 Flatten the xml file into a single list

In [None]:
# to do this we must read the data again
xmlFile = 'australian-sport-thesaurus.xml'
xmlobj = open(xmlFile, "r", encoding="utf8")

# the next step is to build a list of all the lines in the file using a for loop
List_lines = []
for line in xmlobj:
    List_lines.append(line)


In [None]:
print(List_lines[0:50])
# we immediatley see that there are various blank spaces and \n that need to be ridden off , in order to more easily parse the data

In [None]:
# remove all white space characters at the front or end of the line

for x in range(len(List_lines)):
    # for each line that fits the pattern
    if re.search("^\s|\s$", List_lines[x]):
        # we remove the white space characters
        List_lines[x] = List_lines[x].strip()
        List_lines[x] =List_lines[x].replace('\n', '')
        
print(List_lines[0:75])
print(len(List_lines))

#### 2.4 Build a dictionary and code that parses data ina specific format

In [None]:
# Use a while loop to build teh specific format required for the json file

y = 0

# Build a dictionary
sport = {}
sport["Thesaurus"] = []

while y < len(List_lines):
    # to specify the main title from the related terms title, I have used the pattern with the list to parse the exact title I want
    if re.search('<Term>', List_lines[y]):
        # use regex search to look for <Term>
        if y <= len(List_lines):
            # ensure that we are still in range
            if re.search('<Title>.*</Title>', List_lines[y + 1]):
                # ensure that we are still in range
                if y <= len(List_lines):
                    # if after <Term> the next 2 list object contains <Title> node and <Description> node then we know that this is the main Title
                    if re.search('<Description>', List_lines[y + 2]):
                        Title = (re.search('<Title>.*</Title>', List_lines[y + 1]).group(0))
                        # use regex split to get at the information contained within the Title
                        Title = re.split('<Title>|</Title>', Title)
                        Title = Title[1]
    # There is only one Description object so we can simply directly parse this information using regex search                    
    if re.search('<Description>.*</Description>', List_lines[y]):
        Description = (re.search('>.*<', List_lines[y]).group(0))
        Description = re.split('>|<', Description)
        Description = Description[1]
    # Due to related terms having its own title and sometimes having multiplte nodes of information we need to find a specific patter
    # having examined the data related terms can have up to 6 Term nodes
    # The following data tests whether y is out of range whether Title is conatined within a specific pattern and if there is more than one Title and Relationship for RelatedTerms
    if re.search('<RelatedTerms>', List_lines[y]):
        if y <= len(List_lines):
            if re.search('<Term>', List_lines[y + 1]):
                if y <= len(List_lines):
                    if re.search('<Title>.*</Title>', List_lines[y + 2]):
                        if y <= len(List_lines):
                            if re.search('<Relationship>', List_lines[y + 3]):
                                Rel_Title = re.search('<Title>.*</Title>', List_lines[y + 2]).group(0)
                                Rel_Title = re.split('<Title>|</Title>', Rel_Title)
                                Rel_Title = Rel_Title[1]
                                Relationship = re.search('<Relationship>.*</Relationship>', List_lines[y + 3]).group(0)
                                Relationship = re.split('<Relationship>|</Relationship>', Relationship)              
                                Relationship = Relationship[1]
                                # After Related Terms Title and Relationship information is found  it is added to Related_Terms list
                                Related_Terms = [{"Relationship": Relationship, "Title":Rel_Title}]
                                # If the following if tests are passed then the Related_Terms is further concatenated
                                if y <= len(List_lines):
                                    if re.search('<Title>.*</Title>', List_lines[y + 6]):
                                        Rel_Title2 = re.search('<Title>.*</Title>', List_lines[y + 6]).group(0)
                                        Rel_Title2 = re.split('<Title>|</Title>', Rel_Title2)
                                        Rel_Title2 = Rel_Title2[1]
                                        Relationship2 = re.search('<Relationship>.*</Relationship>', List_lines[y + 7]).group(0)
                                        Relationship2 = re.split('<Relationship>|</Relationship>', Relationship2)              
                                        Relationship2 = Relationship2[1]
                                        Related_Terms = [{"Relationship": Relationship, "Title":Rel_Title}, {"Relationship": Relationship2, "Title":Rel_Title2}]
                                        if y <= len(List_lines):
                                            if re.search('<Title>.*</Title>', List_lines[y + 10]):
                                                Rel_Title3 = re.search('<Title>.*</Title>', List_lines[y + 10]).group(0)
                                                Rel_Title3 = re.split('<Title>|</Title>', Rel_Title3)
                                                Rel_Title3 = Rel_Title3[1]
                                                Relationship3 = re.search('<Relationship>.*</Relationship>', List_lines[y + 11]).group(0)
                                                Relationship3 = re.split('<Relationship>|</Relationship>', Relationship3)              
                                                Relationship3 = Relationship3[1]
                                                Related_Terms = [{"Relationship": Relationship, "Title":Rel_Title}, {"Relationship": Relationship2, "Title":Rel_Title2}, {"Relationship": Relationship3, "Title":Rel_Title3}]
                                                if y <= len(List_lines):
                                                    if re.search('<Title>.*</Title>', List_lines[y + 14]):
                                                        Rel_Title4 = re.search('<Title>.*</Title>', List_lines[y + 14]).group(0)
                                                        Rel_Title4 = re.split('<Title>|</Title>', Rel_Title4)
                                                        Rel_Title4 = Rel_Title4[1]
                                                        Relationship4 = re.search('<Relationship>.*</Relationship>', List_lines[y + 15]).group(0)
                                                        Relationship4 = re.split('<Relationship>|</Relationship>', Relationship4)              
                                                        Relationship4 = Relationship4[1]
                                                        Related_Terms = [{"Relationship": Relationship, "Title":Rel_Title}, {"Relationship": Relationship2, "Title":Rel_Title2}, {"Relationship": Relationship3, "Title":Rel_Title3}, {"Relationship": Relationship4, "Title":Rel_Title4}]
                                                        if y <= len(List_lines):
                                                            if re.search('<Title>.*</Title>', List_lines[y + 18]):
                                                                Rel_Title5 = re.search('<Title>.*</Title>', List_lines[y + 18]).group(0)
                                                                Rel_Title5 = re.split('<Title>|</Title>', Rel_Title5)
                                                                Rel_Title5 = Rel_Title5[1]
                                                                Relationship5 = re.search('<Relationship>.*</Relationship>', List_lines[y + 19]).group(0)
                                                                Relationship5 = re.split('<Relationship>|</Relationship>', Relationship5)              
                                                                Relationship5 = Relationship5[1]
                                                                Related_Terms = [{"Relationship": Relationship, "Title":Rel_Title}, {"Relationship": Relationship2, "Title":Rel_Title2}, {"Relationship": Relationship3, "Title":Rel_Title3}, {"Relationship": Relationship4, "Title":Rel_Title4}, {"Relationship": Relationship5, "Title":Rel_Title5}]
                                                                if y <= len(List_lines):
                                                                    if re.search('<Title>.*</Title>', List_lines[y + 22]):
                                                                        Rel_Title6 = re.search('<Title>.*</Title>', List_lines[y + 22]).group(0)
                                                                        Rel_Title6 = re.split('<Title>|</Title>', Rel_Title6)
                                                                        Rel_Title6 = Rel_Title6[1]
                                                                        Relationship6 = re.search('<Relationship>.*</Relationship>', List_lines[y + 23]).group(0)
                                                                        Relationship6 = re.split('<Relationship>|</Relationship>', Relationship6)              
                                                                        Relationship6 = Relationship6[1]
                                                                        Related_Terms = [{"Relationship": Relationship, "Title":Rel_Title}, {"Relationship": Relationship2, "Title":Rel_Title2}, {"Relationship": Relationship3, "Title":Rel_Title3}, {"Relationship": Relationship4, "Title":Rel_Title4}, {"Relationship": Relationship5, "Title":Rel_Title5}, {"Relationship": Relationship6, "Title":Rel_Title6}]
        # At the end of the loop the dictionary is built in the following order                                                                
        sport["Thesaurus"].append({"Description": Description, "Related Terms": Related_Terms, "Title": Title})
    y += 1

In [None]:
# print out the first 100 items of the dictionary to ensure that the pattern of the xml data has been properly parsed
print(sport["Thesaurus"][0:100])

In [None]:
# print out the last 100 items of the dictionary to ensure that the pattern of the xml data has been properly parsed
print(sport["Thesaurus"][-100: ])

## 3. Save Data to JSON file

In [None]:
import json
s = json.dumps(sport)

with open("C://Data//sport.dat", "w") as f:
    f.write(s)

## 4. Summary

The most difficult aspect of parsing this particular xml file was that there were two Title nodes that had values of different parent nodes. Hence in order to parse Title as required a specific pattern had to be found. Parsing the xml file as it is for the Title nodes seemed very difficult as with all the white space characters using regez it was difficult to find a specific pattern. Inorder to rectify this it seemed to make sense to flatten the data and remove all unnecessary white space characters. With the list of all the required objects/nodes of the xml file it was much easier to find a pattern and to collect all the required nodes. The second most difficult aspect of parsing the xml file was collecting multiple Titles and Relationships within the Related Term nodes, this too needed to e rectified by looking for a specific pattern.

## References
* Mastering Python Regular Expressions, Packt Publishing, 2014. ProQuest Ebook Central, https://ebookcentral.proquest.com/lib/monash/detail.action?docID=1644026.