## FIT5196 Task 3 in Assessment 1

### Student Name: Zhiqing Shu
### Student ID: 28217551

#### Date: 04/04/2018

Version: 3.0

Environment: Python 3.6.4 and Anaconda 5.1.0 (64-bit)

Libraries used: 
* json (for json, included in Anaconda Python 3.6) 
* re (for regular expression, included in Anaconda Python 3.6) 
* html (for html, included in Anaconda Python 3.6) 

## 1. Summary

This task focuses on converting the Austrailian Sport Thesaurus stored in an XML file ("australian-sport-thesaurus- student.xml") into a JSON file. The JSON file should look like the given figure.

The detailed requirements of this task is as the following:
- must correctly extract the thesaurus in the XML file and store it in the JSON file;
- while extracting the thesaurus from the XML file, **existing Python Packages that are written to parse XML files (e.g., Beautiful- soup, lxml and ElementTree) must not be used.** You must write your own Python script to extract the thesaurus. **Hint: Regular Expressions can be used**.
- Python packages, **like json**, can be used to save the extracted thesaurus;
- script must be written in a Jupyter notebook named as **"xml_json.ipynb"**;
- the JSON data should be saved in a file named as **"sport.dat"**; 
- the input file must only be **"australian-sport-thesaurus-student.xml"**.

## 2. Import libraries

In [None]:
import html
import re
import json

## 3. Parse XML File

According to the task requirements, we can not use any existing Python Packages to parse XML files and the only tool we can use is Regular Expressions. 

As we may all know, Regular Expressions work on string, so the first step is to extract complete data from the original XML file and load them as string.

By looking at the XML file, we can find there is many space before tags and they are unnecessarily remained in the string we will work on, so we can use `file.readlines()` [function](https://docs.python.org/3/tutorial/inputoutput.html) remove them at this step.

In [None]:
file = open('australian-sport-thesaurus-student.xml')
lines = file.readlines() # read each line in xml file
data = '' # create an empty string
for i in lines:  # i is string
    data += i.strip() # remove space before tags and add processed string to data
data

XML has its own special characters, which may confuse reader and it is better to convert them into readable characters.

Check if there is any special character.

In [None]:
print('&amp' in data) # &
print('&lt' in data) # <
print('&gt' in data) # >
print('&quot' in data) # "
print('&apos' in data) # '
print('&#13' in data) # Carriage return

We can use `html.unescape()` [function](https://docs.python.org/3/library/html.html) to convert all special characters in the string to the corresponding unicode characters. Then, check the result.

In [None]:
data = html.unescape(data)

In [None]:
print('&amp' in data)
print('&lt' in data)
print('&gt' in data)
print('&#13' in data)

There are also some special characters related to format such as lindfeed and newline which are likely to exist in our data. We need to check them.

In [None]:
print('\r' in data) # carriage return
print('\t' in data) # tab
print('\n' in data) # new line

Replace them with more reasonable characters.

In [None]:
data = data.replace('\r',' ')
data = data.replace('\t',' ')

In [None]:
print('\r' in data)
print('\t' in data)

View the cleaned data.

In [None]:
data[0:1000]

After doing the basic data cleaning, we can try to parse the XML file.

The structure of the original XML: 
```XML
<Terms>
    <Term>
        <Title>??????</Title>
        <Description>??????</Description>
        <RelatedTerms>
            <Term>
               <Title>??????</Title>
               <Relationship>??????</Relationship>
            </Term>
            ......
        </RelatedTerms>
    </Term>
    ......
</Terms>
```

XML files are formed as element trees. `Terms` is the root element, `Term` is the child element, `Title`, `Description` and `RelatedTerms` are subchild element. `RelatedTerms` has its own child element `Terms`, while `Terms` also have its child element `Title` and `Relationship`.

According to the sample figure showing what the JSON file should look like, we can get the basic structure of the JSON file (Here, I use space and new line to clarify the structure):
```JSON
{"thesaurus":[
 {"Description": "......",  
  "RealtedTerms": 
   [{"Relartionship": "......", 
     "Title": "......"},
    ......],
  "Title":"......"}
  ......
 ]}
```

So, it will be easy to understand the data type of each element in python:
- `Title` and `Relationship` are the keys of the innermost level dictionary; there can be one dictionary or several dictionaries, and a list is used to store it/them. 
  In the XML file:
  ```XML
    <Term>
        <Title>??????</Title>
        <Relationship>??????</Relationship>
    </Term> 
```
- This list is stored as the value of key `RealtedTerms` in a dictionary. This dictionary has another two key-value pairs and the keys are `Description` and `Title`. 
  In the XML file:
  ```XML
    <Term>
        <Title>??????</Title>
        <Description>??????</Description>
        <RelatedTerms>
            <Term>
               <Title>??????</Title>
               <Relationship>??????</Relationship>
            </Term>
            ......
        </RelatedTerms>
    </Term>
  ```
- This outer dictionary is stored in a list and this list will be stored in a dictionary as the value of key `thesaurus`.
  In the XML file:
  ```XML
    <Terms>
        <Term>
            <Title>??????</Title>
            <Description>??????</Description>
            <RelatedTerms>
                <Term>
                   <Title>??????</Title>
                   <Relationship>??????</Relationship>
                </Term>
                ......
            </RelatedTerms>
        </Term>
        ......
    </Terms>
  ```

Retrieving the data needed from the outside in.

Using `re.findall()` [function](https://docs.python.org/3/library/re.html) to match data because it return all non-overlapping matches of pattern in string, as a list of strings, so extract each string will be easy.

First, matching all key-value pairs except the one with key of `RelatedTerms`.

In [None]:
values = re.findall('<Term>(.*?)(?:<RelatedTerms>(.*?)</RelatedTerms>)*</Term>', data)
values[:5]

Getting the length of the returned list by `re.findall()`.

In [None]:
len(values)

In [None]:
len(values[0])

'( )' in the returned list is tuple and each tuple within the list stores two strings.

The first string contains two key-value pairs (key: `Description` and `Title`) of the middle dictionary which has three key-value pairs.

The second string contains the key-value pairs of the innermost level dictionary and these pairs can be extracted by the regular expression `<Term><(.*?)>(.*?)</.*?><(.*?)>(.*?)</.*?></Term>`, which will return all keys and values pairly as string in tuple and this tuple will be stored in a list, and since we use `re.findall()` which returns matches and doesn't return after first match. 

In [None]:
inner_match = re.findall('<Term><(.*?)>(.*?)</.*?><(.*?)>(.*?)</.*?></Term>',values[1][1])

In [None]:
print(inner_match)

In [None]:
print(inner_match[0][0:2])
print(inner_match[0][2:4])

Since `Title` and `Relationship` are the keys of the innermost level dictionary, in order to convert the original tuple to dictionary, it should be modified as [following](https://docs.python.org/3.6/tutorial/datastructures.html):

In [None]:
inner = [inner_match[0][0:2],inner_match[0][2:4]]

In [None]:
inner

Converting to dictionary.

In [None]:
inner = dict(inner)
inner

After figuring out the logic, the original string `data` can be convert to the desired data type.

In [None]:
regex_1 = r"<Term>(.*?)(?:<RelatedTerms>(.*?)</RelatedTerms>)*</Term>" # Description and Title, Title and Relationship
regex_2 = r"<(.*?)>(.*?)</.*?>" # Description and Title
regex_3 = r"<Term><(.*?)>(.*?)</.*?><(.*?)>(.*?)</.*?></Term>" # Title and Relationship
result = []
content = re.findall(regex_1, data)
for i in content:
    middle = re.findall(regex_2, i[0])
    middle_dict = dict(middle)
    inner = re.findall(regex_3, i[1])
    inner_list = []
    for x in inner: # there may be several matches
        tuple1 = x[0:2]
        tuple2 = x[2:4]
        inner_match = [tuple1,tuple2] # tuple store in list
        inner_dict = dict(inner_match) # list to dict
        inner_list.append(inner_dict) 
    middle_dict['RelatedTerms'] = inner_list 
    result.append(middle_dict)

Check the result.

In [None]:
result[0:50]

It seems look correct, double check the length of the result to make sure.

In [None]:
len(result)

Creating a dictionary with key named "thesaurus" to store the result as the paired value. 

In [None]:
thesaurus = {}
thesaurus["thesaurus"] = result

Using `json.dumps()` [function](https://docs.python.org/3.6/library/json.html?highlight=json#module-json) to convert Python dictionary to JSON, and setting `sort_keys = True` to meet the requirement of sample figure.

In [None]:
sport = json.dumps(thesaurus, sort_keys = True) #return str
with open("sport.dat","w") as file:
  file.write(sport)

In [None]:
#type(sport)

Loading the exported file `sport.dat` to check if the result has been written to JSON file properly.

In [None]:
with open("sport.dat","r") as f:
  data = f.read()
# decoding the JSON to dictionay
d = json.loads(data)
d

In [None]:
#type(d)

## 4. Summary

This task measures the understanding of the structure of XML and JSON and the conversion between different data types. Although using a existing package might be a more efficient way to convert XML to JSON in Python, as a data-wrangling beginner, it is very necessary to achieve the conversion by writing my own script.

The main outcomes achieved while completing this task were:
* Figuring out the right regular expression to match the desired string.
* Use the right data type to store the extracter string.
* Understanding how to nest different data types.
* Being aware of the data type of each return, otherwise it is impossible to get the right result.

## 5. Reference

* Python sofyware foundation.(2018) *7. Input and Output — Python 3.6.5 documentation.* Retrieved from https://docs.python.org/3/tutorial/inputoutput.html
* Python sofyware foundation.(2018) *20.1. html — HyperText Markup Language support.* Retrieved from https://docs.python.org/3/library/html.html
* Python sofyware foundation.(2018) *6.2. re — Regular expression operations.* Retrieved from https://docs.python.org/3/library/re.html
* Python sofyware foundation.(2018) *5. Data Structures.* Retrieved from https://docs.python.org/3.6/tutorial/datastructures.html
* Python sofyware foundation.(2018) *19.2. json — JSON encoder and decoder.* Retrieved from https://docs.python.org/3.6/library/json.html?highlight=json#module-json
* Elledienne(2011, May 5) *JSON output sorting in Python.*[ask] Retrieved from https://stackoverflow.com/questions/2774361/json-output-sorting-in-python