# Task 1 Reconstruct the Original Meeting Transcripts

#### Student Name: Siyang Feng
#### Student ID: 28246993

Date: 04/06/2018

Version: 2.0

Environment: Python 3.6.2 and Anaconda 4.3.29

Libraries used:

* re 2.2.1 (for regular express, included in Anaconda Python 3.6)
* os (for getting files in directory, included in Anaconda Python 3.6)
* BeautifulSoup (for tree structure of xml file, include in Anaconda Python 3.6)


## Introduction
**Reconstruct meeting transcripts with topical boundaries.** The original meeting transcripts are stored in three different types of XML files, which are ending with ".words.xml", ".topic.xml" and ".segments.xml". (The details about the three types of files can be found in Section 3 below). The task here is to reconstruct the original meeting transcripts with the corresponding topical and paragraph boundaries from these files. 
## Import Library

In [23]:
from bs4 import BeautifulSoup as bsoup
import re
import os

## Import XML data and Process
* Import all words XML files
* Import all segments XML files
* Import topic XML file one by one and processing

Next step is using variables to record file pathes of segment, word and topic xml file and the path of result text file path.

In [24]:
# get xml directory path
seg_xml = "./segments"
word_xml = "./words"
topic_xml = "./topics"
# get text directory path
txt_path = './txt_files'

### - Generate hash table of words and segments
The two functions are defined to parsing one word.XML file or one segment.XML file. 

Use for loop to traverse all the file and parse all of them into defined data format (hash table):
* words: {dict of each file name : {dict of each words #id : (each word, 0)}}
* segments: modify words hash table to record segment points
    - {dict of each file name : {dict of each words #id : (each word, 1)}}
   
Regular expression is used to get ID number of each word.

In [25]:
# get word id number pattern
wid_pattern = r'words(\d+)'
w_id = re.compile(wid_pattern)

`re.complie` function is used to create regular expression ID selection pattern `w_id` and this pattern will be used to select words ID number further.

Function `parsing_one_word` is generated to parse one word xml file into a dictioary. In the dictionary, we select the number of word ID as key and the word as one part of the values. The words dictionary value is a tuple which the first item records the word and the second item record if the word is a segment point mark the second item of tuple as '1' and other will be marked as '0'.

Thus, in the `parsing_one_word` function, we don't know which one is segment point. So, we mark all the second item in tuple as '0'. Notice that, this function only record the item with <w ...> in xml file, no `<gap ...>`, `<vocalsound ...>` or other non-words items.
    
Function `parsing_one_segment` is generated to parse one segment xml file in segment filder. In one segment file, a segment range is recorded as:
``` xml
<segment nite:id="ES2002a.sync.288" transcriber_end="88.982" transcriber_start="85.509" channel="2">

    <nite:child href="ES2002a.C.words.xml#id(ES2002a.C.words0)..id(ES2002a.C.words7)"/>

</segment>
```
We only need the inside tag `<nite:child ...>`
```xml 
    <nite:child href="ES2002a.C.words.xml#id(ES2002a.C.words0)..id(ES2002a.C.words7)"/>
```
Regular exression is used to select the word xml file infomation such as `ES2002a.C.words.xml`. And we find that the segement range record the start and end word point of the segment range. To easily record the segment, we only need to record the end word point of the segment into the word dictionary by changing the second item of values in word dictionary into '1'.

Notice that, the there are some segment point are not marked with word. There may be 'vocalsound' or others. All of all the non-word is not stored into the original dictionary. But because it is still the segment point, thus, we still need it. To solve this problem, we add the non-word segment into word dictionaty with its key ID number and the value of `('', 1)`.

In [26]:
def parsing_one_word(t):
    """
    parsing a single word xml file and select id and text part to generate a dictionary.
    0 in words dict represent segment point, 0 -> no, 1 -> yes.
    In this function, imagine no segment point, the value in tuple is all 0.
    
    Arguements:
    t -- word file path
    
    Return:
    words -- a dict of words {'int(id)': ('word', 0)}
    """
    xml_soup = bsoup(open(t), 'lxml')
    words = dict()
    ws = xml_soup.find_all('w')
    for i in ws:
        key = w_id.search(i['nite:id']).group(1)
        words[int(key)] = (i.get_text(), 0)
    return words

In [27]:
def parsing_one_segment(t, word_dict):
    """
    parsing a single segment xml file and select contains in 'nite:child'.
    label segment points in words dictionary with tag 1.
    such as (word, 1)
    
    Arguements:
    t -- segment file path
    word_dict -- the created words dictionary 
    
    Return:
    word_dict -- relabeled segments word dictionary
    """
    xml_soup = bsoup(open(t), 'lxml')
    #segment_ls = list()
    segments = xml_soup.find_all('nite:child', href = True)
    word_file = re.search(r'^(.+)#', segments[0]['href']).group(1)
    for i in segments:
        seg_point = re.findall(r'id\((.+?)\)',i['href'])[-1]
        seg_key = int(w_id.search(seg_point).group(1))
        try:
            word_dict[word_file][seg_key] = (word_dict[word_file][seg_key][0], 1)
        except:  # if segment point in vocalsound label, define previous w label as seg_point
            word_dict[word_file][seg_key] = ('', 1)
        
    return word_dict

The functions of parsing each words file and parsing each segment file are created.

The next step is to generate a hash table of words and then, mark the word with segment tag of 1.

In this part, we read all the word and segment xml files into memory. The first function is used to generate original word dictionary (word hash table) with the structure of `{dict of each file name : {dict of each words id : (each word, 0)}}`. To increase the search speed, the outside key is the word file name without '.xml' and the inside key will be the word id number. 

In [28]:
# read all words.xml into a dictionary
# {dict of each file name : {dict of each words id : (each word, 0)}}
words_dict = dict()
for wfile in os.listdir(word_xml):
    wfile_path = os.path.join(word_xml, wfile)
    if os.path.isfile(wfile_path) and wfile_path.endswith('.xml'):
        words_dict[wfile] = parsing_one_word(wfile_path)

The function below is used to mark the segment words in word hash table with changing second item of each value from 0 to 1. In this function all the segment xml file are readed to modify the word hash table by using the modify function `parsing_one_segment`. The final word hash table will be like:
``` json
{
    'ES2002a.A.words.xml': 
    {
        0: ('Hi', 0),
        1: (',', 0),
        2: ("I'm", 0),
        3: ('David', 0),
        ...
    },
    ...
}
```

In [29]:
# read all segments.xml into the words dictionary : words_dict
for sfile in os.listdir(seg_xml):
    sfile_path = os.path.join(seg_xml, sfile)
    if os.path.isfile(sfile_path) and sfile_path.endswith('.xml'):
        words_dict = parsing_one_segment(sfile_path, words_dict)

## Parsing topic.xml files
In this task, the word hash table has been generated before, it should be used to seach the word to generate the final text file. Regard of the requirements, each topic xml file will generate corresponding text file with the same name but end with '.txt'. Thus, in this part, after we parsing each file the result will be wrote into the corresponding text file. Then, parse another topic xml file.
* Define functions of parsing topic.xml files.
    - parsing each topic in a topic file
    - parsing each topic file
* Each topic file is parsed through its topics tag. 
* Each topic is seperated by 10 asterisks.

Function `parse_one_topic` is used to parse each topic in a topic xml file. Each topic in a topic file contains multiple paragraphs
``` xml
    [<nite:child href="ES2002a.D.words.xml#id(ES2002a.D.words570)..id(ES2002a.D.words624)"></nite:child>, ... ]
```
We should extract the contains of `href` in each tag of `<nite:child >` in each topic. The contains record the word file name and its start word ID and end word ID. All this information were extracted through regular expression. Then, use the word file name and ID to search the word in word hash table to genegrate topic string. Each topic will be ended with 10 asterisks.

In [30]:
def parse_one_topic(a_topic):
    """
    parse all the words in one of the topics in a topic file into a string.
    each topic end with '**********'.
    
    Arguments:
    a_topic -- a list of 'nite:child' items contains under a 'topic' tag in a topic file
            eg. [<nite:child href="ES2002a.D.words.xml#id(ES2002a.D.words570)..id(ES2002a.D.words624)"></nite:child>, ... ]
    
    Return:
    txt -- a string of all words under the topic 
    """
    txt = ''
    for ln in a_topic:   # loop 'nite:child' lines in a topic
        key = re.search(r'^(.+)#', ln['href']).group(1)  # key of words dict
        values = re.findall(r'id\((.+?)\)', ln['href'])  # get start and end point in each word part
        if len(values) == 2:  # values contains 2 id: is a range
            start_num = int(w_id.search(values[0]).group(1))  # get start number dict key
            end_num = int(w_id.search(values[1]).group(1))  # get end number dict key
            for i in range(start_num, end_num+1): 
                try:
                    if words_dict[key][i][0] == '' and txt[-1] == '\n':  # all the recorded not word id is marked with segment point 1
                        continue
                    elif words_dict[key][i][0] == '':
                        txt = txt + '\n'              
                    else:  # word part
                        txt = txt + ' ' + words_dict[key][i][0]
                        if words_dict[key][i][1] == 1:   # the word is marked with segment tag
                            txt = txt + '\n'                
                except:  # not segment not word -> not in dict
                    continue
        else:  # values only contains one id: a word or segment point but not word
            st_en_num = int(w_id.search(values[0]).group(1))
            try:
                if words_dict[key][st_en_num][0] == '':
                    continue
                else:
                    txt = txt + ' ' + words_dict[key][st_en_num][0] + '\n'
            except:  # not segment and not word -> not in dict
                continue
        # if a line in topic txt is not end with '\n' add '\n' 
        try:
            if txt[-1] != '\n':
                txt = txt + '\n'
        except: # the text contains no words
            continue
    # add ten '*' under the total txt topic 
    txt = txt + "*"*10 + '\n'   
    return txt


A topic file is consisted of multiple topics. The function `parse_topic_file` is gererated to parse a total topic file. In this function, the function of `parse_one_topic` will be used to parse each topic. The `parse_topic_file` function will loop all the topic in a topic xml file and parse each topic into a string and return the final string.

In [31]:
def parse_topic_file(file_path):
    """
    parse a topic file into a string.
    use function 'parse_one_topic()' to parse every topic in the topic file
    
    Arguments:
    file_path -- path of a topic file. eg. "./topics/ES2002a.topic.xml"
    
    Return:
    txt -- a tring of all words in the topic file
    """
    xml_soup = bsoup(open(file_path), 'lxml')
    topics = xml_soup.html.body.next.find_all('topic', recursive=False)
    txt = ''
    for i in topics:
        words_field = i.find_all('nite:child', href=True)  
        txt = txt + parse_one_topic(words_field)
    return txt

### - Write string into text file
Each file will be parsed into a tring, the next step is to write the string into a text file. The function `output_text` do this task with the two parameters: the text string and the destination stored file name and its path.

In [32]:
def output_text(string, file_path):
    """
    Write the string into a text file with defined path.
    
    Arguments:
    string -- input string which need to be write into text file
    file_path -- the path of the text to write
    
    Return:
    None
    """
    text_file = open(file_path, 'w')
    text_file.write(string)
    text_file.close()
    

* Retrieval all the topic.xml files in pre-defined directory `txt_path`.
* Parse each topic.xml file into a string variable with the previous function: `parse_topic_file`.
* Write parsed string into a text file with function `output_text`.
* Destination file name is extracted by regular expression to extract from corresponding topic xml file and add `.txt`

In [33]:
for tfile in os.listdir(topic_xml):
    tfile_path = os.path.join(topic_xml, tfile)
    if os.path.isfile(tfile_path) and tfile_path.endswith('.xml'):
        txt = parse_topic_file(tfile_path)
        txt_file = re.search(r'(.+?)\.topic', tfile).group(1) + '.txt'  # generate text file name
        tx_path = os.path.join(txt_path, txt_file)  # generate a text file path to write in
        output_text(txt, tx_path)
        

## Summary
* In total, we can make words file and segement file together as a hash table and then, word in topic will be searched through its ID in generated hash table and get the final result.
* We only select the topics of the first level of the topic file and the inside topic tag will not be considered. Inside each topic, we only need to select the contains of tag `<nite:child >`. To select the first level topic contians:
    ``` python
        xml_soup.html.body.next.find_all('topic', recursive=False)
    ```
    This function defines the recursive is False which means there is no recursive. `xml_soup` is the xml file readed by beautifulsoup. This function only select the contains in first level topic tag.
* We should conform that a paragraph is defined by the segments and each `<nite:child >` in each topic file. Then, comform that there is only one `\n` between two words.