# Task 1: Reconstruct the Original Meeting Transcripts

The original meeting transcripts are stored in three different types of XML files, which are ending with ".words.xml", ".topic.xml" and ".segments.xml". (The details about the three types of files can be found in Section 3 below). The task here is to reconstruct the original meeting transcripts with the corresponding topical and paragraph boundaries from these files. Please note that

- A meeting transcript must be generated for each of the "*.topic.xml" file. For example, "ES2002a.txt" will be generated for "ES2002a.topic.xml".
- All the generated meeting transcripts with the ".txt" file extension must be saved in the folder "txt_files".
- The topical boundaries must be denoted with "**********"(i.e., 10 asterisks).
- All the tokens, including punctuations, must be separated by a white space. For example, "Alright , okay . Okay ."
- Besides the topical boundaries, the paragraph boundaries must also be reconstructed with the "*.segments.xml" file.
- The input files to your notebook "task_1.ipynb" must be the three types of XML files. The output must be the meeting - - transcripts saved in a set of txt files.
- A sample meeting transcript is provided in the "txt_file" folder.


In [22]:
# import python Libraries
import re
import os
import xml.etree.ElementTree as ET

In [24]:
#string to store file content
Word_string = "" 

#initialising folder path
path = 'topics/' 

for filename in os.listdir(path):
    #search for xml files in the folder defined above
    if not filename.endswith('.xml'): continue
    fullname = os.path.join(path, filename)
    #A variable to store file name
    output_file=re.search('(.*).topic',filename).group(1)
    #Reading all topic files
    f = open(fullname, 'r')
    #using xml tree to parse the xml files
    tree = ET.parse(f)
    root = tree.getroot()
    temp = ""
    for child in root:
        for x in child.iter('{http://nite.sourceforge.net/}child'):
            for key, value in x.items():
                #Regular expression to get the filename, start and end number from topic files
                twords = re.search('(.*)#id\(.*words(\d+)\)..id\(.*words(\d+)\)|(.*)#id\(.*words(\d+)\)', value)
                if twords.group(1) != None:
                    file_name = twords.group(1)
                    start_word = int(twords.group(2))
                    end_word = int(twords.group(3))
                else:
                    file_name = twords.group(4)
                    start_word = int(twords.group(5))
                    end_word = int(twords.group(5))
                #stores file name
                new_file_name = re.search('(.*)words', file_name).group(1)
                
                #Parsing Segments XML files
                seg_parse = ET.parse('segments/'+ new_file_name + "segments.xml")
                seg_root = seg_parse.getroot()
                for x in seg_root:
                    #Extracting the contents from Segments XML files
                    for y in x.iter('{http://nite.sourceforge.net/}child'):
                        for k, v in y.items():
                            #Regular Expression to extract start and end numbers from Segments XML file
                            tsegs = re.search('(.*)#id\(.*words(\d+)\)..id\(.*words(\d+)\)|(.*)#id\(.*words(\d+)\)', v)

                            if tsegs.group(1) != None:
                                start_seg = int(tsegs.group(2))
                                end_seg = int(tsegs.group(3))
                            else:
                                start_seg = int(tsegs.group(5))
                                end_seg = int(tsegs.group(5))
                            
                            #Condition, to check start ,end words and Segments
                            if start_word <= start_seg and end_word >= end_seg or start_seg <= end_word and end_seg >= start_word:
                                
                                #Parsing Word XML files
                                words_parse = ET.parse('words/' + new_file_name+ "words.xml")
                                words_root = words_parse.getroot()
                                
                                #Extracting the information from w-tag from words XML files
                                for xwords in words_root.iter(tag='w'):
                                    word_value = xwords.attrib.get('{http://nite.sourceforge.net/}id')
                                    
                                    #WordNumber stores index of each line in word file
                                    WordNumber = int(re.search('.*words(\d+)', word_value).group(1))
                                    
                                    #check for topic and segment boundaries
                                    if WordNumber >= start_seg and WordNumber <= end_seg and WordNumber <= end_word and WordNumber >= start_word:
                                        #storing the information text into string
                                        Word_string += " " + xwords.text
                                Word_string += "\n"
                                temp += Word_string
                                Word_string = ''
                                
        #Adding ********** after end of each topic
        temp += "**********\n"
    
    #Removing empty lines from the text
    text_filtered = "\n".join([text.rstrip() for text in temp.splitlines() if text.strip()])
    #Writing to text files
    with open('txt_files/'+ output_file + ".txt", 'w') as f1:
        f1.write(text_filtered.strip())
    #closing file
    f1.close()
        


Note: this task will take few minutes. so please wait..Have Patience !!

In this task, following steps were performed:

- Reading all the topic xml files from topic folder. File name, start word and end word was obtained from the topic child tags using regex.
- After that segement xml files were read and corresponding start and end segement was recorded.
- For each topic, each corresponding child start and end word boundaries was compared with the start and end segment boundaries.For each topic file, corresponding segment and words file was referred. All words which lie within the segment and topic boundaries were stored in a string.
For example:
Consider Es2002a.topic.xml file, 

nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words0)..id(ES2002a.B.words71)"/>

which has a start word as 0 and end word boundary as 71. Its corresponding segement boundaries are 

  nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words0)..id(ES2002a.B.words1)
  nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words2)..id(ES2002a.B.words3)
  nite:child href="ES2002a.B.words.xml#id(ES2002a.B.words4)..id(ES2002a.B.words71)

Therefore all words in the words file which lie within the topic boudaries are stored in a string. After every segment break is given. 

The output of file will be like.

Okay .
 Right .
 Um well this is the kick-off meeting for our our project . Um and um this is just what we're gonna be doing over the next twenty five minutes . Um so first of all , just to kind of make sure that we all know each other , I'm Laura and I'm the project manager . Do you want to introduce yourself again ?
 Mm-hmm .
 Great .
 Hi , I'm David and I'm supposed to be an industrial designer .
 Okay .
 And I'm Andrew and I'm uh our marketing
 Um I'm Craig and I'm User Interface .
 expert .
"**********"

The subtopics will be considered in the main topic. After every topic, 10 * aestricks marks are added to indicate the end of one topic. Simillarly, all topics files are read and corresponding 139 text files are generated as output.
