
# Reconstruct the Original Meeting Transcripts

Version: 2.0

Environment: Python 3.6.3 and Anaconda 4.3.0 (64-bit)

Libraries used: 
* re (for regular expression, included in Anaconda Python 3.6)
* os to join and load all files
* element tree to parse through xml file

## 1.  Import libraries 

In [None]:
# import relevant libraries
import xml.etree.ElementTree as ET
import re
import os

## 2. Segments

In this section, we will perform the following tasks:
* parse the segments xml file to understand strucutre
* examine the contents of the file
* extract the information that is relevant aka the href tag
* build a dictionary to call on the file
* join files so that we can parse through all segment files to build a dictionary

#### 2.1 Parse files and build dictionary

In [None]:
# xml Element tree library is used to parse through the xml file
treeSegments = ET.parse('ES2002a.A.segments.xml')
# find all segment tags
segments = treeSegments.findall('segment')
segment_list = []
# build a list 
for segment in segments:
    for child in segment:
            segment_list.append(child.get('href'))
            
            
## explained in more detail when integrated into the function below

In [None]:
# An example of the output from using the function, file = 
print(segment_list)

In [None]:
# building a dictionary for the each segment for every file


# function takes in one argument, i.e. the file
def Segment_dictionary(file):
    Dictionary_Segment = {}
    # using element treeparse through the chosen segment file
    treeSegments = ET.parse(file)
    # find all segment tags
    segments = treeSegments.findall('segment')
    # build a list
    segment_list = []
    # use a for loop to to append to segment list
    for segment in segments:
        # segment range is found in the href attribute use element tree to get value
        for child in segment:
            segment_list.append(child.get('href'))
    # use a for loop to iterate through the segment list
    for x in range(len(segment_list)):
        # assign variable "new" to all instances of digits that occur one to 4 times, use regex to find this pattern
        new = re.findall(r'\d{1,4}', segment_list[x])
        # if length of new is less than three then the range is of length one
        # hence we use an if statement to determine the range
        if len(new) > 3:
            Dictionary_Segment[str(new[-3])] = [str(new[-3]), str(new[-1])]
        # if the range is greater than 3 then start range is at index[2], stop is at index[4], skipping years
        else:
        # if the range is less than 3 then start range is at index[2], stop is at index[2], skipping years
            Dictionary_Segment[str(new[-1])] = [str(new[-1])]

    return Dictionary_Segment

    
    
    
    
    

#### 2.1 Segment dictionary for all files

In [None]:
# path to meeting transcripts file
#file_Path = "C:\\Users\\AshSu\\Downloads\\meeting_transcripts_student\\meeting_transcripts_student\\segments"
file_Path = "./segments"
# define a new dictionary so that we can call on a specific file and collect all of the segments for each file
Dicts_dictSegs = {}

# using the os module, loop through all the lists
for file in os.listdir(file_Path):
    # the file variable is instantiated with each file in the folder
    file = os.path.join(file_Path, file)
    # use regex to assign the segment file name and call on the file like an index
    Index = re.findall('ES\d{4}\w\\.\w|IB\d{4}\\.\w|IS\d{4}\w\\.\w|TS\d{4}\w\\.\w',file)[0]
    # assign a new variable seg_file in the for loop to the output of the Segment_dictionary function
    # the segment dictionary funcion is called on each file in the segments folder
    seg_file = Segment_dictionary(file)
    # 
    Dicts_dictSegs[Index] = seg_file

## 3. Words

In this section, we will perform the following tasks:
* parse the words xml file to understand strucutre
* examine the contents of the file
* extract all words and their relevant id number
* build a dictionary to call on the file
* join files so that we can parse through all word files to build a dictionary

#### 3.1 Parse using element tree and build dictionary

In [None]:
# parsing through the words xml files
# use element tree to parse through xml file
treeWords = ET.parse('ES2002a.A.words.xml')

#find words in "w" tag
words = treeWords.findall('w')

# instantiate two lists
# a list for words
word_l = []
# a list for id numbers
id_l =[]

# for word within all of the word tags 
for word in words:
    # append word to word list
    word_l.append(word.text)
    # append corresponding id to id list
    id_l.append(word.attrib.get('{http://nite.sourceforge.net/}id'))

# must change information in the id tag to just be the number
for x in range(len(id_l)):
    # split "id" on words
    y = id_l[x].split("words")
    # now id equal to just the id digit
    id_l[x] = y[1]

# print word and id lists
print(word_l, id_l)
# equal number of objects in each list

In [None]:
# function takes in one argument, i.e. the file
def Word_Dictionary(file):
    # use element tree to parse through xml file
    treeWords = ET.parse(file)
    #find words in "w" tag
    words = treeWords.findall('w')
    # instantiate two lists
    # a list for words
    word_l = []
    # a list for id info
    id_l =[]
    # instantiate a dictionary
    Dictionary_Word = {}
    # instantiate variable to be used in a for loop to concatenate both word and id lists to build a dictionary
    i = 0
    
    # for word within all of the word tags     
    for word in words:
        # append word to word list
        word_l.append(word.text)
        # append corresponding id to id list
        id_l.append(word.attrib.get('{http://nite.sourceforge.net/}id'))
        
    # manipulate the id tag to just be the number and remove the rest of the tag
    for x in range(len(id_l)):
        # split "id" on words
        y = id_l[x].split("words")
         # now id s equal to just the id digit
        id_l[x] = y[1]
    
    # because both lists are the same length we can build a dictionary using a while loop to iterate over both lists
    # the index of the word and id in in their lists means that the each word and its corresponding id have the same index 
    # in their respective lists
    while i < len(word_l):
        # use the id to be the key
        Dictionary_Word[id_l[i]] = word_l[i]
        i += 1
     
    return Dictionary_Word

    

In [None]:
Word_Dictionary('ES2002a.B.words.xml')

#### 3.2 Create a word dictionary for all files

In [None]:
# instatiate path to files

file_Path = "./words"
#file_Path = "C:\\Users\\AshSu\\Downloads\\meeting_transcripts_student\\meeting_transcripts_student\\words"
# new dictionary, all files are collected within this dictionary
Dicts_dictWords = {}

# using the os module, loop through all the lists
for xfile in os.listdir(file_Path):
    # the file variable is instantiated with each file in the folder
    xfile = os.path.join(file_Path, xfile)
    # use regex to assign the word file name and call on the file like an index
    Index = re.findall('ES\d{4}\w\\.\w|IB\d{4}\\.\w|IS\d{4}\w\\.\w|TS\d{4}\w\\.\w',xfile)[0]
    # assign a new variable words_all in the for loop to the output of the Word_Dictionary function
    # the Word_Dictionary funcion is called on each file in the Word folder
    words_all = Word_Dictionary(xfile)
    Dicts_dictWords[Index] = words_all

## 4. Topics

In this section, we will perform the following tasks:
* parse the topics xml file to understand strucutre
* examine the contents of the file
* extract all topics, sub topics and sub-sub topics and their relevant id number
* build a list containing all the topics...
* join files so that we can parse through all topic files

#### 4.1 Understand the structure of topic files, build a file containing all the topics

In [None]:
treeTopics = ET.parse('ES2002a.topic.xml')
topics = treeTopics.findall('topic')
for topic in topics:
    for child in topic:
        for grandchild in child:
            print(grandchild.get('href'))

In [None]:
# use element tree to parse through xml file
treeTopics = ET.parse('ES2014d.topic.xml')
# find all topic tags and instantiate topics variable
topics = treeTopics.findall('topic')
All_topics = []
# iterate over each topic using a for loop
for topic in topics:
    # instantiate a list
        A_list = []
        for child in topic:
            href_tag = child.get('href')
            # there are two instances where the href points to string we don't want
            # use an if statement to bypass these two instances
            if href_tag is None:
                pass
            elif href_tag.startswith('default-topics'):
                pass
            else:
                new = re.findall(r'\d{1,4}', href_tag)
                 # build a key to know exactly what our range is referring to 
                key = re.findall('ES\d{4}\w\\.\w|IB\d{4}\\.\w|IS\d{4}\w\\.\w|TS\d{4}\w\\.\w', href_tag)[0]
                # same logic as segments dictionary
                if len(new) > 3:
                    A_list.append([key,int(new[-3]),int(new[-1])])
                else:
                    A_list.append([key,int(new[-1]),int(new[-1])])
                    
            for grandchild in child:
                href_tag = grandchild.get('href')
                if href_tag is None:
                    pass
                elif href_tag.startswith('default-topics'):
                    pass
                else:
                    new = re.findall(r'\d{1,4}', href_tag)
                 # build a key to know exactly what our range is referring to 
                    key = re.findall('ES\d{4}\w\\.\w|IB\d{4}\\.\w|IS\d{4}\w\\.\w|TS\d{4}\w\\.\w', href_tag)[0]
                # same logic as segments dictionary
                    if len(new) > 3:
                        A_list.append([key,int(new[-3]),int(new[-1])])
                    else:
                        A_list.append([key,int(new[-1]),int(new[-1])])
                
        All_topics.append(A_list)
        

print(All_topics)
                

In [None]:
def List_topics(file):
    # The goal is to make a list of lists, a root list containing all the topics, then a child list containing
    # a specific topic and then within that there is the list of paragraphs,i.e the diferent words used from different
    # word files
    
    # instantiate file
    All_topics = []
    # use element tree to parse through xml file
    treeTopics = ET.parse(file)
    # find all topic tags and instantiate topics variable 
    topics = treeTopics.findall('topic')
    
    # iterate over each topic using a for loop
    for topic in topics:
        # instantiate a second list this helps us build a lists of lists
        A_list = []
        # loop through children tags under each topic
        for child in topic:
            # use element tree get function to get string contained in the href variable
            href_tag = child.get('href')
            # there are two instances where the href points to string we don't want
            # use an if statement to bypass these two instances
            if href_tag is None:
                # only ocurs in some files
                pass
            # also only occurs in some instances
            elif href_tag.startswith('default-topics'):
                pass
            else:
                # repeat steps used in segemnt dictionary function
                # use regex to find all instances of digits occuring 1 to 4 times within our string
                # instantiate variable new 
                new = re.findall(r'\d{1,4}', href_tag)
                # build a key to know exactly what our range is referring to 
                key = re.findall('ES\d{4}\w\\.\w|IB\d{4}\\.\w|IS\d{4}\w\\.\w|TS\d{4}\w\\.\w', href_tag)[0]
                # use if statements to catch instances where there is no real range
                if len(new) > 3:
                    # append key and senence length to our second list
                    A_list.append([key,int(new[-3]),int(new[-1])])
                else:
                    A_list.append([key,int(new[-1]),int(new[-1])])
               # repeat steps if sub topics exists but as per the specifications of the assignment treat 
            # sub-topics like topics
            for grandchild in child:
                href_tag = grandchild.get('href')
                if href_tag is None:
                    pass
                elif href_tag.startswith('default-topics'):
                    pass
                else:
                    new = re.findall(r'\d{1,4}', href_tag)
                 # build a key to know exactly what our range is referring to 
                    key = re.findall('ES\d{4}\w\\.\w|IB\d{4}\\.\w|IS\d{4}\w\\.\w|TS\d{4}\w\\.\w', href_tag)[0]
                # same logic as segments dictionary
                    if len(new) > 3:
                        A_list.append([key,int(new[-3]),int(new[-1])])
                    else:
                        A_list.append([key,int(new[-1]),int(new[-1])])
                    # repeat steps if sub-sub topics exist     
                for greatgrandchild in grandchild:
                    href_tag = greatgrandchild.get('href')
                    if href_tag is None:
                        pass
                    elif href_tag.startswith('default-topics'):
                        pass
                    else:
                        new = re.findall(r'\d{1,4}', href_tag)
                        key = re.findall('ES\d{4}\w\\.\w|IB\d{4}\\.\w|IS\d{4}\w\\.\w|TS\d{4}\w\\.\w', href_tag)[0]
                        if len(new) > 3:
                            A_list.append([key,int(new[-3]),int(new[-1])])
                        else:
                            A_list.append([key,int(new[-1]),int(new[-1])])
                    
        # append our second list to our rootlist            
        All_topics.append(A_list)
            
        
    return(All_topics)      
    
    

#### 4.1 Parse and extract through all files

In [None]:
filePath = "./topics"
#filePath = "C:\\Users\\AshSu\\Downloads\\meeting_transcripts_student\\meeting_transcripts_student\\topics"
AlltopicFiles = []

for xfile in os.listdir(filePath):
    xfile = os.path.join(filePath, xfile)
    topics = List_topics(xfile)
    AlltopicFiles.append(topics)

In [None]:
AlltopicFiles[1]

In [None]:
# explore the structure of the Alltopicfiles
for file in AlltopicFiles:
    for topic in file:
        for content in topic:
            print(content)
                    

## 5. Topics & Segments

In this section, we will perform the following tasks:
* discover all of the segments within a given topic
* rectify the situation of a segment ends before or after a topic


In [None]:
# find all segments in  a topic
# segments indicated by new line
def Topic_Segs(Alltopic_files):
    topic_segs = []
            # actual content contained within topic
    for topic in Alltopic_files:
                # instantiate a segment list with topic name like 'ES2002a.D'
        segment = [topic[0]]
                # parameters  according to assignment
                # while the begining of the topic is less than the end of the topic
        while topic[1] <= topic[2]:
             # use an if statement to check whether beginning of the topic portion is in the corresponding segment
            if str(topic[1]) in Dicts_dictSegs[topic[0]]:
                new = Dicts_dictSegs[topic[0]][str(topic[1])].copy()
                topic[1] = int(Dicts_dictSegs[topic[0]][str(topic[1])][-1]) + 1
                        #if the beginning of the topic range is greater than the end of the topic
                if not topic[1] < topic[2]:
                    new[-1] = str(topic[2])
                        # we append our new variable to our segements thereby creating our segment within a topic
                segment.append(new)
            else:
                a_list = []
                for element in Dicts_dictSegs[topic[0]]:
                    int_element = int(element)
                        # in some instances our segment start before the topic begins but ends somewhere within the topic
                    if int(element) < int(topic[1]):
                        a_list.append(int(element))   
                new_start = max(a_list)
                new = Dicts_dictSegs[topic[0]][str(new_start)].copy()
                new[0] = str(topic[1])
                # we want the topic to start here
                topic[1] = int(Dicts_dictSegs[topic[0]][str(new_start)][-1]) + 1
                 #if the beginning of the topic range is greater than the end of the topic
                if not topic[1] < topic[2]:
                    new[1] = str(topic[2])           
                # we append our new variable to our segements thereby creating our segment within a topic
                segment.append(new)
                 # append segment lists to our topic_segs list      
        topic_segs.append(segment)
                            
    return topic_segs
                
    

## 6. Build text

In this section, we will perform the following tasks:
* Create a text with the pattern specified in the assignment

In [None]:
def Create_text(topicSegs):
    # create an empty list
    A_list = []
    # instantiate an x variable for while loop
    x = 1
    # while x is less than the length of our argument
    while x < len(topicSegs):
        seg = topicSegs[x]
        # instantiate a string
        text = ''
        #  for loop to iteratively build string
        for y in range(int(seg[0]), int(seg[-1]) + 1):
            # instantiate a second string
            text_2 = ''
            Stringy  = str(y)
            # ensure that our string y is in our word dictionary for that section
            if Stringy not in Dicts_dictWords[topicSegs[0]]:
                pass
            else:
                item = Dicts_dictWords[topicSegs[0]][Stringy]
                text_2 = item
                # space after every item even punctuations
                text += ' ' + text_2
       # append text to strings         
        A_list.append(text)
        x += 1
        
    return A_list


## 7. Build all of the text files

In [None]:
# iterate through all of the text files
for file in AlltopicFiles:
    # name for text file
    Text_file = re.findall('(\w+)\.',file[0][0][0])[0]
    # instantiate a list
    text = []
    for topics in file:
        # call Topic_Segs function on all topics
        section = Topic_Segs(topics)
        New = []
        for a_list in section:
            if len(a_list) == 1:
                continue
            New_list = Create_text(a_list)

            while '' in New_list:
                New_list.remove('')

            # new line indicates segment
            text_word = '\n'.join(New_list)
            if text_word != '':
                New.append(text_word)
        # new line and asterix indicate topic paragraph 
        New = '\n'.join(New)
        text.append(New)
        text.append('**********')

    text = '\n'.join(text)
    
    Text_Name = 'txt_files/' + Text_file + '.txt'
    #fileName = "C:\\Users\\AshSu\\Downloads\\meeting_transcripts_student\\meeting_transcripts_student\\txt_files\\" + Text_file + '.txt'
    
    # write all files to folder
    with open(Text_Name, 'w') as Final_File:
        Final_File.write(text)
        


## 8. Conclusion

This particular task proved to be very difficult in that you needed to understand that the structure of some of the files varied for example some files contained sub sub topics while most didn't. The most logical thing to do was to build a series of dictionaries where applicable, so that the relevant information could be easliy accessed. The final hurdle was to find the differing segments and topic sections.

## 9. References
* (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: Truncated UXXXXXXXX escape. (n.d.). Retrieved from https://stackoverflow.com/questions/37400974/unicode-error-unicodeescape-codec-cant-decode-bytes-in-position-2-3-trunca?rq=1 

* 19.7. xml.etree.ElementTree - The ElementTree XML API¶. (n.d.). Retrieved from https://docs.python.org/2/library/xml.etree.elementtree.html 

* File path with Python and Windows on Jupyter Notebook. (n.d.). Retrieved from https://stackoverflow.com/questions/48401560/file-path-with-python-and-windows-on-jupyter-notebook 

* Nguyen, D. (n.d.). Lists, mutability, and in-place methods. Retrieved from http://www.compciv.org/guides/python/fundamentals/lists-mutability/ 

* Processing XML in Python with ElementTree. (n.d.). Retrieved from https://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree 

* H. (2013, August 08). Python path.join, and listdir. Retrieved from https://www.youtube.com/watch?v=t5uRlE28F54 

* Python regular expressions OR. (n.d.). Retrieved from https://stackoverflow.com/questions/8609597/python-regular-expressions-or 