# Task 2: Generate Sparse Representations 

## *Task 2.1*

The aim of this task is to build sparse representations for the meeting transcripts generated in task 1, which includes word tokenization, vocabulary generation, and the generation of sparse representations. Please note that 
- The word tokenization must use the following regular expression, "\w+(?:[-']\w+)?", and all the words must be converted into the lower case.
- The stop words list (i.e, stopwords_en.txt) provided in the zip file must be used.
- The words, whose document frequencies are greater than 132, must be removed.
- Generating multi-word phrases (i.e., collocations) are not needed.

In [155]:
# import python Libraries
import glob
import os
from nltk.tokenize import RegexpTokenizer 
from nltk.corpus import stopwords
from collections import Counter
import pickle
import re
import json

In [170]:

#Initializing path 
path="txt_files/"

#initializing variables to store data
final_data=[]
#count=0
stopwords_list=[]

#Extracting stopwords file from stopwords_en.txt file and appending it to list
stopWords_file = open("stopwords_en.txt","r")
for word in stopWords_file.read().split():
    stopwords_list.append(word)
stopWords_file.close()

#Taking only unique stop words from stopwords list
stopwords_set = set(stopwords_list)

#Reading all txt files
for filename in os.listdir(path):
    #string to store file content
    data = ''
    #count=count+1
    filename = os.path.join(path, filename)
    #Reading all files in txt_files folder
    file = open(filename,"r")   
    file_name = file.read().replace('\n', '')
    data = data + file_name 
    #closing the file
    file.close()
    #Converting data into lower case
    data = data.lower()
    #Tokenizing the data using RegexpTokenizer
    tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")
    unigram_tokens = tokenizer.tokenize(data)
    #Removing stopwords from the unigram tokens
    stopwords_removed = [w for w in unigram_tokens if w not in stopwords_set]
    Removed_tokens = [w for w in stopwords_removed if len(w)>=3] 
    final_data.extend(list(set(Removed_tokens)))


index_dict={}
#Counter to assign index value to tokens
index_dict=Counter(final_data)
index_dict_copy=dict(index_dict)

#Removing words, whose document frequencies are greater than 132
for key,value in index_dict.items():
    if value>132:
        del index_dict_copy[key]
        
final_tokens=[]
for key in index_dict_copy:
    final_tokens.append(key)

#Sorting final tokens
final_tokens.sort()

final_tokens

['a-hold',
 'a_a_',
 'a_a_s',
 'a_m_i_',
 'a_n_',
 'a_p_o_g_e_e_',
 'a_s',
 'a_s_r_',
 'a_v_',
 'abandon',
 'abandoned',
 'abbie',
 'abbing',
 'abbreviations',
 'abdul',
 'abigail',
 'abilities',
 'ability',
 'abo',
 'abou',
 'abrupt',
 'abs',
 'absolute',
 'absolutely',
 'absorb',
 'absorbed',
 'abstract',
 'abstraction',
 'abused',
 'abut',
 'academy',
 'acc',
 'acce',
 'accent',
 'accents',
 'accentu',
 'accentuate',
 'accept',
 'acceptability',
 'acceptable',
 'acceptance',
 'accepted',
 'accepting',
 'accepts',
 'access',
 'accessed',
 'accessible',
 'accessoire',
 'accessory',
 'accident',
 'accidentally',
 'acco',
 'accommodate',
 'accommodated',
 'accommodating',
 'accompanying',
 'accomplish',
 'accomplishing',
 'account',
 'accountant',
 'accountants',
 'accounted',
 'accounting',
 'accounts',
 'accu',
 'accumulate',
 'accuracy',
 'accurate',
 'accurately',
 'accustomed',
 'ach',
 'ache',
 'achieve',
 'achieved',
 'achieving',
 'acknowledge',
 'acknowledged',
 'acknowledgemen

In [171]:
#Adding index
vocab = []
count = 0
tokens = ''

for i in range(0, len(final_tokens)):
    tokens = final_tokens[i] + ':' + str(count)
    vocab.append(tokens)
    count = count + 1
    
#Writing to file
vocab_file = open('vocab.txt', 'w')

for item in vocab:
    vocab_file.writelines("%s\n" % item)
    

Note: In case the count in output vocab.txt is not 10304, please run second tab again.

In above task, following steps were performed:
    

- Reading stopwords file
- Reading all text files generated from task 1
- Converting the text into lower case
- Then, tokenizing the data using RegexpTokenizer given in tutorials
- Removing all stopwords from the unigram tokens
- Removing words, whose document frequencies are greater than 132 and then sorting the data
- Adding index to the tokens and writing to vocab.txt file

# *Task 2.2*

It contains the topic boundaries encoded in boolean vectors. For example, if a meeting transcript, "ES2018d.txt" contains 10 paragraphs in total after being preprocessed, and there are topic boundaries after the 2nd, 5th, and 7th paragraphs, the boolean vector must be "ES2018d:0,1,0,0,1,0,1,0,0,1". Every line in topic_seg.txt corresponds to one meeting transcript.

In [159]:
#string to store file content
final_Seg_string = " " 

path = 'txt_files/' #initialising folder path

for filename in os.listdir(path):
    #search for xml files in the folder defined above
    if not filename.endswith('.txt'): continue
    fullname = os.path.join(path, filename)
    #Reading each text file
    f = open(fullname, 'r')
    #Extracting file name
    file_name = filename.strip(".txt")
    #list to store file content
    data_list = []
    #Reading each line from text file
    for line in f.readlines():
        line=line.rstrip()
        data_list.append(line)
    paragraph = len(data_list)
    
    #Creating a list with boolean 0
    boolean_vector=[0] * paragraph
    
    #Obtaining the index position of aster
    line_index = [i for i, x in enumerate(data_list) if x == "**********"]
    
    #Assigning 1 to positions before the occurence of aster
    for index in line_index:
        boolean_vector[index-1] = 1
        
    #Adding ':' to the file name
    boolean_string=file_name+':'
    
    #Adding boolean values to the list followed by ','
    for x in boolean_vector:
        boolean_string=boolean_string + str(x) + ','
        
    boolean_string = boolean_string.rstrip(',')
    final_Seg_string += boolean_string + '\n'
    
#Writing the generated list of boolean vectors to the file
topic_seg = open("topic_seg.txt", 'w')
topic_seg.write(final_Seg_string)
topic_seg.close()
    

In this task, following steps were performed:

- All text files were read which were generated from task 1
- A boolean vector was taken with values 0 and 1. 1s were assigned to those lines succeeded with 10 *'s, while 0's were assigned to all the remaining lines.
For example, consider Es2002a.txt file. 

Okay .
 Right .
 Um well this is the kick-off meeting for our our project . Um and um this is just what we're gonna be doing over the next twenty five minutes . Um so first of all , just to kind of make sure that we all know each other , I'm Laura and I'm the project manager . Do you want to introduce yourself again ?
 Mm-hmm .
 Great .
 Hi , I'm David and I'm supposed to be an industrial designer .
 Okay .
 And I'm Andrew and I'm uh our marketing
 Um I'm Craig and I'm User Interface .
 expert .
**********

So for Okay, Right and every other line 0 was assigned, while 1 was assigned only to the line preceeding with 10 *'s. in this case 1 was assigned to expert.

Output will be:

ES2002a:0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0

Simillarly, same operation was performed for rest of the files in the txt_files folder.

## *Task 2.3*

Each txt file in the "sparse_files" folder corresponds to one of the meeting transcripts in the "txt_files" folder, and they have the same file name.  For example, "./sparse_files/ES2002a.txt" corresponds to "./txt_files/ES2002a.txt". Each file in "/sparse_files" contains the sparse representations for all its paragraphs as

In [161]:
#Initializing variables to store vocab words and count
vocab_words=[]
vocab_index=[]
new_list=[]
with open('vocab.txt') as f:
    vocab_file_list = f.read().splitlines()


for i in range(len(vocab_file_list)):
    vocab_words.append(re.search('(.*):',vocab_file_list[i]).group(1))
    vocab_index.append(re.search(':(\d+)',vocab_file_list[i]).group(1))
vocab_dict = dict(zip(vocab_words, vocab_index))


path="txt_files/"
for filename in os.listdir(path):
    file_name = filename
    filename = os.path.join(path, filename)
    
    #reading text files
    txt_files = open(filename,"r")
    lines = txt_files.readlines()
    txt_files.close()
    
    lines_list=[]
    for line in lines:
        line=line.strip().lower()
        lines_list.append(line)
    
    #String to store data
    final_string=''

    #Iterate over each line in the file.
    for line in lines_list:
            if(line=='**********'):
                pass
            else:
                lst=line.split(' ')
                #Initialize dictionary to store words count
                count_dict={}
                #Counter to count frequency of occurence
                count_dict=Counter(lst)
                count_dict_copy={}
                for key,value in count_dict.items():
                    if key in vocab_dict:
                        index=vocab_dict[key]
                        count_dict_copy[index]=value
                #Converting dictionary to string
                string_values = ' '.join('{0}:{1}'.format(key, val) for key, val in count_dict_copy.items())
                final_string += string_values +'\n'
    
    #Removing any empty spaces or lines
    final_string = "\n".join([text.rstrip() for text in final_string.splitlines() if text.strip()])
    
    #Writing output to a text file
    with open('sparse_files/'+ file_name + ".txt", 'w') as f1:
        f1.write(final_string)
    
    #closing file
    f1.close()

In this task, following steps were performed:

- vocab file genereated in above task was read and converted into dictionary
- Text files were read from the txt_files, generated from the task 1 and converted into lower case and then appended to a list
- Counter was used to count the frequency of occurence of words. The words from the vocab file were then compared with the words from the text files of the txt_files folder. 
- For each line, the frequency of occurence of words were noted.
- Dictionary was converted into string and white space or empty lines were removed.
- Final string then was written to text file with corresponding file name.