# Task 2 Generate Sparse Representations 

#### Student Name: Zhiqing Shu
#### Student ID: 28217551

Date: 03/04/2018
Version: 2.0

Environment: Python 3.6.5 and Jupyter notebook

Libraries used:
* re (for regular expression, included in Anaconda Python 3.6) 
* os (for useful functions on pathnames, included in Anaconda Python 3.6)
* itertools (for implementing a number of iterator building blocks, included in Anaconda Python 3.6)
* nltk (for building Python programs to work with human language data, included in Anaconda Python 3.6)
* OrderedDict (for sorting dictionary, included in Anaconda Python 3.6)

### Detail Requirements

Task 2: Generate sparse representations for the meeting transcripts. The aim of this task is to build sparse representations for the meeting transcripts generated in task 1, which includes word tokenization, vocabulary generation, and the generation of sparse representations. Please note that 
* The word tokenization must use the following regular expression, "\w+(?:[-']\w+)?", and all the words must be converted into the lower case.
* The stop words list (i.e, stopwords_en.txt) provided in the zip file must be used.
* The words, whose document frequencies are greater than 132, must be removed.
* Generating multi-word phrases (i.e., collocations) are not needed.
* The output of this task must contain the required files.

### Import Libraries

In [1]:
import re
import os
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.probability import *
import itertools
from itertools import chain
from collections import OrderedDict
#from nltk.tokenize import MWETokenizer

### Load files

Here, I load and store all the meeting transcripts generated from task1 into a dictionary, the key is filename and the value is the meeting transcripts.

In [2]:
txt_file_path = "./txt_files"

In [3]:
def parsing(t):
    file = open(t, 'r')
    fileName = os.path.basename(file.name)
    text = file.read()
    return (fileName, text)

In [4]:
meeting_raw = {}
for xfile in os.listdir(txt_file_path): 
    xfile = os.path.join(txt_file_path, xfile)
    if os.path.isfile(xfile) and xfile.endswith('.txt'):
        #(pid, text) = parsing(open(xfile))
        (fileName, text) = parsing(xfile)
        meeting_raw[fileName] = text

### Word Tokenization and Stopwords Removal

In this step, `RegexpTokenizer` will be used to splits a string into tokens by a regular expression. 

In [5]:
tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")

At the same, create a list to store stopwords provided in `stopwords_en.txt`.

In [6]:
stopwordsList = []
with open('./stopwords_en.txt') as f:
    stopwordsList = f.read().splitlines()

In [7]:
# check
len(stopwordsList)

571

Apart from above, converting all words in meeting transcripts into the lower case.

In [8]:
def tokenizeMeeting(FileName):
    """
        the tokenization function is used to tokenize each meeting.
        The one argument is meeting_id.
        First, normalize the case.
        Then, use the regular expression tokenizer to tokenize the patent with the specified id
    """
    raw_meeting = meeting_raw[fileName].lower() #normalization
    tokenized_meeting = tokenizer.tokenize(raw_meeting)
    filtered_tokens = [token for token in tokenized_meeting if token not in stopwordsList]
    return (fileName, filtered_tokens) # return a tuple of filename and a list of tokens

Create a dictionary to store the tokenized meeting transcripts.

In [9]:
meeting_tokenized = {}
for fileName in meeting_raw.keys():
    (fileName, filtered_tokens) = tokenizeMeeting(fileName)
    meeting_tokenized[fileName] = filtered_tokens

Check how many types we have in the whole corpus and the lexical diversity.

In [10]:
from __future__ import division

words = list(chain.from_iterable(meeting_tokenized.values()))
vocab = set(words)
lexical_diversity = len(words)/len(vocab)
print ("Vocabulary size: ",len(vocab),"\nTotal number of tokens: ", len(words), \
"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  10507 
Total number of tokens:  226313 
Lexical diversity:  21.539259541258208


### Most Document Frequent Words Removal

Apply `set()` to each Reuters article to generate a set of unique words in the article and save all sets in a list.
Putting all the words in a list using chain.from_iterable and past it to `FreqDist`.

In [11]:
words = list(chain.from_iterable([set(value) for value in meeting_tokenized.values()]))
fd = FreqDist(words)
fd.items()



In [12]:
MoreFreqWords = set([k for k, v in fd.items() if v > 132])
MoreFreqWords

{'bit',
 'control',
 'design',
 'good',
 'make',
 'meeting',
 'mm',
 'mm-hmm',
 'people',
 'remote',
 'thing',
 'things',
 'uh',
 'um',
 'work',
 'yeah'}

Remove the most document-frequent words.

In [13]:
def removeMoreFreqWords(fileName):
    return (fileName, [w for w in meeting_tokenized[fileName] if w not in MoreFreqWords])

In [14]:
meeting_tokenized = dict(removeMoreFreqWords(fileName) for fileName in meeting_tokenized.keys())

To check if these words have been deleted.

In [15]:
words = list(chain.from_iterable(meeting_tokenized.values()))
vocab = set(words)
lexical_diversity = len(words)/len(vocab)
print ("Vocabulary size: ",len(vocab),"\nTotal number of tokens: ", len(words), \
"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  10491 
Total number of tokens:  155740 
Lexical diversity:  14.845105328376704


### Word length filter

We alse need to remove the token if the length of this token is less than 3.

In [16]:
words_gt_3 = []
for i in words:
    if len(i) >= 3:
        words_gt_3.append(i)

In [17]:
vocab_gt_3 = set(words_gt_3)

In [18]:
#words = list(chain.from_iterable(meeting_tokenized.values()))
#vocab = set(words)
#lexical_diversity = len(words)/len(vocab)
print ("Vocabulary size: ",len(vocab_gt_3),"\nTotal number of tokens: ", len(words_gt_3), \
"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  10305 
Total number of tokens:  152880 
Lexical diversity:  14.845105328376704


## vocab.txt

Convert `vocab_gt_3` into a list, and then sort it to ensure it is in alphabetical order.

In [19]:
ordered_vocab = []
for item in vocab_gt_3 :
    ordered_vocab.append(item)

In [20]:
ordered_vocab.sort()
ordered_vocab

['a-hold',
 'a_a_',
 'a_a_s',
 'a_m_i_',
 'a_n_',
 'a_p_o_g_e_e_',
 'a_s',
 'a_s_r_',
 'a_v_',
 'abandon',
 'abandoned',
 'abbie',
 'abbing',
 'abbreviations',
 'abdul',
 'abigail',
 'abilities',
 'ability',
 'abo',
 'abou',
 'abrupt',
 'abs',
 'absolute',
 'absolutely',
 'absorb',
 'absorbed',
 'abstract',
 'abstraction',
 'abused',
 'abut',
 'academy',
 'acc',
 'acce',
 'accent',
 'accents',
 'accentu',
 'accentuate',
 'accept',
 'acceptability',
 'acceptable',
 'acceptance',
 'accepted',
 'accepting',
 'accepts',
 'access',
 'accessed',
 'accessible',
 'accessoire',
 'accessory',
 'accident',
 'accidentally',
 'acco',
 'accommodate',
 'accommodated',
 'accommodating',
 'accompanying',
 'accomplish',
 'accomplishing',
 'account',
 'accountant',
 'accountants',
 'accounted',
 'accounting',
 'accounts',
 'accu',
 'accumulate',
 'accuracy',
 'accurate',
 'accurately',
 'accustomed',
 'ach',
 'ache',
 'achieve',
 'achieved',
 'achieving',
 'acknowledge',
 'acknowledged',
 'acknowledgemen

Then store the `ordered_vacab` into a dictionary, the key is the word in `ordered_vacab`, the value is its index in `ordered_vacab`.

In [21]:
# vocab2 = vectorizer.get_feature_names()
vocab_list = []
vocab_dic = {}
for item in ordered_vocab:
    vocab_list.append(item)
    #print (item, ":", vocab_list.index(item) )
    vocab_dic[item] = vocab_list.index(item)

In [22]:
vocab_dic

{'a-hold': 0,
 'a_a_': 1,
 'a_a_s': 2,
 'a_m_i_': 3,
 'a_n_': 4,
 'a_p_o_g_e_e_': 5,
 'a_s': 6,
 'a_s_r_': 7,
 'a_v_': 8,
 'abandon': 9,
 'abandoned': 10,
 'abbie': 11,
 'abbing': 12,
 'abbreviations': 13,
 'abdul': 14,
 'abigail': 15,
 'abilities': 16,
 'ability': 17,
 'abo': 18,
 'abou': 19,
 'abrupt': 20,
 'abs': 21,
 'absolute': 22,
 'absolutely': 23,
 'absorb': 24,
 'absorbed': 25,
 'abstract': 26,
 'abstraction': 27,
 'abused': 28,
 'abut': 29,
 'academy': 30,
 'acc': 31,
 'acce': 32,
 'accent': 33,
 'accents': 34,
 'accentu': 35,
 'accentuate': 36,
 'accept': 37,
 'acceptability': 38,
 'acceptable': 39,
 'acceptance': 40,
 'accepted': 41,
 'accepting': 42,
 'accepts': 43,
 'access': 44,
 'accessed': 45,
 'accessible': 46,
 'accessoire': 47,
 'accessory': 48,
 'accident': 49,
 'accidentally': 50,
 'acco': 51,
 'accommodate': 52,
 'accommodated': 53,
 'accommodating': 54,
 'accompanying': 55,
 'accomplish': 56,
 'accomplishing': 57,
 'account': 58,
 'accountant': 59,
 'accountants

Save the result into `vocab.txt`.

In [23]:
with open('./vocab.txt','w') as f:
    for item in vocab_list:
        record = item + ':' + str(vocab_list.index(item)) + '\n'
        f.write(record)
f.close()

## topic_seg.txt

If a segment is a topic boundary, replace the segment before `**********` with 1. If not, mark the segment as 0.

In [24]:
seg = {}
for fileName in meeting_raw.keys():
    # get filename form the format 'ES2002a.txt', the length is different, 10 or 11
    if len(fileName) == 11:
        # record is a string
        record = fileName[:7] + ':'
    elif len(fileName) == 10:
        record = fileName[:6] + ':'
    # split the original by line
    content = meeting_raw[fileName].split('\n')
    for segment in content:
        # if line is '**********', replace the last record with '1', add ',' after it
        if segment == '**********':
            record = record[:-2] + str(1) + ','
        # the file is end, break line
        elif segment == '':
            record = record + '\n'
        # else, it is not topic boundary, just add '0' and ','
        else:
            record = record + str(0) + ','
    # store the result into a dictionary
    seg[fileName] = record

In [25]:
seg['ES2003a.txt']

'ES2003a:0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,\n'

Remove the `,` after the last topic boundary.

In [26]:
for i in seg.keys():
    seg[i] = seg[i][:-2] + seg[i][-1]

Sort the dictionary by the order of file.

In [27]:
seg_ordered = OrderedDict(sorted(seg.items(), key = lambda t: t[0])) 

In [28]:
# check result
seg_ordered['ES2003a.txt']

'ES2003a:0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1\n'

Save the result into `topic_segs.txt`.

In [29]:
with open('./topic_segs.txt','w') as f:
    for i in seg_ordered.keys():
        f.write(seg_ordered[i])
f.close()

In [30]:
with open('./topic_seg.txt','w') as f:
    for i in seg_ordered.keys():
        f.write(seg_ordered[i])
f.close()

## ./sparse_files/*.txt

Sparse each file and save them correspondingly to one of the meeting transcripts in the "txt_files" folder.

In [34]:
for file in meeting_raw.keys():
    out_file = open('./sparse_files/' + file , 'w')
    for lines in meeting_raw[file].split('\n'):
        # normalization the segment
        lines = lines.lower()
        text = ''
        # count the frequency of a word in one segment
        words_dup = re.findall("\w+(?:[-']\w+)?",lines)
        dic = FreqDist(words_dup)
        # here, one segment may have duplicated words, get all words
        # create a list to store the words without duplication
        words_nodup = []  
        for i in words_dup:  
            if i not in words_nodup:  
                # use append to ensure the sequence
                words_nodup.append(i)
        for words in words_nodup:
            if words in vocab_dic.keys():
                index = str(vocab_dic[words])
                if text == '':
                    text = index + ':' + str(dic[words])
                else:
                    text += ',' + index + ':' + str(dic[words])

        if text != '':
            text += '\n'
            out_file.write(text)
    out_file.close()

### Reference

* Regular Expression. Retrieved from: https://docs.python.org/3/library/re.html
* os — Miscellaneous operating system interfaces. Retrieved from: https://docs.python.org/3/library/os.html
* collections — Container datatypes. Retrieved from: https://docs.python.org/3/library/collections.html
* nltk. Retrieved from: http://www.nltk.org/book/
* itertools — Functions creating iterators for efficient looping. Retrieved from: https://docs.python.org/3/library/itertools.html 