# Task 2 Generate Sparse Representations 
#### Student Name: Siyang Feng
#### Student ID: 28246993

Date: 04/06/2018

Version: 2.0

Environment: Python 3.6.2 and Anaconda 4.3.29

Libraries used:

* re 2.2.1 (for regular express, included in Anaconda Python 3.6)
* os (for getting files in directory, included in Anaconda Python 3.6)
* nltk (for nature language pre-processing)


## Introduction
The aim of this task is to build sparse representations for the meeting transcripts generated in task 1, which includes word tokenization, vocabulary generation, and the generation of sparse representations. 
## Import Library

In [1]:
from bs4 import BeautifulSoup as bsoup
import re
import os
import nltk
from nltk.book import *
from itertools import chain
import itertools
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Read text file
* the sample output text has been deleted. Thus, we should just read all text files.

All the text file is recorded into a dictionary with the key of file name without '.txt' and value of string.

Text file path and target sparse file path is record in two variables.

In [2]:
# get text path
txt_path = "./txt_files"
sparse_path = "./sparse_files"

Function `parse_one_text` is used to read a text file and get the contains as a string.

The next step is used to generate a dictionary to store all the contains in all text files in `./txt_files`. The key of the dictionary is the text name without '.txt' and the value is a string storing all the contains in this text file. Notice that, all the string values are converted into lower case in the dictionary. The dictionary result would be:
``` json
{
    'ES2002a': " okay .\n right .\n um well this is the kick-off meeting for our our project . ..."
    'ES2002b': " is that alright now ?\n okay .\n sorry ?\n okay , ..."
    ...
}

```

In [3]:
def parse_one_text(f_path):
    """
    Read a text file with a defined text path t.
    Parse the text file in to a string.
    
    Arguments:
    f_path -- the text file path
    
    Return:
    t -- the string record all the contains in text file
    """
    r = open(f_path, 'r')
    t = r.read()
    return t

In [4]:
# Read all the text file in directory 'txt_files' and load into a dict with the key of its file name
text_dict = dict()
for tfile in os.listdir(txt_path):
    t_path = os.path.join(txt_path, tfile)
    if os.path.isfile(t_path) and t_path.endswith('.txt'):
        text_name= re.search(r'(.+).txt', tfile).group(1)
        text_dict[text_name] = parse_one_text(t_path).lower()

## Tokenize the words
The text has been extracted. Then, we need tokenize all the words with regular expression tokenizer implemented in NLTK.
The token pattern is `\w+(?:[-']\w+)?`. Use this pattern with the function `RegexpTokenizer()` to get the tokenizer.

In [5]:
tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")

Function `tokenize_words()` is used to tokenize all the tokens in every text file. All these topic file are recorded into a dictionary which is generated before. The input of this function will be the text file dictionary. The tokenizer will be used in the values of the text dictionary. The output of this function is a new dictionary which record the file name as key and the original tokens in each file as value of the corresponding key. The structure would like:
``` json
{'ES2002a': 
     ['okay',
      'right',
      'um',
      'well',
      ...
     ],
 'ES2002b':
     [...],
     ...
}
```

In [6]:
def tokenize_words(diction):
    """
    Tokenize all the words in dictionary.
    The tokenized pattern is defined with 'Regexp.Tokenizer'.
    
    Arguments:
    diction -- the dictionary whose value should be tokenized
    
    Return:
    w_dict -- token dictionary.
    """
    w_dict = dict()
    tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")
    for i in list(diction.keys()):
        words = diction[i]
        w_dict[i] = list(tokenizer.tokenize(words))
    return w_dict

In [7]:
# read all the words as tokens in dictionary.
words_dict = tokenize_words(text_dict)

## Stop Words Removal
Stop words are often functional words in english and they always be insignificant for text itself. Thus, we should remove it from the original tokens.

#### - Read stop words text file
Function `parse_one_text` is used to parse the stopwords file into a string. And then, use the `split(\n)` function to split the string into list. 

In [8]:
stopstr = parse_one_text('stopwords_en.txt')
stopwords = stopstr.split('\n')

#### - Remove stop words
* unique the words list and stopwords list
* remove stop words

In [9]:
stopwords_set = set(stopwords)
stopped_tokens = dict()
for i in list(words_dict.keys()):
    stopped_tokens[i] = [w for w in words_dict[i] if w not in stopwords_set]

## High Document Frequency Words Removal
This part, we will remove all the tockens whose document frequency are greater than 132. Thus, we should find out which word has a high document frequency (> 132).  
Function `chain.from_iterable` is used to concatenate all the tokenizers. Fucntion:
``` python
    list(chain.from_iterable([set(value) for value in stopped_tokens.values()]))
```
First, we get the values in each key of the pre-defined dictionaty and se the list of values into set() to make all the tokens into unique in each key. Then the `chain.from_iterable` function is used to chain all the unique tokens in one one key with iterate the dictionary keys. Finally, put the result into list.

In [10]:
words_2 = list(chain.from_iterable([set(value) for value in stopped_tokens.values()]))
word_freq = FreqDist(words_2)

Function `FreqDist` in `nltk.book` calculate the token frequency. The output of this function is a dictionary type of `word : frequency`. Example:
``` json
        { 'cheery': 1,
          'ironic': 1,
          'thinking': 94,
          'product': 110,
          ...
        }
```
Filter the words with it's frequency. Store the tokens in a list with its frequency not greater than 132 and put the selected results into a list.

In [11]:
tokens_1 = []
for i in list(word_freq):
    if word_freq[i] <= 132:
        tokens_1.append(i)

## len() < 3 Words Removal
The token length may be smaller than 3. However, most of the words with its length smaller than 3 would be insignificance such as 'hu', 'a_' and etc. Thus, in this step, we will delete all the tokens with its length smaller than 3. Then, we sort the final token with its alphabetic order.

In [12]:
tokens = [i for i in tokens_1 if len(i) >= 3]

In [13]:
# sort token in alphabetic order
tokens.sort()

## Sub-task 1: Write Tokens into `vocab.txt` File
It contains the unigram vocabulary in the following format, word_string:integer_index. Words in the vocabulary must be sorted in alphabetic order. For example, "absolute:22" in the following figure means that the 23rd word in the vocabulary is "absolute".
**************************
'token : order number'

The token and its order is generated by data structure dictionary with its token as key and order as value.

Then read all the contains in this dictionary into a string with the structure of 
``` python
    dict.key + ':' + str(dict.value) + '\n'
```
Finally, write the final string into the `vocab.txt` file.

In [14]:
# generate a dict (token_dic) of tokens
token_dic = dict()
for i in range(len(tokens)):
    token_dic[tokens[i]] = i

In [15]:
voc = ''
for i in list(token_dic.keys()):
    voc = voc + i + ':' + str(token_dic[i]) + '\n'

In [16]:
vocab = open('vocab.txt', 'w')
vocab.write(voc)
vocab.close()

## Sub-task 2: Generate `topic_segs.txt` file
It contains the topic boundaries encoded in boolean vectors. For example, if a meeting transcript, "ES2018d.txt" contains 10 paragraphs in total after being preprocessed, and there are topic boundaries after the 2nd, 5th, and 7th paragraphs, the boolean vector must be "__ES2018d:0,1,0,0,1,0,1,0,0,1__". Every line in *topic_segs.txt* corresponds to one meeting transcript.
*******************
For this task, the directory `text_dict` will be used.

Get a string object to record all the segment info with 0 (represent a normal paragraph) and 1 (represent the last paragraph in each topic). Each line records each meeting transcript file. 

First, we generate a list to record 0 and 1 features of each meeting transcript file. And then `join` function is used to connect all the value in list into a string and seperate each item by ','. Finally, add every string of each meeting transcript file into a total string by sequence.

In [17]:
# generate a string to record segment info
seg_str = ''
for i in list(text_dict.keys()):
    sub_ls = []
    sub_str = i + ':'
    str_ls = text_dict[i].split('\n')
    for j in str_ls:
        if j == '**********':
            sub_ls[-1] = '1'
        elif j == '':
            continue
        else:
            sub_ls.append('0')
    sub_str = sub_str + ','.join(sub_ls) + '\n'
    seg_str = seg_str + sub_str

The string object is generated before. The next task is to write the string into text file: `topic_segs.txt`.

For this task, it's not clear about write the final result in which file. Because, in assignment specification, it said write it into `topic_seg.txt` file. However, the zip file provide us an empty `topic_segs.txt` file. Thus, I generate the two files and write the same result into them.

In [18]:
topic_seg = open('topic_segs.txt', 'w')
topic_seg.write(seg_str)
topic_seg.close()
topic_seg = open('topic_seg.txt', 'w')
topic_seg.write(seg_str)
topic_seg.close()

## Sub-task 3: Generate `./sparse_files/*.txt` File
Each txt file in the "sparse_files" folder corresponds to one of the meeting transcripts in the "txt_files" folder, and they have the same file name.  For example, "./sparse_files/ES2002a.txt" corresponds to "./txt_files/ES2002a.txt". Each file in "/sparse_files" contains the sparse representations for all its paragraphs.
************************
The funtion `write_sparse` is used to write the sparsed result into a text file.

The function `sparse_one_line` is used to sparse only one line in all of the meeting transcript text files. In each line, we should justify if the word is in the token dictionary `token_dic` which is genarated before. If yes, we should record the index of the word and its frequency in each line. The resord structure would be:
``` python
    str(token_dic[a_key]) + ':' + str(w_freq[key])
```
It should be notice that the word might represent as `'kay`. However, in previous token extraction, we only extract the words like `kay`. Thus, when we got a token in non-extrated word token dictionary, the token should still contians this kind of symbol like `'kay`. It would be match in the normal token dictionary. Thus, before to get the index of each word, we shoud use regular expression `re.findall("\w+(?:[-']\w+)?", i)[0]` to select the useful word part. Finally, connect all the word index and its frequency into a single string by sequency.

In [19]:
def write_sparse(string, file_path):
    """
    Write the sparsed string into a text file with defined path.
    
    Arguments:
    string -- input string which need to be write into text file
    file_path -- the path of the text to write
    
    Return:
    None
    """
    text_file = open(file_path, 'w')
    text_file.write(string)
    text_file.close()

In [20]:
def sparse_one_line(line, token_dic):
    """
    Sparse one line of all the topic text file by selecting its the useful words.
    And the frequency of the word in this line.
    
    Arguments:
    line -- a string line with words.
    token_dic -- the token dictionary to record the token and its index.
    
    Return:
    sparse_str -- the string chained by line result
    """
    ls = []
    ws = line.split()
    for i in ws:
        for j in re.findall("\w+(?:[-']\w+)?", i):
            try:
                ls.append(token_dic[j])
            except:
                continue
    w_freq = FreqDist(ls)
    l_ls = []
    for k in w_freq.keys():
        l_ls.append(str(k) + ':' + str(w_freq[k]))
    sparse_str = ','.join(l_ls)
    if len(sparse_str) != 0:
        sparse_str = sparse_str + '\n'
    return sparse_str

Retrieve all the keys in the text dictionary which records the file name and the file contianing as string. For each text dictionary value, `split('\n')` is used to split the string file into list lines. 

And then retrieve its lines and parse each line using the pre-defined function `sparse_one_line()` and connect line results together. 

After sparsing each file, the result will be wrote into a text file and store the file into `.\sparse_files` folder with the function `write_sparse()`.

In [21]:
for i in list(text_dict.keys()):   # file name will be i.txt
    topic = ''
    str_ls = text_dict[i].split('\n')  # split lines in each topic file
    for j in range(len(str_ls)): 
        topic = topic + sparse_one_line(str_ls[j], token_dic) # parse each line
    spr_name = i + '.txt'
    sps_path = os.path.join(sparse_path, spr_name)
    write_sparse(topic, sps_path)

## Summary


* learned how to select word tokens. From the first sub-task, we understand what is stop words and what kind of words should be selected as tokens. In general, when the word length is smaller than 3, the word might be nonsignificant. 
* Sparse representation represents the features of token in each paragragh.
* The aim of text pre-processing is processing the text to get features of the text. The features may be the words the lines  or the paragraphs or even the text files.