# FIT5196 Assessment 2 - Text Pre-Processing & Feature Generation

###  Group 103:

##### Student Name: Alan Gewerc
##### Student ID: 29961246
##### Student Name: Cristiana Garcia Gewerc
##### Student ID: 30088887


Date: 14/09/2019

Environment: Python 3.7.1 and Jupyter Notebook 5.7.4 (64-bit)

Libraries used:
* pandas 0.23.4 (for data frame, included in Anaconda Python 3.7.1) 
* re 0.23.4 (for regular expression, included in Anaconda Python 3.7.1) 
* requests 2.21.0 (for getting data from url, not included in Anaconda)
* pdfminer.pdfinterp (functions PDFResourceManager, PDFPageInterpreter)
* pdfminer.converter (functions HTMLConverter,TextConverter,XMLConverter) 
* pdfminer.layout (function LAParams)
* pdfminer.pdfpage (function PDFPage)
* nltk.data (to load the tokenizer)
* nltk.tokenize (functions RegexpTokenizer, MWETokenizer)
* nltk.stem (function PorterStemmer)
* nltk.util (function ngrams)
* nltk.probability (functions such as FreqDist)
* itertools (function chain)

# 1. Introduction

This assignment comprises the execution of different text processing and analysis tasks applied to two hundred academic papers published in a popular AI conference. After extracting all data from the documents that are in a non-structured format, i.e., PDF's, preprocessing tasks are done (such as lowercase normalization and stemming) and finally the papers are converted to numerical representations (which are suitable inputs for NLP AI systems, etc). The required tasks, using python code, are the following:

1. **PDF Extraction**: The document paper-ids.pdf contains a table in which each row contains a paper unique id and a URL. First, we have downloaded this file, extracted the information of the paper IDs and URLs from it. Them, we read the two hundred PDFs from these URLs and convert their contents into strings, that will further populate dictonaries containing all abstracts, titles, authors and bodies from the papers. 
2. **Sparse Feature Generation**: Focusing exclusively on the bodies dictonary, we generate two files: `vocab` and `count_vectors`. The first contains an index for every word or collocation from the data-set and the second the count of each index for every paper. However, firstly, some preprocessing tasks were done (the chosen order we will be explained throughout the assignment): 
    1. Normalization/segmentation
    2. Tokenization
    3. Identifying relevant bigrams
    4. Removing
        - independent stopwords
        - dependent stopwords
        - rare tokens
        - small tokens (length less than 3)
    8. Stemming
 
3. **Statistics Generation**: Generate a dataframe with three columns containing the top10_terms_in_abstracts, top10_terms_in_titles, top10_authors after some preprocessing tasks were developed in the other title, abstract and author dictonaries.  

# 2. PDF Extraction

Following the guidelines provided by the assessment document, we are using the `requests` package to programmatically download the PDF files and also `pdfminer` and `re` packages in order to read the PDF files into text and extract the required entities to complete the tasks. Aditionally, the `io` package will be used to handle unicodes with StringIO and BytesIO to use an in-memory buffer instead of a file.


## 2.1. Import Libraries

In [1]:
# !pip install pdfminer.six
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import requests
import io

## 2.2. Read the PDFs

First, we need to read the `paper-ids.pdf` file to extract the pdf links from it.

We followed the approach [suggested](https://stackoverflow.com/questions/39854841/pdfminer-python-3-5?fbclid=IwAR0btjjjuzFet2zfp4Rhle3IG-ZOKP0iAAeToU7ewI7ly1-BLKcrS0MGDB8) by Haseeb (2018), using an adaptation of his `convert_pdf_to_txt` function.


In [2]:
# Now we have pdfminer installed and are ready to convert our PDF to text by running the following command:
def convert_pdf_to_txt(path_to_file):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path_to_file, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

Now, we are going to use the above function to extract the text from the pdf file and save it into the string `links`. Them, we split it into a list.

In [3]:
# reading and converting the pdf file 
links = convert_pdf_to_txt("data/paper-ids.pdf")
# breaking it into lists items:
links_list =links.split()
# analyse the content:
links_list[0:9]

['filename',
 'url',
 'PP1861.pdf',
 'https://drive.google.com/uc?export=download&id=18BfdwBdmTd7DkE1LJUPNfTJifXzPToLU',
 'PP3203.pdf',
 'https://drive.google.com/uc?export=download&id=12IaCmFfJ7lAG7JIR0bEa-RVkgNrDmoZb',
 'PP3216.pdf',
 'https://drive.google.com/uc?export=download&id=18r6FpSWv6lkiHdDNfaTjssrXcvGUdyqO',
 'PP3252.pdf']

We can see the following pattern: every time a new page of the pdf file begins, a 'filename' and 'url' elements appears in our list. Those are useless for our purposes here. We are interested in keeping the filename associated with it's corresponding link.

The filename is always ends with `.pdf`, and the associated url is the subsequent list element. We are going to store them in a dictionary called `links_dic`.

In [4]:
links_dic = {}
# for i in range(1,20):
for i in range(len(links_list)):
    if links_list[i].endswith('.pdf'):
        links_dic[re.sub('.pdf', "",links_list[i])] = links_list[i+1]
# checking if the dictionary is working as it is suposed to:
for x in list(links_dic)[0:3]:
    print ("key {}, value {} ".format(x,  links_dic[x]))

key PP1861, value https://drive.google.com/uc?export=download&id=18BfdwBdmTd7DkE1LJUPNfTJifXzPToLU 
key PP3203, value https://drive.google.com/uc?export=download&id=12IaCmFfJ7lAG7JIR0bEa-RVkgNrDmoZb 
key PP3216, value https://drive.google.com/uc?export=download&id=18r6FpSWv6lkiHdDNfaTjssrXcvGUdyqO 


As [shared](https://stackoverflow.com/questions/22800100/parsing-a-pdf-via-url-with-python-using-pdfminer) by Haseeb (2018), we can use `pdf_from_url_to_txt` function to return a string from a pdf in a url without downloading it. I did a minor change in his function, using the `request` library instead of the `urllib`.

In [5]:
def pdf_from_url_to_txt(url):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    f = requests.get(url).content
    #f = urllib.urlopen(url).read()
    fp = io.BytesIO(f)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()
    for page in PDFPage.get_pages(fp,
                                  pagenos,
                                  maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

Now we are going the apply the function to each url that we extracted from the `Group103.pdf` file, saving the text content in another similar dictionary called `contents_dic`.

In [6]:
contents_dic = {}
for name,url in links_dic.items():
    #save the content of the pdf in the url as a dictionary value
    contents_dic[name] = pdf_from_url_to_txt(url)

In [7]:
len(contents_dic) # size of the dict
# print begining of the first paper
contents_dic['PP1861'][0:1000]

'Algorithms for Non-negative Matrix Factorization\n\nAuthored by:\n\nH. Sebastian Seung\n\nDaniel D. Lee\n\nAbstract\n\nNon-negative matrix factorization (NMF) has previously been shown\nto be a useful decomposition for multivariate data. Two diﬀerent multi-\nplicative algorithms for NMF are analyzed. They diﬀer only slightly in\nthe multiplicative factor used in the update rules. One algorithm can be\nshown to minimize the conventional least squares error while the other\nminimizes the generalized Kullback-Leibler divergence. The monotonic\nconvergence of both algorithms can be proven using an auxiliary func-\ntion analogous to that used for proving convergence of the Expectation-\nMaximization algorithm. The algorithms can also be interpreted as diag-\nonally rescaled gradient descent, where the rescaling factor is optimally\nchosen to ensure convergence.\n\n1 Paper Body\n\nUnsupervised learning algorithms such as principal components analysis and\nvector quantization can be understo

## 2.3. Extract the Bodies of the Papers, Authors and Abstract

We have created a dictonary `contents_dic` that contains as keys the IDs of the files and as values strings that represent the whole text extracted from the PDF documents. Now, with help from the `re` library we will identify the title, authors, abstract and bodies and break each part into a specific dictonary. This is specially useful for the next sections,  that will first focus exclusively on the bodies of papers and after look at the other sections. <br>
These are the regex patterns that we have developed: <br>

`pattern_title = r'(.+)?Authored by'` <br>
`pattern_author = r'Authored by:(.+)?Abstract'` <br>
`pattern_abstract = r'Abstract(.+)?1 Paper Body'` <br>
`pattern_body = r'Paper Body(.+)?2 References'`<br>

We want to capture a sequence of strings, which means that when the *title* ends, *author* begins. When *author* ends, *abstract* begins. When *abstract* ends, body begins. To find the `title`, we extract everything, with a lazy strategy, until we identify *Authored by*. To find the `authors`, we extract everything, with a lazy strategy, starting from *Authored by:* until we identify *Abstract*. This trend continues until we find *2 References*, extracting the paper body.


In [8]:
# Creating regex patterns to identify title, authors, abstract and body
pattern_title, pattern_author, pattern_abstract, pattern_body =\
r'(.+)?Authored by', r'Authored by:(.+?)Abstract', r'Abstract(.+?)1 Paper Body', r'Paper Body(.+?)2 References'

# creating empty dictonaries for each part of the papers
dict_paper_title, dict_paper_author, dict_paper_abstract, dict_paper_body  = {}, {}, {}, {} 

In [9]:
contents_dic['PP3203'][0:1000]

'Predictive Matrix-Variate t Models\n\nAuthored by:\n\nKai Yu\n\nShenghuo Zhu\nYihong Gong\n\nAbstract\n\nIt is becoming increasingly important to learn from a partially-observed\nrandom matrix and predict its missing elements. We assume that the en-\ntire matrix is a single sample drawn from a matrix-variate t distribution\nand suggest a matrix-variate t model (MVTM) to predict those missing\nelements. We show that MVTM generalizes a range of known probabilistic\nmodels, and automatically performs model selection to encourage sparse\npredictive models. Due to the non-conjugacy of its prior, it is diﬃcult to\nmake predictions by computing the mode or mean of the posterior distri-\nbution. We suggest an optimization method that sequentially minimizes\na convex upper-bound of the log-likelihood, which is very eﬃcient and\nscalable. The experiments on a toy data and EachMovie dataset show a\ngood predictive accuracy of the model.\n\n1 Paper Body\n\nMatrix analysis techniques, e.g., singul

Now, using the recently created patterns we will identify each part of every paper and populate the new dictonaries.
However, we will also do some preprocess cleaning in each of the strings that we find in order to make to next steps with the proper text. These will be done by a function called `clean()`, which does the following corrections:  
- Strip strings removing spaces at the beggining and the end
- Remove the '-\n' pattern, usually between a not compound word in a line break. 
- Replace the page breaks **('\x0c')** by spaces. To understand it (**next sheet of paper**), we head help from [enigma website](https://www.enigma.com/blog/the-secret-world-of-newline-characters), Yang Yang (2018).
- Replace the pattern/number in the bottom of the pages, using the regex `pattern_pagebreak`  


In [10]:
pattern_pagebreak = r'\n\n(\d+)?\n\n'

def clean(item):
    item = re.sub(pattern_pagebreak, ' ', item, re.MULTILINE|re.DOTALL)
    item = item.replace('-\n', '').replace('\x0c', ' ').strip()
    return item

for id, content in contents_dic.items():

    # populating dictionaries with elements, after striping, 
    # removing '-\n', replacing '\n' by ' ' and '\x0c', that means pagebreak by ' '
    dict_paper_title[id] = clean(re.search(pattern_title, content,  re.MULTILINE|re.DOTALL).group(1))
    dict_paper_author[id] = clean(re.search(pattern_author, content, re.MULTILINE|re.DOTALL).group(1))
    dict_paper_abstract[id] = clean(re.search(pattern_abstract, content,  re.MULTILINE|re.DOTALL).group(1))
    dict_paper_body[id] = clean(re.search(pattern_body, content,  re.MULTILINE|re.DOTALL).group(1))   


In [11]:
# printing items of first paper to make sure process worked
print('Paper Title:', dict_paper_title['PP1861'], '\n')
print('Paper Author:', dict_paper_author['PP1861'], '\n')

Paper Title: Algorithms for Non-negative Matrix Factorization 

Paper Author: H. Sebastian Seung

Daniel D. Lee 



In [12]:
print('Number of Papers:', len(dict_paper_body))
print('Length of First Article Body:', len(dict_paper_body['PP1861']))

Number of Papers: 200
Length of First Article Body: 13291


# 3 - Sparse Feature Generation 

The next tasks aims to generate numerical representations of the each paper body with the relevant words/collocations that can be found on it. However, first we must apply some preprocess tasks following the proposed guidelines. 

A challenge faced in this assignment was to decide the order to follow of the tasks, as it may impact deeply the final result. It is described in the following every step and the rationale behind this order: 

1. **Normalization/segmentation**: The first step is to lowercase all words that are in the beggining of a sentence. This step is necessary otherwise in the next ones we would classify differently words such as 'House' and 'house, but we want them to be the same. 
2. **Tokenization**: We will transform every word in a token (item of a list). To generate bigrams it's more practical that each word is converted to a token.
3. **Unify the 200 most relevant bigrams**: It is crucial that we do this step before the next one, that is removing unwanted tokens. If we don't, we may find connected words that were previosly separed by a stopword, for instance. Something to notice is that we do not want to bigrams that have stopwords or length less than 3, so many bigrams will removed before getting to next step.
4. **Remove Stop Words and Small Tokens**: context-dependent and independent stopwords, rare tokens and small tokens. The order among the remotion is not relevant, but it is necessary that we do it before stemming words, the last step.
4. **Stemming**: The last step is to stem words, that means, to reduce the remaining words to their root. Since we will transform most words, it is essential that it is after removing unwanted tokens. Stemming is not applyied in bigrams.

**Relevant observation**: We will work with only one dictonary throughout part 2, `dict_paper_body`. We will perform many transformations on it, but we have decided not create many different dictonaries after every transformation, because it would be inneficient regarding memory usage and also confusing for documentation. As consequence, it is not possible to run all the code, and sudently come to the middle of the assignment and try to run individually one cell.  

## 3.1. Import Libraries

In [13]:
import nltk.data
from nltk.tokenize import RegexpTokenizer, MWETokenizer
from nltk.stem import PorterStemmer
from nltk.util import ngrams
from nltk.probability import *
from itertools import chain

## 3.2. Lowercase Normalization
Tokens must be normalized to lowercase except the capital tokens appearing in the middle of a sentence/line. (use sentence segmentation to achieve this)

This step must be done before tokenization because we need to use sentence segmentation in order to recognize if the word is in the middle of a sentence or not. We are going to use the Punkt Sentence Tokenizer, as seen in the tutorials.

*The NLTK's [Punkt Sentence Tokenizer](http://www.nltk.org/api/nltk.tokenize.html) was designed to split 
text into sentences "by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.” It contains a pre-trained sentence tokenizer for English".* (NLTK Project, 2019)

First we must make a pre-trained English sentence tokenizer:

In [14]:
sentence_detect = nltk.data.load("tokenizers/punkt/english.pickle")

Them, we can tokenize all the 200 articles in our `dict_paper_body`

In [15]:
dict_paper_body = {k:sentence_detect.tokenize(v) for (k,v) in dict_paper_body.items()}

For each sentence, we need to lower case every first word in it. To do so, we define the `lower_first_word` function that is appliable to each list inside the `dict_paper_body` dictionary. 

We overwrite a new dictionary with the output. This dictionary links the paper IDs with their bodies, but now with the tokens that appear in the beggining of sentences normalized.

In [16]:
# function that takes a list of string as intput and return it with the first element of each string lowercased.
def lower_first_word(text_list):
    lower_cased = [text[0].lower()+text[1:len(text)] for text in text_list]
    return lower_cased
# apply the function and transform the list back into strings.
dict_paper_body = {k:' '.join(lower_first_word(v)) for (k,v) in dict_paper_body.items()}

In [17]:
print('Number of Papers:', len(dict_paper_body))
print('Length of First Article Body:', len(dict_paper_body['PP1861']))
dict_paper_body['PP1861'][0:1000] # body, after lowercase normalization

Number of Papers: 200
Length of First Article Body: 13274


'unsupervised learning algorithms such as principal components analysis and\nvector quantization can be understood as factorizing a data matrix subject to\ndiﬀerent constraints. depending upon the constraints utilized, the resulting factors can be shown to have very diﬀerent representational properties. principal\ncomponents analysis enforces only a weak orthogonality constraint, resulting in\na very distributed representation that uses cancellations to generate variability [1, 2]. on the other hand, vector quantization uses a hard winnertake-all\nconstraint that results in clustering the data into mutually exclusive prototypes\n[3]. we have previously shown that nonnegativity is a useful constraint for\nmatrix factorization that can learn a parts representation of the data [4, 5]. the nonnegative basis vectors that are learned are used in distributed, yet still\nsparse combinations to generate expressiveness in the reconstructions [6, 7]. in\nthis submission, we analyze in detail two 

## 3.3. Tokenization

The word tokenization must use the following regular expression, <b>r"[A-Za-z]\w+(?:[-'?]\w+)?"</b>

We are going to use `RegexpTokenizier()` from NLTK's [Punkt Sentence Tokenizer](http://www.nltk.org/api/nltk.tokenize.html) (NLTK Project, 2019). Again, we are overwritting the `dict_paper_body`.

In [18]:
tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")

# tokenize the texts 
dict_paper_body = {k:tokenizer.tokenize(v) for (k,v) in dict_paper_body.items()}

# display the output to analyse eventual mistakes
print('Number of Papers:', len(dict_paper_body))
print('Tokens in First Article Body:', len(dict_paper_body['PP1861']))
print(dict_paper_body['PP1861'][0:100])

Number of Papers: 200
Tokens in First Article Body: 1928
['unsupervised', 'learning', 'algorithms', 'such', 'as', 'principal', 'components', 'analysis', 'and', 'vector', 'quantization', 'can', 'be', 'understood', 'as', 'factorizing', 'data', 'matrix', 'subject', 'to', 'diﬀerent', 'constraints', 'depending', 'upon', 'the', 'constraints', 'utilized', 'the', 'resulting', 'factors', 'can', 'be', 'shown', 'to', 'have', 'very', 'diﬀerent', 'representational', 'properties', 'principal', 'components', 'analysis', 'enforces', 'only', 'weak', 'orthogonality', 'constraint', 'resulting', 'in', 'very', 'distributed', 'representation', 'that', 'uses', 'cancellations', 'to', 'generate', 'variability', 'on', 'the', 'other', 'hand', 'vector', 'quantization', 'uses', 'hard', 'winnertake-all', 'constraint', 'that', 'results', 'in', 'clustering', 'the', 'data', 'into', 'mutually', 'exclusive', 'prototypes', 'we', 'have', 'previously', 'shown', 'that', 'nonnegativity', 'is', 'useful', 'constraint', 'for', 

## 3.4. Bigrams

The next task is to generate 200 bigram collocations, based on highest total frequency in the corpus; given the tokenized, context-independent-stop-words-free, lower-cased-when-appropriated dictionary of paper bodies. They should be separated using double underscore (example: "artifical__intelligence")

The first step is to concatenate all the tokenized bodies using the `chain.from_iterable` function from the package [itertools](https://docs.python.org/2/library/itertools.html) (Python Software Foundation, 2019), as done in the tutorials. The output of the function is a flattened list that contains all the words in the corpus.

In [19]:
all_words = list(chain.from_iterable(dict_paper_body.values()))

The next step is to generate the 200 bigram collocations. We need to import some functions, like `ngrams` from [NLKT](http://www.nltk.org/api/nltk.html#nltk.util.ngrams) in order to perform it.

In [20]:
bigrams = ngrams(all_words, n = 2)
fdbigram = FreqDist(bigrams)
top_bigrams = fdbigram.most_common(len(fdbigram))

In [21]:
top_bigrams[0:5]

[(('of', 'the'), 6856),
 (('in', 'the'), 3706),
 (('to', 'the'), 2705),
 (('on', 'the'), 2177),
 (('can', 'be'), 2119)]

However, most of this bigrams have stopwords in it, and this is not of our interest. So, we will select the 200 bigram collocations that don't have any stopword, by excluding than from our list of bigrams.

In [22]:
bigram_list = list(fdbigram)

# # importing stopwords
stopwords_file = open('stopwords_en.txt')
stopwords_list = [i.strip() for i in stopwords_file]
stopwords_set = set(stopwords_list)
stopwords_file.close()

In [23]:
# removing stopwords and bigrams that have a word with length less than 3
top_bigrams = [token for token in top_bigrams if token[0][0] not in stopwords_list and token[0][1] not in stopwords_list]
top_bigrams = [token for token in top_bigrams if len(token[0][0]) > 2 and len(token[0][1]) > 2]
top_200_bigrams = top_bigrams[0:200]
top_200_bigrams[0:10]

[(('log', 'log'), 225),
 (('optimization', 'problem'), 223),
 (('lower', 'bound'), 188),
 (('training', 'data'), 183),
 (('loss', 'function'), 179),
 (('objective', 'function'), 174),
 (('upper', 'bound'), 150),
 (('gradient', 'descent'), 145),
 (('machine', 'learning'), 142),
 (('posterior', 'distribution'), 132)]

Now we have acheived a bigram that is in line with our goals, with no stopwords or words with length less than 3.

### 3.4.1. Re-tokenize the paper bodies again.

Now, we introduce 200 collcations to the token list. we need to make sure those collocations are not split into two individual words. The tokenizer that you need is <a href="http://www.nltk.org/api/nltk.tokenize.html">MWEtokenizer</a> (NLTK Project, 2019). We can use it to transform the bigrams into one unique string separated by "__".

First, in order to apply the MWETokenizer function, we must reshape the `top_200_bigrams`, that currently is a list of tuples of tuples. What we need is a list of tuples. For instance:

[(('log', 'log'), 331),
 (('optimization', 'problem'), 225)]
 
Should become:

[('log', 'log'),
 ('optimization', 'problem')]


In [24]:
top_200_list = list(map(lambda tuple: tuple[0],top_200_bigrams))
print(top_200_list[0:10])

[('log', 'log'), ('optimization', 'problem'), ('lower', 'bound'), ('training', 'data'), ('loss', 'function'), ('objective', 'function'), ('upper', 'bound'), ('gradient', 'descent'), ('machine', 'learning'), ('posterior', 'distribution')]


Now that we have the list, we will apply the `MWETokenizer` on the `dict_paper_body`'s bodies to have a dictionary with the original tokens and also the appropriated bigrams.

In [25]:
mwetokenizer = MWETokenizer(top_200_list, separator='__')
dict_paper_body =  dict((id, mwetokenizer.tokenize(body)) for id,body in dict_paper_body.items())
all_words_colloc = list(chain.from_iterable(dict_paper_body.values()))
colloc_voc = list(set(all_words_colloc))
print('Bigrams vocabulary size: ',len(colloc_voc))

Bigrams vocabulary size:  27196


In [26]:
# display the output to analyse eventual mistakes
print('Tokens in First Article Body:', len(dict_paper_body['PP1861']))
print(dict_paper_body['PP1861'][1:100])

Tokens in First Article Body: 1905
['learning__algorithms', 'such', 'as', 'principal__components', 'analysis', 'and', 'vector', 'quantization', 'can', 'be', 'understood', 'as', 'factorizing', 'data__matrix', 'subject', 'to', 'diﬀerent', 'constraints', 'depending', 'upon', 'the', 'constraints', 'utilized', 'the', 'resulting', 'factors', 'can', 'be', 'shown', 'to', 'have', 'very', 'diﬀerent', 'representational', 'properties', 'principal__components', 'analysis', 'enforces', 'only', 'weak', 'orthogonality', 'constraint', 'resulting', 'in', 'very', 'distributed', 'representation', 'that', 'uses', 'cancellations', 'to', 'generate', 'variability', 'on', 'the', 'other', 'hand', 'vector', 'quantization', 'uses', 'hard', 'winnertake-all', 'constraint', 'that', 'results', 'in', 'clustering', 'the', 'data', 'into', 'mutually', 'exclusive', 'prototypes', 'we', 'have', 'previously', 'shown', 'that', 'nonnegativity', 'is', 'useful', 'constraint', 'for', 'matrix__factorization', 'that', 'can', 'learn

## 3.5. Clean the Body from Unwanted Tokens 


### 3.5.1. Remove Context Independent Stop Words
The context-independent stop words list (i.e, stopwords_en.txt) that was provided is going the be filtered away from our lists of tokens as they as meaningless.

In [27]:
stopwords_file = open('stopwords_en.txt')
stopwords_list = [i.strip() for i in stopwords_file]
stopwords_set = set(stopwords_list)
# we check if the lowercased token is in the stopwords_list because we still have some 
# incorrectly upercase "The", "Therefore", etc
dict_paper_body = {k:[token for token in v if token.lower() not in stopwords_list] for (k,v) in dict_paper_body.items()}
stopwords_file.close()

In [28]:
# analyse filtered_tokens
print('Tokens in First Filtered Article Body:', len(dict_paper_body['PP1861']))
print(dict_paper_body['PP1861'][0:100])

Tokens in First Filtered Article Body: 973
['unsupervised', 'learning__algorithms', 'principal__components', 'analysis', 'vector', 'quantization', 'understood', 'factorizing', 'data__matrix', 'subject', 'diﬀerent', 'constraints', 'depending', 'constraints', 'utilized', 'resulting', 'factors', 'shown', 'diﬀerent', 'representational', 'properties', 'principal__components', 'analysis', 'enforces', 'weak', 'orthogonality', 'constraint', 'resulting', 'distributed', 'representation', 'cancellations', 'generate', 'variability', 'hand', 'vector', 'quantization', 'hard', 'winnertake-all', 'constraint', 'results', 'clustering', 'data', 'mutually', 'exclusive', 'prototypes', 'previously', 'shown', 'nonnegativity', 'constraint', 'matrix__factorization', 'learn', 'parts', 'representation', 'data', 'nonnegative', 'basis', 'vectors', 'learned', 'distributed', 'sparse', 'combinations', 'generate', 'expressiveness', 'reconstructions', 'submission', 'analyze', 'detail', 'numerical', 'algorithms', 'learn

### 3.5.2. Remove Context Dependent Words

Context-dependent (with the threshold set to %95) stop words must be removed from the vocab. The following [article](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html) (Cambridge University Press, 2009) was of big help to overcome this step.

In [29]:
words_2 = list(chain.from_iterable([set(value) for value in dict_paper_body.values()]))
fd_2 = FreqDist(words_2)
common_tokens = fd_2.most_common(25)

content_dependent = [tuple[0] for tuple in common_tokens if (tuple[1]>=190)]

Let's check the context dependent stopwords:

In [30]:
content_dependent

['results', 'set', 'number', 'rst']

Now, we will filter those words from all our paper bodies in `dict_paper_body`:

In [31]:
dict_paper_body = dict((id, [token for token in body if token not in content_dependent]) for id,body in dict_paper_body.items())
# filtered_tokens
print('Tokens in First Filtered Article Body:', len(dict_paper_body['PP1861']))
print(dict_paper_body['PP1861'][0:100])

Tokens in First Filtered Article Body: 961
['unsupervised', 'learning__algorithms', 'principal__components', 'analysis', 'vector', 'quantization', 'understood', 'factorizing', 'data__matrix', 'subject', 'diﬀerent', 'constraints', 'depending', 'constraints', 'utilized', 'resulting', 'factors', 'shown', 'diﬀerent', 'representational', 'properties', 'principal__components', 'analysis', 'enforces', 'weak', 'orthogonality', 'constraint', 'resulting', 'distributed', 'representation', 'cancellations', 'generate', 'variability', 'hand', 'vector', 'quantization', 'hard', 'winnertake-all', 'constraint', 'clustering', 'data', 'mutually', 'exclusive', 'prototypes', 'previously', 'shown', 'nonnegativity', 'constraint', 'matrix__factorization', 'learn', 'parts', 'representation', 'data', 'nonnegative', 'basis', 'vectors', 'learned', 'distributed', 'sparse', 'combinations', 'generate', 'expressiveness', 'reconstructions', 'submission', 'analyze', 'detail', 'numerical', 'algorithms', 'learning', 'opti

### 3.5.3. Removing Tokens with Length Smaller than 3
Tokens with the length less than 3 should be removed from the vocab. This must be done before finding the collocations otherwise they would be misidentified.

In [32]:
dict_paper_body = {k:[token for token in v if len(token) >=3] for (k,v) in dict_paper_body.items()}
# checking outpup
print('Tokens in First Filtered Article Body:', len(dict_paper_body['PP1861']))
print(dict_paper_body['PP1861'][0:100])

Tokens in First Filtered Article Body: 841
['unsupervised', 'learning__algorithms', 'principal__components', 'analysis', 'vector', 'quantization', 'understood', 'factorizing', 'data__matrix', 'subject', 'diﬀerent', 'constraints', 'depending', 'constraints', 'utilized', 'resulting', 'factors', 'shown', 'diﬀerent', 'representational', 'properties', 'principal__components', 'analysis', 'enforces', 'weak', 'orthogonality', 'constraint', 'resulting', 'distributed', 'representation', 'cancellations', 'generate', 'variability', 'hand', 'vector', 'quantization', 'hard', 'winnertake-all', 'constraint', 'clustering', 'data', 'mutually', 'exclusive', 'prototypes', 'previously', 'shown', 'nonnegativity', 'constraint', 'matrix__factorization', 'learn', 'parts', 'representation', 'data', 'nonnegative', 'basis', 'vectors', 'learned', 'distributed', 'sparse', 'combinations', 'generate', 'expressiveness', 'reconstructions', 'submission', 'analyze', 'detail', 'numerical', 'algorithms', 'learning', 'opti

### 3.5.4. Remove Rare Words
Rare tokens (with the threshold set to 3%) must be removed from the vocabulary. Let's first find out who are those `rare_tokens`:

In [33]:
rare_tokens = [key for key,value in fd_2.items() if (value<6)]
# checking outpup
print('Rare Tokens:', len(rare_tokens))
print(rare_tokens[0:10])

Rare Tokens: 21957
['abhahb', 'Vi', 'nonnegativity', 'ilt', 'Proofs', 'Kaufman', 'itt', 'Wh', 'tomography', 'Frobenius-Perron']


Removing them from our bodies vocabulary:

In [34]:
dict_paper_body = {id: [token for token in body if token not in rare_tokens] for id,body in dict_paper_body.items()}
# checking outpup
print('Tokens in First Filtered Article Body:', len(dict_paper_body['PP1861']))
print(dict_paper_body['PP1861'][0:100])

Tokens in First Filtered Article Body: 654
['unsupervised', 'learning__algorithms', 'principal__components', 'analysis', 'vector', 'understood', 'data__matrix', 'subject', 'diﬀerent', 'constraints', 'depending', 'constraints', 'utilized', 'resulting', 'factors', 'shown', 'diﬀerent', 'representational', 'properties', 'principal__components', 'analysis', 'enforces', 'weak', 'constraint', 'resulting', 'distributed', 'representation', 'generate', 'variability', 'hand', 'vector', 'hard', 'constraint', 'clustering', 'data', 'mutually', 'exclusive', 'previously', 'shown', 'constraint', 'matrix__factorization', 'learn', 'parts', 'representation', 'data', 'nonnegative', 'basis', 'vectors', 'learned', 'distributed', 'sparse', 'combinations', 'generate', 'analyze', 'detail', 'numerical', 'algorithms', 'learning', 'optimal', 'nonnegative', 'factors', 'data', 'matrix__factorization', 'formally', 'algorithms', 'solving', 'problem', 'matrix__factorization', 'non-negative', 'matrix', 'non-negative', '

## 3.6. Stemming
Unigram tokens should be stemmed using the Porter stemmer. (be careful that stemming performs lower casing by default)
Porter Stemming Algorithm is the one of the most common stemming algorithms.
It makes use of a series of heuristic replacement rules.

According to [Stemming and lemmatization](http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) (2009), it is *"The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective"*.

In [35]:
stemmer = PorterStemmer()

We define the `stemming_lowercase()` function which applies the `stemmer` on all tokens, except the ones with upercase characters (alleged proper noums) and bigrams:

In [36]:
def stemming_lowercase(token):
    # we only do the stemming if it's not a proper noun or a bigram
    if (token.lower() == token) and ("__" not in token):
        new_token = stemmer.stem(token)
    else: 
        new_token = token
    return new_token

Applying the above defined function:

In [37]:
dict_paper_body = dict((id, list(map(stemming_lowercase,body))) for id,body in dict_paper_body.items())
# checking outpup
print('Tokens in First Filtered Article Body:', len(dict_paper_body['PP1861']))
print(dict_paper_body['PP1861'][0:100])

Tokens in First Filtered Article Body: 654
['unsupervis', 'learning__algorithms', 'principal__components', 'analysi', 'vector', 'understood', 'data__matrix', 'subject', 'diﬀer', 'constraint', 'depend', 'constraint', 'util', 'result', 'factor', 'shown', 'diﬀer', 'represent', 'properti', 'principal__components', 'analysi', 'enforc', 'weak', 'constraint', 'result', 'distribut', 'represent', 'gener', 'variabl', 'hand', 'vector', 'hard', 'constraint', 'cluster', 'data', 'mutual', 'exclus', 'previous', 'shown', 'constraint', 'matrix__factorization', 'learn', 'part', 'represent', 'data', 'nonneg', 'basi', 'vector', 'learn', 'distribut', 'spars', 'combin', 'gener', 'analyz', 'detail', 'numer', 'algorithm', 'learn', 'optim', 'nonneg', 'factor', 'data', 'matrix__factorization', 'formal', 'algorithm', 'solv', 'problem', 'matrix__factorization', 'non-neg', 'matrix', 'non-neg', 'matrix', 'factor', 'appli', 'statist', 'analysi', 'multivari', 'data', 'manner', 'multivari', 'dimension', 'data', 'vecto

## 3.7. Paper Bodies - Sparse Feature Generation 

Each group is required to complete the following two tasks:
Generate a sparse representation for Paper Bodies (i.e. paper text without Title, Authors, Abstract and References). The sparse representation consists of two files:
### 3.7.1. Vocabulary index file
We will create a list that contains all unique words:

In [38]:
vocab_list = list(set(list(chain.from_iterable(dict_paper_body.values()))))
vocab_list.sort()
vocab_list[0:10]

['ADMM',
 'AFOSR',
 'ARO',
 'AUC',
 'Acc',
 'Accuracy',
 'Acknowledgements',
 'Acknowledgments',
 'Action',
 'Adam']

In [39]:
len(vocab_list)

2600

Creating a dictonary that has an index number for every element in the list of tokens and exporting the dict to the file `Group103_vocab.txt`.

In [40]:
index_dict = {}
vocab_index_file = open('Group103_vocab.txt', 'w+', encoding = "utf-8" )

for i in range(len(vocab_list)):
    index_dict[vocab_list[i]] = i 
    vocab_index_file.write(vocab_list[i] + ':'+ str(i)+ "\n") # exporting the token and index

vocab_index_file.close()    

### 3.7.2. Sparse count vectors file

In this next step, we will create the *count_vectors_file*, named `output/count_vectors.txt`.We will iterate over the final dictonary **dict_paper_body**, that holds the final tokenized structure of each paper. We will count the frequency of every token in each paper and input in a string, called **str_key_value** that is exported every time we iterate in one value of the **dict_paper_body**.

In [41]:
count_vectors_file = open('output/count_vectors.txt', 'w+', encoding = "utf-8" )

for key,value in dict_paper_body.items():

    str_key_value = ''   # final string that will be exported, the sum of str_value and str_key
    str_value = ''       # string that contains count of every word/index
    str_key = str(key)   # string that contains the document ID
    token_list = []      # support list, used to avoid duplicates in str_value
   
    dict_word_freq = dict(FreqDist(value))  #dictonary with the frequency of every token 

    for item in value: # iterate over the list of tokens

        token_list.append(item)
        if token_list.count(item) < 2: # this if is to avoid duplicates
            str_value = str_value + str(index_dict[item]) + ':' + str(dict_word_freq[item]) + ','
       
    str_value = str_value[0:len(str_value)-1]  # eliminate the last element, an unecessary ',' 
    str_key_value = str_key + ',' + str_value # unite the paper id with the sequence of value 
   
    count_vectors_file.write(str_key_value + "\n" ) # Exporting the result to a file
   
count_vectors_file.close()

In [42]:
# analysing last paper's sparse representation:
str_key_value

'PP7219,903:7,1661:1,471:2,2269:4,1806:27,2567:3,2529:4,1519:1,2372:9,1330:1,1843:6,987:3,777:1,2245:1,1641:1,1430:1,1899:1,640:1,470:14,2338:2,626:5,1840:1,2381:1,1901:10,1863:61,1328:3,1911:13,2476:17,866:2,1856:10,538:8,1063:28,1941:10,703:2,1230:2,1062:20,2575:7,1156:2,1560:11,1616:2,1154:1,1071:1,678:6,1703:2,965:1,1029:4,1177:1,2118:26,2094:19,1487:10,1893:10,1701:4,1747:1,783:2,469:1,2540:1,2535:1,1196:7,2162:3,2000:4,974:24,1072:20,1658:26,2427:70,887:38,765:1,979:10,1755:5,1260:6,1353:6,2428:9,2082:7,1603:26,1313:2,1563:1,945:1,73:1,283:1,324:1,271:1,231:1,37:1,439:1,1507:1,1982:2,1368:4,482:2,1891:6,1574:33,1526:1,36:13,1162:1,1885:2,2240:2,1776:10,1851:2,774:12,1942:1,1070:2,1345:2,1379:1,542:8,937:1,1322:3,219:1,248:1,268:2,236:3,1983:1,2528:2,571:6,919:2,1517:1,1088:7,1910:1,2286:1,2527:5,982:1,1712:5,920:3,910:1,1004:4,806:3,840:2,1884:3,821:1,1445:5,1245:1,1308:4,2206:1,757:10,1951:1,1935:1,2221:8,2035:5,1589:1,1610:3,2263:4,1821:2,1915:9,1001:9,2385:18,2402:4,1376:5,218

# 4. Statistics Generation

------------------------------------------------------------------------------------------------------------------------------

To complete this second task, we need to perform the following preprocessing steps on
the Titles and Abstracts before extracting the required stats:<br>

- <b>A.</b> The word tokenization must use the following regular expression, `r"[A-Za-z]\w+(?:[-'?]\w+)?"`
- <b>B.</b> The context-independent stop words (i.e, stopwords_en.txt) must be removed
- <b>C.</b> For Abstracts, Tokens must be normalized to lowercase except the capital tokens appearing in the middle of a sentence/line. (use sentence segmentation to achieve this). For Titles, tokens must be all normalised to lowercase.

We will also tokenize the Authors in the `Tokenization` step below.

## 4.1. Import Libraries

In [43]:
import pandas as pd

## 4.2. Lowercase Normalization
For Abstracts, Tokens must be normalized to lowercase except the capital tokens appearing in the middle of a sentence/line. (use sentence segmentation to achieve this). For Titles, tokens must be all normalised to lowercase.

### 4.2.1. Abstracts
We are going to do the same process with the abstracts that we have done before with the paper bodies. 

First, we use the the NLTK's [Punkt Sentence Tokenizer](http://www.nltk.org/api/nltk.tokenize.html) (NLTK Project, 2019) to split 
the abstracts contained in the dictionary `dict_paper_abstract` into sentences:

In [44]:
dict_paper_abstract = {k:sentence_detect.tokenize(v) for (k,v) in dict_paper_abstract.items()}
dict_paper_abstract['PP1861']

['Non-negative matrix factorization (NMF) has previously been shown\nto be a useful decomposition for multivariate data.',
 'Two diﬀerent multiplicative algorithms for NMF are analyzed.',
 'They diﬀer only slightly in\nthe multiplicative factor used in the update rules.',
 'One algorithm can be\nshown to minimize the conventional least squares error while the other\nminimizes the generalized Kullback-Leibler divergence.',
 'The monotonic\nconvergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the ExpectationMaximization algorithm.',
 'The algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally\nchosen to ensure convergence.']

For each sentence, we need to lower case every first word in it. To do so, we use the `lower_first_word` function that is appliable to each list inside the `dict_paper_abstract` dictionary. That is the same function we have previously used for the paper bodies and pre-defined back them.

The `dict_paper_abstract` dictionary will be overwritten with the output. This dictionary now links the paper IDs with their abstracts, like it did previously, but now with the tokens that appear in the beggining of sentences normalized.

In [45]:
# apply the function and transform the list back into strings.
dict_paper_abstract = {k:' '.join(lower_first_word(v)) for (k,v) in dict_paper_abstract.items()}

In [46]:
dict_paper_abstract['PP1861'] # abstracts, after lowercase normalization

'non-negative matrix factorization (NMF) has previously been shown\nto be a useful decomposition for multivariate data. two diﬀerent multiplicative algorithms for NMF are analyzed. they diﬀer only slightly in\nthe multiplicative factor used in the update rules. one algorithm can be\nshown to minimize the conventional least squares error while the other\nminimizes the generalized Kullback-Leibler divergence. the monotonic\nconvergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the ExpectationMaximization algorithm. the algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally\nchosen to ensure convergence.'

### 4.2.2. Titles
For Titles, tokens must be all normalised to lowercase. So, we just need to apply the `lower()` function to all values of the `dict_paper_title` dictionary.

In [47]:
dict_paper_title = {id: title.lower() for (id, title) in dict_paper_title.items()}
dict_paper_title['PP1861']

'algorithms for non-negative matrix factorization'

## 4.3. Tokenization
The word tokenization must use the following regular expression, r"[A-Za-z]\w+(?:[-'?]\w+)?"
Again, we are going to use `RegexpTokenizier()` from NLTK's [Punkt Sentence Tokenizer](http://www.nltk.org/api/nltk.tokenize.html) (NLTK Project, 2019).

We need to tokenize the `dict_paper_title`, `dict_paper_abstract` and `dict_paper_author`. The regex expression for abstracts and titles is the same, the one indicated at the assignment guidelines, <b>r"[A-Za-z]\w+(?:[-'?]\w+)?"</b>.

In [48]:
## Tokenizer for titles and abstracts
tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
# tokenize the titles: 
dict_paper_title = {k:tokenizer.tokenize(v) for (k,v) in dict_paper_title.items()}
# tokenize the abstracts: 
dict_paper_abstract = {k:tokenizer.tokenize(v) for (k,v) in dict_paper_abstract.items()}

However, for authors it doesn't work. One example: 

"H. Sebastian Seung  Daniel D. Lee" mus be tokenized as ["H. Sebastian Seung", "Daniel D. Lee"] and not as ["H., "Sebastian", "Seung", "Daniel", "D.", "Lee"]

In order to perform such task, we need to adapt the regex expression used to tokenize authors names. The pattern is the following:

- If there is one single space " " between two names in the `dict_paper_author`, it means it's just a space separating first name and last name. So, it should be kept as one token.
- If there is more than one space, the second space came from the line break "\n" conversion to " ". The original line break was the separator between the authors, so, the split separator should be it: 2 space characters.

In [49]:
# tokenize the authors: 
dict_paper_author= {k:list(filter(None,v.split("\n"))) for (k,v) in dict_paper_author.items()}

## 4.4. Remove context-independent stop words
The context-independent stop words (i.e, stopwords_en.txt) must be removed from the titles and abstracts. As we have previouly done it for the paper bodies, we have already the `stopwords_set` based on the words cointained in the given file.

In [50]:
# remove stop words form titles:
dict_paper_title = {k:[token for token in v if token not in stopwords_set] for (k,v) in dict_paper_title.items()}
# remove stop words form abstracts:
dict_paper_abstract = {k:[token for token in v if token not in stopwords_set] for (k,v) in dict_paper_abstract.items()}

In [51]:
print(dict_paper_title)

{'PP1861': ['algorithms', 'non-negative', 'matrix', 'factorization'], 'PP3203': ['predictive', 'matrix-variate', 'models'], 'PP3216': ['bayesian', 'binning', 'beats', 'approximate', 'alternatives', 'estimating', 'peri-stimulus', 'time', 'histograms'], 'PP3252': ['multiple-instance', 'active', 'learning'], 'PP3282': ['variational', 'inference', 'diﬀusion', 'processes'], 'PP3284': ['gaussian', 'process', 'models', 'link', 'analysis', 'transfer', 'learning'], 'PP3295': ['discriminative', 'batch', 'mode', 'active', 'learning'], 'PP3309': ['inﬁnite', 'gamma-poisson', 'feature', 'model'], 'PP3391': ['hebbian', 'learning', 'bayes', 'optimal', 'decisions'], 'PP3408': ['adaptive', 'template', 'matching', 'shift-invariant', 'semi-nmf'], 'PP3435': ['shared', 'segmentation', 'natural', 'scenes', 'dependent', 'pitman-yor', 'processes'], 'PP3439': ['bio-inspired', 'real', 'time', 'sensory', 'map', 'realignment', 'robotic', 'barn', 'owl'], 'PP3460': ['learning', 'consistency', 'inductive', 'functions

## 4.5. Generate a CSV file (stats.csv) containing three columns:
<b>a.</b> Top 10 most frequent terms appearing in all Titles

<b>b.</b> Top 10 most frequent Authors

<b>c.</b> Top 10 most frequent terms appearing in all Abstracts

Note: In case of ties in any of the above fields, settle the tie based on alphabetical ascending order. (example: if the author named John appeared as many times as Mark, then John shall be selected over Mark)

First, we define a function that returns a list of the top 10 values in a dictionary of tokens, ordered as instructed. The sorting should be first by reverse order of counts and second by ascending order of names. To do so, we used a `lambda` function inside the `key` of the `sort()`, as [explained](https://stackoverflow.com/questions/14466068/sort-a-list-of-tuples-by-second-value-reverse-true-and-then-by-key-reverse-fal) by mgilson (2013)

In [52]:
def get_top10_sorted(dict):
    # first, make a flat list with all the tokens in the dictionary values
    words_list = list(chain.from_iterable([value for value in dict.values()]))
    # get the tokens' distribution
    fd_words = FreqDist(words_list)
    # list of tuple pairs (token, number_occurences) ordered by number_occurences
    top_words = fd_words.most_common()
    # sorting the words with the same number of occurences by alphabetical order
    top_words.sort(key=lambda x: (-x[1], x[0]))  # -x[1] for reverse order of occurence number, x[0] for alphabetical ordering
    # returning top 10
    return top_words[0:10]

Now, we just need to apply the above defined `get_top10_sorted()` function to the `dict_paper_title`, `dict_paper_abstract` and `dict_paper_author`, generating lists to be further added as columns of a dataframe.
### 4.5.1 Top 10 most frequent terms appearing in all Titles

In [53]:
top_title = get_top10_sorted(dict_paper_title)

In [54]:
top_title

[('learning', 48),
 ('neural', 17),
 ('models', 16),
 ('optimization', 12),
 ('networks', 11),
 ('deep', 10),
 ('inference', 10),
 ('data', 9),
 ('reinforcement', 9),
 ('estimation', 8)]

### 4.5.2 Top 10 most frequent Authors

In [55]:
top_author = get_top10_sorted(dict_paper_author)
top_author

[('Jakob H. Macke', 4),
 ('Tomer Koren', 4),
 ('Dale Schuurmans', 3),
 ('Daniel D. Lee', 3),
 ('Huan Xu', 3),
 ('Michael I. Jordan', 3),
 ('Pradeep K. Ravikumar', 3),
 ('Remi Munos', 3),
 ('Alan A. Stocker', 2),
 ('Alexander T. Ihler', 2)]

### 4.5.3 Top 10 most frequent terms appearing in all Abstracts

In [56]:
top_abstract = get_top10_sorted(dict_paper_abstract)
top_abstract

[('learning', 191),
 ('data', 169),
 ('model', 153),
 ('algorithm', 140),
 ('show', 114),
 ('problem', 109),
 ('approach', 93),
 ('models', 86),
 ('algorithms', 84),
 ('methods', 84)]

### 4.5.4 Generate the CSV
First, we will organize our top 10 words of each section in columns of a dataframe for them write the dataframe to a csv. 

We create a `zipped_list` with the abstract, title and author lists of words, following the approach [suggested](https://thispointer.com/python-pandas-how-to-convert-lists-to-a-dataframe`) in thispointer.com (Varun, 2018). Them, we use it to generate our `df_statistics` dataframe.



In [57]:
zipped_list =  list(zip([i[0] for i in top_abstract], [i[0] for i in top_title], [i[0] for i in top_author]))
df_statistics = pd.DataFrame(zipped_list, columns = ['top10_terms_in_abstracts' , 'top10_terms_in_titles', 'top10_authors'])
df_statistics

Unnamed: 0,top10_terms_in_abstracts,top10_terms_in_titles,top10_authors
0,learning,learning,Jakob H. Macke
1,data,neural,Tomer Koren
2,model,models,Dale Schuurmans
3,algorithm,optimization,Daniel D. Lee
4,show,networks,Huan Xu
5,problem,deep,Michael I. Jordan
6,approach,inference,Pradeep K. Ravikumar
7,models,data,Remi Munos
8,algorithms,reinforcement,Alan A. Stocker
9,methods,estimation,Alexander T. Ihler


Finally, we use the `pandas` function `to_csv()` [function](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv) (The `pandas` Project, 2019) to write our final CSV file with all the required statistics.

In [58]:
 df_statistics.to_csv("output/stats.csv", index=False)

# 5. Summary


This assessment measured the understanding of text file preprocessing techniques, file handling, PDF data extraction, as well as generating numerical modelling - for instance, sparce representation - of texts in the Python programming language. The main outcomes achieved while applying these techniques were:

- **Download PDF files from URLs and convert them in string**. Using the libraries such as pdfminer and request, we have have donwloaded 200 PDF's files and converted than to text that populated dictonaries, with the structure `Paper_ID: Extracted_text`. However, after this step was completed, with use of regex, we broke the dictonary into different 4 parts (other dictonaries). One for authors, other for title, abstract and body. This was useful, specially because the fisrt part of the assignment was exclusevely related to the bodies of the papers.
- **Text Pre-Processing**. A relevant number of tasks were done in order to achieve the proper format to proceed to the modelling tasks. The biggest challenge faced in this part of the assigment, was not any of the tasks, but to understand wich was the correct order to deploy each of the steps. We have decided to unite all activites related to words/tokens removal together, after the processes of segmentation, tokenization and generating bigrams. The last step was to stem the tokens. 
- **Generating Numerical Representations**. Three different documents were generated in this assignment: The first contains a numerical representation for each token, found in the corpus. The second was a sparce representation, where the count of each index in each paper can be found. In the third, ranking statistics about the corpus were generated. The Freqdist tool, from the NLTK library was specially useful to overcome all difficulties faced regarding generating numerical forms from text.

# 6. References
- Haseeb M. (2018, November 17). *Parsing a PDF via URL with Python using pdfminer* [Response to]. Retrieved from https://stackoverflow.com/questions/22800100/parsing-a-pdf-via-url-with-python-using-pdfminer

- Haseeb M. (2018, October 14). *Pdfminer python 3.5* [Response to]. Retrieved from https://stackoverflow.com/questions/39854841/pdfminer-python-3-5?fbclid=IwAR0btjjjuzFet2zfp4Rhle3IG-ZOKP0iAAeToU7ewI7ly1-BLKcrS0MGDB8

- Aryan A. (2017, April 17). *Downloading Files from URLs in Python*. Retrieved from https://www.codementor.io/aviaryan/downloading-files-from-urls-in-python-77q3bs0un

- Guru99. (Accessed on 2019, September 09). *Python Regex: re.match(), re.search(), re.findall() with Example*. Retrieved from
https://www.guru99.com/python-regular-expressions-complete-tutorial.html

- Yang Yang (June 19, 2018). *The Secret World of Newline Characters*. Retrieved from
https://www.enigma.com/blog/the-secret-world-of-newline-characters 

- Cambridge University Press (April 07, 2009). *Dropping common terms: stop words*. Retrieved from
http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html

- Cambridge University Press (April 07, 2009). *Stemming and lemmatization*. Retrieved from
http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

- World International Property Organization (2016). *Python Pandas : How to convert lists to a dataframe*. Retrieved from http://www.wipo.int/export/sites/www/classifications/ipc/en/guide/guide_ipc.pdf

- Varun (2018, September 25). *NLTK 3.4.5 documentation: nltk.tokenize module*. Retrieved from https://thispointer.com/python-pandas-how-to-convert-lists-to-a-dataframe/

- Python Software Foundation. (2019). itertools *Functions creating iterators for efficient looping*. Retrieved from https://docs.python.org/2/library/itertools.html

- NLTK Project. (2019). *NLTK 3.4.5 documentation: nltk.tokenize module*. Retrieved from http://www.nltk.org/api/nltk.tokenize.html

- The pandas Project. (2019). *pandas 0.25.1 documentation: pandas.DataFrame.to_csv*. Retrieved from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv

- mgilson (2013, January 22). *Sort a list of tuples by second value, reverse=True and then by key, reverse=False* [Response to]. Retrieved from https://stackoverflow.com/questions/14466068/sort-a-list-of-tuples-by-second-value-reverse-true-and-then-by-key-reverse-fal