
#### Student Name: Rajath Akshay Vanikul
#### Student ID: 29498724

Date: 14/03/2019

Version: 2.0

Environment: Python 3.6.4 and Anaconda 5.7.6 (64-bit)

Libraries used: 
* re 2.2.1 (for regular expression, included in Anaconda Python 3.6) 
* nltk 3.2.2 (Natural Language Toolkit, included in Anaconda Python 3.6)
* itertools 2.3 (iterator building blocks,included in Anaconda Python 3.6)
* nltk.collocations (for finding bigrams, included in Anaconda Python 3.6)
* nltk.tokenize (for tokenization, included in Anaconda Python 3.6)


## 1. Introduction
The task is to extract data into a proper format from a PDF file. The pdf file contains a table in which each row contains information about a unit which is unit code, synopsis, and outcomes. I will have to extract and
transform the information for each unit into a vector space model.

Methodology used to perform this task is as follows:

1. Extract the information from PDF file to a TEXT file. (Used https://pdftotext.com/) 
2. Text is normalized to lowercase except the capital tokens appeared in the middle of a sentence/line
3. Perform word tokenization using regular expression, `\w+(?:[-']\w+)?`
4. The context-independent and context-dependent stop words are removed from the vocab. The stop words file is provided. (i.e, stopwords_en.txt)
5. Tokens with the length less than 3 should be removed from the vocab.
6. First 200 meaningful bigrams (i.e., collocations) are determined using PMI measure and included in vocab. This should be done after removal of stopwords to ensure elimination of unnecessary bigrams of stopwords.
7. Tokens should be stemmed using the Porter stemmer. Stemming must be performed after bigrams to ensure least loss of information.
8. Rare tokens (with the threshold set to %5) must be removed from the vocab.
9. Find the set of all the unique vocab, index and sort them in alphabetical order.
10. Compare each document(each unit observation) and create a Sparse Matrix.


More details for each task will be given in the following sections.

## 2.  Import libraries

importing regular expression, nltk and itertools to perform tasks.

In [1]:
import re
import nltk
import itertools
from nltk.collocations import *
from itertools import chain
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
from nltk import FreqDist

## 3. Converting and loading the data

I have used an online platform to convert my PDF file to a text format file. I uploaded the PDF to [https://pdftotext.com/] which helped me to convert the file to text. I downloaded the text file as `pdftotext.txt`.
We will be using this text file for the rest of the tasks.

The following code will help me read the complete text file to a variable:

In [2]:
pdf_data = open("pdftotext.txt", 'r') # read the file into the variable.

#reading the first 30 lines of the file
[next(pdf_data) for x in range(30)]

['Title\n',
 '\n',
 'Synopsis\n',
 '\n',
 'Outcomes\n',
 '\n',
 'ATS3221\n',
 '\n',
 'In this unit students consider the central production,\n',
 'consumption and policy contexts of popular music.\n',
 'The unit examines how popular music remains a\n',
 'significant media and cultural industry in the\n',
 'production of content and meaning. It assesses the\n',
 'core music-media output across print, broadcasting,\n',
 'mobile media, film, internet and related media\n',
 'industries. The unit also looks at how government\n',
 'policy shapes music production and consumption, and\n',
 'how local music-making and listening is shaped by\n',
 'global media practices. This includes examination of\n',
 'key debates about music-media technologies,\n',
 'intellectual property frameworks and the impact of\n',
 'music across different media content.\n',
 '\n',
 "['discuss key media studies, popular music and\n",
 'cultural studies theories associated with popular\n',
 "music activity;', 'assess ho

## 4. Extracting the data from text file

Inspecting the file, We can notice the title, synopsis and outcome pattern.
Title, synopsis and outcome of a unit are in a sequence saperated by `\n` as a line. If we carefully look at the complete data set. The unit title appears exactly after 3 lines of `\n`. Similarly, we see the same pattern with synopsis and outcome with exists after 3 lines of `\n`.

We perform:
1. Extract all the unit tiles in the text using a pattern.
2. Extract Synopsis and outcome using the same logic and pattern recognition.
3. Normalise the first letter of sentence in synopsis and outcome. 
4. Zip Synopsis list and outcome list together to form text value for a unit title.
5. Create a dictionary with unit title as a key and zipped text as value.

I have written a code to capture all the titles, synopsis and outcome in three different lists respectively.

### 4.1 Extracting unit titles

As inspected, we know that the unit title appears exactly after 3 lines of \n. I have laverages this and written a code to capture all the unit codes.

Below is a code to extract all the unit titles in the text.

In [3]:
pdf_data = open("pdftotext.txt", 'r') # read the file into the variable.

# initiating the re pattern object to capture the unit title.
reg = re.compile(r"([A-Z]+[0-9]+)")
# initiate a tile list to hold values.
title_lst=[]

# initiating a flag used to manupulate printing.
line_check = False

# initiating a count to keep track of "\n" lines.
count = 0

# iterating through the complete text.
for line in pdf_data:
    
    # search for unit title pattern.
    match_line = re.search(r"([A-Z]+[0-9]+)|$",line)[1]
    
    # used to count the number of "\n" lines.
    if line == '\n':
        count += 1  
    
    # when the count is multiples of 3, turn the flag on.
    if count%3 == 0:
        line_check = True
    
    # when the flag is on and the unit title is found, append the result to the list.
    if line_check and reg.match(str(match_line)):
        title_lst.append(match_line)
        line_check = False # switch the flag off.

# printing the number of unit titles captures.        
print("lenth of the list:",len(title_lst))

# printing the sample of the list of unit titles.
title_lst[:6]

lenth of the list: 200


['ATS3221', 'APG5401', 'DIS2907', 'BFF3331', 'AMU1326', 'RAD4501']

### 4.2 Extracting unit synopsis

Similar procedure of capturing the elements after third '\n' line is employed to obtain synopsis. However, we will be checking for multiples of 3+1 and then append the elements to the final list.
The code is as below:

In [4]:
pdf_data = open("pdftotext.txt", 'r') # read the file into the variable.

syn_lst=[] # inital appending list with "\n" characters.
syn_final=[] # final wrangled list

# initiating a flag used to manupulate printing.
line_check = False

# initiating a count to keep track of "\n" lines.
count = 0

# variable to store the current line.
prev_line = ""

# iterating through the complete text.
for line in pdf_data:
    # used to count the number of "\n" lines.
    if line == '\n':
        count += 1
        
    # when the count is multiples of 3, turn the flag on.
    if count%3 == 1:
        line_check = True
    
    # if line is noy "\n" append all to a text (used to concatinate and store paragraphs)
    if line != '\n':
        line = ''.join([prev_line, line])
    
    # if the flag is on, append the line to a list
    if line_check:
        syn_lst.append(line)
        line_check = False # switch off the flag
    prev_line = line # store the current line.

# iterating through the list without first 4 lines of labels
for i in syn_lst[3:]:
    # check for '\n' character and remove it from previous line. 
    if i == '\n':
        last = re.sub(r'\n','',last)
        syn_final.append(last)
    last = i
# append the last element
syn_final.append(re.sub(r'\n','',i))

# printing the number of unit titles captures.        
print("lenth of the list:",len(syn_final))

# printing the sample of the list of synopsis.
syn_final[0]

lenth of the list: 200


'In this unit students consider the central production,consumption and policy contexts of popular music.The unit examines how popular music remains asignificant media and cultural industry in theproduction of content and meaning. It assesses thecore music-media output across print, broadcasting,mobile media, film, internet and related mediaindustries. The unit also looks at how governmentpolicy shapes music production and consumption, andhow local music-making and listening is shaped byglobal media practices. This includes examination ofkey debates about music-media technologies,intellectual property frameworks and the impact ofmusic across different media content.'

### 4.3 Extracting unit outcomes

Similar procedure of capturing the elements after third '\n' line is employed to obtain outcome. However, we will be checking for multiples of 3+2 and then append the elements to the final list.
The code is as below:

In [5]:
pdf_data = open("pdftotext.txt", 'r')

outcome_lst=[] # inital appending list with "\n" characters.
outcome_final=[] # final wrangled list

# initiating a flag used to manupulate printing.
line_check = False 

# initiating a count to keep track of "\n" lines.
count = 0

# variable to store the previous line.
prev_line = ""

# iterating through the complete text.
for line in pdf_data:
    # used to count the number of "\n" lines.
    if line == '\n':
        count += 1
        
    # when the count is multiples of 3, turn the flag on.
    if count%3 == 2:
        line_check = True
        
    # if line is noy "\n" append all to a text (used to concatinate and store paragraphs)
    if line != '\n':
        line = ''.join([prev_line, line])
        
    # if the flag is on, append the line to a list
    if line_check:
        outcome_lst.append(line)
        line_check = False # switch off the flag
    prev_line = line # store the current line.

# iterating through the list without first 4 lines of labels
for i in outcome_lst[3:]:
    if i == '\n':
        # check for '\n' character and remove it from precious line. 
        last = re.sub(r'\n','',last)
        outcome_final.append(last)
    last = i
# append the last element
outcome_final.append(re.sub(r'\n','',i))

# printing the number of unit titles captures.        
print("lenth of the list:",len(outcome_final))

# printing the sample of the list of outcomes.
outcome_final[0]

lenth of the list: 200


"['discuss key media studies, popular music andcultural studies theories associated with popularmusic activity;', 'assess how popular music operatesas part of local and global media industries;', 'criticallyand independently engage with key debates andissues within the popular music industries;', 'engagewith music industry stakeholders;', 'explain andanalyse course concepts and debates in written andoral forms, and undertake independent research.']"

### 4.4 Normalising the first letter of a sentence

I have combined the synopsis and outcome texts to normalise the first letter of sentence. I have use a code to split the data according to "." or "?" or "!" and convert the first letter to lower case. 

In [6]:
# concatinating both the lists to a single feature called "test".
text = [syn_final[i] + outcome_final[i][1:-1] for i in range(len(outcome_final))]

# initialising an empty list to store the final normalised result. 
final_text=[]
# iterating through the text list with synopsis and outcome.
for i in text:
    # splits "i" with respect to "." or "?" or "!" and returns list.
    j = i.split(".|?|!")
    temp=[]
    
    # iterating though a list of sentences
    for k in j:
        temp.append(k.replace(k[0],k[0].lower()))
    
    # appends the complete line with multiple sentences to the final list.
    final_text.append(".".join(temp))
    
# printing a sample output
final_text[0]

"in this unit students consider the central production,consumption and policy contexts of popular music.The unit examines how popular music remains asignificant media and cultural industry in theproduction of content and meaning. it assesses thecore music-media output across print, broadcasting,mobile media, film, internet and related mediaindustries. The unit also looks at how governmentpolicy shapes music production and consumption, andhow local music-making and listening is shaped byglobal media practices. This includes examination ofkey debates about music-media technologies,intellectual property frameworks and the impact ofmusic across different media content.'discuss key media studies, popular music andcultural studies theories associated with popularmusic activity;', 'assess how popular music operatesas part of local and global media industries;', 'criticallyand independently engage with key debates andissues within the popular music industries;', 'engagewith music industry stak

### 4.5 Final Unit dictionary

We use zip function to stitch two list to gether to form a dictionary.
When we convert this to a dictionary, we will lose all the duplicate unit codes. This we will be left with `194` unit information.

In [7]:
# creating a dictionary with unique vales using zip function on 2 lists
Units_pdf = dict(zip(title_lst,text))

# length of the dictionary
print("lenth of the dict:",len(Units_pdf))

# sample output of dict values.
next(iter(Units_pdf.values()))

lenth of the dict: 194


"In this unit students consider the central production,consumption and policy contexts of popular music.The unit examines how popular music remains asignificant media and cultural industry in theproduction of content and meaning. It assesses thecore music-media output across print, broadcasting,mobile media, film, internet and related mediaindustries. The unit also looks at how governmentpolicy shapes music production and consumption, andhow local music-making and listening is shaped byglobal media practices. This includes examination ofkey debates about music-media technologies,intellectual property frameworks and the impact ofmusic across different media content.'discuss key media studies, popular music andcultural studies theories associated with popularmusic activity;', 'assess how popular music operatesas part of local and global media industries;', 'criticallyand independently engage with key debates andissues within the popular music industries;', 'engagewith music industry stak

## 5. Word tokenization

The tokenization function is used to tokenize the all the words in the dictionary value. The first argument given to the function is the unit title which will call the value pair of that particular key.
regular expression for tokenising is given in the specification `\w+(?:[-']\w+)?`. Using the same regular expression for the function.

This function will return a tuple of unit title and list of tokens.

In [8]:
vocab_tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")
# define the function to tokenise.
def tokenize_words(title):
    tokenized_words = vocab_tokenizer.tokenize(Units_pdf[title])
    return (title, tokenized_words) # return a tupel of unit title and a list of tokens

# call the function to tokenise the words and create a dictionary
text_tokenized = dict(tokenize_words(title) for title in Units_pdf.keys())

# length of the dictionary
print("lenth of the dict:",len(text_tokenized))

# sample output of dict values.
next(iter(text_tokenized.values()))

lenth of the dict: 194


['In',
 'this',
 'unit',
 'students',
 'consider',
 'the',
 'central',
 'production',
 'consumption',
 'and',
 'policy',
 'contexts',
 'of',
 'popular',
 'music',
 'The',
 'unit',
 'examines',
 'how',
 'popular',
 'music',
 'remains',
 'asignificant',
 'media',
 'and',
 'cultural',
 'industry',
 'in',
 'theproduction',
 'of',
 'content',
 'and',
 'meaning',
 'It',
 'assesses',
 'thecore',
 'music-media',
 'output',
 'across',
 'print',
 'broadcasting',
 'mobile',
 'media',
 'film',
 'internet',
 'and',
 'related',
 'mediaindustries',
 'The',
 'unit',
 'also',
 'looks',
 'at',
 'how',
 'governmentpolicy',
 'shapes',
 'music',
 'production',
 'and',
 'consumption',
 'andhow',
 'local',
 'music-making',
 'and',
 'listening',
 'is',
 'shaped',
 'byglobal',
 'media',
 'practices',
 'This',
 'includes',
 'examination',
 'ofkey',
 'debates',
 'about',
 'music-media',
 'technologies',
 'intellectual',
 'property',
 'frameworks',
 'and',
 'the',
 'impact',
 'ofmusic',
 'across',
 'different',
 

#### Initial vocab count
Finding the ratio of different unique word to the total number of words (tokens) which is know as lexical diversity.
This is an effective method to know out output.

In [9]:
words = list(chain.from_iterable(text_tokenized.values())) # all the words in the dictionary
vocab = set(words) # set of unique words.
lexical_diversity = len(words)/len(vocab) #ratio of different unique word to the total number of words (tokens)
print ("Vocabulary size: ",len(vocab),
       "\nTotal number of tokens: ", len(words),
       "\nLexical diversity: ", lexical_diversity) 

Vocabulary size:  7074 
Total number of tokens:  28484 
Lexical diversity:  4.026576194515126


## 6. Removal of stopwords and tokens with less than 3 characters

We are given a text document of all the stopwords that needs to be removed.
1. We need to read the file into a variable.
2. Create a list and append the contents of the stopword file to a list.
3. Use this set of stopword list to check for text token in our dictionary.
4. Remove all the tokens with less than 3 characters.
5. Store the result in a same dictionary.

In [10]:
# read the text file into a variable.
stpwords = open('stopwords_en.txt', 'r')

# append the file into a list of stopwords without the "\n" character.
stpwords_lst = []
for line in stpwords:
    stpwords_lst.append(re.sub(r'\n','',line))

# sample output of the list fo stop words.    
stpwords_lst[:5]

['a', "a's", 'able', 'about', 'above']

In [11]:
# make a set of all stopwords without repitition.
stopwords_set = set(stpwords_lst)

# iterate through out unit dictionary to check for stop words and append tokens which are not in stop words. 
for i,j in text_tokenized.items():
    text_tokenized[i] = [each for each in j if each not in stopwords_set]

In [12]:
# removing all the with the length less than 3.  
for i,j in text_tokenized.items():
    for k in j:
        if len(k) < 3:
            j.remove(k)

#### Vocab after removal of stop words and tokens less than 3
Finding the ratio of different unique word to the total number of words (tokens) which is know as lexical diversity. This is an effective method to know out output.

In [13]:
words = list(chain.from_iterable(text_tokenized.values())) # all the words in the dictionary
vocab = set(words) # set of unique words.
lexical_diversity = len(words)/len(vocab) #ratio of different unique word to the total number of words (tokens)
print ("Vocabulary size: ",len(vocab),
       "\nTotal number of tokens: ", len(words),
       "\nLexical diversity: ", lexical_diversity) 

Vocabulary size:  6820 
Total number of tokens:  18636 
Lexical diversity:  2.7325513196480937


## 7. Generating Bigrams

Bigrams are pairs of consecutive words. This is the reason we need to determine bigrams after removing stopwords.
If we perform this step before stop words removal, we will arrive at a lot of stop word bigrams like "of_the",etc. Thus, we remove stop words and the generate bigrams.

We perform the following:
1. using a function from itertools, chain all the values in the dictionary.
2. Using NLTK collocation to determine 200 bigrams.
3. Use MWE tokeniser to add the bigrams to the existing vocab.

Code is as below:

In [14]:
# using a function from itertools, chain all the values in the dictionary.
all_words = list(chain.from_iterable(text_tokenized.values()))

# printing the length of the list
len(all_words)

18636

In [15]:
# using nltk library to find bigrams.
bigram_measures = nltk.collocations.BigramAssocMeasures() # creating an bigram instance
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_words)
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200) # Top-100 bigrams

#printing the length of the list
len(top_200_bigrams)

200

## 8. Retokenise the vocab

Since we determined the top 200 bigrams, we will need to add these and retokenise our vocab list. We use MWE tokeniser to ensure that we dont split the bigrams. 

In [16]:
# using MWE tokeniser to compare and append the generated bigrams into the original vocab dictionary.
mwetokenizer = MWETokenizer(top_200_bigrams)
units_dict =  dict((title, mwetokenizer.tokenize(text)) for title,text in text_tokenized.items())

#printing the length of the dict
len(units_dict)

194

#### Vocab stat after generating 200 bigrams and retokenising
Finding the ratio of different unique word to the total number of words (tokens) which is know as lexical diversity. This is an effective method to know out output.

In [17]:
words = list(chain.from_iterable(units_dict.values())) # all the words in the dictionary
vocab = set(words) # set of unique words.
lexical_diversity = len(words)/len(vocab) #ratio of different unique word to the total number of words (tokens)
print ("Vocabulary size: ",len(vocab),
       "\nTotal number of tokens: ", len(words),
       "\nLexical diversity: ", lexical_diversity) 

Vocabulary size:  6653 
Total number of tokens:  18469 
Lexical diversity:  2.776040883811814


## 8. Stemming the tokens using Porter Stemmer

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.
We stem to limit the look up and we only stem the words which are in lower case for our task so we dont lose information(words with capital letters).

We have been asked to use porter stemmer to stem our vocab data and generte more meaningful vocab data. We use the function PorterStemmer from nltk library to perform this task.


In [18]:
# using the function PorterStemmer from nltk.stem
stemmer = PorterStemmer()
for k,v in units_dict.items(): # iterating through the dictionary
    units_dict[k] = [stemmer.stem(token) if re.search('([a-z]+[_][a-z]+)',token)==False & token.islower() else token for token in v]

#printing the length of the dict
len(units_dict)

194

#### Vocab stat after stemming the tokens
Finding the ratio of different unique word to the total number of words (tokens) which is know as lexical diversity. This is an effective method to know out output.

We can notice that the vocab count has not changed, that is because we have just altered the derived words to the stem word and not altered the count of the vocab.

In [19]:
words = list(chain.from_iterable(units_dict.values())) # all the words in the dictionary
vocab = set(words) # set of unique words.
lexical_diversity = len(words)/len(vocab) #ratio of different unique word to the total number of words (tokens)
print ("Vocabulary size: ",len(vocab),
       "\nTotal number of tokens: ", len(words),
       "\nLexical diversity: ", lexical_diversity) 

Vocabulary size:  6653 
Total number of tokens:  18469 
Lexical diversity:  2.776040883811814


## 9. Removal of rare and frequent tokens

We will need to remove rare tokens which is defined by the words that occur in less than 5% of the documents/observations. This is an essential process to limit out look-up just to significant vocabs.

We need to perform this process in the end so that we dont miss out on potential bigram generation earlier in the task. removing the rare/frequent tokens in the begining will reduce the chances of potential bigrams.

In [20]:
# calculating the frequency of each token using FreqDist  
unique=[]
for i,j in units_dict.items():
    unique+=list(set(j))
freq_words = FreqDist(unique)    

# appending all the tokens which is found in less than 5% of the document
rare_tokens=[]   
for i,j in freq_words.items():
    #occurance of word in less than 10 documents
    if j < 10:
        rare_tokens.append(i)
# appending all the tokens which is found in greater than 95% of the document
freq_tokens=[]   
for i,j in freq_words.items():
    #occurance of word in more than 190 documents
    if j>190:
        freq_tokens.append(i)

# create a list of rare and frequent tokens
threshold = rare_tokens + freq_tokens

# removing both the rare and frequent tokens from vocab.
for i,j in units_dict.items():
    units_dict[i] = [each for each in j if each not in threshold]

#### Vocab stat after removing the rare and frequent tokens
Finding the ratio of different unique word to the total number of words (tokens) which is know as lexical diversity. 

This is an effective method to know out output.

In [21]:
words = list(chain.from_iterable(units_dict.values())) # all the words in the dictionary
vocab = set(words) # set of unique words.
lexical_diversity = len(words)/len(vocab) #ratio of different unique word to the total number of words (tokens)
print ("Vocabulary size: ",len(vocab),
       "\nTotal number of tokens: ", len(words),
       "\nLexical diversity: ", lexical_diversity) 

Vocabulary size:  233 
Total number of tokens:  6726 
Lexical diversity:  28.86695278969957


## 10. Vectorising and building a sparse vector

We will create a file `29498724_vocab.txt` to write the set of vocab with their index.
1. We sort the set of unique vocab that we obtained in the end.
2. create a file with write permission and write these words with index.

We will also need to create a file `29498724_countVec.txt` to write the sparse matrix.
1.


In [22]:
# create a file with write permission
output_dict = open("29498724_vocab.txt", 'w') 
# converting the set to list.
vocab = list(vocab)

# indexing every token
vocab_dict = {}
i = 0
for w in vocab:
    vocab_dict[w] = i
    i = i + 1

# Writing it to file 
count = 0
# iterate through the sorted vocab.
for k,v in sorted(vocab_dict.items()):
    output_dict.write("{}:{}".format(k,count))
    output_dict.write('\n')
    count += 1

# close the file
output_dict.close()

In [23]:
#Building a Sparse Matrix
output_vector = open("29498724_countVec.txt", 'w')
#Vectorizing
for i,j in units_dict.items():
    output_vector.write(i +',')
    temp = [vocab_dict[w] for w in j]    
    for k, v in FreqDist(temp).items(): 
        output_vector.write("{}:{},".format(k,v))
    output_vector.write('\n\n')
# close the file    
output_vector.close()

## 11. Summary

This task ensures the following methodology that is performed:

1. Extract the information from PDF file to a TEXT file. (Used https://pdftotext.com/)
2. Text is normalized to lowercase except the capital tokens appeared in the middle of a sentence/line
3. Perform word tokenization using regular expression, \w+(?:[-']\w+)?
4. The context-independent and context-dependent stop words are removed from the vocab. The stop words file is provided. (i.e, stopwords_en.txt)
5. First 200 meaningful bigrams (i.e., collocations) are determined using PMI measure and included in vocab. This should be done after removal of stopwords to ensure elimination of unnecessary bigrams of stopwords.
6. Tokens should be stemmed using the Porter stemmer. Stemming must be performed after bigrams to ensure least loss of information.
7. Rare tokens (with the threshold set to %5) must be removed from the vocab.
8. Find the set of all the unique vocab, index and sort them in alphabetical order.
9. Compare each document(each unit observation) and create a Sparse Matrix.

We can also compare the vocab statistics and see the changes in vocab during each stage:

Initial vocab stat:

Vocabulary size:  7074 <br/>
Total number of tokens:  28484 <br/>
Lexical diversity:  4.026576194515126<br/>


Vocab stat after removing stopwords:

Vocabulary size:  6820 <br/>
Total number of tokens:  18636 <br/>
Lexical diversity:  2.7325513196480937<br/>


Vocab stat after generating 200 bigrams and re-tokenising:

Vocabulary size:  6653 <br/>
Total number of tokens:  18469 <br/>
Lexical diversity:  2.776040883811814<br/>

Vocab stat after stemming:

Vocabulary size:  6653 <br/>
Total number of tokens:  18469<br/>
Lexical diversity:  2.776040883811814<br/>


**Final vocab stat after removing rare tokens:**

**Vocabulary size:  233**<br/>
**Total number of tokens:  6726**<br/>
**Lexical diversity:  28.86695278969957**<br/>


## 12. References
* https://pdftotext.com/
* https://stackoverflow.com/questions/26320697/capitalization-of-each-sentence-in-a-string-in-python-3
* https://www.nltk.org/
* http://www.nltk.org/howto/collocations.html