# <center>Natural Language Processing Using NLTK (I)</center>

References:
 - http://www.nltk.org/book_1ed/
 - https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words
 - https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
 - http://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization
 - https://web.stanford.edu/class/cs124/lec/Information_Extraction_and_Named_Entity_Recognition.pdf

## 1. NLTK installation
 1. Install NLTK package using: pip install nltk 
 2. Open your python editor (Jupyter Notebook, Spyder etc.) and type the following comands below. Select "all packages" to install data included in NLTK, including corpora and books. It may take a few minutes to download all data

In [2]:
import nltk
nltk.download()

## 2. NLP Objectives and Basic Steps

 - Objectives:
   * Split documents into tokens or segments
   * Clean up tokens and annotate tokens
   * Extract features from tokens for further text mining tasks
 - Basic processing steps:
   * Tokenization: split documents into individual words or segments
   * Remove stop words and filter tokens
   * POS (part of speech) Tagging
   * Normalization: Stemming, Lemmatization
   * Named Entity Recognition (NER)
   * Term Frequency and Inverse Dcoument Frequency (TF-IDF)
   * Create document-to-term matrix (bag of words)


In [21]:
# Exercise 2.1. Load the text for analysis
import re
import nltk

# load the document
with open('movie_reivew.txt', 'r') as f:
    lines=f.readlines()
    
# merge list into one string
text=" ".join(lines)

print text


`strange days' chronicles the last two days of 1999 in los angeles . 
 as the locals gear up for the new millenium , lenny nero ( ralph fiennes ) goes about his business of peddling erotic memory clips . 
 he pines for his ex-girlfriend , faith ( juliette lewis ) , not noticing that another friend , mace ( angela bassett ) really cares for him . 
 this film features good performances , impressive film-making technique and breath-taking crowd scenes . 
 director kathryn bigelow knows her stuff and does not hesitate to use it . 
 but as a whole , this is an unsatisfying movie . 
 the problem is that the writers , james cameron and jay cocks , were too ambitious , aiming for a film with social relevance , thrills , and drama . 
 not that ambitious film-making should be discouraged ; just that when it fails to achieve its goals , it fails badly and obviously . 
 the film just ends up preachy , unexciting and uninvolving . 



## 3. Tokenization
 - **Definition**: the process of breaking a stream of textual content up into words, terms, symbols, or some other meaningful elements called tokens.
    * Word (Unigram)
    * Bigram (Two consecutive words)
    * Trigram (Three consecutive words)
    * Sentence
 - Different methods exist:
    * Split by regular expression patterns
    * NLTK's word tokenizer
    * NLTK's regular expression tokenizer (customizable)
 - None of them can be perfect for any tokenization task. 

### 3.1. Unigram

In [27]:
# Exercise 3.1.1. Simply split the text by one or more non-word characters

# \W+: one or more non-words
tokens = re.split(r"\W+", text)   

# get the number of tokens

print len(tokens)                   
print (tokens)                       

# Pros: no punctuation, just words
# Cons: breath-taking and film-making 
# are split into two words

148
['', 'strange', 'days', 'chronicles', 'the', 'last', 'two', 'days', 'of', '1999', 'in', 'los', 'angeles', 'as', 'the', 'locals', 'gear', 'up', 'for', 'the', 'new', 'millenium', 'lenny', 'nero', 'ralph', 'fiennes', 'goes', 'about', 'his', 'business', 'of', 'peddling', 'erotic', 'memory', 'clips', 'he', 'pines', 'for', 'his', 'ex', 'girlfriend', 'faith', 'juliette', 'lewis', 'not', 'noticing', 'that', 'another', 'friend', 'mace', 'angela', 'bassett', 'really', 'cares', 'for', 'him', 'this', 'film', 'features', 'good', 'performances', 'impressive', 'film', 'making', 'technique', 'and', 'breath', 'taking', 'crowd', 'scenes', 'director', 'kathryn', 'bigelow', 'knows', 'her', 'stuff', 'and', 'does', 'not', 'hesitate', 'to', 'use', 'it', 'but', 'as', 'a', 'whole', 'this', 'is', 'an', 'unsatisfying', 'movie', 'the', 'problem', 'is', 'that', 'the', 'writers', 'james', 'cameron', 'and', 'jay', 'cocks', 'were', 'too', 'ambitious', 'aiming', 'for', 'a', 'film', 'with', 'social', 'relevance', '

In [11]:
# Exercise 3.1.2 NLTK's word tokenizer: 

# break down text into words and punctuations

# invoke NLTK's word tokenizer
tokens = nltk.word_tokenize(text)    
print len(tokens)                    
print (tokens)       

# Pros: words are well tokenized, 
# e.g. breath-taking and film-making each is captured as one word
# Pros: need to remove punctuation 

172
['`strange', 'days', "'", 'chronicles', 'the', 'last', 'two', 'days', 'of', '1999', 'in', 'los', 'angeles', '.', 'as', 'the', 'locals', 'gear', 'up', 'for', 'the', 'new', 'millenium', ',', 'lenny', 'nero', '(', 'ralph', 'fiennes', ')', 'goes', 'about', 'his', 'business', 'of', 'peddling', 'erotic', 'memory', 'clips', '.', 'he', 'pines', 'for', 'his', 'ex-girlfriend', ',', 'faith', '(', 'juliette', 'lewis', ')', ',', 'not', 'noticing', 'that', 'another', 'friend', ',', 'mace', '(', 'angela', 'bassett', ')', 'really', 'cares', 'for', 'him', '.', 'this', 'film', 'features', 'good', 'performances', ',', 'impressive', 'film-making', 'technique', 'and', 'breath-taking', 'crowd', 'scenes', '.', 'director', 'kathryn', 'bigelow', 'knows', 'her', 'stuff', 'and', 'does', 'not', 'hesitate', 'to', 'use', 'it', '.', 'but', 'as', 'a', 'whole', ',', 'this', 'is', 'an', 'unsatisfying', 'movie', '.', 'the', 'problem', 'is', 'that', 'the', 'writers', ',', 'james', 'cameron', 'and', 'jay', 'cocks', ',

In [18]:
# remove leading or trailing punctuations

import string

tokens=[token.strip(string.punctuation) for token in tokens]

# remove empty tokens
tokens=[token.strip() for token in tokens if token.strip()!='']
print tokens  


['strange', 'days', 'chronicles', 'the', 'last', 'two', 'days', 'of', '1999', 'in', 'los', 'angeles', 'as', 'the', 'locals', 'gear', 'up', 'for', 'the', 'new', 'millenium', 'lenny', 'nero', 'ralph', 'fiennes', 'goes', 'about', 'his', 'business', 'of', 'peddling', 'erotic', 'memory', 'clips', 'he', 'pines', 'for', 'his', 'ex-girlfriend', 'faith', 'juliette', 'lewis', 'not', 'noticing', 'that', 'another', 'friend', 'mace', 'angela', 'bassett', 'really', 'cares', 'for', 'him', 'this', 'film', 'features', 'good', 'performances', 'impressive', 'film-making', 'technique', 'and', 'breath-taking', 'crowd', 'scenes', 'director', 'kathryn', 'bigelow', 'knows', 'her', 'stuff', 'and', 'does', 'not', 'hesitate', 'to', 'use', 'it', 'but', 'as', 'a', 'whole', 'this', 'is', 'an', 'unsatisfying', 'movie', 'the', 'problem', 'is', 'that', 'the', 'writers', 'james', 'cameron', 'and', 'jay', 'cocks', 'were', 'too', 'ambitious', 'aiming', 'for', 'a', 'film', 'with', 'social', 'relevance', 'thrills', 'and', 

In [29]:
# Exercise 3.1.2 NLTK's regular expression tokenizer (customizable)

# Pattern can be customized to your need

# a word is defined as a sequence of word characters  
# followed by optional word characters or "-|." 
# e.g. film-making, L.A.

pattern=r'\w+[\w\-\.]*'                        


# call NLTK's regular expression tokenization
tokens=nltk.regexp_tokenize(text, pattern)

print len(tokens)
print (tokens)

142
['strange', 'days', 'chronicles', 'the', 'last', 'two', 'days', 'of', '1999', 'in', 'los', 'angeles', 'as', 'the', 'locals', 'gear', 'up', 'for', 'the', 'new', 'millenium', 'lenny', 'nero', 'ralph', 'fiennes', 'goes', 'about', 'his', 'business', 'of', 'peddling', 'erotic', 'memory', 'clips', 'he', 'pines', 'for', 'his', 'ex-girlfriend', 'faith', 'juliette', 'lewis', 'not', 'noticing', 'that', 'another', 'friend', 'mace', 'angela', 'bassett', 'really', 'cares', 'for', 'him', 'this', 'film', 'features', 'good', 'performances', 'impressive', 'film-making', 'technique', 'and', 'breath-taking', 'crowd', 'scenes', 'director', 'kathryn', 'bigelow', 'knows', 'her', 'stuff', 'and', 'does', 'not', 'hesitate', 'to', 'use', 'it', 'but', 'as', 'a', 'whole', 'this', 'is', 'an', 'unsatisfying', 'movie', 'the', 'problem', 'is', 'that', 'the', 'writers', 'james', 'cameron', 'and', 'jay', 'cocks', 'were', 'too', 'ambitious', 'aiming', 'for', 'a', 'film', 'with', 'social', 'relevance', 'thrills', 'an

## 3.2. Vocabulary 
 - Vocabulary: the set of unique tokens  
 - Dictionary: typicallly, the vocabulary of a text can be represented as a dictionary 
    * Key: word
    * Value: count of the word
 - Find what words are frequently used (stop words)

In [25]:
# Exercise 3.2.1 
# Get vocabulary and dictionary of text

vocabulary= set(tokens)                                        
# set() convert a list to a set without any duplicates
print (vocabulary)

# tokens.count(word) returns the count of the word in tokens (list)
dictionary={word: tokens.count(word) for word in vocabulary}
# by default, dictionary is sorted by key
print("\nsort by word")
print (dictionary)

# find what are the frequent words
# sort the dictionary by value
# sorted(iterable, key) sorts an iterable object by the comparison key
# lambda: anonymous function defined without a name. 
# lambda item:-item[1] sorts the list by the 2nd element in a descending order
print("\nsort by frequency")
print(sorted(dictionary.items(), key=lambda item:-item[1]))



set(['just', 'lewis', 'cameron', 'its', 'cocks', 'bassett', 'jay', 'technique', 'breath-taking', 'should', 'to', 'writers', 'relevance', 'achieve', 'good', 'unsatisfying', 'noticing', 'not', 'aiming', 'him', 'impressive', 'james', 'millenium', 'this', 'stuff', 'unexciting', 'crowd', 'up', 'erotic', 'locals', 'really', 'fails', 'for', 'movie', 'ralph', 'does', 'goes', 'new', 'be', 'ends', 'business', 'drama', 'faith', 'about', 'last', 'of', 'days', 'angeles', 'peddling', 'social', 'whole', 'features', 'pines', 'another', 'ex-girlfriend', 'use', 'her', 'cares', 'two', 'angela', 'los', 'too', 'memory', 'friend', 'knows', 'that', 'but', 'juliette', 'goals', 'film-making', 'with', 'he', '1999', 'scenes', 'clips', 'were', 'problem', 'and', 'is', 'it', 'an', 'kathryn', 'as', 'his', 'nero', 'in', 'bigelow', 'thrills', 'film', 'gear', 'hesitate', 'when', 'strange', 'ambitious', 'uninvolving', 'badly', 'director', 'mace', 'fiennes', 'preachy', 'discouraged', 'a', 'lenny', 'performances', 'obviou

## 3.3. Stop words and word filtering

 - Stop words: a set of commonly used words, have very little meaning, and cannot differentiate a text from others, such as "and", "the" etc. 
 - Stop words are typically ignored in NLP processing or by search engine
 - Stop words usually are application specific. You can define your own stop words!

In [26]:
# Exercise 3.3.1
# get NLTK English stop words
# You can modify this list by adding more stop words or remove stop words

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words+=["film", "films"]
print (stop_words)

# filter stop words out of the dictionary
# by creating a new dictionary

filtered_dictionary={word: dictionary[word] for word in dictionary if word not in stop_words}
print("\nsort dictionary without stop words by frequency")
print(sorted(filtered_dictionary.items(), key=lambda item:-item[1]))


[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u

In [15]:
# Exercise 3.3.2
# Find positive words 

with open("positive-words.txt",'r') as f:
    positive_words=[line.strip() for line in f]
    
positive_tokens=[token for token in tokens if token in positive_words]

print(positive_tokens)

['faith', 'good', 'impressive', 'ambitious', 'thrills', 'ambitious']
