# <center>Natural Language Processing Using NLTK (I)</center>

References:
 - http://www.nltk.org/book_1ed/
 - https://web.stanford.edu/class/cs124/lec/Information_Extraction_and_Named_Entity_Recognition.pdf

## 1. NLTK installation
 1. Install NLTK package using: pip install nltk 
 2. Open your python editor (Jupyter Notebook, Spyder etc.) and type the following comands below. Select "all packages" to install data included in NLTK, including corpora and books. It may take a few minutes to download all data

In [1]:
import nltk
#nltk.download()

## 2. NLP Objectives and Basic Steps

 - Objectives:
   * Split documents into tokens, phrases, or segments
   * Clean up tokens and annotate tokens
   * Extract features from tokens for further text mining tasks
 - Basic processing steps:
   * Tokenization: split documents into individual words, phrases, or segments
   * Remove stop words and filter tokens
   * POS (part of speech) Tagging
   * Normalization: Stemming, Lemmatization
   * Named Entity Recognition (NER)
   * Term Frequency and Inverse Dcoument Frequency (TF-IDF)
   * Document-to-term matrix (bag of words)
 - NLP packages: NLTK, Gensim, spaCy


In [46]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import re    # import re module
import nltk

In [47]:
# this extract is from https://www.sciencenews.org/article/coronavirus-what-does-covid-19-vaccine-efficacy-mean

text = "The FDA setting a minimum recommendation for efficacy doesn't mean vaccines \
couldn't perform better. The benchmark is also a reminder that COVID-19 vaccine \
development is in its early days. If the first vaccines made available only meet \
the minimum, they may be replaced by others that prove to protect more people. \
But with more than 1 million deaths from COVID-19 worldwide — \
and U.S. deaths surpassing 200,000 — the urgency in finding a \
vaccine that safely helps at least some people is at the forefront."

text

"The FDA setting a minimum recommendation for efficacy doesn't mean vaccines couldn't perform better. The benchmark is also a reminder that COVID-19 vaccine development is in its early days. If the first vaccines made available only meet the minimum, they may be replaced by others that prove to protect more people. But with more than 1 million deaths from COVID-19 worldwide — and U.S. deaths surpassing 200,000 — the urgency in finding a vaccine that safely helps at least some people is at the forefront."

## 3. Tokenization
 - **Definition**: the process of breaking a stream of textual content up into words, terms, symbols, or some other meaningful elements called tokens.
    * Word (Unigram)
    * Bigram (Two consecutive words)
    * Trigram (Three consecutive words)
    * Sentence
 - Different methods exist:
    * Split by regular expression patterns
    * NLTK's word tokenizer
    * NLTK's regular expression tokenizer (customizable)
 - None of them can be perfect for any tokenization task. 

### 3.1. Unigram

#### Regular Expression

In [48]:
# Exercise 3.1.1. Simply split the text by one or more non-word characters

# \W+: one or more non-words
tokens = re.split(r"\W+", text)   

# get the number of tokens

print(len(tokens))                   
print(tokens)                     

# Pros: no punctuation, just words
# Cons: COVID-19, doesn't, couldn't, 200,000
# are split into two words

re.findall(r"\w+", text) 

90
['The', 'FDA', 'setting', 'a', 'minimum', 'recommendation', 'for', 'efficacy', 'doesn', 't', 'mean', 'vaccines', 'couldn', 't', 'perform', 'better', 'The', 'benchmark', 'is', 'also', 'a', 'reminder', 'that', 'COVID', '19', 'vaccine', 'development', 'is', 'in', 'its', 'early', 'days', 'If', 'the', 'first', 'vaccines', 'made', 'available', 'only', 'meet', 'the', 'minimum', 'they', 'may', 'be', 'replaced', 'by', 'others', 'that', 'prove', 'to', 'protect', 'more', 'people', 'But', 'with', 'more', 'than', '1', 'million', 'deaths', 'from', 'COVID', '19', 'worldwide', 'and', 'U', 'S', 'deaths', 'surpassing', '200', '000', 'the', 'urgency', 'in', 'finding', 'a', 'vaccine', 'that', 'safely', 'helps', 'at', 'least', 'some', 'people', 'is', 'at', 'the', 'forefront', '']


['The',
 'FDA',
 'setting',
 'a',
 'minimum',
 'recommendation',
 'for',
 'efficacy',
 'doesn',
 't',
 'mean',
 'vaccines',
 'couldn',
 't',
 'perform',
 'better',
 'The',
 'benchmark',
 'is',
 'also',
 'a',
 'reminder',
 'that',
 'COVID',
 '19',
 'vaccine',
 'development',
 'is',
 'in',
 'its',
 'early',
 'days',
 'If',
 'the',
 'first',
 'vaccines',
 'made',
 'available',
 'only',
 'meet',
 'the',
 'minimum',
 'they',
 'may',
 'be',
 'replaced',
 'by',
 'others',
 'that',
 'prove',
 'to',
 'protect',
 'more',
 'people',
 'But',
 'with',
 'more',
 'than',
 '1',
 'million',
 'deaths',
 'from',
 'COVID',
 '19',
 'worldwide',
 'and',
 'U',
 'S',
 'deaths',
 'surpassing',
 '200',
 '000',
 'the',
 'urgency',
 'in',
 'finding',
 'a',
 'vaccine',
 'that',
 'safely',
 'helps',
 'at',
 'least',
 'some',
 'people',
 'is',
 'at',
 'the',
 'forefront']

#### NLTK's word tokenizer does the following steps:
* split standard contractions, e.g. don't -> do n't and they'll -> they 'll
* treat most punctuation characters as separate tokens
* split off commas and single quotes, when followed by whitespace
* separate periods that appear at the end of line

In [49]:
# Exercise 3.1.2 NLTK's word tokenizer: 

# break down text into words and punctuations

# invoke NLTK's word tokenizer
tokens = nltk.word_tokenize(text)    
print(len(tokens) )                   
print (tokens)       

# Pros: words are well tokenized, 
# e.g. COVID-19, 200,000 are not split by punctuations
# doesn't becomes does n't
# Pros: need to remove punctuation 

92
['The', 'FDA', 'setting', 'a', 'minimum', 'recommendation', 'for', 'efficacy', 'does', "n't", 'mean', 'vaccines', 'could', "n't", 'perform', 'better', '.', 'The', 'benchmark', 'is', 'also', 'a', 'reminder', 'that', 'COVID-19', 'vaccine', 'development', 'is', 'in', 'its', 'early', 'days', '.', 'If', 'the', 'first', 'vaccines', 'made', 'available', 'only', 'meet', 'the', 'minimum', ',', 'they', 'may', 'be', 'replaced', 'by', 'others', 'that', 'prove', 'to', 'protect', 'more', 'people', '.', 'But', 'with', 'more', 'than', '1', 'million', 'deaths', 'from', 'COVID-19', 'worldwide', '—', 'and', 'U.S.', 'deaths', 'surpassing', '200,000', '—', 'the', 'urgency', 'in', 'finding', 'a', 'vaccine', 'that', 'safely', 'helps', 'at', 'least', 'some', 'people', 'is', 'at', 'the', 'forefront', '.']


In [50]:
# Exercise 3.1.3 remove leading or trailing punctuations

import string

string.punctuation

tokens=[token.strip(string.punctuation) for token in tokens]

# remove empty tokens
tokens=[token.strip() for token in tokens \
        if token.strip()!='']
print(len(tokens) )
print(tokens)  

# Note '—' is still kept since it's not in the punctuation list

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

87
['The', 'FDA', 'setting', 'a', 'minimum', 'recommendation', 'for', 'efficacy', 'does', "n't", 'mean', 'vaccines', 'could', "n't", 'perform', 'better', 'The', 'benchmark', 'is', 'also', 'a', 'reminder', 'that', 'COVID-19', 'vaccine', 'development', 'is', 'in', 'its', 'early', 'days', 'If', 'the', 'first', 'vaccines', 'made', 'available', 'only', 'meet', 'the', 'minimum', 'they', 'may', 'be', 'replaced', 'by', 'others', 'that', 'prove', 'to', 'protect', 'more', 'people', 'But', 'with', 'more', 'than', '1', 'million', 'deaths', 'from', 'COVID-19', 'worldwide', '—', 'and', 'U.S', 'deaths', 'surpassing', '200,000', '—', 'the', 'urgency', 'in', 'finding', 'a', 'vaccine', 'that', 'safely', 'helps', 'at', 'least', 'some', 'people', 'is', 'at', 'the', 'forefront']


#### NLTK's regular expression tokinizer (customizable)

In [51]:
# Exercise 3.1.4 NLTK's regular expression tokenizer 

# Pattern can be customized to your need

# a word is defined as:
# (1) must start with a word character  \w
# (2) then contain zero or more word characters,"-", 
#     or "'" in the middle [\w\'-]*
# (3) must end with a word character \w
# e.g. film-making, doesn't

pattern=r'\w[\w\',-]*\w'                        

# call NLTK's regular expression tokenization
tokens=nltk.regexp_tokenize(text, pattern)

print(len(tokens))
print (tokens)

78
['The', 'FDA', 'setting', 'minimum', 'recommendation', 'for', 'efficacy', "doesn't", 'mean', 'vaccines', "couldn't", 'perform', 'better', 'The', 'benchmark', 'is', 'also', 'reminder', 'that', 'COVID-19', 'vaccine', 'development', 'is', 'in', 'its', 'early', 'days', 'If', 'the', 'first', 'vaccines', 'made', 'available', 'only', 'meet', 'the', 'minimum', 'they', 'may', 'be', 'replaced', 'by', 'others', 'that', 'prove', 'to', 'protect', 'more', 'people', 'But', 'with', 'more', 'than', 'million', 'deaths', 'from', 'COVID-19', 'worldwide', 'and', 'deaths', 'surpassing', '200,000', 'the', 'urgency', 'in', 'finding', 'vaccine', 'that', 'safely', 'helps', 'at', 'least', 'some', 'people', 'is', 'at', 'the', 'forefront']


In [52]:
# Exercise 3.1.5 Use NLTK's regular expression tokenizer 
# to define sentences, i.e. 
# (1) starts with non-space character (i.e. \S), 
# (2) contains any number of characters in the middle, 
#     as long as they are not "!?."
# (3) ends with !?.


### 3.2. Sentence

In [53]:
# Exercise 3.2.1. Segmentation by Sentences

sentences = nltk.sent_tokenize(text)
len(sentences)
sentences

# what patterns can be used to segment 
# text into sentences?

4

["The FDA setting a minimum recommendation for efficacy doesn't mean vaccines couldn't perform better.",
 'The benchmark is also a reminder that COVID-19 vaccine development is in its early days.',
 'If the first vaccines made available only meet the minimum, they may be replaced by others that prove to protect more people.',
 'But with more than 1 million deaths from COVID-19 worldwide — and U.S. deaths surpassing 200,000 — the urgency in finding a vaccine that safely helps at least some people is at the forefront.']

### 3.3 Phrases: Bigrams (2 consecutive words),  Trigrams (3 consecutive words), or in general n-grams
 - Why bigrams and trigrams?
 - How to get bigrams or trigrams:
    1. First tokenize text into unigrams
    2. Slice through the list of unigrams to get bigrams

In [54]:
# Exercise 3.3.1. Get bigrams from the text                       

# bigrams are formed from unigrams
# nltk.bigram returns an iterator

bigrams=list(nltk.bigrams(tokens))  # tokens are created in Exercise 3.1.4
print(bigrams)

# trigrams
list(nltk.trigrams(tokens))

[('The', 'FDA'), ('FDA', 'setting'), ('setting', 'minimum'), ('minimum', 'recommendation'), ('recommendation', 'for'), ('for', 'efficacy'), ('efficacy', "doesn't"), ("doesn't", 'mean'), ('mean', 'vaccines'), ('vaccines', "couldn't"), ("couldn't", 'perform'), ('perform', 'better'), ('better', 'The'), ('The', 'benchmark'), ('benchmark', 'is'), ('is', 'also'), ('also', 'reminder'), ('reminder', 'that'), ('that', 'COVID-19'), ('COVID-19', 'vaccine'), ('vaccine', 'development'), ('development', 'is'), ('is', 'in'), ('in', 'its'), ('its', 'early'), ('early', 'days'), ('days', 'If'), ('If', 'the'), ('the', 'first'), ('first', 'vaccines'), ('vaccines', 'made'), ('made', 'available'), ('available', 'only'), ('only', 'meet'), ('meet', 'the'), ('the', 'minimum'), ('minimum', 'they'), ('they', 'may'), ('may', 'be'), ('be', 'replaced'), ('replaced', 'by'), ('by', 'others'), ('others', 'that'), ('that', 'prove'), ('prove', 'to'), ('to', 'protect'), ('protect', 'more'), ('more', 'people'), ('people',

[('The', 'FDA', 'setting'),
 ('FDA', 'setting', 'minimum'),
 ('setting', 'minimum', 'recommendation'),
 ('minimum', 'recommendation', 'for'),
 ('recommendation', 'for', 'efficacy'),
 ('for', 'efficacy', "doesn't"),
 ('efficacy', "doesn't", 'mean'),
 ("doesn't", 'mean', 'vaccines'),
 ('mean', 'vaccines', "couldn't"),
 ('vaccines', "couldn't", 'perform'),
 ("couldn't", 'perform', 'better'),
 ('perform', 'better', 'The'),
 ('better', 'The', 'benchmark'),
 ('The', 'benchmark', 'is'),
 ('benchmark', 'is', 'also'),
 ('is', 'also', 'reminder'),
 ('also', 'reminder', 'that'),
 ('reminder', 'that', 'COVID-19'),
 ('that', 'COVID-19', 'vaccine'),
 ('COVID-19', 'vaccine', 'development'),
 ('vaccine', 'development', 'is'),
 ('development', 'is', 'in'),
 ('is', 'in', 'its'),
 ('in', 'its', 'early'),
 ('its', 'early', 'days'),
 ('early', 'days', 'If'),
 ('days', 'If', 'the'),
 ('If', 'the', 'first'),
 ('the', 'first', 'vaccines'),
 ('first', 'vaccines', 'made'),
 ('vaccines', 'made', 'available'),
 (

### 3.4. Collocation
 - Most bigrams or trigrams may sound odd. However, we need to pay attention to frequent bigrams or trigrams
 - **Collocation**: an expression consisting of two or more words that correspond to some conventional way of saying things, e.g. red wine, United States, balance sheet etc.
    - Collocations are not fully compositional in that there is usually an element of meaning added to the combination.
 - Question: how to find collocations?
    - Suppose you have a rich collection of text, e.g. english-web.txt
    - How to find good collocations from this file?

In [65]:
# Exercise 3.4.1.
# construct bigrams using words from a large bulit-in NLTK corpus

from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

# bigram association measures
# different measures, e.g. frequency, are implemented

bigram_measures = BigramAssocMeasures()

# First load text from the file and create unigram tokens
# Then create bigrams from the tokens
words=nltk.corpus.genesis.words('english-web.txt')

finder = BigramCollocationFinder.from_words(words)

# find the top 10 bigrams by frequency
finder.nbest(bigram_measures.raw_freq, 10) 

# Note that the most frequent bigrams are very odd
# how to fix it?

[(',', 'and'),
 (',', '"'),
 ('of', 'the'),
 ("'", 's'),
 ('in', 'the'),
 ('said', ','),
 ('said', 'to'),
 ('.', 'He'),
 ('the', 'land'),
 ('.', 'The')]

In [66]:
# Exercise 3.4.2. Find collocation by filter

import string
# construct bigrams using words from a NLTK corpus

stop_words = nltk.corpus.stopwords.words('english')
#print(stop_words)
finder.apply_word_filter(lambda w: w.lower() in stop_words\
                         or w.strip(string.punctuation)=='')

finder.nbest(bigram_measures.raw_freq, 10) 

# better?
# most of them are in the pattern of "xxx said"

[('God', 'said'),
 ('one', 'hundred'),
 ('Jacob', 'said'),
 ('Yahweh', 'God'),
 ('Yahweh', 'said'),
 ('years', 'old'),
 ('seven', 'years'),
 ('Joseph', 'said'),
 ('every', 'man'),
 ('five', 'years')]

### 3.4.1 How to find collocations - PMI
- By **frequency** (perhaps with filter)
- **Pointwise Mutual Information (PMI)**
  - giving two words $w_1, w_2$, $$PMI(w_1,w_2)=\log{\frac{p(w_1,w_2)}{p(w_1)*p(w_2)}}$$
  - Some observations:
    - if $w_1$ and $w_2$ are independent, $PMI(w_1,w_2)=0$
    - if $w_1$ is completely dependent on $w_2$, i.e. $p(w_1,w_2)=p(w_2)$, $PMI(w_1,w_2)=\log\frac{1}{p(w_1)}$. In this case, what if $w_1$ just appears once in the corpus? 
    - PMI favors less frequent collocations 
    - how to fix it?


In [67]:
# Exercise 3.4.1.1 Metrics for Collocations

from nltk.collocations import *

# construct bigrams using words from a NLTK corpus
finder = BigramCollocationFinder.from_words(words)

# find top-n bigrams by pmi
finder.nbest(bigram_measures.pmi, 10) 

[('Allon', 'Bacuth'),
 ('Ashteroth', 'Karnaim'),
 ('Ben', 'Ammi'),
 ('En', 'Mishpat'),
 ('Jegar', 'Sahadutha'),
 ('Salt', 'Sea'),
 ('Whoever', 'sheds'),
 ('appoint', 'overseers'),
 ('aromatic', 'resin'),
 ('cutting', 'instrument')]

In [68]:
# 3.4.1.2 filter bigrams by frequency

finder.apply_freq_filter(20)  #5
finder.nbest(bigram_measures.pmi, 10) 

[('It', 'happened'),
 ('lifted', 'up'),
 ('You', 'shall'),
 ('These', 'are'),
 ('years', 'old'),
 ('one', 'hundred'),
 ('shall', 'not'),
 ('This', 'is'),
 ('my', 'lord'),
 ('I', 'am')]

### 3.4.2 How to find collocations - NPMI and others
- **Normalized Pointwise Mutual Information (`NPMI`)**
   - If $w_1$ and $w_2$ always occur together, i.e., $p(w_1)=p(w_2)=p(w_1,w_2)$, PMI reaches the maximum: $$PMI(w_1,w_2)=-\log{p(w_1)}=-\log{p(w_2)}=-\log{p(w_1,w_2)}$$
   - Normalized PMI is the PMI divided by the upper bound:
   $$NPMI(w_1,w_2)=\frac{\log{\frac{p(w_1,w_2)}{p(w_1)*p(w_2)}}}{-\log{p(w_1,w_2)}}$$
   
- Another simple method by Mikolov et al. (2013) (https://arxiv.org/pdf/1310.4546.pdf):

    - $Score(w_1, w_2)=\frac{count(w_1,w_2)-\delta}{count(w_1)*count(w_2)}, \text{where}~\delta~\text{is the minimum collocation frequency} $ 

    - This is equivalent to PMI with a minimum collocation threshold
- Both methods are implemented in `gensim` package

## 3.5. Vocabulary 
 - Vocabulary: the set of unique tokens (unigrams/phrases)  
 - Dictionary: typicallly, the vocabulary of a text can be represented as a dictionary 
    * Key: word, Value: count of the word
    * **nltk.FreqDist()**: a nice function for calculating frequncy of words/phrases
        - Get the frequency of items in the parameter list 
        - Retruns an object similar to a dictionary

In [69]:
# 3.5.1 Get token frequency

# first tokenize the text
pattern=r'\w[\w\',-]*\w'                        
tokens=nltk.regexp_tokenize(text.lower(), pattern)


# get unigram frequency 
# recall, you can also get the dictionary by 
# {token:count(token) for token in set(tokens)}

word_dist=nltk.FreqDist(tokens)
word_dist

# get the most frequent items
print("top 5 words:", word_dist.most_common(5))

# what kind of words usually have high frequency?

# it behaves as a dictionary
for word in word_dist:
    print(word,":", word_dist[word])
    

FreqDist({'the': 2260, 'of': 1587, 'and': 1286, 'in': 790, 'to': 767, 'company': 400, 'for': 362, "company's": 348, 'as': 307, 'on': 300, ...})

top 5 words: [('the', 2260), ('of', 1587), ('and', 1286), ('in', 790), ('to', 767)]
document : 3
type : 1
10-k : 50
sequence : 1
filename : 1
a2032880z10-k : 1
txt : 1
description : 5
form : 91
text : 2
page : 81
united : 26
states : 25
securities : 57
and : 1286
exchange : 78
commission : 9
washington : 1
20549 : 1
mark : 3
one : 23
table : 127
annual : 23
report : 32
pursuant : 14
to : 767
section : 17
13 : 27
or : 204
15 : 35
of : 1587
the : 2260
act : 8
1934 : 6
for : 362
fiscal : 89
year : 66
ended : 47
september : 161
30 : 107
2000 : 274
transition : 9
period : 37
from : 138
______________ : 2
file : 8
number : 45
0-10030 : 1
apple : 62
computer : 59
inc : 69
exact : 1
name : 7
registrant : 34
as : 307
specified : 7
in : 790
its : 175
charter : 1
california : 12
942404110 : 1
state : 10
other : 136
jurisdiction : 1
employer : 1
identification : 2
no : 73
incorporation : 3
organization : 1
infinite : 2
loop : 2
95014 : 2
cupertino : 4
zip : 1
code : 6
address : 2
principal : 25
ex

5,941 : 6
7,081 : 1
9,833 : 1
income : 155
786 : 13
601 : 8
309 : 8
1,045 : 1
816 : 1
basic : 19
42 : 14
diluted : 20
18 : 19
81 : 7
05 : 9
declared : 2
324,568 : 3
286,314 : 3
263,948 : 3
252,124 : 2
247,468 : 2
360,324 : 3
348,328 : 3
335,834 : 3
equivalents : 24
short-term : 29
4,027 : 2
3,226 : 2
2,300 : 2
1,459 : 1
1,745 : 1
assets : 80
6,803 : 4
5,161 : 4
4,289 : 2
4,233 : 1
5,364 : 1
long-term : 15
debt : 37
300 : 10
954 : 1
951 : 1
949 : 1
4,107 : 3
3,104 : 3
1,642 : 2
1,200 : 2
2,058 : 1
gains : 55
taxes : 35
investment : 36
367 : 6
230 : 6
recognized : 28
restructuring : 38
27 : 30
217 : 2
179 : 3
cost : 36
bonus : 21
chief : 39
aircraft : 8
next : 20
allocation : 1
charge : 21
375 : 3
acquire : 13
pcc : 3
110 : 3
expensed : 3
termination : 4
license : 2
sets : 5
unit : 66
shipment : 1
cpu : 3
4,558 : 3
32 : 11
3,448 : 3
2,763 : 3
gross : 22
margin : 18
2,166 : 2
1,696 : 2
1,479 : 2
percentage : 6
administrative : 9
1,166 : 3
996 : 3
908 : 2
620 : 2
61 : 11
386 : 2
44 : 11
26

enterprise : 2
facts : 1
suggest : 1
assesses : 1
determining : 1
translates : 1
year-end : 8
translations : 2
credited : 1
charged : 4
entities : 1
remeasure : 1
monetary : 1
nonmonetary : 1
persuasive : 1
exists : 1
occurred : 1
determinable : 1
collectibility : 1
criteria : 1
met : 1
shipped : 2
returns : 1
rebates : 1
warranties : 1
takes : 1
281 : 1
product's : 1
feasibility : 3
soon : 1
subsequent : 5
achieving : 1
stock-based : 2
compensation : 48
measures : 1
employee : 25
intrinsic : 1
forma : 8
value-based : 1
measuring : 1
dividing : 2
dilutive : 6
if-converted : 1
consists : 4
refers : 1
element : 1
approach : 2
designates : 1
reportable : 4
company-wide : 1
disclosed : 1
2--financial : 5
approximates : 1
government : 1
attempt : 1
flooring : 1
insurance : 2
latin : 1
considerable : 1
retail : 1
shows : 1
800 : 1
790 : 1
615 : 1
1,305 : 1
185 : 2
1,051 : 1
645 : 1
represent : 3
shown : 1
counterparties : 6
failed : 2
according : 1
then-current : 1
respective : 2
prevailing 

## 3.5.1 Stop words and word filtering

 - Stop words: a set of commonly used words, have very little meaning, and cannot differentiate a text from others, such as "and", "the" etc. 
 - Stop words are typically ignored in NLP processing or by search engine
 - Stop words usually are application specific. You can define your own stop words!

In [70]:
# Exercise 3.5.1.1
# get NLTK English stop words
# You can modify this list by adding more stop words or remove stop words

from nltk.corpus import stopwords
import string

stop_words = stopwords.words('english')
stop_words+=["covid-19", "virus"]
#print (stop_words)

# filter stop words out of the dictionary
# by creating a new dictionary

filtered_dict={word: word_dist[word] \
                     for word in word_dist \
                     if word not in stop_words}


filtered_dict

# how to sort the dictionary by value?

{'document': 3,
 'type': 1,
 '10-k': 50,
 'sequence': 1,
 'filename': 1,
 'a2032880z10-k': 1,
 'txt': 1,
 'description': 5,
 'form': 91,
 'text': 2,
 'page': 81,
 'united': 26,
 'states': 25,
 'securities': 57,
 'exchange': 78,
 'commission': 9,
 'washington': 1,
 '20549': 1,
 'mark': 3,
 'one': 23,
 'table': 127,
 'annual': 23,
 'report': 32,
 'pursuant': 14,
 'section': 17,
 '13': 27,
 '15': 35,
 'act': 8,
 '1934': 6,
 'fiscal': 89,
 'year': 66,
 'ended': 47,
 'september': 161,
 '30': 107,
 '2000': 274,
 'transition': 9,
 'period': 37,
 '______________': 2,
 'file': 8,
 'number': 45,
 '0-10030': 1,
 'apple': 62,
 'computer': 59,
 'inc': 69,
 'exact': 1,
 'name': 7,
 'registrant': 34,
 'specified': 7,
 'charter': 1,
 'california': 12,
 '942404110': 1,
 'state': 10,
 'jurisdiction': 1,
 'employer': 1,
 'identification': 2,
 'incorporation': 3,
 'organization': 1,
 'infinite': 2,
 'loop': 2,
 '95014': 2,
 'cupertino': 4,
 'zip': 1,
 'code': 6,
 'address': 2,
 'principal': 25,
 'executiv

## 3.5.2 positive/negative words: sentiment analysis
- Sentiment analysis often relies on **lists of words and phrases with positive and negative connotations**. 
- Many dictionaries of positive and negative opinion words were already developed:

  - **Hu and Liu's lexicon**: http://www.cs.uic.edu/~liub/FBS/
  - **SentiWordNet**: an excellent publicly available lexicon (http://sentiwordnet.isti.cnr.it/) 
  - **SentiWords**: contains 155,000 English words (https://hlt-nlp.fbk.eu/technologies/sentiwords)
  - **WordStat**: contains more than 9164 negative and 4847 positive word patterns (https://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/sentiment-dictionaries/)
  - **SenticNet**: provides polarity associated with 50,000 natural language concepts https://sentic.net
  - **Sentiment140**:  created from 1.6 million tweets and contains a list of words and their associations with positive and negative sentiment (https://github.com/felipebravom/StaticTwitterSent/tree/master/extra/Sentiment140-Lexicon-v0.1)
- Opinion words are <b>domain-specific</b>. (e.g. "power" in political domain vs. in engergy sector)
  - For example, for financial industry, there are a number of dictionaries for opinion words:
     * Harvard's General Inquirer (GI): http://www.wjh.harvard.edu/~inquirer/
     * Loughran and McDonald (2015):  https://sraf.nd.edu/textual-analysis/resources/
- For description of these lexicons, check https://medium.com/@datamonsters/sentiment-analysis-tools-overview-part-1-positive-and-negative-words-databases-ae35431a470c
- Question: **How to select the right lexicon**?


In [71]:
# Exercise 3.5.2.1
# Find positive words 
text = '''the problem is that the writers, james cameron and jay cocks , were too ambitious, aiming for a film with social relevance, thrills, and drama. 
 not that ambitious film-making should be discouraged; just that when it fails to achieve its goals, it fails badly and obviously. 
 the film just ends up preachy, unexciting and uninvolving.'''

pattern=r'\w[\w\',-]*\w'                        
tokens=nltk.regexp_tokenize(text.lower(), pattern)


with open("positive-words.txt",'r') as f:
    positive_words=[line.strip() for line in f]

#positive_words
#print(positive_words)

positive_tokens=[token for token in tokens \
                 if token in positive_words]

print(positive_tokens)

['ambitious', 'thrills', 'ambitious']


- **Naive sentiment analysis**:
  - Find positive/negative words
  - If more positive words than negative, then positive
  - Otherwise, negative
- Note the sentence: 
  -  "the problem is that the writers, james cameron and jay cocks , were **<font color="red">too ambitious</font>**, aiming for a film with social relevance, thrills, and drama. **<font color="red">not that ambitious</font>** film-making should be discouraged; just that when it fails to achieve its goals ..."
- How to deal with **negation**?
- Some useful rules:
    - Negative sentiment: 
      - negative words not preceded by a negation within $n$ (e.g. three) words in the same sentence.
      - positive words preceded by a negation within $n$ (e.g. three) words in the same sentence.
    - Positive sentiment (in the similar fashion):
      - positive words not preceded by a negation within $n$ (e.g. three) words in the same sentence.
      - negative terms following a negation within  $n$ (e.g. three) words in the same sentence


In [72]:
# Exercise 3.5.2.2 # check if a positive word is preceded by negation words
# e.g. not, too, n't, no, cannot

# this is not an exhaustive list of negation words!
negations=['not', 'too', 'n\'t', 'no', 'cannot', 'neither','nor']
tokens = nltk.word_tokenize(text)  

#print(tokens)

positive_tokens=[]
for idx, token in enumerate(tokens):
    if token in positive_words:
        if idx>0:
            if tokens[idx-1] not in negations:
                positive_tokens.append(token)
        else:
            positive_tokens.append(token)


print(positive_tokens)

# what if a positive word is preceded 
# by a negation within N words? 
# e.g. 'does not make any customer happy'

['thrills', 'ambitious']
