# Tokenization

**Tokenization** is division of our sentence in smaller parts. One obvious way to tokenize:
- words, punctation marks, numbers, etc.

This aim can be done in the rough way by 'str.split()'.

However is 
- "ice cream" the same as ["ice", "cream"].
- "Mr. Smith" the same as ["Mr.", "Smith"]

Therefore, sometimes 'tokens' can contains more words - 'n-grams':
- 2-grams (bigrams) "Mr. Smith", "Johny Walker",
- 3-grams (trigrams)
- 4-grams (quadruples)


There are however possibilities to divide into smaller parts as [graphemes](https://en.wikipedia.org/wiki/Grapheme) - the smallest functional unit of a writing system.

**Stemming** - combining the words with similar meaning, e.g., 
- singular-plural: 
    - work = works
    - buses = bus
    - lenses = lens
- conjugation
    - sing = singing
    - run = running

Stemming dose not always work:
don't $\neq$ do

## Some definitions

A tokenizer used for compiling computer languages is often called a *scanner or lexer*.

The vocabulary (the set of all the valid tokens) for a computer language is often called a *lexicon*, and that term is still used in academic articles about NLP.

There is some equivalnts of computer compilers and NLP tools:
- tokenizer—scanner, lexer, lexical analyzer
- vocabulary—lexicon
- parser—compiler
- token, term, word, or n-gram—token, symbol, or terminal symbol

**Tokenization is the first step in an NLP pipeline, so it can have a big impact on the
rest of your pipeline.**

## Simplest tokenizer

In [1]:
sentence = """Thomas Jefferson began building Monticello at the age of 26."""
sentence.split()

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26.']

In [2]:
str.split(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26.']

We have token '26.'. Likewise, we could have also "26!" or "26?", etc.

### One-hot encoding

Now we create a vector (one-hot vector) to create the vector for each word.

In [3]:
import numpy as np
sentence = """Thomas Jefferson began building Monticello at the age of 26."""
token_sequence = str.split(sentence)
print(token_sequence)

['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26.']


In [10]:
#build vacabulary list - all the unique tokens (words) that you want to keep track of.
vocab = sorted(set(token_sequence)) #set remove duplicates!
print(', '.join(vocab))

26., Jefferson, Monticello, Thomas, age, at, began, building, of, the


In [8]:
num_tokens = len(token_sequence)
print(num_tokens)

10


In [9]:
vocab_size = len(vocab)
print(vocab_size)

10


In [12]:
onehot_vectors = np.zeros((num_tokens, vocab_size), int)
print(onehot_vectors)

[[0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]]


In [15]:
for i, word in enumerate(token_sequence):
    onehot_vectors[i, vocab.index(word)] = 1
print(' '.join(vocab))
print(onehot_vectors)

26. Jefferson Monticello Thomas age at began building of the
[[0 0 0 1 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 1 0 0]
 [0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0]
 [1 0 0 0 0 0 0 0 0 0]]


The vectors are very sparse. We can make them more neat.

In [16]:
import pandas as pd
pd.DataFrame(onehot_vectors, columns=vocab)

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,0,0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,1,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0
9,1,0,0,0,0,0,0,0,0,0


Even more prettier

In [17]:
df = pd.DataFrame(onehot_vectors, columns=vocab)
df[df == 0] = ''
df

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,,,,1.0,,,,,,
1,,1.0,,,,,,,,
2,,,,,,,1.0,,,
3,,,,,,,,1.0,,
4,,,1.0,,,,,,,
5,,,,,,1.0,,,,
6,,,,,,,,,,1.0
7,,,,,1.0,,,,,
8,,,,,,,,,1.0,
9,1.0,,,,,,,,,


One nice feature of this vector representation of words and tabular representation
of documents is that no information is lost. As long as you keep track of which words
are indicated by which column, you can reconstruct the original document from this
table of one-hot vectors. And this reconstruction process is 100% accurate, even
though your tokenizer was only 90% accurate at generating the tokens you thought
would be useful. As a result, one-hot word vectors like this are typically used in neural
nets, sequence-to-sequence language models, and generative language models.
They’re a good choice for any model or NLP pipeline that needs to retain all the
meaning inherent in the original text.

The volume of data used to encode one-hot-vectors increase enormously.

Let we have 3000 books with 3500sentences and each about 15 words per sentence (short books).

In [20]:
num_rows=3000*3500*15
print(num_rows)

157500000


In [22]:
num_bytes=num_rows*1000000
print(num_bytes)

157500000000000


In [23]:
num_gigabytes = num_bytes/1e9
print(num_gigabytes)

157500.0


In [25]:
num_terabytes = num_gigabytes/1000
print(num_terabytes)

157.5


So we soon run out of RAM.

### Bag of words

We sum all one-hot vectors into one vector. Even for documents several pages long, a bag-of-words vector is still useful for summarizing the essence of a
document. You can see that for your sentence about Jefferson, even after you sorted all
the words lexically, a human can still guess what the sentence was about. So can a
machine. You can use this new bag-of-words vector approach to compress the informa-
tion content for each document into a data structure that’s easier to work with.

We use dictionary:

In [26]:
sentence_bow = {}
sentence = """Thomas Jefferson began building Monticello at the age of 26."""
for token in sentence.split():
    sentence_bow[token] = 1
sorted(sentence_bow.items())

[('26.', 1),
 ('Jefferson', 1),
 ('Monticello', 1),
 ('Thomas', 1),
 ('age', 1),
 ('at', 1),
 ('began', 1),
 ('building', 1),
 ('of', 1),
 ('the', 1)]

Prittier output:

In [27]:
import pandas as pd
df = pd.DataFrame(pd.Series(dict([(token, 1) for token in sentence.split()])), columns=['sent']).T
df

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.
sent,1,1,1,1,1,1,1,1,1,1


More elaborated example:

In [31]:
sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""
sentences += """Construction was done mostly by local masons and\
carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece\
was Jefferson's obsession."""

corpus = {}

#each sentetnce is splitted against \n
for i, sent in enumerate(sentences.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())

df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T

df

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.,...,South,Pavilion,in,1770.,Turning,a,neoclassical,masterpiecewas,Jefferson's,obsession.
sent0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
sent1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,1,0,0,0,...,1,1,1,1,0,0,0,0,0,0
sent3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,1,1,1,1,1


### Dot product and overlap of sentences

The product of two vectors measure ovelap of two vectors
$$\vec{u}\vec{v}=|v||u|cos(\theta).$$

<img src="Dot_Product.svg.png">

In [33]:
import numpy as np

v1 = np.array([1, 2, 3])
v2 = np.array([2, 3, 4])

#first version
print(v1.dot(v2))

#second version
print((v1 * v2).sum())

#third version
print(sum([x1 * x2 for x1, x2 in zip(v1, v2)]))

20
20
20


The overalp of sentences can be measured by means of dot product (or generally some suitably chosen [inner product](https://en.wikipedia.org/wiki/Inner_product_space)).

In [34]:
df = df.T
df

Unnamed: 0,sent0,sent1,sent2,sent3
Thomas,1,0,0,0
Jefferson,1,0,0,0
began,1,0,0,0
building,1,0,0,0
Monticello,1,0,0,1
at,1,0,0,0
the,1,0,1,0
age,1,0,0,0
of,1,0,0,0
26.,1,0,0,0


In [35]:
df.sent0.dot(df.sent1)

0

In [36]:
df.sent0.dot(df.sent2)

1

In [37]:
df.sent0.dot(df.sent3)

1

We can also extract the same words shared by two sentences:

In [38]:
[(k, v) for (k, v) in (df.sent0 & df.sent3).items() if v]

[('Monticello', 1)]

### Improved tokenization

Using regular expressions all stop signs can be removed:

In [41]:
import re
sentence = """Thomas Jefferson began building Monticello at the\
age of 26."""

tokens = re.split(r'[-\s.,;!?]+', sentence)

print(tokens)

['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'theage', 'of', '26', '']


We must remove white spaces too:

In [43]:
tokens = [x for x in tokens if x and x not in '- \t\n.,;!?']
print(tokens)

['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'theage', 'of', '26']


However, for massive tokenization it is better to **compile** regex:

In [45]:
import re
sentence = """Thomas Jefferson began building Monticello at the\
age of 26."""

pattern = re.compile(r"([-\s.,;!?])+")

tokens = pattern.split(sentence)

tokens = [x for x in tokens if x and x not in '- \t\n.,;!?']

print(tokens[:10])

['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'theage', 'of', '26']


### More advanced tokenizators

There are more advanced tokenizers:
- spaCy—Accurate , flexible, fast, Python
- Stanford CoreNLP—More accurate, less flexible, fast, depends on Java 8
- NLTK—Standard used by many NLP contests and comparisons, popular, Python

#### RegexpTokenizer

First we use NLTK tokenizer using regexes - [RegexpTokenizer](https://www.nltk.org/_modules/nltk/tokenize/regexp.html):

In [47]:
from nltk.tokenize import RegexpTokenizer
sentence = """Monticello wasn't designated as UNESCO World Heritage\
Site until 1987."""

tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')
tokens = tokenizer.tokenize(sentence)

print(tokens)


['Monticello', 'wasn', "'t", 'designated', 'as', 'UNESCO', 'World', 'HeritageSite', 'until', '1987', '.']


#### Treebank Word Tokenizer

An even better tokenizer is the [Treebank Word Tokenizer](https://www.nltk.org/api/nltk.tokenize.treebank.html#module-nltk.tokenize.treebank) from the NLTK package.
It incorporates a variety of common rules for English word tokenization. For example,
it separates phrase-terminating punctuation (?!.;,) from adjacent tokens and retains
decimal numbers containing a period as a single token. In addition it contains rules for
English contractions. For example “don’t” is tokenized as ["do", "n’t"] . This tokeni-
zation will help with subsequent steps in the NLP pipeline, such as stemming.

In [49]:
from nltk.tokenize import TreebankWordTokenizer
sentence = """Monticello wasn't designated as UNESCO World Heritage\
Site until 1987."""

tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(sentence)

print(tokens)

['Monticello', 'was', "n't", 'designated', 'as', 'UNESCO', 'World', 'HeritageSite', 'until', '1987', '.']


#### Casual tokenizer - informal text from social networks

The NLTK library includes a tokenizer— [casual_tokenize](https://www.nltk.org/_modules/nltk/tokenize/casual.html)—that was built to deal
with short, informal, emoticon-laced texts from social networks where grammar and
spelling conventions vary widely.

In [51]:
from nltk.tokenize.casual import casual_tokenize
message = """RT @TJMonticello Best day everrrrrrr at Monticello.\
Awesommmmmmeeeeeeee day :*)"""

print(casual_tokenize(message))

['RT', '@TJMonticello', 'Best', 'day', 'everrrrrrr', 'at', 'Monticello', '.', 'Awesommmmmmeeeeeeee', 'day', ':*)']


In [52]:
print(casual_tokenize(message, reduce_len=True, strip_handles=True))

['RT', 'Best', 'day', 'everrr', 'at', 'Monticello.Awesommmeee', 'day', ':*)']


For comparison:

In [54]:
from nltk.tokenize import TreebankWordTokenizer


message = """RT @TJMonticello Best day everrrrrrr at Monticello.\
Awesommmmmmeeeeeeee day :*)"""

tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(message)

print(tokens)


['RT', '@', 'TJMonticello', 'Best', 'day', 'everrrrrrr', 'at', 'Monticello.Awesommmmmmeeeeeeee', 'day', ':', '*', ')']


We see that standard tokenizers misses emotikons.

## n-grams

In the sentence:

'I scream, you scream, we all scream for ice cream.'

the sentence 'ice cream' cannot be splitted in tokens.


Similarly, the sentnce

'was not' 

have more meaning when treated together as 2-grams.



The name, e.g., 'Thomas Jeffereson' has more meaning than 'Thomas', 'Jeffereson'.

We can use ngrams from NLTK. 

ngrams() returns iterator, so in order to get whole list we convert it to list.

In [58]:
sentence = """Thomas Jefferson began building Monticello at the\
 age of 26."""
pattern = re.compile(r"([-\s.,;!?])+")
tokens = pattern.split(sentence)
tokens = [x for x in tokens if x and x not in '- \t\n.,;!?']
print(tokens)

['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26']


In [59]:
from nltk.util import ngrams
list(ngrams(tokens, 2))

[('Thomas', 'Jefferson'),
 ('Jefferson', 'began'),
 ('began', 'building'),
 ('building', 'Monticello'),
 ('Monticello', 'at'),
 ('at', 'the'),
 ('the', 'age'),
 ('age', 'of'),
 ('of', '26')]

In [60]:
list(ngrams(tokens, 3))

[('Thomas', 'Jefferson', 'began'),
 ('Jefferson', 'began', 'building'),
 ('began', 'building', 'Monticello'),
 ('building', 'Monticello', 'at'),
 ('Monticello', 'at', 'the'),
 ('at', 'the', 'age'),
 ('the', 'age', 'of'),
 ('age', 'of', '26')]

We can join it:

In [61]:
two_grams = list(ngrams(tokens, 2))
[" ".join(x) for x in two_grams]

['Thomas Jefferson',
 'Jefferson began',
 'began building',
 'building Monticello',
 'Monticello at',
 'at the',
 'the age',
 'age of',
 'of 26']

## Stop words

**Stop words** are common words in any language that occur with a high frequency but
carry much less substantive information about the meaning of a phrase. Examples of
some common stop words include:

- a, an
- the, this
- and, or
- of, on

Note that stop words sometimes carry important message:
- Mark reported **to** the CEO
- Suzanne reported **as** the CEO **to** the board

then you can extract this relationship using 4-grams:

- reported to the CEO
- reported as the CEO

We can do example filtering of stop words:

In [63]:
stop_words = ['a', 'an', 'the', 'on', 'of', 'off', 'this', 'is']
tokens = ['the', 'house', 'is', 'on', 'fire']
tokens_without_stopwords = [x for x in tokens if x not in stop_words]
print("tokens = ", tokens)
print("tokens without stop words = ", tokens_without_stopwords)

tokens =  ['the', 'house', 'is', 'on', 'fire']
tokens without stop words =  ['house', 'fire']


[NLTK](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip) contains the stop words from different languages - PL is missing.

In [1]:
import nltk
#nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
print(len(stop_words))
print(stop_words[:7])

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours']


The set of English stop words that sklearn uses is quite different
from those in NLTK. At the time of this writing, sklearn has 318 stop words.
Even NLTK upgrades its corpora periodically, including the stop words list.

In [4]:
from sklearn.feature_extraction.text import\
ENGLISH_STOP_WORDS as sklearn_stop_words

print(len(sklearn_stop_words))
print(len(stop_words))
print(len(set(stop_words).union(sklearn_stop_words)))
print(len(set(stop_words).intersection(sklearn_stop_words)))

list(sklearn_stop_words)[:20]

318
179
378
119


['there',
 'take',
 'my',
 'now',
 'been',
 'throughout',
 'her',
 'had',
 'eg',
 'is',
 'here',
 'although',
 'herein',
 'until',
 'no',
 'per',
 'also',
 'already',
 'even',
 'afterwards']

Since there are differences - use with caution!!

## Normalization of vocabulary

In order to reduce the bag or words, we can do **case folding** - normalizing the size of letters, e.g., 
- doctor = Doctor

It can enromously reduce the size of BoW, e.g., all initial words in sentence will be normalized.

A simple normalization can be made using convert-to-lower-letter:

In [71]:
tokens = ['House', 'Visitor', 'Center']
normalized_tokens = [x.lower() for x in tokens]
print(normalized_tokens)

['house', 'visitor', 'center']


**Note:** There are problems with some specific names as:
- FedEx
- WordPerfect
- stringVariableName

called [Camel case](https://en.wikipedia.org/wiki/Camel_case).


Similar problem appears with [synonymes](https://en.wikipedia.org/wiki/Synonym), e.g., 
- Joe in Joe Smith
- joe as a [coffee company](https://joecoffeecompany.com/)


To avoid this potential loss of information, many NLP pipelines don’t normalize for case at all. For many applications, the efficiency gain (in storage and processing) for reducing one’s vocabulary size by about half is outweighed by the loss of information for proper nouns.


# Stemming

Another common vocabulary normalization technique is to eliminate the small meaning differences of pluralization or possessive endings of words, or even various verb forms. This normalization, identifying a common stem among various forms of a word, is called stemming. For example, the words housing and houses share the *same stem*, house. 

**Stemming removes suffixes from words in an attempt to combine words with similar meanings together under their common stem.**

A stem isn’t required to be a properly spelled word, but merely a token, or label, representing several possible spellings of a word.

Examples:

- singular-plural: house = houses
- conjugation: sing = singing

Stemming can change the meaning of the sentence:

“Dr. House’s calls” $\neq$ not “dr house call,”


A simple stemmer could look like:

In [73]:
def stem(phrase):
    return ' '.join([re.findall('^(.*ss|.*?)(s)?$', word)[0][0].strip("'") for word in phrase.lower().split()])

print(stem('houses'))

house


In [74]:
print(stem("Doctor House's calls"))

doctor house call


### Porter steamer

wo of the most popular stemming algorithms are the Porter and Snowball stemmers. The Porter stemmer is named for the computer scientist [Martin Porter](https://en.wikipedia.org/wiki/Martin_Porter). 

Porter is also responsible for enhancing the Porter stemmer to create the Snowball stemmer. Porter dedicated much of his lengthy career to documenting and improving stemmers, due to their value in information retrieval (keyword search). These stemmers implement more complex rules than our simple regular expression. This enables the stemmer to handle the complexities of English spelling and word ending rules.

Porter and Snowball stemmer is implemented in [NLTK](https://www.nltk.org/howto/stem.html):

In [76]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
stem = ' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()])
print(stem)

dish washer wash dish


Snowball stemmer:

In [6]:
from nltk.stem.snowball import SnowballStemmer
print(" ".join(SnowballStemmer.languages))

arabic danish dutch english finnish french german hungarian italian norwegian porter portuguese romanian russian spanish swedish


In [7]:
stemmer = SnowballStemmer("english")
stem = ' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()])
print(stem)

dish washer wash dish


Ignore stop words:

In [14]:
#stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
stemmer2 = SnowballStemmer("english")
print(stemmer.stem("I am robots"))
print(stemmer2.stem("I am robots"))

i am robot
i am robot


[More on Porter stemmer](https://github.com/jedijulia/porter-stemmer/blob/master/stemmer.py):
- It contains about 300 lines of code.
- It contains a few steps: 1a, 1b, 1c, 2, 3, 4, 5a, 5b.

The steps are as follows:

- Step 1a—“s” and “es” endings
- Step 1b—“ed,” “ing,” and “at” endings
- Step 1c—“y” endings
- Step 2—“nounifying” endings such as “ational,” “tional,” “ence,” and “able”
- Step 3—adjective endings such as “icate,” b “ful,” and “alize”
- Step 4—adjective and noun endings such as “ive,” “ible,” “ent,” and “ism”
- Step 5a—stubborn “e” endings, still hanging around
- Step 5b—trailing double consonants for which the stem will end in a single “l”

[Example of step 1a](https://github.com/jedijulia/porter-stemmer/blob/master/stemmer.py):

In [80]:
def step1a(self, word):
        if word.endswith('sses'):
            word = self.replace(word, 'sses', 'ss')
        elif word.endswith('ies'):
            word = self.replace(word, 'ies', 'i')
        elif word.endswith('ss'):
            word = self.replace(word, 'ss', 'ss')
        elif word.endswith('s'):
            word = self.replace(word, 's', '')
        else:
            pass
        return word

## Lemmatization

If you have access to information about connections between the meanings of various
words, you might be able to associate several words together even if their spelling is
quite different. This more extensive normalization down to the semantic root of a
word—its lemma—is called **lemmatization**.

**Warning:**  Lemmatization is very aggresive procedure and can remove a lot of meaning of word:
- “chat,” “chatter,” “chatty,” “chatting,” “chatbot” = chat ?
- bank, banking, banked = bank ?

Ill-designed lemmatization can lead to easier **spoofing** of chatbot - forcing chatbot to generate wrong response.


Some lemmatizers use the word’s part of speech (POS) tag in addition to its spelling to help improve accuracy. The POS tag for a word indicates its role in the grammar of a phrase or sentence. For example, the noun POS is for words that refer to
“people, places, or things” within a phrase. An adjective POS is for a word that modifies or describes a noun. A verb refers to an action. The POS of a word in isolation
cannot be determined. The context of a word must be known for its POS to be identified. So some advanced lemmatizers can’t be run-on words in isolation.


We use [WordNetLemmatizer for NLTK](https://www.nltk.org/api/nltk.stem.wordnet.html).
It contains the option POS(Part of Speech) that can be set to:
- “n” for nouns, 
- “v” for verbs, 
- “a” for adjectives, 
- “r” for adverbs,
- “s” for satellite adjectives.

In [15]:
#nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("better"))
print(lemmatizer.lemmatize("good", pos="a"))
print(lemmatizer.lemmatize("goods", pos="a"))
print(lemmatizer.lemmatize("goods", pos="n"))
print(lemmatizer.lemmatize("goodness", pos="n"))
print(lemmatizer.lemmatize("best", pos="a"))

better
good
goods
good
goodness
best


Note that 'goodness' is lemmatized to 'goods', that is plural. The Potter stemmer can do it better:

In [88]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('goodness'))

good


One must consider if Lemmatization of Stemming is better for your approach.

## Use of NLTK

From NLTK webpage:

 If you publish work that uses NLTK, please cite the NLTK book as follows:

Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.


# Sentiment analysis

An important part of this information is the word’s sentiment—the overall feeling or emotion that the word invokes.

This **sentiment analysis**—measuring the sentiment of phrases or chunks of text—is a
common application of NLP.


*In many companies it’s the main thing an NLP engineer
is asked to do.*

Applications:
- scoring (products)
- mail filtering
- chatbot understanding of mood

There are two approaches:
- A rule-based algorithm composed by a human - VADER algorith,
- A machine learning model learned from data by a machine - all supervised learning algorithms, e.g., Naive Bayesian

## VADER

[VADER (Valence Aware Dictionary for sEntiment Reasoning)](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf) is a rule-based algorithm and many NLTK functions implements some part of this algorithm. Algorithm was designed by Hutto and Gilbert at GA Tech.

The source code is avaliable on [Github](https://github.com/cjhutto/vaderSentiment).


NLTK implementation:

nltk.sentiment.vader

Installation of native algorithm:

pip install vaderSentiment

In [116]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

sa = SentimentIntensityAnalyzer()

print("Items in lexicon = ", len(sa.lexicon.items()))

print(sa.lexicon)


Items in lexicon =  7506


VADER deals good with emoticons, however if
you use a stemmer (or lemmatizer) in your pipeline, you’ll need to apply that stemmer to the VADER lexicon, too, for all the words that go together in a single stem or lemma.

Each occurence of a emotional sentence from lexicon is scored.

NLTK example:

In [115]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
#nltk.download('vader_lexicon')

sa = SentimentIntensityAnalyzer()

print("Items in lexicon = ", len(sa.lexicon.items()))

print(sa.lexicon)



Items in lexicon =  7502


There are 2-grams in lexicon:

In [104]:
[(tok, score) for tok, score in sa.lexicon.items() if " " in tok]

[("( '}{' )", 1.6),
 ("can't stand", -2.0),
 ('fed up', -1.8),
 ('screwed up', -1.5)]

he VADER algorithm considers the intensity of sentiment polarity in three separate scores (positive, negative, and neutral) and then combines them together into a compound positivity sentiment.

In [106]:
sa.polarity_scores(text="Python is very readable and it's great for NLP.")

{'neg': 0.0, 'neu': 0.661, 'pos': 0.339, 'compound': 0.6249}

In [107]:
sa.polarity_scores(text="Python is not a bad choice for most applications.")

{'neg': 0.0, 'neu': 0.711, 'pos': 0.289, 'compound': 0.431}

Another example:

In [109]:
corpus = ["Absolutely perfect! Love it! :-) :-) :-)", "Horrible! Completely useless. :(", "It was OK. Some good and some bad things."]
for doc in corpus:
    scores = sa.polarity_scores(doc)
    print('{:+}: {}'.format(scores['compound'], doc))

+0.9428: Absolutely perfect! Love it! :-) :-) :-)
-0.8768: Horrible! Completely useless. :(
+0.3254: It was OK. Some good and some bad things.


**Note:** VADER looks only for about 7500 phrases that can determine sentiment. However new phrases are created and new [collocations](https://en.wikipedia.org/wiki/Collocation) are made up for marking sentiment. In VADER the lexicon must be updated constantly.

## Machine Learning - Naive Bayesian

[Naive Bayesian classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) is one of the simplest supervised learning classifier.

[We use NB classifier from scikit.learn](https://scikit-learn.org/stable/modules/naive_bayes.html#)

#### Movies

In [179]:
#reading data
import pandas as pd
movies = pd.read_csv('MoviesAmazon.txt', sep = '\t')
movies.head()

Unnamed: 0,id,sentiment,text
0,1,2.266667,The Rock is destined to be the 21st Century's ...
1,2,3.533333,The gorgeously elaborate continuation of ''The...
2,3,-0.6,Effective but too tepid biopic
3,4,1.466667,If you sometimes like to go to the movies to h...
4,5,1.733333,"Emerges as something rare, an issue movie that..."


In [180]:
bags_of_words = []
from collections import Counter
for text in movies.text:
    bags_of_words.append(Counter(casual_tokenize(text)))

print(bags_of_words)



In [181]:
df_bows = pd.DataFrame.from_records(bags_of_words)
df_bows = df_bows.fillna(0).astype(int)

print(df_bows.shape)
df_bows.head()

(10605, 20756)


Unnamed: 0,The,Rock,is,destined,to,be,the,21st,Century's,new,...,Ill,slummer,Rashomon,dipsticks,Bearable,Staggeringly,’,ve,muttering,dissing
0,1,1,1,1,2,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,4,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [182]:
df_bows.head()[list(bags_of_words[0].keys())]

Unnamed: 0,The,Rock,is,destined,to,be,the,21st,Century's,new,...,Schwarzenegger,",",Jean,Claud,Van,Damme,or,Steven,Segal,.
0,1,1,1,1,2,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,2,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,4
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,4,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1


In [183]:
print(movies.sentiment.shape[0])
print(df_bows.shape[0])
print(movies.sentiment[:10])
print(df_bows[:10])

10605
10605
0    2.266667
1    3.533333
2   -0.600000
3    1.466667
4    1.733333
5    2.533333
6    2.466667
7    1.266667
8    1.933333
9    1.733333
Name: sentiment, dtype: float64
   The  Rock  is  destined  to  be  the  21st  Century's  new  ...  Ill  \
0    1     1   1         1   2   1    1     1          1    1  ...    0   
1    2     0   1         0   0   0    1     0          0    0  ...    0   
2    0     0   0         0   0   0    0     0          0    0  ...    0   
3    0     0   1         0   4   0    1     0          0    0  ...    0   
4    0     0   0         0   0   0    0     0          0    0  ...    0   
5    1     0   0         0   0   0    3     0          0    0  ...    0   
6    0     0   0         0   0   0    0     0          0    0  ...    0   
7    0     0   1         0   1   0    1     0          0    0  ...    0   
8    0     0   0         0   1   0    1     0          0    0  ...    0   
9    0     0   0         0   0   0    0     0          0    0  ...

In [184]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

nb = nb.fit(df_bows, movies.sentiment > 0)

#print( nb.predict_proba(df_bows)[:,0] + nb.predict_proba(df_bows)[:,1])
#print(nb.predict_proba(df_bows))

movies['predicted_sentiment'] = nb.predict_proba(df_bows)[:,1] * 8 - 4 #scale positive probability to [-4;4]

movies['error'] = (movies.predicted_sentiment - movies.sentiment).abs()

movies.error.mean().round(1)

movies['sentiment_ispositive'] = (movies.sentiment > 0).astype(int)
movies['predicted_ispositive'] = (movies.predicted_sentiment > 0).astype(int)
movies['''sentiment predicted_sentiment sentiment_ispositive predicted_ispositive'''.split()].head(8)

print((movies.predicted_ispositive == movies.sentiment_ispositive).sum() / len(movies))

print(nb.score(df_bows, movies.sentiment > 0))

0.9344648750589345
0.9344648750589345


The prediction on the same data that is used for testing is not correct approach , however, even this shows that NB is quite good in predictions.

Exercise - do train-test split and then cross validate calssifier.

#### Products

In [188]:
#reading data
import pandas as pd
products = pd.read_csv('ProductsAmazon.txt', sep = '\t')
#print(movies.head())

bags_of_words = []
for text in products.text:
    bags_of_words.append(Counter(casual_tokenize(text)))
    
df_bows = pd.DataFrame.from_records(bags_of_words)
df_bows = df_bows.fillna(0).astype(int)

print(df_bows.shape)
df_bows.head()

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

nb = nb.fit(df_bows, products.sentiment > 0)

#print( nb.predict_proba(df_bows)[:,0] + nb.predict_proba(df_bows)[:,1])
#print(nb.predict_proba(df_bows))

products['predicted_sentiment'] = nb.predict_proba(df_bows)[:,1] * 8 - 4 #scale positive probability to [-4;4]

products['error'] = (products.predicted_sentiment - products.sentiment).abs()

products.error.mean().round(1)

products['sentiment_ispositive'] = (products.sentiment > 0).astype(int)
products['predicted_ispositive'] = (products.predicted_sentiment > 0).astype(int)
products['''sentiment predicted_sentiment sentiment_ispositive predicted_ispositive'''.split()].head(8)

print((products.predicted_ispositive == products.sentiment_ispositive).sum() / len(products))

print(nb.score(df_bows, products.sentiment > 0))

(3546, 5687)
0.8846587704455725
0.8846587704455725


Again, do cross-validation and train-test split.