# Text Preprocessing

Text preprocessing is an important step in the use of unstructured text documents for any type of data mining, information retrieval, or text analytics.
This lab walks through the use of the Python Natural Language Toolkit (NLTK) to discuss the tools available for text preprocessing.
Specifically, we are looking at the concepts of
  1. Stop Words
  1. Stemming
  1. Lemmatization
  
In the labs after this, these things will be automatically handled for us as we build upon information retrieval.
However, these are still key concepts to see in action.
You will see them again as we continue to move forward with our text analytics in future modules.


## Stop Words

Text documents often contain many occurrences of the same word. 
For example, in a document written in *English*, words such as *a, the, of, and it* are likely to occur very frequently. 
When classifying a document based on the number of times specific words occur in the text document, 
these words can lead to biases, especially since they are generally common in **all** text documents you might want to classify. 
As a result, the concept of [_stop words_](https://en.wikipedia.org/wiki/Stop_words) was invented. 
Basically, these words are the most commonly occurring words that should be removed during the tokenization process in order to improve subsequent text analytics efforts. 

We can easily specify that the __English__ stop words should be excluded during tokenization by using the `stop_words`. 
Note, _stop word_ dictionaries for other languages, or even specific domains, exist and can be used instead. 
We demonstrate the removal of stop words by using a `CountVectorizer` in the following simple example and compare it to using `CountVectorizer` without stop words removed.

-----

In [3]:
# Define our vectorizer

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer='word', lowercase=True)

# Sample sentence to tokenize
my_text = 'This module introduced many concepts in text analysis.'

cv1 = CountVectorizer(lowercase=True)
cv2 = CountVectorizer(stop_words = 'english', lowercase=True)

tk_func1 = cv1.build_analyzer()
tk_func2 = cv2.build_analyzer()

import pprint
pp = pprint.PrettyPrinter(indent=2, depth=1, width=80, compact=True)

print('Tokenization:')
pp.pprint(tk_func1(my_text))

print()

print('Tokenization (with Stop words):')
pp.pprint(tk_func2(my_text))

Tokenization:
['this', 'module', 'introduced', 'many', 'concepts', 'in', 'text', 'analysis']

Tokenization (with Stop words):
['module', 'introduced', 'concepts', 'text', 'analysis']


 

--- 
## Stemming


We have looked at the removal of redundant or unimportant words, i.e., _stop words_. 
However, an issue still exists because of different word forms of the same base term; for example compute, computer, computed, and computing. 
The process of changing words back to their root term or basic form (by removing prefixes and suffixes) so that token frequencies match the use of the root token rather than being spread across multiple similar tokens is known as [stemming](https://en.wikipedia.org/wiki/Stemming). 

The most widely used stemmer, or program/method that performs stemming, is the _Porter Stemmer_, which was originally published in 1980 by Martin Porter. 
An improved version was released in 2000, which fixed a number of errors. 
NLTK includes the Porter Stemmer.
This is used by creating a special function that tokenizes text documents and then passes this function as an argument to the `CountVectorizer` via the `tokenizer` attribute. 
By performing stemming inside this tokenize method, we can return a set of tokens for a document that have been stemmed. 
In the following code cell, we use a custom `tokenize` method that first builds a list of tokens by using nltk, and then maps the Porter Stemmer to the list of tokens to generate a stemmed list.

-----


In [4]:
import string
import nltk
from nltk.stem.porter import PorterStemmer

## See how the PorterStemmer behaves with a list of example words

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
stemmer = PorterStemmer()

for w in example_words:
    print(stemmer.stem(w))

python
python
python
python
pythonli


In [5]:
## See how the PorterStemmer works with tokenizing on a very pythony sentence

new_text = "It is important to be very pythonly while you are pythoning with python. \
All pythoners have pythoned poorly at least once."

tokens = nltk.word_tokenize(new_text)
tokens = [token for token in tokens if token not in string.punctuation]

for w in tokens:
    print(stemmer.stem(w))

It
is
import
to
be
veri
pythonli
while
you
are
python
with
python
all
python
have
python
poorli
at
least
onc


-----

## Lemmatization


In linguistics, lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. 
By "inflected" we mean the form of a word has been changed to express a particular grammatical function or attribute, typically tense, mood, person, number, case, and gender.

In computational linguistics, lemmatization is the algorithmic process of determining the lemma for a given word. 
The process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence, which requires knowledge of the grammar of a language.

In many languages, words appear in several inflected forms. 
For example, in English, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’, ‘walking’. 
The base form, ‘walk’, that one might look up in a dictionary, is called the lemma for the word. 

Lemmatization is closely related to stemming. 
The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. 
However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

-----

In [6]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('dogs')

'dog'

Lets try to understand the difference between **Stemming** and **Lemmatization**. Look at the ouput for the following code, and then read the following explanation.

In [7]:
from nltk import pos_tag

print("Stem %s: %s" % ("going", stemmer.stem("going")))
print("Stem %s: %s" % ("gone", stemmer.stem("gone")))
print("Stem %s: %s" % ("goes", stemmer.stem("goes")))
print("Stem %s: %s" % ("went", stemmer.stem("went")))
print("\n")

print("Without context")
print("Lemmatise %s: %s" % ("going", lemmatizer.lemmatize("going")))
print("Lemmatise %s: %s" % ("gone", lemmatizer.lemmatize("gone")))
print("Lemmatise %s: %s" % ("goes", lemmatizer.lemmatize("goes")))
print("Lemmatise %s: %s" % ("went", lemmatizer.lemmatize("went")))
print("\n")

print("With context")
print("Lemmatise %s: %s" % ("going", lemmatizer.lemmatize("going", pos="v")))
print("Lemmatise %s: %s" % ("gone", lemmatizer.lemmatize("gone", pos="v")))
print("Lemmatise %s: %s" % ("goes", lemmatizer.lemmatize("goes", pos="v")))
print("Lemmatise %s: %s" % ("went", lemmatizer.lemmatize("went", pos="v")))

Stem going: go
Stem gone: gone
Stem goes: goe
Stem went: went


Without context
Lemmatise going: going
Lemmatise gone: gone
Lemmatise goes: go
Lemmatise went: went


With context
Lemmatise going: go
Lemmatise gone: go
Lemmatise goes: go
Lemmatise went: go


We can observe that the stemming process does not generate a real word, but a root form. 
On the other side, the lemmatizer generates real words, 
but without contextual information it is not able to distinguish between nouns and verbs. 
Hence the lemmatization process doesn’t change the word. 

The context is provided by the POS tag ("v" for verb in this example). 
We cannot specify POS tag everytime in order to lemmatize words in a text. 
NLTK generates POS tags automatically, using a simple function `pos_tag()`.

In [8]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize
 
s = "This is a simple sentence"
tokens = word_tokenize(s) # Generate list of tokens
tokens_pos = pos_tag(tokens) 
 
print(tokens_pos)

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('sentence', 'NN')]


So the `pos_tag()` function generates keywords for each word in the text. 
The outputs 'DT', 'VBZ', etc. are `tags` representing parts of speech from the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

## Check Point
Stop Words, Stemming, and Lemmatization are important pre-processing steps in text analytics applications. 
You can leverage the off-the-shelf solutions offered by NLTK into yout text analysis applications.
Additionally, many code libraries and applications that perform more advanced text analytical processes incorporate these techniques in them by default.

Below is some practice coding for you to experiment with the NLTK functionality above.


In [9]:
StringAction = "We are meeting"
StringNoun =  "We had a meeting"

In [11]:
# 1) Compare the result of parsing the two 
# variables, StringAction and StringNoun 
# using the tokenizer with the english stop words removed
# ----------------------------------------

# Import libraries
import pprint
from sklearn.feature_extraction.text import CountVectorizer

# Building the vectorizing tokenizer
cv = CountVectorizer(stop_words = 'english', lowercase=True)
tk_function = cv.build_analyzer()

# Print out the comparison of stop-word enabled tokenizing the two variables
pp = pprint.PrettyPrinter(indent=2, depth=1, width=80, compact=True)

print('Tokenization:')
print("'{}':".format(StringAction))
# -------- EDIT NEXT LINE --------
pp.pprint(tk_function(StringAction))

print("  ... vs ...  ")

print('Tokenization:')
print("'{}':".format(StringNoun))
# -------- EDIT NEXT LINE --------
pp.pprint(tk_function(StringNoun))





Tokenization:
'We are meeting':
['meeting']
  ... vs ...  
Tokenization:
'We had a meeting':
['meeting']


In [23]:
# 2) Compare the result of parsing the two 
# variables, StringAction and StringNoun 
# using the stemmer with the english stop words removed
# ----------------------------------------

# Import libraries
import string
import nltk
from nltk.stem.porter import PorterStemmer

### Add your code below to parse and stem the two variables
from nltk.corpus import stopwords
stop = stopwords.words('english')

Action_tokens = nltk.word_tokenize(StringAction.lower())
Action_tokens = [token for token in Action_tokens if token not in stop]

Noun_tokens = nltk.word_tokenize(StringNoun.lower())
Noun_tokens = [token for token in Noun_tokens if token not in stop]

print("'{}':".format(StringAction))
for w in Action_tokens:
    print(stemmer.stem(w))

print("    ")
print("  ... vs ...  ")
print("    ")


print("'{}':".format(StringNoun))
for w in Noun_tokens:
    print(stemmer.stem(w))


'We are meeting':
meet
    
  ... vs ...  
    
'We had a meeting':
meet


In [24]:
# 3) Compare the result of parsing the two 
# variables, StringAction and StringNoun 
# using the lemmatization with the english stop words removed
# ----------------------------------------

# Import libraries
from nltk.stem import WordNetLemmatizer


### Add your code below to parse and lemmatize the two variables
from nltk.corpus import stopwords
from nltk import word_tokenize

stop = stopwords.words('english')

Action_tokens = [token for token in word_tokenize(StringAction.lower()) if token not in stop]
Noun_tokens = [token for token in word_tokenize(StringNoun.lower()) if token not in stop]

print("Lemmatization")
print(40*'-')
print("'{}':".format(StringAction))
for w in Action_tokens:
    print("No pos: ", lemmatizer.lemmatize(w))
    # use part of speech verb
    print("With pos v: ", lemmatizer.lemmatize(w, pos='v'))
    
print("    ")
print("  ... vs ...  ")
print("    ")

print("'{}':".format(StringNoun))
for w in Noun_tokens:
    print("No pos: ", lemmatizer.lemmatize(w))
    # use part of speech noun
    print("With pos n: ", lemmatizer.lemmatize(w, pos="n"))






Lemmatization
----------------------------------------
'We are meeting':
No pos:  meeting
With pos v:  meet
    
  ... vs ...  
    
'We had a meeting':
No pos:  meeting
With pos n:  meeting


## Text Preprocessing in Full Text Search

Now that we have seen these concepts in isolation, lets revisit the PostgreSQL Full Text Search.

Specifically, you have loaded a new table that breaks each document from the book in the lab into individual lines.  
You can review the load process for this data in [this notebook](./Practice_Load_BookLines.ipynb).
The notebook includes a few queries, showing how the loaded lines may be a little more useful that just document matching.

<span style="background-color:yellow">For the commands below, replace the schema name `sebcq5` with your own pawprint.</span>

```SQL
dsa_student=# select count(*),sum(length(line)) from sebcq5.booklines;
 count |   sum
-------+---------
 31259 | 4315223
(1 row)
```

#### 31K lines

#### Looking at a random line that was added:

```SQL
dsa_student=# \x
Expanded display is on.
dsa_student=# select * from sebcq5.booklines where id = 31236;
-[ RECORD 1 ]-+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id            | 31236
name          | /dsa/data/all_datasets/book/zeph.txt
line_no       | 34
line          | 2:14: And flocks shall lie down in the midst of her, all the beasts of the nations: both the cormorant and the bittern shall lodge in the upper lintels of it; their voice shall sing in the windows; desolation shall be in the thresholds: for he shall uncover the cedar work.
line_tsv_gin  | '14':2 '2':1 'beast':15 'bittern':24 'cedar':51 'cormor':21 'desol':40 'flock':4 'lie':6 'lintel':30 'lodg':26 'midst':10 'nation':18 'shall':5,25,35,41,48 'sing':36 'threshold':45 'uncov':49 'upper':29 'voic':34 'window':39 'work':52
line_tsv_gist | '14':2 '2':1 'beast':15 'bittern':24 'cedar':51 'cormor':21 'desol':40 'flock':4 'lie':6 'lintel':30 'lodg':26 'midst':10 'nation':18 'shall':5,25,35,41,48 'sing':36 'threshold':45 'uncov':49 'upper':29 'voic':34 'window':39 'work':52
```

Notice that we have built a document vector that is stemmed and has removed common (stop) wor


We see that the line 

    2:14: And flocks shall lie down in the midst of her, 
    all the beasts of the nations: both the cormorant and 
    the bittern shall lodge in the upper lintels of it; 
    their voice shall sing in the windows; desolation shall 
    be in the thresholds: for he shall uncover the cedar work.

Is tokenized into **_text search vector_ (tsv)**: 
```
'14':2 '2':1 'beast':15 'bittern':24 'cedar':51 'cormor':21 
'desol':40 'flock':4 'lie':6 'lintel':30 'lodg':26 'midst':10 
'nation':18 'shall':5,25,35,41,48 'sing':36 'threshold':45 
'uncov':49 'upper':29 'voic':34 'window':39 'work':52
```

Lets compare this line to the Python tokenizing:


In [25]:
line = "2:14: And flocks shall lie down in the midst of her, " + \
    "all the beasts of the nations: both the cormorant and  " + \
    "the bittern shall lodge in the upper lintels of it;  " + \
    "their voice shall sing in the windows; desolation shall  " + \
    "be in the thresholds: for he shall uncover the cedar work."
print(line)

2:14: And flocks shall lie down in the midst of her, all the beasts of the nations: both the cormorant and  the bittern shall lodge in the upper lintels of it;  their voice shall sing in the windows; desolation shall  be in the thresholds: for he shall uncover the cedar work.


### Compare processing of the line:

In the cell below, use each of the Python techniques we discussed above to process the `line` variable.
Then, answer the questions in the cells below the code to compare and contrast the Python methods versus the apparent techniques applied by PostgreSQL.

In [30]:
# Import libraries
import pprint
import string
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# 1) Add code to process the line variable with 
# Stop Word tokenization, Stemming, and Lemmatization
# ---------------------------------------------

#stop = stopwords.words('english')

cv = CountVectorizer(stop_words = 'english', lowercase=True)
tk_function = cv.build_analyzer()

line_tokens = tk_function(line)
print(line_tokens)
print(' ')
stem = []
stemmer = PorterStemmer()
for w in line_tokens:
    stem.append(stemmer.stem(w))

print('Stem: ')
print(stem)

print(' ')
print('Lemma: ')

lemma = []
lemmatizer = WordNetLemmatizer()
for w in line_tokens:
    lemma.append(lemmatizer.lemmatize(w))
print(lemma)

['14', 'flocks', 'shall', 'lie', 'midst', 'beasts', 'nations', 'cormorant', 'bittern', 'shall', 'lodge', 'upper', 'lintels', 'voice', 'shall', 'sing', 'windows', 'desolation', 'shall', 'thresholds', 'shall', 'uncover', 'cedar', 'work']
 
Stem: 
['14', 'flock', 'shall', 'lie', 'midst', 'beast', 'nation', 'cormor', 'bittern', 'shall', 'lodg', 'upper', 'lintel', 'voic', 'shall', 'sing', 'window', 'desol', 'shall', 'threshold', 'shall', 'uncov', 'cedar', 'work']
 
Lemma: 
['14', 'flock', 'shall', 'lie', 'midst', 'beast', 'nation', 'cormorant', 'bittern', 'shall', 'lodge', 'upper', 'lintel', 'voice', 'shall', 'sing', 'window', 'desolation', 'shall', 'threshold', 'shall', 'uncover', 'cedar', 'work']


## Run some Queries against the loaded books.

  1. In the cells below, try a couple queries, until you find an interesting line in the results.
  1. Then copy and paste that line into the `line` variable in the appropriate cell below.
  1. Compare the vectorization of the line to PostgreSQL.
  1. Answer the question
  

In [31]:
%load_ext sql
%sql postgres://dsa_ro_user:readonly@pgsql.dsa.lan/dsa_student

'Connected: dsa_ro_user@dsa_student'

### 1) Fill in "`<write your query here>`" with a query or two.

In [33]:
%%sql
SELECT id,name,line_no,line, ts_rank_cd(line_tsv_gin, query) AS rank
FROM dlfy6.booklines, plainto_tsquery('lie') query
WHERE query @@ line_tsv_gin
ORDER BY rank DESC LIMIT 20;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_student
20 rows affected.


id,name,line_no,line,rank
11025,/dsa/data/all_datasets/book/ezekiel.txt,264,"13:19: And will ye pollute me among my people for handfuls of barley and for pieces of bread, to slay the souls that should not die, and to save the souls alive that should not live, by your lying to my people that hear your lies?",0.2
21458,/dsa/data/all_datasets/book/levit.txt,533,18:23: Neither shalt thou lie with any beast to defile thyself therewith: neither shall any woman stand before a beast to lie down thereto: it is confusion.,0.2
27145,/dsa/data/all_datasets/book/proverbs.txt,378,14:5: A faithful witness will not lie: but a false witness will utter lies.,0.2
10831,/dsa/data/all_datasets/book/ezekiel.txt,70,"4:4: Lie thou also upon thy left side, and lay the iniquity of the house of Israel upon it: according to the number of the days that thou shalt lie upon it thou shalt bear their iniquity.",0.2
31250,/dsa/data/all_datasets/book/zeph.txt,48,"3:13: The remnant of Israel shall not do iniquity, nor speak lies; neither shall a deceitful tongue be found in their mouth: for they shall feed and lie down, and none shall make them afraid.",0.2
1669,/dsa/data/all_datasets/book/1john.txt,32,"2:21: I have not written unto you because ye know not the truth, but because ye know it, and that no lie is of the truth.",0.1
1644,/dsa/data/all_datasets/book/1john.txt,7,"1:6: If we say that we have fellowship with him, and walk in darkness, we lie, and do not the truth:",0.1
1116,/dsa/data/all_datasets/book/ruth.txt,53,"3:7: And when Boaz had eaten and drunk, and his heart was merry, he went to lie down at the end of the heap of corn: and she came softly, and uncovered his feet, and laid her down.",0.1
2530,/dsa/data/all_datasets/book/1kings.txt,786,"22:22: And the LORD said unto him, Wherewith? And he said, I will go forth, and I will be a lying spirit in the mouth of all his prophets. And he said, Thou shalt persuade him, and prevail also: go forth, and do so.",0.1
2230,/dsa/data/all_datasets/book/1kings.txt,486,"13:18: He said unto him, I am a prophet also as thou art; and an angel spake unto me by the word of the LORD, saying, Bring him back with thee into thine house, that he may eat bread and drink water. But he lied unto him.",0.1


In [34]:
line = "22:8: That all of you have conspired against me, and there is none that sheweth me that my son hath made a league with the son of Jesse, and there is none of you that is sorry for me, or sheweth unto me that my son hath stirred up my servant against me, to lie in wait, as at this day?"

# 2) Tokenize, Stem, and/or Lemmatize with Python
#---------------------------------------------------------

line_tokens = tk_function(line)
print(line_tokens)
print(' ')
stem = []
stemmer = PorterStemmer()
for w in line_tokens:
    stem.append(stemmer.stem(w))

print('Stem: ')
print(stem)

print(' ')
print('Lemma: ')

lemma = []
lemmatizer = WordNetLemmatizer()
for w in line_tokens:
    lemma.append(lemmatizer.lemmatize(w))
print(lemma)




['22', 'conspired', 'sheweth', 'son', 'hath', 'league', 'son', 'jesse', 'sorry', 'sheweth', 'unto', 'son', 'hath', 'stirred', 'servant', 'lie', 'wait', 'day']
 
Stem: 
['22', 'conspir', 'sheweth', 'son', 'hath', 'leagu', 'son', 'jess', 'sorri', 'sheweth', 'unto', 'son', 'hath', 'stir', 'servant', 'lie', 'wait', 'day']
 
Lemma: 
['22', 'conspired', 'sheweth', 'son', 'hath', 'league', 'son', 'jesse', 'sorry', 'sheweth', 'unto', 'son', 'hath', 'stirred', 'servant', 'lie', 'wait', 'day']


In [36]:
# OR OR OR 
# 2) Tokenize, Stem, and/or Lemmatize with Python
#---------------------------------------------------------

stop = stopwords.words('english')

tokens = [token for token in word_tokenize(line.lower()) if token not in stop]
tokens = [token for token in tokens if token not in string.punctuation]

print("After removing stop words")
print(40*'-')
print(tokens)

stems=[]
for token in tokens:
    stems.append(stemmer.stem(token))
print("\n After Stemming")
print(40*'-')
print(stems)

lemmas=[]
print("\n After Lemmatization")
print(40*'-')
for w in tokens:
    lemmas.append(lemmatizer.lemmatize(w))
print(lemmas)


After removing stop words
----------------------------------------
['22:8', 'conspired', 'none', 'sheweth', 'son', 'hath', 'made', 'league', 'son', 'jesse', 'none', 'sorry', 'sheweth', 'unto', 'son', 'hath', 'stirred', 'servant', 'lie', 'wait', 'day']

 After Stemming
----------------------------------------
['22:8', 'conspir', 'none', 'sheweth', 'son', 'hath', 'made', 'leagu', 'son', 'jess', 'none', 'sorri', 'sheweth', 'unto', 'son', 'hath', 'stir', 'servant', 'lie', 'wait', 'day']

 After Lemmatization
----------------------------------------
['22:8', 'conspired', 'none', 'sheweth', 'son', 'hath', 'made', 'league', 'son', 'jesse', 'none', 'sorry', 'sheweth', 'unto', 'son', 'hath', 'stirred', 'servant', 'lie', 'wait', 'day']


#### 3) Now look up the search vector that PostgreSQL built by the ID column.

In [37]:
%%sql
SELECT id,name,line_no,line, line_tsv_gin
FROM dlfy6.booklines
WHERE id = 3252;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_student
1 rows affected.


id,name,line_no,line,line_tsv_gin
3252,/dsa/data/all_datasets/book/1samuel.txt,584,"22:8: That all of you have conspired against me, and there is none that sheweth me that my son hath made a league with the son of Jesse, and there is none of you that is sorry for me, or sheweth unto me that my son hath stirred up my servant against me, to lie in wait, as at this day?","'22':1 '8':2 'conspir':8 'day':62 'hath':21,48 'jess':29 'leagu':24 'lie':56 'made':22 'none':14,33 'servant':52 'sheweth':16,42 'son':20,27,47 'sorri':38 'stir':49 'unto':43 'wait':58"


# SAVE YOUR NOTEBOOK, then `File > Close and Halt`