# TEXT NORMALIZATION PRACTICE<br>


**<font color=green>INSTRUCTIONS:</font>** <br> <br>
    **<font color=green>1. Look for EXERCISES and QUESTIONS in this script. </font>** <br> <br>
    **<font color=green>2. Each student INDIVIDUALLY uploads this script with their answers embedded (and other materials if requested) to Canvas by the the deadline indicated on Canvas.</font>** <br>
## SESSION PREP

### How to install any module from inside Jupyter

To be able to install any module from inside Jupyper, we need module called sys:

In [1]:
import sys

Now, you can install any module from Jupyter by running a line such as: <br> <br> !{sys.executable} -m pip install module_name

### Install Natural Language ToolKit (NLTK) module (and some other modules)

The NLTK module does text normalization, among other functions. We'll install module NLTK, as well as modules numpy and pandas, from inside Jupyter (you might see deprication warnings in pink about future changes in the module but you do not need to pay attention to them at this time):

In [None]:
!{sys.executable} -m pip install nltk
import nltk

!{sys.executable} -m pip install numpy
import numpy as np 

!{sys.executable} -m pip install pandas
import pandas as pd

## Download text data

In what follows, we'll use an electronic archive of books from Project Gutenberg that Natural Language ToolKit has access to. In particular, we'll use "Alice in Wonderland" by Lewis Carrol. Our corpus will be just one file called carroll-alice.txt (it's in .txt format):

In [3]:
nltk.download('gutenberg') 
from nltk.corpus import gutenberg 

alice = gutenberg.raw(fileids='carroll-alice.txt') # we name the corpus 'alice'
from pprint import pprint #function for pretty printing
pprint(alice[0:35]) #print the first 35 characters of the corpus

[nltk_data] Downloading package gutenberg to /Users/kkee/nltk_data...


"[Alice's Adventures in Wonderland b"


[nltk_data]   Unzipping corpora/gutenberg.zip.


## TEXT TOKENIZATION
**Tokenization** is splitting text into sematically meaningful chuncks, such as sentences or words. Tokenizing into words is most common. You might be interested in tokenizing into sentences if you plan to analyze text sentence by sentence.

### Tokenization by Sentence
From the NLTK module, we'll use a sentence tokenizer 'punkt':

In [4]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/kkee/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Let's now tokenize the Alice corpus by sentence:

In [5]:
alice_sentences = nltk.sent_tokenize(text=alice)
print('\nTotal sentences in the corpus:', len(alice_sentences))


Total sentences in the corpus: 1625


Let's have a look at the first sentence in the Alice corpus:

In [6]:
print('\nFirst sentence in alice:', alice_sentences[0])


First sentence in alice: [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I.


Let's now look at what the second sentence looks like:

In [7]:
print('\nSecond sentence in alice:', alice_sentences[1])


Second sentence in alice: Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'


### <font color=green>QUESTION 1: Why do you think the first and second tokenized sentences (above) look like that? (look at what Python printed out)</font>

### <font color=green> Answer:

Because the sentence tokenization doesn't identify chapter title as a new line. It combines chapter title and the main body as one sentence as there are only spaces to seperate them.
  
### <font color=green> End of Answer

### Tokenization into Words
Let's do some tokenization into words now. You can tokenize into words using punctuation signs, white spaces, or "words".

We'll tokenize a corpus consisting of one sentence shown below:

In [None]:
sentence = "The brown fox wasn't that quick and he couldn't win the races"

Let's tokenize **using "words"**:

In [None]:
words = nltk.word_tokenize(sentence)
print(words)  

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'races']


Let's tokenize **using punctuation signs** now. Do you see any difference between this tokenization and the previous one?

In [None]:
wordpunkt_wt = nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'", 't', 'win', 'the', 'races']


Let's tokenize **using white spaces**:

In [None]:
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'races']


## STOPWORDS

Let's get rid of stopwords ("it's", "is", "the", etc.):

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words=set(stopwords.words("english"))
print(stop_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
{'can', 'mightn', 'same', 'than', 'all', "aren't", 'who', 'whom', 'out', 'will', 'which', 'your', 'about', 'own', "hadn't", 'yours', 'most', 'now', 'ours', 'from', 'during', 'wouldn', 'again', 'hers', 'some', 'nor', 'those', 'were', 'been', 'after', 'off', 'between', 'he', 'other', 'theirs', 'ma', "shan't", 'you', 'i', 'no', 'itself', 'few', "you'd", 'of', "mustn't", 'doing', "wouldn't", 'only', 'does', 'ourselves', 'a', 'his', 'just', 'her', 'with', 'once', 'their', 'doesn', 'o', 'before', "weren't", 'needn', "that'll", 'and', 'while', "didn't", 'shan', 'weren', 'we', 't', 'when', "you've", 'myself', 'was', 'herself', 'll', 'couldn', 'both', 'being', 'under', "she's", 'its', 'having', 'above', 'over', 'not', 'more', 'hasn', 've', 'yourself', 'through', 'down', 'against', "won't", 'in', "mightn't", 'are', 's', 'so', 'y', 'what', 'to', 'himself', 'very', 'such', 'did', 'them', 

You can (and should consider) amending the list of stopwords given your data and project objectives. For example, we can add more stopwords to the standard list:

In [None]:
add_stopwords ={'so','NYC'}
stop_words_new = add_stopwords.union(stop_words)
print(stop_words_new)

{'can', 'mightn', 'same', 'than', 'all', "aren't", 'who', 'whom', 'out', 'will', 'which', 'your', 'about', 'own', "hadn't", 'yours', 'most', 'now', 'ours', 'from', 'during', 'wouldn', 'again', 'hers', 'some', 'nor', 'those', 'were', 'been', 'after', 'off', 'between', 'he', 'other', 'theirs', 'ma', "shan't", 'you', 'i', 'no', 'itself', 'few', "you'd", 'of', "mustn't", 'doing', "wouldn't", 'only', 'does', 'ourselves', 'a', 'his', 'just', 'her', 'with', 'once', 'their', 'doesn', 'o', 'before', "weren't", 'needn', "that'll", 'and', 'while', "didn't", 'shan', 'weren', 'we', 't', 'when', "you've", 'myself', 'was', 'herself', 'll', 'couldn', 'both', 'under', "she's", 'its', 'having', 'above', 'over', 'not', 'more', 'hasn', 've', 'yourself', 'through', 'down', 'against', "won't", 'in', "mightn't", 'are', 's', 'so', 'y', 'what', 'to', 'himself', 'because', 'very', 'such', 'did', 'them', 'didn', 'up', 'why', "you're", 'until', 'm', 'any', 'hadn', 're', "isn't", 'she', 'too', 'this', 'on', 'yours

Now, compare the tokenized sentence before and after removing the stopwords:

In [None]:
filtered_tokens=[]

for w in words:
    if w not in stop_words:
        filtered_tokens.append(w)
        
print("Tokenized Sentence:",words)
print("Filterd Sentence (without stopwords):",filtered_tokens)

Tokenized Sentence: ['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'races']
Filterd Sentence (without stopwords): ['The', 'brown', 'fox', 'quick', 'win', 'races']


## STEMMING AND LEMMATIZATION

Let's stem the sentence first:

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

stemmed_tokens=[]
for w in filtered_tokens:
    stemmed_tokens.append(ps.stem(w))

print("Filtered Sentence:",filtered_tokens)
print("Stemmed Sentence:",stemmed_tokens)

Filtered Sentence: ['The', 'brown', 'fox', 'quick', 'win', 'races']
Stemmed Sentence: ['the', 'brown', 'fox', 'quick', 'win', 'race']


Compare stemming to lemmatization for the word "running": 

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')

lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

word = "running"
print("Lemmatized Word:",lem.lemmatize(word,"v")) # 'v' indicates that the word is a verb
print("Stemmed Word:",ps.stem(word))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Lemmatized Word: run
Stemmed Word: run


One more comparison for the word "bought":

In [None]:
word = "bought"
print("Lemmatized Word:",lem.lemmatize(word,"v")) # 'v' indicates that the word is a verb (part-of-speech)
print("Stemmed Word:",ps.stem(word))

Lemmatized Word: buy
Stemmed Word: bought


### <font color=green>EXERCISE 1: What result would you get if you change the part-of-speech tag in the lemmatization line above to "n", which means "noun"? (look at what Python printed out)</font> <br>
### <font color=green> Answer:

In [None]:
word = "bought"
print("Lemmatized Word:",lem.lemmatize(word,"n")) 
print("Stemmed Word:",ps.stem(word))

Lemmatized Word: bought
Stemmed Word: bought


Because if the machine recognizes the word we input as a regular verb, stemming and lemmazatization can convert it to original form. If the machine recognizes the word we input as an irregular verb, only lemmazatization can successfully change it to the original form. And Both stemming and lemmazatization cannot convert a noun to anything.

### <font color=green> End of Answer