## **Lab 2: TEXT NORMALIZATION AND VECTORIZATION** <br>


**<font color=green>INSTRUCTIONS:</font>** <br>
    **<font color=green>1. Look for EXERCISES and QUESTIONS in this script. </font>** <br>
    **<font color=green>2. Upload this script with your answers embedded by the deadline.</font>** <br>
## **Lab Objectives:**<br>
1. Learn how to prepare text for analysis
2. See how to perform text normalization and vectorization in Python
3. Observe how pre-processing operations change your text data<br>

## **Session Prep:**

### **Installing a Python module from a Jupyter notebook**

We will need some Python modules/packages to do text mining. It is convenient to install modules without leaving your Jupyter notebook. To install a module from Jupyper notebook, we need module called **sys**:

In [1]:
import sys

Now, you can install any module from Jupyter by running a line such as: <br> <br> !{sys.executable} -m pip install module_name

### **Install Natural Language ToolKit (NLTK) module (and some other modules)**

The NLTK module does text normalization, among other things. We'll install module NLTK, as well as modules numpy and pandas, from Jupyter.

**Note**: You might see deprecation warnings (in pink) about future changes in the module but you do not need to pay attention to them.

In [2]:
!{sys.executable} -m pip install nltk
import nltk

!{sys.executable} -m pip install numpy
import numpy as np

!{sys.executable} -m pip install pandas
import pandas as pd



## **Downloading Text Data**

In what follows, we'll use an electronic archive of books from Project Gutenberg that Natural Language ToolKit has access to. In particular, we'll use the book called "Alice in Wonderland" by Lewis Carroll. Our text corpus will be one file called carroll-alice.txt (it's in .txt format):

In [None]:
nltk.download('gutenberg')
from nltk.corpus import gutenberg

alice = gutenberg.raw(fileids='carroll-alice.txt') # we name the corpus 'alice'
from pprint import pprint #function for pretty printing
pprint(alice[0:35]) #print the first 35 characters of the corpus

## **Text Tokenization**
**Tokenization** is splitting text into semantically meaningful chunks, such as sentences or words. Tokenizing into words is most common. You might be interested in tokenizing into sentences if you plan to analyze text sentence-by-sentence (common in sentiment analysis).

### **Tokenization by Sentence**
From the NLTK module, we'll use sentence tokenizer 'punkt':

In [4]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Let's now tokenize the Alice corpus by sentence and see how many tokens (sentences) we get:

In [5]:
alice_sentences = nltk.sent_tokenize(text=alice)
print('\nTotal sentences in the corpus:', len(alice_sentences))


Total sentences in the corpus: 1625


Let's have a look at the first sentence in the Alice corpus:

In [6]:
print('\nFirst sentence in alice:', alice_sentences[0])


First sentence in alice: [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I.


Let's now look at what the second sentence looks like:

In [7]:
print('\nSecond sentence in alice:', alice_sentences[1])


Second sentence in alice: Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'


### <font color=green>**QUESTION 1: <br> 1A. Look closely at the two tokens that the machine created. Remember, we intended to tokenize by sentence. Did we achieve our goal? Why or why not?<br> 1B. What rules did the machine use to create the tokens? (Can you reverse-engineer at least one of the rules?)<br>1C. Should you always expect the machine to produce the correct result when working with text? Why or why not?** </font>

1. No, we didn't. "Chapter One." alone is not a sentence.

2. The machine viewed a period or a question mark as the determinant factors to complete a sentence. Also, if there is an apostrophe after a period or a question mark, it will be included in the sentence.

3. No, because there might be a typo or a specific scenario that the machine couldn't identify with.

### **Tokenization by Words**
Let's do some tokenization into words now. You can tokenize into words using punctuation signs, white spaces, or "recognized words" (see below).

We'll tokenize a text corpus consisting of one sentence shown below:

In [8]:
sentence = "The brown fox wasn't that quick and he couldn't win the races"

Let's tokenize **using recognized "words"**:

In [9]:
words = nltk.word_tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'races']


Let's tokenize **using punctuation marks** now. Do you see any difference between this tokenization and the previous one?

In [10]:
wordpunkt_wt = nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'", 't', 'win', 'the', 'races']


Let's tokenize **using white spaces**:

In [11]:
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'races']


As you can see, the differences in results are quite subtle. For example, the way we catch negation is different. These differences can be important depending on your **project objectives**.

### **Stopwords**

Let's get rid of stopwords ("it's", "is", "the", etc.):

In [12]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words=set(stopwords.words("english"))
print(stop_words)

{'him', 'my', 'up', 'again', 't', 'aren', 'such', 'all', 'being', 'will', 'out', 'needn', 'our', 'not', 'from', 'very', 'what', 'doing', "hadn't", 'weren', 'between', 'yourselves', 'because', 'it', 'on', 'then', 'hadn', 'we', 'now', 'nor', 'than', 'll', 'ma', "you'd", 'above', 'why', 'or', 'most', "aren't", 'about', "weren't", "you've", "mightn't", "won't", 'there', 'some', 'me', 'own', 'you', 'below', 'isn', 'into', 'of', 'theirs', 'each', 'at', 'which', 'yourself', 're', 'has', 'should', 'i', 'he', 'once', 'm', 'can', 'your', 'was', 's', 'his', 'whom', 'shan', 'wouldn', 'an', 'those', 'd', 'himself', "you're", "it's", "doesn't", 'themselves', 'been', 'haven', "shan't", 'wasn', 'during', 'mustn', 've', 'that', 'be', 'ours', 'for', 'who', 'are', 'with', 'is', "mustn't", 'further', 'am', 'in', 'just', 'any', 'after', 'more', 'didn', 'both', "haven't", 'does', 'how', 'this', 'don', 'them', 'no', 'before', 'here', 'o', 'hasn', 'few', 'to', 'ourselves', 'their', 'couldn', 'while', "should'

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


You can amend the list of stopwords given your data and project objectives. For example, we can add more stopwords to the standard list:

In [13]:
add_stopwords ={'so','NYC'}
stop_words_new = add_stopwords.union(stop_words)
print(stop_words_new)

{'him', 'my', 'up', 'again', 't', 'aren', 'such', 'all', 'being', 'will', 'out', 'needn', 'our', 'not', 'from', 'very', 'what', 'doing', "hadn't", 'weren', 'between', 'yourselves', 'because', 'it', 'on', 'then', 'hadn', 'we', 'now', 'nor', 'than', 'll', 'ma', "you'd", 'above', 'why', 'or', 'most', "aren't", 'about', "weren't", "you've", "mightn't", "won't", 'there', 'some', 'me', 'own', 'you', 'below', 'isn', 'into', 'of', 'theirs', 'NYC', 'each', 'at', 'which', 'yourself', 're', 'has', 'should', 'i', 'he', 'once', 'm', 'were', 'can', 'your', 'was', 's', 'his', 'whom', 'shan', 'wouldn', 'an', 'those', 'd', 'himself', "you're", "it's", "doesn't", 'themselves', 'been', 'haven', "shan't", 'wasn', 'during', 'mustn', 've', 'that', 'be', 'ours', 'for', 'who', 'are', 'with', 'is', "mustn't", 'further', 'am', 'in', 'just', 'any', 'after', 'more', 'didn', 'both', "haven't", 'how', 'this', 'don', 'them', 'no', 'before', 'here', 'o', 'hasn', 'few', 'to', 'ourselves', 'their', 'couldn', 'while', "

Now, compare the tokenized sentence before and after removing the stopwords:

In [14]:
filtered_tokens=[]

for w in words:
    if w not in stop_words:
        filtered_tokens.append(w)

print("Tokenized Sentence:",words)
print("Filterd Sentence (without stopwords):",filtered_tokens)

Tokenized Sentence: ['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'races']
Filterd Sentence (without stopwords): ['The', 'brown', 'fox', 'quick', 'win', 'races']


Should you consider adding the apostrophe mark (') to the list of stopwords? :)

### **Stemming and Lemmatization**

Do you remember what's the difference between stemming and lemmatization from our video lecture? Let's stem one sentence first:

In [15]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

stemmed_tokens=[]
for w in filtered_tokens:
    stemmed_tokens.append(ps.stem(w))

print("Filtered Sentence:",filtered_tokens)
print("Stemmed Sentence:",stemmed_tokens)

Filtered Sentence: ['The', 'brown', 'fox', 'quick', 'win', 'races']
Stemmed Sentence: ['the', 'brown', 'fox', 'quick', 'win', 'race']


Compare stemming to lemmatization for the word "running":

In [16]:
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

word = "running"
print("Lemmatized Word:",lem.lemmatize(word,"v")) # 'v' indicates that the word is a verb
print("Stemmed Word:",ps.stem(word))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Lemmatized Word: run
Stemmed Word: run


One more comparison for the word "bought":

In [17]:
word = "bought"
print("Lemmatized Word:",lem.lemmatize(word,"v")) # 'v' indicates that the word is a verb (part-of-speech)
print("Stemmed Word:",ps.stem(word))

Lemmatized Word: buy
Stemmed Word: bought


### <font color=green>**EXERCISE 1: What result would you get if you change the part-of-speech tag in the lemmatization line above to "n" (which means "noun")? Look at what Python prints out. Why did the change happen?**</font> <br>
### <font color=green> Answer:

In [18]:
word = "bought"
print("Lemmatized Word:",lem.lemmatize(word,"n")) # 'n' indicates that the word is a noun
print("Stemmed Word:",ps.stem(word))

Lemmatized Word: bought
Stemmed Word: bought


The new output is:
Lemmatized Word: bought
Stemmed Word: bought

The reason is that "bought" is not a noun, so Python cannot lemmatize it correctly.

### **Text Vectorization**

Text vectorization is the process of feature extraction from text data. In other words, it is the process of creating variables for each observation, where an observation is a text document. We'll consider the **bag-of-words**, the **TF-IDF** and the **n-grams** vectorized representations of text. <br>

Let's vectorize the text corpus about "blue skies and blue cheese" similar to one used in the video lecture:

In [19]:
corpus = ['the sky is blue',
          'sky is blue and sky is beautiful',
          'the beautiful sky is so blue',
          'i love blue cheese']

We'll use built-in vectorizers from the Scikit-Learn module for machine learning.

#### **Bag-of-Words Representation**

We'll do the bag-of-words representation first. For that we'll use a function called **CountVectorizer()**.  <br><br> **Note**: You can see the documentation for the CountVectorizer function here to explore it further:<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

It is convinient to "define" a vectorizer first before applying it. You can specify all the parameters (arguments) of the function in the definition. For example, the **max_features** parameter below drops all features except for the selected number of most frequent terms in the corpus:

In [21]:
vectorizer_BOW = CountVectorizer(max_features=1000) #BOW = bag-of-words

Now let's extract features using the defined vectorizer function. Note the **.fit_transform** method below. It creates the dictionary of the corpus and does the vectorization:

In [22]:
BOW_matrix = vectorizer_BOW.fit_transform(corpus).toarray()
pd.DataFrame(np.round(BOW_matrix,2))

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0,0,1,0,1,0,1,0,1
1,1,1,1,0,2,0,2,0,0
2,0,1,1,0,1,0,1,1,1
3,0,0,1,1,0,1,0,0,0


We want to see the names of the features, right? We shoud use method **get_feature_names_out()** to see the names:

In [23]:
vectorizer_BOW.get_feature_names_out()

array(['and', 'beautiful', 'blue', 'cheese', 'is', 'love', 'sky', 'so',
       'the'], dtype=object)

Let's get a more useful looking bag-of-words representation, with the feature names attached:

In [24]:
pd.DataFrame(np.round(BOW_matrix,2),columns=vectorizer_BOW.get_feature_names_out())

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0,0,1,0,1,0,1,0,1
1,1,1,1,0,2,0,2,0,0
2,0,1,1,0,1,0,1,1,1
3,0,0,1,1,0,1,0,0,0


#### **Vectorization Using N-grams**
<br>
Let's use bi-grams in our vectorized representation of text. First, we define the vectorizer (we need the same CountVectorizer() function) using a parameter for specifying n-grams. Then we apply it:

In [25]:
vectorizer_Bi_Grams = CountVectorizer(max_features=1000, ngram_range=(2, 2))
# 1000 -> first 1000 bi grams
# (2,2) -> lower and upper limit

Bi_Grams_matrix = vectorizer_Bi_Grams.fit_transform(corpus).toarray()
pd.DataFrame(np.round(Bi_Grams_matrix,2),columns=vectorizer_Bi_Grams.get_feature_names_out())

Unnamed: 0,and sky,beautiful sky,blue and,blue cheese,is beautiful,is blue,is so,love blue,sky is,so blue,the beautiful,the sky
0,0,0,0,0,0,1,0,0,1,0,0,1
1,1,0,1,0,1,1,0,0,2,0,0,0
2,0,1,0,0,0,0,1,0,1,1,1,0
3,0,0,0,1,0,0,0,1,0,0,0,0


### <font color=green>**EXERCISE 2: Create a Bi-Grams vectorizer that uses the mix of bi-grams and uni-grams. To complete the Exercise you may need to look up CountVectorizer's documentation, see link below.**</font> <br>

Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html <br>

### <font color=green> Answer:


In [26]:
from sklearn.feature_extraction.text import CountVectorizer

In [41]:
vectorizer_uni_bi_Grams = CountVectorizer(max_features=1000, ngram_range=(1, 2))
uni_bi_Grams_matrix = vectorizer_uni_bi_Grams.fit_transform(corpus).toarray()
pd.DataFrame(np.round(uni_bi_Grams_matrix,3),columns=vectorizer_uni_bi_Grams.get_feature_names_out())


  and should_run_async(code)


Unnamed: 0,and,and sky,beautiful,beautiful sky,blue,blue and,blue cheese,cheese,is,is beautiful,...,is so,love,love blue,sky,sky is,so,so blue,the,the beautiful,the sky
0,0,0,0,0,1,0,0,0,1,0,...,0,0,0,1,1,0,0,1,0,1
1,1,1,1,0,1,1,0,0,2,1,...,0,0,0,2,2,0,0,0,0,0
2,0,0,1,1,1,0,0,0,1,0,...,1,0,0,1,1,1,1,1,1,0
3,0,0,0,0,1,0,1,1,0,0,...,0,1,1,0,0,0,0,0,0,0


### **Vectorization using Term Frequency â€“ Inverse Document Frequency (TF-IDF)**

Now, let's do feature extraction (vectorization) using the TF-IDF approach. We'll use a function called **TfidfVectorizer()**. <br> <br> **Note**: See full documentation for the function TfidfVectorizer() here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer <br> <br>
Let's start. First, we import the vectorizer and then define it by specifying the parameters (look up the specified parameters in the documentation):

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_TF_IDF = TfidfVectorizer(norm = None, smooth_idf = True)

Let's vectorize our text corpus now using TF-IDF:

In [29]:
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(corpus).toarray()
pd.DataFrame(np.round(TF_IDF_matrix, 2), columns=vectorizer_TF_IDF.get_feature_names_out())

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,1.0,0.0,1.22,0.0,1.22,0.0,1.51
1,1.92,1.51,1.0,0.0,2.45,0.0,2.45,0.0,0.0
2,0.0,1.51,1.0,0.0,1.22,0.0,1.22,1.92,1.51
3,0.0,0.0,1.0,1.92,0.0,1.92,0.0,0.0,0.0


Have a look at the **IDF weights** (we talked about them in the video lecture):

In [30]:
print(np.round(vectorizer_TF_IDF.idf_,2))

[1.92 1.51 1.   1.92 1.22 1.92 1.22 1.92 1.51]


It's a good idea to normalize the TF-IDF matrix, i.e. restrict all entries to be between 0 and 1. Some text mining models require normalized matrices. Norm parameter is used for this purpose (you can look it up in the documentation):

In [31]:
vectorizer_TF_IDF = TfidfVectorizer(norm = 'l2', smooth_idf = True)
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(corpus).todense()
pd.DataFrame(np.round(TF_IDF_matrix,2), columns=vectorizer_TF_IDF.get_feature_names_out())

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,0.4,0.0,0.49,0.0,0.49,0.0,0.6
1,0.44,0.35,0.23,0.0,0.56,0.0,0.56,0.0,0.0
2,0.0,0.43,0.29,0.0,0.35,0.0,0.35,0.55,0.43
3,0.0,0.0,0.35,0.66,0.0,0.66,0.0,0.0,0.0


### **<font color=green> EXERCISE 3: You are given a new small corpus called corpus_exercise (see below). Your ultimate task is to normalize (pre-process) the corpus and produce the TF-IDF and the Bag-of-Words representations of the data. Follow the steps below to complete this exercise:</font>**

**Step 1.** Download a file called Text_Normalization_Function.ipynb provided with this assignment. That file defines a relatively sophisticated text normalization function. (OPTIONAL: you can explore what that file does when you are done with this exercise.)

**Step 2.** Run the file Text_Normalization_Function.ipynb to define the text normalization function:

In [None]:
%run ./Text_Normalization_Function.ipynb

**Step 3.** Define the **corpus_exercise** text corpus:

In [35]:
corpus_exercise = ['python is great for text mining',
          'anyone can learn python and do text mining',
          'python can go without eating for days',
          'python can be a great pet']

  and should_run_async(code)


**Step 4.** Normalize the *corpus_exercise* text corpus using the text normalization function defined in the Text_Normalization_Function.ipynb file and call the normalized version **NORM_corpus**. Note that to call the function you need to use normalize_corpus():

In [36]:
NORM_corpus = normalize_corpus(corpus_exercise)
NORM_corpus

  and should_run_async(code)


['python great text mining',
 'anyone learn python text mining',
 'python without eat day',
 'python great pet']

**Step 5**. Compute and print out the TF-IDF and the Bag-of-Words representations for NORM_corpus (WRITE the lines of code needed in the cell below):

In [37]:
#BOW
vectorizer_BOW = CountVectorizer(max_features=1000) #BOW = bag-of-words
BOW_matrix = vectorizer_BOW.fit_transform(NORM_corpus).toarray()
vectorizer_BOW.get_feature_names_out()
pd.DataFrame(np.round(BOW_matrix,2),columns=vectorizer_BOW.get_feature_names_out())

  and should_run_async(code)


Unnamed: 0,anyone,day,eat,great,learn,mining,pet,python,text,without
0,0,0,0,1,0,1,0,1,1,0
1,1,0,0,0,1,1,0,1,1,0
2,0,1,1,0,0,0,0,1,0,1
3,0,0,0,1,0,0,1,1,0,0


In [39]:
#TF-IDF
vectorizer_TF_IDF = TfidfVectorizer(norm = None, smooth_idf = True)
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(NORM_corpus).toarray()
pd.DataFrame(np.round(TF_IDF_matrix, 2), columns=vectorizer_TF_IDF.get_feature_names_out())


  and should_run_async(code)


Unnamed: 0,anyone,day,eat,great,learn,mining,pet,python,text,without
0,0.0,0.0,0.0,1.51,0.0,1.51,0.0,1.0,1.51,0.0
1,1.92,0.0,0.0,0.0,1.92,1.51,0.0,1.0,1.51,0.0
2,0.0,1.92,1.92,0.0,0.0,0.0,0.0,1.0,0.0,1.92
3,0.0,0.0,0.0,1.51,0.0,0.0,1.92,1.0,0.0,0.0


In [40]:
#TF-IDF Normalized ver.
vectorizer_TF_IDF = TfidfVectorizer(norm = 'l2', smooth_idf = True)
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(NORM_corpus).todense()
pd.DataFrame(np.round(TF_IDF_matrix,2), columns=vectorizer_TF_IDF.get_feature_names_out())

  and should_run_async(code)


Unnamed: 0,anyone,day,eat,great,learn,mining,pet,python,text,without
0,0.0,0.0,0.0,0.54,0.0,0.54,0.0,0.36,0.54,0.0
1,0.53,0.0,0.0,0.0,0.53,0.42,0.0,0.28,0.42,0.0
2,0.0,0.55,0.55,0.0,0.0,0.0,0.0,0.29,0.0,0.55
3,0.0,0.0,0.0,0.57,0.0,0.0,0.73,0.38,0.0,0.0


### **<font color=green> OPTIONAL EXERCISE 4: Explore the Text_Normalization_Function.ipynb notebook that defines a text normalization function. </font>**