# Lab 2: TEXT NORMALIZATION and VECTORIZATION <br>


**<font color=green>INSTRUCTIONS:</font>** <br> <br>
    **<font color=green>1. Look for EXERCISES and QUESTIONS in this script. </font>** <br> <br>
    **<font color=green>2. Each student INDIVIDUALLY uploads this script with their answers embedded (and other materials if requested) to Canvas by the the deadline indicated on Canvas.</font>** <br>
## SESSION PREP

### How to install any module from inside Jupyter

To be able to install any module from inside Jupyper, we need module called sys:

In [1]:
import sys

Now, you can install any module from Jupyter by running a line such as: <br> <br> !{sys.executable} -m pip install module_name

### Install Natural Language ToolKit (NLTK) module (and some other modules)

The NLTK module does text normalization, among other functions. We'll install module NLTK, as well as modules numpy and pandas, from inside Jupyter (you might see deprication warnings in pink about future changes in the module but you do not need to pay attention to them at this time):

In [2]:
!{sys.executable} -m pip install nltk
import nltk

!{sys.executable} -m pip install numpy
import numpy as np 

!{sys.executable} -m pip install pandas
import pandas as pd



## Download text data

In what follows, we'll use an electronic archive of books from Project Gutenberg that Natural Language ToolKit has access to. In particular, we'll use "Alice in Wonderland" by Lewis Carrol. Our corpus will be just one file called carroll-alice.txt (it's in .txt format):

In [3]:
nltk.download('gutenberg') 
from nltk.corpus import gutenberg 

alice = gutenberg.raw(fileids='carroll-alice.txt') # we name the corpus 'alice'
from pprint import pprint #function for pretty printing
pprint(alice[0:35]) #print the first 35 characters of the corpus

[nltk_data] Downloading package gutenberg to /Users/lilia/nltk_data...


"[Alice's Adventures in Wonderland b"


[nltk_data]   Unzipping corpora/gutenberg.zip.


## TEXT TOKENIZATION
**Tokenization** is splitting text into sematically meaningful chuncks, such as sentences or words. Tokenizing into words is most common. You might be interested in tokenizing into sentences if you plan to analyze text sentence by sentence.

### Tokenization by Sentence
From the NLTK module, we'll use a sentence tokenizer 'punkt':

In [4]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/lilia/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Let's now tokenize the Alice corpus by sentence:

In [5]:
alice_sentences = nltk.sent_tokenize(text=alice)
print('\nTotal sentences in the corpus:', len(alice_sentences))


Total sentences in the corpus: 1625


Let's have a look at the first sentence in the Alice corpus:

In [6]:
print('\nFirst sentence in alice:', alice_sentences[0])


First sentence in alice: [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I.


Let's now look at what the second sentence looks like:

In [7]:
print('\nSecond sentence in alice:', alice_sentences[1])


Second sentence in alice: Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'


### <font color=green>QUESTION 1: Why do you think the first and second tokenized sentences (above) look like that? (look at what Python printed out)</font>

We can tokenize at the level of sentences, divide the test into sentences with "." and "?".

### Tokenization into Words
Let's do some tokenization into words now. You can tokenize into words using punctuation signs, white spaces, or "words".

We'll tokenize a corpus consisting of one sentence shown below:

In [8]:
sentence = "The brown fox wasn't that quick and he couldn't win the races"

Let's tokenize **using "words"**:

In [9]:
words = nltk.word_tokenize(sentence)
print(words)  

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'races']


Let's tokenize **using punctuation signs** now. Do you see any difference between this tokenization and the previous one?

In [10]:
wordpunkt_wt = nltk.WordPunctTokenizer()
words = wordpunkt_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'wasn', "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'", 't', 'win', 'the', 'races']


Let's tokenize **using white spaces**:

In [11]:
whitespace_wt = nltk.WhitespaceTokenizer()
words = whitespace_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'races']


## STOPWORDS

Let's get rid of stopwords ("it's", "is", "the", etc.):

In [12]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words=set(stopwords.words("english"))
print(stop_words)

{'into', 'couldn', 'theirs', 'than', 'those', 'below', "aren't", 'my', 'be', 'wouldn', 'having', 'all', "couldn't", 'did', 'after', "mightn't", 'but', 'itself', 'in', 'doing', 'won', 'what', 'm', 'them', "wasn't", 'with', "needn't", 'ma', "it's", 'how', 'didn', 'our', 'will', 'have', 'i', 'such', 'shouldn', 'haven', 'her', 'to', 'as', 'she', "you'd", 'once', 'over', 'until', "should've", 'been', 'ourselves', 'll', 'their', 'very', 've', 'mustn', 'yourself', 'from', 'has', "you'll", 'on', 'these', 'now', 'its', 're', 'do', 'up', 'any', 'each', "doesn't", 'yours', "shan't", 'same', 'under', 'before', "that'll", 'too', 'd', 'he', 't', 'only', 'more', 'doesn', 'other', 'themselves', 'at', 'where', 'off', 'does', 'by', 'for', 's', 'when', "you're", 'hadn', 'just', 'needn', 'through', 'again', 'further', 'a', 'yourselves', 'both', 'so', 'you', 'weren', 'being', 'ain', 'whom', 'that', 'me', 'then', "haven't", 'above', 'own', 'which', "hadn't", 'some', 'is', "you've", 'had', 'are', 'shan', 'wh

[nltk_data] Downloading package stopwords to /Users/lilia/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


You can (and should consider) amending the list of stopwords given your data and project objectives. For example, we can add more stopwords to the standard list:

In [13]:
add_stopwords ={'so','NYC'}
stop_words_new = add_stopwords.union(stop_words)
print(stop_words_new)

{'into', 'couldn', 'NYC', 'theirs', 'than', 'those', 'below', "aren't", 'my', 'be', 'wouldn', 'having', 'all', "couldn't", 'did', 'after', "mightn't", 'but', 'itself', 'in', 'doing', 'won', 'what', 'm', 'them', "wasn't", 'with', "needn't", 'ma', "it's", 'how', 'didn', 'our', 'will', 'have', 'i', 'such', 'shouldn', 'haven', 'her', 'to', 'as', 'she', "you'd", 'once', 'over', 'until', "should've", 'been', 'ourselves', 'll', 'their', 'very', 've', 'mustn', 'yourself', 'from', 'has', "you'll", 'on', 'these', 'now', 'its', 're', 'do', 'up', 'any', 'each', "doesn't", 'yours', "shan't", 'same', 'under', 'before', "that'll", 'too', 'd', 'he', 't', 'only', 'more', 'doesn', 'other', 'themselves', 'at', 'where', 'off', 'does', 'by', 'for', 's', 'when', "you're", 'hadn', 'just', 'needn', 'through', 'again', 'further', 'a', 'yourselves', 'both', 'so', 'you', 'weren', 'being', 'ain', 'whom', 'that', 'me', 'then', "haven't", 'above', 'own', 'which', "hadn't", 'some', 'is', "you've", 'had', 'are', 'sha

Now, compare the tokenized sentence before and after removing the stopwords:

In [14]:
filtered_tokens=[]

for w in words:
    if w not in stop_words:
        filtered_tokens.append(w)
        
print("Tokenized Sentence:",words)
print("Filterd Sentence (without stopwords):",filtered_tokens)

Tokenized Sentence: ['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'races']
Filterd Sentence (without stopwords): ['The', 'brown', 'fox', 'quick', 'win', 'races']


## STEMMING AND LEMMATIZATION

Let's stem the sentence first:

In [15]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

stemmed_tokens=[]
for w in filtered_tokens:
    stemmed_tokens.append(ps.stem(w))

print("Filtered Sentence:",filtered_tokens)
print("Stemmed Sentence:",stemmed_tokens)

Filtered Sentence: ['The', 'brown', 'fox', 'quick', 'win', 'races']
Stemmed Sentence: ['the', 'brown', 'fox', 'quick', 'win', 'race']


Compare stemming to lemmatization for the word "running": 

In [17]:
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')

lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

word = "running"
print("Lemmatized Word:",lem.lemmatize(word,"v")) # 'v' indicates that the word is a verb
print("Stemmed Word:",ps.stem(word))

Lemmatized Word: run
Stemmed Word: run


[nltk_data] Downloading package wordnet to /Users/lilia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


One more comparison for the word "bought":

In [18]:
word = "bought"
print("Lemmatized Word:",lem.lemmatize(word,"v")) # 'v' indicates that the word is a verb (part-of-speech)
print("Stemmed Word:",ps.stem(word))

Lemmatized Word: buy
Stemmed Word: bought


### <font color=green>EXERCISE 1: What result would you get if you change the part-of-speech tag in the lemmatization line above to "n", which means "noun"? (look at what Python printed out)</font> <br>
Your Answer in the cell below:

In [19]:
print("Lemmatized Word:",lem.lemmatize(word,"n")) 

Lemmatized Word: bought


## VECTORIZATION

Text vectorization is the process of feature extraction from text data, that is the process of creating variables for each observation, where an observation is a text document. We'll consider the **bag-of-words**, the **TF-IDF** and the **n-grams** vectorized representations of text. <br>

Let's vectorize the corpus about "blue skies and blue cheese" similar to one used in the video lecture: 

In [20]:
corpus = ['the sky is blue',
          'sky is blue and sky is beautiful', 
          'the beautiful sky is so blue',
          'i love blue cheese']

We'll use built-in vectorizers from Scikit-Learn module for machine learning. 

### Bag-of-Words Representation

We'll use bag-of-words representation (CountVectorizer) first. You can see the documentation here:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

It is convinient to "define" a vectorizer first before applying it. You can specify all the parameters (arguments) of the function in the definition. For example, the max_features parameter below drops all features except for the selected number of most frequent terms in the corpus:

In [22]:
vectorizer_BOW = CountVectorizer(max_features=1000) #BOW = bag-of-words

Now let's extract features using the vectorizer function. Note the .fit_transform function below. It creates the dictionary of the corpus and does the vectorization: 

In [23]:
BOW_matrix = vectorizer_BOW.fit_transform(corpus).toarray()
pd.DataFrame(np.round(BOW_matrix,2))

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0,0,1,0,1,0,1,0,1
1,1,1,1,0,2,0,2,0,0
2,0,1,1,0,1,0,1,1,1
3,0,0,1,1,0,1,0,0,0


We want to attach the names of the features, right? Here are the names of the features from the dictionary of the corpus (note the function get_feature_names()):

In [24]:
vectorizer_BOW.get_feature_names()

['and', 'beautiful', 'blue', 'cheese', 'is', 'love', 'sky', 'so', 'the']

Let's get a more useful looking bag-of-words representation, with feature names attached:

In [25]:
pd.DataFrame(np.round(BOW_matrix,2),columns=vectorizer_BOW.get_feature_names())

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0,0,1,0,1,0,1,0,1
1,1,1,1,0,2,0,2,0,0
2,0,1,1,0,1,0,1,1,1
3,0,0,1,1,0,1,0,0,0


### Vectorization Using N-grams
<br>
Let's use bi-grams in our vectorized representation of text. First, we define the vectorizer (we need the same CountVectorizer() function) using a parameter for specifying n-grams. Then we apply it:

In [26]:
vectorizer_Bi_Grams = CountVectorizer(max_features=1000, ngram_range=(2, 2))
Bi_Grams_matrix = vectorizer_Bi_Grams.fit_transform(corpus).toarray()
pd.DataFrame(np.round(Bi_Grams_matrix,2),columns=vectorizer_Bi_Grams.get_feature_names())

Unnamed: 0,and sky,beautiful sky,blue and,blue cheese,is beautiful,is blue,is so,love blue,sky is,so blue,the beautiful,the sky
0,0,0,0,0,0,1,0,0,1,0,0,1
1,1,0,1,0,1,1,0,0,2,0,0,0
2,0,1,0,0,0,0,1,0,1,1,1,0
3,0,0,0,1,0,0,0,1,0,0,0,0


### <font color=green>EXERCISE 2: Create a Bi-Grams vectorizer that uses the mix of bi-grams and uni-grams. To complete the Exercise you may need to look up CountVectorizer's documentation, see link below.</font> <br>

Documentation: 
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html <br>

Your Answer in the cell below:

In [27]:
vectorizer_Uni_Bi_Grams = CountVectorizer(max_features=1000, ngram_range=(1, 2))
Uni_Bi_Grams_matrix = vectorizer_Uni_Bi_Grams.fit_transform(corpus).toarray()
pd.DataFrame(np.round(Uni_Bi_Grams_matrix,2),columns=vectorizer_Uni_Bi_Grams.get_feature_names())

Unnamed: 0,and,and sky,beautiful,beautiful sky,blue,blue and,blue cheese,cheese,is,is beautiful,...,is so,love,love blue,sky,sky is,so,so blue,the,the beautiful,the sky
0,0,0,0,0,1,0,0,0,1,0,...,0,0,0,1,1,0,0,1,0,1
1,1,1,1,0,1,1,0,0,2,1,...,0,0,0,2,2,0,0,0,0,0
2,0,0,1,1,1,0,0,0,1,0,...,1,0,0,1,1,1,1,1,1,0
3,0,0,0,0,1,0,1,1,0,0,...,0,1,1,0,0,0,0,0,0,0


### Vectorization with Term Frequency – Inverse Document Frequency (TF-IDF)

Now, let's do feature extraction (vectorization) using the TF-IDF approach. <br> <br> See full documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer <br> <br>
Import the vectorizer first and define it by specifying the functions (look up the specified parameters in the documentation):

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer 

vectorizer_TF_IDF = TfidfVectorizer(norm = None, smooth_idf = True)


<class 'sklearn.feature_extraction.text.TfidfVectorizer'>


  and should_run_async(code)


Let's vectorize our corpus now:

In [29]:
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(corpus).toarray()
pd.DataFrame(np.round(TF_IDF_matrix, 2), columns=vectorizer_TF_IDF.get_feature_names())

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,1.0,0.0,1.22,0.0,1.22,0.0,1.51
1,1.92,1.51,1.0,0.0,2.45,0.0,2.45,0.0,0.0
2,0.0,1.51,1.0,0.0,1.22,0.0,1.22,1.92,1.51
3,0.0,0.0,1.0,1.92,0.0,1.92,0.0,0.0,0.0


Have a look at the IDF weights:

In [30]:
print(np.round(vectorizer_TF_IDF.idf_,2))

[1.92 1.51 1.   1.92 1.22 1.92 1.22 1.92 1.51]


It's a good idea to normalize the TF-IDF matrix, i.e. restrict all entries to be between 0 and 1. Some text mining models require normalized matrices. Norm parameter is used for this purpose (you can look it up in the documentation):

In [31]:
vectorizer_TF_IDF = TfidfVectorizer(norm = 'l2', smooth_idf = True)
TF_IDF_matrix = vectorizer_TF_IDF.fit_transform(corpus).todense()
pd.DataFrame(np.round(TF_IDF_matrix,2), columns=vectorizer_TF_IDF.get_feature_names())

Unnamed: 0,and,beautiful,blue,cheese,is,love,sky,so,the
0,0.0,0.0,0.4,0.0,0.49,0.0,0.49,0.0,0.6
1,0.44,0.35,0.23,0.0,0.56,0.0,0.56,0.0,0.0
2,0.0,0.43,0.29,0.0,0.35,0.0,0.35,0.55,0.43
3,0.0,0.0,0.35,0.66,0.0,0.66,0.0,0.0,0.0


### **<font color=green> EXERCISE 3: You are given a new small corpus called corpus_exercise (see below). Your ultimate task is to normalize (pre-process) the corpus and produce the TF-IDF and the Bag-of-Words representations of the data. Follow the steps below to complete this exercise:</font>**

Step 1. Download a file Text_Normalization_Function.ipynb from Canvas and put it into the same directory(!) as the current Jupyter notebook. That file defines a relatively sophisticated text normalization function. (OPTIONAL: you can explore what that file does when you are done with this exercise.)

Step 2. Run the file Text_Normalization_Function.ipynb to define the text normalization function:

In [32]:
%run ./Text_Normalization_Function.ipynb

Collecting html.parser
  Downloading html-parser-0.2.tar.gz (904 bytes)
Building wheels for collected packages: html.parser
  Building wheel for html.parser (setup.py) ... [?25ldone
[?25h  Created wheel for html.parser: filename=html_parser-0.2-py3-none-any.whl size=1333 sha256=bb3688801ac24edb20728b7c25472a71c3ce7cfd5776cf8c0a5834e9c0d51b1e
  Stored in directory: /Users/lilia/Library/Caches/pip/wheels/4f/85/2a/67a30aa6cf144eca0c159f337ce5166df2213c4cde9e699cbe
Successfully built html.parser
Installing collected packages: html.parser
Successfully installed html.parser
Collecting pattern3
  Downloading pattern3-3.0.0.tar.gz (23.7 MB)
[K     |████████████████████████████████| 23.7 MB 13.0 MB/s eta 0:00:01
Collecting cherrypy
  Downloading CherryPy-18.6.1-py2.py3-none-any.whl (419 kB)
[K     |████████████████████████████████| 419 kB 24.1 MB/s eta 0:00:01
[?25hCollecting docx
  Downloading docx-0.2.4.tar.gz (54 kB)
[K     |████████████████████████████████| 54 kB 5.1 MB/s  eta 0:00:01

Collecting smart-open>=1.8.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 18.3 MB/s eta 0:00:01
Building wheels for collected packages: pyLDAvis, sklearn
  Building wheel for pyLDAvis (PEP 517) ... [?25ldone
[?25h  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136897 sha256=e77b2b2b44890ccd5f54d96151f0fca81ca81ded67f0bebc55ad9c4323add697
  Stored in directory: /Users/lilia/Library/Caches/pip/wheels/90/61/ec/9dbe9efc3acf9c4e37ba70fbbcc3f3a0ebd121060aa593181a
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1316 sha256=8cfb9de4a60e3abc20de71a6b5b516b7ea7f52527027681da14d5e5fff9406d6
  Stored in directory: /Users/lilia/Library/Caches/pip/wheels/22/0b/40/fd3f795caaa1fb4c6cb738bc1f56100be1e57da95849bfc897
Successfully built pyLDAvis sklearn
Installing collected packages: smart-open, sklearn, gensim, funcy, pyLDA

[nltk_data] Downloading package stopwords to /Users/lilia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/lilia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/lilia/nltk_data...


Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  ['<', 'p', '>', 'The', 'circus', 'dog', 'in', 'a', 'plissé', 'skirt', 'jumped', 'over', 'Python', 'who', 'was', "n't", 'that', 'large', ',', 'just', '3', 'feet', 'long.', '<', '/p', '>']
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  <p>The circus dog in a plissé skirt jumped over Python who was not that large, just 3 feet long.</p>
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  [('<', 'a'), ('p', 'n'), ('>', 'v'), ('the', None), ('circus', 'n'), ('dog', 'n'), ('in', None), ('a', None), ('plissé', 'n'), ('skirt', 'n'), ('jumped', 'v'), ('over', None), ('python', 'n'), ('who', None), ('was', 'v'), ("n't", 'r'), ('that', None), ('large', 'a'), (',', None), ('just', 'r'), ('3', None), ('feet', 'n'), ('long.', 'a'), 

[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /Users/lilia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Step 3. Define the corpus_exercise text corpus:

In [33]:
corpus_exercise = ['python is great for text mining',
          'anyone can learn python and do text mining', 
          'python can go without eating for days',
          'python can be a great pet']

  and should_run_async(code)


Step 4. Normalize the corpus_exercise text corpus and call its normalized version NORM_corpus:

In [34]:
NORM_corpus = normalize_corpus(corpus_exercise)
NORM_corpus

  and should_run_async(code)


['python great text mining',
 'anyone learn python text mining',
 'python without eat day',
 'python great pet']

Step 5. Compute and print out the TF-IDF and the Bag-of-Words representations for NORM_corpus (WRITE the lines of code needed in the cell below):

In [41]:
NORM_vectorizer_TF_IDF = TfidfVectorizer(norm = 'l2', smooth_idf = True)
#l2: 0-1
NORM_TF_IDF_matrix = NORM_vectorizer_TF_IDF.fit_transform(NORM_corpus).todense()
pd.DataFrame(np.round(NORM_TF_IDF_matrix, 2), columns=NORM_vectorizer_TF_IDF.get_feature_names())

  and should_run_async(code)


Unnamed: 0,anyone,day,eat,great,learn,mining,pet,python,text,without
0,0.0,0.0,0.0,0.54,0.0,0.54,0.0,0.36,0.54,0.0
1,0.53,0.0,0.0,0.0,0.53,0.42,0.0,0.28,0.42,0.0
2,0.0,0.55,0.55,0.0,0.0,0.0,0.0,0.29,0.0,0.55
3,0.0,0.0,0.0,0.57,0.0,0.0,0.73,0.38,0.0,0.0


In [44]:
NORM_BOW_matrix = vectorizer_BOW.fit_transform(NORM_corpus).toarray()

pd.DataFrame(np.round(NORM_BOW_matrix,2),columns=vectorizer_BOW.get_feature_names())

  and should_run_async(code)


Unnamed: 0,anyone,day,eat,great,learn,mining,pet,python,text,without
0,0,0,0,1,0,1,0,1,1,0
1,1,0,0,0,1,1,0,1,1,0
2,0,1,1,0,0,0,0,1,0,1
3,0,0,0,1,0,0,1,1,0,0


### **<font color=green> OPTIONAL EXERCISE 4: Explore the Text_Normalization_Function.ipynb notebook that defines a text normalization function. The file is available on Canvas in the Lab 2 Assignment (no answer is needed). </font>**