![Practicum AI Logo image](https://github.com/PracticumAI/practicumai.github.io/blob/main/images/logo/PracticumAI_logo_250x50.png?raw=true)
***
# *Practicum AI:* NLP - Data Cleaning

These exercises adapted from Baig et al. (2020) <i>The Deep Learning Workshop</i> from <a href="https://www.packtpub.com/product/the-deep-learning-workshop/9781839219856">Packt Publishers</a> (Exercises 4.01 - 4.06, page 159).

(15 Minutes: Exercises 4.01 - 4.03)

#### Introduction
Data cleaning and preparation is critical to the success of any AI project.  In fact, about 80% of the time on the typical AI project is spent doing data related tasks.  The initial section of this notebook provides a short overview of text pre-processing.  Text pre-processing is what you do to prepare the data for your primary analysis / model.  Take some time to review this content and execute the code.

#### Import Supporting Libraries

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:80% !important;}</style>"))

#### Getting Started with Text Data Handling
Create some test data for demonstration purposes. 

In [2]:
raw_txt = """Welcome to the world of Deep Learning for NLP! We're in this together, and we'll learn together. 
NLP is amazing, and Deep Learning makes it even more fun. Let's learn!"""

##### Tokenization
Tokenization is the splitting of raw input text into **tokens**.  A token can be a paragraph, sentence, word, or character.

In [3]:
import nltk
nltk.download('punkt')
from nltk import tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     /home/danielmaxwell/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
tokenize.sent_tokenize(raw_txt)

['Welcome to the world of Deep Learning for NLP!',
 "We're in this together, and we'll learn together.",
 'NLP is amazing, and Deep Learning makes it even more fun.',
 "Let's learn!"]

In [5]:
txt_sents = tokenize.sent_tokenize(raw_txt)

In [6]:
type(txt_sents), len(txt_sents)

(list, 4)

> It looks like our sentence tokenizer (`sent_tokenizer`) has done a good job of identifying the 4 sentences in our  text by placing them in a list object.  Let's check out the word tokenizer.

In [7]:
txt_words = [tokenize.word_tokenize(sent) for sent in txt_sents]
type(txt_words), type(txt_words[0])

(list, list)

In [8]:
print(txt_words[:2])

[['Welcome', 'to', 'the', 'world', 'of', 'Deep', 'Learning', 'for', 'NLP', '!'], ['We', "'re", 'in', 'this', 'together', ',', 'and', 'we', "'ll", 'learn', 'together', '.']]


> This is just what we expected.  Our text data has been tokenized into individual words.

##### Normalizing case
We usually don't want the words - "cat", "CAT", "Cat", and "CaT" to be treated as distinct entities.  In Python, capitalization matters.  So to avoid this, we typically convert all text to lower or upper case.

One way of normalizing case is shown here.  We could have executed this code at the start, before tokenization.  But since we've already come this far, we'll continue using the txt_sents variable. 

```python
raw_txt = raw_txt.lower()
```

In [10]:
txt_sents = [sent.lower() for sent in txt_sents]
txt_sents

['welcome to the world of deep learning for nlp!',
 "we're in this together, and we'll learn together.",
 'nlp is amazing, and deep learning makes it even more fun.',
 "let's learn!"]

In [11]:
txt_words = [tokenize.word_tokenize(sent) for sent in txt_sents]

In [12]:
print(txt_words[:2])

[['welcome', 'to', 'the', 'world', 'of', 'deep', 'learning', 'for', 'nlp', '!'], ['we', "'re", 'in', 'this', 'together', ',', 'and', 'we', "'ll", 'learn', 'together', '.']]


> This looks pretty good!  All our words are now in lowercase. So up to this point, we've taken raw input text, broken it into sentences, normalized the case, and then split that into words. 

##### Removing punctuation
As indicated in the output from the last code block, we see that punctuation tokens are present.  In some cases, punctuation is important and ought to be left in place.  Consider, for example, a sentiment anaysis project where an exclamation point conveys important information.  But here, that is not the case.  So, let's remove the punctuation. 

In [13]:
from string import punctuation

> Let's take a look at our punctuation list.

In [14]:
list_punct = list(punctuation)
print(list_punct)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [15]:
def drop_punct(input_tokens):
    return [token for token in input_tokens if token not in list_punct]

In [16]:
drop_punct(["let",".","us",".","go","!"])

['let', 'us', 'go']

In [17]:
txt_words_nopunct = [drop_punct(sent) for sent in txt_words]
print(txt_words_nopunct)

[['welcome', 'to', 'the', 'world', 'of', 'deep', 'learning', 'for', 'nlp'], ['we', "'re", 'in', 'this', 'together', 'and', 'we', "'ll", 'learn', 'together'], ['nlp', 'is', 'amazing', 'and', 'deep', 'learning', 'makes', 'it', 'even', 'more', 'fun'], ['let', "'s", 'learn']]


> Our data now looks much cleaner with the punctuation removed!

##### Removing stop words
In normal language, there are a lot of words that don't add a lot of information or have much value.  These are called *stop words*.  Stop words fall into two broad categories:

- General / Functional: These are filler words like "the","an", "of", and so on.  
- Contextual: These are words that don't add much value in a particular context.  For example, the word "phone" may not add much value when analyzing mobile phone reviews.

The nltk library conveniently provides a list of common stop words.  Let's import it and download stop words (p. 164). 

In [18]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/danielmaxwell/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [19]:
from nltk.corpus import stopwords
list_stop = stopwords.words("english")
len(list_stop)

179

In [20]:
print(list_stop[:50])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be']


> As we see above, these commonly used "filler" words play a functional role in the language but don't add a lot of information.  Stop words are removed in the same way punctuation was removed.

#### Exercise 4.01 (Tokenizing, Case Normalization, Punctuation and Stop Word...) - Page 166

***

#### 1. Import the natural language toolkit & tokenizer

```python
import nltk
from nltk import tokenize
```

In [1]:
# Code it!

#### 2. Create a raw text variable

In [22]:
raw_txt = """Welcome to the world of Deep Learning for NLP! We're in this together, and we'll learn together. NLP is amazing, and Deep Learning makes it even more fun. Let's learn!"""

#### 3. Tokenize the text into sentences and convert to lower-case

```python
txt_sents = tokenize.sent_tokenize(raw_txt.lower())
```

In [23]:
# Code it!

#### 4. Convert sentences into words

```python
txt_words = [tokenize.word_tokenize(sent) for sent in txt_sents]
```

In [24]:
# Code it!

#### 5. Import punctuation and convert to a list

```python
from string import punctuation
stop_punct = list(punctuation)
```

In [25]:
# Code it!

#### 6. Import the stopwords and assign them to a variable

```python
from nltk.corpus import stopwords
stop_nltk = stopwords.words("english")
```

In [26]:
# Code it!

#### 7. Assign the punctuation and stop words to a variable 

```python
stop_final = stop_punct + stop_nltk
```

In [2]:
# Code it!

#### 8. Define a function to remove punctuation and stopwords

In [28]:
def drop_stop(input_tokens):
    return [token for token in input_tokens if token not in stop_final]

#### 9. Remove redundant tokens using the drop_stop() function

```python
txt_words_nostop = [drop_stop(sent) for sent in txt_words]
```

In [29]:
# Code it!

#### 10. Print the words in the first sentence

```python
print(txt_words_nostop[0])
```

In [31]:
# Code it!

Now that we've completed an initial data cleaning exercise, there are some additional things we can do to prepare our data.

##### Stemming

Eat", "eats", "eating", "ate" – are all just variations of the same word, referring to the same action. In most text and spoken language, we have multiple forms of the same word. Typically, we don't want these to be considered as separate tokens.  Stemming is a rule-based approach to achieve normalization by reducing a word to its "stem". The stem is the root of the word before any affixes (an element added to make a variant) are added. This approach is rather simple – chop off the suffix to get the stem. A popular algorithm is the **Porter stemming algorithm**, which applies a series of rules like those below (p. 170).

| **Rule**     | **Term**    | **Stem**    |
|--------------|-------------|-------------|
|s->           | cats        | cat         |
|ies->         | trophies    | trophi      |
|es->e         | drives      | drive       |
|ing->e        | driving     | drive       |

First, we import the PorterStemmer.

In [76]:
from nltk.stem import PorterStemmer

In [77]:
stemmer_p = PorterStemmer()

> Let's test drive the Porter stemmer with a single word.  Test out some others as well.

In [78]:
print(stemmer_p.stem("driving"))

drive


> Let's now check out the stemmer on a sentence.  Note: the sentence is tokenized first.

In [79]:
txt = "I mustered all my drive, drove to the driving school!"

In [80]:
tokens = tokenize.word_tokenize(txt)
print([stemmer_p.stem(word) for word in tokens])

['I', 'muster', 'all', 'my', 'drive', ',', 'drove', 'to', 'the', 'drive', 'school', '!']


> Notice how the stemmer correctly changed 'mustered' to 'muster' and 'driving' to 'drive'.

##### Lemmatization
Lemmatization is a more sophisticated approach that uses a dictionary to find a valid root form (lemma) of the word.  The output from a lemmatization step is always a valid English word.  However, lemmatization is computationally expensive (p. 171).

Let's import WordNetLemmatizer and instantiate it.

In [81]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rahim.baig\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [82]:
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()

> And try it out on a single word...

In [83]:
lemmatizer.lemmatize("ponies")

'pony'

#### Exercise 4.02 (Stemming Our Data) - Page 172

***

#### 1. Import the Porter Stemmer

```python
from nltk.stem import PorterStemmer
```

In [84]:
# Code it!

#### 2. Instantiate the stemmer

```python
stemmer_p = PorterStemmer()
```

In [85]:
# Code it!

#### 3. Apply the stemmer to the first sentence

```python
print([stemmer_p.stem(token) for token in txt_words_nostop[0]])
```

In [9]:
# Code it!

#### 4. Apply the stemmer to all the sentences

```python
txt_words_stem = [[stemmer_p.stem(token) for token in sent] for sent in txt_words_nostop]
```

In [87]:
# Code it!

#### 5. Examine the stemmed data

```python
txt_words_stem
```

In [8]:
# Code it!

***
##### Downloading Text Corpora using NLTK

Up to this point, we've used dummy data to test a variety of data prep techniques.  But what if we want to use real text?  Well, the **gutenberg** text corpus is available via nltk. 

In [1]:
import nltk

<div style="padding: 10px;margin-bottom: 20px;border: thin solid #30335D;border-left-width: 10px;background-color: #fff"><strong>Note:</strong> The nltk.download() command, when executed without arguments, does not open the NLTK downloader in a new window, as pictured in the textbook.  The Unix version has a command line interface.  Type 'l' in the NLTK field and hit enter to view the Packages available for install.  As this list is rather long, you will need to hit enter multiple times to scroll through it.  The gutenberg package is in this list.  To install it, type 'd' and hit enter.  Then type 'gutenberg' in the NLTK field and hit enter again.  The software will respond with an installation message.  And finally, type 'q' to exit the downloader.</div>

In [5]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------


Downloader>  q


True

In [3]:
alice_raw = nltk.corpus.gutenberg.raw('carroll-alice.txt')

In [4]:
alice_raw[:800]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit"

#### Text Representation
##### One hot encoding
One-hot encoding is one of the most intuitive approaches toward text representation. A one-hot encoded feature is a binary indicator of a term being present in the text. It's a simple approach that is easy to interpret – the presence or absence of a word.  Here we see that the term “nlp” appears in the input text in the first and third rows.  So for those rows, we say it’s ‘one-hot encoded” with its indicator variable set to one.  Otherwise, it’s zero.  And the same holds true for the other words (p. 179).

In [92]:
txt_words_nostop

[['welcome', 'world', 'deep', 'learning', 'nlp'],
 ["'re", 'together', "'ll", 'learn', 'together'],
 ['nlp', 'amazing', 'deep', 'learning', 'makes', 'even', 'fun'],
 ['let', "'s", 'learn']]

#### Exercise 4.03 (Creating One-Hot Encoding for Our Data) - Page 181

***

#### 1. Examine the txt_words_nostop variable

```python
print(txt_words_nostop)
```

In [1]:
# Code it!

#### 2. Define list of target terms

```python
target_terms = ["nlp","deep","learn"]
```

In [94]:
# Code it!

#### 3. Define a one-hot function

```python
def get_onehot(sent):
    return [1 if term in  sent else 0 for term in target_terms]
```

In [95]:
# Code it!

#### 4. Apply the function to each sentence in our text

```python
one_hot_mat = [get_onehot(sent) for sent in txt_words_nostop]
```

In [3]:
# Code it!

#### 5. Import numpy and print the one-hot matrix

```python
import numpy as np

np.array(one_hot_mat)
```

In [97]:
# Code it!

***

<div style="padding: 10px;margin-bottom: 20px;border: thin solid #30335D;border-left-width: 10px;background-color: #fff"><strong>Note:</strong> This section and exercise 4.04 is optional.</div>

##### Term Frequencies

We discussed that one-hot encoding merely indicates the presence or absence of a term. A reasonable argument here is that the frequency of terms is also important. It may be that a term that's present more times in a document is more important for the document. Maybe representing the term by its frequency is a better approach
than simply the indicator. The frequency approach is straightforward – for each term, count the number of times it appears in a particular text. If a term is absent from the document/text, it gets a 0. We do this for all the terms in our vocabulary (p. 183).

In [99]:
from sklearn.feature_extraction.text import CountVectorizer

Let's take a look at the contents of our txt_sents variable.  If the *txt_sents* variable has been overwritten, redefine its contents so it matches this output:

```python
['welcome to the world of deep learning for nlp!',
 "we're in this together, and we'll learn together.",
 'nlp is amazing, and deep learning makes it even more fun.',
 "let's learn!"]
```

In [None]:
txt_sents

> Instantiate a vectorizer to identify the top 5 most frequently occurring words in the dataset.

In [100]:
vectorizer = CountVectorizer(max_features = 5)

> Fit and train the vectorizer.

In [101]:
vectorizer.fit(txt_sents)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=5, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

> Let's take a look at our top 5 words.

In [102]:
vectorizer.vocabulary_

{'deep': 1, 'we': 4, 'together': 3, 'and': 0, 'learn': 2}

> Create a Document Term Matrix (DTM).  The result is a sparse matrix.  To view it, we will convert it to an array.

In [103]:
txt_dtm = vectorizer.fit_transform(txt_sents)

In [104]:
txt_dtm.toarray()

array([[0, 1, 0, 0, 0],
       [1, 0, 1, 2, 2],
       [1, 1, 0, 0, 0],
       [0, 0, 1, 0, 0]], dtype=int64)

> The second document (row 2) has a frequency of **2** for the last two terms.  What are those terms?  Well, indices 3 and 4 are the terms **'together**' and **'we**.  This is just what we expect as we see that these two words both occur twice in the second sentence.

In [105]:
txt_sents

['welcome to the world of deep learning for nlp!',
 "we're in this together, and we'll learn together.",
 'nlp is amazing, and deep learning makes it even more fun.',
 "let's learn!"]

#### Advanced Term Frequency

The documentation for this last section begins on page 186 of *The Deep Learning Workshop*.  Review that content and then document the following code blocks.

In [106]:
def do_nothing(doc):
    return doc

In [107]:
vectorizer = CountVectorizer(max_features = 5, 
                             preprocessor = do_nothing, 
                             tokenizer    = do_nothing)

In [108]:
txt_dtm = vectorizer.fit_transform(txt_words_stem)

In [109]:
txt_dtm.toarray()

array([[0, 1, 1, 1, 0],
       [1, 0, 1, 0, 2],
       [0, 1, 1, 1, 0],
       [0, 0, 1, 0, 0]], dtype=int64)

In [110]:
vectorizer.vocabulary_

{'deep': 1, 'learn': 2, 'nlp': 3, 'togeth': 4, "'ll": 0}

In [111]:
txt_words_stem

[['welcom', 'world', 'deep', 'learn', 'nlp'],
 ["'re", 'togeth', "'ll", 'learn', 'togeth'],
 ['nlp', 'amaz', 'deep', 'learn', 'make', 'even', 'fun'],
 ['let', "'s", 'learn']]

#### Exercise 4.04 (Document Term Matrix with TF-IDF) - Page 188

***
#### 1. Import vectorizer from sci-kit learn library

```python
from sklearn.feature_extraction.text import TfidfVectorizer
```

In [112]:
# Code it!

#### 2. Instantiate the vectorizer with a vocabulary size of 5

```python
vectorizer_tfidf = TfidfVectorizer(max_features = 5)
```

In [113]:
# Code it!

#### 3. Vectorize the raw data

```python
vectorizer_tfidf.fit(txt_sents)
```

In [6]:
# Code it!

#### 4. Examine the vocabulary learned by the vectorizer

```python
vectorizer_tfidf.vocabulary_
```

In [5]:
# Code it!

#### 5. Transform the data

```python
txt_tfidf = vectorizer_tfidf.transform(txt_sents)
```

In [116]:
# Code it!

#### 6. Examine the output

```python
txt_tfidf.toarray()
```

In [3]:
# Code it!

#### 7. Print the IDF values for each term

```python
vectorizer_tfidf.idf_
```

In [4]:
# Code it!