In [1]:
from IPython.display import clear_output

In [2]:
# %pip install torch torchtext tqdm import spacy


%pip install --upgrade portalocker

clear_output()

In [10]:
import string

import torch
from torchtext.datasets import IMDB
from tqdm import tqdm

# Content

In this notebook, we'll take a look at how we can process text to be fed to models. Specifically, we'll take a look at:

1. Removing Stop words from the text
2. Tokenizing the text
3. Converting the text to numerical form to be used by models



## Downloading the Dataset

We will work with the IMDB review dataset

It's a sentiment analysis dataset which is built using labelling IMDB reviews of movies as positive or negative

In [4]:
# Download and load the IMDB dataset
train_data = IMDB(split=('train'))

## Custom text processing functions

First, we will take a look at how we can manually write functions for tokenization etc.

In the comments, you'll also find some of the limitations of the function (which are intentionally not taken care of for now)

In [24]:
# Define a function to remove punctuation.
# This is often needed because when tokenizing, punctuation can be a problem
# for example, we want only one token for the word "close" but if "close," and "close." also exist in text corpus
# we will have 3 tokens just representing the word close ("close"|"close,"|"close.")
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Define a function to remove stop words
def remove_stopwords(text):

    # The list could be ever expanding. Not to mention, if we dont remove punctuation beforehand, this function will not work properly.
    # This is also made to only work with lower case words so we need to lower our text.
    stopwords = set([
        "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself",
        "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself",
        "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these",
        "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do",
        "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
        "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
        "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
        "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
        "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
        "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"
    ])
    return ' '.join(word for word in text.split() if word.lower() not in stopwords)

# Tokenize the text. For now we will just use split over white space.
def tokenize(text):
    return text.split()


## Processing the IMDB dataset

to save time, we will use a small part of the dataset

In [25]:
n_samples = 100
data = []

for record in tqdm(train_data, total=n_samples):

    data.append(record[1])  # [0] is the sentiment label. we dont need that.

    if len(data) >= n_samples:
        break

 99%|█████████▉| 99/100 [00:00<00:00, 506.84it/s]


In [26]:
# Create vocabulary from training data
vocab = set()
for text in tqdm(data, desc='Creating Vocab'):
    text = text.lower()
    text = remove_punctuation(text)
    text = remove_stopwords(text)
    tokens = tokenize(text)
    vocab.update(tokens)

# Create word to index mapping. This will be used to convert words to numerical form so it can be fed to models
# each word will be converted to the integer value stored against it in the dict, which comes from it's index hence word_to_idx
word_to_idx = {word: idx for idx, word in tqdm(enumerate(vocab), desc='Building Word 2 Idx')}

# Print the first few words and their indices
print("Word to Index Mapping:")
for word in list(vocab)[:10]:
    print(f"{word}: {word_to_idx[word]}")

Creating Vocab: 100%|██████████| 100/100 [00:00<00:00, 6952.27it/s]
Building Word 2 Idx: 4476it [00:00, 844293.25it/s]

Word to Index Mapping:
anything: 0
chocolate: 1
pointed: 2
incompetent: 3
catalan: 4
ahead: 5
incredible: 6
dons: 7
somebody: 8
effectbr: 9





In [27]:
# Lets take a look at an example
test_idx = 10

text = data[test_idx]

processed_text = remove_punctuation(text.lower())
processed_text = remove_stopwords(processed_text)
tokens = tokenize(processed_text)

idxs = [word_to_idx[token] for token in tokens]

print('Original Text:')
print(text)
print('-'*30)
print('Tokens:')
print(tokens)
print('-'*30)
print('Numerical form (Idxs)')
print(idxs)

Original Text:
It was great to see some of my favorite stars of 30 years ago including John Ritter, Ben Gazarra and Audrey Hepburn. They looked quite wonderful. But that was it. They were not given any characters or good lines to work with. I neither understood or cared what the characters were doing.<br /><br />Some of the smaller female roles were fine, Patty Henson and Colleen Camp were quite competent and confident in their small sidekick parts. They showed some talent and it is sad they didn't go on to star in more and better films. Sadly, I didn't think Dorothy Stratten got a chance to act in this her only important film role.<br /><br />The film appears to have some fans, and I was very open-minded when I started watching it. I am a big Peter Bogdanovich fan and I enjoyed his last movie, "Cat's Meow" and all his early ones from "Targets" to "Nickleodeon". So, it really surprised me that I was barely able to keep awake watching this one.<br /><br />It is ironic that this movie is

## Using Spacy

As we saw, doing this manually have quite a few limitations.

One that hasn't been mentioned above, is that we did a very basic implementation for english, but if we want to work with multiple languages, we need to make these functionalities for them as well, which can be challenging if the developer have limited understanding of the other language

We CAN work around them but that adds a lot of needless development overhead. Which is why, it's better to use a library to take care of things like this.

The 2 most popular libraries are Spacy and NLTK. In this demo, we will use spacy

In [17]:
import spacy

In [18]:
# We need to download the english data and tokenizer for spacy
nlp = spacy.load("en_core_web_sm")

In [29]:
# Define a function to remove stop words using spaCy
def remove_stopwords_sp(text):  # _sp to keep it seperate from our own implementation.
    doc = nlp(text)
    return ' '.join(token.text for token in doc if not token.is_stop)

# Tokenize the text using spaCy
def tokenize_sp(text):
    doc = nlp(text)
    return [token.text for token in doc]

In [43]:
# Create vocabulary from training data
vocab = set()
for text in tqdm(data, desc='Creating Vocab'):
    text = remove_stopwords_sp(text.lower())  # .lower to maintain similarity with our custom implementation
    tokens = tokenize_sp(text)
    vocab.update(tokens)

# Create word to index mapping
word_to_idx = {word: idx for idx, word in tqdm(enumerate(vocab), desc='Building Word 2 Idx')}

# Print the first few words and their indices
print("Word to Index Mapping:")
for word in list(vocab)[:10]:
    print(f"{word}: {word_to_idx[word]}")

Creating Vocab: 100%|██████████| 100/100 [00:08<00:00, 12.03it/s]
Building Word 2 Idx: 4331it [00:00, 748630.98it/s]

Word to Index Mapping:
chocolate: 0
catalan: 1
pointed: 2
incompetent: 3
mess.i: 4
ahead: 5
incredible: 6
dons: 7
somebody: 8
ended: 9





In [44]:
# Lets take a look at an example
test_idx = 10

text = data[test_idx]

processed_text = remove_stopwords_sp(text.lower())
tokens = tokenize_sp(processed_text)

idxs = [word_to_idx[token] for token in tokens]

print('Original Text:')
print(text)
print('-'*30)
print('Tokens:')
print(tokens)
print('-'*30)
print('Numerical form (Idxs)')
print(idxs)

Original Text:
It was great to see some of my favorite stars of 30 years ago including John Ritter, Ben Gazarra and Audrey Hepburn. They looked quite wonderful. But that was it. They were not given any characters or good lines to work with. I neither understood or cared what the characters were doing.<br /><br />Some of the smaller female roles were fine, Patty Henson and Colleen Camp were quite competent and confident in their small sidekick parts. They showed some talent and it is sad they didn't go on to star in more and better films. Sadly, I didn't think Dorothy Stratten got a chance to act in this her only important film role.<br /><br />The film appears to have some fans, and I was very open-minded when I started watching it. I am a big Peter Bogdanovich fan and I enjoyed his last movie, "Cat's Meow" and all his early ones from "Targets" to "Nickleodeon". So, it really surprised me that I was barely able to keep awake watching this one.<br /><br />It is ironic that this movie is

### Something to notice:

As seen from the examples below, We dont have to remove punctuation. Spacy makes a seperate token for punctuation marks. This lets our data contain punctuation and in theory, should be able to give our model understanding of these punctuation marks.

In [45]:
tokenize("Hi, I Love Machine Learning and I want to be a pro at it.")

['Hi,',
 'I',
 'Love',
 'Machine',
 'Learning',
 'and',
 'I',
 'want',
 'to',
 'be',
 'a',
 'pro',
 'at',
 'it.']

In [46]:
tokenize_sp("Hi, I Love Machine Learning and I want to be a pro at it.")  # different tokens for punctuation marks

['Hi',
 ',',
 'I',
 'Love',
 'Machine',
 'Learning',
 'and',
 'I',
 'want',
 'to',
 'be',
 'a',
 'pro',
 'at',
 'it',
 '.']