# Python Text Analysis Part 1: Preprocessing

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Learn common and task-specific operations of preprocessing.
* Know commonly used NLP packages and what they are capable of.
* Understand the differences between tokenizers before and after LLMs.
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br> 

### Sections
1. [Preprocessing](#section1)
2. [Tokenization](#section2)

In this three-part workshop series, we will learn the building blocks for performing text analysis in Python. These techniques lie in the domain of Natural Language Processing (NLP). NLP is a field that deals with identifying and extracting patterns of language, primarily in written texts. Throughout the workshop series, we will interact with various packages for performing text analysis: starting from simple string methods to specific NLP packages, including `nltk`, `spaCy`, and more recent ones on Large Language Models (`transformers`).

Now, let's have these packages properly installed before diving into the materials.

In [1]:
# Uncomment to install the following packages
# %pip install NLTK
# %pip install transformers
# %pip install spaCy
# !python -m spacy download en_core_web_sm

<a id='section1'></a>

# Preprocessing

In Part 1 of this workshop, we address the first step of text analysis. Our goal is to convert the raw, messy text data into a consistent format. This ensures that our text data preserves the necessary information for subsequent analysis while stripping away less relevant details. This process is often called **preprocessing**, **text cleaning**, or **text normalization**.

You will notice that at the end of preprocessing, our data is still in a format that we can read and understand. In Parts 2 and 3, we will begin our foray into converting the text data into a numerical representation---a format that can be handled by computational analysis. 

🔔 **Question**: Let's pause for a minute to reflect on previous experiences working on text data. 
- What is the format of the text data you have interacted with (CSV, XML, or plain text)?
- Where does it come from (structured corpus, scraped from the web, survey data)?
- Is it messy (i.e., is the data formatted consistently)?

## Common Processes

Preprocessing is not something we can accomplish in one swoop with a single line of code. We often start by familiarizing ourselves with the data, and along the ways we become clearer about the granularity of preprocessing we want to arrive at.

Typically, we begin with a set of commonly used processes to clean the data. These operations will not substantially alter the form or meaning of the data; instead, the goal is to convert the text data into a standard format. 

Oftentimes, these operations can be done with built-in Python functions, such as `string` methods and Regular Expressions. For example, the following processes are commonly applied to Enlish texts. 
- Lowercase the text
- Remove punctuation marks
- Remove extra whitespace characters
- Remove stop words

Afterwards, we may choose to perform task-specific preprocessing operations, which are dependent on the specific text-analysis task we want to perform later and the source of data where we retrieve our data at the first place. 

Before we jump into these operations, let's take a look at our data first!

### Import the Text Data

The text data we will be working with is a .csv file. It contains tweets about U.S. airlines, scrapped from Feb 2015. 

Let's read in the file `airline_tweets.csv` with `pandas`.

In [2]:
# Import pandas
import pandas as pd

# File path to data
csv_path = '../data/airline_tweets.csv'

# Specify the separator to be comma
tweets = pd.read_csv(csv_path, sep=',')

In [3]:
# Display the first five rows
tweets.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


The dataframe has one row per tweet. The `text` column contains the tweet, and others metadata of the tweet:

- `airline_sentiment` (`str`): sentiment of the tweet, labeled as "neutral", "positive", or "negative". 
- `airline` (`str`): the airline that is tweeted about.
- `retweet count` (`int`): how many times the tweet was retweeted.
- `text` (`str`): the text of the tweet.

Let's take a look at some of the Tweets:

In [4]:
print(tweets['text'].iloc[0])
print(tweets['text'].iloc[1])
print(tweets['text'].iloc[2])

@VirginAmerica What @dhepburn said.
@VirginAmerica plus you've added commercials to the experience... tacky.
@VirginAmerica I didn't today... Must mean I need to take another trip!


🔔 **Question**: What you have noticed? What are the characteristics of tweet data?

### Lowercasing

While we need to acknowledge that the **casing** of words is informative, we often don't work in contexts where we can properly utilize this information.

More often, the subsequent analysis we perform is **case-insensitive**. For instance, in frequency analysis, we look for word counts want to account for various forms of the same word. Lowercasing the text data aids in this process and simplifies our analysis.

We can easily achieve text lowercasing with the built-in string function [`lower()`](https://docs.python.org/3/library/stdtypes.html#str.lower). See [string methods](https://www.w3schools.com/python/python_ref_string.asp) for more useful functions.

Let's take a look at the following example!

In [5]:
# Print the first example
first_example = tweets['text'][108]
print(first_example)

@VirginAmerica I was scheduled for SFO 2 DAL flight 714 today. Changed to 24th due weather. Looks like flight still on?


In [6]:
# Check if the example is all lowercased
print(first_example.islower())
print(f"{'=' * 50}")

# Convert it to lowercase
print(first_example.lower())
print(f"{'=' * 50}")

# Convert it to uppercase
print(first_example.upper())

False
@virginamerica i was scheduled for sfo 2 dal flight 714 today. changed to 24th due weather. looks like flight still on?
@VIRGINAMERICA I WAS SCHEDULED FOR SFO 2 DAL FLIGHT 714 TODAY. CHANGED TO 24TH DUE WEATHER. LOOKS LIKE FLIGHT STILL ON?


### Remove Extra Whitespace Characters

Sometimes we might come across texts with extraneous whitespace. This is particularly common when the text is scrapped from webpages. Before we dive into the details, let's briefly introduce Regular Expressions (regexes) and the `re` package. 

Regular expressions are a powerful way of searching for specific string patterns in large corpora. They have an infamously steep learning curve, but are very efficient when we get a handle on them. Many NLP packages make heavy use of regexes under the hood. Regex testers are a useful tool in both understanding and creating regex expression. An example is [regex101](https://regex101.com).

Our goal in this workshop is not to provide a deep (or even shallow) dive into regexes; instead, we want to expose you to them so that you are better prepared to do deep dives in the future!

The following example is a poem by T. S. Eliot. Like many other poems, the text may contain extra line breaks (or newline characters, `\n`) that we want to remove.

Let's read the data in!

In [7]:
# File path to the poem
text_path = '../data/morning_at_the_window.txt'

# Read the poem in
with open(text_path, 'r') as file:
    text = file.read()
    text_excerpt = text[660:]
    file.close()

The output of the text excerpt is the poem, which is in a continuous string of text whith line breaks placed at the end of each line, making it difficult to read. 

In [8]:
text_excerpt

'Morning at the Window\n\nThey are rattling breakfast plates in basement kitchens,\nAnd along the trampled edges of the street\nI am aware of the damp souls of housemaids\nSprouting despondently at area gates.\n\nThe brown waves of fog toss up to me\nTwisted faces from the bottom of the street,\nAnd tear from a passer-by with muddy skirts\nAn aimless smile that hovers in the air\nAnd vanishes along the level of the roofs.'

One handy function we can use to display the poem properly is `splitlines()`. As the name suggests, it splits a long text sequence into a list of lines, effectively getting rid of line boundaries. 

In [9]:
# Split the single string into a list of lines
text_excerpt.splitlines()

['Morning at the Window',
 '',
 'They are rattling breakfast plates in basement kitchens,',
 'And along the trampled edges of the street',
 'I am aware of the damp souls of housemaids',
 'Sprouting despondently at area gates.',
 '',
 'The brown waves of fog toss up to me',
 'Twisted faces from the bottom of the street,',
 'And tear from a passer-by with muddy skirts',
 'An aimless smile that hovers in the air',
 'And vanishes along the level of the roofs.']

Let's return to our tweets data for an example.

In [10]:
# Print the second example
second_example = tweets['text'][5]
second_example

"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA"

Now, for the tweets data, we do not really want to split it into strings. We'd still expect a single string of text but would like to remove the line breaks completely from the string.

The string method `strip()` effectively does the job of stripping away spaces at both ends of the text. However, it won't work in our example as the newline character is within the string.

In [11]:
# Strip only removed blankspace at both ends
second_example.strip()

"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.\nit's really the only bad thing about flying VA"

This is where regex could be really helpful.

Let's load the package in first. 

In [12]:
# Import regex
import re

Now, with regex, we are essentially calling it to match a pattern we identify in the text data, and we want to do some operations on the match---extract it, replace it with something else, or remove it completely. Therefore, how regex works could be unpacked into the following executable steps:

- Identify the pattern and write the pattern in regex: `r''`
- Write the replacement for the pattern 
- Call the specific regex function

In our example, the pattern we are looking for is `\s`, which is the regex short name for any whitespace character (`\n` and `\t` included). We also add a quantifier in the end `\s+`, which means the pattern repeats one or more times.

In [13]:
# Write a pattern in regex
blankspace_pattern = r'\s+'

The replacement we have for one or more whitespace characters is exactly one single whitespace---which is the canonical word boundary in English. Any more whitespace will be reduced to one single whitespace. 

In [14]:
# Write a replacement for the pattern identfied
blankspace_repl = ' '

Now, in the final step, let's put everything together with `re.sub`: substitute a pattern with a replacement. 

In [15]:
# Replace whitespace with ' '
clean_text = re.sub(blankspace_pattern, blankspace_repl, second_example)
print(clean_text)

@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing. it's really the only bad thing about flying VA


Ta-da! The newline character is no longer there. 

### Remove Punctuation Marks

Sometimes we might only be interested in **alphanumeric characters** (i.e., the letters and numbers), in which case we might want to remove punctuation marks. This process becomes less common when we consider more advanced NLP algorithms. 

In [16]:
# Load in a predefined list of punctuation marks
from string import punctuation
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In practice, we can iterate over the text and remove characters found in the punctuation list, such as shown below in the `remove_punct` function.

In [17]:
def remove_punct(text):
    # Select characters not in puncutaion
    no_punct = []
    for char in text:
        if char not in punctuation:
            no_punct.append(char)

    # Join the characters into a string
    text_no_punct = ''.join(no_punct)   
    return text_no_punct

Let's apply the function to the example below. 

In [18]:
# Print the third example
third_example = tweets['text'][20]
print(third_example)
print(f"{'=' * 50}")

# Apply the function 
remove_punct(third_example)

@VirginAmerica why are your first fares in May over three times more than other carriers when all seats are available to select???


'VirginAmerica why are your first fares in May over three times more than other carriers when all seats are available to select'

Let's give it a try with another tweet. What have you noticed?

In [19]:
# Print another tweet
print(tweets['text'][100])
print(f"{'=' * 50}")

# Apply the function
remove_punct(tweets['text'][100])

@VirginAmerica trying to add my boy Prince to my ressie. SF this Thursday @VirginAmerica from LAX http://t.co/GsB2J3c4gM


'VirginAmerica trying to add my boy Prince to my ressie SF this Thursday VirginAmerica from LAX httptcoGsB2J3c4gM'

How about the following example?

In [20]:
# Print a text with contraction
contraction_text = "We've got quite a bit of punctuation here, don't we?!? #Python @D-Lab."

# Apply the function
remove_punct(contraction_text)

'Weve got quite a bit of punctuation here dont we Python DLab'

⚠️ **Warning:** In many cases, we want to remove punctuation **after** tokenization, which we will discuss in the next section. This tells us that the **order** of preprocessing operations is a matter of importance!

## 🥊 Challenge 1: Preprocessing with Multiple Steps

So far we've learned a few preprocessing operations, let's put them together in a function! 

The example text data for challenge 1 has been read in. Write a function to:

- Lowercase the text
- Remove punctuation marks
- Remove extra whitespace characters

In [21]:
challenge1_path = '../data/example1.txt'

with open(challenge1_path, 'r') as file:
    challenge1 = file.read()
    
print(challenge1)



This is a text file that has some extra blankspace at the start and end. Blankspace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches blankspace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.





In [22]:
# def clean_text(text):
    
#     # YOUR CODE HERE
    
#     return text

# clean_text(challenge1)

In [23]:
# Solution to challenge 1 
def clean_text(text):
    
    text = text.lower()
    text = remove_punct(text)
    text = re.sub(blankspace_pattern, blankspace_repl, text)
    text = text.strip()
    return text
    
clean_text(challenge1)

'this is a text file that has some extra blankspace at the start and end blankspace is a catchall term for spaces tabs newlines and a bunch of other things that computers distinguish but to us all look like spaces tabs and newlines the python method called strip only catches blankspace at the start and end of a string but it wont catch it in the middle for example in this sentence once again regular expressions will help us with this'

## Task-specific Processes

Now that we understand common preprocessing processes, there are still a few additional operations to consider. Our text data might require further normalization depending on the language, source, and content of the data.

For example, if we are dealing with financial documents, we might want to standardize monetary symbols by converting them to digits. In the airline tweet data we've been using, there are numerous hashtags and URLs. These can be replaced with placeholders to simplify subsequent analysis.

### 🎬 **Demo**: Remove Hashtags and URLs 

Although URLs, hashtags, and numbers are informative in their own right, oftentimes we don't necessarily care about the exact meaning of each of them. 

While we could remove them completely, it's often informative to know that there **exists** a URL or a hashtag. So, we replace individual URLs and hashtags with a "symbol" that preserves the fact these structures exist in the text. It's standard to just use the strings "URL" and "HASHTAG".

Since these types of text often contain precise structure, they're an apt case for using regular expressions. Let's apply these patterns to the Tweets above.

In [24]:
# Print the example tweet 
url_tweet = tweets['text'][13]
print(url_tweet)

@VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel http://t.co/ahlXHhKiyn


In [25]:
# URL 
url_pattern = r'(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])'
url_repl = ' URL '
re.sub(url_pattern, url_repl, url_tweet)

"@VirginAmerica @virginmedia I'm flying your #fabulous #Seductive skies again! U take all the #stress away from travel  URL "

In [26]:
# Hashtag
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
hashtag_repl = ' HASHTAG '
re.sub(hashtag_pattern, hashtag_repl, url_tweet)

"@VirginAmerica @virginmedia I'm flying your HASHTAG  HASHTAG  skies again! U take all the HASHTAG  away from travel http://t.co/ahlXHhKiyn"

<a id='section2'></a>

# Tokenization

## Tokenizers before LLMs

One of the most important steps in text analysis is tokenization. This is the process of breaking down the text into "tokens," which are distinct chunks that we recognize as unique in whatever corpus we're working in.

Once we have broken down a sequence of text into individual tokens, we are ready to perform word-level analysis. For instance, we can filter out tokens that do not contribute to the core meaning of the text.

In this section, we will introduce how to perform tokenization with `nltk` and `spaCy`, as well as tokenization with a Large Language Model (`bert`). The purpose is to expose you to different NLP packages, understand what functionalities each of them provide, and how to access functions within each.

### `nltk`

The first package we will be using is called **Natural Language Toolkit**, or `nltk`. 

Let's install a couple modules within the package.

In [27]:
import nltk

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /Users/mingyu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mingyu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mingyu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

`nltk` has a function called `word_tokenize`, which tokenizes a string for us in an intelligent fashion. 

It takes one argument, which is the text to be tokenized, and returns a list of tokens.

In [28]:
# Load word_tokenize in
from nltk.tokenize import word_tokenize

# Print the example
text = tweets['text'][7]
print(text)

@VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP


In [29]:
# Apply the NLTK tokenizer
nltk_tokens = word_tokenize(text)
nltk_tokens

['@',
 'VirginAmerica',
 'Really',
 'missed',
 'a',
 'prime',
 'opportunity',
 'for',
 'Men',
 'Without',
 'Hats',
 'parody',
 ',',
 'there',
 '.',
 'https',
 ':',
 '//t.co/mWpG7grEZP']

Here we are, with a list of tokens identified by `nltk`. Let's take a minute to inspect the tokens. Do the word boundaries decided by `nltk` make sense to you? Pay attention to the twitter handle and the URL in the example tweet. 

You may feel that accessing functions in `nltk` is pretty straightforward. The function we used just now was imported from the `nltk.tokenize` module, which as the name suggests, primarily does the job of tokenization!

Underlyingly, `nltk` has [a collection of modules](https://www.nltk.org/api/nltk.html) that fulfill different purposes, to name a few:

| NLTK module   | Fucntion                  | Link                                                         |
|---------------|---------------------------|--------------------------------------------------------------|
| nltk.tokenize | Tokenization              | [Documentation](https://www.nltk.org/api/nltk.tokenize.html) |
| nltk.corpus   | Retrieve built-in corpora | [Documentation](https://www.nltk.org/nltk_data/)             |
| nltk.tag      | Part-of-speech tagging    | [Documentation](https://www.nltk.org/api/nltk.tag.html)      |
| nltk.stem     | Stemming                  | [Documentation](https://www.nltk.org/api/nltk.stem.html)     |
| ...           | ...                       | ...                                                          |

Let's import `stopwords` from the `nltk.corpus` module, which hosts various built-in corpora available in `nltk`. 


In [30]:
# Load in a list of predefined stop words
from nltk.corpus import stopwords

Let's specificy that we want to retrieve English stop words. The function simply returns a list of stop words, mostly function words, that `nltk` identifies. 

In [31]:
# Print the first 10 stopwords
stop = stopwords.words('english')
stop[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

### `spaCy`
Other than `nltk`, we have another widely-used package called `spaCy`. Functions in `spaCy` are organized differently.

`spaCy` has its own processing pipeline. It takes in a string of text, runs the `nlp` pipeline on it, and stores the processed text and its annotations in an object called `doc`. The `nlp` pipeline always performs tokenization, followed by [a number of components](https://spacy.io/usage/processing-pipelines#custom-components) as specified by the user. These components are, in fact, pretty similar to the modules in the `nltk` package. 

<img src='../images/spacy.png' alt="spacy pipeline" width="700">

Note that we always start by initializing the `nlp` pipeline, depending on the language of the text. Here, we are loading a pretrained language model for English: `en_core_web_sm`. It means the model is trained on a small set of web text data.

This is the first time we introduce the concept of **pretraining**, though you may have heard it elsewhere. In the context of NLP, pretraining means that the tools we are using are trained, or in other words, optimized on tons of English text. Therefore, when we apply them to our own data, we can expect them to be somewhat efficient and accurate. Much of the functionality offered in `spaCy`, as outlined below, relies on pretraining.

Let's dive in!

In [32]:
import spacy
nlp = spacy.load('en_core_web_sm')

The `nlp` pipeline by default includes a number of components, which we can access via the `pipe_names` attribute. 

You may notice that it dosen't include the tokenizer. Don't worry! Tokenizer is a specicial component that the pipeline always includes, thus it is not counted towards added components.

In [33]:
# Retrieve components included in NLP pipeline
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Let's run the `nlp` pipeline on our example tweet data, and assign it to a variable `doc`.

In [34]:
# Apply the pipeline to example tweet
doc = nlp(tweets['text'][7])

Under the hood, the `doc` object contains the tokens (done by the tokenizer) and their annotations (done by other components), which are [linguistic features](
https://spacy.io/usage/linguistic-features) useful for text analysis. We retrieve the token and its annotations by accessing its attributes. 

| Attribute      | Annotation                              | Link                                                                      |
|----------------|-----------------------------------------|---------------------------------------------------------------------------|
| token.text     | the token in text                       | [Documentation](https://spacy.io/api/token#attributes)                    |
| token.is_stop  | whether the token is a stop word        | [Documentation](https://spacy.io/api/attributes#_title)                   |
| token.is_punct | whether the token is a punctuation mark | [Documentation](https://spacy.io/api/attributes#_title)                   |
| token.lemma_   | the base form of the token              | [Documentation](https://spacy.io/usage/linguistic-features#lemmatization) |
| token.pos_     | the simple POS-tag of the token         | [Documentation](https://spacy.io/usage/linguistic-features#pos-tagging)   |
| ...            | ...                                     | ...                                                                       |

In [35]:
# Get the verbatim texts of tokens
spacy_tokens = [token.text for token in doc]

# The tokens are in a list
spacy_tokens

['@VirginAmerica',
 'Really',
 'missed',
 'a',
 'prime',
 'opportunity',
 'for',
 'Men',
 'Without',
 'Hats',
 'parody',
 ',',
 'there',
 '.',
 'https://t.co/mWpG7grEZP']

In [36]:
# Get the NLTK tokens
nltk_tokens

['@',
 'VirginAmerica',
 'Really',
 'missed',
 'a',
 'prime',
 'opportunity',
 'for',
 'Men',
 'Without',
 'Hats',
 'parody',
 ',',
 'there',
 '.',
 'https',
 ':',
 '//t.co/mWpG7grEZP']

🔔 **Question**: Let's pause for a minute to compare the tokens from `nltk` and `spaCy`. What have you noticed?

`spaCy` conveniently encodes whether a token is part of the stop word list as an annotation, making it more straightforward for us to check it.

In [37]:
# Retrieve the is_stop annotation
spacy_stops = [token.is_stop for token in doc]

# The results are boolean values
spacy_stops

[False,
 True,
 False,
 True,
 False,
 False,
 True,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False]

## 🥊 Challenge 2: Remove Stop Words

We have known how `nltk` and `spaCy` work as NLP packages, now let's write two functions to remove stop words from our text data. 
- Complete the function for stop words removal using `nltk`
    - Requires two arguments: the tokenized text and a list of stop words
- Complete the function for stop words removal using `spaCy`
    - Requires one argument: the doc object which contains the processed text 

In [38]:
def remove_stopword_nltk(text, stopword):

    # YOUR CODE HERE
    text = [token for token in text if token not in stopword]
    
    return text

In [39]:
def remove_stopword_spacy(doc):
    
    # YOUR CODE HERE
    doc = [token.text for token in doc if token.is_stop is False]

    return doc

In [40]:
remove_stopword_nltk(word_tokenize(text), stop)

['@',
 'VirginAmerica',
 'Really',
 'missed',
 'prime',
 'opportunity',
 'Men',
 'Without',
 'Hats',
 'parody',
 ',',
 '.',
 'https',
 ':',
 '//t.co/mWpG7grEZP']

In [41]:
remove_stopword_spacy(doc)

['@VirginAmerica',
 'missed',
 'prime',
 'opportunity',
 'Men',
 'Hats',
 'parody',
 ',',
 '.',
 'https://t.co/mWpG7grEZP']

## 🎬 **Demo**: Powerful Features from `spaCy`

As mentioned above, `spaCy`'s nlp pipeline includes a number of linguistic annotations, which could be very useful for text analysis. 

For instance, we can access more annotations such as the lemma, the part-of-speech tag and its meaning, and whether the token looks like URLs.

In [42]:
# Print tokens and their annotations
for token in doc:
    print(f"{token.text:<24} | {token.lemma_:<24} | {token.pos_:<12} | {spacy.explain(token.pos_):<12} | {token.like_url:<12} |")

@VirginAmerica           | @VirginAmerica           | PROPN        | proper noun  | 0            |
Really                   | really                   | ADV          | adverb       | 0            |
missed                   | miss                     | VERB         | verb         | 0            |
a                        | a                        | DET          | determiner   | 0            |
prime                    | prime                    | ADJ          | adjective    | 0            |
opportunity              | opportunity              | NOUN         | noun         | 0            |
for                      | for                      | ADP          | adposition   | 0            |
Men                      | Men                      | PROPN        | proper noun  | 0            |
Without                  | without                  | ADP          | adposition   | 0            |
Hats                     | Hats                     | PROPN        | proper noun  | 0            |
parody    

As you can imagine, it is typical for this dataset to contain place names and airport codes. It would be cool if we are able to identify them and extract them from tweets. 

In [43]:
# Print example tweets with place names and airport codes
tweet_city = tweets['text'][8273]
tweet_airport = tweets['text'][502]
print(tweet_city)
print(f"{'=' * 50}")
print(tweet_airport)

@JetBlue Vegas, San Francisco, Baltimore, San Diego and Philadelphia so far! I'm a very frequent business traveler.
@VirginAmerica Flying LAX to SFO and after looking at the awesome movie lineup I actually wish I was on a long haul.


We can use the "ner" (Named Entity Recognition) component to identify entities and their categories.

In [44]:
# Print entities identified from the text
doc_city = nlp(tweet_city)
for ent in doc_city.ents:
    print(f"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}")

Vegas           | 9          | 14         | GPE       
San Francisco   | 16         | 29         | GPE       
Baltimore       | 31         | 40         | GPE       
San Diego       | 42         | 51         | GPE       
Philadelphia    | 56         | 68         | GPE       


We can also use `displacy` to highlight entities identified in the text, and at the same time, annotate the entity category. 

In the following example, we have four `GPE` (i.e., geopolitical entities, usually countries and cities) identified. 

In [45]:
# Visualize the identified entities
from spacy import displacy
displacy.render(doc_city, style='ent', jupyter=True)

Let's give it a try with another example.

In [46]:
# Print entities identified from the text
doc_airport = nlp(tweet_airport)
for ent in doc_airport.ents:
     print(f"{ent.text:<15} | {ent.start_char:<10} | {ent.end_char:<10} | {ent.label_:<10}")

@VirginAmerica  | 0          | 14         | CARDINAL  
Flying LAX      | 15         | 25         | ORG       
SFO             | 29         | 32         | ORG       


Interesting that airport codes are identified as `ORG`--- orgnizations, and the tweet handle as `CARDINAL`.

In [47]:
# Visualize the identified entities
displacy.render(doc_airport, style='ent', jupyter=True)

## Tokenizers since LLMs

So far, we've seen what tokenization looks like with two widely-used NLP packages. They work quite well in some settings, but not others. Recall that NLTK struggles with URLs. Now, imagine the data we have is even messier, containing misspellings, recently coined words, foreign names, and etc (collectively called "out of vocabulary" or OOV words). In such circumstances, we might need a more powerful model to handle these complexities.

In fact, tokenization schemes change substantially with **Large Language Models** (LLMs), which are models trained on vast amounts of data from mixed sources. With that massive amount of data, LLMs are better at chunking a longer sequence into tokens and tokens into **subtokens**. These subtokens could be morphological units of a word, such as a prefix, but they could also be parts of a word where the model sets a "meaningful" boundary. 

In this section, we will demonstrate tokenization in **BERT** (Bidirectional Encoder Representations from Transformers), which utilizes a tokenization algorithm called [**WordPiece**](https://huggingface.co/learn/nlp-course/en/chapter6/6). 

We will load the tokenizer of BERT from the package `transformers`, which hosts a number of transformer-based LLMs (e.g., GPT-2). We will not go depth into Transformer in this workshop, but feel free to check out the D-lab workshop on [GPT Fundamentals](https://github.com/dlab-berkeley/GPT-Fundamentals)!

### WordPiece tokenization

Note that BERT comes in a variety of versions. The one we will explore today is `bert-base-uncased`. This model has moderate size (referred to as `base`) and is case-insensitive, meaning the input text will be converted to lowercase by default.

In [48]:
# Load BERT tokenizer in
from transformers import BertTokenizer

# Initialize the tokenizer 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')



The tokenizer has multiple functions, as we will see in a minute. Now we want to access the `.tokenize()` function from the tokenizer. 

Let's tokenize an example tweet below, what have you noticed?

In [49]:
# Select an example tweet from dataframe
text = tweets['text'][194]
print(f"Text: {text}")
print(f"{'=' * 50}")

# Apply tokenizer
tokens = tokenizer.tokenize(text)
print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")

Text: @VirginAmerica Just DM'd. Same issue persisting.
Tokens: ['@', 'virgin', '##ame', '##rica', 'just', 'd', '##m', "'", 'd', '.', 'same', 'issue', 'persist', '##ing', '.']
Number of tokens: 15


The double "hashtag" symbols (`##`) mean that the token is a subword, which is a segment of a longer token.

🔔 **Question**: Do these subwords make sense to you? 

One significant development with LLMs is that each token is assigned an ID in its vocabulary. This is important because computational analysis does not operate directly on strings of text. Our computer does not understand text in its raw form, so each token is translated to an ID. These IDs are the inputs that the model can access and operate.

Tokens and IDs can be converted bidirectionally, for example:

In [50]:
# Get the input ID of the word 
print(f"ID of just is: {tokenizer.vocab['just']}")

# Get the text of the input ID
print(f"Token 2074 is: {tokenizer.decode([2074])}")

ID of just is: 2074
Token 2074 is: just


Let's convert tokens to input IDs!

In [51]:
# Convert a list of tokens to a list of input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Input IDs of text: {input_ids}")
print(f"The number of input IDs: {len(input_ids)}")

Input IDs of text: [1030, 6261, 14074, 14735, 2074, 1040, 2213, 1005, 1040, 1012, 2168, 3277, 29486, 2075, 1012]
The number of input IDs: 15


### Special tokens

In addition to the tokens and subtokens discussed above, BERT also makes use of three special tokens: `SEP`, `CLS`, and `UNK`. The `SEP` token acts as a sentence terminator, commonly known as an `EOS` (End of Sentence) token. The `UNK` token represents any token that is not found in the vocabulary, hence "unknown" tokens. The `CLS` token is automatically added to the beginning of the sentence. It originates from classification tasks, where people found it useful to have a token that aggregates the information of the entire sentence for classification purposes.

When we apply `tokenizer()` directly to our text data, we are asking BERT to **encode** the text for us. This involves multiple steps: 
- Tokenize the text
- Add special tokens
- Convert tokens to input IDs
- Other model-specific processes

The results are kept in a dictionary. 

Let's print them out!

In [52]:
# The results of encoding are formatted as a dictionary
for item, values in tokenizer(text).items():
    print(item+':', values)
    print('\n')

input_ids: [101, 1030, 6261, 14074, 14735, 2074, 1040, 2213, 1005, 1040, 1012, 2168, 3277, 29486, 2075, 1012, 102]


token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]




You may have noticed that we have two special input IDs added: 101 and 102. 

Let's convert them to texts!

In [53]:
# We can also get the input IDs by providing the key 
input_ids_from_tokenizer = tokenizer(text)['input_ids']
print(f"The number of input IDs: {len(input_ids_from_tokenizer)}")
print(f"IDs from tokenizer: {input_ids_from_tokenizer}")

The number of input IDs: 17
IDs from tokenizer: [101, 1030, 6261, 14074, 14735, 2074, 1040, 2213, 1005, 1040, 1012, 2168, 3277, 29486, 2075, 1012, 102]


In [54]:
# Convert input IDs to texts
print(f"The 101st token: {tokenizer.convert_ids_to_tokens(101)}")
print(f"The 102nd token: {tokenizer.convert_ids_to_tokens(102)}")

The 101st token: [CLS]
The 102nd token: [SEP]


## 🥊 Challenge 3: Find the Word Boundary

Now that we know BERT tokenization would often return subwords. Let's try a few more examples! 

Does the result make sense to you? What do you think is the correct word boundary to split the following words into subwords? 

Also feel free to read more about limitations of the WordPiece algorithm. For instance, [this blog post](https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99) dives into why would it fail, and [this one](https://tinkerd.net/blog/machine-learning/bert-tokenization/#demo-bert-tokenizer) introduces the mechanism underlying the algoritm. 

In [55]:
def get_tokens(string):
    tokens = tokenizer.tokenize(string)
    return print(tokens)

In [56]:
# Abbreviations
get_tokens('dlab')

# OOV
get_tokens('covid')

# Prefix
get_tokens('huggable')

# Digits
get_tokens('378')

# YOUR EXAMPLE

['dl', '##ab']
['co', '##vid']
['hug', '##ga', '##ble']
['37', '##8']


<div class="alert alert-success">

## ❗ Key Points

* Preprocessing includes multiple steps, some of them are more common to text data regardlessly, and some are task-specific. 
* Both `nltk` and `spaCy` could be used for tokenization and stop word removal. The latter is more powerful in providing more useful linguistic features. 
* Tokenization works differently in BERT, which often involves breaking down a whole word into subwords. 

</div>