<a href="https://colab.research.google.com/github/Chood16/DSCI222/blob/main/lectures/(9)_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Analysis

## Why This Matters

* Text data powers search engines, chatbots, sentiment analysis, and more
* Raw text is messy — preprocessing is essential for accuracy
* Translation expands applications across languages and cultures

## The Text Analysis Pipeline

1. **Collection** – get raw text (web, files, APIs)
2. **Preprocessing** – clean and normalize
3. **Feature Extraction** – convert to numerical form
4. **Analysis** – classification, clustering, sentiment, etc.
5. **Translation (if needed)** – convert between languages

## Common Preprocessing Steps

* Lowercasing
* Removing punctuation, numbers, and stopwords
* Tokenization (splitting into words/tokens)
* Stemming & Lemmatization
* Handling special characters & emojis
* Language detection

# Regular Expression (regex)



Metacharacters Cheat Sheet

| Symbol  | Meaning                                                                |                                         |
| ------- | ---------------------------------------------------------------------- | --------------------------------------- |
| `.`     | Matches any character except newline (`\n`)                            |                                         |
| `^`     | Matches the start of a string (or line in multiline mode)              |                                         |
| `$`     | Matches the end of a string (or line in multiline mode)                |                                         |
| `*`     | Matches 0 or more repetitions of the preceding pattern                 |                                         |
| `+`     | Matches 1 or more repetitions of the preceding pattern                 |                                         |
| `?`     | Matches 0 or 1 repetition of the preceding pattern (makes it optional) |                                         |
| `{m}`   | Matches exactly `m` repetitions                                        |                                         |
| `{m,}`  | Matches `m` or more repetitions                                        |                                         |
| `{m,n}` | Matches between `m` and `n` repetitions                                |                                         |
| `[]`    | Matches any single character inside the brackets (character class)     |                                         |
| `[^ ]`  | Matches any single character **not** inside the brackets               |                                         |
| `[a-t]` | Matches lowercase letters a through t |
| `[a-tB-P]` | Matches lowercase letters a through t or uppercase B-P|
| `[0-4]` | Matches number 0 through 4
|  `\|`        | Acts as an OR operator between patterns |
| `()`    | Groups patterns and captures the matched text                          |                                         |
| `(?: )` | Groups patterns **without capturing** (non-capturing group)            |                                         |
| `(?P<name>pattern)` | Create a named captured group|

Special Character Cheat Sheet

| Symbol | Meaning                                                          |
| ------ | ---------------------------------------------------------------- |
| `\d`   | Matches any digit (`0-9`)                                        |
| `\D`   | Matches any non-digit                                            |
| `\w`   | Matches any “word” character (letters, digits, underscore)       |
| `\W`   | Matches any non-word character                                   |
| `\s`   | Matches any whitespace (space, tab, newline)                     |
| `\S`   | Matches any non-whitespace                                       |
| `\b`   | Matches word boundary (between word and non-word)                |
| `\B`   | Matches position **not** at a word boundary                      |
| `\\`   | Escapes a special character (e.g. `\.` matches a literal period) |


Some examples

| Symbol | Example                              | Matches                   |
| ------ | ------------------------------------ | ------------------------- |
| `^`    | `re.findall(r"^cat", "cat dog cat")` | `['cat']` (only at start) |
| `$`    | `re.findall(r"cat$", "dog cat")`     | `['cat']` (only at end)   |
| `.`     | `re.findall(r"c.t", "cat cot cut cit")`  | `['cat','cot','cut','cit']` |
| `*`     | `re.findall(r"ab*", "a ab abb abbb")`    | `['a','ab','abb','abbb']`   |
| `+`     | `re.findall(r"ab+", "a ab abb abbb")`    | `['ab','abb','abbb']`       |
| `?`     | `re.findall(r"ab?", "a ab abb")`         | `['a','ab','ab']`           |
| `{m}`   | `re.findall(r"a{3}", "aa aaa aaaa")`     | `['aaa','aaa']`             |
| `{m,}`  | `re.findall(r"a{2,}", "a aa aaa aaaa")`  | `['aa','aaa','aaaa']`       |
| `{m,n}` | `re.findall(r"a{2,3}", "a aa aaa aaaa")` | `['aa','aaa','aaa']`        |
| `[abc]`  | `re.findall(r"[abc]", "apple bat cat")` | `['a','b','a','c','a']` |
| `[^abc]` | `re.findall(r"[^abc]", "abcxyz")`       | `['x','y','z']`         |
| `[0-9]`  | `re.findall(r"[0-9]", "Room 101")`      | `['1','0','1']`         |
| `(...)`   | `re.findall(r"(cat)", "cat dog cat")`   | `['cat','cat']`                    |                         |                 |
| `(?:...)` | `re.findall(r"(?:cat)", "cat dog cat")` | `['cat','cat']` (no capture group) |                         |                 |
| `\| `     | `re.findall(r"cat \| dog", "cat dog bird")` | `['cat','dog']` |
| `\d`   | `re.findall(r"\d", "A1B2C3")`                   | `['1','2','3']`               |
| `\D`   | `re.findall(r"\D", "A1B2")`                     | `['A','B']`                   |
| `\w`   | `re.findall(r"\w", "Hi_5!")`                    | `['H','i','_','5']`           |
| `\W`   | `re.findall(r"\W", "Hi_5!")`                    | `['!']`                       |
| `\s`   | `re.findall(r"\s", "a b\tc\n")`                 | `[' ','\t','\n']`             |
| `\S`   | `re.findall(r"\S", "a b")`                      | `['a','b']`                   |
| `\b`   | `re.findall(r"\bcat\b", "a cat scatter cater")` | `['cat']`                     |
| `\B`   | `re.findall(r"\Bcat\B", "scat cat scatter")`    | `['cat']` (inside words only) |
| `(?=...)`  | `re.findall(r"cat(?=dog)", "catdog catfish")`           | `['cat']`             |
| `(?!...)`  | `re.findall(r"cat(?!dog)", "catdog catfish")`           | `['cat']`             |
| `(?<=...)` | `re.findall(r"(?<=Mr\.)\s\w+", "Mr. Smith, Mr. Jones")` | `[' Smith',' Jones']` |
| `(?<!...)` | `re.findall(r"(?<!Mr\.)\s\w+", "Mr. Smith Ms. Taylor")` | `[' Taylor']`         |
| `\.`   | `re.findall(r"\.", "a.b c.d")`   | `['.','.']` |
| `\\`   | `re.findall(r"\\", r"a\b c\\d")` | `['\\']`    |



## Example: Basic Cleaning

In [None]:
import re

text = "Hello World! 123 :)"

# a-zA-Z: match any letter
# \s: match whitespace
# ^: negates the expression
# "": replace with nothing
# Together, this takes all non-letters, and replace them with nothing, then lowercase
cleaned = re.sub(r"[^a-zA-Z\s]", "", text).lower()
print(cleaned)  # "hello world"

### Let's practice!

Here is a reminder of what we already know

| Method      | Purpose                                   | Example                                                      | Output                  |
|------------|-------------------------------------------|-------------------------------------------------------------|------------------------|
| `match()`   | Check pattern at **start** of string     | `re.match(r"\d+", "123abc")`                                | Match object (matches '123') |
| `search()`  | Find pattern **anywhere** in string      | `re.search(r"\d+", "abc123")`                               | Match object (matches '123') |
| `findall()` | Find **all matches** in string           | `re.findall(r"\d+", "I have 2 cats and 3 dogs")`           | `['2', '3']`           |
| `sub()`     | **Replace matches** with another string  | `re.sub(r"\d", "#", "123-456")`                             | `"###-###"`            |


In [None]:
voicemail = """Hi, this is Alice. I’m calling about order A123.
I placed it on 2025-09-17, and I think the price was $19.99.
My coworker Bob also ordered, his ID was B456 on 2024-12-01, and it cost about $250.00.
If you need to reach me, my email is alice@example.com.
Bob’s email is bob.smith@school.edu.
Oh, and my friend Carol had order C789 on 1999-07-04 for only $7.5. Thanks!"""


Tasks:

* Write a regex to find all the order IDs (like A123, B456).

* How would you match all the dates in YYYY-MM-DD format?

* How can you extract just the letters from the order IDs?

* How can you extract just the numbers from the order IDs?

* Write a regex that matches four-digit years in the dates.

* Modify it so it only matches years starting with 202.

* Create a regex that extracts the username and domain separately from each email (e.g., alice and example.com).

* How would you capture the month part of the dates?

* Write a regex that finds all the prices (including decimals).

* How would you anonymize all email addresses in the voicemail by replacing them with [EMAIL REDACTED]?

* Replace all order IDs (e.g., A123, B456) with the word ORDER.

* Check to see if the voicemail includes an email address.

* Convert each date in YYYY-MM-DD form to DD-MM-YY


In [None]:
ans1 = re.findall(r"[A-Z]\d{3}", voicemail)
ans2 = re.findall(r"\d{4}-(\d{2})-\d{2}", voicemail)
ans3 = re.findall(r"[A-Z](\d{3})", voicemail)
ans4 = re.sub(r"[A-Z]\d{3}", "ORDER", voicemail)
ans5 = re.findall(r"\b202\d\b", voicemail)
ans6 = re.findall(r"\b\d{4}\b", voicemail)
ans7 = re.findall(r"([\w\.\-]+)@([\w\.\-]+\.\w+)", voicemail)
ans8 = re.findall(r"\$\d+(?:\.\d+)?", voicemail)
ans9 = re.sub(r"[\w\.\-]+@[\w\.\-]+\.\w+", "[EMAIL REDACTED]", voicemail)
ans10 = re.search(r"[\w\.\-]+@[\w\.\-]+\.\w+", voicemail)
ans11 = re.findall(r"([A-Z])\d{3}", voicemail)
ans12 = re.findall(r"\d{4}-\d{2}-\d{2}", voicemail)
ans13 = re.sub(r"(\d{4})-(\d{2})-(\d{2})", lambda m: f"{m.group(3)}-{m.group(2)}-{m.group(1)[2:]}", voicemail)

# Tokenization

* Breaking text into words, sentences, characters, subwords, etc.
* [SentencePiece](https://arxiv.org/abs/1808.06226)
* [WordPiece](https://arxiv.org/abs/2012.15524)
* [byte-level BPE (Byte Pair Encoding)](https://arxiv.org/abs/1909.03341)

| Model Family         | Example Models                 | Tokenization Method  |
| -------------------- | ------------------------------ | -------------------- |
| **BERT family**      | BERT, DistilBERT, ELECTRA      | WordPiece            |
| **RoBERTa family**   | RoBERTa, XLM-R                 | BPE (non-byte)       |
| **Google Seq2Seq**   | T5, mT5, XLNet, ALBERT         | SentencePiece        |
| **OpenAI GPT**       | GPT-2, GPT-3, GPT-4, CLIP      | Byte-level BPE       |
| **Meta**             | OPT (byte-BPE), LLaMA (SP-BPE) | BPE / SentencePiece  |
| **Anthropic Claude** | Claude 1–3                     | Byte-level BPE       |
| **Mistral**          | Mistral, Mixtral               | SentencePiece BPE    |
| **BigScience**       | BLOOM                          | Byte-level BPE       |
| **Microsoft**        | DeBERTa (WordPiece), Phi (BPE) | WordPiece / Byte-BPE |


## Tokenization with regex

In [None]:
# Why do we need to learn all of this?
# Why can't we just use .split?

text = "Data Science is fun! Once you learn it, you will never forget it"
tokens = text.split()
print(tokens)

In [None]:
import re

# \w+: match any word characters (letters, numbers, underscore) with repetition
text = "Data Science is fun! Once you learn it, you will never forget it"
text = text.lower()
tokens = re.findall(r"\w+", text)
characters = re.findall(r"[^\w\s]", text)
print(tokens)
print(characters)


In [None]:
# Tokenization can get tricky quickly
# What are some ways we can hand'le It's?

text = "I don't think it's going to get tough quick, do you?"
text = text.lower()
tokens = re.findall(r"\w+", text) # <-- \w is the same as [a-zA-Z0-9_]
print(tokens)

In [None]:

text = "I don't think it's going to get tough quick, do you?"
text = text.lower()
tokens = re.findall(r"[\w']+", text) # <-- captures repetitions of both word characters and apostrophe

print(tokens)

In [None]:
# We could do a substitution manually for each contraction

text = "I don't think it's going to get tough quick, do you?"
text = text.lower()
text = re.sub(r"\bit's\b", "it is", text)
text = re.sub(r"\bdon't\b", "do not", text)
tokens = re.findall(r"[\w']+", text)
print(tokens)

In [None]:
# We could also create a dictionary of contractions

text = text = "I don't think it's going to get tough quick, do you?"
text = text.lower()

contractions_dict = {
    "don't": "do not",
    "it's": "it is",
    "we're": "we are"
}

# Then make a list of the contractions
key_list = list(contractions_dict.keys())

# Join the strings in the list, using | as the separator
pattern_text = "|".join(key_list)

# Insert this separator into a pattern
pattern = r"\b(?:{})\b".format(pattern_text)

# We can then search for our pattern
# For each match found, match.group() is the contraction found in the text
# We are then finding the value in our dictionary associated with this match
# Finally, we sub out the match with the value of the match key
expanded_text = re.sub(pattern, lambda match: contractions_dict[match.group()],text)

print(expanded_text)


In [None]:
!pip install contractions

In [None]:
# There is a python package called contractions that does this for us!

import contractions
import re
text = text = "I don't think it's going to get tough quick, do you?"
text = text.lower()

# Expand contractions automatically
text = contractions.fix(text)

print(text)


In [None]:
tokens = re.findall(r"[\w']+", text)
print(tokens)

## Tokenization Using NLTK (Natural Language Toolkit)

This may not be the best way to split up our contractions and utilize tokenization.

We are missing cruicial contributors, such as punctuation

`word_tokenize` is a tokenizer by nltk

* It's like using a regex such as `[\w']+` to split text into words.

* The key difference is that `word_tokenize` uses the `Punkt` tokenizer under the hood, which is language-aware:

* Handles contractions (can't → 'ca' + "n't")

* Separates punctuation as individual tokens

* Deals with abbreviations (Dr., Mr., etc.)

* Regex like [\w']+ is simpler, but won’t handle all the linguistic nuances.

In [None]:
import nltk
from nltk.tokenize import word_tokenize

# Download required NLTK data (only the first time)
nltk.download('punkt_tab')




In [None]:
text = "I don't think it's going to get tough quick, do you?"
text = text.lower()
tokens = word_tokenize(text)
print(tokens)


### Stopword Removal

* Removes common but uninformative words (e.g., “the”, “and”)

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')

In [None]:
display(stopwords.words('english'))

In [None]:
text = "I don't think it's going to get tough quick, do you?"
text = text.lower()

tokens = word_tokenize(text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stop_words]

print(tokens)

### More fun with NLTK

In [None]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('wordnet')


* WordNet is a large lexical database of English.

* Words can be stemmed

* Words can be lemmatized

Stemming: the process of reducing a word to its root or base form, called a stem.

In [None]:
# Initialize stemmer
ps = PorterStemmer()

# What issues do we see?
print(ps.stem("running"))
print(ps.stem("flies"))
print(ps.stem("easily"))
print(ps.stem("better"))
print(ps.stem("studies"))
print(ps.stem("organizational"))
print(ps.stem("unbelievable"))
print(ps.stem("geese"))
print(ps.stem("running-away"))
print(ps.stem("happily"))

Lemmatization: the process of reducing a word to its dictionary form

In [None]:
# Initialize lemmatizer
lem = WordNetLemmatizer()
words_with_pos = [
    ("running", "v"),
    ("flies", "v"),
    ("easily", "r"),  # adverb
    ("better", "a"),  # adjective
    ("studies", "n"),
    ("organizational", "a"),
    ("unbelievable", "a"),
    ("geese", "n"),
    ("running-away", "v"),
    ("happily", "r")
]

print(f"{'Word':<15} {'Stemmed':<15} {'Lemmatized':<15}")
print("="*60)

for word, pos in words_with_pos:
    stemmed = ps.stem(word)
    lemmatized = lem.lemmatize(word, pos=pos)
    notes = ""
    print(f"{word:<15} {stemmed:<15} {lemmatized:<15}")


## What does tokenization look like for different subword tokenization methods?

In [None]:
from transformers import BertTokenizer, GPT2Tokenizer, T5Tokenizer
import sentencepiece as spm

text = "I don't think it's going to get tough quick, do you?"

# 1. SentencePiece
# For demo purposes, we'll use a pretrained SentencePiece model (t5-small)
# We use from_pretrained for tokenizers because the tokenization rules and vocabularies are learned during model training.

sp_tokenizer = T5Tokenizer.from_pretrained("t5-small")
sp_tokens = sp_tokenizer.tokenize(text)
print("SentencePiece tokens:", sp_tokens)

# The underscore ▁ in SentencePiece tokens is a special marker that indicates a word boundary

In [None]:
# 2. WordPiece (BERT)
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
wp_tokens = bert_tokenizer.tokenize(text)
print("WordPiece tokens:", wp_tokens)

In [None]:
# 3. Byte-level BPE (GPT-2)
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
bpe_tokens = gpt2_tokenizer.tokenize(text)
print("Byte-level BPE tokens:", bpe_tokens)

# spaces are included in the tokens and are represented by Ġ

In [None]:
# Let's highlight some additional differences between these three methods of tokenization
text = "Unbelievable! It's going to rock the AI-world 🚀."
sp_tokens = sp_tokenizer.tokenize(text)
print("SentencePiece tokens:", sp_tokens)

wp_tokens = bert_tokenizer.tokenize(text)
print("WordPiece tokens:", wp_tokens)

bpe_tokens = gpt2_tokenizer.tokenize(text)
print("Byte-level BPE tokens:", bpe_tokens)


In [None]:
text = "oivaohuqonfmoimw surely isn't a word"
sp_tokens = sp_tokenizer.tokenize(text)
print("SentencePiece tokens:", sp_tokens)

wp_tokens = bert_tokenizer.tokenize(text)
print("WordPiece tokens:", wp_tokens)

bpe_tokens = gpt2_tokenizer.tokenize(text)
print("Byte-level BPE tokens:", bpe_tokens)