In [28]:
import re

# Natural Language Processing: Words, Tokens, and Regular Expressions
## Exercises Notebook - Session 3

This notebook contains exercises covering:
- Tokenization concepts

---
## Section 1: Tokenization Concepts
---

### Exercise 1.1: Types vs Instances

The slides distinguish between types and instances.
For the given text, calculate:
1. Number of instances (total tokens)
2. Number of types (vocabulary size |V|)
3. Type-token ratio

In [29]:
# YOUR CODE HERE
text = "the cat sat on the mat the cat was fat"

### Exercise 1.2: Heaps' Law Demonstration

The slides mention Heaps' Law: vocabulary size grows with âˆšN.

Generate text of increasing length and observe vocabulary growth.

In [30]:
# YOUR CODE HERE
import random

# Use a simple word list to simulate text
word_list = ['the', 'a', 'is', 'are', 'was', 'were', 'be', 'been',
             'cat', 'dog', 'bird', 'fish', 'tree', 'house', 'car',
             'run', 'walk', 'jump', 'eat', 'sleep', 'read', 'write',
             'big', 'small', 'fast', 'slow', 'red', 'blue', 'green']

### Exercise 1.3: BPE Simulation

The slides explain Byte Pair Encoding (BPE).
Implement a simple BPE token learner that:
1. Starts with character vocabulary
2. Finds most frequent adjacent pair
3. Merges them into a new token
4. Repeats k times

In [31]:
# YOUR CODE HERE
corpus = "low lower newest widest"

def simple_bpe(corpus, num_merges):
    pass

---
## Section 2: Advanced Regex
---

### Exercise 2.1: Lookahead Assertions

The slides introduce lookahead: (?=pattern) and (?!pattern)

Write patterns to:
1. Find words followed by a comma (without capturing comma)
2. Find first word of line only if it doesn't start with 'T'

In [32]:
# YOUR CODE HERE
text = "The quick, brown fox jumps over the lazy dog."

### Exercise 2.2: Non-capturing Groups

The slides explain (?:...) for grouping without capturing.

Write a pattern that matches "some cats" or "a few cats" 
but only captures "cats" (not "some" or "a few").

In [33]:
# YOUR CODE HERE
texts = [
    "some cats like fish",
    "a few cats play outside", 
    "some dogs bark"
]

### Exercise 2.3: GPT-2 Pre-tokenization

The slides show the GPT-2 pre-tokenization regex.
Test the pattern and understand what each part does.

In [None]:
# YOUR CODE HERE
gpt2_pattern = r"'s|'t|'re|'ve|'m|'ll|'d|\w+|\d+|[^\s\w]+"
test = "I'm learning NLP! It's fascinating. I've got 100 examples."

pretoken = re.findall(gpt2_pattern, test)

print(pretoken)

AttributeError: 'list' object has no attribute 'group'

---
## Section 3: Morphology
---

### Exercise 3.1: Identifying Morphemes

The slides define morphemes as minimal meaning-bearing units.

Write code to identify potential morphemes by finding:
1. Common suffixes (-ed, -ing, -s, -ly, -ful)
2. Common prefixes (un-, re-, pre-, dis-)

In [35]:
# YOUR CODE HERE
words = ["working", "unhappy", "carefully", "reworked", "glasses", "preprocessing"]

morphemes = []
for w in words:    
    w_morph = re.match(r"(un|re|pre|dis)?(.*?)(ed|ing|s|ly|full)?$", w).groups()
    morphemes.append(f"{w} : {w_morph}")

for m in morphemes:
    print(m)


working : (None, 'work', 'ing')
unhappy : ('un', 'happy', None)
carefully : (None, 'careful', 'ly')
reworked : ('re', 'work', 'ed')
glasses : (None, 'glasse', 's')
preprocessing : ('pre', 'process', 'ing')
