
## Exercise 1: Tokenization and Vocabulary
Consider the following corpus: “Apple is a high-tech company; it produces mobile phones and laptops. It's products often feature an apple design on their backs.”

**a.** Apply simple whitespace tokenization to the corpus, list all resulting unique tokens and compute the vocabulary size.
**b.**	Apply simple whitespace tokenization to the corpus after applying lowercasing, list all resulting unique tokens and compute the vocabulary size.
**c.**	Compare the vocabularies obtained in part a & b. Identify one concrete problem caused by lowercasing.
**d.**	Apply Penn Treebank tokenization to the corpus after applying lowercasing, list all resulting unique tokens and compute the vocabulary size.
**e.**	Compare the vocabularies obtained in part a & d. Identify one concrete problem solved by Penn Treebank tokenization.


### Solution


In [None]:
corpus = "Apple is a high-tech company; it produces mobile phones and laptops. It's products often feature an apple design on their backs."

**a.** {'Apple', 'is', 'a', 'high-tech', 'company;', 'it', 'produces', 'mobile', 'phones', 'and', 'laptops.', "It's", 'products', 'often', 'feature', 'an', 'apple', 'design', 'on', 'their', 'backs.'}

$|V| = 21$

In [None]:
x = corpus.split()
vocab = set(x)
print("Tokens: ", vocab)
print("|V| = ", len(vocab))

**b.** {'is', 'a', 'high-tech', 'company;', 'it', 'produces', 'mobile', 'phones', 'and', 'laptops.', "It's", 'products', 'often', 'feature', 'an', 'apple', 'design', 'on', 'their', 'backs.'}

$|V| = 20$

In [None]:
lowered = corpus.lower()
x = lowered.split()
vocab = set(x)
print("Tokens: ", vocab)
print("|V| = ", len(vocab))

**c.** We loose semantic information *(Apple & apple)*

**d.** tokens:  {'laptops.', 'backs', 'high-tech', 'their', 'a', 'produces', "'s", 'is', 'products', 'often', 'an', 'on', '.', 'design', 'and', 'mobile', 'apple', 'phones', 'feature', 'company', 'it', ';'}
$|V| =  22$

In [None]:
from nltk.tokenize.treebank import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
lowered = corpus.lower()


tokens = tokenizer.tokenize(lowered)
vocab = set(tokens)
print("tokens: ", vocab)
print("|V| = ", len(vocab))

## Exercise 2- Stemming vs Lemmatization

You are given the following list of words and the form they were reduced to by a text normalization process.

**a.** For each word, indicate whether the transformation was produced by: Stemming or Lemmatization and justify why?

| Original Word | Normalized Word |
| ------------ | ------------ |
| Studies | Studi |
| Better | Good |
| Cating | Care |
| Wolves | Wolf |
| Organization | Organ |

**b.** Identify **one transformation** in the table that clearly distinguishes stemming from lemmatization, and explain why.


## Exercise 3: Morphological Analysis

Below is a list of **morphemes** (the smallest units of meaning) highlighted within specific words.
For each example, identify if the highlighted part is an **Inflectional Morpheme**, a **Derivational Morpheme**, or a **Clitic**. Justify why?

| Word Example	| Highlighted Morpheme |
| -------------- | -------------- |
| Played |	ed |
| unhappy	| un |
| cats	| s |
| I'm	| 'm |
| Slowly	| ly |
| she'll |	'll |
| teacher	| teach |



In [None]:
from collections import defaultdict, Counter
from typing import Dict, List, Tuple


def get_vocab(corpus: List[str]) -> Dict[Tuple[str, ...], int]:
    """
    Build initial vocabulary from corpus.
    Each word is split into characters + </w>
    """
    vocab = defaultdict(int)
    for word in corpus:
        tokens = tuple(word) + ("</w>",)
        vocab[tokens] += 1
    return vocab


def get_pair_frequencies(vocab: Dict[Tuple[str, ...], int]) -> Counter:
    """
    Count frequency of adjacent symbol pairs
    """
    pairs = Counter()
    for word, freq in vocab.items():
        for i in range(len(word) - 1):
            pairs[(word[i], word[i + 1])] += freq
    return pairs


def merge_pair(
    pair: Tuple[str, str], vocab: Dict[Tuple[str, ...], int]
) -> Dict[Tuple[str, ...], int]:
    """
    Merge a given pair in the vocabulary
    """
    merged_vocab = {}

    bigram = pair
    replacement = "".join(pair)

    for word, freq in vocab.items():
        new_word = []
        i = 0
        while i < len(word):
            # If we see the bigram, merge it
            if i < len(word) - 1 and (word[i], word[i + 1]) == bigram:
                new_word.append(replacement)
                i += 2
            else:
                new_word.append(word[i])
                i += 1

        merged_vocab[tuple(new_word)] = freq

    return merged_vocab


def train_bpe(corpus: List[str], num_merges: int):
    """
    Train BPE and return learned merge rules
    """
    vocab = get_vocab(corpus)
    merges = []

    for _ in range(num_merges):
        pair_freqs = get_pair_frequencies(vocab)
        if not pair_freqs:
            break

        best_pair = pair_freqs.most_common(1)[0][0]
        merges.append(best_pair)

        vocab = merge_pair(best_pair, vocab)

    return merges, vocab


corpus = []

for i in range(12):
    corpus.append("pun")

for i in range(10):
    corpus.append("hug")


for i in range(5):
    corpus.append("pug")

for i in range(4):
    corpus.append("bun")

for i in range(2):
    corpus.append("bus")


vocab = train_bpe(corpus, 100)
print(vocab)