# Week 01

Content:
1. Softmax
2. Byte-Pair
3. Tokenizer
4. NLP-Pipeline

## Softmax

Rem: you get bonus for this exercise if you answer at least 3 out of 4 questions.

Here we are going to answer the question, why softmax is called "softmax" and investigate softmax characteristics. The $i \text{-th}$ component of the softmax is given by:

$$f(x_i) := \text{softmax}(x_i) = \frac{\exp x_i}{ \sum_{j = 1}^k \exp x_j}$$


And thus the vector is $f(\mathbf x) = [f(x_1), \dots, f(x_k)]$ where $\mathbf x = (x_1, \dots, x_k)$.


1. show that $f(\mathbf x)$ can be interpreted as a probability. To do so, show that $\sum_{i=1}^k f(x_i) = 1$ and $f(x) > 0 \quad \forall x \in \mathbb{R}$
2. show that softmax is $C^\infty$, i.e. that you can calculate the derivative of $f(x_i)$ as often as you want with respect to $x_i$. You can use the fact that $\exp x$ is smooth (smooth means $C ^ \infty$)
3. show that if $x_i \lt x_j$ then $f(x_i) \lt f(x_j)$. Does the converse hold as well?
4. what is the limit of $f(x_i)$ for $x_i \to + \infty$? Same question for $x \to - \infty$.


![ex01](week_01_ex01.png)

## Byte-pair encoding

Implement Byte-paid encoding and reproduce the example in chapter 2.5.2 of [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/):

!["Byte Pair Encoding"](byte_pair_encoding.png)

In [1]:
def bytePairEncoding(corpus, k):
  vocab = set(corpus)
  corpus = list(corpus)
  
  for _ in range(k):
    pairs = {}
    for tokenLeft, tokenRight in zip(corpus[:-1], corpus[1:]):
      if (tokenLeft, tokenRight) in pairs:
        pairs[(tokenLeft, tokenRight)] += 1
      else:
        pairs[(tokenLeft, tokenRight)] = 1

    (tokenLeft, tokenRight) = max(pairs, key=pairs.get)

    vocab.add(tokenLeft + tokenRight)

    i = 0
    toRemoveIndeces = []
    while i < len(corpus) - 1:
      if corpus[i] == tokenLeft and corpus[i + 1] == tokenRight:
        corpus[i] = tokenLeft + tokenRight
        toRemoveIndeces.append(i + 1)
        i += 2
      else:
        i += 1

    toRemoveIndeces.reverse()
    for i in toRemoveIndeces:
      del corpus[i]

  return vocab

corpus = "fred fed ted bread and ted fed fred bread"
bytePairEncoding(corpus, 6)


corpus = ""

{' ',
 'a',
 'b',
 'd',
 'd ',
 'e',
 'ed ',
 'ed b',
 'ed br',
 'ed bre',
 'ed f',
 'f',
 'n',
 'r',
 't'}

# Tokenizer

Go through the HuggingFace tutorial on [tokenizers](https://huggingface.co/learn/nlp-course/chapter2/4) and answer:

- **why do we need tokenizers?** <br>
  Tokenizers are splitting a text into tokens and convert them into a format that can be unserstood and processed by machines.
- **what is the difference between a character-based, word-based and a subword-based tokenizer? What are the advantages and disadvantages of each?** <br>
  Tokenizers can split a text into tokens by various cirterias. They could treat every character as a token, split a text into the its words, or split a text into subwords. While the first two approaches are self-explanatory, the subword-based tokenizer is a bit more advanced. Frequently used words are not split into subwords, but rare words are decomposed into meaningful more frequent subwords.
  Character-based tokenizers lead to a smaller vocabulary, because with single letters every word can be composed. But a single vocabulary item does not have much semantic meaning, since it is a single letter. Word-based tokenizers will use the words as vocabulary, which means the vocabulary has a lot meaning (words have a meaning) but the vocabulary size is very large. Subword tokenizers are a compromise between the two. They typically have a smaller vocabulary size than word-based tokenizers. Ideally unknown words and typos can be handled by subword tokenizers, since they can be decomposed into subwords that are known.

## NLP Pipeline

Recall the NLP-Pipeline:

```{mermaid}
%%| echo: false
flowchart TD
    A[Data Acquistion] --> B[Preprocessing and Normalization]
    B --> C[Modelling]
    C --> C
    C --> D[Model evaluation]
    D --> |more preprocessing needed| B
    D --> |more/different data needed| A
    D --> E[Added Value]
    E --> |reiterate| A
```

Here we are going to do some prepocessing and normalization steps. Your task is to do:



1. **Data Collection**: Collect a corpus of text data. It is completely up to you.
2. **Data Cleaning**: Clean the collected data by removing any irrelevant information such as HTML tags, URLs, numbers, etc. This step depends in the corpus you chose:
    - count the vocabulary size before and after cleaning
3. **Tokenization**: Apply a tokenizer (e.g. using https://www.nltk.org/):
    - count the vocabulary size before and after tokenization
    - how much time is needed per word in average?
4. **Stopwords Removal**: Identify and remove stopwords from the tokens. Stopwords are common words that do not contribute much to the meaning of a sentence, such as 'the', 'is', 'in', etc.
    - count the vocabulary size before and after stopword removal
5. **Stemming and Lemmatization**: Apply stemming and lemmatization techniques to the tokens and observe the differences. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. Lemmatization, on the other hand, reduces words to their base word, which is linguistically correct lemmas. It transforms root word with the use of vocabulary and morphological analysis.
    - count the vocabulary size for the stemmed and lemmatized vocabulary
6. **Compare**: 
    - create a barplot of the results you have collected above (https://plotly.com/python/bar-charts/)

Useful libraries:

- NLTK, Spacy
- plotly or matplotlib for plotting graphs

### 1. Get data

### 2. Data Cleaning


### 3. Tokenization

### 4. Stopwords

### 5. Stemming and Lemmatization

### 6. Compare