## 2022/09/05 Lecture 
- Lecturer: 周子軒學長
- Content: Language Model, Perplexity, Pytorch RNN Architecture (Text Generation)
- Scripts: 
    - https://github.com/ProFatXuanAll/language-model-playground
    - `lmp/script/train_model.py`


In [39]:
# trigram
from collections import Counter, defaultdict


n = 3 

corpus = [
    "I like apple.",
    "I like banana."
]

def preproc(s, n):
    return '#'*n+s
corpus = [preproc(x) for x in corpus]
CondFreq = defaultdict(int)
def to_ngram(sent, n = 3):
    tokens = []
    for i in range(len(sent)-n):
        token = sent[i:i+n]
        tokens.append(token)
        # print(token[:-1])
        CondFreq[token[:-1]] += 1 
    return tokens
corpus = [to_ngram(sent) for sent in corpus]

In [40]:
Freq = Counter([item for sublist in corpus for item in sublist])

In [41]:
def conditional_freq(x):
    if x not in Freq: return 0 
    return Freq[x]/CondFreq[x[:-1]]

In [45]:
import math

for sentence in corpus:
    print(sentence)
    prob = 1
    for token in sentence:
        prob *= conditional_freq(token)
    bpc = - math.log2(prob)
    ppl = math.pow(2, bpc)
    print(f"the sentence's occurring prob {prob}") 
    print(f"bits per char: {bpc}")
    print(f"perpexity: {ppl}")

['##I', '#I ', 'I l', ' li', 'lik', 'ike', 'ke ', 'e a', ' ap', 'app', 'ppl', 'ple']
the sentence's occurring prob 0.5
bits per char: 1.0
perpexity: 2.0
['##I', '#I ', 'I l', ' li', 'lik', 'ike', 'ke ', 'e b', ' ba', 'ban', 'ana', 'nan', 'ana']
the sentence's occurring prob 0.5
bits per char: 1.0
perpexity: 2.0



### Intro
   - https://github.com/ProFatXuanAll/language-model-playground
- Byte-Pair Encoding
- Character: 
    - 'I': 1 byte, 
    - '我':~=3 bytes (variable-length)
- Splitting:
    - Tokenization: splitting a sentence, may not be sensible (in English we can use space to do so)
    -  (Word) Segmentation: Sensible Tokenization (e.g. in Chinese)
- Affix/Prefix/Suffix/(Infix)
- https://github.com/ProFatXuanAll/language-model-playground 中只有提供tokenization，沒有WS。

- long-tail phenomenon (Zipf's Law): min_count 砍尾巴、或是使用max_vocab
- normalize (lower/upper-case, )

### Train 
- Approximate the generation prob to a distribution
- $ P(x;\theta)$, $\theta$ is the model (e.g. if use Gaussian, $\theta$ is the mean and std)
    - Hyperparameters: 我們調整的參數
    - Parameters: 模型使用訓練更新的參數


In [49]:
import random 
classes = {1:2, 2:4, 3:6}
# normalization
norm = {k:v/sum(classes.values()) for k, v in classes.items()}

In [50]:
norm

{1: 0.16666666666666666, 2: 0.3333333333333333, 3: 0.5}

In [53]:
cdf = defaultdict(int)
for k in classes:
    cdf[k] = norm[k]
    cdf[k] += cdf[k-1]

In [54]:
cdf

defaultdict(int, {1: 0.16666666666666666, 0: 0, 2: 0.5, 3: 1.0})

In [57]:

n_sample = 10000
results = []

for i in range(n_sample):
    rn = random.uniform(0,1)
    for idx, _cdf in enumerate(sorted(cdf.values())):
        if rn < _cdf:
            results.append(cdf[idx])
            break
c = Counter(results)
print(c)
# inverted probability density function 
# 抽樣看每個區間的大小

Counter({1.0: 5098, 0.5: 3224, 0.16666666666666666: 1678})


### Model

用`register_buffer()`寫在model底下，會讓Pytorch記得把這個tensor搬到正確的device上。
```
self.register_buffer(name=..., tensor=torch.tensor(...))
```
slicing tensor 非常慢，因為要確認記憶體連續與否等，最好以整份tensor做運算。
在LSTM, RNN這種recurrence seq models裡面，hidden states的計算是count on前一個hidden states，因此有一段for loop是絕對無法平行化的，這就是這種神經網路為何seq越長越慢，且無法加速。
Transformers則可以跳脫這個constraint。

text-generation的概念和classification是一樣的 e.g. calc similarity scores 
只是output前再跟embedding matrix互動一次，然後predict category # = vocab_size。

去看`lmp/script/train_model.py` 看那個get_optimizer怎麼寫！！
學長說只能if else或是dictionary lookupㄌ...哭ㄌ...。

weight decay:
$$y = Wx+b$$
$$y + \Delta y = W(x+\Delta x) + b$$
我想要 $\Delta x$ 和 $\Delta y$ 的大小變動可以一致，因此要控制W內的參數大小範圍。
但不要對b(bias vector)做weight decay！！！
請去看 `lmp/script/train_model.py`。