<a href="https://colab.research.google.com/github/zaaabik/hse/blob/master/application_dl/nlp_hw_1/HW_1_HLP_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Генерация текста

В первом домашнем задании мы с вами попробуем побыть писателями. Ваша задача построить алгоритм, который будет генерировать текст похожий на текст какого-либо известного писателя.

**Задание** 

N-gram model (3 балла)
* Препроцессинг текста с помощью NLTK (или любой подобной библиотеки)
* Разбиение train/test
* Построить модель на основе n-gram на train выборке
* Сгенерировать текст и посмотреть качество генерации на test выборки (perplexity)

GPT2 (5 баллов)
* Разбиение train/test
* Взять предобученый токенайзер gpt2
* Дообучить модель на основе gpt2 train выборка
* Сгенерировать текст и посмотреть качество генерации на test выборки (perplexity)

Отправка решения:
* Отправить ссылку на обученную модель в формате https://huggingface.co/*ваш_ник*/gpt2-arxiv-clm
* Отправить github ссылку на ноутбук c решением https://github.com/*ваш_ник*/hse_application_dl


!! **Большинство кода уже написано в семинаре Language_modeling_solved.ipynb** !!

### Найти текст
Первый шаг для построения любой модели машинного обучения это поиск данных. Вам необходимо найти корпус текста писателя (на английском языке). Размер источника произвольный, но чем больше тем лучше. Желательно найти книгу или набор книг в текстовом формате.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
%matplotlib inline
from sklearn.model_selection import train_test_split

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!find /content -name "Metamorphosis.txt"

/content/drive/MyDrive/IT/Metamorphosis.txt


In [4]:
file_path = '/content/drive/MyDrive/IT/Metamorphosis.txt'

In [5]:
import shutil

shutil.copy2(file_path, '/content')

'/content/Metamorphosis.txt'

In [6]:
import requests
import nltk

df = pd.read_csv(file_path, delimiter='\t')

In [7]:
df.head()

Unnamed: 0,"One morning, when Gregor Samsa woke from troubled dreams, he found"
0,himself transformed in his bed into a horrible...
1,"armour-like back, and if he lifted his head a ..."
2,"brown belly, slightly domed and divided by arc..."
3,The bedding was hardly able to cover it and se...
4,"any moment. His many legs, pitifully thin comp..."


## N-gram language models 
По анологии с семинаром где мы пытались генерировать тексты научных статей мы сделаем предобработку текста. В зависимости от источника которым вы пользуетесь 
* Убираем мусор из текста
* Исключаем стоп слова
* Применяем токкенизатор 

In [8]:
rows_string_list = df.apply(lambda row: '\t'.join(row.values.astype(str)), axis=1).tolist()

In [9]:
print(rows_string_list)



In [10]:
from nltk.corpus import stopwords
from nltk import WordPunctTokenizer
import nltk
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
tokenizer = WordPunctTokenizer()

lines = []
for line in tqdm(rows_string_list):
    tokens = tokenizer.tokenize(line.lower())
    clean_tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    clean_line = ' '.join(clean_tokens)
    lines.append(clean_line)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


  0%|          | 0/1753 [00:00<?, ?it/s]

In [11]:
lines[:10]

['transformed bed horrible vermin lay',
 'armour like back lifted head little could see',
 'brown belly slightly domed divided arches stiff sections',
 'bedding hardly able cover seemed ready slide',
 'moment many legs pitifully thin compared size',
 'rest waved helplessly looked',
 'happened thought dream room',
 'proper human room although little small lay peacefully',
 'four familiar walls collection textile samples lay spread',
 'table samsa travelling salesman hung']

### Обучение
Далее обучаем n-gram модель для получения вероятностей цепочек слов 

### N-Gram Language Model

A language model is a probabilistic model that estimates text probability: the joint probability of all tokens $w_t$ in text $X$: $P(X) = P(w_1, \dots, w_T)$.

It can do so by following the chain rule:
$$P(w_1, \dots, w_T) = P(w_1)P(w_2 \mid w_1)\dots P(w_T \mid w_1, \dots, w_{T-1}).$$ 

The problem with such approach is that the final term $P(w_T \mid w_1, \dots, w_{T-1})$ depends on $n-1$ previous words. This probability is impractical to estimate for long texts, e.g. $T = 1000$.

One popular approximation is to assume that next word only depends on a finite amount of previous words:

$$P(w_t \mid w_1, \dots, w_{t - 1}) = P(w_t \mid w_{t - n + 1}, \dots, w_{t - 1})$$

Such model is called __n-gram language model__ where n is a parameter. For example, in 3-gram language model, each word only depends on 2 previous words. 

$$
    P(w_1, \dots, w_n) = \prod_t P(w_t \mid w_{t - n + 1}, \dots, w_{t - 1}).
$$

You can also sometimes see such approximation under the name of _n-th order markov assumption_.

In [12]:
### Тут необходимо описать класс для подсчета вероятностей n-gram модели

In [13]:
from tqdm import tqdm
from collections import defaultdict, Counter

UNK, EOS = "_UNK_", "_EOS_"

def count_ngrams(lines, n):
    """
    Count how many times each word occured after (n - 1) previous words
    :param lines: an iterable of strings with space-separated tokens
    :returns: a dictionary { tuple(prefix_tokens): {next_token_1: count_1, next_token_2: count_2}}

    When building counts, please consider the following two edge cases
    - if prefix is shorter than (n - 1) tokens, it should be padded with UNK. For n=3,
      empty prefix: "" -> (UNK, UNK)
      short prefix: "the" -> (UNK, the)
      long prefix: "the new approach" -> (new, approach)
    - you should add a special token, EOS, at the end of each sequence
      "... with deep neural networks ." -> (..., with, deep, neural, networks, ., EOS)
      count the probability of this token just like all others.
    """
    counts = defaultdict(Counter)
    # counts[(word1, word2)][word3] = how many times word3 occured after (word1, word2)

    for line in tqdm(lines, desc='N-grams'):
        unk_prefix = ' '.join([UNK] * (n - 1))
        eos_suffix = EOS
        tokens = f'{unk_prefix} {line} {eos_suffix}'.split()
        for i in range(n - 1, len(tokens)):
            n_gram = tuple(tokens[i - n + 1: i])
            counts[n_gram].update([tokens[i]])

    
    return counts

In [14]:
# let's test it
dummy_lines = sorted(lines, key=len)[:100]
dummy_counts = count_ngrams(dummy_lines, n=3)

N-grams: 100%|██████████| 100/100 [00:00<00:00, 89164.63it/s]


In [15]:
dummy_counts[('_UNK_', 'a')]

Counter()

Once we can count N-grams, we can build a probabilistic language model.
The simplest way to compute probabilities is in proporiton to counts:

$$ P(w_t | prefix) = { Count(prefix, w_t) \over \sum_{\hat w} Count(prefix, \hat w) } $$

In [16]:
class NGramLanguageModel:    
    def __init__(self, lines, n):
        """ 
        Train a simple count-based language model: 
        compute probabilities P(w_t | prefix) given ngram counts
        
        :param n: computes probability of next token given (n - 1) previous words
        :param lines: an iterable of strings with space-separated tokens
        """
        assert n >= 1
        self.n = n
    
        counts = count_ngrams(lines, self.n)
        
        # compute token proabilities given counts
        self.probs = defaultdict(Counter)
        # probs[(word1, word2)][word3] = P(word3 | word1, word2)
        
        # populate self.probs with actual probabilities
        for k,v in tqdm(counts.items()):
            s = sum(v.values())
            for word, cout in v.items():
                self.probs[k][word] = counts[k][word] / s 
            
    def get_possible_next_tokens(self, prefix):
        """
        :param prefix: string with space-separated prefix tokens
        :returns: a dictionary {token : it's probability} for all tokens with positive probabilities
        """
        prefix = prefix.split()
        prefix = prefix[max(0, len(prefix) - self.n + 1):]
        prefix = [ UNK ] * (self.n - 1 - len(prefix)) + prefix
        return self.probs[tuple(prefix)]
    
    def get_next_token_prob(self, prefix, next_token):
        """
        :param prefix: string with space-separated prefix tokens
        :param next_token: the next token to predict probability for
        :returns: P(next_token|prefix) a single number, 0 <= P <= 1
        """
        return self.get_possible_next_tokens(prefix).get(next_token, 0)

In [17]:
lm = NGramLanguageModel(lines, n=3)

N-grams: 100%|██████████| 1753/1753 [00:00<00:00, 64127.61it/s]
100%|██████████| 8290/8290 [00:00<00:00, 368467.25it/s]


The process of generating sequences is... well, it's sequential. You maintain a list of tokens and iteratively add next token by sampling with probabilities.

$ X = [] $

__forever:__
* $w_{next} \sim P(w_{next} | X)$
* $X = concat(X, w_{next})$


Instead of sampling with probabilities, one can also try always taking most likely token, sampling among top-K most likely tokens or sampling with temperature. In the latter case (temperature), one samples from

$$w_{next} \sim {P(w_{next} | X) ^ {1 / \tau} \over \sum_{\hat w} P(\hat w | X) ^ {1 / \tau}}$$

Where $\tau > 0$ is model temperature. If $\tau << 1$, more likely tokens will be sampled with even higher probability while less likely tokens will vanish.

In [18]:
### Обучение n-gram модели

In [19]:
def get_next_token(lm, prefix, temperature=1.0):
    """
    return next token after prefix;
    :param temperature: samples proportionally to lm probabilities ^ (1 / temperature)
        if temperature == 0, always takes most likely token. Break ties arbitrarily.
    """
    next_tokens = lm.get_possible_next_tokens(prefix)
    if temperature == 0:
        sorted_next_tokens = dict(
            sorted(tuple(next_tokens.items()), key=lambda x:x[1], 
                   reverse=True)
        )
        next_token = tuple(sorted_next_tokens.items())[0][0]
    else:
        sum_probs = sum([
            prob ** (1 / temperature) for prob in next_tokens.values()
        ])

        next_tokens = {
            token: prob ** (1 / temperature) / sum_probs
            for token, prob in next_tokens.items()
        }
        tokens = list(next_tokens.keys())
        probs = list(next_tokens.values())
        next_token = np.random.choice(tokens, 1, p=probs)[0]
    return next_token

## Подсчет Perplexity

### Evaluating language models: perplexity

Perplexity is a measure of how well does your model approximate true probability distribution behind data. __Smaller perplexity = better model__.

To compute perplexity on one sentence, use:
$$
    {\mathbb{P}}(w_1 \dots w_N) = P(w_1, \dots, w_N)^{-\frac1N} = \left( \prod_t P(w_t \mid w_{t - n}, \dots, w_{t - 1})\right)^{-\frac1N},
$$


On the corpora level, perplexity is a product of probabilities of all tokens in all sentences to the power of 1, divided by __total length of all sentences__ in corpora.

This number can quickly get too small for float32/float64 precision, so we recommend you to first compute log-perplexity (from log-probabilities) and then take the exponent.

In [20]:
### Пример из семинара 
import numpy as np 
def perplexity(lm, lines, min_logprob=np.log(10 ** -7.)):
    """
    :param lines: a list of strings with space-separated tokens
    :param min_logprob: if log(P(w | ...)) is smaller than min_logprop, set it equal to min_logrob
    :returns: corpora-level perplexity - a single scalar number from the formula above
    
    Note: do not forget to compute P(w_first | empty) and P(eos | full_sequence)
    
    PLEASE USE lm.get_next_token_prob and NOT lm.get_possible_next_tokens
    """
    total_length = 0
    log_pp = 0

    for line in tqdm(lines):
        tokens = [''] + line.split(' ') + [EOS]

        for t in range(1, len(tokens)):
            prefix = ' '.join(tokens[:t])
            log_pp += max(
                min_logprob, np.log(lm.get_next_token_prob(prefix, tokens[t]))
            )
            total_length += 1
    
    return np.exp(-( 1 / total_length) * log_pp)

In [21]:
from sklearn.model_selection import train_test_split
train_lines, test_lines = train_test_split(lines, test_size=0.25, random_state=42)

n = 2
lm = NGramLanguageModel(n=n, lines=train_lines)

ppx = perplexity(lm, test_lines)
print("N = %i, Perplexity = %.5f" % (n, ppx))

N-grams: 100%|██████████| 1314/1314 [00:00<00:00, 87468.70it/s]
100%|██████████| 2096/2096 [00:00<00:00, 284221.69it/s]
  min_logprob, np.log(lm.get_next_token_prob(prefix, tokens[t]))
100%|██████████| 439/439 [00:00<00:00, 25923.94it/s]

N = 2, Perplexity = 308871.32018





### Пример генерации текста

In [22]:
prefix = 'The long story short' # Придумайте первые несколько слов для вашего рассказа

for i in range(100):
    prefix += ' ' + get_next_token(lm, prefix, temperature=0.5)
    if prefix.endswith(EOS) or len(lm.get_possible_next_tokens(prefix)) == 0:
        break
        
print(prefix)

The long story short breath flowed face door gregor father _EOS_


Текст должен напоминать что-то осознанное

## Генерация с помощью нейронных сетей

По аналогии с семинаром, будем использова библиотеку transformers и предобученную модель gpt2.

In [23]:
!pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m80.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Down

В конце необходимо запушить модель на сайт huggingface, для этого необходимо получить токен
https://huggingface.co/docs/hub/security-tokens

In [24]:
!huggingface-cli login --token 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

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid.
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [25]:
# Загружаем данные еще раз без какого-либо препроцессинга так как будем использовать готовый токенайзер

In [26]:
clm_model_checkpoint = "gpt2"
clm_tokenizer_checkpoint = "gpt2"

from transformers import GPT2Tokenizer, AutoModelForCausalLM
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('zaaabik/gpt2-arxiv-clm')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/907 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/510M [00:00<?, ?B/s]

Датасет можно можно не подвергать никакой обработке, так как в дальнейшем мы будем использовать предобученый токенайзер 
``` python
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
```

In [27]:
train, valid = train_test_split(lines, test_size=0.2)
lm_datasets = {'train' : train, 'valid' : valid}

In [28]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [29]:
from datasets import Dataset

my_dict = {"text": lines}
datasets = Dataset.from_dict(my_dict)
tr_test_datasets = datasets.train_test_split(test_size=0.1)

In [30]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

In [31]:
tokenized_datasets = tr_test_datasets.map(tokenize_function, 
                                          batched=True, num_proc=4, 
                                          remove_columns=["text"])

Map (num_proc=4):   0%|          | 0/1577 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/176 [00:00<?, ? examples/s]

In [32]:
block_size = 128

In [33]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [34]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)    

Map (num_proc=4):   0%|          | 0/1577 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/176 [00:00<?, ? examples/s]

In [36]:
from transformers import Trainer, TrainingArguments

In [37]:
training_args = TrainingArguments(
    f"gpt2-author-clm",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True
)



In [38]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets['train'],
    eval_dataset=lm_datasets['test'],
)

Cloning https://huggingface.co/IvyPo/gpt2-author-clm into local empty directory.


Download file pytorch_model.bin:   0%|          | 8.00k/487M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/487M [00:00<?, ?B/s]

In [39]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,No log,7.724539
2,No log,7.513474
3,No log,7.453012


TrainOutput(global_step=30, training_loss=7.717611694335938, metrics={'train_runtime': 54.3293, 'train_samples_per_second': 4.418, 'train_steps_per_second': 0.552, 'total_flos': 15677521920000.0, 'train_loss': 7.717611694335938, 'epoch': 3.0})

In [40]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 1725.05


### Генерация текста готовой моделью

In [41]:
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [42]:
# сюда нужно вставить ваш ник из huggingface 
nick_name = 'IvyPo'

In [43]:
tokenizer.push_to_hub(
    'gpt2-author-clm_3'
)

CommitInfo(commit_url='https://huggingface.co/IvyPo/gpt2-author-clm_3/commit/1eaeaf20f6fc8d9c8d81be29c6227a0ce4b9c51e', commit_message='Upload tokenizer', commit_description='', oid='1eaeaf20f6fc8d9c8d81be29c6227a0ce4b9c51e', pr_url=None, pr_revision=None, pr_num=None)

In [44]:
model.push_to_hub("gpt2-author-clm_3")

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/510M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/IvyPo/gpt2-author-clm_3/commit/76c983170e5ca661730172b1d1e274604ce95e0b', commit_message='Upload model', commit_description='', oid='76c983170e5ca661730172b1d1e274604ce95e0b', pr_url=None, pr_revision=None, pr_num=None)

In [45]:
from transformers import pipeline
generator = pipeline(
    'text-generation', 
    model = f'{nick_name}/gpt2-author-clm_3',
    tokenizer = tokenizer
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/925 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/510M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

In [46]:
generator('The long story short')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The long story short : in this paper, we investigate whether cnn could be used to enhance object recognition systems in real games rather than simply in the novel ; we consider the problems arising from such game models, and give examples of the many applications of'}]

После выполнения домашнего задания необходимо отправить ноутбук мне в телеграм: @zaaabik

Также туда выслать ссылку на модель в huggingface в формате: \*ваш_ник\*/gpt2-arxiv-clm