

## 1. Токенизация

### Что такое токенизация?
Токенизация - это разбиение текста на более мелкие части (токены). Сначала текст разбивается на предложения, а затем предложения - на отдельные слова.

В вашем коде:
```python
sentences = sent_tokenize(text)  # Разбиение на предложения
words = word_tokenize(sentence)  # Разбиение предложений на слова
```

Это как разрезать длинную цепочку букв на логические кусочки, чтобы компьютер мог их обрабатывать.

### Нормализация токенов

Нормализация - это приведение слов к базовой форме. Есть два основных способа:

1. **Стемминг** - просто отрезает окончания слов ("running" → "run")
2. **Лемматизация** - более умный подход, приводит слово к словарной форме ("running" → "run", "better" → "good")

В коде используются оба метода:
```python
stemmer.stem(word)  # Стемминг
lemmatizer.lemmatize(word)  # Лемматизация
```

Лемматизация обычно дает лучшие результаты, поэтому в итоге используется именно она.

### Создание словаря (vocabulary)

После того как текст разбит на токены и нормализован, создается словарь, который подсчитывает, сколько раз каждое слово встречается в тексте:

```python
vocabulary = {}
for sentence in lemmatized_text:
    for word in sentence:
        vocabulary[word] = vocabulary.get(word, 0) + 1
```

Словарь в итоге выглядит примерно так: {'слово1': 10, 'слово2': 5, ...}

## 2. Кодирование пар байтов (Byte-Pair Encoding, BPE)

### Принцип работы BPE
Это метод сжатия текста, но в NLP его используют для создания подсловных токенов. По сути, алгоритм ищет наиболее частые пары символов и объединяет их.

Простыми словами:
1. Начинаем с отдельных символов ("h", "e", "l", "l", "o")
2. Находим самую частую пару соседних символов (например, "l" и "l")
3. Объединяем их в новый токен ("ll")
4. Повторяем процесс, пока не достигнем нужного количества объединений

В коде это реализовано функциями:
```python
get_stats(vocab)  # Подсчитывает частоту пар
merge_vocab(pair, vocab)  # Объединяет наиболее частую пару
```

### Применение BPE

После создания словаря можно использовать его для токенизации новых текстов:

```python
apply_bpe(word, merges)  # Применяет BPE к слову
```

Результатом будет текст, разбитый на эффективные подсловные единицы, а не просто на отдельные слова или символы.

## Важные аспекты задания:

1. **Гибкость токенизации**: BPE позволяет работать с неизвестными словами, разбивая их на известные части
2. **Эффективное кодирование**: Частые сочетания символов получают собственные токены
3. **Практическое применение**: Эта методика лежит в основе работы современных языковых моделей (таких как GPT)

In [None]:
text = open("./data//ml_text.txt").read()
text[0:100]

'Machine learning  is a field of study in artificial intelligence concerned with the development and '

## Sentence tokenization

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/olga/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:")
for idx, sentence in enumerate(sentences[0:10]):
    print(f"{idx+1}: {sentence}")

Sentence Tokenization:
1: Machine learning  is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions.
2: Within a subdiscipline in machine learning, advances in the field of deep learning have allowed neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance.ML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine.
3: The application of ML to business problems is known as predictive analytics.Statistics and mathematical optimization  methods comprise the foundations of machine learning.
4: Data mining is a related field of study, focusing on exploratory data analysis  via unsupervised learning.
5: From a theoretical viewpoint, probably approximately correct learnin

In [None]:
print("\nWord Tokenization:")
tokenized_text = []
for idx, sentence in enumerate(sentences):
    words = word_tokenize(sentence)
    tokenized_text.append(words)
    print(f"Sentence {idx+1} Words: {words}")


Word Tokenization:
Sentence 1 Words: ['Machine', 'learning', 'is', 'a', 'field', 'of', 'study', 'in', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'development', 'and', 'study', 'of', 'statistical', 'algorithms', 'that', 'can', 'learn', 'from', 'data', 'and', 'generalize', 'to', 'unseen', 'data', ',', 'and', 'thus', 'perform', 'tasks', 'without', 'explicit', 'instructions', '.']
Sentence 2 Words: ['Within', 'a', 'subdiscipline', 'in', 'machine', 'learning', ',', 'advances', 'in', 'the', 'field', 'of', 'deep', 'learning', 'have', 'allowed', 'neural', 'networks', ',', 'a', 'class', 'of', 'statistical', 'algorithms', ',', 'to', 'surpass', 'many', 'previous', 'machine', 'learning', 'approaches', 'in', 'performance.ML', 'finds', 'application', 'in', 'many', 'fields', ',', 'including', 'natural', 'language', 'processing', ',', 'computer', 'vision', ',', 'speech', 'recognition', ',', 'email', 'filtering', ',', 'agriculture', ',', 'and', 'medicine', '.']
Sentence 3 Words: ['The',

In [None]:
tokenized_text

[['Machine',
  'learning',
  'is',
  'a',
  'field',
  'of',
  'study',
  'in',
  'artificial',
  'intelligence',
  'concerned',
  'with',
  'the',
  'development',
  'and',
  'study',
  'of',
  'statistical',
  'algorithms',
  'that',
  'can',
  'learn',
  'from',
  'data',
  'and',
  'generalize',
  'to',
  'unseen',
  'data',
  ',',
  'and',
  'thus',
  'perform',
  'tasks',
  'without',
  'explicit',
  'instructions',
  '.'],
 ['Within',
  'a',
  'subdiscipline',
  'in',
  'machine',
  'learning',
  ',',
  'advances',
  'in',
  'the',
  'field',
  'of',
  'deep',
  'learning',
  'have',
  'allowed',
  'neural',
  'networks',
  ',',
  'a',
  'class',
  'of',
  'statistical',
  'algorithms',
  ',',
  'to',
  'surpass',
  'many',
  'previous',
  'machine',
  'learning',
  'approaches',
  'in',
  'performance.ML',
  'finds',
  'application',
  'in',
  'many',
  'fields',
  ',',
  'including',
  'natural',
  'language',
  'processing',
  ',',
  'computer',
  'vision',
  ',',
  'speech',

## Text normalization

- Lemmatization, the task of determining that two words have the same root
- Stemming is a simpler version of lemmatization in which we mainly just strip suffixes from the end of the word


In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /Users/olga/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/olga/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/olga/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [None]:
words = tokenized_text[0]
# Perform stemming
print("Stemming:")
for word in words:
    print(f"{word} -> {stemmer.stem(word)}")

Stemming:
Machine -> machin
learning -> learn
is -> is
a -> a
field -> field
of -> of
study -> studi
in -> in
artificial -> artifici
intelligence -> intellig
concerned -> concern
with -> with
the -> the
development -> develop
and -> and
study -> studi
of -> of
statistical -> statist
algorithms -> algorithm
that -> that
can -> can
learn -> learn
from -> from
data -> data
and -> and
generalize -> gener
to -> to
unseen -> unseen
data -> data
, -> ,
and -> and
thus -> thu
perform -> perform
tasks -> task
without -> without
explicit -> explicit
instructions -> instruct
. -> .


In [None]:
# Perform lemmatization (default POS is noun)
print("\nLemmatization:")
for word in words:
    print(f"{word} -> {lemmatizer.lemmatize(word)}")


Lemmatization:
Machine -> Machine
learning -> learning
is -> is
a -> a
field -> field
of -> of
study -> study
in -> in
artificial -> artificial
intelligence -> intelligence
concerned -> concerned
with -> with
the -> the
development -> development
and -> and
study -> study
of -> of
statistical -> statistical
algorithms -> algorithm
that -> that
can -> can
learn -> learn
from -> from
data -> data
and -> and
generalize -> generalize
to -> to
unseen -> unseen
data -> data
, -> ,
and -> and
thus -> thus
perform -> perform
tasks -> task
without -> without
explicit -> explicit
instructions -> instruction
. -> .


In [None]:
## Creating lemmatized text
lemmatized_text = []
for sentence in tokenized_text:
    lemmatized_text.append([lemmatizer.lemmatize(word.lower()) for word in sentence])

lemmatized_text


[['machine',
  'learning',
  'is',
  'a',
  'field',
  'of',
  'study',
  'in',
  'artificial',
  'intelligence',
  'concerned',
  'with',
  'the',
  'development',
  'and',
  'study',
  'of',
  'statistical',
  'algorithm',
  'that',
  'can',
  'learn',
  'from',
  'data',
  'and',
  'generalize',
  'to',
  'unseen',
  'data',
  ',',
  'and',
  'thus',
  'perform',
  'task',
  'without',
  'explicit',
  'instruction',
  '.'],
 ['within',
  'a',
  'subdiscipline',
  'in',
  'machine',
  'learning',
  ',',
  'advance',
  'in',
  'the',
  'field',
  'of',
  'deep',
  'learning',
  'have',
  'allowed',
  'neural',
  'network',
  ',',
  'a',
  'class',
  'of',
  'statistical',
  'algorithm',
  ',',
  'to',
  'surpass',
  'many',
  'previous',
  'machine',
  'learning',
  'approach',
  'in',
  'performance.ml',
  'find',
  'application',
  'in',
  'many',
  'field',
  ',',
  'including',
  'natural',
  'language',
  'processing',
  ',',
  'computer',
  'vision',
  ',',
  'speech',
  'recogn

In [None]:
import nltk
stopwords = nltk.corpus.stopwords.words('english')
stopwords

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

## Build vocabulary

In [None]:
vocabulary = {}
for sentence in lemmatized_text:
    for word in sentence:
        vocabulary[word] = vocabulary.get(word, 0) + 1

{k: v for k,v in sorted(vocabulary.items(), key=lambda x: x[1], reverse=True)}

{',': 445,
 'the': 364,
 'a': 313,
 '.': 260,
 'of': 256,
 'to': 223,
 'and': 201,
 'learning': 190,
 'in': 173,
 'is': 119,
 'machine': 115,
 'that': 103,
 'data': 95,
 'model': 78,
 'for': 73,
 'algorithm': 66,
 'by': 65,
 'are': 59,
 'it': 58,
 'on': 56,
 'can': 55,
 'with': 52,
 'from': 52,
 'an': 52,
 'or': 52,
 'be': 45,
 'training': 45,
 '``': 41,
 "''": 37,
 'system': 36,
 'method': 34,
 'used': 33,
 'example': 33,
 'set': 32,
 'artificial': 28,
 'have': 28,
 'network': 28,
 'not': 26,
 'feature': 25,
 'such': 25,
 'input': 25,
 'field': 24,
 'this': 24,
 "'s": 24,
 'been': 23,
 'decision': 22,
 'ha': 22,
 'which': 21,
 'ai': 21,
 'neural': 20,
 'also': 20,
 'these': 20,
 'their': 20,
 'wa': 19,
 'approach': 18,
 'other': 18,
 'into': 18,
 'between': 18,
 'theory': 18,
 'classification': 17,
 'prediction': 17,
 'technique': 17,
 'function': 17,
 'process': 16,
 'but': 16,
 'output': 16,
 'bias': 16,
 'task': 15,
 'many': 15,
 'problem': 15,
 'one': 15,
 'image': 15,
 'variable'

In [None]:
vocabulary["machine"]

115

In [None]:
len(vocabulary.keys())

1987

## Byte-pair encoding

- Start with a character-level vocabulary.

- Count the most frequent pair of adjacent symbols (like 't' and 'h' in 'the').

- Merge the most frequent pair into a new symbol ('th').

- Repeat the merge steps for a fixed number of iterations or until no more merges are possible.

In [None]:
from collections import defaultdict, Counter
from pprint import pprint

text = "low lower lowest"

In [None]:
def get_stats(vocab):
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

def merge_vocab(pair, vocab):
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word in vocab:
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = vocab[word]
    return new_vocab

In [None]:
def byte_pair_encoding(text, num_merges=10):
    words = text.split()
    vocab = {}
    for word in words:
        chars = ' '.join(list(word)) + ' </w>'
        vocab[chars] = vocab.get(chars, 0) + 1

    print("Initial vocabulary:")
    pprint(vocab)

    symbols = set()

    for word in vocab:
        symbols.update(word.split())

    print("Symbols set:", symbols)

    merges = []
    for _ in range(num_merges):
        print("-" * 30)
        pairs = get_stats(vocab)
        print("Pairs:")
        pprint(pairs)
        if not pairs:
            break
        best_pair = max(pairs, key=pairs.get)
        print("Best pair:", best_pair)
        merges.append(best_pair)
        vocab = merge_vocab(best_pair, vocab)
        print("Updated vocabulary:")
        pprint(vocab)
        new_token = ''.join(best_pair)
        symbols.add(new_token)

    # Build final vocabulary: token -> id
    final_vocab = {token: idx for idx, token in enumerate(sorted(symbols))}
    return merges, final_vocab

In [None]:
text = "low lower lowest"
merges, vocab = byte_pair_encoding(text, num_merges=10)

Initial vocabulary:
{'l o w </w>': 1, 'l o w e r </w>': 1, 'l o w e s t </w>': 1}
Symbols set: {'w', 't', 'l', 'o', 'r', '</w>', 's', 'e'}
------------------------------
Pairs:
defaultdict(<class 'int'>,
            {('e', 'r'): 1,
             ('e', 's'): 1,
             ('l', 'o'): 3,
             ('o', 'w'): 3,
             ('r', '</w>'): 1,
             ('s', 't'): 1,
             ('t', '</w>'): 1,
             ('w', '</w>'): 1,
             ('w', 'e'): 2})
Best pair: ('l', 'o')
Updated vocabulary:
{'lo w </w>': 1, 'lo w e r </w>': 1, 'lo w e s t </w>': 1}
------------------------------
Pairs:
defaultdict(<class 'int'>,
            {('e', 'r'): 1,
             ('e', 's'): 1,
             ('lo', 'w'): 3,
             ('r', '</w>'): 1,
             ('s', 't'): 1,
             ('t', '</w>'): 1,
             ('w', '</w>'): 1,
             ('w', 'e'): 2})
Best pair: ('lo', 'w')
Updated vocabulary:
{'low </w>': 1, 'low e r </w>': 1, 'low e s t </w>': 1}
------------------------------
Pai

In [None]:
vocab

{'</w>': 0,
 'e': 1,
 'l': 2,
 'lo': 3,
 'low': 4,
 'low</w>': 5,
 'lowe': 6,
 'lower': 7,
 'lower</w>': 8,
 'lowes': 9,
 'lowest': 10,
 'lowest</w>': 11,
 'o': 12,
 'r': 13,
 's': 14,
 't': 15,
 'w': 16}

In [None]:
## Applying BPE to a list of words
def apply_bpe(word, merges):
    word = list(word) + ['</w>']
    while True:
        pairs = [(word[i], word[i+1]) for i in range(len(word)-1)]
        pair_ranks = {pair: idx for idx, pair in enumerate(merges)}
        ranked = [(pair, pair_ranks[pair]) for pair in pairs if pair in pair_ranks]
        if not ranked:
            break
        best_pair = min(ranked, key=lambda x: x[1])[0]
        new_word = []
        i = 0
        while i < len(word):
            if i < len(word)-1 and (word[i], word[i+1]) == best_pair:
                new_word.append(word[i] + word[i+1])
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        word = new_word
    return word

In [None]:
print("Vocabulary (token -> id):")
for token, idx in vocab.items():
    print(f"{token}: {idx}")

Vocabulary (token -> id):
</w>: 0
e: 1
l: 2
lo: 3
low: 4
low</w>: 5
lowe: 6
lower: 7
lower</w>: 8
lowes: 9
lowest: 10
lowest</w>: 11
o: 12
r: 13
s: 14
t: 15
w: 16


In [None]:
# Tokenize and convert to token IDs
new_text = "lower lowest"
token_ids = []
for word in text.strip().split():
    bpe_tokens = apply_bpe(word, merges)
    ids = [vocab[token] for token in bpe_tokens if token in vocab]
    token_ids.extend(ids)
print("\nToken IDs:")
print(token_ids)


Token IDs:
[5, 8, 11]
