Санкт-Петербург, 2021

# Task

1. Build a text generator based on n-gram language model and neural language model.
2. Find a corpus (e.g. http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt ), but you are free to use anything else of your interest
3. Preprocess it if necessary (we suggest using nltk for that)
4. Build an n-gram model
5. Try out different values of n, calculate perplexity on a held-out set
6. Build a simple neural network model for text generation. Implement recurrent NN (using LSTM and/or GRU cells). We suggest using tensorflow / keras / pytorch for this task.
7. Optional: try transformer NN (BERT/ELECTRA) and compare the results with RNN.

**Criteria:**
*   Data is split into train / validation / test, motivation for the split method is given
*   N-gram model is implemented
*   Unknown words are handled
*   Add-k Smoothing is implemented
*   Neural network for text generation is implemented
*   Perplexity is calculated for both models
*   Examples of texts generated with different models are present and compared
*   Optional: Try both character-based and word-based approaches.

# Imports

In [None]:
import re
import pandas as pd
import numpy as np
import math
import warnings
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from collections import Counter, defaultdict
import string
from sklearn.feature_extraction.text import CountVectorizer

import requests
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_context('notebook')
%config InlineBackend.figure_format = 'retina'

nltk.download('punkt')
nltk.download('stopwords')

plt.rcParams['figure.figsize'] = 10, 8

warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Auxilliary functions

In [None]:
def step_1_7(sentence):

    translate_table = dict((ord(char), None) for char in '0123456789')

    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = sentence.replace('--', ' ')
    sentence = sentence.replace('`', ' ')
    sentence = sentence.replace('\n', ' ')
    sentence = re.sub("[()]", "", sentence)
    sentence = sentence.translate(translate_table)
    sentence = decontracted(sentence)

    return sentence

In [None]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)

    return phrase

# Download & Overview

In [None]:
response = requests.get(
    'http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt')
text = response.text
text = nltk.sent_tokenize(text)

# Data Preprocessing

In [None]:
text = [step_1_7(row) for row in text]

# Ngrams

## Train - test - split

Разделим все данные на три части train - validation - test в пропорции 80 - 10 - 10 (как предлагается в статье из списка рекомендованной литературы: https://web.stanford.edu/~jurafsky/slp3/3.pdf).

In [None]:
print('Всего строк:', len(text))

Всего строк: 52482


In [None]:
train_size = np.ceil(len(text)*0.8).astype(int)
train_and_val_size = np.ceil(len(text)*0.9).astype(int)

In [None]:
train = text[:train_size]
validation = text[train_size: train_and_val_size]
test = text[train_and_val_size:]

In [None]:
print('Train set:', len(train))
print('Validation set:', len(validation))
print('Test set:', len(test))

Train set: 41986
Validation set: 5248
Test set: 5248


In [None]:
train_tokenized = []
for sentence in train:
    sent_token = word_tokenize(sentence)
    train_tokenized.append(sent_token)

## Count-based LM

In [None]:
class NGramModel(object):

    def __init__(self,
                 n: int,
                 k: float
                 ):
        '''
        self.n - размер n-граммы
        self.k - параметр для вычисления k-smoothing
        self.ngrams - словарь, который каждому префиксу из train set длиной от n-1 
        сопоставляет следующий возможный токен с его частотой.
        '''
        self.n = n
        self.k = k
        self.ngrams = defaultdict(Counter)
        self.counter = CountVectorizer()
        self.counter.fit(train)
        self.bag_of_words = self.counter.transform(train)
        self.sum_words = self.bag_of_words.sum(axis=0)
        self.all = 0
        self.words_freq = [(word, self.sum_words[0, idx])
                       for word, idx in self.counter.vocabulary_.items()]
        for i in range(len(self.words_freq)):
              self.all += self.words_freq[i][1]
        self.dictionary = {word: ((self.sum_words[0, idx]+ self.k)/(self.all+self.k*len(self.counter.vocabulary_))) for word, idx in self.counter.vocabulary_.items()} 

    
    def special_symbols(self, data):
        '''
        Добавление в каждое предложение обрамления
        '''
        prepared = []
        for sentence in data:
             sentence = sentence + ['</s>']
             prepared.append(sentence)
        return prepared

    def compute_ngrams(self, data):
        '''
        Пополняем словарь self.ngrams
        '''
        data = self.special_symbols(data)
        self.ngrams = defaultdict(Counter)
        for row in data:
            ngram = ['<s>']*self.n
            for token in row:
                ngram[:-1] = ngram[1:]
                ngram[-1] = token
                self.ngrams[tuple(ngram[:-1])].update([ngram[-1]])

    def probs_for_generating(self, prefix):
        '''
        Функция возвращает список возможных токенов после заданного префикса и их частот сразу с k_smoothing
        '''
        if (prefix != ['</s>']) and (tuple(prefix) in self.ngrams):
            endings = self.ngrams[tuple(prefix)]
            sum_freq = sum(endings[e] for e in endings)
            return {e: (endings[e] + self.k) / (sum_freq + self.k*len(endings)) for e in endings}
        else:
            return self.dictionary #thats all right

    def generate_next_word(self, prefix):
        '''
        Функция по заданному префиксу определяет следующее дальше слова изпользуя np.random.choice
        '''
        if isinstance(prefix, str):
          prefix = word_tokenize(prefix)

        if len(prefix) > self.n-1:
            return self.generate_next_word(prefix[-self.n+1:])
        else:
            prefix = ['<s>']*(self.n-1-len(prefix)) + prefix
            variants = self.probs_for_generating(prefix)
            end = np.random.choice(
                list(variants.keys()), p=list(variants.values()))
        return end

    def generate_text(self, length = 150):
        '''
        Функция генерирует текст заданной длины
        '''
        tokens = ['<s>'] * (self.n-1)
        for i in range(length):
            token = self.generate_next_word(tokens)
            tokens.append(token)

        s = ' '.join([i for i in tokens if (i != '<s>') and (i != '</s>')])
        return s

    def probs_with_k_smoothing(self, x):
        '''
        Функция вычисляет вероятность появления n-граммы с учетом smoothing
        '''
        prefix = x[:-1]
        word = x[-1]
        if tuple(prefix) in self.ngrams:
            endings = self.ngrams[tuple(prefix)]
            sum_freq = sum(endings[e] for e in endings)
            return (endings[word] + self.k) / (sum_freq + self.k*len(endings))
        else:
            return self.k/(self.k*len(self.counter.vocabulary_))

    def perplexity(self, sequence):
        '''
        Функция вычисляет perplexity текста. Данные подается в формате или str, или списка предложений
        '''
        if isinstance(sequence, str):
            sequence = sent_tokenize(sequence)
            sequence = [step_1_7(row) for row in sequence]
        PP = 0
        N = 0
        for row in sequence:
            row = word_tokenize(row)
            row = row + ['</s>']
            N += len(row)
            ngram = ['<s>']*self.n
            for token in row:
                ngram[:-1] = ngram[1:]
                ngram[-1] = token
                p = self.probs_with_k_smoothing(ngram)
                PP += math.log(p, 2) #работаем с log prob, иначе вероятность быстро уйдет в ноль
        PP = math.pow(2, -1*(PP/N))
        return PP

### Build LM

Для различных $k$ и $n$ вычислим perplexity на validation set. И выберем наилучшее сочетание.

In [None]:
for n in range(2, 10):
    for k in [0.001, 0.01, 0.1]:
        model = NGramModel(n, k)
        model.compute_ngrams(train_tokenized)
        print('При n = ', n, 'и k = ', k, 'perplexity = ',
              model.perplexity(validation))

При n =  2 и k =  0.001 perplexity =  378.69058904534916
При n =  2 и k =  0.01 perplexity =  237.5903819267214
При n =  2 и k =  0.1 perplexity =  150.69675874194724
При n =  3 и k =  0.001 perplexity =  1034.2753982379913
При n =  3 и k =  0.01 perplexity =  450.9932393397054
При n =  3 и k =  0.1 perplexity =  201.28658173842416
При n =  4 и k =  0.001 perplexity =  3888.7122777706295
При n =  4 и k =  0.01 perplexity =  2301.4105767937854
При n =  4 и k =  0.1 perplexity =  1382.6321521991583
При n =  5 и k =  0.001 perplexity =  8534.181658663489
При n =  5 и k =  0.01 perplexity =  6720.180513063622
При n =  5 и k =  0.1 perplexity =  5325.312617283351
При n =  6 и k =  0.001 perplexity =  11124.22615507342
При n =  6 и k =  0.01 perplexity =  9709.085509740818
При n =  6 и k =  0.1 perplexity =  8500.243033341438
При n =  7 и k =  0.001 perplexity =  11931.934013892738
При n =  7 и k =  0.01 perplexity =  10677.875113580234
При n =  7 и k =  0.1 perplexity =  9577.659104090362
П

Выберем показатели, соответствующие самой маленькой perplexity, и обучим нашу модель.



In [None]:
model = NGramModel(n=2, k=0.1)

In [None]:
model.compute_ngrams(train_tokenized)

Сгенирируем текст длиной 150 слов.

In [None]:
%%time
model.generate_text()

CPU times: user 258 ms, sys: 8.01 ms, total: 266 ms
Wall time: 270 ms


"i will stand indebted , to march on you hear ? as much like a death is-head with a rarer action , as i hear this it is a prince henry bolingbroke : what masquing stuff : this spite ! all in particular wrongs both murderers : i know thou'rt condemn would make this night , he proclaim would be an hour . hermia is nice and pipe for eggs , by'r our eyes the matter , we by thy master capering : an extemporal epitaph . joy , doing . richer than i had been ; and fortune and delivered your dwelling in a praise : he danced withal , is a-birding together , whom he is but , we will be found most learned reverend gentleman : beside were those of orleans is in your pleasures of york : my mind , at the"

Если не вчитываться, то текст кажется вменяемым

Проверим perplexity на test set

In [None]:
%%time
model.perplexity(test)

CPU times: user 1min 1s, sys: 14.4 ms, total: 1min 1s
Wall time: 1min 2s


162.4229118083138

### Additional options

Что можно еще делать?

*   Генерировать следующее слово по префиксу



In [None]:
model.generate_next_word('i shall not')

'?'

*   Генерировать список вероятных значений после префикса

In [None]:
model.probs_for_generating(['i', 'shall', 'not'])

{'first': 0.0016052537039166735,
 'citizen': 0.00026328575357846513,
 'before': 0.0008235999381487436,
 'we': 0.004411468114485499,
 'proceed': 9.30245372725794e-05,
 'any': 0.0009830263497806184,
 'further': 0.00026019009510017626,
 'hear': 0.001114591835107894,
 'me': 0.009219025731268055,
 'speak': 0.00150928829108972,
 'all': 0.004549224916769352,
 'you': 0.017229042043840407,
 'are': 0.004282998287636512,
 'resolved': 5.8972294011402244e-05,
 'rather': 0.0003886599219491628,
 'to': 0.02306745393389314,
 'die': 0.0005434428458636044,
 'than': 0.0022615333013139057,
 'famish': 1.56330753153586e-05,
 'know': 0.0021794983516392518,
 'caius': 0.00014565073140348952,
 'marcius': 0.00018898995009953315,
 'is': 0.019852612604190193,
 'chief': 4.6589660098246914e-05,
 'enemy': 0.00018279863314295548,
 'the': 0.031736845502341016,
 'people': 0.000274120558252476,
 'not': 0.011687813367703399,
 'let': 0.002790890901101296,
 'us': 0.0020076893060942216,
 'kill': 0.0003623468248837077,
 'him':

# LSTM

## Imports

In [None]:
import re
import pandas as pd
import numpy as np
import math
import warnings
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from collections import Counter, defaultdict
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dense, Dropout, Embedding, Flatten, LSTM
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 
from keras import backend as K
from tensorflow import keras
from keras import layers
from keras import optimizers
from keras.preprocessing import sequence
from keras.callbacks import ModelCheckpoint
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Model
from keras.utils import to_categorical
from keras.wrappers.scikit_learn import KerasRegressor

import requests
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_context('notebook')
%config InlineBackend.figure_format = 'retina'

nltk.download('punkt')
nltk.download('stopwords')

plt.rcParams['figure.figsize'] = 10, 8

warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Train - test - split

В задании необходимо построить LSTM. Она обычно лучше работает с длинными текстами, так как у нее нет фиксированного количества токенов, на которые она может "смотреть".

Мы изначально решили работать с unk  в каждом из способов по-разному. А следовательно и словари у двух моделей будут различны и их нельзя будет сравнить по perplexity, поэтому я решила отойти от разбиения на стандартные нграммы.

Разделение текста производила следующим образом: сначала построчно, а потом последовательно внутри каждого предложения на нграммы все большей длины. 

Я работала на collabe, а у него весьма ограниченный объем ОЗУ. Поэтому массив данных пришлось обрезать. Это сильно сказалось на результате. 

In [None]:
print('Всего строк:', len(text))

Всего строк: 5249


In [None]:
text_size = np.ceil(len(text)*0.1).astype(int)
text = text[:text_size]

In [None]:
len(text)

525

Разделим все данные на две части train - test в пропорции 85 - 15

In [None]:
train_size = np.ceil(len(text)*0.90).astype(int)

In [None]:
train = text[:train_size]
test = text[train_size:]

In [None]:
print('Train set:', len(train))
print('Test set:', len(test))

Train set: 473
Test set: 52


## LSTM Model

In [None]:
class LSTM():
    def __init__(self, corpus):
         corpus = corpus.copy()
         corpus = self.special_symbols(corpus)
         self.tokenizer = Tokenizer(oov_token='<UNK>', filters='#$%&()*+/=@[\\]^_`{|}~\t\n')
         self.tokenizer.fit_on_texts(corpus)
         # creating vocabulary
         self.word_index = self.tokenizer.word_index
         self.vocab_len = len(self.word_index)
         #create X, y, maxlen
         self.X, self.y, self.maxlen = self.create_x_y(corpus, 0)

    def create_x_y (self, data, maksimum):
        '''
        Создаем X и y
        ''' 
        input_sequences = []
        for row in data:
            token_list = self.tokenizer.texts_to_sequences([row])[0]
            for i in range(1, len(token_list)):
                n_gram = token_list[:i+1]
                input_sequences.append(n_gram)
        if maksimum == 0:
             maxlen = max([len(row) for row in input_sequences])
        else:
             maxlen = self.maxlen
        sequences = np.array(pad_sequences(input_sequences, maxlen=maxlen, padding="pre"))
        X = sequences[:, :-1]
        y = to_categorical(sequences[:, -1], num_classes=self.vocab_len+1)
        return X, y, maxlen

    def fit(self, epochs=50):
         '''
         Fitting of a model
         '''
         self.model = Sequential([
             layers.Embedding(self.vocab_len, 8, input_length=self.maxlen-1),
             layers.LSTM(64),
             layers.Dropout(0.2),
             layers.Dense(self.vocab_len, activation='softmax')
             ])
         self.model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['acc', self.perplexity])
         self.model.summary()

         early_stop = keras.callbacks.EarlyStopping(monitor='val_acc', patience=3) #Это RNN. Быстро переобучается. Но поставим early_stop 
         self.model.fit(self.X,                                                    # и не будем думать об этом
                      self.y, 
                      verbose=1, 
                      batch_size=64, 
                      epochs=epochs, 
                      validation_split=0.2,
                      callbacks=[early_stop])

    def perplexity(self, y_true, y_pred):
        '''
        Функция для подсчета perplexity
        '''
        cross_entropy = K.categorical_crossentropy(y_true, y_pred)
        self.perplexity = K.exp(cross_entropy)

        return self.perplexity

    def generate_text(self, seed, num_of_words):
        '''
        Функция для генерации текста по заданному началу и на фиксированное количество слов
        '''      

        token_list = self.tokenizer.texts_to_sequences([seed])[0] 
        result = token_list
        if len(token_list) > self.maxlen - 1:     #на случай, если seed по количеству токенов окажется длиннее n-1
              token_list = token_list[-self.maxlen + 1]
        token_list = pad_sequences([token_list], maxlen=self.maxlen-1, padding='pre')

        for i in range (num_of_words):
              predicted = self.model.predict_classes(token_list, verbose=0)
              result.append(predicted)
              token_list[:-1] = token_list[1:]
              token_list[-1] = predicted
        
        seed = ''
        output = ''
        for element in result:
              for word, index in self.word_index.items():
                  if index == element:
                     output = word
                     break
              # отфильтруем знаки '<s>' и '</s>'
              if (output != '<s>') and (output != '</s>'):
                  seed += " " + output

        return seed

    def preparation(self, test):
        '''
        Узнать основные метрики по test set
        ''' 
        if isinstance(test, str):
            test = sent_tokenize(test) 
        # почистили
        test = self.special_symbols(test)
        # перевели в последовательности и разделили на X и y
        X, y, ln = self.create_x_y(test)
        self.model.evaluate(X, y, self.maxlen)

    def special_symbols(self, data):
        prepared = []
        for sentence in data:
            sentence = '<s> ' + sentence + ' </s>'
            prepared.append(sentence)

        return prepared

In [None]:
model = LSTM(train)

In [None]:
model.fit(10)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 212, 8)            97016     
_________________________________________________________________
lstm (LSTM)                  (None, 64)                18688     
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense (Dense)                (None, 12127)             788255    
Total params: 903,959
Trainable params: 903,959
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
model.generate_text('i shall not be ', 100)

' i shall not be not to be bitter, uglier the soul in not to be bitter, uglier the soul in not to be bitter, uglier the soul in not to be bitter, uglier the soul in not to be bitter, uglier the soul in not to be bitter, uglier the soul in not to be bitter, uglier the soul in not to be bitter, uglier the soul in not to be bitter, uglier the soul in not to be bitter, uglier the soul in not to be bitter, uglier the soul in not to be bitter, uglier the soul in not to be bitter,'

Вероятно, причиной зацикливания является маленький объем данных. А также то, что использовали predict_classes

Также маленький объем данных не может не сказываться на значении val_perplexity. Этот показатель очень большой.

In [None]:
model.preparation(test)



Сравнивать результаты этой модели и предыдущей в данном случае было бы странно. У нас разные словари, мы по разному работали с unk words, и у нас разный объем данных.

Хотя теоретически нейронка выглядит перспективнее, в данный момент ее результаты очень не очень.

Для улучшения работы стоит в дальнейшем поработать с параметрами сети, ну и подгрузить таки достаточный массив данных.