Задание 1: Сравнить LSTM, RNN и GRU на задаче предсказания части речи (качество предсказания, скорость обучения, время инференса модели)

In [1]:
import numpy as np
import time
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, SimpleRNN, GRU, TimeDistributed, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import treebank

nltk.download('treebank')
nltk.download('universal_tagset')

# Загрузка и подготовка данных
sentences = treebank.tagged_sents(tagset='universal')
X = [[word for word, tag in sent] for sent in sentences]
y = [[tag for word, tag in sent] for sent in sentences]

word_tokenizer = Tokenizer(lower=False, oov_token='OOV')  # Токенизатор для слов
word_tokenizer.fit_on_texts(X)
X_encoded = word_tokenizer.texts_to_sequences(X)

tag_tokenizer = Tokenizer(lower=False)  # Токенизатор для тегов
tag_tokenizer.fit_on_texts(y)
y_encoded = tag_tokenizer.texts_to_sequences(y)

X_padded = pad_sequences(X_encoded, padding='post')  # Дополнение последовательностей слов
y_padded = pad_sequences(y_encoded, padding='post')  # Дополнение последовательностей тегов

y_padded = np.expand_dims(y_padded, -1)  # В форме, пригодной для sparse_categorical_crossentropy

X_train, X_test, y_train, y_test = train_test_split(X_padded, y_padded, test_size=0.2)

input_dim = len(word_tokenizer.word_index) + 1
output_dim = 64
n_tags = len(tag_tokenizer.word_index) + 1

def create_model(model_type):
    model = Sequential([
        Embedding(input_dim=input_dim, output_dim=output_dim, mask_zero=True),
        model_type(64, return_sequences=True),
        TimeDistributed(Dense(n_tags, activation="softmax")),
    ])
    model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
    return model

model_types = {
    "RNN": SimpleRNN,
    "LSTM": LSTM,
    "GRU": GRU,
}

results = {}

for name, model_type in model_types.items():
    print(f"\nTraining {name} model...")
    model = create_model(model_type)
  
    start_time = time.time()
    model.fit(X_train, y_train, batch_size=128, epochs=5, validation_split=0.1, verbose=1)
    training_time = time.time() - start_time
  
    loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
    print(f"{name} Accuracy: {accuracy*100:.2f}%")
  
    start_time = time.time()
    _ = model.predict(X_test[:1])
    inference_time = time.time() - start_time
  
    results[name] = {"accuracy": accuracy, "training_time": training_time, "inference_time": inference_time}

# Вывод сравнительных результатов
print("\nСравнительные результаты:")
for name, metrics in results.items():
    print(f"Model: {name}")
    print(f"\tAccuracy: {metrics['accuracy']*100:.2f}%")
    print(f"\tTraining time: {metrics['training_time']:.2f} seconds")
    print(f"\tInference time: {metrics['inference_time']*1000:.2f} ms")

# Пример предсказания
sample_test = ['This', 'is', 'a', 'simple', 'test']  # Добавлен пример предложения
idx2tag = {value: key for key, value in tag_tokenizer.word_index.items()}
sample_test_seq = word_tokenizer.texts_to_sequences([sample_test])
# Обратите внимание, что мы используем maxlen=X_test.shape[1], чтобы размерность соответствовала предыдущим предсказаниям модели
padded_sample_test = pad_sequences(sample_test_seq, maxlen=X_test.shape[1], padding='post') 
predictions = model.predict(padded_sample_test)
pred_index = np.argmax(predictions, axis=-1)[0]  # Берём первый элемент, поскольку предсказание делается для одного предложения

# Улучшение: фильтруем предсказания до длины исходного предложения
pred_tags = [idx2tag.get(index) for index in pred_index][:len(sample_test)]  # Учитываем только слова в sample_test

print(f'\nOriginal sentence: {" ".join(sample_test)}')
print(f'Predicted POS tags: {pred_tags}')

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\Roadmarshal\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Roadmarshal\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!



Training RNN model...
Epoch 1/5
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m168s[0m 3s/step - accuracy: 0.0328 - loss: 2.3823 - val_accuracy: 0.0400 - val_loss: 1.8927
Epoch 2/5
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 1s/step - accuracy: 0.0462 - loss: 1.7528 - val_accuracy: 0.0591 - val_loss: 1.4247
Epoch 3/5
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 2s/step - accuracy: 0.0642 - loss: 1.2847 - val_accuracy: 0.0722 - val_loss: 1.0205
Epoch 4/5
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 1s/step - accuracy: 0.0743 - loss: 0.9002 - val_accuracy: 0.0768 - val_loss: 0.7500
Epoch 5/5
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 1s/step - accuracy: 0.0803 - loss: 0.6407 - val_accuracy: 0.0813 - val_loss: 0.5703
RNN Accuracy: 8.11%
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 30s/step

Training LSTM model...
Epoch 1/5
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [