# Hometask

1) Find text to train (any book)<br>
2) Build train and validation set <br>
3) Train bidirectional language model that predicts the POS of word being based on its `n_context= 3` neighbours from the left and `n_context= 3` neighbours from the right <br>
4) Evaluate the model

# Solution 
<hr/>

In [1]:
import re
import nltk
import numpy as np
from nltk import pos_tag
from keras.models import Sequential
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.sequence import pad_sequences


## Preprocess Text Data 

In [None]:
def preprocess_text(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read().lower()
        text = re.sub(r'\[.*?\]', '', text)  # Remove text within square brackets
        text = re.sub(r'\d+', '', text)      # Remove digits
        text = re.sub(r'«|»', '', text)      # Remove special characters
        text = re.sub(r'\n', ' ', text)      # Replace newline characters with space
        text = re.sub(r'\s+', ' ', text.strip())  # Remove extra whitespace
    return text

# Load and preprocess the text data
text = preprocess_text('data/Rouling_Harry_Potter_1_Harry_Potter_and_the_Sorcerers_Stone.txt')
text[:500]

'harry potter and the sorcerer’s stone for jessica, who loves stories, for anne, who loved them too; and for di, who heard this one first. . the boy who lived mr. and mrs. dursley, of number four, privet drive, were proud to say that they were perfectly normal, thank you very much. they were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense. mr. dursley was the director of a firm called grunnings, which made drills. he '

### Tokenize the raw text and prepare the sequences

In [3]:
tokens = nltk.word_tokenize(text)
tagged = pos_tag(tokens)

X = []
y = []

n_context = 3

for i in range(n_context, len(tagged) - n_context):
    X.append([word for word, _ in tagged[i-n_context:i+n_context+1]])
    y.append(tagged[i][1])

### Build train and validation 


In [4]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=47)

In [5]:
# Convert words and tags into numerical IDs
word2idx = {word: i for i, word in enumerate(set(word for word_list in X_train for word in word_list))}
tag2idx = {tag: i for i, tag in enumerate(set(y_train))}

# Update the sequences with the numerical IDs
X_train = [[word2idx[word] for word in word_list] for word_list in X_train]
X_val = [[word2idx[word] for word in word_list] for word_list in X_val]
y_train = [tag2idx[tag] for tag in y_train]
y_val = [tag2idx[tag] for tag in y_val]

# Pad the sequences
X_train = pad_sequences(X_train, maxlen=n_context*2+1)
X_val = pad_sequences(X_val, maxlen=n_context*2+1)

# One-hot encode
y_train = to_categorical(y_train, num_classes=len(tag2idx))
y_val = to_categorical(y_val, num_classes=len(tag2idx))


In [6]:
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of X_val:", X_val.shape)
print("Shape of y_val:", y_val.shape)

Shape of X_train: (79354, 7)
Shape of y_train: (79354, 38)
Shape of X_val: (19839, 7)
Shape of y_val: (19839, 38)


## Create model and train model


In [7]:
model = Sequential()
model.add(Embedding(input_dim=len(word2idx), output_dim=64, input_length=n_context*2+1))
model.add(Bidirectional(LSTM(64, return_sequences=False)))
model.add(Dense(len(tag2idx), activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [8]:
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1fad31d6880>

## Evaluate the model

### Accuracy and loss

In [9]:
loss, accuracy = model.evaluate(X_val, y_val)
print(f'Validation Loss: {loss}')
print(f'Validation Accuracy: {accuracy}')

Validation Loss: 0.5859730839729309
Validation Accuracy: 0.8776652216911316


### Samples

In [10]:
# After training the model, make predictions on the validation set
y_pred = model.predict(X_val)

idx2word = {i: word for word, i in word2idx.items()}
idx2tag = {i: tag for tag, i in tag2idx.items()}

# Choose 10 random samples from the validation set
indices = np.random.choice(range(len(X_val)), size=10, replace=False)

for i in indices:
    # Get the true and predicted tags
    true_tag = np.argmax(y_val[i])
    pred_tag = np.argmax(y_pred[i])

    # Get the corresponding words
    words = [idx2word[idx] for idx in X_val[i]]

    print(f'Sample: [{" | ".join(words)}]')
    print(f'True word: {words[3]}')
    print(f'True tag: {idx2tag[true_tag]}')
    print(f'Predicted tag: {idx2tag[pred_tag]}')
    print("")


Sample: [and | daddy. | ” | “ | all | right | ,]
True word: “
True tag: NNP
Predicted tag: NNP

Sample: [was | carrying | a | large | wooden | crate | under]
True word: large
True tag: JJ
Predicted tag: JJ

Sample: [gives | you | the | right | to | walk | around]
True word: right
True tag: NN
Predicted tag: NN

Sample: [movement | we | ’ | ve | been | practicing | !]
True word: ve
True tag: RB
Predicted tag: RB

Sample: [if | they | could | make | a | pineapple | tapdance]
True word: make
True tag: VB
Predicted tag: VB

Sample: [together | . | “ | i | ’ | d | not]
True word: i
True tag: JJ
Predicted tag: NN

Sample: [the | floor | in | fright | ; | ron | pulled]
True word: fright
True tag: NN
Predicted tag: NN

Sample: [each | other | , | seeming | to | have | forgotten]
True word: seeming
True tag: VBG
Predicted tag: RB

Sample: [lessons | , | we | ’ | ll | get | into]
True word: ’
True tag: VBP
Predicted tag: VBP

Sample: [could | remember | , | ever | since | he | ’]
True word: ever