<a href="https://colab.research.google.com/github/ChiaoYunTing/Text-Analytics/blob/main/Language_Modeling_with_Positional_Embeddings_and_Some_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Language Model with Positional Embeddings and Elementary 'Attention'

We will expand the previous simple language model by adding  positional embeddings and simplified attention. As before we will use the Jane Austen novel 'Sense and Sensibility' to train the language model.

First we go through the preliminaries to get the model ready to train.

In [1]:
import nltk
nltk.download('gutenberg')  # Make sure the Gutenberg corpus is downloaded
from nltk.corpus import gutenberg

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


#Preliminaries

In [2]:
# Load "Sense and Sensibility" text
sas = gutenberg.raw('austen-sense.txt')

# Print the first 500 characters of "Sense and Sensibility"
print(sas[:500])


[Sense and Sensibility by Jane Austen 1811]

CHAPTER 1


The family of Dashwood had long been settled in Sussex.
Their estate was large, and their residence was at Norland Park,
in the centre of their property, where, for many generations,
they had lived in so respectable a manner as to engage
the general good opinion of their surrounding acquaintance.
The late owner of this estate was a single man, who lived
to a very advanced age, and who for many years of his life,
had a constant companion an


## Embeddings
We will use the `gensim` library, which provides straightforward implementations of `word2vec`, to create word embeddings using the CBOW model. To keep it simple, we will create embeddings of size 5.

In [3]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Tokenize
tokens = word_tokenize(sas)

# Organize the tokens into sentences, Word2Vec needs data in format of list of lists of tokens
sentences = [tokens[i:i+100] for i in range(0, len(tokens), 100)]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
# Train the CBOW model
model = Word2Vec(sentences, vector_size=5, window=5, min_count=1, sg=0)  # sg=0 specifies CBOW


##Tokenizer
Once we get the embeddings, we will store the words (tokens), token ids, and embeddings in a dataframe.

Note that the number of distinct words (tokens) is 7111. This is the size of our vocabulary.

In [5]:
import pandas as pd
# Create a DataFrame to store word, token_id, and embedding
data = {
    'word': [],
    'token_id': [],
    'embedding': []
}

for idx, word in enumerate(model.wv.index_to_key):
    data['word'].append(word)
    data['token_id'].append(idx)
    data['embedding'].append(model.wv[word].tolist())  # convert numpy array to list for easier handling in DataFrame

df = pd.DataFrame(data)
print(df)

              word  token_id  \
0                ,         0   
1               to         1   
2                .         2   
3              the         3   
4               of         4   
...            ...       ...   
7106        other.      7106   
7107   incessantly      7107   
7108  exclamations      7108   
7109     hartshorn      7109   
7110       utility      7110   

                                              embedding  
0     [2.620171546936035, 1.8446730375289917, 3.9585...  
1     [1.5974335670471191, 1.9683901071548462, 4.129...  
2     [5.972449779510498, 2.2278716564178467, 3.3245...  
3     [-0.6165288686752319, 1.8960069417953491, 4.91...  
4     [1.085107445716858, 1.1664472818374634, 5.4929...  
...                                                 ...  
7106  [0.1374606043100357, 0.206727996468544, 0.1856...  
7107  [-0.14069943130016327, -0.0452144593000412, -0...  
7108  [-0.05511047691106796, 0.053439486771821976, -...  
7109  [0.03255762904882431, -0.1556

##Positional Embeddings
Positional embeddings are used primarily in models like the Transformer to incorporate information about the position of tokens in the input sequence into the model. The idea is to add a vector to each token's embedding that represents its position in the sequence, ensuring that the order of tokens contributes to the model's understanding.

In the original Transformer architecture, positional embeddings are computed using sine and cosine functions of different frequencies:

$$
\text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \\
\text{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
$$


Where:

$pos$ is the position of the token in the sequence.

$i$ is the dimension index.

$
d_{model}
$ is the dimensionality of the token embeddings.

This formula helps the model to differentiate positions by providing a unique signal for each position, and the repeating patterns allow the model to generalize to sequence lengths that it has not seen before.

In [6]:
def get_positional_embeddings(sequence_length, embedding_dim):
    positional_embeddings = np.zeros((sequence_length, embedding_dim))
    for pos in range(sequence_length):
        for i in range(embedding_dim):
            if i % 2 == 0:
                positional_embeddings[pos, i] = np.sin(pos / (10000 ** ((2 * i) / embedding_dim)))
            else:
                positional_embeddings[pos, i] = np.cos(pos / (10000 ** ((2 * i) / embedding_dim)))
    return positional_embeddings


##Training Data
Our main objective is to predict the next word (token) based on the previous 5 words (tokens). Thus, our context length is 5.

We will prepare the training data such that inputs are 5 consecutive words (token) and the output to be predicted is the 6th word (token). If the input has less than 5 words (tokens), we will pad it with \<pad>.

Add stuff about positional embeddings

In [9]:
import numpy as np
def generate_training_data(sentences, model_wv, window_size=5):
    embedding_dim = model_wv.vector_size
    X, y = [], []
    sequence_texts = []  # For storing the actual sequences of words
    next_words = []  # For storing the actual next word
    positional_embeddings = get_positional_embeddings(window_size, embedding_dim)

    for sentence in sentences:
        # Embed words using the Word2Vec model
        embedded_sentence = [model_wv[word] for word in sentence if word in model_wv]
        word_sentence = [word for word in sentence if word in model_wv]  # Keep the actual words for viewing

        # Create sequences
        for i in range(len(embedded_sentence)):
            end_ix = i + window_size
            if end_ix >= len(embedded_sentence):
                break
            seq_x, seq_y = embedded_sentence[i:end_ix], embedded_sentence[end_ix]
            seq_text, next_word = word_sentence[i:end_ix], word_sentence[end_ix]

            # Pad sequence if necessary
            padding_length = window_size - len(seq_x)
            seq_x += [np.zeros(embedding_dim)] * padding_length
            seq_text += ['<pad>'] * padding_length

            # Add positional embeddings
            modified_seq_x = np.array(seq_x) + positional_embeddings[:len(seq_x)]

            X.append(modified_seq_x.flatten())
            y.append(seq_y)
            sequence_texts.append(' '.join(seq_text))
            next_words.append(next_word)

    return np.array(X), np.array(y), sequence_texts, next_words


In [10]:
X_train, y_train, train_sequences, train_next_words = generate_training_data(sentences, model.wv)

In [11]:
# Create DataFrame
train_df = pd.DataFrame({
    'Sequence': train_sequences,
    'Next Word': train_next_words,
    'X_train (Flattened Embeddings)': list(X_train),
    'y_train (Embedding)': list(y_train)})

In [12]:
train_df.head()

Unnamed: 0,Sequence,Next Word,X_train (Flattened Embeddings),y_train (Embedding)
0,[ Sense and Sensibility by,Jane,"[-0.05941634625196457, 1.114052876830101, 0.03...","[-0.15103865, -0.13794702, 0.23254584, -0.1802..."
1,Sense and Sensibility by Jane,Austen,"[0.21857543289661407, 1.2241209894418716, 0.20...","[0.2371671, -0.035784535, -0.07405981, -0.2451..."
2,and Sensibility by Jane Austen,1811,"[0.6344841718673706, 3.568253993988037, 4.1085...","[0.13259846, -0.08520899, 0.16910738, -0.24759..."
3,Sensibility by Jane Austen 1811,],"[0.16189710795879364, 0.9867487233132124, 0.21...","[0.2779051, 0.39517418, 0.32963225, -0.3987266..."
4,by Jane Austen 1811 ],CHAPTER,"[0.7093105912208557, 3.1029014587402344, 4.897...","[1.2434641, 1.530862, 2.7463584, -1.8938402, -..."


Our training dataset has 134,306 data points. Each data point has 6 words (tokens), and thus the total number of words (tokens) for training is 134,306 * 6 = 805,836.

In [13]:
train_df.shape
# train_df.head()

(134306, 4)

#Neural Network Design
This time we will add two more layers to have the following structure.

1. Input layer: 25 nodes (5 words(tokens) with embedding size of 5 for each).
2. Attention-Like Layer: 25 nodes. A dense layer can serve as a basic form of attention by processing input features, although true attention mechanisms (like those in transformers) are more complex.
3. Hidden layer: 10 nodes.
4. Output layer: 5 nodes (for the predicted next word (token)).
_________________________________________________________________

![My Image](https://drive.google.com/uc?export=view&id=1lcTDUrhvAgz85zOgFPK2Nzdeiqou3jlP)


In [14]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten, Add, Input, Lambda
import tensorflow as tf
from keras import Model


In [15]:
from keras.layers import Dense, Activation, InputLayer, Lambda
import keras.backend as K
from keras.models import Sequential

def build_attention_model(input_dim, attention_nodes, hidden_nodes, output_dim):
    model = Sequential([
        InputLayer(input_shape=(input_dim,)),  # Input layer specifying input shape
        Dense(attention_nodes),  # Attention-like layer without activation
        Activation('softmax'),  # Apply softmax to simulate attention properly
        Dense(hidden_nodes, activation='relu'),  # Hidden layer
        Dense(output_dim, activation='linear')  # Output layer for the embeddings
    ])
    model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
    return model

# Build and summarize the model
input_dim = 25  # Total input dimensions (5 words * 5 dimensions each)
attention_nodes = 25  # Nodes in the attention layer
hidden_nodes = 10  # Nodes in the hidden layer
output_dim = 5  # Dimensions of the output embeddings

attention_model = build_attention_model(input_dim, attention_nodes, hidden_nodes, output_dim)
attention_model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 25)                650       
                                                                 
 activation (Activation)     (None, 25)                0         
                                                                 
 dense_1 (Dense)             (None, 10)                260       
                                                                 
 dense_2 (Dense)             (None, 5)                 55        
                                                                 
Total params: 965 (3.77 KB)
Trainable params: 965 (3.77 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


# Build and train the model
**Takes a long time - about an hour on my machine**

**I have saved the trained model (`at_model.h5`). To run the trained model you simply have to load the saved model and run it. **

In [16]:
# Example: Train the model with the training data
attention_model.fit(X_train, y_train, epochs=50, batch_size=32)
attention_model.save('at_model.h5')


Epoch 1/50
Epoch 2/50
Epoch 3/50

KeyboardInterrupt: 

In [17]:
from keras.models import load_model
at_model = load_model('at_model.h5')


#Next word prediction

In [18]:
from keras.models import load_model
from nltk.tokenize import word_tokenize
import numpy as np
from scipy.spatial.distance import cosine
import nltk
nltk.download('punkt')

# Assuming `model` and `at_model` are loaded elsewhere and correctly:
word2vec_model = model  # Assuming `model` is your Word2Vec model loaded correctly
at_model = load_model('at_model.h5')  # Ensure this is loaded correctly as well

def get_positional_embeddings(sequence_length, embedding_dim):
    positional_embeddings = np.zeros((sequence_length, embedding_dim))
    for pos in range(sequence_length):
        for i in range(embedding_dim):
            if i % 2 == 0:
                positional_embeddings[pos, i] = np.sin(pos / (10000 ** ((2 * i) / embedding_dim)))
            else:
                positional_embeddings[pos, i] = np.cos(pos / (10000 ** ((2 * i) / embedding_dim)))
    return positional_embeddings

def preprocess_input_text(text, word2vec_model, sequence_length=5):
    tokens = word_tokenize(text.lower())
    embeddings = [word2vec_model.wv[token] for token in tokens if token in word2vec_model.wv]
    embeddings += [np.zeros(word2vec_model.vector_size)] * (sequence_length - len(embeddings))
    positional_embeddings = get_positional_embeddings(sequence_length, word2vec_model.vector_size)
    modified_embeddings = [embed + pos_embed for embed, pos_embed in zip(embeddings, positional_embeddings)]
    return np.concatenate(modified_embeddings)

def find_top_five_words(embedding, word2vec_model):
    word_distances = [(word, cosine(embedding, word2vec_model.wv[word])) for word in word2vec_model.wv.index_to_key]
    word_distances.sort(key=lambda x: x[1])
    return [word for word, _ in word_distances[:5]]

def next_word_prediction(input_text, neural_model, word2vec_model):
    input_vec = preprocess_input_text(input_text, word2vec_model)
    predicted_embedding = neural_model.predict(np.array([input_vec]))[0]
    return find_top_five_words(predicted_embedding, word2vec_model)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [19]:
# Example usage
input_text = "they have" # @param {type:"string"}
predicted_words = next_word_prediction(input_text, at_model, word2vec_model)



In [20]:
predicted_words

['make', 'creature', 'receipt', 'intended', 'back']

#Text Generation

In [21]:
def find_top_words(embedding, word2vec_model, top_n=20):
    word_distances = [(word, cosine(embedding, word2vec_model.wv[word])) for word in word2vec_model.wv.index_to_key]
    word_distances.sort(key=lambda x: x[1])
    return word_distances[:top_n]

import numpy as np
import random

def generate_text(initial_text, model, word2vec_model, num_words):
    generated_words = word_tokenize(initial_text.lower())
    sequence_length = 5  # Assume the same sequence length as used in training

    for _ in range(num_words):
        # Preprocess the current text to input format
        input_vec = preprocess_input_text(' '.join(generated_words[-sequence_length:]), word2vec_model, sequence_length)
        predicted_embedding = model.predict(np.array([input_vec]))[0]

        # Get top 20 predictions and randomly choose from them
        top_predictions = find_top_words(predicted_embedding, word2vec_model, top_n=20)
        next_word = random.choice(top_predictions)[0]

        # Append the randomly chosen word
        generated_words.append(next_word)

    return ' '.join(generated_words)



In [22]:
# Example usage
initial_text = "they had lived in so" # @param {type:"string"}
num_words_to_generate = 50 # @param {type:"integer"}
generated_text = generate_text(initial_text, at_model, word2vec_model, num_words_to_generate)



In [None]:
generated_text

This is a slightly more advanced language model than the previous model but still performs rather poorly.