<a href="https://colab.research.google.com/github/FaiazS/Transformer-based-Text-Generation-System/blob/main/Transformer_Model_Architecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Text Data Loading and Text Data Pre-Processing**

In [None]:
#Loading Required Libraries and Text Dataset and Pre-processing the Text Dataset

import numpy as np

import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.utils import pad_sequences

from tensorflow.keras.layers import Layer, Dense, Embedding, LayerNormalization, Dropout

def load_data(file_path):

  with open(file_path, 'r' , encoding = 'utf-8') as f:

    text_data = f.read()

  return text_data

file_path = 'HP1.txt'

text_data = load_data(file_path).lower()

#1 - TOKENIZATION(TOKENIZING THE TEXT DATA)

tokenizer = Tokenizer(oov_token = '<OOV>')

tokenizer.fit_on_texts([text_data])

total_words = len(tokenizer.word_index) + 1

#Converting the Text into Sequences(Sentences)

input_sequence = []

tokens = tokenizer.texts_to_sequences([text_data])[0]

sequence_length = 50

#First sequence_length tokens(input) -> Used for training the model.

#Last token(target) -> Used as the label which the model tries to predict.

#thus total of (50 + 1) in one input_sequence index

for i in range(sequence_length,len(tokens)):

  input_sequence.append(tokens[i - sequence_length : i + 1])

print(input_sequence[0])

#Padding sequences and splitting inputs and target tokens post which X will have the input tokens
#and y will have the labels for those input tokens.

input_sequence = np.array(pad_sequences(input_sequence, maxlen = sequence_length + 1, padding = 'pre'))

X = input_sequence[:, :-1]

y = input_sequence[:, -1]

#2 - ENCODING THE TEXT DATA(VIA ONE-HOT ENCODING)

#One hot encoding the labels, please not as there are other ways for encoding like

#pre- trained word2vec encoding and so on.

y = tf.keras.utils.to_categorical(y, num_classes = total_words)

[2162, 3680, 4, 274, 224, 8, 651, 332, 652, 535, 35, 1268, 5, 164, 20, 21, 35, 1586, 973, 1587, 14, 69, 157, 21, 35, 2, 141, 128, 653, 789, 5, 32, 1588, 12, 169, 490, 110, 1416, 142, 21, 68, 55, 909, 25, 505, 1788, 151, 224, 10, 2, 2701]


For every Self-Attention Head/Layer, the Query, Key and Value Matrix is going to have 64 Dimensions,
and as we are having 8 Self-Attention layers/heads, the Combined Self-Attention layers/heads is going to have
64 * 8 = 512 Dimensions in total.

**Defining the Transformer Model**

In [None]:
class Multi_Head_Attention(Layer):

  def __init__(self, embed_dim, num_heads):

    super(Multi_Head_Attention, self).__init__()

    self.embed_dim = embed_dim  # 512 Dimensions

    self.num_heads = num_heads  # 8 Self-Attention Layers

    self.projection_dim = embed_dim // num_heads #Query, Key and Value Matrices will be of 64 Dimensions in each Self-Attention Layer/Head.

    self.query_dense = Dense(embed_dim)  #Dense layer for Query Matrix

    self.key_dense = Dense(embed_dim)  #Dense layer for Key Matrix

    self.value_dense = Dense(embed_dim) #Dense layer for Value Matrix

    self.combine_all_layers = Dense(embed_dim) #Dense layer for the Combined Self-Attention Layers(512 Dimensions)


  def compute_attention_score(self, query, key, value):

    attention_score = tf.matmul(query, key, transpose_b = True) #Computing Dot Product of Query and Key(Transposed) Matrix

    attention_score = attention_score / tf.math.sqrt(tf.cast(self.projection_dim, tf.float32)) #Scaling / Normalizing the Dot Product Result by dividing it by root of Query, Key or Value Matrix Dimensions of a Single Self-Attention Layer/Head and also converting the Result from Int to Float

    attention_score_probability = tf.nn.softmax(attention_score, axis = -1) #Sum of values of each row representing a single token(word) will sum up to 1

    final_attention_score = tf.matmul(attention_score, value)

    return final_attention_score


  def split_layers_to_each_individual_layer(self, input, batch_size):

    # Updated Input Shape -> Batch_size, num_of words, num_of heads, projection_dim

    input = tf.reshape(input, (batch_size, -1, self.num_heads, self.projection_dim))

    #Shape we want thus updated -> batch_size, num_of heads, num_of_words(sequence_length), projection

    #batch_size of (8 Self-attention layers of (4 words * 64 dimensions))

    return tf.transpose(input, perm = [0, 2, 1, 3])


  def call(self, inputs):

    # Current Input Shape -> Batch_size, Num of words(sequence_length), Embedded Dims

    batch_size = tf.shape(inputs)[0]

    query = self.query_dense(inputs)

    key = self.key_dense(inputs)

    value = self.value_dense(inputs)

    query = self.split_layers_to_each_individual_layer(query, batch_size)

    key = self.split_layers_to_each_individual_layer(key, batch_size)

    value = self.split_layers_to_each_individual_layer(value, batch_size)

    attention_score = self.compute_attention_score(query, key, value)

    #Current Input Shape I have -> batch_size, num_of_heads, num_of_words(sequence_length), projection_dim

    #Input Shape I want -> batch_size, num_of words(sequence_length), num_of heads, projection_dim

    attention_score = tf.transpose(attention_score, perm = [0, 2, 1, 3])

    final_combined_attention_score = tf.reshape(attention_score, (batch_size, -1, self.embed_dim))

    return self.combine_all_layers(final_combined_attention_score)

In [None]:
class Transformer_Block(Layer):

  def __init__(self, embed_dim, num_heads, simple_feed_forward_nn_dim, dropout_rate):

      super(Transformer_Block, self).__init__()

      self.attention = Multi_Head_Attention(embed_dim, num_heads)

      self.simple_feed_forward_nn = tf.keras.Sequential([

                                              Dense(simple_feed_forward_nn_dim, activation = 'relu'),

                                              Dense(embed_dim)

                                              ])

   #Formula for Normalization - (input - mean) / standard deviation

      self.normalization_layer_1 = LayerNormalization(epsilon = 1e-6)

      self.normalization_layer_2 = LayerNormalization(epsilon = 1e-6)

      self.dropout_layer_1 = Dropout(dropout_rate)

      self.dropout_layer_2 = Dropout(dropout_rate)


  def call(self, inputs, training = False):

     attention_output = self.attention(inputs)

     attention_output = self.dropout_layer_1(attention_output, training = training)

     output_1 = self.normalization_layer_1(inputs + attention_output)  #Residual Connection

     feed_forward_nn_output = self.simple_feed_forward_nn(output_1)

     feed_forward_nn_output = self.dropout_layer_2(feed_forward_nn_output, training = training)

     return self.normalization_layer_2(output_1 + feed_forward_nn_output) #Residual Connection

In [None]:
class Tokenization_And_Positional_Embedding(Layer):

  def __init__(self, max_len, vocab_size, embed_dim):

    super(Tokenization_And_Positional_Embedding, self).__init__()

    self.tokenization_layer = Embedding(input_dim = vocab_size, output_dim = embed_dim)

    self.positional_embedding_layer = Embedding(input_dim = max_len, output_dim = embed_dim)


  def call(self, word_input):

    #The max sequence length the model can handle

    max_len = tf.shape(word_input)[-1] #Sets max_len to the length of the input sequence

    word_positions = tf.range(start = 0, limit = max_len, delta = 1) #Generating unique positions [0, 1, 2  up to max_len - 1]

    word_positions = self.positional_embedding_layer(word_positions) #Each word position index is mapped to a trainable embedding of shape (max_len, embed_dim)

    word_input = self.tokenization_layer(word_input) #Each token ID in word input is mapped to an embedding of shape batch_size, max_len, and embed_dim

    return word_input + word_positions


**Modelling the Whole Transformer
 Architecture, Compiling and Training the Model.**

In [None]:
#Model Parameters

embed_dim = 128 #Embedding Size

num_heads = 4 #Number of attention heads

simple_feed_forward_nn_dim = 512 #Feed Forward layer size

max_len = sequence_length #Already previously defined as 50

#Total words - 6662

#Building the Model

inputs = tf.keras.Input(shape =(max_len,))

word_embedding_layer = Tokenization_And_Positional_Embedding(max_len, total_words, embed_dim)

x = word_embedding_layer(inputs)

print(x.shape)

transformer_block = Transformer_Block(embed_dim, num_heads, simple_feed_forward_nn_dim, dropout_rate = 0.2)

x = transformer_block(x, training =True)

print(x.shape)

x = x[:,-1,:]

print(x.shape)

x = Dense(total_words, activation = 'softmax')(x)

print(x.shape)

transformer_model = tf.keras.Model(inputs = inputs, outputs = x)

#Compiling the Model

transformer_model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

transformer_model.summary()

(None, 50, 128)
(None, 50, 128)
(None, 128)
(None, 6663)


In [None]:
transformer_model_performance = transformer_model.fit(X, y, batch_size = 32, epochs = 17)

Epoch 1/17
[1m2531/2531[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m319s[0m 124ms/step - accuracy: 0.0433 - loss: 0.0222
Epoch 2/17
[1m2531/2531[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m330s[0m 127ms/step - accuracy: 0.1249 - loss: 0.0011
Epoch 3/17
[1m2531/2531[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m376s[0m 125ms/step - accuracy: 0.1662 - loss: 9.3766e-04
Epoch 4/17
[1m2531/2531[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m322s[0m 125ms/step - accuracy: 0.1898 - loss: 8.4480e-04
Epoch 5/17
[1m2531/2531[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m320s[0m 124ms/step - accuracy: 0.2112 - loss: 7.6875e-04
Epoch 6/17
[1m2531/2531[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m316s[0m 122ms/step - accuracy: 0.2362 - loss: 7.0413e-04
Epoch 7/17
[1m2531/2531[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m330s[0m 125ms/step - accuracy: 0.2756 - loss: 6.4419e-04
Epoch 8/17
[1m2531/2531[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m323s[0m 126ms/step - accu

In [None]:
#Defining the Function to Generate Text and Generating Text Post Training

def generate_text(seed_text, next_words, max_sequence_length):

  for _ in range(next_words):

    token_list = tokenizer.texts_to_sequences([seed_text])([0])

    token_list = pad_sequences([token_list], maxlen = max_sequence_length - 1, padding = 'pre')

    predicted_text = transformer_model.predict(token_list, verbose = 0)

    predicted_word = tokenizer.index_word[np.argmax(predicted_text)]

    seed_text = seed_text + " " + predicted_word

    return seed_text

seed_text = "harry looked at"

model_generated_text = generate_text(seed_text, 25, sequence_length)

print(len(model_generated_text))

print(model_generated_text)

Difference in this Simple Transformer Model In Comparision to ChatGPT :-

*   Masked Attention:

ChatGPT uses casual masking so that a word cannot see future words during training, and this current model uses
regular attention, thus allowing the current model to see the entire sequence.

*   Multiple Stacked Transformer Blocks:

ChatGPT has many layers (e.g 12,24, 97 layers etc) and our current model has only one Transformer block.


*   Tokenization and Byte-Pair Encoding(BPE):

ChatGPT does not use simple tokenization, it uses Byte-Pair Encoding or WordPiece techniqiue for better Vocabulary handling and in the other hand, our current model uses only simple, basic Word Tokenization technique.

*   Training on Large Datasets:

ChatGPT is trained on hundreds of GBs of text data while on on the other hand, our current model is only trained on a book of Harry Potter.


*   Decoding Strategies for Text Generation:

ChatGPT uses sampling(Top-K, Nucleus Sampling) or Beam Search to generate text while on the other hand our current model does not have any Decoding Strategy.