**Practical 10**

**Aim:** To build a simple text generation model using a neural network that can predict the next word in a sequence based on a given corpus of text.

**Theory:**

* Tokenization: The text corpus is converted into a sequence of numerical tokens, where each unique word is assigned a unique integer ID.
* N-grams: The training data is created using n-grams, which are contiguous sequences of n items (in this case, words). For example, in the sentence "jack likes apples", the n-grams would be "jack likes" (2-gram) and "jack likes apples" (3-gram). The model is trained to predict the last word of an n-gram given the preceding words.
* Padding: Since the n-grams have different lengths, padding is used to ensure that all input sequences have the same length.
* Embedding Layer: This layer converts the integer token IDs into dense vectors (embeddings) that capture semantic relationships between words.
* Flatten Layer: This layer flattens the output of the embedding layer into a 1D vector for feeding into the dense layers.
Dense Layers: These are fully connected layers that learn complex patterns in the data.
* Softmax Activation: The final dense layer uses a softmax activation function to output a probability distribution over the vocabulary, indicating the likelihood of each word being the next word in the sequence.
* Categorical Crossentropy Loss: This loss function is used to measure the difference between the predicted probability distribution and the actual next word (represented as a one-hot encoded vector).
* Adam Optimizer: This optimizer is used to update the model's weights during training to minimize the loss function.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

In [None]:
# 1. Sample corpus
corpus = [
"jack likes apples",
"jill likes oranges",
"jack eats food",
"jill eats fruits",
"apples are tasty",
"oranges are sweet"
]

In [None]:
# 2. Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1 # vocab size

print("Vocabulary:", tokenizer.word_index)

Vocabulary: {'jack': 1, 'likes': 2, 'apples': 3, 'jill': 4, 'oranges': 5, 'eats': 6, 'are': 7, 'food': 8, 'fruits': 9, 'tasty': 10, 'sweet': 11}


In [None]:
# 3. Generate training sequences
input_sequences = []
for line in corpus:
  token_list = tokenizer.texts_to_sequences([line])[0]
  print("Token List",token_list)
for i in range(1, len(token_list)):
  n_gram_seq = token_list[:i+1]
input_sequences.append(n_gram_seq)
print("Inout Seq",input_sequences)
# Pad sequences
max_seq_len = max(len(x) for x in input_sequences)
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_len, padding='pre')

Token List [1, 2, 3]
Token List [4, 2, 5]
Token List [1, 6, 8]
Token List [4, 6, 9]
Token List [3, 7, 10]
Token List [5, 7, 11]
Inout Seq [[5, 7, 11]]


In [None]:
X, y = input_sequences[:,:-1], input_sequences[:,-1]
print(X,y)
# Convert labels to one-hot
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

print("Training shape:", X.shape, y.shape)

[[5 7]] [11]
Training shape: (1, 2) (1, 12)


In [None]:
# 4. Build model
model = Sequential()
model.add(Embedding(total_words, 10, input_length=max_seq_len-1))
model.add(Flatten()) # feedforward style
model.add(Dense(64, activation='relu'))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()



In [None]:
# 5. Train
model.fit(X, y, epochs=20, verbose=1)

Epoch 1/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.0000e+00 - loss: 2.4816
Epoch 2/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 66ms/step - accuracy: 0.0000e+00 - loss: 2.4721
Epoch 3/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 62ms/step - accuracy: 0.0000e+00 - loss: 2.4630
Epoch 4/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step - accuracy: 0.0000e+00 - loss: 2.4539
Epoch 5/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 135ms/step - accuracy: 0.0000e+00 - loss: 2.4457
Epoch 6/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 142ms/step - accuracy: 1.0000 - loss: 2.4376
Epoch 7/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 66ms/step - accuracy: 1.0000 - loss: 2.4297
Epoch 8/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 138ms/step - accuracy: 1.0000 - loss: 2.4217
Epoch 9/20
[1m1/1[0m [32m━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7878c6385250>

In [None]:
# 6. Generate text
def predict_next_word(seed_text, next_words=3):
  for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_seq_len-1, padding='pre')
    predicted = np.argmax(model.predict(token_list, verbose=0), axis=-1)
    print("Predicted index ",predicted)
    for word, index in tokenizer.word_index.items():
      print('word ', word,'index ',index)
      if index == predicted:
        seed_text += " " + word
        break
  return seed_text

In [None]:
print(predict_next_word("jack likes"))
print(predict_next_word("jill"))

Predicted index  [11]
word  jack index  1
word  likes index  2
word  apples index  3
word  jill index  4
word  oranges index  5
word  eats index  6
word  are index  7
word  food index  8
word  fruits index  9
word  tasty index  10
word  sweet index  11
Predicted index  [11]
word  jack index  1
word  likes index  2
word  apples index  3
word  jill index  4
word  oranges index  5
word  eats index  6
word  are index  7
word  food index  8
word  fruits index  9
word  tasty index  10
word  sweet index  11
Predicted index  [11]
word  jack index  1
word  likes index  2
word  apples index  3
word  jill index  4
word  oranges index  5
word  eats index  6
word  are index  7
word  food index  8
word  fruits index  9
word  tasty index  10
word  sweet index  11
jack likes sweet sweet sweet
Predicted index  [11]
word  jack index  1
word  likes index  2
word  apples index  3
word  jill index  4
word  oranges index  5
word  eats index  6
word  are index  7
word  food index  8
word  fruits index  9
wor

**Observation:**

* The training data (X, y) consists of only one sample ([[5, 7]], [11]) because of the way the input_sequences were generated in cell mFpbQupPXArb.
* The model achieved 100% accuracy during training after a few epochs. This is likely due to the small dataset size and the simplicity of the task.
* When predicting the next word for "jack likes" and "jill", the model consistently predicts "sweet". This is because the word "sweet" (token ID 11) is the only word that appears as the last word in the generated training sequence [[5, 7, 11]].

**Conclusion:** The model learned to predict the next word on the tiny dataset but is not practical for general text generation. A larger corpus and corrected sequence generation are needed for a useful model.