
# LSTM + Skip-gram Summarizer and Quiz Generator: Essential Knowledge

---

## What is this project?

Build a system that:
- Uses **LSTM with Skip-gram embeddings**.
- Performs **text summarization**.
- Generates **fill-in-the-blank MCQs** from the summaries.

---

## Why Skip-gram Embeddings?

- Converts words into **dense, meaningful vectors**.
- Words with similar meanings have **similar vectors**.
- Helps LSTM learn context efficiently.

---

## Why LSTM?

- Handles **sequential data (text)** with memory.
- Useful for **sequence-to-sequence tasks** like summarization.
- Learns long-range dependencies in text.

---

## Why Encoder-Decoder LSTM?

- Summarization needs **variable-length outputs**.
- **Encoder**:
   - Reads the entire input sentence.
   - Converts it into hidden and cell states (context).
- **Decoder**:
   - Uses encoder’s context to generate the summary word-by-word.
   - Outputs sequence independent of input length.

**Analogy**:
- Encoder: Reading and understanding a paragraph.
- Decoder: Explaining it in your own words.

---

## Pipeline Recap

1. **Preprocess text**: clean, tokenize, pad.
2. **Train/load Skip-gram Word2Vec embeddings**.
3. **Build embedding matrix** for your tokenizer vocabulary.
4. **Encoder-Decoder LSTM**:
   - Encoder: Embedding + LSTM → states.
   - Decoder: Embedding + LSTM (with encoder states) → Dense softmax.
5. **Train with teacher forcing** for sequence generation.
6. **Inference**:
   - Use encoder to get states.
   - Use decoder to generate summaries one word at a time.
7. **Generate MCQs**:
   - Extract key terms (nouns/entities) from summaries.
   - Replace with blanks for fill-in-the-blank questions.
8. **Export as JSON** for quiz use.

---

## Key Terms

- **Tokenization**: Splitting text into words/tokens.
- **Embedding**: Converting words into numeric vectors.
- **Cosine Similarity**: Measures how similar two vectors are.
- **Teacher Forcing**: Using true previous word during training.
- **Inference**: Generating output using model’s own predictions.

---

## Why this project is valuable

- Reinforces **practical NLP** (tokenization, embeddings).
- Gives experience with **sequence models (LSTM)**.
- Builds a **real, educational tool** you can demo or deploy.
- Teaches **data preparation to generation pipeline fully**.

---



In [None]:
import numpy as np
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense
from tensorflow.keras.models import Model

%pip install gensim
from gensim.models import Word2Vec






In [None]:
import nltk
nltk.download('puntk')

[nltk_data] Error loading puntk: Package 'puntk' not found in index


False

# Data Preparation

In [None]:
texts = [
    "Photosynthesis is the process by which plants make their food using sunlight.",
    "Mitochondria are the powerhouse of the cell and produce energy.",
    "Water boils at 100 degrees Celsius under normal conditions."
]

summaries = [
    "Plants make food from sunlight.",
    "Mitochondria produce energy.",
    "Water boils at 100 degrees."
]


In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts+summaries)

input_sequences = tokenizer.texts_to_sequences(texts)
target_sequences = tokenizer.texts_to_sequences(summaries)


max_input_len = max(len(seq) for seq in input_sequences)
max_target_len = max(len(seq) for seq in target_sequences)

encoder_input = pad_sequences(input_sequences, maxlen=max_input_len, padding='post')
decoder_input = pad_sequences(target_sequences, maxlen=max_target_len, padding='post')

In [None]:
print(tokenizer.word_index)

{'the': 1, 'plants': 2, 'make': 3, 'food': 4, 'sunlight': 5, 'mitochondria': 6, 'produce': 7, 'energy': 8, 'water': 9, 'boils': 10, 'at': 11, '100': 12, 'degrees': 13, 'photosynthesis': 14, 'is': 15, 'process': 16, 'by': 17, 'which': 18, 'their': 19, 'using': 20, 'are': 21, 'powerhouse': 22, 'of': 23, 'cell': 24, 'and': 25, 'celsius': 26, 'under': 27, 'normal': 28, 'conditions': 29, 'from': 30}



- Counts how often each word appears in your dataset.
- Gives **smaller numbers to words used more often**.
- Gives **larger numbers to less frequent words**.


In [None]:
print(input_sequences)
print(target_sequences)

[[14, 15, 1, 16, 17, 18, 2, 3, 19, 4, 20, 5], [6, 21, 1, 22, 23, 1, 24, 25, 7, 8], [9, 10, 11, 12, 13, 26, 27, 28, 29]]
[[2, 3, 4, 30, 5], [6, 7, 8], [9, 10, 11, 12, 13]]


In [None]:
print(max_input_len,max_target_len)

12 5


In [None]:
print(encoder_input)

[[14 15  1 16 17 18  2  3 19  4 20  5]
 [ 6 21  1 22 23  1 24 25  7  8  0  0]
 [ 9 10 11 12 13 26 27 28 29  0  0  0]]


In [None]:
print(decoder_input)

[[ 2  3  4 30  5]
 [ 6  7  8  0  0]
 [ 9 10 11 12 13]]


- Adds zeros to the end (`padding='post'`) of each sequence so all are the same length.
- Prepares **batch-consistent arrays** for your model.

Teacher forcing = giving the correct previous word to the decoder during training to help it learn sequence generation effectively.

In [None]:
decoder_target = np.zeros_like(decoder_input)

In [None]:
decoder_target[:,:-1]=decoder_input[:,1:]

In [None]:
decoder_target[:, -1] = 0

In [None]:
print(decoder_target)

[[ 3  4 30  5  0]
 [ 7  8  0  0  0]
 [10 11 12 13  0]]


In [None]:
from nltk.tokenize import word_tokenize

In [None]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
tokenized_texts = [nltk.word_tokenize(text.lower()) for text in texts + summaries]
w2v_model = Word2Vec(sentences=tokenized_texts, vector_size=50, window=2, min_count=1, sg=1, epochs=200)


In [None]:
print('cat'in w2v_model.wv) # check existance of vocab

False


In [None]:
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 50

embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    if word in w2v_model.wv:
        embedding_matrix[i] = w2v_model.wv[word]
    else:
        embedding_matrix[i] = np.random.normal(scale=0.6, size=(embedding_dim,)) #The random vector fill is not strictly required but is best practice to let your model handle unknown words properly instead of treating them as padding.


In [None]:
print(embedding_matrix.shape)

(31, 50)


In [None]:
encoder_inputs = Input(shape=(max_input_len,))


In [None]:
enc_emb = Embedding(vocab_size,embedding_dim,weights=[embedding_matrix],trainable=False)(encoder_inputs)

In [None]:
encoder_lstm, state_h,state_c = LSTM(128,return_state=True)(enc_emb)

In [None]:
encoder_states = [state_h, state_c]


In [None]:
print (encoder_states)

[<KerasTensor shape=(None, 128), dtype=float32, sparse=False, name=keras_tensor_3>, <KerasTensor shape=(None, 128), dtype=float32, sparse=False, name=keras_tensor_4>]


In [None]:
decoder_inputs = Input(shape=(max_target_len,))


In [None]:
dec_emb = Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False)(decoder_inputs)


In [None]:
decoder_lstm = LSTM(128, return_sequences=True, return_state=True)


In [None]:
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
# dont need state_h and state_c

In [None]:
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [None]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)


In [None]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')


In [None]:
model.summary()


In [None]:
model.fit([encoder_input, decoder_input], decoder_target[..., np.newaxis], epochs=200, batch_size=2)


Epoch 1/200
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 36ms/step - loss: 3.4329
Epoch 2/200
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - loss: 3.4147
Epoch 3/200
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - loss: 3.3972 
Epoch 4/200
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - loss: 3.3692
Epoch 5/200
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - loss: 3.3257
Epoch 6/200
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - loss: 3.2562
Epoch 7/200
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - loss: 3.1428
Epoch 8/200
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step - loss: 2.9209
Epoch 9/200
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - loss: 2.5665
Epoch 10/200
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step - loss: 2.5273
Epoch 11

<keras.src.callbacks.history.History at 0x7fa0b8c32050>

In [None]:
# ==================== INFERENCE SETUP FOR SUMMARY GENERATION ====================

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input

# 1️⃣ Build Encoder Inference Model
encoder_model = Model(encoder_inputs, encoder_states)

# 2️⃣ Build Decoder Inference Model
decoder_state_input_h = Input(shape=(128,))
decoder_state_input_c = Input(shape=(128,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

dec_emb2 = Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], trainable=False)(decoder_inputs)
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)

decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs2] + decoder_states2)

# 3️⃣ Define decode_sequence function
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)

    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = tokenizer.word_index.get('start', 1)  # fallback to 1 if 'start' not found

    stop_condition = False
    decoded_sentence = ''

    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = None
        for word, index in tokenizer.word_index.items():
            if index == sampled_token_index:
                sampled_word = word
                break

        if sampled_word == 'end' or sampled_word is None or len(decoded_sentence.split()) >= max_target_len:
            stop_condition = True
        else:
            decoded_sentence += ' ' + sampled_word

            target_seq = np.zeros((1, 1))
            target_seq[0, 0] = sampled_token_index

            states_value = [h, c]

    return decoded_sentence.strip()

# 4️⃣ Test Cell: Generate Summary
test_text = """Photosynthesis allows plants to convert sunlight into food, producing oxygen as a byproduct and supporting life on Earth."""

test_seq = tokenizer.texts_to_sequences([test_text])
test_seq = pad_sequences(test_seq, maxlen=max_input_len, padding='post')

generated_summary = decode_sequence(test_seq)

print("Input Text:")
print(test_text)
print("\nGenerated Summary:")
print(generated_summary)

# ==================== END OF INFERENCE SETUP ====================


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 188ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 195ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
Input Text:
Photosynthesis allows plants to convert sunlight into food, producing oxygen as a byproduct and supporting life on Earth.

Generated Summary:
make food from


In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

summary = "Plants make food from sunlight."
tokens = nltk.word_tokenize(summary)
pos_tags = nltk.pos_tag(tokens)

for word, tag in pos_tags:
    if tag.startswith('NN'):
        masked = summary.replace(word, "____")
        question = masked
        answer = word
        print("Question:", question)
        print("Answer:", answer)
        break

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger_eng[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger_eng')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtaggers/averaged_perceptron_tagger_eng/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [None]:
# ==================== MCQ GENERATION BLOCK ====================

!pip install nltk transformers --quiet

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
from transformers import pipeline
import random

# 1️⃣ Extract keywords and generate blanks
def generate_mcq_from_text(text, num_questions=5):
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    words = [w for w in words if w.isalpha() and len(w) > 4]
    words = list(set(words))

    if len(words) < num_questions:
        num_questions = len(words)

    selected_words = random.sample(words, num_questions)
    mcqs = []

    for word in selected_words:
        for sent in sentences:
            if word in sent:
                question = sent.replace(word, '______')
                mcqs.append({
                    'question': question,
                    'answer': word
                })
                break
    return mcqs

# 2️⃣ Distractor Generation using Masked Language Modeling (MLM)
fill_mask = pipeline('fill-mask', model='bert-base-uncased')

def add_distractors(mcqs, num_distractors=3):
    for mcq in mcqs:
        masked_sent = mcq['question'].replace('______', '[MASK]')
        predictions = fill_mask(masked_sent)
        distractors = []
        for pred in predictions:
            token = pred['token_str']
            if token.lower() != mcq['answer'].lower() and token.isalpha() and token not in distractors:
                distractors.append(token)
            if len(distractors) >= num_distractors:
                break
        mcq['options'] = distractors + [mcq['answer']]
        random.shuffle(mcq['options'])
    return mcqs

# 3️⃣ Usage Example
text = """
Photosynthesis allows plants to convert sunlight into food, producing oxygen as a byproduct and supporting life on Earth.
The mitochondria is the powerhouse of the cell, producing ATP for cellular activities.
Water boils at 100 degrees Celsius under normal atmospheric pressure.
"""

mcqs = generate_mcq_from_text(text, num_questions=3)
mcqs = add_distractors(mcqs)

# 4️⃣ Display MCQs
for idx, mcq in enumerate(mcqs, 1):
    print(f"\nQ{idx}: {mcq['question']}")
    for opt_idx, option in enumerate(mcq['options'], ord('A')):
        print(f"   {chr(opt_idx)}. {option}")
    print(f"Answer: {mcq['answer']}")

# ==================== END OF MCQ GENERATION BLOCK ====================


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
