🔍 Code Explanation – Step by Step

Step 1: Dataset Collection

import pandas as pd
import numpy as np
import re
import string
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


These are the essential libraries:

pandas and numpy: for data manipulation

re and string: for text cleaning

matplotlib: for optional data visualization

train_test_split: to divide data into training and testing sets

news_df = pd.read_csv("news_summary.csv", encoding='latin-1')
news_df = news_df[['text', 'headlines']].dropna()


Load the CSV dataset.

Use only the relevant columns: text (news body) and headlines (titles).

Drop rows with missing values.

In [2]:
# Auto Headline Generator – Using LSTM & GRU

# Step 1: Dataset Collection
import pandas as pd
import numpy as np
import re
import string
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Load the dataset
news_df = pd.read_csv("/content/news_summary.csv.zip", encoding='latin-1')
news_df = news_df[['text', 'headlines']].dropna()

# Display a few rows
print(news_df.head())

                                                text  \
0  The Administration of Union Territory Daman an...   
1  Malaika Arora slammed an Instagram user who tr...   
2  The Indira Gandhi Institute of Medical Science...   
3  Lashkar-e-Taiba's Kashmir commander Abu Dujana...   
4  Hotels in Maharashtra will train their staff t...   

                                           headlines  
0  Daman & Diu revokes mandatory Rakshabandhan in...  
1  Malaika slams user who trolled her for 'divorc...  
2  'Virgin' now corrected to 'Unmarried' in IGIMS...  
3  Aaj aapne pakad liya: LeT man Dujana before be...  
4  Hotel staff to get training to spot signs of s...  


Step 2: Data Preprocessing
def clean_text(text):
    text = text.lower()
    text = re.sub("\([^)]*\)", "", text)
    text = re.sub("\d", "", text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub("\s+", " ", text)
    return text.strip()


A function that:

Converts text to lowercase

Removes digits and punctuation

Cleans up extra spaces


news_df['clean_text'] = news_df['text'].apply(clean_text)
news_df['clean_headlines'] = news_df['headlines'].apply(clean_text)
news_df['decoder_input'] = '<sos> ' + news_df['clean_headlines']
news_df['decoder_target'] = news_df['clean_headlines'] + ' <eos>'


Apply the cleaning function

Add special tokens <sos> (start of sequence) and <eos> (end of sequence) for sequence generation



In [3]:

# Step 2: Data Preprocessing
def clean_text(text):
    text = text.lower()
    text = re.sub("\([^)]*\)", "", text)
    text = re.sub("\d", "", text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub("\s+", " ", text)
    return text.strip()

news_df['clean_text'] = news_df['text'].apply(clean_text)
news_df['clean_headlines'] = news_df['headlines'].apply(clean_text)

# Add start and end tokens for decoder
news_df['decoder_input'] = '<sos> ' + news_df['clean_headlines']
news_df['decoder_target'] = news_df['clean_headlines'] + ' <eos>'

Step 3: Tokenization and Padding


Converts words to integers using top 10,000 frequent words

<OOV> handles out-of-vocab words

Transforms and pads text sequences to fixed length (50)

Tokenizer for decoder input (headlines)

Pad both input and target sequences for training the model


In [6]:
# Step 3: Tokenization and Padding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

num_words = 10000
max_text_len = 50
max_headline_len = 15

text_tokenizer = Tokenizer(num_words=num_words, oov_token="<OOV>")
text_tokenizer.fit_on_texts(news_df['clean_text'])
text_seq = text_tokenizer.texts_to_sequences(news_df['clean_text'])
text_pad = pad_sequences(text_seq, maxlen=max_text_len, padding='post')

headline_tokenizer = Tokenizer(num_words=num_words, oov_token="<OOV>")
headline_tokenizer.fit_on_texts(news_df['decoder_input'])
decoder_input_seq = headline_tokenizer.texts_to_sequences(news_df['decoder_input'])
decoder_input_pad = pad_sequences(decoder_input_seq, maxlen=max_headline_len, padding='post')

decoder_target_seq = headline_tokenizer.texts_to_sequences(news_df['decoder_target'])
decoder_target_pad = pad_sequences(decoder_target_seq, maxlen=max_headline_len, padding='post')

# Save tokenizer
import pickle
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump({'text': text_tokenizer, 'headline': headline_tokenizer}, handle, protocol=pickle.HIGHEST_PROTOCOL)

Step 4: Train-Test Split

90% training and 10% test split

In [7]:
# Step 4: Train-Test Split
x_train, x_test, y_train_in, y_test_in, y_train_out, y_test_out = train_test_split(
    text_pad, decoder_input_pad, decoder_target_pad, test_size=0.1, random_state=42)

Step 5 & 6: Model Building and Training (LSTM & GRU)

Encoder: Embedding layer + LSTM which outputs hidden state h and cell state c

Decoder: Uses the states from the encoder to generate a sequence

Output layer with softmax to classify each time step word

GRU (same logic, fewer parameters)

In [8]:
# Step 5: Model Building (LSTM)
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, GRU, Dense
from tensorflow.keras.callbacks import EarlyStopping

embed_dim = 100

# LSTM Model
text_input = Input(shape=(max_text_len,))
embedding_text = Embedding(num_words, embed_dim)(text_input)
encoder_lstm = LSTM(100, return_state=True)
_, h, c = encoder_lstm(embedding_text)

decoder_input = Input(shape=(max_headline_len,))
embedding_headline = Embedding(num_words, embed_dim)(decoder_input)
decoder_lstm = LSTM(100, return_sequences=True)(embedding_headline, initial_state=[h, c])
output = Dense(num_words, activation='softmax')(decoder_lstm)

lstm_model = Model([text_input, decoder_input], output)
lstm_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
lstm_model.summary()


In [9]:
# Step 6: Model Training (LSTM)
es = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
lstm_model.fit([x_train, y_train_in], y_train_out.reshape(*y_train_out.shape, 1),
               epochs=30, batch_size=128,
               validation_split=0.1,
               callbacks=[es])

# Save the LSTM model
lstm_model.save('headline_lstm_model.h5')

Epoch 1/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 2s/step - loss: 8.9056 - val_loss: 6.6746
Epoch 2/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m73s[0m 1s/step - loss: 6.0772 - val_loss: 5.4696
Epoch 3/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 1s/step - loss: 5.1838 - val_loss: 5.2304
Epoch 4/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 1s/step - loss: 4.9786 - val_loss: 5.1460
Epoch 5/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 1s/step - loss: 4.8770 - val_loss: 5.0918
Epoch 6/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 1s/step - loss: 4.8061 - val_loss: 5.0468
Epoch 7/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 1s/step - loss: 4.7362 - val_loss: 5.0106
Epoch 8/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 1s/step - loss: 4.6606 - val_loss: 4.9852
Epoch 9/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[



Step 7: Save Models

Save trained models for later use

In [10]:

# Step 7: GRU Model
encoder_gru = GRU(100, return_state=True)
_, h_gru = encoder_gru(embedding_text)

decoder_gru_layer = GRU(100, return_sequences=True)
decoder_gru_output = decoder_gru_layer(embedding_headline, initial_state=[h_gru])
output_gru = Dense(num_words, activation='softmax')(decoder_gru_output)

gru_model = Model([text_input, decoder_input], output_gru)
gru_model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
gru_model.summary()

gru_model.fit([x_train, y_train_in], y_train_out.reshape(*y_train_out.shape, 1),
              epochs=30, batch_size=128,
              validation_split=0.1,
              callbacks=[es])

# Save the GRU model
gru_model.save('headline_gru_model.h5')

Epoch 1/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 1s/step - loss: 8.7258 - val_loss: 6.5035
Epoch 2/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 1s/step - loss: 5.9726 - val_loss: 5.4335
Epoch 3/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 1s/step - loss: 5.1392 - val_loss: 5.1562
Epoch 4/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 1s/step - loss: 4.8862 - val_loss: 5.0526
Epoch 5/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 1s/step - loss: 4.7576 - val_loss: 5.0062
Epoch 6/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 1s/step - loss: 4.6561 - val_loss: 4.9790
Epoch 7/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 1s/step - loss: 4.5739 - val_loss: 4.9611
Epoch 8/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 1s/step - loss: 4.5245 - val_loss: 4.9475
Epoch 9/30
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[



Step 8: Headline Generation Functions

decode_sequence_argmax(): picks the most likely word at each step

decode_sequence_topk(): randomly picks from top K likely words to create variety

In [13]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences

def decode_sequence(model, input_seq, tokenizer, max_len, start_token="<sos>", end_token="<eos>", temperature=1.0, top_k=50):
    # Prepare the input sequence
    encoded_input_seq = pad_sequences(tokenizer.texts_to_sequences([input_seq]), maxlen=max_text_len, padding='post')
    target_seq = tokenizer.texts_to_sequences([start_token])[0]

    output = []

    for _ in range(max_len):
        # Prepare decoder input for the current step
        decoder_input_for_prediction = pad_sequences([target_seq], maxlen=max_headline_len, padding='post')

        # Predict next token probabilities
        predictions = model.predict([encoded_input_seq, decoder_input_for_prediction], verbose=0)


        # Apply temperature scaling to adjust the randomness
        predictions = np.squeeze(predictions, axis=0)[-1, :] / temperature # Get predictions for the last time step
        predictions = np.exp(predictions) / np.sum(np.exp(predictions))  # Softmax function

        # Top-k sampling: choose the top k predictions
        top_indices = predictions.argsort()[-top_k:][::-1]
        sampled_token_index = np.random.choice(top_indices, p=predictions[top_indices] / np.sum(predictions[top_indices]))

        # Convert the sampled token to a word
        sampled_word = tokenizer.index_word.get(sampled_token_index, '')

        if sampled_word == end_token or sampled_word == '':
            break

        output.append(sampled_word)
        target_seq.append(sampled_token_index)


    return ' '.join(output)

# Test the model on unseen paragraphs
for i in range(5):
    test_text = news_df['clean_text'].iloc[i + 1]
    print(f"Input Text: {test_text}")
    lstm_headline = decode_sequence(lstm_model, test_text, headline_tokenizer, max_headline_len)
    gru_headline = decode_sequence(gru_model, test_text, headline_tokenizer, max_headline_len)

    print(f"LSTM Headline: {lstm_headline}")
    print(f"GRU Headline: {gru_headline}")

Input Text: malaika arora slammed an instagram user who trolled her for divorcing a rich man and having fun with the alimony her life now is all about wearing short clothes going to gym or salon enjoying vacations the user commented malaika responded you certainly got to get your damn facts right before spewing sht on mewhen you know nothing about me
LSTM Headline: post says at years post study ban trump ganguly against study over of on yrs
GRU Headline: ban at him <OOV> to ban govt assembly kejriwal over by says says him remark
Input Text: the indira gandhi institute of medical sciences in patna on thursday made corrections in its marital declaration form by changing virgin option to unmarried earlier bihar health minister defined virgin as being an unmarried woman and did not consider the term objectionable the institute however faced strong backlash for asking new recruits to declare their virginity in the form
LSTM Headline: airport
GRU Headline: post of for on as reports report st

In [None]:
📄 LSTM vs GRU – Performance Comparison Summary

Sections to cover:

| Criteria                   | LSTM                                 | GRU                            |
| -------------------------- | ------------------------------------ | ------------------------------ |
| Architecture               | Uses 3 gates (input, forget, output) | Uses 2 gates (reset, update)   |
| Training Time              | Slightly longer                      | Faster due to fewer parameters |
| Accuracy (Validation Loss) | Example: 0.47                        | Example: 0.45                  |
| Output Quality (Argmax)    | Good, sometimes more formal          | Often more concise             |
| Output Quality (Top-K)     | Slightly repetitive                  | More creative and readable     |
| Model Size                 | \~5.6 MB                             | \~4.8 MB                       |
| Ideal For                  | Long-sequence learning               | Faster training on short text  |


Conclusion:

Both models perform well, but GRU provides comparable accuracy with faster training and smaller size, making it a good choice for real-time applications. LSTM, however, may capture complex dependencies better in longer contexts.



Bonus Summary: Headline Decoding Techniques


In this bonus section, we replaced the traditional argmax decoding strategy with two advanced
methods: Top-K Sampling and Beam Search. These techniques improve the diversity and quality of
generated headlines.
1. Argmax (Greedy Decoding):
Selects the most probable word at each time step. Fast and deterministic but may miss better
long-term sequences.
2. Top-K Sampling:
Samples randomly from the top K highest probability words. Introduces diversity and creativity in the
output.
3. Beam Search:
Explores multiple possible sequences by keeping the top N candidates (beam width). Improves
output quality and handles context better than argmax