# Medium Article Title Generator using LSTM

This notebook implements a neural language model for generating Medium article titles using a Bidirectional LSTM architecture. The model learns patterns from existing Medium article titles and generates new title suggestions based on seed text.

In [None]:
%%capture
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Import Required Libraries

Loading all necessary libraries for data processing, text preprocessing, and neural network implementation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import re
import os

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
from tensorflow.keras.optimizers import Adam


## Load and Explore Dataset

Loading the Medium articles dataset and examining its structure to understand the data we're working with.

In [None]:
medium_data = pd.read_csv("/kaggle/input/medium-articles-dataset/medium_data.csv")
print("First 10 rows of the dataset:")
medium_data.head(10)

In [None]:
print(f"Dataset shape: {medium_data.shape}")
print(f"Total articles: {medium_data.shape[0]}, Features: {medium_data.shape[1]}")
medium_data.shape

## Data Backup and Preprocessing

Creating a backup copy and cleaning the title text by removing unwanted characters and normalizing whitespace.

In [None]:
data_copy = medium_data.copy()
print("Backup copy created successfully")

In [None]:

# medium_data = data_copy.copy()

In [None]:
# Combine all unwanted characters and whitespace patterns
clean_pattern = re.compile(r'[\u00a0\u200a\u200b\u200c\u200d\u202f\u2060\ufeff\t\r\n]+')

medium_data['title'] = medium_data['title'].apply(
    lambda x: clean_pattern.sub(' ', x).strip()
)
print("Text cleaning completed - removed special characters and normalized whitespace")

In [None]:
print("Sample cleaned titles:")
medium_data['title']

## Text Tokenization

Converting text data into numerical sequences using Keras Tokenizer. This creates a vocabulary mapping each unique word to a numerical index.

In [None]:
tokenizer = Tokenizer(oov_token = '<oov>')
tokenizer.fit_on_texts(medium_data['title'])
print("Tokenizer fitted on title texts")

In [None]:
print("Word index mapping (first 10 entries):")
word_items = list(tokenizer.word_index.items())[:10]
for word, index in word_items:
    print(f"'{word}': {index}")
tokenizer.word_index

In [None]:
total_words = len(tokenizer.word_index)+1
print(f"Total vocabulary size: {total_words}")
total_words

## N-gram Sequence Generation

Creating input sequences by generating all possible n-grams from each title. This allows the model to learn word patterns and dependencies within titles.

In [None]:
tokenized_sequences = []
for title in medium_data['title']:
    sequences = tokenizer.texts_to_sequences([title])[0]    
    for i in range(1, len(sequences)):
        n_gram_sequence = sequences[:i+1]
        tokenized_sequences.append(n_gram_sequence)

print("Total input sequences: ", len(tokenized_sequences))

In [None]:
print("Sample tokenized sequences (first 5):")
for i, seq in enumerate(tokenized_sequences[:5]):
    print(f"Sequence {i+1}: {seq}")
tokenized_sequences

## Sequence Padding

Padding sequences to ensure uniform input length for the neural network. All sequences are padded to match the longest sequence length.

In [None]:
maxlen = max([len(x) for x in tokenized_sequences])
print(f"Maximum sequence length: {maxlen}")
maxlen

In [None]:
padded_sequences = pad_sequences(tokenized_sequences, maxlen = maxlen, padding="pre")
print("Sample padded sequence:")
print(padded_sequences[0])

In [None]:
print(f"Original sequence length: {len(tokenized_sequences[0])}")
print(f"Padded sequence length: {len(padded_sequences[0])}")

## Input-Output Split

Splitting sequences into input features (X) and target labels (y). The last word of each sequence becomes the target that the model should predict.

In [None]:
X = padded_sequences[:, :-1]
y = padded_sequences[:, -1]

print(f"Input shape (X): {X.shape}")
print(f"Target shape (y): {y.shape}")
X, y

## One-Hot Encoding

Converting target labels to categorical format for multi-class classification. Each target word is represented as a one-hot vector.

In [None]:
y = to_categorical(y, num_classes=total_words)
print("Target labels converted to categorical format")

In [None]:
print("Sample one-hot encoded target:")
print(y[0])
print(f"One-hot vector length: {len(y[0])}")

In [None]:
input_length = X.shape[1]
print(f"Input sequence length for model: {input_length}")
input_length

## Model Architecture

Building a Bidirectional LSTM model for next-word prediction. The architecture includes:
- Embedding layer for word representations
- Bidirectional LSTM for capturing context from both directions
- Dense output layer with softmax activation for word probability distribution

In [None]:
model = Sequential([
    Embedding(input_dim=total_words, output_dim=100, input_shape=(input_length,)),
    Bidirectional(LSTM(100)),
    Dense(total_words, activation="softmax")
])

model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
print("Model architecture:")
model.summary()

## Model Training

Training the model with 50 epochs, using 10% of data for validation to monitor performance and prevent overfitting.

In [None]:
print("Starting model training...")
history = model.fit(X, y, epochs=50, batch_size=32, validation_split=0.1, verbose=True)
print("Training completed!")

## Training Visualization

Plotting training history to visualize model performance over epochs and identify potential overfitting.

In [None]:
def plot_graphs(history, string):
    """
    Plot training metrics over epochs.
    
    Args:
        history: Training history object from model.fit()
        string: Metric name to plot ('accuracy' or 'loss')
    """
    plt.plot(history.history[string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.show()

print("Training accuracy over epochs:")
plot_graphs(history, "accuracy")
print("Training loss over epochs:")
plot_graphs(history, "loss")

## Title Generation Function

Function to generate new title suggestions based on seed text using the trained model.

In [None]:
def generate_title(seed_text, next_words=10):
    """
    Generate new title text based on seed input.
    
    Args:
        seed_text (str): Starting text for title generation
        next_words (int): Number of words to generate after seed text
        
    Returns:
        str: Complete generated title including seed text
    """
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=input_length, padding='pre')
        predicted = model.predict(token_list, verbose=0)
        predicted_word_index = np.argmax(predicted, axis=1)[0]
        
        for word, index in tokenizer.word_index.items():
            if index == predicted_word_index:
                seed_text += " " + word
                break
    return seed_text

## Title Generation Examples

Testing the trained model with different seed phrases to generate Medium article titles.

In [None]:
print("Generated title examples:")
print("Seed: 'how to' ->", generate_title("how to", 6))
print("Seed: 'deep learning' ->", generate_title("deep learning", 7))
print("Seed: 'What are' ->", generate_title("What are", 5))

## Model and Tokenizer Saving

Saving the trained model and tokenizer for future use and deployment.

In [None]:
import json
token_json = tokenizer.to_json()
with open('tokenizer.json', 'w') as f:
    f.write(token_json)
print("Tokenizer saved to 'tokenizer.json'")

In [None]:
model.save("medium_title_gen.h5")
print("Model saved to 'medium_title_gen.h5'")

## Summary

Successfully implemented and trained a Bidirectional LSTM model for Medium article title generation. The model can generate contextually relevant titles based on seed text input, making it useful for content creators seeking title inspiration.