# LSTM-based Seq2Seq Chatbot using Cornell Movie Dialogue Corpus

**Architecture:** Encoder-Decoder LSTM (Long Short-Term Memory)  
**Model Type:** Sequence-to-Sequence (Seq2Seq)  
**Dataset:** Cornell Movie-Dialogs Corpus  
**Total Conversations:** 220,579 exchanges between 10,292 character pairs  
**Total Utterances:** 304,713 from 617 movies  

**Project Goals:**
- Implement LSTM encoder-decoder architecture
- Understand LSTM's memory mechanisms (forget, input, output gates)
- Process large-scale conversational data
- Train a generative dialogue model using LSTMs

**Dataset Source:** https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html


In [1]:
# Install required libraries
!pip install tensorflow numpy pandas matplotlib scikit-learn tqdm -q

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pickle
import re
import os
import zipfile
from tqdm import tqdm
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

print(f"TensorFlow Version: {tf.__version__}")
print("Setup Complete!")


TensorFlow Version: 2.19.0
Setup Complete!


## Downloading Cornell Movie Dialogue Corpus

The dataset contains:
- `movie_lines.txt` - All utterances with character and movie metadata
- `movie_conversations.txt` - Conversation structure showing dialogue flow
- `movie_characters_metadata.txt` - Character information
- `movie_titles_metadata.txt` - Movie details


In [2]:
# Download the Cornell Movie Dialogue Corpus
!wget -q http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
!unzip -q cornell_movie_dialogs_corpus.zip

# Verify download
data_dir = 'cornell movie-dialogs corpus'
if os.path.exists(data_dir):
    print("✓ Dataset downloaded successfully!")
    print("\nFiles in dataset:")
    for file in os.listdir(data_dir):
        print(f"  - {file}")
else:
    print("✗ Download failed. Please check the URL.")


✓ Dataset downloaded successfully!

Files in dataset:
  - README.txt
  - movie_titles_metadata.txt
  - .DS_Store
  - movie_conversations.txt
  - movie_lines.txt
  - chameleons.pdf
  - movie_characters_metadata.txt
  - raw_script_urls.txt


## Loading and Parsing the Dataset

We need to:
1. Load `movie_lines.txt` to create a line ID → text mapping
2. Load `movie_conversations.txt` to get conversation sequences
3. Extract question-answer pairs from consecutive utterances
4. Clean and preprocess text data


In [3]:
def load_lines(file_path):
    """
    Load movie lines into a dictionary
    Format: lineID +++$+++ characterID +++$+++ movieID +++$+++ character +++$+++ text
    """
    lines_dict = {}
    with open(file_path, 'r', encoding='iso-8859-1') as f:
        for line in f:
            parts = line.split(' +++$+++ ')
            if len(parts) == 5:
                line_id = parts[0]
                text = parts[4].strip()
                lines_dict[line_id] = text
    return lines_dict

# Load lines
lines_file = os.path.join(data_dir, 'movie_lines.txt')
id2line = load_lines(lines_file)

print(f"Total lines loaded: {len(id2line)}")
print(f"\nSample lines:")
for i, (line_id, text) in enumerate(list(id2line.items())[:5]):
    print(f"{line_id}: {text}")


Total lines loaded: 304713

Sample lines:
L1045: They do not!
L1044: They do to!
L985: I hope so.
L984: She okay?
L925: Let's go.


In [4]:
def load_conversations(file_path, id2line):
    """
    Load conversations and create question-answer pairs
    Format: characterID1 +++$+++ characterID2 +++$+++ movieID +++$+++ ['L1', 'L2', ...]
    """
    conversations = []
    with open(file_path, 'r', encoding='iso-8859-1') as f:
        for line in f:
            parts = line.split(' +++$+++ ')
            if len(parts) == 4:
                # Extract line IDs from the conversation
                line_ids = eval(parts[3])
                conversations.append(line_ids)

    # Create question-answer pairs
    qa_pairs = []
    for conversation in conversations:
        for i in range(len(conversation) - 1):
            question_id = conversation[i]
            answer_id = conversation[i + 1]

            if question_id in id2line and answer_id in id2line:
                question = id2line[question_id]
                answer = id2line[answer_id]
                qa_pairs.append([question, answer])

    return qa_pairs

# Load conversations
conversations_file = os.path.join(data_dir, 'movie_conversations.txt')
qa_pairs = load_conversations(conversations_file, id2line)

print(f"Total Q&A pairs created: {len(qa_pairs)}")
print(f"\nSample conversations:")
for i in range(5):
    print(f"Q: {qa_pairs[i][0]}")
    print(f"A: {qa_pairs[i][1]}")
    print("-" * 80)


Total Q&A pairs created: 221616

Sample conversations:
Q: Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.
A: Well, I thought we'd start with pronunciation, if that's okay with you.
--------------------------------------------------------------------------------
Q: Well, I thought we'd start with pronunciation, if that's okay with you.
A: Not the hacking and gagging and spitting part.  Please.
--------------------------------------------------------------------------------
Q: Not the hacking and gagging and spitting part.  Please.
A: Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?
--------------------------------------------------------------------------------
Q: You're asking me out.  That's so cute. What's your name again?
A: Forget it.
--------------------------------------------------------------------------------
Q: No, no, it's my fault -- we didn't have a proper introdu

## Text Cleaning and Preprocessing

Steps:
- Convert to lowercase
- Remove special characters and extra spaces
- Add start and end tokens for decoder
- Filter extremely long or short sentences


In [5]:
def clean_text(text):
    """
    Clean and normalize text
    """
    # Lowercase
    text = text.lower()

    # Remove special characters but keep basic punctuation
    text = re.sub(r"[^a-z0-9?.!,¿']+", " ", text)

    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def filter_pairs(pairs, max_length=20):
    """
    Filter pairs based on length
    """
    filtered_pairs = []
    for question, answer in pairs:
        q_clean = clean_text(question)
        a_clean = clean_text(answer)

        # Filter by word count
        if (len(q_clean.split()) <= max_length and
            len(a_clean.split()) <= max_length and
            len(q_clean.split()) > 0 and
            len(a_clean.split()) > 0):
            filtered_pairs.append([q_clean, a_clean])

    return filtered_pairs

# Clean and filter pairs
print("Cleaning and filtering data...")
filtered_qa_pairs = filter_pairs(qa_pairs, max_length=15)

# Limit to first 50,000 for faster training (you can increase this)
filtered_qa_pairs = filtered_qa_pairs[:50000]

print(f"Filtered pairs: {len(filtered_qa_pairs)}")
print(f"\nSample cleaned conversations:")
for i in range(3):
    print(f"Q: {filtered_qa_pairs[i][0]}")
    print(f"A: {filtered_qa_pairs[i][1]}")
    print("-" * 80)


Cleaning and filtering data...
Filtered pairs: 50000

Sample cleaned conversations:
Q: well, i thought we'd start with pronunciation, if that's okay with you.
A: not the hacking and gagging and spitting part. please.
--------------------------------------------------------------------------------
Q: not the hacking and gagging and spitting part. please.
A: okay... then how 'bout we try out some french cuisine. saturday? night?
--------------------------------------------------------------------------------
Q: you're asking me out. that's so cute. what's your name again?
A: forget it.
--------------------------------------------------------------------------------


In [6]:
# Separate questions and answers
questions = [pair[0] for pair in filtered_qa_pairs]
answers = [pair[1] for pair in filtered_qa_pairs]

# Add START and END tokens to answers
answers_with_tags = ['<START> ' + answer + ' <END>' for answer in answers]

print(f"Total questions: {len(questions)}")
print(f"Total answers: {len(answers_with_tags)}")
print(f"\nSample with tags:")
print(f"Q: {questions[0]}")
print(f"A: {answers_with_tags[0]}")


Total questions: 50000
Total answers: 50000

Sample with tags:
Q: well, i thought we'd start with pronunciation, if that's okay with you.
A: <START> not the hacking and gagging and spitting part. please. <END>


## Tokenization and Sequence Creation

Convert text to sequences of integers for RNN processing:
- Build vocabulary from all words
- Convert sentences to integer sequences
- Pad sequences to uniform length


In [7]:
# Create tokenizer
tokenizer = Tokenizer(filters='', oov_token='<OOV>')
tokenizer.fit_on_texts(questions + answers_with_tags)

vocab_size = len(tokenizer.word_index) + 1
print(f"Vocabulary Size: {vocab_size}")

# Convert to sequences
question_sequences = tokenizer.texts_to_sequences(questions)
answer_sequences = tokenizer.texts_to_sequences(answers_with_tags)

# Find max length
max_question_len = max([len(seq) for seq in question_sequences])
max_answer_len = max([len(seq) for seq in answer_sequences])

print(f"Max Question Length: {max_question_len}")
print(f"Max Answer Length: {max_answer_len}")

# Pad sequences
encoder_input = pad_sequences(question_sequences, maxlen=max_question_len, padding='post')
decoder_input = pad_sequences(answer_sequences, maxlen=max_answer_len, padding='post')

# Create decoder output (shifted by one position)
decoder_output = []
for seq in answer_sequences:
    decoder_output.append(seq[1:])  # Remove <START> token

decoder_output = pad_sequences(decoder_output, maxlen=max_answer_len, padding='post')

print(f"\nEncoder Input Shape: {encoder_input.shape}")
print(f"Decoder Input Shape: {decoder_input.shape}")
print(f"Decoder Output Shape: {decoder_output.shape}")


Vocabulary Size: 38743
Max Question Length: 15
Max Answer Length: 17

Encoder Input Shape: (50000, 15)
Decoder Input Shape: (50000, 17)
Decoder Output Shape: (50000, 17)


## Seq2Seq LSTM Architecture

**Encoder-Decoder LSTM Architecture:**
- **Encoder LSTM**: Processes question sequence, outputs context vectors (hidden state + cell state)
- **Decoder LSTM**: Takes context + previous word, generates next word using LSTM cells
- **LSTM Advantages**:
  - Three gates (forget, input, output) for selective memory
  - Cell state for long-term dependencies
  - Better gradient flow than vanilla RNN

This is a classic **Encoder-Decoder LSTM** model for sequence-to-sequence dialogue generation.


In [10]:
# Model hyperparameters
embedding_dim = 128
lstm_units = 256
batch_size = 64
epochs = 3

# Encoder
encoder_input_layer = Input(shape=(max_question_len,))
encoder_embedding = Embedding(vocab_size, embedding_dim, mask_zero=True)(encoder_input_layer)
encoder_lstm = LSTM(lstm_units, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]

# Decoder
decoder_input_layer = Input(shape=(max_answer_len,))
decoder_embedding = Embedding(vocab_size, embedding_dim, mask_zero=True)(decoder_input_layer)
decoder_lstm = LSTM(lstm_units, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Complete model
model = Model([encoder_input_layer, decoder_input_layer], decoder_outputs)

# Compile
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()


## Training the Chatbot

This will take some time depending on your dataset size and hardware.

**Training Process:**
- Encoder processes the question
- Decoder learns to generate the answer word-by-word
- Loss function: Sparse categorical crossentropy (predicting next word)


In [None]:
# Prepare decoder output for training (add dimension for sparse_categorical_crossentropy)
decoder_output_train = decoder_output.reshape(decoder_output.shape[0], decoder_output.shape[1], 1)

# Train
print("Starting training...")
history = model.fit(
    [encoder_input, decoder_input],
    decoder_output_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.1,
    verbose=1
)

print("\n✓ Training Complete!")


Starting training...
Epoch 1/3
[1m467/704[0m [32m━━━━━━━━━━━━━[0m[37m━━━━━━━[0m [1m25:23[0m 6s/step - accuracy: 0.2242 - loss: 6.8696

In [None]:
# Plot training history
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss', linewidth=2)
plt.plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
plt.title('Model Loss Over Epochs', fontsize=14, fontweight='bold')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.legend()
plt.grid(alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy', linewidth=2)
plt.plot(history.history['val_accuracy'], label='Validation Accuracy', linewidth=2)
plt.title('Model Accuracy Over Epochs', fontsize=14, fontweight='bold')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()


## Inference Models for Response Generation

For chatbot interaction, we need separate encoder and decoder models:
- **Encoder Model**: Processes user input, outputs context states
- **Decoder Model**: Generates response word-by-word using context


In [None]:
# Encoder inference model
encoder_model_inf = Model(encoder_input_layer, encoder_states)

# Decoder inference model
decoder_state_input_h = Input(shape=(lstm_units,))
decoder_state_input_c = Input(shape=(lstm_units,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_embedding_inf = decoder_embedding
decoder_outputs_inf, state_h_inf, state_c_inf = decoder_lstm(
    decoder_embedding_inf, initial_state=decoder_states_inputs
)
decoder_states_inf = [state_h_inf, state_c_inf]
decoder_outputs_inf = decoder_dense(decoder_outputs_inf)

decoder_model_inf = Model(
    [decoder_input_layer] + decoder_states_inputs,
    [decoder_outputs_inf] + decoder_states_inf
)

print("✓ Inference models created successfully!")


## Generating Chatbot Responses

The generation process:
1. Encode user input to get context states
2. Initialize decoder with <START> token
3. Predict next word iteratively until <END> or max length
4. Return generated response


In [None]:
def generate_response(input_text):
    """
    Generate chatbot response for given input
    """
    # Clean input
    input_text = clean_text(input_text)

    # Convert to sequence
    input_seq = tokenizer.texts_to_sequences([input_text])
    input_seq = pad_sequences(input_seq, maxlen=max_question_len, padding='post')

    # Encode input
    states_value = encoder_model_inf.predict(input_seq, verbose=0)

    # Generate empty target sequence
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = tokenizer.word_index.get('<START>', 1)

    # Generate response word by word
    decoded_sentence = []
    stop_condition = False

    while not stop_condition:
        output_tokens, h, c = decoder_model_inf.predict([target_seq] + states_value, verbose=0)

        # Sample next word
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = None

        # Find word from index
        for word, index in tokenizer.word_index.items():
            if index == sampled_token_index:
                sampled_word = word
                break

        # Exit conditions
        if sampled_word == '<END>' or len(decoded_sentence) > max_answer_len:
            stop_condition = True
        elif sampled_word and sampled_word not in ['<START>', '<OOV>']:
            decoded_sentence.append(sampled_word)

        # Update target sequence
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return ' '.join(decoded_sentence)

print("✓ Response generator ready!")


In [None]:
# Test with sample inputs
test_questions = [
    "hi",
    "how are you",
    "what is your name",
    "where are you from",
    "tell me about yourself",
    "goodbye",
    "i love you",
    "what do you think",
    "can you help me",
    "thank you"
]

print("=" * 80)
print("TESTING CHATBOT RESPONSES")
print("=" * 80)

for question in test_questions:
    response = generate_response(question)
    print(f"\nYou: {question}")
    print(f"Bot: {response}")
    print("-" * 80)


## Interactive Chatbot Session

Now you can have a real conversation with your RNN chatbot!

**Note:** The chatbot's responses depend on training quality and may sometimes be:
- Repetitive (due to RNN's short-term memory)
- Generic (learned from movie dialogues)
- Creative (unexpected combinations from training data)

Type 'quit', 'exit', or 'bye' to end the conversation.


In [None]:
def chat_interactive():
    """
    Interactive chatbot interface
    """
    print("\n" + "=" * 80)
    print("🤖 CORNELL MOVIE DIALOGUE RNN CHATBOT")
    print("=" * 80)
    print("Start chatting! (Type 'quit', 'exit', or 'bye' to end)\n")

    while True:
        user_input = input("You: ")

        if user_input.lower() in ['quit', 'exit', 'bye', 'goodbye']:
            response = generate_response(user_input)
            print(f"Bot: {response}")
            print("\n👋 Thanks for chatting! Goodbye!")
            break

        if user_input.strip() == '':
            continue

        response = generate_response(user_input)
        print(f"Bot: {response}\n")

# Start interactive chat
chat_interactive()


## Saving Your Trained Chatbot

Save your model and tokenizer for future use without retraining.


In [None]:
# Save models
model.save('rnn_chatbot_full_model.h5')
encoder_model_inf.save('encoder_model.h5')
decoder_model_inf.save('decoder_model.h5')

# Save tokenizer
with open('tokenizer.pickle', 'wb') as f:
    pickle.dump(tokenizer, f)

# Save configuration
config = {
    'max_question_len': max_question_len,
    'max_answer_len': max_answer_len,
    'vocab_size': vocab_size,
    'embedding_dim': embedding_dim,
    'lstm_units': lstm_units
}

with open('model_config.pickle', 'wb') as f:
    pickle.dump(config, f)

print("✓ Models and configuration saved successfully!")
print("\nSaved files:")
print("  - rnn_chatbot_full_model.h5")
print("  - encoder_model.h5")
print("  - decoder_model.h5")
print("  - tokenizer.pickle")
print("  - model_config.pickle")


## LSTM Memory Mechanisms and Limitations

**How LSTM Solves RNN's Vanishing Gradient Problem:**

1. **Gating Mechanisms**: LSTM uses three gates (forget, input, output) to control information flow
2. **Cell State**: Maintains a separate memory pathway that allows gradients to flow more smoothly
3. **Additive Updates**: Cell state uses addition (not multiplication), preventing gradient decay
4. **Selective Memory**: Network learns *when* to remember and *when* to forget information

**Why LSTM is Better Than Vanilla RNN:**

- ✅ Handles sequences up to 100+ tokens (vs RNN's ~10-15 tokens)
- ✅ Prevents vanishing gradients through constant error flow
- ✅ Better at capturing long-term dependencies in conversations
- ✅ More stable training with deeper networks

**However, LSTM Still Has Limitations:**

1. **Sequential Processing**: Cannot parallelize like Transformers - slower training
2. **Computational Cost**: 4x more parameters than vanilla RNN (3 gates + cell state)
3. **Very Long Sequences**: Still struggles with sequences >200 tokens
4. **Context Window**: Limited memory compared to attention-based models
5. **Training Time**: Requires more epochs and computational resources

**Performance Comparison:**

| Model | Max Sequence Length | Parallelization | Training Speed | Long-term Memory |
|-------|-------------------|-----------------|----------------|------------------|
| RNN | ~10-15 tokens | ❌ No | Fast | ❌ Poor |
| **LSTM (This Model)** | ~100-200 tokens | ❌ No | Moderate | ✅ Good |
| GRU | ~100-200 tokens | ❌ No | Faster than LSTM | ✅ Good |
| Transformer (BERT, GPT) | 512-4096+ tokens | ✅ Yes | Very Fast | ✅✅ Excellent |

**When to Use LSTM:**

- ✅ Small to medium datasets (LSTM often outperforms Transformers on small data)
- ✅ Real-time applications with limited computational resources
- ✅ Sequential tasks with moderate-length dependencies (chatbots, sentiment analysis)
- ✅ Time-series prediction and forecasting

**Next Steps - Modern Alternatives:**

- **GRU (Gated Recurrent Unit)**: Simplified LSTM with fewer parameters, often similar performance
- **Attention Mechanisms**: Add attention layers to LSTM for better context focus
- **Transformers**: State-of-the-art for NLP (BERT, GPT, T5) - use attention instead of recurrence
- **Hybrid Models**: Combine LSTM with Transformers for best of both worlds

**This chatbot demonstrates LSTM's strong sequential learning capabilities. For production chatbots with massive datasets, consider Transformer-based models (GPT, DialoGPT, BlenderBot)!**


In [None]:
# Analyze model behavior with different input lengths
test_lengths = {
    "Short": "hi there",
    "Medium": "how are you doing today",
    "Long": "can you tell me what you think about the weather today and tomorrow"
}

print("=" * 80)
print("ANALYZING RNN BEHAVIOR WITH DIFFERENT INPUT LENGTHS")
print("=" * 80)

for length_type, test_input in test_lengths.items():
    response = generate_response(test_input)
    word_count = len(test_input.split())
    print(f"\n{length_type} Input ({word_count} words):")
    print(f"Input: {test_input}")
    print(f"Response: {response}")
    print("-" * 80)
