# Chatbot Using Seq2Seq Approach

## Introduction

This project focuses on building a **sequence-to-sequence (Seq2Seq) chatbot** using the Cornell Movie Dialog dataset. The Seq2Seq architecture is specifically designed to handle conversational data by converting an input sequence (a user query) into an output sequence (a chatbot response). 

The dataset contains a wealth of dialogues from various movies, making it ideal for training a chatbot to handle diverse conversational topicI We employ **LSTM layers** in both the encoder and decoder to capture the temporal dependencies and structure of the dialogues.

To reduce computational complexI willy, we work with a sample of 30,000 dialogue pairs, applying essential preprocessing steps like text cleaning, tokenization, and padding of sequences. The final chatbot is trained using this Seq2Seq approach, providing the ability to generate coherent responses based on user input.

This notebook will walk through the dataset preparation, model building, and chatbot demonstation.


In [1]:
# Import the necessary libraries
import re
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Embedding, LSTM, Input

## Loading the Dataset

We load the dialogue lines and conversations from the Cornell Movie Dialog dataset files. 

These lines represent dialogues, and conversations contain pairs of dialogues used for training.

In [2]:
# Load dialogue lines from the file
lines = open('movie_lines.txt', encoding='utf-8', errors='ignore').read().split('\n')

# Load conversation pairs from the file
conversations = open('movie_conversations.txt', encoding='utf-8', errors='ignore').read().split('\n')

## Preprocessing

### Extracting Conversations into Pairs

This section processes the conversation data to extract conversation pairs. 

We split the conversations to build a list of dialogue pairs.

A dictionary is created to map each line ID to its corresponding dialogue text.

In [3]:
# Extract conversations into pairs
conversation_pairs = []
for conversation in conversations:
    conversation_pairs.append(conversation.split(' +++$+++ ')[-1][1:-1].replace("'", " ").replace(",", "").split())

# Create a dictionary mapping line IDs to dialogue text
dialogue_dict = {}
for line in lines:
    dialogue_dict[line.split(' +++$+++ ')[0]] = line.split(' +++$+++ ')[-1]

###  Preparing Input-Response Pairs

This part prepares the input and response pairs based on the extracted conversation pairs.

In [None]:
# Prepare question-answer pairs
inputs = []
responses = []

for conversation in conversation_pairs:
    for i in range(len(conversation) - 1):
        inputs.append(dialogue_dict[conversation[i]])
        responses.append(dialogue_dict[conversation[i + 1]])

In [4]:
inputs[:5]

['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.',
 "Well, I thought we'd start with pronunciation, if that's okay with you.",
 'Not the hacking and gagging and spitting part.  Please.',
 "You're asking me out.  That's so cute. What's your name again?",
 "No, no, it's my fault -- we didn't have a proper introduction ---"]

In [5]:
responses[:5]

["Well, I thought we'd start with pronunciation, if that's okay with you.",
 'Not the hacking and gagging and spitting part.  Please.',
 "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?",
 'Forget it.',
 'Cameron.']

In [6]:
print(len(inputs))

221616


In [7]:
print(len(responses))

221616


### Filtering Inputs and Responses by Length

To ensure uniformity, we limit the length of questions and answers to less than 15 words.

In [8]:
# Set the maximum length to 15
# Filter questions and answers based on their length
filtered_inputs = []
filtered_responses = []
for i in range(len(inputs)):
    if len(inputs[i]) < 15:
        filtered_inputs.append(inputs[i])
        filtered_responses.append(responses[i])

### Text Cleaning Function

I define a function to clean text by lowercasing, removing contractions, and eliminating punctuation.

In [9]:
# Function to clean text (lowercase, remove contractions, punctuation)
def clean_text(text):
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"he's", "he is", text)
    text = re.sub(r"she's", "she is", text)
    text = re.sub(r"that's", "that is", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"where's", "where is", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "can not", text)
    text = re.sub(r"couldn't", "could not", text)
    text = re.sub(r"shouldn't", "should not", text)
    text = re.sub(r"wouldn't", "would not", text)
    text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
    return text

### Cleaning Inputs and Responses

The inputs and responses are cleaned using the clean_text function.

In [33]:
# Clean the questions and answers
cleaned_inputs = []
cleaned_responses = []

for inputs in filtered_inputs:
    cleaned_inputs.append(clean_text(inputs))

for response in filtered_responses:
    cleaned_responses.append(clean_text(response))

# Trim answers to a maximum of 15 words (Through EDA)
for i in range(len(cleaned_responses)):
    cleaned_responses[i] = ' '.join(cleaned_responses[i].split()[:15])


### Sample Data

To manage computational resources, we limit the dataset to a sample of 30,000 records.

In [11]:
# Limit the dataset to 30,000 entries
cleaned_inputs = cleaned_inputs[:30000]
cleaned_responses = cleaned_responses[:30000]

### Counting Word Frequencies

This section counts the occurrences of each word in the inputs and responses to build the vocabulary.

In [12]:
# Count word occurrences
word_count = {}

for inputs in cleaned_inputs:
    for word in inputs.split():
        if word not in word_count:
            word_count[word] = 1
        else:
            word_count[word] += 1

for response in cleaned_responses:
    for word in response.split():
        if word not in word_count:
            word_count[word] = 1
        else:
            word_count[word] += 1

In [13]:
print(word_count)



### Tokenization

Words that appear less than 5 times are filtered out to reduce the vocabulary size.

In [14]:
# Remove less frequent words based on a threshold
frequency_threshold = 5
vocabulary = {}
word_index = 0

for word, count in word_count.items():
    if count >= frequency_threshold:
        vocabulary[word] = word_index
        word_index += 1
        
# Tokens are added before and after the response
# We modify the responses to add start and end tokens and ensure a uniform length of 15 tokens.
for index in range(len(cleaned_responses)):
    cleaned_responses[index] = '<SOS> ' + cleaned_responses[index] + ' <EOS>'

# Add special tokens to the vocabulary
# Special tokens like <SOS> (start), <EOS> (end), <PAD> (padding), and <OUT> (unknown) are added to the vocabulary.
# When the word is not present in the vocabulary, it is replaced with <OUT>
special_tokens = ['<PAD>', '<EOS>', '<OUT>', '<SOS>']
word_index = len(vocabulary)
for token in special_tokens:
    vocabulary[token] = word_index
    word_index += 1

# Ensure 'cameron' maps to '<PAD>' (0)
vocabulary['cameron'] = vocabulary['<PAD>']
vocabulary['<PAD>'] = 0 # As 'cameron' is assigned '0', in order to assign PAD '0', it is altered

# Create an inverse vocabulary mapping for decoding
# The inverse vocabulary is created to map indices back to words, which is useful during the decoding phase.
inverse_vocab = {index: word for word, index in vocabulary.items()}

In [15]:
print(inverse_vocab)



In [31]:
print(len(vocabulary))

3334


### Converting Text to Sequences

Inputs and responses are converted into sequences of integers based on the vocabulary.

In [17]:
# Convert questions and answers to sequences of integers
encoder_input = []
for inputs in cleaned_inputs:
    sequence = []
    for word in inputs.split():
        if word not in vocabulary:
            sequence.append(vocabulary['<OUT>'])
        else:
            sequence.append(vocabulary[word])
    encoder_input.append(sequence)

decoder_input = []
for response in cleaned_responses:
    sequence_response = []
    for word in response.split():
        if word not in vocabulary:
            sequence_response.append(vocabulary['<OUT>'])
        else:
            sequence_response.append(vocabulary[word])
    decoder_input.append(sequence_response)

### Padding the Sequences

The sequences are padded to ensure that all inputs and responses have the same length of 15 tokens.

In [18]:
# Pad sequences to a fixed length of 15
encoder_input = pad_sequences(encoder_input, 15, padding='post', truncating='post')
decoder_input = pad_sequences(decoder_input, 15, padding='post', truncating='post')

In [19]:
print(encoder_input)
print(decoder_input)

[[3329    0    0 ...    0    0    0]
 [   0    0    0 ...    0    0    0]
 [   1    0    0 ...    0    0    0]
 ...
 [ 160 1098    0 ...    0    0    0]
 [  17    9  166 ...    0    0    0]
 [ 106   31    0 ...    0    0    0]]
[[3332   86  674 ... 2472 3331 3331]
 [3332 3331 1188 ...  808 1227  108]
 [3332   47 3330 ...    0    0    0]
 ...
 [3332   17    9 ...    0    0    0]
 [3332    9  157 ...    0    0    0]
 [3332   83  178 ...    0    0    0]]


### Decoder Final Output

The decoder's final output is prepared by shifting the input sequence and one-hot encoding the result.

In [20]:
# Prepare the decoder's final output
decoder_final_output = []
for sequence in decoder_input:
    decoder_final_output.append(sequence[1:])  # Remove the start token

decoder_final_output = pad_sequences(decoder_final_output, 15, padding='post', truncating='post')
decoder_final_output = to_categorical(decoder_final_output, len(vocabulary))
print(decoder_final_output.shape)

(30000, 15, 3334)


## Model Architecture

In this section, we define the model using an Seq2Seq encoder-decoder architecture with LSTM layers for sequence modeling.

In [21]:
# Define the model architecture
encoder_input_layer = Input(shape=(15,))
decoder_input_layer = Input(shape=(15,))

# Embedding layer for input representation
embedding_layer = Embedding(len(vocabulary) + 1, output_dim=50, input_length=15, trainable=True)

# Encoder
encoder_embedding_output = embedding_layer(encoder_input_layer)
encoder_lstm_layer = LSTM(400, return_sequences=True, return_state=True)
encoder_output, hidden_state, cell_state = encoder_lstm_layer(encoder_embedding_output)
encoder_states = [hidden_state, cell_state]

# Decoder
decoder_embedding_output = embedding_layer(decoder_input_layer)
decoder_lstm_layer = LSTM(400, return_sequences=True, return_state=True)
decoder_output, _, _ = decoder_lstm_layer(decoder_embedding_output, initial_state=encoder_states)

# Output layer
dense_layer = Dense(len(vocabulary), activation='softmax')
decoder_output_final = dense_layer(decoder_output)



### Model Compilation

I compile the model with 'categorical_crossentropy' loss and 'adam' optimizer for training.

In [22]:
# Compile the model
chatbot_model_2 = Model([encoder_input_layer, decoder_input_layer], decoder_output_final)
chatbot_model_2.compile(loss='categorical_crossentropy', metrics=['acc'], optimizer='adam')

In [23]:
chatbot_model_2.summary()

### Training the Model

The model is trained using the prepared data for 100 epochs.

In [24]:
# Train the model
chatbot_model_2.fit([encoder_input, decoder_input], decoder_final_output, epochs=100)

Epoch 1/100
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m302s[0m 307ms/step - acc: 0.4904 - loss: 3.3592
Epoch 2/100
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m329s[0m 315ms/step - acc: 0.5456 - loss: 2.6667
Epoch 3/100
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m303s[0m 323ms/step - acc: 0.5585 - loss: 2.5219
Epoch 4/100
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m310s[0m 310ms/step - acc: 0.5610 - loss: 2.4532
Epoch 5/100
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m319s[0m 340ms/step - acc: 0.5683 - loss: 2.3745
Epoch 6/100
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m318s[0m 335ms/step - acc: 0.5667 - loss: 2.3482
Epoch 7/100
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m287s[0m 306ms/step - acc: 0.5692 - loss: 2.2941
Epoch 8/100
[1m938/938[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m279s[0m 297ms/step - acc: 0.5734 - loss: 2.2344
Epoch 9/100
[1m938/938[0m [32m━━━━━━━

<keras.src.callbacks.history.History at 0x21f9c11c850>

## Inference Model

### Encoder Model for Inference

I define the encoder model to generate states for the decoder during inference.

In [None]:
# Encoder model for inference
encoder_model = Model([encoder_input_layer], encoder_states)

### Decoder Model for Inference

This section defines the decoder model for inference, which generates the next word in the sequence based on the encoder states.

In [25]:
# Decoder model for inference
decoder_state_input_h = Input(shape=(400,))
decoder_state_input_c = Input(shape=(400,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

# Decoder outputs
decoder_outputs, state_h, state_c = decoder_lstm_layer(decoder_embedding_output, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]

# Final decoder model
decoder_model = Model([decoder_input_layer] + decoder_states_inputs, [decoder_outputs] + decoder_states)

In [26]:
decoder_model.summary()

## Running the Chatbot

This section runs the chatbot in an interactive mode, where user inputs are processed, and responses are generated using the trained model.

In [38]:
print("*****************************************************************")
print("*                        ChatBot Ver. 1.0                       *")
print("*****************************************************************")
print("*               Enter 'quit' to exit from the chat              *")
print("*****************************************************************")
user_input = ""
while user_input != 'quit':
    user_input = input("You: ")
    
    # Clean the user input
    user_input_cleaned = clean_text(user_input)

    # Prepare input for the encoder
    user_input_sequence = [user_input_cleaned]
    sequence_list = []
    for text in user_input_sequence:
        sequence = []
        for word in text.split():
            try:
                sequence.append(vocabulary[word])
            except:
                sequence.append(vocabulary['<OUT>'])
        sequence_list.append(sequence)

    # Pad the input sequence
    sequence_list = pad_sequences(sequence_list, 15, padding='post')

    # Predict encoder states
    encoder_states_output = encoder_model.predict(sequence_list)

    # Prepare the initial input for the decoder
    initial_decoder_input = np.zeros((1, 1))
    initial_decoder_input[0, 0] = vocabulary['<SOS>']  # Start with <SOS>

    stop_condition = False
    decoded_response = ''

    while not stop_condition:
        # Predict the next word in the sequence
        decoder_outputs, hidden_state, cell_state = decoder_model.predict([initial_decoder_input] + encoder_states_output)
        decoder_output_scores = dense_layer(decoder_outputs)
        
        # Sample a word based on the predicted output
        sampled_word_index = np.argmax(decoder_output_scores[0, -1, :])
        sampled_word = inverse_vocab[sampled_word_index] + ' '

        # Stop if we hit <EOS> or exceed 13 words
        if sampled_word != '<EOS> ':
            decoded_response += sampled_word  

        if sampled_word == '<EOS> ' or len(decoded_response.split()) > 15:
            stop_condition = True 

        # Update the input for the next decoder prediction
        initial_decoder_input = np.zeros((1, 1))  
        initial_decoder_input[0, 0] = sampled_word_index
        encoder_states_output = [hidden_state, cell_state]  # Update states

    print("Chatbot: ", decoded_response)
    print("-----------------------------------------------------------------")

*****************************************************************
*                        ChatBot Ver. 1.0                       *
*****************************************************************
*               Enter 'quit' to exit from the chat              *
*****************************************************************


You:  hey


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 129ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
Chatbot:  hey 
-----------------------------------------------------------------


You:  how are you?


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
Chatbot:  fine i am fine 
-----------------------------------------------------------------


You:  I am hungry


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
Chatbot:  i am going to wait here until she comes 
-----------------------------------------------------------------


You:  who will come?


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 28ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
Chatbot:  you know what i am talking about 
-----------------------------------------------------------------


You:  I don't know


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
Chatbot:  i will be there and 
-----------------------------------------------------------------


You:  and what?


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47

You:  don't be scared


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
Chatbot:  <OUT> 
-----------------------------------------------------------------


You:  are you there?


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
Chatbot:  i am sure 
-----------------------------------------------------------------


You:  see you then


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
Chatbot:  i thought you were never any better for you 
-----------------------------------------------------------------


You:  which is your favourite movie?


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
Chatbot:  that guy from west 
-----------------------------------------------------------------


You:  which guy?


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
Chatbot:  passion 
-----------------------------------------------------------------


You:  I do not understand


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
Chatbot:  i dont care 
-----------------------------------------------------------------


You:  quit


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
Chatbot:  mr <OUT> is good enough 
-----------------------------------------------------------------


## Model Saving

Finally, I save the trained chatbot model for later use.

In [29]:
# Save the model to a file
chatbot_model_2.save("chatbot_model_2.h5")



In [30]:
# from tensorflow.keras.models import load_model

# # Load the saved model
# loaded_model = load_model("chatbot_model_2.h5")

## Conclusion

In this project, I successfully developed a **Seq2Seq chatbot** using an LSTM-based encoder-decoder architecture. I worked with a subset of the Cornell Movie Dialog dataset, focusing on data cleaning, sequence preparation, and vocabulary generation to train the model effectively. 

The Seq2Seq model allows the chatbot to generate meaningful responses by encoding input sequences and decoding them into relevant output sequences. Although the model performs reasonably well on simple inputs, it is constrained by the size of the dataset and available computational resources.

Future work could involve training on a larger dataset and experimenting with more sophisticated architectures like transformer-based models (e.g., GPT or BERT). This project provides a strong foundation for generative chatbot development and opens up avenues for further enhancements.
