## Load and Display movie_lines.txt

In [1]:
import os

# Define the dataset path
dataset_path = "dataset/movie_lines.txt"

# Load and display the first few lines
def load_and_inspect_data(file_path, num_lines=5):
    with open(file_path, encoding='iso-8859-1') as f:  # Using 'iso-8859-1' encoding to handle special characters
        for i in range(num_lines):
            print(f.readline().strip())

# Load and display the first 5 lines of the movie_lines.txt file
load_and_inspect_data(dataset_path, num_lines=5)


L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.


from the above, we see each line is structured in the following format:

- L1045: Line ID
- u0: User ID (who is speaking)
- m0: Movie ID (which movie the line is from)
- BIANCA: Character name
- They do not!: The actual line spoken

We'll focus on extracting the dialogue (the last part) for building the chatbot, because the chatbot's responses are based on actual lines spoken in conversations.

## Parse and Store Movie Lines

In [2]:
# Parse the movie_lines.txt file and store it in a dictionary
def parse_movie_lines(file_path):
    lines_dict = {}
    with open(file_path, encoding='iso-8859-1') as f:
        for line in f:
            parts = line.strip().split(" +++$+++ ")
            if len(parts) == 5:
                line_id = parts[0]  # Line ID
                dialogue = parts[4]  # Actual line
                lines_dict[line_id] = dialogue
    return lines_dict

# Parse the movie_lines.txt file
movie_lines_dict = parse_movie_lines(dataset_path)

# Check the size of the dictionary and print some examples
print(f"Total lines parsed: {len(movie_lines_dict)}")
for i, (line_id, dialogue) in enumerate(movie_lines_dict.items()):
    if i < 5:  # Print the first 5 parsed lines
        print(f"{line_id}: {dialogue}")


Total lines parsed: 304446
L1045: They do not!
L1044: They do to!
L985: I hope so.
L984: She okay?
L925: Let's go.


## Parse and Link Conversations

In [3]:
# Define the path for the movie_conversations.txt file
conversations_path = "dataset/movie_conversations.txt"

# Parse the movie_conversations.txt file and store conversations as lists of line IDs
def parse_movie_conversations(file_path):
    conversations = []
    with open(file_path, encoding='iso-8859-1') as f:
        for line in f:
            parts = line.strip().split(" +++$+++ ")
            if len(parts) == 4:
                # The last part contains the list of line IDs as a string, e.g., "['L1045', 'L1044', ...]"
                line_ids_str = parts[3]
                # Convert the string representation of the list into an actual Python list
                line_ids = eval(line_ids_str)  # This will transform the string into a list
                conversations.append(line_ids)
    return conversations

# Parse the movie_conversations.txt file
movie_conversations = parse_movie_conversations(conversations_path)

# Check the number of conversations and print a few examples
print(f"Total conversations parsed: {len(movie_conversations)}")
for i, conversation in enumerate(movie_conversations):
    if i < 5:  # Print the first 5 parsed conversations (line IDs only)
        print(f"Conversation {i+1}: {conversation}")


Total conversations parsed: 83097
Conversation 1: ['L194', 'L195', 'L196', 'L197']
Conversation 2: ['L198', 'L199']
Conversation 3: ['L200', 'L201', 'L202', 'L203']
Conversation 4: ['L204', 'L205', 'L206']
Conversation 5: ['L207', 'L208']


## Link Line IDs to Dialogues

In [4]:
# Link the line IDs from conversations to the actual dialogues from movie_lines_dict
def link_conversations_to_dialogues(conversations, lines_dict):
    full_conversations = []
    for conversation in conversations:
        conv_dialogues = []
        for line_id in conversation:
            if line_id in lines_dict:
                conv_dialogues.append(lines_dict[line_id])
        full_conversations.append(conv_dialogues)
    return full_conversations

# Link conversations to the actual dialogues
linked_conversations = link_conversations_to_dialogues(movie_conversations, movie_lines_dict)

# Check the first few complete conversations
for i, conversation in enumerate(linked_conversations):
    if i < 3:  # Print the first 3 full conversations (actual dialogues)
        print(f"Conversation {i+1}: {conversation}")


Conversation 1: ['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.', "Well, I thought we'd start with pronunciation, if that's okay with you.", 'Not the hacking and gagging and spitting part.  Please.', "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?"]
Conversation 2: ["You're asking me out.  That's so cute. What's your name again?", 'Forget it.']
Conversation 3: ["No, no, it's my fault -- we didn't have a proper introduction ---", 'Cameron.', "The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.", 'Seems like she could get a date easy enough...']


The above output: Linked the line IDs to their corresponding actual dialogues, creating complete conversations, so that the chatbot can learn how to handle multi-turn interactions and contextual responses.

## Tokenize Conversations

In [6]:
import nltk
from nltk.tokenize import word_tokenize

# Tokenize each conversation
def tokenize_conversations(conversations):
    tokenized_conversations = []
    for conversation in conversations:
        tokenized_conversation = [word_tokenize(sentence.lower()) for sentence in conversation]  # Tokenize each sentence and convert to lowercase
        tokenized_conversations.append(tokenized_conversation)
    return tokenized_conversations

# Tokenize the conversations
tokenized_conversations = tokenize_conversations(linked_conversations)

# Check a few tokenized conversations
for i, conversation in enumerate(tokenized_conversations):
    if i < 3:  # Print first 3 tokenized conversations
        print(f"Tokenized Conversation {i+1}: {conversation}")


Tokenized Conversation 1: [['can', 'we', 'make', 'this', 'quick', '?', 'roxanne', 'korrine', 'and', 'andrew', 'barrett', 'are', 'having', 'an', 'incredibly', 'horrendous', 'public', 'break-', 'up', 'on', 'the', 'quad', '.', 'again', '.'], ['well', ',', 'i', 'thought', 'we', "'d", 'start', 'with', 'pronunciation', ',', 'if', 'that', "'s", 'okay', 'with', 'you', '.'], ['not', 'the', 'hacking', 'and', 'gagging', 'and', 'spitting', 'part', '.', 'please', '.'], ['okay', '...', 'then', 'how', "'bout", 'we', 'try', 'out', 'some', 'french', 'cuisine', '.', 'saturday', '?', 'night', '?']]
Tokenized Conversation 2: [['you', "'re", 'asking', 'me', 'out', '.', 'that', "'s", 'so', 'cute', '.', 'what', "'s", 'your', 'name', 'again', '?'], ['forget', 'it', '.']]
Tokenized Conversation 3: [['no', ',', 'no', ',', 'it', "'s", 'my', 'fault', '--', 'we', 'did', "n't", 'have', 'a', 'proper', 'introduction', '--', '-'], ['cameron', '.'], ['the', 'thing', 'is', ',', 'cameron', '--', 'i', "'m", 'at', 'the', '

## Build Vocabulary and Convert Tokens to IDs

In [7]:
from collections import defaultdict

# Build a vocabulary dictionary mapping each word to a unique index
def build_vocabulary(tokenized_conversations):
    vocab = defaultdict(lambda: len(vocab))  # Assigns an incrementing ID to each new word
    vocab['<PAD>'] = 0  # Reserve 0 for padding
    vocab['<UNK>'] = 1  # Reserve 1 for unknown words

    for conversation in tokenized_conversations:
        for sentence in conversation:
            for word in sentence:
                vocab[word]  # Adds word to vocab if not already present

    return dict(vocab)

# Build the vocabulary
vocab = build_vocabulary(tokenized_conversations)

# Convert tokenized conversations to sequences of word IDs
def convert_to_ids(tokenized_conversations, vocab):
    conversations_ids = []
    for conversation in tokenized_conversations:
        conv_ids = [[vocab.get(word, vocab['<UNK>']) for word in sentence] for sentence in conversation]
        conversations_ids.append(conv_ids)
    return conversations_ids

# Convert tokenized conversations to IDs
conversations_ids = convert_to_ids(tokenized_conversations, vocab)

# Check the first few conversations as IDs
for i, conversation in enumerate(conversations_ids):
    if i < 3:  # Print first 3 conversations
        print(f"Conversation {i+1} (as IDs): {conversation}")


Conversation 1 (as IDs): [[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 24], [26, 27, 28, 29, 3, 30, 31, 32, 33, 27, 34, 35, 36, 37, 32, 38, 24], [39, 22, 40, 10, 41, 10, 42, 43, 24, 44, 24], [37, 45, 46, 47, 48, 3, 49, 50, 51, 52, 53, 24, 54, 7, 55, 7]]
Conversation 2 (as IDs): [[38, 56, 57, 58, 50, 24, 35, 36, 59, 60, 24, 61, 36, 62, 63, 25, 7], [64, 65, 24]]
Conversation 3 (as IDs): [[66, 27, 66, 27, 65, 36, 67, 68, 69, 3, 70, 71, 72, 73, 74, 75, 69, 76], [77, 24], [22, 78, 79, 27, 77, 69, 28, 80, 81, 22, 82, 83, 73, 84, 85, 86, 83, 87, 24, 67, 88, 24, 28, 89, 71, 90, 91, 92, 93, 24], [94, 95, 92, 96, 97, 73, 90, 98, 99, 45]]


the conversations now are converted into sequences of numerical IDs. Each word in the conversation is mapped to a an ID based on the vocabulary we built.

## Padding the Sequences

In [9]:
# Pad sequences so that all sentences in a conversation have the same length
def pad_conversations(conversations_ids, max_length=20):
    padded_conversations = []
    for conversation in conversations_ids:
        padded_conversation = []
        for sentence in conversation:
            # If sentence length is less than max_length, pad with 0s, else truncate
            padded_sentence = sentence[:max_length] + [0] * (max_length - len(sentence))
            padded_conversation.append(padded_sentence)
        padded_conversations.append(padded_conversation)
    return padded_conversations

# Pad the conversations
padded_conversations = pad_conversations(conversations_ids, max_length=20)

# Check the first few padded conversations
for i, conversation in enumerate(padded_conversations):
    if i < 3:  # Print first 3 padded conversations
        print(f"Padded Conversation {i+1}: {conversation}")


Padded Conversation 1: [[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21], [26, 27, 28, 29, 3, 30, 31, 32, 33, 27, 34, 35, 36, 37, 32, 38, 24, 0, 0, 0], [39, 22, 40, 10, 41, 10, 42, 43, 24, 44, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0], [37, 45, 46, 47, 48, 3, 49, 50, 51, 52, 53, 24, 54, 7, 55, 7, 0, 0, 0, 0]]
Padded Conversation 2: [[38, 56, 57, 58, 50, 24, 35, 36, 59, 60, 24, 61, 36, 62, 63, 25, 7, 0, 0, 0], [64, 65, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
Padded Conversation 3: [[66, 27, 66, 27, 65, 36, 67, 68, 69, 3, 70, 71, 72, 73, 74, 75, 69, 76, 0, 0], [77, 24, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [22, 78, 79, 27, 77, 69, 28, 80, 81, 22, 82, 83, 73, 84, 85, 86, 83, 87, 24, 67], [94, 95, 92, 96, 97, 73, 90, 98, 99, 45, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


Now all sentences have a uniform length for easy processing for model training.

## Split Data into Training and Test Sets

In [10]:
from sklearn.model_selection import train_test_split

# Flatten conversations so they can be split into training and test sets
def flatten_conversations(padded_conversations):
    all_conversations = []
    for conversation in padded_conversations:
        all_conversations.extend(conversation)  # Add each sentence to the list
    return all_conversations

# Flatten the conversations
all_sentences = flatten_conversations(padded_conversations)

# Split the data into 80% training and 20% test
train_sentences, test_sentences = train_test_split(all_sentences, test_size=0.2, random_state=42)

# Check the sizes of the train and test sets
print(f"Training set size: {len(train_sentences)}")
print(f"Test set size: {len(test_sentences)}")


Training set size: 243556
Test set size: 60890


### Completed as of September 26th:
- Loaded and parsed the `movie_lines.txt` and `movie_conversations.txt`.
- Linked dialogue lines to create full conversations.
- Tokenized and padded the conversations.
- Built a vocabulary and converted the conversations to numerical IDs.
- Split the data into training and test sets.

### Next Steps:
- Model Design and Training with model architecture.
- Train the model using the processed training data.
- Evaluate the model on the test data.
- Tune the model (if needed).
- Convert the project to a chatbot accessible via a webpage using Flask and Bootstrap.
