In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## About

The Cornell Movie Dialog Corpus is a dataset that contains conversations between characters from over 600 movies. Our task is to classify the dialogues into different categories based on their intent or emotion, such as happy, sad, angry, etc.



Importing modules

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader


Declaring the file paths in the following snippet

In [2]:
# Define the path to the data files
metadata_path = "/kaggle/input/cornell-moviedialog-corpus/movie_characters_metadata.txt"
conversations_path = "/kaggle/input/cornell-moviedialog-corpus/movie_conversations.txt"
lines_path = "/kaggle/input/cornell-moviedialog-corpus/movie_lines.txt"
titles_path = "/kaggle/input/cornell-moviedialog-corpus/movie_titles_metadata.txt"
raw_script_urls_path = "/kaggle/input/cornell-moviedialog-corpus/raw_script_urls.txt"


The following code defines a function called `load_lines` that loads individual lines of dialogue from a file specified by `file_path`. If `line_ids` is provided, only the specified lines will be returned; otherwise, all lines will be returned. 

The function reads in the file using `open()`, and then iterates over each line using a `for` loop. Each line is stripped of whitespace and split on the string `" +++$+++ "`, which separates the line ID from the actual line of dialogue. If `line_ids` is `None` or the current line ID is in `line_ids`, the line is added to a dictionary called `lines` with the ID as the key and the actual line of dialogue as the value.

If `line_ids` is not `None` and the number of lines in `lines` is equal to the length of `line_ids`, the function breaks out of the loop. Finally, the function either returns a list of the values in `lines` (if `line_ids` is `None`) or a list of the values in `lines` corresponding to the IDs in `line_ids`.

The function then returns the first 100 lines of dialogue using Python's slice notation `[:100]`.

We're training on 100 samples intentionally out here

In [4]:
def load_lines(file_path, line_ids=None):
    """
    Load individual lines of dialogue from the given file path.
    If line_ids is provided, only the specified lines will be returned.
    """
    lines = {}
    with open(file_path, 'r', encoding='iso-8859-1') as f:
        for line in f:
            line = line.strip().split(' +++$+++ ')
            if line_ids is None or line[0] in line_ids:
                lines[line[0]] = line[-1]
                if line_ids is not None and len(lines) == len(line_ids):
                    break
    if line_ids is not None:
        lines = [lines[line_id] for line_id in line_ids]
    else:
        lines = list(lines.values())
    return lines[:100]

The following code snippet defines a function called `load_conversations` that takes in a file path as input. The function reads the data from the file located at the given file path. The file contains a list of conversations, where each conversation is represented by a series of lines of dialogue. 

The function creates an empty list called `conversations` to store the loaded conversations. It then loops through each line of the file and processes it. For each line, it extracts the list of line IDs that belong to a single conversation. It then loads the text for each of those lines using the `load_lines` function, and appends those lines to the `conversation` list. Once all lines of a conversation have been processed, the entire conversation is appended to the `conversations` list.

Note that the function stops after loading the first 100 conversations in the file, using the condition `if len(conversations) >= 100: break`, to limit the number of conversations to load.

The function returns the list of loaded conversations.

In [5]:
def load_conversations(file_path):
    """
    Load conversation data from the given file path.
    """
    conversations = []
    with open(file_path, 'r', encoding='iso-8859-1') as f:
        for line in f:
            if len(conversations) >= 100:
                break
            conversation = []
            line = line.strip().split(' +++$+++ ')
            line_ids = line[-1][1:-1].replace("'", "").split(", ")
            for line_id in line_ids:
                line_text = load_lines(lines_path, [line_id])[0]
                conversation.append(line_text)
            conversations.append(conversation)
    return conversations

The following code defines a function `load_labels` that takes in two arguments:
- `file_path`: a string representing the file path to a file containing movie information.
- `conversation_ids`: a list of strings representing IDs of conversations whose labels need to be loaded.

The function reads the file line by line and extracts the movie ID and genre list from each line. It then checks if the movie ID is present in the `conversation_ids` list. If it is, it checks if the genre list contains the word 'romance'. If it does, it sets the label for that movie to 1, otherwise it sets the label to 0. Finally, it returns a dictionary where the keys are movie IDs and the values are the corresponding labels. 

The `eval()` function is used to convert the genre list from a string to a list. It is assumed that the genre list is stored in the file as a Python list literal, i.e., enclosed in square brackets.

In [6]:
def load_labels(file_path, conversation_ids):
    labels = {}
    with open(file_path, 'r', encoding='iso-8859-1') as f:
        for line in f:
            # Split line by ' +++$+++ ' and extract the movie ID and genre list
            parts = line.strip().split(' +++$+++ ')
            movie_id = parts[0]
            if movie_id in conversation_ids:
                genres = eval(parts[-1])
                # Set label to 1 if 'romance' is in the genre list, else set label to 0
                if 'romance' in genres:
                    label = 1
                else:
                    label = 0
                labels[movie_id] = label
    return labels


In the following code, `conversations` are loaded from the file path specified in `conversations_path` using the `load_conversations()` function. `conversation_ids` are generated by creating a list of strings, with each string representing the ID of a conversation, from 'm0' to 'mN-1', where N is the total number of conversations loaded. 

`labels` are loaded using the `load_labels()` function which takes two arguments: the file path of the titles file and the list of conversation IDs. The function extracts the labels from the file and returns a dictionary with keys as conversation IDs and values as the corresponding labels. In this case, only the labels for the first 100 conversations are loaded and stored in the `labels` variable.

In [33]:
conversations = load_conversations(conversations_path)
# Get the IDs of the conversations
conversation_ids = ['m' + str(i) for i in range(len(conversations))]

# Load only the labels for the first 100 conversations
labels = load_labels(titles_path, conversation_ids)

In the following snippet, We are importing modules that will be helpful for our next steps

In [12]:
import random
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical




The following code creates a numpy array train_labels by selecting values from a dictionary labels using a for loop that iterates from 0 to len(conversations) (exclusive).

Within the for loop, it accesses the dictionary using the key m+i where i is the current value of the loop index. This will extract a value from the dictionary where the key matches the string "m" concatenated with the current value of i. The extracted value is then added to the train_labels array.

In [35]:
train_labels = np.array([labels['m'+str(i)] for i in range(len(conversations))])


The following code uses the Keras `Tokenizer` class to convert the list of conversations to sequences of integers. The `Tokenizer` class is initialized with a set of filters that specify which characters should be ignored during tokenization. In this case, the filters include various special characters and white space. 

The `fit_on_texts` method is then called on the `Tokenizer` object with the list of conversations as the argument. This method updates the tokenizer's internal vocabulary based on the words in the input texts. 

Finally, the `texts_to_sequences` method is called on the `Tokenizer` object with the `conversations` list as the argument. This method converts each conversation to a sequence of integers based on the tokenizer's vocabulary. The resulting sequences can be used as input to a deep learning model.

In [14]:
# Convert conversations to sequences
tokenizer = Tokenizer(filters='"#$%&()*+-/:;<=>@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(conversations)
sequences = tokenizer.texts_to_sequences(conversations)


The following code is preparing the sequences of text for training the neural network by padding each sequence with zeros so that they all have the same length. The `pad_sequences` function from Keras is used for this purpose. 

In this case, the sequences are padded to a maximum length of 40 words, which means that any sequence that has less than 40 words will be padded with zeros at the end to make it 40 words long. If a sequence has more than 40 words, it will be truncated to 40 words. 

The padding type is set to 'post', which means that the padding will be added at the end of each sequence. This ensures that the actual text content of the sequence comes first and the padding comes at the end. 

The resulting padded_sequences variable will be a numpy array of shape `(number of sequences, max_len)` where `max_len` is the maximum length of the sequences after padding.

In [16]:
# Pad sequences to a maximum length of 40 words
max_len = 40
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post')


The followig code snippet uses the `train_test_split` function from the `sklearn` library to split the input `padded_sequences` and `train_labels` into training and testing sets. The `train_test_split` function takes four arguments: 

- `padded_sequences`: The input sequences that have been padded to a fixed length.
- `train_labels`: The corresponding labels for each input sequence.
- `test_size`: The proportion of the data to be allocated for testing (in this case 20%).
- `random_state`: A seed value for the random number generator used in the split.

The function returns four output variables:

- `train_sequences`: A subset of the `padded_sequences` used for training.
- `test_sequences`: A subset of the `padded_sequences` used for testing.
- `train_targets`: The corresponding labels for `train_sequences`.
- `test_targets`: The corresponding labels for `test_sequences`.

These subsets of the data can then be used to train and evaluate the performance of a machine learning model.

In [17]:
from sklearn.model_selection import train_test_split
# Split data into training and testing sets
train_sequences, test_sequences, train_targets, test_targets = train_test_split(padded_sequences, train_labels, test_size=0.2, random_state=42)


Doing sanity check of the splits by printing the shapes

In [18]:
train_sequences.shape, train_targets.shape

((80, 40), (80,))

Defining the model architecture is performed in the following snippet of code.

Model architecture can be summarized as follows

Firstly,an instance of the sequential model is created using Sequential().

The next step adds an embedding layer to the model using model.add(Embedding(len(tokenizer.word_index) + 1, 64, input_length=max_len)). The embedding layer is used to map each word in the input text to a high-dimensional vector representation, with each dimension representing a feature of the word. The len(tokenizer.word_index) + 1 argument specifies the input dimension of the embedding layer, which is the number of unique words in the tokenizer plus one for out of vocabulary words. The 64 argument specifies the size of the vector space in which words will be embedded, and the input_length argument specifies the length of each input sequence.

The next line adds an LSTM layer to the model using model.add(LSTM(64, dropout=0.1)). The 64 argument specifies the number of output units in the LSTM layer, and the dropout=0.1 argument specifies the dropout rate to reduce overfitting.

Finally, a dense output layer is added to the model using model.add(Dense(1, activation='sigmoid')). The output layer contains a single neuron with a sigmoid activation function to produce a binary classification output indicating the positive(Romantic) or negative sentiment of the input text.









In [36]:
# Define model architecture
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
# Build the model
model = Sequential()
model.add(Embedding(len(tokenizer.word_index) + 1, 64, input_length=max_len))
model.add(LSTM(64, dropout=0.1))
model.add(Dense(1, activation='sigmoid'))


Model is compiled with binary cross entropy loss and accuracy as metrics in the following code.

In [20]:
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


Model is trained on 1000 samples for 50 epochs at a batch size of 32

In [30]:
# Train the model
batch_size = 32
epochs = 50
model.fit(train_sequences, train_targets, batch_size=batch_size, epochs=epochs, validation_data=(test_sequences, test_targets))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f54986027d0>

The following code defines a function called generate_response that takes in an input text and generates a response based on the predicted label for that input.

First, the input text is tokenized using the tokenizer.texts_to_sequences() function, which converts the input text to a sequence of integer tokens based on the tokenizer used during training.

Next, the input sequence is padded using pad_sequences() to ensure that it has the same length as the training sequences.

The model is then used to predict the label probabilities for the input sequence using model.predict(). The label_probs variable contains the predicted probabilities for each label in the training set.

Finally, a response is generated based on the predicted label. If the predicted label probability is greater than or equal to 0.5, the response "That sounds romantic!" is returned. Otherwise, the response "I'm not sure what you mean." is returned.

In [31]:
# Define a function to generate a response
def generate_response(input_text,max_len = 40):
    # Tokenize the input text
    input_sequence = tokenizer.texts_to_sequences([input_text])
    # Pad the input sequence to have the same length as the training sequences
    input_sequence = pad_sequences(input_sequence, maxlen=max_len, padding='post')
    # Predict the label for the input sequence
    label_probs = model.predict(input_sequence)[0]
    # Return a response based on the predicted label
    if label_probs >= 0.5:
        return "That sounds romantic!"
    else:
        return "I'm not sure what you mean."


Inferencing the trained model on a sample text.

In [32]:
input_text = "Hey, do you want to go see a movie tonight? I have planned a candle night dinner after that!"
response = generate_response(input_text)
print(response)


That sounds romantic!
