### Preprocessing Module Synopsis

In this preprocessing module, we clean, tokenize, and structure the raw conversational data from the **Cornell Movie Dialogs Corpus** (or any other selected dataset). This step is critical for transforming unstructured dialogue into a format suitable for training a generative-based chatbot. The key preprocessing tasks include removing unnecessary metadata, normalizing the text (lowercasing and removing special characters), tokenizing sentences, pairing input-response dialogues, and padding sequences to ensure consistency across all inputs.

By completing this preprocessing, we prepare the data for the next phase: **model design and training**, where the chatbot will learn from these structured conversations. Proper preprocessing is crucial for ensuring the chatbot can generate coherent, context-aware responses during real-time conversations.

Next steps include selecting an appropriate model architecture (e.g., Transformer, GPT) and training the chatbot using the preprocessed dataset.

### Preprocessing steps:

1. **Data Understanding**: Explore the structure and content of the dataset.
2. **Data Cleaning**: 
   - Remove unnecessary metadata.
   - Lowercase text.
   - Remove special characters and punctuation.
   - Remove empty or incomplete dialogues.
3. **Tokenization**: Break down text into tokens (words or subwords).
4. **Conversation Pairing**: Create (input, response) pairs for training.
5. **Context Management** (optional): Group multiple turns of conversation.
6. **Padding and Truncation**: Ensure all sequences are of fixed length.
7. **Train/Test Split**: Divide the dataset into training and validation sets.
8. **Special Token Handling**: Add special tokens like `<PAD>`, `<START>`, and `<END>`.
9. **Vectorization/Encoding**: Convert tokens to numerical embeddings.
10. **Save Preprocessed Data**: Store the cleaned and preprocessed data in a suitable format for model training.

In [1]:
!pip install convokit transformers

from convokit import Corpus, download
import re
import json
import random
from convokit import Corpus, download
from sklearn.model_selection import train_test_split
from nltk.corpus import wordnet
from transformers import AutoTokenizer

# Download the Cornell Movie Dialogs Corpus
corpus = Corpus(download("movie-corpus"))




Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



Downloading movie-corpus to /Users/obosieakioyamen/.convokit/downloads/movie-corpus
Downloading movie-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip (40.9MB)... Done


In [2]:
# 1. Data Understanding
# Check basic information about the corpus
print(corpus)

# Access a sample conversation and utterance
for convo in corpus.iter_conversations():
    print(convo)  # Print one conversation as an example
    break  # Only show the first conversation


<convokit.model.corpus.Corpus object at 0x13bae8f10>
Conversation('id': 'L1044', 'utterances': ['L1045', 'L1044'], 'meta': ConvoKitMeta({'movie_idx': 'm0', 'movie_name': '10 things i hate about you', 'release_year': '1999', 'rating': '6.90', 'votes': '62847', 'genre': "['comedy', 'romance']"}))


In [3]:
import re

# 2. Data Cleaning
#  cleaning function to lowercase and remove special characters and urls
def clean_text(text):
    # Dimitri - Remove URLs starting with http/https 
    text = re.sub(r"http\S+", "", text)  # Match URLs starting with 'http' until a space
    text = text.lower()  # Convert to lowercase
    # Remove non-alphanumeric characters (except spaces)
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  
    # Remove extra whitespace
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Clean the utterances
for convo in corpus.iter_conversations():
    for utt in convo.iter_utterances():
        utt.text = clean_text(utt.text)

In [4]:
# 3. Tokenization using Hugging Face's GPT-2 tokenizer for consistency with GPT-2
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

def tokenize(text):
    return tokenizer.tokenize(text)

for convo in corpus.iter_conversations():
    for utt in convo.iter_utterances():
        utt.tokens = tokenize(utt.text)



In [5]:
import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/obosieakioyamen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
# 4. Data Augmentation - Synonym Replacement
def synonym_replacement(text):
    words = text.split()
    new_words = words.copy()
    for i, word in enumerate(words):
        synonyms = wordnet.synsets(word)
        if synonyms:
            # Randomly pick a synonym
            synonym = synonyms[0].lemmas()[0].name()
            if synonym != word:  # Avoid replacing with the same word
                new_words[i] = synonym
    return " ".join(new_words)

# Apply data augmentation to increase dataset diversity
augmented_pairs = []
for convo in corpus.iter_conversations():
    utterances = list(convo.iter_utterances())
    for i in range(len(utterances) - 1):
        input_text = utterances[i].text
        response_text = utterances[i + 1].text
        
        # Apply synonym replacement
        augmented_input = synonym_replacement(input_text)
        augmented_response = synonym_replacement(response_text)
        
        # Original pair
        augmented_pairs.append((input_text, response_text))
        # Augmented pair
        augmented_pairs.append((augmented_input, augmented_response))

print(f"Number of pairs after augmentation: {len(augmented_pairs)}")
print(augmented_pairs[0])

Number of pairs after augmentation: 443232
('they do not', 'they do to')


In [7]:
# 5. Padding and Truncation using advanced tokenization


# If the truncation cuts off important parts of the conversation
# we might want to increase the max_length to allow longer sequences, especially for longer dialogues.
max_length = 60 # Dimitri - Increased to 60


def pad_sequence(sequence, max_length):
    padded_sequence = sequence[:max_length]
    if len(sequence) < max_length:
        padded_sequence += [0] * (max_length - len(sequence))
    return padded_sequence

# Tokenize, pad, and truncate the sequences
for i, (input_text, response_text) in enumerate(augmented_pairs):
    input_tokens = tokenizer.encode(input_text)
    response_tokens = tokenizer.encode(response_text)
    augmented_pairs[i] = (pad_sequence(input_tokens, max_length), pad_sequence(response_tokens, max_length))

# Sample padded pair
print(augmented_pairs[0])


([9930, 466, 407, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [9930, 466, 284, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


In [8]:
# 6. rain/Test Split
train_pairs, val_pairs = train_test_split(augmented_pairs, test_size=0.2)

# Print sizes of train and validation sets
print(f"Train set size: {len(train_pairs)}, Validation set size: {len(val_pairs)}")


Train set size: 354585, Validation set size: 88647


In [9]:
# 7. Save Preprocessed Data
with open('preprocessed_data.json', 'w') as f:
    json.dump(augmented_pairs, f)

print("Preprocessed data saved successfully!")

Preprocessed data saved successfully!


In [10]:
# Load the preprocessed data
with open('preprocessed_data.json', 'r') as f:
    data = json.load(f)

# Print a sample of the data to check
print(data[0])  # Should display a tuple of tokenized, padded input and response sequences


[[9930, 466, 407, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [9930, 466, 284, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


#### Preprocessed Data Explanation / Usage :

1. **Tokenized Sequences**:
   - Each list inside the outer list represents a sequence of tokenized words (input and response) from the preprocessed data. The numbers are token IDs, which are the result of passing the text through the tokenizer (in this case, GPT-2's tokenizer).

2. **Padded Sequences**:
   - The sequences have been padded (or truncated) to a fixed length (`max_length = 20` in our case). The list of numbers should represent tokenized text that was either truncated or padded as part of the preprocessing step.

### Explanation of Output:
- `data[0]` is a tuple of two lists (input and response).
   - The first list `[482, 323, 220, ...]` is the tokenized and padded sequence for the input sentence.
   - The second list `[896, 407, 655, ...]` is the tokenized and padded sequence for the response sentence.


In [11]:
print(data[0][0])  # Check if this is a list of integers
print(data[0][1])  # Check if this is a list of integers

[9930, 466, 407, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[9930, 466, 284, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [12]:
# ^ These token IDs can be mapped back to their original words using the tokenizer if you want to check the original text:

# Convert token IDs back to words to verify the text
input_text = tokenizer.decode(data[0][0], skip_special_tokens=True)
response_text = tokenizer.decode(data[0][1], skip_special_tokens=True)

print("Input text:", input_text)
print("Response text:", response_text)


2024-10-20 15:22:49.736780: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Input text: they do not!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Response text: they do to!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
