### Preprocessing Module Synopsis

In this preprocessing module, we clean, tokenize, and structure the raw conversational data from the **Cornell Movie Dialogs Corpus** (or any other selected dataset). This step is critical for transforming unstructured dialogue into a format suitable for training a generative-based chatbot. The key preprocessing tasks include removing unnecessary metadata, normalizing the text (lowercasing and removing special characters), tokenizing sentences, pairing input-response dialogues, and padding sequences to ensure consistency across all inputs.

By completing this preprocessing, we prepare the data for the next phase: **model design and training**, where the chatbot will learn from these structured conversations. Proper preprocessing is crucial for ensuring the chatbot can generate coherent, context-aware responses during real-time conversations.

Next steps include selecting an appropriate model architecture (e.g., Transformer, GPT) and training the chatbot using the preprocessed dataset.

### Preprocessing steps:

1. **Data Understanding**: Explore the structure and content of the dataset.
2. **Data Cleaning**: 
   - Remove unnecessary metadata.
   - Lowercase text.
   - Remove special characters and punctuation.
   - Remove empty or incomplete dialogues.
3. **Tokenization**: Break down text into tokens (words or subwords).
4. **Conversation Pairing**: Create (input, response) pairs for training.
5. **Context Management** (optional): Group multiple turns of conversation.
6. **Padding and Truncation**: Ensure all sequences are of fixed length.
7. **Train/Test Split**: Divide the dataset into training and validation sets.
8. **Special Token Handling**: Add special tokens like `<PAD>`, `<START>`, and `<END>`.
9. **Vectorization/Encoding**: Convert tokens to numerical embeddings.
10. **Save Preprocessed Data**: Store the cleaned and preprocessed data in a suitable format for model training.

In [40]:
!pip install convokit transformers

from convokit import Corpus, download
import re
import json
import random
from convokit import Corpus, download
from sklearn.model_selection import train_test_split
from nltk.corpus import wordnet
from transformers import AutoTokenizer

# Download the Cornell Movie Dialogs Corpus
corpus = Corpus(download("movie-corpus"))
corpus = Corpus(download("conversations-gone-awry-cmv-corpus"))



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading movie-corpus to /Users/dimitridumont/.convokit/downloads/movie-corpus
Downloading movie-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip (40.9MB)... Done
Dataset already exists at /Users/dimitridumont/.convokit/downloads/conversations-gone-awry-cmv-corpus


In [41]:
# 1. Data Understanding
# Check basic information about the corpus
print(corpus)

# Access a sample conversation and utterance
for convo in corpus.iter_conversations():
    print(convo)  # Print one conversation as an example
    break  # Only show the first conversation


<convokit.model.corpus.Corpus object at 0x169abcf50>
Conversation('id': 'cue8y0b', 'utterances': ['cue8y0b', 'cuec5fs', 'cuect48', 'cuedf8c', 'cuedywn', 'czb942p', 'czbbocu', 'czbdh6q', 'czbe470', 'czbe8el'], 'meta': ConvoKitMeta({'pair_id': 'cue8uxd', 'has_removed_comment': True, 'split': 'train', 'summary_meta': []}))


In [42]:
import re

# 2. Data Cleaning
#  cleaning function to lowercase and remove special characters and urls
def clean_text(text):
    # Dimitri - Remove URLs starting with http/https 
    text = re.sub(r"http\S+", "", text)  # Match URLs starting with 'http' until a space
    text = text.lower()  # Convert to lowercase
    # Remove non-alphanumeric characters (except spaces)
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  
    # Remove extra whitespace
    text = re.sub(r"\s+", " ", text).strip()
    return text

# Clean the utterances
for convo in corpus.iter_conversations():
    for utt in convo.iter_utterances():
        utt.text = clean_text(utt.text)

In [43]:
# 3. Tokenization using Hugging Face's GPT-2 tokenizer for consistency with GPT-2
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

def tokenize(text):
    return tokenizer.tokenize(text)

for convo in corpus.iter_conversations():
    for utt in convo.iter_utterances():
        utt.tokens = tokenize(utt.text)

Token indices sequence length is longer than the specified maximum sequence length for this model (1025 > 1024). Running this sequence through the model will result in indexing errors


In [44]:
import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/dimitridumont/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [45]:
# 4. Data Augmentation - Synonym Replacement
def synonym_replacement(text):
    words = text.split()
    new_words = words.copy()
    for i, word in enumerate(words):
        synonyms = wordnet.synsets(word)
        if synonyms:
            # Randomly pick a synonym
            synonym = synonyms[0].lemmas()[0].name()
            if synonym != word:  # Avoid replacing with the same word
                new_words[i] = synonym
    return " ".join(new_words)

# Apply data augmentation to increase dataset diversity
augmented_pairs = []
for convo in corpus.iter_conversations():
    utterances = list(convo.iter_utterances())
    for i in range(len(utterances) - 1):
        input_text = utterances[i].text
        response_text = utterances[i + 1].text
        
        # Apply synonym replacement
        augmented_input = synonym_replacement(input_text)
        augmented_response = synonym_replacement(response_text)
        
        # Original pair
        augmented_pairs.append((input_text, response_text))
        # Augmented pair
        augmented_pairs.append((augmented_input, augmented_response))

print(f"Number of pairs after augmentation: {len(augmented_pairs)}")
print(augmented_pairs[0])

Number of pairs after augmentation: 72244
('okay ive seen this view come up a few times before and ive always been unsuccessful in convincing people about why theyre wrong however it seems that your view is based on studies so maybe youll respond well to evidence ucarlosriccy i cannot dispute the fact that there is a measurable iq gap bw white students and black students in the us however it is most certainly not because of genetics almost all of it can be attributed to social causes black students are often in schools that arent adequately funded poverty and its corollary effects etc the genetic basis for difference in iq has been tested you can find a comprehensive review here if you look at the science its clear that there is no evidence that justifies making inferences about race and intelligence but what about all those studies in the past that showed the evidence well its been shown that those studies mostly from the social sciences have been biased and unreliable you can read ab

In [46]:
# 5. Padding and Truncation using advanced tokenization


# If the truncation cuts off important parts of the conversation
# we might want to increase the max_length to allow longer sequences, especially for longer dialogues.
max_length = 60 # Dimitri - Increased to 60


def pad_sequence(sequence, max_length):
    padded_sequence = sequence[:max_length]
    if len(sequence) < max_length:
        padded_sequence += [0] * (max_length - len(sequence))
    return padded_sequence

# Tokenize, pad, and truncate the sequences
for i, (input_text, response_text) in enumerate(augmented_pairs):
    input_tokens = tokenizer.encode(input_text)
    response_tokens = tokenizer.encode(response_text)
    augmented_pairs[i] = (pad_sequence(input_tokens, max_length), pad_sequence(response_tokens, max_length))

# Sample padded pair
print(augmented_pairs[0])


([482, 323, 220, 425, 1775, 428, 1570, 1282, 510, 257, 1178, 1661, 878, 290, 220, 425, 1464, 587, 23993, 287, 17101, 661, 546, 1521, 484, 260, 2642, 2158, 340, 2331, 326, 534, 1570, 318, 1912, 319, 3640, 523, 3863, 345, 297, 3031, 880, 284, 2370, 334, 66, 7063, 418, 1173, 948, 1312, 2314, 11047, 262, 1109, 326, 612, 318, 257, 40757, 1312, 80, 7625, 275, 86, 2330, 2444, 290, 2042, 2444, 287, 262, 514, 2158, 340, 318, 749, 3729, 407, 780, 286, 25862, 2048, 477, 286, 340, 460, 307, 14183, 284, 1919, 5640, 2042, 2444, 389, 1690, 287, 4266, 326], [896, 407, 655, 2042, 290, 2330, 45630, 64, 996, 6352, 993, 19173, 6580, 1173, 504, 290, 1097, 571, 44749, 286, 6580, 37189, 7709, 416, 1290, 4776, 262, 9016, 319, 2811, 286, 9450, 2628, 351, 45630, 272, 15102, 691, 9489, 44108, 1365, 290, 262, 5254, 17666, 3953, 1997, 326, 318, 7817, 484, 3953, 12531, 14607, 29294, 1521, 345, 460, 1577, 281, 1312, 80, 1332, 284, 257, 1200, 290, 635, 1521, 22527, 1751, 4143, 460, 4776, 2440, 618, 484, 260, 1862, 47

In [47]:
# 6. rain/Test Split
train_pairs, val_pairs = train_test_split(augmented_pairs, test_size=0.2)

# Print sizes of train and validation sets
print(f"Train set size: {len(train_pairs)}, Validation set size: {len(val_pairs)}")


Train set size: 57795, Validation set size: 14449


In [48]:
# 7. Save Preprocessed Data
with open('preprocessed_data.json', 'w') as f:
    json.dump(augmented_pairs, f)

print("Preprocessed data saved successfully!")

Preprocessed data saved successfully!


In [49]:
# Load the preprocessed data
with open('preprocessed_data.json', 'r') as f:
    data = json.load(f)

# Print a sample of the data to check
print(data[0])  # Should display a tuple of tokenized, padded input and response sequences


[[482, 323, 220, 425, 1775, 428, 1570, 1282, 510, 257, 1178, 1661, 878, 290, 220, 425, 1464, 587, 23993, 287, 17101, 661, 546, 1521, 484, 260, 2642, 2158, 340, 2331, 326, 534, 1570, 318, 1912, 319, 3640, 523, 3863, 345, 297, 3031, 880, 284, 2370, 334, 66, 7063, 418, 1173, 948, 1312, 2314, 11047, 262, 1109, 326, 612, 318, 257, 40757, 1312, 80, 7625, 275, 86, 2330, 2444, 290, 2042, 2444, 287, 262, 514, 2158, 340, 318, 749, 3729, 407, 780, 286, 25862, 2048, 477, 286, 340, 460, 307, 14183, 284, 1919, 5640, 2042, 2444, 389, 1690, 287, 4266, 326], [896, 407, 655, 2042, 290, 2330, 45630, 64, 996, 6352, 993, 19173, 6580, 1173, 504, 290, 1097, 571, 44749, 286, 6580, 37189, 7709, 416, 1290, 4776, 262, 9016, 319, 2811, 286, 9450, 2628, 351, 45630, 272, 15102, 691, 9489, 44108, 1365, 290, 262, 5254, 17666, 3953, 1997, 326, 318, 7817, 484, 3953, 12531, 14607, 29294, 1521, 345, 460, 1577, 281, 1312, 80, 1332, 284, 257, 1200, 290, 635, 1521, 22527, 1751, 4143, 460, 4776, 2440, 618, 484, 260, 1862, 47

#### Preprocessed Data Explanation / Usage :

1. **Tokenized Sequences**:
   - Each list inside the outer list represents a sequence of tokenized words (input and response) from the preprocessed data. The numbers are token IDs, which are the result of passing the text through the tokenizer (in this case, GPT-2's tokenizer).

2. **Padded Sequences**:
   - The sequences have been padded (or truncated) to a fixed length (`max_length = 20` in our case). The list of numbers should represent tokenized text that was either truncated or padded as part of the preprocessing step.

### Explanation of Output:
- `data[0]` is a tuple of two lists (input and response).
   - The first list `[482, 323, 220, ...]` is the tokenized and padded sequence for the input sentence.
   - The second list `[896, 407, 655, ...]` is the tokenized and padded sequence for the response sentence.


In [50]:
print(data[0][0])  # Check if this is a list of integers
print(data[0][1])  # Check if this is a list of integers

[482, 323, 220, 425, 1775, 428, 1570, 1282, 510, 257, 1178, 1661, 878, 290, 220, 425, 1464, 587, 23993, 287, 17101, 661, 546, 1521, 484, 260, 2642, 2158, 340, 2331, 326, 534, 1570, 318, 1912, 319, 3640, 523, 3863, 345, 297, 3031, 880, 284, 2370, 334, 66, 7063, 418, 1173, 948, 1312, 2314, 11047, 262, 1109, 326, 612, 318, 257, 40757, 1312, 80, 7625, 275, 86, 2330, 2444, 290, 2042, 2444, 287, 262, 514, 2158, 340, 318, 749, 3729, 407, 780, 286, 25862, 2048, 477, 286, 340, 460, 307, 14183, 284, 1919, 5640, 2042, 2444, 389, 1690, 287, 4266, 326]
[896, 407, 655, 2042, 290, 2330, 45630, 64, 996, 6352, 993, 19173, 6580, 1173, 504, 290, 1097, 571, 44749, 286, 6580, 37189, 7709, 416, 1290, 4776, 262, 9016, 319, 2811, 286, 9450, 2628, 351, 45630, 272, 15102, 691, 9489, 44108, 1365, 290, 262, 5254, 17666, 3953, 1997, 326, 318, 7817, 484, 3953, 12531, 14607, 29294, 1521, 345, 460, 1577, 281, 1312, 80, 1332, 284, 257, 1200, 290, 635, 1521, 22527, 1751, 4143, 460, 4776, 2440, 618, 484, 260, 1862, 475,

In [51]:
# ^ These token IDs can be mapped back to their original words using the tokenizer if you want to check the original text:

# Convert token IDs back to words to verify the text
input_text = tokenizer.decode(data[0][0], skip_special_tokens=True)
response_text = tokenizer.decode(data[0][1], skip_special_tokens=True)

print("Input text:", input_text)
print("Response text:", response_text)


Input text: okay ive seen this view come up a few times before and ive always been unsuccessful in convincing people about why theyre wrong however it seems that your view is based on studies so maybe youll respond well to evidence ucarlosriccy i cannot dispute the fact that there is a measurable iq gap bw white students and black students in the us however it is most certainly not because of genetics almost all of it can be attributed to social causes black students are often in schools that
Response text: its not just black and white america though subsaharan africans and caribbeans of african decent by far score the lowest on average of ethnic groups with american blacks only performing marginally better and the tests dont measure anything that is taught they measure abstract reasoning thats why you can give an iq test to a child and also why gifted children generally can score higher when theyre young but usually see even if they are truly gifted somewhat of a drop as they grow old