### Preprocessing Module Synopsis

In this preprocessing module, we clean, tokenize, and structure the raw conversational data from the **Cornell Movie Dialogs Corpus** (or any other selected dataset). This step is critical for transforming unstructured dialogue into a format suitable for training a generative-based chatbot. The key preprocessing tasks include removing unnecessary metadata, normalizing the text (lowercasing and removing special characters), tokenizing sentences, pairing input-response dialogues, and padding sequences to ensure consistency across all inputs.

By completing this preprocessing, we prepare the data for the next phase: **model design and training**, where the chatbot will learn from these structured conversations. Proper preprocessing is crucial for ensuring the chatbot can generate coherent, context-aware responses during real-time conversations.

Next steps include selecting an appropriate model architecture (e.g., Transformer, GPT) and training the chatbot using the preprocessed dataset.

### Preprocessing steps:

1. **Data Understanding**: Explore the structure and content of the dataset.
2. **Data Cleaning**: 
   - Remove unnecessary metadata.
   - Lowercase text.
   - Remove special characters and punctuation.
   - Remove empty or incomplete dialogues.
3. **Tokenization**: Break down text into tokens (words or subwords).
4. **Conversation Pairing**: Create (input, response) pairs for training.
5. **Context Management** (optional): Group multiple turns of conversation.
6. **Padding and Truncation**: Ensure all sequences are of fixed length.
7. **Train/Test Split**: Divide the dataset into training and validation sets.
8. **Special Token Handling**: Add special tokens like `<PAD>`, `<START>`, and `<END>`.
9. **Vectorization/Encoding**: Convert tokens to numerical embeddings.
10. **Save Preprocessed Data**: Store the cleaned and preprocessed data in a suitable format for model training.

In [1]:
!pip install convokit

from convokit import Corpus, download

# Download the Cornell Movie Dialogs Corpus
corpus = Corpus(download("movie-corpus"))
corpus = Corpus(download("conversations-gone-awry-cmv-corpus"))





Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md



Downloading movie-corpus to /Users/obosieakioyamen/.convokit/downloads/movie-corpus
Downloading movie-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip (40.9MB)... Done
Dataset already exists at /Users/obosieakioyamen/.convokit/downloads/conversations-gone-awry-cmv-corpus


In [2]:
# 1. Data Understanding
# Check basic information about the corpus
print(corpus)

# Access a sample conversation and utterance
for convo in corpus.iter_conversations():
    print(convo)  # Print one conversation as an example
    break  # Only show the first conversation


<convokit.model.corpus.Corpus object at 0x13f9b3ed0>
Conversation('id': 'cue8y0b', 'utterances': ['cue8y0b', 'cuec5fs', 'cuect48', 'cuedf8c', 'cuedywn', 'czb942p', 'czbbocu', 'czbdh6q', 'czbe470', 'czbe8el'], 'meta': ConvoKitMeta({'pair_id': 'cue8uxd', 'has_removed_comment': True, 'split': 'train', 'summary_meta': []}))


In [4]:
import re

# 2. Data Cleaning
#  cleaning function to lowercase and remove special characters
def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    return text

# Clean the utterances
for convo in corpus.iter_conversations():
    for utt in convo.iter_utterances():
        utt.text = clean_text(utt.text)

In [5]:
# 3. Tokenization
# tokenization using basic Python split we can also use libraries like NLTK/spaCy 
def tokenize(text):
    return text.split()

for convo in corpus.iter_conversations():
    for utt in convo.iter_utterances():
        utt.tokens = tokenize(utt.text)  # Tokenize the utterances


In [6]:
# 4. Conversation Pairing
# Extract input-response pairs
pairs = []
for convo in corpus.iter_conversations():
    utterances = list(convo.iter_utterances())
    for i in range(len(utterances) - 1):
        pairs.append((utterances[i].text, utterances[i+1].text))

print(pairs[0])


('okay ive seen this view come up a few times before and ive always been unsuccessful in convincing people about why theyre wrong however it seems that your view is based on studies so maybe youll respond well to evidence ucarlosriccy\n\ni cannot dispute the fact that there is a measurable iq gap bw white students and black students in the us however it is most certainly not because of genetics almost all of it can be attributed to social causes  black students are often in schools that arent adequately funded poverty and its corollary effects etc \n\nthe genetic basis for difference in iq has been tested you can find a comprehensive review herehttpwwwncbinlmnihgovpmcarticlespmc3341646 if you look at the science its clear that there is no evidence that justifies making inferences about race and intelligence\n\nbut what about all those studies in the past that showed the evidence well its been shown that those studies mostly from the social sciences have been biased and unreliable you c

In [7]:
# 5. Padding and Truncation
max_length = 10
def pad_sequence(sequence, max_length):
    return sequence[:max_length] + ['<PAD>'] * (max_length - len(sequence))

# Example padding, need to make better if model traning needs it 
for i, (input_text, response_text) in enumerate(pairs):
    input_tokens = tokenize(input_text)
    response_tokens = tokenize(response_text)
    pairs[i] = (pad_sequence(input_tokens, max_length), pad_sequence(response_tokens, max_length))

#sample padded pair
print(pairs[0])


(['okay', 'ive', 'seen', 'this', 'view', 'come', 'up', 'a', 'few', 'times'], ['its', 'not', 'just', 'black', 'and', 'white', 'america', 'though', 'subsaharan', 'africans'])


In [8]:
# 6. Train/Test Split
from sklearn.model_selection import train_test_split
train_pairs, val_pairs = train_test_split(pairs, test_size=0.2)

# Print sizes of train and validation sets
print(f"Train set size: {len(train_pairs)}, Validation set size: {len(val_pairs)}")


Train set size: 28897, Validation set size: 7225


In [9]:
# 7. Save Preprocessed Data
import json

]with open('preprocessed_data.json', 'w') as f:
    json.dump(pairs, f)
print("Preprocessed data saved successfully!")


Preprocessed data saved successfully!
