**Background :**
This work outlines the development of a conversational AI chatbot using the BERT and T5 architectures from the Hugging Face Transformers library. The project involves loading dialogue data, preprocessing it, training a Seq2Seq model, and evaluating its performance. The goal is to create a chatbot capable of generating human-like responses based on movie dialogues.

In [1]:
from google.colab import drive

drive.mount('/content/drive')



Mounted at /content/drive


**Step 1:**
**Load dialogue data from specified file paths into pandas DataFrames using custom delimiters.**

In [8]:
import pandas as pd
import os
#from sklearn.datasets import Fetch_Cornel_Movie_Dataframes # Assuming you want to use a dataset from sklearn

# Specify the file path
file_path1 = '/content/drive/My Drive/Cornel Movie Dataframes/movie_characters_metadata.txt'
file_path2 = '/content/drive/My Drive/Cornel Movie Dataframes/movie_conversations.txt'
file_path3 = '/content/drive/My Drive/Cornel Movie Dataframes/movie_lines.txt'
file_path4 = '/content/drive/My Drive/Cornel Movie Dataframes/movie_titles_metadata.txt'
# Load the file using pandas with custom delimiter
df1 = pd.read_csv(file_path1, sep=r'\s*\+\+\+\$\+\+\+\s*', header=None, engine='python', encoding='ISO-8859-1')
df2 = pd.read_csv(file_path2, sep=r'\s*\+\+\+\$\+\+\+\s*', header=None, engine='python', encoding='ISO-8859-1')
df3 = pd.read_csv(file_path3, sep=r'\s*\+\+\+\$\+\+\+\s*', header=None, engine='python', encoding='ISO-8859-1')
df4 = pd.read_csv(file_path4, sep=r'\s*\+\+\+\$\+\+\+\s*', header=None, engine='python', encoding='ISO-8859-1')

# Display the first few rows of the DataFrame
print(df1.head())
print(df2.head())
print(df3.head())
print(df4.head())

# Optionally, assign column names for better clarity
df1.columns = ['User1', 'User2', 'MovieID', 'LineIDs','Unnamed1', 'Unnamed2']
df2.columns = ['MovieID', 'CharacterID1', 'CharacterID2', 'LineIDs']
df3.columns = ['LineID', 'CharacterID', 'MovieID', 'CharacterName', 'LineText']
df4.columns = ['MovieID', 'MovieTitle', 'MovieYear', 'Rating', 'Votes', 'Genres'] # Replace with actual column names from your data

# Display the DataFrame
print(df1.head())
print(df2.head())
print(df3.head())
print(df4.head())

    0         1   2                           3  4  5
0  u0    BIANCA  m0  10 things i hate about you  f  4
1  u1     BRUCE  m0  10 things i hate about you  ?  ?
2  u2   CAMERON  m0  10 things i hate about you  m  3
3  u3  CHASTITY  m0  10 things i hate about you  ?  ?
4  u4      JOEY  m0  10 things i hate about you  m  6
    0   1   2                                 3
0  u0  u2  m0  ['L194', 'L195', 'L196', 'L197']
1  u0  u2  m0                  ['L198', 'L199']
2  u0  u2  m0  ['L200', 'L201', 'L202', 'L203']
3  u0  u2  m0          ['L204', 'L205', 'L206']
4  u0  u2  m0                  ['L207', 'L208']
       0   1   2        3             4
0  L1045  u0  m0   BIANCA  They do not!
1  L1044  u2  m0  CAMERON   They do to!
2   L985  u0  m0   BIANCA    I hope so.
3   L984  u2  m0  CAMERON     She okay?
4   L925  u0  m0   BIANCA     Let's go.
    0                           1     2    3       4  \
0  m0  10 things i hate about you  1999  6.9   62847   
1  m1  1492: conquest of paradise  1

**Step 2:**
**BERT Tokenizer & Model**:
Loaded for sequence classification tasks like sentiment analysis.

In [11]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

#Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


1. **Create input-output pairs where each input corresponds to a preceding dialogue line and each output corresponds to the following line.**
2. **Convert the list of dialogues into a Hugging Face Dataset format for easier manipulation.**
3. **T5 Tokenizer & Model: Loaded for text generation tasks.**
4. **Define a preprocessing *function* that tokenizes input dialogue lines and prepares them for training by creating corresponding target labels. It ensures that both inputs and outputs fit within a maximum length of 512 tokens.**
5. **Apply the preprocessing function to the entire dataset in batches for efficient processing.**
6. **Split the dataset into training (80%) and validation/testing (20%) sets.**
7. **After splitting, you create a DatasetDict to hold these splits:**
The DatasetDict class from the Hugging Face datasets library is used to organize different splits of your dataset (training, validation, and testing) into a single structure. This allows easy access and management of these datasets during model training and evaluation.
By grouping the datasets into a DatasetDict, you can easily reference the training, validation, and test datasets when setting up your model training process. This is particularly helpful when using the Trainer class, which requires separate datasets for training and evaluation.
Using a DatasetDict enhances code readability by clearly indicating which datasets are being used for which purpose. It makes it easier for anyone reading the code to understand the workflow and structure of the data.
Hugging Face's Trainer and other utilities are designed to work seamlessly with DatasetDict. This means that you can directly pass the dictionary to functions that expect datasets, streamlining the training and evaluation processes.

In [12]:
!pip install datasets
import os
import re
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset, DatasetDict, Features, Value # Import Dataset and DatasetDict
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments


# Create input-output pairs for training (previous line as input and current line as output)
#dialogues = df1['LineText'].tolist()
#dialogues = df2['LineText'].tolist()
dialogues = df3['LineText'].tolist()
#dialogues = df4['LineText'].tolist()
input_texts = dialogues[:-1]  # All but the last line
output_texts = dialogues[1:]   # All but the first line

# Create a Hugging Face Dataset from your dialogues
# The issue is with type inference. We need to explicitly specify the type of the 'line' column.
dataset = Dataset.from_dict({'line': dialogues}, features=Features({'line': Value(dtype='string')}))


# Preprocess the data
def preprocess_function(examples):
    inputs = examples['line']  # Access the 'line' value from each dictionary within 'lines'
    targets = inputs[1:] + [inputs[0]]  # Shift for next line prediction
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=512, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Apply the preprocessing function
tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Split the dataset into train and validation sets
train_testvalid = tokenized_datasets.train_test_split(test_size=0.2)  # 80% train, 20% test+validation
test_valid = train_testvalid['test'].train_test_split(test_size=0.5)  # Further split test into 10% test, 10% validation

# Create a DatasetDict
tokenized_datasets = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'validation': test_valid['train']
})





Map:   0%|          | 0/1006 [00:00<?, ? examples/s]



**Training the Model:**
Define the training parameters and initiate the training process for the Seq2Seq model using the prepared dataset.
Save the trained model and tokenizer for future use. I have trained the model on 3 Epochs as there was a compute time limitation beyond this. Nevertheless, the loss is so less that 3 Epochs also seems more than adequate.
                                                                     
                                                                      
**Conclusion till Model Training**
The provided code effectively prepares a dialogue dataset for training a conversational AI chatbot using BERT and T5 architectures from Hugging Face's Transformers library. It involves loading dialogue data from text files, preprocessing it into suitable formats for training, and splitting it into distinct datasets for evaluation purposes.

In [13]:
# Load the pre-trained model
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

# Train the model
trainer.train()
# Load the pre-trained model
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

# Train the model
trainer.train()

# Step 8: Save the trained model after optimization
model.save_pretrained("./optimized_movie_chatbot_model")
tokenizer.save_pretrained("./optimized_movie_chatbot_tokenizer")

print("Model and tokenizer saved successfully!")
print("Model training complete and saved!")








Epoch,Training Loss,Validation Loss
1,No log,0.096197
2,No log,0.094659
3,0.132100,0.09794




Epoch,Training Loss,Validation Loss
1,No log,0.096197
2,No log,0.094659
3,0.132100,0.09794


Model and tokenizer saved successfully!
Model training complete and saved!


This section of code is crucial for evaluating a trained NLP model, specifically one that generates responses based on input text (like a chatbot). It sets up necessary libraries, defines metrics for evaluation, configures training parameters, initializes a trainer object, evaluates performance on test data, and generates predictions. Each part plays an integral role in ensuring that your model is effectively trained and evaluated before deployment.

In [14]:

!pip install datasets transformers #This command installs the datasets and transformers libraries from Hugging Face, which are essential for handling datasets and working with pre-trained models in NLP.

#Below section imports the necessary libraries and prints their versions to confirm successful installation. The datasets library is used for managing datasets, while the transformers library provides access to various pre-trained models and tokenizers.
import datasets
import transformers
import numpy as np
print(datasets.__version__)
print(transformers.__version__)

#Purpose: Below section computes evaluation metrics for the model's predictions:
#It takes eval_preds, which contains predictions and labels.
#It decodes the predicted and actual labels from token IDs back into text.
#The function replaces any -100 values in the labels (which are used to indicate ignored tokens during training) with the tokenizer's padding token ID.
#Finally, it computes a BLEU score (a common metric for evaluating text generation tasks) comparing predicted outputs to reference outputs.

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them
    decoded_labels = np.where(decoded_labels != -100, decoded_labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(decoded_labels, skip_special_tokens=True)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

# Define training arguments, including the compute_metrics function
#This section defines various training parameters:
#output_dir: Specifies where to save the model results.
#evaluation_strategy: Determines how often to evaluate the model during training (in this case, at the end of each epoch).
#learning_rate: Sets the learning rate for the optimizer.
#per_device_train_batch_size: Defines the batch size for training.
#per_device_eval_batch_size: Defines the batch size for evaluation.
#num_train_epochs: Specifies how many times to iterate over the entire training dataset.
#weight_decay: Applies weight decay regularization to prevent overfitting.

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-3,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize the Trainer with the updated training arguments
#This code initializes a Trainer object from Hugging Face:
#It takes in the model to be trained, the defined training arguments, and specifies which datasets to use for training and evaluation.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)


# Evaluate the model without the compute_metrics argument in evaluate()
#This line evaluates the trained model on a test dataset:
#The results of this evaluation (such as loss and any specified metrics) are printed out for analysis.

eval_results = trainer.evaluate(eval_dataset=tokenized_datasets["test"])
print(eval_results)  # Print the evaluation results

#This code generates predictions for the test dataset using the trained model:
#The predictions are printed out, allowing you to see how well the model performs on unseen data.
predictions = trainer.predict(tokenized_datasets["test"])
print(predictions) # Print predictions



3.0.1
4.44.2




{'eval_loss': 0.09960323572158813, 'eval_model_preparation_time': 0.003, 'eval_runtime': 3.1536, 'eval_samples_per_second': 32.027, 'eval_steps_per_second': 32.027}


OutOfMemoryError: CUDA out of memory. Tried to allocate 2.57 GiB. GPU 0 has a total capacity of 14.75 GiB of which 2.56 GiB is free. Process 9007 has 12.19 GiB memory in use. Of the allocated memory 8.38 GiB is allocated by PyTorch, and 3.66 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

**This code effectively sets up a conversational AI chatbot using a pre-trained T5 model. It includes steps for installing necessary libraries, importing them, loading models and tokenizers, defining a function for generating responses, and testing that functionality with sample prompts. The use of Gradio can further enhance user interaction by providing a web interface for users to converse with the chatbot in real-time.**

In [6]:
# Step 1: Install Required Libraries
!pip install datasets transformers gradio

#This section imports essential libraries:
#torch: The PyTorch library, which is used for tensor computations and model training.
#AutoModelForSeq2SeqLM: A class to load pre-trained sequence-to-sequence models (like T5).
#AutoTokenizer: A class to load the appropriate tokenizer for the selected model.
#gradio: A library for building interactive web applications.

# Step 2: Import Libraries
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import gradio as gr

# Step 3: Load the Trained Model and Tokenizer
model_name = "t5-small"  # Replace with your model's name or path
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Move the model to GPU if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# Ensure the model is in evaluation mode
model.eval()

# Step 4: Define the Response Generation Function
#This function generates responses based on user input:
#It tokenizes the input text, generates responses using the T5 model, and decodes them back into human-readable text.
#The function includes parameters for controlling response length and sampling strategies.

def generate_response(prompt_text, model, tokenizer, max_length=150, num_return_sequences=1):
    # Tokenize the input prompt with padding and attention mask
    inputs = tokenizer(prompt_text, return_tensors='pt', padding=True,
                       truncation=True, max_length=512).to(device)

    # Pass both input_ids and attention_mask to the model
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    # Generate responses
    outputs = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        no_repeat_ngram_size=2,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id
    )

    # Check if outputs are generated
    if outputs is not None and len(outputs) > 0:
        # Decode and return the generated text
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        return response
    else:
        return "No response generated."

# Test the chatbot with a user input prompt
prompt = "Hello, how are you?"
response = generate_response(prompt, model, tokenizer)
print(f"Chatbot response: {response}")

# Try another prompt
prompt = "Can you tell me a joke?"
response = generate_response(prompt, model, tokenizer)
print(f"Chatbot response: {response}")


Collecting gradio
  Downloading gradio-5.1.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0 (from gradio)
  Downloading fastapi-0.115.2-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.0 (from gradio)
  Downloading gradio_client-1.4.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting huggingface-hub>=0.22.0 (from datasets)
  Downloading huggingface_hub-0.26.0-py3-none-any.whl.metadata (13 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metada

Chatbot response: Bonjour, wie bist du?
Chatbot response: Können Sie mir sagen, oder wie eine joke?


**Install Gradio for UI**

In [7]:

!pip install gradio
import gradio as gr
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
# Step 1: Load the fine-tuned model and tokenizer
# Correct the paths for loading the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("./optimized_movie_chatbot_tokenizer") # load tokenizer from tokenizer path
model = AutoModelForSeq2SeqLM.from_pretrained("./optimized_movie_chatbot_model") # load model from model path
# Ensure the model is in evaluation mode
model.eval()
# Step 2: Move the model to GPU if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
# Step 3: Function to generate chatbot responses
def generate_response_gradio(prompt_text):
      # Tokenize the input prompt with padding and attention mask
      inputs = tokenizer(prompt_text, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)
      input_ids = inputs['input_ids']
      attention_mask = inputs['attention_mask']
      # Generate response from the model
      outputs = model.generate(
         input_ids=input_ids,
         attention_mask=attention_mask,
         max_length=150, # Adjust maximum length of generated text
         num_return_sequences=1, # Generate one response
         no_repeat_ngram_size=2, # Avoid repeating the same n-grams
         do_sample=True, # Enable sampling for varied responses
         top_k=50, # Sample from top k tokens
         top_p=0.95, # Use nucleus sampling
         temperature=0.7, # Lower temperature makes output more deterministic
         pad_token_id=tokenizer.eos_token_id # Set the pad token to eos_token_id
)
     # Decode and return the generated response
      response = tokenizer.decode(outputs[0], skip_special_tokens=True)
      return response
# Step 4: Create a Gradio interface
interface = gr.Interface(
     fn=generate_response_gradio, # The function to call when the user submits input
     inputs="text", # Input is a text box
     outputs="text", # Output is also text
     title="Movie Chatbot",
     description="Chatbot based on movie dialogues. Ask it anything!"
)
# Step 5: Launch the Gradio interface
interface.launch(share=True) # Use share=True to get a public link in Colab

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://6b664c7b22dd953730.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


