## Introduction

In the age of automation and digital interaction, crafting human-like responses in real-time is a challenge that has garnered immense attention. While there are numerous models built for understanding and generating human text, the real complexity arises when these models need to operate within a dynamic conversation flow, predicting user intents and crafting contextually relevant responses.

This project is an exploration into the harmonious integration of two models. The first, our 'Predictor', is trained to anticipate the potential intent of a user message based on the preceding conversation. It doesn't just stop at understanding; it outputs specific labels that serve as guiding markers for the next step of our system.

Enter the second part, DialoGPT. A variant of the powerful GPT (Generative Pre-trained Transformer) tailored for dialogues, DialoGPT has been making waves in the conversational AI community with its capabilities. However, in our use-case, it's not left to its own devices. Guided by the labels generated by our Predictor, DialoGPT's mission is to craft responses that aren't just coherent, but also contextually in sync with the predicted intent.

By combining a targeted intent prediction mechanism with a state-of-the-art conversational model, we aim to bridge the gap between generic responses and those that resonate with the user's intent. This notebook chronicles our journey of integrating these models, fine-tuning DialoGPT on our dataset, and evaluating the outcomes of this symbiotic relatiolists.



In [1]:
# Imports
import torch
from torch import optim
from torch.utils.data import Dataset, DataLoader
import lightning.pytorch as pl
from transformers import AutoModelForCausalLM, AutoTokenizer
import json

# Constants
RANDOM_SEED = 69
DATASET_FILENAME = '../data/clean/customer_support_twitter_full.json'
MODEL_NAME = 'microsoft/DialoGPT-small'

# Setup
torch.manual_seed(RANDOM_SEED)

<torch._C.Generator at 0x1d5ae59eed0>

In [7]:
# Define PyTorch Dataset
class GeneratorDataset(Dataset):

    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

# Define PyTorch Lightning DataModule
class GeneratorDataModule(pl.LightningDataModule):
    def __init__(self, filename: str, batch_size=2):
        super().__init__()
        self.filename = filename
        self.batch_size = batch_size
        self.conversations = self._load_conversations(filename)
        self.tokenizer = self._init_tokenizer()
        data = []
        for conv in self.conversations[:3]:
            for input_message, target_message in zip(conv, conv[1:]):
                if target_message.get('authored'):
                    inputs = self.tokenizer(
                        f"{input_message.get('text')}",
                        padding='max_length',
                        truncation=True,
                        max_length=20,
                        return_tensors='pt',
                    )
                    data.append({
                        'input_ids': inputs['input_ids'].squeeze(),
                        'attention_mask': inputs['attention_mask'].squeeze(),
                        'labels': inputs['input_ids'].squeeze(),
                    })
        
        self.train_dataset = GeneratorDataset(data)
        
    @staticmethod
    def _load_conversations(filename):
        with open(filename) as file:
            conversations = json.load(file)
        return conversations

    @staticmethod
    def _init_tokenizer():
        tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
        tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        return tokenizer

    @property
    def stats(self):
        return '\n'.join([
            f'Conversation Count: {len(self.conversations)}',
            # f'Label Counts: {self.labels}',
            f'Sample Conversation:\n{json.dumps(self.conversations[0][:2], indent=2)}',
        ])

    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size)

dm = GeneratorDataModule(DATASET_FILENAME)
print(dm.stats)

Conversation Count: 1001
Sample Conversation:
[
  {
    "id": 698,
    "text": "@AppleSupport  URL",
    "authored": false,
    "intents": [
      "URL Share"
    ]
  },
  {
    "id": 696,
    "text": "USERNAME We're here for you. Which version of the iOS are you running? Check from Settings > General > About.",
    "authored": true,
    "intents": [
      "Question",
      "Provide Information",
      "Check Version/Details",
      "Acknowledgement"
    ]
  }
]


In [8]:
dm.train_dataset[0]

{'input_ids': tensor([   31, 16108, 15514,   220, 10289, 50257, 50257, 50257, 50257, 50257,
         50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'labels': tensor([   31, 16108, 15514,   220, 10289, 50257, 50257, 50257, 50257, 50257,
         50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257, 50257])}

In [11]:
# Model Loading & Configuration
class GeneratorModel(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
        self.model.resize_token_embeddings(len(dm.tokenizer))

    def forward(self, inputs):
        return self.model(**inputs)

    def _common_step(self, batch, batch_idx):
        outputs = self(batch)
        return outputs.logits, batch['labels'], outputs.loss

    def _common_log(self, outputs, labels, loss, stage: str):
        self.log_dict({
            f'{stage}_loss': loss,
            f'{stage}_acc': self.accuracy(outputs, labels),
            f'{stage}_f1': self.f1_score(outputs, labels),
        }, prog_bar=True)

    def training_step(self, batch, batch_idx):
        outputs, labels, loss = self._common_step(batch, batch_idx)
        return loss
        
    def configure_optimizers(self):
        return optim.AdamW(self.parameters(), lr=1e-5)

model = GeneratorModel()

In [12]:
# Training Loop
logger = pl.loggers.TensorBoardLogger("runs", name="generator")
trainer = pl.Trainer(
    max_epochs=2,
    callbacks=[pl.callbacks.RichProgressBar(leave=True)],
    logger=logger,
)
trainer.fit(model, datamodule=dm)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

Output()

`Trainer.fit` stopped: `max_epochs=2` reached.


In [6]:
dm.tokenizer('Hellow world', max_length=10, truncation=True, return_tensors='pt', padding='max_length')

{'input_ids': tensor([[   39,  5037,   995, 50257, 50257, 50257, 50257, 50257, 50257, 50257]]), 'attention_mask': tensor([[1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}