# Intent Predictor

In the rapidly evolving landscape of conversational AI, understanding and predicting user intent has emerged as a paramount task. A conversation, akin to a dance of words, is a sequence where each participant's response is predicated not just on the last statement, but on the flow and context of the entire conversation. Given the multilabel nature of many intents, predicting them requires nuanced, sequence-based models that can capture the intricacies of conversational context.

This notebook aims to tackle the fascinating challenge of intent prediction based on the sequence of previous intents in a conversation. Our objective is to predict the intent with which we should respond, grounded in the series of multilabel intents we classified up until a given poin

## Data

The dataset employed is derived from the conversational exchanges up to this point. It's imperative to note that while the data remains consistent, its preparation and representation will be adapted to suit the requirements of our model and task.

## Model Architecture

We've chosen the Gated Recurrent Unit (GRU), a recurrent neural network architecture, for this task. The GRU, with its ability to capture long-term dependencies in sequences, stands as an apt choice for the sequence-based nature of our problem. Additionally, the GRU's efficiency and less complex structure compared to other recurrent architectures make it a promising candidate.

## Development Environment

To streamline our development process and reduce boilerplate code, we'll be leveraging pytorch-lightning. Specifically, our model will be constructed as a pytorch-lightning Module, and the data handling, preprocessing, and loading will be encapsulated within a pytorch-lightning DataModule. This choice not only enhances the modularity and readability of our code but also facilitates easier scalability and experimentation.

In the sections that follow, we'll delve deep into data preparation, model architecture, training, evaluation, and inference. Let's embark on this exciting journey of understanding and predicting conversational intent.t.

In [32]:
# Imports
import json
# Torch
import torch
from torch.utils.data import Dataset, DataLoader
from torch import nn
from torch.nn import functional as F

import lightning.pytorch as pl
from sklearn.model_selection import train_test_split # scikit-learn
import torchmetrics

In [2]:
# Setup & Configurations (constants, seeds, and devices)
RANDOM_SEED = 69
DATASET_FILENAME = '../data/clean/customer_support_twitter_full.json'
SEQUENCE_LENGTH = 3

torch.manual_seed(RANDOM_SEED)

<torch._C.Generator at 0x1bf8701ee90>

In [27]:
# Define PyTorch Dataset
class PredictorDataset(Dataset):

    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

# Define PyTorch Lightning DataModule
class PredictorDataModule(pl.LightningDataModule):

    def __init__(self, filename: str, batch_size=32):
        super().__init__()
        self.filename = filename
        self.conversations = self._load_conversations(filename)
        self.batch_size = batch_size
        labels = set()
        for conversation in self.conversations:
            for message in conversation:
                for intent in message.get('intents'):
                    labels.add(intent)
        self.labels = sorted(list(labels))
        data = []
        for conversation in self.conversations:
            for j, message in enumerate(conversation):
                if message.get('authored') and j > 0:
                    inputs = torch.stack([self._get_label(m.get('intents')) for m in conversation[:j]][-SEQUENCE_LENGTH:])
                    padding_needed = max(SEQUENCE_LENGTH - inputs.shape[0], 0)
                    padded_inputs = F.pad(inputs, (0, 0, padding_needed, 0))
                    target = self._get_label(message.get('intents'))
                    data.append((padded_inputs, target))
        
        # Split the data into 80% train, 10% validation, and 10% test
        train_data, temp_data = train_test_split(data, test_size=0.2, random_state=RANDOM_SEED)
        val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=RANDOM_SEED)

        # Setup datasets
        self.train_dataset = PredictorDataset(train_data)
        self.val_dataset = PredictorDataset(val_data)
        self.test_dataset = PredictorDataset(test_data)
        

    def _get_label(self, intents: list[str]):
        label = torch.zeros(len(self.labels))
        for intent in intents:
            label[self.labels.index(intent)] = 1
        return label
                

    @staticmethod
    def _load_conversations(filename):
        with open(filename) as file:
            conversations = json.load(file)
        return conversations

    @property
    def stats(self):
        return '\n'.join([
            f'Conversation Count: {len(self.conversations)}',
            f'Label Counts: {self.labels}',
            f'Sample Conversation:\n{json.dumps(self.conversations[0], indent=2)}',
        ])

    @property
    def n_labels(self):
        return len(self.labels)
    
    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size)

    def val_dataloader(self):
        return DataLoader(self.val_dataset, batch_size=self.batch_size)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.batch_size)

dm = PredictorDataModule(DATASET_FILENAME)
print(dm.stats)

Conversation Count: 1001
Label Counts: ['Acknowledgement', 'Call Center Inquiry', 'Check Version/Details', 'Direct to DM', 'Provide Information', 'Question', 'Report Problem', 'Troubleshooting', 'URL Share']
Sample Conversation:
[
  {
    "id": 698,
    "text": "@AppleSupport  URL",
    "authored": false,
    "intents": [
      "URL Share"
    ]
  },
  {
    "id": 696,
    "text": "USERNAME We're here for you. Which version of the iOS are you running? Check from Settings > General > About.",
    "authored": true,
    "intents": [
      "Question",
      "Provide Information",
      "Check Version/Details",
      "Acknowledgement"
    ]
  },
  {
    "id": 697,
    "text": "@AppleSupport The newest update. I made sure to download it yesterday.",
    "authored": false,
    "intents": [
      "Provide Information"
    ]
  },
  {
    "id": 699,
    "text": "USERNAME Lets take a closer look into this issue. Select the following link to join us in a DM and we'll go from there. URL",
    "auth

In [40]:
# Model Loading & Configuration
class PredictorModel(pl.LightningModule):
    HIDDEN_SIZE = 32
    DROPOUT = 0.2
    
    def __init__(self, n_labels):
        super().__init__()
        # Layers
        self.gru = nn.GRU(
            input_size=n_labels,
            hidden_size=self.HIDDEN_SIZE,
            num_layers=1,
            dropout=self.DROPOUT,
            batch_first=True,
        )
        self.fc = nn.Linear(self.HIDDEN_SIZE, n_labels)
        # Metrics
        self.accuracy = torchmetrics.classification.MultilabelAccuracy(
            num_labels=n_labels,
        )
        self.f1_score = torchmetrics.classification.MultilabelF1Score(
            num_labels=n_labels,
            average='micro',
        )
        self.hamming_loss = torchmetrics.classification.MultilabelHammingDistance(
            num_labels=n_labels,
        )

    def forward(self, x):
        # GRU
        output, _ = self.gru(x)

        # We're interested in the last output for prediction,
        # which is the contextually richest. If x has shape (batch_size, seq_len, input_dim),
        # out will have shape (batch_size, seq_len, hidden_dim).
        # Thus, we select out[:,-1,:] to get a shape of (batch_size, hidden_dim)
        output = output[:, -1, :]

        # GRU ouputs already went through tanh so no activation required.
        output = self.fc(output)

        return output

    def _common_step(self, batch, batch_idx):
        inputs, labels = batch
        outputs = self(inputs)
        
        # Compute the loss
        loss = F.binary_cross_entropy_with_logits(outputs, labels)
        return outputs, labels, loss

    def _common_log(self, outputs, labels, loss, stage: str):
        self.log_dict({
            f'{stage}_loss': loss,
            f'{stage}_acc': self.accuracy(outputs, labels),
            f'{stage}_f1': self.f1_score(outputs, labels),
        }, prog_bar=True)

    def training_step(self, batch, batch_idx):
        outputs, labels, loss = self._common_step(batch, batch_idx)
        self._common_log(outputs, labels, loss, 'train')
        return loss

    def validation_step(self, batch, batch_idx):
        outputs, labels, loss = self._common_step(batch, batch_idx)
        self._common_log(outputs, labels, loss, 'val')
        self.log_dict({
            'val_hloss': self.hamming_loss(outputs, labels),
        })
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-2)

model = PredictorModel(dm.n_labels)

In [41]:
# Training Loop
logger = pl.loggers.TensorBoardLogger("runs", name="predictor")
trainer = pl.Trainer(
    max_epochs=20,
    callbacks=[pl.callbacks.RichProgressBar(leave=True)],
    logger=logger,
)
trainer.fit(model, datamodule=dm)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

`Trainer.fit` stopped: `max_epochs=20` reached.


In [26]:
len(dm.conversations)

1001