![servicedesk](servicedesk.png)

CleverSupport is a company at the forefront of AI innovation, specializing in the development of AI-driven solutions to enhance customer support services. Their latest endeavor is to engineer a text classification system that can automatically categorize customer complaints. 

Your role as a data scientist involves the creation of a sophisticated machine learning model that can accurately assign complaints to specific categories, such as mortgage, credit card, money transfers, debt collection, etc.

In [14]:
!pip install torchmetrics

Defaulting to user installation because normal site-packages is not writeable


In [15]:
from collections import Counter
import nltk, json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
from torchmetrics import Accuracy, Precision, Recall

In [16]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [17]:
# Import data and labels
with open("words.json", 'r') as f1:
    words = json.load(f1)
with open("text.json", 'r') as f2:
    text = json.load(f2)
labels = np.load('labels.npy')

In [18]:
# Dictionaries to store the word to index mappings and vice versa
word2idx = {o:i for i,o in enumerate(words)}
idx2word = {i:o for i,o in enumerate(words)}

# Looking up the mapping dictionary and assigning the index to the respective words
for i, sentence in enumerate(text):
    text[i] = [word2idx[word] if word in word2idx else 0 for word in sentence]
    
# Defining a function that either shortens sentences or pads sentences with 0 to a fixed length
def pad_input(sentences, seq_len):
    features = np.zeros((len(sentences), seq_len),dtype=int)
    for ii, review in enumerate(sentences):
        if len(review) != 0:
            features[ii, -len(review):] = np.array(review)[:seq_len]
    return features

text = pad_input(text, 50)

In [19]:
# Splitting dataset
train_text, test_text, train_label, test_label = train_test_split(text, labels, test_size=0.2, random_state=42)

train_data = TensorDataset(torch.from_numpy(train_text), torch.from_numpy(train_label).long())
test_data = TensorDataset(torch.from_numpy(test_text), torch.from_numpy(test_label).long())

In [20]:
# Start coding here

## Step 1: Define the CNN Classifier
We will start by defining a class for the CNN classifier in PyTorch.

In [21]:
class CNNClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_classes, seq_len):
        super(CNNClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.conv1d = nn.Conv1d(in_channels=embedding_dim, out_channels=128, kernel_size=5, stride=1)
        self.pool = nn.MaxPool1d(kernel_size=2, stride=2)
        self.fc = nn.Linear(128 * ((seq_len - 4) // 2), num_classes)
        
    def forward(self, x):
        x = self.embedding(x)
        x = x.permute(0, 2, 1)  # Permute to match input shape for Conv1d (batch_size, embedding_dim, seq_len)
        x = F.relu(self.conv1d(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# Parameters
vocab_size = len(word2idx)
embedding_dim = 50
num_classes = len(set(labels))
seq_len = 50

# Instantiate the model
model = CNNClassifier(vocab_size, embedding_dim, num_classes, seq_len)

## Step 2: Define the Optimizer and Loss Function

In [22]:
# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

## Step 3: Train the Classifier
We will train the classifier for 3 epochs using the DataLoader to batch the training data.

In [23]:
# DataLoader
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
test_loader = DataLoader(test_data, batch_size=32, shuffle=False)

# Training function
def train_model(model, train_loader, criterion, optimizer, epochs=3):
    model.train()
    for epoch in range(epochs):
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
        print(f'Epoch {epoch+1}/{epochs}, Loss: {loss.item()}')

# Train the model
train_model(model, train_loader, criterion, optimizer)

Epoch 1/3, Loss: 1.2620103359222412
Epoch 2/3, Loss: 0.7263175249099731
Epoch 3/3, Loss: 0.6771326065063477


## Step 4: Test the Classifier and Make Predictions

In [24]:
# Function to make predictions and store them in 'predicted'
def evaluate_model(model, test_loader):
    model.eval()
    predicted = []
    true_labels = []
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)
            predicted.extend(preds.numpy())
            true_labels.extend(labels.numpy())
    return predicted, true_labels

# Get predictions
predicted, true_labels = evaluate_model(model, test_loader)
predictions = torch.tensor(predicted)  # Ensure predictions is a tensor

## Step 5: Calculate Metrics
We'll use torchmetrics to calculate accuracy, precision, and recall.

In [25]:
# Calculate metrics
accuracy = Accuracy(task="multiclass", num_classes=num_classes)(predictions, torch.tensor(true_labels))
precision = Precision(task="multiclass", num_classes=num_classes, average='macro')(predictions, torch.tensor(true_labels))
recall = Recall(task="multiclass", num_classes=num_classes, average='macro')(predictions, torch.tensor(true_labels))

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')

Accuracy: 0.6610
Precision: 0.6798
Recall: 0.6604


# Their solution

In [26]:
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from torchmetrics import Accuracy, Precision, Recall

batch_size = 400
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=False, batch_size=batch_size)

# Define the classifier class
class TicketClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, target_size):
        super(TicketClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.conv = nn.Conv1d(embed_dim, embed_dim, kernel_size=3, stride=1, padding=1)
        self.fc = nn.Linear(embed_dim, target_size)

    def forward(self, text):
        embedded = self.embedding(text).permute(0, 2, 1)
        conved = F.relu(self.conv(embedded))
        conved = conved.mean(dim=2) 
        return self.fc(conved)


vocab_size = len(word2idx) + 1
target_size = len(np.unique(labels))
embedding_dim = 64

# Create an instance of the TicketClassifier class
model = TicketClassifier(vocab_size, embedding_dim, target_size)

lr = 0.05
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

epochs = 3

# Train the model
model.train()
for i in range(epochs):
    running_loss, num_processed = 0,0
    for inputs, labels in train_loader:
        model.zero_grad()
        output = model(inputs)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        num_processed += len(inputs)
    print(f"Epoch: {i+1}, Loss: {running_loss/num_processed}")


accuracy_metric = Accuracy(task='multiclass', num_classes=5)
precision_metric = Precision(task='multiclass', num_classes=5, average=None)
recall_metric = Recall(task='multiclass', num_classes=5, average=None)

# Evaluate model on test set
model.eval()
predicted = []

for i, (inputs, labels) in enumerate(test_loader):
    output = model(inputs)
    cat = torch.argmax(output, dim=-1)
    predicted.extend(cat.tolist())
    accuracy_metric(cat, labels)
    precision_metric(cat, labels)
    recall_metric(cat, labels)

accuracy = accuracy_metric.compute().item()
precision = precision_metric.compute().tolist()
recall = recall_metric.compute().tolist()
print('Accuracy:', accuracy)
print('Precision (per class):', precision)
print('Recall (per class):', recall)

Epoch: 1, Loss: 0.00380037334561348
Epoch: 2, Loss: 0.0018858914971351622
Epoch: 3, Loss: 0.0009005416296422482
Accuracy: 0.7839999794960022
Precision (per class): [0.700507640838623, 0.7253885865211487, 0.875, 0.7566137313842773, 0.8515284061431885]
Recall (per class): [0.71875, 0.7368420958518982, 0.7777777910232544, 0.7447916865348816, 0.9285714030265808]


## Comparison Summary
My Code:
- Uses a CNN model (CNNClassifier) with an embedding layer, a convolutional layer, and a fully connected layer.
- Defines the model with nn.Embedding, nn.Conv1d, nn.MaxPool1d, and nn.Linear.
- Utilizes torchmetrics for accuracy, precision, and recall calculations.
- Optimizes the model using the Adam optimizer with a learning rate of 0.001.
- Data is padded to a fixed sequence length of 50.
- Trains the model using a batch size of 32 and evaluates on a separate test set.

Proposed Solution:
- Uses a CNN model (TicketClassifier) with an embedding layer, a convolutional layer, and a fully connected layer.
- Defines the model with nn.Embedding, nn.Conv1d, and nn.Linear.
- Utilizes torchmetrics for accuracy, precision, and recall calculations.
- Optimizes the model using the Adam optimizer with a learning rate of 0.05.
- Data is padded to a fixed sequence length of 50.
- Trains the model using a batch size of 400 and evaluates on a separate test set.