![servicedesk](servicedesk.png)

CleverSupport is a company at the forefront of AI innovation, specializing in the development of AI-driven solutions to enhance customer support services. Their latest endeavor is to engineer a text classification system that can automatically categorize customer complaints. 

Your role as a data scientist involves the creation of a sophisticated machine learning model that can accurately assign complaints to specific categories, such as mortgage, credit card, money transfers, debt collection, etc.

In [10]:
!pip install torchmetrics

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [11]:
from collections import Counter
import nltk, json
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
from torchmetrics import Accuracy, Precision, Recall
import torch.optim as optim

In [12]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
# Import data and labels
with open("words.json", 'r') as f1:
    words = json.load(f1)
with open("text.json", 'r') as f2:
    text = json.load(f2)
labels = np.load('labels.npy')

In [14]:
# Dictionaries to store the word to index mappings and vice versa
word2idx = {o:i for i,o in enumerate(words)}
idx2word = {i:o for i,o in enumerate(words)}

# Looking up the mapping dictionary and assigning the index to the respective words
for i, sentence in enumerate(text):
    text[i] = [word2idx[word] if word in word2idx else 0 for word in sentence]
    
# Defining a function that either shortens sentences or pads sentences with 0 to a fixed length
def pad_input(sentences, seq_len):
    features = np.zeros((len(sentences), seq_len),dtype=int)
    for ii, review in enumerate(sentences):
        if len(review) != 0:
            features[ii, -len(review):] = np.array(review)[:seq_len]
    return features

text = pad_input(text, 50)

In [15]:
# Splitting dataset
train_text, test_text, train_label, test_label = train_test_split(text, labels, test_size=0.2, random_state=42)

train_data = TensorDataset(torch.from_numpy(train_text), torch.from_numpy(train_label).long())
test_data = TensorDataset(torch.from_numpy(test_text), torch.from_numpy(test_label).long())

In [16]:
# Start coding here

In [17]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchmetrics.classification import Accuracy, Precision, Recall
import numpy as np

# DataLoader parameters
batch_size = 400
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=False, batch_size=batch_size)

# Define the TicketClassifier class
class TicketClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, target_size):
        super(TicketClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.conv = nn.Conv1d(embed_dim, embed_dim, kernel_size=3, stride=1, padding=1)
        self.fc = nn.Linear(embed_dim, target_size)

    def forward(self, text):
        embedded = self.embedding(text).permute(0, 2, 1)  # Adjust dimensions for Conv1d
        conved = F.relu(self.conv(embedded))
        pooled = conved.mean(dim=2)  # Global average pooling
        return self.fc(pooled)

# Model parameters
vocab_size = len(word2idx) + 1
embedding_dim = 64
target_size = len(np.unique(labels))

# Initialize model, loss, and optimizer
model = TicketClassifier(vocab_size, embedding_dim, target_size)
lr = 0.05
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# Training loop
epochs = 3
model.train()
print("Starting Training...")
for epoch in range(epochs):
    running_loss, num_processed = 0.0, 0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        output = model(inputs)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item() * inputs.size(0)  # Accumulate total loss
        num_processed += inputs.size(0)

    epoch_loss = running_loss / num_processed
    print(f"Epoch [{epoch + 1}/{epochs}], Loss: {epoch_loss:.4f}")

# Evaluation metrics
accuracy_metric = Accuracy(task='multiclass', num_classes=target_size)
precision_metric = Precision(task='multiclass', num_classes=target_size, average=None)
recall_metric = Recall(task='multiclass', num_classes=target_size, average=None)

# Evaluation loop
model.eval()
predicted = []
print("Starting Evaluation...")
with torch.no_grad():
    for inputs, labels in test_loader:
        output = model(inputs)
        predictions = torch.argmax(output, dim=1)
        predicted.extend(predictions.tolist())

        # Update metrics
        accuracy_metric.update(predictions, labels)
        precision_metric.update(predictions, labels)
        recall_metric.update(predictions, labels)

# Compute and display metrics
accuracy = accuracy_metric.compute().item()
precision = precision_metric.compute().tolist()
recall = recall_metric.compute().tolist()

print(f"Accuracy: {accuracy:.4f}")
print("Precision (per class):", precision)
print("Recall (per class):", recall)


Starting Training...
Epoch [1/3], Loss: 1.5720
Epoch [2/3], Loss: 0.6701
Epoch [3/3], Loss: 0.3083
Starting Evaluation...
Accuracy: 0.7860
Precision (per class): [0.699999988079071, 0.7413793206214905, 0.8421052694320679, 0.7871286869049072, 0.8399999737739563]
Recall (per class): [0.6927083134651184, 0.678947389125824, 0.8148148059844971, 0.828125, 0.8999999761581421]


In [24]:
print("Accuracy:",accuracy)
print("Precision:",precision)
print("Recall:",recall)

Accuracy: 0.7860000133514404
Precision: [0.699999988079071, 0.7413793206214905, 0.8421052694320679, 0.7871286869049072, 0.8399999737739563]
Recall: [0.6927083134651184, 0.678947389125824, 0.8148148059844971, 0.828125, 0.8999999761581421]
