# DALI 2024 Winter Application - Machine Learning Track
### John Guerrerio

The Superstore.csv file is missing 999 entries in each column, incuding the Product Category and Product Sub-Category columns.  As these columns are useful features for predicting the profit of a purchase, approximating values for these missing entries would give us more data to train a machine learning model to predict profit.  This file contains code to train a deep learning based approach to predict product category from product name.  Note that one could easily modify this code to predict product sub-category from product category by changing the classifiaction layer of the model and the code to load the dataset.

Areas for expansion:
- Hyperparameter tuning (I would perform a grid search using skorch)
- Testing different model architectures for the decoder
- Testing different pre-trained transformers for the encoder



In [1]:
import pandas as pd
import transformers
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from torch import nn
from tqdm import tqdm
from sklearn.metrics import classification_report, confusion_matrix
import sys

## Prepare the Dataset

We first need to prepare the dataset for machine learning.  This involves loading it, generating correct class labels, splitting it into train, validation and test sets, and tokenizing the product names.  

Splitting the dataset into train, validation and test sets is important to prevent data leakage.  The purpose of each set is as follows:

- Train: The data to train the model on, refining it over time.
- Validation: The dataset on which to optimize performance when tuning hyperparameters.
- Test: The dataset to evaluate the model on after training and hyperparameter tuning.  This gives an indication of the model's real-world performace.

If we train on data outside the train dataset, we will artificially inflate our model's performance because the model will have seen the data in that set before.  Similarly, if we optimize our test set performance when tuning hyperparameters, the test set performance will be artifially inflated and we won't have good means of measuring the model's real world performance.  For this project, we use a 70-15-15 train test validation split.


In [2]:
RANDOM_STATE = 42 # random seed to ensure results are reproducible
BATCH_SIZE = 16 # number of documents in each minibatch for training

In [3]:
df = pd.read_csv('Superstore.csv') # requires the Superstore.csv to be uploaded if you are running this in colab

The below cell drops all entries with a null category and/or product name.  We are assuming data entries are missing completely at random, so this should not introduce bias into the dataset.

In [4]:
df.dropna(subset=["Category", "Product Name"], inplace=True)
print("Number of rows: " + str(df.shape[0]))

Number of rows: 8098


In [5]:
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-uncased') # load the tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [6]:
textLabels = df["Category"]
print(textLabels)

0             Furniture
1             Furniture
2       Office Supplies
3             Furniture
4       Office Supplies
             ...       
9988         Technology
9990          Furniture
9991         Technology
9992    Office Supplies
9993    Office Supplies
Name: Category, Length: 8098, dtype: object


In [7]:
names = df["Product Name"].tolist()
productLabels=np.unique(df['Category'], return_inverse=True)[1].tolist() # generate numerical labels from product category names
print(np.unique(df['Category'], return_inverse=True))

(array(['Furniture', 'Office Supplies', 'Technology'], dtype=object), array([0, 0, 1, ..., 2, 1, 1]))


In [8]:
# shuffles the data and splits it into the train, test, and validation sets
train, validAndTest, trainLabels, validAndTestLabels = train_test_split(names, productLabels, test_size=0.3, random_state=RANDOM_STATE)
valid, test, validLabels, testLabels = train_test_split(validAndTest, validAndTestLabels, test_size=0.5, random_state=RANDOM_STATE)

In [9]:
# tokenize the product names - turns them into a format the BERT model can understand
trainTokenized = tokenizer(train, padding='max_length', max_length = 512, truncation=True, return_tensors='pt', return_attention_mask = True)
validTokenized = tokenizer(valid, padding='max_length', max_length = 512, truncation=True, return_tensors='pt', return_attention_mask = True)
testTokenized = tokenizer(test, padding='max_length', max_length = 512, truncation=True, return_tensors='pt', return_attention_mask = True)

trainTokens = trainTokenized["input_ids"]
trainMask = trainTokenized["attention_mask"]

validTokens = validTokenized["input_ids"]
validMask = validTokenized["attention_mask"]

testTokens = testTokenized["input_ids"]
testMask = testTokenized["attention_mask"]


In [10]:
print(len(train))
print(len(trainLabels))
print()

print(len(valid))
print(len(validLabels))
print()

print(len(test))
print(len(testLabels))


5668
5668

1215
1215

1215
1215


In [11]:
# a class to represent the train, validation, and test sets
# the Dataset class handles dividing the data into minibatches and producing the minibatches for us
class productNamesDataset(Dataset):
  def __init__(self, data, labels, mask):
    self.data = data
    self.labels = labels
    self.mask = mask

  def __len__(self):
    return len(self.labels)

  def __getitem__(self, idx):
    return self.data[idx], self.labels[idx], self.mask[idx]

In [12]:
# build the dataset objects for train, validation, and test sets
trainData = productNamesDataset(trainTokens, trainLabels, trainMask)
validData = productNamesDataset(validTokens, validLabels, validMask)
testData = productNamesDataset(testTokens, testLabels, testMask)

In [13]:
# build the dataloader objects for train, validation, and test sets
trainLoader = DataLoader(trainData, batch_size=BATCH_SIZE)
validLoader = DataLoader(validData, batch_size=BATCH_SIZE)
testLoader = DataLoader(testData, batch_size=1)

## Prepare the Model

The below are the hyperparameters I have chosen for this model.  My first area of expansion to this code would be hyperparameter tuning.  I would perform a grid search using the skorch package.  A grid search is where several possible values for each hyperparmaeter are specified beforehand, and all possible combinations of values of hyperparameters are tested.

In [14]:
DROPOUT = 0.2 # probability of dropping a node in the linear layer after the BERT encoder
ATTN_DROPOUT = 0.2 # probability of dropping a term in the attention equation within the BERT encoder
EPOCHS = 4 # number of times we train the model on the full train set
LEARNING_RATE = 0.001 # scales how much we update our parameters after each minibatch

In [15]:
# load huggingface pretrained model
dbert = transformers.DistilBertModel.from_pretrained('distilbert-base-uncased', dropout=DROPOUT, attention_dropout=ATTN_DROPOUT)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [16]:
# Get cpu or gpu device for training - THIS CODE WORKS BEST ON A GPU
device = "cuda" if torch.cuda.is_available() else "cpu" # need to change options to train on GPU
print(f"Using {device} device")

Using cuda device


The class below defines our model architecture for our classifier.  We can think of this model as having an encoder/decoder architecture - the encoder generates a vector capturing document context from each product name, and the decoder uses that vector to generate its predicted class.  In this case, our encoder is the pre-trained BERT model we use, and our decoder is the two linear layers we place on top of our BERT model.  We use the output of the last layer of the BERT model as our context vector, which is a standard practice for natural language processing.  We perform dropout on the first linear layer after our BERT output, which improves regularization and reduces overfitting.  We use a ReLU activation function between our two linear layers, which is one of the most efficient activation functions available for deep learning.

In [17]:
class DistilBertClassification(nn.Module):
    def __init__(self):
        super(DistilBertClassification, self).__init__()
        # BERT encoder
        self.dbert = dbert
        # Decoder
        self.dropout = nn.Dropout(p=DROPOUT)
        self.linear1 = nn.Linear(768,64)
        self.ReLu = nn.ReLU()
        self.linear2 = nn.Linear(64,3)

    def forward(self, tokens, mask):
        x = self.dbert(input_ids=tokens, attention_mask=mask)
        x = x["last_hidden_state"][:,0,:]
        x = self.dropout(x) # dropout on BERT output, prevents overfitting
        x = self.linear1(x)
        x = self.ReLu(x)
        logits = self.linear2(x)
        return logits

In [18]:
# Check the architecture of the model we have created
classifier = DistilBertClassification().to(device)
print(classifier)

DistilBertClassification(
  (dbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.2, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.2, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.2, inplace=False)
            (lin1

We set the weights and biases of the pre-trained BERT model we use as our encoder to not be updated during our training.  This avoids catastrphic forgetting - where a pretrained large language model "forgets" the knowledge it learned from its initial training when it is used for supervised transfer learning.

In [19]:
for param in classifier.dbert.parameters():
    param.requires_grad = False

In [20]:
# Check the number of total parameters and number of trainable parameters
total_params = sum(p.numel() for p in classifier.parameters())
total_params_trainable = sum(p.numel() for p in classifier.parameters() if p.requires_grad)
print("Number of parameters: ", total_params)
print("Number of trainable parameters: ", total_params_trainable)

Number of parameters:  66412291
Number of trainable parameters:  49411


## Training

We now train the model using the hyperparameters we specified above.  At each epoch, we evaluate the model on the validation set.  We use cross entropy loss as our loss function and Adaptive Moment Estimation as our optimization algorithm to determine how we update our parameters.  Both of these choices are standard for deep learning NLP work.

In [21]:
criterion = torch.nn.CrossEntropyLoss() # softmax and loss for classification layer
optimizer = torch.optim.Adam(classifier.parameters(), lr = LEARNING_RATE)

In [22]:
history = {}
history["epoch"]=[]
history["train_loss"]=[]
history["valid_loss"]=[]
history["train_accuracy"]=[]
history["valid_accuracy"]=[]

In [23]:
# training loop
for e in range(EPOCHS):
  classifier.train() # activated dropout

  train_loss = 0.0
  train_accuracy = []

  # loop over each minibatch
  for description, labels, mask in tqdm(trainLoader):

      # send minibatch to gpu for efficient training
      description = description.to(device)
      labels = labels.to(device)
      mask = mask.to(device)

      # Get prediction & loss
      prediction = classifier(description, mask)
      loss = criterion(prediction, labels)

      train_loss += loss.item()

      # determine the optimal direction to increment parameters
      loss.backward()

      # update parameters
      optimizer.step()

      # zero the gradient so we don't accumulate optimizer steps
      optimizer.zero_grad()

      train_loss += loss.item()

      prediction_index = prediction.argmax(axis=1)
      accuracy = (prediction_index==labels)
      train_accuracy += accuracy

  train_accuracy = (sum(train_accuracy) / len(train_accuracy))

  classifier.eval() # turn off dropout for evaluation
  valid_loss = 0.0
  valid_accuracy = []

  with torch.no_grad(): # turn off gradient calculation so we don't train on the validation set
    for description, labels, mask in validLoader:

      description = description.to(device)
      labels = labels.to(device)
      mask = mask.to(device)

      prediction = classifier(description, mask)
      loss = criterion(prediction, labels)

      valid_loss += loss.item()

      prediction_index = prediction.argmax(axis=1)
      accuracy = (prediction_index==labels)
      valid_accuracy += (accuracy)

  valid_accuracy = (sum(valid_accuracy) / len(valid_accuracy)) # sum sums up the boolean tensors, which themselves have a method to sum up

  # keep a record of our training results
  history["epoch"].append(e+1)
  history["train_loss"].append(train_loss / len(trainLoader))
  history["valid_loss"].append(valid_loss / len(validLoader))
  history["train_accuracy"].append(train_accuracy)
  history["valid_accuracy"].append(valid_accuracy)

  # output results
  print(f'Epoch {e+1}')
  print(f'\t\t Training Loss: {train_loss / len(trainLoader) :10.3f} \t\t Validation Loss: {valid_loss / len(validLoader) :10.3f}')
  print(f'\t\t Training Accuracy: {train_accuracy :10.3%} \t\t Validation Accuracy: {valid_accuracy :10.3%}')

100%|██████████| 355/355 [01:36<00:00,  3.68it/s]


Epoch 1
		 Training Loss:      0.966 		 Validation Loss:      0.296
		 Training Accuracy:    79.940% 		 Validation Accuracy:    89.383%


100%|██████████| 355/355 [01:35<00:00,  3.71it/s]


Epoch 2
		 Training Loss:      0.584 		 Validation Loss:      0.201
		 Training Accuracy:    89.591% 		 Validation Accuracy:    93.416%


100%|██████████| 355/355 [01:35<00:00,  3.71it/s]


Epoch 3
		 Training Loss:      0.520 		 Validation Loss:      0.174
		 Training Accuracy:    89.926% 		 Validation Accuracy:    93.992%


100%|██████████| 355/355 [01:35<00:00,  3.71it/s]


Epoch 4
		 Training Loss:      0.473 		 Validation Loss:      0.152
		 Training Accuracy:    91.196% 		 Validation Accuracy:    94.979%


## Evaluation on the Test Set

Now that we have trained the model (and optimized the hyperparameters on the validation set) we need to determine how well it would perform in a real-word environment.  This can be done by evaluating its performance on the test set.  By splitting the dataset into train, validation, and test sets, we have ensured the model has not been trained on documents in the test set and we have not "cheated" by tuning our hyperparameters to gain optimal performance on the test set.  Therefore, the model's performance on the test set is a good indication of how it would perform in the real world.\
\
0 is the furniture class\
1 is the office supplies class\
2 is the technology class

In [24]:
classifier.eval()

groundTruth = np.zeros(len(testLoader)) # holds the labels for the test set
predictions = np.zeros(len(testLoader)) # holds the model's predictions on the test set


i = 0
with torch.no_grad(): # turn off gradient calculation so we don't train on the test set
  for description, label, mask in tqdm(testLoader):
    description = description.to(device)
    label = label.to(device)
    mask = mask.to(device)

    prediction = classifier(description, mask)
    predictedClass = int(prediction.argmax(axis=1).item()) # determine the model's prediction on a test set document
    predictions[i] = predictedClass


    goldClass = int(label.item())
    groundTruth[i] = goldClass

    i+= 1

100%|██████████| 1215/1215 [00:21<00:00, 57.33it/s]


In [25]:
print(classification_report(groundTruth, predictions)) # print precision, recall, and f1 for each class and overall

              precision    recall  f1-score   support

         0.0       0.95      0.89      0.92       255
         1.0       0.95      0.97      0.96       720
         2.0       0.96      0.95      0.96       240

    accuracy                           0.95      1215
   macro avg       0.95      0.94      0.95      1215
weighted avg       0.95      0.95      0.95      1215



From the classification report, we see this model is performing well on the classification task.  The overall accuracy is well above the accuracy we would obtain by guessing the class randomly.  One thing to note is that these kind of deep learning models typically require more memory and time to perform interence than more lightweight models like logistic regression.  However, this model would only need to perform inference once as part of a data pipeline to fill in the missing cells in the superstore dataset.  This is not a task that requires us to prioritize performance, so the accuracy gained from using this type of model is worth the performance trade-off.

In [26]:
 print(confusion_matrix(groundTruth, predictions)) # print confusion matrix

[[228  27   0]
 [ 11 699  10]
 [  2   9 229]]


## Save the Trained Model

This allows us to use the model we have created for inference without having to train it again.  This would be useful in a data pipeline where we fill in missing values in the superstore dataset before using it to train another model for a different task. For instance, when predicting the profit of a purchase where product category is a useful feature, using this model to fill in the 999 rows missing product category could give us up to an additional 999 rows to train/test on

In [27]:
# This cell is only necessary if you are running this notebook in Colab
from google.colab import drive
import os

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [28]:
path = "/content/gdrive/MyDrive/ColabOutput/checkpoint.pth" # path to save to

torch.save(classifier.state_dict(), path)