# Sentiment Classification with ThirdAI's Universal Deep Transformer

To run this notebook, you will need to obtain a ThirdAI license at the following link if you have not already: https://www.thirdai.com/try-bolt/

In [None]:
!pip3 install -r requirements.txt

In [1]:
!pip3 install thirdai --upgrade

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m
You should consider upgrading via the '/opt/homebrew/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


# Dataset Download

In the next cell, we will process an Amazon review sentiment analysis dataset from the HuggingFace datasets library. This dataset consists of product review texts with associated binary labels indicated if the review is positive or negative.

In [10]:
import pandas as pd
from datasets import load_dataset
from utils import to_batch

def load_data(data, output_filename, split, return_inference_batch=False):
    
    df = pd.DataFrame(data[split])
    df = df[['title', 'label']]    
    df.to_csv(output_filename, index=False, sep='\t')
    
    if return_inference_batch:
        inference_batch = to_batch(df[["title"]].sample(frac=1).iloc[:5])
        return inference_batch

train_filename = "amazon_polarity_train.csv"
test_filename = "amazon_polarity_test.csv"

data = load_dataset('amazon_polarity')
load_data(data, train_filename, split='train')
inference_batch = load_data(data, test_filename, split='test', return_inference_batch=True)

Found cached dataset amazon_polarity (/Users/vihan/.cache/huggingface/datasets/amazon_polarity/amazon_polarity/3.0.0/a27b32b7e7b88eb274a8fa8ba0f654f1fe998a87c22547557317793b5d2772dc)


  0%|          | 0/2 [00:00<?, ?it/s]

# UDT Initialization
We can now create a UDT model by passing in the types of each column in the dataset and the target column we want to be able to predict.

In [2]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    data_types={
        "title": bolt.types.text(),
        "label": bolt.types.categorical(),
    },
    target="label",
    n_target_classes=2,
    delimiter='\t',
)

# Training
We can now train our UDT model with just two lines! Feel free to customize the number of epochs and the learning rate; we have chosen values that give good convergence.

In [4]:
train_config = (bolt.TrainConfig(epochs=5, learning_rate=0.01)
                    .with_metrics(["categorical_accuracy"]))

model.train(train_filename, train_config)

Loading vectors from 'amazon_polarity_train.csv'
Loaded 3600000 vectors from 'amazon_polarity_train.csv' in 3 seconds.
train epoch 0:


train | epoch 0 | updates 1758 | {categorical_accuracy: 0.841919} | batches 1758 | time 136s | complete

train epoch 1:


train | epoch 1 | updates 3516 | {categorical_accuracy: 0.909367} | batches 1758 | time 126s | complete

train epoch 2:


train | epoch 2 | updates 5274 | {categorical_accuracy: 0.949447} | batches 1758 | time 123s | complete

train epoch 3:


train | epoch 3 | updates 7032 | {categorical_accuracy: 0.964999} | batches 1758 | time 263s | complete

train epoch 4:


train | epoch 4 | updates 8790 | {categorical_accuracy: 0.972298} | batches 1758 | time 122s | complete



# Evaluation
Evaluating the performance of the UDT model is also just two lines!

In [5]:
eval_config = (bolt.EvalConfig()
                   .with_metrics(["categorical_accuracy"]))

model.evaluate(test_filename, eval_config);

Loading vectors from 'amazon_polarity_test.csv'
Loaded 400000 vectors from 'amazon_polarity_test.csv' in 0 seconds.
test:


predict | epoch 5 | updates 8790 | {categorical_accuracy: 0.830342} | batches 196 | time 4188ms



# Saving and Loading
Saving and loading a trained UDT model to disk is also extremely straight forward.

In [6]:
save_location = "sentiment_analysis.model"

# Saving
model.save(save_location)

# Loading
model = bolt.UniversalDeepTransformer.load(save_location)

# Testing Predictions
The evaluation method is great for testing, but it requires labels, which don't exist in a production setting. We also have a predict method that can take in an in-memory batch of rows or a single row (without the target column), allowing easy integration into production pipelines. Note that UDT can perform inference with **sub-millisecond** latency

In [7]:
%%time
import numpy as np

prediction_batch = model.predict_batch(inference_batch)
class_names = ["Positive" if model.class_name(class_id) == 1 else 0 
               for class_id in np.argmax(prediction_batch, axis=1)]

print("Batch Prediction Results")
for input_sample, class_name in zip(inference_batch, class_names):
    print("Input:", input_sample, "Prediction:", class_name)

Batch Prediction Results
Input: {'title': 'spotty coverage of commonly used words'} Prediction: 0
Input: {'title': 'Simple and easy'} Prediction: 0
Input: {'title': 'Battlestar Galactica 1980'} Prediction: 0
Input: {'title': 'Tedious'} Prediction: 0
Input: {'title': 'History, Finction, Fantasy & Romance'} Prediction: 0
CPU times: user 1.85 ms, sys: 3.1 ms, total: 4.95 ms
Wall time: 4.51 ms
