## RoBERTa Sentiment Pipeline

This notebook will create a pipeline that uses the [`twitter-roberta-base-sentiment-latest`](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) `transformers` model to analyze the sentiment of tweets about Game of Thrones Season 8. The dataset is stored in an S3 bucket for this pipeline, but you can find it [here](https://www.kaggle.com/datasets/monogenea/game-of-thrones-twitter).

The code here is adapted from the example code on the HuggingFace website.

In the `process` function below we limit the dataset to be the first 5,000 tweets of the datsets (which has about ~800K tweets total). Our cluster is using a `p2.xlarge` VM with a single Tesla K80, which bogs when analyzing more than a few thousand tweets. This isn't a limitation of Aqueduct's and can be solved by paying for a larger VM.

Note that this notebook makes two assumptions:

1. You have your Aqueduct server connected to a Kubernetes cluster with a GPU node group enabled. The easiest way to set this up is to use a hosted Kubernetes offering like AWS EKS or GKE. See our documentation for more details on connecting Aqueduct to Kubernetes.
2. You have an object store (e.g., AWS S3) connected with the dataset from the above blog post stored in it.

In [1]:
# Load the Aqueduct client.
import aqueduct as aq
from aqueduct import op, metric

client = aq.Client()

# This config tells Aqueduct to run every operator on the resource named "eks-us-east-2".
# It also activates "lazy" mode, meaning that we will only trigger compute operations
# when data is requested since some of the functions below can be expensive.
aq.global_config({"lazy": True, 'engine': 'eks-us-east-2'})

In [2]:
# Load the data from the S3 bucket and see a preview of the table.
# This is about ~100MB of data and takes about ~10s to load.
datasets_bucket = client.resource('datasets')
tweets = datasets_bucket.file('got_s8_tweets.csv', artifact_type="table", format="csv")

tweets.get().head()

Unnamed: 0,id,tweet
0,0,👍 on @YouTube: GAME OF THRONES 8x01 Breakdown!...
1,1,👍 on @YouTube: Ups and Downs From Game Of Thro...
2,2,Liked on YouTube: Ups and Downs From Game Of T...
3,3,Liked on YouTube: GAME OF THRONES 8x01 Breakdo...
4,4,@MrLegenDarius unpopular opinion: game of thro...


Here, we can see a preview of the tweets that we got from S3.

Next, we're going to write an Aqueduct operator that is going to preprocess the text. We're going to clean up our tweets to account for user handles and links, and then we'll  use the [`transformers`] library's `AutoTokenizer` class to tokenize our data. 

We first create batches of 1K tweets to tokenize at a time. To keep our tensors uniform, we pad them with 0s as necessary and then concatenate them. 

The `@op` decorator here has a few configuration parameters:
* First, the `engine` parameter tells us that we're going to be running on our EKS cluster in `us-east-2`; you can see the configuration for this resource on the Aqueduct UI. 
* Second, we specify the requirements needed to run this function (`torch` and `transformers`); if necessary, we could specify the required versions as well. 
* Finally, we tell Aqueduct to give this container 15GB of RAM.

In [3]:
@op(
    engine='eks-us-east-2', 
    requirements=['torch', 'transformers'], 
    resources={
        'memory': '15GB',
    },
)
def process(inputs, model, input_limit):
    import torch
    import numpy as np
    from transformers import AutoTokenizer, AutoConfig
    from transformers.tokenization_utils_base import BatchEncoding
    
    # A simple helper function that replaces @-mentions and links in 
    # our tweets.
    def split(text):
        new_text = []
        for t in text.split(" "):
            t = '@user' if t.startswith('@') and len(t) > 1 else t
            t = 'http' if t.startswith('http') else t
            new_text.append(t)
        return " ".join(new_text)
     
    split_text = list(map(split, inputs['tweet'][:input_limit].tolist()))

    # Load the transformers configuration and tokenizer. We use `use_fast` to load
    # the fast, Rust-based tokenizer provided by HuggingFace.
    config = AutoConfig.from_pretrained(model)
    tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)

    batch_size = 1000
    
    # Create the empty tensors that we'll use to stack our tokenized data into.
    input_ids = torch.empty(0, 0, dtype=torch.int64)
    attention_masks = torch.empty(0, 0, dtype=torch.int64)
    
    # Iterate through the full dataset batch by batch, generate 
    for i in range((len(split_text) // batch_size) + 1):
        if (i * batch_size) == len(split_text):
            break
            
        if (i + 1) * batch_size > len(split_text):
            end = len(split_text)
        else:
            end = (i + 1) * batch_size
            
        tokens = tokenizer(
            split_text[(i * batch_size) : end], 
            max_length=500, # This is required by the model.
            padding='max_length', 
            return_tensors='pt', 
            truncation=True,
        )
        
            
        # Pad the existing tensors if necessary.
        pad_delta = np.abs(tokens['input_ids'].size(1) - input_ids.size()[1])
        pad = (0, pad_delta)
        if pad_delta != 0: # If the dimensions are the same, we can blindly concatenate.                           
            if tokens['input_ids'].size()[1] > input_ids.size()[1]:
                input_ids = torch.nn.functional.pad(input_ids, pad, "constant", 0)
                attention_masks = torch.nn.functional.pad(attention_masks, pad, "constant", 0)
            else:
                tokens['input_ids'] = torch.nn.functional.pad(tokens['input_ids'], pad, "constant", 0)
                tokens['attention_mask'] = torch.nn.functional.pad(tokens['attention_mask'], pad, "constant", 0)
                
        input_ids = torch.cat(
            (
                input_ids,
                tokens['input_ids'],
            )
        )
        
        attention_masks = torch.cat(
            (
                attention_masks,
                tokens['attention_mask'],
            )
        )
                
    return BatchEncoding({ 'input_ids': input_ids, 'attention_mask': attention_masks })

We'll create an Aqueduct parameter telling us which model to use. We'll use the RoBERTa base sentiment model linked above, but if we wanted, we could swap this out in a future run.

We'll invoke the `process` function on the `tweets` dataset described above. Since we're processing a large number of tweets, this function can take around ~10 minutes to complete. 

In [4]:
model = client.create_param(name="model", default="cardiffnlp/twitter-roberta-base-sentiment-latest")
input_limit = client.create_param(name="input_limit", default=100)

featurized = process(tweets, model, input_limit)

Calling `.get()` on the `featurized` object will show us a preview of the tokenized features right here in our notebook:

In [5]:
featurized.get()

Operator process Logs:
stderr:
		Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]	Downloading (…)lve/main/config.json: 100%|##########| 929/929 [00:00<00:00, 828kB/s]
		Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]	Downloading (…)olve/main/vocab.json: 100%|##########| 899k/899k [00:00<00:00, 10.9MB/s]
		Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]	Downloading (…)olve/main/merges.txt: 100%|##########| 456k/456k [00:00<00:00, 6.58MB/s]
		Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]	Downloading (…)cial_tokens_map.json: 100%|##########| 239/239 [00:00<00:00, 196kB/s]

Operator process Logs:
stderr:
		Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]	Downloading (…)lve/main/config.json: 100%|##########| 929/929 [00:00<00:00, 828kB/s]
		Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]	Downl

{'input_ids': tensor([[    0, 31193,  8384,  ...,     1,     1,     1],
        [    0, 31193,  8384,  ...,     1,     1,     1],
        [    0,   574, 21101,  ...,     1,     1,     1],
        ...,
        [    0,  7939,    17,  ...,     1,     1,     1],
        [    0,  5379,  6828,  ...,     1,     1,     1],
        [    0,  1922,    44,  ...,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

Next, we'll define a function that will load the RoBERTa model and make predictions on the features defined above. Again, we batch our predictions to avoid overloading memory.

Similar to the `process` function above, we use the `@op` decorator to tell Aqueudct how to run our function. The configuration here is exactly the same except for the fact that we also ask for a GPU in line 6. Based on this configuration, Aqueduct will automatically use a container with CUDA drivers installed for this function.

In [6]:
@op(
    engine='eks-us-east-2', 
    requirements=['torch', 'transformers'], 
    resources={
        'memory': '15GB',
        'gpu_resource_name': 'nvidia.com/gpu',
    }
)
def predict(features, model):
    from transformers import AutoModelForSequenceClassification
    from transformers import AutoConfig
    from transformers.tokenization_utils_base import BatchEncoding
    
    from scipy.special import softmax
    import numpy as np
    
    config = AutoConfig.from_pretrained(model)
    model = AutoModelForSequenceClassification.from_pretrained(model).to('cuda:0')
    
    batch_size = 10
    num_entries = features['input_ids'].size()[0]
    
    start = 0
    pred_batches = []
    for i in range((num_entries // batch_size) + 1):
        if start == num_entries:
            break
        
        if (i + 1) * batch_size >= num_entries:
            end = num_entries
        else:
            end = (i + 1) * batch_size
        
        batch = BatchEncoding({
            'input_ids': features['input_ids'][start:end],
            'attention_mask': features['attention_mask'][start:end]
        }).to('cuda:0')
        batch_preds = model(**batch)
        
        sm = softmax(batch_preds[0].to('cpu').detach().numpy())
        pred_batches.append(sm)
        
        # print(f'Processed {start} to {end}')
        start += batch_size
    
    return np.concatenate(pred_batches)

In [7]:
predictions = predict(featurized, model)

When we call `.get()` on our predictions, we'll see we get three values for each tweet that we pass in, which correspond to the score for a Negative, Neutral, and Positive score, respectively. 

We'll postprocess our predictions to pick the maximum score for each tweet and the corresponding confidence level. 

In [8]:
@op(
    engine="eks-us-east-2",
    requirements=['numpy'],
    resources={
        'memory': '8GB'
    },
)
def process_predictions(predictions):
    import numpy as np
    labels = [-1, 0, 1] # Use numbers here for neg./neut./pos. so we can use numpy later on.
    
    results = list(map(
        lambda prediction: [labels[np.argmax(prediction)], prediction[np.argmax(prediction)]],
        predictions,
    ))
    
    return np.array(results)

In [9]:
labels = process_predictions(predictions)

Finally, we'll want to know the average confidence for our three classes, so we'll create thre [Aqueduct metrics](https://docs.aqueducthq.com/metrics-and-checks/metrics-measuring-your-predictions) to track the scores. 

In [10]:
import numpy as np

@metric(requirements=['numpy'])
def avg_positive(labels):
    return np.mean(list(map (
        lambda label: label[1],
        filter(
            lambda label: label[0] == 1,
            labels
        )
    )))

@metric(requirements=['numpy'])
def avg_negative(labels):
    return np.mean(list(map (
        lambda label: label[1],
        filter(
            lambda label: label[0] == -1,
            labels
        )
    )))

@metric(requirements=['numpy'])
def avg_neutral(labels):
    return np.mean(list(map (
        lambda label: label[1],
        filter(
            lambda label: label[0] == 0,
            labels
        )
    )))

avg_pos = avg_positive(labels)
avg_neg = avg_negative(labels)
avg_neut = avg_neutral(labels)

print(f'Average positive confidence: {avg_pos.get()}')
print(f'Average negative confidence: {avg_neg.get()}')
print(f'Average neutral confidence: {avg_neut.get()}')

Operator process Logs:
stderr:
		Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]	Downloading (…)lve/main/config.json: 100%|##########| 929/929 [00:00<00:00, 828kB/s]
		Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]	Downloading (…)olve/main/vocab.json: 100%|##########| 899k/899k [00:00<00:00, 10.9MB/s]
		Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]	Downloading (…)olve/main/merges.txt: 100%|##########| 456k/456k [00:00<00:00, 6.58MB/s]
		Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]	Downloading (…)cial_tokens_map.json: 100%|##########| 239/239 [00:00<00:00, 196kB/s]

Operator process Logs:
stderr:
		Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]	Downloading (…)lve/main/config.json: 100%|##########| 929/929 [00:00<00:00, 828kB/s]
		Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]	Downl

Operator process Logs:
stderr:
		Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]	Downloading (…)lve/main/config.json: 100%|##########| 929/929 [00:00<00:00, 828kB/s]
		Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]	Downloading (…)olve/main/vocab.json: 100%|##########| 899k/899k [00:00<00:00, 10.9MB/s]
		Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]	Downloading (…)olve/main/merges.txt: 100%|##########| 456k/456k [00:00<00:00, 6.58MB/s]
		Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]	Downloading (…)cial_tokens_map.json: 100%|##########| 239/239 [00:00<00:00, 196kB/s]

Operator process Logs:
stderr:
		Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]	Downloading (…)lve/main/config.json: 100%|##########| 929/929 [00:00<00:00, 828kB/s]
		Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]	Downl

Operator process Logs:
stderr:
		Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]	Downloading (…)lve/main/config.json: 100%|##########| 929/929 [00:00<00:00, 828kB/s]
		Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]	Downloading (…)olve/main/vocab.json: 100%|##########| 899k/899k [00:00<00:00, 10.9MB/s]
		Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]	Downloading (…)olve/main/merges.txt: 100%|##########| 456k/456k [00:00<00:00, 6.58MB/s]
		Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]	Downloading (…)cial_tokens_map.json: 100%|##########| 239/239 [00:00<00:00, 196kB/s]

Operator process Logs:
stderr:
		Downloading (…)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]	Downloading (…)lve/main/config.json: 100%|##########| 929/929 [00:00<00:00, 828kB/s]
		Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]	Downl

And that's it! We're finished creating our workflow and are ready to publish it:

In [None]:
from textwrap import dedent

client.publish_flow(
    'RoBERTa Tweet Sentiment',
    dedent('''
    Uses the HuggingFace RoBERTa Tweet sentiment model to analyze the
    sentiment of tweets about Game of Thrones season 8. 
    '''),
    artifacts=[labels]
)