# Step 1 - Install the required dependencies and make sure the python version is 3.10 and above

In [None]:
!pip install zenoml

In [None]:
!pip install datasets
!pip install transformers
!pip install tqdm

In [22]:
!python --version

Python 3.12.5


# Step 2 - Load a dataset from Hugging Face

In [2]:
from datasets import load_dataset
import pandas as pd

ds = load_dataset("cardiffnlp/tweet_eval", "sentiment")
df = pd.DataFrame(ds['test']).head(500)
df.head(10)

Unnamed: 0,text,label
0,@user @user what do these '1/2 naked pics' hav...,1
1,OH: “I had a blue penis while I was this” [pla...,1
2,"@user @user That's coming, but I think the vic...",1
3,I think I may be finally in with the in crowd ...,2
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",0
5,Savchenko now Saakashvili took drug test live ...,1
6,How many more days until opening day? 😩,1
7,Twitter's #ThankYouObama Shows Heartfelt Grati...,2
8,All CSG and Fracking all around Australia is t...,1
9,@user @user @user @user @user @user take away ...,0


In [3]:
def label_map(x):
    if x == 0:
        return 'negative'
    elif x == 1:
        return 'neutral'
    elif x == 2:
        return 'positive'
    return x
df['label'] = df['label'].map(label_map)

# Step 3 - Run model inference

Warning: This step is going to download two models of ~500MB each. 

**If you don't want to download the models, you can jump to step 4 and use the provided data in the repo instead.**

### Run inference with roberta

In [None]:
!pip3 install torch torchvision torchaudio

In [None]:
import torch
print(torch.__version__)
print(torch.cuda.is_available())  # Check if CUDA (GPU support) is enabled


In [5]:
# Use a pipeline as a high-level helper
import warnings
from transformers import pipeline

pipe = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest")

warnings.filterwarnings("ignore")


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [None]:
import torch
print(torch.__version__)


In [6]:
import tqdm

results = []
texts = df['text'].to_list()

## Depending on your machine, this should take around 1 minute
for text in tqdm.tqdm(texts):
    results.append(pipe(text))

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:37<00:00, 13.22it/s]


In [7]:
df['roberta'] = [r[0]['label'] for r in results]
df['roberta_score'] = [r[0]['score'] for r in results]

### Run inference with gpt2

In [8]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="LYTinn/finetuning-sentiment-model-tweet-gpt2")

Device set to use cpu


In [9]:
import tqdm

results = []
texts = df['text'].to_list()

## Depending on your machine, this should take around 1 minute
for text in tqdm.tqdm(texts):
    results.append(pipe(text))

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:32<00:00, 15.53it/s]


In [10]:
df['gpt2'] = [r[0]['label'] for r in results]
df['gpt2_score'] = [r[0]['score'] for r in results]

## map labels back
def label_map(x):
    if x == 'LABEL_0':
        return 'negative'
    elif x == 'LABEL_1':
        return 'neutral'
    elif x == 'LABEL_2':
        return 'positive'
    return x
df['gpt2'] = df['gpt2'].map(label_map)

# Step 4 - Pre-processing data and add additional columns

In [11]:
## If you skip the model inference, uncomment the code below and load the provided data

df = pd.read_csv('tweets.csv')

In [12]:
df["input_length"] = df["text"].str.len()

# Step 5 - Start Zeno for interactive slicing

In this step, you need to create 5 slices in the Zeno interface and derive meaningful insights.

As a starting point, try to create the two slices we provide:

1. Tweets with hashtags
2. Tweets with strong positive words (e.g., love) -- you can determine the exact words

Creating slices in Zeno is straightforward: Just click on the '+' button for 'create a new slice', and you can define the slice using existing column attributes, with simple value macthing or even regular expression.

![image.png](images/image.png)

There are more fun features in Zeno, including interactive metadata & model comparison -- feel free to check the teaser video in [README](https://github.com/zeno-ml/zeno) of the Zeno repository.

In [36]:
from zeno import zeno

from zeno.api import model, distill, metric
from zeno.api import ModelReturn, MetricReturn, DistillReturn, ZenoOptions

@model
def load_model(model_name):
    
    def pred(df, ops: ZenoOptions):
        out = df[model_name]
        return ModelReturn(model_output=out)

    return pred

@distill
def label_match(df, ops: ZenoOptions):
    results = (df[ops.label_column] == df[ops.output_column]).to_list()
    return DistillReturn(distill_output=results)

@metric
def accuracy(df, ops: ZenoOptions):
    avg = df[ops.distill_columns["label_match"]].mean()
    return MetricReturn(metric=avg)

zeno({
    "metadata": df, # Pandas DataFrame with a row for each instance
    "view": "text-classification",
    "data_column": "text",# The type of view for this data/task
    "id_column": "text", 
    "label_column": "label",
    "functions": [load_model, label_match, accuracy],
    "models": ["roberta", "gpt2"],
    "port": 8231
})


After running the code above, you should be able to access Zeno in http://localhost:8231


After successfully creating the two slices, come up with three *additional* slices you want to check and **create** the slices in the Zeno interface.

There are two directions to identify useful slices:
- Top-down: Think about what kinds of things the model can struggle with, and come up with some slices.
- Bottom-up: Look at model (mis-)predictions, come up with hypotheses, and translate them into data slices.

3. Sarcastic Tweets
4. Tweets with Emotion-Based Sentiment
5. Misinformation Tweet

In [34]:
## Write down descriptions of additional slices you created

custom_slice_descriptions = [
    "Tweets with Hashtags: This slice includes all tweets that contain a `#` symbol, indicating they are discussing trending topics or are part of a broader conversation.",
    
    "Tweets with Strong Positive Words: This slice includes tweets containing words such as 'love', 'great', or 'amazing', which indicate strong positive sentiment. It helps analyze how well the models recognize strongly positive expressions.",
    
    "Tweets with Strong Negative Words: This slice contains tweets with words like 'hate', 'terrible', or 'worst', which express strong negative sentiment. It evaluates whether the models can correctly classify highly negative content.",
    
    "Tweets with Mentions: This slice consists of tweets that contain '@' mentions, meaning they are direct replies or tagged conversations. These tweets may have different sentiment distributions compared to general tweets.",
    
    "Tweets Containing Questions: This slice identifies tweets that include a `?`, which indicates a question. Questions may have neutral sentiment but can imply curiosity, doubt, or concern.",
    
    "High Disagreement Tweets: This slice captures tweets where RoBERTa and GPT-2 predictions differ, helping analyze cases where models struggle to classify sentiment consistently."
]

# Step 6 - Write down three addition data slices you want to create but do not have the metadata for slicing

In the previous step, you might have already come up with some slices you wanted to create but found it hard to do with existing metadata. Write down three of such slices in this step.

Example: 
- I want to create a slice on tweets using slangs
- I want to create a slice on non-English tweets (if any)

In [35]:
## Write down three additional data slices here:

additional_slice_descriptions = [
    "Sarcastic Tweets: This slice would include tweets that contain sarcasm. Sarcasm often carries a different sentiment than the literal meaning of the words used, making it difficult for models to classify correctly. However, we lack explicit metadata for detecting sarcasm without an additional NLP model or dataset.",
    
    "Tweets with Emotion-Based Sentiment: This slice would categorize tweets based on emotions like joy, anger, sadness, or fear instead of simple positive/negative/neutral sentiment. It requires an emotion detection model or additional labeled data, which is not available in the current dataset.",
    
    "Misinformation Tweets: This slice would contain tweets that potentially spread false or misleading information. Identifying misinformation requires external fact-checking metadata, which is not available in the dataset."
]

# Step 7 - Generate more test cases with Large Language Models

Select one slice from the three you wrote down and generate **10 test cases** using LLMs, which can include average case, boundary case, or difficult case.

Your input can be in the following format:

> Examples:
> - OH: “I had a blue penis while I was this” [playing with Google Earth VR]
> - @user @user That’s coming, but I think the victims are going to be Medicaid recipients.
> - I think I may be finally in with the in crowd #mannequinchallenge  #grads2014 @user
> 
> Generate more tweets using slangs.

The first part of **Examples** conditions the LLM on the style, length, and content of examples. The second part of **Instructions** instructs what kind of examples you want LLM to generate.

Use our provided GPTs to start the task: [llm-based-test-case-generator](https://chatgpt.com/g/g-982cylVn2-llm-based-test-case-generator). If you do not have access to GPTs, use the plain ChatGPT or other LLM providers you have access to instead.

In [33]:
## Write down the slice you select

slice_description = "Sarcastic Tweets: This slice would include tweets that contain sarcasm. Sarcasm often carries a different sentiment than the literal meaning of the words used, making it difficult for models to classify correctly. However, we lack explicit metadata for detecting sarcasm without an additional NLP model or dataset.",


## Write down all generated test cases here

generated_test_cases = [
    # Average Cases (Common sarcastic expressions)
    "Oh great, another Monday! Just what I needed to make my life complete.",
    "Wow, my internet is so fast today... I love waiting five minutes for a page to load.",
    "Oh sure, because standing in a long line is exactly how I wanted to spend my afternoon.",
    "Amazing, I just dropped my phone again. Good thing it's indestructible! (Not.)",

    # Boundary Cases (Edge cases in length, ambiguity, or exaggeration)
    "Yeah, because nothing screams ‘great day’ like stepping on a LEGO barefoot.",
    "Oh fantastic, my package was ‘delivered’—just not to my house.",
    "Love when my alarm goes off before the sun even decides to wake up.",
    
    # Difficult Cases (Subtle or structurally challenging sarcasm)
    "I really enjoy getting emails about sales after I just bought the product at full price.",
    "Yes, I totally love it when my headphones get tangled in under five seconds.",
    "Oh, sure, I'd love to hear more about your diet while I eat my pizza."
]