# Step 1 - Install the required dependencies and make sure the python version is 3.10 and above

In [1]:
import sys
print(sys.executable)


/Users/lawrenceegharevba/mlip_labs/cmu-mlip-model-testing-lab/venv/bin/python


In [1]:
!pip install zenoml

[0m

In [2]:
!pip install datasets
!pip install transformers
!pip install tqdm
!pip install torch

[0m

In [4]:
!python --version

Python 3.10.12


# Step 2 - Load a dataset from Hugging Face

In [3]:
from datasets import load_dataset
import pandas as pd

ds = load_dataset("cardiffnlp/tweet_eval", "sentiment")
df = pd.DataFrame(ds['test']).head(500)
df.head(5)

Unnamed: 0,text,label
0,@user @user what do these '1/2 naked pics' hav...,1
1,OH: ‚ÄúI had a blue penis while I was this‚Äù [pla...,1
2,"@user @user That's coming, but I think the vic...",1
3,I think I may be finally in with the in crowd ...,2
4,"@user Wow,first Hugo Chavez and now Fidel Cast...",0


In [4]:
def label_map(x):
    if x == 0:
        return 'negative'
    elif x == 1:
        return 'neutral'
    elif x == 2:
        return 'positive'
    return x
df['label'] = df['label'].map(label_map)

# Step 3 - Run model inference

Warning: This step is going to download two models of ~500MB each. 

**If you don't want to download the models, you can jump to step 4 and use the provided data in the repo instead.**

### Run inference with roberta

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest")

In [6]:
import tqdm

results = []
texts = df['text'].to_list()

## Depending on your machine, this should take around 1 minute
for text in tqdm.tqdm(texts):
    results.append(pipe(text))

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [00:08<00:00, 60.71it/s]


In [None]:
#!pip uninstall torch torchvision torchaudio


In [7]:
df['roberta'] = [r[0]['label'] for r in results]
df['roberta_score'] = [r[0]['score'] for r in results]

### Run inference with gpt2

In [8]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="LYTinn/finetuning-sentiment-model-tweet-gpt2")

Device set to use mps:0


In [9]:
import tqdm

results = []
texts = df['text'].to_list()

## Depending on your machine, this should take around 1 minute
for text in tqdm.tqdm(texts):
    results.append(pipe(text))

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [00:06<00:00, 73.28it/s]


In [10]:
df['gpt2'] = [r[0]['label'] for r in results]
df['gpt2_score'] = [r[0]['score'] for r in results]

## map labels back
def label_map(x):
    if x == 'LABEL_0':
        return 'negative'
    elif x == 'LABEL_1':
        return 'neutral'
    elif x == 'LABEL_2':
        return 'positive'
    return x
df['gpt2'] = df['gpt2'].map(label_map)

# Step 4 - Pre-processing data and add additional columns

In [9]:
## If you skip the model inference, uncomment the code below and load the provided data

# df = pd.read_csv('tweets.csv')

In [11]:
df["input_length"] = df["text"].str.len()

# Step 5 - Start Zeno for interactive slicing

In this step, you need to create 5 slices in the Zeno interface and derive meaningful insights.

As a starting point, try to create the two slices we provide:

1. Tweets with hashtags
2. Tweets with strong positive words (e.g., love) -- you can determine the exact words

Creating slices in Zeno is straightforward: Just click on the '+' button for 'create a new slice', and you can define the slice using existing column attributes, with simple value macthing or even regular expression.

![image.png](images/image.png)

There are more fun features in Zeno, including interactive metadata & model comparison -- feel free to check the teaser video in [README](https://github.com/zeno-ml/zeno) of the Zeno repository.

In [57]:
## Execute the code here to start a local Zeno server

from zeno import zeno

from zeno.api import model, distill, metric
from zeno.api import ModelReturn, MetricReturn, DistillReturn, ZenoOptions

@model
def load_model(model_name):
    
    def pred(df, ops: ZenoOptions):
        out = df[model_name]
        return ModelReturn(model_output=out)

    return pred

@distill
def label_match(df, ops: ZenoOptions):
    results = (df[ops.label_column] == df[ops.output_column]).to_list()
    return DistillReturn(distill_output=results)

@metric
def accuracy(df, ops: ZenoOptions):
    avg = df[ops.distill_columns["label_match"]].mean()
    return MetricReturn(metric=avg)

zeno({
    "metadata": df, # Pandas DataFrame with a row for each instance
    "view": "text-classification", # The type of view for this data/task
    "data_column": "text", 
    "label_column": "label",
    "functions": [load_model, label_match, accuracy],
    "models": ["roberta", "gpt2"],
    "port": 8231
})


Running predistill functions
[1mZeno[0m running on http://localhost:8231


Running inference
Running postdistill functions
Done processing


ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/Users/lawrenceegharevba/mlip_labs/cmu-mlip-model-testing-lab/venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/Users/lawrenceegharevba/mlip_labs/cmu-mlip-model-testing-lab/venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/Users/lawrenceegharevba/mlip_labs/cmu-mlip-model-testing-lab/venv/lib/python3.10/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/lawrenceegharevba/mlip_labs/cmu-mlip-model-testing-lab/venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/lawrenceegharevba/mlip_labs/cmu-mlip-model-testing-lab/venv/lib/pyt

After running the code above, you should be able to access Zeno in http://localhost:8231


After successfully creating the two slices, come up with three *additional* slices you want to check and **create** the slices in the Zeno interface.

There are two directions to identify useful slices:
- Top-down: Think about what kinds of things the model can struggle with, and come up with some slices.
- Bottom-up: Look at model (mis-)predictions, come up with hypotheses, and translate them into data slices.

3. [YOUR CHOICE]
4. [YOUR CHOICE]
5. [YOUR CHOICE]

In [None]:
## Write down descriptions of additional slices you created

custom_slice_descriptions = [
    "",
]

# Slice Evaluation (Pre-LLM Generated Examples)

| Slice Name               | Rationale                                                                 | Observation                                                                                         | Implication                                                                                   |
|---------------------------|---------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| Positive Label (Tweets)   | Hypothesis is that the model should perform strongly on clearly positive sentiment, since training data often has abundant positive examples. | Accuracy is high (~70%). Predictions were consistent, though occasionally tweets with sarcasm (‚Äúlove waiting in traffic‚Äù) were misclassified. | Model is strong at detecting explicit positivity, but sarcasm or subtle positivity can cause errors. |
| Negative Tweets           | Hypothesis was that tweets with negative cues (cancel, criticism, decline) would be harder for the model. | Accuracy dropped to 0.60. Misclassifications clustered around sarcasm, political critique, and factual decline statements. | Model struggles with subtle negativity and context, defaulting to neutral or misreading factual decline as sentiment. |
| Long Tweets               | Longer tweets may introduce ambiguity or mixed sentiment, challenging the model. | Accuracy = 0.70. Misclassifications clustered around political critique, sarcasm, and factual decline statements. Clear sentiment tweets were correctly classified. | Model struggles with nuanced language in long tweets, defaulting to neutral or misinterpreting factual tone as sentiment. |
| Low Confidence Predictions | Hypothesis was that low confidence scores indicate ambiguous or difficult cases. | Accuracy = 0.55. Misclassifications clustered around emotional nuance, sarcasm, factual critique, and political framing. Correct predictions were mostly straightforward neutral or strongly negative tweets. | Confidence scores are a useful diagnostic ‚Äî low confidence reliably flags examples where the model is unstable or error prone. |
| Ambiguous Tweets          | Hypothesis was that evaluative phrasing (‚Äúcozied up‚Äù) should be classified as negative. | Model predicted neutral, showing a misclassification. | The model underestimates subtle negative sentiment in political contexts. |


# Step 6 - Write down three addition data slices you want to create but do not have the metadata for slicing

In the previous step, you might have already come up with some slices you wanted to create but found it hard to do with existing metadata. Write down three of such slices in this step.

Example: 
- I want to create a slice on tweets using slangs
- I want to create a slice on non-English tweets (if any)

# Step 7 - Generate more test cases with Large Language Models

Select one slice from the three you wrote down and generate **10 test cases** using LLMs, which can include average case, boundary case, or difficult case.

Your input can be in the following format:

> Examples:
> - OH: ‚ÄúI had a blue penis while I was this‚Äù [playing with Google Earth VR]
> - @user @user That‚Äôs coming, but I think the victims are going to be Medicaid recipients.
> - I think I may be finally in with the in crowd #mannequinchallenge  #grads2014 @user
> 
> Generate more tweets using slangs.

The first part of **Examples** conditions the LLM on the style, length, and content of examples. The second part of **Instructions** instructs what kind of examples you want LLM to generate.

Use our provided GPTs to start the task: [llm-based-test-case-generator](https://chatgpt.com/g/g-982cylVn2-llm-based-test-case-generator). If you do not have access to GPTs, use the plain ChatGPT or other LLM providers you have access to instead.

## Write down the slice you select

slice_description = "Rationale

‚Ä¢	Emojis often carry sentiment (üòÇ = positive, üò≠ = negative/sad, üò° = anger).

‚Ä¢	Hypothesis: The model may misclassify tweets where emojis contradict or amplify the text sentiment.
"

## Write down all generated test cases here

generated_test_cases = [

1.	‚ÄúBest day ever üòÇüòÇüòÇ‚Äù ‚Üí Positive

2.	‚ÄúI can‚Äôt believe this happened üò≠‚Äù ‚Üí Negative

3.	‚ÄúAnother Monday‚Ä¶ üòí‚Äù ‚Üí Negative

4.	‚ÄúSo proud of my team üéâ‚Äù ‚Üí Positive

5.	‚ÄúThat exam was brutal üò°‚Äù ‚Üí Negative

6.	‚ÄúFinally finished my project üôå‚Äù ‚Üí Positive

7.	‚ÄúI guess it‚Äôs fine‚Ä¶ ü§∑‚Äù ‚Üí Neutral

8.	‚ÄúLove this song ‚ù§Ô∏è‚Äù ‚Üí Positive

9.	‚ÄúTraffic again‚Ä¶ ugh üò©‚Äù ‚Üí Negative

10.	‚ÄúNot sure what to think ü§î‚Äù ‚Üí Neutral",
]

# Add LLM-Generated test cases (dataset) to Zeno

In [51]:
# Add LLM-Generated dataset to Zeno
texts = [
    "Best day ever üòÇüòÇüòÇ",
    "I can‚Äôt believe this happened üò≠",
    "Another Monday‚Ä¶ üòí",
    "So proud of my team üéâ",
    "That exam was brutal üò°",
    "Finally finished my project üôå",
    "I guess it‚Äôs fine‚Ä¶ ü§∑",
    "Love this song ‚ù§Ô∏è",
    "Traffic again‚Ä¶ ugh üò©",
    "Not sure what to think ü§î",
]

gold_labels = [
    "positive",
    "negative",
    "negative",
    "positive",
    "negative",
    "positive",
    "neutral",
    "positive",
    "negative",
    "neutral",
]

outputs = pipe(texts)
pred_labels = [o["label"] for o in outputs]

new_data = pd.DataFrame({
    "text": texts,
    "label": gold_labels,
    "prediction": pred_labels,
})
new_data["correct"] = (new_data["label"] == new_data["prediction"]).astype(int)


In [55]:
outputs = pipe(new_data["text"].tolist())
new_data["prediction"] = [o["label"] for o in outputs]
new_data["correct"] = (new_data["label"] == new_data["prediction"]).astype(int)


In [62]:
all_outputs = pipe(df_extended["text"].tolist())
df_extended["roberta"] = [o["label"] for o in all_outputs]


In [61]:
zeno({
    "metadata": df_extended,
    "view": "text-classification",
    "data_column": "text",
    "label_column": "label",
    "functions": [load_model, label_match, accuracy],
    "models": ["roberta"],  # or ["roberta", "gpt2"] if you also have that column
    "port": 8231
})



[1mZeno[0m running on http://localhost:8231
Running predistill functions

Running inference


Inference on roberta: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 1367.07it/s]


Running postdistill functions


postprocessing label_match on roberta: 100%|‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 829.88it/s]


Done processing


  filt_df.groupby([pd.cut(filt_df[str(col)], bucs)])  # type: ignore
  filt_df.groupby([pd.cut(filt_df[str(col)], bucs)])  # type: ignore
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/Users/lawrenceegharevba/mlip_labs/cmu-mlip-model-testing-lab/venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/Users/lawrenceegharevba/mlip_labs/cmu-mlip-model-testing-lab/venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/Users/lawrenceegharevba/mlip_labs/cmu-mlip-model-testing-lab/venv/lib/python3.10/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/lawrenceegharevba/mlip_labs/cmu-mlip-model-testing-lab/venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call_

# Slice Evaluation (Post-LLM Generated Examples)




| Slice Name              | Rationale                                                                 | Observed Behavior                                                                                   | Implication                                                                                   |
|--------------------------|---------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| Positive Label Tweets    | Explicitly positive tweets, often with praise, enthusiasm, or supportive tone | Accuracy ‚âà 0.95 (20 instances). Model correctly classified clear positive sentiment (e.g., praise, celebration, emoji reinforcement). Occasional misclassifications occurred with sarcasm or subtle positivity. | Model is strong at detecting explicit positivity, especially when reinforced by emojis or hashtags. Weaknesses remain in subtle or sarcastic positive phrasing. |
| Negative Tweets          | Explicitly negative tweets with strong language                          | Accuracy ‚âà 0.60 (20 instances). Model correctly classified clear negativity, but struggled with sarcasm and factual decline. | Model handles explicit negativity well but misclassifies subtle or implied negative sentiment. |
| Long Tweets              | Longer tweets often contain multiple clauses, sarcasm, or mixed sentiment | Accuracy ‚âà 0.60 (20 instances). Misclassifications clustered around sarcasm, factual decline, missed enthusiasm, and emoji cues. | Model struggles with nuanced or multi‚Äëclause sentiment in long tweets.                        |
| Low Confidence Predictions | Tweets flagged with low model confidence scores                         | Accuracy ‚âà 0.55‚Äì0.60 (20 instances). Misclassifications clustered around sarcasm, factual decline, missed enthusiasm, and emoji cues. | Confidence scores reliably highlight unstable predictions, useful for human‚Äëin‚Äëthe‚Äëloop review. |
| Ambiguous Tweets         | Tweets with unclear sentiment, sarcasm, rhetorical questions, or factual tone | Accuracy ‚âà 0.70 (20 instances). Misclassifications clustered around sarcasm, factual decline, promotional tone, and ambiguous emoji use. | Model struggles with ambiguity, often defaulting to neutral or misreading tone.                |
| Slang Tweets             | Tweets with slang, informal phrasing, or culturally loaded hashtags        | Accuracy ‚âà 0.71 (20 instances). Correct on slang‚Äëreinforced praise/criticism, but misclassified slang terms, promotional tone, and rhetorical critiques. | Model struggles with informal or culturally specific slang, often misinterpreting tone.        |
| Emoji Tweets             | Emojis often carry sentiment signals that may reinforce or contradict text | Accuracy = 1.00 (20 instances). Model correctly classified positive, negative, and sarcastic emoji cases. | Model distinguishes emoji sentiment in context, but more diverse emoji testing is needed.      |
| Question Tweets          | Tweets ending with or containing questions, often rhetorical or sarcastic | Accuracy ‚âà 0.66 (20 instances). Correct on explicit negative rhetorical questions, but misclassified neutral/factual inquiries and emoji‚Äëladen questions. | Model struggles with distinguishing genuine inquiries from rhetorical or sarcastic sentiment.  |


         