# ✂️ Snorkel Intro Tutorial: _Data Slicing_

In real-world applications, some model outcomes are often more important than others — e.g. vulnerable cyclist detections in an autonomous driving task, or, in our running **spam** application, potentially malicious link redirects to external websites.

Traditional machine learning systems optimize for overall quality, which may be too coarse-grained.
Models that achieve high overall performance might produce unacceptable failure rates on critical slices of the data — data subsets that might correspond to vulnerable cyclist detection in an autonomous driving task, or in our running spam detection application, external links to potentially malicious websites.

In this tutorial, we:
1. **Introduce _Slicing Functions (SFs)_** as a programming interface
1. **Monitor** application-critical data subsets
2. **Improve model performance** on slices

First, we'll set up our notebook for reproducibility and proper logging.

In [1]:
import logging
import os
import numpy as np
import random
import torch

# For reproducibility
os.environ["PYTHONHASHSEED"] = "0"
SEED = 123
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

# To visualize logs
logger = logging.getLogger()
logger.setLevel(logging.WARNING)

If you want to display all comment text untruncated, change `DISPLAY_ALL_TEXT` to `True` below.

In [2]:
import pandas as pd


DISPLAY_ALL_TEXT = False

pd.set_option("display.max_colwidth", 0 if DISPLAY_ALL_TEXT else 50)

_Note:_ this tutorial differs from the labeling tutorial in that we use ground truth labels in the train split for demo purposes.
SFs are intended to be used *after the training set has already been labeled* by LFs (or by hand) in the training data pipeline.

In [3]:
from utils import load_spam_dataset

df_train, df_test = load_spam_dataset(load_train_labels=True)



## 1. Write slicing functions

We leverage *slicing functions* (SFs), which output binary _masks_ indicating whether an data point is in the slice or not.
Each slice represents some noisily-defined subset of the data (corresponding to an SF) that we'd like to programmatically monitor.

In the following cells, we use the [`@slicing_function()`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/slicing/snorkel.slicing.slicing_function.html#snorkel.slicing.slicing_function) decorator to initialize an SF that identifies short comments

You'll notice that the `short_comment` SF is a heuristic, like the other programmatic ops we've defined, and may not fully cover the slice of interest.
That's okay — in last section, we'll show how a model can handle this in Snorkel.

In [4]:
import re
from snorkel.slicing import slicing_function


@slicing_function()
def short_comment(x):
    """Ham comments are often short, such as 'cool video!'"""
    return len(x.text.split()) < 5


sfs = [short_comment]

### Visualize slices

With a utility function, [`slice_dataframe`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/slicing/snorkel.slicing.slice_dataframe.html#snorkel.slicing.slice_dataframe), we can visualize data points belonging to this slice in a `pandas.DataFrame`.

In [5]:
from snorkel.slicing import slice_dataframe

short_comment_df = slice_dataframe(df_test, short_comment)

100%|██████████| 250/250 [00:00<00:00, 49801.76it/s]


In [6]:
short_comment_df[["text", "label"]].head()

Unnamed: 0,text,label
194,super music﻿,0
2,I like shakira..﻿,0
110,subscribe to my feed,1
263,Awesome ﻿,0
77,Nice,0


## 2. Monitor slice performance with [`Scorer.score_slices`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/analysis/snorkel.analysis.Scorer.html#snorkel.analysis.Scorer.score_slices)

In this section, we'll demonstrate how we might monitor slice performance on the `short_comment` slice — this approach is compatible with _any modeling framework_.

### Train a simple classifier
First, we featurize the data — as you saw in the introductory Spam tutorial, we can extract simple bag-of-words features and store them as numpy arrays.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from utils import df_to_features

vectorizer = CountVectorizer(ngram_range=(1, 1))
X_train, Y_train = df_to_features(vectorizer, df_train, "train")
X_test, Y_test = df_to_features(vectorizer, df_test, "test")

We define a `LogisticRegression` model from `sklearn`.

In [8]:
from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(C=0.001, solver="liblinear")
sklearn_model.fit(X=np.array(X_train), y=Y_train)

In [9]:
from snorkel.utils import preds_to_probs

preds_test = sklearn_model.predict(np.array(X_test))
probs_test = preds_to_probs(preds_test, 2)

In [10]:
from sklearn.metrics import f1_score

print(f"Test set F1: {100 * f1_score(Y_test, preds_test):.1f}%")

Test set F1: 92.5%


### Store slice metadata in `S`

We apply our list of `sfs` to the data using an SF applier.
For our data format, we leverage the [`PandasSFApplier`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/slicing/snorkel.slicing.PandasSFApplier.html#snorkel.slicing.PandasSFApplier).
The output of the `applier` is an [`np.recarray`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.recarray.html) which stores vectors in named fields indicating whether each of $n$ data points belongs to the corresponding slice.

In [11]:
from snorkel.slicing import PandasSFApplier

applier = PandasSFApplier(sfs)
S_test = applier.apply(df_test)

100%|██████████| 250/250 [00:00<00:00, 64199.84it/s]


Now, we initialize a [`Scorer`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/analysis/snorkel.analysis.Scorer.html#snorkel.analysis.Scorer) using the desired `metrics`.

In [12]:
from snorkel.analysis import Scorer

scorer = Scorer(metrics=["f1"])

Using the [`score_slices`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/analysis/snorkel.analysis.Scorer.html#snorkel.analysis.Scorer.score_slices) method, we can see both `overall` and slice-specific performance.

In [13]:
scorer.score_slices(
    S=S_test, golds=Y_test, preds=preds_test, probs=probs_test, as_dataframe=True
)

Unnamed: 0,f1
overall,0.925
short_comment,0.666667


Despite high overall performance, the `short_comment` slice performs poorly here!

### Write additional slicing functions (SFs)

Slices are dynamic — as monitoring needs grow or change with new data distributions or application needs, an ML pipeline might require dozens, or even hundreds, of slices.

We'll take inspiration from the labeling tutorial to write additional slicing functions.
We demonstrate how the same powerful preprocessors and utilities available for labeling functions can be leveraged for slicing functions.

In [14]:
from snorkel.slicing import SlicingFunction, slicing_function
from snorkel.preprocess import preprocessor


# Keyword-based SFs
def keyword_lookup(x, keywords):
    return any(word in x.text.lower() for word in keywords)


def make_keyword_sf(keywords):
    return SlicingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords),
    )


keyword_please = make_keyword_sf(keywords=["please", "plz"])


# Regex-based SFs
@slicing_function()
def regex_check_out(x):
    return bool(re.search(r"check.*out", x.text, flags=re.I))


@slicing_function()
def short_link(x):
    """Returns whether text matches common pattern for shortened ".ly" links."""
    return bool(re.search(r"\w+\.ly", x.text))


# Leverage preprocessor in SF
from textblob import TextBlob


@preprocessor(memoize=True)
def textblob_sentiment(x):
    scores = TextBlob(x.text)
    x.polarity = scores.sentiment.polarity
    return x


@slicing_function(pre=[textblob_sentiment])
def textblob_polarity(x):
    return x.polarity > 0.9

Again, we'd like to visualize data points in a particular slice. This time, we'll inspect the `textblob_polarity` slice.

Most data points with high-polarity sentiments are strong opinions about the video — hence, they are usually relevant to the video, and the corresponding labels are $0$ (not spam).
We might define a slice here for *product and marketing reasons*, it's important to make sure that we don't misclassify very positive comments from good users.

In [15]:
polarity_df = slice_dataframe(df_test, textblob_polarity)

100%|██████████| 250/250 [00:00<00:00, 3389.83it/s]


In [16]:
polarity_df[["text", "label"]].head()

Unnamed: 0,text,label
263,Awesome ﻿,0
240,Shakira is the best dancer,0
261,OMG LISTEN TO THIS ITS SOO GOOD!! :D﻿,0
14,Shakira is very beautiful,0
114,awesome,0


We can evaluate performance on _all SFs_ using the model-agnostic [`Scorer`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/analysis/snorkel.analysis.Scorer.html#snorkel-analysis-scorer).

In [17]:
extra_sfs = [keyword_please, regex_check_out, short_link, textblob_polarity]

sfs = [short_comment] + extra_sfs
slice_names = [sf.name for sf in sfs]

Let's see how the `sklearn` model we learned before performs on these new slices.

In [18]:
applier = PandasSFApplier(sfs)
S_test = applier.apply(df_test)

100%|██████████| 250/250 [00:00<00:00, 19059.47it/s]


In [19]:
scorer.score_slices(
    S=S_test, golds=Y_test, preds=preds_test, probs=probs_test, as_dataframe=True
)

Unnamed: 0,f1
overall,0.925
short_comment,0.666667
keyword_please,1.0
regex_check_out,1.0
short_link,0.5
textblob_polarity,0.727273


Looks like some do extremely well on our small test set, while others do decently.
At the very least, we may want to monitor these to make sure that as we iterate to improve certain slices like `short_comment`, we don't hurt the performance of others.
Next, we'll introduce a model that helps us to do this balancing act automatically.

## 3. Improve slice performance

In the following section, we demonstrate a modeling approach that we call _Slice-based Learning,_ which improves performance by adding extra slice-specific representational capacity to whichever model we're using.
Intuitively, we'd like to model to learn *representations that are better suited to handle data points in this slice*.
In our approach, we model each slice as a separate "expert task" in the style of [multi-task learning](https://github.com/snorkel-team/snorkel-tutorials/blob/master/multitask/multitask_tutorial.ipynb); for further details of how slice-based learning works under the hood, check out the [code](https://github.com/snorkel-team/snorkel/blob/master/snorkel/slicing/utils.py) (with paper coming soon)!

In other approaches, one might attempt to increase slice performance with techniques like _oversampling_ (i.e. with PyTorch's [`WeightedRandomSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.WeightedRandomSampler)), effectively shifting the training distribution towards certain populations.

This might work with small number of slices, but with hundreds or thousands or production slices at scale, it could quickly become intractable to tune upsampling weights per slice.

### Constructing a [`SliceAwareClassifier`](https://snorkel.readthedocs.io/en/v0.9.3/packages/_autosummary/slicing/snorkel.slicing.SliceAwareClassifier.html)


To cope with scale, we will attempt to learn and combine many slice-specific representations with an attention mechanism.
(Please see our [Section 3 of our technical report](https://arxiv.org/abs/1909.06349) for details on this approach).

First we'll initialize a [`SliceAwareClassifier`](https://snorkel.readthedocs.io/en/v0.9.3/packages/_autosummary/slicing/snorkel.slicing.SliceAwareClassifier.html):
* `base_architecture`: We define a simple Multi-Layer Perceptron (MLP) in Pytorch to serve as the primary representation architecture. We note that the `BinarySlicingClassifier` is **agnostic to the base architecture** — you might leverage a Transformer model for text, or a ResNet for images.
* `head_dim`: identifies the final output feature dimension of the `base_architecture`
* `slice_names`: Specify the slices that we plan to train on with this classifier.

In [20]:
from snorkel.slicing import SliceAwareClassifier
from utils import get_pytorch_mlp

# Define model architecture
bow_dim = X_train.shape[1]
hidden_dim = bow_dim
mlp = get_pytorch_mlp(hidden_dim=hidden_dim, num_layers=2)

# Initialize slice model
slice_model = SliceAwareClassifier(
    base_architecture=mlp,
    head_dim=hidden_dim,
    slice_names=[sf.name for sf in sfs],
    scorer=scorer,
)

Next, we'll generate the remaining `S` matrixes with the new set of slicing functions.

In [21]:
applier = PandasSFApplier(sfs)
S_train = applier.apply(df_train)
S_test = applier.apply(df_test)

100%|██████████| 1586/1586 [00:00<00:00, 5134.88it/s]
100%|██████████| 250/250 [00:00<00:00, 40491.81it/s]


In order to train using slice information, we'd like to initialize a **slice-aware dataloader**.
To do this, we can use [`slice_model.make_slice_dataloader`](https://snorkel.readthedocs.io/en/v0.9.3/packages/_autosummary/slicing/snorkel.slicing.SliceAwareClassifier.html#snorkel.slicing.SliceAwareClassifier.predict) to add slice labels to an existing dataloader.

Under the hood, this method leverages slice metadata to add slice labels to the appropriate fields such that it will be compatible with our model, a [`SliceAwareClassifier`](https://snorkel.readthedocs.io/en/v0.9.3/packages/_autosummary/slicing/snorkel.slicing.SliceAwareClassifier.html#snorkel-slicing-slicingclassifier).

In [22]:
from utils import create_dict_dataloader

BATCH_SIZE = 64

train_dl = create_dict_dataloader(X_train, Y_train, "train")
train_dl_slice = slice_model.make_slice_dataloader(
    train_dl.dataset, S_train, shuffle=True, batch_size=BATCH_SIZE
)
test_dl = create_dict_dataloader(X_test, Y_test, "train")
test_dl_slice = slice_model.make_slice_dataloader(
    test_dl.dataset, S_test, shuffle=False, batch_size=BATCH_SIZE
)

### Representation learning with slices

Using Snorkel's [`Trainer`](https://snorkel.readthedocs.io/en/master/packages/_autosummary/classification/snorkel.classification.Trainer.html), we fit our classifier with the training set dataloader.

In [23]:
from snorkel.classification import Trainer

# For demonstration purposes, we set n_epochs=2
trainer = Trainer(n_epochs=2, lr=1e-4, progress_bar=True)
trainer.fit(slice_model, [train_dl_slice])

Epoch 0:: 100%|██████████| 25/25 [00:04<00:00,  5.83it/s, model/all/train/loss=0.508, model/all/train/lr=0.0001]
Epoch 1:: 100%|██████████| 25/25 [00:04<00:00,  6.19it/s, model/all/train/loss=0.257, model/all/train/lr=0.0001]


At inference time, the primary task head (`spam_task`) will make all final predictions.
We'd like to evaluate all the slice heads on the original task head — [`score_slices`](https://snorkel.readthedocs.io/en/v0.9.3/packages/_autosummary/slicing/snorkel.slicing.SliceAwareClassifier.html#snorkel.slicing.SliceAwareClassifier.score_slices) remaps all slice-related labels, denoted `spam_task_slice:{slice_name}_pred`, to be evaluated on the `spam_task`.

In [24]:
slice_model.score_slices([test_dl_slice], as_dataframe=True)

Unnamed: 0,label,dataset,split,metric,score
0,task,SnorkelDataset,train,f1,0.932127
1,task_slice:short_comment_pred,SnorkelDataset,train,f1,0.769231
2,task_slice:keyword_please_pred,SnorkelDataset,train,f1,0.977778
3,task_slice:regex_check_out_pred,SnorkelDataset,train,f1,1.0
4,task_slice:short_link_pred,SnorkelDataset,train,f1,0.5
5,task_slice:textblob_polarity_pred,SnorkelDataset,train,f1,0.8
6,task_slice:base_pred,SnorkelDataset,train,f1,0.932127


In [25]:
trainer.config

TrainerConfig(seed=None, n_epochs=2, lr=0.0001, l2=0.0, grad_clip=1.0, train_split='train', valid_split='valid', test_split='test', progress_bar=True, model_config=ClassifierConfig(device=0, dataparallel=True), log_manager_config=LogManagerConfig(counter_unit='epochs', evaluation_freq=1.0), checkpointing=False, checkpointer_config=CheckpointerConfig(checkpoint_dir='checkpoints', checkpoint_factor=1, checkpoint_metric='model/all/train/loss:min', checkpoint_task_metrics=None, checkpoint_runway=0, checkpoint_clear=True), logging=False, log_writer='tensorboard', log_writer_config=LogWriterConfig(log_dir='logs', run_name=None), optimizer='adam', optimizer_config=OptimizerConfig(sgd_config=SGDOptimizerConfig(momentum=0.9), adam_config=AdamOptimizerConfig(amsgrad=False, betas=(0.9, 0.999)), adamax_config=AdamaxOptimizerConfig(betas=(0.9, 0.999), eps=1e-08)), lr_scheduler='constant', lr_scheduler_config=LRSchedulerConfig(warmup_steps=0, warmup_unit='batches', warmup_percentage=0.0, min_lr=0.0,

*Note: in this toy dataset, we see high variance in slice performance, because our dataset is so small that (i) there are few data points in the train split, giving little signal to learn over, and (ii) there are few data points in the test split, making our evaluation metrics very noisy.
For a demonstration of data slicing deployed in state-of-the-art models, please see our [SuperGLUE](https://github.com/HazyResearch/snorkel-superglue/tree/master/tutorials) tutorials.*

---
## Recap

This tutorial walked through the process authoring slices, monitoring model performance on specific slices, and improving model performance using slice information.
This programming abstraction provides a mechanism to heuristically identify critical data subsets.
For more technical details about _Slice-based Learning,_ please see our [NeurIPS 2019 paper](https://arxiv.org/abs/1909.06349)!

# New Modifications

### Lab 3: Spam Data slicing

In [26]:
import pandas as pd
import re
from snorkel.labeling import labeling_function, PandasLFApplier
from snorkel.labeling.model import LabelModel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

SPAM = 1
HAM = 0
ABSTAIN = -1


### Updated Labeling Functions

In [27]:

@labeling_function()
def lf_contains_link(x):
    """Detect links (clickbait or suspicious URLs)"""
    return SPAM if re.search(r"http|www|bit\.ly|tinyurl|goo\.gl", x.text.lower()) else ABSTAIN

@labeling_function()
def lf_money_or_crypto(x):
    """Detect money, prize, or crypto-related scams"""
    keywords = ["free", "win", "offer", "cash", "prize", "crypto", "bitcoin", "investment", "profit"]
    return SPAM if any(k in x.text.lower() for k in keywords) else ABSTAIN

@labeling_function()
def lf_phone_email(x):
    """Detect phone numbers or emails"""
    pattern = r"(\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|[\w\.-]+@[\w\.-]+)"
    return SPAM if re.search(pattern, x.text) else ABSTAIN

@labeling_function()
def lf_emoji(x):
    """Detect spammy or attention-grabbing emojis"""
    emojis = ["😜", "😂", "🔥", "❤️", "💰", "💸"]
    return SPAM if any(e in x.text for e in emojis) else ABSTAIN

@labeling_function()
def lf_excessive_exclamations(x):
    """Detect excessive punctuation or shouting"""
    return SPAM if re.search(r"!{3,}|[A-Z]{5,}", x.text) else ABSTAIN

@labeling_function()
def lf_short_message(x):
    """Non-spam signal: very short messages"""
    return HAM if len(x.text.split()) < 3 else ABSTAIN

# Combine all labeling functions
lfs = [
    lf_contains_link,
    lf_money_or_crypto,
    lf_phone_email,
    lf_emoji,
    lf_excessive_exclamations,
    lf_short_message
]


### Applying Labeling Functions

In [28]:
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

print("✅ Labeling function outputs (first 10 rows):")
print(L_train[:10])


100%|██████████| 1586/1586 [00:00<00:00, 18928.96it/s]

✅ Labeling function outputs (first 10 rows):
[[ 1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1]
 [ 1 -1 -1 -1 -1  0]
 [-1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1]
 [-1 -1 -1 -1 -1 -1]]





### Training LabelModel

In [29]:
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=42)

# Generate probabilistic labels
df_train["label_snorkel"] = label_model.predict(L=L_train)

100%|██████████| 500/500 [00:00<00:00, 6916.11epoch/s]


### Training and comparing models

In [30]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(df_train.text)
y_train = df_train.label_snorkel

# Logistic Regression
clf_lr = LogisticRegression(max_iter=1000)
clf_lr.fit(X_train, y_train)

# Linear SVC
clf_svc = LinearSVC()
clf_svc.fit(X_train, y_train)

# Evaluate both models
X_test = vectorizer.transform(df_test.text)
y_test = df_test.label.values

y_pred_lr = clf_lr.predict(X_test)
y_pred_svc = clf_svc.predict(X_test)

print("\n📊 Model Comparison:")
print(f"Logistic Regression Test Accuracy: {accuracy_score(y_test, y_pred_lr):.3f}")
print(f"LinearSVC Test Accuracy: {accuracy_score(y_test, y_pred_svc):.3f}")


📊 Model Comparison:
Logistic Regression Test Accuracy: 0.120
LinearSVC Test Accuracy: 0.164




### Slice-Based Evaluation

In [31]:

def evaluate_slice(name, mask):
    """Utility for evaluating model performance per slice"""
    slice_df = df_test[mask]
    if len(slice_df) > 0:
        acc = accuracy_score(slice_df.label, clf_lr.predict(vectorizer.transform(slice_df.text)))
        print(f"{name} slice accuracy: {acc:.3f} ({len(slice_df)} samples)")
    else:
        print(f"{name} slice has 0 samples. Skipping.")

print("\n🔍 Slice-Specific Evaluation:")
evaluate_slice("Short messages (<3 words)", df_test.text.str.split().apply(len) < 3)
evaluate_slice("Emoji spam", df_test.text.str.contains("😜|😂|❤️|🔥|💰"))
evaluate_slice("Money/Crypto scams", df_test.text.str.contains("free|win|cash|prize|crypto|bitcoin|offer"))
evaluate_slice("Clickbait links", df_test.text.str.contains("http|bit\.ly|tinyurl|goo\.gl"))
evaluate_slice("Phone/Email scams", df_test.text.str.contains(r"\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|@"))



🔍 Slice-Specific Evaluation:
Short messages (<3 words) slice accuracy: 0.000 (45 samples)
Emoji spam slice has 0 samples. Skipping.
Money/Crypto scams slice accuracy: 0.571 (7 samples)
Clickbait links slice accuracy: 0.857 (7 samples)
Phone/Email scams slice accuracy: 0.500 (2 samples)
