# Overview

In this notebook, we are trying to replace the LLM call in the preprocessing guardrail above with a lightweight NLP model that detects if the user's input is from an allowed topic. We ill use three techniques to tackle the few-shot text classification problem for the chatbot intent classification scenario:

* SetFit
* FastFit
* Semantic Router

In [1]:
!pip install -q -U transformers==4.39.3
!pip install -q -U datasets==2.18.0

!pip install -q -U setfit==1.0.3

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["DATASET"]="SetFit/amazon_massive_intent_en-US"
os.environ["FITMODEL"]="st-mpnet-v2-amazon-mi"

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Loading dataset

In [3]:
from datasets import Dataset, load_dataset

ds=load_dataset(os.getenv("DATASET"))

Downloading data: 100%|██████████| 1.14M/1.14M [00:00<00:00, 5.07MB/s]
Downloading data: 100%|██████████| 201k/201k [00:00<00:00, 892kB/s]
Downloading data: 100%|██████████| 294k/294k [00:00<00:00, 2.81MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Pre-processing

In [4]:
import pandas as pd

df=pd.DataFrame(ds["test"])

# Helper function to select random rows
def select_random_rows(group):
    return group.sample(n=10, random_state=42)


# find top classes with minimum 30 rows
label_counts=df["label_text"].value_counts()
label_counts=label_counts[label_counts>30]

# restruct df to top classes only
df=df[df["label_text"].isin(label_counts.index)].reset_index(drop=True)

assert set(df["label_text"].value_counts().index.to_list())==set(label_counts.index.to_list()), "Some labels were lost"

# select random row per unique value in label_text column
train_df=df.groupby("label_text", group_keys=False).apply(select_random_rows)

# create eval dataframe by dropping the train data and selecting random rows
eval_df=(df.drop(train_df.index).groupby("label_text",group_keys=False).apply(select_random_rows))

# create test dataframe by dropping both train and eval data
test_df=df.drop(train_df.index.to_list()+eval_df.index.to_list())

# reset the index
cols_to_keep=["text", "label_text"]
train_df=train_df[cols_to_keep].reset_index(drop=True)
eval_df=eval_df[cols_to_keep].reset_index(drop=True)
test_df=test_df[cols_to_keep].reset_index(drop=True)

# save the file
test_df.to_pickle("test_df.pkl")
train_df.to_pickle("train_df.pkl")
eval_df.to_pickle("eval_df.pkl")

train_ds=Dataset.from_pandas(train_df)
eval_ds=Dataset.from_pandas(eval_df)
test_ds=Dataset.from_pandas(test_df)

print(train_df.shape, eval_df.shape, test_df.shape)
train_df.head()

(350, 2) (350, 2) (1879, 2)


  train_df=df.groupby("label_text", group_keys=False).apply(select_random_rows)
  eval_df=(df.drop(train_df.index).groupby("label_text",group_keys=False).apply(select_random_rows))


Unnamed: 0,text,label_text
0,do i have any alarms set for six am tomorrow,alarm_query
1,what is the wake up time for my alarm i have s...,alarm_query
2,please tell me what alarms are on,alarm_query
3,please list all my alarms,alarm_query
4,what times do my alarms go off,alarm_query


# SetFit

SetFit is an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers. More detail see the classfication section. Here we will train the model by using SetFit.

In [5]:
from setfit import SetFitModel, Trainer, TrainingArguments

model=SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-mpnet-base-v2",
    labels=train_df["label_text"].unique().tolist()
)

# setup training
args=TrainingArguments(
    batch_size=16,
    num_epochs=1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    report_to="wandb",
    run_name=os.getenv('WANDB_NAME'),
    load_best_model_at_end=True
)

trainer=Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    metric="accuracy",
    # column_mapping is {"custom dataet column name":"setfit expected column name (text or label)"}
    column_mapping={
        "text":"text",
        "label_text": "label"
    }
)

trainer.train()

2024-05-25 04:26:05.961841: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-25 04:26:05.961937: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-25 04:26:06.100553: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
  self.comm = Comm(**args)


config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to the training dataset
Applying column mapping to the evaluation dataset
  self.comm = Comm(**args)


Map:   0%|          | 0/350 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 119000
  Batch size = 16
  Num epochs = 1
  Total optimization steps = 7438
  return LooseVersion(v) >= LooseVersion(check)
  self.comm = Comm(**args)
[34m[1mwandb[0m: Currently logged in as: [33murakiny[0m ([33mcausal_language_trainer[0m). Use [1m`wandb login --relogin`[0m to force relogin
  from IPython.core.display import HTML, display  # type: ignore


Epoch,Training Loss,Validation Loss,Embedding Loss,Rate
1,No log,No log,0.0755,0.0


  0%|          | 0/7438 [00:00<?, ?it/s]

Loading best SentenceTransformer model from step 7438.


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

In [6]:
metrics=trainer.evaluate(test_ds)
print("Accuracy: {:.2f}%".format(metrics["accuracy"]*100))

Applying column mapping to the evaluation dataset
***** Running evaluation *****
  self.comm = Comm(**args)


Batches:   0%|          | 0/59 [00:00<?, ?it/s]

  self.comm = Comm(**args)


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Accuracy: 77.43%


In [8]:
trainer.push_to_hub('aisuko/'+os.getenv("FITMODEL"))

  self.comm = Comm(**args)


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

model_head.pkl:   0%|          | 0.00/220k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/aisuko/st-mpnet-v2-amazon-mi/commit/a839fef8394625689f915497ca560404b1b9f88f', commit_message='Add SetFit model', commit_description='', oid='a839fef8394625689f915497ca560404b1b9f88f', pr_url=None, pr_revision=None, pr_num=None)

# SetFit for Few Shot Text Intent Classification Model Inference

Here we will run pedictions on a list of texts to measure the class-level precision, recall and f1_score.

In [9]:
model=SetFitModel.from_pretrained('aisuko/'+os.getenv("FITMODEL"))

config.json:   0%|          | 0.00/595 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/170 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/24.3k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config_setfit.json:   0%|          | 0.00/803 [00:00<?, ?B/s]

model_head.pkl:   0%|          | 0.00/220k [00:00<?, ?B/s]

In [10]:
from sklearn.metrics import classification_report

test_df["predictions"]=model.predict(test_df["text"].values)

print("Class Level Metrics")
print(classification_report(test_df["label_text"], test_df["predictions"]))

Batches:   0%|          | 0/59 [00:00<?, ?it/s]

Class Level Metrics
                          precision    recall  f1-score   support

             alarm_query       0.92      0.79      0.85        14
               alarm_set       0.80      0.95      0.87        21
       audio_volume_mute       0.83      0.83      0.83        12
          calendar_query       0.58      0.61      0.59       106
         calendar_remove       0.80      0.94      0.86        47
            calendar_set       0.83      0.89      0.86       189
          cooking_recipe       0.86      0.94      0.90        52
          datetime_query       0.89      0.84      0.86        68
             email_query       0.93      0.90      0.91        99
         email_sendemail       0.87      0.96      0.91        94
          general_quirky       0.35      0.11      0.17       149
              iot_coffee       0.83      0.94      0.88        16
     iot_hue_lightchange       0.74      0.88      0.80        16
        iot_hue_lightoff       0.88      0.91      0.89

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [15]:
# %%timeit
inference_test=test_df.sample(1)
preds=model.predict(inference_test['text'].values)
preds

  self.comm = Comm(**args)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

array(['general_quirky'], dtype='<U24')

# Intent classification
* https://medium.com/towards-artificial-intelligence/few-shot-nlp-intent-classification-d29bf85548aa


# Classification
* https://github.com/IBM/fastfit
* https://github.com/aurelio-labs/semantic-router/tree/main
* https://github.com/huggingface/setfit