# Zero-shot with LLM, SetFit and Argilla

This tutorial covers text classification using zero-shot LLM labeling through the Together API. We then use traditional machine learning techniques with TF-IDF vectorization and logistic regression - for comparaison. The idea here is to have the LLM prelabel our dataset. We then use Argilla to check if the model provided good labels. There are variations to that approach - including Bart Models (multi-label-zero-shot).

**Using OpenAI-style API via the Together API:**

1. **Model Configuration**: A system prompt is defined, outlining the task of categorizing news articles into predefined categories: "World", "Sports", "Business", and "Sci/Tech". The output is required to adhere strictly to the JSON format.

2. **Text Classification**: The provided text is classified into one of the predefined categories using an open source LLM model.

3. **Evaluation**: The classified data is evaluated by predicting categories for a sample of news articles and comparing the predictions with ground truth annotations.

**Traditional Machine Learning Approach:**

We compare the results from the 0-shot/few-shot approach with a traditional ML/NLP approach, where we will be using a very large training dataset.

**Comparison with Traditional Machine Learning Approach:**

Both approaches aim to classify news articles into predefined categories, but they differ in their underlying methodologies. The LLM based approach leverages a state-of-the-art language model for text classification, while the traditional machine learning approach relies on TF-IDF vectorization and logistic regression (but requires much more data for training).

In [None]:
!pip install openai -q

In [None]:
!pip install transformers==4.39.0

In [None]:
import json

In [None]:
from openai import OpenAI

In [None]:
from pydantic import BaseModel, Field
from typing import List, Optional, Dict

In [None]:
from google.colab import userdata
TOGETHER_API_KEY = userdata.get('TOGETHER_API_KEY')

In [None]:
system_prompt = """

You are a sophisticated classification engine tasked with categorizing news articles.
Your primary function is to evaluate the core message of each article and assign it to one of the following categories: "World" for global news covering politics and similar topics,
"Sports" for news related to sports, "Business" for articles on business, economics, or finance,
and "Sci/Tech" for content focused on technology and science. Upon analyzing a text input, you will provide an explanation for the category chosen - very short.
 Your output will adhere strictly to the JSON format, specifically:
 {"prediction":"your selected prediction", "explanation":"your explanation"}.
 It is imperative that your output is VALID JSON and contains no other elements. Output it as string not markdown or code.

"""

In [None]:
text = """
Stocks Rally on Lower Oil Prices Stocks rallied in quiet trading Wednesday
as lower oil prices brought out buyers, countering a pair of government reports
that gave a mixed picture of the economy.
"""

In [None]:
class PredictionOutcome(BaseModel):
    prediction: str = Field(description="Your selected prediction")
    explanation: str = Field(description="Your explanation")

In [None]:
# Point to the local server
client = OpenAI(base_url="https://api.together.xyz/v1", api_key=TOGETHER_API_KEY)

completion = client.chat.completions.create(
  model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo", # this field is currently unused
  response_format={"type": "json_object", "schema": PredictionOutcome.model_json_schema()},
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": f'Classify following text: {text}'}
  ],
  temperature=0.2,
)

print(completion.choices[0].message.content)

In [None]:
json.loads(completion.choices[0].message.content.strip())

In [None]:
def classify(text):
  completion = client.chat.completions.create(
  model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo", # this field is currently unused
  response_format={"type": "json_object", "schema": PredictionOutcome.model_json_schema()},
  messages=[
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": f'Classify following text: {text}'}
  ],
  temperature=0.2,
)
  json_response = completion.choices[0].message.content.strip()
  try:
        prediction = json.loads(json_response)
  except:
        # for some examples, json is not correctly formatted
        return {"prediction": None, "explanation": f"Wrong JSON format: {json_response}" }
  return prediction

In [None]:
classify(text)

In [None]:
!pip install setfit datasets -qqq

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report

In [None]:
import pandas as pd
from datasets import load_dataset, Dataset
from sentence_transformers import SentenceTransformer

# Load the data
data_train = pd.read_parquet('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/ag_news_unlabelled.pq')

# Convert to Hugging Face dataset
dataset_news = Dataset.from_pandas(data_train.sample(30).reset_index(drop=True))

In [None]:
# let's predict over the test set to eval our zero-shot classifier
train_ds_with_preds = dataset_news.map(lambda example: classify(example["text"]))

pd.set_option('display.max_colwidth', None)
train_ds_with_preds.to_pandas().head(15)

In [None]:
from setfit import SetFitModel, Trainer, TrainingArguments
from sentence_transformers import SentenceTransformer

# Initialize SetFitModel
model = SetFitModel.from_pretrained("intfloat/multilingual-e5-base").to('cuda')


In [None]:
# turn the dataframe train_ds_with_preds.to_pandas() into a HF dataset for training
train_ds = Dataset.from_pandas(train_ds_with_preds.to_pandas())

In [None]:
# Load the handlabelled dataset from Argilla
test_ds = load_dataset("ag_news", split="test")

In [None]:
from datasets import ClassLabel

In [None]:
train_ds

In [None]:
train_ds

In [None]:
test_ds.features

In [None]:
# create a new feature "label_orig" and copy "label" into it
train_ds_with_preds = train_ds_with_preds.map(lambda example: {"label_orig": example["label"]})

In [None]:
# 'World', 'Sports', 'Business', 'Sci/Tech' correspond to labels from 0 to 3 - create a mapping and write the numerical values corresponding to "prediction" into it
label_mapping = {
    'World': 0,
    'Sports': 1,
    'Business': 2,
    'Sci/Tech': 3
}

# Apply the mapping to the 'prediction' column
train_ds_with_preds = train_ds_with_preds.map(lambda example: {"label": label_mapping[example["prediction"]]})

In [None]:
# Preparing the training arguments

args = TrainingArguments(
    batch_size=16,
    num_epochs=5,
)


# Create SetFitTrainer and train
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds_with_preds,
    eval_dataset=test_ds,

)

trainer.train()

metrics = trainer.evaluate()
print(metrics)

In [None]:
# Predict and evaluate
predicted_labels = model.predict(test_ds['text'])

In [None]:
print(classification_report(test_ds['label'], predicted_labels))

In [None]:
# Load AG News dataset for logistic regression
dataset = load_dataset("ag_news", split={'train': 'train', 'test': 'test'})

# Training and test sets
train_texts = dataset['train']['text']
train_labels = dataset['train']['label']
test_texts = dataset['test']['text']
test_labels = dataset['test']['label']

# Create and train the logistic regression model
model_lg = make_pipeline(TfidfVectorizer(stop_words='english'), LogisticRegression(max_iter=1000))
model_lg.fit(train_texts, train_labels)

# Predict and evaluate
predicted_labels = model_lg.predict(test_texts)
print(classification_report(test_labels, predicted_labels))