<a href="https://colab.research.google.com/github/TurkuNLP/textual-data-analysis-course/blob/main/text_classification_zero_and_few_shot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Zero- and few-shot text classification

Experiment on zero- and few-shot text classification with a large generative language model.

---

## Setup

Install the required python package

In [1]:
!pip install --quiet transformers

Import the `transformers` library

In [2]:
import transformers

---

## Load model

We'll load a variant of the [Open Pretrained Transformer](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/) (OPT).

Models ranging from 125M parameters (roughly BERT-base size) to 66B parameters are currently available from the [Hugging Face repository](https://huggingface.co/models?sort=downloads&search=facebook%2Fopt).


In [3]:
MODEL = 'facebook/opt-1.3b'

model = transformers.AutoModelForCausalLM.from_pretrained(MODEL)
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL)

---

## Generation

We'll use `model.generate()` in a simple function that tokenizes and vectorizes a text prompt, generates up to a given maximum of new tokens, decodes the output back into text, and returns a string with the prompt marked with `**`: 

In [4]:
def generate(prompt, max_new_tokens=10):
    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(inputs.input_ids, max_new_tokens=max_new_tokens)
    decoded = tokenizer.batch_decode(
        outputs,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )
    output_text = decoded[0]
    generated = output_text[len(prompt):]
    return(f'**{prompt}**{generated}')

This is a general generative model, so we can prompt it with any text we like.

In [5]:
print(generate('Hi OPT, how are you doing today?'))

**Hi OPT, how are you doing today?**
I'm doing great! How are you?


---

## Zero-shot experiment

In a zero-shot setting, the prompt is a natural language formulation of the task:

In [6]:
print(generate('''
Is this review positive or negative?
Review: "This movie sucks!"
Answer:'''
))

**
Is this review positive or negative?
Review: "This movie sucks!"
Answer:** "This movie sucks!"

Is this review


Perhaps unsurprisingly, the model does not answer as hoped. Unlike e.g. ChatGPT, this model is not pre-trained with dialogue data, but rather as a standard language model. (The only way the model could have learned question-answer pairs is if they happened to coincidentally appear in its training data.)

---

## One-shot experiment

In a one-shot setting, the prompt includes a single example of a correct question-answer pair before the question we're actually interested in.

In a sense, here `{ 'text': 'This movie sucks!', 'label': 'Negative' }` is our one and only "training" example.

In [7]:
print(generate('''
Is this review positive or negative?
Review: "This movie sucks!"
Answer: Negative

Is this review positive or negative?
Review: "This movie is great!"
Answer:'''
))

**
Is this review positive or negative?
Review: "This movie sucks!"
Answer: Negative

Is this review positive or negative?
Review: "This movie is great!"
Answer:** Positive

Is this review positive or negative?


That actually _kind_ of worked! The first word after our prompt is `"Negative"`, which is the correct answer.

Of course, the model doesn't know to stop there and continues instead to produce what it predicts is the most likely continuation, namely the start of a third question. (Not an unreasonable assumption.)