# Zero Shot v/s Few Shot performance analysis

## Helper functions

Dataset : Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

In [1]:
from datasets import load_dataset

def load_imdb_dataset():
    test_dataset = load_dataset('stanfordnlp/imdb', split = 'test')
    small_dataset =  test_dataset.shuffle(seed=42).select(range(1000))
    return small_dataset

dataset = load_imdb_dataset()
print(dataset)

README.md: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 1000
})


In [2]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

def calculate_metrics(predictions, labels):
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='binary')
    precision = precision_score(labels, predictions, average='binary')
    recall = recall_score(labels, predictions, average='binary')
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [3]:
# Zero-shot prompt template
def zero_shot_prompt(text):
    return [
        {
            "role": "user",
            "content": f"""Classify the sentiment of the movie review enclosed in delimiters.
            
Respond with only one word: 'positive' or 'negative'.
Review: '''{text}'''
"""
        }
    ]

In [4]:
# Few-shot prompt template (with 4 examples)
def few_shot_prompt(text):
    return [
        {
            "role": "user",
            "content": f"""Analyze the sentiment of these movie reviews enclosed in delimiters. Respond with 'positive' or 'negative' only.

Review: This movie was fantastic! The acting was superb and the plot kept me engaged throughout.
Sentiment: positive

Review: I hated this film. The story made no sense and the characters were poorly developed.
Sentiment: negative

Review: An average movie with some good moments but overall nothing special.
Sentiment: negative

Review: One of the best films I've seen this year. Highly recommended!
Sentiment: positive

Review: '''{text}'''
Sentiment:"""
        }
    ]

In [5]:
# Generate model predictions for Phi-3-mini
def get_model_response_phi(model, tokenizer, messages, device):
    inputs = tokenizer.apply_chat_template(
        messages,
        return_tensors="pt",
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True
    ).to(device)

    outputs = model.generate(**inputs, max_new_tokens=10, use_cache=False)
    
    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
    
    return response.strip().lower()

In [6]:
import re
# Generate model predictions for TinyLLama
def get_model_response_tinyL(model, tokenizer, prompt, device):
    inputs = tokenizer.apply_chat_template(
        prompt,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id)
    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
    return response.split()[-1].strip().strip("'\"").lower()

## Choosing lightweight models due to system constraints

1. The **Phi-3-Mini-4K-Instruct** is a lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. The model has underwent a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures. *Context length = 4K, Parameters = 3.8B*

In [7]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.random.manual_seed(0)
model_1 = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

tokenizer_1 = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


configuration_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

In [8]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [9]:
# 3xample
print(get_model_response_phi(model_1, tokenizer_1, zero_shot_prompt('This was a very bad movie!'), device)) # negative

You are not running the flash-attention implementation, expect numerical differences.


negative


In [10]:
# Zero shot
predictions_phi_zero = []
labels_phi_zero = []

for example in dataset:
    text, label = example['text'], example['label']
    prompt = zero_shot_prompt(text)
    response = get_model_response_phi(model_1, tokenizer_1, prompt, device)
    
    if response == 'positive':
        predict = 1
    elif response == 'negative':
        predict = 0
    else:
        predict = -1
        
    if predict != -1:
        predictions_phi_zero.append(predict)
        labels_phi_zero.append(label)
        
if len(predictions_phi_zero) > 0:
    print(calculate_metrics(predictions_phi_zero, labels_phi_zero))
else:
    print('No response')

{'accuracy': 0.925, 'f1': 0.9192680301399354, 'precision': 0.9682539682539683, 'recall': 0.875}


In [11]:
# Few shot
predictions_phi_few = []
labels_phi_few = []

for example in dataset:
    text, label = example['text'], example['label']
    prompt = few_shot_prompt(text)
    response = get_model_response_phi(model_1, tokenizer_1, prompt, device)
    
    if response == 'positive':
        predict = 1
    elif response == 'negative':
        predict = 0
    else:
        predict = -1
        
    if predict != -1:
        predictions_phi_few.append(predict)
        labels_phi_few.append(label)
        
if len(predictions_phi_zero) > 0:
    print(calculate_metrics(predictions_phi_few, labels_phi_few))
else:
    print('No response')

{'accuracy': 0.9129129129129129, 'f1': 0.9038674033149171, 'precision': 0.9784688995215312, 'recall': 0.839835728952772}


3. The **TinyLlama-1.1B-Chat-v1.0** is the chat model finetuned on top of TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T. We follow HF's Zephyr's training recipe. The model was " initially fine-tuned on a variant of the UltraChat dataset, which contains a diverse range of synthetic dialogues generated by ChatGPT. We then further aligned the model with 🤗 TRL's DPOTrainer on the openbmb/UltraFeedback dataset, which contain 64k prompts and model completions that are ranked by GPT-4." *Context Length ~ 2K, Parameters = 1.1B*

In [12]:
tokenizer_2 = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model_2 = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

tokenizer_config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [13]:
# example
response_2 = get_model_response_tinyL(model_2, tokenizer_2, zero_shot_prompt('This movie was as bad Epic movie!'), device) # negative (Epic movie was not good)
print(response_2)

positive


In [14]:
# Zero shot
predictions_tinyL_zero = []
labels_tinyL_zero = []

for review in dataset:
    text, label = review['text'], review['label']
    prompt = zero_shot_prompt(text)
    response = get_model_response_tinyL(model_2, tokenizer_2, prompt, device)
    
    if response == 'positive':
        predict = 1
    elif response == 'negative':
        predict = 0
    else:
        predict = -1
        
    if predict != -1:
        predictions_tinyL_zero.append(predict)
        labels_tinyL_zero.append(label)
        
if len(predictions_tinyL_zero) > 0:
    print(calculate_metrics(predictions_tinyL_zero, labels_tinyL_zero))
else:
    print('No response')

{'accuracy': 0.43333333333333335, 'f1': 0.6046511627906976, 'precision': 0.43333333333333335, 'recall': 1.0}


In [15]:
# Few shot
predictions_tinyL_few = []
labels_tinyL_few = []

for review in dataset:
    text, label = review['text'], review['label']
    prompt = few_shot_prompt(text)
    response = get_model_response_tinyL(model_2, tokenizer_2, prompt, device)
    
    if response == 'positive':
        predict = 1
    elif response == 'negative':
        predict = 0
    else:
        predict = -1
        
    if predict != -1:
        predictions_tinyL_few.append(predict)
        labels_tinyL_few.append(label)
        
if len(predictions_tinyL_few) > 0:
    print(calculate_metrics(predictions_tinyL_few, labels_tinyL_few))
else:
    print('No response')

{'accuracy': 0.5783132530120482, 'f1': 0.7058823529411765, 'precision': 0.5526315789473685, 'recall': 0.9767441860465116}
