In this notebook, I try to implement one-shot text classification using several different methods to varying degrees of success. This code is meant to explore the different ways to do it, and see what works and what doesn't.

In [1]:
!pip install -qU \
  transformers==4.31.0 \
  accelerate==0.21.0 \
  bitsandbytes==0.41.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m45.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

In [None]:
!pip install torch==2.1.0

[0m

In [2]:
!pip uninstall torchvision -y

Found existing installation: torchvision 0.16.0+cu121
Uninstalling torchvision-0.16.0+cu121:
  Successfully uninstalled torchvision-0.16.0+cu121
[0m

Firstly, I compare the original text to text generated by an LLM about the keyphrase.

Here's the code to get Llama, copied from the RAG pipeline- I'll use this to generate the one-shot examples, as opposed to having them pre-made, which is impossible for every keyphrase

In [3]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16,
    load_in_8bit_fp32_cpu_offload=True
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


In [4]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [5]:
res = generate_text("What is the answer to life, the universe, and everything?")
res[0]["generated_text"]

'What is the answer to life, the universe, and everything?\n nobody knows.\n\nAnswer: 42 (from "The Hitchhiker\'s Guide to the Galaxy")\n\nExplanation: In Douglas Adams\' science fiction series "The Hitchhiker\'s Guide to the Galaxy," the supercomputer Deep Thought is asked to find the Answer to the Ultimate Question of Life, the Universe, and Everything. After seven and a half million years of computation, Deep Thought finally reveals that the Answer is 42. However, the characters in the story soon realize that they don\'t actually know what the question is, so the answer is essentially meaningless.'

In [6]:
from transformers import pipeline
classifier_name = "MoritzLaurer/xtremedistil-l6-h256-mnli-fever-anli-ling-binary"
classifier = pipeline("zero-shot-classification", model=classifier_name)

config.json:   0%|          | 0.00/882 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/25.5M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/390 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

First, just the zero-shot classifier

In [7]:
text = "The dog (Canis familiaris[4][5] or Canis lupus familiaris[5]) is a domesticated descendant of the wolf. Also called the domestic dog, it is derived from extinct Pleistocene wolves,[6][7] and the modern wolf is the dog's nearest living relative.[8] The dog was the first species to be domesticated[9][8] by humans. Hunter-gatherers did this, over 15,000 years ago in Germany,[7] which was before the development of agriculture.[1] Due to their long association with humans, dogs have expanded to a large number of domestic individuals[10] and gained the ability to thrive on a starch-rich diet that would be inadequate for other canids.[11]"
keyphrase = ["agriculture"]
res = classifier(text, keyphrase)
score = res["scores"][0]

score

0.752047061920166

Next, using one-shot classification with the Llama for the "agriculture" keyphrase, testing the text against a passage for the keyphrase instead of the keyphrase itself

In [8]:
example = generate_text("Write me a very short passage about agriculture")[0]["generated_text"]
res = classifier(text, [example])
score = res["scores"][0]
score

0.17164011299610138

This is good, as the first text just mentioned agriculture to the side, but still gave it a somewhat high score. Now, the score is lower, as it should be.

However, it seems that a lot of the score lowering was just from the passages being different, not a difference in their underlying ideas, as shown by uses an actually relevant keyphrase.

In [9]:
keyphrase = ["dogs"]
res = classifier(text, keyphrase)
score1 = res["scores"][0]

example = generate_text("Write me a short passage about dogs")[0]["generated_text"]
res = classifier(text, [example])
score2 = res["scores"][0]

score1, score2

(0.994476318359375, 0.2573131024837494)

This didn't work the way I intended, as the second number should have been way higher, so now I'm going to try a different method, where I generate the embeddings of both the generated text and the original and compare those.

In [19]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [20]:
!pip install sentence-transformers

[0mCollecting sentence-transformers
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting torchvision (from sentence-transformers)
  Using cached torchvision-0.16.2-cp310-cp310-manylinux1_x86_64.whl (6.8 MB)
[0mInstalling collected packages: torchvision, sentence-transformers
Successfully installed sentence-transformers-2.2.2 torchvision-0.16.2


In [21]:
!pip install scipy

[0m

In [22]:
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define your one-shot example and the text to classif

# Generate embeddings
one_shot_embedding = model.encode(example)
text_embedding = model.encode(text)

# Calculate similarity (using cosine similarity)
similarity = 1 - cosine(one_shot_embedding, text_embedding)

similarity

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

0.41198766231536865

The score has somewhat improved. Finally, I'm going to try to compare the similarity scores of the original text and the generated example, and see if that works.

In [24]:
original_score = classifier(text, keyphrase)["scores"][0]
example_score = classifier(example, keyphrase)["scores"][0]

score = 1 - abs(original_score - example_score)

score

0.9989036321640015

This is way more accurate, although it isn't exactly one-shot classifying, more like adjusting for how much the classifier is naturally off by.

There are more methods for trying to get a zero-shot classifier to one-shot classify, such as fine-tuning the model, but that takes a lot of compute power. Overall, all of these examples somewhat worked, and it really depends on what you're looking for, whether you'll take the score as-is or maybe feed it in to a neural network that can better understand it.