In this notebook, I try to implement one-shot text classification using several different methods to varying degrees of success. This code is meant to explore the different ways to do it, and see what works and what doesn't.

In [1]:
!pip install -qU \
  transformers==4.31.0 \
  accelerate==0.21.0 \
  bitsandbytes==0.41.0
!pip install torch==2.1.0



Just basic installs, setting up the LLM

In [2]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-13b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16,
    load_in_8bit_fp32_cpu_offload=True
)

# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Model loaded on cuda:0


In [3]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,
    task='text-generation',
    temperature=0.0,
    max_new_tokens=1024,
    repetition_penalty=1.1
)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [4]:
res = generate_text("What color is the sky")
res[0]["generated_text"]

'What color is the sky?\n\nAnswer: The color of the sky can vary depending on the time of day and atmospheric conditions. During sunrise and sunset, the sky can take on hues of red, orange, pink, and purple. At midday, the sky is typically a pale blue or white. However, if there are clouds present, the sky can appear gray or overcast.'

The LLM seems to be trained to just continue text instead of being fine tuned to answer questions like ChatGPT. With a little creativity, however, we can make this work.

In [5]:
def get_text(prompt):
    res = generate_text(prompt)
    res = res[0]["generated_text"][len(prompt):]
    return res
get_text("Now we are going to create example questions and answers to study a text. The text is \"The dog (Canis familiaris[4][5] or Canis lupus familiaris[5]) is a domesticated descendant of the wolf. Also called the domestic dog, it is derived from extinct gray wolves,[6][7] and the gray wolf is the dog's closest living relative.[8] The dog was the first species to be domesticated[9][8] by humans. Hunter-gatherers did this, over 15,000 years ago in Oberkassel, Bonn,[7] which was before the development of agriculture.[1] Due to their long association with humans, dogs have expanded to a large number of domestic individuals[10] and gained the ability to thrive on a starch-rich diet that would be inadequate for other canids.[11]\" Question: What was the first species to be domesticated by humans? Answer: Dogs. \n Question:")

" What is the closest living relative of the dog? Answer: Gray wolf. \nQuestion: Where and when were dogs first domesticated? Answer: In Oberkassel, Bonn over 15,000 years ago. \nQuestion: What is unique about the diet of domestic dogs compared to other canids? Answer: They can thrive on a starch-rich diet that would be inadequate for other canids. \nNow let's try some example questions and answers to test your understanding of the text. Here's the first question: What is the scientific name of the domestic dog? Answer: Canis familiaris or Canis lupus familiaris. Good job! Here's the next question: How long ago did humans domesticate dogs? Answer: Over 15,000 years ago. Correct! Here's the final question: What is special about the diet of domestic dogs compared to other canids? Answer: They can thrive on a starch-rich diet that would be inadequate for other canids. Great job! You got all three questions right!"

In [6]:
import locale
locale.getpreferredencoding = lambda do_setlocale=True: "UTF-8"

In [7]:
!pip install datasets==2.14.0



In [8]:
!pip install dill==0.3.7



In [9]:
from datasets import load_dataset

def download_dataset(dataset_name):
    dataset = load_dataset(dataset_name)
    return dataset

capybara_dataset = download_dataset('LDJnr/Capybara')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


With some fine-tuning of the exact wording of the prompt, we can get it working pretty well!

In [None]:
number_samples = 30
data = []
ress = []
for sample in range(number_samples):
  capybara_sample = capybara_dataset['train'][sample]['conversation']
  for qa in capybara_sample:
    inp = f"Now we are going to create an example free-response question and answer based on a certain text for a test covering numerous subjects. The text will not be provided to the test takers, so we're not going to reference it at all in the test, and instead only ask general-knowledge questions from it. The format should be \"Question: <question text> Answer: <answer text>\". The text you're going to get your question from is \"{qa}\".  Remember to give both a question and an answer. Here's an example. Question:"
    res = get_text(inp)
    ress.append(res)
    splitpoint = "Answer: "
    counting = 0
    storedict = {}
    broken = False
    for i in range(len(res)):
      if counting >= len(splitpoint):
        storedict['question'] = res[:i-len(splitpoint)]
        endanswer = len(res)
        for j in range(len(res[i:])):
          if res[i+j]=="\n":
            endanswer = i+j
            break
        storedict['answer'] = res[i:endanswer]
        data.append(storedict)
        broken = True
        break
      elif res[i] == splitpoint[counting]: counting += 1
      else: counting = 0
    if not broken: print("No \"Answer: \" found")
  print(sample+1)
data

1
2




3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45


In [None]:
for qa in data:
  print(qa['question'])
  qa['good'] = input("Goodness: ")

 What was one of the main differences between Mozart and Beethoven's musical styles? 


In [None]:
data =

In [None]:
!pip install sentence_transformers
!pip install scipy

In [None]:
from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
question_encodings = []

for qa in data:
  question_encodings.append(model.encode(qa["question"]))

len(question_encodings[0])

In [None]:
goods = [int(i['good']) for i in data]

In [None]:
device = "cuda"

In [None]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

question_tensor = torch.tensor(question_encodings, dtype=torch.float32).to(device)
good_tensor = torch.tensor(goods, dtype=torch.float32).to(device)

dropout_rate = 0.5
net = nn.Sequential(
    nn.Linear(question_tensor.shape[1], 64),
    nn.ReLU(),
    nn.Dropout(dropout_rate),
    nn.Linear(64,32),
    nn.ReLU(),
    nn.Dropout(dropout_rate),
    nn.Linear(32,1),
    nn.Sigmoid()
).to(device)

epochs = 10
lr = 6e-4
criterion = nn.BCELoss()
opt = torch.optim.Adam(net.parameters(), lr=lr)
losses = []
for epoch in range(epochs):
  for question in range(len(question_tensor)):
    out = net(question_tensor[question].reshape(1,-1))
    loss = criterion(good_tensor[question].reshape(1), out.reshape(1))
    opt.zero_grad()
    loss.backward()
    opt.step()
    losses.append(loss.item())

plt.plot(losses)
plt.show()