<a href="https://colab.research.google.com/github/IsaacRe/Syntactically-Constrained-Sampling/blob/main/notebooks/Examples_with_Non_IFT_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Examples with Non-Instruction-Finetuned Models
This notebook walks through constraining generation on base language models directly (models not finetuned on instruction data). In many cases accurate responses can be obtained by imposing syntactic constraints despite the limitations of such models.

In [11]:
# if colab throws an error about UTF-8, run this cell
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
%cd /content
!rm -rf transformers
!git clone -b syntactically-constrained-sampling --single-branch https://github.com/IsaacRe/transformers.git
%cd /content/transformers
!pip install --upgrade pip && pip install .
!pip install git+https://github.com/IsaacRe/Syntactically-Constrained-Sampling
%cd /content
!rm -rf alplaca-lora
!git clone https://github.com/tloen/alpaca-lora.git
%cd alpaca-lora/
!pip install -r requirements.txt

In [13]:
# check setup
from transformers.generation.output_validity import validity_check

In [None]:
!pip show sampling-constraints

Name: sampling-constraints
Version: 0.0.14
Summary: Library of incremental parsers used to force syntax constraints on next-token predictions during language model generation
Home-page: 
Author: 
Author-email: Isaac Rehg <isaacrehg@gmail.com>
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: 


We'll explore the two currently supported constraints, JSON schema enforcement and option (or "one-of") selection



In [11]:
json_prompt = 'List US presidents in JSON format'
json_schema = """[]{
    name: string,
    age_entering_office: number,
    year_entering_office: number
}"""

one_of_prompt = 'Who is the first US president?'
options = 'George Washington,Abraham Lincoln'

### GPT-2

In [13]:
from transformers.pipelines import pipeline

pipe = pipeline(model='gpt2')

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [14]:
pipe(json_prompt, enforce_json_schema=json_schema)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


in generate

Generating with sample





[{'generated_text': 'List US presidents in JSON format[{"name":"Winchester","age_entering_office":0.00,"year_entering_office":0.00}]'}]

In [7]:
pipe(one_of_prompt, enforce_one_of=options)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


in generate

Generating with sample



[{'generated_text': 'Who is the first US president?George Washington'}]

### Llama-7B

In [1]:
from transformers import LlamaForCausalLM, LlamaTokenizer
import torch

base_model = "decapoda-research/llama-7b-hf"
cache_dir = "/content/drive/MyDrive/huggingface_colab"

In [2]:
tokenizer = LlamaTokenizer.from_pretrained(base_model, cache_dir=f"{cache_dir}/hub")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.


In [3]:
model = LlamaForCausalLM.from_pretrained(
    base_model, load_in_8bit=True, torch_dtype=torch.float16, device_map="auto", cache_dir=f"{cache_dir}/hub"
)


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

In [None]:
# fix broken model config as per https://github.com/tloen/alpaca-lora/blob/main/generate.py#L75
model.config.pad_token_id = tokenizer.pad_token_id = 0  # unk
model.config.bos_token_id = tokenizer.bos_token_id = 1
model.config.eos_token_id = tokenizer.eos_token_id = 2
model.eval()

In [5]:
from transformers.generation.output_validity import validity_check

In [21]:
constraint_config = {"enforce_json_schema": json_schema}
valid_json = validity_check(tokenizer, constraint_config)
input = tokenizer(json_prompt, return_tensors="pt").to(0)
with torch.no_grad():
    output = model.generate(input["input_ids"], max_length=50, output_validity_check=valid_json)
tokenizer.batch_decode(output)

in generate

Generating with sample

STOPPING gen


['<s>List US presidents in JSON format[{"name":"George Washington","year_entering_office":1789,"age_entering_office":47},{"name":"John Adams","year_entering_office']

In [22]:
constraint_config = {"enforce_one_of": options}
valid_json = validity_check(tokenizer, constraint_config)
input = tokenizer(one_of_prompt, return_tensors="pt").to(0)
with torch.no_grad():
    output = model.generate(input["input_ids"], max_length=50, output_validity_check=valid_json)
tokenizer.batch_decode(output)

in generate

Generating with sample



['<s> Who is the first US president?George Washington</s>']