fix array generation stopping criteria #47

botka1998 · 2024-01-27T18:29:53Z

Issue

According to Issue #46 array generation has some problems. Mainly it usually stops generating earlier than intended.

How it used to be done?

Once we enter an array generation we would first generate the first value in the array and immediately check whether to continue the generation of the array by asking the LLM to generate the next token. We would do this using the .forward method, we would sort the top 30 results from logits and check if we encounter a "," or "]". The "," would indicate we want to keep generating while the "]" would tell us that the array is done and to move one.

output = self.model.forward(input_tensor.to(self.model.device))
logits = output.logits[0, -1]


top_indices = logits.topk(30).indices
sorted_token_ids = top_indices[logits[top_indices].argsort(descending=True)]

found_comma = False
found_close_bracket = False

for token_id in sorted_token_ids:
    decoded_token = self.tokenizer.decode(token_id)
    if ',' in decoded_token:
        found_comma = True
        break
    if ']' in decoded_token:
        found_close_bracket = True
        break

if found_close_bracket or not found_comma:
    break

The problem with this approach was that tokens such as ]);, ]@ etc. would often appear here and would terminate the array generation too early.

The solution

I simplified the process by prompting the LLM using the .generate method and passing a LogitsWarper that only allows the generation of "," and "]". This way we make the LLM choose whether to keep generating the array or end it. We then check the generated token and act accordingly. This ends up being a lot more reliable.

LogitsWarper masks the output tokens

class OutputCommaAndBracketTokens(LogitsWarper):
    def __init__(self, tokenizer: PreTrainedTokenizer, prompt: str):
        self.tokenizer = tokenizer
        self.tokenized_prompt = tokenizer(prompt, return_tensors="pt")
        vocab_size = len(tokenizer)
        self.allowed_mask = torch.zeros(vocab_size, dtype=torch.bool)

        for _, token_id in tokenizer.get_vocab().items():
            token_str = tokenizer.decode(token_id).strip()

            if token_str in [",", "]"]:
                self.allowed_mask[token_id] = True

    def __call__(self, _, scores):
        mask = self.allowed_mask.expand_as(scores)
        scores[~mask] = -float("inf")

        return scores

LLM decides what to do

response = self.model.generate(
                input_tokens,
                max_new_tokens=1,
                num_return_sequences=1,
                logits_processor=[self.array_end_logit_processor],
                pad_token_id=self.tokenizer.eos_token_id,
            )
last_token = self.tokenizer.decode(response[0][-1])

if "]" in last_token:
    break

return obj

thomasBontempsTecrisViattech · 2024-02-12T16:27:11Z

Hello,
I am currently using this and I would like to know if I need to precise the size of the array ?

Cause today, when I parse some resume to get jsonfile, i only have 1 element in my array (95% of the time) where I should be have 2 or 3.

botka1998 · 2024-02-12T16:34:38Z

Hello @thomasBontempsTecrisViattech
What exactly is your application? You’re trying to summarize a resume into a predefined json schema?

thomasBontempsTecrisViattech · 2024-02-12T16:42:48Z

Hello @thomasBontempsTecrisViattech What exactly is your application? You’re trying to summarize a resume into a predefined json schema?

Yes exactly the goal of my application.

I got an PDF resume that I transform into an TXT into JSON.

So to simplify, i use the TXT file in the prompt and the JSON schema.

I put images to show you (schema, response, choices).

When I precise the size of the array in the json schema file, I can sometimes have more possibilities but it is totaly random.

botka1998 · 2024-02-12T17:00:11Z

@thomasBontempsTecrisViattech Yeah, it extracted Python as soft skills, I'm sure that is not the intended behavior?
So a couple of things I suggest you try...
I haven't used the "description" tag inside the schema, I'm pretty sure there is nothing in jsonformer code that does anything special with this.
I would focus on prompt engineering, my high-performing prompts are explicit, detailed and cover the given task very well and I try to give as much context as possible as well as domain explanations. Also, give explicit restrictions and commands such as "make sure to include all the properties mentioned... never add properties that aren't mentioned in the text..." When I say properties I mean whatever your task is, soft skill for this example. So try some more advanced prompts.
Also using a more powerful LLM should help, I've had great results with Mistral-7B-Instruct-v0.2-GPTQ

thomasBontempsTecrisViattech · 2024-02-12T17:05:09Z

@thomasBontempsTecrisViattech Yeah, it extracted Python as soft skills, I'm sure that is not the intended behavior? So a couple of things I suggest you try... I haven't used the "description" tag inside the schema, I'm pretty sure there is nothing in jsonformer code that does anything special with this. I would focus on prompt engineering, my high-performing prompts are explicit, detailed and cover the given task very well and I try to give as much context as possible as well as domain explanations. Also, give explicit restrictions and commands such as "make sure to include all the properties mentioned... never add properties that aren't mentioned in the text..." When I say properties I mean whatever your task is, soft skill for this example. So try some more advanced prompts. Also using a more powerful LLM should help, I've had great results with Mistral-7B-Instruct-v0.2-GPTQ

Ok thanks ! That's what I thought.
I will reply you if it doesn't work.

noobmldude · 2024-04-21T06:12:14Z

happy to see there is a solution for the issue with generating arrays. thanks
Any plans to merge this in?

thomasBontempsTecrisViattech · 2024-04-21T12:39:13Z

happy to see there is a solution for the issue with generating arrays. thanks Any plans to merge this in?

Hello,

To improve arrays predictions :

Firstly I change the model with this one : 'openbmb/MiniCPM-2B-sft-fp32'.
Then, use this prompt :
``
template = """
Context:
The resume is below and delimited by --START-- and --END--. You will use this resume to extract informations.
You have to follow the section Goal and Criteria after --END--.

--START--

{content}

--END--

IMPORTANT THING TO DO BELLOW :
Goal:
Your goal is to provide : {goal}

Criteria:
- Make sure to include all the properties mentioned
- Recreate paragraphes with only informations wanted
- You will redecompose the resume in distinct part to extract
- Use paragraph to extract most precise information or context
- Process multiple choice as answer, create a list of answers
- If multiple response, try to find one more
Response format:
- No accent at all
- Consistence phrase
  """
  ``

You don't need to mention the JSON schema in the prompt.

So you will have something like this :
``
# Fill in the placeholders in the template
formatted_template = template.format(content=content, goal=json_schema["description"])

# Generate text using the formatted template
results = llm.generate(prompt=formatted_template,
                       json_schema=json_schema,)

``

And the function of the LLM class is the same, so :
``
def generate(self, prompt, json_schema, temperature=0.0001):# strictly positive
device = "cuda:0" if torch.cuda.is_available() else "cpu"
#device = 'cpu'
builder = Jsonformer(
model=self.model,
tokenizer=self.tokenizer,
json_schema=json_schema,
prompt= prompt,
temperature=temperature,
max_array_length=10,
max_string_token_length=100,
max_number_tokens=100,
device=device
)

    print("Generating...")
    output = builder()
    #highlight_values(output)
    return output

``

I hope it will help you.

In addition, try to cut your JSON Schema to make less complex the prediction :
{ "name":"computer_languages", "description": "List of computer languages that the person has been used during their career.", "type": "object", "properties": { "computer_languages": { "type": "array", "items": { "type": "string" } } } }
Like that it is more simple to understand for the model. For my own, I use 11 JSON schema differents. So I need to run eleven time the LLM prediction. Then I got 11 JSON files that I can use as I want.
Lastly, to improve my json output. You can add in your prompt, example of the output wanted. Like this :
{"computer_languages": []}
It will assure you that the keys are the same and the output too.

botka1998 · 2024-04-21T14:36:43Z

happy to see there is a solution for the issue with generating arrays. thanks Any plans to merge this in?

Hey @noobmldude
Yes, I would love to merge this, but I am not a contributor and do not have access rights. I would urge @1rgs to take a look at the PR and merge if they're happy with it.
I am also using jsonformer in a project and I don't like having to install it from my forked version just because I need this feature. I think it's a great and simple addition to jsonformer. Hopefully the owner agrees!

noobmldude · 2024-04-22T19:47:26Z

ok looks like latest commits are from previous year.
I'm not sure if @1rgs is interested in maintaining it.
In that case, I would just use it from your fork @botka1998.
For anyone else interested, I'm referring to this fork: https://github.com/botka1998/jsonformer

wassname · 2024-05-10T12:54:15Z

I merged it in this branch, where I added probabilities too https://github.com/wassname/prob_jsonformer

fix array generation stopping criteria

cb67a7a

This was referenced Jan 27, 2024

Issue with array response #46

Open

add forked jsonformer apexlearningcurve/robot-whisperer#7

Merged

Documentation for returning an array of objects #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix array generation stopping criteria #47

fix array generation stopping criteria #47

botka1998 commented Jan 27, 2024

thomasBontempsTecrisViattech commented Feb 12, 2024

botka1998 commented Feb 12, 2024

thomasBontempsTecrisViattech commented Feb 12, 2024

botka1998 commented Feb 12, 2024

thomasBontempsTecrisViattech commented Feb 12, 2024

noobmldude commented Apr 21, 2024

thomasBontempsTecrisViattech commented Apr 21, 2024

botka1998 commented Apr 21, 2024

noobmldude commented Apr 22, 2024

wassname commented May 10, 2024

fix array generation stopping criteria #47

Are you sure you want to change the base?

fix array generation stopping criteria #47

Conversation

botka1998 commented Jan 27, 2024

Issue

How it used to be done?

The solution

LogitsWarper masks the output tokens

LLM decides what to do

thomasBontempsTecrisViattech commented Feb 12, 2024

botka1998 commented Feb 12, 2024

thomasBontempsTecrisViattech commented Feb 12, 2024

botka1998 commented Feb 12, 2024

thomasBontempsTecrisViattech commented Feb 12, 2024

noobmldude commented Apr 21, 2024

thomasBontempsTecrisViattech commented Apr 21, 2024

botka1998 commented Apr 21, 2024

noobmldude commented Apr 22, 2024

wassname commented May 10, 2024