Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix array generation stopping criteria #47

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

botka1998
Copy link

Issue

According to Issue #46 array generation has some problems. Mainly it usually stops generating earlier than intended.

How it used to be done?

Once we enter an array generation we would first generate the first value in the array and immediately check whether to continue the generation of the array by asking the LLM to generate the next token. We would do this using the .forward method, we would sort the top 30 results from logits and check if we encounter a "," or "]". The "," would indicate we want to keep generating while the "]" would tell us that the array is done and to move one.

output = self.model.forward(input_tensor.to(self.model.device))
logits = output.logits[0, -1]


top_indices = logits.topk(30).indices
sorted_token_ids = top_indices[logits[top_indices].argsort(descending=True)]

found_comma = False
found_close_bracket = False

for token_id in sorted_token_ids:
    decoded_token = self.tokenizer.decode(token_id)
    if ',' in decoded_token:
        found_comma = True
        break
    if ']' in decoded_token:
        found_close_bracket = True
        break

if found_close_bracket or not found_comma:
    break

The problem with this approach was that tokens such as ]);, ]@ etc. would often appear here and would terminate the array generation too early.

The solution

I simplified the process by prompting the LLM using the .generate method and passing a LogitsWarper that only allows the generation of "," and "]". This way we make the LLM choose whether to keep generating the array or end it. We then check the generated token and act accordingly. This ends up being a lot more reliable.

LogitsWarper masks the output tokens

class OutputCommaAndBracketTokens(LogitsWarper):
    def __init__(self, tokenizer: PreTrainedTokenizer, prompt: str):
        self.tokenizer = tokenizer
        self.tokenized_prompt = tokenizer(prompt, return_tensors="pt")
        vocab_size = len(tokenizer)
        self.allowed_mask = torch.zeros(vocab_size, dtype=torch.bool)

        for _, token_id in tokenizer.get_vocab().items():
            token_str = tokenizer.decode(token_id).strip()

            if token_str in [",", "]"]:
                self.allowed_mask[token_id] = True

    def __call__(self, _, scores):
        mask = self.allowed_mask.expand_as(scores)
        scores[~mask] = -float("inf")

        return scores

LLM decides what to do

response = self.model.generate(
                input_tokens,
                max_new_tokens=1,
                num_return_sequences=1,
                logits_processor=[self.array_end_logit_processor],
                pad_token_id=self.tokenizer.eos_token_id,
            )
last_token = self.tokenizer.decode(response[0][-1])

if "]" in last_token:
    break

return obj

@thomasBontempsTecrisViattech

Hello,
I am currently using this and I would like to know if I need to precise the size of the array ?

Cause today, when I parse some resume to get jsonfile, i only have 1 element in my array (95% of the time) where I should be have 2 or 3.

@botka1998
Copy link
Author

Hello @thomasBontempsTecrisViattech
What exactly is your application? You’re trying to summarize a resume into a predefined json schema?

@thomasBontempsTecrisViattech

Hello @thomasBontempsTecrisViattech What exactly is your application? You’re trying to summarize a resume into a predefined json schema?

Yes exactly the goal of my application.

I got an PDF resume that I transform into an TXT into JSON.

So to simplify, i use the TXT file in the prompt and the JSON schema.

I put images to show you (schema, response, choices).
choice_hardskills
debug_json_schema
debug_response_json

When I precise the size of the array in the json schema file, I can sometimes have more possibilities but it is totaly random.

@botka1998
Copy link
Author

@thomasBontempsTecrisViattech Yeah, it extracted Python as soft skills, I'm sure that is not the intended behavior?
So a couple of things I suggest you try...
I haven't used the "description" tag inside the schema, I'm pretty sure there is nothing in jsonformer code that does anything special with this.
I would focus on prompt engineering, my high-performing prompts are explicit, detailed and cover the given task very well and I try to give as much context as possible as well as domain explanations. Also, give explicit restrictions and commands such as "make sure to include all the properties mentioned... never add properties that aren't mentioned in the text..." When I say properties I mean whatever your task is, soft skill for this example. So try some more advanced prompts.
Also using a more powerful LLM should help, I've had great results with Mistral-7B-Instruct-v0.2-GPTQ

@thomasBontempsTecrisViattech

@thomasBontempsTecrisViattech Yeah, it extracted Python as soft skills, I'm sure that is not the intended behavior? So a couple of things I suggest you try... I haven't used the "description" tag inside the schema, I'm pretty sure there is nothing in jsonformer code that does anything special with this. I would focus on prompt engineering, my high-performing prompts are explicit, detailed and cover the given task very well and I try to give as much context as possible as well as domain explanations. Also, give explicit restrictions and commands such as "make sure to include all the properties mentioned... never add properties that aren't mentioned in the text..." When I say properties I mean whatever your task is, soft skill for this example. So try some more advanced prompts. Also using a more powerful LLM should help, I've had great results with Mistral-7B-Instruct-v0.2-GPTQ

Ok thanks ! That's what I thought.
I will reply you if it doesn't work.

@noobmldude
Copy link

happy to see there is a solution for the issue with generating arrays. thanks
Any plans to merge this in?

@thomasBontempsTecrisViattech

happy to see there is a solution for the issue with generating arrays. thanks Any plans to merge this in?

Hello,

To improve arrays predictions :

  • Firstly I change the model with this one : 'openbmb/MiniCPM-2B-sft-fp32'.

  • Then, use this prompt :
    ``
    template = """
    Context:
    The resume is below and delimited by --START-- and --END--. You will use this resume to extract informations.
    You have to follow the section Goal and Criteria after --END--.

    --START--

    {content}

    --END--

    IMPORTANT THING TO DO BELLOW :
    Goal:
    Your goal is to provide : {goal}

    Criteria:

    • Make sure to include all the properties mentioned
    • Recreate paragraphes with only informations wanted
    • You will redecompose the resume in distinct part to extract
    • Use paragraph to extract most precise information or context
    • Process multiple choice as answer, create a list of answers
    • If multiple response, try to find one more

    Response format:

    • No accent at all
    • Consistence phrase
      """
      ``

You don't need to mention the JSON schema in the prompt.

So you will have something like this :
``
# Fill in the placeholders in the template
formatted_template = template.format(content=content, goal=json_schema["description"])

# Generate text using the formatted template
results = llm.generate(prompt=formatted_template,
                       json_schema=json_schema,)

``

And the function of the LLM class is the same, so :
``
def generate(self, prompt, json_schema, temperature=0.0001):# strictly positive
device = "cuda:0" if torch.cuda.is_available() else "cpu"
#device = 'cpu'
builder = Jsonformer(
model=self.model,
tokenizer=self.tokenizer,
json_schema=json_schema,
prompt= prompt,
temperature=temperature,
max_array_length=10,
max_string_token_length=100,
max_number_tokens=100,
device=device
)

    print("Generating...")
    output = builder()
    #highlight_values(output)
    return output

``

I hope it will help you.

  • In addition, try to cut your JSON Schema to make less complex the prediction :
    { "name":"computer_languages", "description": "List of computer languages that the person has been used during their career.", "type": "object", "properties": { "computer_languages": { "type": "array", "items": { "type": "string" } } } }
    Like that it is more simple to understand for the model. For my own, I use 11 JSON schema differents. So I need to run eleven time the LLM prediction. Then I got 11 JSON files that I can use as I want.

  • Lastly, to improve my json output. You can add in your prompt, example of the output wanted. Like this :
    {"computer_languages": []}
    It will assure you that the keys are the same and the output too.

@botka1998
Copy link
Author

happy to see there is a solution for the issue with generating arrays. thanks Any plans to merge this in?

Hey @noobmldude
Yes, I would love to merge this, but I am not a contributor and do not have access rights. I would urge @1rgs to take a look at the PR and merge if they're happy with it.
I am also using jsonformer in a project and I don't like having to install it from my forked version just because I need this feature. I think it's a great and simple addition to jsonformer. Hopefully the owner agrees!

@noobmldude
Copy link

ok looks like latest commits are from previous year.
I'm not sure if @1rgs is interested in maintaining it.
In that case, I would just use it from your fork @botka1998.
For anyone else interested, I'm referring to this fork: https://github.com/botka1998/jsonformer

@wassname
Copy link

I merged it in this branch, where I added probabilities too https://github.com/wassname/prob_jsonformer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants