Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot generate parentheses in JSON strings #838

Closed
posionus opened this issue Apr 25, 2024 · 0 comments · Fixed by #899
Closed

Cannot generate parentheses in JSON strings #838

posionus opened this issue Apr 25, 2024 · 0 comments · Fixed by #899

Comments

@posionus
Copy link
Contributor

Describe the issue as clearly as possible:

I am fine tuning a model to convert non-standard JSONs to a standard schema. I'm using vLLM with Outlines to constrain the output generation so it will follow my desired schema. I have noticed that after a recent Outlines update, the model seems to no longer be able to generate parentheses in JSON strings. Some example diffs are below (minus sign represents the expected line and plus represents the actual output)

{
   "call_date": null,
   "cost": null,
   "coupon_max": null,
   "coupon_min": 0.05,
   "currency_code": null,
-  "description": "Bank of Nova Scotia (Dated 1/31/22, Repurchase Value $182,200,000, collateralized by U.S. Treasury Bill 0.000%, 2/24/22–6/16/22, and U.S. Treasury Note/Bond 0.125%–2.875%, 8/31/23–11/15/51, with a value of $185,844,000)",
+  "description": "Bank of Nova Scotia, Dated 1/31/22, Repurchase Value $182,200,000, collateralized by U.S. Treasury Bill 0.000%, 2/24/22–6/16/22, and U.S. Treasury Note/Bond 0.125%–2.875%, 8/31/23–11/15/51, with a value of $185,844,000",
   "face_amount": 182200000,
   "footnotes": [],
   "maturity_date_max": null,
   "maturity_date_min": "2/1/22",
   "percent_net_assets": null,
   "quantity": null,
   "reference_rate": null,
   "spread": null,
   "value": 182200000,
   "yield_max": null,
   "yield_min": null
}
 
{
   "call_date": null,
   "cost": null,
   "coupon_max": null,
   "coupon_min": 0.05,
   "currency_code": null,
-  "description": "Barclays Capital Inc. (Dated 1/31/22, Repurchase Value $2,900,000, collateralized by U.S. Treasury Note/Bond 3.000%, 5/15/47, with a value of $2,958,000)",
+  "description": "Barclays Capital Inc. [Dated 1/31/22, Repurchase Value $2,900,000, collateralized by U.S. Treasury Note/Bond 3.000%, 5/15/47, with a value of $2,958,000]",
   "face_amount": 2900000,
   "footnotes": [],
   "maturity_date_max": null,
   "maturity_date_min": "2/1/22",
   "percent_net_assets": null,
   "quantity": null,
   "reference_rate": null,
   "spread": null,
   "value": 2900000,
   "yield_max": null,
   "yield_min": null
}

{
   "call_date": null,
   "cost": null,
   "coupon_max": null,
   "coupon_min": null,
   "currency_code": null,
-  "description": "Other Assets and Liabilities (net)",
+  "description": "Other Assets and Liabilities",
   "face_amount": null,
   "footnotes": [],
   "maturity_date_max": null,
   "maturity_date_min": null,
   "percent_net_assets": 4.3,
   "quantity": null,
   "reference_rate": null,
   "spread": null,
   "value": 141357139,
   "yield_max": null,
   "yield_min": null
}

As you can see above, it will it will either drop parentheses, replace them with a similar character like square bracket, or completely skip text in parentheses. I have seen about 100 other similar examples.

This previously was not an issue for me, so I check recent PRs in this repo to see if one of them affected relevant code. This one seems to be the culprit: #829

Specifically the following change:

STRING_INNER = r'(?:[^"\\\x00-\x1f\x7f-\x9f]|\\.)'
STRING_INNER = r'([^("\\\x00-\x1f\x7f-\x9f)]|\\\\)'

It looks like for some reason, opening and closing parentheses were added to the prohibited characters for strings.

I'm not confident to submit a PR to fix this bug because I'm not sure of the motivation behind the PR that caused the bug.

Steps/code to reproduce the bug:

import re

# new version (doesn't match parentheses)
print(re.match(r'([^("\\\x00-\x1f\x7f-\x9f)]|\\\\)', "("))
# old version (matches parentheses)
print(re.match(r'(?:[^"\\\x00-\x1f\x7f-\x9f]|\\.)', "("))

Expected result:

The regex should match parentheses

Error message:

No response

Outlines/Python version information:

0.0.40
Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]

Context for the issue:

No response

@posionus posionus added the bug label Apr 25, 2024
lapp0 added a commit to lapp0/outlines that referenced this issue May 17, 2024
lapp0 added a commit to lapp0/outlines that referenced this issue May 18, 2024
lapp0 added a commit to lapp0/outlines that referenced this issue May 18, 2024
rlouf pushed a commit that referenced this issue May 18, 2024
Fix #838


06d5654
erroneously disallowed parenthesis in strings. This PR allows
parenthesis in strings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants