Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created unified text file tokenizer #2953

Merged
merged 12 commits into from Jan 5, 2023
Merged

Conversation

ohlidalp
Copy link
Member

@ohlidalp ohlidalp commented Oct 16, 2022

We have multiple formats with very similar syntax: rig def (truck), odef, tobj, character (see #2942). This new parser supports all these formats and adds new features:

  • Support for "quoted strings with spaces". I first implemented them in the new .character fileformat and I decided I want them everywhere, notably .tobj so that procedural roads can have custom names and descriptions for a GPS-like navigation system.
  • Simple editing in memory. The document is a vector of Tokens (types: linebreak, comment, keyword, string, number, boolean.) which keeps the exact sequence as in the file. Tokens can be easily added/modified/removed without writing any extra custom code.
  • Serializability - saving the (possibly modified) file is as simple as looping through the Tokens array and writing them to a file. No custom code needs to be written for any file format.
  • Ease of binding to AngelScript: a single API can modify any fileformat. All the operations the user needs is 1. insert token 2. modify token 3. delete token.

Update: I added AngelScript bindings and extended the bundled 'demo_script.as' to showcase them. There's a new button "View document" (changes to "Close document" after pressing) which opens separate window with a syntax highlighted truck file. Update2: all glitches were fixed. It's ready for review.

New script API:

  • class GenericDocumentClass - loads/saves a document from OGRE resource system
  • class GenericDocReaderClass - traverses document tokens
  • enum TokenType (TOKEN_TYPE_*)
  • enum GenericDocumentOptions (GENERIC_DOCUMENT_OPTION_*)
  • void ImGui::AlignTextToFramePadding()
  • string Terran::getTerrainFileName()
  • string Terrain::getTerrainFileResourceGroup()

To test, open game console and say loadscript demo_script.as.
obrazek

@ohlidalp
Copy link
Member Author

ohlidalp commented Oct 17, 2022

When I'm looking at the serialization loop I wrote, I realize that the way you tell good code isn't about diving into the details, but rather looking from bird's perspective where you can barely read the characters, then observing the overall shape and using just common sense to figure out what it probably does. If the above journey is a pleasure to do and the conclusion you arrive at turns out to be pretty correct, you've met good code.
obrazek
Yes, I've had a 🤓 moment, judge me all you want.

@ohlidalp
Copy link
Member Author

Purpose: to be able to read, edit and export any fileformat (TRUCK DEF, ODEF, TOBJ, CHARACTER) directly in game using AngelScript.

Estimate: Blocked by #2930. As soon as #2930 is done, this will take at most 1 man day to finish.

Work to be done:

  • Export elements in 'GenericFileFormat.h' to AngelScript.
  • Add AngelScript API Document@ game.loadDocument(filename, rg_name) and void game.saveDocument(Document@) that will check if the file is unzipped and if yes, load/save the document.
  • Extend the 'demo_script.as' to showcase this new feature.

@CuriousMike56
Copy link
Collaborator

Viewing truck file of any 'complex' vehicle -> ~30 fps drop:
RoR_2022-12-10_20-27-55
No frame drops with the DAF Semi.
Also much of the truck file doesn't match the actual file:
2022-12-10_20-30-17

@ohlidalp
Copy link
Member Author

ohlidalp commented Dec 11, 2022

@CuriousMike56 Thanks for testing.

Viewing truck file of any 'complex' vehicle -> ~30 fps drop:

Makes sense, it's because the script traverses the entire document every frame, so it's combined overhead of DearIMGUI and string allocation by angelscript. I'd have to optimize it like in Console UI - artificially scale the scrollbar and only populate the visible part of the document. But I'd like to do it in another PR since it's just a demo.

Also much of the truck file doesn't match the actual file:

Interesting, I see 3 problems:

  • [fixed] For each number, only first digit is displayed - probably a tokenizer bug.
  • [fixed] set_*_defaults are omitted entirely - absolutely no idea why. UPDATE: it was a tokenizer bug, keywords containing underscore got discarded.
  • Empty lines are not preserved - this is a script issue, it relies on imgui not doing SameLine() which works for single linebreaks but squashes multiple linebreaks.

@ohlidalp ohlidalp force-pushed the unitok branch 4 times, most recently from d5c67fa to 179bd86 Compare December 13, 2022 00:44
We have multiple formats with very similar syntax: rig def (truck), odef, tobj, character (see RigsOfRods#2942). This new parser supports all these formats and adds new features:
 - support for "quoted strings with spaces". I first implemented them in the new .character fileformat and I decided I want them everywhere, notably .tobj so that procedural roads can have custom names and descriptions for a GPS-like navigation system.
 - Editable memory representation. The document is a vector of Tokens (types: linebreak, comment, keyword, string, number, boolean.) which keeps the exact sequence as in the file. Tokens can be easily added/modified/removed without writing any extra custom code.
 - Serializability - saving the (possibly modified) file is as simple as looping through the Tokens array and writing them to a file. No custom code needs to be written for any file format.
 - Ease of binding to AngelScript: a single API can modify any fileformat. All the operations the user needs is 1. insert token 2. modify token 3. delete token.
New script API:
* class GenericDocumentClass - loads/saves a document from OGRE resource system
* class GenericDocReaderClass - traverses document tokens
* enum TokenType (TOKEN_TYPE_*)
* enum GenericDocumentOptions (GENERIC_DOCUMENT_OPTION_*)
* function ImGui::AlignTextToFramePadding()

New features of demo_script.as:
* a "View document" button next to the vehicle name - it will tokenize the truck definition file and open a separate window with syntax-highlighted file contents.

Known issues:
* DearIMGUI windows opened by script can't be closed with X button. This is a global flaw in our DearIMGUI integration
* GenericDocument doesn't understand the truck title (first nonempty noncomment line).
* There are glitches in parsing naked strings - an extra bool is emitted instead.
* There are glitches in parsing keywords - a string is emitted instead.

Codechanges:
* GenericFileFormat: added RefCountingObject logic. Renamed Document to GenericDocument, added funcs {LoadFrom/SaveTo}Resource(). Renamed Reader to GenericDocReader, GetArg*() funcs renamed to GetTok*() - the Arg naming was to match existing parsers which is now moot. Added {Get/Is}TokComment(). Renamed GetType() to GetTokType().
* Actor + ActorAngelscript: added func getTruckFileResourceGroup() - name chosen to match existing getTruckFileName().
* ImGuiAngelscript - added binding of ImGui::AlignTextToFramePadding()
Codechanges:
* GenericFileFormat: new option "first line is title", new func GetPos()
* GenericFileFormatAngelscript: bindings for new stuff
* demo_script.as: use new feats to correctly display title string.
fixup FIRST LINE IS TITLE: skip emtpy lines
fixup DiscontinueBool() - missing linePos++
fixup "number first digit" - missing linePos++
This is to support the quirky syntax of 'axles' and 'interaxles' in truck file format
```
axles
w1(91 95), w2(90 94), d(s)
```
With the NAKED_STRINGS | PARENTHESES_CAPTURE_SPACES options enabled, this will parse as 
```
axles
"w1(91 95)", "w2(90 94)", "d(s)"
```
@ohlidalp
Copy link
Member Author

ohlidalp commented Dec 15, 2022

Truck definition format parsing is complete.

Note the option 'ALLOW_NAKED_STRINGS' must be on at all times, because otherwise the parser requires all strings to be in quotes. Also see these special cases:

  1. The first line is the actor name, with spaces and special characters, possibly including multiple quotes. I added an option "FIRST_LINE_IS_TITLE" which parses it as one string.
    obrazek
  2. in forset, single node numbers are parsed as NUMBER. Node ranges are not valid numbers so they decay to STRINGs.
    obrazek
  3. The axle and interaxle keyword have a quirky abc(1 2) syntax. I added an option 'PARENTHESES_CAPTURE_SPACES' which make the whole expression parse as one string.
    obrazek

@ohlidalp
Copy link
Member Author

Originally I only planned to cover fileformats which are similar to each other (truck, odef, tobj) but now I wanted the demo script to display TOBJ files too, but I didn't want to clog the Terrain API with getFileWhatever() funcs. I wanted to use the GenericDocument to parse the TERRN2 file, get TOBJ filenames from there and parse those as well. It turned out to be pretty straightforward, the tokenizer is already robust so it could take a few extra options.

  • Parsing TERRN2 format needs these special options: "ALLOW_HASH_COMMENTS", "ALLOW_SEPARATOR_EQUALS", "ALLOW_BRACED_KEYWORDS" and of course "ALLOW_NAKED_STRINGS".
    obrazek

  • Parsing TOBJ format needs just "ALLOW_NAKED_STRINGS" and "ALLOW_SLASH_COMMENTS".
    obrazek

New script API:
    * GENERIC_DOCUMENT_OPTION_ALLOW_BRACED_KEYWORDS, //!< Allow INI-like '[keyword]' tokens.
    * GENERIC_DOCUMENT_OPTION_ALLOW_SEPARATOR_EQUALS, //!< Allow '=' as separator between tokens.
    * GENERIC_DOCUMENT_OPTION_ALLOW_HASH_COMMENTS //!< Allow comments starting with `#`.     
    * string Terran::getTerrainFileName()
    * string Terrain::getTerrainFileResourceGroup()

'demo_script.as' extended to allow viewing TERRN2 and TOBJ files.
Copy link
Collaborator

@CuriousMike56 CuriousMike56 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work fine 👍

@ohlidalp ohlidalp merged commit e866683 into RigsOfRods:master Jan 5, 2023
@ohlidalp ohlidalp deleted the unitok branch January 5, 2023 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants