Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] Add: first draft of parsing basic structure elements #7

Closed
wants to merge 1 commit into from

Conversation

gcentauri
Copy link

I realized that the existing org.ebnf file is just aiming at parsing individual lines. I'm sure this will be useful, but as I started looking at property drawers and decided to try working top-down based on the org specification.

This is just a start, but is one step on the way of building the tree structure. Currently, it does not properly switch heading levels while reading the tree. I kept everything in separate files for the time being, as it was easier for me to deal with while learning how everything works.

@munen
Copy link
Contributor

munen commented Feb 10, 2020

Looks like a great start! 👍🙏🏻

@gcentauri
Copy link
Author

gcentauri commented Feb 18, 2020

I haven't had a lot of time to work on this recently, but I was getting stuck on how to properly define the nested structure of an org file with instaparse. Should it be possible to describe this with EBNF? I was assuming it can be due to this part of the syntax specification:

A core concept in this syntax is that only headlines, sections, planning lines and property drawers are context-free. Every other syntactical part only exists within specific environments.

But that may assume we just take the headlines and count stars, and perhaps process the data structure into a tree after parsing the document. I was having trouble figuring out how to determine if we're jumping up just one level when the next headline comes or if we're coming out multiple levels of nesting, for example:

* One
** Two
*** Three
* One, Again

My original attempt worked to go from one to two and back, but adding a third level subheading showed me it was naive. The parser just goes up one level.

I don't want to invest too much time into it if it isn't going to work, so if anyone here has thoughts I'd love to hear them :) Thank you!

@schoettl
Copy link
Collaborator

Hi @gcentauri ,
I did not look into your PR in a great detail. But I think, the missing link is the transform function from instaparse.

http://xahlee.info/clojure/clojure_instaparse.html -> Function: transform

Haven't tried it yet, but it seems like the map argument to this function is the place where we transform the very basic parsed structure to a higher-level structure.

E.g. from

[:headline [:stars "**"] [:title "test"]]

to

{title: "test", level: 2} // JS hash, I don't know yet the corresponding clojure syntax ^^

Currently I'm working on the timestamps PR. I probably need to transform them to a higher-level structure, too.

@gcentauri
Copy link
Author

@schoettl - thanks for the insight! i was getting that feeling too, i'm just very new to parsing stuff.

so it seems like indeed the line-based approach might be the first pass, and then we have another pass to take the structure generated by that and turn it into the proper tree structure?

i'd like to get back to this soon. i just felt stuck.

@schoettl
Copy link
Collaborator

I think your right, with that second transforming pass on the parse tree. I opened #8 to discuss in how far this transformation is in scope of this project.

Anyway, I think the plain parsing to a flat list of headers is a very important first step!

@branch14
Copy link
Member

@gcentauri, @schoettl First of all let me thank your for you interest and work in this project. Your thoughts and efforts are greatly appreciated!

In my first attempts I actually did follow the idea to identify semantic blocks rather than "only" lines. But it turned out to get tricky rather quickly. While I didn't encounter any formal reason not to continue with semantic blocks, I felt that it would make it really hard for others to contribute. Hence I decided to proceed with the much simpler line based approach (or as I put it in 4a4563f "the sane way"). Org-mode is a line based format where greater blocks and other semantic units are made up of lines after all. I expect following that observation for building a parser will keep things simple.

As the parse tree that results from a line based approach does not yield the data structure that resembles the document nicely (i.e. is the structure one would like to work with) a 2nd step "transform" will be required, much like @schoettl pointed out.

While Instaparse's transform function is nice, I don't feel there is much gain in using it on a line based parse tree. Instead a couple of classical map and reduce should just do the job. I'll be happy to provide an example how to do that.

Having said that and looking at the progress you made with #11, I'm totally open to other approaches. The PR reminds me of my attempts just before I gave up on the idea to have the grammar do the heavy lifting and decided to go with the simpler line based and a subsequent transformation. So I wonder where you're at.

@branch14
Copy link
Member

Here I layed out how the code for transformation could look like: #15

@schoettl
Copy link
Collaborator

Having said that and looking at the progress you made with #11, I'm totally open to other approaches. The PR reminds me of my attempts just before I gave up on the idea to have the grammar do the heavy lifting and decided to go with the simpler line based and a subsequent transformation. So I wonder where you're at.

I think that it's good to combine both approaches: parsing of semantic blocks where possible, and line-based parsing where it gets messy with EBNF.

I already wrote EBNF for property-drawers and I think it's pretty clean. For example, tables should be easy to implement as semantic objects in EBNF, too. Same goes for "verbatim containers" like #+BEGIN_EXAMPLE. Here it makes sense to parse the contents directly as raw text.

On the other hand, if we do that stuff in the transformation step it's much more coding with conditionals, map/reduce, ... Similar to what is implemented in organice or other orgmode parser libraries.

But I agree, using EBNF can get messy or impractical. One example are #+BEGIN_xxx and #+END_xxx where the xxx can be anything but must match. AFAIK this cannot be accomplished in our EBNF unless we hardcode all possible xxx (src, example, center, quote, ...).

So I'd vote for putting as much "syntax comprehension" in the EBNF as long it can be expressed cleanly. The rest can be done in the transformation.

@munen
Copy link
Contributor

munen commented May 17, 2020

I think that it's good to combine both approaches: parsing of semantic blocks where possible, and line-based parsing where it gets messy with EBNF.

...

So I'd vote for putting as much "syntax comprehension" in the EBNF as long it can be expressed cleanly. The rest can be done in the transformation.

I discussed this with branch14 we both agree. This is a sane and pragmatic approach. Let's continue like this.

It's also very nice that there are good examples for both options now^^

@munen munen mentioned this pull request May 17, 2020
@gcentauri
Copy link
Author

I'd like to get back to this sometime :) been busy with the crazy year that has been 2020. But Lisp keeps coming back to me and org mode has always been a love of mine too. i'll keep watching the repo and see where i can help. it was probably a bit impetuous of me to think i could figure out how to do the top down parsing over the line-based approach already begun :)

@schoettl
Copy link
Collaborator

You triggered a very good discussion @gcentauri :) A lot have happened since that. I'll get back to #11 soon as I can. It probably makes sense to built on that one to prevent conflicts.

@munen
Copy link
Contributor

munen commented Jun 18, 2020

Thank you for your contribution, @gcentauri! All the best to you and your family 🙏

@schoettl
Copy link
Collaborator

Hey @gcentauri,

I suggest we close this PR.

A lot have changed since last year. We now have layed out a structure for the parse result (#31). I've also implemented parsing some block-like elements as semantic units (instead of line-based parsing). This semantic parsing has to be enabled step-by-step (#32). I'll start with that after other open PRs are merged.

@gcentauri gcentauri closed this May 27, 2021
@branch14
Copy link
Member

🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants