GSoC 2023 Report Tirthankar Mazumder: Rewriting the LaTeX parser in Lark
At the time of writing this, I am in my final year as an undergraduate mathematics student at the Indian Institute of Technology, Bombay.
In this page, I'll talk about all the work that I did in my GSoC. This report is cross-posted from my blog, where the final report was originally posted.
The original plan was to build off the work done by costrouc in #19825, and finish up the Lark port of the ANTLR-based parser.
There were three issues:
- The PR removed the ANTLR-based
$\mathrm{\LaTeX}$ parser completely, even though the Lark-based parser was still a WIP (work-in-progress) at the time. - Related to the first point, the commit history was a mess. There were commits which did many things at once.
- This was more of an issue on my side, but the way the work had been done made it very hard to follow and made it really hard for me to make changes and modifications.
I started my GSoC off with #25210, where I simply took the original PR, #19825, and rebased the current master
on top of the 3 year old branch.
Unfortunately, there were many many merge conflicts when I rebased master
on top of the branch. Fixing them took around an hour for me, and I am fairly certain that I had made a few mistakes while fixing merge conflicts, so this attempt would have been a non-starter anyways.
When my GSoC mentor Francesco Bonazzi looked over the work I had done in the first week during a meet, he noticed that the PR removed all the ANTLR-based parser support. He stressed that we should not remove what already works until we have a well-tested and feature-complete alternative. I mentioned that the original PR, #19825, already removed all the ANTLR-based
After we looked at the commit history of #19825, we realized that a git cherry-pick
would not work because of the 2nd point I mentioned above: The commit history was a mess, and there were commits which did many things at once. For example, along with adding the initial scaffolding for the Lark based parser, he also removed some ANTLR-related files. Then, in another commit, he modified the documentation to mention the Lark-based parser instead of the ANTLR-based parser.
Thus, this branch was a false start. I started over on a new branch, which led to my...
Initially, the work on this branch was going great: I had manually copied over the relevant files from #19825, and manually added costrouc as a coauthor in the commit. My mentor was also happy with this branch at first, as can be seen in this comment on the PR.
In 56b4f46
, I regenerated the standalone grammar file for Lark, which allowed SymPy users to use the parser without having Lark installed on their computer. This ended up being a mistake on my part, because as Aaron Meurer commented in this comment, the license that Lark used for its standalone grammar file was incompatible with SymPy's own BSD 3-clause license.
He was a little annoyed that I hadn't done my due diligence and gone through the relevant issues and discussions in the PRs before starting work on the project. After my faux pas, I made sure to go through the important issues related to my project like #19528, #14004, and the long discussion in #13706, which is the PR that originally added the ANTLR-based parser to SymPy.
Work was going great on this PR until I realized that, for some mysterious reason, I couldn't add even simple things like allowing
\ne
and \ge
as alternatives for \neq
and \geq
. What was even weirder was the fact that when I tried out the same changes on my own branch, it worked.
Moreover, there was no rhyme or reason to which expressions were failing and which expressions were passing. For example, the partially-finished parser was able to parse \binom{n}{k}
correctly, but was failing on \binom{n}{0}
. Similarly, it was able to parse 2x
but was choking on 3x - 1
.
Hence, this attempt was also a non-starter, and I started another branch.
This time, I decided to start from scratch because there were some oddities in the previous work that were preventing me from making even simple changes. This new branch became PR #25324 over time, which was eventually merged into master
.
Making a baseline
As can be seen in commit 7863b84
, this PR had truly humble origins. It only supported things like expressions with addition, subtraction, multiplication, division, and relations. The atoms on either side could only be integers or single-letter symbols. In essence, it only supported things like:
$x + y$ $3 * y$ $z / 9$ $3 < t$ $x \geq y$
I had learned from my previous GSoC that a big and extensive test suite can greatly simplify the development experience, especially when adding new features. For this reason, I added a test suite early in the PR. In commit 1f5f790
, I copied over the entire ANTLR test suite and ran it on the Lark parser.
Having the test suite had another benefit: I now had a concrete way of measuring my progress over time. As I added more and more to the Lark-based parser, fewer tests would fail and I could use this to quantify my progress per week and report back to Francesco.
After that, I copied over the entire tokens list from #19825. Even though its grammar and transformer code might have been impenetrable for me, its tokens list was correct and perfectly reusable. I also made sure to not reinvent the wheel and opted to use built-in tokens for integers and floats from the common Lark token library. I made these changes in 2aa0a2e
.
I had a strong base set up, but I was unsure about how to continue and was dragging my feet. During one of the early team meetings, Francesco shared the brilliant idea of having the Lark grammar file closely match
I was initially opposed to the idea, but I slowly came to see its merits. After that, I was able to increase my progress greatly and I added support for many things, like binomial expressions (added in 50d7706
) and logarithms (added in 465c905
), most integral expressions (added in f966dab
), and more.
I soon ran into another issue: Rules for things like addition must be defined recursively. However, we were using the Earley parser, which tries all the possibilities and returns parse trees which match every possible interpretation of the grammar rules, when multiple such interpretations exist. This lead to many parse trees for simple things like addition and multiplication, even though all of them would return the same result ultimately.
My rule for addition was something like add: expression ADD expression
. With an input expression like
flowchart TB
subgraph Interpretation 2
direction TB
A(+)-->B(a)
A(+)-->C(+)
C-->D(b)
C-->E(c)
end
subgraph Interpretation 1
direction TB
G(+)-->H(+)
H-->I(a)
H-->J(b)
G-->K(c)
end
where the first interpretation corresponds to
In general, in an expression with
The way to handle this is well-known within the parsing community, but when my mentor first demonstrated the idea, it blew my mind. The solution is so simple, and yet so profound: Use a hierarchy of expressions.
In essence, you first choose whether you want the expression to be left- or right-associative. Once you have made your choice, you make the other side of the expression something of higher priority.
For example, I wanted addition to be left-associative in the add
rule to have something like expression_mul
, where expression_mul
can only bind to things with a priority equal to or greater than that of multiplication.
This is an important idea to understand, because this idea was used liberally in the Lark-based
An important change I made was when I refactored the tests. In 14fffa6
, I broke up the tests into many different functions, each of which stress tests the parser against a specific class of expressions.
There were multiple reasons I made the change:
- Before the change, the entire test suite was in a list of tuples where the first element is the input
$\mathrm{\LaTeX}$ string and the second element is the expected SymPy object. while this is a reasonable way to organize the tests overall, breaking up the test cases into separate categories and testing the$\mathrm{\LaTeX}$ parser against each category allows the developer to have a better understanding of where the parser is failing. - There was no rhyme or reason to where each test was placed. Before my change, the tests layout has most of its integral-related tests here with a lone integral test here.
- Having one function for each type of test allows us to easily see which aspects of the parser are sufficiently tested, and which parts have barely any tests. For example, while I broke out the trigonometric function tests into its own function, I realized that things like
\sin^2 x
and hyperbolic functions like\sinh x
are not tested at all.
So far, I had only been working on shaping the parse tree that was generated from the input
The full test suite which I ported over from the ANTLR-based parser's test suite had 185 test expressions. Of them, the original PR was parsing and returning the correct SymPy expression for 60 test cases. Once I had surpassed the old PR in terms of number of passing tests, I started working on the transformer.
Even though the parse tree was correct for these tests, I still had to write the transformer which would walk the tree in a bottom-up fashion and generate the SymPy expression. So, I got to work on the transformer class and started filling in the nodes one by one.
Once I had most of the parse tree the way I wanted it, I folded many of the nodes for performance and because there was no need to write special code to handle them. I was first introduced to the idea of "folded nodes" by this comment in one of the false start PRs. I had kept this idea in mind the whole time because it would simplify the transformer greatly.
Any node which did not need special handling, like most of the expression_*
nodes, were folded.
After this, and a bunch of smaller changes like allowing Greek symbols and allowing subscripts for variable names, the PR was finally merged, which gave us a baseline implementation for the Lark-based
The rest of the work that went into the subsequent PRs to the Lark
I decided to not talk about them in this blog post because the exposition for those PRs would be dry reading, and because my explanations would mainly touch upon implementation details which are not of interest to anyone except prospective developers for the Lark
During my GSoC period, most of my time went into working on the
- #25324: Added a preliminary Lark $\mathrm{\LaTeX}$ parser implementation.
-
#25515: Removed
evaluate=False
from the Lark-based $\mathrm{\LaTeX}$ parser. - #25535: Refactored Transformer code and fixed a minor inconsistency.
-
#25552: Added support for
Min
,Max
,Bra
, andKet
. - #25269: Simplified Lark $\mathrm{\LaTeX}$ parser testing framework.
- #25622: Added error messages to the Lark $\mathrm{\LaTeX}$ parser.
- #25626: Made many miscellaneous improvements to Lark $\mathrm{\LaTeX}$ parser.
After the baseline implementation of the Lark-based
To that end, #24116 caught my eye, because it was an issue which had been stalled by the ongoing but incomplete work of creating the Lark-based {x + y}z
.
So, I made a PR which addressed #24116 and changed the default behavior of the
While working on the evaluate=False
PR, there was a small issue which was pointed out. I was returning a Python int
instead of a SymPy Integer
for expressions that the latex_parser.py
.
After that, I shifted my focus to adding new features to the Lark-based Min
, Max
, Bra
and Ket
. While the ANTLR parser already supported the latter two features, there was an issue here which asked for Min
and Max
. Since, in the Lark-based parser Min
and Max
could use the same scaffolding that applied functions (i.e. expressions like f(x, y, z)
) use, I decided to add the feature.
Solving existing issues that were opened for the ANTLR-based parser in the Lark-based was a recurring theme in my GSoC journey, which I'll talk about in an upcoming section.
After that PR, I noticed that many things which I was using in the tests were duplicated, so I opened #25569 to minimize the amount of code duplication.
#25622 added descriptive error messages to the Lark
Finally, the last PR I made during the GSoC period was #25626, which was a pretty substantial PR. I addressed many things which had been bugging me, and also incorporated suggestions like the one in this comment in the PR. I fixed many of the names (in variables and rules in the grammar, for example) that were bothering me and swapped them out for more appropriate ones. In addition that, I also added some important features that the ANTLR-based parser Lark
where the differential is in the numerator.
There were a bunch of other changes in #25626 but they were implementation related, so I won't talk about them in this blog post.
This section lists all the PRs that I made which had no direct relevance to the
The first PR was something I found while going through old codegen related issues, and noticed that it was already fixed in master
. I left a comment on the corresponding issue, and was told that we could close the issue after I wrote a regression test to prevent this issue from reoccurring in the future, which is what I did in #25189.
I was still looking at codegen issues at the time because I had enjoyed working on #24954, and wanted to contribute to the codegen submodule a bit more.
After that, I submitted #25205 because it was a very easy issue and it also allowed me to get familiar with the way the documentation works in SymPy. I knew that I would eventually have to write some documentation for the Lark-based
The reason why #25292 was made is because I was looking at the MathML printer due to some issue that had been raised, and I noticed that there was a hack to workaround a bug in the implementation of the xml
module in the Python standard library. However, the Python versions that were affected by the bug were Python 3.2 and 3.3. Since the minimum version of Python that was supported at the time of the PR was 3.8, the fix was unnecessary now and removing the code simplified the MathML printer, so I removed it.
Finally, I submitted #25295 because I myself needed to find the SymPy logo in order to create the cover image for this blog post and the previous one. Since there was no page in the SymPy docs with the logo, I had a difficult time finding it at first. When I came across the relevant issue, I decided to make a PR for it so that others wouldn't have to struggle with finding the logo image.
At some point during the GSoC, my mentor asked me to open issues in the SymPy Issue Tracker so that we could keep track of all the progress and keep track of what's left to do in the project.
To that end, I opened a bunch of issues for different things: I opened some issues to keep track of the work that remained to be done in the testing infrastructure, features that needed to be added to the new
- #25365: Master Issue for my GSoC Project, Rewriting the $\mathrm{\LaTeX}$ Parser in Lark.
- #25364: Adding features to the Lark $\mathrm{\LaTeX}$ Parser.
- #25366: Improve the $\mathrm{\LaTeX}$ Parser tests.
-
#25419:
\leqq
and\leqslant
for $\leq$,\geqq
and\geqslant
for $\geq$. -
#25420: Logarithm bases for
\lg
and\log
. - #25482: Ambiguous Expressions in the Lark-based $\mathrm{\LaTeX}$ Parser.
- #25484: Documenting the Lark $\mathrm{\LaTeX}$ Parser.
- #25676: Adding features to the Lark $\mathrm{\LaTeX}$ Parser.
Since most of these issues are self-explanatory, I won't be adding a separate breakdown section. If any reader is interested in any of the issues, they can simply go and check out the issue, which will have a detailed description.
#25364 and #25676 have the same issue title because the former was written near the beginning of the project. As such, it is now outdated because many of the features listed have been implemented. For this reason, I created the latter issue to keep track of the features that are still not implemented.
As I had mentioned above, solving existing issues that were opened for the ANTLR-based parser in the Lark-based was a recurring theme in my GSoC project. I used the following search query on SymPy's GitHub issue tracker to find problems that people had found in the ANTLR-based parser, and I did my best to fix or address them in the Lark-based parser.
Here's a list of the things which are fixed in the Lark-based parser. Most of these issues are still not marked as resolved because they still exist in the ANTLR-based parser.
Once again, these issues are mostly self-explanatory so I won't add a separate breakdown section for this.
The Lark-based
There are a couple of things that could be improved upon, however.
The Lark parser has an issue with expressions like f(x)
, f
is a variable name or a function. Technically, both interpretations are valid absent any other context. However, due to conventions, most people would interpret it as a function
Currently, this is not handled in the Lark-based parser. A discussion about the best and most reasonable way to handle it needs to occur, and then changes need to be made.
When I brought this up, one of the solutions that were proposed were to set aside specific single-letter names like
Parsing
There have been solutions suggested, such as making
I think that any future work mostly be directed towards handling the two issues I pointed out above. Another viable area which would benefit from further work is adding as much user customizability to the parser as possible, in a sensible fashion.