Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils: <math> formulas rendered to SVGs without using LaTeX tools #1432

Merged
merged 25 commits into from
Dec 14, 2022

Conversation

BoboTiG
Copy link
Owner

@BoboTiG BoboTiG commented Nov 16, 2022

Fixes #1427.
Fixes #1198.
Closes #1209.

Tests to pass before merging (the rendering is good, but not the display):

  • $ python -m wikidict fr --gen-dict "cercle unité" --output issue-1427
  • $ python -m wikidict en --gen-dict "Wallis product,primitive recursion,Horner's rule" --output issue-1427

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Nov 16, 2022

Sourcery Code Quality Report

❌  Merging this PR will decrease code quality in the affected files by 1.86%.

Quality metrics Before After Change
Complexity 2.80 ⭐ 3.54 ⭐ 0.74 👎
Method Length 104.38 🙂 99.12 🙂 -5.26 👍
Working memory 7.60 🙂 8.36 🙂 0.76 👎
Quality 68.24% 🙂 66.38% 🙂 -1.86% 👎
Other metrics Before After Change
Lines 640 651 11
Changed files Quality Before Quality After Quality Change
wikidict/constants.py 86.39% ⭐ 85.77% ⭐ -0.62% 👎
wikidict/utils.py 67.18% 🙂 64.93% 🙂 -2.25% 👎

Here are some functions in these files that still need a tune-up:

File Function Complexity Length Working Memory Quality Recommendation
wikidict/utils.py clean 4 ⭐ 304 ⛔ 10 😞 49.10% 😞 Try splitting into smaller methods. Extract out complex expressions
wikidict/utils.py transform 8 ⭐ 132 😞 10 😞 57.08% 🙂 Try splitting into smaller methods. Extract out complex expressions
wikidict/utils.py process_templates 9 🙂 148 😞 8 🙂 58.61% 🙂 Try splitting into smaller methods
wikidict/utils.py table2html 6 ⭐ 82 🙂 11 😞 63.73% 🙂 Extract out complex expressions
wikidict/utils.py get_word_of_the_day 1 ⭐ 142 😞 9 🙂 64.85% 🙂 Try splitting into smaller methods

Legend and Explanation

The emojis denote the absolute quality of the code:

  • ⭐ excellent
  • 🙂 good
  • 😞 poor
  • ⛔ very poor

The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request.


Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Help us improve this quality report!

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

Question 1: Should we use a LRU cache on render_formula()?

Question 2: Should we store optimized SVGs for later runs? So we could run tests without calling the REST API.

@lasconic
Copy link
Collaborator

lru_cache: My understanding is that it will cache the results of a function when the function is called several times with the same args during a program run. It will not persists between run. I believe we don't call this function often with the same argument. Right ? So I would say lru_cache will not really speed up the process, but hey, we could try it.

store svgs: if it doesn't make the test moot, sure, we would store them.


svg_optimized = scourString(svg_raw, options=SCOUR_OPTS)
return (
f'<img src="data:image/svg+xml;utf8,{quote(svg_optimized)}"/>'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You tried to embed the SVG directly ? without img tag ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, good idea 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size is better but the display is messy :-/
screen_002

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, it displays perfectly here:
screen_001

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "cercle unité" contains raw text that scales badly. Will check the SVG content.

Copy link
Owner Author

@BoboTiG BoboTiG Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked english words, the display is just perfect using <svg>. Let's hope we can fix the display issue for "cercle unité" (and potentially other words, but we made a good progress still).

This comment was marked as outdated.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm interesting ... Saving the SVG to a file: it is displayed properly via the OS viewer.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's scoup that is doing something with IDs, and brwosers don't like.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

screen_001

screen_002

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

lru_cache: My understanding is that it will cache the results of a function when the function is called several times with the same args during a program run. It will not persists between run. I believe we don't call this function often with the same argument. Right ? So I would say lru_cache will not really speed up the process, but hey, we could try it.

I was thinking about duplicate formulas, but it will be near to zero at the end, mostly because the cache won't be shared accross locales. Let's forget that idea.

store svgs: if it doesn't make the test moot, sure, we would store them.

It would make sense to store them: 2 HTTP calls saved for each formula. I'll implement it right now.

@BoboTiG BoboTiG marked this pull request as ready for review November 16, 2022 15:37
@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

@lasconic ready for final review.

Only one tiny concern about the potential size of svg_cache.py, but I guess there are not thousands of formulas to keep in the cache.

@lasconic
Copy link
Collaborator

PNG master
screen_003
SVG this PR.
screen_004

It's crispy but it's too large no ? the last one is cut off.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

Let's see if other formulas are cut off too. I would say it's quite correct now (far better than what we used to have).

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

Do you want to dig into scaling down the SVGs, maybe?

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

I updated the cache with all formulas used in french. I'll update with all locales shortly.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

cb61ade fixed the issue where we altered <math> formulas, especially on the SV dict.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

I'm finished with the cache.

Maybe before merging we could find a way to compress svg_cache.py? An idea:

  • using md5(formula) for the keys
  • compressing SVG data as .tar.bz2 or something more powerfull from the standard modules. FTR gzip is badly perfoming.

Maybe not a good idea. The file is only 2 MiN, we can live with that.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

Ah, about the cut off, I didn't a big part of the picture was hidden. That's a shame.

@lasconic
Copy link
Collaborator

lasconic commented Nov 16, 2022

instead of md5, maybe SHA1 ? It seems wikimedia does this : https://github.com/wikimedia/restbase/blob/ecef17bda6f4efc0d6e187fb05b1eeb389bf7120/sys/post_data.js#L13

SVGZ is gzip no ? It doesn't perform well in term of speed ?

For the cut off, it's a pity but the real estate on kobo is so small... and we can't change the margins... We could change the font size but it would need style for every word.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

instead of md5, maybe SHA1 ? It seems wikimedia does this : https://github.com/wikimedia/restbase/blob/ecef17bda6f4efc0d6e187fb05b1eeb389bf7120/sys/post_data.js#L13

Weird, I can't replicate the hash.

formula = "V^n"
print(sha1(formula.encode()).hexdigest())

It gives 1617fcdcebf4aa277a74b0c3bb4772ea65d8966f, but eb483cb4ed353a8b15938aa2d994cc5f0fb9055a is returned by the service.
EDIT: Oh, it must be because of the normalization.

SVGZ is gzip no ? It doesn't perform well in term of speed ?

I didn't try yet, but the issue is not about perfs but storage on our side. Maybe not even a issue, I just raised the concern.

For the cut off, it's a pity but the real estate on kobo is so small... and we can't change the margins... We could change the font size but it would need style for every word.

What about we live with it? :)

@lasconic
Copy link
Collaborator

They do the hash on a normalized json string it includes the type and the formula https://github.com/wikimedia/restbase/blob/ecef17bda6f4efc0d6e187fb05b1eeb389bf7120/sys/mathoid.js#L33

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

No luck so far. I would just let the cache as-is until it becomes problematic.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Nov 16, 2022

I finally could replicate the hash.

We need to use the normalized TeX, it's available in the 1st call (formula is V^n for the test):

{
    "success": true,
    "checked": "V^{n}",
    "requiredPackages": [],
    "identifiers": ["V", "n"],
    "endsWithDot": false,
}

(See the checked key.)

Then, we need to use the full query, as you found out, but also without spaces between items:

d = json.dumps({"q": "V^{n}", "type": "tex"}, indent=None)
d = d.replace(" ", "")
print(sha1(d.encode()).hexdigest())

I know, the POC is ugly, and will not work for all cases :)

@BoboTiG
Copy link
Owner Author

BoboTiG commented Dec 13, 2022

Well, we "just" need to fix the display issue, then we will be good to merge.
Do you have an idea on how to do that? Maybe using some CSS (sadly)?

@lasconic
Copy link
Collaborator

Which display issue ? If it's the cut out formula, then I don't think there is a fix... The screen of the kobo is just too small for some formula. If we make it smaller, it will become not readable.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Dec 13, 2022

I was hoping to have such CSS hack: svg {max-width:100%} :D

I will still try one thing or two. In the eventuallity it's not possible to fix, do you propose to close the PR, and forget SVG stuff?

@BoboTiG
Copy link
Owner Author

BoboTiG commented Dec 13, 2022

Formulas are not so present in dictionaries. If some are cut, I'm OK with that.

@BoboTiG BoboTiG merged commit 7a6ffbf into master Dec 14, 2022
@BoboTiG BoboTiG deleted the fix-1427 branch December 14, 2022 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use a custom docker image for tests Generate SVG rather than GIF for embedded pictures
2 participants