Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tooling] Refactor CLDR Data Processing Script #11

Closed
GrammAcc opened this issue Jan 21, 2024 · 0 comments · Fixed by #20
Closed

[Tooling] Refactor CLDR Data Processing Script #11

GrammAcc opened this issue Jan 21, 2024 · 0 comments · Fixed by #20
Assignees
Milestone

Comments

@GrammAcc
Copy link
Owner

linearmoney uses a simple Python script to parse the CLDR-JSON data and output the relevant portions to json files that can be distributed with the PyPi package. This has the advantage of decoupling the library's data-driven implementation from the actual CLDR data as well as reducing the size of the data distributed with the package to only the pieces relevant to currency.

Because the script's output is also the input data structure for the data used by linearmoney at runtime, any changes to how the library structures data will require a change in the CLDR parsing script in order to avoid the cost of data transformation at runtime. For this reason, the script needs to be just as readable and amenable to change as the rest of the library, but it currently is not.

There are a few problems with the current script:

  1. It's a complete mess.
  • I wrote this script in a hurry when I was developing the formatting system in order to get something working as quickly as possible, so it is completely unreadable and unmaintainable.
  1. There is no automated way to track the version of CLDR data used by the library.
  • The parsing script just parses whatever data is available in the repo and outputs the formatted JSON, but this data changes when the upstream CLDR repo releases a new version. This does not constitute a change of behavior or interface in the linearmoney library, but it will result in different output from certain localization and/or rounding functions, which might be unexpected for the end user. We need a way to track the version of the CLDR data used by the library automatically.
  1. The output is not optimized for size.
  • I used a DEFAULT key in the fractions data for the common case of 2 places and 0 denomination, which is how the CLDR-JSON data is organized, but this is not the most optimized structure. I need to figure out a better structure in order to maximize the deduplication of values shared between different currencies and locales.

Solving the first problem is just a matter of refactoring the script into procedural blocks with self-documenting code so that each stage of the overall procedure can be understood and modified independently when making changes to the data structures used by the library.

The second problem can probably be solved by just reading some kind of version identifier in the parsing script and writing it to a dedicated file, which can then be read by the linearmoney library on import and stored in a CLDR_VERSION variable. I would prefer a solution that did not incur any runtime cost, but reading a 50 byte file at import-time is a negligible cost, so this kind of solution is fine for now if I can't come up with something better more quickly.

Solving the third problem is more difficult as it requires quite a bit of thought. In Python, we can use references to reduce memory usage when two different locales/currencies use the same value for the same field, but the JSON serialization/deserialization converts everything to a string value, so I believe that we lose these references when reading in the data at runtime. I have not tested this yet, so I will start by checking to make sure that Python's JSON parser doesn't intelligently cache values with references when deserializing a nested JSON object, but I doubt this is the case as it would likely cause a lot of bugs when mutating keys in a parsed JSON object. In any case, this will require some experimentation and careful thought in order to get right.

We should also have some documentation explaining how the CLDR data is utilized and the process of parsing and maintaining it. This information will likely be included in the contributing guidelines once they are written, but adding it to the docstring of the parsing script while the concepts are fresh in my mind should be a good way to get this information down on paper. Also, I think it might be good to add a recipe for end users who want to use a different version of the CLDR data or some custom data instead of the CLDR data, and having this process well-documented should make writing a recipe like that easier.

@GrammAcc GrammAcc added this to the v1.0.0 milestone Jan 21, 2024
@GrammAcc GrammAcc self-assigned this Jan 21, 2024
@GrammAcc GrammAcc changed the title Refactor CLDR Data Processing Script [Tooling] Refactor CLDR Data Processing Script Jan 21, 2024
GrammAcc added a commit that referenced this issue Apr 30, 2024
fix #11

I finished the primary refactoring the `process_cldr_data.py` script.

It's still not great, but it's broken up into more clearly delineated
function now, so it should be much easier to make changes to the way we
parse the data or to change the output format.

Through this refactoring process, I also found a couple bugs in the
script that were causing the incorrect data to be parsed, so this commit
also includes updated currencies and locales data.

Specifically, several currencies were being parsed as having
`cash_places = 2` when they should have been `cash_places = 0`. This was
caused by using `j` instead of `i` in the loop in the old
`_build_fractions_data` function. This caused the branch that read the
data for `cash_places` to get skipped for all currencies, so any that
had specific `cash_places` would instead get the value of the `DEFAULT`
currency.

The other problem was that the comparisons with `" "` weren't working
for locales that use some other unicode whitespace character for their
spaces in the formatting patterns. This caused several locales to have
incorrect currency symbol and positive/negative sign placement.

I added a helper function for normalizing whitespace with the
`unicodedata` module, and this is now fixed. The new data should have
the correct sign and symbol placement and spacing.
@GrammAcc GrammAcc linked a pull request Apr 30, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant