[Tooling] Refactor CLDR Data Processing Script #11

GrammAcc · 2024-01-21T17:53:54Z

linearmoney uses a simple Python script to parse the CLDR-JSON data and output the relevant portions to json files that can be distributed with the PyPi package. This has the advantage of decoupling the library's data-driven implementation from the actual CLDR data as well as reducing the size of the data distributed with the package to only the pieces relevant to currency.

Because the script's output is also the input data structure for the data used by linearmoney at runtime, any changes to how the library structures data will require a change in the CLDR parsing script in order to avoid the cost of data transformation at runtime. For this reason, the script needs to be just as readable and amenable to change as the rest of the library, but it currently is not.

There are a few problems with the current script:

It's a complete mess.

I wrote this script in a hurry when I was developing the formatting system in order to get something working as quickly as possible, so it is completely unreadable and unmaintainable.

There is no automated way to track the version of CLDR data used by the library.

The parsing script just parses whatever data is available in the repo and outputs the formatted JSON, but this data changes when the upstream CLDR repo releases a new version. This does not constitute a change of behavior or interface in the linearmoney library, but it will result in different output from certain localization and/or rounding functions, which might be unexpected for the end user. We need a way to track the version of the CLDR data used by the library automatically.

The output is not optimized for size.

I used a DEFAULT key in the fractions data for the common case of 2 places and 0 denomination, which is how the CLDR-JSON data is organized, but this is not the most optimized structure. I need to figure out a better structure in order to maximize the deduplication of values shared between different currencies and locales.

Solving the first problem is just a matter of refactoring the script into procedural blocks with self-documenting code so that each stage of the overall procedure can be understood and modified independently when making changes to the data structures used by the library.

The second problem can probably be solved by just reading some kind of version identifier in the parsing script and writing it to a dedicated file, which can then be read by the linearmoney library on import and stored in a CLDR_VERSION variable. I would prefer a solution that did not incur any runtime cost, but reading a 50 byte file at import-time is a negligible cost, so this kind of solution is fine for now if I can't come up with something better more quickly.

Solving the third problem is more difficult as it requires quite a bit of thought. In Python, we can use references to reduce memory usage when two different locales/currencies use the same value for the same field, but the JSON serialization/deserialization converts everything to a string value, so I believe that we lose these references when reading in the data at runtime. I have not tested this yet, so I will start by checking to make sure that Python's JSON parser doesn't intelligently cache values with references when deserializing a nested JSON object, but I doubt this is the case as it would likely cause a lot of bugs when mutating keys in a parsed JSON object. In any case, this will require some experimentation and careful thought in order to get right.

We should also have some documentation explaining how the CLDR data is utilized and the process of parsing and maintaining it. This information will likely be included in the contributing guidelines once they are written, but adding it to the docstring of the parsing script while the concepts are fresh in my mind should be a good way to get this information down on paper. Also, I think it might be good to add a recipe for end users who want to use a different version of the CLDR data or some custom data instead of the CLDR data, and having this process well-documented should make writing a recipe like that easier.

The text was updated successfully, but these errors were encountered:

fix #11 I finished the primary refactoring the `process_cldr_data.py` script. It's still not great, but it's broken up into more clearly delineated function now, so it should be much easier to make changes to the way we parse the data or to change the output format. Through this refactoring process, I also found a couple bugs in the script that were causing the incorrect data to be parsed, so this commit also includes updated currencies and locales data. Specifically, several currencies were being parsed as having `cash_places = 2` when they should have been `cash_places = 0`. This was caused by using `j` instead of `i` in the loop in the old `_build_fractions_data` function. This caused the branch that read the data for `cash_places` to get skipped for all currencies, so any that had specific `cash_places` would instead get the value of the `DEFAULT` currency. The other problem was that the comparisons with `" "` weren't working for locales that use some other unicode whitespace character for their spaces in the formatting patterns. This caused several locales to have incorrect currency symbol and positive/negative sign placement. I added a helper function for normalizing whitespace with the `unicodedata` module, and this is now fixed. The new data should have the correct sign and symbol placement and spacing.

GrammAcc added this to the v1.0.0 milestone Jan 21, 2024

GrammAcc self-assigned this Jan 21, 2024

GrammAcc changed the title ~~Refactor CLDR Data Processing Script~~ [Tooling] Refactor CLDR Data Processing Script Jan 21, 2024

GrammAcc mentioned this issue Feb 27, 2024

[Data] Redesign Locale and Rounding Data Structures #15

Open

GrammAcc linked a pull request Apr 30, 2024 that will close this issue

Refactor cldr data processing script #20

Merged

GrammAcc closed this as completed in #20 Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tooling] Refactor CLDR Data Processing Script #11

[Tooling] Refactor CLDR Data Processing Script #11

GrammAcc commented Jan 21, 2024

[Tooling] Refactor CLDR Data Processing Script #11

[Tooling] Refactor CLDR Data Processing Script #11

Comments

GrammAcc commented Jan 21, 2024