-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Meta] Project refactoring #458
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
About the final step: converting to the Kobo dictionary. You talked about That would be the "save" step. And we would need another step, "convert" that would handle calls to |
Well, we are using a custom HTML code for the Kobo, we will need to test the pyglossary output before doing such move. |
I would keep the Kobo output as it and add df. I'm not sure if pyglossary can be called from python or should be called from the CI. Render with 7 threads is twice faster. PR to come. |
Mostly done. Only testing is missing... How do you want to tackle it ? |
Hmm If you want to tackle another issue, go ahead too :) |
|
I can close the issue now, thanks a lot for your help, it was awesome 💪 |
The refactoring is finished 🍾 |
Note: the description is updated with comments and changes requested in comments.
The goal is to rework the
script
module to allow more flexibility and clearly separate concerns.First, about the module name:
script
. It has been decided to change towikidict
.Overview
I would like to see the module splitted into 4 parts (each part will independent from others and can be replayed & extended easily).
This will also help leveraging multithreading to speed-up the whole process.
I have in mind a SQLite database where raw data will be stored and updated when needed.
Then, the parts will only use the data from the database. It should speed-up regenerating a whole dictionary when we update a template.
Then, each and every part will have its own CLI:
And the all-in-one operation would be:
Side note: we could use an entry point to only having to type
wikidict
instead ofpython -m wikidict
.Splitting get.py
Here we are talking about parts 1 and 2.
Part 1 is already almost fine as-is, we just need to move the code into its own submodule.
We could improve the CLI by allowing passing the Wiktionary dump date as argument, instead of relying on an envar.
Part 2 is only the mater of parsing the big XML file and storing raw data into a SQLite database. I am thinking of using this schema:
Word
table will contain raw data from the Wiktionary.Render
table will be used to store the transformed text for a given word (after being cleaned up and where templates were processed). It will allow to have multiple texts for a given word (noun 1, noun 2, verb, adjective, ...).We will have one database per locale, located at
data/$LOCALE/$WIKIDUMP_DATE.db
.At the download step, if no database exists, it will be retrieved from GitHub releases where they will be saved alongside dictionaries.
This is a cool thing IMO: everyone will have the good and up-to-date local database.
Of course, we will have options to skip it if the local file already exists or if we would like to force the download.
At the parse step, we will have to find a way to prevent parsing again if we run the command twice on the same Wiktionary dump.
I was thinking of using the PRAGME user_version that would contain the Wiktionary dump date as integer.
It would be set only after the full parsing is done with success.
Splitting convert.py
Here we are talking about parts 3 and 4.
Part 3 will call
clean()
andprocess_templates()
on thewikicode
. And store the result into therendered
field. This is the most time and CPU consuming part. It will be parallelized.Part 4 will rethink how we are handling dictionary output to easily add more formats.
I was thinking of using a class with those methods (not really thought about it, I am just proposing the idea):
That part is way from being finished, but when we have a fully working format, in our code will will use that kind of code to generate the dict file:
The text was updated successfully, but these errors were encountered: