Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] Project refactoring #458

Closed
3 of 4 tasks
BoboTiG opened this issue Dec 6, 2020 · 26 comments
Closed
3 of 4 tasks

[Meta] Project refactoring #458

BoboTiG opened this issue Dec 6, 2020 · 26 comments
Labels

Comments

@BoboTiG
Copy link
Owner

BoboTiG commented Dec 6, 2020

Note: the description is updated with comments and changes requested in comments.

The goal is to rework the script module to allow more flexibility and clearly separate concerns.

First, about the module name: script. It has been decided to change to wikidict.

Overview

I would like to see the module splitted into 4 parts (each part will independent from others and can be replayed & extended easily).
This will also help leveraging multithreading to speed-up the whole process.

  1. Download the data (Project refactoring: download step (related to #458) #466)
  2. Parse and store raw data (start --parse and --render #469)
  3. Render templates and store results (start --parse and --render #469)
  4. Output to the proper eBook reader format

I have in mind a SQLite database where raw data will be stored and updated when needed.
Then, the parts will only use the data from the database. It should speed-up regenerating a whole dictionary when we update a template.

Then, each and every part will have its own CLI:

$ python -m wikidict --download ...
$ python -m wikidict --parse ...
$ python -m wikidict --render ...
$ python -m wikidict --output ...

And the all-in-one operation would be:

$ python -m wikidict --run ...

Side note: we could use an entry point to only having to type wikidict instead of python -m wikidict.

Splitting get.py

Here we are talking about parts 1 and 2.

Part 1 is already almost fine as-is, we just need to move the code into its own submodule.
We could improve the CLI by allowing passing the Wiktionary dump date as argument, instead of relying on an envar.

Part 2 is only the mater of parsing the big XML file and storing raw data into a SQLite database. I am thinking of using this schema:

table: Word
fields:
    - word: varchar(256)
    - code: text
index on: word

table: Render
fields:
    - word_id: int
    - nature: varchar(16)
    - text: text
foreign key: word_id (Word._rowid_)
  • The Word table will contain raw data from the Wiktionary.
  • The Render table will be used to store the transformed text for a given word (after being cleaned up and where templates were processed). It will allow to have multiple texts for a given word (noun 1, noun 2, verb, adjective, ...).

We will have one database per locale, located at data/$LOCALE/$WIKIDUMP_DATE.db.

At the download step, if no database exists, it will be retrieved from GitHub releases where they will be saved alongside dictionaries.
This is a cool thing IMO: everyone will have the good and up-to-date local database.
Of course, we will have options to skip it if the local file already exists or if we would like to force the download.

At the parse step, we will have to find a way to prevent parsing again if we run the command twice on the same Wiktionary dump.
I was thinking of using the PRAGME user_version that would contain the Wiktionary dump date as integer.
It would be set only after the full parsing is done with success.

Splitting convert.py

Here we are talking about parts 3 and 4.

Part 3 will call clean() and process_templates() on the wikicode. And store the result into the rendered field. This is the most time and CPU consuming part. It will be parallelized.

Part 4 will rethink how we are handling dictionary output to easily add more formats.

I was thinking of using a class with those methods (not really thought about it, I am just proposing the idea):

class BaseFormat:

    __slots__ = {"locale", "output_dir"}

    def __init__(self, locale: str, output_dir: Path) -> None:
        self.locale = locale
        self.output_dir = output_dir
    
    def process(self) -> None:
        raise NotImplementedError()

    def save(self) -> None:
        raise NotImplementedError()


class KoboFormat(BaseFormat):
    def process(self, words) -> None:
        groups = self.make_groups(self.words)
        variants = self.make_variants(self.words)

        wordlist = []
        for word in words:
            wordlist.append(self.process_word(word))

        self.save(wordlist, groups, variants)

    def save(self, ...) -> None:
        ...

That part is way from being finished, but when we have a fully working format, in our code will will use that kind of code to generate the dict file:

# Get all registered formats
formaters = get_formaters()

# Get all words from the database
words = get_words()

# And distribute the workload
from multiprocessing import Pool

def run(cls):
    formater = cls(locale, output_dir)
    formater.process(words)

with Pool(len(formaters)) as pool:
    pool.map(run_formatter, formaters))
@BoboTiG

This comment has been minimized.

@lasconic

This comment has been minimized.

@BoboTiG

This comment has been minimized.

@BoboTiG

This comment has been minimized.

@lasconic

This comment has been minimized.

@BoboTiG

This comment has been minimized.

@lasconic

This comment has been minimized.

@BoboTiG

This comment has been minimized.

@lasconic

This comment has been minimized.

@lasconic

This comment has been minimized.

@lasconic

This comment has been minimized.

@BoboTiG

This comment has been minimized.

@lasconic

This comment has been minimized.

@lasconic

This comment has been minimized.

@lasconic

This comment has been minimized.

@BoboTiG

This comment has been minimized.

@lasconic

This comment has been minimized.

@lasconic

This comment has been minimized.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Dec 14, 2020

About the final step: converting to the Kobo dictionary.

You talked about .df files in #409. I was wondering if, for that ticket only, we just implement the code that outputs to such format. Doing so would let pyglossary to do all the work then (Kobo, Kindle, ...). So the last step would just the matter of outputing to the good .df format.

That would be the "save" step. And we would need another step, "convert" that would handle calls to pyglossary to differents dictionaries.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Dec 14, 2020

Well, we are using a custom HTML code for the Kobo, we will need to test the pyglossary output before doing such move.

@lasconic
Copy link
Collaborator

I would keep the Kobo output as it and add df. I'm not sure if pyglossary can be called from python or should be called from the CI.

Render with 7 threads is twice faster. PR to come.

@lasconic
Copy link
Collaborator

lasconic commented Dec 15, 2020

Mostly done. Only testing is missing... How do you want to tackle it ?
Big project starting tomorrow or later this week. The last PR could be my last push for a while.

@BoboTiG
Copy link
Owner Author

BoboTiG commented Dec 15, 2020

Hmm test_N_*.py files will be a pain, I will handle them. If you can migrate test_$LOCALE.py it would be great. But again, if you do not have time, I will have some on my side, so not a big deal.

If you want to tackle another issue, go ahead too :)

@lasconic
Copy link
Collaborator

lasconic commented Dec 15, 2020

test_$LOCALE.py was easy ;) see PR #478
Good luck with the test_N_*.py

@BoboTiG
Copy link
Owner Author

BoboTiG commented Dec 15, 2020

I can close the issue now, thanks a lot for your help, it was awesome 💪

@BoboTiG
Copy link
Owner Author

BoboTiG commented Dec 18, 2020

The refactoring is finished 🍾
Tests coverage is at 100% (modulo arabiser.py). I think we should rework release descriptions to include all dictionaries (when available) or simply list download files.

@BoboTiG BoboTiG added the QA/CI label Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants