[Meta] Project refactoring #458

BoboTiG · 2020-12-06T16:56:45Z

Note: the description is updated with comments and changes requested in comments.

The goal is to rework the script module to allow more flexibility and clearly separate concerns.

First, about the module name: script. It has been decided to change to wikidict.

Overview

I would like to see the module splitted into 4 parts (each part will independent from others and can be replayed & extended easily).
This will also help leveraging multithreading to speed-up the whole process.

Download the data (Project refactoring: download step (related to #458) #466)
Parse and store raw data (start --parse and --render #469)
Render templates and store results (start --parse and --render #469)
Output to the proper eBook reader format

I have in mind a SQLite database where raw data will be stored and updated when needed.
Then, the parts will only use the data from the database. It should speed-up regenerating a whole dictionary when we update a template.

Then, each and every part will have its own CLI:

$ python -m wikidict --download ...
$ python -m wikidict --parse ...
$ python -m wikidict --render ...
$ python -m wikidict --output ...

And the all-in-one operation would be:

$ python -m wikidict --run ...

Side note: we could use an entry point to only having to type wikidict instead of python -m wikidict.

Splitting get.py

Here we are talking about parts 1 and 2.

Part 1 is already almost fine as-is, we just need to move the code into its own submodule.
We could improve the CLI by allowing passing the Wiktionary dump date as argument, instead of relying on an envar.

Part 2 is only the mater of parsing the big XML file and storing raw data into a SQLite database. I am thinking of using this schema:

table: Word
fields:
    - word: varchar(256)
    - code: text
index on: word

table: Render
fields:
    - word_id: int
    - nature: varchar(16)
    - text: text
foreign key: word_id (Word._rowid_)

The Word table will contain raw data from the Wiktionary.
The Render table will be used to store the transformed text for a given word (after being cleaned up and where templates were processed). It will allow to have multiple texts for a given word (noun 1, noun 2, verb, adjective, ...).

We will have one database per locale, located at data/$LOCALE/$WIKIDUMP_DATE.db.

At the download step, if no database exists, it will be retrieved from GitHub releases where they will be saved alongside dictionaries.
This is a cool thing IMO: everyone will have the good and up-to-date local database.
Of course, we will have options to skip it if the local file already exists or if we would like to force the download.

At the parse step, we will have to find a way to prevent parsing again if we run the command twice on the same Wiktionary dump.
I was thinking of using the PRAGME user_version that would contain the Wiktionary dump date as integer.
It would be set only after the full parsing is done with success.

Splitting convert.py

Here we are talking about parts 3 and 4.

Part 3 will call clean() and process_templates() on the wikicode. And store the result into the rendered field. This is the most time and CPU consuming part. It will be parallelized.

Part 4 will rethink how we are handling dictionary output to easily add more formats.

I was thinking of using a class with those methods (not really thought about it, I am just proposing the idea):

class BaseFormat:

    __slots__ = {"locale", "output_dir"}

    def __init__(self, locale: str, output_dir: Path) -> None:
        self.locale = locale
        self.output_dir = output_dir
    
    def process(self) -> None:
        raise NotImplementedError()

    def save(self) -> None:
        raise NotImplementedError()


class KoboFormat(BaseFormat):
    def process(self, words) -> None:
        groups = self.make_groups(self.words)
        variants = self.make_variants(self.words)

        wordlist = []
        for word in words:
            wordlist.append(self.process_word(word))

        self.save(wordlist, groups, variants)

    def save(self, ...) -> None:
        ...

That part is way from being finished, but when we have a fully working format, in our code will will use that kind of code to generate the dict file:

# Get all registered formats
formaters = get_formaters()

# Get all words from the database
words = get_words()

# And distribute the workload
from multiprocessing import Pool

def run(cls):
    formater = cls(locale, output_dir)
    formater.process(words)

with Pool(len(formaters)) as pool:
    pool.map(run_formatter, formaters))

The text was updated successfully, but these errors were encountered:

BoboTiG · 2020-12-14T16:27:42Z

About the final step: converting to the Kobo dictionary.

You talked about .df files in #409. I was wondering if, for that ticket only, we just implement the code that outputs to such format. Doing so would let pyglossary to do all the work then (Kobo, Kindle, ...). So the last step would just the matter of outputing to the good .df format.

That would be the "save" step. And we would need another step, "convert" that would handle calls to pyglossary to differents dictionaries.

BoboTiG · 2020-12-14T18:13:19Z

Well, we are using a custom HTML code for the Kobo, we will need to test the pyglossary output before doing such move.

lasconic · 2020-12-14T21:47:08Z

I would keep the Kobo output as it and add df. I'm not sure if pyglossary can be called from python or should be called from the CI.

Render with 7 threads is twice faster. PR to come.

lasconic · 2020-12-15T20:07:50Z

Mostly done. Only testing is missing... How do you want to tackle it ?
Big project starting tomorrow or later this week. The last PR could be my last push for a while.

BoboTiG · 2020-12-15T20:11:58Z

Hmm test_N_*.py files will be a pain, I will handle them. If you can migrate test_$LOCALE.py it would be great. But again, if you do not have time, I will have some on my side, so not a big deal.

If you want to tackle another issue, go ahead too :)

lasconic · 2020-12-15T20:23:52Z

test_$LOCALE.py was easy ;) see PR #478
Good luck with the test_N_*.py

BoboTiG · 2020-12-15T20:38:39Z

I can close the issue now, thanks a lot for your help, it was awesome 💪

BoboTiG · 2020-12-18T13:21:06Z

The refactoring is finished 🍾
Tests coverage is at 100% (modulo arabiser.py). I think we should rework release descriptions to include all dictionaries (when available) or simply list download files.

BoboTiG added the enhancement label Dec 6, 2020

This comment has been minimized.

Sign in to view

BoboTiG mentioned this issue Dec 6, 2020

[Meta] Locale refactoring #459

Closed

This comment has been minimized.

Sign in to view

BoboTiG added a commit that referenced this issue Dec 7, 2020

Project refactoring: download step (related to #458)

c70dc9d

BoboTiG mentioned this issue Dec 7, 2020

Project refactoring: download step (related to #458) #466

Merged

4 tasks

BoboTiG added a commit that referenced this issue Dec 7, 2020

Project refactoring: download step (related to #458)

9929b81

BoboTiG added a commit that referenced this issue Dec 7, 2020

Project refactoring: download step (related to #458)

158c2cb

BoboTiG added a commit that referenced this issue Dec 7, 2020

Project refactoring: download step (related to #458)

14ace58

BoboTiG added a commit that referenced this issue Dec 8, 2020

Project refactoring: download step (related to #458)

07f30e9

This comment has been minimized.

Sign in to view

BoboTiG closed this as completed Dec 15, 2020

BoboTiG mentioned this issue Dec 18, 2020

Finish the refactoring + more tests coverage #479

Merged

4 tasks

BoboTiG added the QA/CI label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta] Project refactoring #458

[Meta] Project refactoring #458

BoboTiG commented Dec 6, 2020 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

BoboTiG commented Dec 14, 2020 •

edited

Loading

BoboTiG commented Dec 14, 2020

lasconic commented Dec 14, 2020

lasconic commented Dec 15, 2020 •

edited

Loading

BoboTiG commented Dec 15, 2020

lasconic commented Dec 15, 2020 •

edited

Loading

BoboTiG commented Dec 15, 2020

BoboTiG commented Dec 18, 2020 •

edited

Loading

[Meta] Project refactoring #458

[Meta] Project refactoring #458

Comments

BoboTiG commented Dec 6, 2020 • edited Loading

Overview

Splitting get.py

Splitting convert.py

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

BoboTiG commented Dec 14, 2020 • edited Loading

BoboTiG commented Dec 14, 2020

lasconic commented Dec 14, 2020

lasconic commented Dec 15, 2020 • edited Loading

BoboTiG commented Dec 15, 2020

lasconic commented Dec 15, 2020 • edited Loading

BoboTiG commented Dec 15, 2020

BoboTiG commented Dec 18, 2020 • edited Loading

BoboTiG commented Dec 6, 2020 •

edited

Loading

BoboTiG commented Dec 14, 2020 •

edited

Loading

lasconic commented Dec 15, 2020 •

edited

Loading

lasconic commented Dec 15, 2020 •

edited

Loading

BoboTiG commented Dec 18, 2020 •

edited

Loading