Skip to content
This repository has been archived by the owner on Mar 3, 2020. It is now read-only.

Metrics to be included (proposal) #8

Closed
bact opened this issue May 20, 2019 · 2 comments
Closed

Metrics to be included (proposal) #8

bact opened this issue May 20, 2019 · 2 comments

Comments

@bact
Copy link
Member

bact commented May 20, 2019

Metrics to be included and other discussions, moved over from PyThaiNLP/pythainlp#62

Quality

  • Does the tokenizer preserved every characters or does it destructive?
  • Character-level
    • If it violates Thai Character Cluster rule? (more penalty)
  • Word-level
    • Word-level by length of word

Speed

  • Characters per second (on standardized machine)
    • May be tested with different sizes of text (small and large), to notice the "boot time" of a tokenizer

Memory footprint

  • Memory used by the tokenizer (when tokenizing a certain amount of text), at the running time

Disk size

  • Total size of the tokenizer, including dictionary, models, and all non-standard dependencies (excluding runtime environment, like interpreter/VM)
@bact bact changed the title Metrics to be included Metrics to be included (proposal) May 20, 2019
@p16i
Copy link
Collaborator

p16i commented May 27, 2019

@bact thanks for the proposal.

Because this repository is mainly for comparing end results, i.e. tokenised texts, only quality metrics can be measured.

However, speed, memory and disk footprint have to be done while executing the algorithms. I think we can measure them via https://github.com/heytitle/thai-tokenisers.

I'm going to close this issue and we will discuss further details in #9 and PyThaiNLP/docker-thai-tokenizers#1.

@bact I've also put some thoughts on how to implement those performance metrics. Could you please comment on my idea?

@bact
Copy link
Member Author

bact commented May 27, 2019

Thanks krub, will see it over there.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants