Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace pycld3 dependency? #593

Closed
osma opened this issue Jun 21, 2022 · 27 comments · Fixed by #626
Closed

Replace pycld3 dependency? #593

osma opened this issue Jun 21, 2022 · 27 comments · Fixed by #626
Milestone

Comments

@osma
Copy link
Member

osma commented Jun 21, 2022

The pycld3 language detection library which we depend on seems to have install issues on Python 3.10 (see #589). The last release 0.22 was in March 2021.

I think we should consider switching to a more actively maintained library. This should be easy now that we are only using language detection for language filtering but not in other parts of the Annif codebase.

A possible promising candidate would be Lingua but there are others.

@osma osma added this to the Long term milestone Jun 21, 2022
@osma
Copy link
Member Author

osma commented Jun 21, 2022

It should be noted that Lingua is a fairly new library and so has a very short track record, with only two releases so far.

@osma
Copy link
Member Author

osma commented Jun 22, 2022

There is an issue asking about Python 3.10 support for pycld3: bsolomon1124/pycld3#31

@osma
Copy link
Member Author

osma commented Aug 4, 2022

As pointed out by @adulau in this comment, Lingua can use huge amounts of memory. I tested it in the default lazy loading configuration, and detecting the language of the example sentence languages are awesome required 1.8GB of memory; I assume languages like Russian, Arabic and Chinese were excluded because they are written in non-Latin script. When I preloaded all languages, the memory usage was 2.6GB.

@pemistahl
Copy link

Hello, I'm the author of Lingua. I've managed to reduce the memory consumption of the library. All language models together now just take around 800 MB in memory. Perhaps you want to re-evaluate Lingua for this project once again. If you have any questions, feel free to ask here or join this discussion that @osma has opened.

@osma
Copy link
Member Author

osma commented Aug 19, 2022

Thanks @pemistahl , that is excellent news! We will take a new look at Lingua.

@pemistahl
Copy link

@osma I have just released Lingua 1.1.0. In high accuracy mode, memory consumption is now at 800 MB. In low accuracy mode, it's even just 60 MB.

plot

@osma
Copy link
Member Author

osma commented Aug 23, 2022

@pemistahl Whoa, that's quite an improvement!

@pemistahl
Copy link

Yes, that's because the models are now stored in NumPy arrays instead of dictionaries. Querying the arrays is slower than querying dictionaries, that's the downside. But I still use a dictionary as a cache for already looked-up ngrams to speed up the process again.

@pemistahl
Copy link

FYI: There was a little bug in version 1.1.0 that caused wrong probabilities to be returned for certain ngrams. I've just fixed that. So please use version 1.1.1 now for your tests. Thank you.

@osma
Copy link
Member Author

osma commented Aug 26, 2022

I did some testing of Lingua in a draft PR #615, you may want to check that out @pemistahl

@osma
Copy link
Member Author

osma commented Sep 1, 2022

@adbar suggested these other language detection approaches in #617 (comment) :

  • Simplemma should be good enough and especially good on noisy text.
  • I've used langid.py ever since it has been made available and I recently released a Python3 port (py3langid). It works on N-Grams instead of whole words so both approaches are complementary.
  • Fasttext works fast and well for most languages, you can look for related work, whatthelang for example, luga looks promising.
  • For yet another approach (hunspell + fasttext) see fastspell.

We could take a look at these and compare how well they work, similar to the Lingua experiments in PR #615 but testing on a different data set (e.g. Finnish language jyu-theses) where filtering by language actually improves results.

@adbar
Copy link

adbar commented Sep 1, 2022

Hi, just a quick evaluation on my side:

  • WiLI-2018 dataset (Wikipedia sentences, so pretty regular, rather short input, noisy with named entities)
  • A few Germanic languages not too far from another (da, de, en, lb, nl, sv) because of my current focus, 3000 texts in total
  • Packages initialized beforehand, with the right language subset where possible → *
  • Software notes: I couldn't install fastspell, lingua-language-detector doesn't feature Luxemburgish, lplangid and langdetect are left out because of their poor performance (maybe I missed something), latest simplemma from the repository
  • Python 3.8.10
Package Accuracy Time
luga 0.954 0.184
pycld2 0.945 0.068
pycld3 0.967 1.599
py3langid 0.961 0.709
py3langid* 0.981 0.404
simplemma 0.945 5.471
simplemma* 0.948 1.102
whatthelang 0.967 0.228

Quick and dirty approach, many questions left out here! I'm open for discussion and for wider benchmarks.

@osma
Copy link
Member Author

osma commented Sep 28, 2022

I created PR #626 which uses Simplemma for language detection instead of pycld3 (or Lingua in PR #615). I intend to benchmark these three approaches in the near future.

@adulau
Copy link

adulau commented Sep 29, 2022

@adbar Thank you very much for the benchmark. Do you have some statistics on memory usage too? I remember some libs where pretty fast but with significant memory usage compared to some others like pycld3.

@osma
Copy link
Member Author

osma commented Sep 29, 2022

I have now redone the benchmarks described in #615 (comment) with some changes.

This time I used the parts of the Finto AI data set and Finnish language documents and YSO Filolaos as the vocabulary. Again I used two project configurations with two backend algorithms, MLLM and Omikuji Parabel. For training MLLM, I used the fulltext-train collection of fulltext documents from different sources (n=2788; sorry, I cannot share these). For training Omikuji, I used the file yso-finna-fi-01.tsv.gz (n=2M shorttext documents; in practice these will not be filtered during training as the language filter isn't applied on very short documents). Both projects were evaluated using Finnish language test documents (fin-test subset) from the jyu-theses collection (n=766).

I compared current master (which uses pycld3) to the PR #615 branch which uses Lingua 1.1.2 1.1.3 - (mostly low-accuracy (LA) mode, but reran the evaluations only in high-accuracy (HA) mode as well) and to the PR #623 branch which uses Simplemma 0.8.2 for language filtering. As a baseline, I also used project configurations with no language filtering. Here are the project configurations:

[yso-mllm-fi-filter]
language=fi
backend=mllm
vocab=yso
analyzer=voikko(fi)
transform=limit(10000),filter_lang,limit(5000)

[yso-omikuji-parabel-fi-filter]
language=fi
backend=omikuji
analyzer=voikko(fi)
vocab=yso 
transform=limit(10000),filter_lang,limit(5000)

Again, for the baseline case with no filter I used transform=limit(5000) instead. For performance stats, I used /usr/bin/time -v. "time" means total user time over all CPU cores and "mem" means maximum resident set size. All train and eval operations were performed using 8 parallel jobs (-j 8).

operation no filter time no filter mem pycld3 time pycld3 mem lingua LA time lingua LA mem lingua HA time lingua HA mem simplemma time simplemma mem
train mllm 2797 1924600 2851 1940868 6242 1952924 2686 2844544
eval mllm 359 531572 368 531064 1133 530776 24657 759592 392 1335972
train omikuji 3777 6230084 4004 6495436 3770 6428592 3657 7082264
eval omikuji 83 2644828 96 2649708 856 2641988 23418 2897788 124 3482256

Evaluation results (higher is better):

Project type no filter f1@5 no filter ndcg pycld3 f1@5 pycld3 ndcg lingua LA f1@5 lingua LA ndcg lingua HA f1@5 lingua HA ndcg simplemma f1@5 simplemma ndcg
mllm 0.4646 0.6091 0.4653 0.6132 0.4633 0.6113 0.4639 0.6121 0.4614 0.6107
omikuji 0.3453 0.4605 0.3697 0.4923 0.3661 0.4899 0.3663 0.4898 0.3733 0.4921

Observations:

  • speed: pycld3 is the fastest, Simplemma is not far behind, Lingua is much much slower, even more so in high accuracy mode
  • memory overhead of pycld3 and Lingua LA appears to be very small, while Simplemma adds +800MB to the resident set size; Lingua HA adds around 200-250MB
  • MLLM results did not improve with language filtering - the evaluation scores are all very close and this is probably just random variation
  • Omikuji results did improve by 2-3 percentage points in terms of F1@5 score and similarly for the nDCG scores with all three kinds of language filtering (when compared to the baseline, no filtering). The best results were obtained with Simplemma, pycld3 came second, Lingua in third place (both HA and LA). The differences between these three are not very dramatic, though.
  • There is almost no difference in the results between Lingua HA and LA modes.

Some preliminary conclusions:

  • Lingua isn't doing very well in this comparison. It's much slower than the alternatives, and gives the least benefit in terms of quality.
  • Simplemma is very promising, but I'm worried about the extra 800MB of memory it consumes. I wonder if this is necessary, given that it is just asked to identify which sentences are in Finnish and which are not (basically testing in_target_language(sentence, lang='fi') >= 0.5). Is there some bug that makes it use more memory than it should?

@osma
Copy link
Member Author

osma commented Sep 29, 2022

Reported the huge memory usage in Simplemma as adbar/simplemma#19

@pemistahl
Copy link

@osma My library is slower because it is written in pure Python. pycld3 is written in C++ and simplemma uses mypyc to compile the Python modules to C extensions. I've already experimented with Cython and mypyc within Lingua, resulting in performance improvements which I will release in a later version.

You should also add Lingua's high accuracy mode to this comparison because this is what makes the library superior to most other language detection libraries. Memory consumption and running time will be higher but accuracy should be much better. It is kind of unfair to leave out the high accuracy mode and then stating gives the least benefit in terms of quality.

I've just released Lingua 1.1.3 which improves performance by roughly 30% compared to 1.1.2. So maybe you want to update your evaluation again.

@osma
Copy link
Member Author

osma commented Sep 30, 2022

@osma My library is slower because it is written in pure Python. pycld3 is written in C++ and simplemma uses mypyc to compile the Python modules to C extensions. I've already experimented with Cython and mypyc within Lingua, resulting in performance improvements which I will release in a later version.

I understand. Though Simplemma is also pure Python, and it was almost as fast as pycld3 in the comparison. EDIT: Apologies, I read your text carelessly. Yes, Simplemma can use mypyc. But it seems to me that the Simplemma package files on PyPI, which is what I was using for the benchmark, are just pure Python. There is just a single any wheel, not architecture specific ones that you'd expect from compiled C extensions.

Simplemma uses a different, vocabulary-based approach for language detection though, not n-grams like most other language detectors including Lingua.

You should also add Lingua's high accuracy mode to this comparison because this is what makes the library superior to most other language detection libraries. Memory consumption and running time will be higher but accuracy should be much better. It is kind of unfair to leave out the high accuracy mode and then stating gives the least benefit in terms of quality.

I apologize for the harsh wording. I was focused on the downstream results - how the language detection, when applied as a filter for training and evaluation data, affects the quality of automated subject indexing. This may or may not correlate with quality benchmarks that focus purely on the accuracy of language detection. It is entirely possible that even a perfect language detector with 100% accuracy would achieve a low score on this downstream benchmark because there are so many confounding factors. As I also noted above, the differences between the three language detection approaches are quite small ("not very dramatic"). Using Simplemma instead of Lingua (low accuracy mode) with Omikuji improved the F1 score by 0.7 points (pycld3 was halfway between those) and some of these differences could well be just random variation.

I understand that Lingua's strong point is the high accuracy it achieves. But for an application like input preprocessing in Annif, it just doesn't make sense to spend so much computing resources (even just the low accuracy mode tested here) on the language detection part, when the maximum possible benefit is something like half a percentage point in F1 score compared to other, more lightweight approaches. Those resources would likely be better spent in other parts of the process, for example the classification algorithms themselves rather than the preprocessing.

I've just released Lingua 1.1.3 which improves performance by roughly 30% compared to 1.1.2. So maybe you want to update your evaluation again.

That is great news, congratulations!

I might consider doing another round (also including py3langid for example, or a possible new version of Simplemma with lower memory use) but for now I have other, more urgent tasks.

@osma
Copy link
Member Author

osma commented Sep 30, 2022

I realized that I can just run the Omikuji evaluation part again with Lingua 1.1.3, without redoing the whole benchmark. Hang on...

@osma
Copy link
Member Author

osma commented Sep 30, 2022

@pemistahl I upgraded to Lingua 1.1.3 and reran the Omikuji and MLLM evaluations. The Omikuji evaluation runtime decreased from 935 to 856 seconds and the MLLM runtime from 1210 to 1133 seconds. So it's an improvement for sure, but not super dramatic. I updated the table above. Evaluation scores didn't change at all.

Benchmark with Lingua in high accuracy mode is currently running, but as expected, it's taking a while...

@osma
Copy link
Member Author

osma commented Sep 30, 2022

I finished the (partial) benchmark of Lingua in high-accuracy mode and edited the results table above accordingly. The runtime was at least an order of magnitude larger than in low-accuracy mode. Sorry @pemistahl but the result quality almost did not change at all. I don't think this is because Lingua would be less accurate than the others, it's just for some reason not very well suited to this particular task (and it's possible that tweaking the way it's used could improve the results).

@adbar
Copy link

adbar commented Oct 4, 2022

@adulau I assume osma's comment answered your question.

@osma As you say mypyc can be used locally but I didn't enable it in the package release. I confirm the open question regarding memory in Simplemma.

As a side note, you could use hyperfine instead of usr/bin/time for the benchmarks.

@osma
Copy link
Member Author

osma commented Oct 4, 2022

Thanks for the tip @adbar , I wasn't aware of hyperfine. Though it seems to me it will only measure execution time, not memory usage.

@pemistahl
Copy link

I apologize for the harsh wording.

No worries, @osma. I'm not resentful. :)

I understand that Lingua's strong point is the high accuracy it achieves. But for an application like input preprocessing in Annif, it just doesn't make sense to spend so much computing resources (even just the low accuracy mode tested here) on the language detection part, when the maximum possible benefit is something like half a percentage point in F1 score compared to other, more lightweight approaches. Those resources would likely be better spent in other parts of the process, for example the classification algorithms themselves rather than the preprocessing.

This is absolutely reasonable. Then Lingua is simply not the right tool for your job. That's ok. Luckily, there are enough language detectors to choose from, especially in the Python ecosystem.

I was curious and added Simplemma to my own evaluation of language detectors. As expected, the vocabulary-based approach is not as good as the ngram-based approach. The detection accuracy differs significantly between the languages. For Finnish, Simplemma is pretty accurate with 81% on average. But other languages, such as Spanish, for instance, do not perform so well. You can find the accuracy reports in the Lingua repo.

Average Detection Performance

@osma
Copy link
Member Author

osma commented Oct 6, 2022

This is absolutely reasonable. Then Lingua is simply not the right tool for your job. That's ok. Luckily, there are enough language detectors to choose from, especially in the Python ecosystem.

Yes, right. There's also the issue of API design - Simplemma provides the in_target_language function which is well suited for this specific task of filtering by language. It gives the estimated proportion of words in the text that are in the expected target language, and it only needs to load and make use of a single language model. I couldn't find anything similar in Lingua, so what I did in PR #615 was to use Lingua to detect the language of a sentence out of all languages it knows, which requires loading and using all available 75 language models (or at least a significant proportion of those). This means Lingua has to do a lot more work than Simplemma to accomplish the same and at least partly explains the difference in performance.

I was curious and added Simplemma to my own evaluation of language detectors. As expected, the vocabulary-based approach is not as good as the ngram-based approach. The detection accuracy differs significantly between the languages. For Finnish, Simplemma is pretty accurate with 81% on average. But other languages, such as Spanish, for instance, do not perform so well. You can find the accuracy reports in the Lingua repo.

This is great, thanks a lot! It's very useful to have a benchmark that is evaluated on many different detectors. I didn't expect Simplemma to be super accurate, as language detection is just an extra feature and the main purpose of the library is lemmatization. Also there seem to be large differences between the languages supported by Simplemma in the size of the vocabularies it knows about. It's quite natural that Simplemma has difficulties detecting languages with small vocabulary sizes. Finnish happens to be the language with the largest included vocabulary, though this has a lot to do also with the complex morphology of the language.

Would it be possible for you to also include the spent CPU time and memory for each detector in the benchmark results? At least for me those are important considerations, and also @adulau asked about it above, so others would likely be interested too. Since you run the same tests on every detector, the resource usage should be quite easily comparable, right?

@adbar
Copy link

adbar commented Oct 6, 2022

Thanks @pemistahl for the detailed evaluation! I also like the bar plots you made to compare the results by language.

A quick remark on the methodology, you write that "a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted".
You could control sentence length and punctuation to make sure that you random sentences are all (1) actual sentences and (2) not too short (since you evaluate word pairs already). You could also take more instances of each language to make the evaluation possibly more reliable.
You may have a data issue on certain languages (e.g. Catalan or Malay), maybe the Projekt Wortschatz data isn't completely reliable, all detectors get it consistently wrong.

The fact that a n-gram approach works well on single words and on word pairs explains the overall performance of Lingua and others but not the relatively poor performance of CLD, that's interesting.

Simplemma works as expected IMO, it's a meaningful baseline or a good trade-off between simplicity and accuracy, and as @osma says language detection isn't its main purpose anyway.

@pemistahl
Copy link

There's also the issue of API design - Simplemma provides the in_target_language function which is well suited for this specific task of filtering by language. It gives the estimated proportion of words in the text that are in the expected target language, and it only needs to load and make use of a single language model.

I think it is not too difficult to implement something like this in Lingua. I will try to do that.

Would it be possible for you to also include the spent CPU time and memory for each detector in the benchmark results?

I have to rewrite some parts of the accuracy reports script to do so, but yes, it is surely possible. I don't know when I will have the time, though.

You may have a data issue on certain languages (e.g. Catalan or Malay), maybe the Projekt Wortschatz data isn't completely reliable, all detectors get it consistently wrong.

Maybe I will try to find a better source for test data for certain languages but that is not on my todo list at the moment. In the later future perhaps.

@osma osma closed this as completed in #626 Nov 15, 2022
@osma osma modified the milestones: Long term, 0.60 Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants