Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add language detection #642

Open
tfmorris opened this issue Dec 3, 2012 · 33 comments · Fixed by #4651
Open

Add language detection #642

tfmorris opened this issue Dec 3, 2012 · 33 comments · Fixed by #4651
Assignees
Labels
grel The default expression language, GREL, could be improved in many ways! localization anything to do with i18n Internationalization and I10n localization Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.

Comments

@tfmorris
Copy link
Member

tfmorris commented Dec 3, 2012

A function to detect the language of a piece of text would be useful. There's a high performance (99% accuracy), Apache licensed implementation in Java here: https://github.com/shuyo/language-detection (originally http://code.google.com/p/language-detection/)

@kiminoa
Copy link

kiminoa commented Dec 3, 2012

One interesting use case to note for fine tuning ... I use OpenRefine for work on alphasyllabaries. (The clustering algorithms are fantastic for a lot of language / inflection-identifying / etc. work.) At least for Mycenaean Greek (Linear B) and Crypiot Greek alphasyllabaries, these are represented by delimiting between morphemes with white space and internally with dashes, i.e. ti-ri-to represents a 3-symbol word. Would be slick if any language detection in OpenRefine was also wise enough to say "oh, hey, this is isn't an alphabetic transliteration, is it? huh."

@tfmorris
Copy link
Member Author

@kiminoa I'm not sure whether you are requesting detection of transliterations or just making sure it doesn't inappropriately try to detect a transliteration as a language. We can do the latter with a confidence threshold, but the language models are all pre-built for a fixed set of languages.

@kiminoa
Copy link

kiminoa commented Dec 11, 2012

The latter, for starters. Chrome's language detection (Compact Language Detection (CLD)) usually matches Linear B transliterations to Croatian, amusingly. Are the CLD and Apache's implementation using the same underlying code?

@magdmartin
Copy link
Member

The Google Language API can also do that job see old wiki for details: http://code.google.com/p/google-refine/wiki/FetchingURLsFromWebServices

@tfmorris
Copy link
Member Author

@kiminoa I'm pretty sure Apache's implementation is different from the Chrome implementation. We'll try to make sure that the confidence thresholds get set appropriately, but I think Linear B transliterations might be considered an edge case.

@magdmartin That page is available in the current wiki as well

@symac
Copy link
Contributor

symac commented Apr 29, 2020

I found this issue while trying to find a solution to this problem, seeing this issue was still opened, I have tried to find a solution using jython, it proved to work quite well: http://www.geobib.fr/blog/2020-04-29-openrefine-detect-lang

@wetneb wetneb added the Good First Issue Indicates issues suitable for newcomers to design or coding, providing a gentle introduction. label Apr 29, 2020
@elroykanye
Copy link
Member

Hello @tfmorris @wetneb , I wish to ask if it is okay that I work on this

@wetneb
Copy link
Member

wetneb commented Mar 27, 2022

Sure! It seems a bit difficult for a first issue actually, I am not sure I agree with my previous assessment (it might involve adding a new dependency, and that is not so easy since we need to watch out for licensing issues and size considerations)

@wetneb wetneb removed the Good First Issue Indicates issues suitable for newcomers to design or coding, providing a gentle introduction. label Mar 27, 2022
@elroykanye
Copy link
Member

Ah I see 🤔
Worth the try anyway. I'll let you know of every step I take 😄🙏🏾

@elroykanye
Copy link
Member

elroykanye commented Mar 27, 2022

@wetneb
Just to be clear, I'd like to ask. This feature will help detect language from which piece of text? The datasource? Or is it to detect from the browser settings?
I am a bit confused. Please can I get some pointers with, perhaps, an example showing the use case?

@wetneb
Copy link
Member

wetneb commented Mar 27, 2022

I imagine this could be implemented as a GREL function, for instance called detectLanguage, which would work like this:

  • detectLanguage("hello world, what a beautiful day") evaluates to "en"
  • detectLanguage("bonjour le monde, quel jour magnifique") evaluates to "fr"
  • and so on

@elroykanye
Copy link
Member

Okay. From what I understand, it should be something that implements this interface:

https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/grel/Function.java

Am I in the right track?

@wetneb
Copy link
Member

wetneb commented Mar 28, 2022

Correct!

@elroykanye
Copy link
Member

Awesome! Thanks, I will head on to that 😎

@elroykanye
Copy link
Member

A function to detect the language of a piece of text would be useful. There's a high performance (99% accuracy), Apache licensed implementation in Java here: https://github.com/shuyo/language-detection (originally http://code.google.com/p/language-detection/)

Hello @tfmorris , @wetneb
Following this, I realised that this project has some tree. I travelled down and ended at this implementation. So I made a branch to play around with the api and it works pretty good.

  • Added dependencies:
    lang_detect_deps

  • Wrote a test for the utils class and a utils class
    lang-detect

  • Test results with some simple text
    test results

I can added a function class for this as shown below:
function

I just wish to ask if there are any concerns.

@thadguidry
Copy link
Member

thadguidry commented Mar 28, 2022

@elroykanye Nicely done thus far!

  1. My only concern would be what output do we give as the language code to a user? My hunch would be the output should be an ISO 639-3 code but it seems you have output an ISO 639-2 code?

  2. Hmm, what about transforms into the users Locale if they wish? For instance, Japanese is jpn and in a Chinese Locale could alternatively be displayed or transformed into the String 日語 which is how you say "Japanese Language" in written Mandarin. Ref: https://www.wikidata.org/wiki/Q5287

Reconciling would probably be the way that 2. would be accomplished? A user would output the language code into a new column, and then reconcile the language code against Wikidata or some other service. Then they can ask for the Language text in their chosen Locale. But I could see it being a "nice" thing to have where reconciling could be skipped and we had the Language matrices, but probably that is too much for us to absorb and doubtful there's a library that maintains the "complete" Labels of all languages for ISO 639-3 codes. So probably reconciling is the best way to accomplish 2. as a second step if the user wishes. Although it would be cool to have in built-in GREL reconciling transformer function that looks up Wikidata and does it for you in one step, such as "This is a box".detectLanguage().getLocaleLabel("zho") which would detect eng and then do Wikidata lookup for that ISO 639-3 code and get the Chinese label for it and return 英語 .

@elroykanye
Copy link
Member

This is definitely a lot of information to take in 😬
Okay, from what I understand, the language code should not be limited to returning a code which is actually only understood by Anglophones, but should be able to work with other languages as well?

@thadguidry
Copy link
Member

thadguidry commented Mar 28, 2022

@elroykanye To work on this issue, the focus is 1. only. To output a language code. But we need to make a decision about A. output a ISO 639-2 code or B. output a ISO 639-3 code.

You don't have to concern yourself with my ideas for 2. in this issue, since it would be a different new issue that we may or may not create in the future. I write things up briefly so that we don't lose track of them even after issues are closed, they can still be searched.

If we had Discussions enabled, then my 2. comment would not have to pollute this issue and confuse you. I continue to hope that @wetneb enables Discussions in our GitHub repo so that we have an easier way to correspond like this with ideas and plausible features until they are solidified and can become a focused issue to work on without question. Email threads on our mailing list just die and are problematic where they are not easily searchable and cannot be categorized or linked to issues by the community.

@elroykanye
Copy link
Member

Okay, I understand clearly. The discussions is a good idea actually, to pour out as much as I can and will need to understand cases like this. Thanks very much for the explanation, it helped me get some new ideas :)

Concerning 1., it is an ISO 639-2 code returned from the language profile of the library I included.

@wetneb
Copy link
Member

wetneb commented Mar 28, 2022

I write things up briefly so that we don't lose track of them even after issues are closed, they can still be searched.

@thadguidry it would be amazing if you could stay focused on the issue at hand. GitHub issues (or discussions, for that matter) are not an appropriate place to write down distracting wandering thoughts. It is important that we go straight to the point to help newcomers in their first contributions to the project.

If you have wandering thoughts about the project and feel the need to write them down publicly, then perhaps a blog could do?

@thadguidry
Copy link
Member

@elroykanye Ah, then we have no choice then...that's fine.
@wetneb splitting the community does not seem wise. "lone wolves die alone".

@elroykanye
Copy link
Member

Thanks @thadguidry and @wetneb

@wetneb
Copy link
Member

wetneb commented Mar 28, 2022

@elroykanye to help you further it would be great if you could just open a pull request with your changes. That is easier to review than screenshots of your code.

@elroykanye
Copy link
Member

Noted

@thadguidry
Copy link
Member

thadguidry commented Mar 28, 2022

@wetneb I am staying focused on the issue at hand. Expanding or narrowing scope on an issue has always happened in our community and we really only have issues, mailing list, chat or phone calls to leverage and tie conversation threads together in a non-optimal way. I also don't want to muddy the waters of an issue, but when an issue is not solidified well, then discussion should take place within it as well as our mailing list to get users feedback.

Additionally "distracting wandering thoughts" was not what I wrote up, but instead thought provoking questions and ideas directly related to the functionality of this issue as witnessed by @elroykanye own comments that it gave him some ideas. Those ideas should be publicly captured for review by all and allowed to be commented on or voted upon as we've already agreed to that in previous meetings.

@antoine2711
Copy link
Member

I write things up briefly so that we don't lose track of them even after issues are closed, they can still be searched.

Maybe the Dev mailing list would be a better place to put your extra thoughts.

With all the Outreachy potential interns, the need to keep things simple and strait is bigger.

I know I'm working to be as simple and short as possible in my writing on GitHub.

Regards, Antoine

@antoine2711 antoine2711 added grel The default expression language, GREL, could be improved in many ways! localization anything to do with i18n Internationalization and I10n localization labels Apr 12, 2022
@tfmorris
Copy link
Member Author

I agree that:

  1. This is a difficult first issue, and that
  2. Comments here should remain focused on the issue at hand

The difficulty with this isn't in the coding, which is likely to be trivial once an appropriate algorithm/implementation has been chosen, but in doing the evaluation and choosing of the correct solution to start with.

Although I suggested language-detection a decade ago, for most of the last decade I probably would have advocated for CLD2 instead. In the last couple of years, the landscape has shifted again. There is a recent survey and benchmark available here and it links to some earlier evaluations.

I would suggest drawing up a list of evaluation criteria by which to measure implementations and then scoring all the current reasonable alternatives against that list of criteria. Criteria might include # of languages supported, model size, run time performance, implementation language, etc. You might also want to solicit input from the user community, since it's a relatively important implementation decision.

@wetneb wetneb added this to the 3.6 milestone Apr 25, 2022
@tfmorris
Copy link
Member Author

tfmorris commented Oct 25, 2022

So, neither the establishment of evaluation criteria mentioned above nor the associated evaluation ever really happened, but you can find a related post hoc discussion on the pull request #4651 which was generated.

I still think it would be valuable to generate a set of evaluation criteria which reflect the needs of the user community and evaluate alternatives against it.

@thadguidry
Copy link
Member

thadguidry commented Oct 25, 2022

Reopening because...

@tfmorris I honestly think we can skip the evaluation criteria. I use Lingua specifically because it addresses the needs that I have which are both long and shorter text. The current implementation we merged using the older optimaize library fails miserably with just detectLanguage("psychological test") or even "detectLanguage("美国") - "America", as an example. From what I see and my usage needs of both short phrases (nay Entities oftentimes) and longer sentences, have their language detected nearly 100% of the time with Lingua as you mentioned in #4651 and it's because of this. And less than 70% of the time with our implementation with optimaize library since I'm often doing shorter phrases/entities.

@wetneb I think we should reopen this because our implementation is not useful enough for a majority of cases and instead just flip to using Lingua. And as a bonus, we can add the 2nd argument for (true/false - output confidence score in a JSON array). Hack-ish example:

["zh":{["MANDARIN":1.0,"GERMAN":0.8665738136456169,"FRENCH":0.8249537317466078,"SPANISH":0.7792362923625288]}]

@thadguidry thadguidry reopened this Oct 25, 2022
@tfmorris
Copy link
Member Author

I strongly disagree with using an N of 1 approach to requirements definition. It should be a solution which meets the needs of the entire user community. I don't know how you figure out what those needs are unless you ask. Some people will want more languages. Others will want smaller memory footprint. Others will want/need higher performance.

If there's a big enough spread in needs across the community, they may indicate the need to support more than one language detector, which, of course, would have implications on the design of the function/API.

If we figure out what the right rows and columns are for this table, we can then work on filling it in and figuring out how to weight the criteria.

Optimaize OpenNLP Tika Lingua Low Lingua Hi fasttext CLD2 CLD3
Languages 56 68 56 76 76 176 83 107
Sentence Accuracy 93.4 94.9 96.1 93.3 95.9 97.9 87.1 87.0
Word Pair Accuracy 60.8 74.1 80.8 78.5 89.2
Memory Usage
Performance

@thadguidry
Copy link
Member

The Lingua Java project readme doesn't say in sections 9.1+ but found some in section 7.5 here https://pemistahl.github.io/lingua-py/lingua.html

Unsure of Python verses Java there.

@wetneb
Copy link
Member

wetneb commented Oct 26, 2022

Given all this I would be tempted to say that this feature should probably not be in the core but rather an extension, or possibly more extensions if different libraries cover various use cases.

@tfmorris
Copy link
Member Author

I think (hope) it should be possible to find a reasonable default solution which could be integrated into the core, but I'm not sure we have enough information yet to make that call.

@wetneb wetneb removed this from the 3.6 milestone Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grel The default expression language, GREL, could be improved in many ways! localization anything to do with i18n Internationalization and I10n localization Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants