-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add language detection #642
Comments
One interesting use case to note for fine tuning ... I use OpenRefine for work on alphasyllabaries. (The clustering algorithms are fantastic for a lot of language / inflection-identifying / etc. work.) At least for Mycenaean Greek (Linear B) and Crypiot Greek alphasyllabaries, these are represented by delimiting between morphemes with white space and internally with dashes, i.e. ti-ri-to represents a 3-symbol word. Would be slick if any language detection in OpenRefine was also wise enough to say "oh, hey, this is isn't an alphabetic transliteration, is it? huh." |
@kiminoa I'm not sure whether you are requesting detection of transliterations or just making sure it doesn't inappropriately try to detect a transliteration as a language. We can do the latter with a confidence threshold, but the language models are all pre-built for a fixed set of languages. |
The latter, for starters. Chrome's language detection (Compact Language Detection (CLD)) usually matches Linear B transliterations to Croatian, amusingly. Are the CLD and Apache's implementation using the same underlying code? |
The Google Language API can also do that job see old wiki for details: http://code.google.com/p/google-refine/wiki/FetchingURLsFromWebServices |
@kiminoa I'm pretty sure Apache's implementation is different from the Chrome implementation. We'll try to make sure that the confidence thresholds get set appropriately, but I think Linear B transliterations might be considered an edge case. @magdmartin That page is available in the current wiki as well |
I found this issue while trying to find a solution to this problem, seeing this issue was still opened, I have tried to find a solution using jython, it proved to work quite well: http://www.geobib.fr/blog/2020-04-29-openrefine-detect-lang |
Sure! It seems a bit difficult for a first issue actually, I am not sure I agree with my previous assessment (it might involve adding a new dependency, and that is not so easy since we need to watch out for licensing issues and size considerations) |
Ah I see 🤔 |
@wetneb |
I imagine this could be implemented as a GREL function, for instance called
|
Okay. From what I understand, it should be something that implements this interface:
Am I in the right track? |
Correct! |
Awesome! Thanks, I will head on to that 😎 |
Hello @tfmorris , @wetneb I can added a function class for this as shown below: I just wish to ask if there are any concerns. |
@elroykanye Nicely done thus far!
Reconciling would probably be the way that 2. would be accomplished? A user would output the language code into a new column, and then reconcile the language code against Wikidata or some other service. Then they can ask for the Language text in their chosen Locale. But I could see it being a "nice" thing to have where reconciling could be skipped and we had the Language matrices, but probably that is too much for us to absorb and doubtful there's a library that maintains the "complete" Labels of all languages for ISO 639-3 codes. So probably reconciling is the best way to accomplish 2. as a second step if the user wishes. Although it would be cool to have in built-in GREL reconciling transformer function that looks up Wikidata and does it for you in one step, such as |
This is definitely a lot of information to take in 😬 |
@elroykanye To work on this issue, the focus is 1. only. To output a language code. But we need to make a decision about A. output a ISO 639-2 code or B. output a ISO 639-3 code. You don't have to concern yourself with my ideas for 2. in this issue, since it would be a different new issue that we may or may not create in the future. I write things up briefly so that we don't lose track of them even after issues are closed, they can still be searched. If we had Discussions enabled, then my 2. comment would not have to pollute this issue and confuse you. I continue to hope that @wetneb enables Discussions in our GitHub repo so that we have an easier way to correspond like this with ideas and plausible features until they are solidified and can become a focused issue to work on without question. Email threads on our mailing list just die and are problematic where they are not easily searchable and cannot be categorized or linked to issues by the community. |
Okay, I understand clearly. The discussions is a good idea actually, to pour out as much as I can and will need to understand cases like this. Thanks very much for the explanation, it helped me get some new ideas :) Concerning 1., it is an ISO 639-2 code returned from the language profile of the library I included. |
@thadguidry it would be amazing if you could stay focused on the issue at hand. GitHub issues (or discussions, for that matter) are not an appropriate place to write down distracting wandering thoughts. It is important that we go straight to the point to help newcomers in their first contributions to the project. If you have wandering thoughts about the project and feel the need to write them down publicly, then perhaps a blog could do? |
@elroykanye Ah, then we have no choice then...that's fine. |
Thanks @thadguidry and @wetneb |
@elroykanye to help you further it would be great if you could just open a pull request with your changes. That is easier to review than screenshots of your code. |
Noted |
@wetneb I am staying focused on the issue at hand. Expanding or narrowing scope on an issue has always happened in our community and we really only have issues, mailing list, chat or phone calls to leverage and tie conversation threads together in a non-optimal way. I also don't want to muddy the waters of an issue, but when an issue is not solidified well, then discussion should take place within it as well as our mailing list to get users feedback. Additionally "distracting wandering thoughts" was not what I wrote up, but instead thought provoking questions and ideas directly related to the functionality of this issue as witnessed by @elroykanye own comments that it gave him some ideas. Those ideas should be publicly captured for review by all and allowed to be commented on or voted upon as we've already agreed to that in previous meetings. |
Maybe the Dev mailing list would be a better place to put your extra thoughts. With all the Outreachy potential interns, the need to keep things simple and strait is bigger. I know I'm working to be as simple and short as possible in my writing on GitHub. Regards, Antoine |
I agree that:
The difficulty with this isn't in the coding, which is likely to be trivial once an appropriate algorithm/implementation has been chosen, but in doing the evaluation and choosing of the correct solution to start with. Although I suggested I would suggest drawing up a list of evaluation criteria by which to measure implementations and then scoring all the current reasonable alternatives against that list of criteria. Criteria might include # of languages supported, model size, run time performance, implementation language, etc. You might also want to solicit input from the user community, since it's a relatively important implementation decision. |
So, neither the establishment of evaluation criteria mentioned above nor the associated evaluation ever really happened, but you can find a related post hoc discussion on the pull request #4651 which was generated. I still think it would be valuable to generate a set of evaluation criteria which reflect the needs of the user community and evaluate alternatives against it. |
Reopening because... @tfmorris I honestly think we can skip the evaluation criteria. I use Lingua specifically because it addresses the needs that I have which are both long and shorter text. The current implementation we merged using the older optimaize library fails miserably with just @wetneb I think we should reopen this because our implementation is not useful enough for a majority of cases and instead just flip to using Lingua. And as a bonus, we can add the 2nd argument for (true/false - output confidence score in a JSON array). Hack-ish example: ["zh":{["MANDARIN":1.0,"GERMAN":0.8665738136456169,"FRENCH":0.8249537317466078,"SPANISH":0.7792362923625288]}] |
I strongly disagree with using an N of 1 approach to requirements definition. It should be a solution which meets the needs of the entire user community. I don't know how you figure out what those needs are unless you ask. Some people will want more languages. Others will want smaller memory footprint. Others will want/need higher performance. If there's a big enough spread in needs across the community, they may indicate the need to support more than one language detector, which, of course, would have implications on the design of the function/API. If we figure out what the right rows and columns are for this table, we can then work on filling it in and figuring out how to weight the criteria.
|
The Lingua Java project readme doesn't say in sections 9.1+ but found some in section 7.5 here https://pemistahl.github.io/lingua-py/lingua.html Unsure of Python verses Java there. |
Given all this I would be tempted to say that this feature should probably not be in the core but rather an extension, or possibly more extensions if different libraries cover various use cases. |
I think (hope) it should be possible to find a reasonable default solution which could be integrated into the core, but I'm not sure we have enough information yet to make that call. |
A function to detect the language of a piece of text would be useful. There's a high performance (99% accuracy), Apache licensed implementation in Java here: https://github.com/shuyo/language-detection (originally http://code.google.com/p/language-detection/)
The text was updated successfully, but these errors were encountered: