-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mass import of sentences #1762
Comments
I wonder if it might not be dangerous to make this available to regular visitors as well. |
I agree. I feel strongly that this feature should not be available to regular users. I think the right level of accessibility is the admin level, though perhaps I could be persuaded that it could be opened to corpus maintainers as well. |
I totally understand your feelings about this but I think we should analize
other points:
》In fact, a lot of potential bad, wrong sentences can be added. However, we
can check if the uset has been contributing "properly" ( sentences in their
native language, didnt get too comments in their sentences)
》This feature does need some training before using. On dev Tatoeba, I am an
admin and from time to time I use this feature to know if I'm doing that
the right way (punctuation, right code, etc)
》Not making this feature to *some* users might sound like "I dont trust you
since you're not an adm". Everybody can commit mistakes. And also, being an
admin on Tatoeba means (also) that the user has access to some tools that
he need in order to keep Tatoeba "healthy". CMs basically take care of
sentences. And so on. So being just a contributor for 5,7 years doesnt mean
that this person isnt doing a good job. It just means that he or she doesnt
features like linking, for example
|
Perhaps I should have stressed that making this available to regular users doesn't mean making it available to every user. Similarly, the CC0 license is now available to regular users, but the access to this feature is still locked behind a permission that needs to be granted to the user by an admin. |
Is this likely to be ready any time soon? |
It's not likely to be ready anytime soon. |
Perhaps, to avoid overloading the website, we could restrict the number of sentences it's possible to add a time. |
just for curiosity, are there any plans to have this feature again within the latest CakePHP version in the near future? |
There are no plans on that at the moment. But even if we had plans, it would take a little while to get this done. My requirements to reintroduce this feature is to make it open to regular users instead of just to admins. This doesn't mean mass import should be accessible from the start to absolutely anyone but it should no longer be locked behind the condition "you must be an admin". We need a better process so that admins are no longer in charge of being the middle person to execute the mass import of sentences and we should implement whatever is necessary to avoid/handle mass import of unwanted sentences. |
I truly believe that it should be clear that everyone can have access to it once they ask for it as we do to convert sentences from CC-BY to CC0. But there would be some "steps"; 1st - the user who required it would have to participate of a "training" program. They would get instructions on how to do so and make "some exercises" on devTatoeba which will be reviewed by the their instructor or instructors 2nd - after some training, this user would be able to import some sentences (maybe 50 at a time?) on prod website 3rd. If this user feel comfortable and also their instructor, no more restrictions would be necessary |
Hi! Sorry for bothering you about that but is it scheduled to be implemented? I mean, do you guys have an idea? I have thousands of sentences which I'd like to import. However, I'd take so much time by copy and pasting them. |
I'd be willing to import your Portuguese sentences for you once this gets implemented. |
@RyckRichards There are still no plans for this feature, sorry. |
It happens that as part of my "holiday notebook", I wrote a Python script to add sentences (in the development environment). I didn't really plan to use it on the real Tatoeba, but, unless there is an official demand to not do so, I guess I could share it to Ricardo and CK once it is complete so they can add their sentences and wait for the official feature. |
What a great news, @agrodet ! Thanks a million! |
@agrodet You're free to share your script with anyone you trust enough :) |
I actually think that mass import is problematic, and I was hoping that the request would die quietly without being fulfilled. It seems to me that the sentences that are added en masse are the least valuable. Even if they are not automatically generated, they might as well be. Also, as someone who spends a lot of time correcting English sentences that have been posted, I'm leery of anything that can give the edge to people who post lots of bad sentences. I hope we can give this topic some more discussion. |
@alanfgh Trust me, I feel your pain. I think there are several issues in mass-importing and I didn't want to share anything at first, but then I considered the following points. As often, I think that the problem is in people, not in the functionality. You have to choose between providing a functionality, to the risk that some idiots will do bad things with it, or deprive everyone from this possibility to avoid the risk... Kind of a nice philosophical debate. Actually, there is a third point: Making mass importing SLOWER than human contributions and restricting any access to other functionalities during its work. For example, let's say that I type pretty fast so when I contribute original sentences I can do it in 5 seconds, in average. Copy-pasting would probably take less time. But the mass-importing function would add a sentence only every 8 or 10 seconds. During that time, restrict the user from contributing (can still navigate the website in "read-only"). |
Well, from a technically point of view, mass importing is already possible whether we like it or not. Just write a script/program in your favorite programming language (as demonstrated by agrodet). And you all are probably aware of some bots in the past. I think a working import feature would give us a little bit more control over the process (e.g. rate limiting, automatic tagging, ...). |
Just to be clear, when I'm talking about poor-quality sentences added in bulk, I'm not necessarily referring to grammatically incorrect sentences or even sentences that are bad on their own. I'm talking about a mass of sentences that as a group has very little diversity and tends to crowd out sentences that would be more interesting and more useful: I went to the beach. and so on, iterated over a hundred replacements for "beach" ("sea", "mountains", etc.). In my fantasies, people would get so tired of writing such groups of sentences that they wouldn't do it in the first place, but apparently that's not true. But it seems to me that the more we facilitate the adding of sentences in bulk, the more people will be tempted to add such groups of sentences. |
I think a much bigger factor is the fact that people can seemingly add an infinite number of sentences. When people are not limited, they will tend to care less. For instance if you have an infinite amount of money, you will more likely end up buying a lot of junk that you have absolutely no use for, but you just bought it "just in case" or because "why not". In the case of Tatoeba, there is no limit of the amount of sentences that one can add so some people will add sentences just for the sake of adding them, even if these sentences might not very be useful in the end. That's why for me the problem of low-creativity is not a problem we can efficiently prevent or attenuate by discouraging mass-import. To address your issue, I see several other possibilities:
I'm completely okay to put on standby the reintroduction of mass-import until we have a mature solution for preventing mass-import of undesirable sentences and also a more mature definition of what is undesirable. This is actually one of the reasons why there is still no plan for this feature. But I think the feature itself is worth the effort and we shouldn't just let it quietly die. I'm quite convinced there are other people out there who have compiled thousands of interesting sentences into text files or spreadsheets (maybe because it was a lot more convenient for them or maybe because they simply didn't know Tatoeba). Asking them to manually copy-paste these sentences one by one would be torture. |
@trang, that's a good analysis. One thing to note is that adopted sentences count towards one's number of owned sentences. Thus, adding a constraint on the total number of sentences one can own might be a disincentive to adopting. |
As often, Trang gave some very good points. Let me draw on one of them. Personally, and it's frustrating, I don't think we should restrict the addition of the sentences you mention, because they do have value. They can be used out of the box by learning systems, for example, or even humans. However, I do think that we can change how we serve sentences. Gillux mentioned a similar wish a while ago. Alan, if you have a look at the "word analysis" section of one of my notebooks, you can see how repeated stuff influenced other corpora (not that one need it to realize it...). For example, "Tom" is the most used French word (except stop words). However, that also gives us some insight on how we can adapt our tools. For example, we can set different types of search. The totally random one, like the current one, will be weighed by low quality sentences. But the "one in the middle" could give more interesting sentences using words, or even grammar, less commonly used, and the "challenging one" could go to the bottom get words or topics that appear very few times. So I think that a good solution is a mix of "education" / communication and some nicely biased tools ^^ |
Or they simple create another account and continue adding sentences. :-)
I guess CK (to name the elephant in the room) thinks his contributions are valuable. ;-)
I think that's the only workable solution: Let editorial staff (corpus maintainers?) decide what sentences are of good quality and mark them with some approvement label (e.g. a tag only they can set). Then we can easily publish these two corpora. |
I just scanned the above and haven't read all the comments carefully, but since CK was mentioned in the last comment, I'll leave a quick comment. This perhaps doesn't really apply to the "Mass import of sentences," unless it just points out that one of the arguments against it is suspect, and that work on this shouldn’t be stopped.
To say one perfectly-good sentence is of lower quality than another perfectly-good sentence is wrong. You could perhaps argue that they are redundant or time-wasting. Sentences that are different only because of a pronoun can often be used by some languages as alternative translations of a sentence, so asking that they not be included is not a wise thing to do. At least a few members in this discussion know languages that often don't use pronouns and maybe some of you know languages that use genderless pronouns. Some of you likely know sentences in languages that can be translated into English as both past tense and present tense, and maybe other tenses. Some of you know languages in which the English "you" has various translations, singular, plural, by who is being spoken to and what politeness level is being used. Should we discourage all the possibilities? I don't think so. If you are going to be judge the quality of sentences, the following would be something to consider that may have more relevance to the Tatoeba Project. For students and researchers, the same sentence owned by a native speaker is of higher quality than one owned by a non-native speaker since it is more trustworthy. In other words, we can more likely trust that it is a good sentence. Data that can be trusted is of higher quality. Perhaps AI people would create systems to throw out all non-native data to improve results. When a non-native language user contributes a sentence, it prevents a native speaker from contributing the same sentence that would have a higher trustworthiness level, so they are in some ways hurting the “trustworthiness” quality of the corpus. A few randomly chosen exampleshttps://tatoeba.org/eng/sentences/show/1008592 https://tatoeba.org/eng/sentences/show/6158104 https://tatoeba.org/eng/sentences/show/3146657 https://tatoeba.org/eng/sentences/show/8497895 |
@ckjpn wrote:
That's clear. Don't waste our time by posting your own comments until you have. |
(1) I'd like to elaborate more on the idea of limiting the amount of contributions. This limit actually exists implicitly already. Technically, if someone decided to add millions of sentences per minute, that will crash Tatoeba because the server cannot manage this. Now even if the server won't crash, millions of sentences per minutes is humanly just not manageable for our community. Setting an explicit limit is, I think, in any case necessary for the well-being of the server and of the community. And as an extra side effect, it can help some contributors channel their efforts on creating more meaningful sentences. If we ever set a limit that degenerates into workarounds, where many people start to create new accounts because that's the only way they can fulfill their appetite for contributing to the project, then clearly we have chosen the wrong limit. As always, we have to find the right balance and it has to make sense. With a limit of one sentence only per account, it's obvious that a lot of people will try to cheat by creating new accounts. On the other hand, with no limit at all, there will be people who won't care about overwhelming Tatoeba or won't be aware of it. They might realize it only when the server crashes or when people in the community start to heavily complain. This limit doesn't have to be the same for everyone and all the time. For instance, it would make sense that contributors who are completely new are more limited than contributors who have been here for a longer time and have gained the trust of others. (2) I'd like to say a few things about quality. First, we have to distinguish between quality of a sentence and quality of the corpus. Compiling a list of perfectly-good sentences doesn't necessarily make a good quality corpus. Second, we can't really argue about corpus quality if we haven't defined what kind of corpus we want to build. If we are building a corpus for children who are just starting to learn a new language, we won't be able to claim that we have a good quality corpus if there are extremely complex (yet perfectly-good) sentences or sentences that are very obscene or sexual. If we are building a corpus to ship to other galaxies as a meaningful sample of data for aliens to study all the characteristics of the languages on planet Earth, then we would have a bad quality corpus if our sentences all look like they were made for children learning new languages and completely excluded obscene and sexual sentences. Third, assuming we agree on what kind of corpus we want to build, there's one more thing that defines the quality of the corpus, which is its size. Between a corpus that is 10 MB and another that is 100 MB, if both fulfill the same goals, if there's absolutely no additional information we can extract from the bigger corpus than from the smaller corpus in the scope of what they have been compiled for, then I am very much inclined to say that the smaller corpus is a better quality corpus than the bigger one. There's no arguing that redundancy is essential in the corpus but there is also a point where redundancy becomes pollution. Where do we draw the line is a difficult question though. |
Since we're not trying to fix the old mass-import feature, I've removed the |
Corpora of pppular languages will eventually grow to huge mudballs. It will be up to us to extract useful parts amd provide a correct sample to the correct people. Just a two cents on the long run :) |
Although we did discuss in the past of keeping the procedure "human"... And here I was, hoping that Tatoeba will not become a stupid dictionary, brrrr. Not sure there's a benefit in taking "I like monkeys." and replacing "monkeys" by every word in the dictionary, no more in taking "x says / denies/ tells that y VERB z" to use every verb in the dictionary... But I guess people will stay people, as @alanfgh thought. When I think of other corpora becoming tasteless soup of useless crap with no identity, that makes me a little sad. PS: My remarks are clearly pointed to CH's sentences. |
10 days ago:
Yesterday:
Just teasing you @agrodet :P What I really want to point out is that it would be actually fine if such sentences were added by different people at different times. It's only a problem (or at least feels like a problem) when they all come from the same person and are being added in bulk as the result of what looks like a mass-production process. |
Yeah, I know! But there is sense in what I say, I promise! I share your opinion. But people don't want to be patient :( |
Well I spent half of my comment above giving arguments on why I think we should do that. But finding the magic number(s) isn't something we should decide here arbitrarily. Restricting the number of sentences per day was also suggested in #1492. |
My apologies, I didn't explain myself well. I wanted to say, how about restricting the number of sentences added by the mass-import script in a day? Not a general limit. That's quite different. |
Would there be a reason for having different restrictions between mass-import and regular sentence contributions? |
Perhaps, we should restrict contributions to native language contributions when mass importing. This could help avoid a mass of less-than-natural-sounding sentences and sentences with errors being added to our corpus. Of course, a member could lie about what his/her native language is, but if admins are careful about who they give permission to use this function, this may not become a problem. |
I think one reason would be
Another one could be that regular contributions are regular, mass-imported are supposed to be exceptional. Of course, people could still copy-paste. |
What TRANG is referring to is this kind of thing, I think. All these patterns (over 700 of them), ... 7632198 ita Non vanno a costruire università in Australia? Guybrush88 [snip] down to here. 7544982 ita Che stai facendo in Australia? Guybrush88 ... with a substitution of all the countries shown with this search. "^Andiamo a cercare qualcosa in" by Guybrush88 These turita's Turkish sentences were imported and linked to Guybrush88's Italian sentences at approximately the same time. Perhaps these are the only sentences owned by turita. |
That's because, the English pronouns "he", "she" and the formal "you" have all the same verb forms in Italian, that's why the Italian sentence is linked to multiple English ones. In addition to this, "isn't" and "'s not" have the same meaning, that's why the Italian sentence is linked to both. |
back then, I wanted to provide similar patterns of sentences, and I'm sorry if this clogged up the corpus |
You can always fix that by going through the sentences that have no translations and change them into something that is more original :) |
I should report that this feature seems to be needed more and more.
|
@trang Thanks for pointing this thread out to me. I am the person who is working with a tutor and wished to import sentences. I wanted to describe my scenario a bit. I am currently taking German classes with a tutor over Zoom. She very graciously corrects my grammar via chat. At the end of each day she sends me a transcript and I will put the newly created sentences into Tatoeba and give my translation. This practice I believe is beneficial to my tutor, tatoeba, and myself. The tutor gets help with directed learning scenarios where they explain a word or concept in the context of many other examples, Tatoeba gets "high" quality sentences from a native speaker and teacher, and I get the practice of translating into my own words all while receiving helpful corrections. I am a developer so I believe a mass import feature need not be for the masses. But I do think there is some use in a developer token and some sort of program to check programs. I think we can compromise on the above concerns in a couple of ways (apologies for jumping in at the end and thinking I know everything. I don't) I would like to make an app to upload sentences en mass. I only want to do it for one language and I don't need to add more than 100 a day.. Could we perhaps approve developer tokens to interact with an API that meets certain standards for each language. And perhaps these standards per language could be worked on by a group of super users per language. I think there is a way to split the baby and have a way to increase tatoeba quality sentences and keep too much garbage data out. I think the answer is a very limited protected developer api and a robust system for review of client systems. Also, you can put the work on third party developers to meet standards. |
If I were making an app to submit german sentences to Tatoeba here are a few things that I would do to ensure good data makes it in.
In my vision my application would also have a mode requiring me to provide translations as I do the upload. I expect that a group of Tatoeba german super users could provide input into common errors made while adding sentences that could possibly be avoided before submission. I don't think this is a system that will work globally but perhaps Tatoeba can start small and see if it spreads. |
Controlled mass import by Tatoeba staff only (like audio) would make it easy for e.g. CMs to just glance over packages of e.g. 100-200 sentences and approve them for an internal mass import tool, which can execute at a reasonable rate at ‚downtime‘ of the server... One can easily tell the contributor’s intentions or the diversity or quality of the contributions, and if need be, give constructive feedback to the contributor. I‘m also no friend of restrictions on the frontend regarding to the user‘s already existing number of contributions, because users tend to be much more productive when they join Tatoeba than later on. So, such a limitation would work counterproductively against their contribution curve... Merry Christmas! |
This would cut out the possibility of submitting a native sentence on behalf of another. |
Would it be possible to limit an API for users who are "educators" perhaps? I believe having access to a corpous such as tatoeba is an awesome resource for tutors. I can imagine creating a front end designed to help tutors create exercises for students to translate or use as writing prompts. Perhaps their is a way to turn these exercises into crowd sourced translation. |
Hi, I'm writing after an email exchange with Gilles Bedel, on the same subject as this thread. I came to Tatoeba through the wonderful book by Hagiwara on Real-world NLP. It is a great initiative. I understand the principle of curated translations, at the opposite of Google Translate where the quality is sometimes low and you have no guarantee on its quality. But I must confess I saw many Tatoeba translations (in French and Greek) that were bad, very bad: words missing, bad inflectional morphology. And I found no means to correct them (I added a commentary to one of them, but I don't think this will ever be read by anyone). I'm saying this to argue that it is not the effort you request by users that makes translations better. So first of all you need a system allowing corrections (maybe it exists and I have missed it) Maybe you need a scoring system which would attribute an initial score to every sentence (not visible to the user) and then would increase that score whenever there is a correction (unless corrections are back-and-forth, in which case an administrator should be alerted). Then you should have a validation function of the type "Show me sentences of low confidence score", and validating by a single click would then increase the score. And if you implement a user's confidence score, then eir validating would increase the score in a way proportional to er personal score (also not known to the user, we don't want to become stackoverflow). Mass import would then provide sentences of very low score, that would either need to be validated (by somebody else than the person having imported them) or corrected, or both. The conclusion is the same as yours: for a sentence to be trustworthy it will need effort and time. But in my idea, this effort and time is shared by all users and the data are there. Mass import occurs, the result has low score: somebody wanting only trustworthy sentences will ask for sentences with high score, somebody wanting any kind of sentence will take them all. And then, slowly, progressively, the scores of the correct mass-imported sentences will raise and they will becomes like the others, eventually after some corrections. Example: Imagine I supply a mass import containing among other translations the following pair: ENG I'm very honored the fourth word of which has an error (a very common one, writing an infinitive instead of a participle). After the import the scores of the translation can be zero: FRE Je suis très honorer score=0 I go through the "Validate low-score sentences" page and correct it: FRE Je suis très honoré score=0.5 Automatically the score rises. Somebody else goes through the same page and validates it: FRE Je suis très honoré score=1.0 It has now become a medium-scored (or high-scored) sentence. I think that this approach combines your security concerns and the need of mass import. |
Has not been migrated yet (#1733).
Note that this feature is currently only available to admins. It may be worth to just rewrite it instead of migrating the code, so that it's available to regular users as well.
The new mass import feature could use the queue plugin (used in the license feature).
The text was updated successfully, but these errors were encountered: