Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mass import of sentences #1762

Open
trang opened this issue Jan 27, 2019 · 48 comments
Open

Mass import of sentences #1762

trang opened this issue Jan 27, 2019 · 48 comments

Comments

@trang
Copy link
Member

trang commented Jan 27, 2019

Has not been migrated yet (#1733).

Note that this feature is currently only available to admins. It may be worth to just rewrite it instead of migrating the code, so that it's available to regular users as well.

The new mass import feature could use the queue plugin (used in the license feature).

@ckjpn
Copy link

ckjpn commented Feb 2, 2019

I wonder if it might not be dangerous to make this available to regular visitors as well.
It would open up the possibility of really getting a lot of bad sentences quickly, especially if non-native sentence imports were allowed.

@alanfgh
Copy link
Contributor

alanfgh commented Feb 2, 2019

I agree. I feel strongly that this feature should not be available to regular users. I think the right level of accessibility is the admin level, though perhaps I could be persuaded that it could be opened to corpus maintainers as well.

@RyckRichards
Copy link
Member

RyckRichards commented Feb 2, 2019 via email

@trang
Copy link
Member Author

trang commented Feb 4, 2019

Perhaps I should have stressed that making this available to regular users doesn't mean making it available to every user. Similarly, the CC0 license is now available to regular users, but the access to this feature is still locked behind a permission that needs to be granted to the user by an admin.

@jiru jiru added the regression Issue that describes a bug for a feature that used to work just fine. label Feb 18, 2019
@ckjpn
Copy link

ckjpn commented Feb 26, 2019

Is this likely to be ready any time soon?
One member asked about translating offline.
I've sent him some sentences to translate with the warning that I may not be able to import them any time soon.
In the past, I've been able to send sentences to members to translate and then import the translations.

@trang
Copy link
Member Author

trang commented Feb 26, 2019

It's not likely to be ready anytime soon.

@RyckRichards
Copy link
Member

Perhaps, to avoid overloading the website, we could restrict the number of sentences it's possible to add a time.

@Guybrush88
Copy link

just for curiosity, are there any plans to have this feature again within the latest CakePHP version in the near future?

@trang
Copy link
Member Author

trang commented Nov 1, 2019

There are no plans on that at the moment. But even if we had plans, it would take a little while to get this done.

My requirements to reintroduce this feature is to make it open to regular users instead of just to admins. This doesn't mean mass import should be accessible from the start to absolutely anyone but it should no longer be locked behind the condition "you must be an admin".

We need a better process so that admins are no longer in charge of being the middle person to execute the mass import of sentences and we should implement whatever is necessary to avoid/handle mass import of unwanted sentences.

@RyckRichards
Copy link
Member

I truly believe that it should be clear that everyone can have access to it once they ask for it as we do to convert sentences from CC-BY to CC0. But there would be some "steps";

1st - the user who required it would have to participate of a "training" program. They would get instructions on how to do so and make "some exercises" on devTatoeba which will be reviewed by the their instructor or instructors

2nd - after some training, this user would be able to import some sentences (maybe 50 at a time?) on prod website

3rd. If this user feel comfortable and also their instructor, no more restrictions would be necessary

@RyckRichards
Copy link
Member

Hi! Sorry for bothering you about that but is it scheduled to be implemented? I mean, do you guys have an idea?

I have thousands of sentences which I'd like to import. However, I'd take so much time by copy and pasting them.

@ckjpn
Copy link

ckjpn commented Jan 8, 2020

I'd be willing to import your Portuguese sentences for you once this gets implemented.

@trang
Copy link
Member Author

trang commented Jan 18, 2020

@RyckRichards There are still no plans for this feature, sorry.

@agrodet
Copy link
Contributor

agrodet commented Jan 23, 2020

It happens that as part of my "holiday notebook", I wrote a Python script to add sentences (in the development environment). I didn't really plan to use it on the real Tatoeba, but, unless there is an official demand to not do so, I guess I could share it to Ricardo and CK once it is complete so they can add their sentences and wait for the official feature.

@RyckRichards
Copy link
Member

What a great news, @agrodet ! Thanks a million!

@trang
Copy link
Member Author

trang commented Jan 23, 2020

@agrodet You're free to share your script with anyone you trust enough :)

@alanfgh
Copy link
Contributor

alanfgh commented Jan 23, 2020

I actually think that mass import is problematic, and I was hoping that the request would die quietly without being fulfilled. It seems to me that the sentences that are added en masse are the least valuable. Even if they are not automatically generated, they might as well be. Also, as someone who spends a lot of time correcting English sentences that have been posted, I'm leery of anything that can give the edge to people who post lots of bad sentences. I hope we can give this topic some more discussion.

@agrodet
Copy link
Contributor

agrodet commented Jan 24, 2020

Also, as someone who spends a lot of time correcting English sentences that have been posted, I'm leery of anything that can give the edge to people who post lots of bad sentences.

@alanfgh Trust me, I feel your pain. I think there are several issues in mass-importing and I didn't want to share anything at first, but then I considered the following points.
First, if someone wants to add loads of crappy sentences, they can already do so by simply copy-pasting. It takes time, but if they have a list of 1,000 sentences they want to add, eventually they will have them (and faster than we can check them). Of course the problem arises that with such a script, it might be even more tempting to add crappy stuff. Process any resource you find, send the result to the mass-importing function, and boom, you're no more than a zombie monkey robot.
But then, the second point is that if there are such users, they know the rules so we don't have to be tender on them. Delete all the sentences they added, revoke their rights, done. Of course, that's only my personal point of view.

As often, I think that the problem is in people, not in the functionality. You have to choose between providing a functionality, to the risk that some idiots will do bad things with it, or deprive everyone from this possibility to avoid the risk... Kind of a nice philosophical debate.

Actually, there is a third point: Making mass importing SLOWER than human contributions and restricting any access to other functionalities during its work. For example, let's say that I type pretty fast so when I contribute original sentences I can do it in 5 seconds, in average. Copy-pasting would probably take less time. But the mass-importing function would add a sentence only every 8 or 10 seconds. During that time, restrict the user from contributing (can still navigate the website in "read-only").
It kind of mitigates the problem I mentioned above. It is just one not-so-good option. It is not at all user-friendly, but since it is an operation that can be abused, the user has to clearly accept the counterpart.

@AndiPersti
Copy link
Contributor

Well, from a technically point of view, mass importing is already possible whether we like it or not. Just write a script/program in your favorite programming language (as demonstrated by agrodet). And you all are probably aware of some bots in the past.

I think a working import feature would give us a little bit more control over the process (e.g. rate limiting, automatic tagging, ...).

@alanfgh
Copy link
Contributor

alanfgh commented Jan 25, 2020

Just to be clear, when I'm talking about poor-quality sentences added in bulk, I'm not necessarily referring to grammatically incorrect sentences or even sentences that are bad on their own. I'm talking about a mass of sentences that as a group has very little diversity and tends to crowd out sentences that would be more interesting and more useful:

I went to the beach.
You went to the beach.
He went to the beach.
We went to the beach.
They went to the beach.
Tom and Mary went to the beach.
Tom told Mary that he went to the beach.
Mary told Tom that she went to the beach.

and so on, iterated over a hundred replacements for "beach" ("sea", "mountains", etc.). In my fantasies, people would get so tired of writing such groups of sentences that they wouldn't do it in the first place, but apparently that's not true. But it seems to me that the more we facilitate the adding of sentences in bulk, the more people will be tempted to add such groups of sentences.

@trang
Copy link
Member Author

trang commented Jan 26, 2020

But it seems to me that the more we facilitate the adding of sentences in bulk, the more people will be tempted to add such groups of sentences.

I think a much bigger factor is the fact that people can seemingly add an infinite number of sentences. When people are not limited, they will tend to care less. For instance if you have an infinite amount of money, you will more likely end up buying a lot of junk that you have absolutely no use for, but you just bought it "just in case" or because "why not". In the case of Tatoeba, there is no limit of the amount of sentences that one can add so some people will add sentences just for the sake of adding them, even if these sentences might not very be useful in the end.

That's why for me the problem of low-creativity is not a problem we can efficiently prevent or attenuate by discouraging mass-import. To address your issue, I see several other possibilities:

  • We set some constraints on how many sentences per day one can add and possibly as well how many sentences in total one can own. This would make people think a bit more about what they are adding.
  • We find a way to make people become more self-conscious about the creative value of their sentences and elaborate tips/guidelines on how they can improve. That's a bit more difficult.
  • Or we accept that the abundance of uncreative and undiverse sentences as an unavoidable fate for the full Tatoeba corpus. Rather than fighting against our fate, we create tools that can help us extract a diverse and interesting sub-corpus out of this full corpus. We would then have a "raw" corpus and a "curated" corpus.

I'm completely okay to put on standby the reintroduction of mass-import until we have a mature solution for preventing mass-import of undesirable sentences and also a more mature definition of what is undesirable. This is actually one of the reasons why there is still no plan for this feature. But I think the feature itself is worth the effort and we shouldn't just let it quietly die. I'm quite convinced there are other people out there who have compiled thousands of interesting sentences into text files or spreadsheets (maybe because it was a lot more convenient for them or maybe because they simply didn't know Tatoeba). Asking them to manually copy-paste these sentences one by one would be torture.

@alanfgh
Copy link
Contributor

alanfgh commented Jan 26, 2020

@trang, that's a good analysis.

One thing to note is that adopted sentences count towards one's number of owned sentences. Thus, adding a constraint on the total number of sentences one can own might be a disincentive to adopting.

@agrodet
Copy link
Contributor

agrodet commented Jan 27, 2020

As often, Trang gave some very good points. Let me draw on one of them.

Personally, and it's frustrating, I don't think we should restrict the addition of the sentences you mention, because they do have value. They can be used out of the box by learning systems, for example, or even humans. However, I do think that we can change how we serve sentences. Gillux mentioned a similar wish a while ago.

Alan, if you have a look at the "word analysis" section of one of my notebooks, you can see how repeated stuff influenced other corpora (not that one need it to realize it...). For example, "Tom" is the most used French word (except stop words). However, that also gives us some insight on how we can adapt our tools. For example, we can set different types of search. The totally random one, like the current one, will be weighed by low quality sentences. But the "one in the middle" could give more interesting sentences using words, or even grammar, less commonly used, and the "challenging one" could go to the bottom get words or topics that appear very few times.

So I think that a good solution is a mix of "education" / communication and some nicely biased tools ^^

@AndiPersti
Copy link
Contributor

  • We set some constraints on how many sentences per day one can add and possibly as well how many sentences in total one can own. This would make people think a bit more about what they are adding.

Or they simple create another account and continue adding sentences. :-)
I don't think it's possible to solve the quality issues with technical constraints like this.

  • We find a way to make people become more self-conscious about the creative value of their sentences and elaborate tips/guidelines on how they can improve. That's a bit more difficult.

I guess CK (to name the elephant in the room) thinks his contributions are valuable. ;-)

  • Or we accept that the abundance of uncreative and undiverse sentences as an unavoidable fate for the full Tatoeba corpus. Rather than fighting against our fate, we create tools that can help us extract a diverse and interesting sub-corpus out of this full corpus. We would then have a "raw" corpus and a "curated" corpus.

I think that's the only workable solution: Let editorial staff (corpus maintainers?) decide what sentences are of good quality and mark them with some approvement label (e.g. a tag only they can set). Then we can easily publish these two corpora.

@ckjpn
Copy link

ckjpn commented Jan 31, 2020

I just scanned the above and haven't read all the comments carefully, but since CK was mentioned in the last comment, I'll leave a quick comment. This perhaps doesn't really apply to the "Mass import of sentences," unless it just points out that one of the arguments against it is suspect, and that work on this shouldn’t be stopped.

poor-quality sentences
#1762 (comment)

To say one perfectly-good sentence is of lower quality than another perfectly-good sentence is wrong. You could perhaps argue that they are redundant or time-wasting.

Sentences that are different only because of a pronoun can often be used by some languages as alternative translations of a sentence, so asking that they not be included is not a wise thing to do. At least a few members in this discussion know languages that often don't use pronouns and maybe some of you know languages that use genderless pronouns. Some of you likely know sentences in languages that can be translated into English as both past tense and present tense, and maybe other tenses. Some of you know languages in which the English "you" has various translations, singular, plural, by who is being spoken to and what politeness level is being used. Should we discourage all the possibilities? I don't think so.

If you are going to be judge the quality of sentences, the following would be something to consider that may have more relevance to the Tatoeba Project.

For students and researchers, the same sentence owned by a native speaker is of higher quality than one owned by a non-native speaker since it is more trustworthy. In other words, we can more likely trust that it is a good sentence. Data that can be trusted is of higher quality. Perhaps AI people would create systems to throw out all non-native data to improve results. When a non-native language user contributes a sentence, it prevents a native speaker from contributing the same sentence that would have a higher trustworthiness level, so they are in some ways hurting the “trustworthiness” quality of the corpus.

A few randomly chosen examples

https://tatoeba.org/eng/sentences/show/1008592
ita
Non è serio.
eng
He's not serious.
eng
You're not serious.
eng
You aren't serious.
eng
It isn't serious.
eng
He isn't selfish.
eng
He isn't serious.

https://tatoeba.org/eng/sentences/show/6158104
ita
Ha buon gusto.
eng
She has good taste.
eng
He has good taste.

https://tatoeba.org/eng/sentences/show/3146657
ita
Provò.
eng
She tried.
eng
He tried.

https://tatoeba.org/eng/sentences/show/8497895
deu
Sie sind nicht dick.
eng
You aren't fat.
eng
They aren't fat.

@alanfgh
Copy link
Contributor

alanfgh commented Jan 31, 2020

@ckjpn wrote:

I just scanned the above and haven't read all the comments carefully...

That's clear. Don't waste our time by posting your own comments until you have.

@trang
Copy link
Member Author

trang commented Feb 1, 2020

(1) I'd like to elaborate more on the idea of limiting the amount of contributions.

This limit actually exists implicitly already. Technically, if someone decided to add millions of sentences per minute, that will crash Tatoeba because the server cannot manage this. Now even if the server won't crash, millions of sentences per minutes is humanly just not manageable for our community.

Setting an explicit limit is, I think, in any case necessary for the well-being of the server and of the community. And as an extra side effect, it can help some contributors channel their efforts on creating more meaningful sentences.

If we ever set a limit that degenerates into workarounds, where many people start to create new accounts because that's the only way they can fulfill their appetite for contributing to the project, then clearly we have chosen the wrong limit. As always, we have to find the right balance and it has to make sense.

With a limit of one sentence only per account, it's obvious that a lot of people will try to cheat by creating new accounts. On the other hand, with no limit at all, there will be people who won't care about overwhelming Tatoeba or won't be aware of it. They might realize it only when the server crashes or when people in the community start to heavily complain.

This limit doesn't have to be the same for everyone and all the time. For instance, it would make sense that contributors who are completely new are more limited than contributors who have been here for a longer time and have gained the trust of others.

(2) I'd like to say a few things about quality.

First, we have to distinguish between quality of a sentence and quality of the corpus. Compiling a list of perfectly-good sentences doesn't necessarily make a good quality corpus.

Second, we can't really argue about corpus quality if we haven't defined what kind of corpus we want to build. If we are building a corpus for children who are just starting to learn a new language, we won't be able to claim that we have a good quality corpus if there are extremely complex (yet perfectly-good) sentences or sentences that are very obscene or sexual. If we are building a corpus to ship to other galaxies as a meaningful sample of data for aliens to study all the characteristics of the languages on planet Earth, then we would have a bad quality corpus if our sentences all look like they were made for children learning new languages and completely excluded obscene and sexual sentences.

Third, assuming we agree on what kind of corpus we want to build, there's one more thing that defines the quality of the corpus, which is its size. Between a corpus that is 10 MB and another that is 100 MB, if both fulfill the same goals, if there's absolutely no additional information we can extract from the bigger corpus than from the smaller corpus in the scope of what they have been compiled for, then I am very much inclined to say that the smaller corpus is a better quality corpus than the bigger one. There's no arguing that redundancy is essential in the corpus but there is also a point where redundancy becomes pollution. Where do we draw the line is a difficult question though.

@trang trang removed the regression Issue that describes a bug for a feature that used to work just fine. label Feb 1, 2020
@trang
Copy link
Member Author

trang commented Feb 1, 2020

Since we're not trying to fix the old mass-import feature, I've removed the regression label.

@agrodet
Copy link
Contributor

agrodet commented Feb 3, 2020

Corpora of pppular languages will eventually grow to huge mudballs. It will be up to us to extract useful parts amd provide a correct sample to the correct people. Just a two cents on the long run :)

@agrodet
Copy link
Contributor

agrodet commented Feb 5, 2020

Although we did discuss in the past of keeping the procedure "human"... And here I was, hoping that Tatoeba will not become a stupid dictionary, brrrr.

Not sure there's a benefit in taking "I like monkeys." and replacing "monkeys" by every word in the dictionary, no more in taking "x says / denies/ tells that y VERB z" to use every verb in the dictionary... But I guess people will stay people, as @alanfgh thought. When I think of other corpora becoming tasteless soup of useless crap with no identity, that makes me a little sad.

PS: My remarks are clearly pointed to CH's sentences.
PPS: I know I don't suggest any improvement or argument.

@trang
Copy link
Member Author

trang commented Feb 6, 2020

10 days ago:

Personally, and it's frustrating, I don't think we should restrict the addition of the sentences you mention, because they do have value.

Yesterday:

When I think of other corpora becoming tasteless soup of useless crap with no identity, that makes me a little sad.

Just teasing you @agrodet :P

What I really want to point out is that it would be actually fine if such sentences were added by different people at different times. It's only a problem (or at least feels like a problem) when they all come from the same person and are being added in bulk as the result of what looks like a mass-production process.

@agrodet
Copy link
Contributor

agrodet commented Feb 7, 2020

Yeah, I know! But there is sense in what I say, I promise!

I share your opinion. But people don't want to be patient :(
About a future mass-import function, what about restricting the number of sentences per day? Like a maximum of 200 sentences a day or something like that.

@trang
Copy link
Member Author

trang commented Feb 7, 2020

what about restricting the number of sentences per day?

Well I spent half of my comment above giving arguments on why I think we should do that. But finding the magic number(s) isn't something we should decide here arbitrarily.

Restricting the number of sentences per day was also suggested in #1492.

@agrodet
Copy link
Contributor

agrodet commented Feb 8, 2020

My apologies, I didn't explain myself well. I wanted to say, how about restricting the number of sentences added by the mass-import script in a day? Not a general limit. That's quite different.

@trang
Copy link
Member Author

trang commented Feb 9, 2020

Would there be a reason for having different restrictions between mass-import and regular sentence contributions?

@ckjpn
Copy link

ckjpn commented Feb 10, 2020

Would there be a reason for having different restrictions between mass-import and regular sentence contributions?

Perhaps, we should restrict contributions to native language contributions when mass importing.

This could help avoid a mass of less-than-natural-sounding sentences and sentences with errors being added to our corpus.

Of course, a member could lie about what his/her native language is, but if admins are careful about who they give permission to use this function, this may not become a problem.

@agrodet
Copy link
Contributor

agrodet commented Feb 10, 2020

Would there be a reason for having different restrictions between mass-import and regular sentence contributions?

I think one reason would be

What I really want to point out is that it would be actually fine if such sentences were added by different people at different times. It's only a problem (or at least feels like a problem) when they all come from the same person and are being added in bulk as the result of what looks like a mass-production process.

Another one could be that regular contributions are regular, mass-imported are supposed to be exceptional.

Of course, people could still copy-paste.

@ckjpn
Copy link

ckjpn commented Feb 10, 2020

...all come from the same person and are being added in bulk as the result of what looks like a mass-production process.

What TRANG is referring to is this kind of thing, I think.
(I wrote to TRANG on March 13, 2018 and once again on December 22, 2018 about this.)

All these patterns (over 700 of them), ...

7632198 ita Non vanno a costruire università in Australia? Guybrush88
7632138 ita Non andate a costruire università in Australia? Guybrush88
7632078 ita Non andiamo a costruire università in Australia? Guybrush88
7632018 ita Non va a costruire università in Australia? Guybrush88
7631958 ita Non vai a costruire università in Australia? Guybrush88
7631898 ita Non vado a costruire università in Australia? Guybrush88

[snip] down to here.

7544982 ita Che stai facendo in Australia? Guybrush88
7544981 ita Che cosa stai facendo in Australia? Guybrush88
7544980 ita Cosa stai facendo in Australia? Guybrush88
7544933 ita A Tom piace l'Australia. Guybrush88
7530569 ita Dov'è andato in Australia? Guybrush88
7530568 ita Dov'è andata in Australia? Guybrush88

... with a substitution of all the countries shown with this search.

"^Andiamo a cercare qualcosa in" by Guybrush88

https://tatoeba.org/eng/sentences/search?query=%22%5EAndiamo+a+cercare+qualcosa+in%22&from=ita&to=none&user=Guybrush88&orphans=no&unapproved=no&has_audio=&tags=&list=&native=&trans_filter=limit&trans_to=und&trans_link=&trans_user=&trans_orphan=&trans_unapproved=&trans_has_audio=&sort=created&sort_reverse=

These turita's Turkish sentences were imported and linked to Guybrush88's Italian sentences at approximately the same time. Perhaps these are the only sentences owned by turita.

https://tatoeba.org/eng/sentences/search?query=&from=und&to=und&user=turita&orphans=no&unapproved=no&has_audio=&tags=&list=&native=&trans_filter=limit&trans_to=und&trans_link=&trans_user=Guybrush88&trans_orphan=&trans_unapproved=&trans_has_audio=&sort_reverse=&sort=relevance

@Guybrush88
Copy link

ita
Non è serio.
eng
He's not serious.
eng
You're not serious.
eng
You aren't serious.
eng
It isn't serious.
eng
He isn't selfish.
eng
He isn't serious.

https://tatoeba.org/eng/sentences/show/6158104
ita
Ha buon gusto.
eng
She has good taste.
eng
He has good taste.

That's because, the English pronouns "he", "she" and the formal "you" have all the same verb forms in Italian, that's why the Italian sentence is linked to multiple English ones. In addition to this, "isn't" and "'s not" have the same meaning, that's why the Italian sentence is linked to both.

@Guybrush88
Copy link

...all come from the same person and are being added in bulk as the result of what looks like a mass-production process.

What TRANG is referring to is this kind of thing, I think.
(I wrote to TRANG on March 13, 2018 and once again on December 22, 2018 about this.)

All these patterns (over 700 of them), ...

7632198 ita Non vanno a costruire università in Australia? Guybrush88
7632138 ita Non andate a costruire università in Australia? Guybrush88
7632078 ita Non andiamo a costruire università in Australia? Guybrush88
7632018 ita Non va a costruire università in Australia? Guybrush88
7631958 ita Non vai a costruire università in Australia? Guybrush88
7631898 ita Non vado a costruire università in Australia? Guybrush88

[snip] down to here.

7544982 ita Che stai facendo in Australia? Guybrush88
7544981 ita Che cosa stai facendo in Australia? Guybrush88
7544980 ita Cosa stai facendo in Australia? Guybrush88
7544933 ita A Tom piace l'Australia. Guybrush88
7530569 ita Dov'è andato in Australia? Guybrush88
7530568 ita Dov'è andata in Australia? Guybrush88

... with a substitution of all the countries shown with this search.

"^Andiamo a cercare qualcosa in" by Guybrush88

https://tatoeba.org/eng/sentences/search?query=%22%5EAndiamo+a+cercare+qualcosa+in%22&from=ita&to=none&user=Guybrush88&orphans=no&unapproved=no&has_audio=&tags=&list=&native=&trans_filter=limit&trans_to=und&trans_link=&trans_user=&trans_orphan=&trans_unapproved=&trans_has_audio=&sort=created&sort_reverse=

These turita's Turkish sentences were imported and linked to Guybrush88's Italian sentences at approximately the same time. Perhaps these are the only sentences owned by turita.

https://tatoeba.org/eng/sentences/search?query=&from=und&to=und&user=turita&orphans=no&unapproved=no&has_audio=&tags=&list=&native=&trans_filter=limit&trans_to=und&trans_link=&trans_user=Guybrush88&trans_orphan=&trans_unapproved=&trans_has_audio=&sort_reverse=&sort=relevance

back then, I wanted to provide similar patterns of sentences, and I'm sorry if this clogged up the corpus

@trang
Copy link
Member Author

trang commented Dec 19, 2020

@Guybrush88,

back then, I wanted to provide similar patterns of sentences, and I'm sorry if this clogged up the corpus

You can always fix that by going through the sentences that have no translations and change them into something that is more original :)

@trang
Copy link
Member Author

trang commented Dec 19, 2020

I should report that this feature seems to be needed more and more.

  1. A user recently asked on the Wall:

I am collecting sentences in chat between myself and a tutor. Is there not a way to batch upload new sentences? I see that there is no API but surely there must be a way to contribute easily.

  1. A few days ago, someone wrote to me an email asking if there was a way to add thousands of translated sentences at once. They want to import Crimean Tatar sentences and expressed that it is important because it's an endangered language. They pointed out that they have a dictionary with translated examples. If I understood properly, the sentences they want to import would be the sentences used in that dictionary and that they want to have them on Tatoeba so that the content is available in more places, to help keep the language alive.

  2. Last month, someone contacted the team to ask if it's possible to use a script to add sentences to Tatoeba. The reason was to import 7000 sentences written by a contributor who doesn't have a stable internet connection.

  3. Also, back in July, I was informed in private that a new member, who wanted to improve their Latin skills with classical text, asked if there was any tool to import multiple sentences from public domain works into Tatoeba. I suggested that this member writes to us at the team email address or on the Wall what exactly they want to import, but I don't think we received any follow up on that.

@jayrod
Copy link

jayrod commented Dec 19, 2020

@trang Thanks for pointing this thread out to me. I am the person who is working with a tutor and wished to import sentences. I wanted to describe my scenario a bit. I am currently taking German classes with a tutor over Zoom. She very graciously corrects my grammar via chat. At the end of each day she sends me a transcript and I will put the newly created sentences into Tatoeba and give my translation.

This practice I believe is beneficial to my tutor, tatoeba, and myself. The tutor gets help with directed learning scenarios where they explain a word or concept in the context of many other examples, Tatoeba gets "high" quality sentences from a native speaker and teacher, and I get the practice of translating into my own words all while receiving helpful corrections.

I am a developer so I believe a mass import feature need not be for the masses. But I do think there is some use in a developer token and some sort of program to check programs.

I think we can compromise on the above concerns in a couple of ways (apologies for jumping in at the end and thinking I know everything. I don't) I would like to make an app to upload sentences en mass. I only want to do it for one language and I don't need to add more than 100 a day..

Could we perhaps approve developer tokens to interact with an API that meets certain standards for each language. And perhaps these standards per language could be worked on by a group of super users per language.

I think there is a way to split the baby and have a way to increase tatoeba quality sentences and keep too much garbage data out.

I think the answer is a very limited protected developer api and a robust system for review of client systems. Also, you can put the work on third party developers to meet standards.

@jayrod
Copy link

jayrod commented Dec 19, 2020

If I were making an app to submit german sentences to Tatoeba here are a few things that I would do to ensure good data makes it in.

  • Download tatoeba sentences weekly to make comparisons.
  • Ensure that my sentence doesn't match what is in the offline database prior to submission
  • Ensure each sentence has a punctuation.
  • Commonly misspelled words (those that require ß or ä, ü, etc) are autocorrected.
  • No sentences less than 4 parts are uploaded.
  • Ensure sentences are in the target language of German.
  • Ensure each sentence is seen by human eyes prior to submission.

In my vision my application would also have a mode requiring me to provide translations as I do the upload. I expect that a group of Tatoeba german super users could provide input into common errors made while adding sentences that could possibly be avoided before submission.

I don't think this is a system that will work globally but perhaps Tatoeba can start small and see if it spreads.

@mramosch
Copy link

mramosch commented Dec 25, 2020

Controlled mass import by Tatoeba staff only (like audio) would make it easy for e.g. CMs to just glance over packages of e.g. 100-200 sentences and approve them for an internal mass import tool, which can execute at a reasonable rate at ‚downtime‘ of the server...

One can easily tell the contributor’s intentions or the diversity or quality of the contributions, and if need be, give constructive feedback to the contributor.

I‘m also no friend of restrictions on the frontend regarding to the user‘s already existing number of contributions, because users tend to be much more productive when they join Tatoeba than later on. So, such a limitation would work counterproductively against their contribution curve...

Merry Christmas!

@jayrod
Copy link

jayrod commented Dec 25, 2020

Would there be a reason for having different restrictions between mass-import and regular sentence contributions?

Perhaps, we should restrict contributions to native language contributions when mass importing.

This could help avoid a mass of less-than-natural-sounding sentences and sentences with errors being added to our corpus.

Of course, a member could lie about what his/her native language is, but if admins are careful about who they give permission to use this function, this may not become a problem.

This would cut out the possibility of submitting a native sentence on behalf of another.

@jayrod
Copy link

jayrod commented Jan 16, 2021

Would it be possible to limit an API for users who are "educators" perhaps? I believe having access to a corpous such as tatoeba is an awesome resource for tutors. I can imagine creating a front end designed to help tutors create exercises for students to translate or use as writing prompts.

Perhaps their is a way to turn these exercises into crowd sourced translation.

@yannis1962
Copy link

Hi, I'm writing after an email exchange with Gilles Bedel, on the same subject as this thread.

I came to Tatoeba through the wonderful book by Hagiwara on Real-world NLP. It is a great initiative.

I understand the principle of curated translations, at the opposite of Google Translate where the quality is sometimes low and you have no guarantee on its quality.

But I must confess I saw many Tatoeba translations (in French and Greek) that were bad, very bad: words missing, bad inflectional morphology. And I found no means to correct them (I added a commentary to one of them, but I don't think this will ever be read by anyone).

I'm saying this to argue that it is not the effort you request by users that makes translations better.

So first of all you need a system allowing corrections (maybe it exists and I have missed it)

Maybe you need a scoring system which would attribute an initial score to every sentence (not visible to the user) and then would increase that score whenever there is a correction (unless corrections are back-and-forth, in which case an administrator should be alerted).

Then you should have a validation function of the type "Show me sentences of low confidence score", and validating by a single click would then increase the score. And if you implement a user's confidence score, then eir validating would increase the score in a way proportional to er personal score (also not known to the user, we don't want to become stackoverflow).

Mass import would then provide sentences of very low score, that would either need to be validated (by somebody else than the person having imported them) or corrected, or both.

The conclusion is the same as yours: for a sentence to be trustworthy it will need effort and time. But in my idea, this effort and time is shared by all users and the data are there. Mass import occurs, the result has low score: somebody wanting only trustworthy sentences will ask for sentences with high score, somebody wanting any kind of sentence will take them all. And then, slowly, progressively, the scores of the correct mass-imported sentences will raise and they will becomes like the others, eventually after some corrections.

Example: Imagine I supply a mass import containing among other translations the following pair:

ENG I'm very honored
FRE Je suis très honorer

the fourth word of which has an error (a very common one, writing an infinitive instead of a participle). After the import the scores of the translation can be zero:

FRE Je suis très honorer score=0

I go through the "Validate low-score sentences" page and correct it:

FRE Je suis très honoré score=0.5

Automatically the score rises. Somebody else goes through the same page and validates it:

FRE Je suis très honoré score=1.0

It has now become a medium-scored (or high-scored) sentence.
If nobody has the time to correct or validate it, it remains with score 0. To retrieve it people must click on a checkbox "Download also low-scored sentences at your own risk".

I think that this approach combines your security concerns and the need of mass import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests