-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate-merging script is still missing some obvious duplicates with non-matching spaces in French sentences. #770
Comments
+1 |
Can't Trang just make the decision and choose the one she likes? It's her website and it's her language. Anyone who uses the data for other purposes can easily do a mass find-and-replace to whatever space they prefer. |
Agreed. |
There are certain things that I consider low priority, the problem of spaces is one of them. Outside of Tatoeba Days I work on things that I care about personally. But during Tatoeba Days I take the time to work on things that I don't necessarily prioritize but that other members of the community do. |
This problem still exists. https://tatoeba.org/eng/sentences/show/467373#comment-849583 Quel est ton numéro de téléphone ? The duplicate-merging script didn't delete the duplicate because these aren't exactly the same. |
This has been mentioned again on the website. https://tatoeba.org/eng/sentences/show/1567400#comment-857922 I just checked last week's exported data. We have 3 types of spaces being used in front of the question mark in French sentences. ? = 26,396 ? = 9,756 ? (with no space) = 1821 ? = 76 |
This is a related problem, so I'll add a note here. These are duplicates that the software I normally use has found. CK написал(а):
|
I have written a code to deal with spaces in the French corpus. There are three possible spaces in French:
First of all, I've run some scripts to check the numbers, considering nnsp the adequate ones (as many other people out there do). Here are the results:
The little differences between Corrected and the sum of sentences to be corrected is due to some sentences containing several kind of spaces. In summary, running such a script would edit around 50 000 sentences and create around 200 duplicates that Horus will be able to eradicate. However, I can guarantee this script only for the French corpus as I wouldn't take responsibility for a language I do not understand the rules nor the typography. |
The longer this issue is postponed, the more likely problems like the following will occur. Sentence #451458 is older, but Sentence #510709 has an audio file. This will likely be a problem when the spaces before final punctuation in French are standardized so our duplicate-merging script can merge sentences and their translations. https://tatoeba.org/eng/sentences/show/451458 It is sort of unfair to take away the ownership of a sentence from the original owner that was contributed first and didn't need corrections. If the older sentence was corrected and then matched the newer sentence, I don't feel it's a problem to give ownership to the owner of the newer sentence that was contributed error-free. |
I don’t think loosing the ownership of such a simple sentence is a big deal to begin with. If people really think it is, how about we make it so that Horus moves the audio to the first-contributed sentence? |
This is still a problem. I wonder if it might not be time to prioritize this. Related recent comment. |
We did not reach a consensus on having standard rules regarding spaces in French. This is the last thread I remember of: This means that by default, each contributor gets to decide how they want to do the spaces. And if we happen to have two contributors who have the same sentence with different spaces and firmly want to keep things their way, then so be it, we will consider these two sentences as distinct sentences and not duplicate ones. It is currently not Horus' job to "fix" sentences to match a certain set of standard rules. Horus' job is only to do all the necessary tasks to merge exact duplicate sentences. If we want to extend Horus' responsibilities to also take care of fixing sentences based on a specific set of rules, we could. But it's a new task that should be treated independently from merging duplicates. It could be done by another bot as well. What I want to stress is that the rules that the bot would follow cannot be decided without a consensus. This is not something that one person will decide on their own. I will not decide on my own that every French sentence should always use a non-breakable space before a question mark. Similarly, I will not decide on my own that every English sentence should always start with a capital letter. And I will not decide on my own that Japanese sentences should only use full width characters for numbers (for instance 1 instead of 1). I think no one should be deciding such things on their own. It needs to be discussed and agreed by those who speak these languages. Before we implement any additional automatic rules, there has to be sufficient analysis and the results have to be documented. So until someone takes the time to conduct an analysis and based on this analysis, comes up with a decision on what should be the rules for spaces in each language (or if there should be rules at all), we should just handle the present issue manually: reach out to contributors who have duplicate sentences due to space difference and ask them whether or not they care about the type of space they use in their sentences.
Basically, just use common sense to resolve a conflicting situation. This does not require any change in the source code and if no one objects to that, then I will close this issue. In the longer term, we can implement some sort of mass-editing so that people don't have to waste time fixing their sentences one by one. We can also implement a way to resolve near-duplicate issues. This implies:
For these longer-term solution, we can create new issues. |
If this is not a matter that can be settled easily, I wonder if perhaps the Horus script could directly link French sentences that only differ in what space is used before the final punctuation. Are there any cases where a difference in which space is used results in a different meaning? If not, it would likely be useful to link such sentences, since people could at least see translations as indirect translations. Another possible solution, perhaps, would be to adapt the duplicate-merging script to link all translations of French sentences that only differ in this space, so that both sentences get all the translations. This way, it wouldn't matter which French sentence someone was looking at since they would see all available translations. |
Spaces don't change the meaning and your suggestions are good, in my opinion. I would see no issue linking two French sentences that differ only by a space and making the indirect translations direct ones on both sides. But considering the amount of sentences involved (unless the matter grew drastically since @Poulpisator posted his stats), I would suggest to still handle this manually rather than automatically. The linking of the two near-duplicate sentences could be done by anyone with linking permission, but the linking of the translations should be done by someone who speaks both languages. It would be an occasion to double check some translations. I will mark this |
From last week's download. There are 2440 sentences in this file, so roughly half that number would be duplicates. Perhaps some have more than one duplicate. http://tatoeba.byethost3.com/fra-very-near-duplicates-2019-11-30.txt You didn't comment on this possibility.
Wouldn't it make sense to attach all possible translations for each of these French sentences? |
CK, just mind your business. You have absolutely NO added value into this matter, once and for all ! |
I take it that sacredceltic doesn't see any value in attaching existing translations from one French sentence to another French sentence that has the same meaning, but only differs because of the space before the final punctuation. I wonder if others share this view. |
As a record, let me put an answer I wrote when I was asked the following. This only reflects my personal experience and opinion.
There are two main problems when it comes to online contributions: This problem is reflected on websites. Dictionaries, by their core mission, do handle the thing correctly. Serious websites, like "Le Monde", also handle the thing pretty well (I guess they programmatically solve the problem but I do not know). Crappy websites, "la presse people" and others don't care a bit. They often don't even use the right quotation mark. But again, a part of the problem is that the French quotation mark does not directly appear on the keyboard, so don't care. However, it is possible to write them on Mac and Windows if you know the ALT code. |
I disagree for several reasons. First of all, on Tatoeba we call links "translations". Many people may not care but I consider it a very important point, as it is one of the basic components of the tool (the other one being contributions, that we call "sentences"). Of course, I say "we" but that does not include anyone really, it is just a tendency that I noticed by experience. However some people do link their own synonym contributions. That is their choice, and there is no problem in that. But linking synonym sentences would deeply change the French corpus (in my personal opinion, in a bad way), because then one could ask "What about But let me take the problem from the other side. @ckjpn said
Let's suppose that I know that it will work because I did it SO many times on your own sentences. "that" "this" "it", "I know that..." "I know that that..." So many patterns that ends in the same French sentence. If I translate "in order" then the group of sentences likely appears on the same page so I can link directly by the sentence number. But if I translate "by random", I would contribute the second or third identical translation days later, and Horus will merge it. Of course, during those few days Synonym1 does not see my French translation of Synonym2, but in the long run, they all end up with their translations. |
Very interesting list. I wonder if you could make a similar one for dashes (hyphen, – and —) for Dutch.
I have been doing this manually for many years but I doubt that making automatic scripts is a good idea. Regularly publishing lists, and sending them to the CM's of the concerning languages, is a better idea IMHO. |
But do you solve any problem doing that, except your own self preference? But let me express something else here. From a user point of view, I cannot see any problem you're solving with the solution you propose. I'll use simple sentences to illustrate. Situation 2 Situation 3 Please let us know what situations or problems your solution solve. Then of course, if you're well versed in the art of debate, you will ask me what is the added value of the non-solution I do not propose. Well, as I wrote above, I think doing what you suggest is in contradiction with the basic design bricks of Tatoeba. Of course, that is only my personal opinion, and I'm waiting for someone with a different opinion to bring arguments to the table. This contradiction is enough for me to keep things as they are until the U.I. problem is solved and a solution for inputting / replacing space by thin space is decided (Really, if the U.I. problem is solved, I think most of contributors will agree to a two-way policy: no space or thin space). To summarize, I think the solution you propose solves a non-problem, that it relates to a very small part of the corpus, and that it is more to satisfy a personal point of view than to really address any issue. (PS: The agrodet above and me are just two faces of the same coin... Yeah, I mistook accounts.) |
You are right if someone uses the search engine. But what for translators who browse and translate sentence by sentence, like me and like many others? Let's imagine that there is a complicated sentence with a " !" at the end. I start translating it and only after I finished I see that somebody else did the effort of translating the near duplicate. If the near duplicate is linked, there is no waste of time. |
Cher Poulpisator, |
What is the probability that 2 different users translate the same duplicate (which again is a tiny fraction of the corpus...) at the same time ? |
I asked CK to make a script for Dutch. Not for your language. So please shut up. |
Except the title of this thread is about « ...spaces in French sentences » |
I’m going to summarise the dual problem now :
So guess which one we’re going to address first ? |
First, unless I missed something, no one was even aware of the issue that non-breakable spaces are not properly displayed on iPhones until a couple of days ago. I mean, the GitHub issue (#2026) was opened only two days ago. Saying that nobody cares is a bit too dramatic. Second, @Poulpisator has a good point which I forgot about. Linking two sentences of the same language makes sense on an abstract level, but from the UI point of view, we label linked sentences as "Translations" (it's very explicit in the new sentence design) and it is indeed confusing to have a sentence being defined as "translation" of another sentence in the same language. So I take back what I said when I said "I would see no issue linking two French sentences that differ only by a space". There is actually an issue, which is a general issue with same-language linking. And I agree that before we think about linking more sentences in a same language, we should first think about how we display and label linked sentences of the same language. The suggestion made in #1902 is a possible solution. Lastly, to be clear about this suggestion:
As I've said, I doubt it will ever be high priority enough to be implemented as an automated process. So no, we will not adapt the duplicate-merging script for that. It's not worth the effort and as @Poulpisator has described it, we can let sentences get linked to each other in an organic way (through human contributions), rather than in an automatic way (through a bot/script). And the more I think about it, the more I feel we're better off that way. It may be less productive to handle these near-duplicates manually, but it should lead to better quality in the end. A bot or a script will never ask itself if the sentences are really good translations or not, before linking them. There could be nonsense translations and they will be linked without question. A human will (or at least should) put some thoughts into it. We should really move towards a mindset where we embrace duplicates as something that can help us improve the quality of the corpus, rather than just seeing it as something that pollutes the corpus and wastes our time. |
Alleluiah, pray the sun, my boys!! \o/ Can we write this in rainbow colors in the top menu of Tatoeba? ^^ @sacredceltic Une partie non-négligeable de ma vie consiste à faire comprendre à des gens qui refusent de le voir que leur design est loin d'être correct. Tatoeba est un peu ma salle d'entraînement de la vie réelle :) @PaulPeer Okay, that's one argument for. Is there a second one? Although I could say that you're making one more assumption when you say With your own assumption, I arrive to a very different conclusion. I cannot see any case where "wasting" time providing translation to complex sentences is harmful to the project :) Better, I cannot see any case where it is not beneficial to contribute translations to complex sentences. As TRANG said, embrace the duplicates, don't fight them, let them flow through you, seize their power! :D |
OK. That is clear. But do all advanced contributors and CM's know about this? I doubt it very much. Just one example: The sentences "I know you know" and "I know that you know" and hundreds of similar ones have been linked together. Maybe in the FAQ the article about linking should be more clear? |
I had already reported this issue at least 4 years ago https://tatoeba.org/fra/wall/show_message/24679#!%23message_24679 |
Probably not. In itself it is not a huge problem either if people aren't aware and link sentences in the same language. It's not killing anyone. We will eventually implement a solution to the problem of same-language sentences being labelled as "Translations". But until the solution is implemented, there's no need either to add up to the problem if you're aware it's a problem, except if it were for solving another much worse problem.
You didn't exactly report any issue there. You only mentioned that the font used could be a problem. The problem that non-breakable spaces are invisible specifically with the default sans-serif font on iPhones was not a known issue until recently. I personally don't have an iPhone, so unless someone actually tells me that something looks wrong on iPhone, I wouldn't know. I don't think it's intuitive to think that the default sans-serif font on a major operating system doesn't display spaces correctly... |
Closing this now. Summary:
|
There is another discussion about this on the Wall. |
Here's an example left in a comment.
https://tatoeba.org/eng/sentences/show/3184#comment-702160
Here is the text of the comment
https://tatoeba.org/eng/sentences/show/3184
https://tatoeba.org/eng/sentences/show/3951157
Pourquoi demandes-tu ?
Pourquoi demandes-tu ?
The problem is that one of these sentences apparently doesn't use the standard space in front of the question mark, so the duplicate-merging script doesn't see these as duplicates.
I think we may just have to wait for one of the programmers to crate a program that fixes all these French duplicates. Maybe at the same time, they can also automatically fix all English sentences that French speakers have mistakenly put spaces in front of question marks.
The text was updated successfully, but these errors were encountered: