Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate-merging script is still missing some obvious duplicates with non-matching spaces in French sentences. #770

Closed
ckjpn opened this issue Sep 13, 2015 · 34 comments
Labels
out-of-scope Issue that we decided we won't do and will close within a few weeks.

Comments

@ckjpn
Copy link

ckjpn commented Sep 13, 2015

Here's an example left in a comment.
https://tatoeba.org/eng/sentences/show/3184#comment-702160

Here is the text of the comment
https://tatoeba.org/eng/sentences/show/3184
https://tatoeba.org/eng/sentences/show/3951157

Pourquoi demandes-tu ?
Pourquoi demandes-tu ?

The problem is that one of these sentences apparently doesn't use the standard space in front of the question mark, so the duplicate-merging script doesn't see these as duplicates.

I think we may just have to wait for one of the programmers to crate a program that fixes all these French duplicates. Maybe at the same time, they can also automatically fix all English sentences that French speakers have mistakenly put spaces in front of question marks.

@ckjpn ckjpn added the bug Issue that describes a problem with a feature that doesn't work as expected. label Sep 13, 2015
@PaulPeer
Copy link

+1
There are hundreds of these cases. Also in front of ! ; and : as far as I remember. But are we going to wait until the French users stop fighting about which space to use? Or we just decide for the standard space?
The "mistakenly put spaces" is not just a phenomenon in English sentences. I corrected many in Dutch and Esperanto too.

@ckjpn
Copy link
Author

ckjpn commented Sep 13, 2015

But are we going to wait until the French users stop fighting about which space to use?

Can't Trang just make the decision and choose the one she likes? It's her website and it's her language.

Anyone who uses the data for other purposes can easily do a mass find-and-replace to whatever space they prefer.

@PaulPeer
Copy link

Agreed.

@trang
Copy link
Member

trang commented Sep 13, 2015

There are certain things that I consider low priority, the problem of spaces is one of them.
If this is however a big problem for you, then you can mention it during Tatoeba Day so that it can be discussed and a solution can be implemented.

Outside of Tatoeba Days I work on things that I care about personally. But during Tatoeba Days I take the time to work on things that I don't necessarily prioritize but that other members of the community do.

@trang trang added enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. and removed bug Issue that describes a problem with a feature that doesn't work as expected. labels Sep 13, 2015
@ckjpn
Copy link
Author

ckjpn commented May 20, 2016

This problem still exists.

https://tatoeba.org/eng/sentences/show/467373#comment-849583

Quel est ton numéro de téléphone ?
Quel est ton numéro de téléphone ?

The duplicate-merging script didn't delete the duplicate because these aren't exactly the same.
The space before the ? is different.

@ckjpn
Copy link
Author

ckjpn commented Jun 2, 2016

This has been mentioned again on the website.

https://tatoeba.org/eng/sentences/show/1567400#comment-857922

I just checked last week's exported data.

We have 3 types of spaces being used in front of the question mark in French sentences.
For examples, I included the highest-numbered sentences.
Perhaps, if there is nothing wrong with the most commonly used one, it would be a good idea just to convert all French sentences to that one.

? = 26,396
Example: [#5169195] Pourrais-tu jeter un œil sur ma composition avant que je ne la remette ? (sacredceltic)

 ? = 9,756
Example: [#4875430] Avez-vous vu tous ces films ? (sacredceltic)

? (with no space) = 1821
Example: [#5164529] Comment aller à la fête? (martin9)

 ? = 76
Example: [#5158480] Vas-tu bien m'entendre, à la fin ? (sacredceltic)

@ckjpn
Copy link
Author

ckjpn commented Jan 5, 2017

This is a related problem, so I'll add a note here.

These are duplicates that the software I normally use has found.
Sharptoothed verified that these are duplicates except that the spaces on either side of the dash are different characters.

CK написал(а):

[#464878] Тише едешь — дальше будешь. (al_ex_an_der)
[#2734405] Тише едешь — дальше будешь. (carlosalberto)

[#410633] Слово — серебро, молчание — золото. (al_ex_an_der)
[#5404160] Слово — серебро, молчание — золото. (anki)

[#598652] Семь — счастливое число. (kobylkin)
[#825700] Семь — счастливое число. (ae5s)

[#1456323] Мать Тереза ​​родилась в Югославии в 1910 году. (Balamax)
[#1525503] Мать Тереза родилась в Югославии в 1910 году. (corvard)

[#4340200] В здоровом теле — здоровый дух. (savella)
[#2735348] В здоровом теле — здоровый дух. (carlosalberto)

[#338479] Боб — мой друг. (rednaxela)
[#503431] Боб — мой друг. (drnm2)

[#580689] «Кажется, это очень интересно», — говорит Хироси. (al_ex_an_der)
[#2499885] «Кажется, это очень интересно», — говорит Хироси. (paul_lingvo)

@Poulpisator
Copy link

I have written a code to deal with spaces in the French corpus. There are three possible spaces in French:

  • regular space (space)
  • non-break space (nbsp)
  • narrow non-break space (nnsp)

First of all, I've run some scripts to check the numbers, considering nnsp the adequate ones (as many other people out there do). Here are the results:

Sign Total no space space nbsp Corrected Duplicates
? 48 098 1947 32386 3246 37575(78%) 180
! 16764 649 13143 207 13997(83%) 38
: 1679 211 1273 31 1515(90%) 1
; 1509 325 995 31 1351(90%) 0

The little differences between Corrected and the sum of sentences to be corrected is due to some sentences containing several kind of spaces.
Also, my code does not handle space in front of a long series of the same punctuation point, like !!!, but those sentences are really negligible in number.

In summary, running such a script would edit around 50 000 sentences and create around 200 duplicates that Horus will be able to eradicate.

However, I can guarantee this script only for the French corpus as I wouldn't take responsibility for a language I do not understand the rules nor the typography.

@ckjpn
Copy link
Author

ckjpn commented May 27, 2019

The longer this issue is postponed, the more likely problems like the following will occur.

Sentence #451458 is older, but Sentence #510709 has an audio file.

This will likely be a problem when the spaces before final punctuation in French are standardized so our duplicate-merging script can merge sentences and their translations.

https://tatoeba.org/eng/sentences/show/451458

It is sort of unfair to take away the ownership of a sentence from the original owner that was contributed first and didn't need corrections.

If the older sentence was corrected and then matched the newer sentence, I don't feel it's a problem to give ownership to the owner of the newer sentence that was contributed error-free.

@jiru
Copy link
Member

jiru commented May 27, 2019

I don’t think loosing the ownership of such a simple sentence is a big deal to begin with. If people really think it is, how about we make it so that Horus moves the audio to the first-contributed sentence?

@ckjpn
Copy link
Author

ckjpn commented Nov 30, 2019

This is still a problem. I wonder if it might not be time to prioritize this.

Related recent comment.
https://tatoeba.org/eng/sentences/show/1996169#comment-1143868

@trang
Copy link
Member

trang commented Nov 30, 2019

We did not reach a consensus on having standard rules regarding spaces in French. This is the last thread I remember of:
https://tatoeba.org/eng/wall/show_message/29619
(but I know there was other discussions before)

This means that by default, each contributor gets to decide how they want to do the spaces. And if we happen to have two contributors who have the same sentence with different spaces and firmly want to keep things their way, then so be it, we will consider these two sentences as distinct sentences and not duplicate ones.

It is currently not Horus' job to "fix" sentences to match a certain set of standard rules. Horus' job is only to do all the necessary tasks to merge exact duplicate sentences.

If we want to extend Horus' responsibilities to also take care of fixing sentences based on a specific set of rules, we could. But it's a new task that should be treated independently from merging duplicates. It could be done by another bot as well.

What I want to stress is that the rules that the bot would follow cannot be decided without a consensus. This is not something that one person will decide on their own.

I will not decide on my own that every French sentence should always use a non-breakable space before a question mark. Similarly, I will not decide on my own that every English sentence should always start with a capital letter. And I will not decide on my own that Japanese sentences should only use full width characters for numbers (for instance 1 instead of 1). I think no one should be deciding such things on their own. It needs to be discussed and agreed by those who speak these languages. Before we implement any additional automatic rules, there has to be sufficient analysis and the results have to be documented.

So until someone takes the time to conduct an analysis and based on this analysis, comes up with a decision on what should be the rules for spaces in each language (or if there should be rules at all), we should just handle the present issue manually: reach out to contributors who have duplicate sentences due to space difference and ask them whether or not they care about the type of space they use in their sentences.

  • If both sides don't want to change, then do nothing.
  • If someone doesn't reply, then do nothing.
  • If one side is okay to change but the other side doesn't want to, then the person who is okay to change should edit their sentences, to match the space in the other duplicate sentence.
  • If both are okay fine changing the space, then they can agree together which one they prefer and go for that one.
  • If they can't decide, let a French corpus maintainer decide.

Basically, just use common sense to resolve a conflicting situation. This does not require any change in the source code and if no one objects to that, then I will close this issue.

In the longer term, we can implement some sort of mass-editing so that people don't have to waste time fixing their sentences one by one.

We can also implement a way to resolve near-duplicate issues. This implies:

  • detecting near-duplicates
  • listing the near-duplicates to the corresponding contributors and let them decide what they want to do
  • keep track of the decision so that the sentences won't appear again in the listing

For these longer-term solution, we can create new issues.

@trang trang added unclear The issue, its scope or the goal are not clearly identified and removed enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. labels Nov 30, 2019
@ckjpn
Copy link
Author

ckjpn commented Dec 1, 2019

If this is not a matter that can be settled easily, I wonder if perhaps the Horus script could directly link French sentences that only differ in what space is used before the final punctuation.

Are there any cases where a difference in which space is used results in a different meaning?

If not, it would likely be useful to link such sentences, since people could at least see translations as indirect translations.

Another possible solution, perhaps, would be to adapt the duplicate-merging script to link all translations of French sentences that only differ in this space, so that both sentences get all the translations. This way, it wouldn't matter which French sentence someone was looking at since they would see all available translations.

@trang
Copy link
Member

trang commented Dec 1, 2019

Spaces don't change the meaning and your suggestions are good, in my opinion.

I would see no issue linking two French sentences that differ only by a space and making the indirect translations direct ones on both sides.

But considering the amount of sentences involved (unless the matter grew drastically since @Poulpisator posted his stats), I would suggest to still handle this manually rather than automatically.

The linking of the two near-duplicate sentences could be done by anyone with linking permission, but the linking of the translations should be done by someone who speaks both languages. It would be an occasion to double check some translations.

I will mark this out-of-scope because I doubt it will ever be high priority enough to be implemented as an automated process.

@trang trang added out-of-scope Issue that we decided we won't do and will close within a few weeks. and removed unclear The issue, its scope or the goal are not clearly identified labels Dec 1, 2019
@ckjpn
Copy link
Author

ckjpn commented Dec 1, 2019

From last week's download.

There are 2440 sentences in this file, so roughly half that number would be duplicates. Perhaps some have more than one duplicate.

http://tatoeba.byethost3.com/fra-very-near-duplicates-2019-11-30.txt

You didn't comment on this possibility.

Another possible solution, perhaps, would be to adapt the duplicate-merging script to link all translations of French sentences that only differ in this space, so that both sentences get all the translations. This way, it wouldn't matter which French sentence someone was looking at since they would see all available translations.

Wouldn't it make sense to attach all possible translations for each of these French sentences?

@sacredceltic
Copy link

CK, just mind your business. You have absolutely NO added value into this matter, once and for all !

@ckjpn
Copy link
Author

ckjpn commented Dec 1, 2019

I take it that sacredceltic doesn't see any value in attaching existing translations from one French sentence to another French sentence that has the same meaning, but only differs because of the space before the final punctuation.

I wonder if others share this view.

@agrodet
Copy link
Contributor

agrodet commented Dec 2, 2019

As a record, let me put an answer I wrote when I was asked the following. This only reflects my personal experience and opinion.

How do major French news websites handle this?

  • Does each site do it the same way for all their articles?
  • Do a large majority of major news websites do it the same way?
    How do major online French dictionaries handle this?
    How about other major websites in French?

There are two main problems when it comes to online contributions:
1 - People don't know their own language.
2 - Keyboards, computers, and browsers are very bad at it, even in 2019, and that makes 1. worse.
Softwares like Word or LibreOffice would automatically insert a space when you type your punctuation if you set your paragraph in French. But browsers, and consequently every Internet tools like Gmail, Tatoeba contributions, your search engine, etc. do not. So some people write mail with wrong topography to avoid strange output, like punctuation alone on the next line. Others know there is a space but the correct one can only be inserted by ALT+0155 on Windows and is simply impossible to input on a Mac, so they write a regular space.

This problem is reflected on websites. Dictionaries, by their core mission, do handle the thing correctly. Serious websites, like "Le Monde", also handle the thing pretty well (I guess they programmatically solve the problem but I do not know). Crappy websites, "la presse people" and others don't care a bit. They often don't even use the right quotation mark. But again, a part of the problem is that the French quotation mark does not directly appear on the keyboard, so don't care. However, it is possible to write them on Mac and Windows if you know the ALT code.

@agrodet
Copy link
Contributor

agrodet commented Dec 2, 2019

Another possible solution, perhaps, would be to adapt the duplicate-merging script to link all translations of French sentences that only differ in this space, so that both sentences get all the translations. This way, it wouldn't matter which French sentence someone was looking at since they would see all available translations.

I disagree for several reasons. First of all, on Tatoeba we call links "translations". Many people may not care but I consider it a very important point, as it is one of the basic components of the tool (the other one being contributions, that we call "sentences").
As long as we call links "translations" I do not see any value on linking two identical sentences. I know that it is done (for variations) in the English corpus sometimes , for example Mr Jack sleeps. and Mr. Jack sleeps. are linked. But in the French corpus there is kind of a tacit tendency that goes the opposite way. Somewhat surprisingly, even synonym sentences are not so often linked. However, we very often link synonym sentences to uncommon vocabulary, regionalism, or expressions.
For example, we wouldn't link J'aurai ta peau Je vais te buter, we wouldn't link Je veux pas. and Je ne veux pas. but maybe we would link Y en a pas bézef. et Il n'y en a pas beaucoup.

Of course, I say "we" but that does not include anyone really, it is just a tendency that I noticed by experience. However some people do link their own synonym contributions. That is their choice, and there is no problem in that. But linking synonym sentences would deeply change the French corpus (in my personal opinion, in a bad way), because then one could ask "What about Tu es mort Tu es morte Vous êtes mort Vous êtes morte?" and that would be quite something...

But let me take the problem from the other side. @ckjpn said

This way, it wouldn't matter which French sentence someone was looking at since they would see all available translations.

Let's suppose that Viens! and Viens ! are both available.
User1 contribute Come! to Viens!.
Of course, we could see the Viens ! => => Come! indirect link if we were to link the two French sentences. But my view is different, because eventually
User2, or User1 for what matters, will contribute Come! to Viens ! and Horus will then merge Come!, making it a DIRECT translation of the two sentences, as it should be. Isn't the problem solved then?

I know that it will work because I did it SO many times on your own sentences. "that" "this" "it", "I know that..." "I know that that..." So many patterns that ends in the same French sentence. If I translate "in order" then the group of sentences likely appears on the same page so I can link directly by the sentence number. But if I translate "by random", I would contribute the second or third identical translation days later, and Horus will merge it. Of course, during those few days Synonym1 does not see my French translation of Synonym2, but in the long run, they all end up with their translations.

@PaulPeer
Copy link

PaulPeer commented Dec 2, 2019

There are 2440 sentences in this file, so roughly half that number would be duplicates. Perhaps some have more than one duplicate.

http://tatoeba.byethost3.com/fra-very-near-duplicates-2019-11-30.txt

Very interesting list. I wonder if you could make a similar one for dashes (hyphen, – and —) for Dutch.

Wouldn't it make sense to attach all possible translations for each of these French sentences?

I have been doing this manually for many years but I doubt that making automatic scripts is a good idea. Regularly publishing lists, and sending them to the CM's of the concerning languages, is a better idea IMHO.

@Poulpisator
Copy link

Poulpisator commented Dec 2, 2019

But do you solve any problem doing that, except your own self preference?
Really, you (CK, Paul and others who think like you) and I (and others who think like me) seem to be unable to agree on this so either is there a design flaw in your thinking, a flaw in mine, or in both. So it might be a good idea to seriously discuss it. Can you give us two arguments for which your idea would be a good idea? I gave some of mine above, I could give more.

But let me express something else here. From a user point of view, I cannot see any problem you're solving with the solution you propose. I'll use simple sentences to illustrate.
Let us say we have a Come! <=> Viens ! link but not yet a Come! <=> Viens! link (I put no space for the sake of illustration, but suppose it is just two different spaces).
Situation 1
Suppose the user searches for the translation of Come!. He will then see Viens ! as a direct translation. He will not see Viens! but as I mentioned above, that is only a temporary problem, as the link Come! <=> Viens! will eventually appear. But for the sake of argumentation, let us suppose that you do create a link Viens! <=> Viens !. What is the gain here? When you look at Come!, you will only see one more indirect translation, Viens!. But there was Viens ! already. I think the added value is near zero. The real added value is located in Viens! indirect translations, but by design those are not displayed. Hence, I do not think this solution brings any value to situation 1.

Situation 2
Suppose the user searches for viens or a similar form. Ah! now we have somehow a better added value, because when the user will look at Viens!, he will see Viens ! as a direct translation ad therefore all of Viens ! direct translations as indirect translations. Nice... Except that no user would end up on the page of Viens! directly by research. He will first see the list of results. And although he may see Viens! first and click on it, ignoring Viens !, he will just have to go to the next result of the search to see Viens !. Hence I see very little added value for situation 2 (but I do admit that there is more than in situation 1).

Situation 3
The user downloads Tatoeba data and uses it in an external too. If the user uses raw data without parsing it first, the user is a fool. That's a very basic thing to do. I will probably repeat it until I die, but the tools adapt to the source of data not the other way around. And I know that at least one developer agrees with this idea. Would you imagine go to Twitter or Facebook and say "Hey! your format aren't that good for me, update your API!"? They would laugh at you (and then, they would ask what your problem is and propose a solution ^^). Therefore if your solution is to facilitate clusters, then the user needs to be a better programmer / designer. I see no added value here. Worse, I see harm done to the data source to please some fools.

Please let us know what situations or problems your solution solve.

Then of course, if you're well versed in the art of debate, you will ask me what is the added value of the non-solution I do not propose. Well, as I wrote above, I think doing what you suggest is in contradiction with the basic design bricks of Tatoeba. Of course, that is only my personal opinion, and I'm waiting for someone with a different opinion to bring arguments to the table. This contradiction is enough for me to keep things as they are until the U.I. problem is solved and a solution for inputting / replacing space by thin space is decided (Really, if the U.I. problem is solved, I think most of contributors will agree to a two-way policy: no space or thin space).

To summarize, I think the solution you propose solves a non-problem, that it relates to a very small part of the corpus, and that it is more to satisfy a personal point of view than to really address any issue.

(PS: The agrodet above and me are just two faces of the same coin... Yeah, I mistook accounts.)

@PaulPeer
Copy link

PaulPeer commented Dec 2, 2019

Situation 2
Suppose the user searches for viens or a similar form. Ah! now we have somehow a better added value, because when the user will look at Viens!, he will see Viens ! as a direct translation ad therefore all of Viens ! direct translations as indirect translations. Nice... Except that no user would end up on the page of Viens! directly by research. He will first see the list of results. And although he may see Viens! first and click on it, ignoring Viens !, he will just have to go to the next result of the search to see Viens !. Hence I see very little added value for situation 2 (but I do admit that there is more than in situation 1).

You are right if someone uses the search engine. But what for translators who browse and translate sentence by sentence, like me and like many others? Let's imagine that there is a complicated sentence with a " !" at the end. I start translating it and only after I finished I see that somebody else did the effort of translating the near duplicate. If the near duplicate is linked, there is no waste of time.

@sacredceltic
Copy link

Cher Poulpisator,
Je suis effaré par la quantité d’énergie déployée pour tenter désespérément de faire comprendre ces non-problèmes à une poignée d’obsessionnels compulsifs totalement dénués de jugeote.
Au moins cela m’aura ouvert les yeux sur ta ténacité et ta clairvoyance. Enfin quelqu’un qui comprend le problème et ses non-implications au bout de plusieurs années ! Je me sens moins seul tout à coup...
Merci !

@sacredceltic
Copy link

@PaulPeer

What is the probability that 2 different users translate the same duplicate (which again is a tiny fraction of the corpus...) at the same time ?
Next time you have imaginary problems of that size, consult a shrink and leave the rest of us in peace, please...

@PaulPeer
Copy link

PaulPeer commented Dec 2, 2019

Next time you have imaginary problems of that size, consult a shrink and leave the rest of us in peace, please...

I asked CK to make a script for Dutch. Not for your language. So please shut up.

@sacredceltic
Copy link

Except the title of this thread is about « ...spaces in French sentences »
Just get out of this thread, you fool !

@sacredceltic
Copy link

I’m going to summarise the dual problem now :

  1. we’ve got a REAL problem that French sentences, when they are correctly created are incorrectly displayed. And NOBODY adresses this problem. Nobody is interested. Particularly our non-French admins, and especially those who were responsible for imposing the current display of sentences.

  2. we have an IMAGINARY problem of duplicates, which is actually caused, at least partly, by the REAL problem above, and which has been obsessing the same non-French admins, impeding their sleep for so many years.

So guess which one we’re going to address first ?

@trang
Copy link
Member

trang commented Dec 3, 2019

First, unless I missed something, no one was even aware of the issue that non-breakable spaces are not properly displayed on iPhones until a couple of days ago. I mean, the GitHub issue (#2026) was opened only two days ago. Saying that nobody cares is a bit too dramatic.

Second, @Poulpisator has a good point which I forgot about. Linking two sentences of the same language makes sense on an abstract level, but from the UI point of view, we label linked sentences as "Translations" (it's very explicit in the new sentence design) and it is indeed confusing to have a sentence being defined as "translation" of another sentence in the same language. So I take back what I said when I said "I would see no issue linking two French sentences that differ only by a space". There is actually an issue, which is a general issue with same-language linking. And I agree that before we think about linking more sentences in a same language, we should first think about how we display and label linked sentences of the same language. The suggestion made in #1902 is a possible solution.

Lastly, to be clear about this suggestion:

Another possible solution, perhaps, would be to adapt the duplicate-merging script to link all translations of French sentences that only differ in this space, so that both sentences get all the translations.

As I've said, I doubt it will ever be high priority enough to be implemented as an automated process. So no, we will not adapt the duplicate-merging script for that. It's not worth the effort and as @Poulpisator has described it, we can let sentences get linked to each other in an organic way (through human contributions), rather than in an automatic way (through a bot/script).

And the more I think about it, the more I feel we're better off that way. It may be less productive to handle these near-duplicates manually, but it should lead to better quality in the end. A bot or a script will never ask itself if the sentences are really good translations or not, before linking them. There could be nonsense translations and they will be linked without question. A human will (or at least should) put some thoughts into it.

We should really move towards a mindset where we embrace duplicates as something that can help us improve the quality of the corpus, rather than just seeing it as something that pollutes the corpus and wastes our time.

@Poulpisator
Copy link

Poulpisator commented Dec 3, 2019

We should really move towards a mindset where we embrace duplicates as something that can help us improve the quality of the corpus, rather than just seeing it as something that pollutes the corpus and wastes our time.

Alleluiah, pray the sun, my boys!! \o/ Can we write this in rainbow colors in the top menu of Tatoeba? ^^

@sacredceltic Une partie non-négligeable de ma vie consiste à faire comprendre à des gens qui refusent de le voir que leur design est loin d'être correct. Tatoeba est un peu ma salle d'entraînement de la vie réelle :)

@PaulPeer Okay, that's one argument for. Is there a second one? Although I could say that you're making one more assumption when you say Let's imagine that there is a complicated sentence. If we have a look at CK's file we would realize that there aren't so many such sentences. Actually, I think the workflow until such a situation is of pretty low probability.
Beside, you say it is a waste of time. I say this is an opportunity. Especially if it is a complex sentence, as your assumption is. If complexA and complexB are given, I'm pretty sure that two different users would translate them into different sentences. I even think that the same user would not translate them the same way on two different days. So now suppose that we have
complexA <=> complexB <=> translationA
When you explore sentences to translate and find complexA, you see translationA as a good enough translation, link it, go to the next sentence to translate. If you do that, now we have only one translation for one (unique except for spacing) complex sentences. While if this complexA <=> complexB link did not exist, when you find complexA, you would make the effort to provide a new sentence, and then we would have TWO translations to one (unique) complex sentences.

With your own assumption, I arrive to a very different conclusion. I cannot see any case where "wasting" time providing translation to complex sentences is harmful to the project :) Better, I cannot see any case where it is not beneficial to contribute translations to complex sentences.

As TRANG said, embrace the duplicates, don't fight them, let them flow through you, seize their power! :D

@PaulPeer
Copy link

PaulPeer commented Dec 3, 2019

Second, @Poulpisator has a good point which I forgot about. Linking two sentences of the same language makes sense on an abstract level, but from the UI point of view, we label linked sentences as "Translations" (it's very explicit in the new sentence design) and it is indeed confusing to have a sentence being defined as "translation" of another sentence in the same language.

OK. That is clear. But do all advanced contributors and CM's know about this? I doubt it very much. Just one example: The sentences "I know you know" and "I know that you know" and hundreds of similar ones have been linked together. Maybe in the FAQ the article about linking should be more clear?

@sacredceltic
Copy link

I had already reported this issue at least 4 years ago https://tatoeba.org/fra/wall/show_message/24679#!%23message_24679

@trang
Copy link
Member

trang commented Dec 3, 2019

@PaulPeer

But do all advanced contributors and CM's know about this?

Probably not. In itself it is not a huge problem either if people aren't aware and link sentences in the same language. It's not killing anyone. We will eventually implement a solution to the problem of same-language sentences being labelled as "Translations". But until the solution is implemented, there's no need either to add up to the problem if you're aware it's a problem, except if it were for solving another much worse problem.

@sacredceltic

I had already reported this issue at least 4 years ago https://tatoeba.org/fra/wall/show_message/24679#!%23message_24679

You didn't exactly report any issue there. You only mentioned that the font used could be a problem. The problem that non-breakable spaces are invisible specifically with the default sans-serif font on iPhones was not a known issue until recently. I personally don't have an iPhone, so unless someone actually tells me that something looks wrong on iPhone, I wouldn't know. I don't think it's intuitive to think that the default sans-serif font on a major operating system doesn't display spaces correctly...

@trang
Copy link
Member

trang commented Dec 15, 2019

Closing this now.

Summary:

  • We won't automatically merge sentences that differ only by a space. These sentences will have to be dealt with manually.
  • We can consider providing a mass-edit feature for a contributor to convert all the spaces in their sentences instead of having to edit each sentence one by one.
  • We can consider automatically converting spaces, but we will need to agree on what to do exactly.
  • We'll work on ensuring that non-breakable spaces are properly displayed (Non-breakable spaces not visible on iPhones with default sans-serif font #2026).

@ckjpn
Copy link
Author

ckjpn commented Jun 28, 2023

There is another discussion about this on the Wall.
https://tatoeba.org/en/wall/show_message/39979#message_39979

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
out-of-scope Issue that we decided we won't do and will close within a few weeks.
Projects
None yet
Development

No branches or pull requests

7 participants