Duplicate-merging script is still missing some obvious duplicates with non-matching spaces in French sentences. #770

ckjpn · 2015-09-13T07:14:19Z

Here's an example left in a comment.
https://tatoeba.org/eng/sentences/show/3184#comment-702160

Here is the text of the comment
https://tatoeba.org/eng/sentences/show/3184
https://tatoeba.org/eng/sentences/show/3951157

Pourquoi demandes-tu ?
Pourquoi demandes-tu ?

The problem is that one of these sentences apparently doesn't use the standard space in front of the question mark, so the duplicate-merging script doesn't see these as duplicates.

I think we may just have to wait for one of the programmers to crate a program that fixes all these French duplicates. Maybe at the same time, they can also automatically fix all English sentences that French speakers have mistakenly put spaces in front of question marks.

PaulPeer · 2015-09-13T07:32:48Z

+1
There are hundreds of these cases. Also in front of ! ; and : as far as I remember. But are we going to wait until the French users stop fighting about which space to use? Or we just decide for the standard space?
The "mistakenly put spaces" is not just a phenomenon in English sentences. I corrected many in Dutch and Esperanto too.

ckjpn · 2015-09-13T07:41:13Z

But are we going to wait until the French users stop fighting about which space to use?

Can't Trang just make the decision and choose the one she likes? It's her website and it's her language.

Anyone who uses the data for other purposes can easily do a mass find-and-replace to whatever space they prefer.

PaulPeer · 2015-09-13T07:45:50Z

Agreed.

trang · 2015-09-13T12:51:04Z

There are certain things that I consider low priority, the problem of spaces is one of them.
If this is however a big problem for you, then you can mention it during Tatoeba Day so that it can be discussed and a solution can be implemented.

Outside of Tatoeba Days I work on things that I care about personally. But during Tatoeba Days I take the time to work on things that I don't necessarily prioritize but that other members of the community do.

ckjpn · 2016-05-20T23:29:39Z

This problem still exists.

https://tatoeba.org/eng/sentences/show/467373#comment-849583

Quel est ton numéro de téléphone ?
Quel est ton numéro de téléphone ?

The duplicate-merging script didn't delete the duplicate because these aren't exactly the same.
The space before the ? is different.

ckjpn · 2016-06-02T13:50:23Z

This has been mentioned again on the website.

https://tatoeba.org/eng/sentences/show/1567400#comment-857922

I just checked last week's exported data.

We have 3 types of spaces being used in front of the question mark in French sentences.
For examples, I included the highest-numbered sentences.
Perhaps, if there is nothing wrong with the most commonly used one, it would be a good idea just to convert all French sentences to that one.

? = 26,396
Example: [#5169195] Pourrais-tu jeter un œil sur ma composition avant que je ne la remette ? (sacredceltic)

 ? = 9,756
Example: [#4875430] Avez-vous vu tous ces films ? (sacredceltic)

? (with no space) = 1821
Example: [#5164529] Comment aller à la fête? (martin9)

? = 76
Example: [#5158480] Vas-tu bien m'entendre, à la fin ? (sacredceltic)

ckjpn · 2017-01-05T09:59:41Z

This is a related problem, so I'll add a note here.

These are duplicates that the software I normally use has found.
Sharptoothed verified that these are duplicates except that the spaces on either side of the dash are different characters.

CK написал(а):

[#464878] Тише едешь — дальше будешь. (al_ex_an_der)
[#2734405] Тише едешь — дальше будешь. (carlosalberto)

[#410633] Слово — серебро, молчание — золото. (al_ex_an_der)
[#5404160] Слово — серебро, молчание — золото. (anki)

[#598652] Семь — счастливое число. (kobylkin)
[#825700] Семь — счастливое число. (ae5s)

[#1456323] Мать Тереза родилась в Югославии в 1910 году. (Balamax)
[#1525503] Мать Тереза родилась в Югославии в 1910 году. (corvard)

[#4340200] В здоровом теле — здоровый дух. (savella)
[#2735348] В здоровом теле — здоровый дух. (carlosalberto)

[#338479] Боб — мой друг. (rednaxela)
[#503431] Боб — мой друг. (drnm2)

[#580689] «Кажется, это очень интересно», — говорит Хироси. (al_ex_an_der)
[#2499885] «Кажется, это очень интересно», — говорит Хироси. (paul_lingvo)

Poulpisator · 2018-08-18T14:02:09Z

I have written a code to deal with spaces in the French corpus. There are three possible spaces in French:

regular space (space)
non-break space (nbsp)
narrow non-break space (nnsp)

First of all, I've run some scripts to check the numbers, considering nnsp the adequate ones (as many other people out there do). Here are the results:

Sign	Total	no space	space	nbsp	Corrected	Duplicates
?	48 098	1947	32386	3246	37575(78%)	180
!	16764	649	13143	207	13997(83%)	38
:	1679	211	1273	31	1515(90%)	1
;	1509	325	995	31	1351(90%)	0

The little differences between Corrected and the sum of sentences to be corrected is due to some sentences containing several kind of spaces.
Also, my code does not handle space in front of a long series of the same punctuation point, like !!!, but those sentences are really negligible in number.

In summary, running such a script would edit around 50 000 sentences and create around 200 duplicates that Horus will be able to eradicate.

However, I can guarantee this script only for the French corpus as I wouldn't take responsibility for a language I do not understand the rules nor the typography.

ckjpn · 2019-05-27T00:21:15Z

The longer this issue is postponed, the more likely problems like the following will occur.

Sentence #451458 is older, but Sentence #510709 has an audio file.

This will likely be a problem when the spaces before final punctuation in French are standardized so our duplicate-merging script can merge sentences and their translations.

https://tatoeba.org/eng/sentences/show/451458

It is sort of unfair to take away the ownership of a sentence from the original owner that was contributed first and didn't need corrections.

If the older sentence was corrected and then matched the newer sentence, I don't feel it's a problem to give ownership to the owner of the newer sentence that was contributed error-free.

jiru · 2019-05-27T05:10:54Z

I don’t think loosing the ownership of such a simple sentence is a big deal to begin with. If people really think it is, how about we make it so that Horus moves the audio to the first-contributed sentence?

ckjpn · 2019-11-30T01:24:45Z

This is still a problem. I wonder if it might not be time to prioritize this.

Related recent comment.
https://tatoeba.org/eng/sentences/show/1996169#comment-1143868

trang · 2019-11-30T16:34:04Z

We did not reach a consensus on having standard rules regarding spaces in French. This is the last thread I remember of:
https://tatoeba.org/eng/wall/show_message/29619
(but I know there was other discussions before)

This means that by default, each contributor gets to decide how they want to do the spaces. And if we happen to have two contributors who have the same sentence with different spaces and firmly want to keep things their way, then so be it, we will consider these two sentences as distinct sentences and not duplicate ones.

It is currently not Horus' job to "fix" sentences to match a certain set of standard rules. Horus' job is only to do all the necessary tasks to merge exact duplicate sentences.

If we want to extend Horus' responsibilities to also take care of fixing sentences based on a specific set of rules, we could. But it's a new task that should be treated independently from merging duplicates. It could be done by another bot as well.

What I want to stress is that the rules that the bot would follow cannot be decided without a consensus. This is not something that one person will decide on their own.

I will not decide on my own that every French sentence should always use a non-breakable space before a question mark. Similarly, I will not decide on my own that every English sentence should always start with a capital letter. And I will not decide on my own that Japanese sentences should only use full width characters for numbers (for instance １ instead of 1). I think no one should be deciding such things on their own. It needs to be discussed and agreed by those who speak these languages. Before we implement any additional automatic rules, there has to be sufficient analysis and the results have to be documented.

So until someone takes the time to conduct an analysis and based on this analysis, comes up with a decision on what should be the rules for spaces in each language (or if there should be rules at all), we should just handle the present issue manually: reach out to contributors who have duplicate sentences due to space difference and ask them whether or not they care about the type of space they use in their sentences.

If both sides don't want to change, then do nothing.
If someone doesn't reply, then do nothing.
If one side is okay to change but the other side doesn't want to, then the person who is okay to change should edit their sentences, to match the space in the other duplicate sentence.
If both are okay fine changing the space, then they can agree together which one they prefer and go for that one.
If they can't decide, let a French corpus maintainer decide.

Basically, just use common sense to resolve a conflicting situation. This does not require any change in the source code and if no one objects to that, then I will close this issue.

In the longer term, we can implement some sort of mass-editing so that people don't have to waste time fixing their sentences one by one.

We can also implement a way to resolve near-duplicate issues. This implies:

detecting near-duplicates
listing the near-duplicates to the corresponding contributors and let them decide what they want to do
keep track of the decision so that the sentences won't appear again in the listing

For these longer-term solution, we can create new issues.

ckjpn · 2019-12-01T01:53:37Z

If this is not a matter that can be settled easily, I wonder if perhaps the Horus script could directly link French sentences that only differ in what space is used before the final punctuation.

Are there any cases where a difference in which space is used results in a different meaning?

If not, it would likely be useful to link such sentences, since people could at least see translations as indirect translations.

Another possible solution, perhaps, would be to adapt the duplicate-merging script to link all translations of French sentences that only differ in this space, so that both sentences get all the translations. This way, it wouldn't matter which French sentence someone was looking at since they would see all available translations.

trang · 2019-12-01T14:15:00Z

Spaces don't change the meaning and your suggestions are good, in my opinion.

I would see no issue linking two French sentences that differ only by a space and making the indirect translations direct ones on both sides.

But considering the amount of sentences involved (unless the matter grew drastically since @Poulpisator posted his stats), I would suggest to still handle this manually rather than automatically.

The linking of the two near-duplicate sentences could be done by anyone with linking permission, but the linking of the translations should be done by someone who speaks both languages. It would be an occasion to double check some translations.

I will mark this out-of-scope because I doubt it will ever be high priority enough to be implemented as an automated process.

ckjpn · 2019-12-01T22:55:04Z

From last week's download.

There are 2440 sentences in this file, so roughly half that number would be duplicates. Perhaps some have more than one duplicate.

http://tatoeba.byethost3.com/fra-very-near-duplicates-2019-11-30.txt

You didn't comment on this possibility.

Another possible solution, perhaps, would be to adapt the duplicate-merging script to link all translations of French sentences that only differ in this space, so that both sentences get all the translations. This way, it wouldn't matter which French sentence someone was looking at since they would see all available translations.

Wouldn't it make sense to attach all possible translations for each of these French sentences?

sacredceltic · 2019-12-01T23:05:36Z

CK, just mind your business. You have absolutely NO added value into this matter, once and for all !

ckjpn · 2019-12-01T23:35:43Z

I take it that sacredceltic doesn't see any value in attaching existing translations from one French sentence to another French sentence that has the same meaning, but only differs because of the space before the final punctuation.

I wonder if others share this view.

agrodet · 2019-12-02T01:01:25Z

As a record, let me put an answer I wrote when I was asked the following. This only reflects my personal experience and opinion.

How do major French news websites handle this?

Does each site do it the same way for all their articles?

Do a large majority of major news websites do it the same way?
How do major online French dictionaries handle this?
How about other major websites in French?

There are two main problems when it comes to online contributions:
1 - People don't know their own language.
2 - Keyboards, computers, and browsers are very bad at it, even in 2019, and that makes 1. worse.
Softwares like Word or LibreOffice would automatically insert a space when you type your punctuation if you set your paragraph in French. But browsers, and consequently every Internet tools like Gmail, Tatoeba contributions, your search engine, etc. do not. So some people write mail with wrong topography to avoid strange output, like punctuation alone on the next line. Others know there is a space but the correct one can only be inserted by ALT+0155 on Windows and is simply impossible to input on a Mac, so they write a regular space.

This problem is reflected on websites. Dictionaries, by their core mission, do handle the thing correctly. Serious websites, like "Le Monde", also handle the thing pretty well (I guess they programmatically solve the problem but I do not know). Crappy websites, "la presse people" and others don't care a bit. They often don't even use the right quotation mark. But again, a part of the problem is that the French quotation mark does not directly appear on the keyboard, so don't care. However, it is possible to write them on Mac and Windows if you know the ALT code.

agrodet · 2019-12-02T01:26:07Z

Another possible solution, perhaps, would be to adapt the duplicate-merging script to link all translations of French sentences that only differ in this space, so that both sentences get all the translations. This way, it wouldn't matter which French sentence someone was looking at since they would see all available translations.

I disagree for several reasons. First of all, on Tatoeba we call links "translations". Many people may not care but I consider it a very important point, as it is one of the basic components of the tool (the other one being contributions, that we call "sentences").
As long as we call links "translations" I do not see any value on linking two identical sentences. I know that it is done (for variations) in the English corpus sometimes , for example Mr Jack sleeps. and Mr. Jack sleeps. are linked. But in the French corpus there is kind of a tacit tendency that goes the opposite way. Somewhat surprisingly, even synonym sentences are not so often linked. However, we very often link synonym sentences to uncommon vocabulary, regionalism, or expressions.
For example, we wouldn't link J'aurai ta peau Je vais te buter, we wouldn't link Je veux pas. and Je ne veux pas. but maybe we would link Y en a pas bézef. et Il n'y en a pas beaucoup.

Of course, I say "we" but that does not include anyone really, it is just a tendency that I noticed by experience. However some people do link their own synonym contributions. That is their choice, and there is no problem in that. But linking synonym sentences would deeply change the French corpus (in my personal opinion, in a bad way), because then one could ask "What about Tu es mort Tu es morte Vous êtes mort Vous êtes morte?" and that would be quite something...

But let me take the problem from the other side. @ckjpn said

This way, it wouldn't matter which French sentence someone was looking at since they would see all available translations.

Let's suppose that Viens! and Viens ! are both available.
User1 contribute Come! to Viens!.
Of course, we could see the Viens ! => => Come! indirect link if we were to link the two French sentences. But my view is different, because eventually
User2, or User1 for what matters, will contribute Come! to Viens ! and Horus will then merge Come!, making it a DIRECT translation of the two sentences, as it should be. Isn't the problem solved then?

I know that it will work because I did it SO many times on your own sentences. "that" "this" "it", "I know that..." "I know that that..." So many patterns that ends in the same French sentence. If I translate "in order" then the group of sentences likely appears on the same page so I can link directly by the sentence number. But if I translate "by random", I would contribute the second or third identical translation days later, and Horus will merge it. Of course, during those few days Synonym1 does not see my French translation of Synonym2, but in the long run, they all end up with their translations.

PaulPeer · 2019-12-02T11:21:50Z

There are 2440 sentences in this file, so roughly half that number would be duplicates. Perhaps some have more than one duplicate.

http://tatoeba.byethost3.com/fra-very-near-duplicates-2019-11-30.txt

Very interesting list. I wonder if you could make a similar one for dashes (hyphen, – and —) for Dutch.

Wouldn't it make sense to attach all possible translations for each of these French sentences?

I have been doing this manually for many years but I doubt that making automatic scripts is a good idea. Regularly publishing lists, and sending them to the CM's of the concerning languages, is a better idea IMHO.

Poulpisator · 2019-12-02T14:17:36Z

But do you solve any problem doing that, except your own self preference?
Really, you (CK, Paul and others who think like you) and I (and others who think like me) seem to be unable to agree on this so either is there a design flaw in your thinking, a flaw in mine, or in both. So it might be a good idea to seriously discuss it. Can you give us two arguments for which your idea would be a good idea? I gave some of mine above, I could give more.

But let me express something else here. From a user point of view, I cannot see any problem you're solving with the solution you propose. I'll use simple sentences to illustrate.
Let us say we have a Come! <=> Viens ! link but not yet a Come! <=> Viens! link (I put no space for the sake of illustration, but suppose it is just two different spaces).
Situation 1
Suppose the user searches for the translation of Come!. He will then see Viens ! as a direct translation. He will not see Viens! but as I mentioned above, that is only a temporary problem, as the link Come! <=> Viens! will eventually appear. But for the sake of argumentation, let us suppose that you do create a link Viens! <=> Viens !. What is the gain here? When you look at Come!, you will only see one more indirect translation, Viens!. But there was Viens ! already. I think the added value is near zero. The real added value is located in Viens! indirect translations, but by design those are not displayed. Hence, I do not think this solution brings any value to situation 1.

Situation 2
Suppose the user searches for viens or a similar form. Ah! now we have somehow a better added value, because when the user will look at Viens!, he will see Viens ! as a direct translation ad therefore all of Viens ! direct translations as indirect translations. Nice... Except that no user would end up on the page of Viens! directly by research. He will first see the list of results. And although he may see Viens! first and click on it, ignoring Viens !, he will just have to go to the next result of the search to see Viens !. Hence I see very little added value for situation 2 (but I do admit that there is more than in situation 1).

Situation 3
The user downloads Tatoeba data and uses it in an external too. If the user uses raw data without parsing it first, the user is a fool. That's a very basic thing to do. I will probably repeat it until I die, but the tools adapt to the source of data not the other way around. And I know that at least one developer agrees with this idea. Would you imagine go to Twitter or Facebook and say "Hey! your format aren't that good for me, update your API!"? They would laugh at you (and then, they would ask what your problem is and propose a solution ^^). Therefore if your solution is to facilitate clusters, then the user needs to be a better programmer / designer. I see no added value here. Worse, I see harm done to the data source to please some fools.

Please let us know what situations or problems your solution solve.

Then of course, if you're well versed in the art of debate, you will ask me what is the added value of the non-solution I do not propose. Well, as I wrote above, I think doing what you suggest is in contradiction with the basic design bricks of Tatoeba. Of course, that is only my personal opinion, and I'm waiting for someone with a different opinion to bring arguments to the table. This contradiction is enough for me to keep things as they are until the U.I. problem is solved and a solution for inputting / replacing space by thin space is decided (Really, if the U.I. problem is solved, I think most of contributors will agree to a two-way policy: no space or thin space).

To summarize, I think the solution you propose solves a non-problem, that it relates to a very small part of the corpus, and that it is more to satisfy a personal point of view than to really address any issue.

(PS: The agrodet above and me are just two faces of the same coin... Yeah, I mistook accounts.)

PaulPeer · 2019-12-02T15:46:29Z

Situation 2
Suppose the user searches for viens or a similar form. Ah! now we have somehow a better added value, because when the user will look at Viens!, he will see Viens ! as a direct translation ad therefore all of Viens ! direct translations as indirect translations. Nice... Except that no user would end up on the page of Viens! directly by research. He will first see the list of results. And although he may see Viens! first and click on it, ignoring Viens !, he will just have to go to the next result of the search to see Viens !. Hence I see very little added value for situation 2 (but I do admit that there is more than in situation 1).

You are right if someone uses the search engine. But what for translators who browse and translate sentence by sentence, like me and like many others? Let's imagine that there is a complicated sentence with a " !" at the end. I start translating it and only after I finished I see that somebody else did the effort of translating the near duplicate. If the near duplicate is linked, there is no waste of time.

sacredceltic · 2019-12-02T18:43:01Z

Cher Poulpisator,
Je suis effaré par la quantité d’énergie déployée pour tenter désespérément de faire comprendre ces non-problèmes à une poignée d’obsessionnels compulsifs totalement dénués de jugeote.
Au moins cela m’aura ouvert les yeux sur ta ténacité et ta clairvoyance. Enfin quelqu’un qui comprend le problème et ses non-implications au bout de plusieurs années ! Je me sens moins seul tout à coup...
Merci !

sacredceltic · 2019-12-02T18:52:58Z

@PaulPeer

What is the probability that 2 different users translate the same duplicate (which again is a tiny fraction of the corpus...) at the same time ?
Next time you have imaginary problems of that size, consult a shrink and leave the rest of us in peace, please...

PaulPeer · 2019-12-02T19:05:22Z

Next time you have imaginary problems of that size, consult a shrink and leave the rest of us in peace, please...

I asked CK to make a script for Dutch. Not for your language. So please shut up.

sacredceltic · 2019-12-02T19:08:41Z

Except the title of this thread is about « ...spaces in French sentences »
Just get out of this thread, you fool !

sacredceltic · 2019-12-02T19:23:48Z

I’m going to summarise the dual problem now :

we’ve got a REAL problem that French sentences, when they are correctly created are incorrectly displayed. And NOBODY adresses this problem. Nobody is interested. Particularly our non-French admins, and especially those who were responsible for imposing the current display of sentences.
we have an IMAGINARY problem of duplicates, which is actually caused, at least partly, by the REAL problem above, and which has been obsessing the same non-French admins, impeding their sleep for so many years.

So guess which one we’re going to address first ?

trang · 2019-12-03T00:10:42Z

First, unless I missed something, no one was even aware of the issue that non-breakable spaces are not properly displayed on iPhones until a couple of days ago. I mean, the GitHub issue (#2026) was opened only two days ago. Saying that nobody cares is a bit too dramatic.

Second, @Poulpisator has a good point which I forgot about. Linking two sentences of the same language makes sense on an abstract level, but from the UI point of view, we label linked sentences as "Translations" (it's very explicit in the new sentence design) and it is indeed confusing to have a sentence being defined as "translation" of another sentence in the same language. So I take back what I said when I said "I would see no issue linking two French sentences that differ only by a space". There is actually an issue, which is a general issue with same-language linking. And I agree that before we think about linking more sentences in a same language, we should first think about how we display and label linked sentences of the same language. The suggestion made in #1902 is a possible solution.

Lastly, to be clear about this suggestion:

Another possible solution, perhaps, would be to adapt the duplicate-merging script to link all translations of French sentences that only differ in this space, so that both sentences get all the translations.

As I've said, I doubt it will ever be high priority enough to be implemented as an automated process. So no, we will not adapt the duplicate-merging script for that. It's not worth the effort and as @Poulpisator has described it, we can let sentences get linked to each other in an organic way (through human contributions), rather than in an automatic way (through a bot/script).

And the more I think about it, the more I feel we're better off that way. It may be less productive to handle these near-duplicates manually, but it should lead to better quality in the end. A bot or a script will never ask itself if the sentences are really good translations or not, before linking them. There could be nonsense translations and they will be linked without question. A human will (or at least should) put some thoughts into it.

We should really move towards a mindset where we embrace duplicates as something that can help us improve the quality of the corpus, rather than just seeing it as something that pollutes the corpus and wastes our time.

Poulpisator · 2019-12-03T02:50:53Z

We should really move towards a mindset where we embrace duplicates as something that can help us improve the quality of the corpus, rather than just seeing it as something that pollutes the corpus and wastes our time.

Alleluiah, pray the sun, my boys!! \o/ Can we write this in rainbow colors in the top menu of Tatoeba? ^^

@sacredceltic Une partie non-négligeable de ma vie consiste à faire comprendre à des gens qui refusent de le voir que leur design est loin d'être correct. Tatoeba est un peu ma salle d'entraînement de la vie réelle :)

@PaulPeer Okay, that's one argument for. Is there a second one? Although I could say that you're making one more assumption when you say Let's imagine that there is a complicated sentence. If we have a look at CK's file we would realize that there aren't so many such sentences. Actually, I think the workflow until such a situation is of pretty low probability.
Beside, you say it is a waste of time. I say this is an opportunity. Especially if it is a complex sentence, as your assumption is. If complexA and complexB are given, I'm pretty sure that two different users would translate them into different sentences. I even think that the same user would not translate them the same way on two different days. So now suppose that we have
complexA <=> complexB <=> translationA
When you explore sentences to translate and find complexA, you see translationA as a good enough translation, link it, go to the next sentence to translate. If you do that, now we have only one translation for one (unique except for spacing) complex sentences. While if this complexA <=> complexB link did not exist, when you find complexA, you would make the effort to provide a new sentence, and then we would have TWO translations to one (unique) complex sentences.

With your own assumption, I arrive to a very different conclusion. I cannot see any case where "wasting" time providing translation to complex sentences is harmful to the project :) Better, I cannot see any case where it is not beneficial to contribute translations to complex sentences.

As TRANG said, embrace the duplicates, don't fight them, let them flow through you, seize their power! :D

PaulPeer · 2019-12-03T06:28:57Z

Second, @Poulpisator has a good point which I forgot about. Linking two sentences of the same language makes sense on an abstract level, but from the UI point of view, we label linked sentences as "Translations" (it's very explicit in the new sentence design) and it is indeed confusing to have a sentence being defined as "translation" of another sentence in the same language.

OK. That is clear. But do all advanced contributors and CM's know about this? I doubt it very much. Just one example: The sentences "I know you know" and "I know that you know" and hundreds of similar ones have been linked together. Maybe in the FAQ the article about linking should be more clear?

sacredceltic · 2019-12-03T06:51:35Z

I had already reported this issue at least 4 years ago https://tatoeba.org/fra/wall/show_message/24679#!%23message_24679

trang · 2019-12-03T23:45:33Z

@PaulPeer

But do all advanced contributors and CM's know about this?

Probably not. In itself it is not a huge problem either if people aren't aware and link sentences in the same language. It's not killing anyone. We will eventually implement a solution to the problem of same-language sentences being labelled as "Translations". But until the solution is implemented, there's no need either to add up to the problem if you're aware it's a problem, except if it were for solving another much worse problem.

@sacredceltic

I had already reported this issue at least 4 years ago https://tatoeba.org/fra/wall/show_message/24679#!%23message_24679

You didn't exactly report any issue there. You only mentioned that the font used could be a problem. The problem that non-breakable spaces are invisible specifically with the default sans-serif font on iPhones was not a known issue until recently. I personally don't have an iPhone, so unless someone actually tells me that something looks wrong on iPhone, I wouldn't know. I don't think it's intuitive to think that the default sans-serif font on a major operating system doesn't display spaces correctly...

trang · 2019-12-15T23:58:09Z

Closing this now.

Summary:

We won't automatically merge sentences that differ only by a space. These sentences will have to be dealt with manually.
We can consider providing a mass-edit feature for a contributor to convert all the spaces in their sentences instead of having to edit each sentence one by one.
We can consider automatically converting spaces, but we will need to agree on what to do exactly.
We'll work on ensuring that non-breakable spaces are properly displayed (Non-breakable spaces not visible on iPhones with default sans-serif font #2026).

ckjpn · 2023-06-28T17:42:37Z

There is another discussion about this on the Wall.
https://tatoeba.org/en/wall/show_message/39979#message_39979

ckjpn added the bug Issue that describes a problem with a feature that doesn't work as expected. label Sep 13, 2015

trang added enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. and removed bug Issue that describes a problem with a feature that doesn't work as expected. labels Sep 13, 2015

trang mentioned this issue Apr 20, 2017

Some sentences get duplicated even with Horus #1469

Closed

trang added unclear The issue, its scope or the goal are not clearly identified and removed enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. labels Nov 30, 2019

trang added out-of-scope Issue that we decided we won't do and will close within a few weeks. and removed unclear The issue, its scope or the goal are not clearly identified labels Dec 1, 2019

trang mentioned this issue Dec 15, 2019

have a way to detect duplicates which differ only in terms of script #76

Closed

trang closed this as completed Dec 15, 2019

agrodet mentioned this issue Jan 27, 2020

When displaying linked translations, move sentences in the same language to the top. #773

Closed

trang mentioned this issue Jan 27, 2020

It's innacurate to label same language sentences as "Translations" #2107

Open

trang mentioned this issue Feb 12, 2020

Merge sentences that differ only in punctuation and spacing #642

Closed

AndiPersti mentioned this issue Apr 4, 2020

Linefeeds in sentence text are breaking the weekly export script #2250

Closed

jiru mentioned this issue Oct 6, 2020

Auto-transcribe sentences in Lingua Franca Nova's Latin orthography to Cyrillic, and vice versa #1958

Open

ckjpn mentioned this issue Jun 30, 2023

Horus, our duplicate-merging script, seems to be missing some duplicates. #3060

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate-merging script is still missing some obvious duplicates with non-matching spaces in French sentences. #770

Duplicate-merging script is still missing some obvious duplicates with non-matching spaces in French sentences. #770

ckjpn commented Sep 13, 2015

PaulPeer commented Sep 13, 2015

ckjpn commented Sep 13, 2015

PaulPeer commented Sep 13, 2015

trang commented Sep 13, 2015

ckjpn commented May 20, 2016

ckjpn commented Jun 2, 2016 •

edited

Loading

ckjpn commented Jan 5, 2017 •

edited

Loading

Poulpisator commented Aug 18, 2018

ckjpn commented May 27, 2019 •

edited

Loading

jiru commented May 27, 2019

ckjpn commented Nov 30, 2019

trang commented Nov 30, 2019

ckjpn commented Dec 1, 2019 •

edited

Loading

trang commented Dec 1, 2019

ckjpn commented Dec 1, 2019 •

edited

Loading

sacredceltic commented Dec 1, 2019

ckjpn commented Dec 1, 2019

agrodet commented Dec 2, 2019

agrodet commented Dec 2, 2019

PaulPeer commented Dec 2, 2019

Poulpisator commented Dec 2, 2019 •

edited

Loading

PaulPeer commented Dec 2, 2019

sacredceltic commented Dec 2, 2019

sacredceltic commented Dec 2, 2019

PaulPeer commented Dec 2, 2019

sacredceltic commented Dec 2, 2019

sacredceltic commented Dec 2, 2019

trang commented Dec 3, 2019

Poulpisator commented Dec 3, 2019 •

edited

Loading

PaulPeer commented Dec 3, 2019

sacredceltic commented Dec 3, 2019

trang commented Dec 3, 2019

trang commented Dec 15, 2019

ckjpn commented Jun 28, 2023

Duplicate-merging script is still missing some obvious duplicates with non-matching spaces in French sentences. #770

Duplicate-merging script is still missing some obvious duplicates with non-matching spaces in French sentences. #770

Comments

ckjpn commented Sep 13, 2015

PaulPeer commented Sep 13, 2015

ckjpn commented Sep 13, 2015

PaulPeer commented Sep 13, 2015

trang commented Sep 13, 2015

ckjpn commented May 20, 2016

ckjpn commented Jun 2, 2016 • edited Loading

ckjpn commented Jan 5, 2017 • edited Loading

Poulpisator commented Aug 18, 2018

ckjpn commented May 27, 2019 • edited Loading

jiru commented May 27, 2019

ckjpn commented Nov 30, 2019

trang commented Nov 30, 2019

ckjpn commented Dec 1, 2019 • edited Loading

trang commented Dec 1, 2019

ckjpn commented Dec 1, 2019 • edited Loading

sacredceltic commented Dec 1, 2019

ckjpn commented Dec 1, 2019

agrodet commented Dec 2, 2019

agrodet commented Dec 2, 2019

PaulPeer commented Dec 2, 2019

Poulpisator commented Dec 2, 2019 • edited Loading

PaulPeer commented Dec 2, 2019

sacredceltic commented Dec 2, 2019

sacredceltic commented Dec 2, 2019

PaulPeer commented Dec 2, 2019

sacredceltic commented Dec 2, 2019

sacredceltic commented Dec 2, 2019

trang commented Dec 3, 2019

Poulpisator commented Dec 3, 2019 • edited Loading

PaulPeer commented Dec 3, 2019

sacredceltic commented Dec 3, 2019

trang commented Dec 3, 2019

trang commented Dec 15, 2019

ckjpn commented Jun 28, 2023

ckjpn commented Jun 2, 2016 •

edited

Loading

ckjpn commented Jan 5, 2017 •

edited

Loading

ckjpn commented May 27, 2019 •

edited

Loading

ckjpn commented Dec 1, 2019 •

edited

Loading

ckjpn commented Dec 1, 2019 •

edited

Loading

Poulpisator commented Dec 2, 2019 •

edited

Loading

Poulpisator commented Dec 3, 2019 •

edited

Loading