Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search is finding sentences that do not contain the specified word #1944

Open
alanfgh opened this issue Aug 23, 2019 · 13 comments

Comments

@alanfgh
Copy link
Contributor

commented Aug 23, 2019

Search is finding sentences that do not contain the specified word. This was reported 24 days ago (see link 1 below), and @Gillux made a change that he thought might have fixed the problem, but apparently it didn't. For instance, a general search for "Tom" in English (see link 2 below) brings up these sentences:

Damn.
#1078143

Beat it.
#37902

Damn you!
#1135061

We talked.
#2107672

How unfortunate!
#2111810

These sentences are owned by a variety of people, and the logs don't reveal anything that suggests that they ever contained the word "Tom". Nor do all the sentences contain a tag, or audio. However, they are all short. If I set the sort order to random or to longest sentences first, I don't see any false hits.

Could it be that the search engine has seen so many sentences with "Tom" that it now hallucinates them even in sentences that don't contain the word?

I reported this on the Wall as well:

https://tatoeba.org/eng/wall/show_message/32491#message_32491

Link 1: https://tatoeba.org/eng/wall/show_message/32265#message_32265

Link 2: https://tatoeba.org/eng/sentences/search?query=tom&from=eng&to=none&orphans=no&unapproved=no&user=&tags=&list=&has_audio=&trans_filter=limit&trans_to=und&trans_link=&trans_user=&trans_orphan=&trans_unapproved=&trans_has_audio=&sort=words

@trang trang added the bug label Aug 23, 2019

@trang

This comment has been minimized.

Copy link
Member

commented Aug 24, 2019

@trang

This comment has been minimized.

Copy link
Member

commented Aug 26, 2019

The problem re-emerged. Brauchinet reported the issue with the search "islamophobia".

Below is the date and time of creation of each sentence in the results.

MariaDB [tatoeba]> select id, created, text from sentences where id in (7851749, 8113164, 8130106, 8130110, 8130111, 8130121, 8130123, 8130214) order by id;
+---------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id      | created             | text                                                                                                                                                                                              |
+---------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 7851749 | 2019-04-14 18:18:12 | Everyone knows the separatists for their rabid racism and Islamophobia.                                                                                                                           |
| 8113164 | 2019-08-17 23:57:10 | Both Ilhan and Rashida strongly oppose all forms of racism, tribal quarrels and racial hatred, whether Islamophobia, anti-Semitism or the xenophobic detention and deportation of asylum-seekers. |
| 8130106 | 2019-08-25 16:10:50 | Islamophobia is a problem that's on the rise.                                                                                                                                                     |
| 8130110 | 2019-08-25 16:13:17 | Islamophobia is a real problem.                                                                                                                                                                   |
| 8130111 | 2019-08-25 16:14:10 | A group of skinheads gathered near the mosque.                                                                                                                                                    |
| 8130121 | 2019-08-25 16:15:33 | Is this a racist statement?                                                                                                                                                                       |
| 8130123 | 2019-08-25 16:15:45 | Let's study Islamic history.                                                                                                                                                                      |
| 8130214 | 2019-08-25 16:50:07 | A driver blocked the intersection.                                                                                                                                                                |
+---------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.00 sec)

Below are the sentences that should normally have been in the results:

MariaDB [tatoeba]> select id, created, text from sentences where text like '%Islamophobia%' order by id;
+---------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id      | created             | text                                                                                                                                                                                              |
+---------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 7851749 | 2019-04-14 18:18:12 | Everyone knows the separatists for their rabid racism and Islamophobia.                                                                                                                           |
| 8113164 | 2019-08-17 23:57:10 | Both Ilhan and Rashida strongly oppose all forms of racism, tribal quarrels and racial hatred, whether Islamophobia, anti-Semitism or the xenophobic detention and deportation of asylum-seekers. |
| 8130096 | 2019-08-25 16:09:02 | This talk is about Islamophobia.                                                                                                                                                                  |
| 8130097 | 2019-08-25 16:09:11 | What is Islamophobia?                                                                                                                                                                             |
| 8130099 | 2019-08-25 16:09:20 | Islamophobia is dangerous.                                                                                                                                                                        |
| 8130106 | 2019-08-25 16:10:50 | Islamophobia is a problem that's on the rise.                                                                                                                                                     |
| 8130110 | 2019-08-25 16:13:17 | Islamophobia is a real problem.                                                                                                                                                                   |
| 8130181 | 2019-08-25 16:31:11 | Islamophobia is a danger.                                                                                                                                                                         |
+---------+---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
8 rows in set (5.29 sec)

I ran indextool --check eng_main_index and there are 90 failures reported.

@alanfgh

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2019

Did you see brauchinet's observation that each sentence incorrectly found in the search results is five English sentences away from a sentence that should have been found? I found the same thing.

@trang

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

Yes, I saw it, but this is not a systematic pattern. He mentioned that it doesn't work for the sentence "Kiev is the capital of Ukraine." in the Colombia search.

I found out that "I live in Colombia" was the sentence being replaced by "Kiev is the capital of Ukraine". So I chained the searches, using the wrong search result as my next search. This is the chain I obtained:

ID Sentence
7871467 I live in Colombia.
7909051 Kiev is the capital of Ukraine.
7967473 I wonder whose this is.
7973986 How much are a dozen eggs?
7984831 Why does Tom want to live in Boston?
8009797 Mennad doesn't dress this way.
8034851 Mennad farted again.
8047791 Mennad is so sweet.
8098829 Tom pooped his pants at his sister's wedding.
8111396 I will never eat meat again.
8117954 At four o'clock he went to the barber.
8127465 The privacy settings on your smartphone allow this application to track your whereabouts.

From what I've checked, the gap between each sentence is not consistent, even if you only count English sentences. The only thing that seems consistent is that older sentences are being replaced by more recent sentence. What triggers this replacement is still a mystery.

Also, it seems only English is afftected. I'm not sure what's wrong with English specifically... The other languages that are in the top 15 do not have a corrupted index. I also checked randomly some of the less popular languages and didn't find any that had a corrupted index.

@jiru

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

@brauchinet

This comment has been minimized.

Copy link

commented Sep 4, 2019

Interestingly, every single word in the manticore_indextool failure list is contained in one of these six consecutive English sentences:

#8131026 I want to become a famous artist.
#8131027 Tom is getting on his horse right now.
#8131028 That's the wrong way to do that.
#8131029 Tom is getting ready to take a shower.
#8131030 I want to become a famous singer.
#8131053 The farmer feeds alfalfa to his dairy cattle. (2019-08-26 03:58:01)

@trang

This comment has been minimized.

Copy link
Member

commented Sep 12, 2019

As suggested by gillux, I have disabled the index merge. I have also once more rebuilt all the indexes so the search results are once more fixed.

I will manually trigger the index merge this weekend and see if there are error messages in the process. Until then if wrong search results still happen, then it means we have a problem already in the delta indexes.

@brauchinet

This comment has been minimized.

Copy link

commented Sep 13, 2019

I just checked some searches at random:

  • Islamophobia from the search above has only 7 results now (8130181 is missing).
  • Some of the sentences from my previous comment are not found (eg. 8131053 which is notable for its very specific vocabulary , 8131029)
@trang

This comment has been minimized.

Copy link
Member

commented Sep 15, 2019

So something wrong is happening already before the index merge.

I actually noticed that searching for all sentences returns a count higher than the actual number of sentences. There's about 10k of difference.

  • 7811029 occurrences in the search
  • 7799893 sentences from the stats

I'm starting again af full reindexation. I have disabled both delta indexing and index merge. Will check if the number of occurences is correct or not. It should match these stats: https://gist.github.com/trang/eb29e33b5133b6179739554a491c8df9

@jiru

This comment has been minimized.

Copy link
Member

commented Sep 15, 2019

@trang

This comment has been minimized.

Copy link
Member

commented Sep 16, 2019

From what I've checked, the number of occurences returned by the search is always higher than it should be. The difference is more significant the more sentences there are.

Lang search count db count (cf. gist)
eng 1231886 1220425
ita 732780 729476
rus 719079 713691
tur 679226 674888
epo 606112 602893
deu 479222 475826
fra 402139 398937
por 337983 335893
spa 313305 311332
swe 35259 35033
hin 12489 12091
tuk 6753 6724
ilo 2462 2454
nno 1332 1325
tha 1190 1190
gla 1005 1005
sah 926 926
uzb 680 680
nov 311 311

There doesn't seem to be any issue below 1000 sentences. Not sure if something wrong is happening during the main indexation or if the way Manticore counts isn't reliable on high occurences.

@AndiPersti

This comment has been minimized.

Copy link
Contributor

commented Sep 17, 2019

How did you obtain the numbers in "search count"?

I seems to me the counting in the search code doesn't work:
If I use a simply search for Ilocano I get 2461 occurrences.
If I change the sort order to "Last created first" I get 2666 occurrences.
If I change the sort order to "Last modified first" I get 2465 occurrences.
And the sort order "Random" is really random for me (I get 2465, 2467, 2468, ...) if I reload the page.

There is also a difference when the sort order is "Relevance" and I toggle the direction using the "Reverse order" checkbox (2461 and 2466).

What numbers do you get when you query the Manticore tables directly (e.g. echo "select count(*) from ilo_main_index;" | mysql -h0 -P9306)?

Edit: I've just checked on dev.tatoeba.org and the problem is there too. So I'm pretty sure the counting issue is independent of Manticore/indexing.

@trang

This comment has been minimized.

Copy link
Member

commented Sep 17, 2019

How did you obtain the numbers in "search count"?

From the advanced search, with "Is orphan" and "Is unapproved" are both set to "Any". In other words, using this URL, but changing from=und to from={lang}.

What numbers do you get when you query the Manticore tables directly (e.g. echo "select count(*) from ilo_main_index;" | mysql -h0 -P9306)?

I get 2454 for the count on ilo_main_index.

I ran the queries on the main and delta indexes for some of the languages. The numbers are pretty close to the db count for the *_main_index. The difference between the search count and db count seem to match more or less the count of the delta_index.

Lang search count db count main_index delta_index
eng 1231886 1220425 1220423 11556
ita 732780 729476 729480 3332
rus 719079 713691 713700 5433
tur 679226 674888 674901 4347
epo 606112 602893 602898 3249
ilo 2462 2454 2454 21
uzb 680 680 680 6
nov 311 311 311 3

I'm a bit confused why the delta_index count is not at zero though. I tried to bin/cake sphinx_indexes merge to see if it would change anything, but it didn't.

I then ran bin/cake sphinx_indexes update delta. The main_index, delta_index and search count changed as followed.

Lang search count db count main_index delta_index
eng 1220884 1220425 1216163 4721
ita 729495 729476 727734 1761
rus 714302 713691 711696 2606
tur 675161 674888 673108 2053
epo 606112 602893 601670 1584
ilo 2454 2454 2444 10
uzb 680 680 677 3
nov 311 311 310 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.