Skip to content

Split unsplit export files by language #2360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jul 16, 2020

Conversation

LBeaudoux
Copy link
Contributor

I am currently working on a Python library that will help developers to set up their Tatoeba-related projects in just a few minutes. I would like to help users to download only the data related to the languages they are interested in. Thanks to AndiPersti it is already possible with the 'sentences', 'sentences_detailed', 'sentences_CC0' and 'transcriptions' tables. But none of the other CSV files is downloadable in a monolingual version.

In the case of 'user_languages', I assume this is due to the fact that a few multi-line rows make the CSV file difficult to read. I built a custom CSV reader to overcome this issue.

Since the other files don't have any language column, it is harder to split them. But it is doable if we map the sentence id of each row to its corresponding language. I wrote a script that split files using this method.

Running these extra splits consumes about 1.5 gigabytes of memory for about five minutes. I hope it won't be a major problem. I could not locally test this commit since my virtual machine only has 1 gigabyte of RAM. But other than that, the script seems to be working.

I really hope that this pull request will be approved. In my opinion, splitting these files makes way more sense on the server side than on the client side.

@jiru
Copy link
Member

jiru commented Jun 2, 2020

Thanks for bringing up this issue. But I’m not sure I understand. Could you elaborate on the benefits of having these extra files properly split?

About the file size, these extra files are currently all under 100MB so I assume it shouldn’t be a big problem.

About the fact that these extra files include data related to languages you are not interested in, I’d argue that to make use of them, you have to match them with sentences.csv anyway, which has monolingual versions, so I’m not sure it actually helps.

That said, I’ve never really used the CSV files, so I might not be aware of all their shortcomings. That’s why I’m asking you some more practical details to better understand the problem you’re trying to solve.

Running these extra splits consumes about 1.5 gigabytes of memory for about five minutes. I hope it won't be a major problem.

It’s not a major problem, but it looks quite big for the task. Technically speaking, wouldn’t it be much more efficient to produce these new files directly from the database, instead of parsing the CSV files? What about rewriting this as a PHP component of Tatoeba, or some pure SQL query if it’s simple enough?

@AndiPersti
Copy link
Contributor

AndiPersti commented Jun 2, 2020

I've set up my VM with 3GB RAM and so I've tried to run it.

First thing I've noticed is an error:

Traceback (most recent call last):
  File "docs/cron/split_files.py", line 301, in <module>
    split_file(fp, language_index, out_dir)
  File "docs/cron/split_files.py", line 27, in split_file
    DataFile(file_path).split(**split_params)
  File "docs/cron/split_files.py", line 77, in split
    for row in self:
  File "docs/cron/split_files.py", line 45, in __iter__
    with open(self.path) as f:
TypeError: invalid file: PosixPath('exported_files/user_languages.csv')

The reason is that we still use Python3.5 and only in Python3.6+ open can use a PosixPath object. But there is a server update scheduled this coming weekend which will get us Python3.7. So this shouldn't be a problem.

After fixing that problem, the script run without any further problems (it took a little bit longer than 10 minutes). But I've plenty of free memory because I've just configured a few languages for Manticore and my database is pretty small.
On the server we should still have enough free memory available:

# free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        7.8G        221M        123M        7.3G        7.1G
Swap:          8.0G        452M        7.6G

About splitting links.csv: I've noticed that the script creates language pairs like eng-abk_links.tsv, eng-chv_links.tsv, ...
But if this is the goal, than I think it would be more efficient to start with exporting the full sentences_translations table which already contains the languages for each pair. Then splitting would be as easy as with the other sentences files.

And thinking a little bit further, we could also include the sentence language in the other three files (tags.csv, sentences_in_lists.csv, sentences_with_audio.csv). As far as I can see this wouldn't be to expensive.

From the dev server:
tags.csv:

root@envy:~# echo "set profiling = 1; select distinct ts.sentence_id, t.name from tags t join tags_sentences ts on t.id = ts.tag_id into outfile '/tmp/tags.csv'; show profiles; set profiling = 0;" | mysql tatoeba
Query_ID        Duration        Query
1       35.70727262     select distinct ts.sentence_id, t.name from tags t join tags_sentences ts on t.id = ts.tag_id into outfile '/tmp/tags.csv'
root@envy:~# rm /tmp/tags.csv
root@envy:~# echo "set profiling = 1; select distinct ts.sentence_id, s.lang, t.name from tags t join tags_sentences ts on t.id = ts.tag_id join sentences s on s.id = ts.sentence_id into outfile '/tmp/tags.csv'; show profiles; set profiling = 0;" | mysql tatoeba
Query_ID        Duration        Query
1       51.76720263     select distinct ts.sentence_id, s.lang, t.name from tags t join tags_sentences ts on t.id = ts.tag_id join sentences s on s.id = ts.sentence_id into outfile '/tmp/tags.csv'

sentences_in_lists.csv:

root@envy:~# echo "set profiling = 1; SELECT sl.id, s_sl.sentence_id FROM sentences_sentences_lists s_sl JOIN sentences_lists sl ON s_sl.sentences_list_id = sl.id WHERE sl.visibility != 'private' ORDER BY sl.id ASC, s_sl.sentence_id into outfile '/tmp/sentences_in_lists.csv'; show profiles; set profiling = 0;" | mysql tatoeba
Query_ID        Duration        Query
1       6.64788314      SELECT sl.id, s_sl.sentence_id FROM sentences_sentences_lists s_sl JOIN sentences_lists sl ON s_sl.sentences_list_id = sl.id WHERE sl.visibility != 'private' ORDER BY sl.id ASC, s_sl.sentence_id into outfile '/tmp/sentences_in_lists.csv'
root@envy:~# rm /tmp/sentences_in_lists.csv
root@envy:~# echo "set profiling = 1; SELECT sl.id, s_sl.sentence_id, s.lang FROM sentences_sentences_lists s_sl JOIN sentences_lists sl ON s_sl.sentences_list
_id = sl.id join sentences s on s_sl.sentence_id = s.id WHERE sl.visibility != 'private' ORDER BY sl.id ASC, s_sl.sentence_id into outfile '/tmp/sentences_in_lists.csv'; show profiles; set profiling = 0;" | mysql tatoeba
Query_ID        Duration        Query
1       11.73184253     SELECT sl.id, s_sl.sentence_id, s.lang FROM sentences_sentences_lists s_sl JOIN sentences_lists sl ON s_sl.sentences_list_id = sl.id join sentences s on s_sl.sentence_id = s.id WHERE sl.visibility != 'private' ORDER BY sl.id ASC, s_sl.sentence_id into outfile '/tmp/sentences_in_lists.csv'

sentences_with_audio.csv:

root@envy:~# echo "set profiling = 1; SELECT a.sentence_id, u.username, u.audio_license, u.audio_attribution_url FROM audios a LEFT JOIN users u on u.id = a.user_id ORDER BY sentence_id ASC into outfile '/tmp/sentences_with_audio.csv'; show profiles; set profiling = 0;" | mysql tatoeba
Query_ID        Duration        Query
1       0.71096890      SELECT a.sentence_id, u.username, u.audio_license, u.audio_attribution_url FROM audios a LEFT JOIN users u on u.id = a.user_id ORDER BY sentence_id ASC into outfile '/tmp/sentences_with_audio.csv'
root@envy:~# rm /tmp/sentences_with_audio.csv
root@envy:~# echo "set profiling = 1; SELECT a.sentence_id, s.lang, u.username, u.audio_license, u.audio_attribution_url FROM audios a LEFT JOIN users u on u.id = a.user_id join sentences s on a.sentence_id = s.id ORDER BY sentence_id ASC into outfile '/tmp/sentences_with_audio.csv'; show profiles; set profiling = 0;" | mysql tatoeba
Query_ID        Duration        Query
1       2.13130296      SELECT a.sentence_id, s.lang, u.username, u.audio_license, u.audio_attribution_url FROM audios a LEFT JOIN users u on u.id = a.user_id join sentences s on a.sentence_id = s.id ORDER BY sentence_id ASC into outfile '/tmp/sentences_with_audio.csv'

One disadvantage would be, that we would probably break someone's script due to the format change. But then I would rather remove the language before publishing the files.

@LBeaudoux
Copy link
Contributor Author

Could you elaborate on the benefits of having these extra files properly split?

Let's say that as a developer, you want to get the ids of all the English sentences that have at least one translation in French. Let's say that your working environment has a poor internet connectivity and that you have a maximum of 4 GB of memory at your disposal. Let's say that you have run your script every week to update your app and that you don't want this task to run on your machine for a long period of time.

As @jiru said, you currently have to download "eng_sentences.tsv.bz2" (14.2 MB), "fra_sentences.tsv.bz2" (5.5 MB), and "links.tar.bz2" (87.8 MB), which makes a total of 107,5 MB.
Now if you replace "links.tar.bz2" by "eng-fra_links.tar.bz2", you only have 21.3 MB to download in total, and you can do it 5 times faster. With your low speed internet that means a few minutes saved every week.

Now you decompress your downloaded files, extract them and suddenly the 107.5 MB become 365.1 MB. You load 'links.csv" as a Python list of tuples and you realize that it consumes more than 2.3GB of RAM. The situation becomes tricky, you still have to load the English sentences, the French sentences, your big NLP libraries, and keep some memory for your outputs. ... If you'd loaded 'eng-fra_links.tsv" instead, you'd have divided the memory footprint of your links by 50.

Long story short, handling these multilingual files is not always a problem but splitting them by language or language pair can greatly improve the performance in some use cases.

Technically speaking, wouldn’t it be much more efficient to produce these new files directly from the database, instead of parsing the CSV files?

Not sure that it is a better option... In the link case, I guess that would involve 365*365 = 133 225 queries on the link table.

But if this is the goal, than I think it would be more efficient to start with exporting the full sentences_translations table which already contains the languages for each pair. Then splitting would be as easy as with the other sentences files.

This could indeed be a more elegant and scalable solution as long as the awk command correctly handles each line of the file to be split, including potential multi-line fields or fields containing tab delimiters.

@AndiPersti
Copy link
Contributor

Not sure that it is a better option... In the link case, I guess that would involve 365*365 = 133 225 queries on the link table.

I also don't think that splitting the links into language pairs for every possible combination is scalable.
But do we really need to create all this pairs every week? I don't think that many of them are used regularly (at least not at the moment), but I may be wrong.
An alternative would be to have an "on-demand service" as we have already for the lists.

This could indeed be a more elegant and scalable solution as long as the awk command correctly handles each line of the file to be split, including potential multi-line fields or fields containing tab delimiters.

For all the officially supported downloads, I think there is only user_languages which may contain embedded tabs and multilines (i.e. for the others this would be a bug in the data). And after implementing the solution for #2250 there shouldn't be problems

About the awk script: It was more or less a quick prototype and it works as long as the input is clean (as it should be). If someone comes up with a better implementation I for sure won't oppose it.

But while your approach works on the client side (and I've done similar things myself) I don't think it is a good solution when it runs on the server where we have quick and efficient access to the database.

@LBeaudoux
Copy link
Contributor Author

But while your approach works on the client side (and I've done similar things myself) I don't think it is a good solution when it runs on the server where we have quick and efficient access to the database.

You can't really rely on a quick and efficient access to the database when you expect responses containing more than 100,000 entries. I just tried to download the 907 list (over 750,000 sentences) from the site and I got a response to my request with a 40 seconds delay. Since the number of links for a language pair often exceeds 200,000, generating them for a few language pairs is very slow compared to the ready-to-download file solution I proposed.

Honestly, I don't understand why it is OK to split 'sentences.csv' by language but not OK to split 'links.csv' by language pair. I provided a script that @AndiPersti earlier said was working and which just adds up a few minutes to the weekly dump process. Besides, no need to modify the front-end since the files just have to be downloadable from here

As I've already briefly mentioned, I proposed this pull request because I built a library that I would like to share with the Tatoeba community. Its goal is to automate and optimize the boring stuff a developer has to do when he starts building a Tatoeba-based app. For example, he could iterate through a parallel corpus with only two lines of code:

>>> from tatoebatools import ParallelCorpus
>>> parallel_corpus = [(src.text, tgt.text) for src, tgt in ParallelCorpus("eng", "fra")]
>>> print(parallel_corpus)
("Let's try something.", 'Essayons quelque chose.'),
('I have to go to sleep.', 'Je dois aller dormir.'),
('I have to go to sleep.', "Il faut que j'aille dormir."),
...

The necessary data files are automatically downloaded, decompressed, extracted and loaded into memory in an optimized way. As of today, the 'links.csv' file is split on the client side but as you know, it takes a few minutes to complete. This obviously deteriorates the developer's experience. Splitting this file on the server side once and for all would make the whole thing more seamless and efficient.

@AndiPersti
Copy link
Contributor

AndiPersti commented Jun 4, 2020

But while your approach works on the client side (and I've done similar things myself) I don't think it is a good solution when it runs on the server where we have quick and efficient access to the database.

You can't really rely on a quick and efficient access to the database when you expect responses containing more than 100,000 entries.

I think there's a misunderstanding. My paragraph above was about how you have to split any of these files on the client compared to how you would do it directly on the server.

On the client you need to

  1. process sentences.csv to get a id -> language mapping and keep it in memory
  2. you go through any of the files you want to split line by line and create the per language files

On the server you need to:

  1. export the data from the database with all necessary information for each exported file (containing all languages)
  2. you go through any of the files you want to split line by line and create the per language files

So the difference between both solutions is step 1 and I argue that it is very expensive on the client while on the server it's more or less negligible (especially for the links table were all the necessary data is already available without any join operations).

I just tried to download the 907 list (over 750,000 sentences) from the site and I got a response to my request with a 40 seconds delay. Since the number of links for a language pair often exceeds 200,000, generating them for a few language pairs is very slow compared to the ready-to-download file solution I proposed.

You underestimate the efficiency of databases.
Currently the language pair with the most links is eng-tur. Here's what I get from the dev server (which probably has a few less links than the production database):

MariaDB [tatoeba]> select sentence_id, translation_id from sentences_translations where sentence_lang = 'eng' and translation_lang = 'tur' into outfile '/tmp/eng_tur_links.tsv';
Query OK, 618214 rows affected (6.54 sec)

And based on the files your script created I get the following numbers:

vagrant@stretch:~/Tatoeba/exported_files$ find out/ -name '*links.tsv' -exec wc -l '{}' \; > lines_per_file
vagrant@stretch:~/Tatoeba/exported_files$ wc -l lines_per_file | cut -d ' ' -f 1
16056 # total number of files. Since each pair is counted twice there are roughly 8,000 pairs.
vagrant@stretch:~/Tatoeba/exported_files$ egrep '^[2-9][0-9]{5}' lines_per_file | sort -nru | wc -l
13    # number of language pairs with more than 200,000 links
vagrant@stretch:~/Tatoeba/exported_files$ egrep '^[1-9][0-9]{5}' lines_per_file | sort -nru | wc -l
20    # number of language pairs with more than 100,000 links
vagrant@stretch:~/Tatoeba/exported_files$ egrep '^[0-9]{5}' lines_per_file | sort -nru | wc -l
110   # number of language pairs with more than 10,000 links

In addition, when using the on-demand approach it would also be possible to cache the created files so that the cost for creating them is only for the first demand.

But mentioning the on-demand solution was mostly brainstorming and probably needs a little bit more refining.

Honestly, I don't understand why it is OK to split 'sentences.csv' by language but not OK to split 'links.csv' by language pair. I provided a script that @AndiPersti earlier said was working and which just adds up a few minutes to the weekly dump process. Besides, no need to modify the front-end since the files just have to be downloadable from here

I'm not against splitting per se. I just think there is a more efficient and scalable solution than your current script.

@jiru
Copy link
Member

jiru commented Jun 4, 2020

Yeah the fact that we are discussing implementation details doesn’t mean that we are rejecting your proposal. Quite the contrary actually, I think everyone here want to make Tatoeba’s exports easier to use. I’m personally very happy about the python library you mentioned.

It’s just that your suggested solution would use a lot of RAM on the server, and we think there might be ways around this limitation. Even though we have enough RAM on the server to run your proposed script, using this extra RAM means less RAM is available for buffers and caching, which does impact performance.

@LBeaudoux
Copy link
Contributor Author

@AndiPersti @jiru Thanks for your explanations! I'm sorry I overreacted...

I'll try to implement the method AndiPersti suggested earlier:

  1. include language (or language pair) fields in links.csv, tags.csv, sentences_in_lists.csv and sentences_with_audio.csv when exporting them from the database
  2. split the files with awk
  3. remove the language fields before publishing the files to avoid a format change

@jiru
Copy link
Member

jiru commented Jun 5, 2020

@l-bdx It’s alright! 😄

I’d like to suggest a variant solution. Try to directly pipe mysql into an awk script similar to what we already use. For example, for the link pairs:

mysql --skip-column-names --batch tatoeba -e \
    'SELECT sentence_id, sentence_lang, translation_id, translation_lang
     FROM sentences_translations' | \
  awk -F"\t" '{
    print "links_" \
          ($2 == "NULL" ? "unknown" : $2) \
          "-" \
          ($4 == "NULL" ? "unknown" : $4) \
          ".csv"
  }'

@AndiPersti
Copy link
Contributor

Thanks for the update.

I've tried to run the script locally but unfortunately my idea of using the links table as described doesn't work because @jiru discovered a bug a few days ago which results in incorrect languages.

@jiru
Copy link
Member

jiru commented Jun 13, 2020

Yeah we’ll just have to wait until #2387 is solved before we can merge your PR.

@jiru
Copy link
Member

jiru commented Jul 1, 2020

@l-bdx I’m on my way to solve #2387.

Meanwhile, I tried to run the export with your code and I got the following errors:

vagrant@buster:~/Tatoeba$ sudo ./docs/cron/runner.sh ./docs/cron/export.sh
./docs/cron/export.sh: line 90: tags: command not found
./docs/cron/export.sh: line 90: tags_sentences: command not found
./docs/cron/export.sh: line 90: sentences: command not found
ERROR 1146 (42S02) at line 1: Table 'tatoeba.t' doesn't exist

@jiru
Copy link
Member

jiru commented Jul 1, 2020

There is now a new file in the exports (sentences_base.csv). You might want to update your pull request to split that file too.

@jiru
Copy link
Member

jiru commented Jul 3, 2020

@l-bdx Thanks for the update! 👍 I am still getting errors by the way:

vagrant@buster:~/Tatoeba$ sudo ./docs/cron/runner.sh ./docs/cron/export.sh
./docs/cron/export.sh: line 90: tags_sentences: command not found
./docs/cron/export.sh: line 90: tags: command not found
./docs/cron/export.sh: line 90: sentences: command not found
ERROR 1146 (42S02) at line 1: Table 'tatoeba.ts' doesn't exist

It’s because you are using backticks in your SQL requests. Backticks are interpreted by Bash as command substitution character. For example you can do:

vagrant@buster:~/Tatoeba$ whoami
vagrant
vagrant@buster:~/Tatoeba$ echo "I am `whoami`"
I am vagrant

I think you can just use single quotes instead, or just remove the backticks altogether.

@LBeaudoux
Copy link
Contributor Author

LBeaudoux commented Jul 3, 2020

@jiru Thanks for your help! Sorry for the bugs, I haven't found any easy way to test this code locally.

@jiru
Copy link
Member

jiru commented Jul 3, 2020

I tried your updated code and the errors I mentioned are gone.

Here is how to test your code locally. From inside the VM:

# Just run this command once
sudo ln -s /home/vagrant/Tatoeba /var/www-prod
# This creates the files in /var/www-downloads/exports/
sudo ./docs/cron/runner.sh ./docs/cron/export.sh

@jiru
Copy link
Member

jiru commented Jul 3, 2020

It looks like your script is having trouble with null values.

If there is a sentence in unknown language (lang is null) that is based on another sentence, I get this error:

vagrant@buster:~/Tatoeba$ sudo ./docs/cron/runner.sh ./docs/cron/export.sh
awk: cmd. line:4: (FILENAME=- FNR=57) fatal: can't redirect to `/var/tmp/per_language/N/N_sentences_base.tsv' (No such file or directory)

If there is a sentence in unknown language having a translation, I get this error:

vagrant@buster:~/Tatoeba$ sudo ./docs/cron/runner.sh ./docs/cron/export.sh
awk: cmd. line:5: (FILENAME=- FNR=3) fatal: can't redirect to `/var/tmp/per_language/N/N-fra_links.tsv' (No such file or directory)

If there is a sentence in unknown language having a tag, I get this error:

vagrant@buster:~/Tatoeba$ sudo ./docs/cron/runner.sh ./docs/cron/export.sh
awk: cmd. line:4: (FILENAME=- FNR=5) fatal: can't redirect to `/var/tmp/per_language/N/N_tags.tsv' (No such file or directory)

If a user has a profile language with unspecified level (level is null), files *_user_languages.tsv.bz2 contain lines with N instead of \N, such as:

jpn	N	advanced_contributor	
jpn	N	contributor	

Same for *_sentences_base.tsv.bz2:

45	0
46	0
47	N
48	0
49	0

Same for *_sentences_with_audio.tsv.bz2:

62	admin	N	N

Considering all the things you have to check, it might be a good idea to start writing unit tests. @l-bdx Are you familiar with unit testing?

@LBeaudoux
Copy link
Contributor Author

LBeaudoux commented Jul 3, 2020

Here is how to test your code locally.

Actually, the main difficulty is the fact that my local database is empty. I need to figure out a way to populate it with real data.

Are you familiar with unit testing?

I have already unit tested python code, but never bash scripts.

@jiru
Copy link
Member

jiru commented Jul 6, 2020

About the empty database, we don’t have a mechanism to automatically populate it yet. (See also issue #2010.) For now, you have to add sentences manually.

About testing bash scripts, I don’t have experience neither. I wouldn’t venture into proper shell unit testing because we might as well rewrite the whole export script in PHP, but for now I think there is a simple thing you can do that would greatly help already: concatenate all the per_language files of the same kind and compare them with the full file of that kind. For example:

# Check tags
diff -u \
  <(tar xOf /var/www-downloads/exports/tags.tar.bz2 tags.csv | sort) \
  <(bzcat /var/www-downloads/exports/per_language/*/*_tags.tsv.bz2 | sort)

# Check user languages
diff -u \
  <(tar xOf /var/www-downloads/exports/user_languages.tar.bz2 user_languages.csv | sort) \
  <(bzcat /var/www-downloads/exports/per_language/*/*_user_languages.tsv.bz2 | sort)

COALESCE(u.username, '\N'),
COALESCE(ul.language_code, '\\N'),
COALESCE(ul.level, '\\N'),
COALESCE(u.username, '\\N'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently don't have a username that is NULL (and I'm pretty sure we don't want to have one in the future) so I don't think you need to check for NULL here:

root@sloth:~# mariadb tatoeba -e "select count(*) from users where username is null;"
+----------+
| count(*) |
+----------+
|        0 |
+----------+

@jiru
Copy link
Member

jiru commented Jul 9, 2020

@LBeaudoux I ran your updated export script on dev.tatoeba.org. The whole export (including files containing all languages) took 22 minutes and 52 seconds to complete, which is alright. You can have a look at the resulting files.

I wrote a simple consistency-check script that found a bunch of problems. Part of the problems are false positive due to the line returns. But it looks like all the user_languages files are wrong because they use space as field delimiter. There are also problems with empty languages as AndiPersti mentioned producing files like -ber_links.tsv.bz2 @LBeaudoux I’ll let you investigate.

@LBeaudoux
Copy link
Contributor Author

@jiru thanks for your consistency-check script, it's really helpful, it already helped me correct two major mistakes in the script. Besides that, I spotted two other sources of difference in the log.

1/ 9 sentences on the test server (of which only 2 are also in production) have an empty string as language code. This leads to the anomalies already mentioned by jiru. Perhaps 7457556 and
7460541 should be deleted from the database. In the future, it could be useful not to allow this type of sentences.

2/ Sentences that have been removed from the sentences table are still present in tags_sentences and audios so that they can be found in the tags.csv and sentences_with_audio.csv exports. On the other hand, an additional join with the sentences table discards them from the split files. To avoid the differences between the 2 types of files, I suggest to also make joins with sentences when exporting sentences_in_lists.csv, sentences_with-audio,csv and tags.csv .

@AndiPersti
Copy link
Contributor

1/ 9 sentences on the test server (of which only 2 are also in production) have an empty string as language code. This leads to the anomalies already mentioned by jiru. Perhaps 7457556 and
7460541 should be deleted from the database. In the future, it could be useful not to allow this type of sentences.

This is only a problem in the dev database due to an old bug. The production database should be fine:

MariaDB [tatoeba]> select id, lang from sentences where id in (7457556, 7460541);
+---------+------+
| id      | lang |
+---------+------+
| 7457556 | NULL |
| 7460541 | NULL |
+---------+------+
2 rows in set (0.001 sec)

MariaDB [tatoeba]> select count(*) from sentences where lang = '';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.001 sec)

2/ Sentences that have been removed from the sentences table are still present in tags_sentences and audios so that they can be found in the tags.csv and sentences_with_audio.csv exports. On the other hand, an additional join with the sentences table discards them from the split files. To avoid the differences between the 2 types of files, I suggest to also make joins with sentences when exporting sentences_in_lists.csv, sentences_with-audio,csv and tags.csv .

  • There are 4 deleted sentences in tags_sentences in the production database:
MariaDB [tatoeba]> select tags_sentences.* from (tags_sentences left outer join sentences on tags_sentences.sentence_id = sentences.id) where sentences.id is n
ull;
+---------+--------+---------+-------------+---------------------+
| id      | tag_id | user_id | sentence_id | added_time          |
+---------+--------+---------+-------------+---------------------+
| 1954417 |    493 |    1396 |     7653911 | 2018-12-25 08:40:07 |
| 1997636 |   8346 |   50371 |     7917403 | 2019-05-18 00:43:13 |
| 2014163 |  10391 |   55525 |     8005271 | 2019-06-28 02:30:11 |
| 2673392 |    841 |   80371 |     6773205 | 2020-02-05 16:33:46 |
+---------+--------+---------+-------------+---------------------+
4 rows in set (4.789 sec)

The first was deleted by an user, the other three were deleted by Horus.
We have an SQL trigger for this case, so it looks like for some reason that didn't work.

  • There are 622 deleted sentences in audios in the production database:
MariaDB [tatoeba]> select count(*) from (audios left outer join sentences on audios.sentence_id = sentences.id) where sentences.id is null;
+----------+
| count(*) |
+----------+
|      622 |
+----------+
1 row in set (2.409 sec)

Most of them were deleted by Horus, the rest by an user:

MariaDB [tatoeba]> select user_id, count(*) from contributions c join (select audios.sentence_id as sentence_id from (audios left outer join sentences on audios.sentence_id = sentences.id) where sentences.id is null) a on c.sentence_id = a.sentence_id where c.action = 'delete' and c.type = 'sentence' group by c.user_
id;
+---------+----------+
| user_id | count(*) |
+---------+----------+
|    1396 |        8 |
|   63758 |      614 |
+---------+----------+
2 rows in set (4.342 sec)

We currently don't have an SQL trigger but the code should prevent deleting a sentence with audio. It looks like there is a way to circumvent this check. (Horus doesn't update the audios table so needs to be fixed.)

All in all your suggestion for adding a join should only be a short-term fix IMHO.

@jiru
Copy link
Member

jiru commented Jul 13, 2020

@LBeaudoux Thanks for the update. 👍 I executed the export with your updated version. Here is the output of the consistency-check script.

I think the export script should be more solid in general.

About sentences having an empty string as language code. While the current code base should be preventing the addition of such sentences, I wouldn’t be surprised if it happens again in the future because of some regression. So I think it would be better to make the script handle them correctly.

About deleted sentences in other tables like audios. This is a bug in the existing export files, not your new per_language files, so you don’t need to solve that problem in this pull request.

@alanfgh
Copy link
Contributor

alanfgh commented Jul 13, 2020

I suggest changing the title of the ticket from "Split not split export files by language" to "Split unsplit export files by language".

@LBeaudoux LBeaudoux changed the title Split not split export files by language Split unsplit export files by language Jul 14, 2020
@jiru
Copy link
Member

jiru commented Jul 15, 2020

@LBeaudoux Thanks for the update. I ran the export with your updated version. You can check the output on https://downloads.dev.tatoeba.org/exports/ Also see the consistency-check logs. I think it only contains false positives now so we are good to merge your pull request. Thank you very much for all your hard work! 😃

@jiru jiru added this to the 2020-07-19 milestone Jul 15, 2020
@jiru
Copy link
Member

jiru commented Jul 15, 2020

Oh, there is one last minor problem: your script creates an empty NULL directory.

@jiru
Copy link
Member

jiru commented Jul 15, 2020

I confirm your fix solves the problem 👍

@jiru jiru merged commit 5a8498f into Tatoeba:dev Jul 16, 2020
@LBeaudoux
Copy link
Contributor Author

😄 I look forward to finally seeing these split files on the production server!

@jiru @AndiPersti thank you both for the time you spent guiding me and testing this code. It's been very nice working with you on this pull request.

@jiru
Copy link
Member

jiru commented Jul 17, 2020

@LBeaudoux It’s my pleasure 😊 I installed your script on the production server already. The next weekly export, which will be produced in about 9 hours, will include your changes.

jiru added a commit that referenced this pull request Mar 22, 2025
The export script has grown bigger since we are providing
per-language export files (implemented in #2163 and #2360).
While the priority of mysql queries cannot be changed (because
they are performed by the mysql daemon), awk and compression
tasks do bring some extra load on the server that may affect
web clients, especially if the server load happens to be
higher that usual exactly when an export is triggered.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants