-
Notifications
You must be signed in to change notification settings - Fork 134
Split unsplit export files by language #2360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for bringing up this issue. But I’m not sure I understand. Could you elaborate on the benefits of having these extra files properly split? About the file size, these extra files are currently all under 100MB so I assume it shouldn’t be a big problem. About the fact that these extra files include data related to languages you are not interested in, I’d argue that to make use of them, you have to match them with sentences.csv anyway, which has monolingual versions, so I’m not sure it actually helps. That said, I’ve never really used the CSV files, so I might not be aware of all their shortcomings. That’s why I’m asking you some more practical details to better understand the problem you’re trying to solve.
It’s not a major problem, but it looks quite big for the task. Technically speaking, wouldn’t it be much more efficient to produce these new files directly from the database, instead of parsing the CSV files? What about rewriting this as a PHP component of Tatoeba, or some pure SQL query if it’s simple enough? |
I've set up my VM with 3GB RAM and so I've tried to run it. First thing I've noticed is an error:
The reason is that we still use Python3.5 and only in Python3.6+ After fixing that problem, the script run without any further problems (it took a little bit longer than 10 minutes). But I've plenty of free memory because I've just configured a few languages for Manticore and my database is pretty small.
About splitting And thinking a little bit further, we could also include the sentence language in the other three files ( From the dev server:
One disadvantage would be, that we would probably break someone's script due to the format change. But then I would rather remove the language before publishing the files. |
Let's say that as a developer, you want to get the ids of all the English sentences that have at least one translation in French. Let's say that your working environment has a poor internet connectivity and that you have a maximum of 4 GB of memory at your disposal. Let's say that you have run your script every week to update your app and that you don't want this task to run on your machine for a long period of time. As @jiru said, you currently have to download "eng_sentences.tsv.bz2" (14.2 MB), "fra_sentences.tsv.bz2" (5.5 MB), and "links.tar.bz2" (87.8 MB), which makes a total of 107,5 MB. Now you decompress your downloaded files, extract them and suddenly the 107.5 MB become 365.1 MB. You load 'links.csv" as a Python list of tuples and you realize that it consumes more than 2.3GB of RAM. The situation becomes tricky, you still have to load the English sentences, the French sentences, your big NLP libraries, and keep some memory for your outputs. ... If you'd loaded 'eng-fra_links.tsv" instead, you'd have divided the memory footprint of your links by 50. Long story short, handling these multilingual files is not always a problem but splitting them by language or language pair can greatly improve the performance in some use cases.
Not sure that it is a better option... In the link case, I guess that would involve 365*365 = 133 225 queries on the link table.
This could indeed be a more elegant and scalable solution as long as the awk command correctly handles each line of the file to be split, including potential multi-line fields or fields containing tab delimiters. |
I also don't think that splitting the links into language pairs for every possible combination is scalable.
For all the officially supported downloads, I think there is only About the But while your approach works on the client side (and I've done similar things myself) I don't think it is a good solution when it runs on the server where we have quick and efficient access to the database. |
You can't really rely on a quick and efficient access to the database when you expect responses containing more than 100,000 entries. I just tried to download the 907 list (over 750,000 sentences) from the site and I got a response to my request with a 40 seconds delay. Since the number of links for a language pair often exceeds 200,000, generating them for a few language pairs is very slow compared to the ready-to-download file solution I proposed. Honestly, I don't understand why it is OK to split 'sentences.csv' by language but not OK to split 'links.csv' by language pair. I provided a script that @AndiPersti earlier said was working and which just adds up a few minutes to the weekly dump process. Besides, no need to modify the front-end since the files just have to be downloadable from here As I've already briefly mentioned, I proposed this pull request because I built a library that I would like to share with the Tatoeba community. Its goal is to automate and optimize the boring stuff a developer has to do when he starts building a Tatoeba-based app. For example, he could iterate through a parallel corpus with only two lines of code:
The necessary data files are automatically downloaded, decompressed, extracted and loaded into memory in an optimized way. As of today, the 'links.csv' file is split on the client side but as you know, it takes a few minutes to complete. This obviously deteriorates the developer's experience. Splitting this file on the server side once and for all would make the whole thing more seamless and efficient. |
I think there's a misunderstanding. My paragraph above was about how you have to split any of these files on the client compared to how you would do it directly on the server. On the client you need to
On the server you need to:
So the difference between both solutions is step 1 and I argue that it is very expensive on the client while on the server it's more or less negligible (especially for the links table were all the necessary data is already available without any join operations).
You underestimate the efficiency of databases.
And based on the files your script created I get the following numbers:
In addition, when using the on-demand approach it would also be possible to cache the created files so that the cost for creating them is only for the first demand. But mentioning the on-demand solution was mostly brainstorming and probably needs a little bit more refining.
I'm not against splitting per se. I just think there is a more efficient and scalable solution than your current script. |
Yeah the fact that we are discussing implementation details doesn’t mean that we are rejecting your proposal. Quite the contrary actually, I think everyone here want to make Tatoeba’s exports easier to use. I’m personally very happy about the python library you mentioned. It’s just that your suggested solution would use a lot of RAM on the server, and we think there might be ways around this limitation. Even though we have enough RAM on the server to run your proposed script, using this extra RAM means less RAM is available for buffers and caching, which does impact performance. |
@AndiPersti @jiru Thanks for your explanations! I'm sorry I overreacted... I'll try to implement the method AndiPersti suggested earlier:
|
@l-bdx It’s alright! 😄 I’d like to suggest a variant solution. Try to directly pipe mysql into an awk script similar to what we already use. For example, for the link pairs: mysql --skip-column-names --batch tatoeba -e \
'SELECT sentence_id, sentence_lang, translation_id, translation_lang
FROM sentences_translations' | \
awk -F"\t" '{
print "links_" \
($2 == "NULL" ? "unknown" : $2) \
"-" \
($4 == "NULL" ? "unknown" : $4) \
".csv"
}' |
Thanks for the update. I've tried to run the script locally but unfortunately my idea of using the |
Yeah we’ll just have to wait until #2387 is solved before we can merge your PR. |
@l-bdx I’m on my way to solve #2387. Meanwhile, I tried to run the export with your code and I got the following errors:
|
There is now a new file in the exports (sentences_base.csv). You might want to update your pull request to split that file too. |
@l-bdx Thanks for the update! 👍 I am still getting errors by the way:
It’s because you are using backticks in your SQL requests. Backticks are interpreted by Bash as command substitution character. For example you can do:
I think you can just use single quotes instead, or just remove the backticks altogether. |
@jiru Thanks for your help! Sorry for the bugs, I haven't found any easy way to test this code locally. |
I tried your updated code and the errors I mentioned are gone. Here is how to test your code locally. From inside the VM: # Just run this command once
sudo ln -s /home/vagrant/Tatoeba /var/www-prod
# This creates the files in /var/www-downloads/exports/
sudo ./docs/cron/runner.sh ./docs/cron/export.sh |
It looks like your script is having trouble with null values. If there is a sentence in unknown language (lang is null) that is based on another sentence, I get this error:
If there is a sentence in unknown language having a translation, I get this error:
If there is a sentence in unknown language having a tag, I get this error:
If a user has a profile language with unspecified level (level is null), files
Same for
Same for
Considering all the things you have to check, it might be a good idea to start writing unit tests. @l-bdx Are you familiar with unit testing? |
Actually, the main difficulty is the fact that my local database is empty. I need to figure out a way to populate it with real data.
I have already unit tested python code, but never bash scripts. |
About the empty database, we don’t have a mechanism to automatically populate it yet. (See also issue #2010.) For now, you have to add sentences manually. About testing bash scripts, I don’t have experience neither. I wouldn’t venture into proper shell unit testing because we might as well rewrite the whole export script in PHP, but for now I think there is a simple thing you can do that would greatly help already: concatenate all the # Check tags
diff -u \
<(tar xOf /var/www-downloads/exports/tags.tar.bz2 tags.csv | sort) \
<(bzcat /var/www-downloads/exports/per_language/*/*_tags.tsv.bz2 | sort)
# Check user languages
diff -u \
<(tar xOf /var/www-downloads/exports/user_languages.tar.bz2 user_languages.csv | sort) \
<(bzcat /var/www-downloads/exports/per_language/*/*_user_languages.tsv.bz2 | sort) |
docs/cron/export.sh
Outdated
COALESCE(u.username, '\N'), | ||
COALESCE(ul.language_code, '\\N'), | ||
COALESCE(ul.level, '\\N'), | ||
COALESCE(u.username, '\\N'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently don't have a username
that is NULL
(and I'm pretty sure we don't want to have one in the future) so I don't think you need to check for NULL
here:
root@sloth:~# mariadb tatoeba -e "select count(*) from users where username is null;"
+----------+
| count(*) |
+----------+
| 0 |
+----------+
@LBeaudoux I ran your updated export script on dev.tatoeba.org. The whole export (including files containing all languages) took 22 minutes and 52 seconds to complete, which is alright. You can have a look at the resulting files. I wrote a simple consistency-check script that found a bunch of problems. Part of the problems are false positive due to the line returns. But it looks like all the user_languages files are wrong because they use space as field delimiter. There are also problems with empty languages as AndiPersti mentioned producing files like -ber_links.tsv.bz2 @LBeaudoux I’ll let you investigate. |
sentences_with_audio split files
@jiru thanks for your consistency-check script, it's really helpful, it already helped me correct two major mistakes in the script. Besides that, I spotted two other sources of difference in the log. 1/ 9 sentences on the test server (of which only 2 are also in production) have an empty string as language code. This leads to the anomalies already mentioned by jiru. Perhaps 7457556 and 2/ Sentences that have been removed from the |
This is only a problem in the dev database due to an old bug. The production database should be fine:
The first was deleted by an user, the other three were deleted by Horus.
Most of them were deleted by Horus, the rest by an user:
We currently don't have an SQL trigger but the code should prevent deleting a sentence with audio. It looks like there is a way to circumvent this check. (Horus doesn't update the All in all your suggestion for adding a join should only be a short-term fix IMHO. |
@LBeaudoux Thanks for the update. 👍 I executed the export with your updated version. Here is the output of the consistency-check script. I think the export script should be more solid in general. About sentences having an empty string as language code. While the current code base should be preventing the addition of such sentences, I wouldn’t be surprised if it happens again in the future because of some regression. So I think it would be better to make the script handle them correctly. About deleted sentences in other tables like |
I suggest changing the title of the ticket from "Split not split export files by language" to "Split unsplit export files by language". |
@LBeaudoux Thanks for the update. I ran the export with your updated version. You can check the output on https://downloads.dev.tatoeba.org/exports/ Also see the consistency-check logs. I think it only contains false positives now so we are good to merge your pull request. Thank you very much for all your hard work! 😃 |
Oh, there is one last minor problem: your script creates an empty NULL directory. |
I confirm your fix solves the problem 👍 |
😄 I look forward to finally seeing these split files on the production server! @jiru @AndiPersti thank you both for the time you spent guiding me and testing this code. It's been very nice working with you on this pull request. |
@LBeaudoux It’s my pleasure 😊 I installed your script on the production server already. The next weekly export, which will be produced in about 9 hours, will include your changes. |
The export script has grown bigger since we are providing per-language export files (implemented in #2163 and #2360). While the priority of mysql queries cannot be changed (because they are performed by the mysql daemon), awk and compression tasks do bring some extra load on the server that may affect web clients, especially if the server load happens to be higher that usual exactly when an export is triggered.
I am currently working on a Python library that will help developers to set up their Tatoeba-related projects in just a few minutes. I would like to help users to download only the data related to the languages they are interested in. Thanks to AndiPersti it is already possible with the 'sentences', 'sentences_detailed', 'sentences_CC0' and 'transcriptions' tables. But none of the other CSV files is downloadable in a monolingual version.
In the case of 'user_languages', I assume this is due to the fact that a few multi-line rows make the CSV file difficult to read. I built a custom CSV reader to overcome this issue.
Since the other files don't have any language column, it is harder to split them. But it is doable if we map the sentence id of each row to its corresponding language. I wrote a script that split files using this method.
Running these extra splits consumes about 1.5 gigabytes of memory for about five minutes. I hope it won't be a major problem. I could not locally test this commit since my virtual machine only has 1 gigabyte of RAM. But other than that, the script seems to be working.
I really hope that this pull request will be approved. In my opinion, splitting these files makes way more sense on the server side than on the client side.