Deduplication script #424

Closed
trang opened this Issue Sep 4, 2014 · 30 comments

Comments

Projects
None yet
4 participants
@trang
Member

trang commented Sep 4, 2014

Below an email from sysko about the deduplication script.
Attached file: https://dl.dropboxusercontent.com/u/953908/tatoeba/duplicate.tar.gz

Hello

Attached the deduplication script,
it is currently run by launching do_all.sh

from a logical perspective here what it does (do_all.sh)

1 - download sentences.csv and link.csv
2 - import them in a sqlite3 database
3 - export in a file the list of sentences which belongs to a non unique (lang,text) couple.
4 - launch a python script

the python script then take the exported file of duplicates and iterate over it to generate a list of SQL > instruction to "merge" that are dumped in a sql script.

finally the sql script is imported on the server (trough scp) and run in mysql.

Shortcoming:

this version of the script does not handle audio/ownership priority. i.e we may want to keep the sentence of the group who have an audio and the one that belongs to somebody for the longest time (to avoid people creating duplicate to take ownership of a sentence, for example to force a correction)

The script needs to be run as soon as possible after the dump has been created in order to avoid inconsistency (sentence that has been deleted, link that have changed etc.)

So I strongly advice that people rewrite that script before running it, though the basic idea is here.

Regards

@alanfgh

This comment has been minimized.

Show comment
Hide comment
@alanfgh

alanfgh Sep 6, 2014

Contributor

See the wiki for a list of requirements for duplicate-merging. Priority based on length of ownership and presence of audio are mentioned in that list.

The task of writing a duplicate-merging script was a task that @pallavshah was considering. He or @jiru should know the status of that task. If nothing has been done in that regard, it's certainly possible to use sysko's script as a starting point. I've already written Python scripts that work with MySQL, and they're quite similar to sysko's, which work with SQLite, so this is a task that I could take on if it hasn't already been addressed.

Contributor

alanfgh commented Sep 6, 2014

See the wiki for a list of requirements for duplicate-merging. Priority based on length of ownership and presence of audio are mentioned in that list.

The task of writing a duplicate-merging script was a task that @pallavshah was considering. He or @jiru should know the status of that task. If nothing has been done in that regard, it's certainly possible to use sysko's script as a starting point. I've already written Python scripts that work with MySQL, and they're quite similar to sysko's, which work with SQLite, so this is a task that I could take on if it hasn't already been addressed.

@alanfgh

This comment has been minimized.

Show comment
Hide comment
@alanfgh

alanfgh Sep 6, 2014

Contributor

By the way, I marked #182 a duplicate of this ticket, even though #182 is older, because this one contains more information.

Contributor

alanfgh commented Sep 6, 2014

By the way, I marked #182 a duplicate of this ticket, even though #182 is older, because this one contains more information.

@jiru

This comment has been minimized.

Show comment
Hide comment
@jiru

jiru Sep 6, 2014

Member

@pallavshah wasn’t able to get it done by the end of the GSOC. @loolmeh told me he’s been working on it (because pytoeba can’t have duplicates by design, so he needs to deduplicate prior to import into pytoeba).

Member

jiru commented Sep 6, 2014

@pallavshah wasn’t able to get it done by the end of the GSOC. @loolmeh told me he’s been working on it (because pytoeba can’t have duplicates by design, so he needs to deduplicate prior to import into pytoeba).

@loolmeh loolmeh self-assigned this Oct 5, 2014

@loolmeh loolmeh added priority:high and removed priority:high labels Oct 5, 2014

loolmeh added a commit that referenced this issue Oct 7, 2014

@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 7, 2014

Contributor

Hi, I've committed something, for now it:

  • merges comments
  • merges links
  • merges tags
  • prioritizes duplicates: audio > owner > lowest id
  • wraps deduplication in a transaction per set of duplicates with subtransactions for merging operations
  • has a cooldown interval
  • logs operations in the contributions table using a user account
  • handles correctness
  • has 5 sets of fixtures each having about 4 sentences, covering cases with audio, owners, negative correctness, and none of those things.
  • all current functional units are tested
  • can handle incremental deduplication but that needs a (text, lang) index on Sentences to be fast enough for feasible usage, which will probably slow insert speed a little and double the size of the database on disk.

Not covered:

  • the wiki notes talk about merging comment order, which isn't possible I think, since it's just arranged by date on the website
  • the wiki notes talk about merging logs, which sort of doesn't make sense, different owners would have different edits displayed in a chaotic order because of date probably
  • no logging output is done at the moment, would probably be nice to see the script's progress in real time and a small report on how many duplicates have been dealt with after it's done.
Contributor

loolmeh commented Oct 7, 2014

Hi, I've committed something, for now it:

  • merges comments
  • merges links
  • merges tags
  • prioritizes duplicates: audio > owner > lowest id
  • wraps deduplication in a transaction per set of duplicates with subtransactions for merging operations
  • has a cooldown interval
  • logs operations in the contributions table using a user account
  • handles correctness
  • has 5 sets of fixtures each having about 4 sentences, covering cases with audio, owners, negative correctness, and none of those things.
  • all current functional units are tested
  • can handle incremental deduplication but that needs a (text, lang) index on Sentences to be fast enough for feasible usage, which will probably slow insert speed a little and double the size of the database on disk.

Not covered:

  • the wiki notes talk about merging comment order, which isn't possible I think, since it's just arranged by date on the website
  • the wiki notes talk about merging logs, which sort of doesn't make sense, different owners would have different edits displayed in a chaotic order because of date probably
  • no logging output is done at the moment, would probably be nice to see the script's progress in real time and a small report on how many duplicates have been dealt with after it's done.

loolmeh added a commit that referenced this issue Oct 7, 2014

@jiru

This comment has been minimized.

Show comment
Hide comment
@jiru

jiru Oct 7, 2014

Member

Yay, nice job!

Member

jiru commented Oct 7, 2014

Yay, nice job!

@alanfgh

This comment has been minimized.

Show comment
Hide comment
@alanfgh

alanfgh Oct 7, 2014

Contributor

Excellent! I posted your notes on the Wall.

Contributor

alanfgh commented Oct 7, 2014

Excellent! I posted your notes on the Wall.

@alanfgh

This comment has been minimized.

Show comment
Hide comment
@alanfgh

alanfgh Oct 7, 2014

Contributor

sc wrote:

ideally, lists should also be taken in charge [I think he means "into account" rather than "in charge"]. When you merge sentences, the resulting sentence must replace the others in all the lists where they were present, otherwise, lists are being corrupted, with items disappearing.

Contributor

alanfgh commented Oct 7, 2014

sc wrote:

ideally, lists should also be taken in charge [I think he means "into account" rather than "in charge"]. When you merge sentences, the resulting sentence must replace the others in all the lists where they were present, otherwise, lists are being corrupted, with items disappearing.

@trang

This comment has been minimized.

Show comment
Hide comment
@trang

trang Oct 8, 2014

Member

To not forget anything, I'd recommend to go through the structure of each table and check where there is a reference to a sentence id.

I did it based on the doc here, but maybe I missed something.


contributions (you don't want to touch those, but I'm still listing it here)
​- sentence_id
​- ​translation_id

favorites_users
-​ ​favorite_id

sentence_annotations
​- sentence_id

sentence_comments
​- ​sentence_id

sentences
​- ​id

sentences_sentences_lists
​- sentence_id

sentences_translations
​- ​sentence_id
​- ​translation_id

tags_sentences
​- ​sentence_id​​


Anyway, good to see some progress on this :)

Member

trang commented Oct 8, 2014

To not forget anything, I'd recommend to go through the structure of each table and check where there is a reference to a sentence id.

I did it based on the doc here, but maybe I missed something.


contributions (you don't want to touch those, but I'm still listing it here)
​- sentence_id
​- ​translation_id

favorites_users
-​ ​favorite_id

sentence_annotations
​- sentence_id

sentence_comments
​- ​sentence_id

sentences
​- ​id

sentences_sentences_lists
​- sentence_id

sentences_translations
​- ​sentence_id
​- ​translation_id

tags_sentences
​- ​sentence_id​​


Anyway, good to see some progress on this :)

loolmeh added a commit that referenced this issue Oct 13, 2014

loolmeh added a commit that referenced this issue Oct 13, 2014

#424 add a logging framework for the script with tests, covers stdou…
…t, string, and file streams. also moved link logging to json serialization on file stream.

loolmeh added a commit that referenced this issue Oct 13, 2014

#424 refactored merging code into two abstract types: insert merging…
… and update merging. all tests still pass.

loolmeh added a commit that referenced this issue Oct 14, 2014

#424 all operations are logged to stdout and string stream. also, ad…
…ded options to dump string stream to the wall and post comments on merged sentences
@jiru

This comment has been minimized.

Show comment
Hide comment
@jiru

jiru Oct 14, 2014

Member

I accidently removed loolmeh’s comment that was just before this one:

ok the last round of commits cover:

  • merging all tables referencing sentence ids
  • logs and comments are merged by duplicating the rows needed and pointing them to the new merged sentence
  • code refactoring which separates duplicating and updating records as merge methods
  • logging to 3 streams: file, stdout (with coloring), and string (which can end up on the wall)
  • json serialization of any rows affected during deduplication and dumping of them to a file log
  • wall report and comment posting
    limitations:
  • file logging is not transactional, so maybe change this to direct logging in the database, which needs reviewing of the schema to allow an action called 'merge' where we can just dump the json serialization in the text field.
    • logging doubles the queries needed, as snapshots are taken before merge operations happen
    • json serialization slows down the script a bit.

what's left now is for someone to run the test on their VM and try running a full deduplication run on their local database. also, I think I won't be working on this script any further.
(end of loolmeh’s comment)

Awesome! Hats off! Can’t wait to try this out.

Member

jiru commented Oct 14, 2014

I accidently removed loolmeh’s comment that was just before this one:

ok the last round of commits cover:

  • merging all tables referencing sentence ids
  • logs and comments are merged by duplicating the rows needed and pointing them to the new merged sentence
  • code refactoring which separates duplicating and updating records as merge methods
  • logging to 3 streams: file, stdout (with coloring), and string (which can end up on the wall)
  • json serialization of any rows affected during deduplication and dumping of them to a file log
  • wall report and comment posting
    limitations:
  • file logging is not transactional, so maybe change this to direct logging in the database, which needs reviewing of the schema to allow an action called 'merge' where we can just dump the json serialization in the text field.
    • logging doubles the queries needed, as snapshots are taken before merge operations happen
    • json serialization slows down the script a bit.

what's left now is for someone to run the test on their VM and try running a full deduplication run on their local database. also, I think I won't be working on this script any further.
(end of loolmeh’s comment)

Awesome! Hats off! Can’t wait to try this out.

loolmeh added a commit that referenced this issue Oct 14, 2014

#424 ram optimization for full scan deduplication + minor improvemen…
…ts to report messages + fix for duplicate verbose option
@jiru

This comment has been minimized.

Show comment
Hide comment
@jiru

jiru Oct 14, 2014

Member

I tried to run it on my VM. Here is a report.

I had to install python headers (python-dev package) before running pip install.
I had to run pip install as root.
It installed Django 1.7 (@loolmeh looking at the troubles I ran into, you may have a different version).

First, I got that error: optparse.OptionConflictError: option -v/--verbose-stdout: conflicting option string(s): -v. Sounds like Django already has its -v option that conflicts with the script’s. So I just renamed the script’s -v to -V.

Then I got django.db.utils.ProgrammingError: (1102, "Incorrect database name '/home/vagrant/Tatoeba/versions/2014-08-16--08-52-50/docs/tatoeba2-django/tatoeba.db'"). In settings.py, the DATABASE key is ignored. One must set the database name in NAME.

Then I got django.db.utils.OperationalError: (1054, "Unknown column 'sentences_sentences_lists.id' in 'field list'"). Not sure why he needs that but I added that column with ALTER TABLE sentences_sentences_lists ADD id int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST; and it solved the problem.

Then I got _mysql_exceptions.Warning: Some non-transactional changed tables couldn't be rolled back, which is an error while recovering from an error, so the initial error was hidden and it was difficult to debug. The traceback indicated that it failed within cls.update_merge('FavoritesUsers', main_sent.id, ids, 'favorite_id'), so just to be sure I ran the same SQL command for the favorites_users table and it solved the problem (ALTER TABLE favorites_users ADD id int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;).

Then I got that meaningless error again, which came this time from around log_sents_del(). cls.bot.id wasn’t defined. I’m not very sure because I don’t know Python well, but it sounds like the problem was that cls.bot.id is a class attribute whereas it’s defined as an object attribute in handle(). So I changed it’s definition from self.bot = to Dedup.bot = and the script finally completed successfully. 😋

Member

jiru commented Oct 14, 2014

I tried to run it on my VM. Here is a report.

I had to install python headers (python-dev package) before running pip install.
I had to run pip install as root.
It installed Django 1.7 (@loolmeh looking at the troubles I ran into, you may have a different version).

First, I got that error: optparse.OptionConflictError: option -v/--verbose-stdout: conflicting option string(s): -v. Sounds like Django already has its -v option that conflicts with the script’s. So I just renamed the script’s -v to -V.

Then I got django.db.utils.ProgrammingError: (1102, "Incorrect database name '/home/vagrant/Tatoeba/versions/2014-08-16--08-52-50/docs/tatoeba2-django/tatoeba.db'"). In settings.py, the DATABASE key is ignored. One must set the database name in NAME.

Then I got django.db.utils.OperationalError: (1054, "Unknown column 'sentences_sentences_lists.id' in 'field list'"). Not sure why he needs that but I added that column with ALTER TABLE sentences_sentences_lists ADD id int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST; and it solved the problem.

Then I got _mysql_exceptions.Warning: Some non-transactional changed tables couldn't be rolled back, which is an error while recovering from an error, so the initial error was hidden and it was difficult to debug. The traceback indicated that it failed within cls.update_merge('FavoritesUsers', main_sent.id, ids, 'favorite_id'), so just to be sure I ran the same SQL command for the favorites_users table and it solved the problem (ALTER TABLE favorites_users ADD id int(11) NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;).

Then I got that meaningless error again, which came this time from around log_sents_del(). cls.bot.id wasn’t defined. I’m not very sure because I don’t know Python well, but it sounds like the problem was that cls.bot.id is a class attribute whereas it’s defined as an object attribute in handle(). So I changed it’s definition from self.bot = to Dedup.bot = and the script finally completed successfully. 😋

@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 14, 2014

Contributor

I had to install python headers (python-dev package) before running pip install.
I had to run pip install as root.

hmm it's only needed if a package needs to compile against python. I guess the mysql package wants that. and no you don't need root if you're running things inside a virtual env. for the record I have the exact same versions noted in requirements.txt as it's just frozen from the virtual env I use for this script

First, I got that error: optparse.OptionConflictError: option -v/--verbose-stdout: conflicting option string(s):

yeah I fixed that one, noticed it pretty late...

Then I got django.db.utils.ProgrammingError: (1102, "Incorrect database name '/home/vagrant/Tatoeba/versions/2014-08-16--08-52-50/docs/tatoeba2-django/tatoeba.db'")

ah, yeah tatoeba.db is the sqlite file I was running the tests on... yes, it should be the database name I think

Then I got django.db.utils.ProgrammingError: (1102, "Incorrect database name '/home/vagrant/Tatoeba/versions/2014-08-16--08-52-50/docs/tatoeba2-django/tatoeba.db'")

huh, I wonder why none of my tests caught this. Django needs a primary key defined on the model or it will try to use a default id field for this, so I guess I messed up on one of the models.py files I committed in the repo, looking into this.

cls.bot.id wasn’t defined

hmm, also wondering why this was never caught.... you're right in your assumption, self.bot defined inside Command won't actually be accessible from the parent class Dedup, looking into this.

Contributor

loolmeh commented Oct 14, 2014

I had to install python headers (python-dev package) before running pip install.
I had to run pip install as root.

hmm it's only needed if a package needs to compile against python. I guess the mysql package wants that. and no you don't need root if you're running things inside a virtual env. for the record I have the exact same versions noted in requirements.txt as it's just frozen from the virtual env I use for this script

First, I got that error: optparse.OptionConflictError: option -v/--verbose-stdout: conflicting option string(s):

yeah I fixed that one, noticed it pretty late...

Then I got django.db.utils.ProgrammingError: (1102, "Incorrect database name '/home/vagrant/Tatoeba/versions/2014-08-16--08-52-50/docs/tatoeba2-django/tatoeba.db'")

ah, yeah tatoeba.db is the sqlite file I was running the tests on... yes, it should be the database name I think

Then I got django.db.utils.ProgrammingError: (1102, "Incorrect database name '/home/vagrant/Tatoeba/versions/2014-08-16--08-52-50/docs/tatoeba2-django/tatoeba.db'")

huh, I wonder why none of my tests caught this. Django needs a primary key defined on the model or it will try to use a default id field for this, so I guess I messed up on one of the models.py files I committed in the repo, looking into this.

cls.bot.id wasn’t defined

hmm, also wondering why this was never caught.... you're right in your assumption, self.bot defined inside Command won't actually be accessible from the parent class Dedup, looking into this.

loolmeh added a commit that referenced this issue Oct 14, 2014

#424 few fixes: NAME setting, unmanaged models handle primary key on…
… lists table, bot is bound to Dedup instead of Command
@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 14, 2014

Contributor

I hope the last 2 commits fix all the issues you mentioned...

Contributor

loolmeh commented Oct 14, 2014

I hope the last 2 commits fix all the issues you mentioned...

@alanfgh

This comment has been minimized.

Show comment
Hide comment
@alanfgh

alanfgh Oct 14, 2014

Contributor

Would it be possible to add a "dry_run" option to the python script so that it can display/log what it would do without actually making permanent changes to the database? Otherwise, we need to back up the database before running the script and restore it afterwards. We can do that, of course, but it would be nice not to have to, especially if we'll be testing it multiple times.

Contributor

alanfgh commented Oct 14, 2014

Would it be possible to add a "dry_run" option to the python script so that it can display/log what it would do without actually making permanent changes to the database? Otherwise, we need to back up the database before running the script and restore it afterwards. We can do that, of course, but it would be nice not to have to, especially if we'll be testing it multiple times.

loolmeh added a commit that referenced this issue Oct 15, 2014

@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 15, 2014

Contributor

Would it be possible to add a "dry_run" option

done. there's a simple test that confirms it doesn't do anything but it's possible that I missed logging something important, let me know if it works in a real run as I don't have any machines at the moment that won't just hang up on me during the table scan step...

Contributor

loolmeh commented Oct 15, 2014

Would it be possible to add a "dry_run" option

done. there's a simple test that confirms it doesn't do anything but it's possible that I missed logging something important, let me know if it works in a real run as I don't have any machines at the moment that won't just hang up on me during the table scan step...

@jiru

This comment has been minimized.

Show comment
Hide comment
@jiru

jiru Oct 15, 2014

Member

I hope the last 2 commits fix all the issues you mentioned...

Yes. Can’t we just remove that DATABASE option in settings.py since it’s pretty confusing because we must actually set the database name in NAME?

I found a new problem. The post produced by the bot on the Wall is not displayed on the Wall because it lacks lft and rght values. As far as I understand, you can just set lft to max(rght)+1, and rght to lft+1.

Member

jiru commented Oct 15, 2014

I hope the last 2 commits fix all the issues you mentioned...

Yes. Can’t we just remove that DATABASE option in settings.py since it’s pretty confusing because we must actually set the database name in NAME?

I found a new problem. The post produced by the bot on the Wall is not displayed on the Wall because it lacks lft and rght values. As far as I understand, you can just set lft to max(rght)+1, and rght to lft+1.

@jiru

This comment has been minimized.

Show comment
Hide comment
@jiru

jiru Oct 15, 2014

Member

Actually, the favorites_users had the same double primary key problem and I fixed it just like you did with the sentences_sentences_list.

I had the following warning during deduplication, harmless I think but it can easily spam logs:

/usr/local/lib/python2.7/dist-packages/django/db/models/fields/__init__.py:1278: RuntimeWarning: DateTimeField Contributions.datetime received a naive datetime (2014-10-15 06:18:32.004965) while time zone support is active.
  RuntimeWarning)

I tested the dry run option, and it still posts comments on sentences if you use -c, and it still posts on the wall if you use -w. It also says All duplicates removed? NO in the verification step, which is true because it’s a dry run, but it will print YES when used for real. This may be a bit confusing, you may want to rephrase the messages.

Also, I suggest you to add a message at the very end which says "this was a DRY RUN" when it is. I think it could be more secure to run the dry run mode by default, and to add an option to make it run for real.

Member

jiru commented Oct 15, 2014

Actually, the favorites_users had the same double primary key problem and I fixed it just like you did with the sentences_sentences_list.

I had the following warning during deduplication, harmless I think but it can easily spam logs:

/usr/local/lib/python2.7/dist-packages/django/db/models/fields/__init__.py:1278: RuntimeWarning: DateTimeField Contributions.datetime received a naive datetime (2014-10-15 06:18:32.004965) while time zone support is active.
  RuntimeWarning)

I tested the dry run option, and it still posts comments on sentences if you use -c, and it still posts on the wall if you use -w. It also says All duplicates removed? NO in the verification step, which is true because it’s a dry run, but it will print YES when used for real. This may be a bit confusing, you may want to rephrase the messages.

Also, I suggest you to add a message at the very end which says "this was a DRY RUN" when it is. I think it could be more secure to run the dry run mode by default, and to add an option to make it run for real.

@jiru

This comment has been minimized.

Show comment
Hide comment
@jiru

jiru Oct 15, 2014

Member

Oh, and the dry run mode also add logs to the contributions table that says the sentences were removed, whereas they actually aren’t since it’s a dry run.

Member

jiru commented Oct 15, 2014

Oh, and the dry run mode also add logs to the contributions table that says the sentences were removed, whereas they actually aren’t since it’s a dry run.

@jiru

This comment has been minimized.

Show comment
Hide comment
@jiru

jiru Oct 15, 2014

Member

After deduplication, the contribution table contains useless records about added sentence.

Here is the relevant part of the contribution table before running the deduplication script:

mysql> select sentence_id, sentence_lang, text, action, user_id, datetime, ip, type, id from contributions where text = 'Duplicated sentence !' order by id desc;
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+
| sentence_id | sentence_lang | text                  | action | user_id | datetime            | ip       | type     | id  |
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+
|          75 | eng           | Duplicated sentence ! | insert |       4 | 2014-10-15 10:12:59 | 10.0.2.2 | sentence | 154 |
|          74 | eng           | Duplicated sentence ! | insert |       1 | 2014-10-15 10:12:33 | 10.0.2.2 | sentence | 153 |
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+

After running the deduplication script:

mysql> select sentence_id, sentence_lang, text, action, user_id, datetime, ip, type, id from contributions where text = 'Duplicated sentence !' order by id desc;
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+
| sentence_id | sentence_lang | text                  | action | user_id | datetime            | ip       | type     | id  |
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+
|          75 | eng           | Duplicated sentence ! | delete |       7 | 2014-10-15 10:13:47 |          | sentence | 156 |
|          74 | eng           | Duplicated sentence ! | insert |       4 | 2014-10-15 10:12:59 | 10.0.2.2 | sentence | 155 |
|          75 | eng           | Duplicated sentence ! | insert |       4 | 2014-10-15 10:12:59 | 10.0.2.2 | sentence | 154 |
|          74 | eng           | Duplicated sentence ! | insert |       1 | 2014-10-15 10:12:33 | 10.0.2.2 | sentence | 153 |
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+

The contribution with id 155 shouldn’t be there.

Member

jiru commented Oct 15, 2014

After deduplication, the contribution table contains useless records about added sentence.

Here is the relevant part of the contribution table before running the deduplication script:

mysql> select sentence_id, sentence_lang, text, action, user_id, datetime, ip, type, id from contributions where text = 'Duplicated sentence !' order by id desc;
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+
| sentence_id | sentence_lang | text                  | action | user_id | datetime            | ip       | type     | id  |
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+
|          75 | eng           | Duplicated sentence ! | insert |       4 | 2014-10-15 10:12:59 | 10.0.2.2 | sentence | 154 |
|          74 | eng           | Duplicated sentence ! | insert |       1 | 2014-10-15 10:12:33 | 10.0.2.2 | sentence | 153 |
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+

After running the deduplication script:

mysql> select sentence_id, sentence_lang, text, action, user_id, datetime, ip, type, id from contributions where text = 'Duplicated sentence !' order by id desc;
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+
| sentence_id | sentence_lang | text                  | action | user_id | datetime            | ip       | type     | id  |
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+
|          75 | eng           | Duplicated sentence ! | delete |       7 | 2014-10-15 10:13:47 |          | sentence | 156 |
|          74 | eng           | Duplicated sentence ! | insert |       4 | 2014-10-15 10:12:59 | 10.0.2.2 | sentence | 155 |
|          75 | eng           | Duplicated sentence ! | insert |       4 | 2014-10-15 10:12:59 | 10.0.2.2 | sentence | 154 |
|          74 | eng           | Duplicated sentence ! | insert |       1 | 2014-10-15 10:12:33 | 10.0.2.2 | sentence | 153 |
+-------------+---------------+-----------------------+--------+---------+---------------------+----------+----------+-----+

The contribution with id 155 shouldn’t be there.

@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 15, 2014

Contributor

After deduplication, the contribution table contains useless records about added sentence.

This is by design. comments and logs are duplicated instead of updated.

about the dry run option, I'll take care of the deletion records and other duplicate records, but I'm not sure if I should care about handling -w or -c or the verification step saying yes or no, probably gonna make the code look even more complicated. also not sure about dry run being the default, you can go ahead and change that I guess...

the timezone warning should be fixed by using uctnow() instead of now() I guess, but not sure if mysql doesn't mess up on storing timezone aware records...

Contributor

loolmeh commented Oct 15, 2014

After deduplication, the contribution table contains useless records about added sentence.

This is by design. comments and logs are duplicated instead of updated.

about the dry run option, I'll take care of the deletion records and other duplicate records, but I'm not sure if I should care about handling -w or -c or the verification step saying yes or no, probably gonna make the code look even more complicated. also not sure about dry run being the default, you can go ahead and change that I guess...

the timezone warning should be fixed by using uctnow() instead of now() I guess, but not sure if mysql doesn't mess up on storing timezone aware records...

@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 15, 2014

Contributor

actually hmm.... looking at your sql query, I haven't thought about this, I guess I need to only filter links and updates for duplication in the contribution table, working on it. or maybe updates (edits) shouldn't be copied as well?

Contributor

loolmeh commented Oct 15, 2014

actually hmm.... looking at your sql query, I haven't thought about this, I guess I need to only filter links and updates for duplication in the contribution table, working on it. or maybe updates (edits) shouldn't be copied as well?

loolmeh added a commit that referenced this issue Oct 15, 2014

#424 more fixes: insert merge supports complex filtering, merges on …
…the contributions table is tightly filtered and tightly tested, deletion logging obeys dry run option, datetime objects are all timezone aware, and useless DATABASE setting removed
@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 15, 2014

Contributor

everything should be in order now, if you guys wanna change how the options work or what is copied exactly from the Contribtutions table, etc... it's all yours.

Contributor

loolmeh commented Oct 15, 2014

everything should be in order now, if you guys wanna change how the options work or what is copied exactly from the Contribtutions table, etc... it's all yours.

@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 15, 2014

Contributor

oh right forgot about the wall post rgt/lft thing, someone figure it out for me kthxbye

Contributor

loolmeh commented Oct 15, 2014

oh right forgot about the wall post rgt/lft thing, someone figure it out for me kthxbye

loolmeh added a commit that referenced this issue Oct 18, 2014

@loolmeh loolmeh added this to the 2014-10-19 milestone Oct 18, 2014

@loolmeh loolmeh closed this Oct 18, 2014

loolmeh added a commit that referenced this issue Oct 19, 2014

#424 handle duplicate rows integrity error bug + more ram optimizati…
…ons: switch to sha1 hashes as long ints instead of keeping sentence content as strings
@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 19, 2014

Contributor

@alanfgh can you confirm that the bug that you reported yesterday is fixed now? also, the ram usage probably changed, would be nice if you can measure it again. thanks.

Contributor

loolmeh commented Oct 19, 2014

@alanfgh can you confirm that the bug that you reported yesterday is fixed now? also, the ram usage probably changed, would be nice if you can measure it again. thanks.

loolmeh added a commit that referenced this issue Oct 19, 2014

#424 reorganize the duplicate row bug fix to cover all update merges…
… and be air tight with tests, also more lying to django
@alanfgh

This comment has been minimized.

Show comment
Hide comment
@alanfgh

alanfgh Oct 20, 2014

Contributor

@loolmeh, I got this:

File "/var/http/tatoeba/docs/tatoeba2-django/tatoeba2/management/commands/dedu plicate.py", line 18, in
from pytz import UTC as utc
ImportError: No module named pytz
tatoeba@tatovm:~/tatoeba-www/docs/tatoeba2-django$

I only had about one minute to look at this, so I couldn't determine whether I had it on my machine already and if not, the best way to get it.

Contributor

alanfgh commented Oct 20, 2014

@loolmeh, I got this:

File "/var/http/tatoeba/docs/tatoeba2-django/tatoeba2/management/commands/dedu plicate.py", line 18, in
from pytz import UTC as utc
ImportError: No module named pytz
tatoeba@tatovm:~/tatoeba-www/docs/tatoeba2-django$

I only had about one minute to look at this, so I couldn't determine whether I had it on my machine already and if not, the best way to get it.

@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 20, 2014

Contributor

@alanfgh

pip install -r requirements.txt

I updated the dependencies.
also, there's some procedure that needs a prod_admin user with lots of privileges, so make sure you have it:

create user prod_admin;
grant all on tatoeba.* to 'prod_admin'@'localhost';

I ran the script multiple times using the -i flag with no problems. Although I don't have any indices for (lang, text), it definitely was pretty fast. Once the script is deemed fit, I'd recommend doing a full run on the server then setting up a cron job for merging every hour or whatever. Also, logrotate will probably have to be configured for the log directory chosen as it will probably grow pretty quickly.

Contributor

loolmeh commented Oct 20, 2014

@alanfgh

pip install -r requirements.txt

I updated the dependencies.
also, there's some procedure that needs a prod_admin user with lots of privileges, so make sure you have it:

create user prod_admin;
grant all on tatoeba.* to 'prod_admin'@'localhost';

I ran the script multiple times using the -i flag with no problems. Although I don't have any indices for (lang, text), it definitely was pretty fast. Once the script is deemed fit, I'd recommend doing a full run on the server then setting up a cron job for merging every hour or whatever. Also, logrotate will probably have to be configured for the log directory chosen as it will probably grow pretty quickly.

@alanfgh

This comment has been minimized.

Show comment
Hide comment
@alanfgh

alanfgh Oct 21, 2014

Contributor

Pulled from repository and set up prod_admin. However, the script failed again. Here's the tail of the output:
https://gist.github.com/711025c2a791194442f2.git

Here's the command:
/usr/bin/time python manage.py deduplicate --log-path ~/af-work/ -s -u http://localhost:8080/

Contributor

alanfgh commented Oct 21, 2014

Pulled from repository and set up prod_admin. However, the script failed again. Here's the tail of the output:
https://gist.github.com/711025c2a791194442f2.git

Here's the command:
/usr/bin/time python manage.py deduplicate --log-path ~/af-work/ -s -u http://localhost:8080/

@alanfgh

This comment has been minimized.

Show comment
Hide comment
@alanfgh

alanfgh Oct 21, 2014

Contributor

I think we should reopen the ticket until we're sure the script is running without a problem.

Contributor

alanfgh commented Oct 21, 2014

I think we should reopen the ticket until we're sure the script is running without a problem.

@alanfgh alanfgh reopened this Oct 21, 2014

@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 21, 2014

Contributor

This doesn't really help me know what happened, please remove the transactional decorators and see what actual error makes it fail. As far as I'm concerned I'm not working on this script anymore, I spent a day already testing on my VM, if there's anymore problems with it, it's probably an issue of misuse.

Contributor

loolmeh commented Oct 21, 2014

This doesn't really help me know what happened, please remove the transactional decorators and see what actual error makes it fail. As far as I'm concerned I'm not working on this script anymore, I spent a day already testing on my VM, if there's anymore problems with it, it's probably an issue of misuse.

@loolmeh

This comment has been minimized.

Show comment
Hide comment
@loolmeh

loolmeh Oct 21, 2014

Contributor

Closing this. If anyone can actually show me a bug, have some courtesy, make sure it's a bug, and open a new ticket that provides specific information.

Contributor

loolmeh commented Oct 21, 2014

Closing this. If anyone can actually show me a bug, have some courtesy, make sure it's a bug, and open a new ticket that provides specific information.

@alanfgh

This comment has been minimized.

Show comment
Hide comment
@alanfgh

alanfgh Oct 23, 2014

Contributor

jiru fixed the issue with this commit: b31be84

Contributor

alanfgh commented Oct 23, 2014

jiru fixed the issue with this commit: b31be84

loolmeh added a commit that referenced this issue Oct 24, 2014

#424 #476 all tests run on a freshly synced mysql database: handled …
…yet another mysql quantum behavior by resetting and truncating autoincrements for all tables in the fixtures as transactions don't seem to roll them back... + rearranged dedup and bot fixtures + fixed a very subtle bug caused by ids being returned as long ints from mysql which causes /is/ to fail because of comparisons against default python int (which is short)

loolmeh added a commit that referenced this issue Nov 23, 2014

#424 #504 handle unique row bug without the moronic try/except, this…
… also adds passing test cases for unique rows in the translations, favorites, and list tables

loolmeh added a commit that referenced this issue Nov 26, 2014

#424 #512 #513 more ugly code to handle edge cases: moved things aro…
…und in update_merge so unique collisions and deletions of them are done in separate functions now + merging links now uses its own thing instead of update_merge, making update_merge lean again as special cases are handled somewhere else + used the dedup fixture inside the new test cases so running them alone with the -k option makes them pass + refactored log merging to use insert_merge indirectly and added an exclusion Q filter to weed out self links in logs

loolmeh added a commit that referenced this issue Dec 14, 2014

#424 #512 #513 refactoring of unique collisions left update_merge vu…
…lnerable to lists having duplicates of the same main id being in the same list or favorites list, should be fixed now, also test cases have been updated to catch it

loolmeh added a commit that referenced this issue Dec 14, 2014

#424 #534 and it passes: permute all possible links to form a dense …
…graph then remove all those links if they exist
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment