Skip to content

Commit

Permalink
Add index to speedup incremental deduplication
Browse files Browse the repository at this point in the history
The deduplication script says that we need that index
when using the -i option:

$ ./manage.py deduplicate --help
[...]
  -i SINCE, --incremental-scan=SINCE
                        attempts deduplication using an incremental table scan
                        with 1 query filtering sentences
                        added between now and date `d` in `yyyy-mm-dd` format
                        or as a time delta `{n}y {n}m {n}d {n}h {n}min {n}s
                        ago`                  , then a query per row to find
                        duplicates. DO NOT USE THIS WITHOUT A (text, lang)
                        INDEX.

I tried to deduplicate 325 sentences after running 'RESET QUERY CACHE;'.
Without the index, in took about 8m40s. With the index, about 1s.

Refs #1722.
  • Loading branch information
jiru committed Mar 4, 2019
1 parent 15d2d3b commit d7ba14d
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions docs/database/updates/2019-03-04.sql
@@ -0,0 +1 @@
CREATE INDEX dedup_idx on sentences (text, lang);

0 comments on commit d7ba14d

Please sign in to comment.