Add index to speedup incremental deduplication · Tatoeba/tatoeba2@d7ba14d

Commit

Add index to speedup incremental deduplication

The deduplication script says that we need that index
when using the -i option:

$ ./manage.py deduplicate --help
[...]
  -i SINCE, --incremental-scan=SINCE
                        attempts deduplication using an incremental table scan
                        with 1 query filtering sentences
                        added between now and date `d` in `yyyy-mm-dd` format
                        or as a time delta `{n}y {n}m {n}d {n}h {n}min {n}s
                        ago`                  , then a query per row to find
                        duplicates. DO NOT USE THIS WITHOUT A (text, lang)
                        INDEX.

I tried to deduplicate 325 sentences after running 'RESET QUERY CACHE;'.
Without the index, in took about 8m40s. With the index, about 1s.

Refs #1722.

Loading branch information

jiru committed Mar 4, 2019

1 parent 15d2d3b commit d7ba14d

docs/database/updates/2019-03-04.sql

Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		CREATE INDEX dedup_idx on sentences (text, lang);

0 comments on commit `d7ba14d`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `d7ba14d`

Commit

There are no files selected for viewing

0 comments on commit d7ba14d

0 comments on commit `d7ba14d`