Benchmark near duplicate detection on the Quora dataset #98

rth · 2017-02-16T14:03:39Z

Quora recently released a dataset of 400000 potential near duplicate sentences pairs with labels indicating whether they are indeed near-duplicates. There is also some work on this dataset done here using deep learning approaches.

It could be interesting to use this dataset as ground truth and evaluate to what extend the near duplicates can be detected with FreeDiscovery (and estimating the optimal similarity threshold).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark near duplicate detection on the Quora dataset #98

Benchmark near duplicate detection on the Quora dataset #98

rth commented Feb 16, 2017

Benchmark near duplicate detection on the Quora dataset #98

Benchmark near duplicate detection on the Quora dataset #98

Comments

rth commented Feb 16, 2017