Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/doc similarity ranker #13858

Merged

Conversation

wolliq
Copy link
Contributor

@wolliq wolliq commented Jun 19, 2023

Description

DocumentSimilarityRanker is a new annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbours search on top of sentence embeddings, It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.

Motivation and Context

Useful algorithm that scales the semantic similarity search on top of well known embeddings representations such as RoBERTa etc..

How Has This Been Tested?

Locally and in distributed on Databricks LTS 12.2 Spark 3.3.2 .

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Stefano Lori added 30 commits January 21, 2023 14:22
@maziyarpanahi maziyarpanahi changed the base branch from master to release/500-release-candidate June 20, 2023 07:16
@maziyarpanahi maziyarpanahi self-assigned this Jul 1, 2023
@maziyarpanahi maziyarpanahi added the new-feature Introducing a new feature label Jul 1, 2023
@coveralls
Copy link

coveralls commented Jul 1, 2023

Pull Request Test Coverage Report for Build 5432842439

  • 1 of 163 (0.61%) changed or added relevant lines in 4 files are covered.
  • 117 unchanged lines in 34 files lost coverage.
  • Overall coverage decreased (-0.8%) to 65.106%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/main/scala/com/johnsnowlabs/nlp/annotators/similarity/DocumentSimilarityRankerModel.scala 0 19 0.0%
src/main/scala/com/johnsnowlabs/nlp/finisher/DocumentSimilarityRankerFinisher.scala 0 61 0.0%
src/main/scala/com/johnsnowlabs/nlp/annotators/similarity/DocumentSimilarityRankerApproach.scala 0 82 0.0%
Files with Coverage Reduction New Missed Lines %
src/main/scala/com/johnsnowlabs/nlp/annotators/common/TableData.scala 1 85.19%
src/main/scala/com/johnsnowlabs/nlp/annotators/cv/util/io/ImageIOUtils.scala 1 51.35%
src/main/scala/com/johnsnowlabs/nlp/annotators/DateMatcher.scala 1 96.26%
src/main/scala/com/johnsnowlabs/nlp/annotators/DateMatcherUtils.scala 1 87.96%
src/main/scala/com/johnsnowlabs/nlp/annotators/er/AhoCorasickAutomaton.scala 1 96.7%
src/main/scala/com/johnsnowlabs/nlp/annotators/ld/dl/LanguageDetectorDL.scala 1 66.27%
src/main/scala/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfModel.scala 1 69.23%
src/main/scala/com/johnsnowlabs/nlp/annotators/ner/dl/LoadsContrib.scala 1 11.54%
src/main/scala/com/johnsnowlabs/nlp/annotators/ner/NerTagsEncoding.scala 1 70.45%
src/main/scala/com/johnsnowlabs/nlp/annotators/pos/perceptron/PerceptronApproachDistributed.scala 1 82.24%
Totals Coverage Status
Change from base Build 5308728546: -0.8%
Covered Lines: 8652
Relevant Lines: 13289

💛 - Coveralls

@maziyarpanahi maziyarpanahi merged commit 08cad55 into release/500-release-candidate Jul 3, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature Introducing a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants