# Document Similarity Ranker for Spark NLP
### Efficient approximate nearest neighbor search on top of sentence embeddings

In [1]:
# Import Spark NLP classes
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline

In [2]:
# Create the PySpark session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[*]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.0.0")\
    .getOrCreate()



:: loading settings :: url = jar:file:/Users/stefanolori/opt/anaconda3/envs/spknlp/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/stefanolori/.ivy2/cache
The jars for the packages stored in: /Users/stefanolori/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c2fd7a3f-baeb-4909-bf2a-0e72ac08e7b3;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;5.0.0 in central
	found com.typesafe#config;1.4.2 in local-m2-cache
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastutil;7.0.12 in central
	found org.projectlombok#lombok;1.16.8 in central
	found com.google.cloud#google-cloud-storage;2.20.1 in central
	found com.google.guava#guava;31.1-jre in central
	found com.

In [3]:
# Let's use some dataset where we can visually control similarity
# Documents are coupled, as 1-2, 3-4, 5-6, 7-8 and they were voluntarily created similar
data = spark.createDataFrame(
        [
            ["First document, this is my first sentence. This is my second sentence."],
            ["Second document, this is my second sentence. This is my second sentence."],
            ["Third document, climate change is arguably one of the most pressing problems of our time."],
            ["Fourth document, climate change is definitely one of the most pressing problems of our time."],
            ["Fifth document, Florence in Italy, is among the most beautiful cities in Europe."],
            ["Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France."],
            ["Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France."],
            ["Eighth document, the warmest place in France is the French Riviera coast in Southern France."]
        ]
    ).toDF("text")

In [4]:
data.show(10, False)

                                                                                

+------------------------------------------------------------------------------------------------------+
|text                                                                                                  |
+------------------------------------------------------------------------------------------------------+
|First document, this is my first sentence. This is my second sentence.                                |
|Second document, this is my second sentence. This is my second sentence.                              |
|Third document, climate change is arguably one of the most pressing problems of our time.             |
|Fourth document, climate change is definitely one of the most pressing problems of our time.          |
|Fifth document, Florence in Italy, is among the most beautiful cities in Europe.                      |
|Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France.            |
|Seventh document, the French Riviera is the Mediterran

## A document similarity ranker pipeline
### The document similarity ranker works downstream of other annotators generating sentence embeddings. In this example we'll use RoBertaSentenceEmbeddings.
The pipeline will use the following steps:
- document_assembler to annotate the documents
- sentence_detector to detect sentences
- tokenizer to apply tokenization
- sentence_embeddings to created the necessary sentence embeddings representation
- document_similarity_ranker to extract the simlar documents via annotator configuration
- document_similarity_ranker_finisher to extract the column of interest for this new annotator

## DocumentSimilarityRankerApproach: input parameter setters overview
- setInputCols("sentence_embeddings") : this setter will address input column
- setOutputCol("doc_similarity_rankings") : this setter will address ouput column
- setSimilarityMethod("brp") : this setter will select the LSH method (lsh|mh) used to apply approximate nearest neigbours search
- setNumberOfNeighbours(10) : this setter will address the desired number of similar documents for a given document in the set
- setBucketLength(2.0) : LSH parameter used to control the average size of hash buckets and improve recall
- setNumHashTables(3) : LSH parameter used to control number of hash tables used in LSH OR-amplification and improve recall
- setVisibleDistances(True) : this setter will make distances visible in the result, useful for debugging level information
- setIdentityRanking(False) : this setter will make identity distance (0.0) visible, useful for debugging level information

## DocumentSimilarityRankerFinisher: out parameters overview
- setInputCols("doc_similarity_rankings") : this setter will read the result column to extract IDs and distances
- setOutputCols(
            "finished_doc_similarity_rankings_id",
            "finished_doc_similarity_rankings_neighbors") : this setter selects the column with the document query ID and the neighbors document that results from the search run

In [7]:
from sparknlp.annotator.similarity.document_similarity_ranker import *

document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

sentence_embeddings = RoBertaSentenceEmbeddings.pretrained() \
            .setInputCols(["document"]) \
            .setOutputCol("sentence_embeddings")

document_similarity_ranker = DocumentSimilarityRankerApproach() \
            .setInputCols("sentence_embeddings") \
            .setOutputCol("doc_similarity_rankings") \
            .setSimilarityMethod("brp") \
            .setNumberOfNeighbours(1) \
            .setBucketLength(2.0) \
            .setNumHashTables(3) \
            .setVisibleDistances(True) \
            .setIdentityRanking(False)

document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher() \
        .setInputCols("doc_similarity_rankings") \
        .setOutputCols(
            "finished_doc_similarity_rankings_id",
            "finished_doc_similarity_rankings_neighbors") \
        .setExtractNearestNeighbor(True)

pipeline = Pipeline(stages=[
            document_assembler,
            sentence_embeddings,
            document_similarity_ranker,
            document_similarity_ranker_finisher
        ])

docSimRankerPipeline = pipeline.fit(data).transform(data)
# TODO add write/read pipeline
(
    docSimRankerPipeline
        .select(
               "finished_doc_similarity_rankings_id",
               "finished_doc_similarity_rankings_neighbors"
        ).show(10, False)
)

sent_roberta_base download started this may take some time.
Approximate size to download 284.8 MB
[OK!]


                                                                                

+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612                         |[(1510101612,0.0)]                        |
|1634839239                         |[(1634839239,0.0)]                        |
|-612640902                         |[(-612640902,0.0)]                        |
|1274183715                         |[(1274183715,0.0)]                        |
|-1320876223                        |[(-1320876223,0.0)]                       |
|1293373212                         |[(1293373212,0.0)]                        |
|-1548374770                        |[(-1548374770,0.0)]                       |
|-1719102856                        |[(-1719102856,0.0)]                       |
+-----------------------------------+------------------------------------------+



## Result analysis for consistent result confirmation
#### The test is asserting the initial hypothesis. The documents were created similar in pair: 1-2, 3-4, 5-6, 7-8.
For instance document 1 and 2 are detected mutually best neighbors at the very same distance respectively:
- document ID 1510101612 has his best similar document in (1634839239,0.12448559273510636) at distance 0.12448559273510636
- document ID 1634839239 has his best similar document in (1510101612,0.12448559273510636) at distance 0.12448559273510636

#### If we set the ranker like so
```
document_similarity_ranker = DocumentSimilarityRankerApproach() \
            .setInputCols("sentence_embeddings") \
            .setOutputCol("doc_similarity_rankings") \
            .setSimilarityMethod("brp") \
            .setNumberOfNeighbours(1) \
            .setBucketLength(2.0) \
            .setNumHashTables(3) \
            .setVisibleDistances(True) \
            .setIdentityRanking(True)
```

we can check also that each document is at 0.0 distance from itself:

```
+-----------------------------------+------------------------------------------+
|finished_doc_similarity_rankings_id|finished_doc_similarity_rankings_neighbors|
+-----------------------------------+------------------------------------------+
|1510101612                         |[(1510101612,0.0)]                        |
|1634839239                         |[(1634839239,0.0)]                        |
|-612640902                         |[(-612640902,0.0)]                        |
|1274183715                         |[(1274183715,0.0)]                        |
|-1320876223                        |[(-1320876223,0.0)]                       |
|1293373212                         |[(1293373212,0.0)]                        |
|-1548374770                        |[(-1548374770,0.0)]                       |
|-1719102856                        |[(-1719102856,0.0)]                       |
+-----------------------------------+------------------------------------------+
```