![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/21.1.Document_Similarity_Ranker.ipynb)

# **Document Similarity Ranker**

This notebook will cover the different parameters and usages of `DocumentSimilarityRankerApproach` and `DocumentSimilarityRankerFinisher`.

**📖 Learning Objectives:**

1. Background: Understand the 'DocumentSimilarityRankerApproach' and `DocumentSimilarityRankerFinisher` annotator.

2. Colab setup.

3. Become comfortable with using the different parameters of the annotator.




## **🎬 Colab Setup**

In [None]:
# Installing pyspark and spark-nlp
!pip install --upgrade -q pyspark==3.4.1 spark-nlp

In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.annotator.similarity.document_similarity_ranker import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp.start(params=params)

print("Spark NLP Version :", sparknlp.version())

spark

Spark NLP Version : 5.3.3


# **Document Similarity Ranker Approach**

## **📜 Background**

`DocumentSimilarityRankerApproach` Annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbors search on top of sentence embeddings.

It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.

**🔗 Helpful Links:**

- Python Docs : [DocumentSimilarityRankerApproach](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/similarity/document_similarity_ranker/index.html#sparknlp.annotator.similarity.document_similarity_ranker.DocumentSimilarityRankerApproach)

- Scala Docs: [DocumentSimilarityRankerApproach](https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/similarity/DocumentSimilarityRankerApproach)

- For extended examples of usage, see [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master).

## **🖨️ Input/Output Annotation Types**



- Input: `SENTENCE_EMBEDDINGS`

- Output: `DOC_SIMILARITY_RANKINGS`

## **🔎 Parameters**


- `similarityMethod`: The similarity method used to calculate the neighbours. (Default: `brp`, Bucketed Random Projection for Euclidean Distance)
- `numberOfNeighbours`: The number of neighbours the model will return (Default:`10`)
- `visibleDistances`: Whether to set visibleDistances in ranking output (Default: `false`).
- `identityRanking`: Whether to include identity in ranking result set. Useful for debug. (Default: `false`)   
- `numHashTables`: Number of hash tables, where increasing number of hash tables lowers the false negative rate, and decreasing it improves the running performance.
- `enableCaching`: Whether to enable caching DataFrames or RDDs during the training
- `bucketLength`: Controls the average size of hash buckets. A larger bucket length (i.e., fewer buckets) increases the probability of features being hashed to the same bucket (increasing the numbers of true and false positives)

In [3]:
document_similarity_ranker = DocumentSimilarityRankerApproach()
print(document_similarity_ranker.explainParams())

asRetrieverQuery: Whether to set the model as retriever RAG with a specific query string.(Default: `empty`) (default: )
bucketLength: The bucket length that controls the average size of hash buckets. A larger bucket length (i.e., fewer buckets) increases the probability of features being hashed to the same bucket (increasing the numbers of true and false positives). (default: 2.0)
enableCaching: Whether to enable caching DataFrames or RDDs during the training (undefined)
identityRanking: Whether to include identity in ranking result set. Useful for debug. (Default: `false`). (default: False)
inputCols: previous annotations columns, if renamed (undefined)
lazyAnnotator: Whether this AnnotatorModel acts as lazy in RecursivePipelines (default: False)
numHashTables: number of hash tables, where increasing number of hash tables lowers the false negative rate,and decreasing it improves the running performance. (default: 3)
numberOfNeighbours: The number of neighbours the model will return (D

## **💻Pipeline**

In [4]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_embeddings = E5Embeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings")

document_similarity_ranker = DocumentSimilarityRankerApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("doc_similarity_rankings") \
    .setSimilarityMethod("brp") \
    .setVisibleDistances(False) \

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        document_similarity_ranker,
])



e5_small download started this may take some time.
Approximate size to download 76.2 MB
[OK!]


In [5]:
data = spark.createDataFrame([
    ["First document, this is my first sentence. This is my second sentence."],
    ["Second document, this is my second sentence. This is my second sentence."],
    ["Third document, climate change is arguably one of the most pressing problems of our time."],
    ["Fourth document, climate change is definitely one of the most pressing problems of our time."],
    ["Fifth document, Florence in Italy, is among the most beautiful cities in Europe."],
    ["Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France."],
    ["Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France."],
    ["Eighth document, the warmest place in France is the French Riviera coast in Southern France."]
]).toDF("text")

In [6]:
result = pipeline.fit(data).transform(data)

result.selectExpr("text", "doc_similarity_rankings.metadata[0].lshId as id",
                  "doc_similarity_rankings.metadata[0].lshNeighbors as neighbors ").show(truncate=100)

+----------------------------------------------------------------------------------------------------+-----------+---------------------------------------------------------------------------------+
|                                                                                                text|         id|                                                                        neighbors|
+----------------------------------------------------------------------------------------------------+-----------+---------------------------------------------------------------------------------+
|                              First document, this is my first sentence. This is my second sentence.| 1510101612|[1634839239,-612640902,1274183715,-1320876223,-1719102856,1293373212,-1548374770]|
|                            Second document, this is my second sentence. This is my second sentence.| 1634839239|[1510101612,-612640902,1274183715,-1320876223,1293373212,-1719102856,-1548374770]|
|           Thi

### ▶ `numberOfNeighbours`

 The number of neighbours the model will return (Default:`10`)

In [7]:
# We will use some dataset where we can visually control similarity
# Documents are coupled, as 1-2-9, 3-4, 5-6, 7-8 on purpose

data = spark.createDataFrame([
    ["The sun rose above the horizon, casting its golden light across the landscape."],
    ["The dawn broke, painting the sky with hues of gold as the sun ascended."],
    ["The children played gleefully in the meadow, their laughter echoing through the open fields."],
    ["The kids romped happily in the grassy field, their giggles resonating across the countryside"],
    ["The old oak tree stood tall and proud, its branches reaching towards the sky in a display of strength."],
    ["The ancient oak towered with dignity, its boughs stretching skyward, a symbol of resilience."],
    ["The cat purred contentedly as it curled up on the cozy blanket, enjoying the warmth."],
    ["The kitty happily snuggled on the soft blanket, relishing the comfort it provided."],
    ["The sun is rising above the horizon, casting its golden light across the landscape."]
]).toDF("text")

In [8]:
document_similarity_ranker = DocumentSimilarityRankerApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("doc_similarity_rankings") \
    .setSimilarityMethod("brp") \
    .setNumberOfNeighbours(3) \

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        document_similarity_ranker
])

result = pipeline.fit(data).transform(data)

result.selectExpr("text", "doc_similarity_rankings.metadata[0].lshId as id",
                  "doc_similarity_rankings.metadata[0].lshNeighbors as neighbors ").show(truncate=100)

+----------------------------------------------------------------------------------------------------+-----------+-----------------------------------+
|                                                                                                text|         id|                          neighbors|
+----------------------------------------------------------------------------------------------------+-----------+-----------------------------------+
|                      The sun rose above the horizon, casting its golden light across the landscape.| -726629134| [-304978848,1841831092,1913467945]|
|                             The dawn broke, painting the sky with hues of gold as the sun ascended.| 1841831092| [-726629134,-304978848,1913467945]|
|        The children played gleefully in the meadow, their laughter echoing through the open fields.| -508401286| [-651225601,-304978848,-726629134]|
|        The kids romped happily in the grassy field, their giggles resonating across the coun

###  ▶ `visibleDistances`:

Whether to set visibleDistances in ranking output (Default: `false`)

In [9]:
document_similarity_ranker = DocumentSimilarityRankerApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("doc_similarity_rankings") \
    .setSimilarityMethod("brp") \
    .setNumberOfNeighbours(3) \
    .setVisibleDistances(True) \

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        document_similarity_ranker,
])

result = pipeline.fit(data).transform(data)

result.selectExpr("text", "doc_similarity_rankings.metadata[0].lshId as id",
                  "doc_similarity_rankings.metadata[0].lshNeighbors as neighbors ").show(truncate=100)

+----------------------------------------------------------------------------------------------------+-----------+--------------------------------------------------------------------------------------------------+
|                                                                                                text|         id|                                                                                         neighbors|
+----------------------------------------------------------------------------------------------------+-----------+--------------------------------------------------------------------------------------------------+
|                      The sun rose above the horizon, casting its golden light across the landscape.| -726629134| [(-304978848,0.17535143570234635),(1841831092,0.4471054276126468),(1913467945,0.538092155447108)]|
|                             The dawn broke, painting the sky with hues of gold as the sun ascended.| 1841831092| [(-726629134,0.44710542761264

### ▶ `identityRanking`:

Whether to include identity in ranking result set. Useful for debug. (Default: `false`)

In [10]:
document_similarity_ranker = DocumentSimilarityRankerApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("doc_similarity_rankings") \
    .setSimilarityMethod("brp") \
    .setNumberOfNeighbours(3) \
    .setVisibleDistances(True) \
    .setIdentityRanking(True)\

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        document_similarity_ranker
])

result = pipeline.fit(data).transform(data)

result.selectExpr("text", "doc_similarity_rankings.metadata[0].lshId as id",
                  "doc_similarity_rankings.metadata[0].lshNeighbors as neighbors ").show(truncate=100)

+----------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------------------------+
|                                                                                                text|         id|                                                                          neighbors|
+----------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------------------------+
|                      The sun rose above the horizon, casting its golden light across the landscape.| -726629134|[(-726629134,0.0),(-304978848,0.17535143570234635),(1841831092,0.4471054276126468)]|
|                             The dawn broke, painting the sky with hues of gold as the sun ascended.| 1841831092| [(1841831092,0.0),(-726629134,0.4471054276126468),(-304978848,0.4501290794071544)]|
|    

`finished_doc_similarity_rankings_neighbors` column now has item itself as tuple in the result list with same id and 0.0 distance.

### ▶`numHashTables`:

Number of hash tables, where increasing number of hash tables lowers the
false negative rate, and decreasing it improves the running performance.
    

In [11]:
document_similarity_ranker = DocumentSimilarityRankerApproach() \
            .setInputCols("sentence_embeddings") \
            .setOutputCol("doc_similarity_rankings") \
            .setSimilarityMethod("brp") \
            .setNumberOfNeighbours(2) \
            .setNumHashTables(1) \
            .setVisibleDistances(True) \
            .setIdentityRanking(False)\
           # decreased to 1


pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        document_similarity_ranker,
])

result = pipeline.fit(data).transform(data)


result.selectExpr("text", "doc_similarity_rankings.metadata[0].lshId as id",
                  "doc_similarity_rankings.metadata[0].lshNeighbors as neighbors ").show(truncate=100)

+----------------------------------------------------------------------------------------------------+-----------+------------------------------------------------------------------+
|                                                                                                text|         id|                                                         neighbors|
+----------------------------------------------------------------------------------------------------+-----------+------------------------------------------------------------------+
|                      The sun rose above the horizon, casting its golden light across the landscape.| -726629134|[(-304978848,0.17535143570234635),(1841831092,0.4471054276126468)]|
|                             The dawn broke, painting the sky with hues of gold as the sun ascended.| 1841831092| [(-726629134,0.4471054276126468),(-304978848,0.4501290794071544)]|
|        The children played gleefully in the meadow, their laughter echoing through the o

# **DocumentSimilarityRankerFinisher**



## **📜 Background**

`DocumentSimilarityRankerFinisher` Instantiated model of the DocumentSimilarityRankerApproach. For usage and examples see the documentation of the main class.

## **🖨️ Input/Output Annotation Types**

- Input: `SENTENCE_EMBEDDINGS`

- Output: `DOC_SIMILARITY_RANKINGS`


## **🔎 Parameters**

- `extractNearestNeighbor`: Whether to extract the nearest neighbor document

In [12]:
document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher()
print(document_similarity_ranker_finisher.explainParams())

extractNearestNeighbor: whether to extract the nearest neighbor document (default: False)
inputCols: name of input annotation cols containing document similarity ranker results (undefined)
outputCols: output DocumentSimilarityRankerFinisher output cols (undefined)


## **💻Pipeline**

In [13]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_embeddings = E5Embeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings")

document_similarity_ranker = DocumentSimilarityRankerApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("doc_similarity_rankings") \
    .setSimilarityMethod("brp") \
    .setNumberOfNeighbours(3) \
    .setBucketLength(2.0) \
    .setNumHashTables(3) \
    .setVisibleDistances(True) \
    .setIdentityRanking(True)

document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher() \
    .setInputCols("doc_similarity_rankings") \
    .setOutputCols(
        "doc_similarity_rankings_id",
        "doc_similarity_rankings_neighbors") \
    .setExtractNearestNeighbor(True)

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        document_similarity_ranker,
        document_similarity_ranker_finisher
])

result = pipeline.fit(data).transform(data)

result.select("text",
              "doc_similarity_rankings_id",
              "doc_similarity_rankings_neighbors")\
      .show(truncate=False)

e5_small download started this may take some time.
Approximate size to download 76.2 MB
[OK!]
+------------------------------------------------------------------------------------------------------+--------------------------+-----------------------------------------------------------------------------------+
|text                                                                                                  |doc_similarity_rankings_id|doc_similarity_rankings_neighbors                                                  |
+------------------------------------------------------------------------------------------------------+--------------------------+-----------------------------------------------------------------------------------+
|The sun rose above the horizon, casting its golden light across the landscape.                        |-726629134                |[(-726629134,0.0),(-304978848,0.17535143570234635),(1841831092,0.4471054276126468)]|
|The dawn broke, painting the sky with hue



`finished_doc_similarity_rankings_neighbors` column has changed: now there are two  the nearest items and their  distances.



