![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

 # Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.?com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/17.0.Vector_Store_Integration.ipynb)

In [None]:
!pip install johnsnowlabs

from johnsnowlabs import nlp

nlp.install(force_browser=True)

In [None]:
import json
import pandas as pd
from johnsnowlabs import nlp

params = {"spark.driver.memory":"32G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.driver.maxResultSize":"20000M"}


spark = nlp.start(spark_conf=params)

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7163 (22).json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.0.1, 💊Spark-Healthcare==5.0.1, running on ⚡ PySpark==3.1.2


## Generate Embeddings

### Loading the data

In [None]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/legal-nlp/data/mnda_sample.csv

In [5]:
df = pd.read_csv('mnda_sample.csv')

documents = df['text'].tolist()

In [6]:
df.shape

(263, 2)

In [None]:
print(f'Total documents: {len(documents)}')

Total documents: 263


### Creating the pipeline

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal", "en")\
  .setInputCols(["document"])\
  .setOutputCol("sbert_embeddings")

embeddingsFinisher = nlp.EmbeddingsFinisher() \
    .setInputCols("sbert_embeddings") \
    .setOutputCols("finished_sentence_embeddings")

sent_bert_base_uncased_legal download started this may take some time.
Approximate size to download 390.8 MB
[OK!]


In [None]:
data = [[i] for i in documents]
len(data)

263

In [None]:
data = spark.createDataFrame(data) \
    .toDF("text")

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_embeddings,
    embeddingsFinisher
]).fit(data)

result = pipeline.transform(data)

In [None]:
resultWithSize = result.selectExpr("document.result","explode(finished_sentence_embeddings) as embeddings").toPandas()

In [None]:
resultWithSize

Unnamed: 0,result,embeddings
0,"[The recipient shall not, during the term of t...","[0.26616281270980835, 0.341601699590683, 0.934..."
1,"[The recipient shall not, during the term of t...","[-0.11213510483503342, 0.10545776784420013, -0..."
2,[This Agreement shall inure to the benefit of ...,"[-0.46788308024406433, 0.47307464480400085, 0...."
3,"[Unless otherwise provided herein, your obliga...","[-0.49271631240844727, 0.27636203169822693, 0...."
4,"[The recipient shall not, during the term of t...","[-0.7534305453300476, 0.14203622937202454, 1.0..."
...,...,...
258,[Evaluation Material. The term\n “Evaluati...,"[-0.2982212007045746, 0.493274062871933, -0.95..."
259,"[Notwithstanding any other provision hereof, t...","[-0.898460865020752, 0.7281344532966614, 0.999..."
260,"[In addition, each Party agrees that, without ...","[-0.7481690645217896, 0.9281681180000305, 0.99..."
261,[Return and Destruction of Evaluation\n Mat...,"[-0.7783533334732056, 0.9506694078445435, 0.99..."


In [None]:
df_processed = resultWithSize.copy()

In [None]:
df_processed

Unnamed: 0,result,embeddings
0,"[The recipient shall not, during the term of t...","[0.26616281270980835, 0.341601699590683, 0.934..."
1,"[The recipient shall not, during the term of t...","[-0.11213510483503342, 0.10545776784420013, -0..."
2,[This Agreement shall inure to the benefit of ...,"[-0.46788308024406433, 0.47307464480400085, 0...."
3,"[Unless otherwise provided herein, your obliga...","[-0.49271631240844727, 0.27636203169822693, 0...."
4,"[The recipient shall not, during the term of t...","[-0.7534305453300476, 0.14203622937202454, 1.0..."
...,...,...
258,[Evaluation Material. The term\n “Evaluati...,"[-0.2982212007045746, 0.493274062871933, -0.95..."
259,"[Notwithstanding any other provision hereof, t...","[-0.898460865020752, 0.7281344532966614, 0.999..."
260,"[In addition, each Party agrees that, without ...","[-0.7481690645217896, 0.9281681180000305, 0.99..."
261,[Return and Destruction of Evaluation\n Mat...,"[-0.7783533334732056, 0.9506694078445435, 0.99..."


## Let's use Weaviate to perform Vector search!

**What is vector search?**

Vector search refers to a search method that utilizes vector representations (vector embeddings) of data items to perform similarity-based searches. In vector search, data items such as documents, images, or other objects are transformed into high-dimensional vectors, where each dimension represents a specific feature or attribute of the item.

The core idea behind vector search is that similar items will have similar vector representations, making it possible to measure the similarity between items by calculating the distance between their corresponding vectors. The closer the vectors are to each other, the more similar the items are considered to be.


In [None]:
import sys
!pip install weaviate-client
import weaviate



### Register from the Weaviate website and create a cluster.

In [None]:
auth_config = weaviate.AuthApiKey(api_key="GCtvDb8Jsd01xmOdwWSpBUeOg36P4CyW5VlB")  # Replace w/ your Weaviate instance API key

# Instantiate the client
client = weaviate.Client(
    url="https://jsl-ilzkw6m2.weaviate.network", # Replace w/ your Weaviate cluster URL
    auth_client_secret=auth_config
)

In [None]:
client.is_ready()

True

In [None]:
# uncomment and delete if any schemas already exist
# client.schema.delete_all()

### Import data with vectors

In [None]:
jsl_data = []

for i,j in df_processed.iterrows():
  dicti = {}
  dicti['Text'] = j['result']
  dicti['Vector'] = j['embeddings']
  jsl_data.append(dicti)

In [None]:
jsl_data[0].keys()

dict_keys(['Text', 'Vector'])

In [None]:
# Class definition object. Weaviate's autoschema feature will infer properties when importing.
class_obj = {
    "class": "JSL_Document",
    "vectorizer": "none"
}

# Add the class to the schema
client.schema.create_class(class_obj)

In [None]:
# Configure a batch process
with client.batch as batch:
    batch.batch_size=256
    # Batch import all Questions
    for i, d in enumerate(jsl_data):
        print(f"importing question: {i+1}")

        properties = {
            "answer": d["Text"]
        }

        client.batch.add_data_object(properties, "JSL_Document", vector=d["Vector"])

importing question: 1
importing question: 2
importing question: 3
importing question: 4
importing question: 5
importing question: 6
importing question: 7
importing question: 8
importing question: 9
importing question: 10
importing question: 11
importing question: 12
importing question: 13
importing question: 14
importing question: 15
importing question: 16
importing question: 17
importing question: 18
importing question: 19
importing question: 20
importing question: 21
importing question: 22
importing question: 23
importing question: 24
importing question: 25
importing question: 26
importing question: 27
importing question: 28
importing question: 29
importing question: 30
importing question: 31
importing question: 32
importing question: 33
importing question: 34
importing question: 35
importing question: 36
importing question: 37
importing question: 38
importing question: 39
importing question: 40
importing question: 41
importing question: 42
importing question: 43
importing question: 

Let's say you want to find questions related to `Confidential Information` . We can do that by obtaining a vector embedding for `Confidential Information`, and finding objects nearest to it. In this example, we've used spark NLP for creating the legal embeddings. Then, in the following query, we pass that vector to the nearVector operator:

In [None]:
# text for checking

text = 'Confidential information'

In [None]:
data = spark.createDataFrame([[text]]) \
    .toDF("text")

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_embeddings,
    embeddingsFinisher
]).fit(data)

result = pipeline.transform(data)

In [None]:
resultWithSize = result.selectExpr("explode(finished_sentence_embeddings) as embeddings").toPandas()

In [None]:
df_test = resultWithSize.copy()

In [None]:
# Convert the embeddings from Dense Vector to a list using apply and tolist()
# df['embeddings'] = df['embeddings'].apply(lambda x: x.values.tolist())

emb = df_test['embeddings']
emb

0    [-0.8694421052932739, 0.5934111475944519, 1.0,...
Name: embeddings, dtype: object

In [None]:
dicti = {'vector':emb[0]}


In [None]:
response = (
    client.query
    .get("JSL_Document", ["answer"])
    .with_near_vector(
            dicti)
    .with_limit(10)
    .do()
)

print(json.dumps(response, indent=2))

{
  "data": {
    "Get": {
      "JSL_Document": [
        {
          "answer": [
            "In addition, this\n     Agreement and its terms, and the fact and substance of discussions between\n     the parties concerning the Subject Matter, shall be deemed to be\n     Confidential Information"
          ]
        },
        {
          "answer": [
            "Company may\nnot sell, transfer, assign, sublicense, or subcontract any right or obligation\nhereunder without the prior written consent of ASG."
          ]
        },
        {
          "answer": [
            "Assignment. WWC shall not assign this Agreement or its rights or\n    obligations herein without the prior written consent of WWI."
          ]
        },
        {
          "answer": [
            "This Agreement shall be governed by and construed in accordance with the\n     laws of the State of Oklahoma, without regard to their conflict of laws\n     provisions"
          ]
        },
        {
          "answer"