![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Colab Setup

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.?com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/16.0.Vector_Store_Integration.ipynb)

In [None]:
!pip install johnsnowlabs

from johnsnowlabs import nlp,finance

nlp.install(force_browser=True)

In [None]:
import json
import pandas as pd
from johnsnowlabs import nlp,finance

params = {"spark.driver.memory":"32G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.driver.maxResultSize":"20000M"}


spark = nlp.start(spark_conf=params)

📋 Loading license number 0 from /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.0.1, 💊Spark-Healthcare==5.0.1, running on ⚡ PySpark==3.1.2


## Generate Embeddings

In [None]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/earning_calls_sample.csv

--2023-08-07 07:19:29--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/earning_calls_sample.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261030 (255K) [text/plain]
Saving to: ‘earning_calls_sample.csv’


2023-08-07 07:19:29 (13.0 MB/s) - ‘earning_calls_sample.csv’ saved [261030/261030]



### Loading the data

In [None]:
df = pd.read_csv('earning_calls_sample.csv')

documents = df['text'].tolist()

### Splitting into paragraphs

In [None]:
paragraphs = []

for i in documents:
  paras = i.split('\n\n')
  for j in range(len(paras)):
    text_split = paras[j].split(' ')
    if len(text_split) > 100:
      paragraphs.append(paras[j])

In [None]:
print(f'Total paragraphs generated: {len(paragraphs)}')

Total paragraphs generated: 137


### Creating the pipeline

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_embeddings = nlp.BertSentenceEmbeddings.pretrained("sbert_setfit_finetuned_financial_text_classification", "en")\
  .setInputCols(["document"])\
  .setOutputCol("sbert_embeddings")

embeddingsFinisher = nlp.EmbeddingsFinisher() \
    .setInputCols("sbert_embeddings") \
    .setOutputCols("finished_sentence_embeddings")

sbert_setfit_finetuned_financial_text_classification download started this may take some time.
Approximate size to download 390.1 MB
[OK!]


In [None]:
data = [[i] for i in paragraphs]
len(data)

137

In [None]:
data = spark.createDataFrame(data) \
    .toDF("text")

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_embeddings,
    embeddingsFinisher
]).fit(data)

result = pipeline.transform(data)

In [None]:
resultWithSize = result.selectExpr("document.result","explode(finished_sentence_embeddings) as embeddings").toPandas()

In [None]:
resultWithSize

Unnamed: 0,result,embeddings
0,"[Aon (AON 1.07%)Q3 2022 Earnings CallOct 28, 2...","[0.11219091713428497, -0.04963115230202675, -0..."
1,"[ Good morning, everyone. Welcome to our third...","[0.10787619650363922, -0.04971027374267578, -0..."
2,"[ And third, they value expert insight and are...","[0.11406604945659637, -0.04644547775387764, -0..."
3,[ These results are consistent with our full y...,"[0.09470720589160919, -0.04397454857826233, -0..."
4,[ Christa Davies -- Chief Financial Officer an...,"[0.10794825851917267, -0.04399846866726875, -0..."
...,...,...
132,"[ UBS -- Analyst Thank you, management, for ta...","[0.12066387385129929, -0.047519147396087646, -..."
133,[Joey Wat -- Chief Executive Officer Thank you...,"[0.10857225209474564, -0.04651939496397972, -0..."
134,"[Andy Yeung -- Chief Financial Officer OK. So,...","[0.1192004457116127, -0.04519885778427124, -0...."
135,[ make and train -- we also see the [Inaudible...,"[0.1164017990231514, -0.053716566413640976, -0..."


In [None]:
df_processed = resultWithSize.copy()

In [None]:
df_processed

Unnamed: 0,result,embeddings
0,"[Aon (AON 1.07%)Q3 2022 Earnings CallOct 28, 2...","[0.11219091713428497, -0.04963115230202675, -0..."
1,"[ Good morning, everyone. Welcome to our third...","[0.10787619650363922, -0.04971027374267578, -0..."
2,"[ And third, they value expert insight and are...","[0.11406604945659637, -0.04644547775387764, -0..."
3,[ These results are consistent with our full y...,"[0.09470720589160919, -0.04397454857826233, -0..."
4,[ Christa Davies -- Chief Financial Officer an...,"[0.10794825851917267, -0.04399846866726875, -0..."
...,...,...
132,"[ UBS -- Analyst Thank you, management, for ta...","[0.12066387385129929, -0.047519147396087646, -..."
133,[Joey Wat -- Chief Executive Officer Thank you...,"[0.10857225209474564, -0.04651939496397972, -0..."
134,"[Andy Yeung -- Chief Financial Officer OK. So,...","[0.1192004457116127, -0.04519885778427124, -0...."
135,[ make and train -- we also see the [Inaudible...,"[0.1164017990231514, -0.053716566413640976, -0..."


## Let's use Weaviate to perform Vector search!

**What is vector search?**

Vector search refers to a search method that utilizes vector representations (vector embeddings) of data items to perform similarity-based searches. In vector search, data items such as documents, images, or other objects are transformed into high-dimensional vectors, where each dimension represents a specific feature or attribute of the item.

The core idea behind vector search is that similar items will have similar vector representations, making it possible to measure the similarity between items by calculating the distance between their corresponding vectors. The closer the vectors are to each other, the more similar the items are considered to be.


In [None]:
import sys
!pip install weaviate-client
import weaviate



### Register from the Weaviate website and create a cluster.

In [None]:
auth_config = weaviate.AuthApiKey(api_key="xxxxxxxxxxxxxxxxxxxxxxxxx")  # Replace w/ your Weaviate instance API key

# Instantiate the client
client = weaviate.Client(
    url="xxxxxxxxxxxxxxxxxxxxxxxxx", # Replace w/ your Weaviate cluster URL
    auth_client_secret=auth_config
)

In [None]:
client.is_ready()

True

In [None]:
# uncomment and delete if any schemas already exist
# client.schema.delete_all()

### Import data with vectors

In [None]:
jsl_data = []

for i,j in df_processed.iterrows():
  dicti = {}
  dicti['Text'] = j['result']
  dicti['Vector'] = j['embeddings']
  jsl_data.append(dicti)

In [None]:
jsl_data[0].keys()

dict_keys(['Text', 'Vector'])

In [None]:
# Class definition object. Weaviate's autoschema feature will infer properties when importing.
class_obj = {
    "class": "JSL_Document",
    "vectorizer": "none"
}

# Add the class to the schema
client.schema.create_class(class_obj)

In [None]:
# Configure a batch process
with client.batch as batch:
    batch.batch_size=256
    # Batch import all Questions
    for i, d in enumerate(jsl_data):
        print(f"importing question: {i+1}")

        properties = {
            "answer": d["Text"]
        }

        client.batch.add_data_object(properties, "JSL_Document", vector=d["Vector"])

importing question: 1
importing question: 2
importing question: 3
importing question: 4
importing question: 5
importing question: 6
importing question: 7
importing question: 8
importing question: 9
importing question: 10
importing question: 11
importing question: 12
importing question: 13
importing question: 14
importing question: 15
importing question: 16
importing question: 17
importing question: 18
importing question: 19
importing question: 20
importing question: 21
importing question: 22
importing question: 23
importing question: 24
importing question: 25
importing question: 26
importing question: 27
importing question: 28
importing question: 29
importing question: 30
importing question: 31
importing question: 32
importing question: 33
importing question: 34
importing question: 35
importing question: 36
importing question: 37
importing question: 38
importing question: 39
importing question: 40
importing question: 41
importing question: 42
importing question: 43
importing question: 

Let's say you want to find questions related to `provision for credit losses` . We can do that by obtaining a vector embedding for `provision for credit losses`, and finding objects nearest to it. In this example, we've used spark NLP for creating the financial embeddings. Then, in the following query, we pass that vector to the nearVector operator:

In [None]:
# text for checking

text = 'provision for credit losses'

In [None]:
data = spark.createDataFrame([[text]]) \
    .toDF("text")

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_embeddings,
    embeddingsFinisher
]).fit(data)

result = pipeline.transform(data)

In [None]:
resultWithSize = result.selectExpr("explode(finished_sentence_embeddings) as embeddings").toPandas()

In [None]:
df_test = resultWithSize.copy()

In [None]:
# Convert the embeddings from Dense Vector to a list using apply and tolist()
# df['embeddings'] = df['embeddings'].apply(lambda x: x.values.tolist())

emb = df_test['embeddings']
emb

0    [0.11180341243743896, -0.07125287503004074, -0...
Name: embeddings, dtype: object

In [None]:
dicti = {'vector':emb[0]}


In [None]:
response = (
    client.query
    .get("JSL_Document", ["answer"])
    .with_near_vector(
            dicti)
    .with_limit(10)
    .do()
)

print(json.dumps(response, indent=2))

{
  "data": {
    "Get": {
      "JSL_Document": [
        {
          "answer": [
            " We have continued to advance potentially game-changing vaccines in the fight against respiratory disease by entering into a Phase 3 study for our mRNA flu vaccine candidate and initiating a Phase 1 study for a vaccine candidate that combines our mRNA flu and COVID-19 vaccine in one sort. We completed the acquisitions of Biohaven Pharmaceuticals and Global Blood Therapeutics, giving us market-leading franchises in both migraine and sickle cell disease, respectively. Less than six months ago after launching in Accor for a healthier world, a breakthrough initiative designed to close the health equity gap for 1.2 billion people living in 45 lower-income countries, I'm proud to say that the first shipments of our products have arrived to these countries, and we are working with governments on health system improvements that can help make sure these products reach those in need. And of course, we