![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

<img src="https://haystack.deepset.ai/images/haystack-ogimage.png" width="300"/>            

https://haystack.deepset.ai [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/johnsnowlabs/blob/master/notebooks/haystack_with_johnsnowlabs.ipynb)


This tutorial showcase how to use [Johnsnowlabs Components with Langchain](https://nlp.johnsnowlabs.com/docs/en/jsl/haystack-utils) for Scalable Pre-Processing and Embedding computation on clusters

If you want to scale this, you can re-use this code in a spark-cluster created with [nlp.install_to_databricks()](https://nlp.johnsnowlabs.com/docs/en/jsl/install_advanced#into-a-freshly-created-databricks-cluster-automatically)

In [None]:
! pip install johnsnowlabs
from johnsnowlabs import nlp
! pip install 'farm-haystack[all]'
! pip install tensorflow==2.14
nlp.start()

# restart session after installing evertything
import os
os.kill(os.getpid(), 9)

## Download some Sample Data and Convert do Haystack Documents

In [3]:
# Download some sample data we use as a mini-db
! wget https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/modules/state_of_the_union.txt

from haystack import Document
def create_documents_from_file(file_path):
    # Helper func for reading files as Haystack Document object
    # returns a list of Document with one object for every line in file_path
    documents = []
    with open(file_path, 'r') as file:
        for id, line in enumerate(file):
            documents.append(Document(content=line.strip(), content_type="text", id=id))
    return documents


--2023-11-17 03:16:53--  https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/modules/state_of_the_union.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39028 (38K) [text/plain]
Saving to: ‘state_of_the_union.txt.3’


2023-11-17 03:16:53 (3.07 MB/s) - ‘state_of_the_union.txt.3’ saved [39028/39028]



## Create a Haystack pipe
We will add a `JohnSnowLabsHaystackProcessor` and `JohnSnowLabsHaystackEmbedder` for fully distributed computation on spark-clusters.

In this simple example we split documents and store their embeddings in the document store for RAG applications

In [4]:
from johnsnowlabs.llm import embedding_retrieval
from haystack.nodes import PreProcessor
from haystack import Pipeline
from haystack.document_stores import InMemoryDocumentStore

def get_hay_jsl_pipe(model_name='en.embed_sentence.bert_base_uncased',embed_dim=512):
    # small example Haystack pipeline demonstrating JohnSnowLabsHaystackProcessor


    # JohnSnowLabsHaystackProcessor support all parameters of JSl DocumentSplitter
    processor = embedding_retrieval.JohnSnowLabsHaystackProcessor(
        chunk_overlap=2,
        chunk_size=20,
        explode_splits=True,
        keep_seperators=True,
        patterns_are_regex=False,
        split_patterns=["\n\n", "\n", " ", ""],
        trim_whitespace=True,
    )

    # Write some processed data to Doc store, so we can retrieve it later
    document_store = InMemoryDocumentStore(embedding_dim=embed_dim)
    document_store.write_documents(processor.process(create_documents_from_file("state_of_the_union.txt")))

    # If you want to use GPU, make sure you ran nlp.start(hardware_target='gpu') !
    retriever = embedding_retrieval.JohnSnowLabsHaystackEmbedder(
        embedding_model=model_name,
        document_store=document_store,
        use_gpu=False,
    )

    document_store.update_embeddings(retriever)

    pipe = Pipeline()
    pipe.add_node(component=processor, name="Preprocess", inputs=["Query"])
    pipe.add_node(component=retriever, name="Embed&Retrieve", inputs=["Query"])
    return pipe



## Create & Query the pipe
We will get the top K most similar results to our query.
You can use [any Sentence Embedding](https://nlp.johnsnowlabs.com/models?task=Embeddings) from John Snow Labs by passing the **nlu_reference** of any Sentence Embedder.


In [5]:
use_pipe = get_hay_jsl_pipe('en.embed_sentence.use', embed_dim=512)
result = use_pipe.run(query="Who is the first lady")
for r in result['documents']:
  print(r.to_dict())

Spark Session already created, some configs may not take.


Preprocessing:   0%|          | 0/723 [00:00<?, ?docs/s]

Spark Session already created, some configs may not take.
tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Updating Embedding:   0%|          | 0/21 [00:00<?, ? docs/s]



Documents Processed: 10000 docs [00:08, 1159.41 docs/s]


{'content': 'First Lady and', 'content_type': 'text', 'score': 0.5019346848675483, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '2'}
{'content': 'Madam Speaker, Madam', 'content_type': 'text', 'score': 0.5013201007373176, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '0'}
{'content': 'Vice President, our', 'content_type': 'text', 'score': 0.5011079601534563, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '1'}
{'content': 'Members of Congress', 'content_type': 'text', 'score': 0.5010618177687746, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '4'}
{'content': 'Republicans.', 'content_type': 'text', 'score': 0.5007644888755247, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '20'}
{'content': 'Supreme Court. My', 'content_type': 'text', 'score': 0.500718500236837, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '7'}
{'content': 'Second Gentleman.', 'content_type': 'text', 's

In [8]:
bert_pipe = get_hay_jsl_pipe('en.embed_sentence.bert_base_uncased', embed_dim=768)
result = bert_pipe.run(query="Who is the first lady")
for r in result['documents']:
  print(r.to_dict())

Spark Session already created, some configs may not take.


Preprocessing:   0%|          | 0/723 [00:00<?, ?docs/s]

Spark Session already created, some configs may not take.
sent_bert_base_uncased download started this may take some time.
Approximate size to download 392.5 MB
[OK!]


Updating Embedding:   0%|          | 0/21 [00:00<?, ? docs/s]



Documents Processed: 10000 docs [00:02, 3443.47 docs/s]


{'content': 'First Lady and', 'content_type': 'text', 'score': 0.6348900483227143, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '2'}
{'content': 'Vice President, our', 'content_type': 'text', 'score': 0.6224868949875735, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '1'}
{'content': 'Second Gentleman.', 'content_type': 'text', 'score': 0.6224488513925066, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '3'}
{'content': 'Madam Speaker, Madam', 'content_type': 'text', 'score': 0.621548252770467, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '0'}
{'content': 'Supreme Court. My', 'content_type': 'text', 'score': 0.6192054605036265, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '7'}
{'content': 'and the Cabinet.', 'content_type': 'text', 'score': 0.6167082973209767, 'meta': {}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '5'}
{'content': 'Zealand, and many', 'content_type': 'text', '