# Use SparkNLP with Langchain

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2023/07/langchain3.png" width="300"/>         
https://www.langchain.com
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/johnsnowlabs/blob/master/notebooks/langchain_with_johnsnowlabs.ipynb)


This tutorial showcase how to use [Johnsnowlabs Components with Langchain](https://nlp.johnsnowlabs.com/docs/en/jsl/langchain-utils) for Scalable Pre-Processing and Embedding computation on clusters

If you want to scale this, you can re-use this code in a spark-cluster created with [nlp.install_to_databricks()](https://nlp.johnsnowlabs.com/docs/en/jsl/install_advanced#into-a-freshly-created-databricks-cluster-automatically)

# Installing dependencies & Downloading the jsl_embedder

In [None]:
! pip install johnsnowlabs
from johnsnowlabs import nlp
nlp.start()
! pip install langchain openai tiktoken faiss-cpu

# restart session after installing evertything
import os
os.kill(os.getpid(), 9)


# Langchain based JSL-Embedder and Text Splitters
based on this [conversational_retrieval_agents tutorial](https://python.langchain.com/docs/use_cases/question_answering/conversational_retrieval_agents) building a mini RAG system

## Download some Sample Data

In [1]:
# Download some sample data we use as a mini-db
! wget https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/modules/state_of_the_union.txt

--2023-11-17 03:56:24--  https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/modules/state_of_the_union.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39028 (38K) [text/plain]
Saving to: ‘state_of_the_union.txt’


2023-11-17 03:56:25 (5.11 MB/s) - ‘state_of_the_union.txt’ saved [39028/39028]



## Load data as Langchain Docs

In [None]:
from langchain.document_loaders import TextLoader
loader = TextLoader('/content/state_of_the_union.txt')
documents = loader.load()


## Create Pre-Processor which is connected to Spark-Cluster and **pre-processes documents** distributed

In [9]:
from johnsnowlabs.llm import embedding_retrieval
jsl_splitter = embedding_retrieval.JohnSnowLabsLangChainCharSplitter(
        chunk_overlap=2,
        chunk_size=20,
        explode_splits=True,
        keep_seperators=True,
        patterns_are_regex=False,
        split_patterns=["\n\n", "\n", " ", ""],
        trim_whitespace=True,

)
texts = jsl_splitter.split_documents(documents)

Spark Session already created, some configs may not take.


## Create Pre-Processor which is connected to Spark-Cluster and **Embeds documents** distributed

In [8]:
from langchain.vectorstores import FAISS
embeddings =  embedding_retrieval.JohnSnowLabsLangChainEmbedder('en.embed_sentence.bert_base_uncased')
db = FAISS.from_documents(texts, embeddings)
retriever = db.as_retriever()

Spark Session already created, some configs may not take.
sent_bert_base_uncased download started this may take some time.
Approximate size to download 392.5 MB
[OK!]


## Create a Tool with the Distributed Embedding Retriever

In [5]:
from langchain.agents.agent_toolkits import create_retriever_tool
tool = create_retriever_tool(
    retriever,
    "search_state_of_union",
    "Searches and returns documents regarding the state-of-the-union."
)
tools = [tool]


## Create an agent with access to the Tool

In [12]:
from langchain.agents.agent_toolkits import create_conversational_retrieval_agent
from langchain.chat_models import ChatOpenAI

open_api_key = 'YOUR API KEY'
llm = ChatOpenAI(temperature = 0,openai_api_key=open_api_key)
agent_executor = create_conversational_retrieval_agent(llm, tools, verbose=True)

## Query the Agent

In [7]:
result = agent_executor({"input": "what did the president say about going to east of Columbus?"})
result['output']



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `search_state_of_union` with `{'query': 'going to east of Columbus'}`


[0m[36;1m[1;3m[Document(page_content='miles east of', metadata={'source': '/content/state_of_the_union.txt'}), Document(page_content='in America.', metadata={'source': '/content/state_of_the_union.txt'}), Document(page_content='out of America.', metadata={'source': '/content/state_of_the_union.txt'}), Document(page_content='upside down.', metadata={'source': '/content/state_of_the_union.txt'})][0m[32;1m[1;3mI'm sorry, but I couldn't find any specific information about the president's statement regarding going to the east of Columbus in the State of the Union address.[0m

[1m> Finished chain.[0m


"I'm sorry, but I couldn't find any specific information about the president's statement regarding going to the east of Columbus in the State of the Union address."