# Pinecone + Langchain Demo

In this notebook i will show you how to easily create a semantic search engine for your documents using Pinecone and Langchain. The goal is to be able to search through your documents and find the most relevant ones to your query. We will also be able to ask questions about the documents and get answers back.

[Pinecone signup](https://www.pinecone.io/)
After logging in you can see your porjects, indexes and collections


[Pinecone documentation](https://docs.pinecone.io/docs/python-client)

[Pinecone Langchain documentation](https://www.pinecone.io/learn/series/langchain/langchain-intro/)

[Langchain documentation](https://python.langchain.com/docs/get_started/introduction.html)


### Install the packages

In [1]:
!pip install langchain --upgrade
!pip install pypdf

Collecting langchain
  Obtaining dependency information for langchain from https://files.pythonhosted.org/packages/1f/46/d82192ebc8d1f0e42b03b5c8a078737fba9fe8ec416722f5865ed9424d49/langchain-0.0.311-py3-none-any.whl.metadata
  Using cached langchain-0.0.311-py3-none-any.whl.metadata (15 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Obtaining dependency information for SQLAlchemy<3,>=1.4 from https://files.pythonhosted.org/packages/6e/b4/cbb4548208e4295d97b6ce08c249444f99a3f31a19eeb33e147d3a5136fd/SQLAlchemy-2.0.21-cp310-cp310-macosx_11_0_arm64.whl.metadata
  Using cached SQLAlchemy-2.0.21-cp310-cp310-macosx_11_0_arm64.whl.metadata (9.4 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Obtaining dependency information for aiohttp<4.0.0,>=3.8.3 from https://files.pythonhosted.org/packages/94/a9/61f60723b20f9accdf4c9dc812ad4a61c1c63bdc732bc4e81fde9e6c40a9/aiohttp-3.8.6-cp310-cp310-macosx_11_0_arm64.whl.metadata
  Using cached aiohttp-3.8.6-cp310-cp310-macosx_11_0_arm

In [2]:
# PDF Loaders. If unstructured gives you a hard time, try PyPDFLoader
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

### Load your data

The PDF file that I will use is a summary of one of my courses on strategy that i had in my bacholar.

The PDF discusses various perspectives in the field of strategy, including the ideas of Clausewitz, Jomini, Marx, Tolstoy, Weber, Taylor, Follett, Rockefeller, Sloan, Ansoff, and game theory. It highlights the relevance of these perspectives in today's business environment and their impact on strategic thinking. The passage also mentions the different schools of thought in strategy, including the prescriptive and descriptive schools, and raises questions about their relationship to each other in the strategic process. Overall, it provides a comprehensive overview of different strategic perspectives and their implications.

In [76]:
# create a loader
loader = PyPDFLoader("../data/summary_strategy.pdf")

### Other options for loaders

In [77]:
# loader = UnstructuredPDFLoader("../data/summary_strategy.pdf")
# loader = OnlinePDFLoader("...")

In [78]:
# load your data
data = loader.load()

Note: If you're using PyPDFLoader, the text will be split by page for you already

In [79]:
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

You have 18 document(s) in your data
There are 3659 characters in your document


### Split your data up into smaller documents with Chunks

The chunksize should be chosen according to the length of your documents. If you have very long documents, you should choose a smaller chunksize. If you have very short documents, you should choose a smaller chunksize.

The chunk overlap is the number of characters that will be shared between each chunk. This is useful if you want to make sure that your chunks are not too small.

Play around with these parameters to see what works best for your data.

Note: If you're using PyPDFLoader then we'll be splitting for the 2nd time.

In [80]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [81]:
print (f'Now you have {len(texts)} documents')

Now you have 41 documents


### Create embeddings of your documents

Here we import the LangChain and Pinecone libraries. We will use the OpenAIEmbeddings class to create embeddings of our documents. We will then use the Pinecone library to create a Pinecone index and add our documents to it. Finally, we will use the Pinecone library to query our index and get back the most similar documents to our query.

In [121]:
# import libraries
from langchain.vectorstores import  Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone

I chose to store my api keys in a file called credentials.py. You can also store them in your environment variables. You can find your api keys in the pinecone console.

In [122]:
# import your API keys from a file called credentials.py
from credentials import OPENAI_API_KEY, PINECONE_API_KEY, PINECONE_API_ENV

In [123]:
# you can also store the keys in your environment variables
# OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY', 'sk-...')
#
# PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY', '...')
# PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV', '...')

I will use the OpenAI embeddings model to create embeddings of my documents. You can use any of the models that are available in the OpenAIEmbeddings class. You can also use any of the other embeddings models that are available in the langchain library.

In [85]:
# create embeddings
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

## Create a Pinecone index and add your documents to it

Here we create a Pinecone index and add our documents to it. We will then use the Pinecone library to query our index and get back the most similar documents to our query. I chose a dimension of 1536 and a metric of cosine. You can play around with these parameters to see what works best for your data.

For the OpenAI text-embedding-ada-002 embeddings, the output dimension is 1536, hence the dimension parameter.


The metric can be cosine, euclidean, or l2, depending on the type of data you have.

#### Cosine Distance
Description: Measures the cosine of the angle between two vectors, often used when working with normalized or convex sets.
Use Cases: Document classification, semantic search, recommendation systems, and any other task involving high-dimensional and normalized data.

#### Euclidean Distance (L2)
Description: Calculates the straight-line distance between two vectors in a multidimensional space.
Use Cases: Image recognition, speech recognition, handwriting analysis.

#### Inner Product (Dot Product)
Description: Computes the sum of the products of the vectors' corresponding components.
Use Cases: Recommendation systems, collaborative filtering, matrix factorization.

[source](https://www.imaurer.com/which-vector-similarity-metric-should-i-use/)

In the free trial of pinecone you van only create one index. If you want to create more, you can upgrade to a paid plan.

In [None]:
# create a pinecone index
pinecone.create_index("python-index", dimension=1536, metric="cosine")

In [86]:
# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to API key in console
)

index_name = "python-index" # put in the name of your pinecone index here

In [None]:
docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

In [None]:
# if you already have an index, you can load it like this
#docsearch = Pinecone.from_existing_index(index_name, embeddings)

In [115]:
query = "Who was Von Clausewitz?"
docs = docsearch.similarity_search(query)

the function of similarity_search is defined as follows:


def similarity_search(
    self,
    query: str,
    k: int = 4,
    filter: dict | None = None,
    namespace: str | None = None,
    **kwargs: Any) -> list[Document]


The Query is the text that you want to search for. The k is the number of documents that you want to return. The filter is a dictionary of filters that you can use to filter your results. The namespace is the namespace of your index. The **kwargs are any other arguments that you want to pass to the pinecone library.

By default, it will return the top 4 documents that are most similar to your query. You can change this by changing the k parameter.


In [89]:
docs

[Document(page_content='9RQ\x03&ODXVHZLW]  Historic perspectives can still be seen in business perspective today. Clausewitz his perspective seems timeless. He is in the same timeframe as Jomini. French was in a revolution; Eu was ruled by royal houses and the Napoleonic war was going on. When was 12 he joined the army and when he turned 21 he joined the military academy as a scholar. There he met Gerhard von Scharnhorst (lecturer) and Marie von Bruhl (married Clausewitz). The was a grave and was Clausewitz his ticket to the higher circles in Prussia. She played an important role in Clausewitz his career progress and development of perspective in strategy. Von Clausewitz served in an old-fashioned army, Clausewitz was captured by the French and Prussia was conquered. After the release Clausewitz joined the Russian army and fought Napoleon. They defeated Napoleon. Clausewitz tried to finish a book but died of Cholera. 7KH\x03ERRN\x03KDV\x03WKH\x03WLWOH\x03µ¶RQ\x03ZDU¶¶\x11\x03', metadat

In [90]:
# Here's an example of the first document that was returned
print(docs[0].page_content[:450])

9RQ&ODXVHZLW]  Historic perspectives can still be seen in business perspective today. Clausewitz his perspective seems timeless. He is in the same timeframe as Jomini. French was in a revolution; Eu was ruled by royal houses and the Napoleonic war was going on. When was 12 he joined the army and when he turned 21 he joined the military academy as a scholar. There he met Gerhard von Scharnhorst (lecturer) and Marie von Bruhl (married Clausewitz).


### Query those docs to get your answer back

Here we will use the langchain library to create a question answering chain. We will then use the chain to query our documents and get back the answer to our question.

If you want to know more about how to use the question answering chain, I wrote a Medium article about it here: https://medium.com/@rubentak/langchain-using-different-langchain-chains-to-write-a-new-episode-for-the-office-us-7c45d869d895

In [117]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

Use GPT-4 model to answer questions

In [118]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, model_name='gpt-4')
chain = load_qa_chain(llm, chain_type="stuff")

In [119]:
query = "Who was Von Clausewitz?"
docs = docsearch.similarity_search(query)

In [120]:
chain.run(input_documents=docs, question=query)

'Von Clausewitz was a military theorist who served in the Prussian and Russian armies. He joined the army at the age of 12 and later attended a military academy where he met influential figures such as Gerhard von Scharnhorst and Marie von Bruhl, the latter of whom he married. His wife played a significant role in his career progress and development of perspective in strategy. He was captured by the French during the Napoleonic war, and after his release, he joined the Russian army and fought against Napoleon. He attempted to finish a book titled "On War" but died of Cholera before he could complete it. His perspectives on war and strategy are still influential today.'

In [95]:
docs

[Document(page_content='9RQ\x03&ODXVHZLW]  Historic perspectives can still be seen in business perspective today. Clausewitz his perspective seems timeless. He is in the same timeframe as Jomini. French was in a revolution; Eu was ruled by royal houses and the Napoleonic war was going on. When was 12 he joined the army and when he turned 21 he joined the military academy as a scholar. There he met Gerhard von Scharnhorst (lecturer) and Marie von Bruhl (married Clausewitz). The was a grave and was Clausewitz his ticket to the higher circles in Prussia. She played an important role in Clausewitz his career progress and development of perspective in strategy. Von Clausewitz served in an old-fashioned army, Clausewitz was captured by the French and Prussia was conquered. After the release Clausewitz joined the Russian army and fought Napoleon. They defeated Napoleon. Clausewitz tried to finish a book but died of Cholera. 7KH\x03ERRN\x03KDV\x03WKH\x03WLWOH\x03µ¶RQ\x03ZDU¶¶\x11\x03', metadat

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
import pinecone
from credentials import OPENAI_API_KEY, PINECONE_API_KEY, PINECONE_API_ENV

In [113]:
def create_qa_bot(prompt):
    # Load the PDF data
    loader = PyPDFLoader("../data/summary_strategy.pdf")
    data = loader.load()

    # Split the data into smaller documents
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(data)

    # Create embeddings
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

    # Create a Pinecone index and add the documents to it
    #pinecone.create_index("python-index", dimension=1536, metric="cosine")
    pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_API_ENV)
    index_name = "python-index"
    docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

    # Perform similarity search
    docs = docsearch.similarity_search(prompt)

    # Load the question answering chain
    llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
    chain = load_qa_chain(llm, chain_type="stuff")

    # Query the documents and get the answer
    answer = chain.run(input_documents=docs, question=prompt)

    return answer

In [114]:
# Usage example
prompt = "Who was Von Clausewitz?"
answer = create_qa_bot(prompt)
print(answer)

 Von Clausewitz was a Prussian military theorist and soldier who served in the Prussian army and the Russian army during the Napoleonic Wars. He is best known for his book On War, which is still studied today for its insights into military strategy and tactics.
