# Pinecone + Langchain Demo

In this notebook i will show you how to easily create a semantic search engine for your documents using Pinecone and Langchain. The goal is to be able to search through your documents and find the most relevant ones to your query. We will also be able to ask questions about the documents and get answers back.

[Pinecone signup](https://www.pinecone.io/)
After logging in you can see your porjects, indexes and collections


[Pinecone documentation](https://docs.pinecone.io/docs/python-client)

[Pinecone Langchain documentation](https://www.pinecone.io/learn/series/langchain/langchain-intro/)

[Langchain documentation](https://python.langchain.com/docs/get_started/introduction.html)


### Install the packages

In [None]:
!pip install langchain --upgrade
!pip install pypdf

In [27]:
# PDF Loaders. If unstructured gives you a hard time, try PyPDFLoader
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

### Load your data

The PDF file that I will use is a summary of one of my courses on strategy that i had in my bacholar.

The PDF discusses various perspectives in the field of strategy, including the ideas of Clausewitz, Jomini, Marx, Tolstoy, Weber, Taylor, Follett, Rockefeller, Sloan, Ansoff, and game theory. It highlights the relevance of these perspectives in today's business environment and their impact on strategic thinking. The passage also mentions the different schools of thought in strategy, including the prescriptive and descriptive schools, and raises questions about their relationship to each other in the strategic process. Overall, it provides a comprehensive overview of different strategic perspectives and their implications.

In [28]:
# create a loader
loader = PyPDFLoader("DM_Dir_NOF_CSM_14_set_2023.pdf")

### Other options for loaders

In [29]:
# loader = UnstructuredPDFLoader("../data/summary_strategy.pdf")
# loader = OnlinePDFLoader("...")

In [30]:
# load your data
data = loader.load()

Note: If you're using PyPDFLoader, the text will be split by page for you already

In [31]:
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your document')

You have 88 document(s) in your data
There are 148 characters in your document


### Split your data up into smaller documents with Chunks

The chunksize should be chosen according to the length of your documents. If you have very long documents, you should choose a smaller chunksize. If you have very short documents, you should choose a smaller chunksize.

The chunk overlap is the number of characters that will be shared between each chunk. This is useful if you want to make sure that your chunks are not too small.

Play around with these parameters to see what works best for your data.

Note: If you're using PyPDFLoader then we'll be splitting for the 2nd time.

In [32]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [33]:
print (f'Now you have {len(texts)} documents')

Now you have 303 documents


### Create embeddings of your documents

Here we import the LangChain and Pinecone libraries. We will use the OpenAIEmbeddings class to create embeddings of our documents. We will then use the Pinecone library to create a Pinecone index and add our documents to it. Finally, we will use the Pinecone library to query our index and get back the most similar documents to our query.

In [34]:
# import libraries
from langchain.vectorstores import  Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone

I chose to store my api keys in a file called credentials.py. You can also store them in your environment variables. You can find your api keys in the pinecone console.

In [35]:
# import your API keys from a file called credentials.py
from credentials import OPENAI_API_KEY, PINECONE_API_KEY, PINECONE_API_ENV

In [36]:
# you can also store the keys in your environment variables
# OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY', 'sk-...')
#
# PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY', '...')
# PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV', '...')

I will use the OpenAI embeddings model to create embeddings of my documents. You can use any of the models that are available in the OpenAIEmbeddings class. You can also use any of the other embeddings models that are available in the langchain library.

In [37]:
# create embeddings
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

## Create a Pinecone index and add your documents to it

Here we create a Pinecone index and add our documents to it. We will then use the Pinecone library to query our index and get back the most similar documents to our query. I chose a dimension of 1536 and a metric of cosine. You can play around with these parameters to see what works best for your data.

For the OpenAI text-embedding-ada-002 embeddings, the output dimension is 1536, hence the dimension parameter.


The metric can be cosine, euclidean, or l2, depending on the type of data you have.

#### Cosine Distance
Description: Measures the cosine of the angle between two vectors, often used when working with normalized or convex sets.
Use Cases: Document classification, semantic search, recommendation systems, and any other task involving high-dimensional and normalized data.

#### Euclidean Distance (L2)
Description: Calculates the straight-line distance between two vectors in a multidimensional space.
Use Cases: Image recognition, speech recognition, handwriting analysis.

#### Inner Product (Dot Product)
Description: Computes the sum of the products of the vectors' corresponding components.
Use Cases: Recommendation systems, collaborative filtering, matrix factorization.

[source](https://www.imaurer.com/which-vector-similarity-metric-should-i-use/)

In the free trial of pinecone you van only create one index. If you want to create more, you can upgrade to a paid plan.

In [38]:
# create a pinecone index
pinecone.create_index("python-index", dimension=1536, metric="cosine")

ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=UTF-8', 'date': 'Tue, 10 Oct 2023 19:53:29 GMT', 'x-envoy-upstream-service-time': '779', 'content-length': '131', 'server': 'envoy'})
HTTP response body: The index exceeds the project quota of 1 pods by 1 pods. Upgrade your account or change the project settings to increase the quota.


In [39]:
# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to API key in console
)

index_name = "python-index" # put in the name of your pinecone index here

In [40]:
docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

In [23]:
# if you already have an index, you can load it like this
#docsearch = Pinecone.from_existing_index(index_name, embeddings)

In [41]:
query = "What is article 1 about?"
docs = docsearch.similarity_search(query)

the function of similarity_search is defined as follows:


def similarity_search(
    self,
    query: str,
    k: int = 4,
    filter: dict | None = None,
    namespace: str | None = None,
    **kwargs: Any) -> list[Document]


The Query is the text that you want to search for. The k is the number of documents that you want to return. The filter is a dictionary of filters that you can use to filter your results. The namespace is the namespace of your index. The **kwargs are any other arguments that you want to pass to the pinecone library.

By default, it will return the top 4 documents that are most similar to your query. You can change this by changing the k parameter.


In [42]:
docs

[Document(page_content='ÍNDEX\nI\nPART:\nREGLAMENT\nDE\nRÈGIM\nINTERN\n...........................................................................................\n6\nPREÀMBUL\n.......................................................................................................................................\n6\nTítol\npreliminar \nNATURALESA\nI\nFINALITAT\nDEL\nCOL·LEGI\n.........................................................................................\n7\nCapítol\n1r:\nDEFINICIÓ\nDEL\nCOL·LEGI\n.............................................................................................\n8\nCapítol\n2n:\nEL\nMODEL\nEDUCATIU\nDE\nL’ESCOLA\n..........................................................................\n8\nCapítol\n3r:\nLA\nCOMUNITAT\nEDUCATIVA\nDE\nL’ESCOLA\n.................................................................\n9\nTítol\nprimer \nÒRGANS\nDE\nGOVERN\nI\nGESTIÓ\nDEL\nCOL·LEGI\n............................................................................

In [43]:
# Here's an example of the first document that was returned
print(docs[0].page_content[:450])

ÍNDEX
I
PART:
REGLAMENT
DE
RÈGIM
INTERN
...........................................................................................
6
PREÀMBUL
.......................................................................................................................................
6
Títol
preliminar 
NATURALESA
I
FINALITAT
DEL
COL·LEGI
.........................................................................................
7
Capítol
1r:
DEFINICIÓ
D


### Query those docs to get your answer back

Here we will use the langchain library to create a question answering chain. We will then use the chain to query our documents and get back the answer to our question.

If you want to know more about how to use the question answering chain, I wrote a Medium article about it here: https://medium.com/@rubentak/langchain-using-different-langchain-chains-to-write-a-new-episode-for-the-office-us-7c45d869d895

In [44]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

Use GPT-4 model to answer questions

In [45]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, model_name='gpt-4')
chain = load_qa_chain(llm, chain_type="stuff")



In [51]:
query = "What is El coordinador de pastoral educativa?"
docs = docsearch.similarity_search(query)

In [52]:
chain.run(input_documents=docs, question=query)

"El coordinador de pastoral educativa is a role in an educational institution. This person is appointed by the entity that owns the institution and works closely with the owner, the director, and the heads of studies to help achieve the educational objectives of the institution. The coordinator is part of the institution's management team and has various responsibilities, including leading and coordinating the action of the educational pastoral team, promoting the educational pastoral of the entire educational community, facilitating the integration of the school's Christian community and its evangelizing action in the pastoral reality of the diocesan Church, and keeping the management, heads of studies, and tutors informed of all activities that affect the students in their area or the normal development of the course."

In [53]:
docs

[Document(page_content='projecte\neducatiu.\n2.\nEl\ncoordinador\nde\npastoral\nés\nnomenat\ni\ncessat\nper\nl’entitat\ntitular\ndel\ncentre\ni\nrealitza\nles\nseves \nfuncions\nen\nestreta\nrelació\namb\nel\ntitular,\nel\ndirector\ni\nels\ncaps\nd’estudis,\nen\nordre\na\ncol·laborar\nper \nfer\nrealitat\nels\nobjectius\neducatius\ndel\ncentre\nen\ntotes\nles\netapes.\n3.\nEl\nnomenament\ndel\ncoordinador\nde\npastoral\nserà\npel\ntemps\nque\nestipuli\nl’entitat\ntitular\ni\npodrà\nser \nrenovat.\n4.\nEl\ncoordinador\nde\npastoral\neducativa\nforma\npart\nde\nl’equip\ndirectiu\ndel\ncentre.\nArtícle\n33\nLes\nfuncions\ndel\ncoordinador\nde\npastoral\neducativa\nsón\nles\nsegüents:\na)\nAnimar\ni\ncoordinar\nl’acció\nde\nl’equip\nde\npastoral\neducativa\ni\nde\ntots\nels\nseus\nmembres,\ni \nconvocar\ni\npresidir\nles\nreunions.\nb)\nImpulsar\nla\nprogramació\ni\nrealització\nde\nles\niniciatives\ni\nactivitats\ntendents\na\nl’animació \npastoral\nde\nl’escola\ni\nvetllar\nperquè\nel\nc

In [54]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
import pinecone
from credentials import OPENAI_API_KEY, PINECONE_API_KEY, PINECONE_API_ENV

In [63]:
def create_qa_bot(prompt):
    # Load the PDF data
    loader = PyPDFLoader("DM_DIR_PEC_JUNY_23.pdf")
    data = loader.load()

    # Split the data into smaller documents
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(data)

    # Create embeddings
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

    # Create a Pinecone index and add the documents to it
    #pinecone.create_index("python-index", dimension=1536, metric="cosine")
    pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_API_ENV)
    index_name = "python-index"
    docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

    # Perform similarity search
    docs = docsearch.similarity_search(prompt)

    # Load the question answering chain
    llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
    chain = load_qa_chain(llm, chain_type="stuff")

    # Query the documents and get the answer
    answer = chain.run(input_documents=docs, question=prompt)

    return answer

In [64]:
# Usage example
prompt = "When was the school founded?"
answer = create_qa_bot(prompt)
print(answer)
print(docs)

 The school was founded in 1898 by the Missioners del Sagrat Cor.
[Document(page_content='projecte\neducatiu.\n2.\nEl\ncoordinador\nde\npastoral\nés\nnomenat\ni\ncessat\nper\nl’entitat\ntitular\ndel\ncentre\ni\nrealitza\nles\nseves \nfuncions\nen\nestreta\nrelació\namb\nel\ntitular,\nel\ndirector\ni\nels\ncaps\nd’estudis,\nen\nordre\na\ncol·laborar\nper \nfer\nrealitat\nels\nobjectius\neducatius\ndel\ncentre\nen\ntotes\nles\netapes.\n3.\nEl\nnomenament\ndel\ncoordinador\nde\npastoral\nserà\npel\ntemps\nque\nestipuli\nl’entitat\ntitular\ni\npodrà\nser \nrenovat.\n4.\nEl\ncoordinador\nde\npastoral\neducativa\nforma\npart\nde\nl’equip\ndirectiu\ndel\ncentre.\nArtícle\n33\nLes\nfuncions\ndel\ncoordinador\nde\npastoral\neducativa\nsón\nles\nsegüents:\na)\nAnimar\ni\ncoordinar\nl’acció\nde\nl’equip\nde\npastoral\neducativa\ni\nde\ntots\nels\nseus\nmembres,\ni \nconvocar\ni\npresidir\nles\nreunions.\nb)\nImpulsar\nla\nprogramació\ni\nrealització\nde\nles\niniciatives\ni\nactivitats\ntendents\