# LangChain Pinecone OpenAI - Query Your Own Text/PDF File - The Basics

#### This notebook walks through the basics of using Pinecone, OpenAI and LangChain to query your own text document 


## pip install dependencies

In [None]:
pip install langChain

In [None]:
pip install OpenAI

In [None]:
pip install pinecone-client

In [None]:
pip install tiktoken

### Set environment variables and keys

In [1]:
# KEYS, MODELS and ENV Related Settings 
import os
os.environ["OPENAI_API_KEY"] = ""
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

embed_model = "text-embedding-ada-002"


os.environ["PINECONE_API_KEY"] = ""
PINECONE_API_KEY = os.environ['PINECONE_API_KEY']
PINECONE_ENV = ""


### Import required modules

In [None]:
import openai, langchain, pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI

### Import your own text file

In [2]:
# Open the data file and read its content

file_data = open('../data/WizardOfOz.txt', 'r')
file_content = file_data.read()
len(file_content)

227191

### Split the text using RecursiveCharacterTextSplitter to be able to work with the 4096 OpenAI token limit

In [5]:
# Set up the RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 2000,
    chunk_overlap  = 0,
    length_function = len,
)

In [6]:
# Split the file content 

book_texts = text_splitter.create_documents([file_content])
print (len(book_texts))
type(book_texts)

124


list

In [7]:
print(book_texts[31])

page_content='But the Lion went away into the forest and found his own supper, and no\none ever knew what it was, for he didn’t mention it. And the Scarecrow\nfound a tree full of nuts and filled Dorothy’s basket with them, so\nthat she would not be hungry for a long time. She thought this was very\nkind and thoughtful of the Scarecrow, but she laughed heartily at the\nawkward way in which the poor creature picked up the nuts. His padded\nhands were so clumsy and the nuts were so small that he dropped almost\nas many as he put in the basket. But the Scarecrow did not mind how\nlong it took him to fill the basket, for it enabled him to keep away\nfrom the fire, as he feared a spark might get into his straw and burn\nhim up. So he kept a good distance away from the flames, and only came\nnear to cover Dorothy with dry leaves when she lay down to sleep. These\nkept her very snug and warm, and she slept soundly until morning.\n\nWhen it was daylight, the girl bathed her face in a little ri

### Pinecone and OpenAI Embedding setup

In [9]:
# Pinecone related setup

pinecone.init(
        api_key = PINECONE_API_KEY,
        environment = PINECONE_ENV
)

# Set the index name for this project in pinecone first

index_name = 'testsearchbook'


In [10]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [11]:
if index_name not in pinecone.list_indexes():
    print("Index does not exist: ", index_name)


book_docsearch = Pinecone.from_texts([t.page_content for t in book_texts], embeddings, index_name = index_name)


In [12]:
type(book_docsearch)

langchain.vectorstores.pinecone.Pinecone

### Import  load_qa_chain from LangChain

In [13]:
from langchain.chains.question_answering import load_qa_chain

In [14]:
# set up the llm model for our qa session

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

In [15]:
# Let's set up the query 

query = "Who is Dorothy?"
docs = book_docsearch.similarity_search(query)

### Ask questions to your document and get the answer

In [20]:
# Run the QA chain with your query to get the answer

chain = load_qa_chain(llm, chain_type="stuff")
chain.run(input_documents=docs, question=query)

' Dorothy is a young girl who was carried away by a cyclone from her home. She is innocent and harmless and has never killed anything in her life.'

## We can also query our own PDF files

In [21]:
#Import PDF Loader and load the file

from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader

loader = PyPDFLoader("../data/Scale-Zeitgeist-AI-Readiness-Report-2023.pdf")


In [22]:
type(loader)

langchain.document_loaders.pdf.PyPDFLoader

In [23]:
file_content = loader.load()

In [24]:
type(file_content)

list

In [25]:
len(file_content)

24

In [26]:
file_content[15]

Document(page_content='Scale 28 29 Zeitgeist: 2023 AI Readiness ReportLogistics and Supply Chain\nLogistics and supply chain com -\npanies adopt AI to help them \nimprove operational eﬃciency, \nimprove customer experience, and \ngrow revenue.\nTo help achieve these goals, \nlogistics and supply chain com -\npanies are looking to adopt AI for \nbetter inventory management and \ndemand forecasting, improved \nroute optimization, to deploy au -\ntonomous vehicles, and improve \ndocument processing throughput \nand quality. These tools directly \nimpact operational eﬃciency, \nwhich has downstream impacts on \nthe overall customer experience, \nwith reliable delivery and fewer \ndelays.\nFor inventory management and \ndemand forecasting, logistics \nand supply chain companies are \nadopting AI to help reduce costs, \nimprove customer satisfaction, \nand improve forecast accuracy.TOP USE CASES BY INDUSTRYLogistics and Supply Chain\nFor route planning, logistics and \nsupply chain companies

In [27]:

book_texts = text_splitter.split_documents(file_content)
print (len(book_texts))
type(book_texts)

33


list

In [28]:
# Let's set up the query 

query = "How are enterprises working with generative ai?"
docs = book_docsearch.similarity_search(query)

# Run the QA chain with your query to get the answer

chain = load_qa_chain(llm, chain_type="stuff")
chain.run(input_documents=docs, question=query)

' Enterprises are mostly looking to leverage open-source generative models (41%) or Cloud API generative models (37%), while very few are looking to build their own generative models (22%). Furthermore, 28% are exclusively using open-source models, while 26% use cloud APIs and only 15% are exclusively building their own.'

In [29]:
# Let's set up a different query 

query = "What outcomes have companies seen from AI adoption?"
docs = book_docsearch.similarity_search(query)

# Run the QA chain with your query to get the answer

chain = load_qa_chain(llm, chain_type="stuff")
chain.run(input_documents=docs, question=query)

' Companies adopting AI are seeing positive outcomes from improved customer experiences, the ability to develop new products or services and improve existing products, and improved collaboration across business functions.'