<a href="https://colab.research.google.com/gist/Daethyra/3c2a1ab8bda6e326513d52a77d6b5ea7/ask-a-book-questions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install langchain --upgrade
# Version: 0.0.164

# Install necessary packages and upgrade outdated packages
!pip install -qU pinecone-client python-dotenv pypdf openai chromadb tiktoken unstructured

# Install Greg's LangChain repository which contains the data/ folder to work with. Requires arrangement in Google Drive's directory.
!git clone https://github.com/gkamradt/langchain-tutorials.git




ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
types-requests 2.31.0.10 requires urllib3>=2, but you have urllib3 1.26.18 which is incompatible.
'pwd' is not recognized as an internal or external command,
operable program or batch file.
'ls' is not recognized as an internal or external command,
operable program or batch file.
Cloning into 'langchain-tutorials'...


In [2]:
# PDF Loaders. If unstructured gives you a hard time, try PyPDFLoader
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
from dotenv import load_dotenv

load_dotenv()

False

### Load your data

In [7]:
# Basic PDF loader
# loader = PyPDFLoader("./field-guide-to-data-science.pdf")

## Other options for loaders
# loader = UnstructuredPDFLoader("../data/field-guide-to-data-science.pdf")
loader = OnlinePDFLoader("https://wolfpaulus.com/wp-content/uploads/2017/05/field-guide-to-data-science.pdf")

In [8]:
data = loader.load()

ValueError: unstructured package not found, please install it with `pip install unstructured`

In [6]:
# Note: If you're using PyPDFLoader then it will split by page for you already
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[30].page_content)} characters in your document')

NameError: name 'data' is not defined

### Chunk your data up into smaller documents

In [None]:
# Note: If you're using PyPDFLoader then we'll be splitting for the 2nd time.
# This is optional, test out on your own data.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [None]:
print (f'Now you have {len(texts)} documents')

Now you have 162 documents


### Create embeddings of your documents to get ready for semantic search

In [None]:
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone

Check to see if there is an environment variable with you API keys, if not, use what you put below

In [None]:
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', 'sk-')

In [None]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

### Option #1: Pinecone
If you want to use pinecone, run the code below, if not then skip over to Chroma below it. You must go to [Pinecone.io](https://www.pinecone.io/) and set up an account

In [None]:
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY', 'YourAPIKey')
PINECONE_API_ENV = os.getenv('PINECONE_API_ENV', 'us-east1-gcp') # You may need to switch with your env

# initialize pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_API_ENV  # next to api key in console
)
index_name = "langchaintest" # put in the name of your pinecone index here

docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

### Option #2: Chroma

I like Chroma becauase it's local and easy to set up without an account

In [None]:
# load it into Chroma
docsearch = Chroma.from_documents(texts, embeddings)

In [None]:
query = "What is the top priority of a good data science team?"
docs = docsearch.similarity_search(query)

In [None]:
# Here's an example of the first document that was returned
print(docs[0].page_content[:450])

imagination should be the 
hallmarks of Data Science. They 
are fundamental to the success 
of every Data Science project.


### Query those docs to get your answer back

In [None]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

In [None]:
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

In [None]:
query = "What is the collect stage of data maturity?"
docs = docsearch.similarity_search(query)

In [None]:
chain.run(input_documents=docs, question=query)

' The collect stage of data maturity focuses on collecting internal or external datasets. Gathering sales records and corresponding weather data is an example of the collect stage.'