<a href="https://colab.research.google.com/github/ObjectMatrix/google-colab-notebook/blob/main/pinecode_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install langchain --upgrade
!pip install pinecone-client
!pip install openai 
!pip install pypdf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!

# PDF Loaders. If unstructured gives you a hard time, try PyPDFLoader
mybook = "https://github.com/ObjectMatrix/google-colab-notebook/blob/main/mml-book.pdf"

from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader

from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
import json


In [3]:
loader = PyPDFLoader("./mml-book.pdf")

## Other options for loaders 
# loader = UnstructuredPDFLoader("../data/field-guide-to-data-science.pdf")
# loader = OnlinePDFLoader("https://wolfpaulus.com/wp-content/uploads/2017/05/field-guide-to-data-science.pdf")


In [4]:
data = loader.load()

In [5]:
# Note: If you're using PyPDFLoader then it will split by page for you already
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[30].page_content)} characters in your document')

You have 412 document(s) in your data
There are 1782 characters in your document


Chunk your data up into smaller units

In [6]:
# Note: If we're using PyPDFLoader then we'll be splitting for the 2nd time.
# This is optional, test out on your own data.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [7]:
print (f'Now you have {len(texts)} documents')

Now you have 638 documents


Now, let's create embeddings for our documents to do semantic searach 

In [8]:
from langchain.vectorstores import Chroma, Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone



  from tqdm.autonotebook import tqdm


In [9]:
with open('/content/drive/MyDrive/secrets.json', 'r') as f:
    secrets = json.load(f)

KEY = secrets['SECRET_KEY']
pinecone_env = secrets['pinecone_env']
pinecone = secrets["pinecone"]

# Check to see if there is an environment variable with you API keys, if not, use what you put below
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY', KEY)

PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY', pinecone)
PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV', pinecone_env) # You may need to switch with your env

In [10]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [11]:

import pinecone
pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_API_ENV)
index_name = "objectmatrix"


In [12]:
docsearch = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)

In [13]:
# Confirm our index was created
pinecone.list_indexes()
query = "algebra"
docs = docsearch.similarity_search(query)

In [24]:
# Here's an example of the first document that was returned
# print(docs[0].page_content[:450])

[]


In [21]:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

query = "Example 4.1 (Testing for Matrix Invertibility)"
docs = docsearch.similarity_search(query)

chain.run(input_documents=docs, question=query)

'\n\nIn Example 4.1, a method is presented for testing whether a given matrix is invertible. The method involves computing the determinant of the matrix and checking if it is non-zero. If the determinant is non-zero, then the matrix is invertible.'