# Chat with your PDF Files
Including source documents
___

Installing required libraries

In [1]:
#!pip install openai langchain pypdf chromadb docarray

In [2]:
import pandas as pd

I saved my OPEN AI KEY as a csv file

In [3]:
k = pd.read_csv("keys.csv")

In [4]:
import os 
import openai
os.environ['OPENAI_API_KEY'] = k["key"][0] ## <- REPLACE WITH YOUR OWN OPEN AI KEY
openai.api_key  = os.environ['OPENAI_API_KEY']

Document Loading

In [5]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("Reading from SQL Databases.pdf") ##<-REPLACE WITH YOUR PDF FILE
pages = loader.load()
print(len(pages))
pages[0].metadata

13


{'source': 'Reading from SQL Databases.pdf', 'page': 0}

Split Documents

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 250,
    chunk_overlap = 25
)
splits = text_splitter.split_documents(pages)
print(len(splits))
splits[1].page_content

27


'S\nQL (Structured Query Language)\n▶Pronounced Ess Queue Ell orSequel\n▶The package RODBC is used to read SQL databases (and other database\nformats).\n▶Load required package\n> library(RODBC)\n▶Get an overview of the package: library(help=RODBC)'

In [7]:
#!pip install tiktoken

Vector Store

In [8]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
from langchain.vectorstores import Chroma
persist_directory = 'sagemaker-studiolab-notebooks/GenAIDemos/ChromaPDF/'

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)
vectordb.persist()

#VectorStore is "persisted" once and then it can be reused removing the comments in the following lines
# embedding = OpenAIEmbeddings()
# vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
# print(vectordb._collection.count())

27


Chat Chain

In [19]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

from langchain.chains import ConversationalRetrievalChain

retriever=vectordb.as_retriever()

qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=retriever,
    return_source_documents = True
)

Chat With PDF

In [20]:
chat_history = []

In [21]:
while True: 
    query = input("Question: ")
    results = qa({"question":query, "chat_history":chat_history})
    print("AI Answer: "+results["answer"])
    print("----------------------------------")
    print("Source Document: "+str(results["source_documents"][0].metadata)) 
    print("----------------------------------")
    print("Source Document: "+str(results["source_documents"][1].page_content)) 
    chat_history.append((query, results["answer"]))
    print("==================================")
    print("\n")
    if query == "quit":
        break

Question:  How can I fetch all the rows from a table?


AI Answer: To fetch all the rows from a table, you can use the `sqlFetch()` function in R. Here is an example:

```R
# Assuming you have already established a connection to the database
# and assigned it to the variable 'conn'

# Fetch all rows from the table 'manufacturer' in the 'bi' schema
data <- sqlFetch(conn, "bi.manufacturer")

# Print the fetched data
print(data)
```

This will fetch all the rows from the specified table and store them in the `data` variable. You can then manipulate or analyze the data as needed.
----------------------------------
Source Document: {'page': 9, 'source': 'Reading from SQL Databases.pdf'}
----------------------------------
Source Document: G
etting a table
▶UsesqlFetch to get a table from the database.
▶Get the table ’manufacturer’ from SCHEMA ’bi’:
> mf <- sqlFetch(conn,"bi.manufacturer")
> mf
ManufacturerID Manufacturer
1 1 Abbas
2 2 Aliqui
3 3 Barba
4 4 Currus
5 5 Fama
6 6 Leo




Question:  quit


AI Answer: To quit the program, you can use the "q()" function in R. This will exit the R session.
----------------------------------
Source Document: {'page': 12, 'source': 'Reading from SQL Databases.pdf'}
----------------------------------
Source Document: Other operating systems
Instructions for Ubuntu Linux 14.04:
▶Install the required drivers and RODBC package via the commandline :
sudo apt-get install r-cran-rodbc unixodbc-bin unixodbc odbcinst freetds-bin tdsodbc




Comparing results against **Max Marginal Relevance (MMR)**

In [28]:
result_mmr = vectordb.max_marginal_relevance_search(query="How many data types are supported?", k=3)

In [29]:
print(result_mmr[0].page_content)

Da
ta types
▶Classes of variables on the Rside:
> sapply(df, class)
ProductID Date Zip Units Revenue
"integer" "factor" "integer" "integer" "numeric"
▶Recall that the variable ’Zip’ was stored as the SQL speciﬁc type ’varchar’.


In [30]:
result_mmr = vectordb.max_marginal_relevance_search(query="How can I submit a query to fetch rows from a table?", k=3)

In [31]:
print(result_mmr[0].page_content)

S
ubmit real SQL
▶UsesqlQuery for more advanced queries.
▶SQL syntax example:
SELECT Manufacturer FROM bi.manufacturer WHERE ManufacturerID < 10
▶Submit query to R:
> query <- "
+ SELECT Manufacturer
+ FROM bi.manufacturer
+ WHERE ManufacturerID < 10
