<a href="https://colab.research.google.com/github/GeorgeCrossIV/RAGstack-PDF-ChatBot/blob/main/RAGstack_PDF_ChatBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with this notebook

- Create a new vector search enabled database in Astra. [astra.datastax.com](https://astra.datastax.com)
- For the easy path, name the keyspace in that database "vector_preview" (otherwise be prepared to modify the CQL in this notebook)
- Create a token with permissions to create tables
- Download your secure-connect-bundle zip file.
- Download the [sample data file from here](https://drive.google.com/file/d/1KlXnYy6CECoQz7wjf-728ci_unpMSxvF/view?usp=sharing)
- When you open this notebook in Google Colab or your own notebook server, drag-and-drop the secure connect bundle and ProductDataset.csv into the File Browser of the notebook
- Set up an open.ai API account and generate a key
- Update the Keys & Environment Variables cell in the notebook with information from the token you generated and the name of your secure connect bundle file.

# Setup

In [None]:
!pip install openai pandas jupyter-datatables cassandra-driver ragstack-ai pypdf

# Imports

In [2]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory
from cassandra.query import SimpleStatement
from google.colab import userdata
import openai
import numpy
import pandas as pd

# Keys & Environment Variables

In [3]:
# keys and tokens here
openai_api_key = userdata.get('openai_api_key')
openai.api_key = openai_api_key
cass_user = userdata.get('cass_user')
cass_pw = userdata.get('cass_pw')
scb_path = '/content/secure-connect-cassio-db.zip'
keyspace="chatbot"
table="chat_documents"
create_embeddings=False

# Select a model to compute embeddings

In [4]:
model_id = "text-embedding-ada-002"

# Connect to the Cluster

In [5]:
cloud_config= {
  'secure_connect_bundle': scb_path
}
auth_provider = PlainTextAuthProvider(cass_user, cass_pw)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider, protocol_version=4)
session = cluster.connect()
session.set_keyspace(keyspace)
session

<cassandra.cluster.Session at 0x7a1e29da0370>

# Drop / Create Schema

In [6]:
# only use this to reset the schema
if create_embeddings:
  session.execute(f"""DROP INDEX IF EXISTS {keyspace}.openai_desc""")
  session.execute(f"""DROP TABLE IF EXISTS {keyspace}.{table}""")

In [7]:
if create_embeddings:
  # # Create Table
  session.execute(f"""
  CREATE TABLE {keyspace}.{table} (
      document_id text,
      chunk_id int,
      document_text text,
      embedding_vector vector<float, 1536>,
      metadata_blob text,
      PRIMARY KEY (document_id, chunk_id))
  """)

  # # Create Index
  session.execute(f"""
  CREATE CUSTOM INDEX IF NOT EXISTS openai_desc ON {keyspace}.{table} (embedding_vector)
  USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'
  """)

Load PDF

In [8]:
!wget "https://github.com/GeorgeCrossIV/CassIO---PDF-Law-case-questions/raw/main/McCall-v-Microsoft.pdf"

--2023-12-08 06:07:04--  https://github.com/GeorgeCrossIV/CassIO---PDF-Law-case-questions/raw/main/McCall-v-Microsoft.pdf
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/GeorgeCrossIV/CassIO---PDF-Law-case-questions/main/McCall-v-Microsoft.pdf [following]
--2023-12-08 06:07:04--  https://raw.githubusercontent.com/GeorgeCrossIV/CassIO---PDF-Law-case-questions/main/McCall-v-Microsoft.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 254969 (249K) [application/octet-stream]
Saving to: ‘McCall-v-Microsoft.pdf.1’


2023-12-08 06:07:04 (5.70 MB/s) - ‘McCall-v-Microsoft.pdf.1’ saved [254969/254969]



In [9]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('McCall-v-Microsoft.pdf')
pages = loader.load_and_split()
#pages[2]


# Load the table with data and create text embeddings

In [10]:
if create_embeddings:
  document_chunk_id = 0
  for page in pages:
    # Create Embedding for each conversation row, save them to the database
    text_chunk_length = 400
    text_chunks = [page.page_content[i:i + text_chunk_length] for i in range(0, len(page.page_content), text_chunk_length)]
    for chunk_id, chunk in enumerate(text_chunks):
      document_chunk_id += 1
      metadata_blob=f"Page: {page.metadata['page']}"
      embedding = openai.embeddings.create(input=chunk, model=model_id).data[0].embedding
      query = SimpleStatement(
                  f"""
                  INSERT INTO {keyspace}.{table}
                  (document_id, chunk_id, document_text, embedding_vector, metadata_blob)
                  VALUES (%s, %s, %s, %s, %s)
                  """
              )
    session.execute(query, (page.metadata['source'], document_chunk_id, chunk, embedding, metadata_blob ))



---


# Start using the index

In the steps up to this point, we have been creating a schema and loading the table with data, including embeddings we generated through the OpenAI Embedding API.
Now we are going to query that table and use the results to give ChatGPT some context to support it's response.

# Convert a query string into a text embedding to use as part of the query

This is where the real fun starts.  Provide a question or request to be used as the query.  The source sample database is mostly consumer electronics and appliances, so imagine you're talking to a customer service rep at Best Buy or another electronics store.

Here we use the same API that we used to calculate embeddings for each row in the database, but this time we are using your input question to calculate a vector to use in a query.

Let's take a look at what a query against a vector index could look like.  The query vector has the same dimensions (number of entries in the list) as the embeddings we generated a few steps ago for each row in the database.

In [11]:
customer_input = "What is the background of the McCall v. Microsoft Corp. case?"

embedding = openai.embeddings.create(input=customer_input, model=model_id).data[0].embedding

query = SimpleStatement(
    f"""
    SELECT *
    FROM {keyspace}.{table}
    ORDER BY embedding_vector ANN OF {embedding} LIMIT 5;
    """
    )
#display(query)
results = session.execute(query)
top_5_products = results._current_rows

#for row in top_5_products:
#  print(f"""{row.document_id}, {row.document_text}, {row.metadata_blob}\n""")

message_objects = []
message_objects.append({"role":"system",
                        "content":"You're a chatbot answering questions about a document"})

message_objects.append({"role":"user",
                        "content": customer_input})

products_list = []

for row in top_5_products:
    brand_dict = {'role': "assistant", "content": f"{row.document_text}"}
    products_list.append(brand_dict)

message_objects.extend(products_list)
message_objects.append({"role": "assistant", "content":"Here's my summarized answer:"})

completion = openai.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=message_objects
)
print(completion.choices[0].message.content)



The McCall v. Microsoft Corp. case involves a dispute between McCall, the plaintiff, and Microsoft Corporation, the defendant. The case was filed in the United States District Court for the Southern District of New York. The background details of the case are not provided in the document.


In [12]:
customer_input = "Who are the defendents in the case?"

embedding = openai.embeddings.create(input=customer_input, model=model_id).data[0].embedding

query = SimpleStatement(
    f"""
    SELECT *
    FROM {keyspace}.{table}
    ORDER BY embedding_vector ANN OF {embedding} LIMIT 10;
    """
    )

message_objects = []
message_objects.append({"role":"system",
                        "content":"You're a chatbot answering questions about a document"})

message_objects.append({"role":"user",
                        "content": customer_input})

products_list = []

for row in top_5_products:
    brand_dict = {'role': "assistant", "content": f"{row.document_text}"}
    products_list.append(brand_dict)

message_objects.extend(products_list)
message_objects.append({"role": "assistant", "content":"Here's my summarized answer:"})

completion = openai.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=message_objects
)
print(completion.choices[0].message.content)

The defendants in the case are tiff WA, Berman Beerman, Steve W. Hagens Lovell, Christopher Lovell and P.S., Stewart PH/MDL, Seattle, WA, Richard C. LLP, New City, Giovanniello, York Earle II, Pepperman, Cromwell, Sullivan & PH/ PC, Enright Haven, CT, Gorman and MDL, New New York City, for Microsoft Corporation.
