<a href="https://colab.research.google.com/github/GeorgeCrossIV/Langchain-PDF-Law-CassIO/blob/main/Langchain_with_PDF_using_cassio_Law_cases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Langchain Retrieval Augmentation (using Law data)
This notebook provides an example of using a PDF to create embeddings and ultimately enable a user to ask questions regarding the contents of the PDF.

## Colab-specific setup

Make sure you have a Database and get ready to upload the Secure Connect Bundle and supply the Token string
(see [Pre-requisites](https://cassio.org/start_here/#vector-database) on cassio.org for details).

Likewise, ensure you have the necessary secret for the LLM provider of your choice: you'll be asked to input it shortly
(see [Pre-requisites](https://cassio.org/start_here/#llm-access) on cassio.org for details).

_Note: some portions of this notebook is part of the CassIO documentation. Visit [this page on cassIO.org](https://cassio.org/frameworks/langchain/qa-basic/)._


In [None]:
# install required dependencies
! pip install \
    "langchain" \
    "cassandra-driver>=3.28.0" \
    "cassio" \
    "google-cloud-aiplatform>=1.25.0" \
    "jupyter>=1.0.0" \
    "openai==0.27.7" \
    "tiktoken==0.4.0" \
    "pypdf"

You will likely be asked to "Restart the Runtime" at this time, as some dependencies
have been upgraded. **Please do restart the runtime now** for a smoother execution from this point onward.

#Load a PDF file

In [None]:
!wget "https://github.com/GeorgeCrossIV/CassIO---PDF-Law-case-questions/raw/main/McCall-v-Microsoft.pdf"

In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('McCall-v-Microsoft.pdf')
pages = loader.load_and_split()

# Configure the Astra DB Connection

In [None]:
# Input your database keyspace name:
ASTRA_DB_KEYSPACE = input('Your Astra DB Keyspace name: ')

In [None]:
# Input your Astra DB token string, the one starting with "AstraCS:..."
ASTRA_DB_TOKEN_BASED_PASSWORD = input('Your Astra DB Token: ')

### Astra DB Secure Connect Bundle

Please upload the Secure Connect Bundle zipfile to connect to your Astra DB instance.

The Secure Connect Bundle is needed to establish a secure connection to the database.
Click [here](https://awesome-astra.github.io/docs/pages/astra/download-scb/#c-procedure) for instructions on how to download it from Astra DB.

In [None]:
# Upload your Secure Connect Bundle zipfile:
import os
from google.colab import files

print('Please upload your Secure Connect Bundle')
uploaded = files.upload()
if uploaded:
    astraBundleFileTitle = list(uploaded.keys())[0]
    ASTRA_DB_SECURE_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
else:
    raise ValueError(
        'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
    )

In [None]:
# colab-specific override of helper functions
from cassandra.cluster import (
    Cluster,
)
from cassandra.auth import PlainTextAuthProvider

# The "username" is the literal string 'token' for this connection mode:
ASTRA_DB_TOKEN_BASED_USERNAME = 'token'


def getCQLSession(mode='astra_db'):
    if mode == 'astra_db':
        cluster = Cluster(
            cloud={
                "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH,
            },
            auth_provider=PlainTextAuthProvider(
                ASTRA_DB_TOKEN_BASED_USERNAME,
                ASTRA_DB_TOKEN_BASED_PASSWORD,
            ),
        )
        astraSession = cluster.connect()
        return astraSession
    else:
        raise ValueError('Unsupported CQL Session mode')

def getCQLKeyspace(mode='astra_db'):
    if mode == 'astra_db':
        return ASTRA_DB_KEYSPACE
    else:
        raise ValueError('Unsupported CQL Session mode')

### LLM Provider

In the cell below you can choose between **GCP VertexAI** or **OpenAI** for your LLM services.
(See [Pre-requisites](https://cassio.org/start_here/#llm-access) on cassio.org for more details).

Make sure you set the `llmProvider` variable and supply the corresponding access secrets in the following cell.

In [None]:
# Set your secret(s) for LLM access:
llmProvider = 'OpenAI'  # 'GCP_VertexAI'

In [None]:
if llmProvider == 'OpenAI':
    apiSecret = input(f'Your secret for LLM provider "{llmProvider}": ')
    os.environ['OPENAI_API_KEY'] = apiSecret
elif llmProvider == 'GCP_VertexAI':
    # we need a json file
    print(f'Please upload your Service Account JSON for the LLM provider "{llmProvider}":')
    from google.colab import files
    uploaded = files.upload()
    if uploaded:
        vertexAIJsonFileTitle = list(uploaded.keys())[0]
        os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = os.path.join(os.getcwd(), vertexAIJsonFileTitle)
    else:
        raise ValueError(
            'No file uploaded. Please re-run the cell.'
        )
else:
    raise ValueError('Unknown/unsupported LLM Provider')

# Vector Similarity Search QA Quickstart

_**NOTE:** this uses Cassandra's "Vector Similarity Search" capability.
Make sure you are connecting to a vector-enabled database for this demo._

In [None]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.vectorstores.cassandra import Cassandra

A database connection is needed to access Cassandra. The following assumes
that a _vector-search-capable Astra DB instance_ is available. Adjust as needed.

In [None]:
# creation of the DB connection
cqlMode = 'astra_db'
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)

Both an LLM and an embedding function are required.

Below is the logic to instantiate the LLM and embeddings of choice. We choose to leave it in the notebooks for clarity.

In [None]:
# creation of the LLM resources

if llmProvider == 'GCP_VertexAI':
    from langchain.llms import VertexAI
    from langchain.embeddings import VertexAIEmbeddings
    llm = VertexAI()
    myEmbedding = VertexAIEmbeddings()
    print('LLM+embeddings from VertexAI')
elif llmProvider == 'OpenAI':
    from langchain.llms import OpenAI
    from langchain.embeddings import OpenAIEmbeddings
    llm = OpenAI(temperature=0)
    myEmbedding = OpenAIEmbeddings()
    print('LLM+embeddings from OpenAI')
else:
    raise ValueError('Unknown LLM provider.')

## Langchain Retrieval Augmentation

The following is a minimal usage of the Cassandra vector store. The store is created and filled at once, and is then queried to retrieve relevant parts of the indexed text, which are then stuffed into a prompt finally used to answer a question.

The following creates an "index creator", which knows about the type of vector store, the embedding to use and how to preprocess the input text:

_(Note: stores built with different embedding functions will need different tables. This is why we append the `llmProvider` name to the table name in the next cell.)_

In [None]:
table_name = 'vs_law_pdf_' + llmProvider

index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Cassandra,
    embedding=myEmbedding,
    text_splitter=CharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=0,
    ),
    vectorstore_kwargs={
        'session': session,
        'keyspace': keyspace,
        'table_name': table_name,
    },
)

Create the Cassandra Vector Store and clear entries if the table already exists

In [None]:
myCassandraVStore = Cassandra(
    embedding=myEmbedding,
    session=session,
    keyspace=keyspace,
    table_name=table_name,
)

myCassandraVStore.clear()

In [None]:
mySplitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=120)

In [None]:
for page in pages:
  page_chunks = mySplitter.transform_documents([page])
  myCassandraVStore.add_documents(page_chunks)

In [None]:
index = VectorStoreIndexWrapper(vectorstore=myCassandraVStore)

Let's ask some questions about the PDF we loaded

In [None]:
query = "What is the background of the McCall v. Microsoft Corp. case?"
index.query(query,llm=llm)

In [None]:
query = "Who were the key parties involved in the case?"
index.query(query,llm=llm)

In [None]:
query = "What was the verdict?"
index.query(query,llm=llm)

You've now seen how we can use a LLM to answer the prompt from our Astra Vector Store, but notice that the answer is different from using the LLM directly.

Let's get some information about the source for the response to the question "What temperature should Andouille be cooked?"

In [None]:
retriever = index.vectorstore.as_retriever(search_kwargs={
    'k': 2,
})

In [None]:
retriever.get_relevant_documents(
    "What temperature should Andouille be cooked?"
)