<a href="https://colab.research.google.com/github/GeorgeCrossIV/Langchain-Retrieval-Augmentation-with-CASSIO/blob/main/Langchain_Retrieval_Augmentation_using_cassio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Langchain Retrieval Augmentation (using Wikipedia data)
Large Language Models (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is retrieval augmentation. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that. In this demo, external or proprietary data will be stored in Astra DB and used to provide more current LLM responses.

## Colab-specific setup

Make sure you have a Database and get ready to upload the Secure Connect Bundle and supply the Token string
(see [Pre-requisites](https://cassio.org/start_here/#vector-database) on cassio.org for details).

Likewise, ensure you have the necessary secret for the LLM provider of your choice: you'll be asked to input it shortly
(see [Pre-requisites](https://cassio.org/start_here/#llm-access) on cassio.org for details).

_Note: some portions of this notebook is part of the CassIO documentation. Visit [this page on cassIO.org](https://cassio.org/frameworks/langchain/qa-basic/)._


In [None]:
# install required dependencies
! pip install \
    "git+https://github.com/hemidactylus/langchain@cassio#egg=langchain" \
    "cassandra-driver>=3.28.0" \
    "cassio>=0.0.4" \
    "google-cloud-aiplatform>=1.25.0" \
    "jupyter>=1.0.0" \
    "openai==0.27.7" \
    "python-dotenv==1.0.0" \
    "tensorflow-cpu==2.12.0" \
    "tiktoken==0.4.0" \
    "transformers>=4.29.2"

You will likely be asked to "Restart the Runtime" at this time, as some dependencies
have been upgraded. **Please do restart the runtime now** for a smoother execution from this point onward.

# Get the Wikipedia data from 20220301.simple

In [None]:
 !wget https://raw.githubusercontent.com/GeorgeCrossIV/Langchain-Retrieval-Augmentation-with-CASSIO/main/20220301.simple.csv

Import the 20220301.simple wikipedia from the CSV file

In [None]:
import pandas as pd
data = pd.read_csv('20220301.simple.csv')

There are 10,000 entries in the Wikipedia data file. We'll reduce the dataset to 10 rows for this demo. It takes a while to process the data; however, feel free to increase the number of rows for future demo runs.

In [None]:
data = data.head(10)
data = data.rename(columns={'text ': 'text'})
data

We will execute queries against the [Andouille](https://simple.wikipedia.org/wiki/Andouille) Wikipedia entries later in this demo. The Wikipedia data used in this demo is from a snapshot in time, stored in a CSV file. Below is the text of the Wikipedia record that will be processed.

In [None]:
data.iloc[9]['text']

In [None]:
# Input your database keyspace name:
ASTRA_DB_KEYSPACE = input('Your Astra DB Keyspace name: ')

In [None]:
# Input your Astra DB token string, the one starting with "AstraCS:..."
ASTRA_DB_TOKEN_BASED_PASSWORD = input('Your Astra DB Token: ')

### Astra DB Secure Connect Bundle

Please upload the Secure Connect Bundle zipfile to connect to your Astra DB instance.

The Secure Connect Bundle is needed to establish a secure connection to the database.
Click [here](https://awesome-astra.github.io/docs/pages/astra/download-scb/#c-procedure) for instructions on how to download it from Astra DB.

In [None]:
# Upload your Secure Connect Bundle zipfile:
import os
from google.colab import files


print('Please upload your Secure Connect Bundle')
uploaded = files.upload()
if uploaded:
    astraBundleFileTitle = list(uploaded.keys())[0]
    ASTRA_DB_SECURE_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
else:
    raise ValueError(
        'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
    )

In [None]:
# colab-specific override of helper functions
from cassandra.cluster import (
    Cluster,
)
from cassandra.auth import PlainTextAuthProvider

# The "username" is the literal string 'token' for this connection mode:
ASTRA_DB_TOKEN_BASED_USERNAME = 'token'


def getCQLSession(mode='astra_db'):
    if mode == 'astra_db':
        cluster = Cluster(
            cloud={
                "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH,
            },
            auth_provider=PlainTextAuthProvider(
                ASTRA_DB_TOKEN_BASED_USERNAME,
                ASTRA_DB_TOKEN_BASED_PASSWORD,
            ),
        )
        astraSession = cluster.connect()
        return astraSession
    else:
        raise ValueError('Unsupported CQL Session mode')

def getCQLKeyspace(mode='astra_db'):
    if mode == 'astra_db':
        return ASTRA_DB_KEYSPACE
    else:
        raise ValueError('Unsupported CQL Session mode')

def getTableCount():
  # create a query that counts the number of records of the Astra DB table
  query = SimpleStatement(f"""SELECT COUNT(*) FROM {keyspace}.{table_name};""")

  # execute the query
  results = session.execute(query)
  return results.one().count

### LLM Provider

In the cell below you can choose between **GCP VertexAI** or **OpenAI** for your LLM services.
(See [Pre-requisites](https://cassio.org/start_here/#llm-access) on cassio.org for more details).

Make sure you set the `llmProvider` variable and supply the corresponding access secrets in the following cell.

In [None]:
# Set your secret(s) for LLM access:
llmProvider = 'OpenAI'  # 'GCP_VertexAI'


In [None]:
if llmProvider == 'OpenAI':
    apiSecret = input(f'Your secret for LLM provider "{llmProvider}": ')
    os.environ['OPENAI_API_KEY'] = apiSecret
elif llmProvider == 'GCP_VertexAI':
    # we need a json file
    print(f'Please upload your Service Account JSON for the LLM provider "{llmProvider}":')
    from google.colab import files
    uploaded = files.upload()
    if uploaded:
        vertexAIJsonFileTitle = list(uploaded.keys())[0]
        os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = os.path.join(os.getcwd(), vertexAIJsonFileTitle)
    else:
        raise ValueError(
            'No file uploaded. Please re-run the cell.'
        )
else:
    raise ValueError('Unknown/unsupported LLM Provider')

### Colab preamble completed

The following cells constitute the demo notebook proper.

# Vector Similarity Search QA Quickstart

_**NOTE:** this uses Cassandra's "Vector Similarity Search" capability.
Make sure you are connecting to a vector-enabled database for this demo._

In [None]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from cassandra.query import SimpleStatement

The following line imports the Cassandra flavor of a LangChain vector store:

In [None]:
from langchain.vectorstores.cassandra import Cassandra

A database connection is needed to access Cassandra. The following assumes
that a _vector-search-capable Astra DB instance_ is available. Adjust as needed.

In [None]:
# creation of the DB connection
cqlMode = 'astra_db'
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)

Both an LLM and an embedding function are required.

Below is the logic to instantiate the LLM and embeddings of choice. We choose to leave it in the notebooks for clarity.

In [None]:
# creation of the LLM resources

if llmProvider == 'GCP_VertexAI':
    from langchain.llms import VertexAI
    from langchain.embeddings import VertexAIEmbeddings
    llm = VertexAI()
    myEmbedding = VertexAIEmbeddings()
    print('LLM+embeddings from VertexAI')
elif llmProvider == 'OpenAI':
    from langchain.llms import OpenAI
    from langchain.embeddings import OpenAIEmbeddings
    llm = OpenAI(temperature=0)
    myEmbedding = OpenAIEmbeddings()
    print('LLM+embeddings from OpenAI')
else:
    raise ValueError('Unknown LLM provider.')

## Langchain Retrieval Augmentation

The following is a minimal usage of the Cassandra vector store. The store is created and filled at once, and is then queried to retrieve relevant parts of the indexed text, which are then stuffed into a prompt finally used to answer a question.

The following creates an "index creator", which knows about the type of vector store, the embedding to use and how to preprocess the input text:

_(Note: stores built with different embedding functions will need different tables. This is why we append the `llmProvider` name to the table name in the next cell.)_

In [None]:
table_name = 'vs_test1_' + llmProvider

index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Cassandra,
    embedding=myEmbedding,
    text_splitter=CharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=0,
    ),
    vectorstore_kwargs={
        'session': session,
        'keyspace': keyspace,
        'table_name': table_name,
    },
)

Create the Cassandra Vector Store and clear entries if the table already exists

In [None]:
myCassandraVStore = Cassandra(
    embedding=myEmbedding,
    session=session,
    keyspace=keyspace,
    table_name='vs_test1_' + llmProvider,
)

myCassandraVStore.clear()

In [None]:
mySplitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=120)

The Astra DB table should be cleared from the last statement. Let's do a quick table count to make sure.

In [None]:
# create a query that counts the number of records of the Astra DB table
print('Total records: ' + str(getTableCount()))

Create the function for creating a vector index for a Wikipedia entry

In [None]:
def create_vector_index(row, myCassandraVStore):
  metadata = {
    'url': row['url'],
    'title': row['title']
  }
  page_content = row['text']

  wikiDocument = Document(
      page_content=page_content,
      metadata=metadata
  )
  wikiDocs = mySplitter.transform_documents([wikiDocument])
  myCassandraVStore.add_documents(wikiDocs)

Execute the create_vector_index function for each row in the Wikipedia dataframe. It's good time to grab a drink as the next step will take about 90 seconds to complete.

In [None]:
for index, row in data.iterrows():
  create_vector_index(row, myCassandraVStore)

Now that we've processed records by embedding them with vector search values, let's see how many records are in the Astra DB table. We'll also look at one of the rows and a snippet of it's new vector values

In [None]:
# Count the number records of the Astra DB table after the embedding
print('Total records: ' + str(getTableCount()))

In [None]:
# create a query that returns the first row of the Astra DB table
query = SimpleStatement(f"""SELECT * from {keyspace}.{table_name} limit 1;""")

# execute the query and display the first row of the table
results = session.execute(query)
print('document_id: ' + results.one().document_id)
print('document: ' + results.one().document)
print('metadata_blob: ' + results.one().metadata_blob)
print('first 20 bytes of embedding_vector: ' + str(results.one().embedding_vector[:20]))

In [None]:
index = VectorStoreIndexWrapper(vectorstore=myCassandraVStore)

Now let's query our proprietory store. We'll ask "What is Andouille?"

In [None]:
query = "What is Andouille?"
index.query(query,llm=llm)

I'm really interested in what temperature to cook my andouille.

In [None]:
query = "What temperature should Andouile be cooked?"
index.query(query,llm=llm)

Let's compare this answer to what OpenAi GPT-3 will return

In [None]:
import openai

openai.api_key = apiSecret
response = openai.Completion.create(
  engine="text-davinci-003",
  prompt="What temperature should Andouille be cooked?",
  max_tokens=100
)

print(response.choices[0].text.strip())

You've now seen how we can use a LLM to answer the prompt from our Astra Vector Store, but notice that the answer is different from using the LLM directly.

Let's get some information about the source for the response to the question "What temperature should Andouille be cooked?"

In [None]:
retriever = index.vectorstore.as_retriever(search_kwargs={
    'k': 2,
})

In [None]:
retriever.get_relevant_documents(
    "What temperature should Andouille be cooked?"
)