<a href="https://colab.research.google.com/github/GeorgeCrossIV/Langchain-Retrieval-Augmentation-with-CASSIO/blob/main/Langchain_Retrieval_Augmentation_using_cassio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Langchain Retrieval Augmentation (using Wikipedia data)
Large Language Models (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is retrieval augmentation. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that.

## Colab-specific setup

Make sure you have a Database and get ready to upload the Secure Connect Bundle and supply the Token string
(see [Pre-requisites](https://cassio.org/start_here/#vector-database) on cassio.org for details).

Likewise, ensure you have the necessary secret for the LLM provider of your choice: you'll be asked to input it shortly
(see [Pre-requisites](https://cassio.org/start_here/#llm-access) on cassio.org for details).

_Note: this notebook is part of the CassIO documentation. Visit [this page on cassIO.org](https://cassio.org/frameworks/langchain/qa-basic/)._


In [None]:
# install required dependencies
! pip install \
    "git+https://github.com/hemidactylus/langchain@cassio#egg=langchain" \
    "cassandra-driver>=3.28.0" \
    "cassio>=0.0.4" \
    "google-cloud-aiplatform>=1.25.0" \
    "jupyter>=1.0.0" \
    "openai==0.27.7" \
    "python-dotenv==1.0.0" \
    "tensorflow-cpu==2.12.0" \
    "tiktoken==0.4.0" \
    "transformers>=4.29.2"

You will likely be asked to "Restart the Runtime" at this time, as some dependencies
have been upgraded. **Please do restart the runtime now** for a smoother execution from this point onward.

# Get the Wikipedia data from 20220301.simple

In [2]:
 !wget https://raw.githubusercontent.com/GeorgeCrossIV/Langchain-Retrieval-Augmentation-with-CASSIO/main/20220301.simple.csv

--2023-06-21 13:21:23--  https://raw.githubusercontent.com/GeorgeCrossIV/Langchain-Retrieval-Augmentation-with-CASSIO/main/20220301.simple.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30939404 (30M) [text/plain]
Saving to: ‘20220301.simple.csv’


2023-06-21 13:21:23 (94.9 MB/s) - ‘20220301.simple.csv’ saved [30939404/30939404]



Import the 20220301.simple wikipedia from the CSV file

In [3]:
import pandas as pd
data = pd.read_csv('20220301.simple.csv')

There are 10,000 entries in the Wikipedia data file. We'll reduce the dataset to 10 rows for this demo. It takes a while to process the data; however, feel free to increase the number of rows for future demo runs.

In [4]:
data = data.head(10)
data = data.rename(columns={'text ': 'text'})
data

Unnamed: 0,id,url,title,text
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...
5,12,https://simple.wikipedia.org/wiki/Autonomous%2...,Autonomous communities of Spain,Spain is divided in 17 parts called autonomous...
6,13,https://simple.wikipedia.org/wiki/Alan%20Turing,Alan Turing,"Alan Mathison Turing OBE FRS (London, 23 June ..."
7,14,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,"Alanis Nadine Morissette (born June 1, 1974) i..."
8,17,https://simple.wikipedia.org/wiki/Adobe%20Illu...,Adobe Illustrator,Adobe Illustrator is a computer program for ma...
9,18,https://simple.wikipedia.org/wiki/Andouille,Andouille,Andouille is a type of pork sausage. It is spi...


We will execute queries against the [Alan Turing](https://simple.wikipedia.org/wiki/Alan%20Turing) and [Andouille](https://simple.wikipedia.org/wiki/Andouille) Wikipedia entries later in this demo.

In [6]:
# Input your database keyspace name:
ASTRA_DB_KEYSPACE = input('Your Astra DB Keyspace name: ')

Your Astra DB Keyspace name: cassio_tutorials


In [7]:
# Input your Astra DB token string, the one starting with "AstraCS:..."
ASTRA_DB_TOKEN_BASED_PASSWORD = input('Your Astra DB Token: ')

Your Astra DB Token: AstraCS:GyxRBptROZpsvvyDORZcKYvm:0952d8d4ca3fa134b27467a1f230f96832b813f78b15dd245cfd00701a6fe7e4


### Astra DB Secure Connect Bundle

Please upload the Secure Connect Bundle zipfile to connect to your Astra DB instance.

The Secure Connect Bundle is needed to establish a secure connection to the database.
Click [here](https://awesome-astra.github.io/docs/pages/astra/download-scb/#c-procedure) for instructions on how to download it from Astra DB.

In [8]:
# Upload your Secure Connect Bundle zipfile:
import os
from google.colab import files


print('Please upload your Secure Connect Bundle')
uploaded = files.upload()
if uploaded:
    astraBundleFileTitle = list(uploaded.keys())[0]
    ASTRA_DB_SECURE_BUNDLE_PATH = os.path.join(os.getcwd(), astraBundleFileTitle)
else:
    raise ValueError(
        'Cannot proceed without Secure Connect Bundle. Please re-run the cell.'
    )

Please upload your Secure Connect Bundle


Saving secure-connect-cassio-db.zip to secure-connect-cassio-db.zip


In [9]:
# colab-specific override of helper functions
from cassandra.cluster import (
    Cluster,
)
from cassandra.auth import PlainTextAuthProvider

# The "username" is the literal string 'token' for this connection mode:
ASTRA_DB_TOKEN_BASED_USERNAME = 'token'


def getCQLSession(mode='astra_db'):
    if mode == 'astra_db':
        cluster = Cluster(
            cloud={
                "secure_connect_bundle": ASTRA_DB_SECURE_BUNDLE_PATH,
            },
            auth_provider=PlainTextAuthProvider(
                ASTRA_DB_TOKEN_BASED_USERNAME,
                ASTRA_DB_TOKEN_BASED_PASSWORD,
            ),
        )
        astraSession = cluster.connect()
        return astraSession
    else:
        raise ValueError('Unsupported CQL Session mode')

def getCQLKeyspace(mode='astra_db'):
    if mode == 'astra_db':
        return ASTRA_DB_KEYSPACE
    else:
        raise ValueError('Unsupported CQL Session mode')

### LLM Provider

In the cell below you can choose between **GCP VertexAI** or **OpenAI** for your LLM services.
(See [Pre-requisites](https://cassio.org/start_here/#llm-access) on cassio.org for more details).

Make sure you set the `llmProvider` variable and supply the corresponding access secrets in the following cell.

In [10]:
# Set your secret(s) for LLM access:
llmProvider = 'OpenAI'  # 'GCP_VertexAI'


In [11]:
if llmProvider == 'OpenAI':
    apiSecret = input(f'Your secret for LLM provider "{llmProvider}": ')
    os.environ['OPENAI_API_KEY'] = apiSecret
elif llmProvider == 'GCP_VertexAI':
    # we need a json file
    print(f'Please upload your Service Account JSON for the LLM provider "{llmProvider}":')
    from google.colab import files
    uploaded = files.upload()
    if uploaded:
        vertexAIJsonFileTitle = list(uploaded.keys())[0]
        os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = os.path.join(os.getcwd(), vertexAIJsonFileTitle)
    else:
        raise ValueError(
            'No file uploaded. Please re-run the cell.'
        )
else:
    raise ValueError('Unknown/unsupported LLM Provider')

Your secret for LLM provider "OpenAI": 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


### Colab preamble completed

The following cells constitute the demo notebook proper.

# Vector Similarity Search QA Quickstart

_**NOTE:** this uses Cassandra's "Vector Similarity Search" capability.
Make sure you are connecting to a vector-enabled database for this demo._

In [12]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

The following line imports the Cassandra flavor of a LangChain vector store:

In [13]:
from langchain.vectorstores.cassandra import Cassandra

A database connection is needed to access Cassandra. The following assumes
that a _vector-search-capable Astra DB instance_ is available. Adjust as needed.

In [14]:
# creation of the DB connection
cqlMode = 'astra_db'
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(140051429524528) cadf0dc6-b88d-4b3c-95c3-ed828664e189-us-east1.db.astra.datastax.com:29042:e80d59ef-d76c-4cc5-addb-55091103f7db> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


Both an LLM and an embedding function are required.

Below is the logic to instantiate the LLM and embeddings of choice. We choose to leave it in the notebooks for clarity.

In [15]:
# creation of the LLM resources


if llmProvider == 'GCP_VertexAI':
    from langchain.llms import VertexAI
    from langchain.embeddings import VertexAIEmbeddings
    llm = VertexAI()
    myEmbedding = VertexAIEmbeddings()
    print('LLM+embeddings from VertexAI')
elif llmProvider == 'OpenAI':
    from langchain.llms import OpenAI
    from langchain.embeddings import OpenAIEmbeddings
    llm = OpenAI(temperature=0)
    myEmbedding = OpenAIEmbeddings()
    print('LLM+embeddings from OpenAI')
else:
    raise ValueError('Unknown LLM provider.')

LLM+embeddings from OpenAI


## Langchain Retrieval Augmentation

The following is a minimal usage of the Cassandra vector store. The store is created and filled at once, and is then queried to retrieve relevant parts of the indexed text, which are then stuffed into a prompt finally used to answer a question.

The following creates an "index creator", which knows about the type of vector store, the embedding to use and how to preprocess the input text:

_(Note: stores built with different embedding functions will need different tables. This is why we append the `llmProvider` name to the table name in the next cell.)_

In [16]:
table_name = 'vs_test1_' + llmProvider

index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Cassandra,
    embedding=myEmbedding,
    text_splitter=CharacterTextSplitter(
        chunk_size=400,
        chunk_overlap=0,
    ),
    vectorstore_kwargs={
        'session': session,
        'keyspace': keyspace,
        'table_name': table_name,
    },
)


Create the Cassandra Vector Store and clear entries if the table already exists

In [17]:
myCassandraVStore = Cassandra(
    embedding=myEmbedding,
    session=session,
    keyspace=keyspace,
    table_name='vs_test1_' + llmProvider,
)

myCassandraVStore.clear()

In [18]:
mySplitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=120)

Create the function for creating a vector index for a Wikipedia entry

In [19]:
def create_vector_index(row, myCassandraVStore):
  metadata = {
    'url': row['url'],
    'title': row['title']
  }
  page_content = row['text']

  aDocument = Document(
      page_content=page_content,
      metadata=metadata
  )
  aDocs = mySplitter.transform_documents([aDocument])
  myCassandraVStore.add_documents(aDocs)

Execute the create_vector_index function for each row in the Wikipedia dataframe. It's good time to grab a drink as the next step will take about 90 seconds to complete.

In [20]:
for index, row in data.iterrows():
  create_vector_index(row, myCassandraVStore)

In [21]:
index = VectorStoreIndexWrapper(vectorstore=myCassandraVStore)

Now let's query our store. We'll ask "What is Andouille?"

In [22]:
query = "What is Andouille?"
index.query(query,llm=llm)

' Andouille is a type of pork sausage that is spicy (hot in taste) and smoked. It is made with different combinations of pork meat, fat, intestines, and tripe, and usually includes extra salt, black pepper, and garlic. It is smoked over pecan wood and sugar cane for a maximum of seven or eight hours, at about 175 degrees Fahrenheit (80 degrees Celsius).'

I'm really interested in what temperature to cook my andouille.

In [23]:
query = "What temperature should Andouile be cooked?"
index.query(query,llm=llm)

' About 175 degrees Fahrenheit (80 degrees Celsius).'

Let's compare this answer to what OpenAi GPT-3 will return

In [28]:
import openai

openai.api_key = apiSecret
response = openai.Completion.create(
  engine="text-davinci-003",
  prompt="What temperature should Andouille be cooked?",
  max_tokens=100
)

print(response.choices[0].text.strip())

Andouille should be cooked to an internal temperature of 155 degrees Fahrenheit (68 degrees Celsius).


You've now seen how we can use a LLM to answer the prompt from our Astra Vector Store, but notice that the answer is different from using the LLM directly.

Now let's ask about the Stoney family, who are associated with Alan Turning.

In [24]:
query = "Who is the Stoney family?"
index.query(query,llm=llm)

' The Stoney family is not mentioned in the context given.'

Let's get some information about the source for the response to the question "Who is the Stoney family?"

In [25]:
retriever = index.vectorstore.as_retriever(search_kwargs={
    'k': 2,
})

In [26]:
retriever.get_relevant_documents(
    "Who is the Stoney family?"
)

[Document(page_content='Trivia', metadata={'url': 'https://simple.wikipedia.org/wiki/April', 'title': 'April'}),
 Document(page_content='Alanis Morissette was born in Riverside Hospital of Ottawa in Ottawa, Ontario. Her father is French-Canadian. Her mother is from Hungary. She has an older brother, Chad, and a twin brother, Wade, who is 12 minutes younger than she is. Her parents had', metadata={'url': 'https://simple.wikipedia.org/wiki/Alanis%20Morissette', 'title': 'Alanis Morissette'})]