# Retrieval Augmented Generation (RAG) with Langchain
*Using IBM Granite Models*

## In this notebook
This notebook contains instructions for performing Retrieval Augumented Generation (RAG). RAG is an architectural pattern that can be used to augment the performance of language models by recalling factual information from a knowledge base, and adding that information to the model query. The most common approach in RAG is to create dense vector representations of the knowledge base in order to retrieve text chunks that are semantically similar to a given user query.

RAG use cases include:
- Customer service: Answering questions about a product or service using facts from the product documentation.
- Domain knowledge: Exploring a specialized domain (e.g., finance) using facts from papers or articles in the knowledge base.
- News chat: Chatting about current events by calling up relevant recent news articles.

In its simplest form, RAG requires 3 steps:

- Initial setup:
  - Index knowledge-base passages for efficient retrieval. In this recipe, we take embeddings of the passages and store them in a vector database.
- Upon each user query:
  - Retrieve relevant passages from the database. In this recipe, we using an embedding of the query to retrieve semantically similar passages.
  - Generate a response by feeding retrieved passage into a large language model, along with the user query.

## Setting up the environment

### Python version

Ensure you are running Python 3.10 or 3.11.

In [None]:
import sys
assert sys.version_info >= (3, 10) and sys.version_info < (3, 12), "Use Python 3.10 or 3.11 to run this notebook."

### Install dependencies

Granite Kitchen comes with a bundle of dependencies that are required for notebooks. See the list of packages in its [`setup.py`](https://github.com/ibm-granite-community/granite-kitchen/blob/main/setup.py). 

In [None]:
! pip install \
  "git+https://github.com/ibm-granite-community/granite-kitchen.git" \
  "langchain-huggingface" \
  "langchain-milvus" \
  "wget"

### Ollama

This notebook requires IBM Granite models to be served by a AI model runtime so that the models can be inferred or called. This notebook uses [Ollama](https://github.com/ollama/ollama) to serve the models.

The Ollama server can be run locally on your computer or in the notebook itself (notebook needs to be run in [Google Colab](https://colab.research.google.com)). Follow the steps in the section which best suits your needs.

#### Running Ollama Locally

If you have not already done this in the pre-work, you will need to start a local Ollama server.

Using the command line on your computer:

1. [Download and install Ollama](https://github.com/ollama/ollama?tab=readme-ov-file#ollama), if you haven't already.

    On macOS, you can use Homebrew to install with

    ```shell
    brew install ollama
    ```

1. Start the Ollama server.

    ```shell
    ollama serve
    ```

1. Pull down the Granite models you will want to use in the workshop. Larger models take more memory to run.

    ```shell
    ollama pull granite-code:3b
    ollama pull granite-code:8b
    ollama pull granite-code:20b
    ```


#### Running Ollama in Colab

This section is if you are not going to run the Ollama server locally on your computer. Running the Ollama server in Colab will limit the size of Granite models you can use and be _significantly_ slower when calling the Granite models.

1. Download and install Ollama in Colab

In [None]:
!curl https://ollama.ai/install.sh | sh

2. Start the Ollama server as a background process in Colab using `nohup` and `&`

In [None]:
import os
os.system("nohup ollama serve &")

3. Pull down the Granite models in Colab that you will use in the workshop. Larger models take more memory to run. The `granite-code:20b` model is too large for the Colab runtime environment.

In [None]:
!ollama pull granite-code:3b
!ollama pull granite-code:8b

## Selecting System Components

### Choose your Embeddings Model

Specify the model to use for generating embedding vectors from text.

To use a model from a provider other than Huggingface, replace this code cell with one from [this Embeddings Model recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Embeddings_Models.ipynb).

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

### Choose your Vector Database

Specify the database to use for storing and retrieving embedding vectors.

To connect to a vector database other than Milvus substitute this code cell with one from [this Vector Store recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_Vector_Stores.ipynb).

In [None]:
from langchain_milvus import Milvus
import tempfile

db_file = tempfile.NamedTemporaryFile(prefix="milvus_", suffix=".db", delete=False).name
print(f"The vector database will be saved to {db_file}")

vector_db = Milvus(embedding_function=embeddings_model, connection_args={"uri": db_file}, auto_id=True)

### Choose your LLM
The LLM will be used for answering the question, given the retrieved text.

Select a IBM Granite Code model from the [`granite-code`](https://ollama.com/library/granite-code) org on Ollama. Here we use the Ollama Langchain client to connect to the model.

In [None]:
from langchain_ollama.llms import OllamaLLM

model = OllamaLLM(model="granite-code:3b")

## Building the Vector Database

In this example, we take the State of the Union speech text, split it into chunks, derive embedding vectors using the embedding model, and load it into the vector database for querying.

### Download the document

Here we use President Biden's State of the Union address from March 1, 2022.

In [None]:
import os, wget

filename = 'state_of_the_union.txt'
url = 'https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/data/foundation_models/state_of_the_union.txt'

if not os.path.isfile(filename):
  wget.download(url, out=filename)

### Split the document into chunks

Split the document into text segments that can fit into the model's context window.

In [None]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

loader = TextLoader(filename)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

### Populate the vector database

NOTE: Population of the vector database may take over a minute depending on your embedding model and service.

In [None]:
vector_db.add_documents(texts)

## Querying the Vector Database

### Conduct a similarity search

Search the database for similar documents by proximity of the embedded vector in vector space.

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
docs = vector_db.similarity_search(query)
print(docs[0].page_content)

## Answering Questions

### Automate the RAG pipeline

Build a question-answering chain with the model and the document retriever.

In [None]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(llm=model, chain_type="stuff", retriever=vector_db.as_retriever()) # , chain_type_kwargs={"verbose": False})

### Generate a retrieval-augmented response to a question

Use the question-answering chain to process the query. 

In [None]:
query = "What did the president say about Ketanji Brown Jackson"
qa.invoke(query)