Last modified: 07/11/2024

# Retrieval Augmented Generation

This notebook uses the [IBM Granite 3.0 2B Instruct](https://huggingface.co/ibm-granite/granite-3.0-2b-instruct) model from HuggingFace. The model is hosted locally, and used within a LangChain RAG pipeline to query local data.

The components making up a RAG application are as follows
- Data ingestion & pre-processing
- Data embedding
- Data storage (Vector db)
- Prompt template
- Retriever
- Augmented generative model

[Optional]
- User interface
- Backend API

An example workflow is:
1. Ingestion: Use Tika/Textract to extract text from private files.
2. Preprocessing: Use NLTK/spaCy to clean and tokenize.
3. Embedding: Generate embeddings with Hugging Face SentenceTransformers.
4. Storage: Store embeddings in FAISS or Chroma.
4. Retrieval: Use Haystack or LangChain to retrieve relevant document embeddings for a given query.
5. Generation: Use Hugging Face Transformers or OpenAI’s API (optional) to generate responses based on retrieved content.
6. Serve: Create an API with FastAPI and optionally a UI with Streamlit or Gradio.

# Simple RAG
Following the [LangChain demo](https://python.langchain.com/docs/tutorials/rag/), this sections sets up a RAG pipeline for single-task inferencing to query some dummy data stored using an in-memory vector database (FAISS).

`HuggingFaceHub` is used to access HuggingFace hosted models (via API).
`HuggingFacePipeline` is used to access locally hosted models.

In [None]:
# %pip install --quiet --upgrade langchain langchain-community python-dotenv
# %pip install langchain-chroma # for Chroma db
# %pip install faiss-cpu # for FAISS db

# # for remotely hosted inference
# pip install huggingface_hub

# # for local inference
# pip install langchain-huggingface transformers[torch] sentence-transformers

# # for progress bar in Jupyter notebooks
# %pip install --quiet tqdm ipywidgets

## Setup

In [None]:
import os
from dotenv import load_dotenv

# Set up your API tokens
load_dotenv(os.getcwd() + "/.env")
LANGSMITH_API_KEY = os.getenv("LANGCHAIN_API_KEY", "")
HUGGINGFACE_API_KEY = os.getenv("HUGGINGFACE_API_KEY", "")

## Document Embedding
This is needed for a RAG application. Need to be able to embded documents to be stored in a vector database and later retrieved.

Can use 
- **Hugging Face Transformers**: provides a wide range of pre-trained models (e.g. BERT, Sentence-BERT) for generating embeddings from text, images, and other data types. It’s highly versatile and integrates well with FAISS for tasks like semantic search.
- **Sentence Transformers**: Built on top of Hugging Face Transformers, this library is specifically designed for creating high-quality sentence and document embeddings. It supports various models optimized for different tasks, making it a great choice for embedding text data.
- **OpenAI Embeddings**: OpenAI offers models like GPT-3 and GPT-4 that can generate embeddings for text. These embeddings can be indexed and searched using FAISS.

In [None]:
# Initialize model for embeddings using LangChain's HuggingFaceEmbeddings class
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings


embedding_model_name = "sentence-transformers/all-mpnet-base-v2"
embedding_model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": False}

embeddings = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs=embedding_model_kwargs,
    encode_kwargs=encode_kwargs,
    multi_process=False,
    show_progress=True,
)

In [None]:
# # Can explore embedding output with a test string

# query_embedding = embeddings.embed_query("a test string")
# query_embedding

## Vector Store

Store embedded data for retrieval by the RAG pipelines.

Use one of the following for local development and testing
- FAISS
- Chroma db

As a simple example, use in-memory vector store which will be lost once the kernel is stopped. Data can either be loaded from the file system, or input using the `Document` class in LangChain.

In [None]:
# Creating dummy documents using LangChain
from langchain_core.documents import Document


documents = [
    Document(
        page_content="Malikai is a Data Science Consultant at Terox with over 3 years of professional experience in the transportation domain.",
        metadata={"source": "Malikai-bio"},
    ),
    Document(
        page_content="Robert Isling has earned his PhD certificate in Physics from the University College London in 2021.",
        metadata={"source": "Robert-bio"},
    ),
    Document(
        page_content="Nusret is originally from the city of Izmir in Turkey, but currently resides in Nottingham, United Kindom.",
        metadata={"source": "Nusret-bio"},
    ),
]

In [None]:
# # Loading dummy data from disk
# from langchain.document_loaders import TextLoader


# loader = TextLoader("./test_rag_doc.txt")
# documents = loader.load()

### FAISS

In [None]:
from langchain_community.vectorstores import FAISS


# Generate embeddings and store in FAISS
faiss_store = FAISS.from_documents(documents, embeddings)

In [None]:
# TEST that documents have been loaded into the vector store
faiss_store.similarity_search_with_score("London")

## Model Initialisation

In [None]:
# HuggingFace model path
model_path = "ibm-granite/granite-3.0-2b-instruct"

Remote model

In [None]:
# # Using a model hosted remotely on HuggingFace infrastructure
# from langchain import HuggingFaceHub

# # Initialize Hugging Face model through LangChain's HuggingFaceHub
# llm = HuggingFaceHub(
#     repo_id=model_path,  # or any model of choice
#     model_kwargs={"temperature": 0.5, "max_length": 100},
#     huggingfacehub_api_token=HUGGINGFACE_API_KEY,
# )

Local model

In [None]:
# Use a locally hosted model
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from langchain_huggingface import HuggingFacePipeline


tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)
model.eval()

# Load a local model pipeline
local_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=500,  # default is 20 tokens
    truncation=True,
)

# Initialize Hugging Face model through LangChain's HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=local_pipeline)


## Prompt Template

In [None]:
from langchain.prompts import PromptTemplate

# The `PromptTemplate` class is suitable for one-off tasks, i.e.
# not used for conversational interactions. It does not support
# `{roles}` within the message.

message = """
Answer this question by rephrasing the information provided in the context.

{question}

Context:
{context}
"""

prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template=message,
)

In [None]:
# from langchain_core.prompts import ChatPromptTemplate

# # The `ChatPromptTemplate` is suitable for conversational interactions
# # with the LLM. It supports `{roles}` within the message, such as
# # system prompt, ai prompt, and human prompt using the appropriate
# # prompt template classes for each role.

# prompt_template = ChatPromptTemplate.from_messages([("human", message)])

## Retrieval

A vector store can be used as a retriever [source](https://python.langchain.com/docs/tutorials/retrievers/#retrievers). Vector stores can be queried by [source](https://python.langchain.com/docs/tutorials/retrievers/#vector-stores) 
- similarity
- maximum marginal relevance (to balance similarity with query to diversity in retrieved results)

In [None]:
from langchain_core.runnables import RunnableLambda

# Query by `similarity`
# Return top match (k=1)
retriever = RunnableLambda(faiss_store.similarity_search).bind(k=1)

# # Alternative implementation
# retriever = faiss_store..as_retriever(
#     search_type="similarity",
#     search_kwargs={"k": 1},
# )

In [None]:
# # test the retriever
# retriever.invoke("university")

## Augmented Generation

Define the RAG chain in LangChain syntax

In [None]:
from langchain_core.runnables import RunnablePassthrough

qa_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt_template
    | llm
)

Query the RAG chain

In [None]:
query = "Who is Nusret?"
response = qa_chain.invoke(query)
print(response)

# First Call
Simply calling the IBM Granite 3.0 2b instruct model with a single query.

In [None]:
# from transformers import pipeline

# import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
device = "cpu"
model_path = "ibm-granite/granite-3.0-2b-instruct"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(model_path)
model.eval()

In [None]:
# change input text as desired
chat = [
    {
        "role": "user",
        "content": "What is the dense transformer architecture?",
    },
]


chat = tokenizer.apply_chat_template(
    chat, tokenize=False, add_generation_prompt=True
)

# tokenize the text
input_tokens = tokenizer(chat, return_tensors="pt").to(device)

In [None]:
# generate output tokens
output = model.generate(**input_tokens, max_new_tokens=100)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# print output
print(output)


In [None]:
output[0].split("end_of_role|")