## The Question
Large Language Models perform really well on natural language generation tasks, sometimes equally or surpassing human abilities. Since GPT, natural language processing has seen many advancements, and ChatGPT showcased the power of LLMs on variety of tasks like question answering, generating long texts, and even simple logical reasoning. But, the answers were based out of the parametric memory of the model which was frozen, i.e., it could not update or fetch relevant data. Many a times the LLMs would invent results and output them: they would *hallucinate*. **Is there a way to make large language models fetch relevant information and generate accurate answers?**

## The Dataset
We are utilizing 3 chapters (chapter 1, 2, and 7) of the course textbook 'Artificial Intelligence: A Modern Approach' and a couple of slides from Prof. Keogh and Prof. LePendu as our additional context.

## The Method
To achieve our objective of eliminating hallucination and boost relevancy, we utilize Llama-Index, LangChain, and Ollama to build a RAG pipeline. Llama-Index is used to index the documents and store them in a vector database. LangChain serves the local Ollama model of Llama 2 7B. The chapters from the textbook and the slides are index and stored in a vector database which are then retrieved based on the similarity with the query passed to the language model. We have used OpenAI's state-of-the-art text-embedding-ada-002 embedding model through their API. 

## Our hypothesis
We feel that supplementing the LLM with additional context can boost the relevance of the answer and stop it from hallucinating. The evaluation is done using the RAGAS approach where we measure how relevant and consistent the generated result is to the context and the query.   

## Why bother?
As the use of LLM is rising, one LLM cannot know everything in this universe. It is important to have specific LLM for a specific purpose. It is also important to have fresh data for the LLM to access and output relevant and factually correct results. The approach of RAG seems promising to alleviate most of the problems related to LLMs generation and correct usage can boost the utilization of the language models in various fields. 






# Main Jupyter Notebook for the CS205 Final Project
*This only works if run locally after completing all the installations mentioned in the README file of the project*

1. In this project we will explore a usecase of Large Language Models
2. We will start with how to use an LLM through HuggingFace, and explain some of the basic concepts behind an LLM. 
3. Once we have a good understading of how to use an LLM for generating text, we will explore Retrieval Augmented Generation (RAG). 

 For this project we have used Llama 2 7 Billion paramter model with OpenAI's text-embed-002 embedding model. Llama 2 7B was served locally by Ollama. We have used Llama Index and LangChain to interact with the LLM

#### The data store is at the root of the project directory with the name 'data'. Create a data repository before running the indexing and query cells

#### Let's understand how to use an LLM using HuggingFace

In [77]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

import os

In [None]:
import torch

#Import HuggingFace Transformer
from transformers import AutoModelForCausalLM, AutoTokenizer

In [None]:
#Fetch Meta's OPT LLM with 1.3 billion parameters. This is quite a small model compared to the SOTA like GPT4V, etc.

model = AutoModelForCausalLM.from_pretrained('facebook/opt-1.3b')
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-1.3b')

##### Open Pre-trained Transformer (OPT) is a collection of decoder-only transformer developed by Meta. 

In [None]:
input_text = 'I like CS205 Artificial Intelligence course, because'
tok_input = tokenizer(input_text, return_tensors='pt', add_special_tokens=True, truncation=True) #Create tokens from the given input and return PyTorch tensors

In [None]:
torch.manual_seed(123)

generated_output = model.generate(**tok_input, 
                                  max_new_tokens=200, 
                                  return_dict_in_generate=True, 
                                  do_sample=True) #Generation is deterministic. 
                                                  #To use top-k sampling, set do_sample=True to get different responses in each generation
                                                  #Set do_sample=False to have a deterministic generation each time


In [None]:
decoded_output = tokenizer.batch_decode(generated_output.sequences, skip_special_tokens=True)[0]
print(decoded_output)

#### What is an embedding?

Embedding is a numerical representation of the text. The words in the vocabulary are mapped to a set of integers. These integers are then converted into another mathematical representation. This 'embedding' vector is of shape (n_tokens x embedding_dimension)

In [86]:
sentence = "Life is sweet and sugary, but eating ice cream in Manali is long"

In [87]:
tokenize1 = {s: i for i, s in enumerate(sorted(sentence.replace(',', '').split()))}

In [88]:
sentence_vec = torch.tensor([tokenize1[s] for s in sentence.replace(',', '').split()])
print(sentence_vec)

tensor([ 0,  9, 12,  2, 11,  3,  5,  6,  4,  7,  1,  9, 10])


The sentence1 is tokenized as shown in the output above. The words are mapped to an integer. 

In [89]:
torch.manual_seed(123)

<torch._C.Generator at 0x12823fb70>

In [90]:
embed = torch.nn.Embedding(len(sentence_vec), 10)
embedded_sentence = embed(sentence_vec).detach()

The sentence "Life is sweet and sugary, but eating ice cream in Manali is long" now has a representation like shown below. It is converted from text to a form which the machine understands. This embedding is the input to a language model. Although the sentence is now an embedding, the vectors are not meaningful to derive any relationship between the entities. We require a learned embedding to semantically understand the text

In [91]:
print(embedded_sentence.shape)
print(embedded_sentence)

torch.Size([13, 10])
tensor([[ 0.3374, -0.1778, -0.3035, -0.5880,  0.3486,  0.6603, -0.2196, -0.3792,
          0.7671, -1.1925],
        [ 0.9031, -0.7218, -0.5951, -0.7112,  0.6230, -1.3729, -2.2150, -1.3193,
         -2.0915,  0.9629],
        [-1.4205, -0.2238, -0.2548,  1.1517, -0.0179,  0.4264, -0.7657, -0.0545,
         -0.7321,  1.2347],
        [ 1.5810,  1.3010,  1.2753, -0.2010,  0.4965, -1.5723,  0.9666, -1.1481,
         -1.1589,  0.3255],
        [-0.9896,  0.7016, -0.9405, -0.4681, -0.8016, -0.8183, -1.1820, -0.2877,
         -0.6043,  0.6002],
        [-0.6315, -2.8400, -1.3250,  0.1784, -2.1338,  1.0524, -0.3885, -0.9343,
         -0.4991, -1.0867],
        [-1.4779,  1.1331, -1.2203,  1.3139,  1.0533,  0.1388,  2.2473, -0.8036,
         -0.2808,  0.7697],
        [-0.6596, -0.7979,  0.1838,  0.2293,  0.5146,  0.9938, -0.2587, -1.0826,
         -0.0444,  1.6236],
        [ 0.8805,  1.5542,  0.6266, -0.1755,  0.0983, -0.0935,  0.2662, -0.5850,
          0.8768,  1.6221]

Now we shall use a pretrained model to generate a learned embedding for our sentences

In [92]:
sentence1 = "I feel like driving car today"
sentence2 = "I think I want to drive today"

In [None]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [None]:
embedding1 = model.encode(sentence1)
embedding2 = model.encode(sentence2)

Evaluating a cosine similarity of 2 embeddings tells us how similar the vectors are semantically. This is particularly useful in our application where the query is matched with documents using a similarity function.

The embedding is particularly important since it can influence how well the text is modeled in the latent space to be understood semantically by the model. A good embedding can significantly boost the generation capability of an LLM. 

In [None]:
util.pytorch_cos_sim(embedding1, embedding2)

The cosine similarity is high, indicating that the two sentences are semantically similar

##### So far we have understood what an LLM is and what an embedding is. Building on this knowledge we will then proceed to showcase RAG

#### Retrieval Augmented Generation 

It is a technique in generative AI to boost the knowledge of an LLM. The LLM parameters are learned and not updated to the current information, so a specialized database of knowledge (could be private) is created for the LLM is access. This is a non-parametric memory, i.e, this information is not stored in the learned paramaters of the LLM. RAG combines retrieval with generation for content generation tasks.

How does an RAG work?

1. Retrieval: The model retrieves a set of top-k relevant documents that act as additional context for the query. Since the documents are stored in the database in the form of embeddings, the model can perform similarity searches in retrieval

2. Generation: Once the documents are retrieved, this serves as additional context along with the original input. The generative model can now use both the original input and retrieved context to generated the content. 

#### Implementing RAG with Llama Index using ChromaDB as the vector database. Ollama is used to serve Llama 2 locally

The data can be accessed at this link. Add this folder to the root of the project directory

https://drive.google.com/drive/folders/10wzRErO4Zlj6L3bLSqh9QtTLDkaUOuxS?usp=drive_link

In [None]:
import openai

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, download_loader
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext
from llama_index.embeddings import OpenAIEmbedding

from langchain.llms import Ollama

import chromadb

In [None]:
from dotenv import dotenv_values
from pathlib import Path

In [80]:
api_key = dotenv_values('../.env')["OPENAI_API_KEY"]
openai.api_key = api_key

os.environ["OPENAI_API_KEY"] = api_key # For RAGAS evaluation

#### Set the embedding model

In [None]:
llm = Ollama(model="llama2")
#embed_model = OllamaEmbeddings(base_url="http://localhost:11434", model="llama2") #Local Llama 2 embedding model
embed_model = OpenAIEmbedding() #Using OpenAI's text-embed-002

text-embedding-ada-002 is a powerful embedding model released by OpenAI. Like we discussed, an embedding represents a text mathematically in n-dimensions, and the distance between the embeddings can measure the similarity between the sentences or words. 

In [None]:
COLLECTION = "aiprof"
SLIDE_COLLECTION = 'slides'
PATH = '../chroma'

In [None]:
# create client and a new collection
db = chromadb.PersistentClient(path=PATH)
chroma_collection = db.get_or_create_collection(COLLECTION)

In [None]:
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

In [85]:
# load documents
documents = SimpleDirectoryReader("../data/AIMA/").load_data()

In [None]:
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, service_context=service_context
)

In [None]:
# load from disk
db2 = chromadb.PersistentClient(path=PATH)
chroma_collection = db2.get_or_create_collection(COLLECTION)

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

index2 = VectorStoreIndex.from_vector_store(
    vector_store,
    service_context=service_context,
)

In [114]:
query_engine = index2.as_query_engine()

In [None]:
resp = query_engine.query("What is the turing test?")
print(resp.response)

In [None]:
resp = query_engine.query("Generate 2 concise questions about rational agents")

In [None]:
resp.response.split('\n')

#### Querying Slides (PPTx)

In [None]:
# create client and a new collection
slides_db = chromadb.PersistentClient(path=PATH)
slides_chroma_collection = slides_db.get_or_create_collection(SLIDE_COLLECTION)

In [None]:
vector_store = ChromaVectorStore(chroma_collection=slides_chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

In [107]:
slides_reader = download_loader("PptxReader")
loader = slides_reader()

slides = loader.load_data(Path('../data/slides/4_Adversarial_Search.pptx'))

In [108]:
slides_index = VectorStoreIndex.from_documents(documents=slides, 
                                               storage_context=storage_context, 
                                               service_context=service_context)

In [None]:
slides_query = slides_index.as_query_engine()
resp = slides_query.query("What is The Minimax Algorithm?")
print(resp.response)

In [93]:
from llama_index.query_engine import RouterQueryEngine
from llama_index.tools import QueryEngineTool

pdf_query_engine = QueryEngineTool.from_defaults(query_engine=index2.as_query_engine(), description="Use this to retrieve information from the PDFs")
slides_query_engine = QueryEngineTool.from_defaults(query_engine=slides_index.as_query_engine(), description="Use this to retrieve information from the PPTx")


query_engine = RouterQueryEngine.from_defaults(
    query_engine_tools=[pdf_query_engine, slides_query_engine]
)

In [94]:
response = query_engine.query("From the PPTx, tell me what is adversarial search")
print(response.response)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Adversarial search is a type of search algorithm that assumes there is an adversary in the environment, and it relaxes the assumption of full control over the world. Adversarial search is used in game playing search, where the goal is to find the best move in a game against an opponent. The algorithm starts by evaluating the initial state of the game and then generates the game tree down to the terminal nodes. It applies the utility function to the terminal nodes and passes up the backed-up values to the parent node until it reaches the initial state. The value of the initial state is the minimum score for Max. In summary, adversarial search is a search algorithm that considers the presence of an adversary in the environment and evaluates the best move in a game against an opponent using a utility function.


## Evaluation

We have been quite successful at getting what looks like accurate results, but lets run some evaluations on the generated text. Inspired from the RAGAS paper (https://arxiv.org/pdf/2309.15217v1.pdf), there is a ragas python library which runs evaluations on the RAG generated text. 

**Retrieval Augmentation Generated Assessment** evaluates a RAG generated text without human interference. Even after utilizing RAG we wouldn't be sure if the system hallucinated the generation. The paper proposes a suite of metrics that evaluates RAG.

*Faithfulness*: Measures the factual consistency of the generated answer. It is based on the answer and the retrieved context. The generated answer is faithful if all the answers can be deduced from the given context. The score ranges from 0 to 1. Higher the score, the better. 

*Answer Relevance*: Measures how relevant the answer is to the given prompt. Incomplete and/or redudant answers are given a lower score. The score ranges from 0 to 1.   

*Context Relevance*: Similar to answer relevancy, context relevancy measures the relevance of the generated text to the context. The idea is that the context should exlusively contain the information required to answer the query prompt. 

*Context Precision*: Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.

*Context Recall*: Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.

Definitions are sourced from https://github.com/explodinggradients/ragas/tree/main/docs/concepts/metrics and the RAGAS paper. 

In [None]:
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from ragas.metrics.critique import harmfulness

metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    harmfulness,
]

In [111]:
from ragas.llama_index import evaluate

In [115]:
eval_questions = ["What is the turing test?", 
                  "What is property of completeness in propositional logic?"]

eval_answers = ["The Turing test, originally called the imitation game by Alan Turing in 1950,[2] is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses.",
                'In mathematical logic and metalogic, a formal system is called complete with respect to a particular property if every formula having the property can be derived using that system, i.e. is one of its theorems.']

eval_answers = [[a] for a in eval_answers]

In [116]:
result = evaluate(query_engine, metrics, eval_questions, eval_answers) #Takes long to run. 5-10min depending on the number of questions and answers

In [83]:
result.to_pandas()

Unnamed: 0,question,contexts,answer,ground_truths,faithfulness,answer_relevancy,context_precision,context_recall,harmfulness
0,What is the turing test?,[6 Chapter 1.Introduction\ntheso-called totalT...,The Turing Test is a theoretical framework for...,"[The Turing test, originally called the imitat...",0.285714,0.920805,0.0,1.0,0
