## The Question
Large Language Models perform really well on natural language generation tasks, sometimes equally or surpassing human abilities. Since GPT, natural language processing has seen many advancements, and ChatGPT showcased the power of LLMs on variety of tasks like question answering, generating long texts, and even simple logical reasoning. But, the answers were based out of the parametric memory of the model which was frozen, i.e., it could not update or fetch relevant data. Many a times the LLMs would invent results and output them: they would *hallucinate*. **Is there a way to make large language models fetch relevant information and generate accurate answers?**

## The Dataset
We are utilizing 3 chapters (chapter 1, 2, and 7) of the course textbook 'Artificial Intelligence: A Modern Approach' and a couple of slides from Prof. Keogh and Prof. LePendu as our additional context.

## The Method
To achieve our objective of eliminating hallucination and boost relevancy, we utilize Llama-Index, LangChain, and Ollama to build a RAG pipeline. Llama-Index is used to index the documents and store them in a vector database. LangChain serves the local Ollama model of Llama 2 7B. The chapters from the textbook and the slides are index and stored in a vector database which are then retrieved based on the similarity with the query passed to the language model. We have used OpenAI's state-of-the-art text-embedding-ada-002 embedding model through their API. 

## Our hypothesis
We feel that supplementing the LLM with additional context can boost the relevance of the answer and stop it from hallucinating. The evaluation is done using the RAGAS approach where we measure how relevant and consistent the generated result is to the context and the query.   

## Why bother?
As the use of LLM is rising, one LLM cannot know everything in this universe. It is important to have specific LLM for a specific purpose. It is also important to have fresh data for the LLM to access and output relevant and factually correct results. The approach of RAG seems promising to alleviate most of the problems related to LLMs generation and correct usage can boost the utilization of the language models in various fields. 






# Main Jupyter Notebook for the CS205 Final Project
*This only works if run locally after completing all the installations mentioned in the README file of the project*

1. In this project we will explore a usecase of Large Language Models
2. We will start with how to use an LLM through HuggingFace, and explain some of the basic concepts behind an LLM. 
3. Once we have a good understading of how to use an LLM for generating text, we will explore Retrieval Augmented Generation (RAG). 

 For this project we have used Llama 2 7 Billion paramter model with OpenAI's text-embed-002 embedding model. Llama 2 7B was served locally by Ollama. We have used Llama Index and LangChain to interact with the LLM

#### The data store is at the root of the project directory with the name 'data'. Create a data repository before running the indexing and query cells

#### Let's understand how to use an LLM using HuggingFace

In [123]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

import os

In [124]:
import torch

#Import HuggingFace Transformer
from transformers import AutoModelForCausalLM, AutoTokenizer

In [125]:
#Fetch Meta's OPT LLM with 1.3 billion parameters. This is quite a small model compared to the SOTA like GPT4V, etc.

model = AutoModelForCausalLM.from_pretrained('facebook/opt-1.3b')
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-1.3b')

##### Open Pre-trained Transformer (OPT) is a collection of decoder-only transformer developed by Meta. 

In [126]:
input_text = 'I like CS205 Artificial Intelligence course, because'
tok_input = tokenizer(input_text, return_tensors='pt', add_special_tokens=True, truncation=True) #Create tokens from the given input and return PyTorch tensors

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [127]:
torch.manual_seed(123)

generated_output = model.generate(**tok_input, 
                                  max_new_tokens=200, 
                                  return_dict_in_generate=True, 
                                  do_sample=True) #Generation is deterministic. 
                                                  #To use top-k sampling, set do_sample=True to get different responses in each generation
                                                  #Set do_sample=False to have a deterministic generation each time


In [128]:
decoded_output = tokenizer.batch_decode(generated_output.sequences, skip_special_tokens=True)[0]
print(decoded_output)

I like CS205 Artificial Intelligence course, because of the breadth of material covered and the fact that it is an introductory course, so the material is "fresh". I think its a good course. (Yes, I do agree that I like the math, and i do agree that in all CS courses, the first course or two are usually the hardest ones).
Do you think you would still like it had you not taken a calculus course with higher math content


#### What is an embedding?

Embedding is a numerical representation of the text. The words in the vocabulary are mapped to a set of integers. These integers are then converted into another mathematical representation. This 'embedding' vector is of shape (n_tokens x embedding_dimension)

In [129]:
sentence = "Life is sweet and sugary, but eating ice cream on Everest is long"

In [130]:
tokenize1 = {s: i for i, s in enumerate(sorted(sentence.replace(',', '').split()))}

In [131]:
sentence_vec = torch.tensor([tokenize1[s] for s in sentence.replace(',', '').split()])
print(sentence_vec)

tensor([ 1,  8, 12,  2, 11,  3,  5,  6,  4, 10,  0,  8,  9])


The sentence1 is tokenized as shown in the output above. The words are mapped to an integer. 

In [132]:
torch.manual_seed(123)

<torch._C.Generator at 0x12823fb70>

In [135]:
embed = torch.nn.Embedding(len(sentence_vec), 10)
embedded_sentence = embed(sentence_vec).detach()

The sentence "Life is sweet and sugary, but eating ice cream on Everest is long" now has a representation like shown below. It is converted from text to a form which the machine understands. This embedding is the input to a language model.

In [136]:
print(embedded_sentence.shape)
print(embedded_sentence)

torch.Size([13, 10])
tensor([[-0.4223, -1.1036,  0.8398, -1.0029,  0.5253,  0.9389, -0.0306, -0.0894,
         -0.1965, -0.9713],
        [ 0.1419,  0.3696, -0.0174, -0.9575,  1.2968,  0.6833,  0.2154,  0.3307,
         -2.1467, -1.7984],
        [-0.9124, -1.4065,  1.3834,  0.0324,  0.0040,  0.3480,  1.7276, -2.5230,
          0.2561,  1.0097],
        [ 0.2790, -0.7587,  0.5473,  0.4301,  0.8558,  1.6098, -1.1893,  1.1677,
          0.6220,  2.5737],
        [ 0.8243, -0.4808, -0.1200,  0.4884,  1.1051, -0.5454, -0.2115, -0.2708,
         -0.0830, -0.2453],
        [-0.6239, -1.2965, -0.4382,  0.3265, -1.5786, -1.3995,  0.2425,  0.3648,
          1.3119, -0.2825],
        [ 0.7047, -0.2722,  0.0781, -0.1134, -0.7817,  0.8967, -0.4619, -1.5539,
         -0.3338,  0.2405],
        [-0.0334,  1.5544,  0.3418, -1.5768, -0.6933,  1.7409,  0.2698,  0.9595,
          0.7744,  1.8721],
        [ 1.7524, -0.2135,  0.4095,  0.0465,  0.5468,  1.1478, -0.3339, -0.6653,
          0.9376, -0.9225]

In [137]:
sentence1 = "I feel like driving car today"
sentence2 = "I think I want to drive today"

In [138]:
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [139]:
embedding1 = model.encode(sentence1)
embedding2 = model.encode(sentence2)

Evaluating a cosine similarity of 2 embeddings tells us how similar the vectors are semantically. This is particularly useful in our application where the query is matched with documents using a similarity function. 

In [140]:
util.pytorch_cos_sim(embedding1, embedding2)

tensor([[0.7208]])

The closer the value is to 1, the more semantically closer the embeddings are. We can explore the similarity with other sentences by modifying the text

#### What is RAG?

#### Retrieval Augmented Generation 

It is a technique in generative AI to boost the knowledge of an LLM. The LLM parameters are learned and not updated to the current information, so a specialized database of knowledge (could be private) is created for the LLM is access. This is a non-parametric memory, i.e, this information is not stored in the learned paramaters of the LLM. RAG combines retrieval with generation for content generation tasks.

How does an RAG work?

1. Retrieval: The model retrieves a set of top-k relevant documents that act as additional context for the query. Since the documents are stored in the database in the form of embeddings, the model can perform similarity searches in retrieval

2. Generation: Once the documents are retrieved, this serves as additional context along with the original input. The generative model can now use both the original input and retrieved context to generated the content. 

#### Implementing RAG with Llama Index using ChromaDB as the vector database. Ollama is used to serve Llama 2 locally

The data can be accessed at this link. Add this folder to the root of the project directory

https://drive.google.com/drive/folders/10wzRErO4Zlj6L3bLSqh9QtTLDkaUOuxS?usp=drive_link

In [141]:
import openai

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, download_loader
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext
from llama_index.embeddings import OpenAIEmbedding

from langchain.llms import Ollama

import chromadb

In [142]:
from dotenv import dotenv_values
from pathlib import Path

In [143]:
api_key = dotenv_values('../.env')["OPENAI_API_KEY"]
openai.api_key = api_key

os.environ["OPENAI_API_KEY"] = api_key # For RAGAS evaluation

In [144]:
#Set the embedding

llm = Ollama(model="llama2")
#embed_model = OllamaEmbeddings(base_url="http://localhost:11434", model="llama2") #Local Llama 2 embedding model
embed_model = OpenAIEmbedding() #Using OpenAI's text-embed-002

text-embedding-ada-002 is a powerful embedding model released by OpenAI. Like we discussed, an embedding represents a text mathematically in n-dimensions, and the distance between the embeddings can measure the similarity between the sentences or words. 

In [145]:
COLLECTION = "aiprof"
SLIDE_COLLECTION = 'slides'
PATH = '../chroma'

In [146]:
# create client and a new collection
db = chromadb.PersistentClient(path=PATH)
chroma_collection = db.get_or_create_collection(COLLECTION)

In [147]:
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

In [150]:
# load documents
documents = SimpleDirectoryReader("../data/AIMA/").load_data()

In [152]:
slides_reader = download_loader("PptxReader")
loader = slides_reader()
slides = []

for file in os.listdir("../data/slides/"):
    if file.endswith(".pptx"):
        slides += loader.load_data("../data/slides/" + file)

In [153]:
information = documents + slides

In [154]:
index = VectorStoreIndex.from_documents(
    information, storage_context=storage_context, service_context=service_context
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
# load from disk
db2 = chromadb.PersistentClient(path=PATH)
chroma_collection = db2.get_or_create_collection(COLLECTION)

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

index2 = VectorStoreIndex.from_vector_store(
    vector_store,
    service_context=service_context,
)

In [155]:
query_engine = index.as_query_engine()

In [156]:
resp = query_engine.query("What is the turing test?")
print(resp.response)

The Turing test is a check to see if a computer can behave like a human. It includes visual signals that the interrogator cannot test the TOTAL Turing Test subject's perceptual abilities, as well as the opportunity for the interrogator to pass physical objects through a hatch. To pass the total Turing Test, the computer will need computer vision to perceive objects and robotics to move them around. Within AI, there has not been a big effort to try to pass the Turing test. The main problem comes when AI programs have to interact with people, such as when an expert system explains how it came to its diagnosis or a natural language processing system has a dialogue with a user. These programs must behave according to certain normal conventions of human interaction in order to make themselves understood. The underlying representation and reasoning in such a system may or may not be based on a human model. Thinking humanly: The cognitive modeling approach If we are going to say that a given 

In [157]:
resp = query_engine.query("What is Adversarial Search?")
print(resp.response)

Adversarial search is a type of search algorithm that involves assuming there is an enemy entity that can potentially thwart the agent's goals. The algorithm relaxes the assumption of full control over the environment and considers the possibility of an opponent who can make moves that affect the outcome of the game.

In adversarial search, the algorithm considers different types of games, including cooperative and competitive games. For competitive games, the algorithm identifies the initial state, a set of operators, a terminal test, and a utility function. The utility function evaluates each node in terms of how good it is for each player, with positive values indicating states advantageous for Max and negative values indicating states advantageous for Min.

The algorithm then generates the game tree down to the terminal nodes and applies the utility function to the terminal nodes. It recursively passes up the backed-up values until reaching the initial state, and the minimum score 

In [158]:
resp = query_engine.query("Generate 2 concise questions about rational agents")

In [159]:
resp.response.split('\n')

['1. What is the difference between a performance measure and a utility function in the context of rational agents?',
 '2. Choose a familiar domain, and write a page description of an agent for that environment. Determine the most appropriate agent architecture for this domain (table lookup, simple reactive, goal-based, or utility-based)?']

In [160]:
resp = query_engine.query("When is the final project due")

In [161]:
print(resp.response)

Based on the provided context information, the final project is due on November 30, 2023 (tentative).


## Evaluation

We have been quite successful at getting what looks like accurate results, but lets run some evaluations on the generated text. Inspired from the RAGAS paper (https://arxiv.org/pdf/2309.15217v1.pdf), there is a ragas python library which runs evaluations on the RAG generated text. 

**Retrieval Augmentation Generated Assessment** evaluates a RAG generated text without human interference. Even after utilizing RAG we wouldn't be sure if the system hallucinated the generation. The paper proposes a suite of metrics that evaluates RAG.

*Faithfulness*: Measures the factual consistency of the generated answer. It is based on the answer and the retrieved context. The generated answer is faithful if all the answers can be deduced from the given context. The score ranges from 0 to 1. Higher the score, the better. 

*Answer Relevance*: Measures how relevant the answer is to the given prompt. Incomplete and/or redudant answers are given a lower score. The score ranges from 0 to 1.   

*Context Relevance*: Similar to answer relevancy, context relevancy measures the relevance of the generated text to the context. The idea is that the context should exlusively contain the information required to answer the query prompt. 

*Context Precision*: Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.

*Context Recall*: Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.

Definitions are sourced from https://github.com/explodinggradients/ragas/tree/main/docs/concepts/metrics and the RAGAS paper. 

In [162]:
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from ragas.metrics.critique import harmfulness

metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    harmfulness,
]

In [163]:
from ragas.llama_index import evaluate

In [167]:
eval_questions = ["What is the turing test?"]

eval_answers = ["The Turing test, originally called the imitation game by Alan Turing in 1950,[2] is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses."]
eval_answers = [[a] for a in eval_answers]

In [168]:
result = evaluate(query_engine, metrics, eval_questions, eval_answers) #Takes long to run. 5-10min depending on the number of questions and answers. Might even timeout. 
#The results of this evaluation will depend on the async library on your local. This is just a demonstration that generation can be evaluated

evaluating with [faithfulness]


100%|██████████| 1/1 [01:00<00:00, 60.96s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:02<00:00,  2.89s/it]


evaluating with [context_precision]


100%|██████████| 1/1 [00:11<00:00, 11.63s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:36<00:00, 36.18s/it]


evaluating with [harmfulness]


100%|██████████| 1/1 [00:08<00:00,  8.16s/it]


In [169]:
result.to_pandas()

Unnamed: 0,question,contexts,answer,ground_truths,faithfulness,answer_relevancy,context_precision,context_recall,harmfulness
0,What is the turing test?,[6 Chapter 1.Introduction\ntheso-called totalT...,The Turing test is a check to determine whethe...,"[The Turing test, originally called the imitat...",0.4,0.8753,0.0,1.0,0


## Conclusion

This tutorial demonstrated what a large language model is, how to use a model from Huggingface, and discussed what embeddings are. We went further by exploring Retrieval Augment Generation (RAG), a novel method to boost the generation of an LLM. We then evaluated the results using RAGA, a 

Retrieval Augmented Generation has helped make LLMs more specific and relevant. It has demonstrated that it can reduce hallucination and boost the relevance of the generated text while offering more control and interpretability. Though this method is quite effective, it should be acknowledged that no external source is perfect, devoid of misinformation and bias. We have a long way to go before such models can be used without human interference to verify and interpret the results generated by such methods