# RAG System: Retrieval-Augmented Generation Implementation

This notebook demonstrates a complete Retrieval-Augmented Generation (RAG) system implementation for enterprise question-answering.

**Skills Demonstrated:**
- RAG architecture design and implementation
- Vector database integration for document retrieval
- LangChain/LangGraph pipeline construction
- Embedding models and similarity search
- LLM integration for answer generation
- RAG evaluation metrics

**Technologies:** LangChain, Vector Databases, Embedding Models, LLMs

# LLM Usage Clarification in A5

PLEASE READ THESE DIRECTIONS CAREFULLY

We want to clarify the boundaries for using LLMs (commercial and open source) in your completion of a5. Beyond the obvious use of the Mistral and Cohere LLMs as the question answering portion of your RAG pipeline, you may use an LLM to assist in certain portions of the project. For example, you may use an LLM to assist with re-writing portions of your report if you wish. We would discourage you from using the LLM to write code.

**You MUST NOT:**

* You must **NOT** use an LLM to answer any questions we posed. For example, do not use it to answer the questions in sections 4 or 5 of the notebook.

* You must **NOT** use the LLM to generate the initial draft of your final report.

* You must **NOT** use the LLM to generate or assist with your metrics justification. We want to know what you think in your own words.

**You CAN:**

* You can use an LLM to help you with the report writing as far as formulations and phrasing go, but obviously not to construct core content.

* You should turn off the use of your input to help with training commercial LLMs. For example, in OpenAI’s ChatGPT you can go to Settings in your account and then Data Controls and then make sure that ‘Improve the model for everyone’ is set to off.

**You MUST:**

* You **MUST** tell us how you have used an LLM by specifying which model(s) you used, how you used the model, and which portion of your output it helped to generate. You can include this in the references section and put citations into the body of your final report.

## 1. Setup

We will first install a number of libraries and import what we will need.

In [2]:
import os
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%%capture
!pip install -q -U transformers
!pip install -q -U datasets loralib sentencepiece
!pip install -q bitsandbytes accelerate
!pip install -q langchain
!pip install langchain-core
!pip install langchain-classic
!pip install einops
!pip install faiss-gpu
!pip install langchain_community
!pip install --upgrade --quiet chromadb bs4 qdrant-client
!pip install langchainhub
!pip install -U langchain-huggingface
!pip install -U langchain-cohere
!pip install --upgrade --quiet wikipedia
!pip install --upgrade --quiet arxiv
!pip install --upgrade --quiet pymupdf

!pip install xmltodict

!pip install cohere
!pip install langchain-qdrant

In [4]:
#In case we want to know our installed transformers library version
!pip list | grep transformers
!pip list | grep accelerate
!pip list | grep langchain

sentence-transformers 5.1.2
transformers 4.57.3
accelerate 1.12.0
langchain 1.1.0
langchain-classic 1.0.0
langchain-cohere 0.5.0
langchain-community 0.4.1
langchain-core 1.1.0
langchain-huggingface 1.1.0
langchain-qdrant 1.1.0
langchain-text-splitters 1.0.0
langchainhub 0.1.21


In [5]:
import torch
import os
import bs4
import json
import numpy as np
import time

from pprint import pprint

import locale

from transformers import AutoTokenizer , AutoModelForCausalLM
from transformers import pipeline, BitsAndBytesConfig
from langchain_huggingface import HuggingFacePipeline

from langchain_cohere import ChatCohere

from langchain_core.prompts import PromptTemplate
from langchain_classic.chains import LLMChain
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.output_parsers import StrOutputParser

from langchain_text_splitters import CharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain_classic import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.vectorstores import Chroma
#from langchain_community.vectorstores import Qdrant
from langchain_qdrant import QdrantVectorStore as Qdrant
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.utils.math import cosine_similarity

from langchain_community.document_loaders import ArxivLoader
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import WikipediaLoader
from langchain_community.document_loaders import OnlinePDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import PubMedLoader

from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')



In [6]:
import langchain
print(langchain.__version__)

1.1.0


In [7]:
locale.getpreferredencoding = lambda: "UTF-8"

In [8]:
%%capture
!pip install -U sentence_transformers

Add your keys from the secret store (do **NOT** print them out or leave them exposed as plaintext in your notebook!):

In [9]:
COHERE_API_KEY = userdata.get('COHERE_API_KEY')

## 2. Building the Components of our RAG System

Let us introduce and test the base components of our RAG system. We will largely use the Hugging Face and LangChain libraries.

### 2.1 The Embedding Model

We will need to represent text (pieces) as vectors. For this, we will use the [sentence_transformer](https://sbert.net/docs/sentence_transformer/pretrained_models.html) architecture.

**NOTE:** The embedding models you can use are: 'all-mpnet-base-v2', 'all-MiniLM-L12-v2', 'multi-qa-mpnet-base-dot-v1', 'all-distilroberta-v1', and 'multi-qa-distilbert-cos-v1 '

In [10]:
# %%capture
# base_embeddings = HuggingFaceEmbeddings(model_name="multi-qa-mpnet-base-dot-v1")

In [11]:
# text = "This is a test document."
# query_result = base_embeddings.embed_query(text)
# print(f'Embedding dimension: {len(query_result)}')

# doc_result = base_embeddings.embed_documents(["Germany won the World Cup 4 times.", "This is a training document."])
# len(doc_result)

Do those dimensions look correct?

Now lets see if the embedding model is working as we want. Ideally our embeddings go beyond shared words and capture the underlying meaning.

In [12]:
# #Let's see how well our embeddng model works
# similarity = cosine_similarity([query_result], doc_result)[0]

# similarity

That's how you should define your embedding models.

Next, we turn to text chunks.

### 2.2. Loading and Chunking Texts

We first need to load the documents. Here is an example:

In [13]:
# loader = WebBaseLoader(
# web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
# bs_kwargs=dict(
# parse_only=bs4.SoupStrainer(
# class_=("post-content", "post-title", "post-header")
# )
# ),
# )

# documents = loader.load()

We will need to split the text in chunks that are 'suitable' as retrieval units. Let's for starters define a chunk size of 128 and have no overlap between the chunks:

In [14]:
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=0)
# splits = text_splitter.split_documents(documents)
# print('Number of splits/chunks: ', str(len(splits)))

Ok, so it looks like we have now many splits (chunks) from one document. Here is how you can get the content:

Perfect. Now we have the splits and embeddings. Next, the embeddings need to be stored in a vector db.

### 2.3 Storing the Embeddings of Chunks in Vectorstores

After loading and chunking the data, we need to save the vector representations of the chunks in a vectorstore. We will use Qdrant here for simplicity. We load the splits (structured chunks) and the embeddings:

In [16]:
# vectorstore = Qdrant.from_documents(splits,
# base_embeddings,
# location=":memory:", # Local mode with in-memory storage only
# collection_name="test",
# )
# retriever = vectorstore.as_retriever()

The nice thing is that the vector store also does the similarity searches for us:

In [17]:
# query = "What is Chain of Thought doing?"
# docs = vectorstore.similarity_search_by_vector(base_embeddings.embed_query(query)) # will rank the splits

Looks good! We have an ordered list of documents that seem to relate to the question. That is what we need.

The last major component is the actual LLM.

### 2.4. The LLM

We will use one Open Source Model ("mistralai/Mistral-7B-Instruct-v0.3") and one Proprietery Model (Cohere) for our tests. Let's first set up the OS model:

In [19]:
# #Quantization config

# quantization_config = BitsAndBytesConfig(
# load_in_4bit=True,
# bnb_4bit_quant_type="nf4",
# bnb_4bit_use_double_quant=True,
# bnb_4bit_compute_dtype=torch.bfloat16
# )

In [20]:
# #%%capture

# llm_mistral_model = AutoModelForCausalLM.from_pretrained(
# "mistralai/Mistral-7B-Instruct-v0.3",
# dtype=torch.float32,
# device_map='auto',
# quantization_config=quantization_config
# )

# llm_mistral_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

In [21]:
# # RC - Modified to store and download
# # Download and save to Google Drive (If running from Colab)

# import os
# from google.colab import drive
# drive.mount('/content/drive')

# model_save_path = "/content/drive/MyDrive/models/mistral-7b-instruct"

# # Check if already saved
# if os.path.exists(model_save_path):
# print("Loading model from Google Drive...")
# model_source = model_save_path
# else:
# print("Downloading model from Hugging Face (this will take ~15 min)...")
# model_source = "mistralai/Mistral-7B-Instruct-v0.3"

# # Load from appropriate source
# llm_mistral_model = AutoModelForCausalLM.from_pretrained(
# model_source,
# dtype=torch.float32,
# device_map='auto',
# quantization_config=quantization_config
# )

# llm_mistral_tokenizer = AutoTokenizer.from_pretrained(model_source)

# # If we downloaded, save it for next time
# if model_source != model_save_path:
# print("Saving model to Google Drive for future use...")
# llm_mistral_model.save_pretrained(model_save_path)
# llm_mistral_tokenizer.save_pretrained(model_save_path)
# print("Saved!")

# llm_mistral_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

We use the model first to generate a Hugging Face pipeline. A pipeline simplifies the process of actually generating responses.

In [22]:
# mistral_pipe = pipeline(
# "text-generation",
# model=llm_mistral_model,
# tokenizer=llm_mistral_tokenizer,
# max_new_tokens=1000,
# temperature=0.6,
# top_p=0.95,
# do_sample=True,
# repetition_penalty=1.2
# )
# mistral_pipe.model.config.pad_token_id = mistral_pipe.model.config.eos_token_id

# # wrapping the Hugging Face pipeline into a LangChain object

Does it work?

Reasonable!

We will also use a Cohere model, but will create this below as part of the LangChain framework.

### 2.5 Testing the LLM in a LangChain Chain

Chains will be defined and discussed in . In short, they are convenient programmatic ways to deal with 'chains' of actions that involve LLMs. For example, a list of events like 'here is a city name. Plug that city name into prompt template, then generate a story about that city. Lastly, format the model output as a string' can be easily handled by LangChain's Chain framework. In this case, the Chain would consist of the prompt template, the LLM, and the String Formatter. The parameter (the city in this case) will be provided at run time by invocation of the Chain. Let's test that.

To use a Hugging Face model in a LangChain environment, we need to wrap the model into a LangChain pipeline object:

In [24]:
# mistral_llm_lc = HuggingFacePipeline(pipeline=mistral_pipe)

Next, we need to define a template and create a corresponding prompt template that can take any questiion

In [25]:
# test_llm_template = """[INST] Give me a two-sentence story about an {object}! [/INST]"""
# test_llm_prompt_template = PromptTemplate(template=test_llm_template, input_variables=["object"])

Let's define a Chain, a static flow of actions that (usually) involve at least a definition of the variables used in the chain, one or more templates, LLM step(s) and potentially other actions. This would be a chain that declares the variable 'object' to be expected when the chain is invoked, then inserts it into the template, and passes this to our mistral model pipeline (wrapped as a LangChain object):

In [26]:
# test_llm_chain_short = (
# {"object": RunnablePassthrough()}
# | test_llm_prompt_template
# | mistral_llm_lc
# )

Works too. We will use this notation moving forward.

Next, how would we do this with a Cohere Chat Model instead of Mistral?

In [28]:
# cohere_chat_model = ChatCohere(cohere_api_key=COHERE_API_KEY)

This can be plugged straight into the Chain:

In [29]:
# test_cohere_llm_chain_short = (
# {"object": RunnablePassthrough()}
# | test_llm_prompt_template
# | cohere_chat_model
# )

Works! (Note: you may want to review the format of the template. The one we used here is the one from Mistral, and the format may or may not be optimal for Cohere.)

How can we get the output formatting under control? We can add a String Formatter to the chain:

In [31]:
# output_parser = StrOutputParser()

# test_cohere_llm_chain_short_formatted = (
# {"object": RunnablePassthrough()}
# | test_llm_prompt_template
# | cohere_chat_model
# | output_parser
# )

# test_cohere_llm_chain_short_formatted.invoke('apple')

### 2.6 Setting Up a Simple RAG Chain

For RAG, we will follow the same approach. Except... you will **later** need to change the chain to include the retrieval step.

We first do a simple test: create a RAG template that takes a question and a pre-defined context as input, and generates the answer based on the provided context:

In [32]:
# rag_template = """[INST] Answer the question based only on the following context:
# {context}

# Question: {question}
# [/INST]
# """
# rag_prompt_template = ChatPromptTemplate.from_template(rag_template)

# base_rag_chain =(
# {"context": RunnablePassthrough(),
# "question": RunnablePassthrough()}
# | rag_prompt_template
# | mistral_llm_lc
# | output_parser
# )

# predefined_context = "Germany has won the World Cup 4 times."
# question = "How many times did Germany win the world cup?"

# resp = base_rag_chain.invoke({'context': predefined_context,
# 'question': question})
# print(resp)

That's great. But of course, the context needs to be created in an earlier retrieval step. More precisely, the documents will be first retrieved as a list, and then they will need to be formatted into one string to pass to the LLM in the context window.

Here is a simple formatting function that can be hooked into the chain, which combines a list of chunks into one string:

In [33]:
# def format_docs(docs):
# return "\n\n".join(doc.page_content for doc in docs)

So how could we build a simple chain? Let's first just get the retrieval done and the formatted retrieved data and the question inserted into the prompt template:

In [34]:
# rag_template = """Here is a context:\n{context} \n\nand here is a question: \n{question}"""

# rag_prompt = ChatPromptTemplate.from_template(rag_template)

# rag_chain = (
# {"context": retriever | format_docs,
# "question": RunnablePassthrough()}
# | rag_prompt

# )

In [35]:
# output = rag_chain.invoke('What is Chain of Thought?')

Ok... with some formatting... this looks good:

In [36]:
# print(output.messages[0].content)

Let's complete the RAG Chain:

In [37]:
# output_parser = StrOutputParser()

# rag_template = """[INST]Please answer the question below only based on the context information provided.\n\nHere is a context:\n{context} \n\nHere is a question: \n{question}.[/INST]"""
# rag_prompt = ChatPromptTemplate.from_template(rag_template)

# rag_chain = (
# {"context": retriever | format_docs,
# "question": RunnablePassthrough()}
# | rag_prompt
# | mistral_llm_lc

# )

What about the Cohere models?

In [39]:
# cohere_rag_chain = (
# {"context": retriever | format_docs,
# "question": RunnablePassthrough()}
# | rag_prompt
# | cohere_chat_model
# | output_parser
# )

Works too! Time to build the real thing and do experimentation.

## 3. The RAG Model & Experimentation

With this we can get started. First, we need to acquire the data, chunk it, vectorize it, and store the embeddings (and in this simple case also the docs) in our Qdrant vector db.

### 3.1 The Vector Database

We will start by creating our datastore, Qdrant. Usually, you would deploy the vector db as a server, but in this case let's simply put everything in memory. Also, in this case we will store not only the embeddings but the whole document in the vector store. We will seed the store with the splits from the blog post we had used before.

We will also create the retriever, which defines the way the documents are being retrieved. The retriever parameters define for example which method is used, how many docs are retrieved, etc. See [this LangChain link ](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore)for more information.

In [41]:
# qdrant_vectorstore = Qdrant.from_documents(splits,
# base_embeddings,
# location=":memory:", # Local mode with in-memory storage only
# collection_name="rag_tech_db",
# force_recreate=True
# )

# retriever = qdrant_vectorstore.as_retriever()

### 3.2 Data Acquisition, Chunking, and Vectorization

Now where we have our store we need to get the data into it. We will need to retrieve the data, create the chunks, then vectorize them, and finally store the vectors (along with the docs in this case) in the vector db.

Let us first set chunk size and overlap, as well as the type of splitter. These are starting parameters and you may want to experiment with them:

In [42]:
# #Note that these defaults may or may not be ideal!
# CHUNK_SIZE=128
# OVERLAP=0

# text_splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=OVERLAP)

Now let's work with an actual document collection. We will work with four types of documents:

* A few papers from the ArXiv on RAG and NLP
* A few blogs from Lily Weng that talk about Open Domain Question Answering and related topics
* A number of Wikipedia articles on that topic

To make testing easier we'll define a global record number so we can trace back to see which chunk came from which specific document.

In [43]:
# #assign a unique number to each document we ingest
# global_doc_number = 1

First we'll grab some papers from ArXiv. We'll grab the pdf files and get all of the pages as separate documents.

In [44]:
# arxiv_numbers = ('2005.11401', '2104.07567', '2104.09864', '2105.03011', '2106.09685', '2203.15556', '2203.02155', '2211.09260', '2211.12561',
# '2212.09741', '2305.14314', '2305.18290', '2306.15595', '2309.08872', '2309.15217', '2310.06825', '2310.11511',
# '2311.08377', '2312.05708', '2401.06532', '2401.17268', '2402.01306', '2402.19473', '2406.04744',
# '2312.10997', '2410.12812', '2410.15944', '2404.00657',
# )

In [45]:
# all_arxiv_pages = []

# #loop through the papers
# for identifier in arxiv_numbers:
# # Construct URL using the arXiv unique identifier
# arx_url = f"https://arxiv.org/pdf/{identifier}.pdf"

# # Extract pages from the document and add them to the list of pages
# arx_loader = PyMuPDFLoader(arx_url)
# arx_pages = arx_loader.load()
# for page_num in range(len(arx_pages)):
# page = arx_pages[page_num]
# #CHANGED
# page.metadata['page_num'] = page_num
# page.metadata['doc_num'] = global_doc_number
# page.metadata['doc_source'] = "ArXiv"
# all_arxiv_pages.append(page)

# global_doc_number += 1

How many docs did we get? Is that the correct number? And what is the content?

In [46]:
# num_pages = len(all_arxiv_pages)
# num_docs = global_doc_number - 1

# print(f"{num_docs} documents in total")
# print(f"{num_pages} pages in total")

Now we need to split the docs into chunks. LangChain provides a couple of ways to do that. We'll use for now the `RecursiveCharacterTextSplitter`.

In [48]:
# #index doc chunks
# splits = text_splitter.split_documents(all_arxiv_pages)
# for idx, text in enumerate(splits):
# splits[idx].metadata['split_id'] = idx

# print('Number of splits/chunks: ', len(splits))

Let's add the vectors to the datastore and see whether we can retrieve a nearest neighbor to a query. Let's look at the second closest match:

In [50]:
# %%capture

# qdrant_vectorstore.add_documents(documents=splits)

In [51]:
# query = "How can we train a model for preferences?"
# found_docs = qdrant_vectorstore.similarity_search_with_score(query)

In [52]:
# print(found_docs[0][0].page_content)
# print(found_docs[0][1])

Next, let's get some information from Wikipedia on our main topic -- Gen AI. LangChain provides a DocumentLoader that accesses the Wikipedia API.

In [53]:
# wiki_docs = WikipediaLoader(query="Generative Artificial Intelligence", load_max_docs=4).load()
# for idx, text in enumerate(wiki_docs):
# wiki_docs[idx].metadata['doc_num'] = global_doc_number
# wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

# global_doc_number += 1

# print('Number of documents: ', len(wiki_docs))

# #index docs
# wiki_splits = text_splitter.split_documents(wiki_docs)
# for idx, text in enumerate(wiki_splits):
# wiki_splits[idx].metadata['split_id'] = idx

# print('Number of splits/chunks: ', len(wiki_splits))

Now we'll add these splits to the vector stores.

In [54]:
# %%capture

# qdrant_vectorstore.add_documents(documents=wiki_splits)

Same with a couple of other queries:

In [55]:
# wiki_docs = WikipediaLoader(query="Information Retrieval", load_max_docs=4).load()
# for idx, text in enumerate(wiki_docs):
# wiki_docs[idx].metadata['doc_num'] = global_doc_number
# wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

# global_doc_number += 1

# print('Number of documents: ', len(wiki_docs))

# #index docs
# wiki_splits = text_splitter.split_documents(wiki_docs)
# for idx, text in enumerate(wiki_splits):
# wiki_splits[idx].metadata['split_id'] = idx

# print('Number of splits/chunks: ', len(wiki_splits))

In [56]:
# %%capture

# qdrant_vectorstore.add_documents(documents=wiki_splits)

And yet another related Wikipedia article.

In [57]:
# wiki_docs = WikipediaLoader(query="Large Language Models", load_max_docs=4).load()
# for idx, text in enumerate(wiki_docs):
# wiki_docs[idx].metadata['doc_num'] = global_doc_number
# wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

# global_doc_number += 1

# print('Number of documents: ', len(wiki_docs))

# #index docs
# wiki_splits = text_splitter.split_documents(wiki_docs)
# for idx, text in enumerate(wiki_splits):
# wiki_splits[idx].metadata['split_id'] = idx

# print('Number of splits/chunks: ', len(wiki_splits))

In [58]:
# %%capture

# qdrant_vectorstore.add_documents(documents=wiki_splits)

And finally another related Wikipedia article.

In [59]:
# wiki_docs = WikipediaLoader(query="Retrieval Augmented Generation", load_max_docs=4).load()
# for idx, text in enumerate(wiki_docs):
# wiki_docs[idx].metadata['doc_num'] = global_doc_number
# wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

# global_doc_number += 1

# print('Number of documents: ', len(wiki_docs))

# #index docs
# wiki_splits = text_splitter.split_documents(wiki_docs)
# for idx, text in enumerate(wiki_splits):
# wiki_splits[idx].metadata['split_id'] = idx

# print('Number of splits/chunks: ', len(wiki_splits))

In [60]:
# %%capture

# qdrant_vectorstore.add_documents(documents=wiki_splits)

We'll also augment our collection with some blog entries about Open Domain Question Answering, of which RAG is an approach, and some related topics in case users want to ask how the new Search system works.

In [61]:
# web_loader = WebBaseLoader(
# web_paths=("https://lilianweng.github.io/posts/2020-10-29-odqa/",
# "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
# "https://lilianweng.github.io/posts/2018-06-24-attention/",
# "https://lilianweng.github.io/posts/2023-06-23-agent/",
# "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/"),
# "https://lilianweng.github.io/posts/2024-07-07-hallucination/"

# bs_kwargs=dict(
# parse_only=bs4.SoupStrainer(
# class_=("post-content", "post-title", "post-header")
# )
# ),
# )

# web_documents = web_loader.load()

# for idx, text in enumerate(web_documents):
# web_documents[idx].metadata['doc_num'] = global_doc_number
# web_documents[idx].metadata['doc_source'] = "WWW"
# global_doc_number += 1

# print('Number of documents: ', len(web_documents))

Again, we will split the retrieved data into chunks and add the data to the vector store:

In [62]:
# web_splits = text_splitter.split_documents(web_documents)

# for idx, text in enumerate(web_splits):
# web_splits[idx].metadata['split_id'] = idx

# print('Number of splits: ', len(web_splits))

In [63]:
# %%capture

# qdrant_vectorstore.add_documents(documents=web_splits)

# project Start

### 3.3 The Test Data

You will want to test the system that you (will) have built. Below we give you a validation set that you could take as labeled data (imagine, your user personas would have had these questions and deemed the answers to be good). We also will give you a test set that only contains questions. (This is the set that we will use to get a feel for how well your RAG system corresponds to our Gold model).

Here are is the gold validation set and the test questions. **DO NOT CHANGE OR DELETE!!**

In [64]:
validation_questions_answers = {
 0: {
 "question": "What defines a large language model in the context of natural language processing tasks?",
 "gold_answer_research": "A large language model in the context of natural language processing tasks is characterized by its ability to achieve general-purpose language generation and other NLP tasks through self-supervised and semi-supervised training on large datasets. These models typically utilize feedforward neural networks and transformers, surpassing earlier models like recurrent neural networks and word n-gram language models. They are often pre-trained on vast amounts of text data from the internet and can be fine-tuned for specific downstream tasks such as summarization, machine reading comprehension, and natural language to SQL translation.",
 "gold_answer_marketing": "A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process."
 },
 1: {
 "question": "How do large language models like GPT-3 become capable of text generation?",
 "gold_answer_research": "Large language models like GPT-3 become capable of text generation through a process of pre-training on vast amounts of unlabelled text data, where they learn statistical relationships and patterns in language. This pre-training process involves training the model to predict the next word in a sequence, allowing it to understand and generate human-like text. Additionally, fine-tuning on specific tasks or datasets further enhances the model's ability to generate text that is contextually relevant and coherent. By combining these pre-training and fine-tuning techniques, GPT-3 can generate novel and diverse text outputs across a wide range of natural language processing tasks.",
 "gold_answer_marketing": "Large language models like GPT-3 become capable of text generation by being pre-trained on large data sets of unlabelled text, learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. This allows them to generate novel human-like content based on the patterns and relationships they have learned."
 },
 2: {
 "question": "What are some of the architectures used in building artificial neural networks for LLMs?",
 "gold_answer_research": "Some common architectures used in building artificial neural networks for LLMs include decoder-only transformer-based architectures, recurrent neural network variants, and Mamba (a state space model). MRKL (Modular Reasoning, Knowledge and Language) is another neuro-symbolic architecture that utilizes neural and symbolic expert modules. Additionally, the unified architecture proposed by Ronan Collobert and Jason Weston combines deep neural networks with multitask learning for natural language processing.",
 "gold_answer_marketing": "The architectures used in building artificial neural networks for LLMs include decoder-only transformer-based architecture, recurrent neural network variants, and Mamba (a state space model)."
 },
 3: {
 "question": "Can you name some notable large language models and their respective creators or companies?",
 "gold_answer_research": "Some notable large language models include Mistral 7B developed by a team of engineers including Albert Q. Jiang and Guillaume Lample, as well as Chinchilla developed by the research team at DeepMind. Additionally, The Pile dataset was created by Leo Gao, Stella Biderman, and others, while the Realm model was developed by Kelvin Guu and his team. These models have been recognized for their superior performance and efficiency in language generation and natural language processing tasks.",
 "gold_answer_marketing": "Some notable large language models include Mistral 7B by a team of researchers, Chinchilla by DeepMind, and GPT-3 by OpenAI."
 },
 7: {
 "question": "What licensing terms are associated with source-available models like Mistral AI's language models?",
 "gold_answer_research": "Source-available models like Mistral AI's language models, including Mistral 7B, are released under the Apache 2.0 license. This license allows for more permissive use and modification of the models, providing flexibility for users to adapt the models to their specific needs. Additionally, the release of Mistral 7B includes a reference implementation for easy deployment on various cloud platforms and integration with tools like the vLLM inference server and Hugging Face for streamlined usage. This licensing approach aims to facilitate widespread adoption and adaptation of the models for different tasks and applications.",
 "gold_answer_marketing": "Source-available models like Mistral AI's language models are released under the Apache 2.0 license."
 },
 8: {
 "question": "What are the main applications of language models?",
 "gold_answer_research": "Language models have a wide range of applications, including speech recognition, machine translation, natural language generation, optical character recognition, handwriting recognition, grammar induction, and information retrieval. Large language models, which are currently the most advanced form, are created using larger datasets, often sourced from the public internet, and feedforward neural networks. These models can be used for tasks such as text generation, question-answering, creative writing, dialogue, and classification.",
 "gold_answer_marketing": "The main applications of language models include speech recognition, machine translation, natural language generation, optical character recognition, handwriting recognition, grammar induction, and information retrieval."
 },
 9: {
 "question": "Who proposed the first significant statistical language model?",
 "gold_answer_research": "The first significant statistical language model was proposed in 1980. This model was a probabilistic model of natural language. IBM conducted 'Shannon-style' experiments during the 1980s to identify potential sources for language modeling improvement by observing and analyzing the performance of human subjects in predicting or correcting text.",
 "gold_answer_marketing": "The first significant statistical language model was proposed by IBM in 1980."
 },
 11: {
 "question": "Which components have allowed large language models to surpass their predecessors?",
 "gold_answer_research": "Large language models have surpassed their predecessors due to their use of larger datasets, feedforward neural networks, and transformers. These components have enabled these models to outperform recurrent neural network-based models and traditional statistical models like word n-gram language models. The combination of these elements has allowed for the creation of more adaptable, efficient, and high-performing language models that can be applied to a wide range of tasks in the field of natural language processing.",
 "gold_answer_marketing": "The combination of larger datasets, feedforward neural networks, and transformers."
 },
 12: {
 "question": "What is a common strategy used by language models to address the curse of dimensionality?",
 "gold_answer_research": "One common strategy used by language models to address the curse of dimensionality is the use of continuous space embeddings produced by recurrent neural network-based models. These embeddings represent words as non-linear combinations of weights, which helps alleviate the data sparsity issue caused by the exponential increase in possible word sequences with the size of the vocabulary. This approach allows language models to efficiently capture the relationships between words in a lower-dimensional space, reducing the impact of the curse of dimensionality.",
 "gold_answer_marketing": "Continuous space embeddings produced in recurrent neural network-based language models are a common strategy used to address the curse of dimensionality."
 },
 13: {
 "question": "Why might large language models not be considered plausible cognitive models?",
 "gold_answer_research": "Large language models may not be considered plausible cognitive models because they sometimes learn patterns that humans do not learn, and fail to learn patterns that humans typically do learn. This discrepancy in learning capabilities between large language models, particularly recurrent neural networks, and humans raises questions about the true cognitive abilities of these models. Additionally, the potential for misuse of these models, such as generating misinformation or abusive content, highlights the need for caution when deploying them in various applications. Furthermore, the pre-defined context window size in LLMs and the potential for hallucination of data also contribute to the concerns about their cognitive modeling capabilities.",
 "gold_answer_marketing": "Large language models sometimes learn patterns that humans do not learn and fail to learn patterns that humans typically learn, making them not considered plausible cognitive models."
 },
 16: {
 "question": "What is the purpose of the constitution in training AI systems?",
 "gold_answer_research": "The purpose of the constitution in training AI systems, such as in the Constitutional AI approach, is to provide a set of guiding principles for the AI model to follow during the supervised learning phase. This constitution helps the model generate responses to prompts, self-critique these responses, and revise them accordingly. Additionally, the constitution serves as a framework for the reinforcement learning phase, where the AI evaluates responses based on these principles, ultimately training the AI to be harmless and helpful without extensive human feedback.",
 "gold_answer_marketing": "The purpose of the constitution in training AI systems is to guide the model to generate responses that align with a set of guiding principles, ensuring that the AI is harmless and helpful without extensive human feedback."
 },
 17: {
 "question": "What is the meaning of the term alignment tax in the context of AI development?",
 "gold_answer_research": "In the context of AI development, the term alignment tax refers to the additional cost incurred in ensuring that an AI system is aligned with human intent. This cost arises from the need to mitigate performance degradations introduced by fine-tuning the model to achieve alignment. A high alignment tax could deter the adoption of alignment techniques, as it may impact usability and performance. Therefore, there is a growing need for alignment techniques that have a low alignment tax to encourage the development of highly capable AI systems aligned with human intent.",
 "gold_answer_marketing": "The term 'alignment tax' in the context of AI development refers to the additional cost incurred to ensure that an AI system is aligned with human intent and ethical considerations. This cost can arise from mitigating performance degradations introduced by fine-tuning the AI model."
 },
 18: {
 "question": "How does the release of successive models in a language model series typically improve functionality?",
 "gold_answer_research": "The release of successive models in a language model series typically improves functionality by incorporating larger datasets, allowing for more comprehensive training and better understanding of syntax, semantics, and ontology in human language corpora. Additionally, newer models often address inaccuracies and biases present in earlier versions, leading to more accurate and reliable results. Furthermore, advancements in technology and research techniques contribute to the development of more efficient and high-performing language models, enabling them to be used in a wider range of tasks effectively.",
 "gold_answer_marketing": "The release of successive models in a language model series typically improves functionality by enhancing adaptability, performance, and efficiency through advancements in training methods and model architecture."
 },
 19: {
 "question": "What is the significant enhancement in Claude 2.1 compared to its previous version?",
 "gold_answer_research": "The significant enhancement in Claude 2.1 compared to its previous version Claude 2 is the expanded context window, which has been increased from 100,000 tokens to 200,000 tokens. This allows Claude 2.1 to process and analyze larger amounts of text, enabling it to provide more comprehensive summaries and assistance with tasks. Additionally, Claude 2.1 has improved its performance in handling complex queries and requests, showcasing advancements in its capabilities for natural language processing and understanding.",
 "gold_answer_marketing": "The significant enhancement in Claude 2.1 compared to its previous version is the expansion of the context window from 9,000 tokens to 100,000 tokens, as well as the ability to upload PDFs and other documents for reading, summarizing, and task assistance."
 },
 20: {
 "question": "In what way can a language model demonstrate meta-cognitive reasoning capabilities?",
 "gold_answer_research": "A language model can demonstrate meta-cognitive reasoning capabilities by realizing it is being artificially tested during needle-in-a-haystack evaluations, as shown by Claude 3. This ability allows the model to understand the context of the evaluation and adjust its responses accordingly. Additionally, fine-tuning a pre-trained model to answer questions without external context, like in the study by Roberts et al. (2020), can also showcase the meta-cognitive reasoning abilities of the language model. This fine-tuning process forces the model to rely on internalized knowledge acquired during pre-training, demonstrating its capacity for self-awareness and adaptive reasoning.",
 "gold_answer_marketing": "A language model can demonstrate meta-cognitive reasoning capabilities by realizing it is being artificially tested during evaluations such as needle in a haystack tasks."
 },
 22: {
 "question": "How can a language model's ability to analyze images expand its range of applications?",
 "gold_answer_research": "One way a language model's ability to analyze images can expand its range of applications is by enabling it to perform tasks that require both textual and visual information, such as image captioning or visual question answering. By incorporating image features into the model's input, it can generate more contextually relevant and accurate responses. This integration of image analysis can also enhance the model's performance in tasks like content generation, recommendation systems, and sentiment analysis, where visual cues play a significant role in understanding and interpreting the data. Additionally, combining language and image processing capabilities can lead to more sophisticated and versatile AI systems that can handle a wider range of real-world applications effectively.",
 "gold_answer_marketing": "By incorporating image analysis capabilities, a language model can be used for tasks such as image captioning, visual question answering, and text-to-image generation, expanding its range of applications beyond just text-based tasks."
 },
 23: {
 "question": "What are some ethical considerations that come into play when refining the performance of language models?",
 "gold_answer_research": "Some ethical considerations when refining the performance of language models include preventing biased outputs, ensuring privacy of data, avoiding generation of misinformation, and not promoting harmful activities. It is important to evaluate model outputs based on criteria such as not generating abusive or offensive language, not providing bad advice or promoting illegal activities, and not causing harm to individuals or the environment. Trade-offs may need to be made between these criteria depending on the task at hand. Additionally, alignment techniques should be used as part of a broader safety ecosystem to address safety issues associated with large language models.",
 "gold_answer_marketing": "Ethical considerations when refining language models include avoiding biased outputs, protecting privacy, preventing misinformation, and ensuring outputs do not cause harm or promote illegal activity. Trade-offs may need to be made in evaluating model outputs based on these criteria."
 },
 24: {
 "question": "Who developed the language model family known as Chinchilla?",
 "gold_answer_research": "The Chinchilla language model family was developed by the research team at DeepMind and presented in March 2022. It is named 'Chinchilla' as an advancement over the previous Gopher model family. The Chinchilla family has been trained to investigate the scaling laws of large language models and is designed to outperform GPT-3.",
 "gold_answer_marketing": "The research team at DeepMind developed the language model family known as Chinchilla."
 },
 25: {
	"question": "What benchmark did Chinchilla achieve an average accuracy of 67.5% on?",
	"gold_answer_research": "Chinchilla achieved an average accuracy of 67.5% on the MMLU benchmark (Measuring Massive Multitask Language Understanding).",
	"gold_answer_marketing": "Chinchilla achieved an average accuracy of 67.5% on the MMLU benchmark (Measuring Massive Multitask Language Understanding)."
 },
 27: {
 "question": "What is the significance of the accuracy percentage achieved by Chinchilla on the MMLU benchmark?",
 "gold_answer_research": "The significance of the accuracy percentage achieved by Chinchilla on the MMLU benchmark is that it is 67.5%, which is 7% higher than Gopher's performance. Chinchilla's performance on the benchmark is a positive indicator of its effectiveness in developing an effective training paradigm for large autoregressive language models with limited compute resources. The higher accuracy suggests that Chinchilla may have better performance on downstream tasks compared to Gopher. Additionally, Chinchilla was still in the testing phase as of January 12, 2023, indicating ongoing development and potential for further improvement in accuracy.",
 "gold_answer_marketing": "The significance of the accuracy percentage achieved by Chinchilla on the MMLU benchmark is 67.5%, which is 7% higher than Gopher's performance."
 },
 28: {
 "question": "Why is Chinchilla considered more efficient in terms of computing power for inference and fine-tuning?",
 "gold_answer_research": "Chinchilla is considered more efficient in terms of computing power for inference and fine-tuning because it has 70B parameters and four times as much data, allowing for high performance with limited compute resources. Additionally, Chinchilla's training paradigm recommends doubling the number of training tokens for every model size doubling, leading to better results on downstream tasks. Moreover, Chinchilla's smooth and differentiable model architecture contributes to its efficiency in training and inference processes.",
 "gold_answer_marketing": "Chinchilla is considered more efficient in terms of computing power for inference and fine-tuning because it has 70B parameters and four times as much data, which allows for high performance with limited compute resources."
 },
 30: {
 "question": "What is the recommended strategy for training large autoregressive language models with limited compute resources, as contributed by the Chinchilla team?",
 "gold_answer_research": "The Chinchilla team recommends that the number of training tokens should be doubled for every model size doubling to achieve better results on downstream tasks. They also suggest using larger, higher-quality training datasets to improve performance. Additionally, they mention the importance of balancing model size and efficiency to address computational costs and inference latency limitations. It is advised to focus on Transformer language models and consider sharing model parameters for quick task-switching when deploying as a service.",
 "gold_answer_marketing": "The Chinchilla team recommends doubling the number of training tokens for every model size doubling and using larger, higher-quality training datasets to achieve better results on downstream tasks."
 },
 33: {
 "question": "What assumptions must be met in order for the reparameterization of reward functions to be applied within the context of Plackett-Luce and Bradley-Terry models?",
 "gold_answer_research": "In order for the reparameterization of reward functions to be applied within the context of Plackett-Luce and Bradley-Terry models, the assumptions must include the existence of a reward function r(x, y) that induces an optimal model πr(y | x) as specified by Equation 4. Additionally, the reward function must be from an equivalence class of reward functions that differ only in an input-specific component. It is also important to ensure that the value distribution, representing human utility, is affected by input-specific changes to maximize preference. Finally, the assumptions should allow for the cancellation of the normalization constant Z(x) in the derived expressions.",
 "gold_answer_marketing": "The assumptions that must be met for the reparameterization of reward functions to be applied within the context of Plackett-Luce and Bradley-Terry models are that the reward classes consistent with these models can be represented by the reparameterization formula r(x, y) = β log(π(y|x) / πref(y|x))."
 },
 34: {
	"question": "What are some of the limitations of traditional position encoding methods in the architecture of pre-trained language models (PLMs), and what novel approach does the paper propose to address these issues?",
	"gold_answer_research": "One limitation of traditional position encoding methods in PLMs is that they may not enable length extrapolation of pre-existing models, leading to the need for substantial pre-training costs. The paper proposes a novel approach called Position Interpolation, which extends existing PLMs without deviating far from existing definitions of position encoding or attention mechanisms. This method allows for much extended context windows for text modeling, leading to significant perplexity gains and improved model performance.",
	"gold_answer_marketing": "Traditional position encoding methods in PLMs have limitations in enabling length extrapolation and adapting to extended context windows. The paper proposes a novel approach called Position Interpolation, which generates strong models that can effectively make use of much extended context windows. This method allows for substantial pre-training cost savings and preserves the quality of the original models, even for small context window tasks."
 },
 35: {
	"question": "How does the Rotary Position Embedding (RoPE) approach in Transformers differ from the traditional additive method of position embedding with respect to encoding position information?",
	"gold_answer_research": "The RoPE approach in Transformers differs from the traditional additive method of position embedding by being multiplicative instead of additive. While traditional methods add position encoding to context representations, RoPE incorporates relative position information through rotation matrix product. This means that RoPE naturally includes relative position dependency in the self-attention formulation, without altering terms in the expanded formulation like the additive method does. Additionally, RoPE's properties show that it decays as the relative distance between positions increases, providing a clear theoretical interpretation of how position information is encoded.",
	"gold_answer_marketing": "The RoPE approach in Transformers differs from the traditional additive method of position embedding by incorporating relative position information through rotation matrix product instead of altering terms in the expanded formulation of additive position encoding."
 },
 36: {
 "question": "What approaches or methods are suggested for improving the alignment of language models with human preferences?",
 "gold_answer_research": "One approach suggested for improving the alignment of language models with human preferences is Direct Preference Optimization (DPO), which optimizes language models to adhere to human preferences without explicit reward modeling or reinforcement learning. This algorithm implicitly optimizes the same objective as existing RLHF algorithms, making it simple to implement and straightforward to train. Additionally, incorporating pretraining data into RLHF fine-tuning can help mitigate the alignment tax and improve the alignment of language models with human preferences.",
 "gold_answer_marketing": "Direct Preference Optimization (DPO) is a suggested approach for improving the alignment of language models with human preferences. It optimizes the same objective as existing reinforcement learning with a KL-divergence constraint but is simpler to implement and train."
 },
 38: {
 "question": "What methods have been explored to improve the alignment of language models with user preferences or desired outputs?",
 "gold_answer_research": "Several methods have been explored to improve the alignment of language models with user preferences or desired outputs. These include fine-tuning language models from human preferences, using reinforcement learning from human feedback, and minimizing f-divergence to align language models with preferences. Additionally, incorporating pretraining data into reinforcement learning from human feedback (RLHF) has shown promise in mitigating alignment issues. It is important to consider factors such as the underlying model, training data, fine-tuning data, and alignment methods when working towards better alignment with user intentions.",
 "gold_answer_marketing": "Methods explored to improve the alignment of language models with user preferences or desired outputs include RLHF, DPO, and f-divergence minimization."
 },
 39: {
 "question": "How does the uniqueness of a reparameterized reward function within an equivalence class impact the selection of optimal policies in constrained reinforcement learning problems?",
 "gold_answer_research": "The uniqueness of a reparameterized reward function within an equivalence class ensures that different reward functions from the same class will induce the same optimal policy in constrained reinforcement learning problems. This means that despite the under-specification issue, the optimal policy remains consistent across equivalent reward functions, allowing for more stable and reliable policy learning. By reparameterizing the reward function, the selection of optimal policies becomes more straightforward and consistent, leading to better convergence and performance in constrained reinforcement learning scenarios.",
 "gold_answer_marketing": "The uniqueness of a reparameterized reward function within an equivalence class ensures that different reward functions from the same class will lead to the selection of the same optimal policy in constrained reinforcement learning problems."
 },
 41: {
	"question": "Question: When conducting demographic and technical assessments of teams or research subjects, what types of data categories are typically collected and analyzed to ensure a comprehensive understanding of the group's composition and the methods used?",
	"gold_answer_research": "When conducting demographic and technical assessments of teams or research subjects, it is important to collect and analyze data categories such as age, gender, education level, professional background, and expertise in specific areas. By gathering information on these categories, you can ensure a comprehensive understanding of the group's composition and the methods used in your assessments. Additionally, it may be helpful to consider factors like cultural background, language proficiency, and geographical location to capture a more nuanced picture of the group being assessed. This detailed approach to data collection and analysis can provide valuable insights for making informed decisions and recommendations based on the gathered information.",
	"gold_answer_marketing": "Answer: Demographic data such as age, gender, education level, and technical data related to skills and experience are typically collected and analyzed for comprehensive understanding."
 },
 43: {
 "question": "How does the evaluation process determine the level of alignment in the models being tested?",
 "gold_answer_research": "The evaluation process determines the level of alignment in the models by comparing the system-level judgments made by the language model (GPT-4) with those made by human annotators, using metrics such as Kendall Tau and Spearman rank correlation. At the example level, the agreement between the model and human annotators is measured using Fleiss κ. These metrics provide insights into the alignment of the language model with human intentions and indicate the reliability of model-based evaluation as an alternative to human evaluation.",
 "gold_answer_marketing": "The evaluation process determines the level of alignment in the models being tested by comparing system-level judgments by the language model and human annotators, as well as examining agreement at the example level."
 },
 44: {
 "question": "What approaches can be used to improve the performance of an AI model on various datasets, and how do they compare in terms of effectiveness?",
 "gold_answer_research": "To improve the performance of an AI model on various datasets, you can consider combining your method with ways to filter the pretraining data or training the initial pretrained models using human data. Additionally, you could explore methods that improve models' truthfulness, such as WebGPT. Comparing the effectiveness of these approaches, it may be beneficial to experiment with different adaptation approaches in the low-data regime and evaluate their performance on tasks like MNLI-n. Fine-tuning models using different training datasets and varying the number of fine-tuning steps can also impact performance on public NLP datasets.",
 "gold_answer_marketing": "Approaches such as fine-tuning with different datasets, filtering pretraining data, and combining methods to improve model truthfulness can be used to enhance the performance of an AI model on various datasets. However, the effectiveness of these approaches may vary depending on the specific task and dataset."
 },
 46: {
 "question": "What are some common strategies for addressing the ethical and social risks associated with the deployment of language models, according to recent research findings?",
 "gold_answer_research": "Recent research findings suggest that common strategies for addressing the ethical and social risks associated with the deployment of language models include implementing alignment techniques to ensure that the models are helpful, honest, and harmless. These alignment techniques involve fine-tuning approaches such as reinforcement learning from human feedback to train the models to follow a broad class of written instructions. Additionally, it is important to consider the potential harms of language models in real-world applications and to evaluate how their outputs are used, especially in safety-critical situations. Furthermore, regulating access to large language models and incorporating diverse values-targeted datasets can help mitigate the risks of biased outputs, private data leaks, and misinformation generation.",
 "gold_answer_marketing": "Recent research findings suggest that common strategies for addressing ethical and social risks associated with the deployment of language models include mitigating biases, protecting private data, preventing the generation of misinformation, and ensuring alignment with human intentions to be helpful, honest, and harmless. These strategies aim to minimize the potential harms that language models may cause in various domains, such as medical diagnoses, employment, housing, and law enforcement."
 },
 47: {
 "question": "What categories are the listed companies classified into, and what are some examples of tasks that can be performed in relation to text extraction, generation, rewriting, and chat services based on the given content?",
 "gold_answer_research": "The listed companies are classified as follows: Apple falls under the category of Technology, Facebook is categorized as Social Media, and Fedex's category is not specified. Some examples of tasks that can be performed in relation to text extraction include salient span masking, named entity identification, and adding an empty null document. For text generation, tasks can include natural language generation, speech recognition, and machine translation. For rewriting, tasks can involve summarization, paraphrasing, and producing rap lyrics based on a given article. Chat services can include customer assistance, complaint resolution, and information retrieval.",
 "gold_answer_marketing": "Apple is classified into the Technology category, Facebook is classified into the Social Media category, and Fedex is not classified. Tasks that can be performed in relation to text extraction include salient span masking and adding an empty null document. Tasks related to text generation include speech recognition, machine translation, and natural language generation. For rewriting, tasks can include summarization and rewriting rap lyrics. Chat services can involve customer assistance and complaints."
 },
 48: {
 "question": "What criteria were used to select labelers to ensure they can effectively detect and rate sensitive content?",
 "gold_answer_research": "The criteria used to select labelers included agreement on sensitive speech flagging, performance on a screening test measuring the ability to detect and respond to sensitive content, comparison of outputs, and a demonstration score of 6/7. Labelers were chosen subjectively based on these criteria, with soft cutoffs at 75% agreement on sensitive speech flagging and comparisons. Additionally, labelers were trained on the project and provided with detailed instructions for each task to ensure consistency and accuracy in their ratings.",
 "gold_answer_marketing": "The criteria used to select labelers included agreement on sensitive speech flagging, performance on a screening test measuring sensitivity to different demographic groups, and ability to identify potentially harmful outputs. Soft cutoffs were set at 75% agreement on sensitive speech flagging and comparisons, and a 6/7 demonstration score."
 },
 50: {
 "question": "How do the evaluation metrics used in a model contribute to the overall quality and reliability of the generated responses?",
 "gold_answer_research": "The evaluation metrics used in a model, such as percentage of true responses and informative responses, play a crucial role in assessing the quality and reliability of the generated responses. By comparing the model's outputs to ground-truth summaries and baselines, these metrics help determine the performance of the model in generating accurate and relevant information. Additionally, subjective evaluations by labelers on a Likert scale further contribute to understanding the overall quality of the responses. Future work should focus on mitigating subjective preferences and biases in evaluation systems to ensure more reliable and consistent results.",
 "gold_answer_marketing": "The evaluation metrics used in a model help determine how often the model's outputs are preferred to a baseline policy, as well as the overall quality of each response. This contributes to assessing the performance and reliability of the generated responses in comparison to other models and benchmarks."
 },
 51: {
 "question": "How do the datasets encourage the application of common-sense reasoning and entailment recognition in natural language processing tasks?",
 "gold_answer_research": "The datasets mentioned in the context focus on evaluating natural language models on tasks that require common-sense reasoning and entailment recognition. By incorporating prompts that require understanding of context and relevant information, the models are challenged to go beyond surface-level processing. This encourages the development of models that can infer implicit information, draw logical conclusions, and make accurate predictions based on contextual clues. Through these datasets, researchers aim to improve the ability of natural language processing systems to perform more complex and nuanced tasks that mimic human-like reasoning.",
 "gold_answer_marketing": "The datasets encourage common-sense reasoning and entailment recognition by evaluating model performance on tasks like question answering, reading comprehension, and summarization, which require understanding context and relationships between different pieces of information in natural language."
 },
 52: {
 "question": "What metrics are used to evaluate the quality of translations and summaries in the dataset examples provided?",
 "gold_answer_research": "Translations in the dataset examples are evaluated using the BLEU metric, while summaries are judged based on their ROUGE-L scores with respect to a set of reference summaries. Additionally, the evaluation metric for the summaries is the f1 score from the sample to the target completion. These metrics help assess the quality and accuracy of the translations and summaries in the datasets.",
 "gold_answer_marketing": "Translations are evaluated using the BLEU metric, while summaries are judged via their ROUGE-L scores with respect to a set of reference summaries. The evaluation metric for translations is the f1 score from the sample to the target completion."
 },
 53: {
 "question": "What are the implications of adding updates on the pretraining data during the fine-tuning phase of model development based on the observed performance of the models?",
 "gold_answer_research": "Adding updates on the pretraining data during the fine-tuning phase of model development can help mitigate performance regressions observed in the models. This approach allows for reducing performance regressions on specific datasets without compromising labeler preference scores. Additionally, it enables the models to generalize to the preferences of 'held-out' labelers that did not provide training data, improving overall model performance. This method can be particularly useful for addressing performance issues and enhancing model robustness during the fine-tuning process.",
 "gold_answer_marketing": "Adding updates on the pretraining data during the fine-tuning phase can help mitigate performance regressions on certain tasks without compromising labeler preference scores. This approach can improve the generalization of models to new data and tasks."
 },
 54: {
 "question": "How do the capabilities of different AI models compare in terms of following explicit constraints and minimizing hallucinations, as evidenced by metadata ratings?",
 "gold_answer_research": "Based on the metadata ratings, it appears that the capabilities of different AI models vary in terms of following explicit constraints and minimizing hallucinations. The results show that extending LLaMA 7B and 13B models with a longer context window size leads to significant reductions in perplexity, indicating improved performance in minimizing hallucinations. However, it is important to note that the specific term 'AI hallucination' may anthropomorphize computers and the concept of hallucinations in AI is associated with unjustified responses or beliefs. Further research and analysis are needed to fully understand and compare the capabilities of different AI models in this context.",
 "gold_answer_marketing": "The capabilities of different AI models in terms of following explicit constraints and minimizing hallucinations can be compared based on metadata ratings."
 },
 55: {
 "question": "How do human likert scores compare when evaluating PPO with different initial models based on the pretraining fraction?",
 "gold_answer_research": "Based on the data provided, it appears that human likert scores for PPO with different initial models do not show significant sensitivity to the pretraining fraction choice, as indicated by the performance seeming not sensitive to the particular choice of 0%, 0.1%, or 0.5% pretraining data mix. Additionally, the likert scores seem to be consistent across different pretraining fractions. Further investigation into the impact of pretraining fraction on likert scores may be necessary to determine if there are any subtle differences in performance.",
 "gold_answer_marketing": "Human likert scores for PPO with different initial models show that there is not a significant difference in performance based on the pretraining fraction used."
 },
 59: {
 "question": "What methodologies can be employed to test the reliability and accuracy of AI-generated responses to prompts with varying levels of obscurity?",
 "gold_answer_research": "To test the reliability and accuracy of AI-generated responses to prompts with varying levels of obscurity, methodologies such as measuring truthfulness by comparing the model's actual output to its believed correct output can be employed. Additionally, sensitivity speech flagging can be used to identify and label prompts or completions that may elicit strong negative feelings. Agreement on rankings of prompts runted to the API can also help assess the model's performance on novel questions. Finally, human evaluation through random sampling of prompts and generated outputs can provide valuable insights into the quality of model responses.",
 "gold_answer_marketing": "Use automated and human evaluations, measure truthfulness, analyze model responses against user intentions, and conduct closed-book QA testing with various prompts."
 },
 60: {
 "question": "How does the subjective nature of human preferences influence the evaluation of chatbot task performance, and what are potential methods for addressing this challenge?",
 "gold_answer_research": "The subjective nature of human preferences can introduce variability in the evaluation of chatbot task performance, as different human annotators may have differing opinions on what constitutes a preferred response. This can lead to disagreements among evaluators, as seen in the case of comparing generations from different chatbot systems. The text suggests that drawing from disciplines such as Human-Computer Interaction and Psychology may offer insights into mitigating these challenges posed by subjective preferences. Additionally, the proposed Direct Preference Optimization (DPO) algorithm aims to optimize language models based on human preferences without the need for explicit reward modeling or reinforcement learning, offering a potential method to address this challenge.",
 "gold_answer_marketing": "The subjective nature of human preferences can impact the evaluation of chatbot task performance by introducing disagreements among evaluators. Potential methods for addressing this challenge mentioned in the text include investigating approaches from disciplines like Human-Computer Interaction and Psychology to mitigate subjective preferences."
 },
 61: {
 "question": "What are some potential societal impacts of the widespread use of the QLORA finetuning method for Language Learning Models (LLMs)?",
 "gold_answer_research": "The authors suggest that the QLORA finetuning method could help close the resource gap between large corporations and small teams, making state-of-the-art NLP technology more accessible. They believe that QLORA could lead to more independent analysis and auditing of LLMs, which could have a positive impact by ensuring models align with societal values and consensus. The method may also enable further investigations into the tradeoffs between simple cross-entropy loss and reinforcement learning from human feedback, potentially leading to more efficient training methods for LLMs.",
 "gold_answer_marketing": "The potential societal impacts of the widespread use of the QLORA finetuning method for LLMs include making state-of-the-art NLP technology more accessible to researchers with limited resources, closing the resource gap between large corporations and small teams, and enabling independent analysis of LLMs."
 },
 62: {
 "question": "What methods or approaches are being investigated or utilized to optimize or adapt machine learning models?",
 "gold_answer_research": "Some of the methods or approaches being investigated or utilized to optimize or adapt machine learning models according to the listed references include large-scale pre-training on general domain data followed by adaptation to specific tasks or domains, adding adapter layers, optimizing certain forms of adaptations, and exploring the use of large language models for reference-free text quality evaluation. Additionally, strategies like fine-tuning continuous prompts for generation and learning overparameterized neural networks via stochastic gradient descent on structured data are being explored for model adaptation. These approaches aim to make model adaptation more parameter- and compute-efficient.",
 "gold_answer_marketing": "The methods or approaches being investigated or utilized include large-scale pre-training on general domain data and adaptation to particular tasks or domains, adding adapter layers, and optimizing some forms of adaptations."
 },
 63: {
 "question": "What are the advantages of applying LoRA to transformer models in terms of computational efficiency during training and deployment?",
 "gold_answer_research": "Applying LoRA to transformer models offers several advantages in terms of computational efficiency. By using LoRA, a pre-trained model can be shared and used for multiple tasks, reducing the storage requirements and task-switching overhead significantly. Additionally, LoRA makes training more efficient and lowers the hardware barrier to entry by up to 3 times when using adaptive optimizers, as gradients do not need to be calculated for all parameters. This can lead to faster model training and deployment times, improving overall computational efficiency.",
 "gold_answer_marketing": "Applying LoRA to transformer models can reduce the number of trainable parameters, making training more efficient and lowering the hardware barrier to entry. It also allows for quick task-switching during deployment by sharing the majority of model parameters."
 },
 64: {
 "question": "How does the training cost impact the evaluation of performance metrics across the different methods of adaptation for large Transformer models like GPT-3?",
 "gold_answer_research": "The training cost has a significant impact on the evaluation of performance metrics across different adaptation methods for large Transformer models like GPT-3. The cost of training our 175B SFT model is 4.9 petaflops/s-days, while training our 175B PPO-ptx model requires 60 petaflops/s-days, compared to 3,640 petaflops/s-days for GPT-3. This difference in training cost affects the efficiency and effectiveness of adaptation methods, as it determines the resources required for training and evaluating the models. Additionally, the training cost influences the scalability and feasibility of implementing these adaptation methods on a larger scale. It is essential to consider the trade-offs between training cost and performance metrics when evaluating and selecting adaptation methods for large Transformer models like GPT-3.",
 "gold_answer_marketing": "The training cost has a significant impact on the evaluation of performance metrics for large Transformer models like GPT-3. By achieving the largest reduction of trainable parameters, we can adapt the model while maintaining task performances, but the training cost for different adaptation methods varies. This cost can affect the overall efficiency and effectiveness of the adaptation process, as seen in the empirical studies conducted."
 },
 65: {
 "question": "What are the advantages of using low-rank adaptations during the fine-tuning process of pre-trained transformer models?",
 "gold_answer_research": "Using low-rank adaptations during the fine-tuning process of pre-trained transformer models can potentially amplify important features for specific downstream tasks that were not emphasized in the general pre-training model. This approach allows for a reduction in the number of trainable parameters, leading to a more efficient use of computational resources without sacrificing task performance. Additionally, low-rank adaptations can help in identifying the trade-off between performance and the number of trainable parameters, aiding in better optimization strategies for adaptation in low-data regimes.",
 "gold_answer_marketing": "Low-rank adaptations potentially amplify important features for specific downstream tasks that were not emphasized in the general pre-training model."
 },
 67: {
 "question": "What are the implications of fine-tuning large language models on datasets of varying sizes in contrast to using few-shot learning techniques?",
 "gold_answer_research": "Fine-tuning large language models on datasets of varying sizes compared to using few-shot learning techniques can significantly improve model performance across different tasks. Fine-tuning with DPO has shown to exceed PPO-based RLHF in controlling sentiment of generations and matching or improving response quality in summarization and single-turn dialogue. Additionally, fine-tuning is simpler to implement and train, making it a more practical approach for performance-sensitive applications with curated training datasets. The ability to fine-tune models for specific tasks on larger datasets can lead to substantial performance improvements compared to few-shot learning methods.",
 "gold_answer_marketing": "Fine-tuning with DPO surpasses PPO-based RLHF in sentiment control across generations and improves response quality in summarization and single-turn dialogue while being simpler to implement and train."
 },
 69: {
 "question": "How does the RoFormer model perform on semantic text matching tasks, such as those found in the CAIL2019-SCM dataset?",
 "gold_answer_research": "Based on the results from the experiments conducted on the CAIL2019-SCM dataset, RoFormer consistently outperforms its alternatives, including BERT and WoBERT, in dealing with long texts and semantic text matching tasks. When compared with BERT and WoBERT on the same pre-training data with short text cut-offs, RoFormer shows comparable performance to WoBERT and slightly better performance than BERT. However, when the maximum input text length is increased to 1024, RoFormer significantly outperforms WoBERT by an absolute improvement of 1.5%, showcasing its effectiveness in semantic text matching tasks.",
 "gold_answer_marketing": "The RoFormer model outperforms other pre-trained models on semantic text matching tasks, such as those in the CAIL2019-SCM dataset, especially when the maximum input text length is increased to 1024."
 },
 73: {
 "question": "What is the purpose of using a regularization term (like the epsilon multiplied by the identity matrix) in the linear regression model described in the code?",
 "gold_answer_research": "The purpose of using a regularization term (epsilon multiplied by the identity matrix) in the linear regression model described in the code is to prevent overfitting by adding a penalty term to the loss function. This regularization term helps to control the complexity of the model by discouraging large coefficients, leading to a more stable and generalizable solution. By incorporating this regularization, the model can better handle noise and outliers in the data, ultimately improving its performance on unseen data points.",
 "gold_answer_marketing": "The purpose of using a regularization term in the linear regression model is to mitigate or resolve catastrophic extrapolation error, making the extrapolated values comparable to those within the original range."
 },
 74: {
 "question": "What novel attention mechanisms does the language model discussed employ to improve performance and efficiency in processing long sequences?",
 "gold_answer_research": "The language model discussed employs sliding window attention to improve performance and efficiency in processing long sequences. This mechanism helps reduce the number of operations and memory usage in comparison to vanilla attention, leading to lower latency and increased throughput during inference. Additionally, the model incorporates linear attention as an alternative mechanism to avoid quadratic computation costs associated with input sequence length, further enhancing its ability to handle longer sequences effectively.",
 "gold_answer_marketing": "The language model discussed employs sliding window attention and linear attention mechanisms to improve performance and efficiency in processing long sequences."
 },
 75: {
 "question": "How does the performance of the Mistral 7B model in mathematical benchmarks compare to that of larger pretrained models?",
 "gold_answer_research": "Based on the provided information, the Mistral 7B model outperforms the Llama 34B model in mathematics and code generation benchmarks. Additionally, Mistral 7B approaches the performance of the larger Code-Llama 7B model in coding tasks. This indicates that Mistral 7B demonstrates high performance in mathematical benchmarks compared to larger pretrained models.",
 "gold_answer_marketing": "The Mistral 7B model outperforms larger pretrained models in mathematics benchmarks."
 },
 76: {
 "question": "What are the key differences between HALO and non-HALO loss functions, and what impact do these differences have on language model alignment?",
 "gold_answer_research": "HALO loss functions, such as DPO and KTO, explicitly model human biases like loss aversion, while non-HALO loss functions, like CSFT and SLiC, do not. These differences can impact language model alignment by influencing how well the model captures human preferences and perceptions in text generation tasks. HALOs may be more effective in aligning language models to human expectations and behaviors, potentially leading to more accurate and natural language generation outputs. Further research is needed to fully understand the implications of using HALO versus non-HALO loss functions in language model alignment.",
 "gold_answer_marketing": "HALO loss functions implicitly model human biases, such as loss aversion, while non-HALO loss functions do not. These biases can impact language model alignment by influencing the generation of text outputs that better match human perceptions of language quality."
 },
 78: {
 "question": "What implications does the lemma from Rafailov et al. (2023) have on the relationship between optimal policies and reward functions within the same equivalence class?",
 "gold_answer_research": "The lemma from Rafailov et al. (2023) implies that reward functions within the same equivalence class will induce the same optimal policy under the constrained RL problem. This means that even though the reward functions may differ by an input-specific component, they will lead to the same decision-making strategy. This has important implications for understanding the relationship between reward functions and optimal policies, showing that certain variations in the reward functions do not impact the final decision-making process. Understanding this relationship can help in designing more efficient and effective RL algorithms.",
 "gold_answer_marketing": "The lemma from Rafailov et al. (2023) implies that optimal policies and reward functions within the same equivalence class induce the same optimal policy under the constrained RL problem."
 },
 80: {
 "question": "What is the significance of the term open-domain in the context of question answering?",
 "gold_answer_research": "Answer: The term 'open-domain' in the context of question answering refers to the lack of specific context or relevant information provided to the model when answering a factual question. This means that the model must rely solely on the question itself without additional articles or background knowledge. In open-domain question answering, the model's task is to produce answers to factoid questions based solely on the input question without any accompanying context or information. Additionally, open-domain question answering systems typically aim to provide objective answers to factual questions, making it easier to evaluate the model's performance.",
 "gold_answer_marketing": "Answer: The term 'open-domain' in question answering refers to the lack of specific context or relevant information provided for any arbitrarily asked factual question. The model must generate answers solely based on the question itself."
 },
 81: {
 "question": "How do various models handle the retrieval of relevant context for question answering from external knowledge sources?",
 "gold_answer_research": "Various models handle the retrieval of relevant context for question answering from external knowledge sources in different ways. Some models, like DenSPI, encode all text in the knowledge corpus at the phrase level and rely on a retriever to identify the most relevant phrase as the predicted answer. Others, like ORQA, jointly learn a retriever and reader QA model to optimize correct answers without needing ground-truth context passages during training. Additionally, methods like Self-Ask and IRCoT combine iterative prompting and queries to external sources to construct the thought process iteratively and retrieve relevant content. The effectiveness of these models depends on the retrieval quality and the integration of retrieved content into the prompt.",
 "gold_answer_marketing": "Various models handle the retrieval of relevant context for question answering from external knowledge sources by using retrieval models to surface context based on relevance, recency, and importance. The dense representations of a question and context passage are extracted using language models, and the dot-product of these representations is used as the retrieval score to rank and select the most relevant passages. Additionally, some models use bi-directional LSTM and reader models to extract answers from context documents."
 },
 82: {
 "question": "What are the differences between open-book and closed-book question answering?",
 "gold_answer_research": "In open-book question answering, the retriever-generator approach involves two stages where the second stage is to generate free text directly to answer the question. This is also known as generative question answering. On the other hand, closed-book question answering involves the use of generative language models that are pre-trained on a large collection of textual data and can answer questions without explicit context, similar to a closed-book exam. These models produce free text responses to questions without requiring explicit reading comprehension. Additionally, swapping the question with the evidence in closed-book QA demonstrations has been found to consistently yield lower results across all datasets.",
 "gold_answer_marketing": "Open-book question answering involves generating free text directly to answer a question, while closed-book question answering uses pre-trained language models to produce free text responses without explicit context."
 },
 84: {
 "question": "What are some of the concerns related to fine-tuning QA models with common datasets?",
 "gold_answer_research": "One concern related to fine-tuning QA models with common datasets is the significant overlap between questions in the train and test sets of public QA datasets. This overlap can potentially lead to biased performance evaluations and limit the generalizability of the model. Additionally, using datasets that primarily focus on tasks like classification and question answering may not fully capture the diverse range of tasks that language models are used for in practice. Furthermore, efforts to modify language models to mitigate harms, such as reducing toxicity, can inadvertently impact their ability to model text from under-represented groups.",
 "gold_answer_marketing": "One concern of fine-tuning QA models with common datasets is the significant overlap between questions in the training and test sets, which can impact the model's performance."
 },
 85: {
 "question": "How does the performance of question answering models tend to vary with the size of the model?",
 "gold_answer_research": "The performance of question answering models tends to improve with the size of the model, with larger models like T5 with 11B parameters outperforming smaller models like DPR with 3 BERT-base models. However, increasing the size of the model may negatively affect its performance in certain cases. Research has shown that larger models have the capability to memorize answers seen during training and perform well on novel questions at test time, but dataset suitability is also important for optimal performance. Additionally, extensive analysis of chatbot performance using both human raters and AI evaluation has been conducted to determine the effectiveness of different models.",
 "gold_answer_marketing": "The performance of question answering models tends to improve with larger model sizes."
 },
 86: {
 "question": "What are the advantages and limitations of using generative language models for closed-book question answering?",
 "gold_answer_research": "The advantages of using generative language models for closed-book question answering include the ability to memorize factual knowledge within parameter weights and produce free text responses without explicit context. Additionally, models like RAG can be fine-tuned on various tasks for better performance. However, limitations may arise from the reliance on supervised learning, which can be costly and time-consuming, and may restrict the use on datasets that are not well-annotated. Furthermore, the performance of generative models may vary based on the model size and the quality of the retrieved relevant context.",
 "gold_answer_marketing": "Advantages: Generative language models can memorize factual knowledge and answer questions without explicit context, similar to a closed-book exam. Limitations: Reliance on supervised learning limits use on datasets that are not well-annotated and can be expensive and time-consuming to train extremely large models."
 },
 87: {
 "question": "What are the main components that complement the central controller in an autonomous agent system?",
 "gold_answer_research": "In an autonomous agent system powered by LLM, the main components that complement the central controller include Planning, Memory, and Task Decomposition. Planning involves breaking down tasks into subgoals and reflecting on past actions for refinement. Memory allows the agent to store and retrieve information for decision-making. Task Decomposition focuses on the reliability of the natural language interface and the use of expert modules for routing inquiries. Together, these components enhance the overall functionality and efficiency of the autonomous agent system.",
 "gold_answer_marketing": "The main components that complement the central controller in an autonomous agent system are Planning and Memory."
 },
 88: {
 "question": "How can subgoals and task decomposition improve the handling of complex tasks in autonomous systems?",
 "gold_answer_research": "Subgoals and task decomposition can improve the handling of complex tasks in autonomous systems by breaking down large tasks into smaller, more manageable subtasks. This enables the agent to efficiently navigate through the various steps required to complete the overall task. By reflecting on past actions and refining their approach, the agent can learn from mistakes and continuously improve the quality of their final results. This iterative process of breaking down tasks and refining actions ultimately leads to enhanced safety and reliability in autonomous systems.",
 "gold_answer_marketing": "Subgoals and task decomposition can break down large tasks into smaller, manageable steps, enabling efficient handling of complex tasks in autonomous systems. This allows the agent to plan ahead, reflect on past actions, learn from mistakes, and refine strategies for future steps, ultimately improving the quality of final results."
 },
 89: {
 "question": "What types of memory are leveraged in autonomous agents, and how do they differ?",
 "gold_answer_research": "In autonomous agents, two main types of memory are leveraged: short-term memory and long-term memory. Short-term memory is utilized for in-context learning, while long-term memory allows for the retention and recall of information over extended periods. The agents can interact with other agents and retain past experiences using these memory mechanisms. Short-term memory is essential for immediate learning, while long-term memory enables agents to store and retrieve a vast amount of information over time. These memory systems enable agents to make informed decisions based on past experiences and interactions.",
 "gold_answer_marketing": "Short-term memory and long-term memory are utilized in autonomous agents. Short-term memory is used for in-context learning, while long-term memory allows agents to retain and recall information over extended periods. Long-term memory often leverages external storage for fast retrieval."
 },
 91: {
 "question": "How does planning and reflection contribute to the iterative improvement of autonomous agents?",
 "gold_answer_research": "Planning and reflection play a crucial role in the iterative improvement of autonomous agents by enabling them to break down tasks into manageable subgoals, learn from past actions, and refine their strategies for future steps. Through task decomposition, agents can efficiently handle complex tasks, while self-criticism and self-reflection allow them to identify mistakes and make necessary adjustments, ultimately improving the quality of their final results. By synthesizing memories and guiding future behavior based on past events, the reflection mechanism helps agents make higher-level inferences over time, leading to continuous learning and enhancement in performance.",
 "gold_answer_marketing": "Planning and reflection contribute to the iterative improvement of autonomous agents by allowing them to break down tasks into manageable subgoals, learn from past actions, correct mistakes, and refine their approach for future steps. This process leads to better decision-making, higher-quality results, and overall improvement in performance over time."
 },
 92: {
 "question": "In what ways can an agent's performance be evaluated and refined over time?",
 "gold_answer_research": "An agent's performance can be evaluated and refined over time by continuously reviewing and analyzing actions, constructively self-criticizing behavior, reflecting on past decisions and strategies, and aiming to complete tasks in the least number of steps. Additionally, the agent can evaluate the results of API calls and refine inputs if necessary, as well as break down large tasks into smaller subgoals for more efficient handling. Self-reflection, learning from mistakes, and refining actions for future steps can also improve the quality of final results.",
 "gold_answer_marketing": "The speaker of the dialogue is 'agent'. An agent's performance can be evaluated and refined over time by continuously reviewing and analyzing actions, constructively self-criticizing big-picture behavior, reflecting on past decisions and strategies, being smart and efficient in completing tasks, breaking down tasks into smaller subgoals, self-reflection and refinement of actions, learning from mistakes, calling external APIs for additional information, and using memory, planning, and reflection mechanisms to improve behavior based on past experiences."
 },
 93: {
 "question": "What challenges are faced in enhancing the long-term planning capabilities of autonomous agents?",
 "gold_answer_research": "Enhancing long-term planning capabilities of autonomous agents faces challenges such as effectively exploring the solution space over a lengthy history and adjusting plans when unexpected errors occur. LLMs struggle to adapt plans in response to errors, making them less robust compared to human learning from trial and error. To address this, agents need to improve task decomposition by breaking down large tasks into manageable subgoals and incorporate reflection mechanisms for self-criticism and learning from past actions. Additionally, the integration of external classical planners like PDDL can aid in long-horizon planning for complex tasks.",
 "gold_answer_marketing": "Challenges in long-term planning for autonomous agents include difficulty in task decomposition, adjusting plans when faced with unexpected errors, and lack of robustness compared to human learning from trial and error."
 },
 94: {
 "question": "How does the use of tools and external resources extend the capabilities of large language models in practical applications?",
 "gold_answer_research": "The use of tools and external resources extends the capabilities of large language models by fine-tuning them to leverage external tool APIs. This process involves expanding the dataset based on the potential improvement in model outputs from newly added API call annotations. By incorporating external resources, such as APIs, large language models can enhance their performance and adaptability for a wide range of practical applications, including speech recognition, machine translation, and natural language generation. This approach helps create more efficient and high-performing language models that can effectively handle diverse tasks and scenarios.",
 "gold_answer_marketing": "The use of tools and external resources allows large language models to access additional information and functions, improving their performance in various tasks such as speech recognition, machine translation, and natural language generation. By fine-tuning the models to utilize external APIs, the dataset is expanded and enriched, leading to more accurate and efficient model outputs. This approach enhances the adaptability and effectiveness of large language models in practical applications."
 },
 95: {
 "question": "What are the primary techniques involved in steering the behavior of language models without modifying their underlying architectures?",
 "gold_answer_research": "The primary techniques involved in steering the behavior of language models without modifying their underlying architectures include using language models to generate toxic outputs as part of a data augmentation pipeline, making models refuse certain user instructions, and combining Reinforcement Learning from Human Feedback (RLHF) with architectures that heavily use self-attention. Additionally, techniques such as supervised fine-tuning (SFT) and few-shot learning, or prompt engineering, are utilized to achieve precise control over the behavior of language models. These techniques focus on aligning language models with human intentions through iterative improvements and alignment research.",
 "gold_answer_marketing": "The primary techniques involved in steering the behavior of language models without modifying their underlying architectures include using language models to generate toxic outputs as part of a data augmentation pipeline, making models refuse certain user instructions, and combining reinforcement learning with architecture that heavily uses self-attention."
 },
 96: {
 "question": "How can the alignment and steerability of language models be influenced through specific engineering methods?",
 "gold_answer_research": "Specific engineering methods that can influence the alignment and steerability of language models include fine-tuning data selection, alignment method choice, and incorporating pretraining data into reinforcement learning from human feedback (RLHF) processes. Additionally, designing interfaces for human labelers to provide feedback to language models can play a crucial role in improving alignment. It is also important to consider factors like model response editing, generating critiques, and using diverse feedback mechanisms to enhance alignment and steerability in language models.",
 "gold_answer_marketing": "Promote alignment and steerability of language models through specific engineering methods like prompt engineering, fine-tuning data, and incorporating pretraining data into reinforcement learning from human feedback (RLHF)."
 },
 101: {
 "question": "What innovative approaches are being explored to enhance language model's reasoning capabilities, specifically for complex reasoning tasks?",
 "gold_answer_research": "Some innovative approaches being explored to enhance language model's reasoning capabilities for complex tasks include ReAct by Yao et al. (2023), which synergizes reasoning and acting in language models, Complexity-based prompting for multi-step reasoning by Fu et al. (2022), Rationale-augmented ensembles in language models by Wang et al. (2022), and Automatic chain of thought prompting in large language models by Zhang et al. (2022). These approaches focus on improving the model's ability to handle complex reasoning tasks by incorporating advanced techniques such as prompt-based learning, rationale augmentation, and automatic chaining of thoughts. Additionally, models like PAL and PoT offload complex computation and reasoning tasks by generating programming language statements to resolve natural language problems, enhancing the efficiency and performance of the language model.",
 "gold_answer_marketing": "Researchers are exploring approaches like ReAct, complexity-based prompting, rationale-augmented ensembles, and automatic chain of thought prompting to enhance language model's reasoning capabilities for complex tasks."
 },
 102: {
 "question": "How can external tools and APIs be integrated with language models to extend their capabilities and applications?",
 "gold_answer_research": "External tools and APIs can be integrated with language models by fine-tuning the models to learn how to use these external resources effectively. This involves expanding the dataset to determine if adding new API call annotations can enhance the model's output quality. By calling external APIs for additional information that may be missing from the model weights, such as current information or access to proprietary sources, the language model can be prompted further to generate more accurate and insightful responses. Additionally, incorporating tool use capabilities, like those seen in ChatGPT Plugins and OpenAI API function calling, can provide practical examples of how language models can benefit from external resources.",
 "gold_answer_marketing": "External tools and APIs can be integrated with language models through techniques like fine-tuning the models to use the APIs for extra information, accessing external data sources, executing code, and enhancing the model's capabilities for tasks like speech recognition, machine translation, and natural language generation. This integration helps extend the applications of language models by providing access to additional resources and functionalities."
 },
 103: {
 "question": "What kind of learning challenges does the attention mechanism address in neural machine translation?",
 "gold_answer_research": "The attention mechanism in neural machine translation addresses the challenge of memorizing long source sentences by creating shortcuts between the context vector and the entire source input. This allows for more efficient learning of dependencies between source and target sequences, regardless of the in-between distance. Additionally, the attention mechanism enables the model to learn correlations between current words and previous parts of the sentence, improving the overall translation quality. By adapting different types of attention weight matrices, the model can further enhance its learning capabilities.",
 "gold_answer_marketing": "The attention mechanism helps address the challenge of memorizing long source sentences and creating shortcuts between context vectors and entire source inputs in neural machine translation."
 },
 104: {
 "question": "How is the encoder-decoder architecture in seq2seq models affected by long input sequences?",
 "gold_answer_research": "In seq2seq models, the encoder-decoder architecture can be affected by long input sequences. This can lead to challenges in attending to the relevant parts of the input during decoding, potentially resulting in erroneous generation. Additionally, the design of the decoding strategy itself, such as using top-k sampling for improved generation diversity, can also contribute to issues like hallucinations. Therefore, careful consideration of handling long input sequences and refining decoding strategies is crucial for effective seq2seq model performance.",
 "gold_answer_marketing": "The encoder-decoder architecture in seq2seq models may struggle with long input sequences, leading to difficulties in generating accurate target sequences."
 },
 105: {
 "question": "What are the differences between soft and hard attention in the context of image caption generation?",
 "gold_answer_research": "Soft attention in image caption generation allows the model to distribute its focus over all patches in the source image, similar to the attention mechanism proposed by Bahdanau et al. in 2015. On the other hand, hard attention restricts the model to only attend to a specific patch of the image when generating a word. This distinction between soft and hard attention impacts how the model learns to align image features with generated words, influencing the quality and interpretability of the generated captions. Additionally, the choice between soft and hard attention can affect the model's ability to capture global dependencies within the image.",
 "gold_answer_marketing": "Soft attention in image caption generation refers to alignment weights that are learned and placed 'softly' over all patches in the source image. This allows the model to focus on different regions of the image simultaneously. On the other hand, hard attention involves selecting a single aligned position for the current target word, making it more like a spotlight that focuses on specific areas of the image at a time."
 },
 106: {
 "question": "Can you describe the multi-head self-attention mechanism in the transformer model?",
 "gold_answer_research": "In the transformer model, the multi-head self-attention mechanism runs the scaled dot-product attention multiple times in parallel instead of just once. This allows the model to jointly attend to information from different representation subspaces at different positions. The independent attention outputs are then concatenated and linearly transformed to the expected dimensions. This approach enhances the model's ability to capture relationships and dependencies across different parts of the input sequence, leading to improved performance in tasks such as machine reading and summarization.",
 "gold_answer_marketing": "The multi-head self-attention mechanism in the transformer model runs scaled dot-product attention multiple times in parallel, allowing the model to jointly attend to information from different representation subspaces at different positions. The independent attention outputs are concatenated and linearly transformed into the expected dimensions."
 },
 107: {
 "question": "In what ways does SNAIL address the issue of positioning in the transformer model?",
 "gold_answer_research": "SNAIL addresses the issue of weakly incorporating sequential order in the transformer model by using self-attention to enhance performance. It formulates relative position using vector production and encodes absolute position information through a rotation matrix. Additionally, SNAIL mathematically illustrates the advantageous properties of this method when applied to the Transformer. This approach improves the model's ability to handle positional dependencies, particularly important for tasks like reinforcement learning.",
 "gold_answer_marketing": "SNAIL addresses the issue of positioning in the transformer model by using self-attention to enhance the performance and incorporating relative position naturally through vector production and absolute position information through a rotation matrix."
 },
 108: {
 "question": "How does the Pointer Network differ from standard seq2seq models in dealing with sequential data?",
 "gold_answer_research": "The Pointer Network differs from standard seq2seq models in that it is specifically designed to handle problems where the output elements correspond to positions in an input sequence. Instead of using attention to blend hidden units like in seq2seq models, the Pointer Network directly points to the elements in the input sequence. This allows for greater flexibility in determining the output elements, which is particularly useful in tasks like sorting or the traveling salesman problem where the output categories are not predetermined. The architecture of a Pointer Network model includes an encoder-decoder setup with a focus on predicting or inferring specific elements in the input sequence based on their importance weights.",
 "gold_answer_marketing": "The Pointer Network differs from standard seq2seq models by being able to handle problems where the output elements correspond to positions in an input sequence, rather than using attention to blend hidden units."
 },
 110: {
 "question": "How does Neural Turing Machine (NTM) simulate the infinite memory characteristic of Turing machines?",
 "gold_answer_research": "Neural Turing Machine (NTM) simulates the infinite memory characteristic of Turing machines by coupling a neural network with external memory storage. The memory in NTM mimics the Turing machine tape, allowing the neural network to control operation heads to read from or write to the tape. However, the memory in NTM is finite, resembling more of a 'Neural von Neumann Machine' due to practical limitations in real modern computers. This limitation is addressed by various works that add memory capabilities to Transformers through recurrence, enhancing the model's ability to handle very long sequences while still maintaining attention mechanisms for efficient memory access.",
 "gold_answer_marketing": "Neural Turing Machine (NTM) uses external memory storage that mimics the Turing machine tape, allowing the neural network to control operation heads to read from or write to the tape. However, the memory in NTM is finite, making it more similar to a 'Neural von Neumann Machine' rather than having infinite memory like a Turing machine."
 }
}

test_questions = {
 4: {"question": "What significant advancements did the transformer architecture introduce in natural language processing?"},
 5: {"question": "How has the public accessibility of GPT models evolved since their inception?"},
 6: {"question": "What benchmark is used to compare the performance of different language models, as mentioned in the text?"},
 10: {"question": "What type of experiments did IBM perform in the 1980s regarding language models?"},
 14: {"question": "How is the evaluation of language models typically conducted?"},
 15: {"question": "Name some benchmarks or data sets used for evaluating language processing systems."},
 21: {"question": "For what purpose do language models like Claude undergo fine-tuning after their initial pre-training?"},
 26: {"question": "How does Chinchilla's training approach differ with respect to the number of training tokens compared to the model size?"},
 29: {"question": "What is the role of optimizers in the training of Chinchilla and how does it differ from its predecessor?"},
 31: {"question": "In what context is LaMDA mentioned in relation to the Gopher and Chinchilla models?"},
 32: {"question": "How does the DPO method influence the generation of language completions in terms of preferred and dispreferred outcomes?"},
 37: {"question": "What are the key advantages of InstructGPT models over GPT-3 models according to the findings in the research?"},
 40: {"question": "When evaluating summaries or chatbot responses, what criteria should be considered to determine their effectiveness and helpfulness?"},
 42: {"question": "What are the potential limitations and drawbacks of large language models that have not been fine-tuned with human feedback?"},
 49: {"question": "What methods are used to evaluate and measure bias and toxicity in the provided datasets?"},
 56: {"question": "What are the benchmarks used to evaluate the performance of the Deep Policy Optimization (DPO) method compared to other preference learning algorithms in the document provided?"},
 57: {"question": "What innovations does QLORA introduce to enhance the efficiency of language model fine-tuning, and what are the benefits achieved by these innovations?"},
 58: {"question": "How does the use of 4-bit QLORA with NF4 data type compare to 16-bit LoRA finetuning in terms of performance on benchmark datasets?"},
 66: {"question": "What approaches have been proposed to enhance the effectiveness of fine-tuning in language models, according to recent research?"},
 68: {"question": "What performance improvement can be observed when using fine-tuning over few-shot learning according to the study's results outlined in Table 8?"},
 71: {"question": "What metric is used to evaluate the effectiveness of a language model in the context of processing long sequences of text, as reported in the given document?"},
 72: {"question": "How does Position Interpolation contribute to the adaptability of language models for processing texts with varying lengths?"},
 77: {"question": "What are some of the research areas and methodologies referenced in the literature cited within the field of artificial intelligence and machine learning?"},
 79: {"question": "What are the common approaches for building open-domain question answering systems?"},
 83: {"question": "In what ways do neural networks, particularly Transformer-based models, contribute to question answering tasks?"},
 90: {"question": "What potential benefits does integrating external APIs provide for autonomous agents?"},
 98: {"question": "Can you elaborate on some biases that may affect the performance of language models during tasks and the strategies employed to mitigate these biases?"},
 99: {"question": "What is the significance of example diversity and order during language model prompting, and what are some methods to optimize these factors?"},
 100: {"question": "In the context of task instruction, how can the communication cost be reduced while maintaining effective interaction with language models?"},
 109: {"question": "What actor plays the role of Thanos in the Marvel Universe and the role of Cable in Deadpool 2?"}
}

### 3.3 Running the RAG System

Let's have a quick look at the validation and test data:

Let's now use the data to ask questions against it. So we need to define our prompt templates, the RAG Chain, etc.

We have two types of User Personas we need to support:

1. The engineers, who require pretty detailed information when they ask questions 
2. The marketing team and supporting staff who also will ask questions around GenAI in order to better understand the products and the field as a whole, but a lot more high level answers would likely be in order

**Below, please build your RAG pipeline including the relevant prompts. This is free form so you will need to create your own cells, text documentation as you need, etc.**

## Class Config

In [67]:
from dataclasses import dataclass
from typing import Optional

@dataclass
class VectorStoreConfig:
 MODEL_ABBREV = {
 'all-mpnet-base-v2':'mpnet',
 'all-MiniLM-L12-v2':'minilm',
 'multi-qa-mpnet-base-dot-v1': 'multiqampnet',
 'all-distilroberta-v1':'distilroberta',
 'multi-qa-distilbert-cos-v1': 'multiqadistilbert'
 }

 embedding_model: str = "multi-qa-mpnet-base-dot-v1"
 splitter_type: str = "text"
 chunk_size: Optional[int] = None
 chunk_overlap: Optional[int] = None

 def get_id(self):
 abbrev = VectorStoreConfig.MODEL_ABBREV[self.embedding_model]
 parts = [abbrev, self.splitter_type]
 if self.chunk_size is not None:
 parts.append(str(self.chunk_size))
 if self.chunk_overlap is not None:
 parts.append(str(self.chunk_overlap))
 return '_'.join(parts)

In [68]:
config = VectorStoreConfig(
 embedding_model="all-distilroberta-v1",
 splitter_type="text",
 chunk_size=1200,
 chunk_overlap=200
)

## Step 1: Embedding Model

In [69]:
# Step 1: Embedding Model
# The embedding models you can use are: 'all-mpnet-base-v2', 'all-MiniLM-L12-v2', 'multi-qa-mpnet-base-dot-v1', 'all-distilroberta-v1', and 'multi-qa-distilbert-cos-v1 '
%%capture
base_embeddings = HuggingFaceEmbeddings(model_name=config.embedding_model)

## Step 2: Build Vectorstore

### 2.1: Archive X

In [70]:
# Step 2: Load Chunks

text_splitter = RecursiveCharacterTextSplitter(chunk_size=config.chunk_size, chunk_overlap=config.chunk_overlap)

In [71]:
# 2.1: Archive X
#assign a unique number to each document we ingest
global_doc_number = 1

arxiv_numbers = ('2005.11401', '2104.07567', '2104.09864', '2105.03011', '2106.09685', '2203.15556', '2203.02155', '2211.09260', '2211.12561',
 '2212.09741', '2305.14314', '2305.18290', '2306.15595', '2309.08872', '2309.15217', '2310.06825', '2310.11511',
 '2311.08377', '2312.05708', '2401.06532', '2401.17268', '2402.01306', '2402.19473', '2406.04744',
 '2312.10997', '2410.12812', '2410.15944', '2404.00657',
 )

all_arxiv_pages = []

#loop through the papers
for identifier in arxiv_numbers:
 # Construct URL using the arXiv unique identifier
 arx_url = f"https://arxiv.org/pdf/{identifier}.pdf"

 # Extract pages from the document and add them to the list of pages
 arx_loader = PyMuPDFLoader(arx_url)
 arx_pages = arx_loader.load()
 for page_num in range(len(arx_pages)):
 page = arx_pages[page_num]
 #CHANGED
 page.metadata['page_num'] = page_num
 page.metadata['doc_num'] = global_doc_number
 page.metadata['doc_source'] = "ArXiv"
 all_arxiv_pages.append(page)

 global_doc_number += 1

# Document chunking
#index doc chunks
splits = text_splitter.split_documents(all_arxiv_pages)
for idx, text in enumerate(splits):
 splits[idx].metadata['split_id'] = idx

print('Number of splits/chunks: ', len(splits))

num_pages = len(all_arxiv_pages)
num_docs = global_doc_number - 1

print(f"{num_docs} documents in total")
print(f"{num_pages} pages in total")

Number of splits/chunks: 2291
28 documents in total
598 pages in total


In [72]:
qdrant_vectorstore = Qdrant.from_documents(splits,
 base_embeddings,
 location=":memory:", # Local mode with in-memory storage only
 collection_name="rag_tech_db",
 force_recreate=True
)

retriever = qdrant_vectorstore.as_retriever(search_kwargs={"k": 8})

## 2.2: Wikipedia

In [73]:
wiki_docs = WikipediaLoader(query="Generative Artificial Intelligence", load_max_docs=4).load()
for idx, text in enumerate(wiki_docs):
 wiki_docs[idx].metadata['doc_num'] = global_doc_number
 wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

 global_doc_number += 1

print('Number of documents: ', len(wiki_docs))

#index docs
wiki_splits = text_splitter.split_documents(wiki_docs)
for idx, text in enumerate(wiki_splits):
 wiki_splits[idx].metadata['split_id'] = idx

print('Number of splits/chunks: ', len(wiki_splits))

qdrant_vectorstore.add_documents(documents=wiki_splits)

wiki_docs = WikipediaLoader(query="Information Retrieval", load_max_docs=4).load()
for idx, text in enumerate(wiki_docs):
 wiki_docs[idx].metadata['doc_num'] = global_doc_number
 wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

 global_doc_number += 1

print('Number of documents: ', len(wiki_docs))

#index docs
wiki_splits = text_splitter.split_documents(wiki_docs)
for idx, text in enumerate(wiki_splits):
 wiki_splits[idx].metadata['split_id'] = idx

print('Number of splits/chunks: ', len(wiki_splits))

qdrant_vectorstore.add_documents(documents=wiki_splits)

wiki_docs = WikipediaLoader(query="Large Language Models", load_max_docs=4).load()
for idx, text in enumerate(wiki_docs):
 wiki_docs[idx].metadata['doc_num'] = global_doc_number
 wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

 global_doc_number += 1

print('Number of documents: ', len(wiki_docs))

#index docs
wiki_splits = text_splitter.split_documents(wiki_docs)
for idx, text in enumerate(wiki_splits):
 wiki_splits[idx].metadata['split_id'] = idx

print('Number of splits/chunks: ', len(wiki_splits))

qdrant_vectorstore.add_documents(documents=wiki_splits)

wiki_docs = WikipediaLoader(query="Retrieval Augmented Generation", load_max_docs=4).load()
for idx, text in enumerate(wiki_docs):
 wiki_docs[idx].metadata['doc_num'] = global_doc_number
 wiki_docs[idx].metadata['doc_source'] = "Wikipedia"

 global_doc_number += 1

print('Number of documents: ', len(wiki_docs))

#index docs
wiki_splits = text_splitter.split_documents(wiki_docs)
for idx, text in enumerate(wiki_splits):
 wiki_splits[idx].metadata['split_id'] = idx

print('Number of splits/chunks: ', len(wiki_splits))

qdrant_vectorstore.add_documents(documents=wiki_splits)

Number of documents: 4
Number of splits/chunks: 21
Number of documents: 4
Number of splits/chunks: 22
Number of documents: 4
Number of splits/chunks: 17
Number of documents: 4
Number of splits/chunks: 19


['2c3c71f09f3246f2b4fe5ce1a36bbfb0',
 '3c3acbdb84394b0197065472165bf37b',
 '533f17692aa94e2eb4978fd7d88a892a',
 'f39c4321f0ec4e309ae3a74e2d8da4d2',
 '15280c0e015f4cb49c32c15170529060',
 'a099cc724dbc4be7a31af76fa2a1b7d8',
 'fcb79298d20b442dbe135640c66f4623',
 '141a849406bf472893414249016b6881',
 '288fbb014c3449cf8b85cd16444de5ef',
 'e373aaf0908c400ca902cde7be581334',
 'ddba3f9de272410eb88e0ef084f575b9',
 '427455834255424d923095f3e5826391',
 '17d9f6abe91d47b994eb4ffc30242432',
 'a9cbfd0bbd5f4d82979b6e0748b36f2e',
 'cdc3649f2a97489aa81a40f9a20188e0',
 'f6b92690af274475b660fa6b7a00dda1',
 '9bb52f2527aa43ce8bffdf03189e7cff',
 'cbf10e84defb4905a712c975967a4153',
 '3a182a2840814c11a9036360c0da499a']

## 2.3: Blogs

In [74]:
web_loader = WebBaseLoader(
 web_paths=("https://lilianweng.github.io/posts/2020-10-29-odqa/",
 "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
 "https://lilianweng.github.io/posts/2018-06-24-attention/",
 "https://lilianweng.github.io/posts/2023-06-23-agent/",
 "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
 "https://lilianweng.github.io/posts/2024-07-07-hallucination/"),

 bs_kwargs=dict(
 parse_only=bs4.SoupStrainer(
 class_=("post-content", "post-title", "post-header")
 )
 ),
)

web_documents = web_loader.load()

for idx, text in enumerate(web_documents):
 web_documents[idx].metadata['doc_num'] = global_doc_number
 web_documents[idx].metadata['doc_source'] = "WWW"
 global_doc_number += 1

print('Number of documents: ', len(web_documents))

web_splits = text_splitter.split_documents(web_documents)

for idx, text in enumerate(web_splits):
 web_splits[idx].metadata['split_id'] = idx

print('Number of splits: ', len(web_splits))

qdrant_vectorstore.add_documents(documents=web_splits)

Number of documents: 6
Number of splits: 283


['76ebbe1b8c2742e79f1611b247fa70d2',
 'd2cfd072a0214acf8e1e4e015ef5bc9b',
 '8e74141c52814e418d4b51f5e2495155',
 '233c937357f14c74996103f8e7de38ab',
 'e70475372df6454fb2e930ea9067847a',
 '365302d984974544820dfe58382e0810',
 'dd2ddb6009dc4936b53a63fb1511c9b7',
 '4dd0453f41b340fdb541afd8a04556dc',
 'bf4c06e09ee542079a3e52ae1a04a0e0',
 '5500345c7b254133935cbd035cc320fc',
 '58f4442e9a4d4e348123d051e59a8e97',
 '11fd08aa8d1b4735bfcbe80e271d8e54',
 'd3d0e7ba611e402397f9ef52a48305f0',
 '5a78b9f4c32640efb6d50986d8466c48',
 '0cf804251b264a92817ae90786df0ebd',
 '5bbb3b92cda64227af307a912de112d0',
 'bc642885b87a41c08be8eaf1bac99443',
 '3256c953a82d49fb9ed536a72bd506e1',
 'bac2b887ff444b47973ac53e6a46a750',
 '6a3491bfec654805b75447c3933aa1e8',
 '64fa0a67bd02417a8fefa01af663f5c6',
 '169dab0a802f41caaf9158bf29bddbca',
 'a9017b55105c49d89d58f9d644a294d3',
 '5f6e98141b924da9a9ac2be497378d4b',
 '63513d3019eb4a459236710fa30ad9d3',
 '321bfd0681eb4b9d9e569c1e0bcb76c7',
 '458256c652524233a0a4f382dec04d18',
 

## Step 3: LLM

### 3.1: Mistral

In [75]:
#Quantization config

quantization_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_use_double_quant=True,
 bnb_4bit_compute_dtype=torch.bfloat16
)

# RC - Modified to store and download
# Download and save to Google Drive (If running from Colab)

model_save_path = "/content/drive/MyDrive/models/mistral-7b-instruct"

# Check if already saved
if os.path.exists(model_save_path):
 print("Loading model from Google Drive...")
 model_source = model_save_path
else:
 print("Downloading model from Hugging Face (this will take ~15 min)...")
 model_source = "mistralai/Mistral-7B-Instruct-v0.3"

# Load from appropriate source
llm_mistral_model = AutoModelForCausalLM.from_pretrained(
 model_source,
 dtype=torch.float32,
 device_map='cuda',
 quantization_config=quantization_config
)

llm_mistral_tokenizer = AutoTokenizer.from_pretrained(model_source)

# If we downloaded, save it for next time
if model_source != model_save_path:
 print("Saving model to Google Drive for future use...")
 llm_mistral_model.save_pretrained(model_save_path)
 llm_mistral_tokenizer.save_pretrained(model_save_path)
 print("Saved!")

llm_mistral_tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

mistral_pipe = pipeline(
 "text-generation",
 model=llm_mistral_model,
 tokenizer=llm_mistral_tokenizer,
 max_new_tokens=1000,
 temperature=0.6,
 top_p=0.95,
 do_sample=True,
 repetition_penalty=1.2
)
mistral_pipe.model.config.pad_token_id = mistral_pipe.model.config.eos_token_id

mistral_llm_lc = HuggingFacePipeline(pipeline=mistral_pipe)

Loading model from Google Drive...




tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model: 0%| | 0.00/587k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0%| | 0.00/414 [00:00<?, ?B/s]

Device set to use cuda


### 3.2: Cohere

In [76]:
cohere_chat_model = ChatCohere(cohere_api_key=COHERE_API_KEY)

## 4: Retrieval Chain

### 4.1: Prompt

In [77]:
# persona = 'marketing'
# engineering_template = """[INST]You are a technical AI assistant helping engineers understand complex AI and NLP concepts.

# Based on the context below, provide a detailed, technical answer to the question. Include specific details, technical terms, and explanations.

# Context:
# {context}

# Question: {question}

# Provide a comprehensive technical answer:[/INST]"""

# marketing_template = """[INST]You are an AI assistant helping marketing professionals understand AI concepts for business communication.

# Based on the context below, provide a clear, high-level answer to the question. Keep it accessible and focus on practical applications.

# Context:
# {context}

# Question: {question}

# Provide a clear, business-focused answer:[/INST]"""

# # Select template based on persona
# template = engineering_template if persona == "engineering" else marketing_template
# rag_prompt = ChatPromptTemplate.from_template(template)

# print(f"Using {persona.upper()} persona prompt")

In [78]:
# Prompt version 1
# persona = 'marketing'
# engineering_template = """[INST]You are a technical AI assistant helping engineers understand complex AI and NLP concepts.

# Based on the context below, provide a detailed, technical answer to the question. Include specific details, technical terms, and explanations.

# Context:
# {context}

# Question: {question}

# Provide a comprehensive technical answer:[/INST]"""

# marketing_template = """[INST]You are an AI assistant helping marketing professionals understand AI concepts for business communication.

# Based on the context below, provide a clear, high-level answer to the question. Keep it accessible and focus on practical applications.

# Context:
# {context}

# Question: {question}

# Provide a clear, business-focused answer:[/INST]"""

# engineering_rag_prompt = ChatPromptTemplate.from_template(engineering_template)
# marketing_rag_prompt = ChatPromptTemplate.from_template(marketing_template)

In [79]:
#Prompt version 2
persona = 'marketing'

engineering_template = """[INST]You are a technical AI assistant helping engineers understand AI and NLP concepts.

Using ONLY the context provided below, answer the question with technical precision. Your answer must:
- Be 4-5 sentences (approximately 500 characters)
- Use precise technical terminology
- Focus on implementation details and architectures
- Stay grounded in the provided context

If the context doesn't contain sufficient information to answer the question, state what information is missing.

Context:
{context}

Question: {question}

Provide a concise, technical answer (4-5 sentences):[/INST]"""

marketing_template = """[INST]You are an AI assistant helping marketing professionals understand AI concepts for business communication.

Using ONLY the context provided below, answer the question in accessible language. Your answer must:
- Be 2-3 sentences (approximately 250 characters)
- Avoid technical jargon
- Focus on business value and practical applications
- Stay grounded in the provided context

If the context doesn't contain sufficient information to answer the question, state what information is missing.

Context:
{context}

Question: {question}

Provide a brief, business-focused answer (2-3 sentences):[/INST]"""

engineering_rag_prompt = ChatPromptTemplate.from_template(engineering_template)
marketing_rag_prompt = ChatPromptTemplate.from_template(marketing_template)

In [80]:
# #Prompt version 3 with few-shot examples

# persona = 'marketing'

# engineering_template = """[INST]You are a technical AI assistant helping engineers understand AI and NLP concepts.

# Using ONLY the context provided below, answer the question with technical precision. Your answer must:

# - Be 4-5 sentences (approximately 500 characters)
# - Use precise technical terminology
# - Focus on implementation details and architectures
# - Stay grounded in the provided context

# If the context doesn't contain sufficient information to answer the question, state what information is missing.

# Examples of good engineering answers:

# Example 1:
# Question: How can a language model's ability to analyze images expand its range of applications?
# Answer: One way a language model's ability to analyze images can expand its range of applications is by enabling it to perform tasks that require both textual and visual information, such as image captioning or visual question answering. By incorporating image features into the model's input, it can generate more contextually relevant and accurate responses. This integration of image analysis can also enhance the model's performance in tasks like content generation, recommendation systems, and sentiment analysis, where visual cues play a significant role in understanding and interpreting the data. Additionally, combining language and image processing capabilities can lead to more sophisticated and versatile AI systems that can handle a wider range of real-world applications effectively.

# Example 2:
# Question: What is the purpose of using a regularization term (like the epsilon multiplied by the identity matrix) in the linear regression model described in the code?
# Answer: The purpose of using a regularization term (epsilon multiplied by the identity matrix) in the linear regression model described in the code is to prevent overfitting by adding a penalty term to the loss function. This regularization term helps to control the complexity of the model by discouraging large coefficients, leading to a more stable and generalizable solution. By incorporating this regularization, the model can better handle noise and outliers in the data, ultimately improving its performance on unseen data points.

# Example 3:
# Question: How does the Pointer Network differ from standard seq2seq models in dealing with sequential data?
# Answer: The Pointer Network differs from standard seq2seq models in that it is specifically designed to handle problems where the output elements correspond to positions in an input sequence. Instead of using attention to blend hidden units like in seq2seq models, the Pointer Network directly points to the elements in the input sequence. This allows for greater flexibility in determining the output elements, which is particularly useful in tasks like sorting or the traveling salesman problem where the output categories are not predetermined. The architecture of a Pointer Network model includes an encoder-decoder setup with a focus on predicting or inferring specific elements in the input sequence based on their importance weights.

# Context:
# {context}

# Question: {question}

# Provide a concise, technical answer (4-5 sentences):[/INST]"""

# marketing_template = """[INST]You are an AI assistant helping marketing professionals understand AI concepts for business communication.

# Using ONLY the context provided below, answer the question in accessible language. Your answer must:

# - Be 2-3 sentences (approximately 250 characters)
# - Avoid technical jargon
# - Focus on business value and practical applications
# - Stay grounded in the provided context

# If the context doesn't contain sufficient information to answer the question, state what information is missing.

# Examples of good marketing answers:

# Example 1:
# Question: How can a language model's ability to analyze images expand its range of applications?
# Answer: By incorporating image analysis capabilities, a language model can be used for tasks such as image captioning, visual question answering, and text-to-image generation, expanding its range of applications beyond just text-based tasks.

# Example 2:
# Question: What is the purpose of using a regularization term (like the epsilon multiplied by the identity matrix) in the linear regression model described in the code?
# Answer: The purpose of using a regularization term in the linear regression model is to mitigate or resolve catastrophic extrapolation error, making the extrapolated values comparable to those within the original range.

# Example 3:
# Question: How does the Pointer Network differ from standard seq2seq models in dealing with sequential data?
# Answer: The Pointer Network differs from standard seq2seq models by being able to handle problems where the output elements correspond to positions in an input sequence, rather than using attention to blend hidden units.

# Context:
# {context}

# Question: {question}

# Provide a brief, business-focused answer (2-3 sentences):[/INST]"""

# engineering_rag_prompt = ChatPromptTemplate.from_template(engineering_template)
# marketing_rag_prompt = ChatPromptTemplate.from_template(marketing_template)

In [81]:
def format_docs(docs):
 return "\n\n".join(doc.page_content for doc in docs)

output_parser = StrOutputParser()

In [82]:
# Build RAG chain
mistral_marketing_rag_chain = (
 {"context": retriever | format_docs,
 "question": RunnablePassthrough()}
 | marketing_rag_prompt
 | mistral_llm_lc
 | output_parser
)

mistral_engineering_rag_chain = (
 {"context": retriever | format_docs,
 "question": RunnablePassthrough()}
 | engineering_rag_prompt
 | mistral_llm_lc
 | output_parser
)

cohere_marketing_rag_chain = (
 {"context": retriever | format_docs,
 "question": RunnablePassthrough()}
 | marketing_rag_prompt
 | cohere_chat_model
 | output_parser
)

cohere_engineering_rag_chain = (
 {"context": retriever | format_docs,
 "question": RunnablePassthrough()}
 | engineering_rag_prompt
 | cohere_chat_model
 | output_parser
)

chains = {
 'marketing_mistral': mistral_marketing_rag_chain,
 'marketing_cohere': cohere_marketing_rag_chain,
 'engineering_mistral': mistral_engineering_rag_chain,
 'engineering_cohere': cohere_engineering_rag_chain
}

## 4. Tests & Evaluations

Here you should evaluate the results. First, you should implement your evaluation metrics and then you should run evaluation tests. This is really your area, but key results to show are:

1) Your metrics of choice 
2) How your various models compare to the labeled validation data.

Make sure you look at the results for the marketing team and the research team separately.

**Note:** You do not need to run all models against all labeled questions, as that may take some time. Just do that for a few models/configs, and test a larger set with a smaller subset. But if you use a subset you must justify why you are using that specific subset of questions.

**This is free form so you will need to create your own cells, text documentation as you need, etc.**

After you have implemented you evaluation strategy please answer the questions below in sections 4.1 and 4.2.

Please feel free to add more text and code cells as needed.

### RAG Function Setup

In [84]:
from bert_score import score

In [85]:
def ask_rag_question(question_id, chain_type='marketing_mistral', retrieved_contexts=None):
 question = validation_questions_answers[question_id]['question']

 if retrieved_contexts is None:
 retrieved_docs = retriever.invoke(question)
 retrieved_contexts = [doc.page_content for doc in retrieved_docs]

 response = chains[chain_type].invoke(question)

 if 'cohere' in chain_type:
 rag_answer = response.strip()
 else:
 if "[/INST]" in response:
 rag_answer = response.split("[/INST]", 1)[1].strip()
 else:
 rag_answer = response.strip()

 persona = chain_type.split('_')[0]

 if persona == 'marketing':
 gold_answer = validation_questions_answers[question_id]['gold_answer_marketing']
 else:
 gold_answer = validation_questions_answers[question_id]['gold_answer_research']

 return {
 'question_id': question_id,
 'question': question,
 'response': response,
 'rag_answer': rag_answer,
 'gold_answer': gold_answer,
 'retrieved_contexts': retrieved_contexts
 }

### Evaluation with Bertscore

In [86]:
def evaluate_with_bertscore(rag_response, scorer=None):
 """
 Evaluates RAG-generated answer against gold standard using BERTScore
 scorer: Optional pre-loaded BERTScorer instance for better performance
 """
 rag_answer = rag_response['rag_answer']
 gold_answer = rag_response['gold_answer']

 if scorer is not None:
 P, R, F1 = scorer.score([rag_answer], [gold_answer])
 precision, recall, f1 = P.item(), R.item(), F1.item()
 else:
 from bert_score import score
 bert_score = score([rag_answer], [gold_answer], lang='en', rescale_with_baseline=True)
 P, R, F1 = bert_score
 precision, recall, f1 = P.item(), R.item(), F1.item()

 return {'bert_precision':precision, 'bert_recall':recall,'bert_f1':f1}

In [87]:
# Upgrade to latest RAGAS version
!pip install --upgrade ragas

# Also upgrade langchain components
!pip install --upgrade langchain-openai langchain-core

Collecting ragas
 Downloading ragas-0.4.0-py3-none-any.whl.metadata (22 kB)
Collecting appdirs (from ragas)
 Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting diskcache>=5.6.3 (from ragas)
 Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting instructor (from ragas)
 Downloading instructor-1.13.0-py3-none-any.whl.metadata (11 kB)
Collecting scikit-network (from ragas)
 Downloading scikit_network-0.33.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.5 kB)
Collecting langchain_openai (from ragas)
 Downloading langchain_openai-1.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting jiter<1,>=0.10.0 (from openai>=1.0.0->ragas)
 Downloading jiter-0.11.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting pre-commit>=4.3.0 (from instructor->ragas)
 Downloading pre_commit-4.5.0-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting ty>=0.0.1a23 (from instructor->ragas)
 Downloading ty-0.0.1a32-py3

In [88]:
def evaluate_with_ragas(rag_response):
 """
 Evaluates RAG system using RAGAS metrics
 """
 from ragas import EvaluationDataset, evaluate
 from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness
 from ragas.llms import LangchainLLMWrapper
 from langchain_openai import ChatOpenAI

 evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

 test_question = rag_response['question']
 gold_answer = rag_response['gold_answer']
 generated_answer = rag_response['rag_answer']
 retrieved_docs_content = rag_response['retrieved_contexts']

 dataset_dict = {
 "user_input": test_question,
 "retrieved_contexts": retrieved_docs_content,
 "response": generated_answer,
 "reference": gold_answer
 }

 evaluation_dataset = EvaluationDataset.from_list([dataset_dict])

 result = evaluate(
 dataset=evaluation_dataset,
 metrics=[
 LLMContextRecall(),
 Faithfulness(),
 FactualCorrectness()
 ],
 llm=evaluator_llm
 )

 return result

### Combine evaluation

In [89]:
def evaluate_complete(question_id, chain_type='marketing_mistral', retrieved_contexts=None, scorer=None):
 rag_response = ask_rag_question(question_id, chain_type, retrieved_contexts=retrieved_contexts)

 bert_scores = evaluate_with_bertscore(rag_response, scorer=scorer)

 ragas_result = evaluate_with_ragas(rag_response)

 all_scores = {
 'question_id': question_id,
 'chain_type': chain_type,
 'bert_precision': bert_scores['bert_precision'],
 'bert_recall': bert_scores['bert_recall'],
 'bert_f1': bert_scores['bert_f1'],
 'ragas_context_recall': ragas_result['context_recall'],
 'ragas_faithfulness': ragas_result['faithfulness'],
 'ragas_factual_correctness': ragas_result['factual_correctness(mode=f1)']
 }

 return rag_response, all_scores

### Run Testing

In [90]:
from bert_score import BERTScorer

bert_scorer = BERTScorer(lang="en", rescale_with_baseline=True)

tokenizer_config.json: 0%| | 0.00/25.0 [00:00<?, ?B/s]

config.json: 0%| | 0.00/482 [00:00<?, ?B/s]

vocab.json: 0%| | 0.00/899k [00:00<?, ?B/s]

merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]

tokenizer.json: 0%| | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors: 0%| | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [91]:
# import os

# # Delete the checkpoint file to start fresh
# checkpoint_file = '/content/drive/MyDrive/evaluation_results_checkpoint.pkl'

# if os.path.exists(checkpoint_file):
# os.remove(checkpoint_file)

In [92]:
import pandas as pd
from tqdm import tqdm
import time
import pickle
import os
from bert_score import score
from datasets import Dataset

# Configuration
# configs = ['marketing_mistral', 'marketing_cohere', 'engineering_mistral', 'engineering_cohere']
configs = ['marketing_cohere', 'engineering_cohere']
results_file = '/content/drive/MyDrive/evaluation_results_checkpoint.pkl'

available_question_ids = sorted(validation_questions_answers.keys())
print(f"Total available questions: {len(available_question_ids)}")
print(f"First 20 question IDs: {available_question_ids[:20]}")

question_ids_to_test = available_question_ids[:3]
print(f"Testing with questions: {question_ids_to_test}")

if os.path.exists(results_file):
 with open(results_file, 'rb') as f:
 rag_outputs = pickle.load(f)
 print(f"Loaded {len(rag_outputs)} existing generated responses")
else:
 rag_outputs = []
 print("Starting fresh generation")

completed_keys = {(r['question_id'], r['chain_type']) for r in rag_outputs}
total_needed = len(question_ids_to_test) * len(configs)
print(f"Progress: {len(completed_keys)}/{total_needed} generations complete")

print("PHASE 1: Generating RAG responses...")
for question_id in tqdm(question_ids_to_test, desc="Questions"):

 question = validation_questions_answers[question_id]['question']
 retrieved_docs = retriever.invoke(question)
 retrieved_contexts = [doc.page_content for doc in retrieved_docs]

 doc_sources = []
 for doc in retrieved_docs:
 if hasattr(doc, 'metadata') and 'source' in doc.metadata:
 doc_sources.append(doc.metadata['source'])
 else:
 doc_sources.append('unknown')

 for chain_type in configs:
 if (question_id, chain_type) in completed_keys:
 continue

 try:
 start = time.time()

 rag_response = ask_rag_question(
 question_id,
 chain_type
 )

 entry = {
 'question_id': question_id,
 'chain_type': chain_type,
 'question': rag_response['question'],
 'rag_answer': rag_response['rag_answer'],
 'gold_answer': rag_response['gold_answer'],
 'retrieved_contexts': rag_response['retrieved_contexts'],
 'retrieved_sources': doc_sources
 }
 rag_outputs.append(entry)

 with open(results_file, 'wb') as f:
 pickle.dump(rag_outputs, f)

 elapsed = time.time() - start
 print(f" Generated Q{question_id}-{chain_type} ({elapsed:.1f}s)")

 except Exception as e:
 print(f" Error Q{question_id}-{chain_type}: {e}")

print(f"PHASE 1 Complete: {len(rag_outputs)} responses generated")

print(f"PHASE 2: Batch Evaluation on {len(rag_outputs)} items...")

if rag_outputs and 'bert_f1' not in rag_outputs[0]:
 print(" Running BERTScore in batch mode...")
 start = time.time()

 rag_answers = [x['rag_answer'] for x in rag_outputs]
 gold_answers = [x['gold_answer'] for x in rag_outputs]

 P, R, F1 = score(rag_answers, gold_answers, lang='en', rescale_with_baseline=True)

 for i, entry in enumerate(rag_outputs):
 entry['bert_precision'] = P[i].item()
 entry['bert_recall'] = R[i].item()
 entry['bert_f1'] = F1[i].item()

 elapsed = time.time() - start
 print(f" BERTScore complete ({elapsed:.1f}s for {len(rag_outputs)} items)")

 with open(results_file, 'wb') as f:
 pickle.dump(rag_outputs, f)
else:
 print(" BERTScore already computed, skipping...")

if rag_outputs and 'ragas_faithfulness' not in rag_outputs[0]:
 print(" Running RAGAS in batch mode...")
 start = time.time()

 data_dict = {
 "user_input": [x['question'] for x in rag_outputs],
 "retrieved_contexts": [x['retrieved_contexts'] for x in rag_outputs],
 "response": [x['rag_answer'] for x in rag_outputs],
 "reference": [x['gold_answer'] for x in rag_outputs]
 }

 dataset = Dataset.from_dict(data_dict)

 from ragas import evaluate
 from ragas.metrics import LLMContextRecall, ContextPrecision, Faithfulness, FactualCorrectness, AnswerRelevancy
 from ragas.llms import LangchainLLMWrapper
 from langchain_openai import ChatOpenAI

 evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
 metrics = [LLMContextRecall(), ContextPrecision(), Faithfulness(), FactualCorrectness(), AnswerRelevancy()]

 ragas_results = evaluate(
 dataset=dataset,
 metrics=metrics,
 llm=evaluator_llm
 )

 for i, entry in enumerate(rag_outputs):
 entry['ragas_context_recall'] = ragas_results['context_recall'][i]
 entry['ragas_context_precision'] = ragas_results['context_precision'][i]
 entry['ragas_faithfulness'] = ragas_results['faithfulness'][i]
 entry['ragas_factual_correctness'] = ragas_results['factual_correctness(mode=f1)'][i]
 entry['ragas_answer_relevancy'] = ragas_results['answer_relevancy'][i]

 elapsed = time.time() - start
 print(f" RAGAS complete ({elapsed:.1f}s for {len(rag_outputs)} items)")

 with open(results_file, 'wb') as f:
 pickle.dump(rag_outputs, f)
else:
 print(" RAGAS already computed, skipping...")

results_df = pd.DataFrame(rag_outputs)
print(f"\n{'='*70}")
print(f"EVALUATION COMPLETE!")
print(f"{'='*70}")
print(f"Total items evaluated: {len(results_df)}")
print(f"Questions tested: {sorted(results_df['question_id'].unique())}")
print(f"Configs tested: {sorted(results_df['chain_type'].unique())}")
print(f"Results saved to: {results_file}")
print(f"\nQuick preview:")
print(results_df.groupby('chain_type')[['bert_f1', 'ragas_context_recall',
 'ragas_context_precision',
 'ragas_faithfulness',
 'ragas_factual_correctness',
 'ragas_answer_relevancy']].mean())

print(f"\n{'='*70}")
print("RETRIEVED DOCUMENT SOURCES")
print(f"{'='*70}")
for q_id in question_ids_to_test:
 q_data = results_df[results_df['question_id'] == q_id].iloc[0]
 print(f"\nQ{q_id}: {q_data['question'][:60]}...")
 print(f"Sources: {q_data['retrieved_sources']}")

Total available questions: 78
First 20 question IDs: [0, 1, 2, 3, 7, 8, 9, 11, 12, 13, 16, 17, 18, 19, 20, 22, 23, 24, 25, 27]
Testing with questions: [0, 1, 2]
Loaded 156 existing generated responses
Progress: 156/6 generations complete
PHASE 1: Generating RAG responses...


Questions: 100%|██████████| 3/3 [00:00<00:00, 15.12it/s]

PHASE 1 Complete: 156 responses generated
PHASE 2: Batch Evaluation on 156 items...
 BERTScore already computed, skipping...
 RAGAS already computed, skipping...

EVALUATION COMPLETE!
Total items evaluated: 156
Questions tested: [np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(7), np.int64(8), np.int64(9), np.int64(11), np.int64(12), np.int64(13), np.int64(16), np.int64(17), np.int64(18), np.int64(19), np.int64(20), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(27), np.int64(28), np.int64(30), np.int64(33), np.int64(34), np.int64(35), np.int64(36), np.int64(38), np.int64(39), np.int64(41), np.int64(43), np.int64(44), np.int64(46), np.int64(47), np.int64(48), np.int64(50), np.int64(51), np.int64(52), np.int64(53), np.int64(54), np.int64(55), np.int64(59), np.int64(60), np.int64(61), np.int64(62), np.int64(63), np.int64(64), np.int64(65), np.int64(67), np.int64(69), np.int64(73), np.int64(74), np.int64(75), np.int64(76), np.int64(78), np.int64(80), np.int6




In [93]:
results_df['rag_answer_length'] = results_df['rag_answer'].str.len()
results_df['gold_answer_length'] = results_df['gold_answer'].str.len()

In [94]:
# results_df.to_csv(r'/content/drive/MyDrive/Gen AI/phase_1_results.csv',index=False)

In [95]:
# results_df.to_csv(r'/content/drive/MyDrive/Gen AI/phase_2a_results.csv',index=False)

In [96]:
# results_df.to_csv(r'/content/drive/MyDrive/Gen AI/phase_2b_results.csv',index=False)

In [97]:
# results_df.to_csv(r'/content/drive/MyDrive/Gen AI/phase_2c_results.csv',index=False)

In [98]:
results_df.to_csv(r'/content/drive/MyDrive/Gen AI/phase_2d_results.csv',index=False)

In [100]:
gold_lengths = []

for q_id, q_data in validation_questions_answers.items():
 gold_lengths.append({
 'question_id': q_id,
 'research': len(q_data['gold_answer_research']),
 'marketing': len(q_data['gold_answer_marketing'])
 })

gold_df = pd.DataFrame(gold_lengths)

print("Average Gold Answer Lengths:")
print(f"Research: {gold_df['research'].mean():.0f} chars")
print(f"Marketing: {gold_df['marketing'].mean():.0f} chars")
print(f"Ratio: {(gold_df['research'].mean() / gold_df['marketing'].mean()):.2f}x")

Average Gold Answer Lengths:
Research: 613 chars
Marketing: 253 chars
Ratio: 2.42x


In [101]:
# Review docs associated with question 2
print(validation_questions_answers[2]['question'])

docs = retriever.invoke(validation_questions_answers[2]['question'])
for i, doc in enumerate(docs):
 print(f"\nDoc {i+1} (first 200 chars):")
 print(doc.page_content[:200])

What are some of the architectures used in building artificial neural networks for LLMs?

Doc 1 (first 200 chars):
Memory

Short-term memory: I would consider all the in-context learning (See Prompt Engineering) as utilizing short-term memory of the model to learn.
Long-term memory: This provides the agent with th

Doc 2 (first 200 chars):
Prompt LLM to generate instruction candidates based on a small set of demonstrations in the form of input-output pairs. E.g. {{Given desired input-output pairs}}\n\nThe instruction is.


Given a datas

Doc 3 (first 200 chars):
== Versions ==


=== Initial release ===
The first version of Llama (stylized as LLaMA and sometimes referred to as Llama 1) was announced on February 24, 2023, via a blog post and a paper describing 

Doc 4 (first 200 chars):
Moving beyond n-gram models, researchers started in 2000 to use neural networks to learn language models. Following the breakthrough of deep neural networks in image classification around 2012, simila

Do

### 4.1. Metrics

Now we'd like to hear about your approach to evaluation.

**Demonstration:**
4.1. In 8 to 15 sentences, please define and defend your approach to evaluating your RAG model. Please fill in your answer in the text block below:

The evaluation approach was designed to measure both sides of the RAG pipeline — retrieval quality (BERTScore) and generation quality (RAGAS). Given the strict time constraints of the POC, we developed an evaluation framework that allowed for rapid iteration while still producing reliable, comparable results. To standardize comparisons across experiments, we used a fixed test set of three questions representing different query styles. Although this limited set enabled fast experimentation, bringing in more samples from the provided answers would be greatly beneficial, however time and API cost constraints (especially with Cohere) prevented expanding it during the POC window.

We combined these automated metrics with manual reviews, recognizing that quantitative scores alone often miss nuances such as hallucinations or irrelevant retrieval. Using these automated metrics allowed us to quickly iterate on different configurations, while manual inspection helped identify further room for improvement or gaps. Retrieval was evaluated using context precision and recall to determine whether the right chunks were obtained, while generation was assessed through faithfulness and factual correctness to verify the content was grounded in what was retrieved.

To ensure consistent, controlled comparisons, we changed one variable at a time — prompt design, embedding model, k-value, and few-shot examples—and re-ran the standardized question set for each. This helped us understand the affect of adjustments to our RAG system. Although the resulting evaluation framework was effective for a constrained POC, we acknowledge that future work should expand the test suite, incorporate more diverse question types, and undertake more manual review.

### 4.2. Evaluation Comparisons

Document your key runs here. Feel free to add more text and code cells as needed. Include at least one full run on all 75 examples for your best models/model configuration.

### Phase 1 Evaluation: Baseline Results

**Configuration:**
- **LLMs**: Mistral-7B-Instruct-v0.3, Cohere Command-R
- **Prompts**: Marketing vs Engineering personas
- **Retrieval**: k=4, multi-qa-mpnet-base-dot-v1 embeddings
- **Chunking**: RecursiveCharacterTextSplitter (size=1200, overlap=200)
- **Evaluation Questions**: 3 questions selected to test retrieval across different complexity levels

In [103]:
phase_1_results = pd.read_csv(r'/content/drive/MyDrive/Gen AI/phase_1_results.csv')

Phase 1: Baseline Performance Issues

**Key Findings:**

Overall model performance was poor - both retrieval quality (BERTScore, manual review) and output grounding (RAGAS metrics) underperformed.
Verbosity problem: Outputs 3-12x longer than gold answers.

* marketing_cohere: 1,473 chars vs 379 target (3.9x)
* engineering_cohere: 5,753 chars vs 608 target (9.5x)
* Root cause: Prompts lack explicit length constraints

Q2 retrieval failure: 0% context precision on technical question

The engineering Q2 completely failed to retrieve any relevant information. Instead, it retrieved generic LLM introductions, citation lists, and a threat diagram.

* Root cause: the current embedding model (multi-qa-mpnet-base-dot-v1) does surface keyword matching, not semantic understandin. A closer look into these documents show that they do contain relevant architecture info, it was just not retrieved
* Engineering_cohere reached 59% factual correctness despite this,suggesting the LLM compensated for this despite not having retrieved relevant information.
This is an issue, as we would want our RAG system to verify answers against company documentation

**Model comparison:**

Mistral: More concise, better BERTScore, but lower faithfulness (0.43-0.89)
Cohere: Extremely verbose, but superior faithfulness (0.52-0.79) and factual correctness (59-72%)
Selected Cohere for Phase 2: faithfulness harder to fix than verbosity

**Next steps:**
1. Fix verbosity: Add explicit length constraints to prompts
2. Fix retrieval: Test all-distilroberta-v1 embedding (better semantic understanding)
3. Optimize k-value: Test k=8 after fixing embeddings

### Phase 2A Evaluation:

Changes from Phase 1:
* Added explicit length constraints (2-3 sentences/250 chars for marketing, 4-5 sentences/500 chars for engineering)
* Added strict grounding instruction ("Using ONLY the context provided")
* Removed vague terms like "comprehensive" and "detailed"
* Added instruction to state when information is missing

In [106]:
phase_2a_results = pd.read_csv(r'/content/drive/MyDrive/Gen AI/phase_2a_results.csv')

### Phase 2A Evaluation: Concise Prompts with Strict Grounding
Modification: Added explicit length targets (marketing: ~250 chars, engineering: ~500 chars) and strict grounding instruction ("Using ONLY the context provided").

**Key Findings:**

Verbosity resolved: Prompts reduced length 50-86%. marketing_cohere hit 372 chars vs 300 target.
Grounding vs accuracy trade-off revealed:

* Q0, Q1 (good retrieval): Models gave accurate, grounded answers
* Q2 (failed retrieval): Models acknowledged missing info instead of hallucinating

marketing_cohere: 1.0 faithfulness, 0% factual correctness (admits context insufficient)
engineering_cohere: 0.27 faithfulness, 0% factual correctness (attempted partial answer)

* Q2 still fails: 0% context precision/recall unchanged. Prompt engineering can't fix bad retrieval.

Model differences:

* Mistral: Works with limited context, moderate faithfulness
* Cohere: Strictly follows grounding, explicitly states "information needed is missing"

Conclusion:
Prompt engineering fixed verbosity and improved BERTScore. But Q2 retrieval failure persists—proves we need better embeddings, not better prompts.

### Phase 2B Evaluation:

Changes from Phase 2A:
* Replaced multi-qa-mpnet-base-dot-v1 → all-distilroberta-v1 while keeping Phase 2A concise prompts and k=4.
Rationale: Phase 2A proved prompt engineering can't fix retrieval failures.

In [108]:
phase_2b_results = pd.read_csv(r'/content/drive/MyDrive/Gen AI/phase_2b_results.csv')

Phase 2B: Embedding Model Optimization

Key Findings:

* Q2 retrieval breakthrough: Phase 2A retrieved generic introductions and threat diagrams (0% precision). Phase 2B retrieved actual technical content: "LSTM encoder-decoder," "transformer architecture," "Word2Vec," "seq2seq models" (25% precision). First successful technical retrieval for Q2.
* Q0/Q1 maintained or improved: BERTScore F1 increased +0.01 to +0.10 across conceptual and process questions. engineering_mistral hit 0.55 F1 on Q1 with perfect faithfulness (1.0) and 77% factual correctness—highest performance across all experiments.
* Q2 showed first non-zero scores: Factual correctness went from 0% to 0-25%. BERTScore F1 improved +0.08 to +0.15. Models could finally provide substantive answers instead of acknowledging missing information.
Remaining gap: Q2 context recall still 0%, precision only 25%. Only 1 of 4 chunks highly relevant—coverage incomplete.

Conclusion:
Embedding optimization fixed the root cause from Phase 2A. all-distilroberta-v1 has better semantic understanding of technical terms, turning Q2 from complete failure to partial success. System now produces concise, grounded responses when retrieval works.

### Phase 2C: k-Value Optimization (k=4 vs k=8)

 * Increased retrieval from k=4 → k=8 chunks while keeping Phase 2B config (distilroberta embeddings, concise prompts).

In [110]:
phase_2c_results = pd.read_csv(r'/content/drive/MyDrive/Gen AI/phase_2c_results.csv')

### Phase 2D: Few-Shot Prompting
* Added 3 few-shot examples to both marketing and engineering prompts while keeping Phase 2C config (distilroberta embeddings, k=8, concise instructions).

In [112]:
phase_2d_results = pd.read_csv(r'/content/drive/MyDrive/Gen AI/phase_2d_results.csv')

**Key Findings:**

* BERTScore improved: +2-9% across both personas. Few-shot examples demonstrated desired structure/vocabulary, leading to better word-level overlap with gold answers.
* Faithfulness dropped significantly: Engineering -14%, marketing -8%. Models pattern-matched against examples rather than grounding in retrieved context.
Marketing factual correctness tanked: -35% decline (0.323 → 0.210). Few-shot examples introduced implicit knowledge that led to hallucination when context insufficient—particularly problematic for marketing's shorter response format.

**Critical trade-off identified:**

Phase 2D produces answers that look like gold answers (higher BERTScore) but sacrifice grounding and accuracy (lower faithfulness/correctness). BERTScore gains (+2-9%) don't justify faithfulness/accuracy losses.

**Recommendation:**
Reject Phase 2D. Maintain Phase 2C configuration.
Phase 2C achieved near-perfect faithfulness (0.970) for engineering. Few-shot examples introduced hallucination risk that outweighs stylistic gains. Marketing -35% factual correctness unacceptable for business communications.

**Final optimal config:**
* LLM: Cohere
* Embeddings: all-distilroberta-v1
* Retrieval: k=8
* Prompts: Phase 2A concise, grounded prompts (no few-shot)

### Full Evaluation

In [114]:
import os

checkpoint_file = '/content/drive/MyDrive/evaluation_results_checkpoint.pkl'

if os.path.exists(checkpoint_file):
 os.remove(checkpoint_file)
 print(f"Deleted checkpoint: {checkpoint_file}")
else:
 print("No checkpoint file found - starting fresh")

Deleted checkpoint: /content/drive/MyDrive/evaluation_results_checkpoint.pkl


In [115]:
import pandas as pd
from tqdm import tqdm
import time
import pickle
import os
from bert_score import score
from datasets import Dataset

configs = ['marketing_cohere', 'engineering_cohere']
results_file = '/content/drive/MyDrive/evaluation_results_checkpoint.pkl'

available_question_ids = sorted(validation_questions_answers.keys())
print(f"Total available questions: {len(available_question_ids)}")
print(f"First 20 question IDs: {available_question_ids[:20]}")

question_ids_to_test = available_question_ids
print(f"Testing with {len(question_ids_to_test)} questions: {question_ids_to_test}")

if os.path.exists(results_file):
 with open(results_file, 'rb') as f:
 rag_outputs = pickle.load(f)
 print(f"Loaded {len(rag_outputs)} existing generated responses")
else:
 rag_outputs = []
 print("Starting fresh generation")

completed_keys = {(r['question_id'], r['chain_type']) for r in rag_outputs}
total_needed = len(question_ids_to_test) * len(configs)
print(f"Progress: {len(completed_keys)}/{total_needed} generations complete")

print("PHASE 1: Generating RAG responses...")
for question_id in tqdm(question_ids_to_test, desc="Questions"):

 question = validation_questions_answers[question_id]['question']
 retrieved_docs = retriever.invoke(question)
 retrieved_contexts = [doc.page_content for doc in retrieved_docs]

 doc_sources = []
 for doc in retrieved_docs:
 if hasattr(doc, 'metadata') and 'source' in doc.metadata:
 doc_sources.append(doc.metadata['source'])
 else:
 doc_sources.append('unknown')

 for chain_type in configs:
 if (question_id, chain_type) in completed_keys:
 continue

 try:
 start = time.time()

 rag_response = ask_rag_question(
 question_id,
 chain_type
 )

 entry = {
 'question_id': question_id,
 'chain_type': chain_type,
 'question': rag_response['question'],
 'rag_answer': rag_response['rag_answer'],
 'gold_answer': rag_response['gold_answer'],
 'retrieved_contexts': rag_response['retrieved_contexts'],
 'retrieved_sources': doc_sources
 }
 rag_outputs.append(entry)

 with open(results_file, 'wb') as f:
 pickle.dump(rag_outputs, f)

 elapsed = time.time() - start
 print(f" Generated Q{question_id}-{chain_type} ({elapsed:.1f}s)")

 time.sleep(3.5)

 except Exception as e:
 print(f" Error Q{question_id}-{chain_type}: {e}")
 time.sleep(3.5)

print(f"PHASE 1 Complete: {len(rag_outputs)} responses generated")

print(f"PHASE 2: Batch Evaluation on {len(rag_outputs)} items...")

if rag_outputs and 'bert_f1' not in rag_outputs[0]:
 print(" Running BERTScore in batch mode...")
 start = time.time()

 rag_answers = [x['rag_answer'] for x in rag_outputs]
 gold_answers = [x['gold_answer'] for x in rag_outputs]

 P, R, F1 = score(rag_answers, gold_answers, lang='en', rescale_with_baseline=True)

 for i, entry in enumerate(rag_outputs):
 entry['bert_precision'] = P[i].item()
 entry['bert_recall'] = R[i].item()
 entry['bert_f1'] = F1[i].item()

 elapsed = time.time() - start
 print(f" BERTScore complete ({elapsed:.1f}s for {len(rag_outputs)} items)")

 with open(results_file, 'wb') as f:
 pickle.dump(rag_outputs, f)
else:
 print(" BERTScore already computed, skipping...")

if rag_outputs and 'ragas_faithfulness' not in rag_outputs[0]:
 print(" Running RAGAS in batch mode...")
 start = time.time()

 data_dict = {
 "user_input": [x['question'] for x in rag_outputs],
 "retrieved_contexts": [x['retrieved_contexts'] for x in rag_outputs],
 "response": [x['rag_answer'] for x in rag_outputs],
 "reference": [x['gold_answer'] for x in rag_outputs]
 }

 dataset = Dataset.from_dict(data_dict)

 from ragas import evaluate
 from ragas.metrics import LLMContextRecall, ContextPrecision, Faithfulness, FactualCorrectness, AnswerRelevancy
 from ragas.llms import LangchainLLMWrapper
 from langchain_openai import ChatOpenAI

 evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
 metrics = [LLMContextRecall(), ContextPrecision(), Faithfulness(), FactualCorrectness(), AnswerRelevancy()]

 ragas_results = evaluate(
 dataset=dataset,
 metrics=metrics,
 llm=evaluator_llm
 )

 for i, entry in enumerate(rag_outputs):
 entry['ragas_context_recall'] = ragas_results['context_recall'][i]
 entry['ragas_context_precision'] = ragas_results['context_precision'][i]
 entry['ragas_faithfulness'] = ragas_results['faithfulness'][i]
 entry['ragas_factual_correctness'] = ragas_results['factual_correctness(mode=f1)'][i]
 entry['ragas_answer_relevancy'] = ragas_results['answer_relevancy'][i]

 elapsed = time.time() - start
 print(f" RAGAS complete ({elapsed:.1f}s for {len(rag_outputs)} items)")

 with open(results_file, 'wb') as f:
 pickle.dump(rag_outputs, f)
else:
 print(" RAGAS already computed, skipping...")

results_df = pd.DataFrame(rag_outputs)
print(f"\n{'='*70}")
print(f"EVALUATION COMPLETE!")
print(f"{'='*70}")
print(f"Total items evaluated: {len(results_df)}")
print(f"Questions tested: {sorted(results_df['question_id'].unique())}")
print(f"Configs tested: {sorted(results_df['chain_type'].unique())}")
print(f"Results saved to: {results_file}")
print(f"\nQuick preview:")
print(results_df.groupby('chain_type')[['bert_f1', 'ragas_context_recall',
 'ragas_context_precision',
 'ragas_faithfulness',
 'ragas_factual_correctness',
 'ragas_answer_relevancy']].mean())

print(f"\n{'='*70}")
print("RETRIEVED DOCUMENT SOURCES")
print(f"{'='*70}")
for q_id in question_ids_to_test:
 q_data = results_df[results_df['question_id'] == q_id].iloc[0]
 print(f"\nQ{q_id}: {q_data['question'][:60]}...")
 print(f"Sources: {q_data['retrieved_sources']}")

Total available questions: 78
First 20 question IDs: [0, 1, 2, 3, 7, 8, 9, 11, 12, 13, 16, 17, 18, 19, 20, 22, 23, 24, 25, 27]
Testing with 78 questions: [0, 1, 2, 3, 7, 8, 9, 11, 12, 13, 16, 17, 18, 19, 20, 22, 23, 24, 25, 27, 28, 30, 33, 34, 35, 36, 38, 39, 41, 43, 44, 46, 47, 48, 50, 51, 52, 53, 54, 55, 59, 60, 61, 62, 63, 64, 65, 67, 69, 73, 74, 75, 76, 78, 80, 81, 82, 84, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 96, 101, 102, 103, 104, 105, 106, 107, 108, 110]
Starting fresh generation
Progress: 0/156 generations complete
PHASE 1: Generating RAG responses...


Questions: 0%| | 0/78 [00:00<?, ?it/s]

 Generated Q0-marketing_cohere (1.8s)
 Generated Q0-engineering_cohere (4.9s)


Questions: 1%|▏ | 1/78 [00:13<17:40, 13.77s/it]

 Generated Q1-marketing_cohere (1.6s)
 Generated Q1-engineering_cohere (2.9s)


Questions: 3%|▎ | 2/78 [00:25<15:51, 12.52s/it]

 Generated Q2-marketing_cohere (8.4s)
 Generated Q2-engineering_cohere (2.1s)


Questions: 4%|▍ | 3/78 [00:42<18:32, 14.83s/it]

 Generated Q3-marketing_cohere (1.5s)
 Generated Q3-engineering_cohere (3.0s)


Questions: 5%|▌ | 4/78 [00:54<16:42, 13.54s/it]

 Generated Q7-marketing_cohere (1.3s)
 Generated Q7-engineering_cohere (2.7s)


Questions: 6%|▋ | 5/78 [01:05<15:23, 12.65s/it]

 Generated Q8-marketing_cohere (1.3s)
 Generated Q8-engineering_cohere (2.8s)


Questions: 8%|▊ | 6/78 [01:16<14:35, 12.16s/it]

 Generated Q9-marketing_cohere (1.6s)
 Generated Q9-engineering_cohere (2.0s)


Questions: 9%|▉ | 7/78 [01:27<13:47, 11.65s/it]

 Generated Q11-marketing_cohere (1.9s)
 Generated Q11-engineering_cohere (2.9s)


Questions: 10%|█ | 8/78 [01:39<13:38, 11.69s/it]

 Generated Q12-marketing_cohere (1.0s)
 Generated Q12-engineering_cohere (3.1s)


Questions: 12%|█▏ | 9/78 [01:50<13:13, 11.49s/it]

 Generated Q13-marketing_cohere (2.0s)
 Generated Q13-engineering_cohere (2.5s)


Questions: 13%|█▎ | 10/78 [02:01<13:01, 11.50s/it]

 Generated Q16-marketing_cohere (1.8s)
 Generated Q16-engineering_cohere (3.4s)


Questions: 14%|█▍ | 11/78 [02:14<13:06, 11.74s/it]

 Generated Q17-marketing_cohere (1.3s)
 Generated Q17-engineering_cohere (3.2s)


Questions: 15%|█▌ | 12/78 [02:25<12:50, 11.68s/it]

 Generated Q18-marketing_cohere (1.5s)
 Generated Q18-engineering_cohere (2.4s)


Questions: 17%|█▋ | 13/78 [02:36<12:23, 11.45s/it]

 Generated Q19-marketing_cohere (0.9s)
 Generated Q19-engineering_cohere (2.0s)


Questions: 18%|█▊ | 14/78 [02:46<11:41, 10.97s/it]

 Generated Q20-marketing_cohere (45.7s)
 Generated Q20-engineering_cohere (1.8s)


Questions: 19%|█▉ | 15/78 [03:40<25:17, 24.09s/it]

 Generated Q22-marketing_cohere (1.5s)
 Generated Q22-engineering_cohere (3.5s)


Questions: 21%|██ | 16/78 [03:52<21:09, 20.48s/it]

 Generated Q23-marketing_cohere (1.2s)
 Generated Q23-engineering_cohere (2.8s)


Questions: 22%|██▏ | 17/78 [04:03<17:55, 17.63s/it]

 Generated Q24-marketing_cohere (3.2s)
 Generated Q24-engineering_cohere (2.5s)


Questions: 23%|██▎ | 18/78 [04:16<16:09, 16.16s/it]

 Generated Q25-marketing_cohere (1.8s)
 Generated Q25-engineering_cohere (2.6s)


Questions: 24%|██▍ | 19/78 [04:28<14:31, 14.77s/it]

 Generated Q27-marketing_cohere (1.8s)
 Generated Q27-engineering_cohere (4.1s)


Questions: 26%|██▌ | 20/78 [04:41<13:44, 14.21s/it]

 Generated Q28-marketing_cohere (1.4s)
 Generated Q28-engineering_cohere (1.2s)


Questions: 27%|██▋ | 21/78 [04:50<12:12, 12.84s/it]

 Generated Q30-marketing_cohere (1.7s)
 Generated Q30-engineering_cohere (3.0s)


Questions: 28%|██▊ | 22/78 [05:02<11:41, 12.52s/it]

 Generated Q33-marketing_cohere (1.7s)
 Generated Q33-engineering_cohere (3.7s)


Questions: 29%|██▉ | 23/78 [05:14<11:25, 12.47s/it]

 Generated Q34-marketing_cohere (1.6s)
 Generated Q34-engineering_cohere (4.5s)


Questions: 31%|███ | 24/78 [05:27<11:22, 12.64s/it]

 Generated Q35-marketing_cohere (2.1s)
 Generated Q35-engineering_cohere (2.6s)


Questions: 32%|███▏ | 25/78 [05:39<10:54, 12.36s/it]

 Generated Q36-marketing_cohere (1.9s)
 Generated Q36-engineering_cohere (2.7s)


Questions: 33%|███▎ | 26/78 [05:51<10:31, 12.15s/it]

 Generated Q38-marketing_cohere (1.6s)
 Generated Q38-engineering_cohere (3.8s)


Questions: 35%|███▍ | 27/78 [06:03<10:25, 12.26s/it]

 Generated Q39-marketing_cohere (1.7s)
 Generated Q39-engineering_cohere (3.4s)


Questions: 36%|███▌ | 28/78 [06:15<10:11, 12.23s/it]

 Generated Q41-marketing_cohere (1.7s)
 Generated Q41-engineering_cohere (5.5s)


Questions: 37%|███▋ | 29/78 [06:30<10:28, 12.83s/it]

 Generated Q43-marketing_cohere (1.6s)
 Generated Q43-engineering_cohere (3.9s)


Questions: 38%|███▊ | 30/78 [06:42<10:11, 12.74s/it]

 Generated Q44-marketing_cohere (1.6s)
 Generated Q44-engineering_cohere (3.3s)


Questions: 40%|███▉ | 31/78 [06:54<09:46, 12.48s/it]

 Generated Q46-marketing_cohere (1.7s)
 Generated Q46-engineering_cohere (4.5s)


Questions: 41%|████ | 32/78 [07:07<09:44, 12.71s/it]

 Generated Q47-marketing_cohere (1.5s)
 Generated Q47-engineering_cohere (3.4s)


Questions: 42%|████▏ | 33/78 [07:19<09:20, 12.46s/it]

 Generated Q48-marketing_cohere (1.4s)
 Generated Q48-engineering_cohere (2.5s)


Questions: 44%|████▎ | 34/78 [07:30<08:47, 11.99s/it]

 Generated Q50-marketing_cohere (1.4s)
 Generated Q50-engineering_cohere (3.3s)


Questions: 45%|████▍ | 35/78 [07:42<08:32, 11.93s/it]

 Generated Q51-marketing_cohere (1.9s)
 Generated Q51-engineering_cohere (3.1s)


Questions: 46%|████▌ | 36/78 [07:54<08:21, 11.95s/it]

 Generated Q52-marketing_cohere (1.7s)
 Generated Q52-engineering_cohere (3.2s)


Questions: 47%|████▋ | 37/78 [08:06<08:09, 11.94s/it]

 Generated Q53-marketing_cohere (1.3s)
 Generated Q53-engineering_cohere (3.2s)


Questions: 49%|████▊ | 38/78 [08:17<07:52, 11.82s/it]

 Generated Q54-marketing_cohere (1.2s)
 Generated Q54-engineering_cohere (2.7s)


Questions: 50%|█████ | 39/78 [08:28<07:30, 11.55s/it]

 Generated Q55-marketing_cohere (1.5s)
 Generated Q55-engineering_cohere (12.3s)


Questions: 51%|█████▏ | 40/78 [08:49<09:04, 14.34s/it]

 Generated Q59-marketing_cohere (1.7s)
 Generated Q59-engineering_cohere (4.5s)


Questions: 53%|█████▎ | 41/78 [09:02<08:37, 13.99s/it]

 Generated Q60-marketing_cohere (2.4s)
 Generated Q60-engineering_cohere (4.3s)


Questions: 54%|█████▍ | 42/78 [09:16<08:21, 13.92s/it]

 Generated Q61-marketing_cohere (1.5s)
 Generated Q61-engineering_cohere (2.8s)


Questions: 55%|█████▌ | 43/78 [09:28<07:40, 13.17s/it]

 Generated Q62-marketing_cohere (1.6s)
 Generated Q62-engineering_cohere (3.4s)


Questions: 56%|█████▋ | 44/78 [09:39<07:15, 12.81s/it]

 Generated Q63-marketing_cohere (1.9s)
 Generated Q63-engineering_cohere (4.1s)


Questions: 58%|█████▊ | 45/78 [09:53<07:04, 12.87s/it]

 Generated Q64-marketing_cohere (1.6s)
 Generated Q64-engineering_cohere (3.9s)


Questions: 59%|█████▉ | 46/78 [10:05<06:48, 12.77s/it]

 Generated Q65-marketing_cohere (1.6s)
 Generated Q65-engineering_cohere (3.6s)


Questions: 60%|██████ | 47/78 [10:17<06:30, 12.61s/it]

 Generated Q67-marketing_cohere (1.3s)
 Generated Q67-engineering_cohere (3.9s)


Questions: 62%|██████▏ | 48/78 [10:30<06:15, 12.51s/it]

 Generated Q69-marketing_cohere (1.7s)
 Generated Q69-engineering_cohere (3.0s)


Questions: 63%|██████▎ | 49/78 [10:41<05:55, 12.27s/it]

 Generated Q73-marketing_cohere (1.1s)
 Generated Q73-engineering_cohere (1.9s)


Questions: 64%|██████▍ | 50/78 [10:51<05:25, 11.62s/it]

 Generated Q74-marketing_cohere (1.6s)
 Generated Q74-engineering_cohere (3.5s)


Questions: 65%|██████▌ | 51/78 [11:03<05:17, 11.76s/it]

 Generated Q75-marketing_cohere (1.7s)
 Generated Q75-engineering_cohere (2.5s)


Questions: 67%|██████▋ | 52/78 [11:15<05:01, 11.60s/it]

 Generated Q76-marketing_cohere (2.2s)
 Generated Q76-engineering_cohere (3.2s)


Questions: 68%|██████▊ | 53/78 [11:27<04:56, 11.84s/it]

 Generated Q78-marketing_cohere (1.8s)
 Generated Q78-engineering_cohere (3.6s)


Questions: 69%|██████▉ | 54/78 [11:40<04:48, 12.02s/it]

 Generated Q80-marketing_cohere (1.3s)
 Generated Q80-engineering_cohere (3.0s)


Questions: 71%|███████ | 55/78 [11:51<04:31, 11.82s/it]

 Generated Q81-marketing_cohere (2.0s)
 Generated Q81-engineering_cohere (3.2s)


Questions: 72%|███████▏ | 56/78 [12:03<04:22, 11.93s/it]

 Generated Q82-marketing_cohere (1.5s)
 Generated Q82-engineering_cohere (3.1s)


Questions: 73%|███████▎ | 57/78 [12:15<04:08, 11.84s/it]

 Generated Q84-marketing_cohere (1.3s)
 Generated Q84-engineering_cohere (2.5s)


Questions: 74%|███████▍ | 58/78 [12:26<03:50, 11.55s/it]

 Generated Q85-marketing_cohere (1.8s)
 Generated Q85-engineering_cohere (3.3s)


Questions: 76%|███████▌ | 59/78 [12:38<03:42, 11.71s/it]

 Generated Q86-marketing_cohere (1.5s)
 Generated Q86-engineering_cohere (2.9s)


Questions: 77%|███████▋ | 60/78 [12:49<03:29, 11.61s/it]

 Generated Q87-marketing_cohere (2.2s)
 Generated Q87-engineering_cohere (2.7s)


Questions: 78%|███████▊ | 61/78 [13:01<03:18, 11.70s/it]

 Generated Q88-marketing_cohere (1.5s)
 Generated Q88-engineering_cohere (2.5s)


Questions: 79%|███████▉ | 62/78 [13:12<03:03, 11.47s/it]

 Generated Q89-marketing_cohere (1.7s)
 Generated Q89-engineering_cohere (2.7s)


Questions: 81%|████████ | 63/78 [13:23<02:51, 11.46s/it]

 Generated Q91-marketing_cohere (1.6s)
 Generated Q91-engineering_cohere (2.9s)


Questions: 82%|████████▏ | 64/78 [13:35<02:40, 11.48s/it]

 Generated Q92-marketing_cohere (1.1s)
 Generated Q92-engineering_cohere (2.6s)


Questions: 83%|████████▎ | 65/78 [13:46<02:26, 11.24s/it]

 Generated Q93-marketing_cohere (1.5s)
 Generated Q93-engineering_cohere (2.7s)


Questions: 85%|████████▍ | 66/78 [13:57<02:15, 11.25s/it]

 Generated Q94-marketing_cohere (1.2s)
 Generated Q94-engineering_cohere (4.2s)


Questions: 86%|████████▌ | 67/78 [14:09<02:07, 11.62s/it]

 Generated Q95-marketing_cohere (1.6s)
 Generated Q95-engineering_cohere (2.7s)


Questions: 87%|████████▋ | 68/78 [14:21<01:55, 11.53s/it]

 Generated Q96-marketing_cohere (1.9s)
 Generated Q96-engineering_cohere (4.0s)


Questions: 88%|████████▊ | 69/78 [14:34<01:47, 11.97s/it]

 Generated Q101-marketing_cohere (1.5s)
 Generated Q101-engineering_cohere (3.2s)


Questions: 90%|████████▉ | 70/78 [14:45<01:35, 11.88s/it]

 Generated Q102-marketing_cohere (6.1s)
 Generated Q102-engineering_cohere (3.9s)


Questions: 91%|█████████ | 71/78 [15:02<01:33, 13.40s/it]

 Generated Q103-marketing_cohere (2.0s)
 Generated Q103-engineering_cohere (2.9s)


Questions: 92%|█████████▏| 72/78 [15:14<01:17, 12.95s/it]

 Generated Q104-marketing_cohere (2.0s)
 Generated Q104-engineering_cohere (2.6s)


Questions: 94%|█████████▎| 73/78 [15:26<01:02, 12.57s/it]

 Generated Q105-marketing_cohere (1.4s)
 Generated Q105-engineering_cohere (2.3s)


Questions: 95%|█████████▍| 74/78 [15:36<00:48, 12.00s/it]

 Generated Q106-marketing_cohere (1.5s)
 Generated Q106-engineering_cohere (3.1s)


Questions: 96%|█████████▌| 75/78 [15:48<00:35, 11.90s/it]

 Generated Q107-marketing_cohere (1.3s)
 Generated Q107-engineering_cohere (2.4s)


Questions: 97%|█████████▋| 76/78 [15:59<00:23, 11.53s/it]

 Generated Q108-marketing_cohere (3.5s)
 Generated Q108-engineering_cohere (3.0s)


Questions: 99%|█████████▊| 77/78 [16:12<00:12, 12.12s/it]

 Generated Q110-marketing_cohere (1.5s)
 Generated Q110-engineering_cohere (3.8s)


Questions: 100%|██████████| 78/78 [16:25<00:00, 12.63s/it]


PHASE 1 Complete: 156 responses generated
PHASE 2: Batch Evaluation on 156 items...
 Running BERTScore in batch mode...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 BERTScore complete (6.8s for 156 items)
 Running RAGAS in batch mode...


 evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))


Evaluating: 0%| | 0/780 [00:00<?, ?it/s]



 RAGAS complete (1008.4s for 156 items)

EVALUATION COMPLETE!
Total items evaluated: 156
Questions tested: [np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(7), np.int64(8), np.int64(9), np.int64(11), np.int64(12), np.int64(13), np.int64(16), np.int64(17), np.int64(18), np.int64(19), np.int64(20), np.int64(22), np.int64(23), np.int64(24), np.int64(25), np.int64(27), np.int64(28), np.int64(30), np.int64(33), np.int64(34), np.int64(35), np.int64(36), np.int64(38), np.int64(39), np.int64(41), np.int64(43), np.int64(44), np.int64(46), np.int64(47), np.int64(48), np.int64(50), np.int64(51), np.int64(52), np.int64(53), np.int64(54), np.int64(55), np.int64(59), np.int64(60), np.int64(61), np.int64(62), np.int64(63), np.int64(64), np.int64(65), np.int64(67), np.int64(69), np.int64(73), np.int64(74), np.int64(75), np.int64(76), np.int64(78), np.int64(80), np.int64(81), np.int64(82), np.int64(84), np.int64(85), np.int64(86), np.int64(87), np.int64(88), np.int64(89), np.int64(91), np.

In [116]:
results_df.to_csv(r'/content/drive/MyDrive/Gen AI/full_evaluation_results.csv', index=False)

In [117]:
results_df['rag_answer_length'] = results_df['rag_answer'].str.len()
results_df['gold_answer_length'] = results_df['gold_answer'].str.len()

## 5. Results

### 5.1 Model Specifications

Document the detailed specs of your choices. Also comment on how you valued the needs of the marketing team vs the needs of the researchers, in case you had to make a trade-off.

**Demonstration:**
5.1.a Provide the detailed specification of your pipelines.

* LLM: Cohere
* Embedding Model: all-distilroberta-v1 (384 dimensions)
* Vector Database: Qdrant (in-memory)
* Chunk Size: 1200 characters
* Chunk Overlap200 characters
* Retrievalk=8 chunks, cosine similarity

**Trade-Offs**

Primary decision: Optimized for engineering for POC as it was more difficult to manage - due to the larger content requirement and preciseness required.

* k=8 for both personas (engineering benefits +63% faithfulness on Q2, marketing drops -15% BERTScore but acceptable)
* Cohere over Mistral (engineering needs faithfulness 0.89 vs 0.71, marketing handles verbosity with prompt constraints)
* Single unified config for POC to reduce testing time (simplifies deployment, engineering is primary stakeholder with 300-person org)
* Result: Engineering faithfulness 0.89, marketing faithfulness 0.65 on 78-question validation. Both exceed 0.83 answer relevancy.

**Demonstration:**
5.1.b: What is the prompt you used for the engineering research group?

engineering_template = """[INST]You are a technical AI assistant helping engineers understand AI and NLP concepts.

Using ONLY the context provided below, answer the question with technical precision. Your answer must:
- Be 4-5 sentences (approximately 500 characters)
- Use precise technical terminology
- Focus on implementation details and architectures
- Stay grounded in the provided context

If the context doesn't contain sufficient information to answer the question, state what information is missing.

Context:
{context}

Question: {question}

Provide a concise, technical answer (4-5 sentences):[/INST]"""

**Demonstration:**
5.1.c.: What is the prompt you used for the marketing group??

marketing_template = """[INST]You are an AI assistant helping marketing professionals understand AI concepts for business communication.

Using ONLY the context provided below, answer the question in accessible language. Your answer must:
- Be 2-3 sentences (approximately 250 characters)
- Avoid technical jargon
- Focus on business value and practical applications
- Stay grounded in the provided context

If the context doesn't contain sufficient information to answer the question, state what information is missing.

Context:
{context}

Question: {question}

Provide a brief, business-focused answer (2-3 sentences):[/INST]"""

## 5.2 Some Test Questions

Please study the answers generated by your chosen setup for these specific test questions:

1. "What defines a large language model in the context of natural language processing tasks?" (Question 0)

2. "What are the advantages of applying LoRA to transformer models in terms of computational efficiency during training and deployment?" (Question 63)

3. "What actor played the role of Thanos in the Marvel Universe and the role of Cable in Deadpool 2?" (Question 109, no labeled answers)

For each of the three questions above please provide:

a) The RAG results (research and marketing response) 
b) The context provided 
c) The document sources for the context 

Then, for questions 1 and 2,

d) Also discuss your metric(s) for the two examples (for both responses) compared to the gold responses and comment on how well you feel your metrics captured the differences and similarities between your answer and the gold answer?

Put your answers to these questions into the appropriae cells below.

### 5.2.1 Test Question 1

Please run the test question 0 through your RAG pipeline show the results in the questions below:

**Demonstration:**
5.2.1.a.i. What is the engineering response for question 1?

In [122]:
evaluation_results = pd.read_csv(r'/content/drive/MyDrive/Gen AI/full_evaluation_results.csv')

5.2.1.a.i. A large language model (LLM) is defined by its vast number of parameters, typically ranging from billions to trillions, and its training on extensive text corpora using self-supervised learning. Architecturally, LLMs are often based on transformer models, leveraging attention mechanisms for sequence modeling. Their key characteristic is the ability to generalize across diverse natural language processing tasks with minimal task-specific fine-tuning, enabled by pre-training on large datasets. This scalability in model size and data has been shown to improve performance, though it introduces challenges like high computational costs and the need for high-quality training data. LLMs are particularly effective in generative tasks, such as language generation, summarization, and conversational agents, due to their capacity to capture complex linguistic patterns and semantics.

**Demonstration:**
5.2.1.a.ii. What is the marketing response for question 1?

5.2.1.a.ii. A large language model (LLM) is a powerful tool for businesses, designed to handle natural language processing tasks like language generation. LLMs are defined by their vast number of parameters (billions to trillions) and training on massive text datasets, enabling them to generalize across tasks, from conversational agents to automated reasoning, with minimal task-specific supervision.

**Demonstration:**
5.2.1.b: What is the context you passed to the LLM for question 1?

5.2.1.b:

--- Chunk 1 ---
billion parameters (Smith et al., 2022). The drive to train larger and larger models is clear—so far
increasing the size of language models has been responsible for improving the state-of-the-art in many
language modelling tasks. Nonetheless, large language models face several challenges, including
their overwhelming computational requirements (the cost of training and inference increase with
model size) (Rae et al., 2021; Thoppilan et al., 2022) and the need for acquiring more high-quality
training data. In fact, in this work we ﬁnd that larger, high quality datasets will play a key role in any
further scaling of language models.
Modelling the scaling behavior.
Understanding the scaling behaviour of language models and
their transfer properties has been important in the development of recent large models (Hernandez
et al., 2021; Kaplan et al., 2020). Kaplan et al. (2020) ﬁrst showed a predictable relationship between
model size and loss over many orders of magnitude. The authors investigate the question of choosing
the optimal model size to train for a given compute budget. Similar to us, they address this question

--- Chunk 2 ---
== Background ==
After the release of large language models such as GPT-3, a focus of research was up-scaling models, which in some instances showed major increases in emergent capabilities. The release of ChatGPT and its surprise success caused an increase in attention to large language models.
Compared with other responses to ChatGPT, Meta's Chief AI scientist Yann LeCun stated that large language models are best for aiding with writing.

== Versions ==

--- Chunk 3 ---
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

== List ==
For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.

== Timeline ==

== See also ==
List of chatbots
List of language model benchmarks

== Notes ==

== References ==

--- Chunk 4 ---
== History ==

Before the emergence of transformer-based models in 2017, some language models were considered large relative to the computational and data constraints of their time. In the early 1990s, IBM's statistical models pioneered word alignment techniques for machine translation, laying the groundwork for corpus-based language modeling. In 2001, a smoothed n-gram model, such as those employing Kneser–Ney smoothing, trained on 300 million words, achieved state-of-the-art perplexity on benchmark tests. During the 2000s, with the rise of widespread internet access, researchers began compiling massive text datasets from the web ("web as corpus") to train statistical language models.

--- Chunk 5 ---
# Tasks
Examples
Language Modelling
20
WikiText-103, The Pile: PG-19, arXiv, FreeLaw, . . .
Reading Comprehension
3
RACE-m, RACE-h, LAMBADA
Question Answering
3
Natural Questions, TriviaQA, TruthfulQA
Common Sense
5
HellaSwag, Winogrande, PIQA, SIQA, BoolQ
MMLU
57
High School Chemistry, Astronomy, Clinical Knowledge, . . .
BIG-bench
62
Causal Judgement, Epistemic Reasoning, Temporal Sequences, . . .
Table 5 | All evaluation tasks. We evaluate Chinchilla on a collection of language modelling along
with downstream tasks. We evaluate on largely the same tasks as in Rae et al. (2021), to allow for
direct comparison.
4.2. Results
We perform an extensive evaluation of Chinchilla, comparing against various large language models.
We evaluate on a large subset of the tasks presented in Rae et al. (2021), shown in Table 5. As
the focus of this work is on optimal model scaling, we included a large representative subset, and
introduce a few new evaluations to allow for better comparison to other existing large models. The
evaluation details for all tasks are the same as described in Rae et al. (2021).
4.2.1. Language modelling
pubmed_abstracts
nih_exporter
uspto_backgrounds
pubmed_central

--- Chunk 6 ---
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) and provide the core capabilities of modern chatbots. LLMs can be fine-tuned for specific tasks or guided by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained on.
They consist of billions to trillions of parameters and operate as general-purpose sequence models, generating, summarizing, translating, and reasoning over text. LLMs represent a significant new technology in their ability to generalize across tasks with minimal task-specific supervision, enabling capabilities like conversational agents, code generation, knowledge retrieval, and automated reasoning that previously required bespoke systems.

--- Chunk 7 ---
thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open
and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,
Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open
foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.
[28] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a
machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
[29] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied,
Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation
models. arXiv preprint arXiv:2304.06364, 2023.
9

--- Chunk 8 ---
Note that they did fine-tune the pretrained LM independently for each dataset.
Closed-book QA: Generative Language Model#
Big language models have been pre-trained on a large collection of unsupervised textual corpus. Given enough parameters, these models are able to memorize some factual knowledge within parameter weights. Therefore, we can use these models to do question-answering without explicit context, just like in a closed-book exam. The pre-trained language models produce free text to respond to questions, no explicit reading comprehension.

The amount of computation used for training big language models of different sizes is getting big. (Image source: Brown et al., 2020).

Roberts et al. (2020) measured the practical utility of a language model by fine-tuning a pre-trained model to answer questions without access to any external context or knowledge. They fine-tuned the T5 language model (same architecture as the original Transformer) to answer questions without inputting any additional information or context. Such setup enforces the language model to answer questions based on “knowledge” that it internalized during pre-training.

**Demonstration:**
5.2.1.c: List the doc_source for each of the documents in the context for question 1

1. https://arxiv.org/pdf/2203.15556.pdf
2. https://en.wikipedia.org/wiki/Llama_(language_model)
3. https://en.wikipedia.org/wiki/List_of_large_language_models
4. https://en.wikipedia.org/wiki/Large_language_model
5. https://arxiv.org/pdf/2203.15556.pdf
6. https://en.wikipedia.org/wiki/Large_language_model
7. https://arxiv.org/pdf/2310.06825.pdf
8. https://lilianweng.github.io/posts/2020-10-29-odqa/

**Demonstration:**
5.2.1.d.i: How well does your model perform relative to the gold answer we provided for engineering on question 1?

5.2.1.d.i:

[('BERTScore Precision', 0.268),
 ('BERTScore Recall', 0.341),
 ('BERTScore F1', 0.306),
 ('Context Recall', 1.000),
 ('Context Precision', 0.347),
 ('Faithfulness', 1.000),
 ('Factual Correctness', 0.360),
 ('Answer Relevancy', 0.925)]

**Demonstration:**
5.2.1.d.ii: How well does your model perform relative to the gold answer we provided for marketing on question 1?

5.2.1.d.ii.

[('BERTScore Precision', 0.322),
 ('BERTScore Recall', 0.392),
 ('BERTScore F1', 0.358),
 ('Context Recall', 1.000),
 ('Context Precision', 0.347),
 ('Faithfulness', 0.857),
 ('Factual Correctness', 0.290),
 ('Answer Relevancy', 0.880)]

### 5.2.2 Test Question 2

Please run question 63 through your RAG pipeline:

**Demonstration:**
5.2.2.a.i: What is the engineering response for question 2?

5.2.2.a.i:
Applying LoRA to Transformer models significantly enhances computational efficiency during training and deployment. By injecting low-rank matrices (L1 ∈ R^(h×r) and L2 ∈ R^(r×o)) into the linear projection layers, LoRA reduces the number of trainable parameters, lowering VRAM usage by up to 2/3 and enabling training with fewer GPUs. During training, LoRA avoids calculating gradients for frozen parameters, resulting in a 25% speedup for GPT-3 175B. At deployment, the trained low-rank matrices can be merged with frozen weights (W = W0 + BA), eliminating inference latency compared to full fine-tuning. Additionally, task switching is efficient, requiring only the replacement of LoRA weights (B′A′) with minimal memory overhead.

**Demonstration:**
5.2.2.a.ii: What is the marketing response for question 2?

5.2.2.a.ii:
Applying LoRA to transformer models significantly reduces memory and storage usage, allowing businesses to train models with fewer GPUs and avoid costly I/O bottlenecks. This efficiency enables faster task switching and the creation of customized models at a lower cost, while also speeding up training by up to 25% and introducing no additional inference latency during deployment.

**Demonstration:**
5.2.2.b: What is the context you passed to the LLM for question 2?

5.2.2.b:

--- Chunk 1 ---
Practical Beneﬁts and Limitations.
The most signiﬁcant beneﬁt comes from the reduction in
memory and storage usage. For a large Transformer trained with Adam, we reduce that VRAM
usage by up to 2/3 if r ≪dmodel as we do not need to store the optimizer states for the frozen
parameters. On GPT-3 175B, we reduce the VRAM consumption during training from 1.2TB to
350GB. With r = 4 and only the query and value projection matrices being adapted, the checkpoint
size is reduced by roughly 10,000× (from 350GB to 35MB)4. This allows us to train with signiﬁ-
cantly fewer GPUs and avoid I/O bottlenecks. Another beneﬁt is that we can switch between tasks
while deployed at a much lower cost by only swapping the LoRA weights as opposed to all the
parameters. This allows for the creation of many customized models that can be swapped in and out
on the ﬂy on machines that store the pre-trained weights in VRAM. We also observe a 25% speedup
during training on GPT-3 175B compared to full ﬁne-tuning5 as we do not need to calculate the
gradient for the vast majority of the parameters.
LoRA also has its limitations. For example, it is not straightforward to batch inputs to different tasks

--- Chunk 2 ---
when the full rank (i.e., d) is as high as 12,288, making LoRA both storage- and compute-efﬁcient.
LoRA possesses several key advantages.
• A pre-trained model can be shared and used to build many small LoRA modules for dif-
ferent tasks. We can freeze the shared model and efﬁciently switch tasks by replacing the
matrices A and B in Figure 1, reducing the storage requirement and task-switching over-
head signiﬁcantly.
• LoRA makes training more efﬁcient and lowers the hardware barrier to entry by up to 3
times when using adaptive optimizers since we do not need to calculate the gradients or
maintain the optimizer states for most parameters. Instead, we only optimize the injected,
much smaller low-rank matrices.
• Our simple linear design allows us to merge the trainable matrices with the frozen weights
when deployed, introducing no inference latency compared to a fully ﬁne-tuned model, by
construction.
• LoRA is orthogonal to many prior methods and can be combined with many of them, such
as preﬁx-tuning. We provide an example in Appendix E.
Terminologies and Conventions
We make frequent references to the Transformer architecture

--- Chunk 3 ---
ness of full ﬁne-tuning by setting the LoRA rank r to the rank of the pre-trained weight matrices. In
other words, as we increase the number of trainable parameters 3, training LoRA roughly converges
to training the original model, while adapter-based methods converges to an MLP and preﬁx-based
methods to a model that cannot take long input sequences.
No Additional Inference Latency.
When deployed in production, we can explicitly compute and
store W = W0 + BA and perform inference as usual. Note that both W0 and BA are in Rd×k.
When we need to switch to another downstream task, we can recover W0 by subtracting BA and
then adding a different B′A′, a quick operation with very little memory overhead. Critically, this
2They represent a negligible number of parameters compared to weights.
3An inevitability when adapting to hard tasks.
4

--- Chunk 4 ---
the full model parameters which remain fixed. Gradients during stochastic gradient descent are
passed through the fixed pretrained model weights to the adapter, which is updated to optimize the
loss function. LoRA augments a linear projection through an additional factorized projection. Given
a projection XW = Y with X ∈Rb×h, W ∈Rh×o LoRA computes:
Y = XW + sXL1L2,
(3)
where L1 ∈Rh×r and L2 ∈Rr×o, and s is a scalar.
Memory Requirement of Parameter-Efficient Finetuning
One important point of discussion is
the memory requirement of LoRA during training both in terms of the number and size of adapters
used. Since the memory footprint of LoRA is so minimal, we can use more adapters to improve
performance without significantly increasing the total memory used. While LoRA was designed as a
3

--- Chunk 5 ---
Hyperparameters
Fine-Tune
PreEmbed
PreLayer
BitFit
AdapterH
LoRA
Optimizer
AdamW
Batch Size
128
# Epoch
2
Warmup Tokens
250,000
LR Schedule
Linear
Learning Rate
5.00E-06
5.00E-04
1.00E-04
1.6E-03
1.00E-04
2.00E-04
Table 12: The training hyperparameters used for different GPT-3 adaption methods. We use the
same hyperparameters for all datasets after tuning learning rate.
rally, we replace them after every Transformer block with an input agnostic vector. Thus, both the
embeddings and subsequent Transformer block activations are treated as trainable parameters. For
more on preﬁx-layer tuning, see Section 5.1.
In Table 15, we show the evaluation results of LoRA+PE and LoRA+PL on WikiSQL and MultiNLI.
First of all, LoRA+PE signiﬁcantly outperforms both LoRA and preﬁx-embedding tuning on
WikiSQL, which indicates that LoRA is somewhat orthogonal to preﬁx-embedding tuning. On
MultiNLI, the combination of LoRA+PE doesn’t perform better than LoRA, possibly because LoRA
on its own already achieves performance comparable to the human baseline. Secondly, we notice
that LoRA+PL performs slightly worse than LoRA even with more trainable parameters. We at-

--- Chunk 6 ---
of 128 to match the adapter baselines. Second, we initialize the model to the pre-trained model for
MRPC, RTE, and STS-B, not a model already adapted to MNLI like the ﬁne-tuning baseline. Runs
following this more restricted setup from Houlsby et al. (2019) are labeled with †. The result is
presented in Table 2 (Top Three Sections). See Section D.1 for details on the hyperparameters used.
5.3
DEBERTA XXL
DeBERTa (He et al., 2021) is a more recent variant of BERT that is trained on a much larger
scale and performs very competitively on benchmarks such as GLUE (Wang et al., 2019) and Su-
perGLUE (Wang et al., 2020). We evaluate if LoRA can still match the performance of a fully
ﬁne-tuned DeBERTa XXL (1.5B) on GLUE. The result is presented in Table 2 (Bottom Section).
See Section D.2 for details on the hyperparameters used.
5.4
GPT-2 MEDIUM/LARGE
Having shown that LoRA can be a competitive alternative to full ﬁne-tuning on NLU, we hope to
answer if LoRA still prevails on NLG models, such as GPT-2 medium and large (Radford et al.,
b). We keep our setup as close as possible to Li & Liang (2021) for a direct comparison. Due

--- Chunk 7 ---
guarantees that we do not introduce any additional latency during inference compared to a ﬁne-tuned
model by construction.
4.2
APPLYING LORA TO TRANSFORMER
In principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the
number of trainable parameters. In the Transformer architecture, there are four weight matrices in
the self-attention module (Wq, Wk, Wv, Wo) and two in the MLP module. We treat Wq (or Wk, Wv)
as a single matrix of dimension dmodel ×dmodel, even though the output dimension is usually sliced
into attention heads. We limit our study to only adapting the attention weights for downstream
tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity
and parameter-efﬁciency.We further study the effect on adapting different types of attention weight
matrices in a Transformer in Section 7.1. We leave the empirical investigation of adapting the MLP
layers, LayerNorm layers, and biases to a future work.
Practical Beneﬁts and Limitations.
The most signiﬁcant beneﬁt comes from the reduction in
memory and storage usage. For a large Transformer trained with Adam, we reduce that VRAM

--- Chunk 8 ---
Table 1 | Current LLMs. We show ﬁve of the current largest dense transformer models, their size,
and the number of training tokens. Other than LaMDA (Thoppilan et al., 2022), most models are
trained for approximately 300 billion tokens. We introduce Chinchilla, a substantially smaller model,
trained for much longer than 300B tokens.
Model
Size (# Parameters)
Training Tokens
LaMDA (Thoppilan et al., 2022)
137 Billion
168 Billion
GPT-3 (Brown et al., 2020)
175 Billion
300 Billion
Jurassic (Lieber et al., 2021)
178 Billion
300 Billion
Gopher (Rae et al., 2021)
280 Billion
300 Billion
MT-NLG 530B (Smith et al., 2022)
530 Billion
270 Billion
Chinchilla
70 Billion
1.4 Trillion
2. Related Work
Large language models.
A variety of large language models have been introduced in the last few
years. These include both dense transformer models (Brown et al., 2020; Lieber et al., 2021; Rae
et al., 2021; Smith et al., 2022; Thoppilan et al., 2022) and mixture-of-expert (MoE) models (Du
et al., 2021; Fedus et al., 2021; Zoph et al., 2022). The largest dense transformers have passed 500
billion parameters (Smith et al., 2022). The drive to train larger and larger models is clear—so far

**Demonstration:**
5.2.2.c: List the doc_source for each of the documents in the context for question 2

5.2.2.c:

1. https://arxiv.org/pdf/2106.09685.pdf
2. https://arxiv.org/pdf/2106.09685.pdf
3. https://arxiv.org/pdf/2106.09685.pdf
4. https://arxiv.org/pdf/2305.14314.pdf
5. https://arxiv.org/pdf/2106.09685.pdf
6. https://arxiv.org/pdf/2106.09685.pdf
7. https://arxiv.org/pdf/2106.09685.pdf
8. https://arxiv.org/pdf/2203.15556.pdf

**Demonstration:**
5.2.2.d.i: How well does your model perform relative to the gold answer we provided for engineering on question 2?

5.2.2.d.i:

[('BERTScore Precision', 0.128),
 ('BERTScore Recall', 0.424),
 ('BERTScore F1', 0.273),
 ('Context Recall', 0.750),
 ('Context Precision', 0.976),
 ('Faithfulness', 1.000),
 ('Factual Correctness', 0.130),
 ('Answer Relevancy', 0.957)]

**Demonstration:**
5.2.2.d.ii: How well does your model perform relative to the gold answer we provided for marketing on question 2?

5.2.2.d.ii:

[('BERTScore Precision', 0.323),
 ('BERTScore Recall', 0.489),
 ('BERTScore F1', 0.405),
 ('Context Recall', 1.000),
 ('Context Precision', 0.976),
 ('Faithfulness', 1.000),
 ('Factual Correctness', 0.200),
 ('Answer Relevancy', 0.955)]

#### 5.2.3 Test Question 3

Please run question 109 through your RAG pipeline:

**Demonstration:**
5.2.3.a.i: What is the engineering response for question 3?

5.2.3.a.i:

The context provided does not contain any information about the Marvel Universe, Thanos, Cable, Deadpool 2, or the actors associated with these roles. Therefore, it is impossible to answer the question using the given context. Information about the specific actors and their roles in these films or franchises is entirely missing.

**Demonstration:**
5.2.3.a.ii: What is the marketing response for question 3?

5.2.3.a.ii:

The context provided does not contain information about the actor who plays Thanos in the Marvel Universe or Cable in Deadpool 2. Therefore, I cannot answer the question with the given information.

**Demonstration:**
5.2.3.b: What is the context you passed to the LLM for question 3?

5.2.3_b:

--- Chunk 1 ---
Question
Context
Answer
Who directed the film Op-
penheimer and who stars
as J. Robert Oppenheimer
in the film?
Oppenheimer is a 2023 biographical thriller film written
and directed by Christopher Nolan. Based on the 2005
biography American Prometheus by Kai Bird and Mar-
tin J. Sherwin, the film chronicles the life of J. Robert
Oppenheimer, a theoretical physicist who was pivotal in
developing the first nuclear weapons as part of the Man-
hattan Project, and thereby ushering in the Atomic Age.
Cillian Murphy stars as Oppenheimer, with Emily Blunt
as Oppenheimer’s wife Katherine "Kitty" Oppenheimer.
High Faithfulness: Christopher
Nolan directed the film Oppen-
heimer. Cillian Murphy stars as J.
Robert Oppenheimer in the film.
Low
Faithfulness:
James
Cameron directed the film Op-
penheimer. Tom Cruise stars as J.
Robert Oppenheimer in the film.
Table 2: Example from WikiEval, showing answers with high and low faithfulness.
Question
Answer
When is the scheduled
launch date and time for
the PSLV-C56 mission,
and where will it be
launched from?
High answer relevance: The PSLV-C56 mission is scheduled to be launched on Sunday, 30

--- Chunk 2 ---
where $l$ is the hidden dimension of the bidirectional LSTM module. $\mathbf{W}^g \in \mathbb{R}^{l\times l}$, $\mathbf{b}^g \in \mathbb{R}^l$, and $\mathbf{W}^m \in \mathbb{R}^{2l \times 4l}$ are parameters to learn. The operator $\otimes \mathbf{e}_{d_x}$ is the outer product to repeat the column vector $\mathbf{b}^g$ $d_x$ times.
The ranker and reader components share the same Match-LSTM module with two separate prediction heads in the last layer, resulting in $\mathbf{H}^\text{rank}$ and $\mathbf{H}^\text{reader}$.

The overview of R^3 (reinforced ranker-reader) architecture. Both components share the same Match-LSTM module. (Image source: Wang, et al., 2017)

The retriever runs a max-pooling operation per passage and then aggregates to output a probability of each passage entailing the answer.

$$
\begin{aligned}
\mathbf{u}_i &= \text{max-pooling}(\mathbf{H}^\text{rank}_i) \in \mathbb{R}^l \\
\mathbf{C} &= \text{tanh}(\mathbf{W}^c[\mathbf{u}_1;\dots;\mathbf{u}_N] + \mathbf{b}^c \otimes \mathbf{e}_N) \in \mathbb{R}^{l \times n} \\
\gamma &= \text{softmax}(\mathbf{w}^c \mathbf{C}) \in \mathbb{R}^n
\end{aligned}
$$

--- Chunk 3 ---
Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables
input length extrapolation. In International Conference on Learning Representations, 2022. URL
https://openreview.net/forum?id=R8sQPpGCv0.
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lilli-
crap. Compressive transformers for long-range sequence modelling. In International Confer-
ence on Learning Representations, 2020. URL https://openreview.net/forum?id=
SylKikSYDH.
13

--- Chunk 4 ---
The model is trained from scratch for five days on 256 A100
GPUs. Our implementation is in PyTorch (Paszke et al.,
2019) using Metaseq (Zhang et al., 2022). We use model
parallelism over 4 GPUs and a batch size of 16 sequences
per GPU. The optimization uses a linear learning rate decay
with 1500 warmup steps, a peak learning rate of 1e-4, a
gradient clipping of 1.0, and the Adam optimizer with β1 =
0.9, β2 = 0.98 (Kingma & Ba, 2015).
Approach
CIDEr (↑)
Retrieval Baseline
84.1
DALL-ESmall (Wang, 2021)
20.2
ruDALL-E-XL (Forever, 2021)
38.7
minDALL-E (Kim et al., 2021)
48.0
X-LXMERT (Cho et al., 2020)
55.8
Parti (Yu et al., 2022)
83.9
Flamingo (3B; 4-shot) (Alayrac et al., 2022)
85
Flamingo (80B; 4-shot) (Alayrac et al., 2022)
103
Vanilla CM3
71.9
RA-CM3 (2.7B) (Ours)
89.1
Table 3. Image-to-caption generation performance on MS-
COCO (with no finetuning). Our retrieval-augmented CM3 signif-
icantly outperforms the baseline CM3 with no retrieval. Moreover,
our model outperforms other strong models such as Parti (20B
parameters) and Flamingo (3B; 4-shot), despite using just ∼3B
parameters and 2-shot in-context examples.
Baseline.
For our baseline, we train a vanilla CM3 with no

--- Chunk 5 ---
the dataset license in the paper or a repository. The
rest is over view as follows:
• Apache License 2.0 license: GOV2, TREC-
9https://quoradata.quora.com/
First-Quora-Dataset-Release-Question-Pairs
10https://huggingface.co/BeIR
11https://creativecommons.org/licenses/by-sa/4.
0/
Robust, CODEC, CNN/DM
• MIT license: TREC-CAsT, GECOR, ORCAS-
I, MIMICS, MIMICS-Duo, XSum
• CC BY 4.0: Query2Doc, MSRP, SQuAD
• CC BY-SA 4.0: CANARD, BEIR, HotpotQA,
QuAC
• CC BY-SA 3.0: QReCC, FEVER, BoolQ
• CC BY-NC 2.0: SciFact
• Provided under the “Dataset License Agree-
ment”: TREC-COVID, Multi-News, MS MARCO
Note that CoQA contains several datasets under
different licenses. They are listed on the Hugging-
Face page.12
B
In-domain Evaluation Details
In this evaluation, we split the full dataset into
training, validation, and test sets. The split pro-
cess is designed based on the size and structure of
the original datasets. Specifically, if the original
datasets do not contain a test set, then: For original
datasets with over 10,400 samples, we randomly se-
lect 10,000 samples for constructing training data,
200 samples for validation, and 200 samples for

--- Chunk 6 ---
compared to CNNs1.
1A stack of multiple CNN layers can also capture longer intra-token relation, here we only consider single layer setting.
arXiv:2104.09864v5 [cs.CL] 8 Nov 2023

--- Chunk 7 ---
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Confer-
ence on Learning Representations, 2019. URL https://openreview.net/forum?id=
Bkg6RiCqY7.
Pedro Henrique Martins, Zita Marinho, and Andr´e F. T. Martins. ∞-former: Infinite memory trans-
former. 2021.
Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context
length for transformers. arXiv preprint arXiv:2305.16300, 2023.
Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens.
2023.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K¨opf, Ed-
ward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,
Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style, High-Performance
Deep Learning Library. Curran Associates Inc., Red Hook, NY, USA, 2019.
Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables
input length extrapolation. In International Conference on Learning Representations, 2022. URL

--- Chunk 8 ---
$$
\arg\min_{\mathbf{e}'_i \in \mathcal{V}} [\mathbf{e}'_i - \mathbf{e}_i]^\top \nabla_{\mathbf{e}_i} \mathcal{L}_\text{adv}
$$

where $\mathcal{V}$ is the embedding matrix of all the tokens. $\nabla_{\mathbf{e}_i} \mathcal{L}_\text{adv}$ is the average gradient of the task loss over a batch around the current embedding of the $i$-th token in the adversarial triggering sequence $\mathbf{t}$. We can brute-force the optimal $\mathbf{e}’_i$ by a big dot product of size embedding of the entire vocabulary $\vert \mathcal{V} \vert$ $\times$ the embedding dimension $d$. Matrix multiplication of this size is cheap and can be run in parallel.
AutoPrompt (Shin et al., 2020) utilizes the same gradient-based search strategy to find the most effective prompt template for a diverse set of tasks.
The above token search method can be augmented with beam search. When looking for the optimal token embedding $\mathbf{e}’_i$, we can pick top-$k$ candidates instead of a single one, searching from left to right and score each beam by $\mathcal{L}_\text{adv}$ on the current data batch.

Illustration of how Universal Adversarial Triggers (UAT) works. (Image source: Wallace et al. 2019)

**Demonstration:**
5.2.3.c: List the doc_source for each of the documents in the context for question 3

5.2.3.c:

1. https://arxiv.org/pdf/2309.15217.pdf
2. https://lilianweng.github.io/posts/2020-10-29-odqa/
3. https://arxiv.org/pdf/2306.15595.pdf
4. https://arxiv.org/pdf/2211.12561.pdf
5. https://arxiv.org/pdf/2401.06532.pdf
6. https://arxiv.org/pdf/2104.09864.pdf
7. https://arxiv.org/pdf/2306.15595.pdf
8. https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/

**Demonstration:**
5.2.4.a: For questions 1 and 2, how well do you feel your metrics captured the differences and similarities between your answer and the gold answer?

5.2.4.a: Enter five sentence answer here.

The metrics captured aspects of answer quality but with some limitations. BERTScore effectively measured semantic similarity, with higher scores for marketing (0.358 F1) reflecting closer alignment to concise gold answers versus engineering (0.306 F1) whose longer responses introduced additional content not in gold answers. Faithfulness scores (1.0 for engineering, 0.857 for marketing) accurately showed how well responses stayed grounded in the context, as we can see that engineering's strict adherence to the material was correctly rewarded. However, factual correctness (0.29-0.36) may underestimate actual accuracy since both responses contain correct information that's simply phrased differently than gold answers. Context precision (0.347) appropriately flagged that only ~3 of 8 retrieved chunks were highly relevant. Overall, the multi-metric approach provided a more complete picture than any single metric would—BERTScore alone would miss grounding issues, while faithfulness alone would miss semantic quality.

However, if time permitted I would supplement metrics with further manually review.

### 5.3 Other Questions

Below are a few questions that you should think about. Please answer them in the text cells directly (in a short paragraph) and also see whether they may be relevant for your final write-up.

**Demonstration:**
5.3.a. How would you expect your response quality to change if you had a chunk size of 50?

==== ENTER YOUR 5.3.a.

A chunk size of 50 characters would severely degrade response quality. Each chunk would contain only ~8-10 words—far too small to capture meaningful content. Initially, I tested my data on the default chunk size provided (128) and retrieved useless information that was often cut off mid-sentence. Additionally, we'd need many more chunks (higher k) to gather sufficient context.

**Demonstration:**
5.3.b. How would you expect your response quality to change if you had a chunk size of 5000?

==== ENTER YOUR 5.3.b. ANSWER IN THIS TEXT BLOCK.

A chunk size of 5000 characters would reduce precision while improving recall. Each chunk would contain ~800-1000 words, meaning relevant information gets bundled with irrelevant content. The LLM would receive more context but struggle to identify the specific relevant portions, potentially leading to verbose or unfocused responses. Context precision would drop since chunks would be "partially relevant" rather than highly targeted. However, for complex questions requiring broader context, larger chunks might help by keeping related concepts together. Overall I anticipate performance would drop dramatically.

**Demonstration:**
5.3.c. If you had time, how do you think fine-tuning of the LLM could help? What type of data would you want for that? And which training approach would you take?

==== ENTER YOUR 5.3.c. ANSWER IN THIS TEXT BLOCK.

Fine-tuning could help in two ways: adapting output style to match persona requirements (concise marketing vs detailed engineering), and mproving domain-specific terminology understanding for LLM/NLP content. I would use QLoRA (4-bit quantization with Low-Rank Adaptation) for parameter-efficient fine-tuning, requiring less compute while preserving base model capabilities. Training data would include ~500-1000 question-answer pairs matching our gold answer format—concise 2-3 sentence marketing responses and detailed 4-5 sentence engineering responses grounded in retrieved context. This would reduce reliance on prompt engineering for length/style control.

**Demonstration:**
5.3.d. What was your design philosophy of the prompts? How did they differ between engineering and marketing support?

==== ENTER YOUR 5.3.d. ANSWER IN THIS TEXT BLOCK.

The core design philosophy was grounding + conciseness. Both prompts include "Using ONLY the context provided" to prevent hallucination and ensure verifiable answers. The key difference is output: marketing prompts request 2-3 sentences (250 chars) in accessible, jargon-free language focused on business value; engineering prompts request 4-5 sentences (500 chars) with technical precision and implementation details. (as confirmed by the different average lengths in the gold standard answers) Both prompts instruct models to acknowledge missing information rather than hallucinate when context is insufficient.

**Demonstration:**
5.3.e. What are your average and peak load estimates for the system? Given that, would you suggest a pay-per-use deployment or one that reserves the LLM?

==== ENTER YOUR 5.3.e.

I would guess average load of 30-100 queries/day during normal operations (depending on popularity of the tool), but peak loads of 200+ queries during product launches or documentation sprints. Given this bursty, unpredictable usage pattern, pay-per-use deployment is recommended. Reserved LLM infrastructure would sit idle most of the time, wasting budget. Pay-per-use scales automatically for peaks and costs nothing during quiet periods.

**Demonstration:**
5.3.f. What type of limitations/risks would you see in using this system?

==== ENTER YOUR 5.3.f.

Key limitations:
1. Retrieval failures—technical queries may not retrieve relevant chunks despite content existing in the knowledge base (as seen with Q2 architectures at 25% precision).
2. No real-time updates—the system cannot answer questions about information added after the last index update.
3. Hallucination risk—despite grounding instructions, models occasionally generate plausible-sounding but unverifiable claims (marketing faithfulness dropped to 0.65 on full validation).
4. Verbosity variance—response length still varies despite prompt constraints.
5. Latency and cost—retrieval + LLM inference adds 2-4 seconds per query, and API costs scale with usage.
6. Not production-ready—current performance - the tool requires huge optimization before deployment. Further work needed on embedding model selection, chunk size tuning, and potentially fine-tuning to meet production quality standards.

### Link to your 4-5 page pdf report

Enter a link to your report that you've stored in your Google Drive. Put it in a folder and give edit permissions to the following people:

mhbutler@.edu

penpen1986@.edu

peterg@.edu

cornelia.ilin@.edu

rahmfeld@.edu

makahmad@.edu

timtung@.edu

https://drive.google.com/file/d/18fn-hHzR8OqT74prrN2Kg0QJO058d35L/view?usp=sharing