# Problem Statement

## Business Context

As organizations grow and scale, they are often inundated with large volumes of data, reports, and documents that contain critical information for decision-making. In real-world business settings, such as venture capital firms like Andreesen Horowitz, business analysts are required to sift through large datasets, research papers, or reports to extract relevant information that impacts strategic decisions.

For instance, consider that you've just joined Andreesen Horowitz, a renowned venture capital firm, and you are tasked with analyzing a dense report like the Harvard Business Review's **"How Apple is Organized for Innovation."** Going through the report manually can be extremely time-consuming as the size and complexity of these report increases. However, by using **Semantic Search** and **Retrieval-Augmented Generation (RAG)** models, you can significantly streamline this process.

Imagine having the capability to directly ask questions like, “How does Apple structure its teams for innovation?” and get immediate, relevant answers drawn from the report. This ability to extract and organize specific insights quickly and accurately enables you to focus on higher-level analysis and decision-making, rather than being bogged down by information retrieval.

## Objective

Develop a RAG application to help business analysts efficiently extract key insights from extensive reports, such as **How Apple is Organized for Innovation**, enabling faster and more informed decision-making.

## Data Description

**How Apple is Organized for Innovation** - An article of 11 pages in pdf format

# Installing and Importing Necessary Libraries and Dependencies

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [9]:
#Libraries for processing dataframes,text
import json,os
import tiktoken
import pandas as pd

#Libraries for Loading Data, Chunking, Embedding, and Vector Databases
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

# Data Preparation for RAG

## Loading the Data

In [10]:
apple_pdf_path = "HBR_How_Apple_Is_Organized_For_Innovation-4.pdf"

In [12]:
pdf_loader = PyMuPDFLoader(apple_pdf_path)
pdf_loader

<langchain_community.document_loaders.pdf.PyMuPDFLoader at 0x757f47024470>

In [13]:
apple = pdf_loader.load()

## Data Overview

#### Checking the first 3 pages

In [14]:
for i in range(3):
    print(f"Page Number : {i+1}",end="\n")
    print(apple[i].page_content,end="\n")

Page Number : 1
REPRINT R2006F
PUBLISHED IN HBR
NOVEMBER–DECEMBER 2020
ARTICLE
ORGANIZATIONAL CULTURE
How Apple Is 
Organized  
for Innovation
It’s about experts leading experts. 
by Joel M. Podolny and Morten T. Hansen
This article is made available to you with compliments of Apple Inc for your personal use. Further posting, copying or distribution is not permitted.
Page Number : 2
2
Harvard Business Review
November–December 2020
This article is made available to you with compliments of Apple Inc for your personal use. Further posting, copying or distribution is not permitted.
Page Number : 3
PHOTOGRAPHER MIKAEL JANSSON
How Apple Is 
Organized 
for Innovation
It’s about experts 
leading experts.
ORGANIZATIONAL 
CULTURE
Joel M. 
Podolny
Dean, Apple 
University
Morten T. 
Hansen
Faculty, Apple 
University
AUTHORS
FOR ARTICLE REPRINTS CALL 800-988-0886 OR 617-783-7500, OR VISIT HBR.ORG
Harvard Business Review
November–December 2020  3
This article is made available to you with compliment

Above is the sixth page of the document.  
- It contains shapes, text, and other elements.  

Let's see how the text is extracted.

In [17]:
apple[5].page_content

'targets were the overriding criteria for judging investments \nand leaders. Significantly, the bonuses of senior R&D exec-\nutives are based on companywide performance numbers \nrather than the costs of or revenue from particular products. \nThus product decisions are somewhat insulated from short-\nterm financial pressures. The finance team is not involved in \nthe product road map meetings of engineering teams, and \nengineering teams are not involved in pricing decisions.\nWe don’t mean to suggest that Apple doesn’t consider \ncosts and revenue goals when deciding which technologies \nand features the company will pursue. It does, but in ways \nthat differ from those employed by conventionally organized \ncompanies. Instead of using overall cost and price targets as \nfixed parameters within which to make design and engineer-\ning choices, R&D leaders are expected to weigh the benefits \nto users of those choices against cost considerations.\nIn a functional organization, individua

* If we observe the text closely, the text is not extracted sequentially.  

* This is a limitation, as we are missing coherent text.



#### Checking the number of pages

In [18]:
len(apple)

11

## Data Chunking

In [24]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=512,
    chunk_overlap= 20
)

In [26]:
text_splitter

<langchain_text_splitters.character.RecursiveCharacterTextSplitter at 0x757e1f12e630>

In [27]:
document_chunks = pdf_loader.load_and_split(text_splitter)

In [28]:
len(document_chunks)

25

In [32]:
document_chunks[0].page_content

'REPRINT R2006F\nPUBLISHED IN HBR\nNOVEMBER–DECEMBER 2020\nARTICLE\nORGANIZATIONAL CULTURE\nHow Apple Is \nOrganized  \nfor Innovation\nIt’s about experts leading experts. \nby Joel M. Podolny and Morten T. Hansen\nThis article is made available to you with compliments of Apple Inc for your personal use. Further posting, copying or distribution is not permitted.'

In [33]:
document_chunks[-2].page_content

'the answer (because they don’t). This differs starkly from \nthe way leaders question subordinates about activities in the \nowning and teaching boxes.\nFinally, Rosner has delegated some areas—including \niMovie and GarageBand, in which he is not an expert—to \npeople with the requisite capabilities. For activities in the \ndelegating box, he assembles teams, agrees on objectives, \nmonitors and reviews prog\xadress, and holds the teams account-\nable: the stuff of general management.\nWhereas Apple’s VPs spend most of their time in the own-\ning and learning boxes, general managers at other companies \ntend to spend most of their time in the delegating box. Rosner \nestimates that he spends about 40% of his time on activities \nhe owns (including collaboration with others in a given area), \nabout 30% on learning, about 15% on teaching, and about 15% \non delegating. These numbers vary by manager, of course, \ndepending on their business and the needs at a given time.\nThe discretio

In [31]:
document_chunks[-1].page_content

'to cultivate the experts-leading-experts model even within \na business unit structure. For example, when filling the next \nsenior management role, pick someone with deep expertise \nin that area as opposed to someone who might make the best \ngeneral manager. But a full-fledged transformation requires \nthat leaders also transition to a functional organization. \nApple’s track rec\xadord proves that the rewards may justify the \nrisks. Its approach can produce extraordinary results.\u2002\nHBR Reprint R2006F\nJOEL M. PODOLNY is a vice president of Apple and the dean  \nof Apple University. Prior to joining Apple, in 2009, he was  \nthe dean of the Yale School of Management and on the faculty of \nHarvard’s and Stanford’s business schools. MORTEN T. HANSEN  \nis a member of Apple University’s faculty and a professor at the \nUniversity of California, Berkeley. He was formerly on the faculties  \nof Harvard Business School and INSEAD.\nFOR ARTICLE REPRINTS CALL 800-988-0886 OR 617-783

As expected, there are some overlaps:  

- The sentence '*to cultivate the experts-leading-experts model even within*' appears in both chunks.  
- If we increase the `chunk_overlap`, the overlapping length of the sentence will also increase.

## Embedding

In [34]:
embedding_model = SentenceTransformerEmbeddings(model_name='thenlper/gte-large')

  embedding_model = SentenceTransformerEmbeddings(model_name='thenlper/gte-large')


In [36]:
embedding_1 = embedding_model.embed_query(document_chunks[0].page_content)
embedding_2 = embedding_model.embed_query(document_chunks[1].page_content)

In [37]:
print("Dimension of the embedding vector ",len(embedding_1))
len(embedding_1)==len(embedding_2)

Dimension of the embedding vector  1024


True

* The embedding model provides a fixed-length vector for any number of chunks.  
* This is necessary because we want to compare them for similarity.

## Vector Database

In [38]:
out_dir = 'apple_db'

if not os.path.exists(out_dir):
  os.makedirs(out_dir)

In [None]:
# Creates a new Chroma database
vectorstore = Chroma.from_documents(
    document_chunks,
    embedding_model,
    persist_directory=out_dir
)

In [40]:
vectorstore

<langchain_community.vectorstores.chroma.Chroma at 0x757e0a7cd5e0>

In [41]:
# Loads an existing Chroma database from the directory
vectorstore = Chroma(persist_directory=out_dir,embedding_function=embedding_model)

  vectorstore = Chroma(persist_directory=out_dir,embedding_function=embedding_model)


In [42]:
vectorstore.embeddings

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='thenlper/gte-large', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

In [43]:
vectorstore.similarity_search("Apple Steve Jobs iPhone ",k=3)

[Document(metadata={'trapped': '', 'creator': 'Adobe InDesign 14.0 (Macintosh)', 'keywords': '', 'producer': 'Adobe PDF Library 15.0 (via http://bfo.com/products/pdf?version=2.23.5-r33279)', 'total_pages': 11, 'file_path': 'HBR_How_Apple_Is_Organized_For_Innovation-4.pdf', 'author': '', 'format': 'PDF 1.6', 'modDate': 'D:20201201183749Z', 'encryption': 'Standard V2 R3 128-bit RC4', 'page': 4, 'creationDate': "D:20201005141842-04'00'", 'creationdate': '2020-10-05T14:18:42-04:00', 'source': 'HBR_How_Apple_Is_Organized_For_Innovation-4.pdf', 'title': '', 'moddate': '2020-12-01T18:37:49+00:00', 'subject': ''}, page_content='WHY A FUNCTIONAL ORGANIZATION?\nApple’s main purpose is to create products that enrich \npeople’s daily lives. That involves not only developing \nentirely new product categories such as the iPhone and the \nApple Watch, but also continually innovating within those \ncategories. Perhaps no product feature better reflects Apple’s \ncommitment to continuous innovation tha

* From the retrieved chunks, we observe that all the chunks are related to the key terms [ 'Apple', 'Steve Jobs', 'iPhone' ].

## Retriever

In [44]:
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 2}
)

In [45]:
rel_docs = retriever.get_relevant_documents("How does does Apple develop and ship products that requires good coordination between the teams?")
rel_docs

  rel_docs = retriever.get_relevant_documents("How does does Apple develop and ship products that requires good coordination between the teams?")


[Document(metadata={'title': '', 'encryption': 'Standard V2 R3 128-bit RC4', 'modDate': 'D:20201201183749Z', 'producer': 'Adobe PDF Library 15.0 (via http://bfo.com/products/pdf?version=2.23.5-r33279)', 'trapped': '', 'creationDate': "D:20201005141842-04'00'", 'source': 'HBR_How_Apple_Is_Organized_For_Innovation-4.pdf', 'creationdate': '2020-10-05T14:18:42-04:00', 'format': 'PDF 1.6', 'moddate': '2020-12-01T18:37:49+00:00', 'subject': '', 'author': '', 'total_pages': 11, 'keywords': '', 'page': 7, 'creator': 'Adobe InDesign 14.0 (Macintosh)', 'file_path': 'HBR_How_Apple_Is_Organized_For_Innovation-4.pdf'}, page_content='40 specialist teams: silicon design, camera software, reliabil-\nity engineering, motion sensor hardware, video engineering, \ncore motion, and camera sensor design, to name just a few. \nHow on earth does Apple develop and ship products that \nrequire such coordination? The answer is collaborative \ndebate. Because no function is responsible for a product or a \nservic

In [46]:
rel_docs[0].page_content

'40 specialist teams: silicon design, camera software, reliabil-\nity engineering, motion sensor hardware, video engineering, \ncore motion, and camera sensor design, to name just a few. \nHow on earth does Apple develop and ship products that \nrequire such coordination? The answer is collaborative \ndebate. Because no function is responsible for a product or a \nservice on its own, cross-functional collaboration is crucial.\nWhen debates reach an impasse, as some inevitably do, \nhigher-level managers weigh in as tiebreakers, including at \ntimes the CEO and the senior VPs. To do this at speed with \nsufficient attention to detail is challenging for even the best \nof leaders, making it all the more important that the company \nfill many senior positions from within the ranks of its VPs, \nwho have experience in Apple’s way of operating.\nHowever, given Apple’s size and scope, even the executive \nteam can resolve only a limited number of stalemates. The \nmany horizontal dependencie

In [47]:
rel_docs[1].page_content

'Apple is run. Leaders can push, probe, and “smell” an issue. \nThey know which details are important and where to focus \ntheir attention. Many people at Apple see it as liberating, \neven exhilarating, to work for experts, who provide better \nguidance and mentoring than a general manager would. \nTogether, all can strive to do the best work of their lives in \ntheir chosen area.\nWillingness to collaboratively debate. Apple has \nhundreds of specialist teams across the company, dozens of \nwhich may be needed for even one key component of a new \nproduct offering. For example, the dual-lens camera with \nportrait mode required the collaboration of no fewer than  \nApple leaders are expected to possess deep expertise, be immersed \nin the details of their functions, and engage in collaborative debate.\nORGANIZATIONAL \nCULTURE\nFOR ARTICLE REPRINTS CALL 800-988-0886 OR 617-783-7500, OR VISIT HBR.ORG\nHarvard Business Review\nNovember–December 2020 \u20097\nThis article is made availa

- We can observe that the two relevant chunks contain the answer to the query.  
- If we increase the **`k`** value, there is a chance that we might find the answer in even more chunks.  
- This is a hyperparameter that we need to tune to get the best context.

# Defining the Response Generator

## Downloading and Loading the model

In [48]:
from huggingface_hub import hf_hub_download

local_path = hf_hub_download(
    repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    filename="mistral-7b-instruct-v0.2.Q6_K.gguf"
)

print("Downloaded to:", local_path)


Downloaded to: /home/dell/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q6_K.gguf


In [35]:
#uncomment the below snippet of code if the runtime is connected to GPU.
# llm = Llama(
#     model_path=local_path,
#     n_ctx=5000,
#     n_gpu_layers=38,
#     n_batch=512
# )

In [49]:
#uncomment the below snippet of code if the runtime is connected to CPU only.
llm = Llama(
   model_path=local_path,
   n_ctx=1024,
   n_cores=-2
)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /home/dell/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_

llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 18
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model

In [50]:
llm = Llama(
    model_path=local_path,
    n_ctx=4096,
    n_threads=max(1, os.cpu_count() - 2)  # leaves 2 cores free
)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /home/dell/.cache/huggingface/hub/models--TheBloke--Mistral-7B-Instruct-v0.2-GGUF/snapshots/3a6fbf4a41a1d52e415a4958cde6856d34b2db93/mistral-7b-instruct-v0.2.Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_

In [51]:
llm("How does Apple develop and ship products that requires good coordination between the teams?")['choices'][0]['text']

llama_perf_context_print:        load time =    1145.35 ms
llama_perf_context_print: prompt eval time =    1145.24 ms /    17 tokens (   67.37 ms per token,    14.84 tokens per second)
llama_perf_context_print:        eval time =    3625.31 ms /    15 runs   (  241.69 ms per token,     4.14 tokens per second)
llama_perf_context_print:       total time =    4774.86 ms /    32 tokens
llama_perf_context_print:    graphs reused =         14


' The answer is Agile Development and Scrum methodology.\n\nIn this'

- The response seems generic and appears to be derived from another article. Let's provide our own context and align the response with our needs.

## System and User Prompt Template

Prompts guide the model to generate accurate responses. Here, we define two parts:

    1. The system message describing the assistant's role.
    2. A user message template including context and the question.

In [38]:
qna_system_message = """
You are an assistant whose work is to review the report and provide the appropriate answers from the context.
User input will have the context required by you to answer user questions.
This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Please answer only using the context provided in the input. Do not mention anything about the context in your final answer.

If the answer is not found in the context, respond "I don't know".
"""

In [39]:
qna_user_message_template = """
###Context
Here are some documents that are relevant to the question mentioned below.
{context}

###Question
{question}
"""

## Response Function

In [40]:
def generate_rag_response(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=k)
    context_list = [d.page_content for d in relevant_document_chunks]

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    prompt = qna_system_message + '\n' + user_message

    # Generate the response
    try:
        response = llm(
                  prompt=prompt,
                  max_tokens=max_tokens,
                  temperature=temperature,
                  top_p=top_p,
                  top_k=top_k
                  )

        # Extract and print the model's response
        response = response['choices'][0]['text'].strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

# Question Answering using RAG

### Query 1: Who are the authors of this article and who published this article ?

In [35]:
user_input = "Who are the authors of this article and who published this article ?"
print(generate_rag_response(user_input))

Llama.generate: prefix-match hit


Answer:
Morten T. Hansen and Joel M. Podolny authored this article and Harvard Business Review published it.


In [51]:
user_input = "Who are the authors of this article and who published this article ?"
print(generate_rag_response(user_input))

Llama.generate: 452 prefix-match hit, remaining 1 prompt tokens to eval
llama_perf_context_print:        load time =    1175.68 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =    8648.28 ms /    35 runs   (  247.09 ms per token,     4.05 tokens per second)
llama_perf_context_print:       total time =    8659.14 ms /    36 tokens
llama_perf_context_print:    graphs reused =         33


Answer:
The authors of the article are Joel M. Podolny and Morten T. Hansen. The article was published by Harvard Business Review.


- The answer is clear, concise, and focused, without any unnecessary information.  

- For queries like this, we expect a response of this nature.

### Query 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [52]:
user_input_2 = "List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines."
generate_rag_response(user_input_2)

Llama.generate: 145 prefix-match hit, remaining 1694 prompt tokens to eval
llama_perf_context_print:        load time =    1175.68 ms
llama_perf_context_print: prompt eval time =  138605.39 ms /  1694 tokens (   81.82 ms per token,    12.22 tokens per second)
llama_perf_context_print:        eval time =   22069.16 ms /    81 runs   (  272.46 ms per token,     3.67 tokens per second)
llama_perf_context_print:       total time =  160704.59 ms /  1775 tokens
llama_perf_context_print:    graphs reused =         78


"- Deep expertise\n  * Apple is a company where experts lead experts.\n  * The assumption is that it's easier to train an expert to manage well than to train a manager to be an expert.\n\n- Immersion in the details\n  * Leaders should know the details of their organization three levels down.\n  * This demands extreme precision in manufacturing and production processes."

- The response contains only two leadership characteristics, but they are well explained.  
- Perhaps if we increase the **`max_tokens`**, we might get the third characteristic as well (assuming it is in the document).

### Query 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [37]:
user_input_3 = "Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?"
generate_rag_response(user_input_3)

Llama.generate: prefix-match hit


"I don't know. The context does mention that Apple's functional organization and leadership model have played a crucial role in the company's innovation success, but it doesn't provide specific examples of how this has manifested in successful innovations."

- If we look at the system prompt, we explicitly mentioned that the query should not be answered if it cannot be derived from the context.  

- As expected, the model has done its job well. It has eliminated hallucination.

## Fine-tuning Parameters

### Query 1: Who are the authors of this article and who published this article ?

In [38]:
user_input = "Who are the authors of this article and who published this article ?"
generate_rag_response(user_input, max_tokens=100)

Llama.generate: prefix-match hit


'Answer:\nMorten T. Hansen and Joel M. Podolny authored this article and Harvard Business Review published it.'

- Even if the **`max_tokens`** is set to 100, the model still didn't generate that many, as the query could be answered with a limited number of tokens.  

- One of the reasons could be that the temperature is set to 0, making the model more deterministic and less creative.

### Query 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [39]:
user_input_2 = "List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines."
generate_rag_response(user_input_2, temperature=0.1, max_tokens=350)

Llama.generate: prefix-match hit


"- Deep expertise: Apple's managers are expected to possess deep expertise that allows them to meaningfully engage in all the work being done within their individual functions. The assumption is that it's easier to train an expert to manage well than to train a manager to be an expert. At Apple, experts lead experts and this approach cascades down all levels of the organization.\n- Immersion in the details: Leaders at Apple are expected to know the details of their organization three levels down for effective cross-functional decision-making at the highest levels. Managers attend decision-making meetings with the details at their disposal, and if not, the decision must either be made without the details or postponed. Apple's leaders pay extreme attention to the exact shape of products' rounded corners and demand extremely precise manufacturing tolerances to produce millions of products with continuous curves."

"- Deep expertise: Apple's managers are expected to possess deep expertise that allows them to meaningfully engage in all the work being done within their individual functions. The assumption is that it's easier to train an expert to manage well than to train a manager to be an expert. At Apple, experts lead experts and this approach cascades down all levels of the organization.\n- Immersion in the details: Leaders at Apple are expected to know the details of their organization three levels down for effective cross-functional decision-making at the highest levels. Managers attend decision-making meetings with the details at their disposal, and if not, the decision must either be made without the details or postponed. Apple's leaders pay extreme attention to the exact shape of products' rounded corners and demand extremely precise manufacturing tolerances to produce millions of products with continuous curves."

- If we compare it to the previous case, after increasing the **`max_tokens`**, we got the third characteristic

### Query 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [53]:
user_input_3 = "Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?"
generate_rag_response(user_input_3, top_p=0.98, top_k=20, max_tokens=256)

Llama.generate: 145 prefix-match hit, remaining 1704 prompt tokens to eval
llama_perf_context_print:        load time =    1175.68 ms
llama_perf_context_print: prompt eval time =  131397.12 ms /  1704 tokens (   77.11 ms per token,    12.97 tokens per second)
llama_perf_context_print:        eval time =   11819.82 ms /    45 runs   (  262.66 ms per token,     3.81 tokens per second)
llama_perf_context_print:       total time =  143233.20 ms /  1749 tokens
llama_perf_context_print:    graphs reused =         42


"I don't know. The article discusses Apple's functional organization and how it has contributed to the company's innovation success, but it does not provide specific examples of innovations that resulted from this approach."

- Since the context provided doesn't help with the query, the model has responded correctly based on the prompt design.  

- However, there is a chance that it might not be present in the top **`k`** context. Therefore, it is better to experiment with higher values of **`k`** and check.

# Output Evaluation

Let us now use the LLM-as-a-judge method to check the quality of the RAG system on two parameters - retrieval and generation. We illustrate this evaluation based on the answeres generated to the question from the previous section.

- We are using the same Mistral model for evaluation, so basically here the llm is rating itself on how well he has performed in the task.

### Defining the Evaluation Prompts

In [54]:
groundedness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
The answer should be derived only from the information presented in the context

Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric.
2. Give a step-by-step explanation if the answer adheres to the metric considering the question and context as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the answer using the evaluaton criteria and assign a score.
"""

In [55]:
relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.
"""

In [56]:
user_message_template = """
###Question
{question}

###Context
{context}

###Answer
{answer}
"""

### Defining the Evaluation Function

In [57]:
def generate_ground_relevance_response(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=3)
    context_list = [d.page_content for d in relevant_document_chunks]
    context_for_query = ". ".join(context_list)

    # Combine user_prompt and system_message to create the prompt
    prompt = f"""[INST]{qna_system_message}\n
                {'user'}: {qna_user_message_template.format(context=context_for_query, question=user_input)}
                [/INST]"""

    response = llm(
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    answer =  response["choices"][0]["text"]

    # Combine user_prompt and system_message to create the prompt
    groundedness_prompt = f"""[INST]{groundedness_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    # Combine user_prompt and system_message to create the prompt
    relevance_prompt = f"""[INST]{relevance_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    response_1 = llm(
            prompt=groundedness_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    response_2 = llm(
            prompt=relevance_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    return response_1['choices'][0]['text'],response_2['choices'][0]['text']

### Query 1: Who are the authors of this article and who published this article ?

In [51]:
user_input = "Who are the authors of this article and who published this article ?"
ground,rel = generate_ground_relevance_response(user_input,max_tokens=350)

print(ground,end="\n\n")
print('-'*200)
print(rel)

Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
Llama.generate: prefix-match hit


 Steps to evaluate the answer:
1. Identify the key information in the context related to the question.
2. Check if the authors and publisher mentioned in the context are included in the AI generated answer.
3. Verify that the answer is derived only from the information presented in the context and not from any external sources.

Explanation:
The AI generated answer correctly identifies Morten T. Hansen and Joel M. Podolny as the authors of the article and Harvard Business Review as the publisher. This information is directly taken from the context, specifically the author bylines and the publication name mentioned in the text. Therefore, the answer is derived only from the information presented in the context and adheres to the metric.

Evaluation:
The metric is followed completely.

Rating:
Based on the evaluation criteria, I would rate the AI generated answer as a 5, indicating that the metric is followed completely.

------------------------------------------------------------------

In [58]:
user_input = "Who are the authors of this article and who published this article ?"
ground,rel = generate_ground_relevance_response(user_input,max_tokens=350)

print(ground,end="\n\n")
print('-'*200)
print(rel)

Llama.generate: 1 prefix-match hit, remaining 467 prompt tokens to eval
llama_perf_context_print:        load time =    1175.68 ms
llama_perf_context_print: prompt eval time =   31668.07 ms /   467 tokens (   67.81 ms per token,    14.75 tokens per second)
llama_perf_context_print:        eval time =    7134.50 ms /    29 runs   (  246.02 ms per token,     4.06 tokens per second)
llama_perf_context_print:       total time =   38812.57 ms /   496 tokens
llama_perf_context_print:    graphs reused =         27
Llama.generate: 7 prefix-match hit, remaining 616 prompt tokens to eval
llama_perf_context_print:        load time =    1175.68 ms
llama_perf_context_print: prompt eval time =   46055.63 ms /   616 tokens (   74.77 ms per token,    13.38 tokens per second)
llama_perf_context_print:        eval time =   42632.92 ms /   172 runs   (  247.87 ms per token,     4.03 tokens per second)
llama_perf_context_print:       total time =   88766.82 ms /   788 tokens
llama_perf_context_print:    g

 Steps to evaluate the answer:
1. Identify the key information in the context related to the question.
2. Check if the answer is derived only from the information in the context.
3. Compare the answer with the information in the context to ensure accuracy.

Explanation:
The question asks for the authors of the article and the publisher. The context clearly states the names of the authors (Joel M. Podolny and Morten T. Hansen) and the publisher (Harvard Business Review). The AI generated answer matches the information in the context and is accurate.

Evaluation:
The metric is followed completely as the answer is derived only from the information presented in the context.

Rating:
Based on the evaluation criteria, the answer would receive a score of 5.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Steps to evaluate the context as per

- It got a perfect score because the response is both grounded in the context and relevant to the query.  
- This means that both the retrieval and augmentation parts are good.

### Query 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [56]:
user_input_2 = "List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines."
ground,rel = generate_ground_relevance_response(user_input_2,max_tokens=800)

Llama.generate: prefix-match hit
Llama.generate: prefix-match hit


 Steps to evaluate the answer:
1. Identify the three leadership characteristics mentioned in the context.
2. Check if each bullet point in the answer matches one of the leadership characteristics.
3. Verify that the explanation under each bullet point is derived from the information presented in the context.

The answer adheres to the metric considering the question and context as the input:

* Deep expertise: The context states that Apple's managers are expected to possess deep expertise in their individual functions, and the organization is structured such that experts lead other experts. This information is used to form the first bullet point in the answer.
  + Experts lead experts: This explanation is directly taken from the context.
  + Cascades down all levels: The context states that this approach cascades down through all areas of the organization, which is explained in the second line under this bullet point.
* Immersion in the details: The context states that Apple's leaders 

In [59]:
import pprint
pprint.pp(ground, width = 150)
print('-'*200)
pprint.pp(rel, width = 150)

(' Steps to evaluate the answer:\n'
 '1. Identify the key information in the context related to the question.\n'
 '2. Check if the answer is derived only from the information in the context.\n'
 '3. Compare the answer with the information in the context to ensure accuracy.\n'
 '\n'
 'Explanation:\n'
 'The question asks for the authors of the article and the publisher. The context clearly states the names of the authors (Joel M. Podolny and '
 'Morten T. Hansen) and the publisher (Harvard Business Review). The AI generated answer matches the information in the context and is accurate.\n'
 '\n'
 'Evaluation:\n'
 'The metric is followed completely as the answer is derived only from the information presented in the context.\n'
 '\n'
 'Rating:\n'
 'Based on the evaluation criteria, the answer would receive a score of 5.')
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

- The groundedness score is 5 since the response is derived solely from the context.  

- Regarding relevance, the score is 4 (the metric is mostly followed). However, it is not very clear why this rating was given.  

- One solution is to modify the relevance prompt to instruct the model to provide reasons for any point deductions or increase the max_tokens (assuming the output has been truncated)

### Query 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [68]:
user_input_3 = "Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?"
ground,rel = generate_ground_relevance_response(user_input_3,max_tokens=500)

pprint.pp(ground,width = 120)
print('-'*200)
pprint.pp(rel, width = 120)

Llama.generate: prefix-match hit
Llama.generate: prefix-match hit
Llama.generate: prefix-match hit


(' Steps to evaluate the answer:\n'
 "1. Identify the specific examples mentioned in the context related to Apple's approach to leadership and successful "
 'innovations.\n'
 '2. Determine if the AI generated answer is derived only from the information presented in the context.\n'
 '\n'
 'Evaluation:\n'
 "The AI generated answer adheres to the metric as it mentions the specific example of the iPhone's development under "
 "Apple's functional organization and the evolution of the leadership approach that contributed to its success. The "
 'answer is directly derived from the context provided.\n'
 '\n'
 'Rating:\n'
 'Based on the evaluation criteria, I would rate this answer as a 5 - The metric is followed completely. The AI '
 'generated answer is fully derived from the information presented in the context and provides a specific example of '
 "how Apple's approach to leadership has led to successful innovations.")
------------------------------------------------------------------------

- For relevance, the response includes both the score and the reason for the point deduction.  

- For groundedness, it is unclear why one point was deducted.

# Business Insights and Recommendations

- Vector database creation time increases with the number of pages in the PDF document.
- Retrieval parameter **`k`** is critical as the answer can be spread across multiple contexts.
- **`chunk_overlap`** ensures coherence, especially when context spans across chunks.
- **`max_tokens`** depends on query complexity; higher values yield detailed responses, while simple queries result in concise outputs despite large token limits due to prompt design and zero **`temperature`**.
- Refine prompt design and temperature settings to control response length and creativity.
- Continuously adjust RAG parameters based on specific use cases for optimal performance.
- Prioritize groundedness and relevance in evaluations to ensure reliable and contextually accurate outputs.
- Establish a feedback loop to fine-tune parameters, improving performance for diverse query types.

<font size=4 color='orange'>Learn more at www.sunitechai.com</font>
___