
<center><font face = 'Times New Roman' size=10>Introduction to LLMs and GenAI</center></font>
<center><font face = 'Times New Roman' size=6>Mini Project 4 - Large Langauge Models (LLMs) and Retrieval Augmented Generation (RAG)</center></font>

# Problem Statement

## Business Context

Growing organizations generate massive amounts of reports and data critical for decision-making. For example, venture capital analysts at firms like Andreessen Horowitz must extract insights from dense documents such as HBR’s “How Apple is Organized for Innovation.” Manual review is slow and inefficient, but Semantic Search and Retrieval-Augmented Generation (RAG) can deliver quick, precise answers to targeted questions, enabling faster strategic insights.

## Objective

Build a RAG application that allows business analysts to quickly extract key insights from lengthy reports, improving efficiency and decision-making.

## Data Description

**How Apple is Organized for Innovation** - An article of 11 pages in pdf format

# Installing and Importing Necessary Libraries and Dependencies

In [None]:
!pip uninstall -y numpy
!pip uninstall -y scipy

Found existing installation: numpy 2.0.2
Uninstalling numpy-2.0.2:
  Successfully uninstalled numpy-2.0.2
Found existing installation: scipy 1.16.1
Uninstalling scipy-1.16.1:
  Successfully uninstalled scipy-1.16.1


In [None]:
!pip install numpy==1.26.4 scipy==1.11.4



In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install sentence-transformers==2.2.2 transformers==4.44.0
!pip install langchain==0.1.1 langchain-community

Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting sentence-transformers==2.2.2
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers==4.44.0
  Downloading transformers-4.44.0-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.20,>=0.19 (from transformers==4.44.0)
  Downloading tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.44.0-py3-none-any.whl (9.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m91.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.19.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m125.2 MB/s[0m eta [36m0:00:00[0m
[?2

In [1]:
!pip install llama-cpp-python==0.1.85 huggingface_hub tiktoken pymupdf chromadb --force-reinstall --no-cache-dir -q


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.8 MB[0m [31m17.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m217.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m190.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 kB[0m [31m239.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.4/68.4 kB[0m [31m189.3 

In [None]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1

In [None]:
import json, os   
import tiktoken   
from langchain.text_splitter import RecursiveCharacterTextSplitter   
from langchain_community.document_loaders import PyMuPDFLoader       
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings 
from langchain_community.vectorstores import Chroma             

from huggingface_hub import hf_hub_download   
from llama_cpp import Llama                   


# Data Preparation for RAG

## Loading the Data

In [4]:
apple_pdf_path = "/content/drive/MyDrive/0- July-Dec 2025/5th sem Intro to LLM and GenAI/Classroom Mini Projects/Part-4/HBR_How_Apple_Is_Organized_For_Innovation-4.pdf"

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
pdf_loader = PyMuPDFLoader(apple_pdf_path)

In [7]:
apple = pdf_loader.load()

## Data Overview

#### Checking the first 3 pages

In [8]:
for i in range(3):
    print(f"Page Number : {i+1}",end="\n")
    print(apple[i].page_content,end="\n")

Page Number : 1
REPRINT R2006F
PUBLISHED IN HBR
NOVEMBER–DECEMBER 2020
ARTICLE
ORGANIZATIONAL CULTURE
How Apple Is 
Organized  
for Innovation
It’s about experts leading experts. 
by Joel M. Podolny and Morten T. Hansen
This article is made available to you with compliments of Apple Inc for your personal use. Further posting, copying or distribution is not permitted.

Page Number : 2
2
Harvard Business Review
November–December 2020
This article is made available to you with compliments of Apple Inc for your personal use. Further posting, copying or distribution is not permitted.

Page Number : 3
PHOTOGRAPHER MIKAEL JANSSON
How Apple Is 
Organized 
for Innovation
It’s about experts 
leading experts.
ORGANIZATIONAL 
CULTURE
Joel M. 
Podolny
Dean, Apple 
University
Morten T. 
Hansen
Faculty, Apple 
University
AUTHORS
FOR ARTICLE REPRINTS CALL 800-988-0886 OR 617-783-7500, OR VISIT HBR.ORG
Harvard Business Review
November–December 2020  3
This article is made available to you with complime

Above is the sixth page of the document.  
- It contains shapes, text, and other elements.  

Let's see how the text is extracted.

In [9]:
apple[5].page_content

'targets were the overriding criteria for judging investments \nand leaders. Significantly, the bonuses of senior R&D exec-\nutives are based on companywide performance numbers \nrather than the costs of or revenue from particular products. \nThus product decisions are somewhat insulated from short-\nterm financial pressures. The finance team is not involved in \nthe product road map meetings of engineering teams, and \nengineering teams are not involved in pricing decisions.\nWe don’t mean to suggest that Apple doesn’t consider \ncosts and revenue goals when deciding which technologies \nand features the company will pursue. It does, but in ways \nthat differ from those employed by conventionally organized \ncompanies. Instead of using overall cost and price targets as \nfixed parameters within which to make design and engineer-\ning choices, R&D leaders are expected to weigh the benefits \nto users of those choices against cost considerations.\nIn a functional organization, individua

* If we observe the text closely, the text is not extracted sequentially.  

* This is a limitation, as we are missing coherent text.



#### Checking the number of pages

In [13]:
len(apple)

11

## Data Chunking

In [10]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base',
    chunk_size=512,
    chunk_overlap= 20
)

In [11]:
document_chunks = pdf_loader.load_and_split(text_splitter)

In [12]:
len(document_chunks)

25

In [13]:
document_chunks[0].page_content

'REPRINT R2006F\nPUBLISHED IN HBR\nNOVEMBER–DECEMBER 2020\nARTICLE\nORGANIZATIONAL CULTURE\nHow Apple Is \nOrganized  \nfor Innovation\nIt’s about experts leading experts. \nby Joel M. Podolny and Morten T. Hansen\nThis article is made available to you with compliments of Apple Inc for your personal use. Further posting, copying or distribution is not permitted.'

In [14]:
document_chunks[-2].page_content

'the answer (because they don’t). This differs starkly from \nthe way leaders question subordinates about activities in the \nowning and teaching boxes.\nFinally, Rosner has delegated some areas—including \niMovie and GarageBand, in which he is not an expert—to \npeople with the requisite capabilities. For activities in the \ndelegating box, he assembles teams, agrees on objectives, \nmonitors and reviews prog\xadress, and holds the teams account-\nable: the stuff of general management.\nWhereas Apple’s VPs spend most of their time in the own-\ning and learning boxes, general managers at other companies \ntend to spend most of their time in the delegating box. Rosner \nestimates that he spends about 40% of his time on activities \nhe owns (including collaboration with others in a given area), \nabout 30% on learning, about 15% on teaching, and about 15% \non delegating. These numbers vary by manager, of course, \ndepending on their business and the needs at a given time.\nThe discretio

In [None]:
document_chunks[-1].page_content

As expected, there are some overlaps:  

- The sentence '*to cultivate the experts-leading-experts model even within*' appears in both chunks.  
- If we increase the `chunk_overlap`, the overlapping length of the sentence will also increase.

## Embedding

In [None]:
from langchain_community.embeddings import SentenceTransformerEmbeddings
embedding_model = SentenceTransformerEmbeddings(model_name='thenlper/gte-large')

In [15]:
embedding_1 = embedding_model.embed_query(document_chunks[0].page_content)
embedding_2 = embedding_model.embed_query(document_chunks[1].page_content)

NameError: name 'embedding_model' is not defined

In [6]:
print("Dimension of the embedding vector ",len(embedding_1))
len(embedding_1)==len(embedding_2)

NameError: name 'embedding_1' is not defined

* The embedding model provides a fixed-length vector for any number of chunks.  
* This is necessary because we want to compare them for similarity.

## Vector Database

In [7]:
out_dir = 'apple_db'

if not os.path.exists(out_dir):
  os.makedirs(out_dir)

In [8]:
vectorstore = Chroma.from_documents(
    document_chunks,
    embedding_model,
    persist_directory=out_dir
)

NameError: name 'Chroma' is not defined

In [9]:
vectorstore = Chroma(persist_directory=out_dir,embedding_function=embedding_model)

NameError: name 'Chroma' is not defined

In [10]:
vectorstore.embeddings

NameError: name 'vectorstore' is not defined

In [11]:
vectorstore.similarity_search("Apple Steve Jobs iPhone ",k=3)

NameError: name 'vectorstore' is not defined

* From the retrieved chunks, we observe that all the chunks are related to the key terms [ 'Apple', 'Steve Jobs', 'iPhone' ].

## Retriever

In [12]:
retriever = vectorstore.as_retriever(
    search_type='similarity',
    search_kwargs={'k': 2}
)

NameError: name 'vectorstore' is not defined

In [13]:
rel_docs = retriever.get_relevant_documents("How does does Apple develop and ship products that requires good coordination between the teams?")
rel_docs

NameError: name 'retriever' is not defined

- We can observe that the two relevant chunks contain the answer to the query.  
- If we increase the **`k`** value, there is a chance that we might find the answer in even more chunks.  
- This is a hyperparameter that we need to tune to get the best context.

# Defining the Response Generator

## Downloading and Loading the model

In [14]:
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_basename = "mistral-7b-instruct-v0.2.Q6_K.gguf"

In [15]:
model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
)

NameError: name 'hf_hub_download' is not defined

In [None]:
llm = Llama(
    model_path=model_path,
    n_ctx=2300,
    n_gpu_layers=38,
    n_batch=512
)

NameError: name 'Llama' is not defined

In [None]:
llm = Llama(
  model_path=model_path,
   n_ctx=1024,
  n_cores=-2
)

In [18]:
llm("How does does Apple develop and ship products that requires good coordination between the teams?")['choices'][0]['text']

NameError: name 'llm' is not defined

- The response seems generic and appears to be derived from another article. Let's provide our own context and align the response with our needs.

## System and User Prompt Template

Prompts guide the model to generate accurate responses. Here, we define two parts:

    1. The system message describing the assistant's role.
    2. A user message template including context and the question.

In [19]:
qna_system_message = """
You are an assistant whose work is to review the report and provide the appropriate answers from the context.
User input will have the context required by you to answer user questions.
This context will begin with the token: ###Context.
The context contains references to specific portions of a document relevant to the user query.

User questions will begin with the token: ###Question.

Please answer only using the context provided in the input. Do not mention anything about the context in your final answer.

If the answer is not found in the context, respond "I don't know".
"""

In [20]:
qna_user_message_template = """
###Context
Here are some documents that are relevant to the question mentioned below.
{context}

###Question
{question}
"""

## Response Function

In [21]:
def generate_rag_response(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=k)
    context_list = [d.page_content for d in relevant_document_chunks]

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    prompt = qna_system_message + '\n' + user_message

    # Generate the response
    try:
        response = llm(
                  prompt=prompt,
                  max_tokens=max_tokens,
                  temperature=temperature,
                  top_p=top_p,
                  top_k=top_k
                  )

        # Extract and print the model's response
        response = response['choices'][0]['text'].strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

# Question Answering using RAG

### Query 1: Who are the authors of this article and who published this article ?

In [22]:
user_input = "Who are the authors of this article and who published this article ?"
print(generate_rag_response(user_input))

NameError: name 'retriever' is not defined

- The answer is clear, concise, and focused, without any unnecessary information.  

- For queries like this, we expect a response of this nature.

### Query 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [23]:
user_input_2 = "List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines."
generate_rag_response(user_input_2)

NameError: name 'retriever' is not defined

- The response contains only two leadership characteristics, but they are well explained.  
- Perhaps if we increase the **`max_tokens`**, we might get the third characteristic as well (assuming it is in the document).

### Query 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [24]:
user_input_3 = "Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?"
generate_rag_response(user_input_3)

NameError: name 'retriever' is not defined

- If we look at the system prompt, we explicitly mentioned that the query should not be answered if it cannot be derived from the context.  

- As expected, the model has done its job well. It has eliminated hallucination.

## Fine-tuning Parameters

### Query 1: Who are the authors of this article and who published this article ?

In [25]:
user_input = "Who are the authors of this article and who published this article ?"
generate_rag_response(user_input, max_tokens=100)

NameError: name 'retriever' is not defined

- Even if the **`max_tokens`** is set to 100, the model still didn't generate that many, as the query could be answered with a limited number of tokens.  

- One of the reasons could be that the temperature is set to 0, making the model more deterministic and less creative.

### Query 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [26]:
user_input_2 = "List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines."
generate_rag_response(user_input_2, temperature=0.1, max_tokens=350)

NameError: name 'retriever' is not defined

- If we compare it to the previous case, after increasing the **`max_tokens`**, we got the third characteristic

### Query 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [27]:
user_input_3 = "Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?"
generate_rag_response(user_input_3, top_p=0.98, top_k=20, max_tokens=256)

NameError: name 'retriever' is not defined

- Since the context provided doesn't help with the query, the model has responded correctly based on the prompt design.  

- However, there is a chance that it might not be present in the top **`k`** context. Therefore, it is better to experiment with higher values of **`k`** and check.

# Output Evaluation

Let us now use the LLM-as-a-judge method to check the quality of the RAG system on two parameters - retrieval and generation. We illustrate this evaluation based on the answeres generated to the question from the previous section.

- We are using the same Mistral model for evaluation, so basically here the llm is rating itself on how well he has performed in the task.

### Defining the Evaluation Prompts

In [28]:
groundedness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
The answer should be derived only from the information presented in the context

Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric.
2. Give a step-by-step explanation if the answer adheres to the metric considering the question and context as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the answer using the evaluaton criteria and assign a score.
"""

In [29]:
relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.
"""

In [30]:
user_message_template = """
###Question
{question}

###Context
{context}

###Answer
{answer}
"""

### Defining the Evaluation Function

In [31]:
def generate_ground_relevance_response(user_input,k=3,max_tokens=128,temperature=0,top_p=0.95,top_k=50):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=3)
    context_list = [d.page_content for d in relevant_document_chunks]
    context_for_query = ". ".join(context_list)

    # Combine user_prompt and system_message to create the prompt
    prompt = f"""[INST]{qna_system_message}\n
                {'user'}: {qna_user_message_template.format(context=context_for_query, question=user_input)}
                [/INST]"""

    response = llm(
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    answer =  response["choices"][0]["text"]

    # Combine user_prompt and system_message to create the prompt
    groundedness_prompt = f"""[INST]{groundedness_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    # Combine user_prompt and system_message to create the prompt
    relevance_prompt = f"""[INST]{relevance_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=answer)}
                [/INST]"""

    response_1 = llm(
            prompt=groundedness_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    response_2 = llm(
            prompt=relevance_prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            stop=['INST'],
            )

    return response_1['choices'][0]['text'],response_2['choices'][0]['text']

### Query 1: Who are the authors of this article and who published this article ?

In [None]:
user_input = "Who are the authors of this article and who published this article ?"
ground,rel = generate_ground_relevance_response(user_input,max_tokens=350)

print(ground,end="\n\n")
print(rel)

- It got a perfect score because the response is both grounded in the context and relevant to the query.  
- This means that both the retrieval and augmentation parts are good.

### Query 2: List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines.

In [None]:
user_input_2 = "List down the three leadership characteristics in bulleted points and explain each one of the characteristics under two lines."
ground,rel = generate_ground_relevance_response(user_input_2,max_tokens=500)

print(ground,end="\n\n")
print(rel)

- The groundedness score is 5 since the response is derived solely from the context.  

- Regarding relevance, the score is 4 (the metric is mostly followed). However, it is not very clear why this rating was given.  

- One solution is to modify the relevance prompt to instruct the model to provide reasons for any point deductions or increase the max_tokens (assuming the output has been truncated)

### Query 3: Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?

In [None]:
user_input_3 = "Can you explain specific examples from the article where Apple's approach to leadership has led to successful innovations?"
ground,rel = generate_ground_relevance_response(user_input_3,max_tokens=500)

print(ground,end="\n\n")
print(rel)

- For relevance, the response includes both the score and the reason for the point deduction.  

- For groundedness, it is unclear why one point was deducted.

<section class="rag-conclusion" style="font-family:system-ui, -apple-system, Segoe UI, Roboto, Arial, sans-serif; line-height:1.6;">
  <h2 style="margin:0 0 .5rem;">Conclusion</h2>
  <p style="margin:.25rem 0 1rem;">
    For reliable, concise, and contextually accurate RAG outputs, tune retrieval and generation together—favoring groundedness over verbosity.
  </p>
  <ul style="margin:.5rem 0 1rem 1.25rem;">
    <li><strong>Indexing:</strong> Larger PDFs increase vectorization time; batch and cache embeddings.</li>
    <li><strong>Retrieval depth (<code>k</code>):</strong> Raise when answers span sections; lower for focused queries to reduce noise.</li>
    <li><strong>Chunking:</strong> Use <code>chunk_overlap</code> to preserve continuity across boundaries without inflating recall.</li>
    <li><strong>Generation limits:</strong> Set <code>max_tokens</code> to match query complexity; prompt structure and <code>temperature=0</code> keep simple answers brief.</li>
    <li><strong>Prompt &amp; creativity:</strong> Iterate prompts and adjust temperature to balance precision with flexibility.</li>
    <li><strong>Evaluation:</strong> Prioritize groundedness and relevance; measure with task-specific checks.</li>
    <li><strong>Feedback loop:</strong> Continuously refine parameters per use case using logged queries and outcomes.</li>
  </ul>
  <p style="margin:0;">
    Start conservative (low <code>k</code>, moderate overlap, strict prompts), evaluate groundedness, then incrementally expand scope (tokens, <code>k</code>, overlap) only when evidence shows improved answer coverage.
  </p>
</section>
