# Semantic Search & RAG with LlamaIndex
## ABB #8 - Session 3

Code authored by: Shaw Talebi

### imports

In [1]:
from IPython.display import display, Markdown
from bs4 import BeautifulSoup

from llama_index.core import VectorStoreIndex, get_response_synthesizer, Settings
from llama_index.core.schema import TextNode
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

In [2]:
from dotenv import load_dotenv
import os

# import sk from .env file
load_dotenv()
my_sk = os.getenv("OPENAI_API_KEY")

### 1) chunk articles

In [3]:
# Get all HTML files from raw directory
filename_list = ["articles/"+f for f in os.listdir('articles')]

chunk_list = []
for filename in filename_list:
    # only process .html files
    if filename.lower().endswith(('.html')):
        # read html file
        with open(filename, 'r', encoding='utf-8') as file:
            html_content = file.read()
    
        # Parse HTML
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Get article title
        article_title = soup.find('title').get_text().strip() if soup.find('title') else "Untitled"
        
        # Initialize variables
        article_content = []
        current_section = "Main"  # Default section if no headers found
        
        # Find all headers and text content
        content_elements = soup.find_all(['h1', 'h2', 'h3', 'p', 'ul', 'ol'])
    
        # iterate through elements and extract text with metadata
        for element in content_elements:
            if element.name in ['h1', 'h2', 'h3']:
                current_section = element.get_text().strip()
            elif element.name in ['p', 'ul', 'ol']:
                text = element.get_text().strip()
                # Only add non-empty content that's at least 30 characters long
                if text and len(text) >= 30:
                    article_content.append({
                        'article_title': article_title,
                        'section': current_section,
                        'text': text
                    })
    
        # add article content to list
        chunk_list.extend(article_content)

In [4]:
chunk_list[0]

{'article_title': 'Fine-Tuning BERT for Text Classification',
 'section': 'Fine-Tuning BERT for Text Classification',
 'text': 'Although todayâ€™s 100B+ parameter transformer models are state-of-the-art in AI, thereâ€™s still much we can accomplish with smaller (< 1B parameter) models. In this article, I will walk through one such example, fine-tuning BERT (110M parameters) to classify phishing URLs. Iâ€™ll start by covering key concepts and then share example Python code.'}

In [5]:
# create nodes with Llama Index (i.e. nodes)
node_list = []
for i, chunk in enumerate(chunk_list):
    node_list.append(
        TextNode(
            id_=str(i), 
            text=chunk["text"], 
            metadata = {
                "article":chunk["article_title"],
                "section":chunk["section"]
            }
        )
    )

print(len(node_list))

778


In [6]:
node_list[0]

TextNode(id_='0', embedding=None, metadata={'article': 'Fine-Tuning BERT for Text Classification', 'section': 'Fine-Tuning BERT for Text Classification'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text='Although todayâ€™s 100B+ parameter transformer models are state-of-the-art in AI, thereâ€™s still much we can accomplish with smaller (< 1B parameter) models. In this article, I will walk through one such example, fine-tuning BERT (110M parameters) to classify phishing URLs. Iâ€™ll start by covering key concepts and then share example Python code.', mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}')

### 2) create index

In [7]:
index = VectorStoreIndex(node_list)

print(f"Embedding Model: {index._embed_model.model_name}")
print(f"Index Size: {len(index.vector_store.data.embedding_dict)}")
print(f"Embedding Size: {len(index.vector_store.data.embedding_dict["0"])}")

Embedding Model: text-embedding-ada-002
Index Size: 778
Embedding Size: 1536


In [8]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# changing embedding model
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)

In [9]:
index = VectorStoreIndex(node_list)

print(f"Embedding Model: {index._embed_model.model_name}")
print(f"Index Size: {len(index.vector_store.data.embedding_dict)}")
print(f"Embedding Size: {len(index.vector_store.data.embedding_dict["0"])}")

Embedding Model: BAAI/bge-small-en-v1.5
Index Size: 778
Embedding Size: 384


### 3) semantic search

In [10]:
# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

In [11]:
results = retriever.retrieve("When do I perform fine-tuning?")

In [12]:
# results

In [13]:
# format results in markdown
results_markdown = ""
for i, result in enumerate(results, start=1):
    results_markdown += f"{i}. **Article title:** {result.metadata["article"]}  \n"
    results_markdown += f"   **Section:** {result.metadata["section"]}  \n"
    results_markdown += f"   **Snippet:** {result.text} \n\n"
    results_markdown += f"   **Score:** {result.score} \n\n"

In [14]:
display(Markdown(results_markdown))

1. **Article title:** LLM Fine-tuningâ€Šâ€”â€ŠFAQs  
   **Section:** When do I Fine-tune?  
   **Snippet:** This is not to say that fine-tuning is useless. A central benefit of fine-tuning an AI assistant is lowering inference costs [3]. 

   **Score:** 0.8114657060166933 

2. **Article title:** LLM Fine-tuningâ€Šâ€”â€ŠFAQs  
   **Section:** When NOT to Fine-tune  
   **Snippet:** The effectiveness of any approach will depend on the details of the use case. For example, fine-tuning is less effective than retrieval augmented generation (RAG) to provide LLMs with specialized knowledge [1]. 

   **Score:** 0.800293870277152 

3. **Article title:** LLM Fine-tuningâ€Šâ€”â€ŠFAQs  
   **Section:** How to Prepare Data for Fine-tuning?  
   **Snippet:** For example, if I wanted to fine-tune an LLM to respond to viewer questions on YouTube, I would need to gather a set of comments with questions and my associated responses. For a concrete example of this, check out the code walk-through on YouTube. 

   **Score:** 0.7996616635141707 

4. **Article title:** LLM Fine-tuningâ€Šâ€”â€ŠFAQs  
   **Section:** When do I Fine-tune?  
   **Snippet:** Fine-tuning, on the other hand, can compress prompt sizes by directly training the model on examples. Shorter prompts mean fewer tokens at inference, leading to lower compute costs and faster model responses [3]. For instance, after fine-tuning, the above prompt could be compressed to the following. 

   **Score:** 0.7995040458001792 

5. **Article title:** LLM Fine-tuningâ€Šâ€”â€ŠFAQs  
   **Section:** RAG vs Fine-tuning?  
   **Snippet:** Weâ€™ve already mentioned situations where RAG and fine-tuning perform well. However, since this is such a common question, itâ€™s worth reemphasizing when each approach works best. 

   **Score:** 0.7930144142584221 

6. **Article title:** Fine-Tuning Large Language Models (LLMs)  
   **Section:** 3 Ways to Fine-tune  
   **Snippet:** The next, and perhaps most popular, way to fine-tune a model is via supervised learning. This involves training a model on input-output pairs for a particular task. An example is instruction tuning, which aims to improve model performance in answering questions or responding to user prompts [1,3]. 

   **Score:** 0.7919754233525915 

7. **Article title:** How to Improve LLMs with RAG  
   **Section:** Why weÂ care  
   **Snippet:** Previous articles in this series discussed fine-tuning, which adapts an existing model for a particular use case. While this is an alternative way to endow an LLM with specialized knowledge, empirically, fine-tuning seems to be less effective than RAG at doing this [1]. 

   **Score:** 0.7899395386438688 

8. **Article title:** Fine-Tuning Large Language Models (LLMs)  
   **Section:** What is Fine-tuning?  
   **Snippet:** Fine-tuning is taking a pre-trained model and training at least one internal model parameter (i.e. weights). In the context of LLMs, what this typically accomplishes is transforming a general-purpose base model (e.g. GPT-3) into a specialized model for a particular use case (e.g. ChatGPT) [1]. 

   **Score:** 0.7895567465793972 

9. **Article title:** LLM Fine-tuningâ€Šâ€”â€ŠFAQs  
   **Section:** Whatâ€™s Next?  
   **Snippet:** Here, I summarized the most common fine-tuning questions Iâ€™ve received over the past 12 months. While fine-tuning is not a panacea for all LLM use cases, it has key benefits. 

   **Score:** 0.7862102243041814 

10. **Article title:** LLM Fine-tuningâ€Šâ€”â€ŠFAQs  
   **Section:** What is Fine-tuning?  
   **Snippet:** I like to define fine-tuning as taking an existing (pre-trained) model and training at least 1 model parameter to adapt it to a particular use case. 

   **Score:** 0.7854351862786609 



### 4) RAG

In [15]:
# configure response synthesizer
response_synthesizer = get_response_synthesizer()

In [16]:
# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

In [17]:
response = query_engine.query("When do I perform fine-tuning?")
print(response)

Perform fine-tuning when you want to lower inference costs by compressing prompt sizes, leading to lower compute costs and faster model responses.


In [18]:
print(f"LLM: {Settings.llm.model}")

LLM: gpt-3.5-turbo


In [19]:
from llama_index.llms.openai import OpenAI

# changing the global LLM
Settings.llm = OpenAI("gpt-5")

In [20]:
# simpler way to make query engine
query_engine = index.as_query_engine()
response = query_engine.query("I'm trying to build a LinkedIn post writer using AI. What do you recommend?")
print(response)

Great idea. A simple, effective path is:
- Use prompt engineering to shape consistent, on-brand outputs. Create a few reusable prompt templates (e.g., different tones or goals) and iterate.
- Add a text-classification layer (e.g., a fineâ€‘tuned BERT classifier) to tag or verify tone, format, and audience fit, then route to the right template or filter lowâ€‘quality generations.
- Close the loop with quick human feedback to refine prompts and the classifier over time.

Iâ€™ve compiled resources on prompt engineering and have work on fineâ€‘tuning BERT for text classification that can help you implement this stack. If you want pointers or examples, check my site: https://www.shawhintalebi.com/ or reach out on YouTube ðŸŽ¥, LinkedIn, or Twitter.


In [21]:
response.get_formatted_sources()

'> Source (Doc id: 28): My website: https://www.shawhintalebi.com/\n\n> Source (Doc id: 610): Socials: YouTube ðŸŽ¥ | LinkedIn | Twitter'

In [22]:
print(f"LLM: {Settings.llm.model}")

LLM: gpt-5
