# Semantic Search with Text Embeddings
## ABB #2 - Session 4

Code authored by: Shaw Talebi

### imports

In [1]:
import os
from bs4 import BeautifulSoup
import json
from sentence_transformers import SentenceTransformer
import torch
from IPython.display import display, Markdown
from functions import *

### 1) chunk articles

In [2]:
# Get all HTML files from raw directory
filename_list = ["articles/"+f for f in os.listdir('articles')]

chunk_list = []
for filename in filename_list:
    # only process .html files
    if filename.lower().endswith(('.html')):
        # read html file
        with open(filename, 'r', encoding='utf-8') as file:
            html_content = file.read()
    
        # Parse HTML
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Get article title
        article_title = soup.find('title').get_text().strip() if soup.find('title') else "Untitled"
        
        # Initialize variables
        article_content = []
        current_section = "Main"  # Default section if no headers found
        
        # Find all headers and text content
        content_elements = soup.find_all(['h1', 'h2', 'h3', 'p', 'ul', 'ol'])
    
        # iterate through elements and extract text with metadata
        for element in content_elements:
            if element.name in ['h1', 'h2', 'h3']:
                current_section = element.get_text().strip()
            elif element.name in ['p', 'ul', 'ol']:
                text = element.get_text().strip()
                # Only add non-empty content that's at least 30 characters long
                if text and len(text) >= 30:
                    article_content.append({
                        'article_title': article_title,
                        'section': current_section,
                        'text': text
                    })
    
        # add article content to list
        chunk_list.extend(article_content)

In [3]:
# save chunk list to file
filename='data/chunk_list.json'
with open(filename, 'w', encoding='utf-8') as f:
    json.dump(chunk_list, f, indent=4, ensure_ascii=False)

### 2) embed chunks

In [4]:
# define text to embed
text_list = []
for content in chunk_list:
    # concatenate title and section header
    context = content['article_title'] + " - " + content['section'] + ": "
    # append text from paragraph to fill CLIP's 256 sequence limit
    text = context + content['text'][:512-len(context)]
    
    text_list.append(text)
print("Num chunks:",len(text_list))

Num chunks: 778


In [5]:
chunk_list[0]

{'article_title': 'Fine-Tuning BERT for Text Classification',
 'section': 'Fine-Tuning BERT for Text Classification',
 'text': 'Although today’s 100B+ parameter transformer models are state-of-the-art in AI, there’s still much we can accomplish with smaller (< 1B parameter) models. In this article, I will walk through one such example, fine-tuning BERT (110M parameters) to classify phishing URLs. I’ll start by covering key concepts and then share example Python code.'}

In [6]:
text_list[0]

'Fine-Tuning BERT for Text Classification - Fine-Tuning BERT for Text Classification: Although today’s 100B+ parameter transformer models are state-of-the-art in AI, there’s still much we can accomplish with smaller (< 1B parameter) models. In this article, I will walk through one such example, fine-tuning BERT (110M parameters) to classify phishing URLs. I’ll start by covering key concepts and then share example Python code.'

In [7]:
# load model
model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")

# compute embeddings
chunk_embeddings = model.encode(text_list)
print(chunk_embeddings.shape)

# save chunk embeddings to file
torch.save(chunk_embeddings, 'data/chunk_embeddings.pt')

(778, 384)


### 3) semantic search

In [8]:
# define query
query = "What is a token?"
query_embedding = model.encode(query)
print(query_embedding.shape)

# compute similarity between query and all chunks
similarities = model.similarity(query_embedding, chunk_embeddings)
print(similarities.shape)
# print(similarities[0])

(384,)
torch.Size([1, 778])


In [9]:
# define search parameters
temp = 0.1
k=3
threshold = 0.05

# Rescale similarities via softmax
scores = torch.nn.functional.softmax(similarities/temp, dim=1)

# Get sorted indices and scores
sorted_indices = scores.argsort(descending=True)[0]
sorted_scores = scores[0][sorted_indices]

# Filter by threshold and get top k
filtered_indices = [
    idx.item() for idx, score in zip(sorted_indices, sorted_scores) 
    if score.item() >= threshold
][:k]

# Get corresponding content items and scores
top_results = [chunk_list[i] for i in filtered_indices]
result_scores = [scores[0][i].item() for i in filtered_indices]

In [10]:
top_results

[{'article_title': 'Cracking Open the OpenAI (Python) API',
  'section': '2) OpenAI’s (Python)\xa0API',
  'text': 'Tokens, in the context of LLMs, are essentially a set of numbers representing a set of words and characters. For example, “The” could be a token, “ end” (with the space) could be another, and “.” another.'}]

### 4) display results

In [11]:
results_markdown = ""
for i, result in enumerate(top_results, start=1):
    results_markdown += f"{i}. **Article title:** {result['article_title']}  \n"
    results_markdown += f"   **Section:** {result['section']}  \n"
    results_markdown += f"   **Snippet:** {result['text']}  \n\n"

In [12]:
display(Markdown(results_markdown))

1. **Article title:** Cracking Open the OpenAI (Python) API  
   **Section:** 2) OpenAI’s (Python) API  
   **Snippet:** Tokens, in the context of LLMs, are essentially a set of numbers representing a set of words and characters. For example, “The” could be a token, “ end” (with the space) could be another, and “.” another.  



In [13]:
# bringing it all together
query = "What's the difference between RAG and Fine-tuning?"
results_markdown = semantic_search(query, model, chunk_embeddings, chunk_list, temp=0.1, k=10, threshold=0)
display(Markdown(results_markdown))

1. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** RAG vs Fine-tuning?  
   **Snippet:** We’ve already mentioned situations where RAG and fine-tuning perform well. However, since this is such a common question, it’s worth reemphasizing when each approach works best.  

2. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** RAG vs Fine-tuning?  
   **Snippet:** RAG is when we inject relevant context into an LLM’s input prompt so that it can generate more helpful responses. For example, if we have a domain-specific knowledge base (e.g., internal company documents and emails), we might identify the items most relevant to the user’s query so that an LLM can synthesize information in an accurate and digestible way.  

3. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** RAG vs Fine-tuning?  
   **Snippet:** Here’s high-level guidance on when to use each.  

4. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** RAG vs Fine-tuning?  
   **Snippet:** Notice that these approaches are not mutually exclusive. In fact, the original RAG system proposed by Facebook researchers used fine-tuning to better use retrieved information for generating responses [4].  

5. **Article title:** How to Improve LLMs with RAG  
   **Section:** Why we care  
   **Snippet:** Previous articles in this series discussed fine-tuning, which adapts an existing model for a particular use case. While this is an alternative way to endow an LLM with specialized knowledge, empirically, fine-tuning seems to be less effective than RAG at doing this [1].  

6. **Article title:** How to Improve LLMs with RAG  
   **Section:** Some Nuances  
   **Snippet:** Document preparation—The quality of a RAG system is driven by how well useful information can be extracted from source documents. For example, if a document is unformatted and full of images and tables, it will be more difficult to parse than a well-formatted text file.  

7. **Article title:** How to Improve LLMs with RAG  
   **Section:** Some Nuances  
   **Snippet:** While the steps for building a RAG system are conceptually simple, several nuances can make building one (in the real world) more complicated.  

8. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** When NOT to Fine-tune  
   **Snippet:** The effectiveness of any approach will depend on the details of the use case. For example, fine-tuning is less effective than retrieval augmented generation (RAG) to provide LLMs with specialized knowledge [1].  

9. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** RAG vs Fine-tuning?  
   **Snippet:** RAG: necessary knowledge for the task is not commonly known or available on the web but can be stored in a databaseFine-tuning: necessary knowledge for the task is already baked into the model, but you want to reduce the prompt size or refine response qualityRAG + Fine-tuning: the task requires specialized knowledge, and we would like to reduce the prompt size or refine the response quality  

10. **Article title:** How to Improve LLMs with RAG  
   **Section:** What is RAG?  
   **Snippet:** RAG works by adding a step to this basic process. Namely, a retrieval step is performed where, based on the user’s prompt, the relevant information is extracted from an external knowledge base and injected into the prompt before being passed to the LLM.  

