# Semantic Search with Text Embeddings
## ABB #3 - Session 4

Code authored by: Shaw Talebi

### imports

In [1]:
import os
from bs4 import BeautifulSoup
import json
from sentence_transformers import SentenceTransformer
import torch
from IPython.display import display, Markdown
from functions import *

### 1) chunk articles

In [2]:
# Get all HTML files from raw directory
filename_list = ["articles/"+f for f in os.listdir('articles')]

chunk_list = []
for filename in filename_list:
    # only process .html files
    if filename.lower().endswith(('.html')):
        # read html file
        with open(filename, 'r', encoding='utf-8') as file:
            html_content = file.read()
    
        # Parse HTML
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Get article title
        article_title = soup.find('title').get_text().strip() if soup.find('title') else "Untitled"
        
        # Initialize variables
        article_content = []
        current_section = "Main"  # Default section if no headers found
        
        # Find all headers and text content
        content_elements = soup.find_all(['h1', 'h2', 'h3', 'p', 'ul', 'ol'])
    
        # iterate through elements and extract text with metadata
        for element in content_elements:
            if element.name in ['h1', 'h2', 'h3']:
                current_section = element.get_text().strip()
            elif element.name in ['p', 'ul', 'ol']:
                text = element.get_text().strip()
                # Only add non-empty content that's at least 30 characters long
                if text and len(text) >= 30:
                    article_content.append({
                        'article_title': article_title,
                        'section': current_section,
                        'text': text
                    })
    
        # add article content to list
        chunk_list.extend(article_content)

In [3]:
# save chunk list to file
filename='data/chunk_list.json'
with open(filename, 'w', encoding='utf-8') as f:
    json.dump(chunk_list, f, indent=4, ensure_ascii=False)

### 2) embed chunks

In [4]:
# define text to embed
text_list = []
for content in chunk_list:
    # concatenate title and section header
    context = content['article_title'] + " - " + content['section'] + ": "
    # append text from paragraph to fill CLIP's 256 sequence limit
    text = context + content['text'][:512-len(context)]
    
    text_list.append(text)
print("Num chunks:",len(text_list))

Num chunks: 778


In [5]:
chunk_list[0]

{'article_title': 'Fine-Tuning BERT for Text Classification',
 'section': 'Fine-Tuning BERT for Text Classification',
 'text': 'Although today’s 100B+ parameter transformer models are state-of-the-art in AI, there’s still much we can accomplish with smaller (< 1B parameter) models. In this article, I will walk through one such example, fine-tuning BERT (110M parameters) to classify phishing URLs. I’ll start by covering key concepts and then share example Python code.'}

In [6]:
text_list[0]

'Fine-Tuning BERT for Text Classification - Fine-Tuning BERT for Text Classification: Although today’s 100B+ parameter transformer models are state-of-the-art in AI, there’s still much we can accomplish with smaller (< 1B parameter) models. In this article, I will walk through one such example, fine-tuning BERT (110M parameters) to classify phishing URLs. I’ll start by covering key concepts and then share example Python code.'

In [7]:
# load model
model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")

# compute embeddings
chunk_embeddings = model.encode(text_list)
print(chunk_embeddings.shape)

# save chunk embeddings to file
torch.save(chunk_embeddings, 'data/chunk_embeddings.pt')

(778, 384)


### 3) semantic search

In [8]:
# define query
query = "What is a token?"
query_embedding = model.encode(query)
print(query_embedding.shape)

# compute similarity between query and all chunks
similarities = model.similarity(query_embedding, chunk_embeddings)
print(similarities.shape)
# print(similarities[0])

(384,)
torch.Size([1, 778])


In [9]:
# define search parameters
temp = 0.1
k=3
threshold = 0.05

# Rescale similarities via softmax
scores = torch.nn.functional.softmax(similarities/temp, dim=1)

# Get sorted indices and scores
sorted_indices = scores.argsort(descending=True)[0]
sorted_scores = scores[0][sorted_indices]

# Filter by threshold and get top k
filtered_indices = [
    idx.item() for idx, score in zip(sorted_indices, sorted_scores) 
    if score.item() >= threshold
][:k]

# Get corresponding content items and scores
top_results = [chunk_list[i] for i in filtered_indices]
result_scores = [scores[0][i].item() for i in filtered_indices]

In [10]:
top_results

[{'article_title': 'Cracking Open the OpenAI (Python) API',
  'section': '2) OpenAI’s (Python)\xa0API',
  'text': 'Tokens, in the context of LLMs, are essentially a set of numbers representing a set of words and characters. For example, “The” could be a token, “ end” (with the space) could be another, and “.” another.'}]

In [11]:
result_scores

[0.20563369989395142]

### 4) display results

In [12]:
results_markdown = ""
for i, result in enumerate(top_results, start=1):
    results_markdown += f"{i}. **Article title:** {result['article_title']}  \n"
    results_markdown += f"   **Section:** {result['section']}  \n"
    results_markdown += f"   **Snippet:** {result['text']}  \n\n"

In [13]:
display(Markdown(results_markdown))

1. **Article title:** Cracking Open the OpenAI (Python) API  
   **Section:** 2) OpenAI’s (Python) API  
   **Snippet:** Tokens, in the context of LLMs, are essentially a set of numbers representing a set of words and characters. For example, “The” could be a token, “ end” (with the space) could be another, and “.” another.  



In [14]:
# bringing it all together
query = "What is attention?"
results_markdown = semantic_search(query, model, chunk_embeddings, chunk_list, temp=0.1, k=10, threshold=0)
display(Markdown(results_markdown))

1. **Article title:** How to Build an LLM from Scratch  
   **Section:** Step 2: Model Architecture  
   **Snippet:** Attention allows the neural network to capture the importance of content and position for modeling language. This has been an idea in ML for decades. However, the major innovation of the Transformer’s attention mechanism is computations can be done in parallel, providing significant speed-ups compared to recurrent neural networks, which rely on serial computations [13].  

2. **Article title:** How to Build an LLM from Scratch  
   **Section:** Step 2: Model Architecture  
   **Snippet:** A transformer is a neural network architecture that uses attention mechanisms to generate mappings between inputs and outputs. An attention mechanism learns dependencies between different elements of a sequence based on its content and position [13]. This comes from the intuition that with language, context matters.  

3. **Article title:** How to Build an LLM from Scratch  
   **Section:** Step 2: Model Architecture  
   **Snippet:** Encoder-Decoder — we can combine the encoder and decoder modules to create an encoder-decoder transformer. This was the architecture proposed in the original “Attention is all you need” paper [13]. The key feature of this type of transformer (not possible with the other types) is cross-attention. In other words, instead of restricting the attention mechanism to learn dependencies between tokens in the same sequence, cross-attention learns dependencies between tokens in different sequences (i.e. sequences from encoder and decoder modules). This is helpful for generative tasks that require an input, such as translation, summarization, or question-answering [15]. Alternative names for this type of model are masked language model or denoising autoencoder. A popular LLM using this design is Facebook’s BART [17].  

4. **Article title:** How to Improve LLMs with RAG  
   **Section:** Resources  
   **Snippet:** Socials: YouTube 🎥 | LinkedIn | Instagram  

5. **Article title:** Cracking Open the Hugging Face Transformers Library  
   **Section:** What is Hugging Face?  
   **Snippet:** The power of these resources is that they are community generated, which leverages all the benefits of open-source (i.e. cost-free, wide diversity of tools, high-quality resources, and rapid pace of innovation). While these make building powerful ML projects more accessible than before, there is another key element of the Hugging Face ecosystem — the Transformers library.  

6. **Article title:** How to Improve LLMs with RAG  
   **Section:** Why we care  
   **Snippet:** Notice that RAG does not fundamentally change how we use an LLM; it's still prompt-in and response-out. RAG simply augments this process (hence the name).  

7. **Article title:** Cracking Open the Hugging Face Transformers Library  
   **Section:** Conclusion  
   **Snippet:** Hugging Face has become synonymous with open-source language models and machine learning. The biggest advantage of their ecosystem is it gives small-time developers, researchers, and tinkers access to powerful ML resources.  

8. **Article title:** A Practical Introduction to LLMs  
   **Section:** Resources  
   **Snippet:** Socials: YouTube 🎥 | LinkedIn | Twitter  

9. **Article title:** Multimodal Models — LLMs that can see and hear  
   **Section:** Example: Using LLaMA 3.2 Vision for Image-based Tasks  
   **Snippet:** Objectively describing a scene is simpler than understanding and explaining humor. Let’s see how the model explains the meme below.  

10. **Article title:** Cracking Open the OpenAI (Python) API  
   **Section:** Resources  
   **Snippet:** Socials: YouTube 🎥 | LinkedIn | Twitter  

