## Importing Required Libraries

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # Suppress TF info logs (0=all, 1=errors, 2+=warnings)

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from sentence_transformers import SentenceTransformer
import faiss
import re
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from pprint import pprint

## Semantic Search
To develop a semantic search system, we should begin with the fundamentals. Let's explore what semantic search entails and why it revolutionizes information retrieval.

### Defining Semantic Search
Semantic search goes beyond the constraints of conventional keyword-based methods by grasping the underlying context and subtleties in user inputs. Essentially, semantic search:
- Improves the overall search process by discerning the purpose and situational implications of queries.
- Provides more precise and pertinent outcomes by examining the links between terms and expressions in the given context.
- Adjusts to individual habits and tastes, enhancing results to boost user contentment.

### The Basics of How Semantic Search Operates
So, how does this intelligent tool perform its role? It employs sophisticated techniques from Natural Language Processing, commonly abbreviated as NLP. Here's a straightforward breakdown of the mechanism:
- **Capturing the Essence**: Initially, the system processes your input and aims to capture its core idea. Rather than merely identifying keywords, it delves into the true intent.
- **Establishing Links**: Then, it considers various associations among words (for instance, recognizing that "physician" and "doctor" are synonymous). This aids in comprehending your request more accurately.
- **Selecting the Optimal Matches**: Lastly, it functions like an expert curator with knowledge of vast resources. It sifts through extensive data to choose the best fits, taking into account your likely intent.

### The Technical Aspects of Semantic Search
Once we've covered the essentials, let's examine the underlying machinery that drives semantic search. This section resembles a lesson in mathematics, focusing on vectors—not the physical kind, but comparable ones applied in search technologies.

#### Vectors: The Foundation of Semantic Search
In semantic search, a vector consists of a sequence of numbers that a machine employs to depict the significance of words or phrases. Picture each word or sentence as a location in a multidimensional space; the nearer two locations, the more alike their interpretations.
- **Generating Vectors**: We convert text into vectors via models such as the Universal Sentence Encoder, essentially assigning a distinct numerical signature to each segment of text.
- **Measuring Similarity**: To gauge the resemblance between texts, we calculate the proximity of their vectors using equations like cosine similarity, which quantifies the likeness or disparity of these signatures.
- **Applying Vectors in Searches**: Upon entering a query, the system identifies vectors nearest to your query's vector. These nearest ones correspond to the most fitting responses.

#### The Role of Vectors in Enabling Search
Vectors excel at encapsulating the intricate nuances of language that surpass mere word appearances. In a semantic search system, the process unfolds as follows:
1. **Vector Conversion**: As soon as a query is submitted, the system transforms the input into a vector.
2. **Database Scanning**: It rapidly reviews an enormous collection of precomputed vectors, each linked to various data entries.
3. **Information Extraction**: By pinpointing the most similar vectors, the system pulls out content that's not only word-for-word alike but meaningfully connected.

By the conclusion of this tutorial, you'll know how to construct a search engine capable of these functions and beyond. We'll proceed gradually, starting from the basics. Prepared? Let's dive in!

## Understanding Vectorization and Indexing

Vectorization and indexing form the backbone of any effective semantic search engine. Let's dive into how these processes function, using the **Universal Sentence Encoder (USE)** for vectorization and **FAISS** for fast similarity search.

#### What does the Universal Sentence Encoder actually do?

The Universal Sentence Encoder transforms sentences—regardless of their length or complexity—into high-dimensional vectors. These vectors are simply lists of numbers that encode the semantic meaning of the text. Here's what makes USE so powerful:

- **Deep Language Understanding** — It captures the contextual meaning of sentences by considering how words interact with each other.
- **Broad Applicability** — Trained on diverse datasets, it performs well across many domains, languages (to some extent), and writing styles.
- **High Efficiency** — After training, it encodes sentences into vectors very quickly, making it practical for real-time applications.

#### How does the Universal Sentence Encoder work?

At its core, USE relies on advanced deep learning architectures (typically transformer-based or DAN — Deep Averaging Network variants). The process can be summarized as:

1. **Word & Context Analysis** — It examines individual words along with their surrounding context to build a rich understanding.
2. **Contextual Awareness** — The model considers word order, grammar, and relationships to interpret the overall intent.
3. **Vector Generation** — All this information is compressed into a fixed-size numerical vector (usually 512 dimensions) that represents the sentence's meaning.

#### What is FAISS and what role does it play?

FAISS (Facebook AI Similarity Search) is a highly optimized library designed for efficient similarity search and clustering of dense vectors. Once we have vectors from USE, FAISS enables us to quickly find the most similar ones among potentially millions or billions of entries.

Key strengths of FAISS include:

- **Blazing-fast Search** — It uses sophisticated algorithms to perform nearest-neighbor searches in milliseconds even on huge datasets.
- **Massive Scalability** — Handles vector collections too large for RAM through efficient indexing techniques and quantization.
- **Excellent Precision** — Offers tunable trade-offs between speed and accuracy via different indexing methods (IVF, HNSW, etc.).

#### How does FAISS work (simplified)?

FAISS builds a specialized index structure that makes similarity search dramatically faster than brute-force comparison. The basic workflow looks like this:

1. **Index Construction** — It organizes all pre-computed vectors so that similar items are grouped or structured efficiently (e.g., using clustering or graph-based methods).
2. **Query Processing** — When a new query vector arrives (from USE), FAISS quickly narrows down the search space to the most promising regions.
3. **Result Retrieval** — It returns the top-k most similar vectors (and their corresponding documents) with impressive speed and accuracy.

#### Putting It All Together

USE and FAISS form a perfect partnership for semantic search:  
USE translates human language into a mathematical representation (vectors), while FAISS enables lightning-fast discovery of the most semantically similar content.

Here are some clear visual representations of this complete flow — from raw text → vectorization → indexing → similarity search:

These diagrams illustrate the end-to-end pipeline beautifully — text gets encoded, stored in a vector index, and queried efficiently.

With this combination, you get a semantic search system that's both intelligent (understands meaning) and performant (searches at scale). Ready to move to the next step of actually implementing it? Let's keep building!

In [2]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

In [3]:
pprint(list(newsgroups_train.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [4]:
# Display the first 3 posts from the dataset
for i in range(3):
    print(f"Sample post {i+1}:\n")
    pprint(newsgroups_train.data[i])
    print("\n" + "-"*80 + "\n")

Sample post 1:

("From: lerxst@wam.umd.edu (where's my thing)\n"
 'Subject: WHAT car is this!?\n'
 'Nntp-Posting-Host: rac3.wam.umd.edu\n'
 'Organization: University of Maryland, College Park\n'
 'Lines: 15\n'
 '\n'
 ' I was wondering if anyone out there could enlighten me on this car I saw\n'
 'the other day. It was a 2-door sports car, looked to be from the late 60s/\n'
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition,\n'
 'the front bumper was separate from the rest of the body. This is \n'
 'all I know. If anyone can tellme a model name, engine specs, years\n'
 'of production, where this car is made, history, or whatever info you\n'
 'have on this funky looking car, please e-mail.\n'
 '\n'
 'Thanks,\n'
 '- IL\n'
 '   ---- brought to you by your neighborhood Lerxst ----\n'
 '\n'
 '\n'
 '\n'
 '\n')

--------------------------------------------------------------------------------

Sample post 2:

('From: guykuo@carson.u.washington.edu (Guy Kuo)\n'
 '

## Pre-processing Data

In [5]:
newsgroups = fetch_20newsgroups(subset='all')
documents = newsgroups.data

# Basic preprocessing of text data
def preprocess_text(text):
    # Remove email headers
    text = re.sub(r'^From:.*\n?', '', text, flags=re.MULTILINE)
    # Remove email addresses
    text = re.sub(r'\S*@\S*\s?', '', text)
    # Remove punctuations and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Remove excess whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Preprocess each document
processed_documents = [preprocess_text(doc) for doc in documents]

In [6]:
# Choose a sample post to display
sample_index = 0  # for example, the first post in the dataset

# Print the original post
print("Original post:\n")
print(newsgroups_train.data[sample_index])
print("\n" + "-"*80 + "\n")

# Print the preprocessed post
print("Preprocessed post:\n")
print(preprocess_text(newsgroups_train.data[sample_index]))
print("\n" + "-"*80 + "\n")

Original post:

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----






--------------------------------------------------------------------------------

Preprocessed post:

subject what car is this nntppostinghost racwamumdedu organization university of maryland college park lines i was wondering if anyone out there could enlighte

## Universal Sentence Encoder

In [17]:
# Load the model once (runs on CPU if no GPU available, or force it)
# Choose one model based on your needs:
#   - 'all-MiniLM-L6-v2'     → fastest & lightest (~80 MB, 384 dim)
#   - 'all-mpnet-base-v2'    → best quality in medium size (~420 MB, 768 dim)
#   - 'BAAI/bge-small-en-v1.5' → excellent modern retrieval model (~130 MB, 384 dim)
#model = SentenceTransformer('all-MiniLM-L6-v2', device='cpu')  # or remove device='cpu' to auto-detect
model = SentenceTransformer('BAAI/bge-small-en-v1.5', device='cpu')

# Function to generate embeddings (similar to your original embed_text)
def embed_text(texts):
    """
    texts: can be a single string or list of strings
    Returns: numpy array of embeddings
    """
    # encode() handles single string or list automatically
    return model.encode(texts, convert_to_numpy=True, show_progress_bar=True)

In [18]:
# -------------------------------
# Your original usage pattern:
# Generate embeddings for each preprocessed document
# Assuming 'processed_documents' is a list of strings
# -------------------------------

# Most efficient way (recommended) - batch everything at once
X_embeddings = model.encode(
    processed_documents,
    batch_size=32,           # adjust: 16-64 good for CPU, higher = faster but more RAM
    show_progress_bar=True,  # nice progress bar for large lists
    convert_to_numpy=True    # returns numpy array directly 
)

# X_embeddings now has shape: (len(processed_documents), model_dimension)
# Example: (N_docs, 384) for all-MiniLM-L6-v2
print("Embeddings shape:", X_embeddings.shape)

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

Embeddings shape: (18846, 384)


# FAISS Indexing

After converting documents to vectors with the Universal Sentence Encoder, FAISS (Facebook AI Similarity Search) enables fast similarity queries.

## Building the FAISS Index

- Extract vector dimensions from `X_embeddings.shape[1]`
- Initialize L2 distance index: `faiss.IndexFlatL2(dimension)`
- Store vectors: `index.add(X_embeddings)` - creates searchable document space

### Index Selection

- `IndexFlatL2` suits small/medium datasets with exact, reliable results
- FAISS provides specialized indexes for scale:
  - `IndexIVFFlat`: balances speed/memory for larger sets
  - `IndexIVFPQ`: compresses vectors for massive datasets
- Check [FAISS indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes) for production needs

In [21]:
dimension = X_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)  # Creating a FAISS index
index.add(X_embeddings)  # Adding the document vectors to the index

## FAISS Querying

### Search Function Overview

- `search()` finds semantically similar documents to input queries
- Preprocesses query → converts to vector → FAISS nearest neighbor search (`k` results)
- Returns distances + indices of top matches

### Running Queries

- Test with queries like "motorcycle"
- Results show:
  - Rank position
  - Similarity distance (lower = closer match)
  - Original + preprocessed document text

Demonstrates semantic search power: context-aware retrieval beyond keyword matching.

In [22]:
# Function to perform a query using the Faiss index
def search(query_text, k=5):
    # Preprocess the query text
    preprocessed_query = preprocess_text(query_text)
    # Generate the query vector
    query_vector = embed_text([preprocessed_query])
    # Perform the search
    distances, indices = index.search(query_vector.astype('float32'), k)
    return distances, indices

# Example Query
query_text = "motorcycle"
distances, indices = search(query_text)

# Display the results
for i, idx in enumerate(indices[0]):
    # Ensure that the displayed document is the preprocessed one
    print(f"Rank {i+1}: (Distance: {distances[0][i]})\n{processed_documents[idx]}\n")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Rank 1: (Distance: 0.7634575366973877)
subject re newbie organization nasa science internet project office lines in article writes hey there yea thats what i ama newbie i have never owned a motorcycle this makes it is spring matt ps i am not really sure what the purpose of this article was butoh well neither were we read for a few days then try again curt howland ace dod eff v sabre meddle not in the afairs of wizards for it makes them soggy and hard to relight

Rank 2: (Distance: 0.8812552094459534)
subject re first bike organization bc systems corporation lines in article james leo belliveau writes i am a serious motorcycle enthusiast without a motorcycle and to put it bluntly it sucks i really would like some advice on what would oh for a second i thought this was a posting by ed green bruce clarke bc environment email

Rank 3: (Distance: 0.8916760683059692)
subject re new to motorcycles organization hp sonoma county srsdmwtdmid xnewsreader tin version pl lines gregory humphreys wro

In [23]:
# Display the results
for i, idx in enumerate(indices[0]):
    # Displaying the original (unprocessed) document corresponding to the search result
    print(f"Rank {i+1}: (Distance: {distances[0][i]})\n{documents[idx]}\n")

Rank 1: (Distance: 0.7634575366973877)
From: howland@noc2.arc.nasa.gov (Curt Howland)
Subject: Re: Newbie
Organization: NASA Science Internet Project Office
Lines: 16

In article <C5swox.GwI@mailer.cc.fsu.edu>, os048@xi.cs.fsu.edu () writes:
|>  hey there,
|>     Yea, thats what I am....a newbie. I have never owned a motorcycle,

This makes 5! It IS SPRING!

|> Matt
|> PS I am not really sure what the purpose of this article was but...oh well

Neither were we. Read for a few days, then try again.

---
Curt Howland "Ace"       DoD#0663       EFF#569
howland@nsipo.nasa.gov            '82 V45 Sabre
     Meddle not in the afairs of Wizards,
 for it makes them soggy and hard to re-light.


Rank 2: (Distance: 0.8812552094459534)
From: bclarke@galaxy.gov.bc.ca
Subject: Re: First Bike??
Organization: BC Systems Corporation
Lines: 8

In article <0forqFa00iUzMATnMz@andrew.cmu.edu>, James Leo Belliveau <jbc9+@andrew.cmu.edu> writes:
>     I am a serious motorcycle enthusiast without a motorcycle,