Semantic search allows computers to understand the meaning behind user queries rather than relying only on exact keyword matching. Using FAISS (Facebook AI Similarity Search), we can build a high-performance system that searches through hundreds or even thousands of documents by meaning and not just by text overlap. This approach enables smarter, faster and more context-aware information retrieval.

Step 1: Install Dependencies

In [1]:
!pip install sentence-transformers faiss-cpu PyPDF2 python-docx numpy


Collecting sentence-transformers
  Downloading sentence_transformers-5.2.0-py3-none-any.whl.metadata (16 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.1-cp312-cp312-win_amd64.whl.metadata (7.6 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting python-docx
  Downloading python_docx-1.2.0-py3-none-any.whl.metadata (2.0 kB)
Collecting lxml>=3.1.0 (from python-docx)
  Downloading lxml-6.0.2-cp312-cp312-win_amd64.whl.metadata (3.7 kB)
Downloading sentence_transformers-5.2.0-py3-none-any.whl (493 kB)
Downloading faiss_cpu-1.13.1-cp312-cp312-win_amd64.whl (18.8 MB)
   ---------------------------------------- 0.0/18.8 MB ? eta -:--:--
    --------------------------------------- 0.3/18.8 MB ? eta -:--:--
   - -------------------------------------- 0.8/18.8 MB 2.6 MB/s eta 0:00:08
   --- ------------------------------------ 1.6/18.8 MB 2.7 MB/s eta 0:00:07
   ----- ---------------------------------- 2.4/18.8 MB 3.0 MB/s eta 0:00:06
   ------ --


[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Step 2: Import Libraries


In [2]:
import os
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from PyPDF2 import PdfReader
from docx import Document


  from .autonotebook import tqdm as notebook_tqdm





Step 3: Extract Text from Documents
We need to define the function for document loading,

Reads different file formats like .pdf, .docx and .txt.
Ensures content is extracted as plain text for embedding generation.

In [7]:
def extract_text_from_file(file_path):
    """
    Extracts text content from PDF, DOCX, and TXT files.
    """
    text = ""
    if file_path.endswith(".pdf"):
        reader = PdfReader(file_path)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    elif file_path.endswith(".docx"):
        doc = Document(file_path)
        for para in doc.paragraphs:
            text += para.text + "\n"
    elif file_path.endswith(".txt"):
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()
    else:
        raise ValueError("Unsupported file format.")
    
    return text.strip()


Step 4: Split Text into Chunks
Divides long documents into smaller segments (chunks).
Improves search accuracy and performance by focusing on smaller text units.

In [8]:
def chunk_text(text, chunk_size=300):
    """
    Splits a long text into smaller chunks for semantic embedding.
    """
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]


Step 5: Load and Process Documents
Here:

Reads all files in the documents/ folder.
Splits them into manageable text chunks and stores the source file name for each.

In [11]:
folder_path = r"E:\fiass\semantic search example documents"
documents = []
doc_sources = []

for file in os.listdir(folder_path):
    if file.endswith((".pdf", ".docx", ".txt")):
        path = os.path.join(folder_path, file)
        print(f"ðŸ“„ Reading file: {file}")
        content = extract_text_from_file(path)
        chunks = chunk_text(content)
        documents.extend(chunks)
        doc_sources.extend([file] * len(chunks))

print(f"\nLoaded {len(documents)} text chunks from {len(os.listdir(folder_path))} files.")


ðŸ“„ Reading file: ai_intro.pdf
ðŸ“„ Reading file: data_science.docx
ðŸ“„ Reading file: db_basics.txt

Loaded 3 text chunks from 3 files.


Step 6: Generate Text Embeddings
Here we will:

Converts text chunks into vector representations using SentenceTransformer.
Normalizes vectors for cosine similarity in FAISS.
Shows embedding progress for transparency.

In [13]:
model = SentenceTransformer('all-MiniLM-L6-v2')
print("\nGenerating embeddings... (this may take a minute)")

embeddings = model.encode(documents, convert_to_numpy=True, show_progress_bar=True)
embeddings = embeddings.astype('float32')

# Normalize embeddings for cosine similarity
faiss.normalize_L2(embeddings)

print(f"Embeddings shape: {embeddings.shape}")



Generating embeddings... (this may take a minute)


Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1/1 [00:00<00:00,  3.12it/s]

Embeddings shape: (3, 384)





Step 7: Create FAISS Index
Initializes a FAISS IndexFlatIP index (for cosine similarity).
Adds all text embeddings into the FAISS index for fast retrieval.



In [20]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner Product = Cosine similarity
index.add(embeddings)
print(f"FAISS index created with {index.ntotal} vectors.")


FAISS index created with 3 vectors.


Step 8: Define Cleaning and Search Functions
1. clean_text(): Removes unwanted formatting and extra spaces.

2. semantic_search_best():

Converts the query into a vector.
Searches the FAISS index for similar embeddings.
Displays the best matches with readable snippets.

In [19]:
import re
import textwrap

def clean_text(text):
    """
    Cleans Markdown symbols and extra whitespace from extracted text.
    """
    text = re.sub(r'[#=*`~_-]+', '', text)  # remove markdown headers, underlines, etc.
    text = re.sub(r'\*\*(.*?)\*\*', r'\1', text)  # remove bold markers
    text = re.sub(r'\s+', ' ', text).strip()  # normalize spaces
    return text


def semantic_search_best(query, top_k=1, wrap_width=100, similarity_threshold=0.35, snippet_length=300):
    """
    Performs semantic search and returns the most relevant file(s)
    with a short, word-wrapped snippet preview.
    """
    query_embedding = model.encode([query]).astype('float32')
    faiss.normalize_L2(query_embedding)

    D, I = index.search(query_embedding, top_k)

    print("\nTop Semantic Search Result(s):")
    print("=" * 120)

    results_shown = 0

    for rank, idx in enumerate(I[0]):
        score = D[0][rank]
        if score < similarity_threshold:
            continue  # skip weak matches

        snippet = clean_text(documents[idx])[:snippet_length]  # limit to snippet_length chars
        wrapped_snippet = textwrap.fill(snippet, width=wrap_width)

        print(f"\nRank {rank + 1}")
        print(f"Source File     : {doc_sources[idx]}")
        print(f"Similarity Score: {score:.4f}")
        print("-" * 120)
        print(f"Preview Snippet:\n{wrapped_snippet}")
        print("=" * 120)
        results_shown += 1

    if results_shown == 0:
        print("No strong semantic matches found for your query.")
""

''

Step 9: Run Semantic Search
Retrieves top semantically relevant chunks for each query.
Displays source document name, similarity score and wrapped text preview.

In [21]:
semantic_search_best("applications of artificial intelligence")


Top Semantic Search Result(s):

Rank 1
Source File     : ai_intro.pdf
Similarity Score: 0.5430
------------------------------------------------------------------------------------------------------------------------
Preview Snippet:
Artificial Intelligence (AI) Introduction Artificial Intelligence refers to the simulation of human
intelligence in machines that are programmed to think and act like humans. The core idea is to
enable computers to perform tasks such as reasoning, learning, perception, and decisionmaking.
Branches o


In [22]:
semantic_search_best("database systems and AI", top_k=3)


Top Semantic Search Result(s):

Rank 1
Source File     : ai_intro.pdf
Similarity Score: 0.5042
------------------------------------------------------------------------------------------------------------------------
Preview Snippet:
Artificial Intelligence (AI) Introduction Artificial Intelligence refers to the simulation of human
intelligence in machines that are programmed to think and act like humans. The core idea is to
enable computers to perform tasks such as reasoning, learning, perception, and decisionmaking.
Branches o

Rank 2
Source File     : db_basics.txt
Similarity Score: 0.4668
------------------------------------------------------------------------------------------------------------------------
Preview Snippet:
Database Basics Databases are structured systems used to store, manage, and retrieve data
efficiently. They play a crucial role in modern applications ranging from web services to analytics
platforms. Types of Databases 1. Relational Databases (RDBMS) â€” Store 