
## üîç What is Semantic Search?

**Semantic search** is a modern approach to information retrieval that focuses on understanding the *meaning* of text, not just exact keyword matches. It uses vector-based representations‚Äîcalled **embeddings**‚Äîto capture the semantic content of sentences.
(If You would like to read more about embeddings,  [go here](https://www.kaggle.com/code/beatafaron/nlp-trends-2025-update-complete-learning-guide)

Instead of searching by literal words, we transform both the **user‚Äôs query** and **all candidate texts** (like quotes) into high-dimensional vectors using a language model. Then, we compare these vectors using **cosine similarity** to find the most semantically similar matches.

#### üß† Embeddings and Models

To generate sentence embeddings, we use pre-trained transformer models like:

* `all-MiniLM-L6-v2` from [Sentence-Transformers](https://www.sbert.net/) ‚Äì a lightweight model optimized for semantic search.
  These models convert text into numerical vectors that reflect meaning, context, and relationships between words.

#### ‚öôÔ∏è Two Search Methods in This Tutorial:

1. **Cosine Similarity** ‚Äì a simple way to compare the query vector with each quote vector and rank by similarity.
2. **FAISS (Facebook AI Similarity Search)** ‚Äì a high-performance library that allows fast similarity search even on large datasets.

This setup enables the system to return meaningful quotes even when the wording between the query and the quote is completely different.


Lets go with the example.

### Install external packages

In [None]:
!pip install -U sentence-transformers pandas
!pip install faiss-cpu sentence-transformers pandas

>We‚Äôll use sentence-transformers to turn text into dense numerical vectors (embeddings),
> which we can compare using cosine similarity.

### Import libraries

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util

# üìå Example 1:
### Search for the most relevant quotes based on meaning, not keywords.

### 1.1. Load data

In [None]:
#Load data
df = pd.read_csv("/kaggle/input/wisdom-from-business-leaders-and-innovators/quotes-wisdom.csv") 
df = df.dropna(subset=["quote"])  
df.head()
df.shape

### 1.2. Load a Pretrained Model

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

**Sentence Transformers** are fine-tuned to capture the semantic meaning of sentences. <br>
We‚Äôll use a popular, small but effective model: **all-MiniLM-L6-v2**

‚úîÔ∏è This model is:
- Fast and lightweight (great for real-time apps)
- Multilingual
- Trained for semantic similarity tasks



### 1.3. Encode All Quotes into Embeddings

Before we can compare anything, we need to convert all text into vector form.

In [None]:
# Get the quotes as a list
quote_texts = df['quote'].tolist()

# Encode them into dense vector representations
quote_embeddings = model.encode(quote_texts, convert_to_tensor=True)


### 1.4. Encode the User Query

We‚Äôll turn a search phrase (e.g. ‚Äúhow to lead a team‚Äù) into a vector just like the quotes. <br>
Then we compare the query vector to all quote vectors using cosine similarity. Higher = more similar.


In [None]:
query = "how to lead a team"
query_embedding = model.encode(query, convert_to_tensor=True)

### 1.5. Using cosine similarity scores

In [None]:
# Compute cosine similarity scores between query and all quotes
cos_scores = util.cos_sim(query_embedding, quote_embeddings)[0]

# Get top 5 most similar quotes
top_results = cos_scores.topk(k=5)

In [None]:
print(f"\nTop quotes for: '{query}'\n")
for score, idx in zip(top_results[0], top_results[1]):
    idx = idx.item()  # üëà convert tensor to int
    quote = df.iloc[idx]['quote']
    author = df.iloc[idx]['author']
    print(f"{score:.4f} ‚Äî {quote}  ({author})")



## üìê What is **Cosine Similarity**?

Cosine similarity is a metric that tells us **how similar two vectors are**, regardless of their **length** ‚Äî it focuses on **direction**.
Each sentence (e.g. a quote or a search query) is turned into a **vector** in high-dimensional space (e.g. 384 dimensions if you're using `all-MiniLM-L6-v2`).
These vectors point in different directions ‚Äî cosine similarity tells us **how aligned they are**.

---

 **‚úÖ Cosine Similarity Formula**

If you have two vectors **A** and **B**:

$$
\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{\|A\| \times \|B\|}
$$

* $A \cdot B$ is the **dot product** of the vectors
* $\|A\|$ and $\|B\|$ are the **magnitudes (lengths)** of the vectors

---

###  What does the result mean?

| Cosine Similarity | Interpretation                             |
| ----------------- | ------------------------------------------ |
| `1.0`             | Exactly the same direction                 |
| `0.0`             | Completely unrelated                       |
| `-1.0`            | Opposite direction (very rare in practice) |

In this semantic search:

* The **query** is a vector
* Each **quote** is a vector
* Cosine similarity tells you: *How close in meaning is this quote to the query?*



---

**‚úÖ Why use cosine similarity (vs Euclidean distance)?**

| Metric               | Cosine Similarity                     | Euclidean Distance               |
| -------------------- | ------------------------------------- | -------------------------------- |
| Focuses on...        | **Direction**                         | Length + Direction               |
| Sensitive to length? | ‚ùå No (normalizes vectors)             | ‚úÖ Yes (penalizes longer vectors) |
| Best for...          | **Text embeddings** (semantic search) | Physical space, clustering, etc. |



## ‚úîÔ∏è**FAISS**


**FAISS** stands for:

> **Facebook AI Similarity Search**

It‚Äôs an open-source library built by Facebook (Meta) to perform **very fast vector similarity searches** ‚Äî especially useful for **semantic search**, **recommendation engines**, and **nearest neighbor retrieval**.
FAISS helps you **quickly find the most similar vectors** (sentences, images, products, users...) out of **thousands or millions**, using efficient math.

---

**Why use FAISS (instead of looping or cosine one-by-one)?**

If you use `util.cos_sim()` one by one, it's fine for 500‚Äì2,000 vectors.
But if you scale to **100,000+** quotes ‚Äî that becomes **slow**.

FAISS solves that by:

* Indexing your vectors in a smart way (e.g. via clustering, quantization)
* Searching with optimized **C++ backends**
* Supporting **GPU acceleration**

---

### üß† Core Concepts in FAISS

| Concept          | What it means                                                       |
| ---------------- | ------------------------------------------------------------------- |
| **Vector index** | A structure that stores all your embeddings for fast lookup         |
| **L2 / Cosine**  | Distance metric used (L2 = Euclidean, but cosine is common for NLP) |
| **Flat index**   | Basic brute-force method (fast for small data)                      |
| **IVF / HNSW**   | Smart indexes (faster for large datasets)                           |

---

### üß™ Summary: FAISS = Speed + Scale

| Without FAISS       | With FAISS             |
| ------------------- | ---------------------- |
| Fine for < 2k items | Handles 1M+ vectors    |
| Python loops        | Fast C++ core          |
| No GPU              | ‚úÖ Optional GPU support |



# üìå Example 2:

In [None]:
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer


### 2.1. Loading data, model, generate embeddings

In [None]:
# Load Csv + model + embeddings
# Load your dataset
df = pd.read_csv("/kaggle/input/wisdom-from-business-leaders-and-innovators/quotes-wisdom.csv") 

# Drop missing values just in case
df = df.dropna(subset=["quote"]).reset_index(drop=True)

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Get quote texts
quote_texts = df['quote'].tolist()

# Generate embeddings (shape: [N, 384])
embeddings = model.encode(quote_texts, convert_to_numpy=True)


### 2.2 Normalize the Embeddings 

In [None]:
# (FAISS uses L2 distance)
# Normalize to unit length (for cosine similarity)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)


### 2.3 Build and populate a FAISS Index

In [None]:
# Get the dimensionality of embeddings
dimension = embeddings.shape[1]

# Create the FAISS index (flat = exact search)
index = faiss.IndexFlatIP(dimension)  # IP = inner product = cosine if vectors normalized

# Add all embeddings to the index
index.add(embeddings)


### 2.4 Search Example

In [None]:
query = "how to lead a team"
query_vector = model.encode([query], convert_to_numpy=True)
query_vector = query_vector / np.linalg.norm(query_vector, axis=1, keepdims=True)

# Search top 5 most similar quotes
D, I = index.search(query_vector, k=5)


In [None]:
#Print the Top Results
print(f"\nTop results for: '{query}'\n")
for idx, score in zip(I[0], D[0]):
    quote = df.iloc[idx]['quote']
    author = df.iloc[idx]['author']
    print(f"{score:.4f} ‚Äî {quote}  ({author})")



## üîç What Else Should You Know About Semantic Search?



### Semantic Search vs Keyword Search

| Keyword Search              | Semantic Search                        |
| --------------------------- | -------------------------------------- |
| Matches **exact words**     | Matches **meaning**                    |
| Sensitive to spelling/order | Understands **context**                |
| No synonyms or paraphrases  | Recognizes synonyms and reworded ideas |
| e.g. ‚ÄúCEO advice‚Äù           | Will find: ‚Äútips for business leaders‚Äù |

---

### Best Models for Semantic Embeddings

Use Hugging Face models from the `sentence-transformers` family:

| Model Name                 | Notes                                     |
| -------------------------- | ----------------------------------------- |
| `all-MiniLM-L6-v2`         | ‚ö° Fast, lightweight, great for prototypes |
| `paraphrase-MiniLM-L12-v2` | üí¨ Better quality, slightly slower        |
| `multi-qa-MiniLM`          | üîç Trained for QA and search              |
| `bge-base-en-v1.5`         | üìà SOTA for semantic search in 2024       |

More recent models like `bge-large`, `Instructor`, `GTE` offer top-tier results for serious production apps.

---

### ‚úÖ How to Improve Semantic Search Quality

* Use **better models** (larger, newer = smarter embeddings)
* **Prompt your query better**:
* Normalize and clean your dataset
* Use a **second-stage re-ranker** (e.g. cross-encoder) for top-5 results

---

### Beyond FAISS: Alternatives & Index Types

FAISS is great, but you should know other options too:

| Tool      | Strengths                                   |
| --------- | ------------------------------------------- |
| `FAISS`   | Powerful, scalable, supports GPU            |
| `Annoy`   | Lightweight, great for web apps             |
| `HNSWlib` | High-speed approximate search, good balance |

In FAISS itself, you can also use different **index types**:

* `IndexFlatIP` = brute-force cosine similarity
* `IVF`, `HNSW` = approximate, faster at large scale (100k+ vectors)

---

### Common Mistakes

Avoid these issues:

* ‚ùå Not normalizing embeddings before FAISS
* ‚ùå Including duplicate quotes ‚Üí repeated results
* ‚ùå Querying with too short/ambiguous input
* ‚ùå Comparing vectors from different models/languages

---

### Real-World Use Cases

Semantic search powers many modern apps:

| Use Case             | Example                                        |
| -------------------- | ---------------------------------------------- |
| AI assistants        | Find the right doc/snippet for a user question |
| Quote recommendation | Retrieve inspiring quotes by meaning           |
| Semantic FAQ         | User question ‚Üí best-matching FAQ answer       |
| Product search       | ‚Äúcomfy waterproof hiking shoes under \$100‚Äù    |
| Talent matching      | Resume ‚Üî job description matching              |

---

### You can also

* **Vector space visualization**: Use UMAP/t-SNE to *see* clusters of meaning
* **Filtering by metadata**: Combine semantic similarity with `theme`, `region`, `gender`
* **Hybrid search**: Combine keyword and semantic for best of both worlds

---


# üìå Streamlit APP

This notebook introduces semantic search and shows how you can use it to build powerful search tools. You can easily turn this into a simple Streamlit app for interactive use. If you'd like me to show you the easiest way to do that, just let me know‚ÄîI'm happy to add the code.

You can also try out our quote search app here: [https://semantic-quote-leadership-search.streamlit.app/](https://semantic-quote-leadership-search.streamlit.app/)



---
**Author:** Beata Faron  
[LinkedIn](https://www.linkedin.com/in/beata-faron-24764832/) ‚Ä¢ [Kaggle](https://www.kaggle.com/beatafaron)

*Data Scientist with a background in business, design, and machine learning. Focused on time series forecasting and real-world applications.*
