## **Vector search**

## Introduction to Vector Search

Modern applications like recommendation systems, chatbots, and search engines need to handle **unstructured data** such as text, images, audio, and video.  
Traditional search (keyword or SQL-based) struggles here, which is why **Vector Search** is used.


### What is a Vector?

- A **vector** is simply a list of numbers (like `[0.2, -0.4, 1.7, ...]`).  
- In machine learning, vectors are used to represent **features** of data.  

Examples:
- A sentence can be converted into a vector that captures its meaning (using embeddings like Word2Vec, BERT, etc.).
- An image can be converted into a vector that represents its visual features.

These vectors usually live in **high-dimensional space** (e.g., 128D, 512D, 1024D).



### What is Vector Search?

**Vector Search** is a technique to find the most similar items to a given query based on their vector representations.

- Instead of looking for exact matches (`WHERE name = 'Alice'`),  
  vector search finds items that are **closest in meaning or similarity**.
- It uses **mathematical distance** (cosine similarity, Euclidean distance, etc.) to compare vectors.

Example:  
If you search for *"puppy"*, results may include *"dog"*, *"cute pet"*, *"golden retriever"*—because their vectors are close in meaning.



### Why Use Vector Search?

Traditional search:
- Works well for structured data and exact keywords.
- Struggles with synonyms, context, or unstructured data.

Vector search:
- Handles **semantic similarity**.
- Works with **text, images, and other unstructured formats**.
- Powers modern systems like:
  - Google Search (*related queries*)
  - ChatGPT knowledge retrieval
  - Netflix & Spotify recommendations
  - Image similarity search


### How Vector Search Works

1. **Data Encoding (Embeddings)**  
   Convert text, image, or audio into numerical vectors using a model.

2. **Vector Database**  
   Store all vectors in a specialized database (e.g., Pinecone, Milvus, Weaviate, FAISS).

3. **Similarity Search**  
   Convert the query into a vector and compare it with stored vectors.

4. **Return Results**  
   Retrieve items whose vectors are most similar to the query.



### Similarity Metrics

- **Cosine Similarity** → angle between vectors.  
- **Euclidean Distance** → straight-line distance.  
- **Dot Product** → magnitude & alignment.  

##### Example (Text Search)

| Sentence          | Vector (simplified) |
|-------------------|----------------------|
| "I love cats"     | [0.2, 0.8, 0.5]     |
| "Dogs are cute"   | [0.3, 0.7, 0.6]     |
| "Pizza is tasty"  | [0.9, 0.1, 0.2]     |

If query = "I like kittens" → vector ≈ [0.25, 0.75, 0.55]  
Closest matches = **"I love cats"** and **"Dogs are cute"**



#### Tools for Vector Search

- **Vector Databases**: Pinecone, Milvus, Weaviate  
- **Libraries**: FAISS (Facebook AI), Annoy, ScaNN  
- **Cloud Services**: Elasticsearch (with vector search), OpenSearch, Vespa  

## Summary

- **Vector Search = similarity search in high-dimensional space.**
- It’s essential for handling **unstructured data** (text, images, audio).  
- Core idea: **represent data as vectors → compare similarity → return nearest neighbors.**


In [1]:
!pip install langchain chromadb sentence-transformers langchain_community




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

In [3]:
from langchain.schema import Document

# Load back into one Document per song
with open("songs_lyrics.txt", "r", encoding="utf-8") as f:
    songs = f.read().strip().split("\n")

documents = [Document(page_content=song) for song in songs if song.strip()]

In [4]:
print(len(documents))   # number of songs
print(documents[0].page_content)  # lyrics of first song

57650
"Look at her face, it's a wonderful face   And it means something special to me   Look at the way that she smiles when she sees me   How lucky can one fellow be?      She's just my kind of girl, she makes me feel fine   Who could ever believe that she could be mine?   She's just my kind of girl, without her I'm blue   And if she ever leaves me what could I do, what could I do?      And when we go for a walk in the park   And she holds me and squeezes my hand   We'll go on walking for hours and talking   About all the things that we plan      She's just my kind of girl, she makes me feel fine   Who could ever believe that she could be mine?   She's just my kind of girl, without her I'm blue   And if she ever leaves me what could I do, what could I do?"


In documents, each song's lyrics are on a seperate line.

#### Reading the dataset containing predicted emotions (useful as metadata for hybrid search)

In [5]:
import pandas as pd

# Reading the csv file
songs_sentiment_df = pd.read_csv('songs_with_predicted_emotions.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'songs_with_predicted_emotions.csv'

##### **Vector database**: Chroma

In [None]:
# Adding metadata to the lyrics to re-link it to the songs + sentiment info
from langchain.schema import Document

lyrics_song_artist = []

for _, row in songs_sentiment_df.iterrows():
    lyrics = row['text']
    title = row['song']
    artist = row['artist']


    doc = Document(
        page_content=lyrics,
        metadata={
            'title': title,
            'artist': artist
        }
    )
    lyrics_song_artist.append(doc)


*page_content* → contains clean, consistent lyrics for better similarity search.

*metadata* → retains song title and artist so recommendations can be shown with full context.

In [None]:
# Use a HuggingFace embedding model (efficient & open-source)
embedding_function = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

# Create and persist Chroma DB
db = Chroma.from_documents(
    documents=lyrics_song_artist,
    embedding=embedding_function,
    persist_directory="./chroma_songs_db",
    collection_name="lyrics"
)

# Save the DB to disk
db.persist()

  return forward_call(*args, **kwargs)


##### **Similarity search**

In [None]:
def get_top_songs(query: str, db, top_k=3):
    results = db.similarity_search(query, k=top_k)

    songs = []
    for i, doc in enumerate(results, 1):
        title = doc.metadata.get('title', 'Unknown Title')
        artist = doc.metadata.get('artist', 'Unknown Artist')
        snippet = doc.page_content[:200].strip().replace('\n', ' ') + "..."

        songs.append({
            'rank': i,
            'title': title,
            'artist': artist,
            'lyrics_snippet': snippet
        })

    return songs

In [None]:
# Similarity search
query = "A song that talks about love and heart breaks"
docs = db.similarity_search(query, k=10)
docs

[Document(metadata={'title': 'Give Your Heart A Break', 'artist': 'Glee'}, page_content="The day I first met you   You told me you'd never fall in love   But now that I get you   I know fear is what it really was   Now here we are, so close   Yet so far, haven't I passed the test?   When will you realize   Baby, I'm not like the rest?      Don't wanna break your heart   Wanna give your heart a break   I know you're scared, it's wrong   Like you might make a mistake      There's just one life to live   And there's no time to wait      To waste   So let me give your heart a break   Give your heart a break   Let me give your heart a break   Your heart a break   Oh yeah, yeah      On Sunday, you went home alone   There were tears in your eyes   I called your cell phone, my love   But you did not reply      The world is ours if we want it   We can take it      If you just take my hand   There's no turning back now (There's no turning back now)      Baby, try to understand   Don't wanna brea

In [None]:
query = "A song that talks about love and heart breaks"
top_songs = get_top_songs(query, db)

for song in top_songs:
    print(f"#{song['rank']} - {song['title']} by {song['artist']}")
    print(f"Lyrics snippet: {song['lyrics_snippet']}")
    print("-" * 50)

#1 - Give Your Heart A Break by Glee
Lyrics snippet: The day I first met you   You told me you'd never fall in love   But now that I get you   I know fear is what it really was   Now here we are, so close   Yet so far, haven't I passed the test?   When...
--------------------------------------------------
#2 - Give Your Heart A Break by Glee
Lyrics snippet: The day I first met you   You told me you'd never fall in love   But now that I get you   I know fear is what it really was   Now here we are, so close   Yet so far, haven't I passed the test?   When...
--------------------------------------------------
#3 - Love Never Broke Anyone's Heart by Vince Gill
Lyrics snippet: You say your heart has been broken   And it's taking forever to mend   And it's left you even more certain   That you'll never love again   A long time ago someone told me   It's not love that causes...
--------------------------------------------------
