# Vector Embeddings and Similarity Search

## Key Concepts <span style="color:blue">#keyConcepts</span>

### Vectors <span style="color:green">#vectors</span>

Vectors are mathematical representations of words, sentences, or documents as lists of numbers in multi-dimensional space.

- **Definition**: A vector is an ordered list of numbers representing features or dimensions of an object or concept.
- **Dimensionality**: Real-world applications often use hundreds or thousands of dimensions.
- **Example**: In a simplified 2D space, "apple" could be represented as [1, 4], where:
  1. Is it a fruit? 1 (yes)
  2. Cost: $4

This allows complex objects or concepts to be represented numerically, enabling mathematical operations and comparisons.

<img src="vectors.png" alt="Vector Representation" width="700" height="300"/>
<img src="vectors2.png" alt="Vector Representation" width="700" height="300"/>
<img src="vectors3.png" alt="Vector Representation" width="700" height="300"/>

### Embedding <span style="color:green">#embedding</span>

Embedding is the process of converting text or other data into vector representations using machine learning models.

- **Purpose**: To create dense, low-dimensional vector representations of high-dimensional data.
- **Models**: Examples include Word2Vec, GloVe, and Amazon Titan embeddings.
- **Features**:
  1. Considers meaning and context, not just keywords
  2. Can create word, sentence, and document embeddings
  3. Enables semantic similarity comparisons

- **Process**:
  1. Input text is tokenized
  2. Tokens are passed through the embedding model
  3. Model outputs a fixed-length vector for each input

<img src="embeddings.png" alt="Embedding Process" width="700" height="300"/>
<img src="embeddings2.png" alt="Embedding Process" width="700" height="300"/>

### Data Chunking <span style="color:green">#dataChunking</span>

Data chunking involves breaking large documents or datasets into smaller, manageable parts before vectorization.

- **Purpose**: To improve processing efficiency and maintain context in large documents.
- **Methods**:
  1. Split by fixed number of characters
  2. Split by tokens (words or subwords)
  3. Split by paragraphs or sentences
- **Benefits**:
  1. Enables processing of documents too large for direct embedding
  2. Preserves local context within chunks
  3. Allows for more granular similarity searches

<img src="data_chunking.png" alt="Data Chunking" width="700" height="300"/>

### Vector Store <span style="color:green">#vectorStore</span>

A vector store is a specialized database designed for efficient storage and retrieval of vector representations.

- **Purpose**: To store and index large numbers of vector embeddings for fast similarity search.
- **Examples**:
  - Pinecone
  - Chroma DB
  - Facebook AI Similarity Search (FAISS)
- **Features**:
  1. Optimized for high-dimensional vector data
  2. Supports efficient similarity search algorithms
  3. Scales to millions or billions of vectors
  4. Often supports metadata storage alongside vectors

<img src="vector_store.png" alt="Vector Store" width="700" height="300"/>

### Cosine Similarity <span style="color:green">#cosineSimilarity</span>

Cosine similarity is a measure of similarity between two vectors based on the cosine of the angle between them.

- **Formula**: cos(θ) = (A · B) / (||A|| ||B||)
  Where A and B are vectors, · is the dot product, and ||A|| is the magnitude of A.
- **Range**: Values from -1 to 1
  - 1: Vectors are identical (0° angle)
  - 0: Vectors are orthogonal (90° angle)
  - -1: Vectors are opposite (180° angle)
- **Usage**: Commonly used in information retrieval and text similarity tasks
- **Advantage**: Measures orientation rather than magnitude, making it useful for high-dimensional spaces

<img src="cosine_similarity.png" alt="Cosine Similarity" width="700" height="300"/>

### K-Nearest Neighbors (KNN) <span style="color:green">#KNN</span>

KNN is an algorithm used for classification and regression, often applied in similarity search tasks.

- **Purpose**: To find the k most similar items to a query item in a dataset.
- **Process**:
  1. Calculate distance/similarity between query vector and all vectors in the dataset
  2. Select the k nearest neighbors based on the calculated distances
  3. For classification: majority vote of the neighbors' classes
  4. For regression: average of the neighbors' values
- **Distance metrics**: Can use various metrics, including Euclidean distance or cosine similarity
- **Considerations**:
  - Choice of k affects the algorithm's performance and bias-variance tradeoff
  - Can be computationally expensive for large datasets
  - Approximate Nearest Neighbor (ANN) algorithms are often used for better scalability

<img src="knn_algorithm.png" alt="KNN Algorithm" width="700" height="300"/>

## Conclusion

These concepts form the foundation of modern vector-based similarity search systems, enabling efficient and meaningful comparisons of complex data in high-dimensional spaces. They are crucial for applications in natural language processing, recommendation systems, and information retrieval.