# Query-Aware HNSW Optimization Approach

## 🎯 Problem Statement
**Given:**  
- 1M base dataset vectors with embeddings  
- 10K query vectors + their 100 nearest neighbors (1M total neighbor entries)  
**Goal:** Improve HNSW performance (faster queries + better recall) using query patterns  

## 🔍 Key Insight
Vectors frequently appearing in query results or having small query distances are critical "hubs". Prioritize them in HNSW's upper layers for faster access.

## 🛠 Implementation Steps

### 1️⃣ Frequency & Distance Calculation
- **Frequency (freq):** Count how often each base vector appears in 1M neighbor entries  
- **Distance:** For each vector, compute average/min distance to queries where it was a neighbor  

### 2️⃣ Normalization
| Metric          | Formula                          | Purpose                          |
|-----------------|----------------------------------|----------------------------------|
| **Frequency**   | `norm_freq = log(1+freq)/log(1+max_freq)` | Compress skewed frequency distribution |
| **Distance**    | `norm_dist = 1 - (dist-min_dist)/(max_dist-min_dist)` | Invert & scale distances to [0,1] |

**Combined Score:**  
`f = 0.5*norm_freq + 0.5*norm_dist`  
*(Adjust 0.5 weights via validation)*

### 3️⃣ Threshold Selection
**Recommended Methods:**  
- **Percentile:** `t = 90th percentile of f` (boost top 10% vectors)  
- **Statistical:** `t = μ + 1.5σ` (capture outliers)  
- **Validation:** Test thresholds on holdout queries  

### 4️⃣ HNSW Layer Assignment
Modify layer probability for vectors with `f > t`:  


This increases their chance to appear in higher layers.

## 🚀 Expected Benefits
| Aspect       | Improvement Mechanism                  |
|--------------|----------------------------------------|
| **Speed**    | Critical vectors found earlier in search |
| **Recall**   | Reduces missed connections between hubs |
| **Adaptivity**| Index evolves with query patterns      |

## 📊 Visualization
