# üì¶ FAISS Index Creation

This notebook demonstrates how to:
1. Load sentences from a CSV file
2. Generate embeddings using Sentence Transformers
3. Create a FAISS index for efficient similarity search
4. Save the index for later use

## What is FAISS?

**FAISS** (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It's widely used in production systems for:
- Semantic search
- Recommendation systems
- Duplicate detection
- RAG (Retrieval-Augmented Generation) systems

## 1Ô∏è‚É£ Install Required Libraries

In [None]:
# Install required packages
!pip install faiss-cpu sentence-transformers pandas numpy -q

## 2Ô∏è‚É£ Import Libraries

In [None]:
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
import pickle
import time

print("‚úÖ All libraries imported successfully!")

## 3Ô∏è‚É£ Load Sentences from CSV

In [None]:
# Load the sentences CSV file
df = pd.read_csv('sentences.csv')

print(f"üìä Loaded {len(df)} sentences")
print(f"\nüìÅ Columns: {list(df.columns)}")
print(f"\nüìÇ Categories: {df['category'].unique().tolist()}")
print(f"\nüî¢ Sentences per category:")
print(df['category'].value_counts())

In [None]:
# Preview some sentences
print("üìù Sample sentences:\n")
for i, row in df.head(10).iterrows():
    print(f"  [{row['category']:12}] {row['text']}")

## 4Ô∏è‚É£ Initialize Embedding Model

We'll use the `all-MiniLM-L6-v2` model which is:
- Fast and lightweight
- Good quality embeddings
- Perfect for demonstrations

In [None]:
# Initialize the embedding model
print("üîÑ Loading embedding model...")
model = SentenceTransformer('all-MiniLM-L6-v2')

# Get embedding dimension
embedding_dim = model.get_sentence_embedding_dimension()
print(f"‚úÖ Model loaded!")
print(f"üìê Embedding dimension: {embedding_dim}")

## 5Ô∏è‚É£ Generate Embeddings

Convert all sentences to numerical vectors (embeddings)

In [None]:
# Extract sentences as a list
sentences = df['text'].tolist()

# Generate embeddings
print("üîÑ Generating embeddings...")
start_time = time.time()

embeddings = model.encode(
    sentences,
    show_progress_bar=True,
    convert_to_numpy=True
)

elapsed_time = time.time() - start_time
print(f"\n‚úÖ Embeddings generated in {elapsed_time:.2f} seconds")
print(f"üìê Embeddings shape: {embeddings.shape}")
print(f"   - {embeddings.shape[0]} sentences")
print(f"   - {embeddings.shape[1]} dimensions per embedding")

In [None]:
# Normalize embeddings for cosine similarity
# FAISS uses L2 distance by default, but normalized vectors + L2 = cosine similarity
faiss.normalize_L2(embeddings)
print("‚úÖ Embeddings normalized for cosine similarity")

## 6Ô∏è‚É£ Create FAISS Index

FAISS offers different index types:
- `IndexFlatL2` - Exact search, good for small datasets
- `IndexFlatIP` - Inner Product (for cosine similarity with normalized vectors)
- `IndexIVFFlat` - Approximate search, faster for large datasets

For our 100 sentences, we'll use `IndexFlatIP` for exact cosine similarity search.

In [None]:
# Create FAISS index
print("üîÑ Creating FAISS index...")

# IndexFlatIP = Inner Product index (cosine similarity for normalized vectors)
index = faiss.IndexFlatIP(embedding_dim)

# Add embeddings to the index
index.add(embeddings)

print(f"‚úÖ FAISS index created!")
print(f"üìä Total vectors in index: {index.ntotal}")

## 7Ô∏è‚É£ Quick Test - Verify the Index Works

In [None]:
# Test query
test_query = "What programming language should I learn?"

# Generate embedding for query
query_embedding = model.encode([test_query], convert_to_numpy=True)
faiss.normalize_L2(query_embedding)

# Search for top 5 similar sentences
k = 5
distances, indices = index.search(query_embedding, k)

print(f"üîç Query: '{test_query}'")
print(f"\nüìã Top {k} most similar sentences:\n")

for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
    print(f"  {i+1}. [Score: {dist:.4f}] [{df.iloc[idx]['category']:12}]")
    print(f"     {df.iloc[idx]['text']}\n")

## 8Ô∏è‚É£ Save the Index and Metadata

We need to save:
1. The FAISS index file
2. The sentences/metadata for retrieval

In [None]:
# Save FAISS index
faiss.write_index(index, 'sentences.faiss')
print("‚úÖ FAISS index saved to 'sentences.faiss'")

# Save metadata (sentences and categories)
metadata = {
    'sentences': sentences,
    'categories': df['category'].tolist(),
    'ids': df['id'].tolist(),
    'model_name': 'all-MiniLM-L6-v2',
    'embedding_dim': embedding_dim
}

with open('sentences_metadata.pkl', 'wb') as f:
    pickle.dump(metadata, f)

print("‚úÖ Metadata saved to 'sentences_metadata.pkl'")

In [None]:
# Verify saved files
import os

files = ['sentences.faiss', 'sentences_metadata.pkl']
print("üìÅ Saved files:\n")
for file in files:
    if os.path.exists(file):
        size = os.path.getsize(file) / 1024  # KB
        print(f"  ‚úÖ {file} ({size:.2f} KB)")
    else:
        print(f"  ‚ùå {file} not found")

## üìä Summary

In this notebook, we:
1. ‚úÖ Loaded 100 sentences from CSV
2. ‚úÖ Generated embeddings using Sentence Transformers
3. ‚úÖ Created a FAISS index for similarity search
4. ‚úÖ Tested the index with a sample query
5. ‚úÖ Saved the index and metadata for later use

**Next Step:** Use `05_faiss_vector_search.ipynb` to explore different search queries and understand how vector search works!

---

## üß© Bonus: Understanding Index Types

| Index Type | Description | Best For |
|------------|-------------|----------|
| `IndexFlatL2` | Exact L2 distance | Small datasets (<10K) |
| `IndexFlatIP` | Exact Inner Product | Cosine similarity (normalized) |
| `IndexIVFFlat` | Approximate search | Medium datasets (10K-1M) |
| `IndexHNSW` | Graph-based search | Large datasets, high recall |
| `IndexIVFPQ` | Compressed vectors | Very large datasets |