# Fixed Quora Dataset Processing and Embedding Generation

This notebook fixes the issue where queries were being removed due to incorrect column identification.

**Key Fixes:**
1. **Correct Column Detection**: Properly identifies the text column containing actual questions
2. **Preserved Data**: Ensures no queries are lost during processing
3. **Smart Text Processing**: Preserves semantic information while cleaning
4. **Optimized Embeddings**: Uses best models for high MAP scores


## Step 1: Install Required Packages

In [4]:
# Install required packages
!pip install --upgrade pip
!pip install sentence-transformers>=2.2.2
!pip install transformers>=4.21.0
!pip install torch>=1.13.0
!pip install pandas numpy scikit-learn
!pip install joblib nltk tqdm
!pip install faiss-cpu
!pip install beir
!pip install datasets
!pip install ir_datasets

print("\n[INFO] Packages installed! Please restart runtime if needed.")


[INFO] Packages installed! Please restart runtime if needed.


## Step 2: Import Libraries

In [5]:
import pandas as pd
import numpy as np
import re
import string
import nltk
import joblib
import os
import warnings
import torch
import zipfile
from collections import defaultdict
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
import faiss
from sentence_transformers import SentenceTransformer

warnings.filterwarnings('ignore')

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

# Download NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("‚úÖ All packages imported successfully!")

Using device: cuda
GPU: Tesla T4
GPU Memory: 14.7 GB
‚úÖ All packages imported successfully!


## Step 3: Load Dataset

In [10]:
# For Google Colab
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    base_path = '/content/drive/MyDrive/downloads'
else:
    # For local environment
    base_path = '/Users/raafatmhanna/Desktop/Quora'

# Load the dataset files
print("Loading dataset files...")

# Try different possible file names and paths
file_patterns = {
    'docs': ['docs.tsv', 'documents.tsv', 'docs.tsv'],
    'queries': ['queries.tsv', 'questions.tsv', 'query.tsv'],
    'qrels': ['qrels.tsv', 'relevance.tsv', 'labels.tsv']
}

datasets = {}
for data_type, patterns in file_patterns.items():
    for pattern in patterns:
        try:
            file_path = os.path.join(base_path, pattern)
            if os.path.exists(file_path):
                datasets[data_type] = pd.read_csv(file_path, sep='\t')
                print(f"‚úÖ Loaded {data_type}: {len(datasets[data_type])} rows from {pattern}")
                break
        except Exception as e:
            continue

# Verify we have the necessary data
if 'docs' not in datasets or 'queries' not in datasets:
    print("‚ùå Error: Could not load required files")
    print("Please ensure docs.tsv and queries.tsv are in the correct location")
else:
    print("\n‚úÖ Dataset loaded successfully!")
    print(f"Documents: {len(datasets['docs']):,} rows")
    print(f"Queries: {len(datasets['queries']):,} rows")
    if 'qrels' in datasets:
        print(f"Qrels: {len(datasets['qrels']):,} rows")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Loading dataset files...
‚úÖ Loaded docs: 522770 rows from docs.tsv
‚úÖ Loaded queries: 5000 rows from queries.tsv
‚úÖ Loaded qrels: 7626 rows from qrels.tsv

‚úÖ Dataset loaded successfully!
Documents: 522,770 rows
Queries: 5,000 rows
Qrels: 7,626 rows


## Step 4: Inspect Data Structure (Critical Fix)

In [13]:
# CRITICAL: Properly inspect the data structure
print("=== INSPECTING QUERIES STRUCTURE ===")
print("\nQuery columns:", list(datasets['queries'].columns))
print("\nFirst 5 rows of queries:")
print(datasets['queries'].head())

print("\n=== IDENTIFYING TEXT COLUMNS ===")
# Identify which columns contain actual text
for col in datasets['queries'].columns:
    sample_values = datasets['queries'][col].head(3).tolist()
    print(f"\nColumn '{col}':")
    for i, val in enumerate(sample_values):
        print(f"  Row {i}: {str(val)[:100]}..." if len(str(val)) > 100 else f"  Row {i}: {val}")

    # Check if this column contains question text
    if datasets['queries'][col].astype(str).str.len().mean() > 20:
        print(f"  ‚Üí Likely contains text (avg length: {datasets['queries'][col].astype(str).str.len().mean():.1f})")

print("\n=== INSPECTING DOCUMENTS STRUCTURE ===")
print("\nDocument columns:", list(datasets['docs'].columns))
print("\nFirst 3 rows of documents:")
print(datasets['docs'].head(3))

=== INSPECTING QUERIES STRUCTURE ===

Query columns: ['query_id', 'text']

First 5 rows of queries:
   query_id                                               text
0       318                How does Quora look to a moderator?
1       378  How do I refuse to chose between different thi...
2       379  Did Ben Affleck shine more than Christian Bale...
3       399  What are the effects of demonitization of 500 ...
4       420                       Why creativity is important?

=== IDENTIFYING TEXT COLUMNS ===

Column 'query_id':
  Row 0: 318
  Row 1: 378
  Row 2: 379

Column 'text':
  Row 0: How does Quora look to a moderator?
  Row 1: How do I refuse to chose between different things to do in my life?
  Row 2: Did Ben Affleck shine more than Christian Bale as Batman?
  ‚Üí Likely contains text (avg length: 51.5)

=== INSPECTING DOCUMENTS STRUCTURE ===

Document columns: ['doc_id', 'text']

First 3 rows of documents:
   doc_id                                               text
0       1  

## Step 5: Smart Text Preprocessing with Correct Column Detection

In [14]:
def safe_clean_text(text):
    """
    Ultra-safe cleaning that preserves Quora question format
    """
    if pd.isna(text) or not isinstance(text, str):
        return ""

    # Convert to string to be safe
    text = str(text)

    # Minimal cleaning - preserve most information
    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def find_text_column(df, data_type='query'):
    """
    Intelligently find the column containing actual text content
    """
    print(f"\nFinding text column for {data_type}...")

    # First, look for columns with common text-related names
    text_keywords = ['text', 'question', 'query', 'content', 'title', 'body']

    for col in df.columns:
        col_lower = col.lower()
        # Skip ID columns
        if 'id' in col_lower and not any(keyword in col_lower for keyword in text_keywords):
            continue

        # Check if column name suggests text content
        if any(keyword in col_lower for keyword in text_keywords):
            # Verify it actually contains text
            avg_length = df[col].astype(str).str.len().mean()
            if avg_length > 20:  # Reasonable threshold for text content
                print(f"  ‚úì Found text column: '{col}' (avg length: {avg_length:.1f})")
                return col

    # If no column found by name, find the column with longest average text
    max_length = 0
    best_col = None

    for col in df.columns:
        try:
            avg_length = df[col].astype(str).str.len().mean()
            if avg_length > max_length:
                max_length = avg_length
                best_col = col
        except:
            continue

    if best_col and max_length > 20:
        print(f"  ‚úì Found text column by length: '{best_col}' (avg length: {max_length:.1f})")
        return best_col

    # Last resort - return the second column (first is usually ID)
    if len(df.columns) > 1:
        print(f"  ‚ö†Ô∏è Using fallback column: '{df.columns[1]}'")
        return df.columns[1]

    return df.columns[0]

# Process queries with correct column detection
print("=== PROCESSING QUERIES ===")
queries_df = datasets['queries'].copy()

# Find the actual text column
query_text_col = find_text_column(queries_df, 'query')

# Show sample of what we're processing
print("\nSample queries before cleaning:")
for i in range(min(3, len(queries_df))):
    print(f"  {i+1}: {queries_df[query_text_col].iloc[i][:100]}...")

# Apply cleaning
queries_df['text_cleaned'] = queries_df[query_text_col].apply(safe_clean_text)

# Remove only truly empty entries
original_count = len(queries_df)
queries_df = queries_df[queries_df['text_cleaned'].str.len() > 0]
cleaned_count = len(queries_df)

print(f"\nQueries processed: {original_count} ‚Üí {cleaned_count} (removed {original_count - cleaned_count})")

# Show sample after cleaning
print("\nSample queries after cleaning:")
for i in range(min(3, len(queries_df))):
    print(f"  {i+1}: {queries_df['text_cleaned'].iloc[i][:100]}...")

# Process documents
print("\n=== PROCESSING DOCUMENTS ===")
docs_df = datasets['docs'].copy()

# Find the actual text column for documents
doc_text_col = find_text_column(docs_df, 'document')

# Apply cleaning
docs_df['text_cleaned'] = docs_df[doc_text_col].apply(safe_clean_text)

# Remove only truly empty entries
original_count = len(docs_df)
docs_df = docs_df[docs_df['text_cleaned'].str.len() > 0]
cleaned_count = len(docs_df)

print(f"\nDocuments processed: {original_count} ‚Üí {cleaned_count} (removed {original_count - cleaned_count})")

# Save cleaned data
queries_df.to_csv('queries_cleaned.tsv', sep='\t', index=False)
docs_df.to_csv('docs_cleaned.tsv', sep='\t', index=False)
print("\n‚úÖ Cleaned data saved!")

=== PROCESSING QUERIES ===

Finding text column for query...
  ‚úì Found text column: 'text' (avg length: 51.5)

Sample queries before cleaning:
  1: How does Quora look to a moderator?...
  2: How do I refuse to chose between different things to do in my life?...
  3: Did Ben Affleck shine more than Christian Bale as Batman?...

Queries processed: 5000 ‚Üí 5000 (removed 0)

Sample queries after cleaning:
  1: How does Quora look to a moderator?...
  2: How do I refuse to chose between different things to do in my life?...
  3: Did Ben Affleck shine more than Christian Bale as Batman?...

=== PROCESSING DOCUMENTS ===

Finding text column for document...
  ‚úì Found text column: 'text' (avg length: 62.2)

Documents processed: 522770 ‚Üí 522768 (removed 2)

‚úÖ Cleaned data saved!


## Step 6: Generate Optimized Embeddings

In [16]:
# Load optimized model
MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
print(f"Loading model: {MODEL_NAME}")
model = SentenceTransformer(MODEL_NAME, device=device)

# Set optimal parameters
if hasattr(model, 'max_seq_length'):
    model.max_seq_length = 512

print(f"Model loaded on {device}")
print(f"Max sequence length: {getattr(model, 'max_seq_length', 'default')}")

# Prepare texts for embedding
print("\nPreparing texts...")

# Get document texts and IDs
doc_texts = docs_df['text_cleaned'].tolist()
doc_ids = docs_df[docs_df.columns[0]].tolist()  # First column is usually ID

# Get query texts and IDs
query_texts = queries_df['text_cleaned'].tolist()
query_ids = queries_df[queries_df.columns[0]].tolist()  # First column is usually ID

print(f"\nReady to generate embeddings for:")
print(f"  - {len(doc_texts):,} documents")
print(f"  - {len(query_texts):,} queries")

# Generate embeddings with progress bar
def generate_embeddings_batch(texts, desc="Generating embeddings"):
    """Generate embeddings with optimal batch size"""
    # Determine batch size based on available memory
    if torch.cuda.is_available():
        gpu_memory = torch.cuda.get_device_properties(0).total_memory
        if gpu_memory < 8e9:  # Less than 8GB
            batch_size = 32
        elif gpu_memory < 16e9:  # Less than 16GB
            batch_size = 64
        else:
            batch_size = 128
    else:
        batch_size = 32

    print(f"Using batch size: {batch_size}")

    embeddings = model.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=True  # Important for better similarity
    )

    return embeddings

# Generate embeddings
print("\n=== GENERATING DOCUMENT EMBEDDINGS ===")
doc_embeddings = generate_embeddings_batch(doc_texts)

print("\n=== GENERATING QUERY EMBEDDINGS ===")
query_embeddings = generate_embeddings_batch(query_texts)

print(f"\n‚úÖ Embeddings generated!")
print(f"Document embeddings shape: {doc_embeddings.shape}")
print(f"Query embeddings shape: {query_embeddings.shape}")

# Verify normalization
print(f"\nVerification:")
print(f"First doc embedding norm: {np.linalg.norm(doc_embeddings[0]):.3f} (should be ~1.0)")
print(f"First query embedding norm: {np.linalg.norm(query_embeddings[0]):.3f} (should be ~1.0)")

Loading model: sentence-transformers/all-MiniLM-L6-v2
Model loaded on cuda
Max sequence length: 512

Preparing texts...

Ready to generate embeddings for:
  - 522,768 documents
  - 5,000 queries

=== GENERATING DOCUMENT EMBEDDINGS ===
Using batch size: 64


Batches:   0%|          | 0/8169 [00:00<?, ?it/s]


=== GENERATING QUERY EMBEDDINGS ===
Using batch size: 64


Batches:   0%|          | 0/79 [00:00<?, ?it/s]


‚úÖ Embeddings generated!
Document embeddings shape: (522768, 384)
Query embeddings shape: (5000, 384)

Verification:
First doc embedding norm: 1.000 (should be ~1.0)
First query embedding norm: 1.000 (should be ~1.0)


## Step 7: Evaluate Retrieval Performance

In [17]:
# Build FAISS index for efficient retrieval
print("Building FAISS index...")
index = faiss.IndexFlatIP(doc_embeddings.shape[1])  # Inner product for normalized vectors
index.add(doc_embeddings.astype(np.float32))
print(f"Index built with {index.ntotal} documents")

# Quick evaluation on sample queries
print("\n=== SAMPLE RETRIEVAL TEST ===")
n_samples = min(5, len(query_embeddings))
k = 5  # Top-k documents to retrieve

for i in range(n_samples):
    print(f"\nQuery {i+1}: {query_texts[i][:100]}...")

    # Search for similar documents
    scores, indices = index.search(query_embeddings[i:i+1].astype(np.float32), k)

    print(f"Top {k} retrieved documents:")
    for j, (score, idx) in enumerate(zip(scores[0], indices[0])):
        print(f"  {j+1}. Score: {score:.3f} - {doc_texts[idx][:80]}...")

# Calculate basic metrics
print("\n=== CALCULATING METRICS ===")

# Sample evaluation for efficiency
sample_size = min(100, len(query_embeddings))
sample_indices = np.random.choice(len(query_embeddings), sample_size, replace=False)
sample_queries = query_embeddings[sample_indices]

# Calculate similarity statistics
print("Calculating similarity statistics...")
similarities = cosine_similarity(sample_queries, doc_embeddings)

print(f"\nSimilarity Statistics:")
print(f"  Mean: {np.mean(similarities):.4f}")
print(f"  Std: {np.std(similarities):.4f}")
print(f"  Max: {np.max(similarities):.4f}")
print(f"  Min: {np.min(similarities):.4f}")

# Calculate MAP if qrels available
if 'qrels' in datasets and datasets['qrels'] is not None:
    print("\n=== CALCULATING MAP SCORE ===")
    # Implementation would go here based on qrels format
    print("MAP calculation requires proper qrels format")
else:
    print("\n‚ö†Ô∏è No qrels file found for MAP calculation")

Building FAISS index...
Index built with 522768 documents

=== SAMPLE RETRIEVAL TEST ===

Query 1: How does Quora look to a moderator?...
Top 5 retrieved documents:
  1. Score: 0.725 - How does one become a Quora moderator?...
  2. Score: 0.686 - Who are the Quora Moderators?...
  3. Score: 0.680 - How is Quora moderated?...
  4. Score: 0.676 - What does the Quora website look like to members of Quora moderation?...
  5. Score: 0.675 - How does Quora Moderation work?...

Query 2: How do I refuse to chose between different things to do in my life?...
Top 5 retrieved documents:
  1. Score: 0.800 - How do I choose what to do with my life?...
  2. Score: 0.763 - How do you "DECIDE" what you want to do with your life?...
  3. Score: 0.744 - How can I decide what to do in with my life?...
  4. Score: 0.731 - How do I decide on what to do with my life?...
  5. Score: 0.699 - Why I'm not able to decide what my goal is & what to do in my life?...

Query 3: Did Ben Affleck shine more than Christ

In [18]:
# ====== CALCULATE MAP & MPR METRICS ======
if 'qrels' in datasets and len(datasets['qrels']) > 0:
    print("\n=== CALCULATING RETRIEVAL METRICS ===")
    print("Preparing qrels data...")

    # Convert qrels to {query_id: {doc_id: relevance}} format
    qrels = defaultdict(dict)
    for _, row in datasets['qrels'].iterrows():
        qid = str(row['query_id'])
        did = str(row['doc_id'])
        qrels[qid][did] = int(row['relevance'])

    # Create mappings from IDs to embedding indices
    query_id_to_idx = {str(qid): i for i, qid in enumerate(query_ids)}
    doc_id_to_idx = {str(did): i for i, did in enumerate(doc_ids)}

    # Evaluation parameters
    top_k = 100  # Maximum number of docs to retrieve per query
    rank_cutoffs = [5, 10, 20, 50, 100]  # For MPR calculation

    # Initialize metrics storage
    map_scores = []
    mpr_scores = {k: [] for k in rank_cutoffs}

    print(f"\nEvaluating on {len(qrels)} query-relevance pairs...")

    # Process each query with relevance judgments
    for qid, relevant_docs in tqdm(qrels.items(), desc="Evaluating queries"):
        if qid not in query_id_to_idx:
            continue  # Skip if query wasn't processed

        query_idx = query_id_to_idx[qid]
        query_embedding = query_embeddings[query_idx]

        # Retrieve top_k documents
        distances, indices = index.search(
            query_embedding.reshape(1, -1).astype(np.float32),
            top_k
        )

        retrieved_docs = [doc_ids[i] for i in indices[0]]
        relevant_found = 0
        precisions = []

        # Calculate precision at each rank
        for rank, did in enumerate(retrieved_docs, 1):
            if str(did) in relevant_docs:
                relevant_found += 1
                precisions.append(relevant_found / rank)

                # Record precision at cutoff points
                if rank in rank_cutoffs:
                    mpr_scores[rank].append(relevant_found / rank)

        # Calculate Average Precision for this query
        if precisions:
            ap = sum(precisions) / len(relevant_docs)
            map_scores.append(ap)

    # Calculate final metrics
    if map_scores:
        MAP = np.mean(map_scores)
        print(f"\nMean Average Precision (MAP): {MAP:.4f}")

        print("\nMean Precision at Rank (MPR):")
        for cutoff in sorted(mpr_scores.keys()):
            if mpr_scores[cutoff]:
                mpr = np.mean(mpr_scores[cutoff])
                print(f"  @{cutoff}: {mpr:.4f}")
            else:
                print(f"  @{cutoff}: No relevant docs found")

        # Save metrics to metadata
        if 'metadata' in locals():
            metadata['retrieval_metrics'] = {
                'MAP': MAP,
                'MPR': {k: np.mean(v) for k, v in mpr_scores.items() if v}
            }
            joblib.dump(metadata, 'embedding_metadata.joblib')
            print("\n‚úÖ Metrics saved to metadata")
    else:
        print("\n‚ö†Ô∏è No relevant documents found for any query")
else:
    print("\n‚ö†Ô∏è No qrels data found - skipping MAP/MPR calculation")

print("\n=== EVALUATION COMPLETE ===")


=== CALCULATING RETRIEVAL METRICS ===
Preparing qrels data...

Evaluating on 5000 query-relevance pairs...


Evaluating queries: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [06:29<00:00, 12.84it/s]


Mean Average Precision (MAP): 0.8454

Mean Precision at Rank (MPR):
  @5: 0.5285
  @10: 0.3847
  @20: 0.2935
  @50: 0.1250
  @100: 0.1025

=== EVALUATION COMPLETE ===





## Step 8: Save Results

In [5]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Define your save directory in Google Drive
save_dir = '/content/drive/MyDrive/Quora_Embeddings'  # Change this to your preferred path

# Create directory if it doesn't exist
if not os.path.exists(save_dir):
    os.makedirs(save_dir)
    print(f"Created directory: {save_dir}")
else:
    print(f"Directory already exists: {save_dir}")

print("\nSaving embeddings and metadata to Google Drive...")

# Save embeddings using joblib
joblib.dump(doc_embeddings, f'{save_dir}/doc_embeddings.joblib')
joblib.dump(query_embeddings, f'{save_dir}/query_embeddings.joblib')

# Save metadata
metadata = {
    'model_name': MODEL_NAME,
    'embedding_dim': doc_embeddings.shape[1],
    'num_docs': len(doc_embeddings),
    'num_queries': len(query_embeddings),
    'doc_ids': doc_ids,
    'query_ids': query_ids,
    'normalized': True
}
joblib.dump(metadata, f'{save_dir}/embedding_metadata.joblib')

# Save cleaned texts with IDs using joblib
doc_data = {
    'doc_ids': doc_ids,
    'texts': doc_texts
}
joblib.dump(doc_data, f'{save_dir}/documents_final.joblib')

query_data = {
    'query_ids': query_ids,
    'texts': query_texts
}
joblib.dump(query_data, f'{save_dir}/queries_final.joblib')

# Create summary
summary = f"""
=== PROCESSING COMPLETE ===

Model: {MODEL_NAME}
Documents: {len(doc_embeddings):,}
Queries: {len(query_embeddings):,}
Embedding Dimension: {doc_embeddings.shape[1]}

Files Generated (all in joblib format):
- doc_embeddings.joblib: Document embeddings
- query_embeddings.joblib: Query embeddings
- embedding_metadata.joblib: Metadata
- documents_final.joblib: Cleaned documents with IDs
- queries_final.joblib: Cleaned queries with IDs

Saved to Google Drive at: {save_dir}

‚úÖ All files saved successfully!
"""

print(summary)

# Save summary as text file
with open(f'{save_dir}/processing_summary.txt', 'w') as f:
    f.write(summary)

# Create zip file for easy download
print("\nCreating zip file in Google Drive...")
with zipfile.ZipFile(f'{save_dir}/quora_embeddings_joblib.zip', 'w') as zipf:
    zipf.write(f'{save_dir}/doc_embeddings.joblib', 'doc_embeddings.joblib')
    zipf.write(f'{save_dir}/query_embeddings.joblib', 'query_embeddings.joblib')
    zipf.write(f'{save_dir}/embedding_metadata.joblib', 'embedding_metadata.joblib')
    zipf.write(f'{save_dir}/documents_final.joblib', 'documents_final.joblib')
    zipf.write(f'{save_dir}/queries_final.joblib', 'queries_final.joblib')
    zipf.write(f'{save_dir}/processing_summary.txt', 'processing_summary.txt')

print(f"‚úÖ Zip file created: {save_dir}/quora_embeddings_joblib.zip")
print("\nüéâ Processing complete! Files saved to your Google Drive.")

KeyboardInterrupt: 

In [20]:
from google.colab import drive
from sentence_transformers import SentenceTransformer
import joblib
import os

# Mount Google Drive
drive.mount('/content/drive')

# Define your save directory in Google Drive
save_dir = '/content/drive/MyDrive/Quora_Embeddings'  # Change this to your preferred path

# Create directory if it doesn't exist
if not os.path.exists(save_dir):
    os.makedirs(save_dir)
    print(f"Created directory: {save_dir}")
else:
    print(f"Directory already exists: {save_dir}")

# 1. Save the model itself
print("\nSaving the Sentence Transformer model...")
model_save_path = f"{save_dir}/{MODEL_NAME.replace('/', '_')}"
model.save(model_save_path)
print(f"‚úÖ Model saved to: {model_save_path}")

# 2. Save embeddings using joblib
print("\nSaving embeddings...")
joblib.dump(doc_embeddings, f'{save_dir}/doc_embeddings.joblib')
joblib.dump(query_embeddings, f'{save_dir}/query_embeddings.joblib')

# 3. Save metadata
metadata = {
    'model_name': MODEL_NAME,
    'model_path': model_save_path,
    'embedding_dim': doc_embeddings.shape[1],
    'num_docs': len(doc_embeddings),
    'num_queries': len(query_embeddings),
    'doc_ids': doc_ids,
    'query_ids': query_ids,
    'normalized': True
}
joblib.dump(metadata, f'{save_dir}/embedding_metadata.joblib')

# 4. Save cleaned texts
doc_data = {
    'doc_ids': doc_ids,
    'texts': doc_texts
}
joblib.dump(doc_data, f'{save_dir}/documents_final.joblib')

query_data = {
    'query_ids': query_ids,
    'texts': query_texts
}
joblib.dump(query_data, f'{save_dir}/queries_final.joblib')

# Create summary
summary = f"""
=== PROCESSING COMPLETE ===

Model: {MODEL_NAME}
Model saved to: {model_save_path}
Documents: {len(doc_embeddings):,}
Queries: {len(query_embeddings):,}
Embedding Dimension: {doc_embeddings.shape[1]}

Files Generated:
- Model directory: {MODEL_NAME.replace('/', '_')}/
- doc_embeddings.joblib: Document embeddings
- query_embeddings.joblib: Query embeddings
- embedding_metadata.joblib: Metadata
- documents_final.joblib: Cleaned documents
- queries_final.joblib: Cleaned queries

Saved to Google Drive at: {save_dir}

‚úÖ All files saved successfully!
"""

print(summary)

# Save summary
with open(f'{save_dir}/processing_summary.txt', 'w') as f:
    f.write(summary)

print("\nüéâ Processing complete! Model and embeddings saved to your Google Drive.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Directory already exists: /content/drive/MyDrive/Quora_Embeddings

Saving the Sentence Transformer model...
‚úÖ Model saved to: /content/drive/MyDrive/Quora_Embeddings/sentence-transformers_all-MiniLM-L6-v2

Saving embeddings...

=== PROCESSING COMPLETE ===

Model: sentence-transformers/all-MiniLM-L6-v2
Model saved to: /content/drive/MyDrive/Quora_Embeddings/sentence-transformers_all-MiniLM-L6-v2
Documents: 522,768
Queries: 5,000
Embedding Dimension: 384

Files Generated:
- Model directory: sentence-transformers_all-MiniLM-L6-v2/
- doc_embeddings.joblib: Document embeddings
- query_embeddings.joblib: Query embeddings
- embedding_metadata.joblib: Metadata
- documents_final.joblib: Cleaned documents
- queries_final.joblib: Cleaned queries

Saved to Google Drive at: /content/drive/MyDrive/Quora_Embeddings

‚úÖ All files saved successfully!


üéâ Processing comp

## Step 9: Build FAISS Index for Efficient Retrieval

In [21]:
import faiss
import time
import pickle

print("=== BUILDING FAISS INDEX ===")

# Convert embeddings to float32 for FAISS
doc_embeddings_f32 = doc_embeddings.astype(np.float32)
query_embeddings_f32 = query_embeddings.astype(np.float32)

# Get embedding dimension
embedding_dim = doc_embeddings_f32.shape[1]
print(f"Embedding dimension: {embedding_dim}")
print(f"Number of documents: {len(doc_embeddings_f32):,}")

# Create different types of FAISS indices
print("\nBuilding FAISS indices...")

# 1. Flat Index (exact search, best for accuracy)
print("1. Building IndexFlatIP (exact search)...")
start_time = time.time()
index_flat = faiss.IndexFlatIP(embedding_dim)
index_flat.add(doc_embeddings_f32)
flat_build_time = time.time() - start_time
print(f"   ‚úÖ Built in {flat_build_time:.2f} seconds")

# 2. IVF Index (approximate search, faster for large datasets)
print("2. Building IndexIVFFlat (approximate search)...")
start_time = time.time()

# Calculate number of clusters (nlist)
nlist = min(4096, int(np.sqrt(len(doc_embeddings_f32))))
print(f"   Using {nlist} clusters")

# Create quantizer and index
quantizer = faiss.IndexFlatIP(embedding_dim)
index_ivf = faiss.IndexIVFFlat(quantizer, embedding_dim, nlist)

# Train the index
print("   Training index...")
index_ivf.train(doc_embeddings_f32)

# Add vectors
print("   Adding vectors...")
index_ivf.add(doc_embeddings_f32)

# Set search parameters
index_ivf.nprobe = min(32, nlist // 4)  # Number of clusters to search
ivf_build_time = time.time() - start_time
print(f"   ‚úÖ Built in {ivf_build_time:.2f} seconds")
print(f"   Search will probe {index_ivf.nprobe} clusters")

# 3. HNSW Index (hierarchical navigable small world)
print("3. Building IndexHNSWFlat (graph-based search)...")
start_time = time.time()

# Create HNSW index
M = 16  # Number of connections
index_hnsw = faiss.IndexHNSWFlat(embedding_dim, M)
index_hnsw.hnsw.efConstruction = 200  # Construction parameter
index_hnsw.hnsw.efSearch = 128  # Search parameter

# Add vectors
index_hnsw.add(doc_embeddings_f32)

hnsw_build_time = time.time() - start_time
print(f"   ‚úÖ Built in {hnsw_build_time:.2f} seconds")
print(f"   M={M}, efConstruction={index_hnsw.hnsw.efConstruction}, efSearch={index_hnsw.hnsw.efSearch}")

# Store indices in a dictionary
faiss_indices = {
    'flat': index_flat,
    'ivf': index_ivf,
    'hnsw': index_hnsw
}

print("\n=== FAISS INDEX SUMMARY ===")
print(f"IndexFlatIP: {index_flat.ntotal:,} vectors, exact search")
print(f"IndexIVFFlat: {index_ivf.ntotal:,} vectors, {nlist} clusters, approximate search")
print(f"IndexHNSWFlat: {index_hnsw.ntotal:,} vectors, graph-based search")

print("\n‚úÖ All FAISS indices built successfully!")

# Quick test search
print("\n=== QUICK SEARCH TEST ===")
test_query = query_embeddings_f32[0:1]
k = 5

for name, idx in faiss_indices.items():
    start_time = time.time()
    scores, indices = idx.search(test_query, k)
    search_time = time.time() - start_time
    print(f"{name.upper()}: Found {len(indices[0])} results in {search_time*1000:.2f}ms")


=== BUILDING FAISS INDEX ===
Embedding dimension: 384
Number of documents: 522,768

Building FAISS indices...
1. Building IndexFlatIP (exact search)...
   ‚úÖ Built in 0.71 seconds
2. Building IndexIVFFlat (approximate search)...
   Using 723 clusters
   Training index...
   Adding vectors...
   ‚úÖ Built in 32.63 seconds
   Search will probe 32 clusters
3. Building IndexHNSWFlat (graph-based search)...
   ‚úÖ Built in 629.83 seconds
   M=16, efConstruction=200, efSearch=128

=== FAISS INDEX SUMMARY ===
IndexFlatIP: 522,768 vectors, exact search
IndexIVFFlat: 522,768 vectors, 723 clusters, approximate search
IndexHNSWFlat: 522,768 vectors, graph-based search

‚úÖ All FAISS indices built successfully!

=== QUICK SEARCH TEST ===
FLAT: Found 5 results in 104.15ms
IVF: Found 5 results in 8.40ms
HNSW: Found 5 results in 0.99ms


## Step 10: Comprehensive Evaluation with FAISS

In [22]:
print("=== EVALUATION WITH FAISS INDICES ===")

def evaluate_with_faiss(index, index_name, k_values=[5, 10, 20, 50, 100]):
    """Evaluate retrieval performance using FAISS index"""
    print(f"\nEvaluating {index_name} index...")

    if 'qrels' not in datasets or len(datasets['qrels']) == 0:
        print("‚ö†Ô∏è No qrels data available for evaluation")
        return None

    # Convert qrels to dictionary
    qrels = defaultdict(dict)
    for _, row in datasets['qrels'].iterrows():
        qid = str(row['query_id'])
        did = str(row['doc_id'])
        qrels[qid][did] = int(row['relevance'])

    # Create ID mappings
    query_id_to_idx = {str(qid): i for i, qid in enumerate(query_ids)}
    doc_id_to_idx = {str(did): i for i, did in enumerate(doc_ids)}

    # Evaluation metrics storage
    results = {
        'map_scores': [],
        'precision_at_k': {k: [] for k in k_values},
        'recall_at_k': {k: [] for k in k_values},
        'search_times': []
    }

    # Evaluate each query
    evaluated_queries = 0
    total_queries = len(qrels)

    for qid, relevant_docs in tqdm(qrels.items(), desc=f"Evaluating {index_name}"):
        if qid not in query_id_to_idx:
            continue

        query_idx = query_id_to_idx[qid]
        query_embedding = query_embeddings_f32[query_idx:query_idx+1]

        # Search with FAISS
        start_time = time.time()
        max_k = max(k_values)
        distances, indices = index.search(query_embedding, max_k)
        search_time = time.time() - start_time
        results['search_times'].append(search_time)

        # Get retrieved document IDs
        retrieved_docs = [str(doc_ids[i]) for i in indices[0]]

        # Calculate metrics
        relevant_retrieved = 0
        precisions = []

        for rank, doc_id in enumerate(retrieved_docs, 1):
            if doc_id in relevant_docs:
                relevant_retrieved += 1
                precisions.append(relevant_retrieved / rank)

        # Average Precision
        if precisions:
            ap = sum(precisions) / len(relevant_docs)
            results['map_scores'].append(ap)

        # Precision and Recall at K
        for k in k_values:
            if k <= len(retrieved_docs):
                top_k_docs = retrieved_docs[:k]
                relevant_in_top_k = sum(1 for doc in top_k_docs if doc in relevant_docs)

                precision_at_k = relevant_in_top_k / k
                recall_at_k = relevant_in_top_k / len(relevant_docs)

                results['precision_at_k'][k].append(precision_at_k)
                results['recall_at_k'][k].append(recall_at_k)

        evaluated_queries += 1

    # Calculate final metrics
    if results['map_scores']:
        final_results = {
            'index_name': index_name,
            'map': np.mean(results['map_scores']),
            'avg_search_time': np.mean(results['search_times']) * 1000,  # in ms
            'precision_at_k': {},
            'recall_at_k': {},
            'evaluated_queries': evaluated_queries
        }

        for k in k_values:
            if results['precision_at_k'][k]:
                final_results['precision_at_k'][k] = np.mean(results['precision_at_k'][k])
                final_results['recall_at_k'][k] = np.mean(results['recall_at_k'][k])

        return final_results
    else:
        print(f"‚ö†Ô∏è No relevant documents found for {index_name}")
        return None

# Evaluate all FAISS indices
faiss_results = {}

for index_name, index in faiss_indices.items():
    result = evaluate_with_faiss(index, index_name)
    if result:
        faiss_results[index_name] = result

# Display results
print("\n=== FAISS EVALUATION RESULTS ===")
print(f"{'Index':<15} {'MAP':<8} {'P@5':<8} {'P@10':<8} {'R@5':<8} {'R@10':<8} {'Time(ms)':<10}")
print("-" * 75)

for name, result in faiss_results.items():
    map_score = result['map']
    p5 = result['precision_at_k'].get(5, 0)
    p10 = result['precision_at_k'].get(10, 0)
    r5 = result['recall_at_k'].get(5, 0)
    r10 = result['recall_at_k'].get(10, 0)
    time_ms = result['avg_search_time']

    print(f"{name.upper():<15} {map_score:<8.4f} {p5:<8.4f} {p10:<8.4f} {r5:<8.4f} {r10:<8.4f} {time_ms:<10.2f}")

print("\n‚úÖ FAISS evaluation completed!")

=== EVALUATION WITH FAISS INDICES ===

Evaluating flat index...


Evaluating flat: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [06:25<00:00, 12.98it/s]



Evaluating ivf index...


Evaluating ivf: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:22<00:00, 224.83it/s]



Evaluating hnsw index...


Evaluating hnsw: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:03<00:00, 1301.25it/s]


=== FAISS EVALUATION RESULTS ===
Index           MAP      P@5      P@10     R@5      R@10     Time(ms)  
---------------------------------------------------------------------------
FLAT            0.8454   0.2439   0.1325   0.9123   0.9498   76.08     
IVF             0.8458   0.2427   0.1319   0.9076   0.9450   4.27      
HNSW            0.8455   0.2438   0.1325   0.9121   0.9496   0.67      

‚úÖ FAISS evaluation completed!





## Step 11: Evaluation without FAISS (Traditional Cosine Similarity)

In [27]:
print("=== EVALUATION WITHOUT FAISS (COSINE SIMILARITY) ===")

def evaluate_without_faiss(k_values=[5, 10, 20, 50, 100], sample_size=None):
    """Evaluate retrieval performance using traditional cosine similarity"""

    if 'qrels' not in datasets or len(datasets['qrels']) == 0:
        print("‚ö†Ô∏è No qrels data available for evaluation")
        return None

    # Convert qrels to dictionary
    qrels = defaultdict(dict)
    for _, row in datasets['qrels'].iterrows():
        qid = str(row['query_id'])
        did = str(row['doc_id'])
        qrels[qid][did] = int(row['relevance'])

    # Create ID mappings
    query_id_to_idx = {str(qid): i for i, qid in enumerate(query_ids)}
    doc_id_to_idx = {str(did): i for i, did in enumerate(doc_ids)}

    # Sample queries for efficiency (optional)
    if sample_size and sample_size < len(qrels):
        sampled_qrels = dict(list(qrels.items())[:sample_size])
        print(f"Sampling {sample_size} queries out of {len(qrels)} for efficiency")
        qrels = sampled_qrels

    # Evaluation metrics storage
    results = {
        'map_scores': [],
        'precision_at_k': {k: [] for k in k_values},
        'recall_at_k': {k: [] for k in k_values},
        'search_times': []
    }

    # Evaluate each query
    evaluated_queries = 0

    for qid, relevant_docs in tqdm(qrels.items(), desc="Evaluating with cosine similarity"):
        if qid not in query_id_to_idx:
            continue

        query_idx = query_id_to_idx[qid]
        query_embedding = query_embeddings[query_idx:query_idx+1]

        # Calculate cosine similarity
        start_time = time.time()
        similarities = cosine_similarity(query_embedding, doc_embeddings)[0]

        # Get top-k documents
        max_k = max(k_values)
        top_indices = np.argsort(similarities)[::-1][:max_k]
        search_time = time.time() - start_time
        results['search_times'].append(search_time)

        # Get retrieved document IDs
        retrieved_docs = [str(doc_ids[i]) for i in top_indices]

        # Calculate metrics
        relevant_retrieved = 0
        precisions = []

        for rank, doc_id in enumerate(retrieved_docs, 1):
            if doc_id in relevant_docs:
                relevant_retrieved += 1
                precisions.append(relevant_retrieved / rank)

        # Average Precision
        if precisions:
            ap = sum(precisions) / len(relevant_docs)
            results['map_scores'].append(ap)

        # Precision and Recall at K
        for k in k_values:
            if k <= len(retrieved_docs):
                top_k_docs = retrieved_docs[:k]
                relevant_in_top_k = sum(1 for doc in top_k_docs if doc in relevant_docs)

                precision_at_k = relevant_in_top_k / k
                recall_at_k = relevant_in_top_k / len(relevant_docs)

                results['precision_at_k'][k].append(precision_at_k)
                results['recall_at_k'][k].append(recall_at_k)

        evaluated_queries += 1

    # Calculate final metrics
    if results['map_scores']:
        final_results = {
            'method': 'cosine_similarity',
            'map': np.mean(results['map_scores']),
            'avg_search_time': np.mean(results['search_times']) * 1000,  # in ms
            'precision_at_k': {},
            'recall_at_k': {},
            'evaluated_queries': evaluated_queries
        }

        for k in k_values:
            if results['precision_at_k'][k]:
                final_results['precision_at_k'][k] = np.mean(results['precision_at_k'][k])
                final_results['recall_at_k'][k] = np.mean(results['recall_at_k'][k])

        return final_results
    else:
        print("‚ö†Ô∏è No relevant documents found")
        return None

# Run evaluation without FAISS
print("Running evaluation with traditional cosine similarity...")
traditional_results = evaluate_without_faiss(sample_size=1000)  # Sample for efficiency

if traditional_results:
    print("\n=== TRADITIONAL COSINE SIMILARITY RESULTS ===")
    print(f"Method: {traditional_results['method']}")
    print(f"MAP: {traditional_results['map']:.4f}")
    print(f"Average search time: {traditional_results['avg_search_time']:.2f} ms")
    print(f"Evaluated queries: {traditional_results['evaluated_queries']}")

    print("\nPrecision at K:")
    for k, precision in traditional_results['precision_at_k'].items():
        print(f"  P@{k}: {precision:.4f}")

    print("\nRecall at K:")
    for k, recall in traditional_results['recall_at_k'].items():
        print(f"  R@{k}: {recall:.4f}")

print("\n‚úÖ Traditional evaluation completed!")

=== EVALUATION WITHOUT FAISS (COSINE SIMILARITY) ===
Running evaluation with traditional cosine similarity...
Sampling 1000 queries out of 5000 for efficiency


Evaluating with cosine similarity:  14%|‚ñà‚ñç        | 142/1000 [02:17<13:51,  1.03it/s]


KeyboardInterrupt: 

## Step 12: Comparison Analysis - FAISS vs Traditional Methods

In [24]:
print("=== COMPREHENSIVE COMPARISON ANALYSIS ===")

# Create comparison table
if faiss_results and traditional_results:
    print("\nüìä PERFORMANCE COMPARISON\n")
    print(f"{'Method':<20} {'MAP':<8} {'P@5':<8} {'P@10':<8} {'R@5':<8} {'R@10':<8} {'Time(ms)':<10}")
    print("=" * 80)

    # Traditional method
    t_map = traditional_results['map']
    t_p5 = traditional_results['precision_at_k'].get(5, 0)
    t_p10 = traditional_results['precision_at_k'].get(10, 0)
    t_r5 = traditional_results['recall_at_k'].get(5, 0)
    t_r10 = traditional_results['recall_at_k'].get(10, 0)
    t_time = traditional_results['avg_search_time']

    print(f"{'Cosine Similarity':<20} {t_map:<8.4f} {t_p5:<8.4f} {t_p10:<8.4f} {t_r5:<8.4f} {t_r10:<8.4f} {t_time:<10.2f}")

    # FAISS methods
    for name, result in faiss_results.items():
        f_map = result['map']
        f_p5 = result['precision_at_k'].get(5, 0)
        f_p10 = result['precision_at_k'].get(10, 0)
        f_r5 = result['recall_at_k'].get(5, 0)
        f_r10 = result['recall_at_k'].get(10, 0)
        f_time = result['avg_search_time']

        method_name = f"FAISS {name.upper()}"
        print(f"{method_name:<20} {f_map:<8.4f} {f_p5:<8.4f} {f_p10:<8.4f} {f_r5:<8.4f} {f_r10:<8.4f} {f_time:<10.2f}")

    # Analysis
    print("\nüîç ANALYSIS:\n")

    # Find best performing method
    best_map = max(traditional_results['map'], max(r['map'] for r in faiss_results.values()))
    fastest_method = min(traditional_results['avg_search_time'], min(r['avg_search_time'] for r in faiss_results.values()))

    print(f"üìà Best MAP Score: {best_map:.4f}")
    print(f"‚ö° Fastest Search: {fastest_method:.2f} ms")

    # Speed comparison
    print("\n‚ö° SPEED COMPARISON:")
    baseline_time = traditional_results['avg_search_time']

    for name, result in faiss_results.items():
        speedup = baseline_time / result['avg_search_time']
        print(f"  FAISS {name.upper()}: {speedup:.2f}x faster than cosine similarity")

    # Accuracy comparison
    print("\nüéØ ACCURACY COMPARISON:")
    baseline_map = traditional_results['map']

    for name, result in faiss_results.items():
        accuracy_ratio = result['map'] / baseline_map
        accuracy_diff = (result['map'] - baseline_map) * 100
        print(f"  FAISS {name.upper()}: {accuracy_ratio:.3f}x accuracy ({accuracy_diff:+.2f}% difference)")

    # Recommendations
    print("\nüí° RECOMMENDATIONS:")
    print("  ‚Ä¢ IndexFlatIP: Best accuracy, exact search, slower for large datasets")
    print("  ‚Ä¢ IndexIVFFlat: Good balance of speed and accuracy, suitable for large datasets")
    print("  ‚Ä¢ IndexHNSWFlat: Fast search, good for real-time applications")
    print("  ‚Ä¢ Cosine Similarity: Baseline method, exact results, can be slow")

else:
    print("‚ö†Ô∏è Cannot perform comparison - missing evaluation results")

print("\n‚úÖ Comparison analysis completed!")

=== COMPREHENSIVE COMPARISON ANALYSIS ===

üìä PERFORMANCE COMPARISON

Method               MAP      P@5      P@10     R@5      R@10     Time(ms)  
Cosine Similarity    0.8086   0.2790   0.1603   0.8703   0.9217   825.46    
FAISS FLAT           0.8454   0.2439   0.1325   0.9123   0.9498   76.08     
FAISS IVF            0.8458   0.2427   0.1319   0.9076   0.9450   4.27      
FAISS HNSW           0.8455   0.2438   0.1325   0.9121   0.9496   0.67      

üîç ANALYSIS:

üìà Best MAP Score: 0.8458
‚ö° Fastest Search: 0.67 ms

‚ö° SPEED COMPARISON:
  FAISS FLAT: 10.85x faster than cosine similarity
  FAISS IVF: 193.20x faster than cosine similarity
  FAISS HNSW: 1225.06x faster than cosine similarity

üéØ ACCURACY COMPARISON:
  FAISS FLAT: 1.045x accuracy (+3.67% difference)
  FAISS IVF: 1.046x accuracy (+3.72% difference)
  FAISS HNSW: 1.046x accuracy (+3.69% difference)

üí° RECOMMENDATIONS:
  ‚Ä¢ IndexFlatIP: Best accuracy, exact search, slower for large datasets
  ‚Ä¢ IndexIVFFlat:

## Step 13: Save and Download FAISS Indices

In [3]:
print("=== SAVING FAISS INDICES ===")

# Create directory for FAISS indices
faiss_dir = f'{save_dir}/faiss_indices'
if not os.path.exists(faiss_dir):
    os.makedirs(faiss_dir)
    print(f"Created directory: {faiss_dir}")

# Save each FAISS index
saved_indices = {}

for name, index in faiss_indices.items():
    index_path = f'{faiss_dir}/{name}_index.faiss'
    faiss.write_index(index, index_path)
    saved_indices[name] = index_path
    print(f"‚úÖ Saved {name} index to: {index_path}")

# Store HNSW parameters separately if the index exists
hnsw_params = {}
if 'index_hnsw' in locals() and index_hnsw.is_trained:
    # Access parameters if the index is trained and attributes exist
    if hasattr(index_hnsw.hnsw, 'M'):
        hnsw_params['M'] = index_hnsw.hnsw.M
    if hasattr(index_hnsw.hnsw, 'efConstruction'):
        hnsw_params['efConstruction'] = index_hnsw.hnsw.efConstruction
    if hasattr(index_hnsw.hnsw, 'efSearch'):
        hnsw_params['efSearch'] = index_hnsw.hnsw.efSearch


# Save index metadata
faiss_metadata = {
    'embedding_dim': embedding_dim,
    'num_documents': len(doc_embeddings),
    'doc_ids': doc_ids,
    'query_ids': query_ids,
    'model_name': MODEL_NAME,
    'indices': {
        'flat': {
            'type': 'IndexFlatIP',
            'description': 'Exact search using inner product',
            'file': 'flat_index.faiss' # Corrected filename here
        },
        'ivf': {
            'type': 'IndexIVFFlat',
            'description': f'Approximate search with {index_ivf.nlist} clusters',
            'nlist': index_ivf.nlist,
            'nprobe': index_ivf.nprobe,
            'file': 'ivf_index.faiss' # Corrected filename here
        },
        'hnsw': {
            'type': 'IndexHNSWFlat',
            'description': 'Graph-based search',
            'M': hnsw_params.get('M', 'N/A'), # Access from stored parameters
            'efConstruction': hnsw_params.get('efConstruction', 'N/A'), # Access from stored parameters
            'efSearch': hnsw_params.get('efSearch', 'N/A'), # Access from stored parameters
            'file': 'hnsw_index.faiss' # Corrected filename here
        }
    },
    'evaluation_results': {
        'faiss': faiss_results if 'faiss_results' in locals() else None,
        'traditional': traditional_results if 'traditional_results' in locals() else None
    }
}

# Save metadata
metadata_path = f'{faiss_dir}/faiss_metadata.joblib'
joblib.dump(faiss_metadata, metadata_path)
print(f"‚úÖ Saved FAISS metadata to: {metadata_path}")

# Create usage example script
# Use a regular string and format it later or escape braces
usage_script = f"""
# FAISS Index Usage Example
# This script shows how to load and use the saved FAISS indices

import faiss
import joblib
import numpy as np

# Load metadata
metadata = joblib.load('faiss_metadata.joblib')
print(f"Loaded metadata for {{metadata['num_documents']:,}} documents")

# Load FAISS indices
indices = {{}}
for name in ['flat', 'ivf', 'hnsw']:
    index_path = f'{{name}}_index.faiss'
    indices[name] = faiss.read_index(index_path)
    print(f"Loaded {{name}} index with {{indices[name].ntotal}} vectors")

# Example search
def search_example(query_embedding, k=10):
    \"\"\"Example search function\"\"\" # Escaped quotes here
    query_embedding = query_embedding.astype(np.float32)
    if len(query_embedding.shape) == 1:
        query_embedding = query_embedding.reshape(1, -1)

    results = {{}}
    for name, index in indices.items():
        scores, doc_indices = index.search(query_embedding, k)
        results[name] = {{
            'scores': scores[0],
            'doc_indices': doc_indices[0],
            'doc_ids': [metadata['doc_ids'][i] for i in doc_indices[0]]
        }}
    return results

# Usage:
# results = search_example(your_query_embedding)
# print(results['flat']['doc_ids'][:5])  # Top 5 document IDs
"""

with open(f'{faiss_dir}/usage_example.py', 'w') as f:
    f.write(usage_script)

print(f"‚úÖ Created usage example: {faiss_dir}/usage_example.py")



=== SAVING FAISS INDICES ===


NameError: name 'save_dir' is not defined

## About FAISS as a Vector Store

**Yes, FAISS is definitely considered one of the leading vector stores/databases!** Here's what changes when you apply FAISS to your Quora embeddings:

### What is FAISS?

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It's widely used as a vector store for:

- **Semantic Search**: Finding similar documents/texts
- **Recommendation Systems**: Finding similar items
- **RAG Applications**: Retrieval-Augmented Generation
- **Large-scale ML**: Handling millions/billions of vectors

### Changes After Applying FAISS:

#### 1. **Performance Improvements**
- **Speed**: 10-1000x faster than brute-force cosine similarity
- **Memory**: More efficient memory usage
- **Scalability**: Can handle billions of vectors

#### 2. **Search Options**
- **Exact Search**: IndexFlatIP (same results as cosine similarity)
- **Approximate Search**: IndexIVF* (99% accuracy, much faster)
- **Graph Search**: IndexHNSW* (very fast, good for real-time)

#### 3. **Production Ready**
- **Persistence**: Save/load indices to disk
- **GPU Support**: Leverage GPU acceleration
- **Batch Processing**: Efficient batch searches

#### 4. **Trade-offs**
- **Accuracy vs Speed**: Approximate methods trade slight accuracy for speed
- **Memory vs Speed**: Different indices have different memory requirements
- **Build Time**: Index construction takes time but search is much faster

### FAISS vs Other Vector Stores

| Feature | FAISS | Pinecone | Weaviate | Chroma |
|---------|-------|----------|----------|--------|
| **Type** | Library | SaaS | Open Source | Open Source |
| **Hosting** | Self-hosted | Cloud | Self/Cloud | Self-hosted |
| **Scale** | Billions | Millions | Millions | Millions |
| **Cost** | Free | Paid | Free/Paid | Free |
| **Speed** | Fastest | Fast | Fast | Fast |
| **Features** | Core search | Full service | Graph + Vector | Simple |

### When to Use FAISS

‚úÖ **Use FAISS when:**
- You need maximum performance
- You have large datasets (>100K vectors)
- You want fine-grained control
- You're building production systems
- You need offline/local processing

‚ùå **Don't use FAISS when:**
- You need a full database with metadata
- You want managed service
- You need advanced querying (filters, etc.)
- You have very small datasets (<10K vectors)
