# Task 2: Text Chunking, Embedding, and Vector Store Indexing

## Text Chunking, Embedding, and Vector Store for RAG System

This notebook implements the text chunking strategy, generates embeddings, and creates a vector store for efficient semantic search.

**Objectives:**
- Implement text chunking strategy for long complaint narratives
- Generate embeddings using sentence transformers
- Create and populate a vector store (FAISS)
- Store metadata for traceability

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
import pickle
import os
from typing import List, Dict, Any
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")
print(f"Current working directory: {os.getcwd()}")

# Check if GPU is available
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
# Task 2: Text Chunking, Embedding, and Vector Store Indexing

## 1. Load Cleaned Data

First, let's load the cleaned complaint data from Task 1.

In [None]:
# Load the cleaned data from Task 1
data_path = "../data/filtered_complaints.csv"

if os.path.exists(data_path):
    df = pd.read_csv(data_path)
    print(f"✅ Loaded cleaned data: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")
    print(f"\nFirst few rows:")
    print(df.head())
    
    # Check text lengths
    print(f"\nText length statistics:")
    print(df['cleaned_length'].describe())
    
else:
    print("❌ Cleaned data not found. Please run Task 1 first.")
    exit()

## 2. Text Chunking Strategy

Long narratives are often ineffective when embedded as a single vector. We'll implement a recursive character text splitter to break narratives into smaller, semantically meaningful chunks.

In [None]:
class RecursiveCharacterTextSplitter:
    """
    A simple recursive character text splitter that breaks text into chunks
    while trying to preserve semantic boundaries.
    """
    
    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 100, separators: List[str] = None):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.separators = separators or ["\n\n", "\n", ".", "!", "?", ";", " ", ""]
    
    def split_text(self, text: str) -> List[str]:
        """Split text into chunks using recursive character splitting."""
        if len(text) <= self.chunk_size:
            return [text]
        
        chunks = []
        current_chunk = ""
        
        # Split by separators in order of preference
        for separator in self.separators:
            if separator in text:
                parts = text.split(separator)
                for i, part in enumerate(parts):
                    if len(current_chunk) + len(part) + len(separator) <= self.chunk_size:
                        current_chunk += part
                        if i < len(parts) - 1:  # Don't add separator after last part
                            current_chunk += separator
                    else:
                        if current_chunk:
                            chunks.append(current_chunk.strip())
                        current_chunk = part
                        if i < len(parts) - 1:
                            current_chunk += separator
                
                if current_chunk:
                    chunks.append(current_chunk.strip())
                break
        
        # Handle overlap
        final_chunks = []
        for i, chunk in enumerate(chunks):
            if i == 0:
                final_chunks.append(chunk)
            else:
                # Add overlap from previous chunk
                overlap_text = chunks[i-1][-self.chunk_overlap:] if len(chunks[i-1]) > self.chunk_overlap else chunks[i-1]
                final_chunks.append(overlap_text + " " + chunk)
        
        return [chunk for chunk in final_chunks if chunk.strip()]

# Test the chunking strategy
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

# Test with a sample narrative
if not df.empty:
    sample_text = df['Consumer complaint narrative'].iloc[0]
    print(f"Original text length: {len(sample_text)}")
    print(f"Original text: {sample_text[:200]}...")
    
    chunks = splitter.split_text(sample_text)
    print(f"\nNumber of chunks: {len(chunks)}")
    for i, chunk in enumerate(chunks[:3]):  # Show first 3 chunks
        print(f"Chunk {i+1} (length: {len(chunk)}): {chunk[:100]}...")

## 3. Embedding Model Selection

We'll use the `sentence-transformers/all-MiniLM-L6-v2` model for generating embeddings. This model provides a good balance between performance and computational efficiency.

In [None]:
# Initialize the embedding model
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
print(f"Loading embedding model: {model_name}")

# Load the model
embedding_model = SentenceTransformer(model_name)
embedding_model.to(device)

print(f"✅ Model loaded successfully!")
print(f"Model max sequence length: {embedding_model.max_seq_length}")
print(f"Model embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

# Test embedding generation
test_text = "I am having issues with my credit card billing"
test_embedding = embedding_model.encode([test_text])
print(f"\nTest embedding shape: {test_embedding.shape}")
print(f"Sample embedding values: {test_embedding[0][:10]}")

## 4. Process All Complaint Narratives

Now we'll process all complaint narratives through our chunking strategy and prepare them for embedding.

In [None]:
# Process all complaint narratives into chunks
def process_complaints_to_chunks(df, splitter):
    """
    Process all complaint narratives into chunks with metadata.
    """
    all_chunks = []
    chunk_metadata = []
    
    print(f"Processing {len(df)} complaints into chunks...")
    
    for idx, row in tqdm(df.iterrows(), total=len(df)):
        narrative = row['Consumer complaint narrative']
        chunks = splitter.split_text(narrative)
        
        for chunk_idx, chunk in enumerate(chunks):
            all_chunks.append(chunk)
            chunk_metadata.append({
                'complaint_id': row.get('Complaint ID', idx),
                'chunk_id': f"{idx}_{chunk_idx}",
                'chunk_index': chunk_idx,
                'total_chunks': len(chunks),
                'product': row['Product'],
                'issue': row['Issue'],
                'company': row.get('Company', 'Unknown'),
                'date_received': row.get('Date received', ''),
                'original_text_length': len(narrative),
                'chunk_length': len(chunk)
            })
    
    return all_chunks, chunk_metadata

# Process the complaints
all_chunks, chunk_metadata = process_complaints_to_chunks(df, splitter)

print(f"\n✅ Processing complete!")
print(f"Total chunks created: {len(all_chunks)}")
print(f"Average chunks per complaint: {len(all_chunks) / len(df):.2f}")

# Analyze chunk statistics
chunk_lengths = [len(chunk) for chunk in all_chunks]
print(f"\nChunk length statistics:")
print(f"Min: {min(chunk_lengths)}")
print(f"Max: {max(chunk_lengths)}")
print(f"Mean: {np.mean(chunk_lengths):.2f}")
print(f"Median: {np.median(chunk_lengths):.2f}")

# Visualize chunk lengths
plt.figure(figsize=(10, 6))
plt.hist(chunk_lengths, bins=50, alpha=0.7, edgecolor='black')
plt.title('Distribution of Chunk Lengths')
plt.xlabel('Chunk Length (characters)')
plt.ylabel('Frequency')
plt.axvline(np.mean(chunk_lengths), color='red', linestyle='--', label=f'Mean: {np.mean(chunk_lengths):.0f}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 5. Generate Embeddings

Now we'll generate embeddings for all chunks using our selected model.

In [None]:
# Generate embeddings for all chunks
def generate_embeddings_batch(chunks, model, batch_size=32):
    """
    Generate embeddings for chunks in batches to manage memory.
    """
    embeddings = []
    
    print(f"Generating embeddings for {len(chunks)} chunks...")
    
    for i in tqdm(range(0, len(chunks), batch_size)):
        batch = chunks[i:i + batch_size]
        batch_embeddings = model.encode(batch, convert_to_numpy=True)
        embeddings.extend(batch_embeddings)
    
    return np.array(embeddings)

# Generate embeddings
print("Starting embedding generation...")
embeddings = generate_embeddings_batch(all_chunks, embedding_model, batch_size=32)

print(f"\n✅ Embedding generation complete!")
print(f"Embeddings shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")
print(f"Total embeddings: {embeddings.shape[0]}")

# Save embeddings and metadata for later use
embeddings_path = "../vector_store/embeddings.npy"
metadata_path = "../vector_store/metadata.pkl"

os.makedirs("../vector_store", exist_ok=True)

np.save(embeddings_path, embeddings)
with open(metadata_path, 'wb') as f:
    pickle.dump(chunk_metadata, f)

print(f"✅ Embeddings saved to: {embeddings_path}")
print(f"✅ Metadata saved to: {metadata_path}")

## 6. Create FAISS Vector Store

We'll create a FAISS index for efficient similarity search and store it for use in the RAG pipeline.

In [None]:
# Create FAISS index
def create_faiss_index(embeddings):
    """
    Create a FAISS index for similarity search.
    """
    dimension = embeddings.shape[1]
    
    # Use IndexFlatIP for cosine similarity (inner product after normalization)
    # Normalize embeddings for cosine similarity
    faiss.normalize_L2(embeddings)
    
    # Create index
    index = faiss.IndexFlatIP(dimension)
    
    # Add embeddings to index
    index.add(embeddings.astype('float32'))
    
    return index

# Create the FAISS index
print("Creating FAISS index...")
faiss_index = create_faiss_index(embeddings.copy())

print(f"✅ FAISS index created!")
print(f"Index dimension: {faiss_index.d}")
print(f"Total vectors in index: {faiss_index.ntotal}")

# Save the FAISS index
index_path = "../vector_store/faiss_index.bin"
faiss.write_index(faiss_index, index_path)
print(f"✅ FAISS index saved to: {index_path}")

# Save chunks for retrieval
chunks_path = "../vector_store/chunks.pkl"
with open(chunks_path, 'wb') as f:
    pickle.dump(all_chunks, f)
print(f"✅ Chunks saved to: {chunks_path}")

## 7. Test Vector Store Retrieval

Let's test our vector store by performing some sample queries to ensure everything works correctly.

In [None]:
# Test retrieval functionality
def test_retrieval(query, index, chunks, metadata, model, k=5):
    """
    Test retrieval functionality with a sample query.
    """
    print(f"Query: '{query}'")
    print("-" * 50)
    
    # Encode query
    query_embedding = model.encode([query])
    faiss.normalize_L2(query_embedding)
    
    # Search
    scores, indices = index.search(query_embedding.astype('float32'), k)
    
    print(f"Top {k} results:")
    for i, (score, idx) in enumerate(zip(scores[0], indices[0])):
        chunk = chunks[idx]
        meta = metadata[idx]
        
        print(f"\n{i+1}. Score: {score:.4f}")
        print(f"   Product: {meta['product']}")
        print(f"   Issue: {meta['issue']}")
        print(f"   Chunk: {chunk[:200]}...")
        print(f"   Metadata: Complaint ID: {meta['complaint_id']}, Chunk {meta['chunk_index']+1}/{meta['total_chunks']}")

# Test with sample queries
test_queries = [
    "Problems with credit card billing",
    "Issues with personal loan payments",
    "BNPL payment problems",
    "Savings account access issues",
    "Money transfer delays"
]

for query in test_queries:
    test_retrieval(query, faiss_index, all_chunks, chunk_metadata, embedding_model, k=3)
    print("\n" + "="*80 + "\n")

## 8. Summary and Next Steps

### Chunking Strategy Summary
- **Chunk Size**: 500 characters with 100 character overlap
- **Separators**: Prioritized by semantic boundaries (paragraphs, sentences, words)
- **Total Chunks**: Generated from complaint narratives
- **Average Chunks per Complaint**: Calculated based on text length

### Embedding Model Choice
- **Model**: `sentence-transformers/all-MiniLM-L6-v2`
- **Rationale**: 
  - Good balance between performance and computational efficiency
  - 384-dimensional embeddings
  - Suitable for semantic similarity tasks
  - Fast inference time

### Vector Store Configuration
- **Index Type**: FAISS IndexFlatIP (Inner Product for cosine similarity)
- **Normalization**: L2 normalization for cosine similarity
- **Metadata Storage**: Complete traceability to original complaints

### Files Created
1. `../vector_store/embeddings.npy` - Raw embeddings array
2. `../vector_store/metadata.pkl` - Chunk metadata with traceability
3. `../vector_store/faiss_index.bin` - FAISS index for similarity search
4. `../vector_store/chunks.pkl` - Text chunks for retrieval

### Next Steps
The vector store is now ready for Task 3: Building the RAG Core Logic and Evaluation. The retrieval system can efficiently find relevant complaint chunks based on semantic similarity to user queries.