<a href="https://colab.research.google.com/github/MiskirB/B5W6-Intelligent-Complaint-Analysis/blob/main/02_chunking_embedding_indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install sentence-transformers faiss-cpu


Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_6

In [2]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.makedirs("/content/drive/MyDrive/B5W6-Intelligent-Complaint-Analysis/vector_store", exist_ok=True)


Mounted at /content/drive


In [3]:
import pandas as pd

data_path = "/content/drive/MyDrive/B5W6-Intelligent-Complaint-Analysis/data/filtered_complaints.csv"
df = pd.read_csv(data_path)

print(df.shape)
df[['Product', 'cleaned_narrative']].head()


(177855, 20)


Unnamed: 0,Product,cleaned_narrative
0,Credit card,a xxxx xxxx card was opened under my name by a...
1,Credit card,dear cfpb i have a secured credit card with ci...
2,Credit card,i have a citi rewards cards the credit balance...
3,Credit card,bi am writing to dispute the following charges...
4,Credit card,although the account had been deemed closed i ...


In [4]:
def chunk_text(text, chunk_size=300, chunk_overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - chunk_overlap
    return chunks

# Apply chunking
chunk_data = []
for idx, row in df.iterrows():
    chunks = chunk_text(row['cleaned_narrative'])
    for c in chunks:
        chunk_data.append({
            "complaint_id": idx,
            "product": row["Product"],
            "text": c
        })

chunk_df = pd.DataFrame(chunk_data)
print("✅ Total Chunks:", len(chunk_df))
chunk_df.head()


✅ Total Chunks: 800676


Unnamed: 0,complaint_id,product,text
0,0,Credit card,a xxxx xxxx card was opened under my name by a...
1,0,Credit card,dulent and immediately closed the card however...
2,1,Credit card,dear cfpb i have a secured credit card with ci...
3,1,Credit card,d my check but their system doesnt have info a...
4,1,Credit card,me via mail within 14 days to fill out i calle...



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [5]:
!ls /content/drive/MyDrive/B5W6-Intelligent-Complaint-Analysis/vector_store/


In [8]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import pickle
from tqdm.auto import tqdm
import os

# 1. Prepare input texts
texts = chunk_df['text'].tolist()

# 2. Load the embedding model (GPU will be used automatically)
model = SentenceTransformer('all-MiniLM-L6-v2')

# 3. Encode with GPU (fast)
embeddings = model.encode(
    texts,
    show_progress_bar=True,
    convert_to_numpy=True,
    batch_size=64  # you can tune this
)

# 4. Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
print("✅ FAISS index built. Total vectors:", index.ntotal)

# 5. Prepare metadata
metadata = chunk_df.to_dict(orient='records')

# 6. Save both to Google Drive
index_path = "/content/drive/MyDrive/B5W6-Intelligent-Complaint-Analysis/vector_store/faiss_index.index"
metadata_path = "/content/drive/MyDrive/B5W6-Intelligent-Complaint-Analysis/vector_store/metadata.pkl"

os.makedirs("/content/drive/MyDrive/B5W6-Intelligent-Complaint-Analysis/vector_store", exist_ok=True)

faiss.write_index(index, index_path)
print("✅ FAISS index saved to:", index_path)

with open(metadata_path, "wb") as f:
    pickle.dump(metadata, f)
print("✅ Metadata saved to:", metadata_path)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/12511 [00:00<?, ?it/s]

✅ FAISS index built. Total vectors: 800676
✅ FAISS index saved to: /content/drive/MyDrive/B5W6-Intelligent-Complaint-Analysis/vector_store/faiss_index.index
✅ Metadata saved to: /content/drive/MyDrive/B5W6-Intelligent-Complaint-Analysis/vector_store/metadata.pkl


In [9]:
!ls -lh /content/drive/MyDrive/B5W6-Intelligent-Complaint-Analysis/vector_store/


total 1.4G
-rw------- 1 root root 1.2G Jul  5 11:20 faiss_index.index
-rw------- 1 root root 218M Jul  5 11:20 metadata.pkl
