# Data Preprocessing: Load and Vectorize Historical Data

This code sets up a FAISS-based vector store for storing and retrieving FAQ answers efficiently using vector search. Here’s a breakdown of what’s happening:



In [12]:
from langchain_ollama import OllamaEmbeddings
import faiss
import numpy as np
import pickle
import json

# Load FAQ data (only answers)
with open("FaQ_en.json", "r", encoding="utf-8") as f:
    faq_data = json.load(f)

# Extract answers only
answers = [q["answer"] for q in faq_data]  # Store only answers (no questions)

# Initialize embedding model
embedding_model = OllamaEmbeddings(model="nomic-embed-text")
answer_embeddings = embedding_model.embed_documents(answers)

# Convert embeddings to NumPy array
answer_vectors = np.array(answer_embeddings).astype("float32")

# Create FAISS index
dimension = answer_vectors.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(answer_vectors)

# Save FAISS index & metadata
faiss.write_index(index, "faiss_index")
with open("faiss_metadata.pkl", "wb") as f:
    pickle.dump(answers, f)  # Save answers only

print("✅ FAISS Vector Store Created (Only Answers Stored)")


✅ FAISS Vector Store Created (Only Answers Stored)


1. Loading and Processing FAQ Data
- The script loads a JSON file (FaQ_en.json) that contains FAQ entries.
- It extracts only the answers, discarding the questions. This means the retrieval will be based only on answers.
2. Generating Embeddings for Answers
- Uses Ollama’s nomic-embed-text model to generate embeddings.
- Each answer is converted into a dense vector representation.
- These embeddings are crucial for semantic search, allowing the system to find similar answers based on meaning, not just keywords.
3. Preparing FAISS Index
- Converts embeddings into a NumPy array (float32 format) to be used with FAISS.
- Initializes a FAISS Index using IndexFlatL2, which performs L2 (Euclidean) distance calculations for similarity search.
- Adds the vectors to the index, enabling fast retrieval.
4. Saving the FAISS Index and Metadata

# Further Study Resources:
- **FAISS Documentation**: https://faiss.ai/
- **Vector Embeddings**: Study models like BERT, SentenceTransformers, OpenAI embeddings.
- **LangChain RAG**: Learn how to use vector stores in retrieval-augmented generation.
- **Scaling FAISS**: Learn how to use quantization (PQ, OPQ) for large datasets.


In [2]:
import json
import faiss
import pickle
import numpy as np
from langchain_ollama import OllamaEmbeddings

# Load the FAISS index & metadata
embedding_model = OllamaEmbeddings(model="nomic-embed-text")
index = faiss.read_index("faiss_index")

with open("faiss_metadata.pkl", "rb") as f:
    stored_answers = pickle.load(f)

# Load new JSON file
json_file_path = "processed_sites.json"
with open(json_file_path, "r", encoding="utf-8") as f:
    site_data = json.load(f)

# Create a list to store new embeddings
new_embeddings = []
new_texts = []

# Iterate through JSON entries and create embeddings
for site in site_data:
    site_name = site.get("site_name", "")
    location_description = site.get("location_description", "")
    summary = site.get("summary", "")

    # Create text representation
    new_text = f"{site_name}. {location_description} {summary}"

    # Generate the embedding
    new_embedding = embedding_model.embed_query(new_text)

    # Store for batch addition
    new_embeddings.append(new_embedding)
    new_texts.append(new_text)

# Convert new embeddings to numpy array
new_embeddings_np = np.array(new_embeddings).astype('float32')

# Append new embeddings to the existing FAISS index
index.add(new_embeddings_np)

# Append new texts to metadata storage
stored_answers.extend(new_texts)

# Save the updated FAISS index
faiss.write_index(index, "faiss_index")

# Save the updated metadata
with open("faiss_metadata.pkl", "wb") as f:
    pickle.dump(stored_answers, f)

print(f"Successfully added {len(new_texts)} new entries to the FAISS vector store!")


Successfully added 139 new entries to the FAISS vector store!


> Adding entries to vector store from separate data source (PDF in our case treated to JSON)