<a href="https://colab.research.google.com/github/DilkiSandunika/VGTU_Thesis_Project/blob/main/notebooks/02_knowledge_base_creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# ===================================================================
# CELL 1: Install Necessary Libraries
# ===================================================================
print("Installing required libraries for vector database creation...")
# faiss-cpu is for vector search, sentence-transformers is for creating the vectors
!pip install faiss-cpu sentence-transformers -q
print("Libraries installed successfully.")


# ===================================================================
# CELL 2: Import Libraries and Unzip Data
# ===================================================================
import os
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from google.colab import files
import pickle # We will use pickle to save our list of documents

# Define paths
knowledge_base_zip_path = '/content/knowledge_base.zip'
knowledge_base_dir = '/content/knowledge_base/'
output_dir = '/content/processed_db/'

# Unzip the knowledge base files
print(f"Unzipping {knowledge_base_zip_path}...")
!unzip -q {knowledge_base_zip_path} -d {knowledge_base_dir}

# Create the output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)
print("Setup complete. Knowledge base is ready.")


# ===================================================================
# CELL 3: Load the Knowledge Documents
# ===================================================================
knowledge_base_docs = []

# Loop through the files in the unzipped directory
for filename in os.listdir(knowledge_base_dir):
    if filename.endswith(".txt"):
        filepath = os.path.join(knowledge_base_dir, filename)
        with open(filepath, 'r', encoding='utf-8') as f:
            # Read each line as a separate document/rule
            for line in f:
                if line.strip(): # Ensure the line is not empty
                    knowledge_base_docs.append(line.strip())

print(f"Loaded {len(knowledge_base_docs)} individual rules/guidelines from the knowledge base.")
print("\n--- Knowledge Base Content ---")
for doc in knowledge_base_docs:
    print(f"- {doc}")


# ===================================================================
# CELL 4: Create Text Embeddings
# ===================================================================
print("\nLoading sentence transformer model (this may take a moment)...")
# 'all-MiniLM-L6-v2' is a great general-purpose model
model = SentenceTransformer('all-MiniLM-L6-v2')

print("Converting knowledge base text into numerical vectors (embeddings)...")
# The .encode() method turns our list of sentences into a matrix of numbers
embeddings = model.encode(knowledge_base_docs)
print(f"Embeddings created successfully. Shape of embeddings matrix: {embeddings.shape}")


# ===================================================================
# CELL 5: Build and Save the FAISS Index and Document List
# ===================================================================
# Get the dimensionality of our vectors
d = embeddings.shape[1]

# Build a FAISS index. IndexFlatL2 is a standard, good-for-starters index.
index = faiss.IndexFlatL2(d)

# Add our document embeddings to the index
index.add(embeddings.astype('float32')) # FAISS requires float32

print(f"\nFAISS index built successfully with {index.ntotal} vectors.")

# --- CRUCIAL: Save both the index and the original documents ---
# 1. Save the FAISS index
faiss.write_index(index, os.path.join(output_dir, 'knowledge_base.index'))

# 2. Save the list of documents, so we can retrieve the original text later
with open(os.path.join(output_dir, 'knowledge_base_docs.pkl'), 'wb') as f:
    pickle.dump(knowledge_base_docs, f)

print(f"Vector DB and document list have been saved to the '{output_dir}' directory.")


# ===================================================================
# CELL 6: Download the Created Files (Optional)
# ===================================================================
print("\nTriggering download of the created vector database files...")
files.download(os.path.join(output_dir, 'knowledge_base.index'))
files.download(os.path.join(output_dir, 'knowledge_base_docs.pkl'))

Installing required libraries for vector database creation...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m53.9 MB/s[0m eta [36m0:00:00[0m
[?25hLibraries installed successfully.
Unzipping /content/knowledge_base.zip...
Setup complete. Knowledge base is ready.
Loaded 6 individual rules/guidelines from the knowledge base.

--- Knowledge Base Content ---
- Rule 101: All functional requirements must explicitly state the user role involved (e.g., 'the admin', 'the user', 'the officer').
- Rule 102: Any requirement handling personally identifiable information (PII) or sensitive data must mention encryption or secure handling.
- Rule 103: The system shall use role-based access control for any function that creates, modifies, or deletes data.
- Rule 104: Requirements must be written in a clear, active voice (e.g., "The system shall do X" not "X should be done").
- Rule 105: Each requirement must be atomic, meaning it describes a single, verifiable fun

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Converting knowledge base text into numerical vectors (embeddings)...
Embeddings created successfully. Shape of embeddings matrix: (6, 384)

FAISS index built successfully with 6 vectors.
Vector DB and document list have been saved to the '/content/processed_db/' directory.

Triggering download of the created vector database files...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>