## GPT Sentence Transformer and FAISS example 3

Now, we want to expand on the first example to create a vector index that can be searched on an ad-hoc basis.  Additionally, I would like to expand on creating additional vector indexes that have sub-topics that can be used for targeted searches.  For example, it might be helpful to have a set of indexes that are purely related to various topics.  Each of these indexes would have some associated text that would be queried first to determine which index(es) were most related to the question.  Then the query would be applied to the sub-indexes to find relevant documents and text.

[Sentence Transformers home page](https://www.sbert.net/)

In [1]:
import faiss
import numpy as np
from torch import Tensor

import nltk
nltk.download('punkt')  # Download the punkt tokenizer for sentence tokenization

import os
import pprint
import re
import spacy
from sentence_transformers import SentenceTransformer, util
from directives_processor import DirectivesProcessor

%load_ext autoreload
%autoreload 2

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hugh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load the SentenceTransformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
dp = DirectivesProcessor()

In [3]:
# Load the documents and their id mappings
documents_mapping = {}

def load_documents_from_folder(folder_path):    
    file_names = os.listdir(folder_path)
    
    for file_name in file_names:
        file_path = os.path.join(folder_path, file_name)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as file:
                document_content = file.read()
                documents_mapping[len(documents_mapping)] = {
                    'file_name': file_name,
                    'content': document_content
                }
    
    return documents_mapping

# Load documents from the 'data' folder
data_folder = 'data'  # Replace 'data' with your folder name
documents_mapping = load_documents_from_folder(data_folder)

# Construct documents array from the values of documents_mapping
documents = [doc_info['content'] for doc_info in documents_mapping.values()]

# Print the document ID to file name mapping
for doc_id, doc_info in documents_mapping.items():
    print(f"Document ID: {doc_id}, File Name: {doc_info['file_name']}")

Document ID: 0, File Name: AI_1.txt
Document ID: 1, File Name: ClimateChange_1.txt
Document ID: 2, File Name: CulturalDiversityAndTraditions_1.txt
Document ID: 3, File Name: FinancialMarkets_1.txt
Document ID: 4, File Name: HistoryAndHistoricalEvents_1.txt
Document ID: 5, File Name: Terrorism_1.txt
Document ID: 6, File Name: WorldHealthIssues_1.txt


In [4]:
# Generate a list of 'questions' from the document directives
directive_embeddings = {}
topic_directives = []
topic_names = dp.get_topic_names()
print('TOPIC NAMES:', topic_names)
topic_titles = dp.get_topic_titles()
print('\nTOPIC TITLES:\n', topic_titles)
print('\nAI DIRECTIVES:\n', dp.get_topic_text('AI'))

TOPIC NAMES: ['AI', 'Climate Change', 'World Health Issues', 'Cultural Diversity']

TOPIC TITLES:
 ['Artificial Intelligence and Machine Learning', 'Documents covering climate change, global warming, environmental impacts, renewable energy, etc.', 'Gather documents related to health topics such as diseases, medical advancements, healthcare policies, etc.', 'Gather information about different cultures, traditions, languages, and societal practices.']

AI DIRECTIVES:
 Artificial Intelligence and Machine Learning
	Global Implementation of AI and ML
		Identify key AI/ML initiatives in various countries.
		Analyze adoption rates and trends of AI/ML technologies.
		Compare and contrast scope and scale of AI/ML applications in different nations.
	Policy and Regulation Variance
		Investigate legal frameworks and policies governing AI/ML in different countries.
		Assess ethical considerations and regulatory disparities.
		Highlight geopolitical implications of AI/ML disparities among nations.
	

In [5]:
# Populate the index and track the sentence ids and locations

index = faiss.IndexFlatL2(768)  # Create an index
# Maintain a mapping between sentence embeddings' index and their original sentences
sentence_to_index_mapping = {}

# Load the docs into the index
for idx, doc in enumerate(documents):
    sentences = nltk.sent_tokenize(doc)
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
    
    print('sentence_embeddings.shape', sentence_embeddings.shape)   
    
    for sentence_idx, embedding in enumerate(sentence_embeddings):
        # Add sentence embedding to the index
        index.add(np.expand_dims(embedding, axis=0))
        
        # Track the mapping between sentence index and its embedding index
        sentence_to_index_mapping[len(sentence_to_index_mapping)] = {
            'document_index': idx,
            'sentence_index': sentence_idx,
            'sentence_text': sentences[sentence_idx]  # Save the actual sentence
        }
    
    print("Length of the index:", index.ntotal)
    
print("Final length of the index:", index.ntotal)

sentence_embeddings.shape torch.Size([42, 768])
Length of the index: 42
sentence_embeddings.shape torch.Size([43, 768])
Length of the index: 85
sentence_embeddings.shape torch.Size([110, 768])
Length of the index: 195
sentence_embeddings.shape torch.Size([101, 768])
Length of the index: 296
sentence_embeddings.shape torch.Size([134, 768])
Length of the index: 430
sentence_embeddings.shape torch.Size([338, 768])
Length of the index: 768
sentence_embeddings.shape torch.Size([63, 768])
Length of the index: 831
Final length of the index: 831


In [6]:
# Check the first 10 entries
# Slice the first 10 items of the dictionary
first_10_items = {k: sentence_to_index_mapping[k] for k in sorted(sentence_to_index_mapping)[:10]}

# Pretty print the first 10 items
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(first_10_items)

{   0: {   'document_index': 0,
           'sentence_index': 0,
           'sentence_text': 'Artificial intelligence (AI) vs. machine learning '
                            '(ML)\n'
                            'You might hear people use artificial intelligence '
                            '(AI) and machine learning (ML) interchangeably, '
                            'especially when discussing big data, predictive '
                            'analytics, and other digital transformation '
                            'topics.'},
    1: {   'document_index': 0,
           'sentence_index': 1,
           'sentence_text': 'The confusion is understandable as artificial '
                            'intelligence and machine learning are closely '
                            'related.'},
    2: {   'document_index': 0,
           'sentence_index': 2,
           'sentence_text': 'However, these trending technologies differ in '
                            'several ways, including scope, app

In [7]:
# Run a user query against the index and get the top 5
# User query or question
user_query = "How does AI affect global health?"
user_query = "How is global health affected by climate change?"

# Encode user query into an embedding
user_query_embedding = model.encode(user_query, convert_to_tensor=True).numpy()

# Search in FAISS index
k = 5  # Number of most similar prompts to retrieve
D, I = index.search(np.array([user_query_embedding]), k)

# Retrieve sentences and corresponding documents based on valid indices
most_similar_prompts = []
for idx in I[0]:
    if idx in sentence_to_index_mapping:
        mapping = sentence_to_index_mapping[idx]
        doc_idx = mapping['document_index']
        sentence_idx = mapping['sentence_index']
        sentence = mapping['sentence_text']
        most_similar_prompts.append((doc_idx, sentence_idx, sentence))

print('USER QUERY:', user_query)
print("Most similar prompts to the user query:")
for doc_idx, sentence_idx, sentence in most_similar_prompts:
    filename = documents_mapping.get(doc_idx, '').get('file_name')
    print(f"\nDocument Index: {doc_idx}\t Sentence Index: {sentence_idx}\t Filename: {filename}")
    print(f"Sentence:\n******************************\n\t{sentence}")
    print('******************************\n')


USER QUERY: How is global health affected by climate change?
Most similar prompts to the user query:

Document Index: 6	 Sentence Index: 22	 Filename: WorldHealthIssues_1.txt
Sentence:
******************************
	We are now at a point where climate change is clearly with us, and much more attention needs to be put on minimizing the impacts on global health through adaptation or enhancing resilience.
******************************


Document Index: 6	 Sentence Index: 19	 Filename: WorldHealthIssues_1.txt
Sentence:
******************************
	As we know from the pandemic, preparedness is key, and we are far from prepared for the health impacts of a warmer climate.
******************************


Document Index: 6	 Sentence Index: 16	 Filename: WorldHealthIssues_1.txt
Sentence:
******************************
	Impact of climate changeA child looks out over a dried up lake
“Climate change is already affecting the health of millions of people all over the world, and more importantly