## GPT Sentence Transformer and FAISS example 4

Now, we want to expand on the first example to create a vector index that can be searched on an ad-hoc basis.  Additionally, I would like to expand on creating additional vector indexes that have sub-topics that can be used for targeted searches.  For example, it might be helpful to have a set of indexes that are purely related to various topics.  Each of these indexes would have some associated text that would be queried first to determine which index(es) were most related to the question.  Then the query would be applied to the sub-indexes to find relevant documents and text.

[Sentence Transformers home page](https://www.sbert.net/)

In [24]:
import faiss
import json
import numpy as np
import torch
from torch import Tensor

import nltk
nltk.download('punkt')  # Download the punkt tokenizer for sentence tokenization

import os
import pprint
import re
import spacy
from sentence_transformers import SentenceTransformer, util
from directives_processor import DirectivesProcessor
from transformers import pipeline

from transformers import BartTokenizer, BartForConditionalGeneration

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hugh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# Load the SentenceTransformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
dp = DirectivesProcessor()

In [3]:
# Load the BART tokenizer
summarization_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

# Load the pre-trained BART model for summarization
summarization_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

# summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

In [4]:
from transformers import BertTokenizer, BertForQuestionAnswering

# Load the BERT tokenizer
qa_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

# Load the pre-trained BERT model for question answering
qa_model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
# Load the documents and their id mappings
documents_mapping = {}

def load_documents_from_folder(folder_path):    
    file_names = os.listdir(folder_path)
    
    for file_name in file_names:
        file_path = os.path.join(folder_path, file_name)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as file:
                document_content = file.read()
                documents_mapping[len(documents_mapping)] = {
                    'file_name': file_name,
                    'content': document_content
                }
    
    return documents_mapping

# Load documents from the 'data' folder
data_folder = 'data'  # Replace 'data' with your folder name
documents_mapping = load_documents_from_folder(data_folder)

# Construct documents array from the values of documents_mapping
documents = [doc_info['content'] for doc_info in documents_mapping.values()]

# Print the document ID to file name mapping
for doc_id, doc_info in documents_mapping.items():
    print(f"Document ID: {doc_id}, File Name: {doc_info['file_name']}")

Document ID: 0, File Name: AI_1.txt
Document ID: 1, File Name: ClimateChange_1.txt
Document ID: 2, File Name: CulturalDiversityAndTraditions_1.txt
Document ID: 3, File Name: FinancialMarkets_1.txt
Document ID: 4, File Name: HistoryAndHistoricalEvents_1.txt
Document ID: 5, File Name: Terrorism_1.txt
Document ID: 6, File Name: WorldHealthIssues_1.txt


In [6]:
# Get the document by doc id:
doc_info = documents_mapping.get(1)
print(doc_info['content'][:100])

What Is Climate Change?
Climate change refers to long-term shifts in temperatures and weather patter


In [7]:
def summarize_text(text, model, tokenizer, max_length=150):
    # Tokenize the input text
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)

    # Generate the summary
    summary_ids = model.generate(inputs, max_length=max_length, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
    
    # Decode the summary tokens into text
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# # Summarize one document
# # Example usage to summarize a document
# document_text = documents[0]
# summary = summarize_text(document_text, summarization_model, summarization_tokenizer)
# print("Summary:", summary)

# Summarize each document
for doc_id, doc_info in documents_mapping.items():
    print(f"Document ID: {doc_id}, File Name: {doc_info['file_name']}")
    print('******************************')
    document_text = doc_info['content']
    summary = summarize_text(document_text, summarization_model, summarization_tokenizer)
    print('Summary:\n\t', summary)
    print('******************************\n')

Document ID: 0, File Name: AI_1.txt
******************************
Summary:
	 Artificial intelligence (AI) and machine learning (ML) are closely related. However, these technologies differ in several ways, including scope, applications, and more. Here, we explore how these two innovative concepts are related and what makes them different from each other.
******************************

Document ID: 1, File Name: ClimateChange_1.txt
******************************
Summary:
	 The average temperature of the Earth’s surface is 1.1°C warmer than it was in the late 1800s (before the industrial revolution) The last decade (2011-2020) was the warmest on record, and each of the last four decades has been warmer than any previous decade since 1850.
******************************

Document ID: 2, File Name: CulturalDiversityAndTraditions_1.txt
******************************
Summary:
	 Cultural diversity is the quality of diverse or different cultures, as opposed to monoculture. It has a variety of

In [17]:
# Generate a list of 'questions' from the document directives
directive_embeddings = {}
topic_directives = []
topic_names = dp.get_topic_names()
print('TOPIC NAMES:', topic_names)
topic_titles = dp.get_topic_titles()
print('\nTOPIC TITLES:\n', topic_titles)
print('\nAI DIRECTIVES:\n', dp.get_topic_text('AI'))

TOPIC NAMES: ['AI', 'Climate Change', 'World Health Issues', 'Cultural Diversity']

TOPIC TITLES:
 ['Artificial Intelligence and Machine Learning', 'Documents covering climate change, global warming, environmental impacts, renewable energy, etc.', 'Gather documents related to health topics such as diseases, medical advancements, healthcare policies, etc.', 'Gather information about different cultures, traditions, languages, and societal practices.']

AI DIRECTIVES:
 Artificial Intelligence and Machine Learning
	Global Implementation of AI and ML
		Identify key AI/ML initiatives in various countries.
		Analyze adoption rates and trends of AI/ML technologies.
		Compare and contrast scope and scale of AI/ML applications in different nations.
	Policy and Regulation Variance
		Investigate legal frameworks and policies governing AI/ML in different countries.
		Assess ethical considerations and regulatory disparities.
		Highlight geopolitical implications of AI/ML disparities among nations.
	

In [28]:
index_name = 'doc_index.index'
mapping_name = 'sentence_to_index_mapping.json'
index = None
sentence_to_index_mapping = None

if os.path.exists(index_name) and os.path.exists(mapping_name):
    index = faiss.read_index(index_name)
    print(f"Loaded index from {index_name}")

    with open(mapping_name, 'r') as mapping_file:
        sentence_to_index_mapping = json.load(mapping_file)
    print(f"Loaded sentence_to_index_mapping from {mapping_name}")
else:
    # Populate the index and track the sentence ids and locations
    index = faiss.IndexFlatL2(768)  # Create an index
    # Maintain a mapping between sentence embeddings' index and their original sentences
    sentence_to_index_mapping = {}

    # Load the docs into the index
    for idx, doc in enumerate(documents):
        sentences = nltk.sent_tokenize(doc)
        sentence_embeddings = model.encode(sentences, convert_to_tensor=True)

        print('sentence_embeddings.shape', sentence_embeddings.shape)   

        for sentence_idx, embedding in enumerate(sentence_embeddings):
            # Add sentence embedding to the index
            index.add(np.expand_dims(embedding, axis=0))

            # Track the mapping between sentence index and its embedding index
            sentence_to_index_mapping[len(sentence_to_index_mapping)] = {
                'document_index': idx,
                'sentence_index': sentence_idx,
                'sentence_text': sentences[sentence_idx]  # Save the actual sentence
            }

        print("Length of the index:", index.ntotal)
            
    # Save the index and mapping
    faiss.write_index(index, index_name)
    with open(mapping_name, 'w') as mapping_file:
        json.dump(sentence_to_index_mapping, mapping_file)
    
print("Final length of the index:", index.ntotal)

Loaded index from doc_index.index
Loaded sentence_to_index_mapping from sentence_to_index_mapping.json
Final length of the index: 831


In [22]:
# Test the saving and loading of an index
# index_name = 'doc_index.index'
# faiss.write_index(index, index_name)

# index = None
# print('index:', index, '\tReloading index.')
# index = faiss.read_index(index_name)
# print('index:', index)

index: None 	Reloading index.
index: <faiss.swigfaiss_avx2.IndexFlat; proxy of <Swig Object of type 'faiss::IndexFlat *' at 0x00000198DF943780> >


In [26]:
# Check the first 10 entries
# Slice the first 10 items of the dictionary
first_5_items = {k: sentence_to_index_mapping[k] for k in sorted(sentence_to_index_mapping)[:5]}

# Pretty print the first 5 items
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(first_5_items)

{   '0': {   'document_index': '2d28937a-69ae-4fc4-8543-1b9b5498fab2',
             'sentence_index': 0,
             'sentence_text': 'Artificial intelligence (AI) vs. machine '
                              'learning (ML)\n'
                              'You might hear people use artificial '
                              'intelligence (AI) and machine learning (ML) '
                              'interchangeably, especially when discussing big '
                              'data, predictive analytics, and other digital '
                              'transformation topics.'},
    '1': {   'document_index': '2d28937a-69ae-4fc4-8543-1b9b5498fab2',
             'sentence_index': 1,
             'sentence_text': 'The confusion is understandable as artificial '
                              'intelligence and machine learning are closely '
                              'related.'},
    '10': {   'document_index': '2d28937a-69ae-4fc4-8543-1b9b5498fab2',
              'sentence_index':

In [11]:
# Run a user query against the index and get the top 5
# User query or question
user_query = "How does AI affect global health?"
user_query = "How is global health affected by climate change?"

# Encode user query into an embedding
user_query_embedding = model.encode(user_query, convert_to_tensor=True).numpy()

# Search in FAISS index
k = 5  # Number of most similar prompts to retrieve
D, I = index.search(np.array([user_query_embedding]), k)

# Retrieve sentences and corresponding documents based on valid indices
most_similar_prompts = []
for idx in I[0]:
    if idx in sentence_to_index_mapping:
        mapping = sentence_to_index_mapping[idx]
        doc_idx = mapping['document_index']
        sentence_idx = mapping['sentence_index']
        sentence = mapping['sentence_text']
        most_similar_prompts.append((doc_idx, sentence_idx, sentence))

print('USER QUERY:', user_query)
print("Most similar prompts to the user query:")
for doc_idx, sentence_idx, sentence in most_similar_prompts:
    filename = documents_mapping.get(doc_idx, '').get('file_name')
    print(f"\nDocument Index: {doc_idx}\t Sentence Index: {sentence_idx}\t Filename: {filename}")
    print(f"Sentence:\n******************************\n\t{sentence}")
    print('******************************\n')


USER QUERY: How is global health affected by climate change?
Most similar prompts to the user query:

Document Index: 6	 Sentence Index: 22	 Filename: WorldHealthIssues_1.txt
Sentence:
******************************
	We are now at a point where climate change is clearly with us, and much more attention needs to be put on minimizing the impacts on global health through adaptation or enhancing resilience.
******************************


Document Index: 6	 Sentence Index: 19	 Filename: WorldHealthIssues_1.txt
Sentence:
******************************
	As we know from the pandemic, preparedness is key, and we are far from prepared for the health impacts of a warmer climate.
******************************


Document Index: 6	 Sentence Index: 16	 Filename: WorldHealthIssues_1.txt
Sentence:
******************************
	Impact of climate changeA child looks out over a dried up lake
“Climate change is already affecting the health of millions of people all over the world, and more importantly

In [12]:
def answer_question(question, context, model, tokenizer):
    # Tokenize the input question and context text
    inputs = tokenizer.encode_plus(question, context, return_tensors="pt", max_length=512, truncation=True)

    # Get the model's prediction for the start and end positions
    with torch.no_grad():  # Ensure no gradients are computed
        outputs = model(**inputs)
        start_scores = outputs.start_logits
        end_scores = outputs.end_logits

    # Convert start and end scores to numpy arrays for further processing
    start_scores = start_scores.detach().cpu().numpy()
    end_scores = end_scores.detach().cpu().numpy()

    # Find the tokens with the highest start and end scores
    start_idx = np.argmax(start_scores)
    end_idx = np.argmax(end_scores)

    # Decode the tokens into the answer text
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start_idx:end_idx+1]))
    return answer

In [13]:
def process_single_large_document(user_question, faiss_sentences, large_document, qa_model, tokenizer):
    # Concatenate FAISS sentences with the user question to form the context
    faiss_context = ' '.join(faiss_sentences)
    context = user_question + " " + faiss_context

    # Tokenize the context and large document separately
    tokenized_context = tokenizer.encode_plus(context, return_tensors="pt", max_length=512, truncation=True)
    tokenized_document = tokenizer.encode_plus(large_document, return_tensors="pt", max_length=512, truncation=True)

    # Trim or truncate the tokenized document to fit within the remaining space
    max_seq_len = tokenizer.model_max_length - tokenized_context['input_ids'].size(1) - 3  # Account for special tokens [CLS], [SEP], etc.
    input_ids_doc = tokenized_document['input_ids'][:, :max_seq_len]
    attention_mask_doc = tokenized_document['attention_mask'][:, :max_seq_len]

    # Concatenate the tokenized context and truncated document
    input_ids = torch.cat([tokenized_context['input_ids'], input_ids_doc], dim=1)
    attention_mask = torch.cat([tokenized_context['attention_mask'], attention_mask_doc], dim=1)

    # Answer the question based on the combined context and document
    with torch.no_grad():
        outputs = qa_model(input_ids=input_ids, attention_mask=attention_mask)
    
    # Process outputs to get the answer
    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    start_idx = torch.argmax(start_scores)
    end_idx = torch.argmax(end_scores)

    # Decode the tokens into the answer text
    answer = tokenizer.decode(input_ids[0][start_idx:end_idx+1], skip_special_tokens=True)
    
    return answer


In [14]:
# most_similar_prompts
# Get the document by doc id:
doc_info = documents_mapping.get(1)
document = doc_info['content']
print(document[:150])

# Extract only the sentences from most_similar_prompts
faiss_sentences = [sentence for _, _, sentence in most_similar_prompts]

print('\nUser  query:\n\t', user_query)
answer = process_single_large_document(user_query, faiss_sentences, document, qa_model, qa_tokenizer)
print("Answer:\n\t", answer)

What Is Climate Change?
Climate change refers to long-term shifts in temperatures and weather patterns. Such shifts can be natural, due to changes in 

User  query:
	 How is global health affected by climate change?
Answer:
	 climate change is already affecting the health of millions of people all over the world
