## GPT Sentence Transformer and FAISS example 2

Now, we want to expand on the first example to create a vector index that can be searched on an ad-hoc basis.  Additionally, I would like to expand on creating additional vector indexes that have sub-topics that can be used for targeted searches.  For example, it might be helpful to have a set of indexes that are purely related to various topics.  Each of these indexes would have some associated text that would be queried first to determine which index(es) were most related to the question.  Then the query would be applied to the sub-indexes to find relevant documents and text.

In [1]:
import faiss
import numpy as np
from torch import Tensor

import nltk
nltk.download('punkt')  # Download the punkt tokenizer for sentence tokenization

import os
import re
import spacy
from sentence_transformers import SentenceTransformer, util
from directives_processor import DirectivesProcessor

%load_ext autoreload
%autoreload 2

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hugh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load the SentenceTransformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
dp = DirectivesProcessor()


In [3]:
doc_id_dict = {}

# Function to load documents from a folder
def load_documents_from_folder(folder_path):
    documents = []
    file_names = os.listdir(folder_path)
    
    for file_name in file_names:
        file_path = os.path.join(folder_path, file_name)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as file:
                document_content = file.read()
                documents.append(document_content)
    
    return documents

# Load documents from the 'data' folder
data_folder = 'data'  # Replace 'data' with your folder name
documents = load_documents_from_folder(data_folder)

'''
# WORKING BACKUP...
document_list = []
index = faiss.IndexFlatL2(768)  # Create an index

# Load the docs into the index
for doc in documents:
    sentences = nltk.sent_tokenize(doc)
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
    document_list.append((doc, sentence_embeddings))
    index.add(sentence_embeddings.numpy())

'''

'\n# WORKING BACKUP...\ndocument_list = []\nindex = faiss.IndexFlatL2(768)  # Create an index\n\n# Load the docs into the index\nfor doc in documents:\n    sentences = nltk.sent_tokenize(doc)\n    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)\n    document_list.append((doc, sentence_embeddings))\n    index.add(sentence_embeddings.numpy())\n\n'

In [9]:
# Generate a list of 'questions' from the document directives
directive_embeddings = {}
topic_directives = []
topic_names = dp.get_topic_names()
print('TOPIC NAMES:', topic_names)
topic_titles = dp.get_topic_titles()
print('TOPIC TITLES:', topic_titles)


# for name in topic_names:
#     topic_directive = dp.get_topic_text(name)
# #     print('\n\n', name, 'DIRECTIVES:\n\t', topic_directive)
#     _embedding = model.encode(topic_directive, convert_to_tensor=True)
    
#     # Convert tensor to numpy array before storing
#     tensor_as_numpy = _embedding.detach().numpy()
#     directive_embeddings[name] = tensor_as_numpy
# #     directive_embeddings[name] = _embedding
# #     print('\n\n\n', directive_embeddings[name])

# # print('\n\ntopic_directives:', topic_directives)

TOPIC NAMES: ['AI', 'Climate Change', 'World Health Issues', 'Cultural Diversity']
TOPIC TITLES: ['Artificial Intelligence and Machine Learning', 'Documents covering climate change, global warming, environmental impacts, renewable energy, etc.', 'Gather documents related to health topics such as diseases, medical advancements, healthcare policies, etc.', 'Gather information about different cultures, traditions, languages, and societal practices.']


In [17]:

# Example questions
question_1 = "What is the main topic discussed?"
question_2 = "Is the topic related to health?"
most_related_to_health = {"similarity_score":0.0, "topic_sentence": "", "doc": ""}

# Process each document to find a sentence that might answer the first question
for idx, doc in enumerate(documents):
    # Tokenize document into sentences
    sentences = nltk.sent_tokenize(doc)
    print('sentences', sentences)
    
    # Encode the first question
    question_1_embedding = model.encode(question_1, convert_to_tensor=True)
    
    # Calculate cosine similarity between the first question and sentences in the document
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
    print('question_1_embedding.shape', question_1_embedding.shape)
    print('sentence_embeddings.shape', sentence_embeddings.shape)    
    
    similarity_scores = util.pytorch_cos_sim(question_1_embedding, sentence_embeddings)
    print('similarity_scores', similarity_scores)
    
    # Find the index of the sentence with the highest similarity score to the first question
    most_similar_index = similarity_scores.argmax().item()
    
    # Extract the most similar sentence
    topic_sentence = sentences[most_similar_index]
    
    # Print the most similar sentence
    print(f"Document {idx + 1}:")
    print("Most similar sentence to the first question:")
    print(topic_sentence)
    
    # Encode the extracted topic sentence
    topic_embedding = model.encode(topic_sentence, convert_to_tensor=True)
    
    # Encode the second question
    question_2_embedding = model.encode(question_2, convert_to_tensor=True)
    
    # Calculate cosine similarity between the second question and the topic sentence
    print('question_2_embedding.shape', question_2_embedding.shape)
    print('topic_embedding.shape', topic_embedding.shape)
    similarity_score = util.pytorch_cos_sim(question_2_embedding, topic_embedding).item()
    print('similarity_score', similarity_score)
    
    # Determine if the topic is related to health based on a similarity threshold
    print(question_2, similarity_score, end='')
    if similarity_score > 0.5:  # Set your desired similarity threshold
        print("\tYes\n-----------------------\n")
    else:
        print("\tNo\n-----------------------\n")
    
    # Determine if the topic is related to health based on a similarity threshold
    if similarity_score > most_related_to_health["similarity_score"]:
        most_related_to_health["similarity_score"] = similarity_score
        most_related_to_health["topic_sentence"] = topic_sentence
        most_related_to_health["doc"] = doc

# Print the most related topic to health after the loop finishes
print("Most related topic to health:")
# print("Document:", most_related_to_health["doc"])
print("Most related topic sentence:", most_related_to_health["topic_sentence"])
print("Similarity score:", most_related_to_health["similarity_score"])

sentences ['Artificial intelligence (AI) vs. machine learning (ML)\nYou might hear people use artificial intelligence (AI) and machine learning (ML) interchangeably, especially when discussing big data, predictive analytics, and other digital transformation topics.', 'The confusion is understandable as artificial intelligence and machine learning are closely related.', 'However, these trending technologies differ in several ways, including scope, applications, and more.', 'Increasingly AI and ML products have proliferated as businesses use them to process and analyze immense volumes of data, drive better decision-making, generate recommendations and insights in real time, and create accurate forecasts and predictions.', 'So, what exactly is the difference when it comes to ML vs. AI, how are ML and AI connected, and what do these terms mean in practice for organizations today?', 'We’ll break down AI vs. ML and explore how these two innovative concepts are related and what makes them dif

question_1_embedding.shape torch.Size([768])
sentence_embeddings.shape torch.Size([43, 768])
similarity_scores tensor([[ 0.2419,  0.1238, -0.0184,  0.1170,  0.0981,  0.0886,  0.0005,  0.0530,
          0.0544,  0.1463,  0.0595,  0.1005,  0.1935,  0.0992,  0.0491,  0.0896,
          0.1184,  0.0736,  0.0964,  0.1519,  0.1932,  0.0940,  0.1508,  0.0449,
          0.0853,  0.0872,  0.1173,  0.0912,  0.1784,  0.2955,  0.2792,  0.1252,
          0.0486,  0.0878,  0.0769,  0.2088,  0.3176,  0.1409,  0.0938,  0.0278,
          0.1041, -0.0141,  0.2017]])
Document 2:
Most similar sentence to the first question:
It covers current impacts and those likely in the future.
question_2_embedding.shape torch.Size([768])
topic_embedding.shape torch.Size([768])
similarity_score 0.3013697862625122
Is the topic related to health? 0.3013697862625122	No
-----------------------

sentences ['Cultural diversity\n\nArticle\nTalk\nRead\nEdit\nView history\n\nTools\nFrom Wikipedia, the free encyclopedia\n\n37th G

question_1_embedding.shape torch.Size([768])
sentence_embeddings.shape torch.Size([110, 768])
similarity_scores tensor([[ 0.2440,  0.2679,  0.3744,  0.2583,  0.2470,  0.3532,  0.2936,  0.3166,
          0.3345,  0.3703,  0.2422,  0.2261,  0.2640,  0.3041,  0.4398,  0.5714,
          0.2589,  0.2159,  0.2450,  0.2693,  0.2685,  0.1460,  0.2070,  0.1370,
          0.1832,  0.0979,  0.1369,  0.0437,  0.1845,  0.2747,  0.2615,  0.1992,
          0.2736,  0.2625,  0.2221,  0.0693,  0.2410,  0.0235,  0.1860,  0.3852,
          0.3846,  0.1657,  0.2663,  0.2843,  0.3213,  0.3423,  0.3107,  0.2344,
          0.3037,  0.1119,  0.3319,  0.3248,  0.3286,  0.3011,  0.2627,  0.3956,
          0.1172,  0.2289,  0.2857,  0.1314,  0.2586,  0.2158,  0.2716,  0.2272,
          0.2261,  0.2816,  0.1938,  0.1087,  0.1670,  0.1456,  0.0501,  0.1327,
         -0.0016,  0.2255,  0.1018,  0.0812,  0.1126,  0.1755,  0.1626, -0.0865,
          0.0959,  0.1910,  0.1507,  0.2403,  0.1919,  0.2186,  0.1224,  0.203

question_1_embedding.shape torch.Size([768])
sentence_embeddings.shape torch.Size([101, 768])
similarity_scores tensor([[0.2483, 0.2021, 0.2080, 0.2548, 0.1855, 0.1501, 0.2539, 0.1727, 0.2475,
         0.1162, 0.2391, 0.1727, 0.2014, 0.1874, 0.1302, 0.1887, 0.2006, 0.1427,
         0.1589, 0.2095, 0.2024, 0.2765, 0.2030, 0.1629, 0.0613, 0.1558, 0.2258,
         0.0472, 0.0965, 0.0265, 0.1100, 0.0497, 0.2171, 0.1948, 0.1460, 0.1957,
         0.1614, 0.1603, 0.1717, 0.1475, 0.0793, 0.1959, 0.1443, 0.1484, 0.0777,
         0.0865, 0.1333, 0.1714, 0.0938, 0.1525, 0.1164, 0.1564, 0.2612, 0.1385,
         0.1282, 0.1208, 0.1165, 0.0964, 0.0130, 0.0364, 0.0290, 0.1101, 0.2423,
         0.3193, 0.1933, 0.1210, 0.0554, 0.0639, 0.1411, 0.1442, 0.1025, 0.1480,
         0.1539, 0.1302, 0.1746, 0.0891, 0.1258, 0.1247, 0.0321, 0.0749, 0.0919,
         0.0731, 0.0187, 0.0618, 0.0397, 0.0631, 0.2610, 0.1859, 0.1738, 0.1819,
         0.1891, 0.0969, 0.3061, 0.2008, 0.2106, 0.2280, 0.1744, 0.2312, 0.214

question_1_embedding.shape torch.Size([768])
sentence_embeddings.shape torch.Size([134, 768])
similarity_scores tensor([[ 0.3140,  0.2067,  0.2591,  0.2238,  0.2491,  0.2447,  0.3065,  0.1979,
          0.2897,  0.1699,  0.2125,  0.2204,  0.1014,  0.2760,  0.3080,  0.3335,
          0.3968,  0.3297,  0.2307,  0.2192,  0.2789,  0.1609,  0.1469,  0.2268,
          0.2914,  0.2596,  0.2236,  0.2077,  0.2902,  0.3707,  0.3843,  0.2852,
          0.1475,  0.1334,  0.2887,  0.2437,  0.2041,  0.2128,  0.2264,  0.2858,
          0.0721,  0.2045,  0.1648,  0.2224,  0.1934,  0.2998,  0.2325,  0.1904,
          0.2631,  0.1428,  0.2793,  0.1652,  0.2000,  0.1647,  0.2190,  0.1763,
          0.1835,  0.1415,  0.3181,  0.2319,  0.1896,  0.0811,  0.2104,  0.2097,
          0.1714,  0.2048,  0.2298,  0.1831,  0.2802,  0.2276,  0.2114,  0.0696,
          0.1617,  0.0727,  0.1243,  0.1945,  0.1463,  0.2147,  0.1341,  0.0676,
          0.1577,  0.1766,  0.1643,  0.2555,  0.2832,  0.3139,  0.2507,  0.175

question_1_embedding.shape torch.Size([768])
sentence_embeddings.shape torch.Size([338, 768])
similarity_scores tensor([[ 0.1701, -0.0137,  0.2242,  0.2548,  0.1073,  0.1245,  0.1265,  0.0575,
          0.1433,  0.1098,  0.2282,  0.1163,  0.1207,  0.2149,  0.2417,  0.0889,
          0.0122,  0.1464,  0.1672,  0.0108,  0.0027,  0.0415,  0.1332,  0.0321,
          0.1618,  0.1455,  0.1378,  0.1928,  0.1810,  0.1912,  0.2183,  0.1726,
          0.2694,  0.1788,  0.2297,  0.1822,  0.2386,  0.1468,  0.2135,  0.2666,
          0.1008,  0.1863,  0.0744,  0.0770,  0.1348,  0.0492,  0.1351,  0.1353,
          0.1414,  0.0467,  0.1357,  0.3063,  0.0183,  0.1951,  0.2512,  0.1341,
          0.2133,  0.1692,  0.2456,  0.2107,  0.2486,  0.0575,  0.2052,  0.0461,
          0.1107,  0.1324,  0.2391,  0.1732,  0.2123,  0.1187,  0.1657,  0.1393,
          0.0789,  0.0687,  0.1776,  0.0773,  0.2272,  0.1098,  0.1276,  0.2229,
          0.1180,  0.1663,  0.1768,  0.1957,  0.1564,  0.0811,  0.1312, -0.015

question_1_embedding.shape torch.Size([768])
sentence_embeddings.shape torch.Size([63, 768])
similarity_scores tensor([[0.2504, 0.2694, 0.3299, 0.2234, 0.2923, 0.2179, 0.1855, 0.1235, 0.1184,
         0.0788, 0.0554, 0.0942, 0.1918, 0.1722, 0.1831, 0.2448, 0.1314, 0.1457,
         0.1361, 0.1194, 0.2838, 0.0543, 0.1180, 0.3377, 0.1234, 0.0812, 0.0748,
         0.2405, 0.1913, 0.1950, 0.1858, 0.0624, 0.1089, 0.0486, 0.1566, 0.1305,
         0.0246, 0.1603, 0.2128, 0.1321, 0.2138, 0.1467, 0.2011, 0.1470, 0.1776,
         0.1123, 0.1136, 0.1977, 0.2550, 0.2574, 0.2206, 0.1670, 0.1072, 0.1637,
         0.1475, 0.2449, 0.1250, 0.2742, 0.1840, 0.1872, 0.1316, 0.1053, 0.2287]])
Document 7:
Most similar sentence to the first question:
“One aspect of this is improving overall health and enhancing socioeconomic development because we know that those who are more vulnerable will suffer the most.
question_2_embedding.shape torch.Size([768])
topic_embedding.shape torch.Size([768])
similarity_score 

In [22]:
document_list = []
index = faiss.IndexFlatL2(768)  # Create an index

# Load the docs into the index
for idx, doc in enumerate(documents):
    print(f"Document {idx + 1}:\t", doc.replace('\n', ' ')[:50])
#     sentences = nltk.sent_tokenize(doc)
#     doc_embeddings = model.encode(sentences, convert_to_tensor=True)
    doc_embeddings = model.encode(doc, convert_to_tensor=True)
    
    for idx, title in enumerate(topic_titles):        
#         # Tokenize document into sentences
#         sentences = nltk.sent_tokenize(doc)
#         sentence_embeddings = model.encode(sentences, convert_to_tensor=True)

        # Encode the question
#         question_embedding = model.encode(question, convert_to_tensor=True)        
        
        topic_question = "Is the document related to " + title
        print('\tIs the document related to ', topic_names[idx])
        topic_embedding = model.encode(topic_question, convert_to_tensor=True)
#         print('\t\tdoc_embeddings.shape', doc_embeddings.shape)
#         print('\t\ttopic_embedding.shape', topic_embedding.shape)
        
        # Calculate cosine similarity between the sentence_embeddings and the topic_embedding
        similarity_score = util.pytorch_cos_sim(topic_embedding, doc_embeddings).item()
        print(f'\t\t{topic_name}: ', similarity_score, end='')
        if similarity_score > 0.4:  # Set your desired similarity threshold
            print("\tYes")
        else:
            print("\tNo")
    
    document_list.append((doc, sentence_embeddings))
    index.add(sentence_embeddings.numpy())


Document 1:	 Artificial intelligence (AI) vs. machine learning 
	Is the document related to  AI
		A:  0.4988892674446106	Yes
	Is the document related to  Climate Change
		A:  0.14186131954193115	No
	Is the document related to  World Health Issues
		A:  0.12791411578655243	No
	Is the document related to  Cultural Diversity
		A:  0.1643749624490738	No
Document 2:	 What Is Climate Change? Climate change refers to l
	Is the document related to  AI
		A:  0.1305544078350067	No
	Is the document related to  Climate Change
		A:  0.426655650138855	Yes
	Is the document related to  World Health Issues
		A:  0.09304580837488174	No
	Is the document related to  Cultural Diversity
		A:  0.15576374530792236	No
Document 3:	 Cultural diversity  Article Talk Read Edit View hi
	Is the document related to  AI
		A:  0.12078669667243958	No
	Is the document related to  Climate Change
		A:  0.18997083604335785	No
	Is the document related to  World Health Issues
		A:  0.1784690022468567	No
	Is the document relat

In [None]:

question_1 = "What is the main topic discussed?"
question_2 = "Is the topic related to health?"
most_related_to_health = {"similarity_score":0.0, "topic_sentence": "", "doc": ""}

# Encode the first question
question_1_embedding = model.encode(question_1, convert_to_tensor=True)

# Encode the second question
question_2_embedding = model.encode(question_2, convert_to_tensor=True)

for doc in document_list:
    

In [12]:

'''

question_1 = "What is the main topic discussed?"
question_2 = "Is the topic related to health?"
most_related_to_health = {"similarity_score":0.0, "topic_sentence": "", "doc": ""}

# Process each document to find a sentence that might answer the first question
for idx, doc in enumerate(documents):
    # Tokenize document into sentences
    sentences = nltk.sent_tokenize(doc)
    
    # Encode the first question
    question_1_embedding = model.encode(question_1, convert_to_tensor=True)
    
    # Calculate cosine similarity between the first question and sentences in the document
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
    similarity_scores = util.pytorch_cos_sim(question_1_embedding, sentence_embeddings)
    
    # Find the index of the sentence with the highest similarity score to the first question
    most_similar_index = similarity_scores.argmax().item()
    
    # Extract the most similar sentence
    topic_sentence = sentences[most_similar_index]
    
    # Print the most similar sentence
    print(f"Document {idx + 1}:")
    print("Most similar sentence to the first question:")
    print(topic_sentence)
    
    # Encode the extracted topic sentence
    topic_embedding = model.encode(topic_sentence, convert_to_tensor=True)
    
    # Encode the second question
    question_2_embedding = model.encode(question_2, convert_to_tensor=True)
    
    # Calculate cosine similarity between the second question and the topic sentence
    similarity_score = util.pytorch_cos_sim(question_2_embedding, topic_embedding).item()
    
    # Determine if the topic is related to health based on a similarity threshold
    print(question_2, similarity_score, end='')
    if similarity_score > 0.5:  # Set your desired similarity threshold
        print("\tYes\n-----------------------\n")
    else:
        print("\tNo\n-----------------------\n")
    
    # Determine if the topic is related to health based on a similarity threshold
    if similarity_score > most_related_to_health["similarity_score"]:
        most_related_to_health["similarity_score"] = similarity_score
        most_related_to_health["topic_sentence"] = topic_sentence
        most_related_to_health["doc"] = doc

Document 1:
Most similar sentence to the first question:
Efficiency
Increasing operational efficiency and reducing costs.
Is the topic related to health? 0.24417605996131897	No


# Load the SentenceTransformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Get embeddings for documents
document_embeddings = model.encode(documents)

# Build FAISS index
index = faiss.IndexFlatL2(document_embeddings.shape[1])  # Create an index

# Add document embeddings to the index
index.add(document_embeddings)

# Process each document to find a sentence that might answer the question
for idx, doc in enumerate(documents):
    # Tokenize document into sentences
    sentences = nltk.sent_tokenize(doc)
    
    # Encode the question
    question_embedding = model.encode(question, convert_to_tensor=True)
    
    # Calculate cosine similarity between question and sentences in the document
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
    similarity_scores = util.pytorch_cos_sim(question_embedding, sentence_embeddings)
    
    # Find the index of the sentence with the highest similarity score to the question
    most_similar_index = similarity_scores.argmax().item()
'''


# Load the SentenceTransformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Example question
question = "What is the main topic discussed?"

# Process each document to find a sentence that might answer the question
for idx, doc in enumerate(documents):
    # Tokenize document into sentences
    sentences = nltk.sent_tokenize(doc)
    
    # Encode the question
    question_embedding = model.encode(question, convert_to_tensor=True)
    
    # Calculate cosine similarity between question and sentences in the document
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
    print('sentence_embeddings.shape', sentence_embeddings.shape)
    print('question_embedding.shape', question_embedding.shape)
    similarity_scores = util.pytorch_cos_sim(question_embedding, sentence_embeddings)
    
    # Find the index of the sentence with the highest similarity score to the question
    most_similar_index = similarity_scores.argmax().item()
    
    # Print the most similar sentence for demonstration
    print(f"Document {idx + 1}:")
    print("Most similar sentence possibly answering the question:")
    print(sentences[most_similar_index])
    print("\n-----------------------\n")

sentence_embeddings.shape torch.Size([42, 768])
question_embedding.shape torch.Size([768])
Document 1:
Most similar sentence possibly answering the question:
Efficiency
Increasing operational efficiency and reducing costs.

-----------------------

sentence_embeddings.shape torch.Size([43, 768])
question_embedding.shape torch.Size([768])
Document 2:
Most similar sentence possibly answering the question:
It covers current impacts and those likely in the future.

-----------------------

sentence_embeddings.shape torch.Size([110, 768])
question_embedding.shape torch.Size([768])
Document 3:
Most similar sentence possibly answering the question:
It emphasises an ongoing process of interaction and dialogue between cultures.

-----------------------

sentence_embeddings.shape torch.Size([101, 768])
question_embedding.shape torch.Size([768])
Document 4:
Most similar sentence possibly answering the question:
To give two more concrete examples, we will consider the role of stock markets in brin