## GPT Sentence Transformer and FAISS example 1

This notebook appears to be my first successful attempt at creating a document search algorithm that can take human sentences and find documents that are related to those queries.

In [1]:
import faiss
import numpy as np

import nltk
nltk.download('punkt')  # Download the punkt tokenizer for sentence tokenization

import os
import re
import spacy
from sentence_transformers import SentenceTransformer, util

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hugh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
data_folder = 'data'  # Replace 'data' with your folder name

# Function to clean text
def clean_text(text):
    # Remove unwanted characters such as quotes, special symbols, etc.
    cleaned_text = re.sub(r"[^\w\s]", "", text)  # Remove non-alphanumeric characters except whitespace
    return cleaned_text.strip()  # Strip leading/trailing whitespace

# Function to load documents from a folder
def load_documents_from_folder(folder_path):
    documents = []
    file_names = os.listdir(folder_path)
    
    for file_name in file_names:
        file_path = os.path.join(folder_path, file_name)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as file:
                document_content = file.read()
                cleaned_document = clean_text(document_content)
                documents.append(cleaned_document)
    
    return documents

# Load documents from the 'data' folder
documents = load_documents_from_folder(data_folder)

# Print the first few documents as a check
for i in range(min(5, len(documents))):  # Print the first 5 documents
    print(f"**********Document {i + 1}:\n**********")
    print(documents[i])
    print("\n-----------------------\n")


**********Document 1:
**********
Artificial intelligence AI vs machine learning ML
You might hear people use artificial intelligence AI and machine learning ML interchangeably especially when discussing big data predictive analytics and other digital transformation topics The confusion is understandable as artificial intelligence and machine learning are closely related However these trending technologies differ in several ways including scope applications and more  

Increasingly AI and ML products have proliferated as businesses use them to process and analyze immense volumes of data drive better decisionmaking generate recommendations and insights in real time and create accurate forecasts and predictions 

So what exactly is the difference when it comes to ML vs AI how are ML and AI connected and what do these terms mean in practice for organizations today 

Well break down AI vs ML and explore how these two innovative concepts are related and what makes them different from each ot

In [3]:
# Load the SentenceTransformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Get embeddings for documents
document_embeddings = model.encode(documents)

In [4]:
# Build FAISS index
index = faiss.IndexFlatL2(document_embeddings.shape[1])  # Create an index

# Add document embeddings to the index
index.add(document_embeddings)

print(document_embeddings.shape)
print(document_embeddings.shape[1])

(7, 768)
768


In [5]:
# Example query
query = "What is the main topic discussed?"

# Get the query embedding
query_embedding = model.encode([query])[0]

# Search for similar documents
k = 1  # Number of similar documents to retrieve
distances, indices = index.search(np.array([query_embedding]), k)

# Download the 'en_core_web_sm' model
# spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")  # Load SpaCy's English model

# Display top k similar documents
for i in range(k):
    doc = documents[indices[0][i]]
    doc_entities = nlp(doc).ents
    print(f"**********\n\tDocument {i + 1}:\n**********")
#     print(documents[indices[0][i]])
    print("\tDistance:", distances[0][i], "\n**********\n")
    print()
    print("Entities related to the question in the document:")
    for ent in doc_entities:
        if query.lower() in ent.text.lower():
            print(ent.text, "-", ent.label_)
    print("\n-----------------------\n")


**********
	Document 1:
**********
	Distance: 1.4290388 
**********


Entities related to the question in the document:

-----------------------



In [6]:
# Example: Extract information using simple substring search
# Assume you're looking for a specific keyword in the retrieved documents
keyword = "topic"  

for i in range(k):
    if keyword in documents[indices[0][i]]:
        print("**********\n\tDocument containing relevant information:\n**********\n")
        print(documents[indices[0][i]])
        break


In [7]:
# Example question
question = "What is the main topic discussed?"

# Process each document to find a sentence that might answer the question
for idx, doc in enumerate(documents):
    # Tokenize document into sentences
    sentences = nltk.sent_tokenize(doc)
    
    # Encode the question
    question_embedding = model.encode(question, convert_to_tensor=True)
    
    # Calculate cosine similarity between question and sentences in the document
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
    similarity_scores = util.pytorch_cos_sim(question_embedding, sentence_embeddings)
    
    # Find the sentence with the highest similarity score to the question
    most_similar_index = similarity_scores.argmax().item()
    most_similar_sentence = sentences[most_similar_index]
    
    # Print the most similar sentence for demonstration
    print(f"Document {idx + 1}:")
    print("Most similar sentence possibly answering the question:")
    print(most_similar_sentence)
    print("\n-----------------------\n")

Document 1:
Most similar sentence possibly answering the question:
Artificial intelligence AI vs machine learning ML
You might hear people use artificial intelligence AI and machine learning ML interchangeably especially when discussing big data predictive analytics and other digital transformation topics The confusion is understandable as artificial intelligence and machine learning are closely related However these trending technologies differ in several ways including scope applications and more  

Increasingly AI and ML products have proliferated as businesses use them to process and analyze immense volumes of data drive better decisionmaking generate recommendations and insights in real time and create accurate forecasts and predictions 

So what exactly is the difference when it comes to ML vs AI how are ML and AI connected and what do these terms mean in practice for organizations today 

Well break down AI vs ML and explore how these two innovative concepts are related and what

Document 3:
Most similar sentence possibly answering the question:
Cultural diversity

Article
Talk
Read
Edit
View history

Tools
From Wikipedia the free encyclopedia

37th General Assembly of UNESCO in 2013 Paris
Cultural diversity is the quality of diverse or different cultures as opposed to monoculture It has a variety of meanings in different contexts sometimes applying to cultural products like art works in museums or entertainment available online and sometimes applying to the variety of human cultures or traditions in a specific region or in the world as a whole It can also refer to the inclusion of different cultural perspectives in an organization or society

Cultural diversity can be affected by political factors such as censorship or the protection of the rights of artists and by economic factors such as free trade or protectionism in the market for cultural goods Since the middle of the 20th century there has been a concerted international effort to protect cultural diversi

Document 4:
Most similar sentence possibly answering the question:
Financial Markets Role in the Economy Importance Types and Examples
By ADAM HAYES Updated October 19 2023
Reviewed by CIERRA MURRY
Fact checked by KIRSTEN ROHRS SCHMITT
What Are Financial Markets
Financial markets refer broadly to any marketplace where securities trading occurs including the stock market bond market forex market and derivatives market Financial markets are vital to the smooth operation of capitalist economies

KEY TAKEAWAYS
Financial markets refer broadly to any marketplace where the trading of securities occurs
There are many kinds of financial markets including but not limited to forex money stock and bond markets
These markets may include assets or securities that are either listed on regulated exchanges or trade overthecounter OTC
Financial markets trade in all types of securities and are critical to the smooth operation of a capitalist society
When financial markets fail economic disruption includi

Document 5:
Most similar sentence possibly answering the question:
Americans Name the 10 Most Significant Historic Events of Their Lifetimes
911 Obama election and the tech revolution among those with greatest impact on the country

Shared experiences define what it means to be an American The Sept 11 2001 terrorist attacks were such a unifying event for modern Americans Nothing else has come close to being as important or as memorable according to a new survey conducted by Pew Research Center in association with AE Networks HISTORY

Roughly threequarters 76 of the public include the Sept 11 terror attacks as one of the 10 events during their lifetime with the greatest impact on the country according to a national online survey of 2025 adults conducted June 16July 4 2016

The perceived historic importance of the attacks on New York and the Pentagon span virtually every traditional demographic divide Majorities of men and women Millennials and Baby Boomers Americans with college degrees

Document 6:
Most similar sentence possibly answering the question:
Terrorism

Article
Talk
Read
View source
View history

Tools
Page semiprotected
From Wikipedia the free encyclopedia
Terrorist redirects here For other uses see Terrorist disambiguation

This article needs to be updated Please help update this article to reflect recent events or newly available information August 2021

United Airlines Flight 175 hits the South Tower of the World Trade Center during the September 11 attacks of 2001 in New York City
Part of a series on
Terrorism
DefinitionsHistoryIncidents
By ideology
Structure
MethodsTactics
Terrorist groups
Adherents
Response to terrorism
vte
Terrorism in its broadest sense is the use of intentional violence and fear to achieve political or ideological aims The term is used in this regard primarily to refer to intentional violence during peacetime or in the context of war against noncombatants mostly civilians and neutral military personnel1 The terms terrorist and terr

Document 7:
Most similar sentence possibly answering the question:
11 global health issues to watch in 2023 according to IHME experts
Published December 20 2022

Authors

ANNIE CHAN
CONTENT WRITER
As the year 2022 winds down what is next on the horizon for global health We turned to our IHME experts for their takes on the most critical health issues to watch in 2023 Entering our fourth year grappling with COVID19 most of our experts pointed to issues that were impacted in some way by the pandemic like long COVID and mental health They also offered potential interventions to address the threats 

The faculty members and research scientists who shared their insights are professor Mohsen Naghavi assistant professor Hwme Kyu assistant professor Angela Micah affiliate professor Michael Brauer affiliate assistant professor Alize Ferrari lead research scientist Liane Ong lead research scientist Sarah Wulf Hanson postdoctoral scholar Christian Razo postdoctoral scholar Ewerton Cousin and resea

In [8]:
# Example question
question = "What is the main topic discussed?"

# Process each document to find sentences that might answer the question
for idx, doc in enumerate(documents):
    # Encode the document
    document_embeddings = model.encode(doc, convert_to_tensor=True)
    
    # Encode the question
    question_embedding = model.encode(question, convert_to_tensor=True)
    
    # Calculate cosine similarity between question and sentences in the document
    similarity_scores = util.pytorch_cos_sim(question_embedding, document_embeddings)
    
    # Find the sentence with the highest similarity score to the question
    most_similar_index = similarity_scores.argmax().item()
    most_similar_sentence = doc[most_similar_index]
    
    # Print the most similar sentence for demonstration
    print(f"Document {idx + 1}:")
    print("Most similar sentence possibly answering the question:")
    print(most_similar_sentence)
    print("\n-----------------------\n")

Document 1:
Most similar sentence possibly answering the question:
A

-----------------------

Document 2:
Most similar sentence possibly answering the question:
W

-----------------------

Document 3:
Most similar sentence possibly answering the question:
C

-----------------------

Document 4:
Most similar sentence possibly answering the question:
F

-----------------------

Document 5:
Most similar sentence possibly answering the question:
A

-----------------------

Document 6:
Most similar sentence possibly answering the question:
T

-----------------------

Document 7:
Most similar sentence possibly answering the question:
1

-----------------------



In [9]:
from sentence_transformers import SentenceTransformer, util
import os
import nltk
nltk.download('punkt')  # Download the punkt tokenizer for sentence tokenization

# Function to load documents from a folder
def load_documents_from_folder(folder_path):
    documents = []
    file_names = os.listdir(folder_path)
    
    for file_name in file_names:
        file_path = os.path.join(folder_path, file_name)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as file:
                document_content = file.read()
                documents.append(document_content)
    
    return documents

# Load documents from the 'data' folder
data_folder = 'data'  # Replace 'data' with your folder name
documents = load_documents_from_folder(data_folder)

# Load the SentenceTransformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Example question
question = "What is the main topic discussed?"

# Process each document to find a sentence that might answer the question
for idx, doc in enumerate(documents):
    # Tokenize document into sentences
    sentences = nltk.sent_tokenize(doc)
    
    # Encode the question
    question_embedding = model.encode(question, convert_to_tensor=True)
    
    # Calculate cosine similarity between question and sentences in the document
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
    similarity_scores = util.pytorch_cos_sim(question_embedding, sentence_embeddings)
    
    # Find the index of the sentence with the highest similarity score to the question
    most_similar_index = similarity_scores.argmax().item()
    
    # Print the most similar sentence for demonstration
    print(f"Document {idx + 1}:")
    print("Most similar sentence possibly answering the question:")
    print(sentences[most_similar_index])
    print("\n-----------------------\n")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hugh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Document 1:
Most similar sentence possibly answering the question:
Efficiency
Increasing operational efficiency and reducing costs.

-----------------------

Document 2:
Most similar sentence possibly answering the question:
It covers current impacts and those likely in the future.

-----------------------

Document 3:
Most similar sentence possibly answering the question:
It emphasises an ongoing process of interaction and dialogue between cultures.

-----------------------

Document 4:
Most similar sentence possibly answering the question:
To give two more concrete examples, we will consider the role of stock markets in bringing a company to IPO and the role of the OTC derivatives market in the 2008-09 financial crisis.

-----------------------

Document 5:
Most similar sentence possibly answering the question:
Survey participants were asked to list the 10 historic events that occurred during their lifetimes that they thought “have had the greatest impact on the country.” Respondents

In [13]:
from sentence_transformers import SentenceTransformer, util
import os
import nltk
nltk.download('punkt')  # Download the punkt tokenizer for sentence tokenization

# Function to load documents from a folder
def load_documents_from_folder(folder_path):
    documents = []
    file_names = os.listdir(folder_path)
    
    for file_name in file_names:
        file_path = os.path.join(folder_path, file_name)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='utf-8') as file:
                document_content = file.read()
                documents.append(document_content)
    
    return documents

# Load documents from the 'data' folder
data_folder = 'data'  # Replace 'data' with your folder name
documents = load_documents_from_folder(data_folder)

# Load the SentenceTransformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Example questions
question_1 = "What is the main topic discussed?"
question_2 = "Is the topic related to health?"
most_related_to_health = {"similarity_score":0.0, "topic_sentence": "", "doc": ""}

# Process each document to find a sentence that might answer the first question
for idx, doc in enumerate(documents):
    # Tokenize document into sentences
    sentences = nltk.sent_tokenize(doc)
    
    # Encode the first question
    question_1_embedding = model.encode(question_1, convert_to_tensor=True)
    
    # Calculate cosine similarity between the first question and sentences in the document
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
    
    
    similarity_scores = util.pytorch_cos_sim(question_1_embedding, sentence_embeddings)
    
    # Find the index of the sentence with the highest similarity score to the first question
    most_similar_index = similarity_scores.argmax().item()
    
    # Extract the most similar sentence
    topic_sentence = sentences[most_similar_index]
    
    # Print the most similar sentence
    print(f"Document {idx + 1}:")
    print("Most similar sentence to the first question:")
    print(topic_sentence)
    
    # Encode the extracted topic sentence
    topic_embedding = model.encode(topic_sentence, convert_to_tensor=True)
    
    # Encode the second question
    question_2_embedding = model.encode(question_2, convert_to_tensor=True)
    
    # Calculate cosine similarity between the second question and the topic sentence
    print('question_2_embedding.shape', question_2_embedding.shape)
    print('topic_embedding.shape', topic_embedding.shape)
    similarity_score = util.pytorch_cos_sim(question_2_embedding, topic_embedding).item()
    
    # Determine if the topic is related to health based on a similarity threshold
    print(question_2, similarity_score, end='')
    if similarity_score > 0.5:  # Set your desired similarity threshold
        print("\tYes\n-----------------------\n")
    else:
        print("\tNo\n-----------------------\n")
    
    # Determine if the topic is related to health based on a similarity threshold
    if similarity_score > most_related_to_health["similarity_score"]:
        most_related_to_health["similarity_score"] = similarity_score
        most_related_to_health["topic_sentence"] = topic_sentence
        most_related_to_health["doc"] = doc

# Print the most related topic to health after the loop finishes
print("Most related topic to health:")
# print("Document:", most_related_to_health["doc"])
print("Most related topic sentence:", most_related_to_health["topic_sentence"])
print("Similarity score:", most_related_to_health["similarity_score"])


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hugh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Document 1:
Most similar sentence to the first question:
Efficiency
Increasing operational efficiency and reducing costs.
question_2_embedding.shape torch.Size([768])
topic_embedding.shape torch.Size([768])
Is the topic related to health? 0.24417605996131897	No
-----------------------

Document 2:
Most similar sentence to the first question:
It covers current impacts and those likely in the future.
question_2_embedding.shape torch.Size([768])
topic_embedding.shape torch.Size([768])
Is the topic related to health? 0.3013697862625122	No
-----------------------

Document 3:
Most similar sentence to the first question:
It emphasises an ongoing process of interaction and dialogue between cultures.
question_2_embedding.shape torch.Size([768])
topic_embedding.shape torch.Size([768])
Is the topic related to health? 0.3993353545665741	No
-----------------------

Document 4:
Most similar sentence to the first question:
To give two more concrete examples, we will consider the role of stock market

In [11]:
print(document_embeddings.shape)
# print(document_embeddings.shape[1])

torch.Size([768])


IndexError: tuple index out of range