# Instructions: 
- ใช้ dataset Wikipedia สัก 1-2 ภาษา ขนาดไม่เกิน 2,000 row
- แปลงข้อมูลเป็น vector ด้วย embedding model
- แสดงกลุ่มข้อมูลที่แปลงมาแล้วว่ามีกี่กลุ่มด้วย t-SNE
- ทดลองทำ search engine โดยไม่ต้องค้นคำที่ตรงกับในข้อความ

# Find a dataset

Download it here

In [3]:
from datasets import load_dataset

data = load_dataset("csv", data_files="./bible_data_set.csv")

# Change the dataset to an embed in the form of a vector using AzureOpenAIEmbeddings

In [4]:
# Choose sample size for the test to vectorize the a portion of the data - all of it would be too much
sample_size = 2000
sample = data["train"].select(range(sample_size))

In [5]:
print(sample[0])

{'citation': 'Genesis 1:1', 'book': 'Genesis', 'chapter': 1, 'verse': 1, 'text': 'In the beginning God created the heaven and the earth. \r\n'}


Preprocess data to get it ready for embedding - clean samples

- Remove any signs, punctuation from the samples. 
- Need to do lemmatization - reducing words to their base or root form (e.g., "running" to "run"), will help treating different forms of a word as a single word for computer
- Remove stop words

In [6]:
# Load stop words to remove
import nltk
import string
nltk.download("stopwords")
nltk.download("wordnet")

stop_words = set(nltk.corpus.stopwords.words("english"))
punctuation = set(string.punctuation)

processedData = []

for i in range(len(sample)):
    # Get all the text in lowercase
    text = sample[i]["text"].lower()
    # Remove punctuation
    text = "".join([char for char in text if char not in punctuation]).split(" ")
    # Strip whitespace from any text, remove any words that are stop words, and lemmatize the words - and then turn it all back into a single string
    processed = " ".join([nltk.stem.WordNetLemmatizer().lemmatize(word.strip()) for word in text if word not in stop_words])
    # Make the changes to the sample
    processedData.append(processed)
        


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\IQ\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\IQ\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
print(sample[0]["text"])
print(processedData[0])

In the beginning God created the heaven and the earth. 

beginning god created heaven earth 


In [None]:
from langchain_openai import AzureOpenAIEmbeddings
from dotenv import dotenv_values

secrets = dotenv_values("../.env")


# Initialize embeddings
embeddings = AzureOpenAIEmbeddings(deployment=secrets["OPENAI_EMBEDDINGS_DEPLOYMENT"])

# Function to embed text
def embed_text(text):
    return embeddings.embed_query(text)

Next code segment updated, not run yet

Updated to turn embedded sample into dictionary since current doesn't have reference back to text

In [None]:
# Embedding with Azure OpenAI
embedded_sample = {sample[i]["citation"]:embed_text(processedData[i]) for i in range(sample_size)}

print("Finished embedding")

# Save the file 
import json
file = open("embedded_samples.json", "w")
json.dump(embedded_sample, file)

Finished embedding


# Show the data in the form of t-SNE

Use scikit's t-SNE to show data

- First need to reduce the dimensions to a reasonable amount - use PCA for dense data

In [None]:
import numpy
import json

dataVector = json.load(open("./embedded_samples.json"))

textVector = numpy.array(list(dataVector.values()))

vectorKeys = list(dataVector.keys())

print(textVector)

print(vectorKeys)

[[ 0.03563479 -0.00134475  0.03102818 ... -0.02081834  0.00409446
  -0.02684237]
 [ 0.01922061  0.02242618  0.00142568 ...  0.00186083  0.02902965
  -0.01797684]
 [ 0.02315761  0.0066396  -0.03868085 ... -0.00501729  0.00435796
   0.04194281]
 ...
 [ 0.02194769  0.00367686  0.00983882 ... -0.01595001  0.00959987
  -0.01832757]
 [ 0.01146847 -0.00915735  0.00148945 ... -0.02126457  0.0087683
   0.00010325]
 [ 0.02130495  0.00456814  0.01780402 ...  0.01542234 -0.01866298
  -0.01835063]]
['Genesis 1:1', 'Genesis 1:2', 'Genesis 1:3', 'Genesis 1:4', 'Genesis 1:5', 'Genesis 1:6', 'Genesis 1:7', 'Genesis 1:8', 'Genesis 1:9', 'Genesis 1:10', 'Genesis 1:11', 'Genesis 1:12', 'Genesis 1:13', 'Genesis 1:14', 'Genesis 1:15', 'Genesis 1:16', 'Genesis 1:17', 'Genesis 1:18', 'Genesis 1:19', 'Genesis 1:20', 'Genesis 1:21', 'Genesis 1:22', 'Genesis 1:23', 'Genesis 1:24', 'Genesis 1:25', 'Genesis 1:26', 'Genesis 1:27', 'Genesis 1:28', 'Genesis 1:29', 'Genesis 1:30', 'Genesis 1:31', 'Genesis 2:1', 'Genes

In [None]:
# Test 3D plot the first 3 features of the data w/ Plotly Express
import plotly.express as plot

fig = plot.scatter_3d(x = textVector[:,0], y = textVector[:,1], z = textVector[:,2], opacity=0.8)
fig.show()

Use PCA for dimensionality reduction to reduce the dimensions to a reasonable amount since number of features is pretty high

In [None]:
from sklearn.decomposition import PCA

# Will return 2 PCA components - data is more simplified since it's in 2 dimensions/componenets
pca = PCA(n_components=2)
data_pca = pca.fit_transform(textVector)


fig = plot.scatter(x = data_pca[:,0], y = data_pca[:,1], opacity=0.8)
fig.show()


Use t-SNE to fit and transform the data

PCA transformed data variable = data_pca

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE()

data_tsne = tsne.fit_transform(data_pca)

# Lower KL divergence is better - lower will yield better results
tsne.kl_divergence_

0.5450316667556763

# Visualize the plot for the data that was just fitted

In [None]:
fig = plot.scatter(x = data_tsne[:,0], y = data_tsne[:,1])
fig.show()

# Make the search engine

How it will work: 

We already have the embedded vectors for all the data that we want to search through
Just need to write code to convert search query to vector and compare, then return the best results. 

1. Get a search query - user input
2. Convert the query to a vector by embedding it
3. Compare the vector query to the vector data we have and return the closest ones
    - Use rank_bm25 package to help comparing and finding the best - will compare documents and return the ones most relavant to the query
        - Lower level method: compare all the data by with query and return the ones with the closest cosine similarities - but would likely take too long - BM25 is more efficient
    - Take note of stop words in program execution. 

In [None]:
# Example search query
searchQuery = input("Search query: ")

# Process the query/turn it into a vector
queryVector = embed_text(searchQuery)
print("Query processed")

# Calculate cosine distance between query and the words
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(textVector, [queryVector])





Query processed


# Make a list of the 10 most similar prompts to the query

In [None]:
# Make a list of the top 10 most similar
import numpy

similaritiesList = [float(i[0]) for i in numpy.ndarray.tolist(similarities)]

topTen = similaritiesList[:]
topTen.sort(reverse=True)
topTen = topTen[:10]

results = [vectorKeys[similaritiesList.index(r)] for r in topTen]


In [None]:
from datasets import load_dataset

data = load_dataset("csv", data_files="./bible_data_set.csv")

print(f"Top 10 matching results for query: {searchQuery}\n")

for result in results:
    resultIndex = data["train"]["citation"].index(result)
    print(f"{result} - {data["train"]["text"][resultIndex]}")

Top 10 matching results for query: Flood

Genesis 7:17 - And the flood was forty days upon the earth; and the waters increased, and bare up the ark, and it was lift up above the earth. 

Genesis 7:10 - And it came to pass after seven days, that the waters of the flood were upon the earth. 

Genesis 7:6 - And Noah was six hundred years old when the flood of waters was upon the earth. 

Genesis 9:15 - And I will remember my covenant, which is between me and you and every living creature of all flesh; and the waters shall no more become a flood to destroy all flesh. 

Genesis 9:28 - And Noah lived after the flood three hundred and fifty years. 

Genesis 6:17 - And, behold, I, even I, do bring a flood of waters upon the earth, to destroy all flesh, wherein is the breath of life, from under heaven; and every thing that is in the earth shall die. 

Exodus 15:8 - And with the blast of thy nostrils the waters were gathered together, the floods stood upright as an heap, and the depths were cong

# Implement it as a Vector Database - optimize the search - use Milvus lite
- w/ a Vector database = no need to search through every single element of the index
    - Turns the database into vector batches - matches the query to each group/batch
- Need to use wsl to use since native windows is not supported


In [6]:
from pymilvus import MilvusClient
client = MilvusClient("milvus_demo.db")

import numpy
import json

dataVector = json.load(open("./embedded_samples.json"))

textVector = numpy.array(list(dataVector.values()))

vectorKeys = list(dataVector.keys())

Setup the database

In [7]:
import numpy

# Check for the number of dimensions in our array
print(numpy.shape(textVector))

collection_name = "Bible_search_azure"
if client.has_collection(collection_name):
    client.drop_collection(collection_name)

# Setup the database
client.create_collection(collection_name=collection_name, dimension=1536, metric_type="COSINE")

textVector_list = list(dataVector.values())

from datasets import load_dataset
originalData = load_dataset("csv", data_files="./bible_data_set.csv")

# Reformat data for Milvist - dictionary w/ ID number for values
# Format: {id:id, vector: vector, text:text} - for the ID number just use the of the value in the normal list 
vectorDatabase = []
for i in range(len(textVector_list)):
    vectorDatabase.append({"id":i, "vector":textVector[i]})

client.insert(collection_name=collection_name, data = vectorDatabase)



(2000, 1536)


{'insert_count': 2000,
 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215

Define function for vector embed

In [5]:
import os
from langchain_openai import AzureOpenAIEmbeddings

# Initialize embeddings
embeddings = AzureOpenAIEmbeddings(deployment=secrets["OPENAI_EMBEDDINGS_DEPLOYMENT"])

# Function to embed text
def embed_text(text):
    return embeddings.embed_query(text)

# Ask user for search query and search for it in the database

In [68]:
query = input("Enter search query: ")

response = client.search(
    collection_name=collection_name,
    data=[embed_text(query)],  # Use the function to convert the question to an embedding vector
    limit=10,  # Return top 10 results
    output_fields=["text"],  # Return the text field
)

print(f"Top results for query: {query}\n\n")

for part in response[0]:
    id = part["id"]
    print(f'{vectorKeys[id]} - {originalData["train"][id]["text"]}')

Top results for query: cain and abel


Genesis 4:8 - And Cain talked with Abel his brother: and it came to pass, when they were in the field, that Cain rose up against Abel his brother, and slew him. 

Genesis 4:25 - And Adam knew his wife again; and she bare a son, and called his name Seth: For God, said she, hath appointed me another seed instead of Abel, whom Cain slew. 

Genesis 4:9 - And the LORD said unto Cain, Where is Abel thy brother? And he said, I know not: Am I my brother's keeper? 

Genesis 4:2 - And she again bare his brother Abel. And Abel was a keeper of sheep, but Cain was a tiller of the ground. 

Genesis 4:5 - But unto Cain and to his offering he had not respect. And Cain was very wroth, and his countenance fell. 

Genesis 4:24 - If Cain shall be avenged sevenfold, truly Lamech seventy and sevenfold. 

Genesis 5:12 - And Cainan lived seventy years and begat Mahalaleel: 

Genesis 5:13 - And Cainan lived after he begat Mahalaleel eight hundred and forty years, and bega

# Try using a local tokenization algorithm to convert to Vector

- Using miniLM for local tokenization, AzureOpenAI for cloud

In [1]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=0)
def embedLocal(text):
    return model.encode(text)



  from tqdm.autonotebook import tqdm, trange


It works - use it to embed the real data samples. 

- Embed 2000 samples from the Bible dataset

In [2]:
from datasets import load_dataset

originalData = load_dataset("csv", data_files="./bible_data_set.csv")

trainingData = [line["text"] for line in originalData["train"]]
vectorData = embedLocal(trainingData)

print("Finished data tokenization")

import json
import numpy

json.dump(numpy.ndarray.tolist(vectorData), open("embedded_local.json", "w"))

print("Finished packaging data into json file and saved")


Finished data tokenization
Finished packaging data into json file and saved


Compare the cosine similarities between the two different tokenization methods

In [3]:
vectorData_miniLM = json.load(open("./embedded_local.json", "r"))
# Only take a sample since we only fed 2000 samples into Azure
vectorData_miniLM_sample = vectorData_miniLM[:2000]
vectorData_azure = list(json.load(open("./embedded_samples.json", "r")).values())



In [8]:
from sklearn.metrics.pairwise import cosine_similarity

examplequery = "sky"

miniLM_sim = cosine_similarity([embedLocal(examplequery)], [vectorData_miniLM[0]])
azure_sim = cosine_similarity([embed_text(examplequery)], [vectorData_azure[0]])

print(f"Testing similarity with query(\"{examplequery}\") and text example sentence(\"{trainingData[0][:-2]}\")")
print(f"\nSimilarity analyzed from tokenization by: MiniLM - {miniLM_sim}, AzureOpenAI - {azure_sim}")



Testing similarity with query("sky") and text example sentence("In the beginning God created the heaven and the earth. ")

Similarity analyzed from tokenization by: MiniLM - [[0.22766002]], AzureOpenAI - [[0.27191248]]


# Make a vector database to use with the miniLM vectors

Copy paste the code that I just wrote above - make some changes to it to adapt it
- Need to change the dimensions since miniLM will return less dimensions than azure

In [53]:
# Check for dimensions in miniLM model

print(len(vectorData_miniLM[0]))

384


In [9]:
from pymilvus import MilvusClient
client = MilvusClient("milvus_demo.db")

collection_name = "Bible_search_miniLM"

if client.has_collection(collection_name):
    client.drop_collection(collection_name)

# Setup the database
client.create_collection(collection_name=collection_name, dimension=384, metric_type="COSINE")

# Reformat data for Milvist - dictionary w/ ID number for values
# Format: {id:id, vector: vector, text:text} - for the ID number just use the of the value in the normal list 
vectorDatabase = []
for i in range(len(vectorData_miniLM_sample)):
    vectorDatabase.append({"id":i, "vector":vectorData_miniLM_sample[i]})

client.insert(collection_name=collection_name, data = vectorDatabase)





{'insert_count': 2000,
 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215

Use same code as previously written to search through the database

In [20]:
query = input("Enter search query: ")

response = client.search(
    collection_name=collection_name,
    data=[embedLocal(query)],  # Use the function to convert the question to an embedding vector
    limit=10,  # Return top 10 results
    output_fields=["text"],  # Return the text field
)

print(f"Top results for query: {query}\n\n")

for part in response[0]:
    id = part["id"]
    print(f'{originalData["train"][id]["citation"]} - {originalData["train"][id]["text"]}')

Top results for query: negawatts


Genesis 36:3 - And Bashemath Ishmael's daughter, sister of Nebajoth. 

Genesis 10:2 - The sons of Japheth; Gomer, and Magog, and Madai, and Javan, and Tubal, and Meshech, and Tiras. 

Exodus 1:4 - Dan, and Naphtali, Gad, and Asher. 

Genesis 25:15 - Hadar, and Tema, Jetur, Naphish, and Kedemah: 

Genesis 46:16 - And the sons of Gad; Ziphion, and Haggi, Shuni, and Ezbon, Eri, and Arodi, and Areli. 

Genesis 10:8 - And Cush begat Nimrod: he began to be a mighty one in the earth. 

Genesis 25:14 - And Mishma, and Dumah, and Massa, 

Genesis 14:7 - And they returned, and came to Enmishpat, which is Kadesh, and smote all the country of the Amalekites, and also the Amorites, that dwelt in Hazezontamar. 

Genesis 46:24 - And the sons of Naphtali; Jahzeel, and Guni, and Jezer, and Shillem. 

Genesis 36:1 - Now these are the generations of Esau, who is Edom. 



# Compare the two when searching for the same query

In [31]:
query = input("Enter search query: ")


client.load_collection("Bible_search_miniLM")
response_miniLM = client.search(
    collection_name="Bible_search_miniLM",
    data=[embedLocal(query)],  # Use the function to convert the question to an embedding vector
    limit=5,  # Return top 10 results
    output_fields=["text"],  # Return the text field
)

client.load_collection("Bible_search_azure")
response_azure = client.search(
    collection_name="Bible_search_azure",
    data=[embed_text(query)],
    limit=5,
    output_fields=["text"]
)

print(f"Results for query(vector search through vector database): {query}\n")

print("\nminiLM results: ")
for answer in response_miniLM[0]:
    id = answer["id"]
    print(f'{originalData["train"][id]["citation"]} - {originalData["train"][id]["text"]}')

print("\nAzureOpenAI results: ")
for answer in response_azure[0]:
    id = answer["id"]
    print(f'{originalData["train"][id]["citation"]} - {originalData["train"][id]["text"]}')

Results for query(vector search through vector database): The beginning of the world or something


miniLM results: 
Genesis 1:1 - In the beginning God created the heaven and the earth. 

Genesis 2:1 - Thus the heavens and the earth were finished, and all the host of them. 

Genesis 1:5 - And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day. 

Genesis 41:47 - And in the seven plenteous years the earth brought forth by handfuls. 

Genesis 6:13 - And God said unto Noah, The end of all flesh is come before me; for the earth is filled with violence through them; and, behold, I will destroy them with the earth. 


AzureOpenAI results: 
Genesis 1:1 - In the beginning God created the heaven and the earth. 

Genesis 10:10 - And the beginning of his kingdom was Babel, and Erech, and Accad, and Calneh, in the land of Shinar. 

Genesis 1:5 - And God called the light Day, and the darkness he called Night. And the evening and the morning