SpaCy is like the Swiss Army knife of NLP, while Transformers is more akin to a sledge hammer.

SpaCy is fast and lightweight. Transformers (ie. Sentence transformer) let’s you use state of the art stuff, but the trade off is usually in terms of slower runtime at inference and larger memory usage.

Another important distinction is that SpaCy has tools for more linguistics-focused tasks, such as dependency parsing, and annotations. While transformers has tools for tasks that span beyond just NLP.

In [20]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# the problem with fuzz is that it does not capture sematic meaning -> good ratio can be very bad since wording is key in TORs
#str1 = 'สามารถติดตั้งบนระบบปฏิบัติการต่าง ๆ เช่น Windows Server, Linux, Unix ได้เป็นอย่างน้อย'
#str2 = 'สามารถเลือกทำงานบนระบบปฏิบัติการ Windows หรือ ระบบปฏิบัติการ UNIX'

#str1 = 'สามารถใช้งาน Lock ข้อมูลในระดับแถว (Row Level Locking) ได้อัตโนมัติ โดยไม่ต้องมีการพัฒนาโปรแกรมเพิ่มเติม'
#str2 = 'เป็นฐานข้อมูลที่มีระบบ Lock ข้อมูลในระดับ Row Level Locking จริง ๆ ซึ่ง Database Engine กระทำได้เอง โดยต้องไม่มีการเขียนโปรแกรมเพิ่มเติม'

str1 = 'Im very happy right now, so thats the best right'
str2 = 'Im very sad right now, so thats the best right'

display(fuzz.token_sort_ratio(str1, str2)) # token based -> order does not matter as much as long as words are the same
display(fuzz.ratio(str1, str2)) # Order matters -> whitespace also effect the output

89

94

# Fuzz does not give good result for similar text - if not excat match. Compare this to sentencetransformer below

In [21]:
# test using sentence models -> pip install sentence-transformers
from sentence_transformers import SentenceTransformer, util

# load a pre-trained model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Define the two sentences
sentence1 = 'Im very happy right now, so thats the best right'
sentence2 = 'Im very sad right now, so thats the best right'

# Generate embeddings for each sentence
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)

# Compute cosine similarity between the embeddings
cosine_score = util.pytorch_cos_sim(embedding1, embedding2)
print("Cosine similarity:", cosine_score.item())

Cosine similarity: 0.7313247919082642


In [6]:
# test using sentence models -> pip install sentence-transformers
from sentence_transformers import SentenceTransformer, util

# load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define the two sentences
sentence1 = 'I want to really eat some ice cream at the store'
sentence2 = 'I want to really not eat some ice cream at the storesssssssssss'

# Generate embeddings for each sentence
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)

# Compute cosine similarity between the embeddings
cosine_score = util.pytorch_cos_sim(embedding1, embedding2)
print("Cosine similarity:", cosine_score.item())

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Cosine similarity: 0.8878823518753052


In [7]:
# Create spacy nlp object
# load en_core_web_md (small model), en_core_web_lg (large model), en_core_web_trf (largest)
# pip uninstall en-core-web-lg
#nlp = spacy.load("en_core_web_lg")

import spacy
nlp = spacy.load("en_core_web_trf")
print(nlp.meta)


OSError: [E050] Can't find model 'en_core_web_trf'. It doesn't seem to be a Python package or a valid path to a data directory.

In [None]:
# Statement accuracy rate, compare between sentence transformer vs spacy vs fuzzywuzzy

In [None]:
import pandas as pd # for data manipulation
from sentence_transformers import SentenceTransformer, util

# Import the two excel file - input file and reference file
df_main = pd.read_excel('Excel_file/Main.xlsx')
df_compare = pd.read_excel('Excel_file/Compare.xlsx')

# Import thai compatible model
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# Encode all statements from Main.xlsx as a single batch
main_statements = df_main['Statement'].tolist()
main_embeddings = model.encode(main_statements, convert_to_tensor=True, show_progress_bar=True)
print(type(main_embeddings))
main_embeddings.shape

Batches: 100%|██████████| 11/11 [00:01<00:00,  6.43it/s]

<class 'torch.Tensor'>





torch.Size([341, 384])

In [None]:
import pickle # for caching main embeddings

# testing pickle, pk1 is pickle file, can be any file type really but pk1 just to demonstrate
student_names = ['Kay','Bob','Elena','Jane','Kyle']
with open('student_file.pkl', 'wb') as f:  # open a text file
    pickle.dump(student_names, f) # serialize the list
f.close()

In [None]:
with open('student_file.pkl', 'rb') as f:  # open a text file
    list_name = pickle.load(f) # deserialize the list
f.close()
print(list_name)

['Kay', 'Bob', 'Elena', 'Jane', 'Kyle']


In [None]:
# selecting excel test
import pandas as pd
import re

def excel_cell_to_indices(cell_str):
    """
    Converts an Excel cell address (e.g., "A5") to zero-based (row, column) indices.
    """
    match = re.match(r"([A-Za-z]+)([0-9]+)", cell_str)
    if not match:
        raise ValueError("Invalid cell format: " + cell_str)
    col_str, row_str = match.groups()
    # Convert letters to a zero-based column index:
    col_idx = 0
    for char in col_str.upper():
        col_idx = col_idx * 26 + (ord(char) - ord('A') + 1)
    col_idx -= 1  # adjust to zero-based index
    row_idx = int(row_str) - 1  # adjust to zero-based index
    return row_idx, col_idx

def slice_excel_by_cells(df, num_start, num_end, stmt_start, stmt_end):
    """
    Extracts two series from the DataFrame based on provided Excel cell ranges.
    
    Args:
        df (pd.DataFrame): DataFrame read from the Excel file.
        num_start (str): Starting cell for TOR comply numbers (e.g., "A5").
        num_end (str): Ending cell for TOR comply numbers (e.g., "A23").
        stmt_start (str): Starting cell for TOR comply statements (e.g., "B5").
        stmt_end (str): Ending cell for TOR comply statements (e.g., "B23").
    
    Returns:
        (pd.Series, pd.Series): Two series, one for numbers and one for statements.
    """
    num_start_row, num_start_col = excel_cell_to_indices(num_start)
    num_end_row, _ = excel_cell_to_indices(num_end)  # Column should be same as start for numbers
    stmt_start_row, stmt_start_col = excel_cell_to_indices(stmt_start)
    stmt_end_row, _ = excel_cell_to_indices(stmt_end)  # Column should be same as start for statements
    
    # Slicing includes the ending row so add 1 (pandas slicing is end-exclusive)
    numbers = df.iloc[num_start_row:num_end_row+1, num_start_col]
    statements = df.iloc[stmt_start_row:stmt_end_row+1, stmt_start_col]
    return numbers, statements

# Example usage:
# Read the Excel file (adjust header settings if needed)
df = pd.read_excel("Excel_file/Unformat_test.xlsx", header=None)

# Dynamically select ranges using Excel cell notation.
tor_numbers, tor_statements = slice_excel_by_cells(df, "A6", "A23", "B6", "B23")

# Combine into a new DataFrame with proper column names
result_df = pd.DataFrame({
    "TOR comply number": tor_numbers,
    "TOR comply statement": tor_statements
})

print(result_df)


   TOR comply number                               TOR comply statement
5                4.2  สามารถเลือกทำงานบนระบบปฏิบัติการ Windows หรือ ...
6                4.3  เป็นฐานข้อมูลที่มีระบบ Lock ข้อมูลในระดับ Row ...
7                4.4  มีคุณสมบัติในการทำ Multi-Version Read Consiste...
8                4.5  สามารถทำการเก็บข้อมูลและแสดงผลได้ทั้งภาษาไทยแล...
9                4.6  มีการทำงานแบบ Machine Learning เพื่อช่วยเพิ่มป...
10               4.7  มีการทำงานแบบ Query Optimization และสามารถทำงา...
11               4.8  สามารถรองรับการจัดเก็บข้อมูลในรูปแบบ JSON โดยส...
12               4.9       สามารถทำงานในรูปแบบระบบฐานข้อมูลแบบ Graph ได
13               4.1  มีเครื่องมือรองรับในการจัดการระบบไฟล์สำหรับไฟล...
14               NaN  4.10.1 รองรับการช่วยกระจาย I/O ไปยังดิสก์ข้อมู...
15               NaN  4.10.2 รองรับการเพิ่มหรือลดจำนวน disk ได้โดยไม...
16               NaN  4.10.3 รองรับการจัดเรียงการกระจายของข้อมูลใหม่...
17               NaN  4.10.4 รองรับการ Mirror Resync ข้อมูลระหว่

In [None]:
# single batch code - LLM generated
import pandas as pd  # for data manipulation
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm
import time
import pickle
import os

# Load the two Excel files
df_main = pd.read_excel('Excel_file/Main.xlsx')
df_compare = pd.read_excel('Excel_file/Compare.xlsx')

# Record start time
start_time = time.time()

# Load the Sentence Transformer model
print("Start loading model...")
with tqdm(total=1, desc="Loading Model") as pbar:
    model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
    pbar.update(1)
print("Model loaded!")

# Cache file for main embeddings
cache_file = 'main_embeddings.pkl'

# Either load cached embeddings or create them if they don't exist
if os.path.exists(cache_file):
    with open(cache_file, 'rb') as f:
        main_embeddings = pickle.load(f)
    print("Loaded cache file for main embeddings!")
else:
    print("Start embedding main statements...")
    main_statements = df_main['Statement'].tolist()
    main_embeddings = model.encode(main_statements, convert_to_tensor=True, show_progress_bar=True)

    with open(cache_file, 'wb') as f:
        pickle.dump(main_embeddings, f)
    print("Created cache file for main embeddings!")

# ---------------------------
# BATCH ENCODING IMPROVEMENT
# ---------------------------
print("Encoding compare statements in batch...")
compare_statements = df_compare['Statement'].tolist()
compare_embeddings = model.encode(compare_statements, convert_to_tensor=True, show_progress_bar=True)

print("Computing similarity scores in a single pass...")
# This creates a similarity matrix of shape (len(df_compare), len(df_main))
similarity_matrix = util.pytorch_cos_sim(compare_embeddings, main_embeddings)

# Find the highest similarity score for each row in df_compare
best_scores, best_idxs = similarity_matrix.max(dim=1)

# Set a threshold if you want to discard matches below a certain score
threshold = 0.1
Result = []

# Loop through each compare statement once, retrieving the best match
for i in range(len(compare_statements)):
    score = best_scores[i].item()
    idx = best_idxs[i].item()
    if score >= threshold:
        best_document = df_main.iloc[idx]['Document']
        best_statement = df_main.iloc[idx]['Statement']
        folder_location = df_main.iloc[idx]['Folder location']
        Result.append({
            'Number': df_compare.iloc[i]['Number'],
            'Statement': compare_statements[i],
            'Matched Statement': best_statement,
            'Matched Document Reference': best_document,
            'Similarity Score': score,
            'Folder location': folder_location
        })

# Create a DataFrame from the results
output_df = pd.DataFrame(Result)
#print(output_df)

# Write the output to Excel with XlsxWriter
output_file = 'Excel_file/Result.xlsx'
with pd.ExcelWriter(output_file, engine='xlsxwriter') as writer:
    output_df.to_excel(writer, sheet_name='Sheet1', index=False)

    workbook = writer.book
    worksheet = writer.sheets['Sheet1']

    # Format the "Similarity Score" column (E) to display as a percentage
    percentage_format = workbook.add_format({'num_format': '0.00%'})
    worksheet.set_column('E:E', 16, percentage_format)

    # Apply conditional formatting based on similarity score
    num_rows = len(output_df)
    cell_range = f'E2:E{num_rows + 1}'

    red_format = workbook.add_format({'bg_color': '#FFC7CE', 'font_color': '#9C0006'})
    orange_format = workbook.add_format({'bg_color': '#FFEB9C', 'font_color': '#9C6500'})
    green_format = workbook.add_format({'bg_color': '#C6EFCE', 'font_color': '#006100'})

    # < 80%: red
    worksheet.conditional_format(cell_range, {
        'type': 'cell',
        'criteria': '<',
        'value': 0.8,
        'format': red_format
    })

    # 80% - 95%: orange
    worksheet.conditional_format(cell_range, {
        'type': 'cell',
        'criteria': 'between',
        'minimum': 0.8,
        'maximum': 0.95,
        'format': orange_format
    })

    # >= 95%: green
    worksheet.conditional_format(cell_range, {
        'type': 'cell',
        'criteria': '>=',
        'value': 0.95,
        'format': green_format
    })

    # Adjust column widths
    worksheet.set_column('A:A', 8)
    worksheet.set_column('B:B', 50)
    worksheet.set_column('C:C', 50)
    worksheet.set_column('D:D', 60)
    worksheet.set_column('F:F', 30)

# End time and total time
end_time = time.time()
total_time = end_time - start_time
print(f"Total time: {total_time:.3f} seconds taken")


Start loading model...


Loading Model: 100%|██████████| 1/1 [00:03<00:00,  3.28s/it]


Model loaded!
Loaded cache file for main embeddings!
Encoding compare statements in batch...


Batches: 100%|██████████| 2/2 [00:00<00:00,  6.99it/s]

Computing similarity scores in a single pass...
Total time: 3.595 seconds taken





In [None]:
# Old cold - unbatch processing in loop - slighty slower in redudant testing, faster initials
# Import the required lib
import pandas as pd # for data manipulation

# Sentence Transformers enables the transformation of sentences into vector spaces
from sentence_transformers import SentenceTransformer, util # util provides helper function for embeddings such as the function pytorch_cos_sim to compute cosine similarity
from tqdm import tqdm # for progress bar
import time # for total time
import pickle # for caching main embeddings
import os

# Import the two excel file - input file and reference file
df_main = pd.read_excel('Excel_file/Main.xlsx')
df_compare = pd.read_excel('Excel_file/Compare.xlsx')

# Record start time
start_time = time.time()

# Import thai compatible model
print("Start loading model...")
with tqdm(total=1, desc="Loading Model") as pbar:
    model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
    pbar.update(1)
print("Model loaded!")

# Cache file for main embeddings
cache_file = 'main_embeddings.pkl'

# Use pickle to cache main file embeddings - load if already created, create if not
# Note: When excel file is changed, the embeddings will need to be re-created, delete the cache file (main_embeddings.pkl)
if os.path.exists(cache_file):
    with open(cache_file, 'rb') as f:
        main_embeddings = pickle.load(f)
    print("Loaded cache file for main embeddings!")
else:
    # Encode all statements from Main.xlsx as a single batch
    print("Start embedding main statements...")
    main_statements = df_main['Statement'].tolist()
    main_embeddings = model.encode(main_statements, convert_to_tensor=True, show_progress_bar=True)

    # Cache the embeddings
    with open(cache_file, 'wb') as f:
        pickle.dump(main_embeddings, f)
    
    print("Created cache file for main embeddings!")

# Create similarity function - single batch variant
def find_match(statement, main_df, threshold=0.2):
    """
    Finds the best match for 'statement' within 'main_df' using sentence-transformers semantic similarity.

    Args:
        statement (str): The statement from df_compare to match against df_main.
        main_df (pd.DataFrame): DataFrame containing 'Statement' and 'Document' columns.
        threshold (float): Minimum similarity score required to consider a match valid.

    Returns:
        tuple: (best_document, best_statement, best_score)
            - best_document: The matching 'Document' from df_main
            - best_statement: The matched statement from df_main
            - best_score: The highest cosine similarity score found
    """
    # Encode the input statement once
    embedding_input = model.encode(statement, convert_to_tensor=True)

    # Compute similarity to all main embeddings at once (shape: (1, n_main))
    similarity_scores = util.pytorch_cos_sim(embedding_input, main_embeddings)[0] # [0] extracts the first row from 2D tensor, this is similarity score for each statement in main_df
    # Get the best score and its index
    best_score, best_idx = similarity_scores.max(dim=0) # This will get the highest similarity score from the 1D tensor and its index, dim = 0 means row
    best_score = best_score.item() # convert pytorch into float, Ex. 0.95
    best_idx = best_idx.item() # convert pytorch into float, Ex. 3

    # Threshold checking, return none if below threshold, which will skipped the append later
    if best_score >= threshold:
        # Retrieve the corresponding row from df_main with the best index from before
        best_document = main_df.iloc[best_idx]['Document']  # Adjust column name if needed
        best_statement = main_df.iloc[best_idx]['Statement']
        folder_location = main_df.iloc[best_idx]['Folder location']
        return best_document, best_statement, best_score, folder_location
    else:
        return None, None, best_score, None

Result = []
# Loop through each row, tqdm for progress bar
for _, row in tqdm(df_compare.iterrows(), total=len(df_compare), desc="Processing rows"):
    # Use function to find the best match
    document, statement, score, location = find_match(row['Statement'], df_main, threshold=0.1)

    # If not none, then append to result
    if document is not None:
        Result.append({
            'Number': row['Number'],
            'Statement': row['Statement'],
            'Matched Statement': statement,
            'Matched Document Reference': document,
            'Similarity Score': score,
            'Folder location': location
        })

# Print the Dataframe result
output_df = pd.DataFrame(Result)
#print(output_df)

# Use pandas to create an Excel file with XlsxWriter module with similarity coloring base on three conditions
output_file = 'Excel_file/Result.xlsx'
with pd.ExcelWriter(output_file, engine='xlsxwriter') as writer:
    # write a dataframe into an excel file
    output_df.to_excel(writer, sheet_name='Sheet1', index=False)

    # Get the workbook and worksheet object
    workbook = writer.book
    worksheet = writer.sheets['Sheet1']
    
    # Determine the cell range for the Similarity Score column (row 2 to the last row)
    num_rows = len(output_df)
    cell_range = f'E2:E{num_rows + 1}'
    
    # Apply conditional formatting:
    # Red for scores below 80% (< 0.8)
    red_format = workbook.add_format({'bg_color': '#FFC7CE', 'font_color': '#9C0006'})
    worksheet.conditional_format(cell_range, {
        'type': 'cell',
        'criteria': '<',
        'value': 0.8,
        'format': red_format
    })
    
    # Orange for scores between 80% and 95% (0.8 to 0.95)
    orange_format = workbook.add_format({'bg_color': '#FFEB9C', 'font_color': '#9C6500'})
    worksheet.conditional_format(cell_range, {
        'type': 'cell',
        'criteria': 'between',
        'minimum': 0.8,
        'maximum': 0.95,
        'format': orange_format
    })
    
    # Green for scores 95% and above (>= 0.95)
    green_format = workbook.add_format({'bg_color': '#C6EFCE', 'font_color': '#006100'})
    worksheet.conditional_format(cell_range, {
        'type': 'cell',
        'criteria': '>=',
        'value': 0.95,
        'format': green_format
    })

    # Set column width for each column
    worksheet.set_column('A:A', 8)
    worksheet.set_column('B:B', 50)
    worksheet.set_column('C:C', 50)
    worksheet.set_column('D:D', 65)
    worksheet.set_column('F:F', 30)
    
    # column E format is set to percentage instead, round to 2 decimal point
    percentage_format = workbook.add_format({'num_format': '0.00%'})
    worksheet.set_column('E:E', 14, percentage_format)


# Get end time and total time taken
end_time = time.time()
total_time = end_time - start_time
print(f"Total time: {total_time:.3f} seconds taken")

Start loading model...


Loading Model: 100%|██████████| 1/1 [00:03<00:00,  3.29s/it]


Model loaded!
Loaded cache file for main embeddings!


Processing rows: 100%|██████████| 35/35 [00:00<00:00, 54.25it/s]

Total time: 3.965 seconds taken





In [None]:
# testing PDF editor with python (pip install PyMuPDF)
import fitz
import os
import shutil

# Clear output folder
output_dir = 'output'
if os.path.exists(output_dir):
    print("Clearing output folder...")
    # Go over the files and subdirectories in the output folder
    for filename in os.listdir(output_dir):
        file_path = os.path.join(output_dir, filename)
        try:
            if os.path.isfile(file_path) or os.path.islink(file_path):
                os.unlink(file_path)  # remove file or link
                print("unlinking file: " + file_path)
            elif os.path.isdir(file_path):
                shutil.rmtree(file_path)  # remove directory and its contents
                print("removing directory: " + file_path)
        except Exception as e:
            print(f"Failed to delete {file_path}. Reason: {e}")
else:
    # Create the output folder if it does not exist
    print("Making output folder...")
    os.makedirs(output_dir)

print("\n")
# this will be from the complier.py file (extracted from excel input file)
# Here's a simple Map instead, replace this with actual columns from either the result.xlsx or dataframe column to map, number will be from the result.xlsx
pdf_statements = {
    "ODB_13": "2.1",
    "ODB_14": "2.2",
    "ODB_15": "2.3",
    "ODB_16": "2.4",
    "ODB_17": "2.5",
    "ODB_18": "2.6",
}
# loop over each PDF file
for saved_statement, cur_statement in pdf_statements.items():
    print(f"Opening PDF: {saved_statement}.pdf ...")
    # Open the PDF file
    doc = fitz.open(f'document/{saved_statement}.pdf')

    # loop over all pages in PDF file
    for page in doc:
        # loop over all annotations
        for annot in page.annots():
            # check if annotation is a freetext annotation
            if "FreeText" in annot.type:
                # Update annotation text
                annot.set_info(content=cur_statement, title="Oracle")
                annot.update()  # save annotation

                print(f"Annotation content updated to: {cur_statement}")

    # Save the PDF file
    doc.save(f'output/{cur_statement}.pdf')
    doc.close()  # Close the document when done



Clearing output folder...
unlinking file: output\1.14.1.pdf
unlinking file: output\1.14.2.pdf


Opening PDF: ODB_13.pdf ...
Annotation content updated to: 2.1
Annotation content updated to: 2.1
Annotation content updated to: 2.1
Opening PDF: ODB_14.pdf ...
Annotation content updated to: 2.2
Annotation content updated to: 2.2
Opening PDF: ODB_15.pdf ...
Annotation content updated to: 2.3
Opening PDF: ODB_16.pdf ...
Annotation content updated to: 2.4
Annotation content updated to: 2.4
Annotation content updated to: 2.4
Annotation content updated to: 2.4
Annotation content updated to: 2.4
Annotation content updated to: 2.4
Opening PDF: ODB_17.pdf ...
Annotation content updated to: 2.5
Opening PDF: ODB_18.pdf ...
Annotation content updated to: 2.6


In [13]:
# test why cannot embeded
import pandas as pd
df_main = pd.read_excel('Excel_file/ODB_Mapped_PDF.xlsx')
print("Start embedding main statements...")
main_statements = df_main['Statement'].tolist()
#main_embeddings = model.encode(main_statements, convert_to_tensor=True, show_progress_bar=True)
print(main_statements)

Start embedding main statements...
['เป็นระบบจัดการฐานข้อมูลเชิงสัมพันธ์ (Relational Database Management System)', 'เป็นระบบจัดการฐานข้อมูลเชิงสัมพันธ์ที่สนับสนุนการทำงานแบบออปเจ็กต์ (Object-Relational Database Management System)', 'เป็นระบบจัดการฐานข้อมูลเชิงสัมพันธ์ที่สนับสนุนการทำงานแบบออบเจ็กต์ (Object-Relational Database Management System) โดยสามารถทำงานกับข้อมูลต่าง ๆ ดังนี้ ', 'สามารถเลือกทำงานบนระบบปฏิบัติการ Windows หรือ ระบบปฏิบัติการ UNIX', 'สามารถทำงานได้บนระบบปฎิบัติการ Linux และ Windows', 'สามารถติดตั้งบนระบบปฏิบัติการ Linux และระบบปฏิบัติการ Windows ได้เป็นอย่างน้อย', 'เป็นฐานข้อมูลที่มีระบบ Lock ข้อมูลในระดับ Row Level Locking จริง ๆ ซึ่ง Database Engine กระทำได้เอง โดยต้องไม่มีการเขียนโปรแกรมเพิ่มเติม', 'สามารถใช้งาน Lock ข้อมูลในระดับแถว (Row Level Locking) ได้อัตโนมัติ โดยไม่ต้องมีการพัฒนาโปรแกรมเพิ่มเติม', 'มีคุณสมบัติในการทำ Multi-Version Read Consistency โดยไม่มีการอ่านข้อมูลแบบ Dirty Reads ทั้งนี้เพื่อความถูกต้องของข้อมูลที่จะถูกนำไปใช้ ที่ซึ่งผู้เป็น Readers และ

In [None]:
# dealing with zeros -> make sure number col is in "text" format and trailing zero is correct
import pandas as pd
df_main = pd.read_excel(
    'Excel_file/Compare.xlsx',
    sheet_name="ODB",
    header=1,
    dtype={'Number': str}  # Force 'Number' column to be read as text
)
df_main.head(20)

Unnamed: 0,Number,Statement
0,3.1,\tเป็นระบบจัดเก็บฐานข้อมูลเชิงสัมพันธ์ที่สนับส...
1,3.2,สามารถทำงานได้บนระบบปฏิบัติการ ได้แก่ Linux แล...
2,3.3,สามารถเชื่อมต่อฐานข้อมูลด้วยเน็ตเวิร์คโพรโทคอล...
3,3.4,สามารถเชื่อมต่อฐานข้อมูลด้วยมาตรฐาน Open Datab...
4,3.5,สามารถรองรับการทำงานกับข้อมูลรูปแบบ Character ...
5,3.6,"สนับสนุนการทำ Index, Referential Integrity, Un..."
6,3.7,"สนับสนุนการทำ View, Stored Procedure, Function..."
7,3.8,มีระบบช่วยเหลือในการสืบค้นข้อมูล (Query Optimi...
8,3.9,มีคุณสมบัติในการทำ Multi-Version Read Consiste...
9,3.1,สามารถสำรองและกู้คืนฐานข้อมูล (Database Backup...


In [None]:
# Saving as a local modal to ONNX
from sentence_transformers import SentenceTransformer, util

# Load the model using the ONNX backend and ensure export is enabled.
model = SentenceTransformer(
    "paraphrase-multilingual-MiniLM-L12-v2",
    backend="onnx",
    model_kwargs={"export": True}
)

# Save the model locally; this will create the correct folder structure.
model.save_pretrained("./onnx_model")

KeyboardInterrupt: 

In [2]:
# test out local model
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer(
    "./onnx_model",
    backend="onnx",
    model_kwargs={"file_name": "model.onnx"}
)

# Define the two sentences
sentence1 = 'Im very happy right now, so thats the best right'
sentence2 = 'Im very sad right now, so thats the best right'

# Generate embeddings for each sentence
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)

# Compute cosine similarity between the embeddings
cosine_score = util.pytorch_cos_sim(embedding1, embedding2)
print("Cosine similarity:", cosine_score.item())

  from .autonotebook import tqdm as notebook_tqdm


*************** EP Error ***************
EP Error D:\a\_work\1\s\onnxruntime\python\onnxruntime_pybind_state.cc:505 onnxruntime::python::RegisterTensorRTPluginsAsCustomOps Please install TensorRT libraries as mentioned in the GPU requirements page, make sure they're in the PATH or LD_LIBRARY_PATH, and that your GPU is supported.
 when using ['TensorrtExecutionProvider', 'CUDAExecutionProvider']
Falling back to ['CUDAExecutionProvider', 'CPUExecutionProvider'] and retrying.
****************************************
Cosine similarity: 0.7313246130943298
