<h1>Query Similarity Search with Embeddings</h1>

<p>This example demonstrates how to use embeddings to calculate the cosine similarity between queries and a sample dataset. We will use a small dataset containing categories like <strong>Smartphones</strong>, <strong>Audio Equipment</strong>, <strong>Kitchen Appliances</strong>, <strong>Athletic Footwear</strong>, and <strong>Home Cleaning</strong>. For each query, we embed the text, calculate the cosine similarity, and retrieve the most similar item.</p>

<h2>Dataset</h2>
<ul>
    <li><strong>Smartphones</strong></li>
    <li><strong>Audio Equipment</strong></li>
    <li><strong>Kitchen Appliances</strong></li>
    <li><strong>Athletic Footwear</strong></li>
    <li><strong>Home Cleaning</strong></li>
</ul>

<h3>Example Process</h3>
<p>For each category, we embedded the text descriptions, calculated cosine similarity, and identified the most similar item based on a query. Below are the queries and their corresponding most similar items:</p>


<h2>Importing Required Packages</h2>

<p>Before starting, ensure you have all the necessary packages installed. If a package is missing, you can install it using <code>pip</code>. Below is the list of required imports for this project:</p>

In [8]:
from sentence_transformers import SentenceTransformer

import gradio as gr

import torch 
import torch.nn.functional as F
from torch import Tensor

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import f1_score, confusion_matrix

from datasets import load_from_disk, Dataset

import numpy as np
import pandas as pd

<h2>Load Dataset and Move Model to GPU (if available)</h2>

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer('Metric-AI/armenian-text-embeddings-1',device=device)
dataset = load_from_disk('product_demo_data')

No sentence-transformers model found with name Metric-AI/armenian-text-embeddings-1. Creating a new one with mean pooling.


In [3]:
print(dataset)

Dataset({
    features: ['name', 'description', 'item_section'],
    num_rows: 15
})


<h1>Passage Preprocessing Steps</h1>

<p>In this example, product descriptions are treated as "passages." Each passage is prefixed with <code>passage:</code>. The following preprocessing steps are applied:</p>

In [4]:
dataset = dataset.map(lambda x: {'passage': 'passage: '+x['description']})
dataset = dataset.map(lambda x: {'embedding': model.encode(x['passage'], normalize_embeddings=True)}, batch_size=32)

Map:   0%|          | 0/15 [00:00<?, ? examples/s]

Map:   0%|          | 0/15 [00:00<?, ? examples/s]

<h2>Search for Similar Items in a Dataset Using Gradio</h2>

<p>This interface allows you to input a query (e.g., a product or item description), and it will return the most similar item from a dataset based on the query.</p>

<p><strong>Objective:</strong> Given a search query, the model will compute the similarity between the query and the embeddings of items in the dataset. It will then return the item with the highest similarity score.</p>

<p><strong>Usage:</strong> You can either type a custom query in the input box or use one of the example queries provided. The app will search through the dataset and return the item with the highest similarity to the query.</p>

<p>The app will output the <strong>Name</strong> and <strong>Item Section</strong> of the most similar item in the dataset.</p>

<h3>Gradio Interface Features:</h3>
<ul>
    <li><strong>Input:</strong> A text box for entering the search query.</li>
    <li><strong>Output:</strong> A text box showing the most similar item.</li>
    <li><strong>Clear Button:</strong> A button to clear the input and output fields.</li>
    <li><strong>Example Queries:</strong> Predefined queries that users can try.</li>
</ul>


In [17]:
queries = [
    "բարձրորակ սմարթֆոն՝ հզոր պրոցեսորով",
    "աղմուկը մեկուսացնող անլար ականջակալներ",
    "բազմաֆունկցիոնալ խոհանոցային գործիք",
    "վազքի կոշիկներ",
    "ռոբոտ փոշեկուլ"
]

def search(query):
    processed_query = 'query: ' + query
    
    query_embedding = model.encode(processed_query, normalize_embeddings=True)
    
    similarities = [np.dot(item['embedding'], query_embedding) for item in dataset]
    
    max_index = int(np.argmax(similarities))
    
    max_row = dataset[max_index]
    
    return f"Name: {max_row['name']}\n Item Section: {max_row['item_section']}"

iface = gr.Interface(
    fn=search, 
    inputs=gr.Textbox(
        label="🔍 Enter Your Search Query", 
        placeholder="Try one of the example queries...",
        value=queries[0]
    ),
    outputs=gr.Textbox(label="🏆 Most Similar Item"), 
    live=False,  
    clear_btn="Clear",
    examples=[
        [query] for query in queries 
    ]
)

iface.launch(share=True)

* Running on local URL:  http://127.0.0.1:7884
* Running on public URL: https://98035dcf937664984e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




<h1>Query Similarity Calculation</h1>

<p>The following steps are taken in the code to calculate the similarity between queries and product descriptions:</p>

<ol>
    <li><strong>Query Processing:</strong> Each query is prefixed with <code>query:</code> to differentiate it from the product descriptions.</li>
    <li><strong>Embedding Queries:</strong> The processed queries are embedded using a pre-trained model, and the embeddings are normalized.</li>
    <li><strong>Similarity Calculation:</strong> For each query, the cosine similarity is calculated between the query embedding and the embeddings of the products in the dataset.</li>
    <li><strong>Finding the Most Similar Product:</strong> The product with the highest similarity score is selected as the most similar item to the query. The name and description of the most similar product are printed for each query.</li>
</ol>