<h1>Query Similarity Search with Embeddings</h1>

<p>This example demonstrates how to use embeddings to calculate the cosine similarity between queries and a sample dataset. We will use a small dataset containing categories like <strong>Smartphones</strong>, <strong>Audio Equipment</strong>, <strong>Kitchen Appliances</strong>, <strong>Athletic Footwear</strong>, and <strong>Home Cleaning</strong>. For each query, we embed the text, calculate the cosine similarity, and retrieve the most similar item.</p>

<h2>Dataset</h2>
<ul>
    <li><strong>Smartphones</strong></li>
    <li><strong>Audio Equipment</strong></li>
    <li><strong>Kitchen Appliances</strong></li>
    <li><strong>Athletic Footwear</strong></li>
    <li><strong>Home Cleaning</strong></li>
</ul>

<h3>Example Process</h3>
<p>For each category, we embedded the text descriptions, calculated cosine similarity, and identified the most similar item based on a query. Below are the queries and their corresponding most similar items:</p>


<h2>Importing Required Packages</h2>

<p>Before starting, ensure you have all the necessary packages installed. If a package is missing, you can install it using <code>pip</code>. Below is the list of required imports for this project:</p>

In [12]:
from sentence_transformers import SentenceTransformer

import torch 
import torch.nn.functional as F
from torch import Tensor

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import f1_score, confusion_matrix

from datasets import load_from_disk, Dataset

import numpy as np
import pandas as pd

<h2>Load Dataset and Move Model to GPU (if available)</h2>

In [55]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer('Metric-AI/armenian-text-embeddings-1',device=device)
dataset = load_from_disk('product_data')

No sentence-transformers model found with name Metric-AI/armenian-text-embeddings-1. Creating a new one with mean pooling.


In [69]:
print(dataset)

Dataset({
    features: ['name', 'description', 'item_section', 'passage', 'embedding', 'similarity_scores'],
    num_rows: 15
})


<h1>Passage Preprocessing Steps</h1>

<p>In this example, product descriptions are treated as "passages." Each passage is prefixed with <code>passage:</code>. The following preprocessing steps are applied:</p>

In [56]:
dataset = dataset.map(lambda x: {'passage': 'passage: '+x['description']})
dataset = dataset.map(lambda x: {'embedding': model.encode(x['passage'], normalize_embeddings=True)}, batch_size=32)

Map:   0%|          | 0/15 [00:00<?, ? examples/s]

In [57]:
df = dataset.to_pandas()

In [58]:
for i in range(df.shape[0]):
    print(df.iloc[i]['passage'])

passage: Apple's flagship smartphone with advanced camera system, A17 Pro chip, and titanium design featuring innovative computational photography capabilities.
passage: High-end Android smartphone with 200MP camera, S Pen integration, powerful Snapdragon processor, and expansive display for mobile productivity and photography.
passage: Premium Android smartphone with advanced computational photography, AI-enhanced features, and powerful processor for creative mobile experiences.
passage: Premium noise-canceling wireless headphones with exceptional sound quality, intelligent noise cancellation, and comfortable over-ear design for audiophiles.
passage: Wireless noise-canceling earbuds with adaptive sound control, comfortable fit, and exceptional audio quality for immersive listening experiences.
passage: Advanced wireless earbuds with active noise cancellation, spatial audio, and seamless integration with Apple ecosystem for premium listening experience.
passage: Iconic kitchen applianc

<h1>Query Similarity Calculation</h1>

<p>The following steps are taken in the code to calculate the similarity between queries and product descriptions:</p>

<ol>
    <li><strong>Query Processing:</strong> Each query is prefixed with <code>query:</code> to differentiate it from the product descriptions.</li>
    <li><strong>Embedding Queries:</strong> The processed queries are embedded using a pre-trained model, and the embeddings are normalized.</li>
    <li><strong>Similarity Calculation:</strong> For each query, the cosine similarity is calculated between the query embedding and the embeddings of the products in the dataset.</li>
    <li><strong>Finding the Most Similar Product:</strong> The product with the highest similarity score is selected as the most similar item to the query. The name and description of the most similar product are printed for each query.</li>
</ol>

In [65]:
queries = ['high-end smartphone with powerful processor', 'noise-canceling wireless headphones','multi-functional kitchen tool', 'high-performance running sneakers', 'advanced robotic vacuum cleaner']
processed_queries = ['query: ' + query for query in queries]
query_embedding = model.encode(processed_queries, normalize_embeddings=True)

for i, query in enumerate(queries):
    dataset = dataset.map(lambda x: {'similarity_scores': x['embedding']@query_embedding[i]})
    max_row = max(dataset, key=lambda x: x['similarity_scores'])
    
    print(f"Query: {query}")
    print(f"Most similar item in the dataset:")
    print(f"Name: {max_row['name']}")
    print(f"Description: {max_row['description']}")

Map:   0%|          | 0/15 [00:00<?, ? examples/s]

Query: high-end smartphone with powerful processor
Most similar item in the dataset:
Name: Samsung Galaxy S23 Ultra
Description: High-end Android smartphone with 200MP camera, S Pen integration, powerful Snapdragon processor, and expansive display for mobile productivity and photography.


Map:   0%|          | 0/15 [00:00<?, ? examples/s]

Query: noise-canceling wireless headphones
Most similar item in the dataset:
Name: Sony WH-1000XM5 Headphones
Description: Premium noise-canceling wireless headphones with exceptional sound quality, intelligent noise cancellation, and comfortable over-ear design for audiophiles.


Map:   0%|          | 0/15 [00:00<?, ? examples/s]

Query: multi-functional kitchen tool
Most similar item in the dataset:
Name: KitchenAid Artisan Stand Mixer
Description: Iconic kitchen appliance with powerful motor, multiple attachment capabilities, and classic design available in various colors for home bakers and cooking enthusiasts.


Map:   0%|          | 0/15 [00:00<?, ? examples/s]

Query: high-performance running sneakers
Most similar item in the dataset:
Name: Nike Air Zoom Pegasus 40
Description: High-performance running shoes with responsive cushioning, breathable mesh upper, and improved fit for long-distance runners and athletic training.


Map:   0%|          | 0/15 [00:00<?, ? examples/s]

Query: advanced robotic vacuum cleaner
Most similar item in the dataset:
Name: iRobot Roomba j7+
Description: Advanced robotic vacuum with intelligent navigation, obstacle avoidance, self-emptying base, and smart home integration for effortless cleaning.
