<a href="https://colab.research.google.com/github/Adnya-01/AI-projects/blob/main/semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Semantic Search**
In this code, we'll walk through how semantic search can be used to find the most relevant searches to our query from a multilingual translation dataset.

Semantic search refers to a retrieval method in which related search results are retrieved based on the context or the intent of the query, rather than just using keywords (as in lexical search).
It can be used in applications where traditional lexical search is insufficient and the intent of the user's input is important as well as for multimodal and multilingual applications.

Install required libraries

In [None]:
!uv pip install -qU \
  pinecone~=7.3.0 \
  pinecone-notebooks==0.1.1 \
  numpy==2.0.2 \
  datasets==3.5.1

Authenticate your Pinecone account and generate an API key

In [None]:
from pinecone_notebooks.colab import Authenticate

Authenticate()

Fetch your API key and initialize a Pinecone client which will be used to perform searches.

In [None]:
from pinecone import Pinecone
# Initialize client
import os

api_key = os.environ.get("PINECONE_API_KEY")

pc = Pinecone(
        # You can remove this for your own projects!
        api_key=api_key
    )

Create (if needed), connect to, and inspect a Pinecone semantic search index.

In [None]:

index_name = "semantic-search"

if not pc.has_index(index_name):
    pc.create_index_for_model(
        name=index_name,
        cloud="aws",
        region="us-east-1",
        embed={
            "model": "llama-text-embed-v2",
            "field_map":{"text": "chunk_text"}
        }
    )

# Initialize index client
index = pc.Index(name=index_name)

# View index stats
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

Load English-Spanish translation sentence pairs from the Tatoeba dataset which contains thousands of sentence translation pairs.

In [None]:
from datasets import load_dataset
# specify that we want the english-spanish translation pairs
tatoeba = load_dataset("Helsinki-NLP/tatoeba", lang1="en", lang2="es", trust_remote_code=True, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

tatoeba.py: 0.00B [00:00, ?B/s]



Downloading data:   0%|          | 0.00/6.88M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
tatoeba[0:5]

{'id': ['0', '1', '2', '3', '4'],
 'translation': [{'en': "Let's try something.", 'es': '¡Intentemos algo!'},
  {'en': "Let's try something.", 'es': 'Intentemos algo.'},
  {'en': "Let's try something.", 'es': 'Permíteme hacer algo.'},
  {'en': "Let's try something.", 'es': 'Permíteme intentarlo.'},
  {'en': 'I have to go to sleep.', 'es': 'Tengo que irme a dormir.'}]}

In [None]:
keywords= ["fan"]

def simple_keyword_filter(sentence, keywords):
  # filter for a list of keywords by sentence

    for keyword in keywords:
        if keyword in sentence:
            return True
    return False

def transform_dataset_for_pinecone(dataset, use_filter=True):

    if use_filter:
        # filter for a list of keywords by sentence, helpful for building intuition on semantic search
        translation_pairs = dataset.filter(lambda x: simple_keyword_filter(
        sentence = x["translation"]["en"], keywords=keywords))
    else:
        # use the full 200k+ dataset. Run only if you want to embed this many records!
        translation_pairs = dataset

    # flatten and shuffle for ease of use
    translation_pairs = translation_pairs.flatten()
    translation_pairs = translation_pairs.shuffle(seed=1)

    english_sentences = translation_pairs.rename_column("translation.en", "text").remove_columns("translation.es")

    # add lang column to indicate embedding origin
    english_sentences = english_sentences.add_column("lang", ["en"]*len(english_sentences))


    records = []

    for idx, sentence in enumerate(english_sentences):
        # Here, we create a record for each sentence in the dataset
        # The record contains an ID and metadata fields which we can use to filter if desired
        # The chunk_text field is the text we will embed
        records.append(
            {
                "id": str(idx),
                "chunk_text": sentence["text"],
                "lang": sentence["lang"]
            }
        )

    # convert to record format
    return records


records = transform_dataset_for_pinecone(tatoeba)

Filter:   0%|          | 0/214127 [00:00<?, ? examples/s]

Flattening the indices:   0%|          | 0/134 [00:00<?, ? examples/s]

In [None]:
from tqdm import tqdm

batch_size = 96
namespace = "english-sentences"


# We upsert in batches of 96 to avoid hitting the embedding model's rate limit.

for start in tqdm(range(0, len(records), batch_size), f"Upserting records batch: "):
    index.upsert_records(records=records[start:start+batch_size], namespace = namespace)

Upserting records batch: 100%|██████████| 2/2 [00:01<00:00,  1.13it/s]


In [None]:
search_query = "I am your biggest fan"

results = index.search(
    namespace=namespace,
    query={
        "top_k": 10,
        "inputs": {
            'text': search_query
        }
    }
)

for result in results["result"]["hits"]:
    print(f'Sentence: {result["fields"]["chunk_text"]} Semantic Similarity Score: {result["_score"]}\n')

Sentence: I am a big golf fan. Semantic Similarity Score: 0.27246353030204773

Sentence: I'm a great baseball fan. Semantic Similarity Score: 0.2553280293941498

Sentence: I am a big fan of the arts. Semantic Similarity Score: 0.2516880929470062

Sentence: I am a fan of cars. Semantic Similarity Score: 0.25155746936798096

Sentence: I am fan of football. Semantic Similarity Score: 0.2507578730583191

Sentence: I'm a big fan of golf. Semantic Similarity Score: 0.24699632823467255

Sentence: I'm a big fan of golf. Semantic Similarity Score: 0.24119894206523895

Sentence: I'm a Real Madrid fan. Semantic Similarity Score: 0.23950009047985077

Sentence: I am a fan of the theater. Semantic Similarity Score: 0.23386766016483307

Sentence: We're all big fans of your music around here. Semantic Similarity Score: 0.2291153371334076



In [None]:
search_query = "We definately need a fan in this hot summer"

results = index.search(
    namespace=namespace,
    query={
        "top_k": 10,
        "inputs": {
            'text': search_query
        }
    }
)

for result in results["result"]["hits"]:
    print(f'Sentence: {result["fields"]["chunk_text"]} Semantic Similarity Score: {result["_score"]}\n')

Sentence: Incidentally, this room doesn't have anything like an air conditioner. All it has is a hand-held paper fan. Semantic Similarity Score: 0.3651370704174042

Sentence: Ladies use fans when it is hot. Semantic Similarity Score: 0.34572702646255493

Sentence: When was the last time you used a fan? Semantic Similarity Score: 0.33806848526000977

Sentence: Incidentally, this room doesn't have anything like an air-conditioner. All it has is a fan. Semantic Similarity Score: 0.33342260122299194

Sentence: I want the fan. Semantic Similarity Score: 0.29499420523643494

Sentence: Tom is fanning himself. Semantic Similarity Score: 0.2875750958919525

Sentence: I don't feel like doing anything, and I'm not getting up to turn on the fan. Semantic Similarity Score: 0.28487861156463623

Sentence: Everyone is carrying fans. Semantic Similarity Score: 0.27976104617118835

Sentence: There's a fan on the table. Semantic Similarity Score: 0.2776767313480377

Sentence: It will be difficult to swee

In [None]:
search_query = "Stop fanning yourself"

results = index.search(
    namespace=namespace,
    query={
        "top_k": 10,
        "inputs": {
            'text': search_query
        }
    }
)

for result in results["result"]["hits"]:
    print(f'Sentence: {result["fields"]["chunk_text"]} Semantic Similarity Score: {result["_score"]}\n')

Sentence: Tom is fanning himself. Semantic Similarity Score: 0.4490232765674591

Sentence: I cannot fan myself with Taninna's magazine. She would get mad at me. Semantic Similarity Score: 0.35516420006752014

Sentence: Turn off the fan when you're done reading the book. Semantic Similarity Score: 0.3299453556537628

Sentence: When was the last time you used a fan? Semantic Similarity Score: 0.31535041332244873

Sentence: Everyone is carrying fans. Semantic Similarity Score: 0.30289211869239807

Sentence: I turned on the fan and directed it to the wall. Semantic Similarity Score: 0.29892972111701965

Sentence: Ladies use fans when it is hot. Semantic Similarity Score: 0.2847423255443573

Sentence: It will be difficult to sweep the room with the fan on! Semantic Similarity Score: 0.2679525315761566

Sentence: I don't feel like doing anything, and I'm not getting up to turn on the fan. Semantic Similarity Score: 0.2675994038581848

Sentence: Are you guys crazy? Turning the fan on when it'

In [None]:
pc.delete_index(name=index_name)