# Semantic Search 
* Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.

* The idea behind semantic search is to embed all entries in your corpus, whether they be sentences, paragraphs, or documents, into a vector space.

* At search time, the query is embedded into the same vector space and the closest embeddings from your corpus are found. These entries should have a high semantic overlap with the query.

# symmetric semantic search
* For symmetric semantic search your query and the entries in your corpus are of about the same length and have the same amount of content.

# asymmetric semantic search
* asymmetric semantic search, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query

In [None]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

Downloading (…)5fedf/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)2cb455fedf/README.md:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading (…)b455fedf/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)edf/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)5fedf/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)fedf/train_script.py:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

Downloading (…)2cb455fedf/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)455fedf/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
import requests


# corpus
* corpus is a collection of authentic text or audio organized into datasets. Authentic here means text written or audio spoken by a native of the language or dialect. A corpus can be made up of everything from newspapers, novels, recipes, radio broadcasts to television shows, movies, and tweets. In natural language processing, a corpus contains text and speech data that can be used to train AI and machine learning systems

In [None]:
from requests.models import Response
responce = requests.get('https://raw.githubusercontent.com/laxmimerit/machine-learning-dataset/master/text-dataset-for-machine-learning/sbert-corpus.txt')
corpus  = responce.text.split('\r\n')

responce = requests.get('https://raw.githubusercontent.com/laxmimerit/machine-learning-dataset/master/text-dataset-for-machine-learning/sbert-queries.txt')
queries = responce.text.split('\r\n')

In [None]:
print(corpus)

['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.', 'The girl is carrying a baby.', 'The baby is carried by the woman', 'A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.', 'A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']


In [None]:
print(queries)

['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


In [None]:

corpus_embeddings = model.encode(corpus,convert_to_tensor=True)
queries_embeddings = model.encode(queries,convert_to_tensor=True)

In [None]:
# lets normolize vector for fast calucation
corpus_embeddings = util.normalize_embeddings(corpus_embeddings)
queries_embeddings = util.normalize_embeddings(queries_embeddings)


In [None]:
len(corpus_embeddings[0])

384

In [None]:
hits = util.semantic_search(queries_embeddings, corpus_embeddings,score_function=util.dot_score, top_k=3)


In [None]:
hits

[[{'corpus_id': 2, 'score': 1.0000001192092896},
  {'corpus_id': 0, 'score': 0.8384665250778198},
  {'corpus_id': 1, 'score': 0.7468276619911194}],
 [{'corpus_id': 8, 'score': 1.0000001192092896},
  {'corpus_id': 7, 'score': 0.7612733840942383},
  {'corpus_id': 3, 'score': 0.3815288245677948}],
 [{'corpus_id': 10, 'score': 1.0000001192092896},
  {'corpus_id': 9, 'score': 0.8703994154930115},
  {'corpus_id': 6, 'score': 0.37411707639694214}]]

In [None]:
for query, hit in zip( queries , hits):
  for q_hit in hit:
    id = q_hit['corpus_id']
    score = q_hit['score']

    print( query, "<>", corpus[id], "-->", score)

  print()

A man is eating pasta. <> A man is eating pasta. --> 1.0000001192092896
A man is eating pasta. <> A man is eating food. --> 0.8384665250778198
A man is eating pasta. <> A man is eating a piece of bread. --> 0.7468276619911194

Someone in a gorilla costume is playing a set of drums. <> Someone in a gorilla costume is playing a set of drums. --> 1.0000001192092896
Someone in a gorilla costume is playing a set of drums. <> A monkey is playing drums. --> 0.7612733840942383
Someone in a gorilla costume is playing a set of drums. <> The girl is carrying a baby. --> 0.3815288245677948

A cheetah chases prey on across a field. <> A cheetah chases prey on across a field. --> 1.0000001192092896
A cheetah chases prey on across a field. <> A cheetah is running behind its prey. --> 0.8703994154930115
A cheetah chases prey on across a field. <> A man is riding a white horse on an enclosed ground. --> 0.37411707639694214

