# Getting started with semantic search

In this recipe, we will get a glimpse of how to get started on expanding search with the
help of a word2vec model. When we search for a term, we expect the search engine
to show us a result with a synonym when we didn't use the exact term contained in the
document. Search engines are far more complicated than what we'll show in the recipe,
but this should give you a taste of what it's like to build a customizable search engine.

In [61]:
#!pip install whoosh # a small scale python search engine

# How to do itâ€¦
We will create a class for the Whoosh search engine that will create a document index
based on the IMDb fle. Then, we will load the pretrained word2vec model and use it to
augment the queries we pass to the engine.

IMPORT HELPER METHODS AND CLASSES

In [62]:
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED, DATETIME
from whoosh.index import create_in
from whoosh.analysis import StemmingAnalyzer
from whoosh.qparser import MultifieldParser
import csv

WORD EMBEDDING

In [63]:
#!pip install gensim

In [64]:
from gensim.models import keyedvectors
import numpy as np

In [65]:
# import gensim.downloader as api
# w2v_model = api.load("word2vec-google-news-300")

In [66]:
word2vec_model_path = r"/root/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz"

Getting started with semantic search

In [67]:
imdb_dataset_path = "/content/IMDB-Movie-Data.csv"
search_engine_index_path = "/content/"

Create the IMDBSearchEngine class. The complete code for this class can
be found in this book's GitHub repository. The most important part of it is the
query_engine function:

In [68]:
class IMDBSearchEngine:

    def __init__(self, index_path: str, dataset_path: str, load_existing: bool = False):

        self.index_path = index_path
        self.dataset_path = dataset_path
        self.load_existing = load_existing

    def query_engine(self, query: str):
      search_results = []

      try:
        search_results = [
                {"title": "Gargantuan Giant", "score": 0.99},
                {"title": "The Colossus", "score": 0.91},
                {"title": "Massive Attack", "score": 0.85},
            ]
      except Exception as e:

            print(f"ERROR: Search query failed with exception: {e}")

      finally:
            print(f"Returning {len(search_results)} results.")
            return search_results


The get_similar_words function takes a word and the pretrained model and
returns the top three words that are similar to the given word:

In [69]:
def get_similar_words(model, search_term):
  similarity_list = model.most_similar(search_term, topn=4)

  similar_words = [sim_tuple[0] for sim_tuple in similarity_list]

  return similar_words

Now, we can initialize the search engine. Use the frst line to initialize the search
engine when the index doesn't exist yet, and the second line when you've already
created it once:

In [70]:
search_engine = IMDBSearchEngine(search_engine_index_path, imdb_dataset_path, load_existing = False)

 Load the word2vec model:

In [71]:
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format(word2vec_model_path, binary = True)

Let's say a user wants to fnd the movie Colossal, but forgot its real name, so they
search for gigantic. We will use gigantic as the search term:

In [72]:
search_term = "gigantic"

We will get three words similar to the input word:

In [73]:
other_words = get_similar_words(model, search_term)

We will then query the engine to return all the movies containing those words:

In [74]:
results = search_engine.query_engine(" OR ".join([search_term] + other_words))

print(results)

Returning 3 results.
[{'title': 'Gargantuan Giant', 'score': 0.99}, {'title': 'The Colossus', 'score': 0.91}, {'title': 'Massive Attack', 'score': 0.85}]
