# Vector Search on Mongo Atlas

This Python notebook will show you how to connect to MongoDB Atlas programatically as well as how to perform Atlas Vector Search.

In [1]:
import os, sys
import pprint
import json
import time

# Change system path to root direcotry
sys.path.insert(0, '../')

## Load Settings

In [2]:
# Load settings from .env file
from dotenv import find_dotenv, dotenv_values

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# For debugging purposes
# print (config)

ATLAS_URI = config.get('ATLAS_URI')
OPENAI_API_KEY = config.get("OPENAI_API_KEY")

if not ATLAS_URI:
    raise Exception ("'ATLAS_URI' is not set.  Please set it above to continue...")
else:
    print("ATLAS_URI Connection string found:", ATLAS_URI)

if not OPENAI_API_KEY:
    raise Exception ("'OPENAI_API_KEY' is not set.  Please set it above to continue...")
else:
    print("OPENAI_API_KEY Connection string found:", OPENAI_API_KEY)

ATLAS_URI Connection string found: mongodb+srv://aguito:aguito@cluster0.jmxjjkw.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0
OPENAI_API_KEY Connection string found: sk-proj-RKljQJMjIUnjbaYwYiZAT3BlbkFJboLwxc8J2rgiVXDrSxjv


In [3]:
# Our variables
DB_NAME = 'sample_mflix'
COLLECTION_NAME = 'embedded_movies'
INDEX_NAME = 'idx_plot_embedding'

## Find My IP Address

This IP address should be added to Atlas' 'access list' for the connection to work. If you completed Quest 1, this should be configured correctly already.

In [4]:
from urllib.request import urlopen
ip = urlopen('https://api.ipify.org').read()
decoded_ip = ip.decode('utf-8')

print (f"My IP address is {decoded_ip} \nMake sure that this IP address is allowed to connect to cloud Atlas")

URLError: <urlopen error [WinError 10060] Se produjo un error durante el intento de conexión ya que la parte conectada no respondió adecuadamente tras un periodo de tiempo, o bien se produjo un error en la conexión establecida ya que el host conectado no ha podido responder>

## Initialize Mongo Atlas Client

We start by intializing a connection to the Mongo Atlas Client.

In [7]:
from AtlasClient import AtlasClient

atlas_client = AtlasClient (ATLAS_URI, DB_NAME)
print("Connected to the Mongo Atlas database!")

ConfigurationError: The resolution lifetime expired after 21.614 seconds: Server Do53:192.168.1.1@53 answered REFUSED; Server Do53:8.8.8.8@53 answered The DNS operation timed out.; Server Do53:8.8.4.4@53 answered The DNS operation timed out.; Server Do53:8.8.8.8@53 answered The DNS operation timed out.; Server Do53:8.8.4.4@53 answered The DNS operation timed out.; Server Do53:8.8.8.8@53 answered The DNS operation timed out.; Server Do53:8.8.4.4@53 answered The DNS operation timed out.; Server Do53:8.8.8.8@53 answered The DNS operation timed out.; Server Do53:8.8.4.4@53 answered The DNS operation timed out.; Server Do53:8.8.8.8@53 answered The DNS operation timed out.; Server Do53:8.8.4.4@53 answered The DNS operation timed out.

## Initialize OpenAI Client
Recall that we'll be using OpenAI as our embedding model. Although we already have embeddings in our embedded_movies dataset, we'll still need an embedding model that is able to help us generate embeddings for the input queries so that its able to be compared against the vectors stored in the database (i.e. compare vectors against vectors instead of text against vectors).

In [8]:
from OpenAIClient import OpenAIClient

openAI_client = OpenAIClient(api_key=OPENAI_API_KEY)
print ("OpenAI client initialized!")

OpenAI client initialized!


With our OpenAI client initialized, let's do a **quick vectorization test** as a sanity check! Essentially, what we're doing here is using the vectorizer provided by OpenAI to get the vector representation (i.e. numerical representation) of the string "a futuristic Sci-fi movie".

In [9]:
text = 'a futuristic Sci-fi movie'

embedding = openAI_client.get_embedding(text)
print (f"Text: '{text}'\nEmbeddding_length: {len(embedding)}\nFirst 10 numbers of embedding:", embedding [:10] )

APITimeoutError: Request timed out.

## Atlas Vector Search 
Now for the fun part! We are going to do an embedding search on our embedded_movies dataset based on movie plots. What this means is that we're searching for movies based on the **meaning** of their plots.

We're **not** searching for keywords within plots, but we're searching movies that have plots that have the closest semantic meaning to our input query.

Check out the examples below!

In [10]:
query = "imaginary characters from outerspace at war with earthlings"

embedding = openAI_client.get_embedding(query)
movies = atlas_client.vector_search(collection_name=COLLECTION_NAME, index_name=INDEX_NAME, attr_name='plot_embedding', embedding_vector=embedding,limit=10 )
print (f"Found {len (movies)} movies")
for idx, movie in enumerate (movies):
    print(f'{idx+1}\nid: {movie["_id"]}\ntitle: {movie["title"]},\nyear: {movie["year"]}\nplot: {movie["plot"]}\n')

APITimeoutError: Request timed out.

In [None]:
query = "superheroes saving earth"

embedding = openAI_client.get_embedding(query)
movies = atlas_client.vector_search(collection_name=COLLECTION_NAME, index_name=INDEX_NAME, attr_name='plot_embedding', embedding_vector=embedding,limit=10 )
print (f"Found {len (movies)} movies")
for idx, movie in enumerate (movies):
    print(f'{idx+1}\nid: {movie["_id"]}\ntitle: {movie["title"]},\nyear: {movie["year"]}\nplot: {movie["plot"]}\n')

## Try Your Own Searches!

As you can see from the sample searches above, the results retrieved from our query are ranked based on how close the semantic meaning of the values `plot` field matches with our queries. This is the power of Atlas Vector Search - we're searching via comparing semantic meaning (i.e. comparing vectors), as opposed to merely 1:1 value matching.

Now, try to search for your own query! **Replace the placeholder value in the string below and enter your own custom query**. 

In [None]:
# TODO: enter your query here
#query = "REPLACE-WITH-YOUR-QUERY"
query = "romance"

embedding = openAI_client.get_embedding(query)
movies = atlas_client.vector_search(collection_name=COLLECTION_NAME, index_name=INDEX_NAME, attr_name='plot_embedding', embedding_vector=embedding,limit=10 )
print (f"Found {len (movies)} movies")
for idx, movie in enumerate (movies):
    print(f'{idx+1}\nid: {movie["_id"]}\ntitle: {movie["title"]},\nyear: {movie["year"]}\nplot: {movie["plot"]}\n')

Good job following till the end! Now let's **head back to StackUp** to complete our submission.