[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/advanced_techniques/retrieval_strategies_mongodb_llamaindex.ipynb)

[![View Article](https://img.shields.io/badge/View%20Article-blue)](https://www.mongodb.com/developer/products/atlas/optimize-relevance-mongodb-llamaindex/?utm_campaign=devrel&utm_source=cross-post&utm_medium=organic_social&utm_content=https%3A%2F%2Fgithub.com%2Fmongodb-developer%2FGenAI-Showcase&utm_term=apoorva.joshi)


# Optimizing for relevance using MongoDB and LlamaIndex

In this notebook, we will explore and tune different retrieval options in MongoDB's LlamaIndex integration to get the most relevant results.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import os
import logging
logging.getLogger('llama_index').setLevel(logging.ERROR)

## Step 1: Install libraries

- **pymongo**: Python package to interact with MongoDB databases and collections
<p>
- **llama-index**: Python package for the LlamaIndex LLM framework
<p>
- **llama-index-llms-openai**: Python package to use OpenAI models via their LlamaIndex integration 
<p>
- **llama-index-vector-stores-mongodb**: Python package for MongoDB’s LlamaIndex integration 

In [38]:
# Configure Gemini LLM to avoid OpenAI defaults
from llama_index.llms.gemini import Gemini

# Use Gemini Pro or Flash
Settings.llm = Gemini(model="gemini-2.5-flash-lite")


In [32]:
!pip install -qU pymongo llama-index llama-index-embeddings-google-genai llama-index-llms-gemini llama-index-vector-stores-mongodb pandas google-genai

## Step 2: Setup prerequisites

- **Set the MongoDB connection string**: Follow the steps [here](https://www.mongodb.com/docs/manual/reference/connection-string/) to get the connection string from the Atlas UI.

- **Set the Google API Key**: Obtain your key from AI Studio.

In [2]:
import getpass
import os

from pymongo import MongoClient

In [3]:
os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter your Google API Key: ")

In [4]:
MONGODB_URI = getpass.getpass("Enter your MongoDB URI: ")
os.environ['MONGODB_URI'] = MONGODB_URI
mongodb_client = MongoClient(
    MONGODB_URI, appname="retrieval_strategies_llamaindex"
)

## Step 3: Load and process the dataset

In [5]:
import pandas as pd
import os
import ast
from llama_index.core import Document

In [6]:
import pandas as pd
import ast
import os

# Load pre-computed embeddings data
data_path = "./data/movies_with_embeddings.csv"
print(f"Loading {data_path}...")
data = pd.read_csv(data_path)
print(f"Loaded {len(data)} rows.")

Loading ./data/movies_with_embeddings.csv...
Loaded 8584 rows.


In [7]:
data.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,vote_count,cast,crew,genres_list,cast_list,languages_list,fullplot,rating,combined_text,embedding
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","['Animation', 'Comedy', 'Family']","['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...",['English'],"Led by Woody, Andy's toys live happily in his ...",7.7,"Title: Toy Story\nPlot: Led by Woody, Andy's t...","[-0.027750747, -0.0032961427, 0.00066539174, -..."
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","['Adventure', 'Fantasy', 'Family']","['Robin Williams', 'Jonathan Hyde', 'Kirsten D...","['English', 'Français']",When siblings Judy and Peter discover an encha...,6.9,Title: Jumanji\nPlot: When siblings Judy and P...,"[-0.010704878, -0.015528211, -2.695813e-05, -0..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,92.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","['Romance', 'Comedy']","['Walter Matthau', 'Jack Lemmon', 'Ann-Margret...",['English'],A family wedding reignites the ancient feud be...,6.5,Title: Grumpier Old Men\nPlot: A family weddin...,"[-0.021056956, -0.020402517, -0.00059282366, -..."
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,34.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","['Comedy', 'Drama', 'Romance']","['Whitney Houston', 'Angela Bassett', 'Loretta...",['English'],"Cheated on, mistreated and stepped on, the wom...",6.1,"Title: Waiting to Exhale\nPlot: Cheated on, mi...","[-0.017479543, -0.014748784, -0.0129432585, -0..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,173.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",['Comedy'],"['Steve Martin', 'Diane Keaton', 'Martin Short...",['English'],Just when George Banks has recovered from his ...,5.7,Title: Father of the Bride Part II\nPlot: Just...,"[-0.018792322, -0.010316654, 0.001152085, -0.0..."


In [8]:
print("Parsing stringified lists (embeddings, genres, etc)...")

converters = {
    "embedding": ast.literal_eval,
    "genres_list": ast.literal_eval,
    "cast_list": ast.literal_eval,
    "languages_list": ast.literal_eval
}

for col, func in converters.items():
    if col in data.columns:
        # safer apply
        data[col] = data[col].apply(lambda x: func(x) if isinstance(x, str) else x)

print("Data parsing complete.")

Parsing stringified lists (embeddings, genres, etc)...
Data parsing complete.


In [9]:
documents = []

for _, row in data.iterrows():
    # Use combined_text from the CSV
    text = row['combined_text']
    
    # Prepare metadata
    # Ensure we use valid values
    title = row['title'] if pd.notna(row['title']) else "Unknown"
    rating = row['rating'] if pd.notna(row['rating']) else 0
    languages = row['languages_list'] if isinstance(row['languages_list'], list) else []
    genres = row['genres_list'] if isinstance(row['genres_list'], list) else []

    metadata = {
        "title": title,
        "rating": rating,
        "languages": languages,
        "genres": genres
    }
    
    # Create Document with pre-computed embedding
    doc = Document(
        text=text,
        metadata=metadata,
        embedding=row['embedding'] 
    )
    documents.append(doc)

print(f"Created {len(documents)} documents with pre-computed embeddings.")

Created 8584 documents with pre-computed embeddings.


In [11]:
print(documents[0].text)

Title: Toy Story
Plot: Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.
Cast: Tom Hanks, Tim Allen, Don Rickles, Jim Varney, Wallace Shawn
Genres: Animation, Comedy, Family
Languages: English
Rating: 7.7


In [12]:
print(documents[0].metadata)

{'title': 'Toy Story', 'rating': 7.7, 'languages': ['English'], 'genres': ['Animation', 'Comedy', 'Family']}


## Step 4: Create MongoDB Atlas vector store

In [13]:
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.core.settings import Settings
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from google.genai.types import EmbedContentConfig
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from pymongo.errors import OperationFailure
from pymongo.operations import SearchIndexModel

In [14]:
# Initialize Google GenAI Embedding
model_name = "gemini-embedding-001"
embed_model = GoogleGenAIEmbedding(
    model_name=model_name,
    embedding_config=EmbedContentConfig(output_dimensionality=3072)
)
# Testing with one example
try:
    test_embed = embed_model.get_text_embedding("Hello World")
    print(f"Embedding successful. Dimension: {len(test_embed)}")
except Exception as e:
    print(f"Error initializing embedding model: {e}")


Embedding successful. Dimension: 3072


In [16]:
VS_INDEX_NAME = "vector_index"
FTS_INDEX_NAME = "fts_index"
DB_NAME = "llamaindex"
COLLECTION_NAME = "hybrid_search"
collection = mongodb_client[DB_NAME][COLLECTION_NAME]

In [None]:
# Ensure embed_model is defined before usage
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from google.genai.types import EmbedContentConfig

if 'embed_model' not in globals():
    print("Defining embed_model...")
    embed_model = GoogleGenAIEmbedding(
        model_name="gemini-embedding-001",
        embedding_config=EmbedContentConfig(output_dimensionality=3072)
    )


In [18]:
vector_store = MongoDBAtlasVectorSearch(
    mongodb_client,
    db_name=DB_NAME,
    collection_name=COLLECTION_NAME,
    vector_index_name=VS_INDEX_NAME,
    fulltext_index_name=FTS_INDEX_NAME,
    embedding_key="embedding",
    text_key="text",
)

# If the collection has documents   with embeddings already, create the vector store index from the vector store
if collection.count_documents({}) > 0:
    print("Collection found on MongoDB. Loading index from store...")
    vector_store_index = VectorStoreIndex.from_vector_store(
        vector_store,
        embed_model=embed_model
    )
# If the collection does not have documents, embed and ingest them into the vector store
else:
    print("Collection empty. Ingesting documents...")
    vector_store_context = StorageContext.from_defaults(vector_store=vector_store)
    vector_store_index = VectorStoreIndex.from_documents(
        documents,
        storage_context=vector_store_context,
        show_progress=True,
        embed_model=embed_model
    )


Collection empty. Ingesting documents...


Parsing nodes:   0%|          | 0/8584 [00:00<?, ?it/s]

Generating embeddings: 0it [00:00, ?it/s]

Generating embeddings: 0it [00:00, ?it/s]

Generating embeddings: 0it [00:00, ?it/s]

Generating embeddings: 0it [00:00, ?it/s]

Generating embeddings: 0it [00:00, ?it/s]

## Step 5: Create Atlas Search indexes

In [24]:
vs_model = SearchIndexModel(
    definition={
        "fields": [
            {
                "type": "vector",
                "path": "embedding",
                "numDimensions": 3072,
                "similarity": "cosine",
            },
            {"type": "filter", "path": "metadata.rating"},
            {"type": "filter", "path": "metadata.languages"},
        ]
    },
    name=VS_INDEX_NAME,
    type="vectorSearch",
)

In [25]:
fts_model = SearchIndexModel(
    definition={"mappings": {"dynamic": False, "fields": {"text": {"type": "string"}}}},
    name=FTS_INDEX_NAME,
    type="search",
)

In [26]:
for model in [vs_model, fts_model]:
    try:
        collection.create_search_index(model=model)
    except OperationFailure:
        print(f"Duplicate index found for model {model}. Skipping index creation.")

## Step 6: Get movie recommendations

In [27]:
def get_recommendations(query: str, mode: str, **kwargs) -> None:
    """
    Get movie recommendations

    Args:
        query (str): User query
        mode (str): Retrieval mode. One of (default, text_search, hybrid)
    """
    query_engine = vector_store_index.as_query_engine(
        similarity_top_k=5, vector_store_query_mode=mode, **kwargs
    )
    response = query_engine.query(query)
    nodes = response.source_nodes
    for node in nodes:
        title = node.metadata["title"]
        rating = node.metadata["rating"]
        score = node.score
        print(f"Title: {title} | Rating: {rating} | Relevance Score: {score}")

### Full-text search

In [35]:
get_recommendations(
    query="Action movies about humans fighting machines",
    mode="text_search",
)

Title: Sgt. Bilko | Rating: 5.5 | Relevance Score: 5.555953502655029
Title: Those Magnificent Men in Their Flying Machines or How I Flew from London to Paris in 25 hours 11 minutes | Rating: 6.4 | Relevance Score: 5.342971324920654
Title: A.I. Artificial Intelligence | Rating: 6.8 | Relevance Score: 5.013030052185059
Title: Maximum Overdrive | Rating: 5.5 | Relevance Score: 4.941791534423828
Title: Babylon 5: In the Beginning | Rating: 7.3 | Relevance Score: 4.64164924621582


### Vector search

In [39]:
get_recommendations(
    query="Action movies about humans fighting machines", mode="default"
)

Title: Eve of Destruction | Rating: 5.1 | Relevance Score: 0.8475072979927063
Title: Solo | Rating: 3.9 | Relevance Score: 0.8448247909545898
Title: The Terminator | Rating: 7.4 | Relevance Score: 0.8446205258369446
Title: RoboCop | Rating: 7.1 | Relevance Score: 0.8444143533706665
Title: Robot Jox | Rating: 5.3 | Relevance Score: 0.8411067128181458


### Hybrid search

In [40]:
# Vector and full-text search weighted equal by default
get_recommendations(query="Action movies about humans fighting machines", mode="hybrid")

Title: Eve of Destruction | Rating: 5.1 | Relevance Score: 0.5
Title: Sgt. Bilko | Rating: 5.5 | Relevance Score: 0.5
Title: Those Magnificent Men in Their Flying Machines or How I Flew from London to Paris in 25 hours 11 minutes | Rating: 6.4 | Relevance Score: 0.25
Title: Solo | Rating: 3.9 | Relevance Score: 0.25
Title: The Terminator | Rating: 7.4 | Relevance Score: 0.16666666666666666


In [41]:
# Higher alpha, vector search dominates
get_recommendations(
    query="Action movies about humans fighting machines",
    mode="hybrid",
    alpha=0.7,
)

Title: Eve of Destruction | Rating: 5.1 | Relevance Score: 0.7
Title: Solo | Rating: 3.9 | Relevance Score: 0.35
Title: Sgt. Bilko | Rating: 5.5 | Relevance Score: 0.30000000000000004
Title: The Terminator | Rating: 7.4 | Relevance Score: 0.2333333333333333
Title: RoboCop | Rating: 7.1 | Relevance Score: 0.175


In [42]:
# Lower alpha, full-text search dominates
get_recommendations(
    query="Action movies about humans fighting machines",
    mode="hybrid",
    alpha=0.3,
)

Title: Sgt. Bilko | Rating: 5.5 | Relevance Score: 0.7
Title: Those Magnificent Men in Their Flying Machines or How I Flew from London to Paris in 25 hours 11 minutes | Rating: 6.4 | Relevance Score: 0.35
Title: Eve of Destruction | Rating: 5.1 | Relevance Score: 0.3
Title: A.I. Artificial Intelligence | Rating: 6.8 | Relevance Score: 0.2333333333333333
Title: Maximum Overdrive | Rating: 5.5 | Relevance Score: 0.175


### Combining metadata filters with search

In [43]:
from llama_index.core.vector_stores import (
    FilterCondition,
    FilterOperator,
    MetadataFilter,
    MetadataFilters,
)

In [44]:
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="metadata.rating", value=7, operator=FilterOperator.GT),
        MetadataFilter(
            key="metadata.languages", value="English", operator=FilterOperator.EQ
        ),
    ],
    condition=FilterCondition.AND,
)

In [45]:
get_recommendations(
    query="Action movies about humans fighting machines",
    mode="hybrid",
    alpha=0.7,
    filters=filters,
)

Title: The Terminator | Rating: 7.4 | Relevance Score: 0.7
Title: RoboCop | Rating: 7.1 | Relevance Score: 0.35
Title: Babylon 5: In the Beginning | Rating: 7.3 | Relevance Score: 0.30000000000000004
Title: The Matrix | Rating: 7.9 | Relevance Score: 0.235
Title: Terminator 2: Judgment Day | Rating: 7.7 | Relevance Score: 0.2333333333333333


## Hybrid Search with Different Alpha Values
Experimenting with `alpha` to balance Vector Search vs Keyword Search.

In [50]:
query_str = "Funny romantic movie with a sad ending"

print("--- Alpha = 0.3 (More Keyword/Text focused) ---")
get_recommendations(
    query=query_str,
    mode="hybrid",
    alpha=0.3
)

--- Alpha = 0.3 (More Keyword/Text focused) ---
Title: Fast, Cheap & Out of Control | Rating: 8.3 | Relevance Score: 0.7
Title: Jails, Hospitals & Hip-Hop | Rating: 0.0 | Relevance Score: 0.35
Title: Funny About Love | Rating: 4.8 | Relevance Score: 0.3
Title: State and Main | Rating: 6.5 | Relevance Score: 0.2333333333333333
Title: Singles | Rating: 6.6 | Relevance Score: 0.175


In [51]:
query_str = "Funny romantic movie with a sad ending"

print("--- Alpha = 0.7 (More Vector/Semantic focused) ---")
get_recommendations(
    query=query_str,
    mode="hybrid",
    alpha=0.7
)

--- Alpha = 0.7 (More Vector/Semantic focused) ---
Title: Funny About Love | Rating: 4.8 | Relevance Score: 0.7
Title: Funny Felix | Rating: 5.1 | Relevance Score: 0.35
Title: Fast, Cheap & Out of Control | Rating: 8.3 | Relevance Score: 0.30000000000000004
Title: Sleepless in Seattle | Rating: 6.5 | Relevance Score: 0.2333333333333333
Title: Modern Romance | Rating: 6.7 | Relevance Score: 0.175


In [52]:
query_str = "Cyberpunk sci-fi with ai takeover"

print("--- Alpha = 0.3 (More Keyword/Text focused) ---")
get_recommendations(
    query=query_str,
    mode="hybrid",
    alpha=0.3
)

--- Alpha = 0.3 (More Keyword/Text focused) ---
Title: CQ | Rating: 6.0 | Relevance Score: 0.7
Title: Galaxy Quest | Rating: 6.9 | Relevance Score: 0.35
Title: A.I. Artificial Intelligence | Rating: 6.8 | Relevance Score: 0.3
Title: Amazon Women on the Moon | Rating: 6.0 | Relevance Score: 0.2333333333333333
Title: Logan's Run | Rating: 6.6 | Relevance Score: 0.175


In [53]:
query_str = "Cyberpunk sci-fi with ai takeover"

print("--- Alpha = 0.7 (More Vector/Semantic focused) ---")
get_recommendations(
    query=query_str,
    mode="hybrid",
    alpha=0.7
)

--- Alpha = 0.7 (More Vector/Semantic focused) ---
Title: A.I. Artificial Intelligence | Rating: 6.8 | Relevance Score: 0.7
Title: Runaway | Rating: 5.0 | Relevance Score: 0.35
Title: CQ | Rating: 6.0 | Relevance Score: 0.30000000000000004
Title: Nemesis 2 - Nebula | Rating: 4.6 | Relevance Score: 0.2333333333333333
Title: RoboCop | Rating: 7.1 | Relevance Score: 0.175
