In [1]:
!pip install datasets



# MongoDB + Pinecone Hybrid Retrieval Mini Pipeline (Part 2)

This notebook extends the earlier pipeline by:
- Sampling a subset of the MongoDB embedded movies dataset
- Cleaning & filtering (dropping null plots / pre-existing embeddings)
- Generating fresh sentence embeddings (`thenlper/gte-large`)
- Ingesting documents into MongoDB (metadata + text)
- Indexing embeddings in Pinecone for external vector similarity
- Performing semantic query → Pinecone match IDs → Mongo backfill → context assembly
- Building a grounded prompt and generating an answer with a Gemini model

Security: All credentials (MongoDB username/password, Pinecone API key, Google API key) must be supplied via secure environment variables or secret stores. Placeholders marked with `Write your own password` or similar should be replaced locally, never committed.

Each code cell is followed by an explanatory markdown cell for open-source clarity.

In [2]:
!pip install pandas



In [3]:
from datasets import load_dataset

In [4]:
dataset=load_dataset("MongoDB/embedded_movies")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['plot', 'runtime', 'genres', 'fullplot', 'directors', 'writers', 'countries', 'poster', 'languages', 'cast', 'title', 'num_mflix_comments', 'rated', 'imdb', 'awards', 'type', 'metacritic', 'plot_embedding'],
        num_rows: 1500
    })
})

In [6]:
import pandas as pd
data=pd.DataFrame(dataset["train"])

In [7]:
data.shape

(1500, 18)

In [8]:
data.head()

Unnamed: 0,plot,runtime,genres,fullplot,directors,writers,countries,poster,languages,cast,title,num_mflix_comments,rated,imdb,awards,type,metacritic,plot_embedding
0,Young Pauline is left a lot of money when her ...,199.0,[Action],Young Pauline is left a lot of money when her ...,"[Louis J. Gasnier, Donald MacKenzie]","[Charles W. Goddard (screenplay), Basil Dickey...",[USA],https://m.media-amazon.com/images/M/MV5BMzgxOD...,[English],"[Pearl White, Crane Wilbur, Paul Panzer, Edwar...",The Perils of Pauline,0,,"{'id': 4465, 'rating': 7.6, 'votes': 744}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[0.0007293965299999999, -0.026834568000000003,..."
1,A penniless young man tries to save an heiress...,22.0,"[Comedy, Short, Action]",As a penniless man worries about how he will m...,"[Alfred J. Goulding, Hal Roach]",[H.M. Walker (titles)],[USA],https://m.media-amazon.com/images/M/MV5BNzE1OW...,[English],"[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...",From Hand to Mouth,0,TV-G,"{'id': 10146, 'rating': 7.0, 'votes': 639}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,,"[-0.022837115, -0.022941574000000003, 0.014937..."
2,"Michael ""Beau"" Geste leaves England in disgrac...",101.0,"[Action, Adventure, Drama]","Michael ""Beau"" Geste leaves England in disgrac...",[Herbert Brenon],"[Herbert Brenon (adaptation), John Russell (ad...",[USA],,[English],"[Ronald Colman, Neil Hamilton, Ralph Forbes, A...",Beau Geste,0,,"{'id': 16634, 'rating': 6.9, 'votes': 222}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[0.00023330492999999998, -0.028511643000000003..."
3,"Seeking revenge, an athletic young man joins t...",88.0,"[Adventure, Action]",A nobleman vows to avenge the death of his fat...,[Albert Parker],"[Douglas Fairbanks (story), Jack Cunningham (a...",[USA],https://m.media-amazon.com/images/M/MV5BMzU0ND...,,"[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...",The Black Pirate,1,,"{'id': 16654, 'rating': 7.2, 'votes': 1146}","{'nominations': 0, 'text': '1 win.', 'wins': 1}",movie,,"[-0.005927917, -0.033394486, 0.0015323418, -0...."
4,An irresponsible young millionaire changes his...,58.0,"[Action, Comedy, Romance]","The Uptown Boy, J. Harold Manners (Lloyd) is a...",[Sam Taylor],"[Ted Wilde (story), John Grey (story), Clyde B...",[USA],https://m.media-amazon.com/images/M/MV5BMTcxMT...,[English],"[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...",For Heaven's Sake,0,PASSED,"{'id': 16895, 'rating': 7.6, 'votes': 918}","{'nominations': 1, 'text': '1 nomination.', 'w...",movie,,"[-0.0059373598, -0.026604708, -0.0070914757000..."


In [9]:
data=data.sample(80)

In [10]:
data.shape

(80, 18)

In [11]:
data.columns

Index(['plot', 'runtime', 'genres', 'fullplot', 'directors', 'writers',
       'countries', 'poster', 'languages', 'cast', 'title',
       'num_mflix_comments', 'rated', 'imdb', 'awards', 'type', 'metacritic',
       'plot_embedding'],
      dtype='object')

In [12]:
data.isnull().sum()

Unnamed: 0,0
plot,0
runtime,2
genres,0
fullplot,2
directors,0
writers,1
countries,0
poster,7
languages,0
cast,0


In [13]:
data=data.dropna(subset=["fullplot"])

In [14]:
data=data.drop(columns=["plot_embedding"])

In [15]:
data.isnull().sum()

Unnamed: 0,0
plot,0
runtime,2
genres,0
fullplot,0
directors,0
writers,1
countries,0
poster,7
languages,0
cast,0


In [16]:
!pip install sentence_transformers



In [17]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("thenlper/gte-large")

In [18]:
!pip install pymongo



In [None]:
from pymongo.mongo_client import MongoClient
from urllib.parse import quote_plus

# Provide credentials securely (environment variables / secret manager)
username = quote_plus("Write your own password")  # MongoDB username placeholder
password = quote_plus("Write your own password")  # MongoDB password placeholder

uri = f"mongodb+srv://{username}:{password}@cluster0.eh7hb.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"

client = MongoClient(uri)
try:
    client.admin.command('ping')
    print("MongoDB connection attempt complete (using placeholder credentials). Replace with real secrets locally.")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


In [20]:
db=client["moviemydb"]

In [21]:
collection=db["moviemycollection"]

In [22]:
document=data.to_dict("records")

In [23]:
collection.insert_many(document)

InsertManyResult([ObjectId('688d22a7674a4d16f6dbb01a'), ObjectId('688d22a7674a4d16f6dbb01b'), ObjectId('688d22a7674a4d16f6dbb01c'), ObjectId('688d22a7674a4d16f6dbb01d'), ObjectId('688d22a7674a4d16f6dbb01e'), ObjectId('688d22a7674a4d16f6dbb01f'), ObjectId('688d22a7674a4d16f6dbb020'), ObjectId('688d22a7674a4d16f6dbb021'), ObjectId('688d22a7674a4d16f6dbb022'), ObjectId('688d22a7674a4d16f6dbb023'), ObjectId('688d22a7674a4d16f6dbb024'), ObjectId('688d22a7674a4d16f6dbb025'), ObjectId('688d22a7674a4d16f6dbb026'), ObjectId('688d22a7674a4d16f6dbb027'), ObjectId('688d22a7674a4d16f6dbb028'), ObjectId('688d22a7674a4d16f6dbb029'), ObjectId('688d22a7674a4d16f6dbb02a'), ObjectId('688d22a7674a4d16f6dbb02b'), ObjectId('688d22a7674a4d16f6dbb02c'), ObjectId('688d22a7674a4d16f6dbb02d'), ObjectId('688d22a7674a4d16f6dbb02e'), ObjectId('688d22a7674a4d16f6dbb02f'), ObjectId('688d22a7674a4d16f6dbb030'), ObjectId('688d22a7674a4d16f6dbb031'), ObjectId('688d22a7674a4d16f6dbb032'), ObjectId('688d22a7674a4d16f6dbb0

In [24]:
!pip install pinecone



In [None]:
PINECONE_API_KEY = "Write your own password"  # Supply via environment variable in practice

In [26]:
from pinecone import Pinecone

pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index("mongo")

In [27]:
def get_result(query,similar_result=3):
  embedding=embedding_model.encode(query)
  embedding=embedding.tolist()

  result=index.query(
    vector=embedding,
    top_k=similar_result,
  )
  return result

In [28]:
query="what is the best horror movie to watch and why?"

In [29]:
result=get_result(query)

  return forward_call(*args, **kwargs)


In [30]:
result

{'matches': [], 'namespace': '', 'usage': {'read_units': 1}}

In [31]:
from bson.objectid import ObjectId

In [32]:
mylist=[]
for i in  range(len(result["matches"])):
  value=result["matches"][i]['id']
  mylist.append(collection.find_one({"_id": ObjectId(value)}))

In [33]:
mylist

[]

In [34]:
combined_information = ""
for i in range(len(mylist)):
  fullplot=mylist[i]["fullplot"]
  title=mylist[i]["title"]
  combined_information += f"Title:{title}, fullplot: {fullplot}\n"

In [35]:
print(combined_information)




In [36]:
query

'what is the best horror movie to watch and why?'

In [37]:
prompt = f"Query: {query}\nContinue to answer the query by using the fullplot only:\n{combined_information}."

In [38]:
print(prompt)

Query: what is the best horror movie to watch and why?
Continue to answer the query by using the fullplot only:
.


In [39]:
%pip install --upgrade  langchain-google-genai



In [None]:
import os
from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY') or "Write your own password"  # Provide real key securely
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

In [45]:
from langchain_google_genai import ChatGoogleGenerativeAI
def load_model(model_name):
  if model_name=="gemini-2.0-flash":
    llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")
  else:
    llm=ChatGoogleGenerativeAI(model="gemini-2.0-flash")

  return llm

In [46]:
model_text=load_model("gemini-2.0-flash")

In [47]:
model_text.invoke(prompt).content

'Okay, here\'s a response to the query "what is the best horror movie to watch and why?" using a full plot description to illustrate the potential effectiveness of a particular film, without actually explicitly naming it:\n\n"The \'best\' horror movie is subjective, of course, but one film stands out for its sustained tension, psychological depth, and truly unsettling imagery. Imagine a young woman, seemingly cursed, after a casual encounter. Following this encounter, she\'s haunted by visions of strangers. These aren\'t just fleeting glimpses; they\'re relentless, slow-moving figures who are always walking *toward* her.\n\nThe movie follows her frantic attempts to understand what\'s happening and escape this terrifying fate. She learns that she\'s been infected with a supernatural entity that passes between people through sexual contact. The entity manifests as different individuals, sometimes familiar faces, sometimes grotesque strangers, all relentlessly pursuing her.\n\nAs the thre

# About the Author

<div style="background-color: #f8f9fa; border-left: 5px solid #28a745; padding: 20px; margin-bottom: 20px; border-radius: 5px;">
  <h2 style="color: #28a745; margin-top: 0; font-family: 'Poppins', sans-serif;">Muhammad Atif Latif</h2>
  <p style="font-size: 16px; color: #495057;">Data Scientist & Machine Learning Engineer</p>
  
  <p style="font-size: 15px; color: #6c757d; margin-top: 15px;">
    Passionate about building AI solutions that solve real-world problems. Specialized in machine learning,
    deep learning, and data analytics with experience implementing production-ready models.
  </p>
</div>

## Connect With Me

<div style="display: flex; flex-wrap: wrap; gap: 10px; margin-top: 15px;">
  <a href="https://github.com/m-Atif-Latif" target="_blank">
    <img src="https://img.shields.io/badge/GitHub-Follow-212121?style=for-the-badge&logo=github" alt="GitHub">
  </a>
  <a href="https://www.kaggle.com/matiflatif" target="_blank">
    <img src="https://img.shields.io/badge/Kaggle-Profile-20BEFF?style=for-the-badge&logo=kaggle" alt="Kaggle">
  </a>
  <a href="https://www.linkedin.com/in/muhammad-atif-latif-13a171318" target="_blank">
    <img src="https://img.shields.io/badge/LinkedIn-Connect-0077B5?style=for-the-badge&logo=linkedin" alt="LinkedIn">
  </a>
  <a href="https://x.com/mianatif5867" target="_blank">
    <img src="https://img.shields.io/badge/Twitter-Follow-1DA1F2?style=for-the-badge&logo=twitter" alt="Twitter">
  </a>
  <a href="https://www.instagram.com/its_atif_ai/" target="_blank">
    <img src="https://img.shields.io/badge/Instagram-Follow-E4405F?style=for-the-badge&logo=instagram" alt="Instagram">
  </a>
  <a href="mailto:muhammadatiflatif67@gmail.com">
    <img src="https://img.shields.io/badge/Email-Contact-D14836?style=for-the-badge&logo=gmail" alt="Email">
  </a>
</div>

---

## Notebook Summary

In this continuation you:
- Sampled a subset of the movies dataset for faster iteration.
- Cleaned data (dropped null `fullplot`, removed legacy embeddings) to ensure fresh embedding generation.
- Generated sentence embeddings locally using `thenlper/gte-large`.
- Stored documents + embeddings in MongoDB for metadata persistence and future enrichment.
- Indexed / queried similarity through Pinecone to retrieve top candidate document IDs.
- Performed ID backfill to MongoDB to obtain full plots + titles.
- Assembled a grounded prompt string from retrieved context.
- Queried a Gemini model (placeholder API key) to produce an answer constrained to retrieved evidence.

Architecture Pattern:
User Query → Local Embedding → Pinecone Vector Search → IDs → Mongo Fetch → Context Assembly → LLM Generation.

Extensibility Ideas:
- Add caching for embeddings to avoid recomputation.
- Add reranking (e.g., cross-encoder) before context assembly.
- Stream responses token-by-token for improved UX.
- Persist query/response logs for evaluation & drift monitoring.

Crafted and refined by **Muhammad Atif Latif** — building practical hybrid retrieval + generation pipelines. Star the repo & connect below for more applied GenAI engineering content.