In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Lightweight RAG System for Movie Plots

This notebook builds a Retrieval-Augmented Generation (RAG) system that can answer questions about movie plots using the Wikipedia Movie Plots dataset.

## Architecture Overview
1. **Data Loading**: Load a subset of movie plots from Wikipedia
2. **Text Chunking**: Split long plots into manageable chunks
3. **Embedding**: Convert text chunks into vector representations
4. **Vector Store**: Store embeddings in FAISS for fast retrieval
5. **Retrieval**: Find most relevant chunks for a given query
6. **Generation**: Use an LLM to generate answers from retrieved context
7. **Structured Output**: Return answer with context and reasoning

## Step 1: Install Dependencies

**What we're installing:**
- `sentence-transformers`: Creates high-quality text embeddings (converts text to vectors)
- `faiss-cpu`: Fast vector similarity search library by Facebook
- `gemini`: Google API client
- `pandas`: For loading and processing the CSV dataset

**Why these choices:**
- Sentence-transformers is free and runs locally (since no API key needed for embeddings)
- FAISS is extremely fast for in-memory vector search
- We'll use Gemini API for generation (since its free)
- Pandas handles CSV data efficiently

In [3]:
!pip install -q sentence-transformers faiss-cpu pandas

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m23.7/23.7 MB[0m [31m84.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m388.2/388.2 kB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
import pandas as pd
import numpy as np
import faiss
import json
from sentence_transformers import SentenceTransformer
from typing import List, Dict
import re
import os

## Step 2: Load the Kaggle Dataset

**Dataset source:**
- Kaggle: https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots
- File: `wiki_movie_plots_deduped.csv`


**Key decisions:**
- We'll take only 300-500 movies to keep it lightweight
- We filter out movies with very short plots (less than 100 characters)
- We keep only Title and Plot columns for simplicity

**Dataset structure:**
- Columns: Release Year, Title, Origin/Ethnicity, Director, Cast, Genre, Wiki Page, Plot

In [5]:
PROJECT_DIR = '/content/drive/MyDrive/Mini_RAG_System'
DATA_DIR = os.path.join(PROJECT_DIR, 'data')

os.makedirs(DATA_DIR, exist_ok=True)

print("Project directory created:")
print(PROJECT_DIR)
print(DATA_DIR)

CSV_PATH = '/content/drive/MyDrive/Mini_RAG_System/data/wiki_movie_plots_deduped.csv'

df = pd.read_csv(
    CSV_PATH,
    engine='python',
    on_bad_lines='skip'
)

print(f"Loaded {len(df)} movies")


Project directory created:
/content/drive/MyDrive/Mini_RAG_System
/content/drive/MyDrive/Mini_RAG_System/data
Loaded 34886 movies


## Step 3: Text Chunking

**Why we chunk:**
- Long plots (1000+ words) are too large for effective retrieval
- Smaller chunks improve precision - we get exactly the relevant part
- Most embedding models work best with 200-500 word chunks

**Our strategy:**
- Split plots into ~300-word chunks
- Add 50-word overlap between chunks to maintain context
- Keep track of which movie each chunk came from

**Example:**
A 1000-word plot becomes 3-4 chunks, each with the movie title attached so we know the source.

In [6]:
def chunk_text(text: str, chunk_size: int = 300, overlap: int = 50) -> List[str]:
    """
    Split text into overlapping chunks based on word count.
    Handles None / NaN safely.
    """
    if text is None or (isinstance(text, float) and pd.isna(text)):
        return []

    text = str(text).strip()
    if not text:
        return []

    words = text.split()
    chunks = []

    step = max(1, chunk_size - overlap)

    for i in range(0, len(words), step):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)

        if i + chunk_size >= len(words):
            break

    return chunks


# Create chunks with metadata
all_chunks = []
chunk_metadata = []

for idx, row in df.iterrows():
    title = row.get('Title', 'Unknown Title')
    plot = row.get('Plot')

    chunks = chunk_text(plot)

    if not chunks:
        continue

    for chunk_idx, chunk in enumerate(chunks):
        all_chunks.append(chunk)
        chunk_metadata.append({
            'title': title,
            'chunk_index': chunk_idx,
            'total_chunks': len(chunks),
            'text': chunk
        })

print(f"Created {len(all_chunks)} chunks from {len(df)} movies")
print(f"Average chunks per movie: {len(all_chunks) / len(df):.1f}")

print("\nSample chunk:")
print(f"Title: {chunk_metadata[0]['title']}")
print(f"Text: {chunk_metadata[0]['text'][:200]}...")


Created 65915 chunks from 34886 movies
Average chunks per movie: 1.9

Sample chunk:
Title: Kansas Saloon Smashers
Text: A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man...


## Step 4: Generate Embeddings

**What are embeddings:**
Embeddings convert text into numerical vectors (arrays of numbers) that capture semantic meaning. Similar texts have similar vectors.

**Model choice:**
We use `all-MiniLM-L6-v2` because:
- It's fast and lightweight (runs on CPU)
- Produces 384-dimensional vectors
- No API key required
- Good balance of quality and speed

**What happens:**
Each chunk gets converted to a 384-number vector. The model has learned to place semantically similar texts close together in this 384-dimensional space.

In [7]:
# Load embedding model
print("Loading embedding model...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for all chunks
print(f"Generating embeddings for {len(all_chunks)} chunks...")
embeddings = embedding_model.encode(
    all_chunks,
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f"\nEmbedding shape: {embeddings.shape}")
print(f"Each chunk is now represented as a {embeddings.shape[1]}-dimensional vector")

Loading embedding model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings for 65915 chunks...


Batches:   0%|          | 0/2060 [00:00<?, ?it/s]


Embedding shape: (65915, 384)
Each chunk is now represented as a 384-dimensional vector


## Step 5: Build FAISS Vector Store

**What is FAISS:**
FAISS (Facebook AI Similarity Search) is a library for efficient similarity search. It finds the nearest neighbors to a query vector incredibly fast.

**How it works:**
- We create an index that stores all our embedding vectors
- When we query, FAISS compares the query vector to all stored vectors
- It returns the k most similar vectors using cosine similarity or L2 distance

**Index type:**
We use `IndexFlatL2` - a simple brute-force index perfect for our small dataset (300 movies ‚Üí ~500 chunks). For millions of vectors, you'd use approximate methods.

In [8]:
# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)  # L2 distance (Euclidean)

# Normalize embeddings for cosine similarity (optional but recommended)
faiss.normalize_L2(embeddings)

# Add embeddings to index
index.add(embeddings)

print(f"FAISS index created with {index.ntotal} vectors")
print(f"Index type: {type(index).__name__}")
print(f"Ready for similarity search!")

FAISS index created with 65915 vectors
Index type: IndexFlatL2
Ready for similarity search!


## Step 6: Implement Retrieval Function

**How retrieval works:**
1. Take a user query (e.g., "movies about AI")
2. Convert the query to an embedding using the same model
3. Search FAISS for the k most similar chunk embeddings
4. Return those chunks with their metadata (movie title, etc.)

**Key parameters:**
- `k`: Number of chunks to retrieve (typically 3-5)
- Higher k = more context but possibly more noise
- We return both the chunks and their similarity scores

In [9]:
def retrieve_relevant_chunks(query: str, k: int = 3) -> List[Dict]:
    """
    Retrieve the top-k most relevant chunks for a given query.
    """
    # Embed the query
    query_embedding = embedding_model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_embedding)

    # Search the index
    distances, indices = index.search(query_embedding, k)

    # Gather results with metadata
    results = []
    for idx, distance in zip(indices[0], distances[0]):
        results.append({
            'title': chunk_metadata[idx]['title'],
            'text': chunk_metadata[idx]['text'],
            'chunk_index': chunk_metadata[idx]['chunk_index'],
            'similarity_score': float(1 / (1 + distance))  # Convert distance to similarity
        })

    return results


# Test retrieval
test_query = "movies about artificial intelligence"
print(f"Test query: '{test_query}'")
print("\nTop 3 retrieved chunks:")

retrieved = retrieve_relevant_chunks(test_query, k=3)
for i, result in enumerate(retrieved, 1):
    print(f"\n{i}. {result['title']} (similarity: {result['similarity_score']:.3f})")
    print(f"   {result['text'][:150]}...")

Test query: 'movies about artificial intelligence'

Top 3 retrieved chunks:

1. The Inerasable (similarity: 0.493)
   Ai is a mystery novel writer. She received a letter from Kubo, a reader of her novel and a university student. Kubo's letter states that she hears odd...

2. Stealth (similarity: 0.470)
   In the near future, the United States Navy develops an aviation program to deal with international terrorists and other enemies of the state quickly a...

3. Ulsaha Committee (similarity: 0.467)
   The film is about a school drop out whose pursuit for amazing scientific inventions lands him in trouble....


## Step 7: Setup LLM for Answer Generation

**What we're doing:**
Now we combine retrieval with generation. The LLM reads the retrieved chunks and generates a natural language answer.

**How it works:**
1. We provide the LLM with retrieved chunks as context
2. We give it the user's question
3. The LLM synthesizes an answer based only on the provided context
4. We request structured JSON output with answer, contexts, and reasoning


In [10]:
!pip install -q google-generativeai

In [11]:
import google.generativeai as genai

# Configure api key, after running, its been replaced with xxx for privacy and safety concerns
genai.configure(api_key="xxx")

print("Available models:")
for m in genai.list_models():
    if 'generateContent' in m.supported_generation_methods:
        print(m.name)

Available models:
models/gemini-2.5-flash
models/gemini-2.5-pro
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-preview
models/gemini-exp-1206
models/gemini-2.5-flash-preview-tts
models/gemini-2.5-pro-preview-tts
models/gemma-3-1b-it
models/gemma-3-4b-it
models/gemma-3-12b-it
models/gemma-3-27b-it
models/gemma-3n-e4b-it
models/gemma-3n-e2b-it
models/gemini-flash-latest
models/gemini-flash-lite-latest
models/gemini-pro-latest
models/gemini-2.5-flash-lite
models/gemini-2.5-flash-image-preview
models/gemini-2.5-flash-image
models/gemini-2.5-flash-preview-09-2025
models/gemini-2.5-flash-lite-preview-09-2025
models/gemini-3-pro-preview
models/gemini-3-flash-preview
models/gemini-3-pro-image-preview
models/nano-banana-pro-preview
models/gemini-robotics-er-1.5-preview
models/gemini-2.5

In [12]:
import google.generativeai as genai
import getpass

print("Please enter your Gemini API key:")
api_key = getpass.getpass("API Key: ")

genai.configure(api_key=api_key)

MODEL_NAME = "gemini-2.5-flash"
model = genai.GenerativeModel(MODEL_NAME)

print("‚úì Gemini API configured")
print(f"‚úì Gemini model initialized: {MODEL_NAME}")


Please enter your Gemini API key:
API Key: ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑
‚úì Gemini API configured
‚úì Gemini model initialized: gemini-2.5-flash


## Step 8: Build the Complete RAG Pipeline

**The full RAG flow:**
1. **Retrieve**: Get relevant chunks from vector store
2. **Augment**: Package chunks as context for the LLM
3. **Generate**: LLM creates an answer using the context
4. **Structure**: Format output as JSON with answer, contexts, and reasoning

**Prompt engineering:**
We instruct the LLM to:
- Only use information from the provided contexts
- Cite which movie(s) the answer comes from
- Explain its reasoning process
- Admit if it can't answer based on available context

This is the core of RAG - grounding the LLM's responses in retrieved documents.

In [13]:
import json
from typing import Dict

def rag_query(question: str, k: int = 3) -> Dict:
    """
    Complete RAG pipeline: retrieve relevant chunks and generate an answer using Gemini.
    """
    # Step 1: Retrieve relevant chunks
    retrieved_chunks = retrieve_relevant_chunks(question, k=k)

    # Step 2: Format context for LLM
    context_text = "\n\n".join([
        f"Movie: {chunk['title']}\nPlot excerpt: {chunk['text']}"
        for chunk in retrieved_chunks
    ])

    # Step 3: Create prompt for LLM
    prompt = f"""
You are a helpful assistant that answers questions about movies based on plot summaries.

Using ONLY the following movie plot excerpts, answer the user's question.

CONTEXT:
{context_text}

QUESTION: {question}

Provide your response as a JSON object with these fields:
- "answer": A natural language answer to the question (mention specific movie titles)
- "contexts": A list of the movie titles you used to form your answer
- "reasoning": A brief explanation of how you formed the answer from the context

If the context doesn't contain enough information to answer the question, say so in the answer.

Return ONLY the JSON object, no other text.
"""

    # Step 4: Gemini call
    response = model.generate_content(
        prompt,
        generation_config={
            "temperature": 0.3,
            "max_output_tokens": 800
        }
    )

    response_text = response.text.strip()

    # Step 5: Parse JSON safely
    try:
        result = json.loads(response_text)
    except json.JSONDecodeError:
        result = {
            "answer": response_text,
            "contexts": [chunk['title'] for chunk in retrieved_chunks],
            "reasoning": "Answer generated using Gemini with retrieved context"
        }

    # Step 6: Attach retrieved chunks (for debugging / display)
    result['retrieved_chunks'] = [{
        'title': chunk['title'],
        'text': chunk['text'][:200] + '...',
        'similarity': chunk['similarity_score']
    } for chunk in retrieved_chunks]

    return result


## Step 9: Test the RAG System

**What we're testing:**
We'll run several example queries to demonstrate:
- Semantic search working correctly
- LLM synthesizing answers from multiple sources
- Proper citation of movie titles
- Reasoning about how the answer was formed

**Types of queries:**
1. Specific factual questions (e.g., "Which movie has HAL 9000?")
2. Thematic questions (e.g., "Movies about time travel")
3. Comparative questions (e.g., "Sci-fi movies with AI themes")

Each response shows the complete structured output with answer, contexts, and reasoning.

In [14]:
# Test with example queries
test_queries = [
    "Which movie features an AI system called HAL 9000?",
    "What movies involve time travel or time dilation?",
    "Tell me about movies with hitmen or assassins",
    "Which films explore dreams or dream worlds?"
]

print("=" * 80)
print("RAG SYSTEM TEST QUERIES")
print("=" * 80)

for i, query in enumerate(test_queries, 1):
    print(f"\n{'=' * 80}")
    print(f"Query {i}: {query}")
    print("=" * 80)

    result = rag_query(query, k=3)

    # Pretty print the result
    print("\nüìù ANSWER:")
    print(result['answer'])

    print("\nüìö CONTEXTS (Movies Used):")
    if isinstance(result['contexts'], list):
        for ctx in result['contexts']:
            print(f"  - {ctx}")
    else:
        print(f"  {result['contexts']}")

    print("\nüß† REASONING:")
    print(result['reasoning'])

    print("\nüîç Retrieved Chunks (with similarity scores):")
    for chunk in result['retrieved_chunks']:
        print(f"  - {chunk['title']} (similarity: {chunk['similarity']:.3f})")
        print(f"    {chunk['text']}")
        print()

RAG SYSTEM TEST QUERIES

Query 1: Which movie features an AI system called HAL 9000?

üìù ANSWER:
```json
{
  "answer": "The movie that features an AI system called HAL 9000 is 2001: A Space Odyssey.",
  "contexts": [
    "2001: A Space Odyssey"
  ],
  "reasoning": "The plot excerpt for '2001: A Space Odyssey' directly mentions 'Hal' and 'HAL 9000' when describing the AI system that interacts with the astronauts and eventually malfunctions."
}
```

üìö CONTEXTS (Movies Used):
  - Jedara Bale
  - Wizards
  - 2001: A Space Odyssey

üß† REASONING:
Answer generated using Gemini with retrieved context

üîç Retrieved Chunks (with similarity scores):
  - Jedara Bale (similarity: 0.492)
    Rajkumar plays a CID Police Agent, who is code-named as "999". The story revolves around the attempt to stop a formula which can convert any metal into gold reaching the hands of hooligans. Uday Kumar...

  - Wizards (similarity: 0.474)
    become bored or sidetracked in the midst of battle. Blackwolf t

## Step 10: Export Structured JSON Output

**Final step:**
We'll export the exact JSON structure requested in the requirements.

**JSON schema:**
```json
{
  "answer": "Natural language answer with movie citations",
  "contexts": ["Retrieved plot excerpt 1", "Retrieved plot excerpt 2"],
  "reasoning": "Explanation of how the answer was formed"
}
```

This format makes it easy to:
- Programmatically access the answer
- Verify which sources were used
- Understand the RAG system's decision-making process

In [15]:
# Example: Export clean JSON output
sample_query = "What movie involves a dystopian future with replicants?"

print(f"Query: {sample_query}\n")

result = rag_query(sample_query, k=3)

# Create clean JSON output (matching the spec req'd by assessment)
clean_output = {
    "answer": result['answer'],
    "contexts": [
        f"{chunk['title']}: {chunk['text']}"
        for chunk in result['retrieved_chunks']
    ],
    "reasoning": result['reasoning']
}

# Pretty print JSON
print("STRUCTURED JSON OUTPUT:")
print("=" * 80)
print(json.dumps(clean_output, indent=2))
print("=" * 80)

# Save to file
with open('rag_output.json', 'w') as f:
    json.dump(clean_output, f, indent=2)

print("\n‚úì Output saved to 'rag_output.json'")

Query: What movie involves a dystopian future with replicants?

STRUCTURED JSON OUTPUT:
{
  "answer": "```json\n{\n  \"answer\": \"Based on the provided plot excerpts, there is no information about any movie involving a dystopian future with replicants.\",\n  \"contexts\": [],\n  \"reasoning\": \"I reviewed the plot excerpts for 'Ocean Flame' and 'Current'. Neither excerpt contained any keywords or descriptions related to a dystopian future or replicants. Therefore, the question cannot be answered with the given information.\"\n}\n```",
  "contexts": [
    "Ocean Flame: The film tells the story of a punk's entanglements with a pure young girl....",
    "Ocean Flame: The film tells the story of a punk's entanglements with a pure young girl....",
    "Current: The film reveals the life and struggle of a farmer, who is tired of dealing with the corrupt systems in bureaucracy and politics at that time in India...."
  ],
  "reasoning": "Answer generated using Gemini with retrieved context"


## Interactive Query Interface

Takes custom user input and answers

**Example queries chosen by me and their respective results are displayed:**
- "Movies about space exploration"
- "Which film has a character named Morpheus?"
- "Stories involving the mafia or crime families"

In [16]:
# Interactive query function
def ask_question(question: str):
    print(f"\n{'=' * 80}")
    print(f"Q: {question}")
    print("=" * 80)

    result = rag_query(question, k=3)

    print("\nüìù ANSWER:")
    print(result['answer'])

    print("\nüìö SOURCES:")
    if isinstance(result['contexts'], list):
        for ctx in result['contexts']:
            print(f"  - {ctx}")

    print("\nüß† REASONING:")
    print(result['reasoning'])
    print("=" * 80)

    return result

# Try your own questions!
ask_question("What movies involve parallel realities or simulations?")


Q: What movies involve parallel realities or simulations?

üìù ANSWER:
```json
{
  "answer": "Based on the provided plot excerpts, none of the movies (For Love or Money, Sample People, A Paying Ghost) involve parallel realities or simulations. The context does not contain enough information to answer your question.",
  "contexts": [],
  "reasoning": "I reviewed the plot excerpts for 'For Love or Money', 'Sample People', and 'A Paying Ghost'. 'For Love or Money' only describes the availability of clips, not its plot. 'Sample People' describes multiple intersecting plot lines within a single reality, not parallel realities. 'A Paying Ghost' describes a ghost story, which involves supernatural elements but not parallel realities or simulations. Therefore, none of the provided information indicates any movie involving parallel realities or simulations."
}
```

üìö SOURCES:
  - For Love or Money
  -  Sample People
  - A Paying Ghost

üß† REASONING:
Answer generated using Gemini with ret

{'answer': '```json\n{\n  "answer": "Based on the provided plot excerpts, none of the movies (For Love or Money, Sample People, A Paying Ghost) involve parallel realities or simulations. The context does not contain enough information to answer your question.",\n  "contexts": [],\n  "reasoning": "I reviewed the plot excerpts for \'For Love or Money\', \'Sample People\', and \'A Paying Ghost\'. \'For Love or Money\' only describes the availability of clips, not its plot. \'Sample People\' describes multiple intersecting plot lines within a single reality, not parallel realities. \'A Paying Ghost\' describes a ghost story, which involves supernatural elements but not parallel realities or simulations. Therefore, none of the provided information indicates any movie involving parallel realities or simulations."\n}\n```',
 'contexts': ['For Love or Money', ' Sample People', 'A Paying Ghost'],
 'reasoning': 'Answer generated using Gemini with retrieved context',
 'retrieved_chunks': [{'title