<a href="https://www.kaggle.com/code/asrakshithgowda007/explainable-movie-recommender?scriptVersionId=246502922" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Project: Explainable Movie Recommender using Generative AI

This notebook presents a project focused on building a movie recommendation system that goes beyond simply suggesting titles. The core goal is to provide **human-readable explanations** for *why* a particular movie is recommended.

## The Problem

Traditional recommendation systems, while effective, often function as "black boxes." Users receive suggestions but lack insight into the underlying reasoning. This lack of transparency can reduce trust and limit user understanding of the recommendations provided.

## The Solution

This project tackles the explainability problem by leveraging Generative AI capabilities. Instead of just matching items based on collaborative or content filtering alone, we use AI to understand movie content and user queries, find relevant movies, and then synthesize a natural language explanation for the recommendation.

## Gen AI Capabilities Used

This project specifically utilizes the following Generative AI techniques:

* **Embeddings:** Representing movie titles and descriptions as numerical vectors to capture semantic meaning and enable efficient similarity searching.
* **Retrieval Augmented Generation (RAG):** Combining a retrieval step (finding relevant movies using embeddings) with a generation step (using an LLM to formulate the explanation based on the retrieved information).
* **Prompting & Structured Output:** Crafting specific instructions (prompts) for the Large Language Model (LLM) to guide its response towards generating coherent and relevant explanations, potentially aiming for a desired structure in the output.

# Install key libraries

In [None]:
# Force upgrade/reinstall key libraries to ensure compatibility
!pip install -q -- numpy scipy scikit-learn pandas
!pip install -q -U google-generativeai

print("Libraries potentially updated. Please re-run the next cell (imports).")
# ===============================================

# **Imports & API Key**

In [None]:
import google.generativeai as genai
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import json
import os
import kaggle_secrets 

# Configure API Key using the imported module
try:
    
    client = kaggle_secrets.UserSecretsClient()
    GOOGLE_API_KEY = client.get_secret("GOOGLE_API_KEY")
    genai.configure(api_key=GOOGLE_API_KEY)
    print("Google AI API Key configured successfully using Kaggle Secrets.")

    # Initialize the generative models
    generation_model = genai.GenerativeModel('gemini-1.5-flash')
    embedding_model = genai.GenerativeModel('models/embedding-001')
    print("Generative models initialized.")

except kaggle_secrets.SecretNotFoundError:
    print("ERROR: Kaggle Secret 'GOOGLE_API_KEY' not found.")
    print("Please ensure you have added your Google API key as a secret named 'GOOGLE_API_KEY' in your Kaggle notebook settings.")
except Exception as e:
    print(f"An unexpected error occurred during setup: {e}")
    # Potentially stop execution or handle the error appropriately
    raise # Re-raise the exception to halt execution if setup fails critically

# ===========================================================

# 1. Movie Data Preparation# 

## Data Loading and Preparation

This section handles the initial loading and preparation of the movie dataset used for the recommender.

* **Data Source:** For demonstration purposes in this notebook, a small sample of movie data (including Title, Genre, and Synopsis) is defined directly as a Python list of dictionaries.
* **Real-world Scenario:** In a production system, this data would typically be loaded from external sources like CSV files, databases, or APIs.
* **DataFrame Conversion:** The list of movie data is converted into a pandas DataFrame (`df_movies`) to facilitate data manipulation and integration with subsequent steps.
* **Text for Embeddings:** A key step here is the creation of the `text_for_embedding` column. This column concatenates the 'title', 'genre', and 'synopsis' fields into a single string. This combined text provides the comprehensive input required by the embedding model to generate a meaningful vector representation for each movie.
* **Verification:** The code prints the total number of movies loaded and displays the first few rows of the DataFrame to confirm successful loading and preparation.

In [None]:
# Simple movie dataset (Title, Genre, Synopsis)
# In a real scenario, will load this from a CSV
movie_data = [
    {"movie_id": 1, "title": "Inception", "genre": "Sci-Fi, Action, Thriller", "synopsis": "A thief who steals corporate secrets through use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O."},
    {"movie_id": 2, "title": "The Matrix", "genre": "Action, Sci-Fi", "synopsis": "A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers."},
    {"movie_id": 3, "title": "Blade Runner 2049", "genre": "Sci-Fi, Thriller, Mystery", "synopsis": "Young Blade Runner K's discovery of a long-buried secret leads him to track down former Blade Runner Rick Deckard, who's been missing for thirty years."},
    {"movie_id": 4, "title": "Parasite", "genre": "Comedy, Drama, Thriller", "synopsis": "Greed and class discrimination threaten the newly formed symbiotic relationship between the wealthy Park family and the destitute Kim clan."},
    {"movie_id": 5, "title": "Spirited Away", "genre": "Animation, Adventure, Family", "synopsis": "During her family's move to the suburbs, a sullen 10-year-old girl wanders into a world ruled by gods, witches, and spirits, and where humans are changed into beasts."},
    {"movie_id": 6, "title": "Interstellar", "genre": "Sci-Fi, Drama, Adventure", "synopsis": "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival."},
    {"movie_id": 7, "title": "Mad Max: Fury Road", "genre": "Action, Adventure, Sci-Fi", "synopsis": "In a post-apocalyptic wasteland, a woman rebels against a tyrannical ruler in search for her homeland with the help of a group of female prisoners, a psychotic worshiper, and a drifter named Max."},
     {"movie_id": 8, "title": "Arrival", "genre": "Sci-Fi, Drama, Mystery", "synopsis": "A linguist works with the military to communicate with alien lifeforms after twelve mysterious spacecraft appear around the world."},
     {"movie_id": 9, "title": "Eternal Sunshine of the Spotless Mind", "genre": "Drama, Romance, Sci-Fi", "synopsis": "When their relationship turns sour, a couple undergoes a medical procedure to have each other erased from their memories."},
     {"movie_id": 10, "title": "Pulp Fiction", "genre": "Crime, Drama", "synopsis": "The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption."}

]

df_movies = pd.DataFrame(movie_data)

# Combine text features for embedding
df_movies['text_for_embedding'] = df_movies['title'] + ". Genre: " + df_movies['genre'] + ". Synopsis: " + df_movies['synopsis']

print(f"Loaded {len(df_movies)} movies.")
print(df_movies.head()) 

# 2. Generate Movie Embeddings# 

## Creating Vector Representations (Embeddings)

This is a fundamental step where we transform the textual information about each movie into a numerical format that computers can easily process for similarity comparisons.

* **Process:** The code iterates through the `text_for_embedding` created previously for each movie.
* **Embedding Model:** It uses a pre-trained Generative AI embedding model (`models/text-embedding-004`) to generate a high-dimensional vector (the embedding) for each movie's text. The `RETRIEVAL_DOCUMENT` task type is specified, which is optimized for embedding documents that will be retrieved via search.
* **Purpose:** These vectors capture the semantic meaning and context of the movie's description. Movies with similar themes, genres, or plot elements are expected to have vectors that are numerically closer to each other in the embedding space.
* **Error Handling:** A mechanism is included to catch potential errors during the embedding generation for any movie. Movies that fail to generate an embedding are identified and removed from the dataset to ensure data integrity for the next steps.
* **Output:** The generated embeddings are added as a new column (`embedding`) to the DataFrame. Finally, these embeddings are converted into a NumPy array (`movie_embedding_array`). This array structure is essential for efficient vector similarity search operations (like cosine similarity) later in the notebook.
* **Efficiency Note:** For larger datasets, embedding generation is typically optimized by sending text in batches to the API rather than one by one, which is more efficient.

In [None]:
# Generate embeddings for each movie's text
# Note: Batching requests is more efficient for large datasets
movie_embeddings = []
print(f"Generating embeddings using model: models/embedding-004") # Optional: Confirm model

# Keep track of original number of movies
initial_movie_count = len(df_movies)

for idx, text in enumerate(df_movies['text_for_embedding']):
    current_movie_title = df_movies.iloc[idx]['title'] # Get title for better logging
    print(f"Embedding movie {idx+1}/{initial_movie_count}: '{current_movie_title}'...") 
    try:
        # Call genai.embed_content directly, specifying the model name
        response = genai.embed_content(
            model="models/text-embedding-004", 
            content=text,
            task_type="RETRIEVAL_DOCUMENT" #  RETRIEVAL_DOCUMENT for the items you'll search over
        )
        

        movie_embeddings.append(response['embedding'])

    except Exception as e:
        # Provide more specific error info
        print(f"ERROR embedding text for movie '{current_movie_title}': {e}")
        print(f"Failed text: {text}")
        movie_embeddings.append(None) # Handle potential errors by appending None

# Add embeddings to the DataFrame
df_movies['embedding'] = movie_embeddings

# Check how many failed before dropping
failed_count = df_movies['embedding'].isna().sum()
if failed_count > 0:
    print(f"\nWarning: Failed to generate embeddings for {failed_count} out of {initial_movie_count} movies.")

df_movies.dropna(subset=['embedding'], inplace=True) # Remove rows where embedding failed

# Convert list of embeddings into a NumPy array for similarity calculation
if not df_movies.empty:
    movie_embedding_array = np.array(df_movies['embedding'].tolist())
    print(f"\nSuccessfully generated embeddings for {len(df_movies)} movies.")
    print(f"Embedding array shape: {movie_embedding_array.shape}") # Should be (num_movies, embedding_dim)
else:
    movie_embedding_array = np.array([]) # Ensure it's an empty array if all failed
    print("\nERROR: No movie embeddings were successfully generated.")
    print(f"Embedding array shape: {movie_embedding_array.shape}")

# 3. Recommendation Function (RAG + Generation)



## Building the Explainable Recommender: Retrieval Augmented Generation (RAG)

This section defines the core function `get_explainable_recommendation` which orchestrates the process of generating a movie recommendation and its accompanying explanation based on a user query. This function implements the **Retrieval Augmented Generation (RAG)** pattern:

1.  **Retrieval (Query Embedding):** The user's input `user_query` is first converted into a vector embedding using the same embedding model (`models/text-embedding-004`), but with the `RETRIEVAL_QUERY` task type.
2.  **Retrieval (Similarity Search):** The embedding of the user query is compared against the embeddings of all movie embeddings in our dataset (using cosine similarity). The function identifies the `top_n` movies whose embeddings are most similar to the query embedding. These are the initial candidates.
3.  **Augmentation:** The detailed information (Title, Genre, Synopsis) for these top candidate movies is retrieved from the DataFrame. This information serves as the relevant "context" to augment the user's query for the next step.
4.  **Generation (Prompting & LLM Call):** A detailed prompt is dynamically constructed. This prompt includes the original `user_query` and the structured information about the `candidate_movies`. The prompt explicitly instructs the Large Language Model (LLM) to:
    * Select the *single best* recommendation from the provided candidates.
    * Generate a *concise explanation* for the choice, referencing aspects relevant to the user's query or the movies' details.
    * Format the output strictly as a JSON object with specified keys (`recommended_movie`, `explanation`).
    The function then calls the `generation_model` with this comprehensive prompt.
5.  **Output Parsing:** The text response received from the LLM is parsed. The code specifically looks for and extracts the JSON object containing the recommended movie title and its generated explanation, providing fallback mechanisms if the JSON format is unexpected.

The function returns the structured recommendation and explanation or an error message if the process could not be completed.

In [None]:
def get_explainable_recommendation(user_query, top_n=3):
    """
    Gets a movie recommendation with explanation based on user query.

    Args:
        user_query (str): The user's request (e.g., "a sci-fi movie like Blade Runner").
        top_n (int): Number of top similar movies to consider for the final recommendation.

    Returns:
        dict: A dictionary containing 'recommended_movie' and 'explanation', or an error message.
    """
    if not user_query:
        return {"error": "Please provide a query."}
    if movie_embedding_array.shape[0] == 0:
         return {"error": "No movie embeddings available."}


    try:
        # 1. Embed the user query (RAG - Retrieval Step 1)
        query_embedding_response = genai.embed_content(
            model="models/text-embedding-004",
            content=user_query,
            task_type="RETRIEVAL_QUERY" # Use RETRIEVAL_QUERY for the search query
        )
        query_embedding = np.array(query_embedding_response['embedding']).reshape(1, -1)

        # 2. Find similar movies using Cosine Similarity (RAG - Retrieval Step 2)
        similarities = cosine_similarity(query_embedding, movie_embedding_array)[0]

        # Get indices of top N most similar movies
        # We add 1 to top_n because the most similar might be the input movie itself if it's in the list
        num_candidates = min(top_n + 1, len(similarities))
        top_indices = np.argsort(similarities)[-num_candidates:][::-1] # Sort descending, get top indices

        # 3. Prepare context for the LLM (RAG - Augmentation Step)
        candidate_movies = []
        print(f"\n--- Debug: Top {num_candidates} candidates based on similarity ---")
        for i in top_indices:
            movie_info = df_movies.iloc[i]
            # Optional: Exclude the exact movie mentioned in the query if found?
            # For simplicity now, we don't exclude, LLM should handle it.
            candidate_movies.append({
                "title": movie_info["title"],
                "genre": movie_info["genre"],
                "synopsis": movie_info["synopsis"]
            })
            print(f"- {movie_info['title']} (Similarity: {similarities[i]:.4f})")
        print("------------------------------------------------------")


        # 4. Generate Recommendation and Explanation using LLM (Generation Step)
        # Construct a detailed prompt
        prompt = f"""
        User query: "{user_query}"

        Based on the user query, consider these potentially relevant movies:
        {json.dumps(candidate_movies, indent=2)}

        Your task:
        1. Select the *single best* movie recommendation from the list above that matches the user query.
        2. Provide a concise explanation (2-3 sentences) for *why* this movie is a good recommendation, connecting it to the user's query or the potential themes/genres implied by the query.
        3. Format your response strictly as a JSON object with keys "recommended_movie" and "explanation".

        Example Input Query: "Suggest a dark sci-fi thriller"
        Example Output JSON:
        {{
          "recommended_movie": "Blade Runner 2049",
          "explanation": "Blade Runner 2049 fits your request for a dark sci-fi thriller. Like the original Blade Runner often associated with the genre, it features a moody atmosphere, complex philosophical themes, and a compelling mystery within its futuristic setting."
        }}

        Now, generate the JSON output for the user query: "{user_query}"
        """

        # Call the Gemini model
        response = generation_model.generate_content(prompt)

        # 5. Parse the response
        raw_response_text = response.text.strip()
        print(f"\n--- Debug: Raw LLM Response ---")
        print(raw_response_text)
        print("---------------------------------")

        # Attempt to clean and parse JSON (robustness for potential LLM formatting issues)
        try:
            # Find the start and end of the JSON object
            json_start = raw_response_text.find('{')
            json_end = raw_response_text.rfind('}') + 1
            if json_start != -1 and json_end != -1:
                json_string = raw_response_text[json_start:json_end]
                parsed_output = json.loads(json_string)
                if "recommended_movie" in parsed_output and "explanation" in parsed_output:
                     return parsed_output
                else:
                    print("Warning: Parsed JSON missing required keys.")
                    # Fallback: Try to return the raw text if JSON parsing fails structurally
                    return {"recommendation_details": raw_response_text}

            else:
                 print("Warning: Could not find JSON object in the response.")
                 return {"recommendation_details": raw_response_text} # Fallback

        except json.JSONDecodeError as json_err:
            print(f"Error decoding JSON from LLM response: {json_err}")
            # Fallback: Return the raw text if JSON is invalid
            return {"recommendation_details": raw_response_text}


    except Exception as e:
        print(f"An error occurred during recommendation: {e}")
        import traceback
        traceback.print_exc() # Print detailed traceback for debugging
        return {"error": f"An internal error occurred: {e}"}

# # 4. Testing the Explainable Recommender

In [None]:
# Test Case 1
query1 = "Suggest a mind-bending sci-fi movie about dreams"
result1 = get_explainable_recommendation(query1)
print(f"\nQuery: {query1}")
print(f"Result: {result1}")

# Test Case 2
query2 = "I want an animated adventure movie from Japan"
result2 = get_explainable_recommendation(query2)
print(f"\nQuery: {query2}")
print(f"Result: {result2}")

# Test Case 3
query3 = "Looking for a thought-provoking sci-fi drama dealing with communication"
result3 = get_explainable_recommendation(query3)
print(f"\nQuery: {query3}")
print(f"Result: {result3}")

# Test Case 4 (More vague)
query4 = "Something action-packed in a desert"
result4 = get_explainable_recommendation(query4)
print(f"\nQuery: {query4}")
print(f"Result: {result4}")

# Conclusion

This project demonstrates the potential of leveraging Generative AI to build more transparent and explainable recommendation systems.

## Results

Initial testing on a few example movie queries showed that the system was able to retrieve relevant movies and generate plausible explanations based on the movie descriptions provided in the dataset. The quality of the explanations varied, sometimes being quite specific and other times more generic, depending on the movie and the query.

## Limitations

Despite the promising results, this prototype has several limitations:

* **Small Dataset:** The recommendations and explanations are limited by the size and content of the dataset used. A larger, more diverse dataset would likely improve relevance.
* **Embedding Model Limitations:** The performance is dependent on the quality and domain knowledge of the embedding model used. Generic models might struggle with nuanced movie concepts.
* **LLM Hallucination/Genericity:** Large Language Models can sometimes generate factually incorrect (hallucinate) or overly generic explanations, especially if the context is limited or the prompt is not precise enough.
* **Sensitivity to Prompt Phrasing:** The quality and focus of the generated explanation can be highly sensitive to the exact wording and structure of the prompt given to the LLM.

## Future Work

There are several avenues for improving this explainable recommender:

* **Larger and Richer Dataset:** Incorporating a dataset with more movies, richer metadata (genre, cast, director, plot summaries, tags), and potentially user interactions.
* **Fine-tuning Embeddings:** Using or fine-tuning an embedding model specifically for the movie domain.
* **Incorporating User Profiles:** Including user history and preferences to personalize recommendations beyond just the current query.
* **Adding Conversational Abilities:** Making the system more agent-like, allowing users to refine recommendations through dialogue.
* **Function Calling:** Utilizing LLM function calling to potentially retrieve external data like movie ratings or reviews to enrich explanations.
* **Stricter Gen AI Evaluation:** Developing more robust methods to evaluate the quality, relevance, and factual accuracy of the generated explanations.

## Final Thoughts

Providing explanations for recommendations is crucial for building user trust and improving the overall user experience. This project serves as a foundation for exploring how Generative AI, through techniques like embeddings, RAG, and prompting, can unlock this valuable explainability layer in recommendation systems.