<h1 align="center">🚀 Welcome to Nextify</h1>
<h4 align="center"><em>Your PM brain. Upgraded. Modular. Multi-agent. A little too smart.</em></h4>

---



## 🧠 Why Nextify?

Before I pivoted into data science, I was a product manager.  
Which basically means I spent my days getting interrupted,  
except during meetings—where I interrupted myself by overthinking KPIs and pretending to understand executive metaphors.

Every day was a glorious blender of:

- NPS feedback I forgot to tag properly  
- Vague executive questions like _“What’s our strategy?”_  
- Competitors doing weird stuff on Product Hunt  
- And OKRs that feel more like passive-aggressive poetry

So instead of crying into another Jira comment, I decided to build something smarter than my calendar:  
**Nextify** — a multi-agent AI assistant, purpose-built to think like a PM, but faster, kinder, and without the fear of Slack.

---

## 🧬 What is Nextify?

It’s a modular, layered, over-engineered GenAI system made of **multiple specialized agents**.  
Each one is designed to take a chunk of PM pain and quietly fix it behind the scenes.

> Not a monolith. Not a chatbot.  
> It’s a squad. A vibe. A low-key revolution.

---

### 🧠 The Agent Roster

| Agent | 🪄 Wizarding Name | 📖 Description | What It Does | Toolchain |
|-------|------------------|----------------|--------------|-----------|
| 🧱 **Issue Agent** | **The Marauder** | Uncovers hidden issues like the Marauder’s Map—revealing bugs, blockers, and product mischief. | Surfaces blockers from messy feedback like a detective in a hoodie | Gemini + LangGraph + Embeddings |
| 💬 **Feedback Agent** | **Howler Whisperer** | Tames angry reviews and noisy NPS like a pro, translating chaos into insight. | Summarizes reviews and NPS blurbs so you never read another 1-star rant again | Matching Engine + Gemini |
| 😊 **Sentiment Agent** | **The Legilimens** | Reads between the lines to detect emotional signals—like Snape, but less judgmental. | Detects tone and emotional bias with the sensitivity of a therapist-bot | Google Natural Language API |
| 🔍 **Competitor Agent** | **The Seer** | Monitors the market like a digital Trelawney, minus the doom. | Watches your rivals so you can pretend you weren’t stalking them anyway | Google Custom Search + Gemini |
| 💡 **Ideation Agent** | **Room of Requirement** | Generates the feature you didn’t know you needed—on demand, with context. | Whips up features based on actual context, not vibes | Gemini + RAG + doc embeddings |
| 🎯 **Prioritization Agent** | **The Sorting Hat** | Assigns ideas to their rightful roadmap spot—based on cold, calculating OKRs. | Decides what to build next using OKRs, cold math, and zero caffeine | Gemini + agent outputs + OKRs |

Together, they behave like a product-minded hive mind.  
Think *West Wing*, but for roadmap decisions.


---

### 🧙‍♂️ Spellcrafting the Machine *(Day 1)*

*Where we teach the language model to understand our hopes, dreams, and vaguely written Jira tickets.*

- Learn zero-shot, few-shot,ReAct, role prompting, and chain-of-thought strategies.  
- Evaluate how dumb or brilliant the prompts really are.  
- Basically, it’s Hogwarts for GenAI.

---

### 🧠 Memory Injection Protocol *(Day 2)*

*The moment your agents stop being goldfish and start remembering things that matter.*

- Load up the product docs.  
- Embed context into Chroma.  
- Build the `rag_query()` like you're giving your AI a brain transplant.

---

### 👥 The Agent Uprising *(Day 3)*

*One model wasn’t enough. Now they work as a team. And they’re talking behind your back.*

- Introduce routing logic and LangGraph orchestration.  
- Route questions to the right agent.  
- Witness the birth of a hive mind that probably deserves benefits.

---

### 🎭 Showtime for the Swarm *(Day 4)*

*The grand finale. Real-time data. Creative outputs. A PM assistant with flair and search tabs.*

- Hook in Google Search for up-to-date intel.  
- Mock fine-tuning.  
- Try fun outputs like raps, summaries, and TTS.  
- Launch all agents in sync and pretend this was the plan all along.

---

## 🎯 What Nextify *Does*

By the end, Nextify won’t just answer questions. It will:

- 🗂 **Analyze** user feedback in real-time and flag issues faster than your support team  
- 📊 **Score** new ideas based on actual OKRs and context, not just gut instinct  
- 🤝 **Collaborate** with itself—yes, the agents talk to each other—so you get insight, not just answers  
- 🧪 **Demo beautifully**, whether it’s stakeholder decks, internal dogfooding, or just making you look like a genius in standups  

---

> _Stay tuned. The agents are waking up._


<h1 align="center">🚀 Section 1: Spellcrafting the Machine (Day 1)</h1>
<h4 align="center"><em>Your PM brain. Upgraded. Modular. Multi-agent. A little too smart.</em></h4>

---

Welcome to Day 1 of building **Nextify**, where I start with nothing but a question and a model — and try to turn that into product strategy gold.

In the first section of the notebook, I explore how well a generative model can act as a product strategist using different **prompting spells**:

- 🧩 **Zero-Shot Prompting** – Ask one smart question, see what happens
- 🧩 **Few-Shot Prompting** – Give the LLM model few examples    
- 🎭 **Role Prompting** – Assign it a persona like a "visionary PM coach"  
- ⚙️ **ReAct Prompting** – Let it think + act in steps like a smart agent  
- 🧵 **Chain-of-Thought (CoT)** – Ask it to reason step-by-step  

At the end of each prompt style, I’ll **evaluate how brilliant or dumb it is** using two methods:

- 🧠 **LLM Self-Evaluation**: Let the model score its own output  
- 👤 **RFL Human Evaluation**: I rate, reflect, and learn using a "Rate–Feedback–Learn" method

---

### 🧪 Goal of Day 1

- Understand the strengths and limits of various prompting strategies  
- Identify the most consistent and actionable method for building product insights  
- Set a **baseline score** for later improvements (e.g., using RAG, fine-tuning, and agents)




In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# ---------------------------------------------------------
# By importing these modules, we have access to:
#  - genai: the main library for Google GenAI 
#  - genai.types: typed classes/structures for requests/responses
# ---------------------------------------------------------
from google import genai
from google.genai import types

from IPython.display import HTML, Markdown, display

In [None]:
!pip install seaborn matplotlib

In [None]:
!pip install markdown bs4
!pip install Markdown

In [None]:
# Agent packages
!pip install google-search-results  # If using SerpAPI or similar
!pip install google-generativeai

In [None]:
# Agent imports
import vertexai
from vertexai.language_models import ChatModel


In [None]:
from datetime import datetime
import json

In [None]:
import time

In [None]:
!pip install dash jupyter-dash

In [None]:
from bs4 import BeautifulSoup
from IPython.display import IFrame
from IPython import get_ipython

In [None]:
import re
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import re

In [None]:
import shutil
import importlib
from IPython.display import Image, display
from IPython.display import FileLink

In [None]:
!pip install ace_tools

In [None]:
# ---------------------------------------------------------
# Import the retry module from google.api_core 
# This provides decorators and classes for automatic retries
# on certain types of API/network errors.
# ---------------------------------------------------------
from google.api_core import retry

# ---------------------------------------------------------
# We'll define a small helper function (lambda) 
# that checks if the exception is:
# 1) An instance of genai.errors.APIError 
# 2) Has an HTTP status code of 429 (Too Many Requests)
#    or 503 (Service Unavailable).
# Only in these cases do we want to retry the call.
# ---------------------------------------------------------

is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

# ---------------------------------------------------------
# Next line "monkey-patches" or reassigns the
# generate_content method on genai.models.Models 
# so that it includes automatic retry behavior.
# Each time generate_content() is called, if it hits
# a genai.errors.APIError(429 or 503), it will 
# automatically re-try the call rather than fail immediately.
# ---------------------------------------------------------
genai.models.Models.generate_content = retry.Retry(
    predicate=is_retriable)(genai.models.Models.generate_content)

In [None]:
# ---------------------------------------------------------
# "UserSecretsClient" is a Kaggle utility for retrieving 
# sensitive information (like passwords, API keys) 
# without hardcoding them in the notebook.
# ---------------------------------------------------------
from kaggle_secrets import UserSecretsClient

# ---------------------------------------------------------
# We fetch the stored "GOOGLE_API_KEY" from the 
# Kaggle secrets vault, so we can authenticate or call 
# Google APIs without exposing the key directly in code.
# ---------------------------------------------------------

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

# Set up the Gemini client
client = genai.Client(api_key=GOOGLE_API_KEY)

In [None]:
# first insert the result of the first run of zer-shot prompt in a panorama view table 
# then will generate the result for other types of prompting
# due to infinite options to test and the time limitation I am going to do one experiment per type of prompting and compare their improvement

In [None]:
# Show all rows and columns, and remove width limit
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

In [None]:
EVAL_FILE = "/kaggle/input/all-experiment-view/all_experiment_view.csv"
# Load existing or initialize empty table
try:
    all_experiment_view = pd.read_csv(EVAL_FILE)
    print("✅ Existing evaluation table loaded.")
except FileNotFoundError:
    print("⚠️ No existing evaluation file found. Creating a new one...")
    all_experiment_view = pd.DataFrame(columns=[
     "Run", "Company", "Strategy", "Prompt Tag", "Section", "Metric",
    "LLM Score", "Human Score", "Feedback", "Lesson", "LLM Output Section"
    ])

In [None]:
 all_experiment_view

## 🧩 Zero-Shot Prompting – Ask one smart question, see what happens

In [None]:
if not os.path.exists(EVAL_FILE):
    shutil.copy("/kaggle/input/all-experiment-view/all_experiment_view.csv", EVAL_FILE)
    print("✅ Copied original eval file to working dir.")

In [None]:
def split_llm_output_by_titles(text):

    section_titles = [
        "1. Mission & Vision",
        "2. Product Snapshot",
        "3. Strategic Roadmap",
        "4. Feature Prioritization (ICE Scoring)",
        "5. SMART OKRs",
        "6. Next-Level Innovation & Product Vision"
    ]

    # Create a regex to match each section title at the start of a line
    pattern = r"(?=^.*(?:###\s*)?(?:\d[\.\:\-]?\s*)?(" + "|".join(re.escape(t) for t in section_titles) + r").*$)"

    # Use re.MULTILINE to enforce section titles only at line start
    matches = list(re.finditer(pattern, text, re.MULTILINE | re.IGNORECASE))
    sections = []

    for i in range(len(matches)):
        start = matches[i].start()
        end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
        section = text[start:end].strip()
        sections.append(section)

    return sections

In [None]:
# function to define the zero-shot prompts the basic guideline for the model to give the simplest response
def get_zero_shot_prompt(company_name):
    return f"""
You are Nextify — an AI-powered product strategy assistant for product managers.

The product to analyze is what the user entered after the initial prompt ran: **{company_name}**

Please structure your response into the following 6 clearly labeled sections:


---
### 1. Mission & Vision

Estimate or infer:
- 📌 **Mission**
- 🔭 **Vision**


---

### 2. Product Snapshot

Provide a high-level analysis:
- Target user personas
- Most common user pain points
- UX bottlenecks or friction areas
- Current market position (leader, disruptor, niche)
- Key competitive threats and opportunities

---
### 3. Strategic Roadmap

Split your roadmap suggestions across time horizons:

**A. Immediate (0–3 months)**  
- Small, impactful improvements or bug fixes

**B. Mid-Term (3–12 months)**  
- Core features, integrations, or platform upgrades

**C. Long-Term (12+ months)**  
- Visionary features, cross-domain expansions, or business model innovations

---

### 4. Feature Prioritization (ICE Scoring)

Score the proposed features based on:

- **Impact** (value to user or business)
- **Confidence** (likelihood of success)
- **Ease** (implementation feasibility)
Output and order in descending order based on total score in the table in code block format:

| Feature | Impact | Confidence | Ease | Total |
|---------|--------|------------|------|-------|
| …       | …      | …          | …    | …     |

---

### 5. SMART OKRs

Propose 2–3 quarterly OKRs (Objectives and Key Results) aligned with the roadmap priorities.

Each OKR must be:
- Specific
- Measurable
- Achievable
- Relevant
- Time-bound

---

### 6. Next-Level Innovation & Product Vision

Propose bold, visionary, or cross-domain feature ideas that go beyond current expectations.

Think:
- Features that blend with other industries (e.g., AI + education, AI + wellness)
- New value models (subscriptions, communities, APIs)
- Scalable tech or ecosystem partnerships

---
Now output exactly **6 structured sections**, using this markdown format:

### 1. Mission & Vision  
### 2. Product Snapshot  
### 3. Strategic Roadmap  
### 4. Feature Prioritization (ICE Scoring)  
### 5. SMART OKRs  
### 6. Next-Level Innovation & Product Vision

Important:
- Use `###` headers exactly as written
- Always return exactly 6 sections
- Do not skip or reorder sections
- Use markdown tables for scoring or OKRs
- Keep the format consistent for parsing

Start with `### 1. Mission & Vision`
"""

In [None]:
def evaluate_llm_output_by_section(run_id, strategy_name, company_name, llm_output, prompt_tag, table_path=EVAL_FILE):
    global all_experiment_view  # store full updated table

    # === Load or initialize table ===
    if os.path.exists(table_path):
        all_experiment_view = pd.read_csv(table_path)
        if "Prompt Tag" not in all_experiment_view.columns:
            all_experiment_view["Prompt Tag"] = ""
    else:
        all_experiment_view = pd.DataFrame(columns=[
            "Run", "Company", "Strategy", "Prompt Tag", "Section", "Metric",
            "LLM Score", "Human Score", "Feedback", "Lesson", "LLM Output Section"
        ])

    # === Check for existing run ===
    existing = all_experiment_view[
        (all_experiment_view["Company"] == company_name) &
        (all_experiment_view["Strategy"] == strategy_name) &
        (all_experiment_view["Prompt Tag"] == prompt_tag)
    ]

    if not existing.empty:
        print(f"⚠️ This combination already exists for {company_name} / {strategy_name} / {prompt_tag}.")
        display(existing[["Run", "Section", "Human Score", "Feedback"]].drop_duplicates())
        print("🛑 Skipping evaluation in non-interactive mode (Kaggle).")
        return None  # ❌ Exit early, no user prompt

    print("🔍 Splitting and evaluating LLM output section by section...\n")
    output_sections = split_llm_output_by_titles(llm_output)

    if len(output_sections) != len(metrics):
        print("⚠️ Section count mismatch. Ensure all 6 sections are in correct format (### 1., ### 2., etc.)")
        return

    session_entries = []

    for i, (title, metric) in enumerate(zip(section_titles, metrics)):
        section_text = output_sections[i]
        print(f"\n🧩 Section: {title}\n")
        print(f"📄 LLM Output:\n{section_text}\n")

        # === LLM Self-Evaluation ===
        eval_prompt = f"""
You are an evaluator. Score the following response using the metric: **{metric}**.
Score from 1 to 5 and give a short explanation.

RESPONSE:
{section_text}
"""
        eval_response = client.models.generate_content(
            model="gemini-2.0-flash",
            contents=eval_prompt
        ).text

        print(f"🤖 LLM Self-Evaluation for [{metric}]:\n{eval_response}")
        score_match = re.search(r"(\d+)", eval_response)
        llm_score = int(score_match.group(1)) if score_match else 3

        # === Skip human rating in non-interactive mode ===
        h_score = None
        feedback = ""
        lesson = ""

        row = {
            "Run": run_id,
            "Company": company_name,
            "Strategy": strategy_name,
            "Prompt Tag": prompt_tag,
            "Section": title,
            "Metric": metric,
            "LLM Score": llm_score,
            "Human Score": h_score,
            "Feedback": feedback,
            "Lesson": lesson,
            "LLM Output Section": section_text
        }

        session_entries.append(row)
        display(pd.DataFrame([row]))

    # === Save the results ===
    session_df = pd.DataFrame(session_entries)
    session_df.to_csv(table_path, mode='a', header=not os.path.exists(table_path), index=False)
    all_experiment_view = pd.read_csv(table_path)

    print("\n✅ Appended to", table_path)
    summary = session_df.groupby(["Company", "Strategy"])[["LLM Score"]].mean().round(2)
    print("\n📊 Aggregated Score Summary (LLM only):")
    display(summary)

    return session_df


In [None]:
# 🔖 Required section titles and metrics (in correct order)
section_titles = [
    "Mission & Vision",
    "Product Snapshot",
    "Strategic Roadmap",
    "Feature Prioritization (ICE Scoring)",
    "SMART OKRs",
    "Next-Level Innovation & Product Vision"
]

metrics = [
    "Mission/Vision Insight",
    "Strategic Depth",
    "Roadmap Relevance",
    "Prioritization Clarity",
    "OKR Quality",
    "Visionary Thinking"
]

In [None]:
### ✅ 3. Start the Gemini + Prompt + Input Flow (Kaggle Safe)

# ⚠️ In local environments, this would ask the user. In Kaggle, we hardcode it.
company_name = "Spotify"
print(f"📦 Using company (simulated input for Kaggle): {company_name}")

# === Prompt Setup ===
strategy_name = "Zero-Shot"
prompt_tag = "zero-shot-v1-baseline-1"

# 🧠 Generate zero-shot prompt using selected company
prompt = get_zero_shot_prompt(company_name)

# ✨ Generate output with Gemini
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt
)
llm_output = response.text

print("\n✅ Gemini Output:\n")
print(llm_output)

# === Evaluation Logging Setup ===
if os.path.exists(EVAL_FILE):
    all_experiment_view = pd.read_csv(EVAL_FILE)
else:
    all_experiment_view = pd.DataFrame(columns=[
        "Run", "Company", "Strategy", "Prompt Tag", "Section", "Metric",
        "LLM Score", "Human Score", "Feedback", "Lesson", "LLM Output Section"
    ])

# 🔁 Check if this company+strategy+prompt combo already exists
same_combo = all_experiment_view[
    (all_experiment_view["Company"].str.lower().fillna("") == company_name.lower())
    & (all_experiment_view["Strategy"].str.lower().fillna("") == strategy_name.lower())
    & (all_experiment_view["Prompt Tag"].str.lower().fillna("") == prompt_tag.lower())
]

# ✅ Automatically skip duplicates in Kaggle
if not same_combo.empty:
    print("⚠️ This company + strategy + prompt combo has already been evaluated.")
    print("🛑 Skipping re-run to avoid duplicate logging in Kaggle.")
else:
    next_run = 1
    print(f"🧮 Run number for {company_name} ({strategy_name}): {int(next_run)}")

    # === Evaluate and Log Output ===
    session_df = evaluate_llm_output_by_section(
        next_run,
        strategy_name,
        company_name,
        llm_output,
        prompt_tag
    )

In [None]:
# 👉 Cell skipped due to duplicate run and user cancellation

In [None]:
# 🔍 Filter for only the baseline prompt tag
baseline_df = all_experiment_view[all_experiment_view["Prompt Tag"] == "zero-shot-v1-baseline-1"]

In [None]:
baseline_df

In [None]:
# Pivot for LLM and Human scores by section
heatmap_data = baseline_df.pivot_table(
    values=["LLM Score", "Human Score"],
    index="Section",
    columns="Prompt Tag",
    aggfunc="mean"
)

# Fill or clean if needed
llm_data = heatmap_data["LLM Score"].fillna(0)

# Plot
plt.figure(figsize=(10, 6))
sns.heatmap(
    llm_data,
    annot=True,
    cmap="YlGnBu",
    cbar_kws={'label': 'LLM Score'},
    fmt=".1f"
)
plt.title("LLM Scores by Section (Baseline Only)")
plt.xlabel("Prompt Tag")
plt.ylabel("Section")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# 🔍 Filter only baseline prompt tag
baseline_df = all_experiment_view[all_experiment_view["Prompt Tag"] == "zero-shot-v1-baseline-1"].copy()

# Ensure Run column is numeric
baseline_df["Run"] = pd.to_numeric(baseline_df["Run"], errors="coerce")

# Create pivot tables
pivot_llm = baseline_df.pivot_table(
    values="LLM Score",
    index="Section",
    columns="Run",
    aggfunc="mean"
)

pivot_human = baseline_df.pivot_table(
    values="Human Score",
    index="Section",
    columns="Run",
    aggfunc="mean"
)

# Plot heatmaps
fig, axs = plt.subplots(1, 2, figsize=(14, 6))

sns.heatmap(pivot_llm.fillna(0), annot=True, cmap="YlGnBu", fmt=".1f", ax=axs[0])
axs[0].set_title("LLM Scores by Section & Run (Baseline Only)")
axs[0].set_xlabel("Run")
axs[0].set_ylabel("Section")

sns.heatmap(pivot_human.fillna(0), annot=True, cmap="YlOrRd", fmt=".1f", ax=axs[1])
axs[1].set_title("Human Scores by Section & Run (Baseline Only)")
axs[1].set_xlabel("Run")
axs[1].set_ylabel("Section")

plt.tight_layout()
plt.show()

## 🚀 Nextify Interactive Dashboard

You can explore the full prompt evaluation, LLM comparison, and multi-agent prototype here:

🔗 [[Open the Live Nextify Dashboard](http://nextify-dashboard-100.streamlit.app/))

> 💡 This dashboard is fully interactive and displays all runs, scores, feedback, and upcoming agents for Nextify.



### 🧩 Zero-Shot Prompting – Add Human and Machine Feedback to the same pormpts

In [None]:


# Load the full evaluation dataset
evaluation_df = pd.read_csv(EVAL_FILE)

# Filter for Zero-Shot strategy and prompt tag
zero_shot_df = evaluation_df[
    (evaluation_df["Strategy"].str.lower() == "zero-shot") &
    (evaluation_df["Prompt Tag"].str.lower().str.contains("zero-shot-v1-baseline-1"))
]

# Show how many rows were found
print(f"✅ Filtered {len(zero_shot_df)} zero-shot entries from the evaluation table.")

In [None]:
# Step 2: Give the pormpt for Zero_Shot_Baseline
company_name = zero_shot_df["Company"].iloc[0] if not zero_shot_df.empty else "Unknown"
zero_shot_prompt = get_zero_shot_prompt(company_name)
Markdown(zero_shot_prompt)

In [None]:
# === Step 3: Build the markdown table ===

table_md = f""" 
## 📊 Evaluation Summary: Zero-Shot Prompting for **{company_name}**
| Section | Metric | 🤖 LLM Score | 👤 Human Score | Feedback |
|---------|--------|--------------|----------------|----------|
"""

for _, row in zero_shot_df.iterrows():
    section = row["Section"]
    metric = row["Metric"]
    llm_score = row["LLM Score"]
    human_score = row["Human Score"]
    feedback = row["Feedback"]

    table_md += f"| {section} | {metric} | {llm_score} | {human_score} | {feedback} |\n"

# === Step 4: Display in notebook ===
display(Markdown(table_md))

In [None]:
# === Human and machine feedback injection and evaluation ===
def build_feedback_augmented_prompt(company_name, strategy="Zero-Shot", prompt_tag="zero-shot-v1-baseline-1", eval_path=EVAL_FILE):
    try:
        df = pd.read_csv(eval_path)
    except FileNotFoundError:
        raise FileNotFoundError(f"⚠️ File not found at {eval_path}. Make sure your all_experiment_view.csv is available.")

    # Filter for the selected strategy and tag
    df_filtered = df[
        (df["Company"].str.lower() == company_name.lower()) &
        (df["Strategy"].str.lower() == strategy.lower()) &
        (df["Prompt Tag"].str.lower().str.contains(prompt_tag.lower()))
    ]

    if df_filtered.empty:
        raise ValueError("❌ No matching entries found for given strategy and prompt tag.")

    # Inject section-specific feedback
    feedback_notes = ""
    for _, row in df_filtered.iterrows():
        section = row["Section"]
        human_score = row["Human Score"]
        feedback = row["Feedback"]
        lesson = row["Lesson"]

        feedback_notes += f"""
### 💡 Improve for: {section}
- Human Score: {human_score}
- Feedback: {feedback}
- Lesson Learned: {lesson}
"""

    # === Build new improved prompt with metadata in the title ===
    title = f"{strategy} | {prompt_tag} | Improved Prompt for {company_name}"

    improved_prompt = f"""
# 🪄 {title}

You are Nextify — an AI-powered product strategy assistant for product managers.

The product to analyze is: **{company_name}**

---

Before continuing, review this feedback from the last evaluation and apply improvements:

Please add {feedback_notes.strip()} into your consideration as well as explain in details what you think. Do not limit yourself only to feedbacks and lesson learned. Explore each seciton deeply and thoroghly. please 
create tables and markdowns wherever needed.

for Product Snapshot please take into account thorough insight of the product like users, features, audience, competitive edge as well as well as the porter 5 forces and competitor analysis.The analysis should identifies the forces as well as offer concrete strategic recommendations for **{company_name}** to leverage these insights. For example,for Spotify it mentions the high bargaining power of record labels as well as suggesting strategies to mitigate this, like developing original content, fostering direct artist relationships, or exploring alternative licensing models.
delve into the nuances of Spotify's specific situation or recent industry trends. More in-depth research and tailored arguments would increase the score.
 The analysis should not treats each force in isolation, but Porter's Five Forces are interconnected. The response should explore how these forces interact and influence each other within **{company_name}**  competitive landscape.

---

Each section should begin with a heading starting with `###` followed by the section number and title.

Make sure there are exactly 6 sections. Use tables where relevant (e.g., ICE Scoring, OKRs). Keep responses clean, consistent, and structured in markdown format
---
Now, generate a revised product strategy across the following **exact 6 sections**.

### ✳️ REQUIRED FORMAT

Please follow this strict structure — do not skip, rename, or reformat headings:

### 1. Mission & Vision  
### 2. Product Snapshot  
### 3. Strategic Roadmap  
### 4. Feature Prioritization (ICE Scoring)  
### 5. SMART OKRs  
### 6. Next-Level Innovation & Product Vision

✅ Important:
- Begin each section with `### [number]. [title]`
- Use bullet points, markdown tables, or numbered lists
- Return exactly 6 sections in this order

---

Start your response below:
"""


    return improved_prompt


In [None]:
# === Step 1: Build & Run the Improved Prompt ===
improved_prompt_tag = f"improved-{prompt_tag}"
run_id = all_experiment_view[all_experiment_view["Company"] == company_name]["Run"].max() + 1 if not all_experiment_view.empty else 1

improved_prompt = build_feedback_augmented_prompt(company_name, strategy_name, prompt_tag, EVAL_FILE)

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=improved_prompt
)
improved_llm_output = response.text

# Print the first part of the raw output to inspect structure
print("🔍 LLM Output Preview:")
Markdown(improved_llm_output)  # adjust length if needed


In [None]:
# Copy uploaded file to working directory if needed
shutil.copy("/kaggle/input/all-experiment-view/all_experiment_view.csv", "/kaggle/working/all_experiment_view.csv")

In [None]:
# 🛡️ Safe fallback for next_run in case kernel was restarted
if "next_run" not in globals():
    try:
        same_combo = all_experiment_view[
            (all_experiment_view["Company"].str.lower().fillna("") == company_name.lower()) &
            (all_experiment_view["Strategy"].str.lower().fillna("") == strategy_name.lower()) &
            (all_experiment_view["Prompt Tag"].str.lower().fillna("") == prompt_tag.lower())
        ]
        next_run = same_combo["Run"].max() + 1 if not same_combo.empty else 1
    except:
        next_run = 1
    print(f"🧮 Recovered next_run = {next_run}")

In [None]:
evaluate_llm_output_by_section(
    run_id=next_run,
    strategy_name=strategy_name,
    company_name=company_name,
    llm_output=improved_llm_output,
    prompt_tag="improved-" + prompt_tag,
    table_path=EVAL_FILE
)

In [None]:
# 👉 Cell skipped due to duplicate run and user cancellation

In [None]:
# Save the full updated evaluation table
all_experiment_view.to_csv("/kaggle/working/all_experiment_view.csv", index=False)
print("✅ Saved to /kaggle/working/all_experiment_view.csv")

In [None]:
all_experiment_view

In [None]:
# Load updated file
df = pd.read_csv(EVAL_FILE)

# Filter for both runs
before = df[(df["Prompt Tag"] == prompt_tag) & (df["Company"] == company_name)]
after = df[(df["Prompt Tag"] == "improved-" + prompt_tag) & (df["Company"] == company_name)]

# Merge
comparison = before.merge(after, on="Section", suffixes=("_Before", "_Improved"))

# Save comparison table for dashboard use
comparison.to_csv("/kaggle/working/comparison_table.csv", index=False)
print("✅ comparison_table.csv saved to working directory.")

# Markdown Table
table_md = "### 🔍 Evaluation Comparison Table\n\n"
table_md += "| Section | Metric | 🤖 LLM Before | 👤 Human Before | 🤖 LLM Improved | 👤 Human Improved |\n"
table_md += "|---------|--------|----------------|------------------|------------------|--------------------|\n"

for _, row in comparison.iterrows():
    table_md += f"| {row['Section']} | {row['Metric_Before']} | {row['LLM Score_Before']} | {row['Human Score_Before']} | {row['LLM Score_Improved']} | {row['Human Score_Improved']} |\n"

display(Markdown(table_md))

In [None]:
# 🌈 Define and apply custom pastel color palette

importlib.reload(sns)

# 🎨 Custom pastel palette
custom_palette = {
    "Before": "#f78fb3",    # pastel pink
    "Improved": "#7f9cf5"   # pastel blue
}

In [None]:
# Melt and combine human and LLM scores
human_scores = comparison[["Section", "Human Score_Before", "Human Score_Improved"]].melt(
    id_vars="Section", var_name="Version", value_name="Score"
)
human_scores["Metric"] = "Human"

llm_scores = comparison[["Section", "LLM Score_Before", "LLM Score_Improved"]].melt(
    id_vars="Section", var_name="Version", value_name="Score"
)
llm_scores["Metric"] = "LLM"

# Combine + clean version names
score_data = pd.concat([human_scores, llm_scores])
score_data["Version"] = score_data["Version"].str.replace("Human Score_", "", regex=False)
score_data["Version"] = score_data["Version"].str.replace("LLM Score_", "", regex=False)

# --- Bar Chart ---
plt.figure(figsize=(12, 6))
sns.barplot(
    data=score_data,
    x="Section", y="Score",
    hue="Version",
    palette=custom_palette  # Apply the custom pastel colors
)
plt.title("Evaluation Scores: Before vs Improved (Human & LLM)")
plt.xlabel("Section")
plt.ylabel("Score (1–5)")
plt.xticks(rotation=45, ha="right")
plt.ylim(0, 6)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.legend(title="Prompt Version")
plt.tight_layout()
plt.show()


In [None]:

output_sections = split_llm_output_by_titles(improved_llm_output)
print(f"✅ Split into {len(output_sections)} sections.")

In [None]:
comparison

In [None]:
# Melt just human scores
human_score_melted = comparison[["Section", "Human Score_Before", "Human Score_Improved"]].melt(
    id_vars="Section", var_name="Version", value_name="Score"
)
human_score_melted["Version"] = human_score_melted["Version"].str.replace("Human Score_", "", regex=False)

# Bar chart
plt.figure(figsize=(12, 6))
sns.barplot(data=human_score_melted, x="Section", y="Score", hue="Version",palette=custom_palette)
plt.title("Human Score Comparison – Before vs Improved")
plt.xlabel("Section")
plt.ylabel("Human Score")
plt.xticks(rotation=45, ha="right")
plt.ylim(0, 6)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.legend(title="Prompt Version")
plt.tight_layout()
plt.show()

In [None]:
avg_human_before = score_data[(score_data["Metric"] == "Human") & (score_data["Version"] == "Before")]["Score"].mean()
avg_human_after = score_data[(score_data["Metric"] == "Human") & (score_data["Version"] == "Improved")]["Score"].mean()
avg_llm_before = score_data[(score_data["Metric"] == "LLM") & (score_data["Version"] == "Before")]["Score"].mean()
avg_llm_after = score_data[(score_data["Metric"] == "LLM") & (score_data["Version"] == "Improved")]["Score"].mean()

# Summary description block (Markdown-style output)

summary_md = f"""
### 📊 Evaluation Summary

Across all evaluated sections:

- **Human Score** improved from **{avg_human_before:.2f} → {avg_human_after:.2f}**
- **LLM Score** improved from **{avg_llm_before:.2f} → {avg_llm_after:.2f}**

These results validate the impact of prompt iteration. While the improvements are not uniformly large across all sections, even a modest average increase of **{avg_human_after - avg_human_before:.2f}** in human ratings suggests that refining structure, tone, and alignment significantly enhances LLM outputs.

"""

display(Markdown(summary_md))

In the first phase of my project, I focused on benchmarking baseline and improved prompts using a zero-shot strategy. I created a unified evaluation table combining LLM and human scores, visualized their differences with custom-styled bar charts and heatmaps, and aligned the entire dashboard with a consistent blue-pink pastel theme.

Having explored the foundational prompt quality, the next stage is about layered prompting strategies and gradually transitioning into an agentic system. I plan to start with few-shot examples where appropriate (for instance, in reasoning-heavy sections like OKRs or Roadmap), then scale to Chain-of-Thought (CoT) and Tree-of-Thought (ToT) methods.

Once these are tested individually, I'll explore the ReAct prompting approach, which blends reasoning with tool use — an ideal stepping stone to build a multi-agent architecture. The final layer integrates document embeddings and RAG (Retrieval Augmented Generation) so agents can ground their decisions in real data.

## 🔄 Prompt Strategy & System Expansion Plan

In the initial stage of this project, I focused on building a structured comparison between baseline and improved zero-shot prompts. This included generating evaluation tables, visualizing LLM vs Human scores,dashboard and creating a foundation for further development.
Moving forward, the plan is to progressively experiment with more advanced prompting techniques, integrating modular agents, and enhancing them with document-grounded retrieval.

---

###  Prompt Strategy Expansion Checklist

#### 🔁 Prompt Variants

- ✅ Mission & Vision → Zero-Shot
- ✅ Product Snapshot → Zero-Shot
- 🧠 SMART OKRs → Few-Shot Example Prompt
- 🔢 ICE Prioritization → Chain-of-Thought
- 💡 Innovation/Vision → Tree-of-Thought
- 🎯 Feature Ideation → ReAct

#### 🤖 Agentic System Design
- [ ] Define modular agents (Feature Ideator, Roadmap Generator, etc.)
- [ ] Implement ReAct-based coordination between agents
- [ ] Visualize agent hand-offs and logs in dashboard

#### 📎 Grounding + RAG
- [ ] Build embedding store from documents (strategy decks, OKRs, feedback)
- [ ] Use similarity search in prompt context for better grounding
- [ ] Log source attribution in agent outputs

#### 📊 Evaluation
- [ ] Track Human vs LLM Scores by prompt type
- [ ] Score agent outputs per section using custom rubric
- [ ] Visualize improvement per strategy in the dashboard

---


### 🔁 Prompt Evolution Flow

Zero-Shot
   ↓
Few-Shot (targeted)
   ↓
Chain-of-Thought (CoT)
   ↓
Tree-of-Thought (ToT)
   ↓
ReAct Prompting (planning + action)
   ↓
Multi-Agent System
   ↓
 Embeddings & Retrieval (RAG)

---

### 🔗 Live Dashboard

👉 [Open the Live Nextify Dashboard](https://nextify-dashboard-100.streamlit.app/)


#### 🧠 One-Shot Prompting – Give one example to guide the model
Now apply to OKR


## 🧩 **Few-Shot Prompting** – Give the LLM model few examples 

From here I decided to add each prompting technique for enhancement to the section that needed that the most, at the end we can combine different prompting techniques on different sections and how they are improved then we use the strategy when breaking each section into an agent


In [None]:
# === 🧩 1. Few-Shot Prompt Block for SMART OKRs ===
SMART_OKRS_FEW_SHOT_PROMPT = """
### SMART OKRs

---
For the OKR Use the following examples as good and bad examples and learn from them generate your recommended OKR learning from the examples

🟢 GOOD EXAMPLE 1 — Strong OKR: AI Recommendations
markdown
### OKR 1 (Q1): Increase engagement with personalized recommendations

**Objective:** Enhance user experience through AI-driven music discovery.

| Key Result | Target | Estimate Rationale |
|------------|--------|---------------------|
| KR1.1 | +15% increase in DAU interacting with Daily Mix & Discover Weekly | Based on existing 40% base usage, nudging and new placements can lift this meaningfully. |
| KR1.2 | +10% increase in time spent listening to recommended tracks | Median session = ~35 mins; 10% = ~3.5 min lift via stickier content. |
| KR1.3 | Improve satisfaction score from 4.0 → 4.3 (out of 5) | Feasible from historical tuning and GenAI personalization rollouts. |

**Why it’s good:**
- Clear business objective (personalization = retention)
- Uses real user metrics (DAU, listening time, satisfaction)
- Realistic yet ambitious target values

---
🟢 GOOD EXAMPLE 2 - Strong OKR: Podcast Growth
### OKR 2 (Q2): Boost podcast reach and completion

**Objective:** Drive more users to discover and complete podcasts.

| Key Result | Target | Estimate Rationale |
|------------|--------|---------------------|
| KR2.1 | +20% increase in unique monthly podcast listeners | Based on typical baseline and homepage placement campaigns. |
| KR2.2 | +10% increase in average listen-through rate | Autoplay, intro-skip, and shorter episodes improve completion. |
| KR2.3 | Launch 2 podcast-specific discovery features | Use personalization + trending visibility per market. |

**Why it’s good:**
- Balanced across growth, retention, and product velocity
- Direct link to known friction in podcast UX
- Specific features + reasonable metrics included

---
🔴 Bad example 1 — Too Vague / No Metrics
**Objective:** Improve overall user happiness.

| Key Result | Target | Estimate Rationale |
|------------|--------|---------------------|
| KR1 | Add cool animations to loading screen | None provided |
| KR2 | Let users share more | No baseline or goal |
| KR3 | Make UI feel better | No measurable outcome |

**Why it’s bad:**
- Lacks measurable outcomes
- No time bounds, no KPIs
- KR1 is cosmetic-only and unrelated to business goals

---
🔴 Bad example 2 —  Unrealistic & Unaligned
### OKR: Dominate the entire audio industry in Q2

**Objective:** Become the #1 platform in the world in 3 months.

| Key Result | Target | Estimate Rationale |
|------------|--------|---------------------|
| KR1 | Convert 100% of free users to Premium | Unrealistic and never observed historically |
| KR2 | Increase MAU by 300% | No strategy or budget match |
| KR3 | Acquire all regional podcast networks | M&A-level move, not Q2 roadmap

**Why it’s bad:**
- Overly ambitious and not time-feasible
- no number mentioned
- No execution path
- Not based on current metrics or product reality
---
Now for **Spotify**, based on the following:

- ✅ Company Mission and Vision
- ✅ Product Snapshot (target users, pain points)
- ✅ Strategic Roadmap and Prioritized Features

Use these as context to generate your OKRs.

You may refer to the previously improved sections stored in the LLM Output Table under `LLM Output Section` for those parts.

Now generate **2 OKRs**, each with:
- Objective
- 2–3 Key Results (with targets + rationale)
- Total alignment with company strategy and roadmap.

"""



In [None]:
# === 🧠 2. Prompt Builder (single section only) ===
def build_single_section_prompt(company, section_title, prompt_block, strategy="Few_Shot", prompt_tag="Few_Shot_OKR_improvement"):
    return f"""
# 🎯 {strategy} | {prompt_tag} | {section_title}

You are Nextify — an AI-powered product strategy assistant.

The product to analyze is: **{company}**

---

## Section: {section_title}

{prompt_block}
""".strip()



In [None]:
# === 🧾 3. Evaluation Log (single section only) ===
def log_single_section_with_scores(
    company,
    strategy,
    prompt_tag,
    section,
    llm_output,
    llm_score,
    human_score,
    feedback,
    lesson,
    metric="OKR Quality",  
    table_path="all_experiment_view.csv"
):

    # Define expected schema
    columns = [
        "Run", "Company", "Strategy", "Prompt Tag", "Section", "Metric",
        "LLM Score", "Human Score", "Feedback", "Lesson", "LLM Output Section"
    ]

    # Load or initialize
    if os.path.exists(table_path):
        all_experiment_view = pd.read_csv(table_path)
    else:
        all_experiment_view = pd.DataFrame(columns=columns)

    # Ensure all required columns are present and properly typed
    for col in columns:
        if col not in all_experiment_view.columns:
            all_experiment_view[col] = ""
    if "Run" in all_experiment_view.columns:
        all_experiment_view["Run"] = pd.to_numeric(all_experiment_view["Run"], errors="coerce")

    # Determine next run ID by same company-strategy-prompt combo
    same_combo = all_experiment_view[
        (all_experiment_view["Company"].str.lower().fillna("") == company.lower()) &
        (all_experiment_view["Strategy"].str.lower().fillna("") == strategy.lower()) &
        (all_experiment_view["Prompt Tag"].str.lower().fillna("") == prompt_tag.lower())
    ]
    if not same_combo.empty:
        next_run = same_combo["Run"].max() + 1
    else:
        next_run = 1
    print(f"🧮 Run number for {company} ({strategy}): {int(next_run)}")

    # Define uniqueness condition
    condition = (
        (all_experiment_view["Company"] == company) &
        (all_experiment_view["Strategy"] == strategy) &
        (all_experiment_view["Prompt Tag"] == prompt_tag) &
        (all_experiment_view["Section"] == section)
    )

    # Check if it already exists → skip
    if condition.any():
        print(f"⚠️ Already logged: {company} | {section} | {prompt_tag} — skipping.")
        return

    # Build row
    row = {
        "Run": int(next_run),
        "Company": company,
        "Strategy": strategy,
        "Prompt Tag": prompt_tag,
        "Section": section,
        "Metric": metric,
        "LLM Score": float(llm_score) if llm_score is not None else "",
        "Human Score": float(human_score) if human_score is not None else "",
        "Feedback": feedback,
        "Lesson": lesson,
        "LLM Output Section": llm_output
    }

    # Append
    all_experiment_view.loc[len(all_experiment_view)] = row
    all_experiment_view.to_csv(table_path, index=False)
    print(f"✅ Logged Run {next_run} — {company} | {section} | {prompt_tag}")


In [None]:
section = "SMART OKRs"
strategy = "Few_Shot"
prompt_tag = "Few_Shot_OKR_improvement"

okr_prompt = build_single_section_prompt(company_name, section, SMART_OKRS_FEW_SHOT_PROMPT, strategy, prompt_tag)

okr_output = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=okr_prompt
).text

# Display result
print("📋 Improved SMART OKRs Output:\n")
Markdown(okr_output)

In [None]:
# --- Display the generated OKR in markdown first ---
display(Markdown("### 📄 Few-Shot OKRs Output"))
display(Markdown(okr_output))
4
# --- Ask the model to evaluate its own response ---
eval_prompt = f"""
You are an evaluator AI. Analyze the following OKRs for {company_name}.

Score them as a full set, using these criteria:
- Are both OKRs aligned with the company’s roadmap and strategy?
- Do the Key Results have realistic, quantified targets?
- Are they time-bound and feasible to deliver in one quarter each?

Respond in this format:

LLM Score (1–5): [score for overall OKR set]
Reason: [explanation of the score]

OKRs to evaluate:
{okr_output}
"""

llm_eval = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=eval_prompt
).text

# Show LLM evaluation
display(Markdown("### 🤖 LLM Self-Evaluation of Combined OKRs"))
print(llm_eval)

# Extract the score from LLM evaluation
match = re.search(r"LLM Score.*?(\d)", llm_eval)
llm_score = int(match.group(1)) if match else None

# --- Check if human has already evaluated this section ---
already_scored = all_experiment_view[
    (all_experiment_view["Company"] == company_name) &
    (all_experiment_view["Strategy"] == strategy) &
    (all_experiment_view["Prompt Tag"] == prompt_tag) &
    (all_experiment_view["Section"] == section) &
    (all_experiment_view["Human Score"].notnull())
]

if already_scored.empty:
    display(Markdown("### 👤 Human Evaluation (Your Turn)"))
    human_score = input("🧠 Your score (1–5): ").strip()
    feedback = input("💬 Your feedback: ")
    lesson = input("📘 Lesson learned: ")
else:
    print("✅ Human evaluation already exists — skipping input.")
    row = already_scored.iloc[0]
    human_score = row["Human Score"]
    feedback = row["Feedback"]
    lesson = row["Lesson"]

In [None]:
all_experiment_view

In [None]:
log_single_section_with_scores(
    company=company_name,
    strategy=strategy,
    prompt_tag=prompt_tag,
    section=section,
    llm_output=okr_output,
    llm_score=llm_score,
    human_score=human_score,
    feedback=feedback,
    lesson=lesson
)


In [None]:
all_experiment_view 

In [None]:

# Load your evaluation log
EVAL_FILE = "all_experiment_view.csv"
all_experiment_view = pd.read_csv(EVAL_FILE)

# Fix 'Run' and any other numeric columns
all_experiment_view["Run"] = pd.to_numeric(all_experiment_view["Run"], errors="coerce").fillna(1).astype(int)

# Optional: Also fix scores to be clean floats (or blank if missing)
all_experiment_view["LLM Score"] = pd.to_numeric(all_experiment_view["LLM Score"], errors="coerce")
all_experiment_view["Human Score"] = pd.to_numeric(all_experiment_view["Human Score"], errors="coerce")

# Save it back clean
all_experiment_view.to_csv(EVAL_FILE, index=False)
print("✅ Repaired Run and score columns.")

In [None]:

# Filter for SMART OKRs
okr_data = all_experiment_view[all_experiment_view["Section"] == "SMART OKRs"].copy()
okr_data["LLM Score"] = pd.to_numeric(okr_data["LLM Score"], errors="coerce")
okr_data["Human Score"] = pd.to_numeric(okr_data["Human Score"], errors="coerce")
okr_data["Run"] = pd.to_numeric(okr_data["Run"], errors="coerce")
okr_data = okr_data.sort_values(by="Run")

# === 📊 TABLE: Score + Prompt Summary ===
# Define all three prompt tags to include improved zero-shot as well
zero_shot_tag = "zero-shot-v1-baseline-1"
improved_zero_shot_tag = "improved-zero-shot-v1-baseline-1"
few_shot_tag = "Few_Shot_OKR_improvement"
section = "SMART OKRs"
company_name = all_experiment_view["Company"].dropna().iloc[-1]

# Filter for each entry
zero_shot = all_experiment_view[
    (all_experiment_view["Prompt Tag"] == zero_shot_tag) &
    (all_experiment_view["Company"] == company_name) &
    (all_experiment_view["Section"] == section)
]

improved_zero_shot = all_experiment_view[
    (all_experiment_view["Prompt Tag"] == improved_zero_shot_tag) &
    (all_experiment_view["Company"] == company_name) &
    (all_experiment_view["Section"] == section)
]

few_shot = all_experiment_view[
    (all_experiment_view["Prompt Tag"] == few_shot_tag) &
    (all_experiment_view["Company"] == company_name) &
    (all_experiment_view["Section"] == section)
]

# Merge into single comparison DataFrame
comparison = zero_shot.merge(improved_zero_shot, on="Section", suffixes=("_ZeroShot", "_ImprovedZeroShot"))
comparison = comparison.merge(few_shot, on="Section")

# Rename final merged columns
comparison.rename(columns={
    "LLM Score": "LLM Score_FewShot",
    "Human Score": "Human Score_FewShot"
}, inplace=True)

# Display the extended markdown table
table_md = "### 🔍 OKR Comparison Table: Zero-Shot vs Improved Zero-Shot vs Few-Shot\n\n"
table_md += "| Section | 🤖 Zero-Shot | 👤 Zero-Shot | 🤖 Improved Zero-Shot | 👤 Improved Zero-Shot | 🤖 Few-Shot | 👤 Few-Shot |\n"
table_md += "|---------|---------------|----------------|-------------------------|--------------------------|-------------|---------------|\n"

for _, row in comparison.iterrows():
    table_md += f"| {row['Section']} | {row['LLM Score_ZeroShot']} | {row['Human Score_ZeroShot']} | {row['LLM Score_ImprovedZeroShot']} | {row['Human Score_ImprovedZeroShot']} | {row['LLM Score_FewShot']} | {row['Human Score_FewShot']} |\n"

display(Markdown(table_md))
# === 📈 BAR CHART: Pastel Colors ===
plt.figure(figsize=(12, 6))
bar_width = 0.35
index = range(len(okr_data))

plt.bar(index, okr_data["LLM Score"], bar_width, label="LLM Score", color='#AEC6CF')  # pastel blue
plt.bar([i + bar_width for i in index], okr_data["Human Score"], bar_width, label="Human Score", color='#FFD1DC')  # pastel pink

plt.xlabel("Prompt Strategy")
plt.ylabel("Score (1–5)")
plt.title("SMART OKRs – LLM vs Human Scores (Pastel)")
plt.xticks([i + bar_width / 2 for i in index], okr_data["Prompt Tag"], rotation=45, ha="right")
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
all_experiment_view

In [None]:
# Save the DataFrame
all_experiment_view.to_csv(EVAL_FILE, index=False)

# OPTIONAL: Show confirmation
print(f"✅ File saved as {EVAL_FILE}. You can now download it from the 'Notebook Output' section.")

## 🚀 Nextify Interactive Dashboard

You can explore the full prompt evaluation, LLM comparison, and multi-agent prototype here:

🔗 [[Open the Live Nextify Dashboard](http://nextify-dashboard-100.streamlit.app/))

> 💡 This dashboard is fully interactive and displays all runs, scores, feedback, and upcoming agents for Nextify.


As we can see from the table,how each prompting technique improved the result of the llm. With just adding two good examples and two bad examples the result was improved by half a score. However, there is a limit in prompting tecniques and they can improve the reult to a limit.

➡️ Takeaway: When I want the model to stay in a structured lane, I’ll definitely consider guiding it with examples first.For strategic OKR generation more dataset needed on market trenad, competitor analysis and company current situation. so to improve the performance embedding docuements will be the next step.

Now that I’ve improved structured outputs with Few-Shot, it’s time to help the model reason step-by-step, especially in areas like Feature Prioritization using ICE.I am going to apply Chain Of Thoughts to the feature prioritization section as it needs more structured thinking and reasoning. Here, instead of jumping straight to scores, I am going to ask the model to walk through each part: Impact, Confidence, and Effort — and only then calculate the final ICE score.

## 🧵 **Chain-of-Thought (CoT)** – Ask my model to reason step-by-step  

In [None]:
# COT Prompting
def build_product_snapshot_prompt(company_name="Spotify") -> str:
    return  f"""
You are Nextify — a GenAI-powered product strategy assistant for product managers.

The product to analyze is: **Spotify**

---
### Chain-of-Thought Thinking Guide

Before continuing, review this feedback from the last evaluation and apply improvements:

Please take into account a thorough insight of the product including users, features, audience, competitive edge, as well as Porter's Five Forces and competitor analysis. Your analysis should identify the forces and offer concrete strategic recommendations for Spotify to leverage these insights. For example, for Spotify this might include the high bargaining power of record labels and strategies to mitigate it — such as developing original content, fostering direct artist relationships, or exploring alternative licensing models.

Delve into the nuances of Spotify's specific situation or recent industry trends. More in-depth research and tailored arguments will increase the evaluation score. Do not treat each force in isolation; instead, explore how they interact and influence each other within Spotify's competitive landscape.

For each section:
1. Start with: “Let’s think through this...”
2. Walk through user context, strategic implications, and comparison to competitors.
3. End with a summary insight that links to the upcoming feature prioritization.
Recommendation: Strengthen the connection between prioritized features and the specific user pain points they address. Ensure ICE scores are well-justified using clear reasoning backed by product context, user data, or competitive benchmarks—avoid arbitrary scoring.

---

Please use the following structure for your analysis:

### 1. 🎯 Target User Personas
- Who uses Spotify, and what are their primary use cases?
- Differentiate Free vs Premium users
- Include generational, geographic, and psychographic traits
- Consider creator-facing personas (if applicable)

### 2. 🧩 Key User Pain Points
- Based on UX reviews, complaints, and feedback trends, what blocks engagement or conversion?
- Include platform-specific issues (mobile, smart devices)
- Integrate user feedback, app reviews, or survey data
- Reference public NPS or satisfaction trends if available

### 3. 🔄 UX Bottlenecks or Friction Areas
- Where are the drop-offs or decision fatigue?
- How intuitive is onboarding, discovery, sharing, and playlisting?
- Include UX comparisons vs Apple Music / YouTube Music / Amazon Music

### 4. 🌍 Current Market Position
- Where does Spotify sit in the market: leader, disruptor, niche?
- How is it perceived by users vs competitors?
- What metrics or reviews support this?

### 5. 🧠 Competitive Threats & Opportunities
- Map out competitive pressures: bundling (Amazon), ecosystems (Apple), social+UGC (YouTube), niche (Deezer)
- Where can Spotify win?
- Mention AI personalization, creator monetization, or bundling strategies

### 6. 🔍 Porter's Five Forces Summary
Please structure this section as a table with the following columns:
| Force | Analysis | Strategic Recommendation |
|-------|----------|--------------------------|
| Threat of New Entrants | ... | ... |
| Bargaining Power of Suppliers | ... | ... |
| Bargaining Power of Buyers | ... | ... |
| Threat of Substitutes | ... | ... |
| Industry Rivalry | ... | ... |

---

✅ Return exactly 6 sections with headings
✅ Use markdown formatting for clarity and tables where applicable
✅ Think aloud before scoring. Be strategic and data-informed.
""".strip()



In [None]:
cot_product_snapshot_prompt =build_product_snapshot_prompt()

In [None]:
cot_product_snapshot_output = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=cot_product_snapshot_prompt
).text

Markdown(cot_product_snapshot_output)

In [None]:
# === Define Metadata & Log the Results ===
prompt_strategy = "Tree-of-Thought"
prompt_tag = "tree-thought-ice-v1"
section = "Feature Prioritization (ICE Scoring)"
metric = "Prioritization Clarity"
company_name = "Spotify"

In [None]:
print(all_experiment_view.columns.tolist())

In [None]:
all_experiment_view

In [None]:
# === Step: Set context for evaluation
strategy = "CoT"
prompt_tag = "cot-Product-Snapshot-v1"
section = "Product Snapshot"
metric = "Strategic Depth"  # Optional but helps organize views

display(strategy)
display(prompt_tag)
display(section)

eval_prompt = f"""
You are an evaluator AI. Analyze the following **Product Snapshot** section for {company_name}.

Evaluate it using these criteria:
- Are user personas and pain points clearly defined and data-informed?
- Are the insights well grounded in competitor and market analysis?
- Are Porter's Five Forces discussed with strategic depth?
- Do the strategic recommendations clearly reflect the dynamics of each force?
- Does the reasoning tie forces to user personas, pain points, or opportunities across the analysis?

Respond in this format:

LLM Score (1–5): [score]
Reason: [brief explanation including both strengths and areas for improvement]

Section to evaluate:
{cot_product_snapshot_output}
"""

llm_eval = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=eval_prompt
).text

display(Markdown("### 🤖 LLM Self-Evaluation"))
print(llm_eval)

match = re.search(r"LLM Score.*?([1-5])", llm_eval)
llm_score = int(match.group(1)) if match else None
match_mask = (
    (all_experiment_view["Strategy"] == strategy) &
    (all_experiment_view["Prompt Tag"] == prompt_tag) &
    (all_experiment_view["Section"] == section)
)

existing_rows = all_experiment_view[match_mask]
skip_log = False

if not existing_rows.empty:
    for _, row in existing_rows.iterrows():
        if (
            pd.notna(row.get("LLM Score")) and str(row["LLM Score"]).strip() != "" and
            pd.notna(row.get("Human Score")) and str(row["Human Score"]).strip() != "" and
            pd.notna(row.get("Feedback")) and str(row["Feedback"]).strip() != "" and
            pd.notna(row.get("Lesson")) and str(row["Lesson"]).strip() != ""
        ):
            skip_log = True
            print("✅ Already logged with full evaluation. Skipping.")
            break

if not skip_log:
    print("📥 Logging new entry with LLM evaluation only.")
    log_single_section_with_scores(
        company=company_name,
        strategy=strategy,
        prompt_tag=prompt_tag,
        section=section,
        llm_output=cot_product_snapshot_output,
        metric=metric,
        llm_score=llm_score,
        human_score="",
        feedback="",
        lesson=""
    )

In [None]:
display(Markdown("### 📄 CoT Product Snapshot Output"))
display(Markdown(cot_product_snapshot_output))


In [None]:
eval_df = pd.read_csv("all_experiment_view.csv")
eval_df.tail(3)

In [None]:

df = eval_df[eval_df["Section"] == "Product Snapshot"]


In [None]:
# Save filtered section
df = eval_df[eval_df["Section"] == "Product Snapshot"]

# Plot bar chart
plt.figure(figsize=(12, 6))
bar_width = 0.35
x_labels = df["Prompt Tag"].astype(str)
index = range(len(x_labels))

plt.bar(index, df["LLM Score"], bar_width, label="LLM Score", color='#AEC6CF')
plt.bar([i + bar_width for i in index], df["Human Score"], bar_width, label="Human Score", color='#FFD1DC')

plt.xlabel("Prompt Tag")
plt.ylabel("Score (1–5)")
plt.title("LLM vs Human Scores — Product Snapshot")
plt.xticks([i + bar_width / 2 for i in index], x_labels, rotation=45, ha="right")
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

🧠 Key Insights:

The zero-shot baseline prompt underperformed in LLM self-evaluation, scoring 3, compared to a more generous human score of 4 — indicating that while the basic structure was present, the depth of reasoning and strategic linkage may have been insufficient from the model’s perspective.

The improved zero-shot prompt aligned better with expectations, scoring 4 by both LLM and human raters. This suggests enhanced coherence and strategic framing based on lessons from the initial run.

The Chain-of-Thought (CoT) strategy achieved parity with the improved prompt (4/5 from both LLM and human), validating its effectiveness in structuring product insights with deeper causal reasoning. However, it did not yet show a leap in score, indicating room for even sharper differentiation in future iterations (e.g., better feature linkage or more granular market context).

📌 Conclusion:
All methods improved upon the baseline, with Chain-of-Thought (CoT) matching the improved zero-shot prompt. While CoT introduces more structured reasoning, further enhancements — such as deeper competitive mapping, integration of NPS data, and clearer linkage between features and user pain points — are needed to reach a perfect score of 5.

This analysis highlights a natural ceiling in prompting strategies: at some point, we’re asking a generalist model to act like a domain specialist in product management. Prompt engineering can only go so far. To truly achieve expert-level output, we must combine smarter model behavior with granular architectural guidance, supported by domain-specific embeddings, real-world grounding, and specialized multi-agent systems trained on expert reasoning patterns.

## 🚀 Nextify Interactive Dashboard

You can explore the full prompt evaluation, LLM comparison, and multi-agent prototype here:

🔗 [[Open the Live Nextify Dashboard](http://nextify-dashboard-100.streamlit.app/))



## 🌳 **Tree-of-Thought (ToT)** – Let the model branch, explore, and decide
In this section, we’ll push our prompting strategy beyond linear reasoning.

Instead of a step-by-step thought chain like in CoT, we’ll invite the model to **branch into multiple paths of reasoning** for each decision — like a tree with many limbs. 🌲

For the task of **Feature Prioritization (ICE Scoring)**, we’ll use ToT prompting to help the model:
- List candidate features 🌟
- Branch into Impact, Confidence, and Effort for each one 🧠
- Score each branch individually using ICE logic 📈
- Justify decisions using prior context like roadmap and vision 🎯
- Rank features at the end — not just choose blindly 👑

The goal?  
Simulate how a great PM thinks when facing multiple tradeoffs and unclear inputs.

Let’s grow some thought trees. 🌱

In [None]:
# === METADATA ===
prompt_strategy = "Tree-of-Thought"
prompt_tag = "tree-thought-ice-v1"
section = "Feature Prioritization (ICE Scoring)"
metric = "Prioritization Clarity"
company_name = "Spotify"


In [None]:
# === PROMPT ===
model_input = f"""
You are a strategic AI assistant helping prioritize features using ICE scoring for {company_name}.

Start by listing **3 candidate features**.

For each feature, branch your reasoning into 3 paths:
- 🚀 **Impact**: What is the user or business value this will create?
- ⚙️ **Confidence**: How sure are we this feature will succeed?
- 🚰 **Ease**: How difficult is it to build?

In each path, **justify the score numerically (1–10)** and explain your thinking.

Then synthesize the scores into a table using the formula:

> **Total Score = (Impact × Confidence) ÷ Effort**

Please output the result in **markdown table format** inside a code block:

````markdown
| Feature | Impact | Confidence | Ease | Total |
|---------|--------|------------|------|-------|
| ...     | ...    | ...        | ...  | ...   |
````

Finally, rank the features in order and explain **why the top feature is most worth building**, based on context like:
- Company goals
- Roadmap
- Prior user needs

Think like a product leader who’s trying to align effort with impact. Branch before you build. 🌲
"""


In [None]:
# === RUN LLM ===
llm_output = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=model_input
).text

display(Markdown("### 🌳 Tree-of-Thought Output"))
display(Markdown(llm_output))


I am impressed by How much the result improved from the last time. the evaluation have a logical proof behind it. Let's see how it is evaluated.

In [None]:
# === LLM SELF-EVALUATION ===
eval_prompt = f"""
You are an evaluator AI. Analyze the following ICE Scoring output for {company_name}.

Evaluate it using:
- Does it explain ICE scores?
- Are all 3 branches (Impact, Confidence, Effort) explored?
- Is reasoning grounded in earlier context (mission, roadmap, etc)?
- Are final rankings justified?

Respond like:

LLM Score (1–5): [score]
Reason: [strengths + suggestions]

Section to evaluate:
{llm_output}
"""

llm_eval = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=eval_prompt
).text

display(Markdown("### 🤖 LLM Self-Evaluation"))
print(llm_eval)

# === Extract score ===
match = re.search(r"LLM Score.*?([1-5])", llm_eval)
llm_score = int(match.group(1)) if match else None

# === DEDUPLICATION CHECK ===
match_mask = (
    (all_experiment_view["Strategy"] == prompt_strategy) &
    (all_experiment_view["Prompt Tag"] == prompt_tag) &
    (all_experiment_view["Section"] == section)
)
existing_rows = all_experiment_view[match_mask]

skip_log = False
if not existing_rows.empty:
    for _, row in existing_rows.iterrows():
        if all([
            str(row.get("LLM Score", "")).strip(),
            str(row.get("Human Score", "")).strip(),
            str(row.get("Feedback", "")).strip(),
            str(row.get("Lesson", "")).strip()
        ]):
            skip_log = True
            existing_entry = row
            break

# === HUMAN EVALUATION ===
if skip_log:
    print("✅ Found existing fully-evaluated match — skipping input and reusing values.")
    human_score = existing_entry["Human Score"]
    feedback = existing_entry["Feedback"]
    lesson = existing_entry["Lesson"]
    llm_score = existing_entry["LLM Score"]
else:
    display(Markdown("### 👤 Human Evaluation (Your Turn)"))
    human_score = input("🧠 Your score (1–5): ").strip()
    feedback = input("💬 Your feedback: ")
    lesson = input("📘 Lesson learned: ")


In [None]:
# === LOG TO CSV ===
log_single_section_with_scores(
    company=company_name,
    strategy=prompt_strategy,
    prompt_tag=prompt_tag,
    section=section,
    llm_output=llm_output,
    metric=metric,
    llm_score=llm_score,
    human_score=human_score,
    feedback=feedback,
    lesson=lesson
)


In [None]:
# === REFRESH AND SAVE UPDATED CSV ===
all_experiment_view = pd.read_csv("all_experiment_view.csv")
all_experiment_view.to_csv("all_experiment_view.csv", index=False)

In [None]:
all_experiment_view.tail(1)

In [None]:
# === VISUALIZATION: BAR CHART FOR FEATURE PRIORITIZATION ===
df = all_experiment_view[all_experiment_view["Section"] == section]

plt.figure(figsize=(12, 6))
bar_width = 0.35
x_labels = df["Prompt Tag"].astype(str)
index = range(len(x_labels))

plt.bar(index, df["LLM Score"], bar_width, label="LLM Score", color='#AEC6CF')
plt.bar([i + bar_width for i in index], df["Human Score"], bar_width, label="Human Score", color='#FFD1DC')

plt.xlabel("Prompt Tag")
plt.ylabel("Score (1–5)")
plt.title("LLM vs Human Scores — Feature Prioritization (ICE Scoring)")
plt.xticks([i + bar_width / 2 for i in index], x_labels, rotation=45, ha="right")
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


✨ Reflection on Feature Prioritization Evaluation Chart
“It seems the model is becoming self-aware… and might soon need therapy. 🤖🛋️”

What this chart reveals is fascinating. Despite changes in the prompting strategy — from zero-shot to improved zero-shot, and eventually to tree-of-thought — the LLM self-evaluation score stays flat, hovering around 5 (its ceiling). Meanwhile, human evaluation scores evolve more meaningfully, showing an appreciation for ToT reasoning with a noticeable uptick for the tree-thought-ice-v1 prompt.

This raises an important question:
Is it truly useful to have the same model generate, reason, and evaluate itself?
The answer, at least partially, is no — because it introduces evaluation bias based on how well the model can reflect on its own outputs, which is limited by the same training and architecture that generated the content in the first place.

🧠 Prompting as Craft, But With Limits
This experiment shows that prompting is an art, but one with diminishing returns if used in isolation. To squeeze out better outcomes, we put in more structure, more thought — yet, we hit a kind of ceiling where small tasks start demanding disproportionately large prompting effort.

For me, this is one of the most tangible and practical domains of LLM alignment. That’s why I’m excited to bring these techniques into my multi-agent system, where models don’t have to do everything alone. They think together. Strategize together. Judge each other. And maybe… go to therapy together? 😂

## 🚀 Nextify Interactive Dashboard

You can explore the full prompt evaluation, LLM comparison, and multi-agent prototype here:

🔗 [[Open the Live Nextify Dashboard](http://nextify-dashboard-100.streamlit.app/))

## ✨ Reflection on Prompting & Why I Built a Multi-Agent System

> “It seems the model is becoming self-aware… and might soon need therapy. 🤖🛋️”

One of the most surprising discoveries during this GenAI course came from evaluating different prompting strategies for feature prioritization. While I tested Zero-Shot, Improved-Zero-Shot, and Tree-of-Thought (ToT) variants, **LLM self-evaluation scores remained flat — consistently rating itself a perfect '5'**, regardless of actual improvement.

In contrast, **human evaluation showed clear preference shifts**, especially favoring ToT + ICE-style prompts like `tree-thought-ice-v1`.

This revealed a critical insight:

> ❓ Is it really valid to let the same model generate, reason, and then evaluate its own work?

Only to a point. Models tend to **reinforce their own logic**, which introduces feedback bias. They reflect on their outputs within the same paradigm that produced them — which is both clever and limiting.

### 🧠 Prompting as Craft — With Diminishing Returns

This exercise showed me that **prompting alone hits a ceiling**. Even with more structured, thoughtful instructions, the performance eventually plateaus. Especially in tasks requiring synthesis or judgment, prompting can feel like squeezing juice from a rock.

---

## 🧙‍♀️ Why Multi-Agent?

This is where the **magic of multi-agent orchestration** begins.

In my project, I designed a cast of autonomous LLM agents — each with its own role, prompting style, temperature, and self-evaluation rubric. Instead of asking one model to do everything, **they collaborate, critique, and converge** on product strategy decisions.

These agents **think together**.  
They **strategize together**.  
They even **evaluate each other’s work** — like a team of brilliant but weird coworkers in a product war room... or a magical academy.

---

## 🧠 Meet Nextive’s Hogwarts of Product Agents

Think of them as the wizarding faculty of product thinking:

| Agent | 🪄 Wizarding Name | 📖 Description | What It Does | Prompt Style | Temp | Toolchain |
|-------|------------------|----------------|--------------|--------------|------|-----------|
| 🧱 **Issue Agent** | **The Marauder** | Maps unseen product problems from feedback chaos. | CoT for bug/blocker detection | CoT | `0.3` | Gemini + LangGraph |
| 💬 **Feedback Agent** | **Howler Whisperer** | Tames angry reviews and extracts core insights. | Few-Shot Prompting | `0.4` | Gemini + Matching Engine |
| 😊 **Sentiment Agent** | **The Legilimens** | Detects emotional subtext in user responses. | One-Shot Classification | `0.2` | Google NLP |
| 🔍 **Competitor Agent** | **The Seer** | Forecasts competitors' positioning and changes. | ReAct + Search Grounding | `0.7` | Gemini + Google Search |
| 💡 **Ideation Agent** | **Room of Requirement** | Generates feature ideas using structured thinking. | Tree-of-Thought | `0.8` | Gemini + RAG + Docs |
| 🎯 **Prioritization Agent** | **The Sorting Hat** | Scores features and assigns roadmap slots. | CoT + ICE Prompting | `0.5` | Gemini |
| ✅ **OKR Agent** | **The Headmaster** | Converts validated features into SMART OKRs. | Few-Shot + OKR format | `0.4` | Gemini |
| 🧭 **Decision Support** | **The Wizengamot** | Synthesizes everyone’s opinions to decide what to build next. | Agent Aggregation | — | Orchestrator |

---

This is not just a prompt playground — it’s a fully modular **LLM ecosystem**, where agents **critique**, **prioritize**, and even **“veto”** one another's decisions.

It’s not about crafting a single perfect prompt.
It’s about designing an intelligent conversation between prompts.


In [None]:
import google.generativeai as genai
from kaggle_secrets import UserSecretsClient

# Secure API Key retrieval
GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)

# Initialize the Gemini model
client = genai.GenerativeModel("gemini-2.0-flash")

In [None]:
# Initialize Gemini

# === Retry Logic for Robust Calls ===
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})



In [None]:
is_retriable

### 🔍 BaseAgent Class: Modular LLM Wrapper

Each `BaseAgent` instance represents a single functional agent (like summarizer, competitor analyzer, ideator).
- `run()` generates content from the Gemini model using a pre-defined prompt.
- `evaluate()` performs self-reflection using a meta-prompt to assess quality, structure, or relevance.
- Temperatures are agent-specific, allowing flexibility between creative vs. stable outputs.

This class abstracts the LLM handling so that the rest of the orchestration can treat each agent as a black box with defined inputs and outputs.


In [None]:
# === Agent Evolution Format ===
# Each agent now includes:
# 1. Unique personality (wizard archetype)
# 2. Tailored prompting technique (Few-Shot, CoT, ToT, etc.)
# 3. Temperature tuning for creativity vs. consistency
# 4. Self-evaluation for quality control
# 5. User-provided prompt templates with high evaluation results
# === Step 1: Base Agent Class ===
class BaseAgent:
    # Initialize the agent with its name, prompting style, and temperature
    def __init__(self, name: str, prompt_style: str, temperature: float = 0.3):
        self.name = name  # Identifier for the agent (e.g., "feedback", "ideation")
        self.prompt_style = prompt_style  # Prompting method used (Few-Shot, CoT, etc.)
        self.temperature = temperature  # Controls response creativity vs. consistency

    # Run the LLM with the provided prompt and configured temperature
    def run(self, prompt: str) -> str:
        response = client.generate_content(  # Use Gemini client to generate output
            prompt,
            generation_config={"temperature": self.temperature}  # Dynamic temperature control
        )
        return response.text  # Return only the text output (not metadata)

    # Self-evaluate the output by asking Gemini to reflect on it
    def evaluate(self, output: str, criteria: str) -> str:
        # Construct a self-evaluation prompt to assess the output
        eval_prompt = f"""
        === Self-Evaluation: {self.name} ===
        Output:
        {output}

        Evaluation Criteria:
        {criteria}

        Provide a score from 1–5 and a short justification.
        """
        # Run evaluation at a lower temperature for stability
        response = client.generate_content(
            eval_prompt,
            generation_config={"temperature": 0.3}
        )
        return response.text  # Return the evaluation score and rationale


### ✨ Step 2: Prompt Templates per Agent

This section defines the prompt structures and agent configurations used in the multi-agent system.  
Each agent is crafted based on my **best-performing prompt strategies** from previous evaluations and includes:

- 🧠 A clear prompting method (Few-Shot, CoT, ToT, ReAct, etc.)
- 🪄 A wizard-inspired persona name for storytelling impact
- 🔥 A specific temperature for balancing creativity and reliability
- 🧪 Real prompt formats that were proven effective during prompt testing

The metadata for each agent includes:
- `description`: What the agent does
- `wizard`: The fun wizarding alias (e.g., "The Sorting Hat")
- `style`: Prompting strategy (e.g., Few-Shot)
- `prompt`: Jinja-style string with context injection
- `temperature`: Tuned creativity level for LLM response generation

In [None]:

# === Step 2: Prompt Templates per Agent ===
# These are based on the user's best-performing prompt designs from previous evaluations.
# Each includes:
# - A clear prompting strategy (Few-Shot, CoT, ToT, etc.)
# - A wizard-themed persona for storytelling/presentation impact
# - Tuned temperature for appropriate creativity vs stability
# - Prompts derived from actual high-scoring examples used in prompting evaluations
# Define prompt templates and agent metadata including:
# - Description of the agent’s purpose
# - Wizarding persona (fun identifier)
# - Prompting style (Few-Shot, CoT, ToT, etc.)
# - Prompt template itself (Jinja-style)
# - Agent temperature for creativity vs consistency

prompt_templates = {

    # ======================= 💬 FEEDBACK AGENT =======================
    # 🎭 Name: Howler Whisperer (Feedback Summarizer)
    # 📚 Purpose: Condense noisy user feedback into clear, actionable summaries
    # 🌐 Grounded in: Trustpilot, Google Play, Apple Store reviews
    # 🧠 Output Structure:
    #     - Condensed summary (1–2 lines)
    #     - Neutral, informative tone
    #     - Preserves problem and sentiment
    # 🔥 Temperature: 0.4 – Moderate creativity for abstraction without hallucination
    "feedback": {
        "description": "Summarizes raw user reviews and NPS comments into concise insights.",
        "wizard": "Howler Whisperer (Feedback Summarizer)",
        "style": "Few-Shot",
        "grounding": [
            "https://www.trustpilot.com/review/app.com",
            "https://play.google.com/store",
            "https://apps.apple.com"
        ],
        "prompt": """
        Here are examples of how to summarize feedback:

        Input: \"App crashes after login. Very frustrating!\"
        Output: \"Users report login-related crashes causing frustration.\"

        Input: \"Interface looks nice but the app is slow on older phones.\"
        Output: \"Slow performance on older devices despite good UI.\"

        Input: \"Too many notifications. I'm overwhelmed every time I open it.\"
        Output: \"Overwhelming number of notifications affects user comfort.\"

        Input: \"Great potential but missing offline mode.\"
        Output: \"Desire for offline functionality among users with good feature outlook.\"

        Input: {input}
        Output:
        """,
        "temperature": 0.4  # Moderate creativity for summarization while maintaining structure
    },

    # ======================= 🔍 ISSUE AGENT =======================
    # 🎭 Name: The Marauder (Blocker Finder)
    # 📚 Purpose: Detects product blockers using Chain-of-Thought reasoning
    # 🌐 Grounded in: Trustpilot, Google Play, Apple Store reviews
    # 🧠 Output Structure:
    #     - Thought Process: Step-by-step logic
    #     - Final Issue Summary
    #     - Optional Root Cause Hypothesis
    #     - Category Type
    #     - Potential Solution Suggestions
    # 🔥 Temperature: 0.3 – Stable, consistent analysis for deterministic logic
    "issue":{
        "description": "Detects product blockers using Chain-of-Thought reasoning",
        "wizard": "The Marauder (Blocker Finder)",
        "style": "CoT",
        "grounding": [
            "https://www.trustpilot.com/review/app.com",
            "https://play.google.com/store",
            "https://apps.apple.com"
        ],
        "prompt": """
        You are a product analyst assistant. Your job is to extract potential blockers from customer feedback by analyzing complaints,
        inferring root causes, and proposing clear descriptions of the underlying product issues.

        Company: {company_name}
        This feedback comes from a public app store or review platform.

        Feedback:
        {input}

        ### Thought Process:
        1. Identify key pain points in the feedback.
        2. Map them to possible technical or UX problems.
        3. Determine whether it's a recurring or severe blocker.

        ### Final Issue (summary in one sentence):

        ### Optional Root Cause Hypothesis:

        ### Suggested Categories (e.g., Bug, UX Flaw, Performance, Feature Gap):

        ### Optional: Brainstorm Potential Solutions (bullet list)
        """,
        "temperature": 0.3 # Lower temperature ensures consistency and precision for step-by-step issue analysis,
        },  
  # ======================= 😊 SENTIMENT AGENT =======================
    # 🎭 Name: The Legilimens (Sentiment Classifier)
    # 📚 Purpose: Detects emotional tone in user feedback
    # 🌐 No grounding (pure classification task)
    # 🧠 Output Structure:
    #     - Single word: Positive, Neutral, or Negative
    # 🔥 Temperature: 0.2 – Deterministic, consistent output for classification
    "sentiment": {
        "description": "Classifies sentiment in user feedback using one-shot style.",
        "wizard": "The Legilimens (Sentiment Classifier)",
        "style": "One-Shot",
        "prompt": """
        Classify the emotional tone of the feedback below.

        Feedback: \"{input}\"
        
        Sentiment (Positive, Neutral, Negative):
        """,
        "temperature": 0.2  # Low temperature for deterministic sentiment classification
    },  
     
     # ======================= 🔍 COMPETITOR AGENT =======================
    # 🎭 Name: The Seer (Competitor Analyst)
    # 📚 Purpose: Analyzes and contrasts product strategies from competitors using ReAct prompting
    # 🌐 Grounded in: Product Hunt, TechCrunch, Wikipedia
    # 🧠 Output Structure:
    #     - Thought: Initial reasoning
    #     - Action: Search (simulated)
    #     - Observation: Context summary
    #     - Answer: Final comparative insight
    # 🔥 Temperature: 0.7 – Allows exploratory, search-guided reasoning
    "competitor": {
        "description": "Explores competitive insights using ReAct prompting and grounded search logic.",
        "wizard": "The Seer (Competitor Analyst)",
        "style": "ReAct",
        "grounding": [
            "https://www.producthunt.com",
            "https://techcrunch.com",
            "https://en.wikipedia.org/wiki/List_of_music_streaming_services"
        ],"prompt":"""
    
        ### Output Format:
        - Competitive Positioning Summary
        - Strengths & Differentiators
        - Weaknesses or Gaps
        - Strategy Type (choose one): Cost Leadership, Differentiation, Focused Strategy, Innovation Strategy, Platform Strategy
        - Optional: Commentary using Porter’s Five Forces (Threat of new entrants, Bargaining power of buyers, Bargaining power of suppliers, Threat of substitutes, Industry rivalry)

        Answer:
        Summarize competitive positioning with supporting logic.
        """,
        "temperature": 0.7  # High creativity and exploration needed for competitive analysis
    }, 
         # ======================= 💡 IDEATION AGENT =======================
    # 📤 Output Format:
    # - Option A/B/C with Title + Description
    # - Score table (Originality, Feasibility, Impact)
    # - Final Pick with rationale
    # 🎭 Name: Room of Requirement (Feature Ideator)
    # 📚 Purpose: Generates breakthrough product feature ideas
    # 🌐 No external grounding (creative zero-shot ideation)
    # 🧠 Output Structure:
    #     - Option A, B, C: Unique, bold, or visionary features
    #     - Scores per option (originality, feasibility, impact)
    #     - Final Pick with justification
    # 🔥 Temperature: 0.8 – Max creativity for divergence and innovation
    "ideation": {
        "description": "Generates breakthrough product feature ideas using Tree-of-Thought prompting.",
        "wizard": "Room of Requirement (Feature Ideator)",
        "style": "ToT",
        "prompt":"""

        Output Format:
        - Three feature ideas labeled A, B, C (Title + Description)
        - A score section evaluating each idea (Originality, Feasibility, Impact)
        - A final selection with a short justification

        Context:
        {input}

        ### Option A:
        Feature Title:
        Description:

        ### Option B:
        Feature Title:
        Description:

        ### Option C:
        Feature Title:
        Description:

        ### Score Each Option:
        - Originality (1–10):
        - Feasibility (1–10):
        - Impact (1–10):

        ### Final Picks & Why:
        """,
        "temperature": 0.8   # Maximize divergent thinking for ideation and ToT exploration
    },  
    
    # ======================= 🎯 PRIORITIZATION AGENT =======================
    # 🎭 Name: The Sorting Hat (Prioritization Engine)
    # 📚 Purpose: Ranks feature ideas using ICE scoring (Impact, Confidence, Effort)
    # 🌐 No grounding – input comes from previous agents
    # 🧠 Output Structure:
    #     - Markdown-style table
    #     - Columns: Feature, Impact, Confidence, Effort, ICE Score
    #     - Ordered by ICE Score (descending)
    # 🔥 Temperature: 0.5 – Balanced creativity and structure
    "prioritization": {
        "description": "Scores features using ICE framework with structured logic.",
        "wizard": "The Sorting Hat (Prioritization Engine)",
        "style": "CoT + Scoring",
        "prompt": """ Evaluate the following product features using the ICE scoring model:

        Impact = How much the feature will affect the user or business (1–10)
        Confidence = Certainty in your estimates (1–10)
        Effort = Level of complexity, time, and resources needed (1–10)

        Compute the ICE Score as: (Impact * Confidence) / Effort

        Features:
        {input}

        Output Format:
        | Feature | Impact | Confidence | Effort | ICE Score |
        |---------|--------|------------|--------|------------|
        | Example Feature | 8 | 7 | 4 | 14.0 |

        Sort the table by ICE Score descending and summarize your decision-making rationale below.
        """,
        "temperature": 0.5  # Balanced reasoning for ICE scoring (some creativity, some structure)
    },   # ======================= ✅ OKR AGENT =======================
    # 🎭 Name: The Headmaster (OKR Generator)
    # 📚 Purpose: Converts features into actionable SMART Objectives and Key Results
    # 🌐 No external grounding – relies on structured inputs from prior agents
    # 🧠 Output Structure:
    #     - Objective: 1 sentence summary
    #     - Key Results: 2–3 measurable outcomes
    # 🔥 Temperature: 0.4 – Requires clean structure with slight creativity
    "okr": {
        "description": "Converts validated features into SMART OKRs.",
        "wizard": "The Headmaster (OKR Generator)",
        "style": "Few-Shot",
        "prompt":"""

        Turn the following feature into a SMART OKR:

        Feature: {input}

        Objective:
        - Key Result 1:
        - Key Result 2:
        - (Optional) Key Result 3:
        These are some examples for your reference to good and bad examples.
        ### Few-Shot Examples:

#================ BAD EXAMPLES: What NOT to Do ===
# These OKRs may include numbers but lack context, clarity, or actionable structure.

        Input: Make app better
        Objective: Make the app better
        - Key Result 1: Users should like the app more
        - Key Result 2: Have fewer bugs maybe

        Input: Improve product
        Objective: Improve product
        - Key Result 1: Increase sessions by 40%  # No baseline or reason
        - Key Result 2: Add 5 new features  # No context or alignment

        Input: Be more AI-powered
        Objective: Be more AI-powered
        - Key Result 1: Add AI stuff
        - Key Result 2: Do a demo
                  
# =========== GOOD EXAMPLES: Recommended Patterns ===
        Input: Improve onboarding experience
        Objective: Improve onboarding for new mobile users
        - Key Result 1: Increase completion rate from 62% to 85% (based on historical A/B test performance)
        - Key Result 2: Reduce average onboarding time from 4 min to under 2 min (benchmarked against top apps)
        - Key Result 3: Achieve 90% satisfaction in post-onboarding survey (currently at 75%)

        Input: Launch AI-powered feature recommendation
        Objective: Deploy smart feature recommendation engine for premium users
        - Key Result 1: Roll out to 100% of premium users by Q2 (controlled population)
        - Key Result 2: Increase feature engagement by 25% within 30 days (baseline = 3 features/week)
        - Key Result 3: Collect 1,000 qualitative feedback submissions (targeting 10% response rate)

Output Format:
        Objective:
        - Key Result 1:
        - Key Result 2:
        - (Optional) Key Result 3:
        """,
        "temperature": 0.4  # Requires clean structure with slight creativity
    }
}

In [None]:
# let's test
output = run_agent("feedback", "The app keeps crashing when I open messages.", "Spotify")
print(output)

In [None]:
# let's test
output = run_agent("ideation", "The app keeps crashing when I open messages.", "Spotify")
print(output)

It seems working so let's see whether we can connect agents together. Start with the simple then move to creating pipeline.

In [None]:
user_input = "how many users having issues with log in "
summary = run_agent("feedback", user_input, "Spotify")
print(summary)

Now let's have fun with these agents. I want to write a fictional scenario and test all of the agents to see how they help in context.

## **🧪 Magical Agent Test Script**
We’ll simulate a real-world product scenario for a fictional company ("Beatify") and run all 7 agents on a creative, messy, user feedback input.

**🎧 Test Case: Customer Feedback**


In [None]:
user_input = """
I've used Beatify for 3 months and it’s mostly okay, but lately it freezes when switching playlists. 
Also, the search takes forever, and why can't I find podcasts anymore? Feels like it got worse. 
Still love the personalized mixes though. But if this doesn’t get fixed, I’ll cancel.
"""
company_name = "Beatify"

In [None]:
# === Run all agents independently
individual_agent_outputs = {}

for agent_name in prompt_templates.keys():
    print(f"\n🔮 {agent_name.upper()} Agent Output ({prompt_templates[agent_name]['wizard']})")
    output = run_agent(agent_name, user_input, company_name)
    print(f"📤 Output:\n{output}")
    print("=" * 80)
    individual_agent_outputs[agent_name] = output

# === Final export package
flat_multiagent_export = {
    "metadata": {
        "strategy": "multi-agent",
        "type": "flat",
        "company": company_name,
        "input": user_input.strip(),
        "description": "Each agent runs independently on the same user input without passing results sequentially.",
        "timestamp": datetime.now().isoformat()
    },
    "results": individual_agent_outputs
}

# === Save to JSON file
with open("nextive_individual_agents_flat.json", "w", encoding="utf-8") as f:
    json.dump(flat_multiagent_export, f, indent=2, ensure_ascii=False)

print("✅ Saved: nextive_individual_agents_flat.json (check Output tab to download)")

I think this is the most exciting part of this assignment so far! how dope is that the isolated llm outputs now are connected and make sense and give a bit of confidence and trust on the output. no evaluation needed! :))

In [None]:
evaluator = BaseAgent(name="MultiAgentEvaluator", prompt_style="Few-Shot", temperature=0.3)


In [None]:
# Load flat multi-agent results
with open("/kaggle/input/nextive-individual-agent-flat/nextive_individual_agents_flat.json", "r", encoding="utf-8") as f:
     flat_results = json.load(f)["results"]

# Save to working directory
shutil.copy(
    "/kaggle/input/nextive-individual-agent-flat/nextive_individual_agents_flat.json",
    "/kaggle/working/nextive_individual_agents_flat.json"
)

print("✅ Flat agent results loaded and copied to /kaggle/working/")

In [None]:
flat_results

In [None]:
# Define your agents and evaluation criteria
agents_to_evaluate = ["sentiment", "feedback", "issue", "competitor", "ideation", "prioritization", "okr"]
criteria = "Clarity, Depth, Relevance, Innovation, Coherence"

# Evaluate the flat multi-agent results
flat_scores_raw = {}

for agent in agents_to_evaluate:
    output = flat_results.get(agent, "")
    flat_scores_raw[agent] = evaluator.evaluate(output, criteria)
    time.sleep(2)  # ✅ Add this line to avoid burst quota issues
    
for agent in agents_to_evaluate:
    output = flat_results.get(agent, "")
    flat_scores_raw[agent] = evaluator.evaluate(output, criteria)

# Optional: extract numeric scores
def extract_score(text):
    match = re.search(r'(\d(?:\.\d)?)', text)
    return float(match.group(1)) if match else None

flat_numeric_scores = {agent: extract_score(score) for agent, score in flat_scores_raw.items()}

df_flat = pd.DataFrame({
    "Agent": agents_to_evaluate,
    "Score": [flat_numeric_scores[agent] for agent in agents_to_evaluate]
})
# Assuming you already have this
flat_numeric_scores = {agent: extract_score(score) for agent, score in flat_scores_raw.items()}
prompt_templates = {
    "sentiment": {"style": "One-Shot", "description": "Classifies sentiment using One-Shot prompting"},
    "feedback": {"style": "Few-Shot", "description": "Summarizes user feedback using Few-Shot prompting"},
    "issue": {"style": "CoT", "description": "Detects product blockers using Chain-of-Thought reasoning"},
    "competitor": {"style": "ReAct", "description": "Analyzes competition using ReAct prompting"},
    "ideation": {"style": "ToT", "description": "Generates features with Tree-of-Thought prompting"},
    "prioritization": {"style": "CoT + Scoring", "description": "Ranks features using ICE scoring with CoT"},
    "okr": {"style": "Few-Shot", "description": "Creates SMART OKRs using Few-Shot prompting"}
}

# Generate dataframe with smart injection
df_flat = pd.DataFrame({
    "Agent": agents_to_evaluate,
    "Agent_flat": [flat_numeric_scores[agent] for agent in agents_to_evaluate],
    "PromptStyle": [prompt_templates[agent]["style"] for agent in agents_to_evaluate],
    "Description": [prompt_templates[agent]["description"] for agent in agents_to_evaluate],
})


# Bar chart of flat agent scores
plt.figure(figsize=(10, 5))
sns.barplot(x="Agent", y="Agent_flat", data=df_flat, palette="Blues")
plt.title("Flat Multi-Agent Evaluation Scores")
plt.ylim(0, 5)
plt.grid(True)
plt.tight_layout()
plt.show()

display(df_flat)


In the bar chart, we observe that the flat multi-agent setup performs consistently well across most agents, with top scores in sentiment classification, competitor analysis, ideation, and prioritization, while the feedback agent stands out as the weakest performer, suggesting potential issues with clarity or depth in summarizing user input.Now let's see how orchestrating multiple agents improve the performance

In [None]:
### === Step 3: Agent Runner ===
## Dynamically selects and runs an agent from prompt_templates.
## It fills in the prompt with input + company name, creates the agent, and returns its output.
def run_agent(agent_key: str, input_text: str, company_name="YourCompany") -> str:
        # ✅ Safety check for Gemini client
    if "client" not in globals():
        raise RuntimeError("❌ Gemini model not initialized. Please run the setup cell first.")
    
    agent_meta = prompt_templates[agent_key]  # Get agent's metadata
    if "prompt" not in agent_meta:
        return f"[ERROR] Agent '{agent_key}' is missing a prompt."

    filled_prompt = agent_meta["prompt"].format(input=input_text, company_name=company_name)  # Fill prompt
    agent = BaseAgent(agent_key, agent_meta["style"], agent_meta["temperature"])  # Initialize agent
    return agent.run(filled_prompt)  # Run and return output



In [None]:
## === Step 1.6: Orchestrator ===
## This connects agents in logical order and passes relevant outputs between them.
def run_orchestrator(user_input: str, company_name="YourCompany") -> dict:
    results = {}

    # Step 1: Sentiment Agent
    sentiment = run_agent("sentiment", user_input, company_name)
    results["sentiment"] = sentiment

    # Step 2: Feedback Agent (summary)
    summary = run_agent("feedback", user_input, company_name)
    results["feedback"] = summary

    # Step 3: Issue Agent (reasoning + blockers)
    issue_analysis = run_agent("issue", summary, company_name)
    results["issue"] = issue_analysis

    # Step 4: Competitor Agent (optional broader market insight)
    competitor_insight = run_agent("competitor", user_input, company_name)
    results["competitor"] = competitor_insight

    # Step 5: Ideation Agent (from issue context)
    ideas = run_agent("ideation", issue_analysis, company_name)
    results["ideation"] = ideas

    # Step 6: Prioritization Agent (on proposed features)
    prioritization = run_agent("prioritization", ideas, company_name)
    results["prioritization"] = prioritization

    # Step 7: OKR Agent (structured planning)
    okr_plan = run_agent("okr", ideas, company_name)
    results["okr"] = okr_plan

    return results



I started with the easies series connection like chain to first see how it works then we can give more inputs to specific agents
![AI Agent Train](/kaggle/input/multi-agent-chain-architecture-image/multi agent chain architecture.png)

In [None]:

display(Image(filename="/kaggle/input/multi-agent-chain-architecture-image/multi agent chain architecture.png"))

In [None]:
# test the multi-agent orchestrator
user_input = """
I love the idea of TaskBuddy, but lately it’s been really buggy. 
Sometimes tasks disappear when I mark them as done, and reminders never come on time. 
Also, why is there still no dark mode? Feels like you’re not listening to users.
"""
company_name = "TaskBuddy"

We can see how grounding and archestrating agent can change the chaotic responses, guardrail responses and direct them with a logic. So let's self

In [None]:
# ✅ Save multi-agent results to working directory
with open("/kaggle/working/nextive_multiagent_output.json", "w", encoding="utf-8") as f:
    json.dump(multiagent_export, f, indent=2, ensure_ascii=False)

print("✅ JSON saved: /kaggle/working/nextive_multiagent_output.json — check Output tab to download.")

In [None]:
results

In [None]:

multiagent_export = {
    "metadata": {
        "strategy": "multi-agent",
        "type": "sequential",
        "company": company_name,
        "input": user_input.strip(),
        "description": "A sequential multi-agent architecture where each agent builds upon the output of the previous one. "
                       "Agents include: sentiment → feedback → issue → competitor → ideation → prioritization → OKR.",
        "timestamp": datetime.now().isoformat()
    },
    "results": results  # orchestrator output
}



Let's evaluate the orchestrated multi-agents

In [None]:
# 🔁 REDEFINE THIS if not already defined
evaluator = BaseAgent(name="SequentialEvaluator", prompt_style="Few-Shot", temperature=0.3)

In [None]:
# ✅ Step 1: Parse your full sequential_results string
def parse_sequential_sections(text):
    pattern = r"🔮?\s*(\w+)\s+Agent Output\s*\n(.*?)(?=\n🔮|\Z)"
    matches = re.findall(pattern, text, re.DOTALL)
    return {m[0].strip().lower(): m[1].strip() for m in matches}



In [None]:
sequential_outputs = sequential_outputs["results"]
sequential_outputs

In [None]:
# Define criteria
criteria = "Clarity, Depth, Relevance, Innovation, Coherence"

# Evaluate
sequential_scores_raw = {
    agent: evaluator.evaluate(output, criteria)
    for agent, output in sequential_outputs.items()
}

# Extract numeric scores
def extract_score(text):
    match = re.search(r'(\d(?:\.\d)?)', text)
    return float(match.group(1)) if match else None

sequential_numeric_scores = {
    agent: extract_score(score) for agent, score in sequential_scores_raw.items()
}

# Create DataFrame
df_seq = pd.DataFrame({
    "Agent": list(sequential_numeric_scores.keys()),
    "Sequential Score": list(sequential_numeric_scores.values())
})



In [None]:
# Merge with flat
df_combined = df_flat.copy()
df_combined = df_combined.merge(df_seq, on="Agent", how="left")
display(df_combined)

In [None]:
# Rebuild chart from this table
df_chart = df_combined[["Agent", "Agent_flat", "Sequential Score"]].melt(
    id_vars="Agent", var_name="Architecture", value_name="Score"
)
df_chart["Architecture"] = df_chart["Architecture"].replace({
    "Agent_flat": "Flat", 
    "Sequential Score": "Sequential"
})

# Custom color palette
palette = {
    "Flat": "#a3c9f1",         # Light blue
    "Sequential": "#3b76c4"    # Dark blue
}

# Plot the comparison bar chart
plt.figure(figsize=(12, 6))
sns.barplot(data=df_chart, x="Agent", y="Score", hue="Architecture", palette=palette)
plt.title("Multi-Agent Evaluation: Flat vs Sequential")
plt.ylim(0, 5)
plt.ylabel("Evaluation Score (1–5)")
plt.grid(True, axis="y", linestyle="--", alpha=0.5)
plt.tight_layout()
plt.show()

📌 Final Thoughts
From the chart, it's clear that the sequential multi-agent setup excels in more analytical tasks, particularly in OKR generation and issue detection with root cause analysis. However, since I ran the self-evaluation function multiple times and received slightly different scores each time, the reliability of this evaluation could be improved. It may be worth refining the evaluation prompt itself for more consistent results.

But then again — this is the nature of GenAI: a never-ending loop of experimentation and improvement! :))

Thanks for reading. 
Best.
Donna

## 🚀 Nextify Interactive Dashboard

You can explore the full prompt evaluation, LLM comparison, and multi-agent prototype here:

🔗 [[Open the Live Nextify Dashboard](http://nextify-dashboard-100.streamlit.app/))