## Product Analyzer Agent

# Section 1: Setup & Environment

In this section, we will install the necessary libraries and set up our credentials to connect to the Reddit and Product Hunt APIs.

In [1]:
!/usr/local/bin/python3.11 -m pip install praw requests python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [2]:
import os
from dotenv import load_dotenv

# Load the environment variables from the .env file
load_dotenv()

# Get the Reddit credentials
reddit_client_id = os.getenv("REDDIT_CLIENT_ID")
reddit_client_secret = os.getenv("REDDIT_CLIENT_SECRET")

# Get the Product Hunt credentials
ph_api_key = os.getenv("PRODUCT_HUNT_API_KEY")
ph_access_token = os.getenv("PRODUCT_HUNT_ACCESS_TOKEN")

# Check if the keys were loaded correctly
if reddit_client_id and ph_api_key:
    print("✅ API keys loaded successfully!")
else:
    print("🚨 Error: Could not load API keys. Check your .env file.")

✅ API keys loaded successfully!


# Data collection Agent

# Connection to Reddit

In [3]:
import praw

# Use the loaded keys to initialize the Reddit API connection
# A user_agent is a unique name for your script so Reddit knows who is making requests.
try:
    reddit = praw.Reddit(
        client_id=reddit_client_id,
        client_secret=reddit_client_secret,
        user_agent="Feature-Analyzer by Flo"
    )

    # Check the connection status
    print(f"✅ Connected to Reddit as: {reddit.user.me()} (Read-Only Mode: {reddit.read_only})")

except Exception as e:
    print(f"🚨 Error connecting to Reddit: {e}")

✅ Connected to Reddit as: None (Read-Only Mode: True)


# Fetching Data from your Subreddit - in our case "n8n"

In [4]:
# Define the target subreddit and how many posts to fetch
subreddit_name = "n8n"
post_limit = 10

print(f"🔥 Fetching the top {post_limit} hot posts from r/{subreddit_name}...")

try:
    # Point our connection to the target subreddit
    subreddit = reddit.subreddit(subreddit_name)

    # Fetch the 'hot' posts and loop through them
    for post in subreddit.hot(limit=post_limit):
        # Print the title and score of each post
        print(f"  - [Score: {post.score}] {post.title}")

except Exception as e:
    print(f"🚨 Error fetching posts: {e}")

🔥 Fetching the top 10 hot posts from r/n8n...
  - [Score: 2] Weekly Self Promotion Thread
  - [Score: 28] How much is my workflow worth?
  - [Score: 15] We'll never know
  - [Score: 61] Anyone here looking for a job?
  - [Score: 14] Hiring n8n Expert (Bonus if Full Stack w/ Node.js + React)
  - [Score: 25] GitHub Releases to Marketing
  - [Score: 4] Things I have automated, do you want yours?
  - [Score: 2] Self hosting via Docker security risk?
  - [Score: 5] How I Use Redis to Cache Google API Data in n8n (and Why You Should Too)
  - [Score: 3] I want to create a did you mean loop


# Store Detailled Post and comments from subreddit

# Cleaning script for the comments. Removing links, emojis etc.

In [5]:
import re

def clean_text(text):
    """A function to clean raw text."""
    # Make text lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'https?://\S+', '', text)
    # Remove special characters and numbers, keeping only letters and spaces
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

print("✅ The 'clean_text' function is now defined and ready to use.")

✅ The 'clean_text' function is now defined and ready to use.


# Collect Targeted Feedback from Reddit

In [6]:
import json
from praw.models import MoreComments

# This is the targeted data collection script.
# It creates the 'reddit_data' variable that the final AI script needs.

reddit_data = []
subreddit_name = "n8n"
post_limit = 1000 # Using a large limit for a comprehensive search

# The search query to find relevant feedback posts
search_query = "feedback OR idea OR suggestion OR bug OR error OR improve OR wish OR stuck OR help"

print(f"🔬 Searching r/{subreddit_name} for top posts of all time containing feedback keywords...")

try:
    subreddit = reddit.subreddit(subreddit_name)
    # Use .search() to find the most relevant posts from all time
    for post in subreddit.search(search_query, sort='top', time_filter='all', limit=post_limit):
        post_data = {
            "id": post.id,
            "title": post.title,
            "score": post.score,
            "url": post.url,
            "body": post.selftext,
            "comments": []
        }

        # Fetch all comments, skipping the 'MoreComments' objects
        for comment in post.comments.list():
            if isinstance(comment, MoreComments):
                continue
            post_data["comments"].append(comment.body)
        
        reddit_data.append(post_data)

    print(f"✅ Successfully collected data for {len(reddit_data)} posts.")
    
    if reddit_data:
        # We don't need to print the example here, as the final report will show the output.
        pass
    else:
        print("No posts found matching the search query.")

except Exception as e:
    print(f"🚨 Error collecting detailed data: {e}")

🔬 Searching r/n8n for top posts of all time containing feedback keywords...
✅ Successfully collected data for 249 posts.


In [7]:
import json

print("🧼 Applying cleaning function to all collected Reddit text...")

# This is the crucial loop that adds the 'cleaned_' keys to your Reddit data
for post in reddit_data:
    post['cleaned_title'] = clean_text(post['title'])
    post['cleaned_body'] = clean_text(post['body'])
    post['cleaned_comments'] = [clean_text(comment) for comment in post['comments']]

print(f"✅ Cleaning complete for {len(reddit_data)} posts.")

🧼 Applying cleaning function to all collected Reddit text...
✅ Cleaning complete for 249 posts.


# Product Hunt Crawl - Important: Product Hunt is giving only a limit amount of reviews in our cas 42 out of 102

In [8]:
import requests
import json
import time

# --- Configuration ---
product_slug = "n8n-io"
# We ask for 50 at a time, a common page size limit for APIs
limit_per_page = 50

# --- API and Header setup (same as before) ---
api_url = "https://api.producthunt.com/v2/api/graphql"
headers = {
    'Authorization': f'Bearer {ph_access_token}',
    'Content-Type': 'application/json'
}

# --- A more advanced query that includes pagination logic ---
query = """
query getProductReviews($slug: String!, $limit: Int!, $after: String) {
  post(slug: $slug) {
    name
    comments(first: $limit, after: $after) {
      edges {
        node {
          body
        }
      }
      pageInfo {
        hasNextPage
        endCursor
      }
    }
  }
}
"""

# --- The Loop to Fetch All Pages ---
product_hunt_reviews = []
after_cursor = None # Start with no cursor to get the first page
has_next_page = True

print(f"📝 Fetching all reviews for '{product_slug}' from Product Hunt...")

while has_next_page:
    # Set the variables for the API request, including the cursor
    request_body = {
        "query": query,
        "variables": {"slug": product_slug, "limit": limit_per_page, "after": after_cursor}
    }
    
    try:
        response = requests.post(api_url, headers=headers, json=request_body)
        
        if response.status_code == 200:
            ph_data = response.json()
            post_info = ph_data['data']['post']
            
            if post_info:
                comments_page = post_info['comments']
                
                # Clean and store the reviews from the current page
                for comment in comments_page['edges']:
                    cleaned_review = clean_text(comment['node']['body'])
                    product_hunt_reviews.append(cleaned_review)
                
                # Check if there's a next page and get the new cursor
                page_info = comments_page['pageInfo']
                has_next_page = page_info['hasNextPage']
                after_cursor = page_info['endCursor']
                
                print(f"  - Fetched a page, total reviews collected: {len(product_hunt_reviews)}")
                
            else:
                # No post found for the slug
                print(f"🚨 Could not find a product with slug '{product_slug}'.")
                break

            # A small delay to be polite to the API
            time.sleep(1)

        else:
            print(f"🚨 Error fetching from Product Hunt: Status code {response.status_code}")
            print(response.text)
            break

    except Exception as e:
        print(f"🚨 An exception occurred: {e}")
        break

print(f"\n✅ Finished! Successfully collected and cleaned {len(product_hunt_reviews)} reviews.")

# --- Example of Cleaned Reviews ---
print("\n--- Example of Cleaned Reviews ---")
for review in product_hunt_reviews[:3]:
    print(f"- {review}")

📝 Fetching all reviews for 'n8n-io' from Product Hunt...
  - Fetched a page, total reviews collected: 20
  - Fetched a page, total reviews collected: 40
  - Fetched a page, total reviews collected: 42

✅ Finished! Successfully collected and cleaned 42 reviews.

--- Example of Cleaned Reviews ---
- i love your product
- so excited to test the possibilities with notion automatizations
- amazing solution


# Find the most common Phrases (N-Grams)

In [9]:
# installing the right PY package Natural Language Toolki (nltk)

!/usr/local/bin/python3.11 -m pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [10]:
import nltk
import ssl
from nltk.util import ngrams
from collections import Counter
import re
from nltk.corpus import stopwords

# --- FIX: Temporarily disable SSL verification ---
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

# --- FIX: Directly download ALL necessary data packages ---
print("⚙️ Downloading required NLTK data packages...")
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
print("✅ NLTK data is ready.")


# --- ANALYSIS ---
print("\n🔬 Starting data analysis...")

# Combine all cleaned text into one big list
all_text = []

# Add cleaned Reddit data
for post in reddit_data:
    all_text.append(post['cleaned_title'])
    all_text.append(post['cleaned_body'])
    all_text.extend(post['cleaned_comments'])

# Add cleaned Product Hunt reviews
all_text.extend(product_hunt_reviews)

# Join everything into a single block of text and tokenize into words
full_text = ' '.join(all_text)
words = nltk.word_tokenize(full_text)

# --- Remove common "stop words" using NLTK's comprehensive list ---
stop_words = set(stopwords.words('english'))
# Add custom words to ignore based on previous output
stop_words.update(['im', 'ive', 'would', 'also', 'like', 'get']) 
filtered_words = [word for word in words if word not in stop_words]

# --- Generate and count three-word phrases (trigrams) ---
# MODIFIED: Changed the number from 2 to 3
trigrams = ngrams(filtered_words, 2)
trigram_counts = Counter(trigrams)

print("✅ Analysis Complete!")
print("🔎 Top 10 Most Common Phrases (Trigrams):")
for phrase, count in trigram_counts.most_common(10):
    print(f"  - {' '.join(phrase)}: {count} times")

⚙️ Downloading required NLTK data packages...
✅ NLTK data is ready.

🔬 Starting data analysis...
✅ Analysis Complete!
🔎 Top 10 Most Common Phrases (Trigrams):
  - nn workflow: 134 times
  - ai agent: 134 times
  - thanks sharing: 112 times
  - google sheets: 112 times
  - using nn: 106 times
  - let know: 104 times
  - ai agents: 94 times
  - dont know: 86 times
  - nn workflows: 85 times
  - code node: 85 times


# Using OpenAI to understand the user comments and potential feature requests

# Install openAI library


In [11]:
!/usr/local/bin/python3.11 -m pip install openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


# Analyzing all Reddit comments and categorize them

In [12]:
from openai import OpenAI
import os
import json
from dotenv import load_dotenv
from collections import Counter
import pandas as pd

# --- PART 1: CONFIGURE CLIENT AND PROMPT ---

load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

if not openai_api_key:
    print("🚨 OpenAI API Key not found. Please add it to your .env file.")
else:
    client = OpenAI(api_key=openai_api_key)
    print("✅ OpenAI Client configured successfully.")

# This prompt is for the first step: categorizing individual comments
messages_for_prompt = [
    {
        "role": "system",
        "content": """You are a product analyst for a software company called **n8n**. Your task is to analyze user comments and extract actionable feedback **about the n8n product itself**.

**CRITICAL RULE**: You must first determine if the comment is direct feedback about the n8n software. If the comment is about the user's own project performance (e.g., their newsletter's CTR), about the subreddit community, or is just a general polite comment (e.g., 'thank you'), classify it as **`Irrelevant`**.

For relevant comments about the n8n product, classify them into one of these categories:
- Struggle / Confusion
- Bug Report
- Feature Request
- Positive Feedback
- Negative Feedback

Provide your output in a JSON format with two keys: `category` and `user_problem`. If there is no specific user problem, return `null` for that key."""
    },
    { "role": "user", "content": "**POST:** How I built this workflow.\n**COMMENT:** This is great, but the n8n expression editor is confusing. I wish it had autocomplete."},
    { "role": "assistant", "content": "{\n  \"category\": \"Negative Feedback\",\n  \"user_problem\": \"The n8n expression editor is confusing and lacks an autocomplete feature, making it difficult to use.\"\n}"},
    { "role": "user", "content": "**POST:** Look at my new workflow!\n**COMMENT:** This is awesome, great job!"},
    { "role": "assistant", "content": "{\n  \"category\": \"Irrelevant\",\n  \"user_problem\": null\n}"}
]

def get_contextual_analysis(post, comment):
    if isinstance(post, dict):
        full_context = f"**ORIGINAL POST TITLE:** {post['title']}\n**ORIGINAL POST BODY:** {post['body'][:500]}...\n**COMMENT:** {comment}"
    else:
        full_context = comment
    current_messages = messages_for_prompt + [{"role": "user", "content": full_context}]
    try:
        response = client.chat.completions.create(model="gpt-4o-mini", messages=current_messages, response_format={"type": "json_object"})
        return json.loads(response.choices[0].message.content)
    except Exception as e:
        return None

# --- NEW: A more advanced function to generate dashboard summaries ---
def generate_dashboard_summary(title, feedback_list):
    print(f"\n🧠 Generating AI summary for '{title}'...")
    if not feedback_list:
        return f"No data available for {title}."
    
    problems_str = "\n".join([f"- {item}" for item in feedback_list])
    
    # This new prompt is much more specific and demanding
    prompt = f"""You are a senior product manager at n8n. Your task is to write a concise, actionable summary (around 150 words) for a dashboard section.
Based on the following list of user problems categorized under '{title}', synthesize the key themes and identify concrete improvements n8n can make.
Do not start with a title. Begin the summary directly.

**User Feedback:**
{problems_str}

**Actionable Summary for Dashboard:**
"""
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error generating summary: {e}"

# --- PART 2: COMBINE ALL DATA & RUN FULL ANALYSIS ---

final_results = []
all_feedback_items = []

# Gather Reddit comments with their post context
if 'reddit_data' in locals():
    for post in reddit_data:
        for comment in post['comments']:
            all_feedback_items.append({'comment': comment, 'context': post})

# Add Product Hunt reviews
if 'product_hunt_reviews' in locals():
    for review in product_hunt_reviews:
        all_feedback_items.append({'comment': review, 'context': 'Product Hunt Review'})

# MODIFIED: Removed the sample_size limit to process all data
print(f"\n🔬 Analyzing all {len(all_feedback_items)} feedback items from all sources... (This will take a long time)")

for item in all_feedback_items:
    comment_text = item['comment']
    post_context = item['context']
    
    if len(comment_text) < 20 or "[deleted]" in comment_text or "[removed]" in comment_text:
        continue
    
    analysis = get_contextual_analysis(post_context, comment_text)
    
    if analysis and analysis.get('category') and analysis['category'] != 'Irrelevant':
        final_results.append({
            "source": "Reddit" if isinstance(post_context, dict) else "Product Hunt",
            "comment": comment_text,
            "category": analysis['category'],
            "user_problem": analysis.get('user_problem')
        })

print(f"\n✅ AI analysis complete. Found {len(final_results)} relevant feedback items.")

# --- PART 3: SAVE REPORT & GENERATE DASHBOARD SUMMARIES ---

if final_results:
    df = pd.DataFrame(final_results)
    output_filename = 'final_comprehensive_report.csv'
    print(f"\n💾 Saving the comprehensive report to '{output_filename}'...")
    df.to_csv(output_filename, index=False, encoding='utf-8-sig')
    print(f"✅ Report created successfully.")

    # Generate AI summaries for the dashboard
    struggles = df[df['category'] == 'Struggle / Confusion']['user_problem'].dropna().tolist()
    feature_requests = df[df['category'] == 'Feature Request']['user_problem'].dropna().tolist()
    positive_feedback = df[df['category'] == 'Positive Feedback']['user_problem'].dropna().tolist()

    low_hanging_summary = generate_dashboard_summary("Low-Hanging Fruits (User Struggles)", struggles)
    nice_to_have_summary = generate_dashboard_summary("Nice-to-Have (Feature Requests)", feature_requests)
    no_impact_summary = generate_dashboard_summary("No Business Impact (Praise)", positive_feedback)
    
    dashboard_summaries = {
        "low_hanging_fruits": low_hanging_summary,
        "nice_to_have": nice_to_have_summary,
        "no_business_impact": no_impact_summary
    }
    
    with open('dashboard_summaries.json', 'w') as f:
        json.dump(dashboard_summaries, f, indent=2)
        
    print("\n✅ Dashboard summaries generated and saved to 'dashboard_summaries.json'.")
    
    print("\n--- AI-Generated Summaries Preview ---")
    print(f"\nLow-Hanging Fruits:\n{low_hanging_summary}")
    print(f"\nNice-to-Have:\n{nice_to_have_summary}")
    print(f"\nNo Business Impact:\n{no_impact_summary}")

else:
    print("\nNo relevant comments were found to generate a report.")


✅ OpenAI Client configured successfully.

🔬 Analyzing all 9636 feedback items from all sources... (This will take a long time)

✅ AI analysis complete. Found 1099 relevant feedback items.

💾 Saving the comprehensive report to 'final_comprehensive_report.csv'...
✅ Report created successfully.

🧠 Generating AI summary for 'Low-Hanging Fruits (User Struggles)'...

🧠 Generating AI summary for 'Nice-to-Have (Feature Requests)'...

🧠 Generating AI summary for 'No Business Impact (Praise)'...

✅ Dashboard summaries generated and saved to 'dashboard_summaries.json'.

--- AI-Generated Summaries Preview ---

Low-Hanging Fruits:
Users are encountering several common challenges with n8n, primarily centered around setup complexity, workflow management, and lack of clarity in documentation. Key areas for improvement include:

1. **Enhanced Documentation**: Provide more detailed, beginner-friendly guides and video tutorials focused on common tasks, such as workflow creation, using key nodes (e.g., HT