Title etc

In [2]:
import pandas as pd
import openai
import os
from dotenv import load_dotenv
!pip install openai==0.28

#load the Open AI api key file
load_dotenv()
#read key
openai.api_key = os.getenv("OPENAI_API_KEY")

if openai.api_key is None:
    print("‚ö†Ô∏è WARNING: OpenAI API Key not loaded. Please ensure you have a '.env' file in your environment with 'OPENAI_API_KEY=...'")
else:
    print("‚úÖ OpenAI API Key successfully loaded from .env file.")

‚úÖ OpenAI API Key successfully loaded from .env file.


In [3]:
#Data Loading
FILE_PATH = '/final.csv'

try:
    # Load the uploaded file
    df = pd.read_csv(FILE_PATH, encoding='utf-8')

    # Data Check
    print("-" * 50)
    print("‚úÖ DataFrame successfully loaded.")
    print("\nInitial Data Structure:")
    print(df.head())
    print("\nData Types:")
    df.info()

except FileNotFoundError:
    print(f"‚ùå Error: The file '{FILE_PATH}' was not found. Please ensure it is uploaded!!!")

--------------------------------------------------
‚úÖ DataFrame successfully loaded.

Initial Data Structure:
              ProductID                                       Product Name  \
0  AVqkIhwDv8e3D1O-lebb  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...   
1  AVqkIhwDv8e3D1O-lebb  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...   
2  AVqkIhwDv8e3D1O-lebb  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...   
3  AVqkIhwDv8e3D1O-lebb  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...   
4  AVqkIhwDv8e3D1O-lebb  All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...   

                                            Category   Brand  Ratings  \
0  Electronics,iPad & Tablets,All Tablets,Fire Ta...  Amazon      5.0   
1  Electronics,iPad & Tablets,All Tablets,Fire Ta...  Amazon      5.0   
2  Electronics,iPad & Tablets,All Tablets,Fire Ta...  Amazon      5.0   
3  Electronics,iPad & Tablets,All Tablets,Fire Ta...  Amazon      4.0   
4  Electronics,iPad & Tablets,All Tablets,Fire Ta...  A

In [4]:
#Choose the category for analysis
# Assuming the main category is the first item in the comma-separated string
# For example: 'Electronics,iPad & Tablets,All Tablets...' -> 'Electronics'
categories = df['Category'].str.split(',').str[0].str.strip()

print("\n\n--- Common Top-Level Categories ---")
print(categories.value_counts().head())



--- Common Top-Level Categories ---
Category
Fire Tablets       24874
AA                 12071
Stereos             6621
Back To College     5056
Electronics         4160
Name: count, dtype: int64


Picked the first 3 categories: "Fire Tablets", "AA", "Stereos".

In [5]:
import pandas as pd
import numpy as np

#Define Target Categories
TARGET_CATEGORIES = ['Fire Tablets', 'AA', 'Stereos']
FILE_PATH = '/final.csv'

# Reload the DataFrame (ensure fresh start for this block)
try:
    df = pd.read_csv(FILE_PATH, encoding='utf-8')
except FileNotFoundError:
    print(f"‚ùå Error: The file '{FILE_PATH}' was not found. Please re-upload.")
    exit()

#Creating primary category column
df['Primary Category'] = df['Category'].str.split(',').str[0].str.strip()

#filter the data
df_filtered = df[df['Primary Category'].isin(TARGET_CATEGORIES)].copy()

#drop rows where reviews are missing.
df_filtered.dropna(subset=['reviews.text'], inplace=True)

#make sure that 'Ratings' is numeric (it might be loaded as object if there were non-numeric values)
df_filtered['Ratings'] = pd.to_numeric(df_filtered['Ratings'], errors='coerce')
df_filtered.dropna(subset=['Ratings'], inplace=True)

#aggregate data by product | Group by the primary category and the product name
df_aggregated = df_filtered.groupby(['Primary Category', 'Product Name']).agg(
    # Calculate the average rating for the product
    Average_Rating=('Ratings', 'mean'),
    # Concatenate all reviews into a single string (separated by a space)
    All_Reviews=('reviews.text', lambda x: ' '.join(x.astype(str))),
    # Count the total number of reviews for the product
    Review_Count=('reviews.text', 'size')
).reset_index()

#inspection of aggregated data
print("‚úÖ Aggregation complete.")
print("-" * 60)
print("Aggregated DataFrame Head (Products and Combined Reviews):")
print(df_aggregated.head())
print("-" * 60)
print("Aggregated DataFrame Information:")
df_aggregated.info()

‚úÖ Aggregation complete.
------------------------------------------------------------
Aggregated DataFrame Head (Products and Combined Reviews):
  Primary Category                                       Product Name  \
0               AA  AmazonBasics AA Performance Alkaline Batteries...   
1               AA  AmazonBasics AAA Performance Alkaline Batterie...   
2     Fire Tablets  All-New Fire 7 Tablet with Alexa, 7" Display, ...   
3     Fire Tablets  All-New Fire HD 8 Kids Edition Tablet, 8 HD Di...   
4     Fire Tablets  All-New Fire HD 8 Kids Edition Tablet, 8 HD Di...   

   Average_Rating                                        All_Reviews  \
0        4.453594  Bulk is always the less expensive way to go fo...   
1        4.448040  I order 3 of them and one of the item is bad q...   
2        4.585366  Amazing way to keep my kids reading and also t...   
3        4.630901  I have been wanting to get a tablet for my gra...   
4        4.641638  Purchase this Amazon - Fire HD 8 Kid

The resulting df_aggregated DataFrame now contains

    Primary Category: To loop through and generate articles for each one.

    Product Name: The name of the product.

    Average_Rating: To rank the products (Top 3 and Worst).

    All_Reviews: A single text blob containing every review for that specific product, ready to be sent to GPT-3 for analysis and summarization.

    Review_Count: Used to ensure we only summarize products with a sufficient number of reviews.

In [6]:
#Prompt design

SYSTEM_PROMPT = """
You are a professional product reviewer and technical writer.
Your task is to analyze a large body of customer reviews and generate a well-structured, unbiased article summarizing(like a blog post) the best and worst products in a given category.
You MUST adhere strictly to the requested output structure.
"""

def generate_article_summary(category_df, model="gpt-3.5-turbo"):
    """
    Generates a structured article summary for a single product category.
    """
    if category_df.empty:
        return "No products found for this category."

#Prepare Input Data for the Prompt

    # 1. Rank Products
    # We will use the product with the lowest review count as a threshold to filter out outliers.
    min_reviews = category_df['Review_Count'].quantile(0.25) # Exclude bottom 25% of products by review count
    category_df_filtered = category_df[category_df['Review_Count'] >= min_reviews].copy()

    # Sort to find the Top 3 and Worst
    top_products = category_df_filtered.sort_values(by='Average_Rating', ascending=False).head(3)
    worst_product = category_df_filtered.sort_values(by='Average_Rating', ascending=True).head(1)

    # 2. Concatenate Reviews for Top Products and Worst Product
    top_reviews_text = " ".join(top_products['All_Reviews'].tolist())
    worst_reviews_text = worst_product['All_Reviews'].iloc[0] if not worst_product.empty else "No worst product identified."

    # 3. Format the product list for the prompt
    product_list = "\n".join(
        [f"- {row['Product Name']} (Avg Rating: {row['Average_Rating']:.2f})"
         for index, row in top_products.iterrows()]
    )
    category_name = category_df_filtered['Primary Category'].iloc[0]


    # --- B. Define the User Prompt ---
    USER_PROMPT = f"""
    Based on the provided customer reviews, generate a comprehensive article for the product category: '{category_name}'.

    **TASK REQUIREMENTS:**
    1.  **Top 3 Products:** Identify the top 3 best-rated products from the list below.
    2.  **Key Differences:** Summarize the main functional/feature differences between these top 3 products.
    3.  **Top Complaints:** For EACH of the top 3 products, list the 2-3 most frequent negative complaints (based on the combined review text).
    4.  **Worst Product:** Identify the single worst-rated product in the category.
    5.  **Avoidance Reason:** Summarize the key reasons (complaints/flaws) why the worst product should be avoided.

    **PRODUCTS & RATINGS (for ranking reference):**
    {product_list}

    **COMBINED REVIEW TEXT FOR TOP PRODUCTS (Use this for differences/complaints):**
    ---
    {top_reviews_text[:3000]}... [Truncated for API efficiency]
    ---

    **COMBINED REVIEW TEXT FOR WORST PRODUCT (Use this for avoidance reasons):**
    ---
    {worst_reviews_text[:1000]}... [Truncated for API efficiency]
    ---

    **FORMAT:**
    Present the output as a Markdown-formatted blog post, using H2 titles for sections.
    """

    # --- C. API Call ---
    try:
        print(f"‚è≥ Generating summary for {category_name}...")
        response = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": USER_PROMPT}
            ],
            temperature=0.7 # A bit of creativity for a blog post tone
        )
        return response.choices[0].message['content']
    except Exception as e:
        return f"An API error occurred: {e}"

# --- D. Execution Loop (Next Block) ---
# Now, we are ready to loop through the categories and call the function.
categories = df_aggregated['Primary Category'].unique()
all_articles = {}

for category in categories:
    category_data = df_aggregated[df_aggregated['Primary Category'] == category]

    # Generate the article summary
    article_markdown = generate_article_summary(category_data)

    # Store and print the result
    all_articles[category] = article_markdown
    print(f"\n\n--- ARTICLE FOR: {category} ---\n")
    print(article_markdown)
    print("\n---------------------------------------------------------")


print("\n\n‚úÖ ALL ARTICLES GENERATED!")

‚è≥ Generating summary for AA...


--- ARTICLE FOR: AA ---


## Top 3 Products:

### 1. AmazonBasics AAA Performance Alkaline Batteries (36 Count)
- Average Rating: 4.45

### 2. TBD

### 3. TBD

## Key Differences:
Upon analyzing the reviews for the top-rated product, the AmazonBasics AAA Performance Alkaline Batteries, it is evident that customers appreciate the overall performance and reliability of these batteries. However, some users have pointed out a recurring issue with missing backup springs, leading to inconvenience and the need for makeshift solutions like using a piece of aluminum to make the battery work.

## Top Complaints:
### AmazonBasics AAA Performance Alkaline Batteries (36 Count):
1. Missing backup springs causing inconvenience.
2. Quality control issues leading to a need for makeshift solutions.

## Worst Product:

### Avoidance Reason:
The worst-rated product in the category of AAA batteries is also the AmazonBasics AAA Performance Alkaline Batteries (36 Count). Th

In [7]:
# Assuming the 'all_articles' dictionary from the previous step is still in memory.
# It contains: {Category Name: Markdown Article Text}

OUTPUT_FILENAME = 'project_articles.md'

try:
    with open(OUTPUT_FILENAME, 'w', encoding='utf-8') as f:
        print(f"Starting to write articles to {OUTPUT_FILENAME}...")

        for category, content in all_articles.items():
            # Write a large header for the category
            f.write(f"# Product Review: {category}\n\n")

            # Write the generated article content (which is already in Markdown)
            f.write(content)

            # Add a separator for the next article
            f.write("\n\n---\n\n")

    print(f"\n‚úÖ Successfully saved all generated articles to '{OUTPUT_FILENAME}'.")
    print("You can download this file from your Colab file explorer.")

except Exception as e:
    print(f"‚ùå An error occurred while writing the file: {e}")

Starting to write articles to project_articles.md...

‚úÖ Successfully saved all generated articles to 'project_articles.md'.
You can download this file from your Colab file explorer.


In [8]:
import pandas as pd
import openai
import os
from dotenv import load_dotenv
import json

# --- 1. Setup and Data Loading (Consolidated Step 1 & 2) ---
print("Setting up Chatbot Environment and Preparing Data...")

# Load API Key
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

if openai.api_key is None:
    print("‚ùå FATAL ERROR: OpenAI API Key not loaded. Cannot run the chatbot.")
    exit()

TARGET_CATEGORIES = ['Fire Tablets', 'AA', 'Stereos']
FILE_PATH = 'final.csv'

# Load the raw CSV
try:
    df = pd.read_csv(FILE_PATH, encoding='utf-8')
except FileNotFoundError:
    print(f"‚ùå Error: The file '{FILE_PATH}' was not found. Please ensure final.csv is uploaded.")
    exit()

# Data Cleaning and Filtering
df['Primary Category'] = df['Category'].str.split(',').str[0].str.strip()
df_filtered = df[df['Primary Category'].isin(TARGET_CATEGORIES)].copy()
df_filtered.dropna(subset=['reviews.text', 'Ratings'], inplace=True)
df_filtered['Ratings'] = pd.to_numeric(df_filtered['Ratings'], errors='coerce')
df_filtered.dropna(subset=['Ratings'], inplace=True)

# Aggregate Data by Product
df_aggregated = df_filtered.groupby(['Primary Category', 'Product Name']).agg(
    Average_Rating=('Ratings', 'mean'),
    All_Reviews=('reviews.text', lambda x: ' '.join(x.astype(str))),
    Review_Count=('reviews.text', 'size')
).reset_index()

# Filter out products with very few reviews (e.g., less than 5)
df_aggregated = df_aggregated[df_aggregated['Review_Count'] >= 5]


# --- 2. Pre-process Data into a Dictionary for the AI Prompt ---
product_data_for_ai = {}
categories = df_aggregated['Primary Category'].unique().tolist()

for category in categories:
    category_df = df_aggregated[df_aggregated['Primary Category'] == category]

    # Sort to easily find the Top 3
    category_df_sorted = category_df.sort_values(by='Average_Rating', ascending=False)
    top_3 = category_df_sorted.head(3)

    product_data_for_ai[category] = []
    for index, row in top_3.iterrows():
        # *** FIX: Drastically reducing text size to prevent token limit breach ***
        reviews_text = row['All_Reviews'][:1000]

        product_data_for_ai[category].append({
            'Name': row['Product Name'],
            'Average Rating': f"{row['Average_Rating']:.2f}",
            'Review Count': int(row['Review_Count']),
            'Reviews Text Snippet (Truncated)': reviews_text
        })

print(f"‚úÖ Data for {len(categories)} categories successfully prepared.")


# --- 3. Define the System Prompt and Chatbot Function ---

SYSTEM_PROMPT = f"""
You are **ByteAdvisor**, a professional product reviewer and technical writer. Your sole purpose is to provide advice based *only* on the customer reviews data provided below.

**YOUR MANDATE AND CONSTRAINTS (CRITICAL RULES):**
1.  **Persona & Greeting:** Maintain a helpful and professional tone. Start by welcoming the user and introducing your scope based on the available product categories.
2.  **In-Scope Data:** Your knowledge is strictly limited to the products and reviews in the data provided below.
3.  **Core Functions:** You must be able to:
    * List the available categories when asked: {', '.join(categories)}.
    * List the top 3 products for any requested category.
    * Analyze the provided 'Reviews Text Snippet (Truncated)' to list 3 clear **Pros** and 3 clear **Cons** for a specific top product.
    * Provide definitive buying advice (a recommendation) based on the ratings and pros/cons.
4.  **Out-of-Scope Rule:** If the user asks *anything* outside of the context of these products (e.g., weather, history, other products), you **MUST** kindly refuse by saying: "That question seems to be outside the scope of my current product review data. I can only offer advice on {', '.join(categories)}."

**PRODUCT DATA (JSON format):**
{json.dumps(product_data_for_ai, indent=2)}
"""

def product_advisor_chatbot():
    """
    Main function to run the interactive, stateful chatbot in a loop.
    """
    conversation_history = [
        {"role": "system", "content": SYSTEM_PROMPT}
    ]

    # Initial Greeting: Bot introduces itself based on the system prompt
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=conversation_history,
        temperature=0.7,
        max_tokens=200
    )
    initial_response = response.choices[0].message['content'].strip()
    print("--- Starting ByteAdvisor Chatbot ---")
    print(f"\nü§ñ ByteAdvisor: {initial_response}")

    # Start the main conversation loop
    while True:
        user_input = input("\nYou: ").strip()

        if user_input.lower() in ['quit', 'exit', 'bye']:
            print("\nü§ñ ByteAdvisor: Thank you for consulting me! Goodbye.")
            break

        # Add user's message to history
        conversation_history.append({"role": "user", "content": user_input})

        # *** FIX: Reducing history size to prevent token limit breach ***
        # Send System prompt + last 4 user/assistant turns (5 total messages)
        messages_to_send = [conversation_history[0]] + conversation_history[-5:]

        try:
            # Call the API with the constrained prompt and history
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=messages_to_send,
                temperature=0.5,
                max_tokens=500
            )

            bot_response = response.choices[0].message['content'].strip()

            # Add bot's message to history
            conversation_history.append({"role": "assistant", "content": bot_response})

            print(f"\nü§ñ ByteAdvisor: {bot_response}")

        except Exception as e:
            print(f"\n‚ùå An API error occurred: {e}. Please check your key or network connection.")
            conversation_history.pop()


# --- 4. EXECUTION ---
product_advisor_chatbot()

Setting up Chatbot Environment and Preparing Data...
‚úÖ Data for 3 categories successfully prepared.
--- Starting ByteAdvisor Chatbot ---

ü§ñ ByteAdvisor: Hello! Welcome to ByteAdvisor, your go-to source for product advice based on customer reviews. The available categories for product reviews are AA, Fire Tablets, and Stereos.

If you'd like, I can list the top 3 products in any of these categories or provide you with the pros and cons of a specific product. How can I assist you today?

You: Stereos

ü§ñ ByteAdvisor: Great choice! Here are the top 3 stereo products:

1. **Amazon Fire Hd 6 Standing Protective Case (4th Generation - 2014 Release), Cayenne Red**
   - Average Rating: 4.83
   - Review Count: 6
   - Reviews Text Snippet (Truncated): "I really enjoy the Echo. I got an Echo Dot and liked it so much I then got the full size Echo..."

2. **Kindle Dx Leather Cover, Black (fits 9.7 Display, Latest and 2nd Generation Kindle Dxs)**
   - Average Rating: 4.78
   - Review Count: 9

KeyboardInterrupt: Interrupted by user