# Part B: Sentiment Analysis Prompt Evaluation

## Objective
- Create sentiment analysis prompt (positive/negative/neutral)
- Include confidence score and reasoning
- Test on 10 emails
- Iterate and improve (v1 → v2)

## 1. Setup

In [1]:
import pandas as pd
from groq import Groq
import os
from dotenv import load_dotenv
import json
from datetime import datetime

load_dotenv()
client = Groq(api_key=os.getenv('GROQ_API_KEY'))

## 2. Load Test Data (10 Emails)

In [2]:
# Load small dataset and select 10 emails
df = pd.read_csv('../data/small_dataset.csv')

# Select 10 emails for testing
test_emails = df.head(10).copy()

print(f"Test set: {len(test_emails)} emails")
print("\nSample emails:")
for idx, row in test_emails.iterrows():
    print(f"\nEmail {row['email_id']}:")
    print(f"Subject: {row['subject']}")
    print(f"Body: {row['body'][:100]}...")

Test set: 10 emails

Sample emails:

Email 1:
Subject: Unable to access shared mailbox
Body: Hi team, I'm unable to access the shared mailbox for our support team. It keeps showing a permission...

Email 2:
Subject: Rules not working
Body: We created a rule to auto-assign emails based on subject line but it stopped working since yesterday...

Email 3:
Subject: Email stuck in pending
Body: One of our emails is stuck in pending even after marking it resolved. Not sure what's happening....

Email 4:
Subject: Automation creating duplicate tasks
Body: Your automation engine is creating 2 tasks for every email. This started after we edited our workflo...

Email 5:
Subject: Tags missing
Body: Many of our tags are not appearing for new emails. Looks like the tagging model is not working for u...

Email 6:
Subject: Billing query
Body: We were charged incorrectly this month. Need a corrected invoice....

Email 7:
Subject: CSAT not visible
Body: CSAT scores disappeared from our dashboard today. I

## 3. Prompt v1: Initial Design

In [3]:
PROMPT_V1 = """Analyze the sentiment of this customer support email.

Subject: {subject}
Body: {body}

Classify the sentiment as: positive, negative, or neutral.

Provide your response in JSON format:
{{
    "sentiment": "positive/negative/neutral",
    "confidence": 0.0-1.0,
    "reasoning": "brief explanation of why you chose this sentiment"
}}
"""

# Save prompt v1
with open('prompt_v1.txt', 'w') as f:
    f.write(PROMPT_V1)

print("Prompt v1:")
print(PROMPT_V1)

Prompt v1:
Analyze the sentiment of this customer support email.

Subject: {subject}
Body: {body}

Classify the sentiment as: positive, negative, or neutral.

Provide your response in JSON format:
{{
    "sentiment": "positive/negative/neutral",
    "confidence": 0.0-1.0,
    "reasoning": "brief explanation of why you chose this sentiment"
}}



## 4. Test Prompt v1

In [4]:
def analyze_sentiment(subject, body, prompt_template, client):
    """
    Analyze sentiment using given prompt template.
    """
    prompt = prompt_template.format(subject=subject, body=body)
    
    try:
        response = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1,
            max_tokens=200
        )
        
        # Get response text
        response_text = response.choices[0].message.content
        
        # Strip markdown code fences if present
        if '```json' in response_text or '```' in response_text:
            # Find the JSON block
            if '```json' in response_text:
                # Extract content between ```json and ```
                start = response_text.find('```json') + 7
                end = response_text.find('```', start)
                response_text = response_text[start:end].strip()
            elif response_text.strip().startswith('```'):
                # Remove ```json or ``` at start and ``` at end
                response_text = response_text.strip()
                response_text = response_text.split('\n', 1)[1]  # Remove first line (```)
                response_text = response_text.rsplit('\n', 1)[0]  # Remove last line (```)
        
        result = json.loads(response_text)
        return result
    
    except Exception as e:
        return {"sentiment": "error", "confidence": 0.0, "reasoning": str(e)}

# Test on all 10 emails
results_v1 = []

for idx, row in test_emails.iterrows():
    result = analyze_sentiment(row['subject'], row['body'], PROMPT_V1, client)
    results_v1.append({
        'email_id': row['email_id'],
        'subject': row['subject'],
        'sentiment': result['sentiment'],
        'confidence': result['confidence'],
        'reasoning': result['reasoning']
    })
    print(f"Email {row['email_id']}: {result['sentiment']} (confidence: {result['confidence']})")

# Save results
results_v1_df = pd.DataFrame(results_v1)
results_v1_df.to_json('results_v1.json', orient='records', indent=2)

Email 1: neutral (confidence: 0.8)
Email 2: negative (confidence: 0.8)
Email 3: neutral (confidence: 0.8)
Email 4: negative (confidence: 0.8)
Email 5: negative (confidence: 0.8)
Email 6: negative (confidence: 0.8)
Email 7: neutral (confidence: 0.8)
Email 8: negative (confidence: 0.9)
Email 9: neutral (confidence: 0.8)
Email 10: negative (confidence: 0.8)


## 5. Manual Evaluation of v1 Results

In [5]:
# Display results for manual review
print("\nPrompt v1 Results:")
print("="*80)
for idx, result in enumerate(results_v1, 1):
    print(f"\nEmail {idx}:")
    print(f"Subject: {result['subject']}")
    print(f"Sentiment: {result['sentiment']}")
    print(f"Confidence: {result['confidence']}")
    print(f"Reasoning: {result['reasoning']}")
    print("-" * 80)

# TODO: Manually assess each prediction
# - Is the sentiment correct?
# - Is the confidence appropriate?
# - Is the reasoning sound?
# - What patterns of errors do you see?


Prompt v1 Results:

Email 1:
Subject: Unable to access shared mailbox
Sentiment: neutral
Confidence: 0.8
Reasoning: The customer is reporting an issue, but the tone is polite and matter-of-fact, without expressing frustration, anger, or dissatisfaction, which are typical characteristics of negative sentiment. The language used is also neutral, focusing on the problem and the request for assistance, rather than making a positive or negative comment.
--------------------------------------------------------------------------------

Email 2:
Subject: Rules not working
Sentiment: negative
Confidence: 0.8
Reasoning: The customer is reporting an issue with a feature (rules not working) that is causing inconvenience, indicating a negative experience with the product.
--------------------------------------------------------------------------------

Email 3:
Subject: Email stuck in pending
Sentiment: neutral
Confidence: 0.8
Reasoning: The customer is reporting an issue, but their tone is matter

## 6. Analyze Failures and Issues

In [6]:
# TODO: Document issues found in v1
# Examples of things to look for:
# - Incorrect sentiment classifications
# - Overconfident or underconfident predictions
# - Poor reasoning
# - Inconsistent handling of similar emails
# - Edge cases not handled well

print("\nIssues to address in v2:")
print("1. [Issue 1]")
print("2. [Issue 2]")
print("3. [Issue 3]")


Issues to address in v2:
1. [Issue 1]
2. [Issue 2]
3. [Issue 3]


## 7. Prompt v2: Improved Design

In [7]:
# TODO: Design improved prompt based on v1 failures
# Consider adding:
# - More specific instructions
# - Examples (few-shot learning)
# - Clearer definitions of positive/negative/neutral
# - Guidelines for confidence scoring
# - Context about customer support domain

PROMPT_V2 = """You are an expert at analyzing sentiment in customer support emails.

Email to analyze:
Subject: {subject}
Body: {body}

Instructions:
1. Classify the sentiment as:
   - "positive": Customer is happy, grateful, or satisfied
   - "negative": Customer is frustrated, angry, or disappointed
   - "neutral": Informational query or neither clearly positive nor negative

2. Consider:
   - Tone and word choice
   - Urgency markers
   - Emotional indicators
   - Context of the issue

3. Confidence scoring:
   - High (0.8-1.0): Clear sentiment indicators
   - Medium (0.5-0.79): Some ambiguity
   - Low (0.0-0.49): Mixed signals or unclear

Return JSON:
{{
    "sentiment": "positive/negative/neutral",
    "confidence": 0.0-1.0,
    "reasoning": "detailed explanation referencing specific words/phrases"
}}
"""

# Save prompt v2
with open('prompt_v2.txt', 'w') as f:
    f.write(PROMPT_V2)

print("Prompt v2 created with improvements.")

Prompt v2 created with improvements.


## 8. Test Prompt v2

In [8]:
# Test v2 on same 10 emails
results_v2 = []

for idx, row in test_emails.iterrows():
    result = analyze_sentiment(row['subject'], row['body'], PROMPT_V2, client)
    results_v2.append({
        'email_id': row['email_id'],
        'subject': row['subject'],
        'sentiment': result['sentiment'],
        'confidence': result['confidence'],
        'reasoning': result['reasoning']
    })
    print(f"Email {row['email_id']}: {result['sentiment']} (confidence: {result['confidence']})")

# Save results
results_v2_df = pd.DataFrame(results_v2)
results_v2_df.to_json('results_v2.json', orient='records', indent=2)

Email 1: neutral (confidence: 0.8)
Email 2: negative (confidence: 0.8)
Email 3: neutral (confidence: 0.7)
Email 4: negative (confidence: 0.8)
Email 5: negative (confidence: 0.8)
Email 6: negative (confidence: 0.8)
Email 7: neutral (confidence: 0.8)
Email 8: negative (confidence: 0.9)
Email 9: neutral (confidence: 0.9)
Email 10: negative (confidence: 0.8)


## 9. Compare v1 vs v2

In [9]:
# Create comparison DataFrame
comparison = pd.DataFrame({
    'email_id': results_v1_df['email_id'],
    'subject': results_v1_df['subject'],
    'v1_sentiment': results_v1_df['sentiment'],
    'v1_confidence': results_v1_df['confidence'],
    'v2_sentiment': results_v2_df['sentiment'],
    'v2_confidence': results_v2_df['confidence']
})

# Identify changes
comparison['changed'] = comparison['v1_sentiment'] != comparison['v2_sentiment']

print("\nComparison of v1 vs v2:")
print(comparison)

print(f"\nNumber of changes: {comparison['changed'].sum()}")
print(f"\nAverage confidence v1: {comparison['v1_confidence'].mean():.3f}")
print(f"Average confidence v2: {comparison['v2_confidence'].mean():.3f}")

# Show cases where prediction changed
if comparison['changed'].any():
    print("\nEmails where prediction changed:")
    print(comparison[comparison['changed']])


Comparison of v1 vs v2:
   email_id                              subject v1_sentiment  v1_confidence  \
0         1      Unable to access shared mailbox      neutral            0.8   
1         2                    Rules not working     negative            0.8   
2         3               Email stuck in pending      neutral            0.8   
3         4  Automation creating duplicate tasks     negative            0.8   
4         5                         Tags missing     negative            0.8   
5         6                        Billing query     negative            0.8   
6         7                     CSAT not visible      neutral            0.8   
7         8               Delay in email loading     negative            0.9   
8         9            Need help setting up SLAs      neutral            0.8   
9        10                   Mail merge failing     negative            0.8   

  v2_sentiment  v2_confidence  changed  
0      neutral            0.8    False  
1     negati

## 10. Document Improvements

In [10]:
# TODO: Document in evaluation_report.md:
# 1. What failed in v1
# 2. What was improved in v2
# 3. How to evaluate prompts systematically

print("\nNext steps:")
print("1. Manually review all results")
print("2. Calculate accuracy (need ground truth labels)")
print("3. Update evaluation_report.md with findings")
print("4. Document systematic evaluation process")


Next steps:
1. Manually review all results
2. Calculate accuracy (need ground truth labels)
3. Update evaluation_report.md with findings
4. Document systematic evaluation process


## 11. Additional Analysis

In [11]:
# Sentiment distribution
print("\nSentiment Distribution:")
print("\nv1:")
print(results_v1_df['sentiment'].value_counts())
print("\nv2:")
print(results_v2_df['sentiment'].value_counts())

# Confidence distribution
print("\nConfidence Stats:")
print("\nv1:")
print(results_v1_df['confidence'].describe())
print("\nv2:")
print(results_v2_df['confidence'].describe())


Sentiment Distribution:

v1:
sentiment
negative    6
neutral     4
Name: count, dtype: int64

v2:
sentiment
negative    6
neutral     4
Name: count, dtype: int64

Confidence Stats:

v1:
count    10.000000
mean      0.810000
std       0.031623
min       0.800000
25%       0.800000
50%       0.800000
75%       0.800000
max       0.900000
Name: confidence, dtype: float64

v2:
count    10.000000
mean      0.810000
std       0.056765
min       0.700000
25%       0.800000
50%       0.800000
75%       0.800000
max       0.900000
Name: confidence, dtype: float64
