# Categorizing Open-Ended Survey Responses with AI
## Using Real Social Science Data

This notebook demonstrates how to use AI to categorize and analyze open-ended survey responses from social science research.

## Dataset: Food Choices Survey

We'll use the **Food Choices** dataset from Kaggle:
- **Source**: Survey of 126 college students at Mercyhurst University
- **Topic**: Food preferences and eating habits
- **Key Question**: "Why do you eat comfort food?"
- **Download**: https://www.kaggle.com/datasets/borapajo/food-choices

This dataset contains real open-ended responses about comfort food choices, perfect for demonstrating categorization techniques.

In [None]:
# Setup and imports
import pandas as pd
import numpy as np
import json
import openai
import os
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns

# Install and import kagglehub for dataset download
# !pip install kagglehub
import kagglehub

# Set up plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

In [None]:
# Load API key
import getpass

if not os.getenv('OPENAI_API_KEY'):
    openai.api_key = getpass.getpass("Enter your OpenAI API key: ")
else:
    openai.api_key = os.getenv('OPENAI_API_KEY')
    
print("✅ API key loaded")

## Step 1: Load and Explore the Data

First, download the dataset from Kaggle and load it. For this demo, we'll create sample data similar to the actual dataset.

In [None]:
# Download the actual Food Choices dataset from Kaggle
path = kagglehub.dataset_download("borapajo/food-choices")
print("Path to dataset files:", path)

# List files in the dataset
import os
files = os.listdir(path)
print("\nFiles in dataset:")
for file in files:
    print(f"  - {file}")

# Load the main CSV file
csv_file = os.path.join(path, 'food_coded.csv')
food_data = pd.read_csv(csv_file)

print(f"\nDataset shape: {food_data.shape}")
print(f"Columns: {list(food_data.columns)}")

# Look for comfort food related columns
comfort_cols = [col for col in food_data.columns if 'comfort' in col.lower()]
print(f"\nComfort food related columns:")
for col in comfort_cols:
    print(f"  - {col}")

In [None]:
# Search for the comfort_food_reasons_coded column specifically
print("Searching for coded comfort food reasons...\n")

# Get exact column names containing 'comfort'
comfort_related = [col for col in food_data.columns if 'comfort' in col.lower()]
print(f"Columns with 'comfort' in name: {comfort_related}")

# Check if comfort_food_reasons_coded exists (with exact match)
if 'comfort_food_reasons_coded' in food_data.columns:
    print("\n✅ Found 'comfort_food_reasons_coded' column!")
    coded_data = food_data['comfort_food_reasons_coded'].dropna()
    print(f"Non-null coded responses: {len(coded_data)}")
    print(f"Unique categories: {coded_data.nunique()}")
    print("\nCategory distribution:")
    print(coded_data.value_counts())
    
    # Check if we have both original and coded
    if 'comfort_food_reasons' in food_data.columns:
        # Create paired data
        paired = food_data[['comfort_food_reasons', 'comfort_food_reasons_coded']].dropna()
        print(f"\n✅ Found {len(paired)} responses with both text and codes!")
        print("\nSample of paired data:")
        for idx, row in paired.head().iterrows():
            print(f"\nText: {row['comfort_food_reasons']}")
            print(f"Code: {row['comfort_food_reasons_coded']}")
else:
    print("\n❌ No 'comfort_food_reasons_coded' column found")
    print("We'll need to create our own categorization")
    
    # Look for any text columns that might be comfort food reasons
    text_cols = food_data.select_dtypes(include=['object']).columns
    print(f"\nText columns that might contain reasons:")
    for col in text_cols:
        if 'comfort' in col.lower() or 'reason' in col.lower():
            print(f"  - {col}")
            sample = food_data[col].dropna().head(2)
            for val in sample:
                print(f"      Sample: {str(val)[:100]}...")

In [None]:
# Step 3: AI-Powered Categorization
# If we have the actual comfort_food_reasons column, use it
# Otherwise, create sample data

if 'comfort_food_reasons' in food_data.columns:
    # Use actual data
    comfort_responses = food_data[['comfort_food_reasons']].dropna().copy()
    comfort_responses['respondent_id'] = range(1, len(comfort_responses) + 1)
    
    # Add demographic data if available
    if 'Gender' in food_data.columns:
        comfort_responses['gender'] = food_data.loc[comfort_responses.index, 'Gender']
    if 'grade_level' in food_data.columns:
        comfort_responses['year'] = food_data.loc[comfort_responses.index, 'grade_level']
    
    print(f"Using {len(comfort_responses)} actual responses from the dataset")
    print("\nFirst 5 responses:")
    for idx, row in comfort_responses.head().iterrows():
        print(f"{row['respondent_id']}. {row['comfort_food_reasons']}")
    
    # Check if we have existing codes for validation
    if 'comfort_food_reasons_coded' in food_data.columns:
        existing_codes = food_data.loc[comfort_responses.index, 'comfort_food_reasons_coded']
        comfort_responses['existing_code'] = existing_codes
        print(f"\n✅ Found existing codes for validation!")
        print(f"Unique existing categories: {existing_codes.dropna().unique()}")
else:
    # Create sample data if column not found
    print("Creating sample data similar to typical comfort food responses...")
    comfort_responses = pd.DataFrame({
        'respondent_id': range(1, 11),
        'comfort_food_reasons': [
            "stress and anxiety from exams",
            "reminds me of home",
            "when I'm sad or lonely",
            "boredom",
            "tastes good",
            "reward after hard work",
            "nostalgia",
            "homesickness",
            "stress eating",
            "convenience"
        ]
    })

In [None]:
# Explore the comfort food reasons column
print("Examining comfort food reasons in the dataset:\n")

# Check if there's a column with open-ended responses
if 'comfort_food_reasons' in food_data.columns:
    comfort_reasons = food_data['comfort_food_reasons'].dropna()
    print(f"Found {len(comfort_reasons)} comfort food reasons")
    print("\nFirst 5 responses:")
    for i, reason in enumerate(comfort_reasons.head(), 1):
        print(f"{i}. {reason}")
        
# Check for any coded/labeled versions
if 'comfort_food_reasons_coded' in food_data.columns:
    print("\n✅ Found existing coded labels!")
    print("Unique codes:", food_data['comfort_food_reasons_coded'].unique())
    
# Look for columns that might contain coding
coded_cols = [col for col in food_data.columns if 'coded' in col.lower() or 'code' in col.lower()]
if coded_cols:
    print(f"\nColumns with potential coding:")
    for col in coded_cols:
        print(f"  - {col}: {food_data[col].dtype}")
        if food_data[col].dtype == 'object':
            unique_vals = food_data[col].nunique()
            print(f"    Unique values: {unique_vals}")
            if unique_vals < 20:
                print(f"    Values: {food_data[col].unique()[:10]}")

def categorize_response(response, categories):
    """
    Use AI to categorize an open-ended response.
    
    Args:
        response: The open-ended text response
        categories: Dictionary of category codes and descriptions
    
    Returns:
        Dictionary with primary and secondary categories
    """
    
    # Create category list for prompt
    cat_list = "\n".join([f"- {code}: {desc}" for code, desc in categories.items()])
    
    prompt = f"""
    Categorize this survey response about why someone eats comfort food.
    
    Response: "{response}"
    
    Categories:
    {cat_list}
    
    Return a JSON object with:
    - primary_category: The main category code
    - secondary_category: A secondary category if applicable (or null)
    - confidence: Your confidence level (0-1)
    - key_phrases: List of 2-3 key phrases that led to this categorization
    
    Return ONLY the JSON object.
    """
    
    response_obj = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an expert at qualitative data analysis."},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    
    result = response_obj['choices'][0]['message']['content'].strip()
    if result.startswith("```json"):
        result = result[7:-3]
    
    return json.loads(result)

# Test with one response
test_response = comfort_responses.iloc[0]['comfort_food_reasons']
test_result = categorize_response(test_response, coding_scheme)
print(f"Response: {test_response}")
print(f"\nCategorization: {json.dumps(test_result, indent=2)}")

In [None]:
# Step 4: Batch Processing All Responses
from tqdm import tqdm

# Categorize all responses
categorizations = []

# Process only a subset if dataset is large
max_responses = min(20, len(comfort_responses))  # Limit to 20 for demo
print(f"Processing {max_responses} responses...")

for idx in tqdm(range(max_responses), desc="Categorizing"):
    row = comfort_responses.iloc[idx]
    try:
        result = categorize_response(row['comfort_food_reasons'], coding_scheme)
        result['respondent_id'] = row['respondent_id']
        result['original_response'] = row['comfort_food_reasons']
        
        # Add existing code if available
        if 'existing_code' in row:
            result['existing_code'] = row['existing_code']
            
        categorizations.append(result)
    except Exception as e:
        print(f"Error processing response {row['respondent_id']}: {e}")
        categorizations.append({
            'respondent_id': row['respondent_id'],
            'original_response': row['comfort_food_reasons'],
            'primary_category': 'error',
            'secondary_category': None,
            'confidence': 0,
            'key_phrases': []
        })

# Convert to DataFrame
results_df = pd.DataFrame(categorizations)
print(f"\nProcessed {len(results_df)} responses")
results_df.head()

## Step 3: AI-Powered Categorization

Now let's use AI to automatically categorize these responses:

In [None]:
def categorize_response(response, categories):
    """
    Use AI to categorize an open-ended response.
    
    Args:
        response: The open-ended text response
        categories: Dictionary of category codes and descriptions
    
    Returns:
        Dictionary with primary and secondary categories
    """
    
    # Create category list for prompt
    cat_list = "\n".join([f"- {code}: {desc}" for code, desc in categories.items()])
    
    prompt = f"""
    Categorize this survey response about why someone eats comfort food.
    
    Response: "{response}"
    
    Categories:
    {cat_list}
    
    Return a JSON object with:
    - primary_category: The main category code
    - secondary_category: A secondary category if applicable (or null)
    - confidence: Your confidence level (0-1)
    - key_phrases: List of 2-3 key phrases that led to this categorization
    
    Return ONLY the JSON object.
    """
    
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an expert at qualitative data analysis."},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    
    result = response['choices'][0]['message']['content'].strip()
    if result.startswith("```json"):
        result = result[7:-3]
    
    return json.loads(result)

# Test with one response
test_response = comfort_food_data.iloc[0]['comfort_food_reasons']
test_result = categorize_response(test_response, coding_scheme)
print(f"Response: {test_response}")
print(f"\nCategorization: {json.dumps(test_result, indent=2)}")

## Step 4: Batch Processing All Responses

In [None]:
# Step 7: Cross-tabulation Analysis
# Only run if we have demographic data
if 'gender' in comfort_responses.columns:
    # Merge with demographic data
    full_results = pd.merge(
        results_df, 
        comfort_responses[['respondent_id', 'gender']], 
        on='respondent_id',
        how='left'
    )
    
    # Cross-tabulation by gender
    gender_crosstab = pd.crosstab(full_results['primary_category'], full_results['gender'])
    print("Category distribution by gender:")
    print(gender_crosstab)
    
    # Visualization
    gender_crosstab.plot(kind='bar', stacked=False)
    plt.title('Comfort Food Reasons by Gender')
    plt.xlabel('Category')
    plt.ylabel('Count')
    plt.legend(title='Gender')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
else:
    print("No gender data available for cross-tabulation")
    full_results = results_df.copy()

## Step 5: Analyze Results

In [None]:
# Primary category distribution
primary_counts = results_df['primary_category'].value_counts()

# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Bar chart of primary categories
primary_counts.plot(kind='bar', ax=ax1, color='steelblue')
ax1.set_title('Primary Categories for Comfort Food Reasons')
ax1.set_xlabel('Category')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=45)

# Pie chart
primary_counts.plot(kind='pie', ax=ax2, autopct='%1.1f%%')
ax2.set_title('Distribution of Comfort Food Reasons')
ax2.set_ylabel('')

plt.tight_layout()
plt.show()

# Summary statistics
print("\nCategory Summary:")
for cat in primary_counts.index:
    count = primary_counts[cat]
    pct = (count / len(results_df)) * 100
    print(f"{cat}: {count} responses ({pct:.1f}%)")

# Step 10: Validation - Compare with Existing Coding (if available)
if 'existing_code' in results_df.columns:
    print("✅ Comparing AI categorization with existing human coding...")
    print("="*60)
    
    # Filter to only rows with both AI and existing codes
    validation_df = results_df[results_df['existing_code'].notna()].copy()
    
    if len(validation_df) > 0:
        # Map existing codes to our categories if needed
        # This is a simplified mapping - adjust based on actual codes
        code_mapping = {
            'stress': 'emotional_regulation',
            'boredom': 'habit',
            'sadness': 'emotional_regulation',
            'happiness': 'reward',
            'nostalgia': 'nostalgia',
            'convenience': 'convenience',
            'taste': 'sensory'
        }
        
        # Apply mapping
        validation_df['mapped_existing'] = validation_df['existing_code'].map(
            lambda x: code_mapping.get(str(x).lower(), str(x).lower())
        )
        
        # Calculate accuracy
        matches = (validation_df['primary_category'] == validation_df['mapped_existing']).sum()
        total = len(validation_df)
        accuracy = matches / total * 100
        
        print(f"\nAccuracy Score: {accuracy:.1f}% ({matches}/{total} matches)")
        
        # Show mismatches for analysis
        mismatches = validation_df[validation_df['primary_category'] != validation_df['mapped_existing']]
        if len(mismatches) > 0:
            print(f"\nMismatched categorizations ({len(mismatches)}):")
            for _, row in mismatches.head(5).iterrows():
                print(f"\nResponse: {row['original_response'][:80]}...")
                print(f"  Human code: {row['existing_code']}")
                print(f"  AI category: {row['primary_category']}")
                print(f"  AI confidence: {row['confidence']:.2f}")
        
        # Confusion matrix
        from sklearn.metrics import confusion_matrix, classification_report
        import numpy as np
        
        # Get unique categories
        categories = sorted(set(validation_df['primary_category'].unique()) | 
                          set(validation_df['mapped_existing'].unique()))
        
        # Create confusion matrix
        cm = confusion_matrix(validation_df['mapped_existing'], 
                            validation_df['primary_category'], 
                            labels=categories)
        
        # Plot confusion matrix
        fig, ax = plt.subplots(figsize=(10, 8))
        im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
        ax.figure.colorbar(im, ax=ax)
        
        ax.set(xticks=np.arange(cm.shape[1]),
               yticks=np.arange(cm.shape[0]),
               xticklabels=categories,
               yticklabels=categories,
               title='Confusion Matrix: Human vs AI Categorization',
               ylabel='Human Coding',
               xlabel='AI Categorization')
        
        plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
        
        # Add text annotations
        for i in range(cm.shape[0]):
            for j in range(cm.shape[1]):
                ax.text(j, i, format(cm[i, j], 'd'),
                       ha="center", va="center",
                       color="white" if cm[i, j] > cm.max() / 2 else "black")
        
        plt.tight_layout()
        plt.show()
        
        # Classification report
        print("\nClassification Report:")
        print(classification_report(validation_df['mapped_existing'], 
                                   validation_df['primary_category']))
    else:
        print("No existing codes found for validation")
else:
    print("No existing coding available for validation")
    print("Consider having a human expert code a subset for comparison")

In [None]:
# Analyze AI confidence levels
confidence_scores = results_df['confidence'].values

plt.figure(figsize=(10, 5))
plt.hist(confidence_scores, bins=20, edgecolor='black', alpha=0.7)
plt.axvline(confidence_scores.mean(), color='red', linestyle='--', label=f'Mean: {confidence_scores.mean():.2f}')
plt.xlabel('Confidence Score')
plt.ylabel('Frequency')
plt.title('AI Confidence in Categorization')
plt.legend()
plt.show()

# Low confidence responses (for manual review)
low_confidence = results_df[results_df['confidence'] < 0.7]
print(f"\nResponses with low confidence (<0.7): {len(low_confidence)}")
if len(low_confidence) > 0:
    print("\nThese might need manual review:")
    for _, row in low_confidence.iterrows():
        print(f"- ID {row['respondent_id']}: {row['original_response'][:50]}...")
        print(f"  Category: {row['primary_category']} (confidence: {row['confidence']:.2f})")

## Step 7: Cross-tabulation Analysis

In [None]:
# Merge with demographic data
full_results = pd.merge(results_df, comfort_food_data[['respondent_id', 'gender', 'year']], on='respondent_id')

# Cross-tabulation by gender
gender_crosstab = pd.crosstab(full_results['primary_category'], full_results['gender'])
print("Category distribution by gender:")
print(gender_crosstab)

# Visualization
gender_crosstab.plot(kind='bar', stacked=False)
plt.title('Comfort Food Reasons by Gender')
plt.xlabel('Category')
plt.ylabel('Count')
plt.legend(title='Gender')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Step 8: Extract Key Themes

In [None]:
# Collect all key phrases
all_phrases = []
for phrases in results_df['key_phrases']:
    if phrases:
        all_phrases.extend(phrases)

# Count phrase frequency
phrase_counts = Counter(all_phrases)
top_phrases = phrase_counts.most_common(10)

print("Top 10 Key Phrases:")
for phrase, count in top_phrases:
    print(f"  '{phrase}': {count} occurrences")

# Create word cloud visualization (if wordcloud is installed)
try:
    from wordcloud import WordCloud
    
    # Create text from all phrases
    text = ' '.join(all_phrases)
    
    # Generate word cloud
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
    
    plt.figure(figsize=(12, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Key Themes in Comfort Food Responses')
    plt.show()
except ImportError:
    print("\nInstall wordcloud for visualization: pip install wordcloud")

## Step 9: Export Results

In [None]:
# Save categorized data
output_file = 'categorized_comfort_food_responses.csv'
full_results.to_csv(output_file, index=False)
print(f"Results saved to {output_file}")

# Create summary report
summary = {
    'total_responses': len(results_df),
    'categories_used': list(primary_counts.index),
    'most_common_category': primary_counts.index[0],
    'average_confidence': confidence_scores.mean(),
    'low_confidence_count': len(low_confidence),
    'category_distribution': primary_counts.to_dict()
}

print("\n" + "="*50)
print("SUMMARY REPORT")
print("="*50)
for key, value in summary.items():
    if key != 'category_distribution':
        print(f"{key}: {value}")

# Save summary as JSON
with open('categorization_summary.json', 'w') as f:
    json.dump(summary, f, indent=2)
print("\nSummary saved to categorization_summary.json")

## Step 10: Validation - Compare with Manual Coding

In a real research scenario, you would validate AI categorization against human coding:

In [None]:
# Simulate manual coding for validation (in practice, this would be done by researchers)
# Let's manually code a few responses for comparison

manual_coding = {
    1: 'emotional_regulation',  # "stressed or anxious about exams"
    2: 'nostalgia',             # "reminds me of home"
    3: 'emotional_regulation',  # "sad or feeling lonely"
    4: 'habit',                 # "Boredom"
    5: 'sensory'                # "tastes good"
}

# Compare AI vs manual for these samples
print("Validation Comparison (AI vs Manual):")
print("="*60)

for resp_id, manual_cat in manual_coding.items():
    ai_result = results_df[results_df['respondent_id'] == resp_id].iloc[0]
    match = "✅" if ai_result['primary_category'] == manual_cat else "❌"
    
    print(f"\nResponse {resp_id}: {match}")
    print(f"  Text: {ai_result['original_response'][:60]}...")
    print(f"  Manual: {manual_cat}")
    print(f"  AI: {ai_result['primary_category']} (confidence: {ai_result['confidence']:.2f})")

# Calculate agreement rate
matches = sum(1 for rid, manual in manual_coding.items() 
             if results_df[results_df['respondent_id'] == rid].iloc[0]['primary_category'] == manual)
agreement_rate = matches / len(manual_coding) * 100
print(f"\nAgreement Rate: {agreement_rate:.1f}%")

## Conclusions and Best Practices

### Key Findings:
1. **Efficiency**: AI can categorize hundreds of responses in minutes vs hours/days manually
2. **Consistency**: AI applies the same criteria across all responses
3. **Transparency**: Key phrases show why categories were assigned
4. **Validation Needed**: Always validate with manual coding on a subset

### Best Practices:
1. **Clear Categories**: Define mutually exclusive, comprehensive categories
2. **Iterative Refinement**: Review low-confidence responses and adjust prompts
3. **Human Oversight**: AI assists but doesn't replace researcher judgment
4. **Documentation**: Keep detailed records of prompts and decision rules
5. **Validation**: Always validate on a subset with manual coding

### When to Use AI Categorization:
- Large datasets (100+ responses)
- Initial exploration of themes
- Consistent application of coding scheme
- Time-sensitive analysis

### Limitations:
- May miss subtle context or sarcasm
- Requires clear, well-defined categories
- Costs scale with dataset size
- Need for validation and quality checks