Overview of the Steps:

	1.	Load and Preprocess the Data:
	•	Read the dataset from your .xlsx files.
	•	Count the number of comments per user.
	•	Group users based on the number of comments they have.
	•	Randomly sample users from each group.
	2.	Analyze Comments with the LLM:
	•	For each sampled user, collect their comments.
	•	Use OpenAI’s GPT-4o-mini model to analyze the text and assign a happiness score to each comment.
	•	Ensure consistent scoring methodology across all users.
	3.	Aggregate and Analyze Results:
	•	Compute average happiness scores per user.
	•	Compare these scores with the users’ subjective well-being (SWB) scores from the survey.
	•	Perform statistical analyses to explore correlations and validate your methodology.

In [3]:
pip install -r requirements.txt

Collecting openai (from -r requirements.txt (line 1))
  Using cached openai-1.54.1-py3-none-any.whl.metadata (24 kB)
Collecting python-dotenv (from -r requirements.txt (line 2))
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting requests (from -r requirements.txt (line 3))
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
[31mERROR: Could not find a version that satisfies the requirement os (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for os[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


Explanation:

	•	Data Loading: We use pandas to read the Excel file and ensure all required columns are present.
	•	Counting Comments: We group the data by users and count the number of comments per user.
	•	Grouping Users: Users are categorized into groups based on their comment counts.
	•	Sampling Users: We randomly sample an equal number of users from each group to ensure balanced representation.

In [None]:
import pandas as pd
import numpy as np
import openai
from dotenv import load_dotenv
import os
import time
import random

# Load your OpenAI API key from the .env file
load_dotenv()
openai.api_key = os.getenv('OPENAI_API_KEY')

# Step 1: Load and Preprocess the Data

# Replace 'data.xlsx' with the path to your dataset file
data = pd.read_excel('test.xlsx')

# Ensure that all necessary columns are present
required_columns = ['a', 'Type', 'ContentTimestamp', 'Content', 'Upvotes',
                    'ContentSubreddit', 'Score', 'QuestionID', 'SurveyResponse', 'SurveyTimestamp']

if not all(column in data.columns for column in required_columns):
    raise ValueError("One or more required columns are missing in the dataset.")

# Count the number of comments per user
user_comment_counts = data.groupby('a')['Content'].count().reset_index()
user_comment_counts.columns = ['user', 'comment_count']

# Define user groups based on comment counts
def assign_group(count):
    if count < 25:
        return 'Under 25'
    elif count < 100:
        return '25-99'
    elif count < 1000:
        return '100-999'
    else:
        return '1000+'

user_comment_counts['group'] = user_comment_counts['comment_count'].apply(assign_group)

# Sample users from each group
sampled_users = []
group_sizes = {'Under 25': 10, '25-99': 10, '100-999': 10, '1000+': 10}

for group, size in group_sizes.items():
    users_in_group = user_comment_counts[user_comment_counts['group'] == group]
    sample_size = min(size, len(users_in_group))
    sampled = users_in_group.sample(n=sample_size, random_state=42)
    sampled_users.extend(sampled['user'].tolist())

Explanation:

	•	Filtering Data: We select only the data corresponding to the sampled users.
	•	LLM Function: The get_happiness_score function sends each comment to the LLM and retrieves a happiness score.
	•	Error Handling: If the LLM fails to provide a valid score, we assign NaN to handle missing data appropriately.
	•	API Rate Limiting: We add a delay after each API call to respect rate limits.

In [None]:
# Filter the data for the sampled users
sampled_data = data[data['a'].isin(sampled_users)].copy()

# Function to get happiness score from the LLM
def get_happiness_score(text):
    # Prepare the prompt
    messages = [
        {"role": "system", "content": "You are an assistant that rates the happiness expressed in a given text on a scale from 1 to 10, where 1 is very unhappy and 10 is very happy. Only provide the numerical score."},
        {"role": "user", "content": text}
    ]
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4o-mini",
            messages=messages,
            max_tokens=5,
            n=1,
            stop=None,
            temperature=0
        )
        # Extract the score from the response
        score_str = response['choices'][0]['message']['content'].strip()
        # Convert to float
        score = float(score_str)
        return score
    except Exception as e:
        print(f"Error processing text: {e}")
        return np.nan  # Use NaN for missing values

# Apply the function to the comments
# Since API calls can be rate-limited, we should process with care
happiness_scores = []
for idx, row in sampled_data.iterrows():
    text = row['Content']
    score = get_happiness_score(text)
    happiness_scores.append(score)
    # To avoid hitting the rate limit, add a delay
    time.sleep(1)  # Adjust based on your API rate limits

sampled_data['happiness_score'] = happiness_scores

Explanation:

	•	Aggregating Scores: We compute the average happiness score per user from the LLM outputs.
	•	SWB Scores: We extract and average the SWB scores for each user from the survey responses.
	•	Merging Data: We combine the LLM happiness scores and SWB scores into a single DataFrame for analysis.
	•	Statistical Analysis: We calculate the Pearson correlation coefficient to assess the relationship between the LLM scores and SWB scores.
	•	Saving Results: Optionally, we save the merged data to a CSV file for further analysis or visualization.

In [None]:
# Compute average happiness scores per user
user_happiness = sampled_data.groupby('a')['happiness_score'].mean().reset_index()
user_happiness.columns = ['user', 'average_happiness_score']

# Get SWB scores for the sampled users
swb_data = data[data['a'].isin(sampled_users) & data['QuestionID'].isin(['Q1', 'Q2'])]
user_swb = swb_data.groupby(['a', 'QuestionID'])['Score'].mean().reset_index()
user_swb = user_swb.pivot(index='a', columns='QuestionID', values='Score').reset_index()
user_swb.columns = ['user', 'SWB_Q1', 'SWB_Q2']

# Merge happiness scores with SWB scores
merged_data = pd.merge(user_happiness, user_swb, on='user')

# Perform statistical analysis
# For example, compute Pearson correlation between average happiness score and SWB scores
from scipy.stats import pearsonr

# Correlation with SWB_Q1
corr_q1, p_value_q1 = pearsonr(merged_data['average_happiness_score'], merged_data['SWB_Q1'])
print(f"Correlation between LLM happiness scores and SWB_Q1: {corr_q1}, p-value: {p_value_q1}")

# Correlation with SWB_Q2
corr_q2, p_value_q2 = pearsonr(merged_data['average_happiness_score'], merged_data['SWB_Q2'])
print(f"Correlation between LLM happiness scores and SWB_Q2: {corr_q2}, p-value: {p_value_q2}")

# Optional: Save the merged data for further analysis
merged_data.to_csv('merged_data.csv', index=False)

Adjustments and Considerations:

	•	API Rate Limits and Costs:
	•	Be mindful of the OpenAI API’s rate limits and potential costs, especially with a large dataset.
	•	Consider batching requests or using OpenAI’s asynchronous API features if necessary.
	•	Error Handling:
	•	The get_happiness_score function includes basic error handling.
	•	If the LLM fails to provide a valid score, we assign NaN to handle missing data appropriately in analysis.
	•	Scalability:
	•	Since processing every comment might be impractical, sampling users and their comments helps manage the workload.
	•	You might further limit the number of comments per user if needed.
	•	Consistency in Scoring:
	•	Using a fixed prompt and setting temperature=0 ensures deterministic outputs from the LLM, enhancing consistency.
	•	Statistical Analysis:
	•	Beyond Pearson correlation, you might explore other statistical tests or models (e.g., regression analysis) to deepen your insights.
	•	Visualizations (e.g., scatter plots) can also be helpful.

Next Steps:

	•	Validation:
	•	Review a subset of the LLM’s outputs to ensure that the scores make sense and the model is interpreting the text as expected.
	•	Documentation:
	•	Keep detailed records of your methodology, including any parameters or thresholds used, to maintain transparency and reproducibility.
	•	Ethical Considerations:
	•	Ensure compliance with Reddit’s API terms and conditions and respect user privacy.
	•	Anonymize data if necessary when sharing results.

Notes on Adjusting Your Approach:

Given the scale of your data and the potential costs and time associated with processing a large number of comments, grouping users and sampling is a practical approach. This method allows you to:

	•	Manage Resources: Limit the number of API calls to a feasible amount.
	•	Statistical Validity: By sampling from different user groups, you maintain representation across the spectrum of user activity levels.
	•	Focus on Quality: With a manageable dataset, you can spend more time ensuring the accuracy and reliability of your results.
