EXAM NUMBER: B263310

# 1. Introduction

Online platforms host thousands of users engaging in discussions on countless topics. Reddit, being one of the largest "forum"-based platforms, often sees users pride themselves on conducting “fact-based” discussions. Yet, in practice, evidence and data can fail to persuade when they clash with entrenched views. In this project, I focused on how evidence is used—or potentially misused—in Reddit discussions about a divisive figure: **Elon Musk**.

**Key Idea**: Evidence can matter, but its impact often depends on whether participants share norms about its validity and are open to changing their minds. Otherwise, the same source, link, or data point can be marshalled to reinforce opposing opinions.

In this notebook, I documented the process to:
- Scrape Reddit data related to Elon Musk from selected subreddits (`r/technology`, `r/EnoughMuskSpam`).
- Identify instances where commenters invoked *indicators* of evidence (primarily external links and specific keywords).
- Analyze the overall sentiment surrounding these discussions using VADER and a method using OpenAI's GPT classifcation.
- Prepare the interaction data (replies) for network visualization in Gephi.

This notebook outlines the problem, my approach, the methods executed, and preliminary observations from the data, setting the stage for deeper analysis and visualization.

# 2. The Problem of Contradictory Interpretations

One of the central puzzles motivating this project is that the exact same piece of information often supports contradictory viewpoints online. For example, a news article about a controversial policy might be used by two different commenters to “prove” opposite claims. In the context of Elon Musk, a tweet or statistic about Tesla, SpaceX, or X (formerly Twitter) could be interpreted as either:

- Proof of mismanagement, inconsistency, or hypocrisy.
- Evidence of visionary leadership confronting biased media or overcoming obstacles.

This phenomenon likely occurs because:
1.  **Preexisting Beliefs**: Users tend to process new information through the lens of their existing worldview and biases towards the subject.
2.  **Community Norms**: Different online communities have varying standards for acceptable evidence. Some demand rigorous sources; others may rely more on anecdotes or assertions.
3.  **Emotional Loading**: Topics surrounding influential and often polarizing individuals like Elon Musk frequently trigger strong loyalty or hostility, potentially overshadowing the neutral presentation of facts.

My work explores how prevalent these dynamics are in the collected Reddit data, particularly looking for differences between subreddits with potentially different baseline attitudes towards Musk.

# 3. Position and Main Argument

I adopted the following position to guide this analysis:

> **"Evidence indicators alone do not guarantee persuasion online. If there are no shared community norms or incentives to engage with the substance of the evidence, users can reinterpret or dismiss it based on pre-existing biases. However, comparing subreddits with different prevailing sentiments might reveal variations in how evidence indicators correlate with engagement and tone."**

This position reflects the common experience of browsing Reddit: some communities foster a culture where backing up claims is encouraged, while others feature more rhetorical or emotionally driven exchanges where links might serve primarily as tribal signals. By analyzing the collected data, I aimed to see whether:

- Users in the selected subreddits (`r/technology`, `r/EnoughMuskSpam`) differ in how often they include links or evidence-related keywords.
- Posts/comments containing these evidence indicators correlate with higher engagement (scores) or specific sentiment patterns.
- Instances of the same URL being cited potentially align with divergent sentiment scores, hinting at contradictory usage (requiring further qualitative checks).

Ultimately, I argue that the **role of evidence** in these online discussions is not just about its presence, but about **how** it's framed and received within specific community contexts or often that evidence its self is rarely the pressasive action.

# 4. Research Questions

I framed my inquiry using these concrete questions:

1.  **RQ1**: To what extent did Reddit users in `r/technology` and `r/EnoughMuskSpam` include external evidence *indicators* (links, specific keywords) when discussing Elon Musk during the sampled period?
2.  **RQ2**: Did posts/comments containing these evidence indicators correlate with higher levels of engagement (e.g., upvotes/scores) in these subreddits?
3.  **RQ3**: Can instances be identified where the *same* external link (URL) was cited in comments with significantly different sentiment scores, suggesting contradictory interpretations?
4.  **RQ4**: How did the patterns observed for RQ1, RQ2, and RQ3 differ between `r/technology` and `r/EnoughMuskSpam`?

# 5. Methods and Computational Techniques


## 5.1 Data Collection (Reddit Scraping)

1.  **Subreddits Targeted**
    *   I selected two subreddits known to frequently discuss Elon Musk but often with different prevailing viewpoints:
        *   `r/technology`: A large, general tech news subreddit, likely containing a mix of opinions.
        *   `r/EnoughMuskSpam`: A subreddit explicitly critical of Elon Musk.
    *   This contrast allows for exploring RQ4 (differences between communities).

2.  **Scraping Execution**
    *   I used Python’s **PRAW** library to connect to the Reddit API.
    *   I scraped posts matching the query "Elon Musk" within the `time_filter='month'` timeframe, limiting the collection to `post_limit_per_subreddit = 50` posts from each subreddit to keep the dataset manageable for this initial analysis.
    *   For each selected post, I retrieved up to `comment_limit_per_post = 100` comments, including metadata like author, score, creation time, parent ID, and text.

3.  **Data Storage and Preprocessing**
    *   The raw scraped data (posts and comments) was compiled into a single list and then saved as a Pandas DataFrame to a CSV file (`reddit_data_Elon_Musk_YYYYMMDD_HHMMSS.csv`).
    *   Basic preprocessing involved converting timestamps to datetime objects and handling potential missing values (e.g., `[deleted]` authors).

In [13]:

from dotenv import load_dotenv

load_dotenv('keys.env')

import praw
import pandas as pd
import time
import os # For environment variables
from datetime import datetime

In [None]:
# 5.1.1 Data Collection Code
import os
print("Starting Data Collection...")

# --- Configuration ---

REDDIT_CLIENT_ID = os.environ.get("REDDIT_CLIENT_ID", "YOUR_REDDIT_CLIENT_ID") 
REDDIT_CLIENT_SECRET = os.environ.get("REDDIT_CLIENT_SECRET", "YOUR_REDDIT_CLIENT_SECRET") 
REDDIT_USER_AGENT = os.environ.get("REDDIT_USER_AGENT", "YOUR_REDDIT_USER_AGENT") 

# Subreddits to target (Example: one potentially more moderated, one less)
# Choose subreddits relevant to topic (e.g., Elon Musk)
# subreddit_list = ["technology", "EnoughMuskSpam", "SpaceXLounge", "politics", "wallstreetbets"]
subreddit_list = ["technology", "EnoughMuskSpam"] # Keep it small initially

# Search query
search_query = "Elon Musk"

# Time limit for search (e.g., 'month', 'year', 'all')
# Use of 'all' might return too much data so starting small
time_filter = 'month'

# Limit number of posts per subreddit (to keep it manageable)
post_limit_per_subreddit = 50 
comment_limit_per_post = 100 # Max comments per post


# --- PRAW Setup ---
try:
    reddit = praw.Reddit(
        client_id=REDDIT_CLIENT_ID,
        client_secret=REDDIT_CLIENT_SECRET,
        user_agent=REDDIT_USER_AGENT
    )
    reddit.read_only = True
    print("PRAW Reddit instance created.")
except Exception as e:
    print(f"Error creating PRAW instance: {e}")
    # Exit or handle error appropriately
    reddit = None

# --- Data Storage ---
all_data = [] # List to hold dictionaries of posts and comments

# --- Scraping Loop ---
if reddit:
    for sub_name in subreddit_list:
        print(f"\n--- Processing Subreddit: r/{sub_name} ---")
        try:
            subreddit = reddit.subreddit(sub_name)
            post_count = 0
            # Search for posts mentioning the query
            for submission in subreddit.search(search_query, sort="relevance", time_filter=time_filter, limit=post_limit_per_subreddit):
                if post_count >= post_limit_per_subreddit:
                    break
                post_count += 1
                print(f"  Fetching post {post_count}/{post_limit_per_subreddit}: {submission.id} - {submission.title[:50]}...")

                # Store post data
                post_author = submission.author.name if submission.author else "[deleted]"
                all_data.append({
                    'type': 'post',
                    'id': submission.id,
                    'subreddit': sub_name,
                    'title': submission.title,
                    'author': post_author,
                    'created_utc': submission.created_utc,
                    'score': submission.score,
                    'upvote_ratio': submission.upvote_ratio,
                    'num_comments': submission.num_comments,
                    'text': submission.selftext,
                    'url': submission.url, # URL the post links to (if not self-post)
                    'permalink': f"https://www.reddit.com{submission.permalink}",
                    'parent_id': None, # Posts don't have a parent in this context
                    'parent_author': None
                })

                # Fetch comments for this post
                submission.comments.replace_more(limit=5) # Expand top-level "more comments" links a few times
                comment_count = 0
                for comment in submission.comments.list():
                    if comment_limit_per_post is not None and comment_count >= comment_limit_per_post:
                        break
                    comment_count += 1

                    # Find parent author (can be post author or another comment author)
                    parent_author = "[deleted]" # Default
                    parent_is_post = comment.parent_id.startswith('t3_')
                    if parent_is_post:
                         parent_author = post_author
                    else: # Parent is another comment (t1_)
                        try:
                            # Attempt to fetch parent comment directly (might fail if deleted)
                            parent_comment = reddit.comment(comment.parent_id.split('_')[1])
                            if parent_comment and parent_comment.author:
                                parent_author = parent_comment.author.name
                        except Exception:
                             parent_author = "[unknown_parent]" # Or keep as deleted

                    # Store comment data
                    comment_author = comment.author.name if comment.author else "[deleted]"
                    all_data.append({
                        'type': 'comment',
                        'id': comment.id,
                        'subreddit': sub_name,
                        'title': None, # Comments don't have titles
                        'author': comment_author,
                        'created_utc': comment.created_utc,
                        'score': comment.score,
                        'upvote_ratio': None, # Comments don't have upvote ratio
                        'num_comments': None, # Not applicable directly
                        'text': comment.body,
                        'url': None, # Comments don't link to external URLs directly
                        'permalink': f"https://www.reddit.com{comment.permalink}",
                        'parent_id': comment.parent_id, # ID of the parent (post or comment)
                        'parent_author': parent_author
                    })

                print(f"    Fetched {comment_count} comments for post {submission.id}")
                time.sleep(1) # Be 'polite' to Reddit API between posts

        except Exception as e:
            print(f"  Error processing subreddit r/{sub_name}: {e}")
        time.sleep(2) # Be extra polite between subreddits

    # --- Convert to DataFrame and Save ---
    if all_data:
        df = pd.DataFrame(all_data)
        # Convert UTC timestamp to datetime objects
        df['created_datetime'] = pd.to_datetime(df['created_utc'], unit='s')
        # Define output filename
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_filename = f"reddit_data_{search_query.replace(' ','_')}_{timestamp}.csv"
        df.to_csv(output_filename, index=False)
        print(f"\nData collection complete. Saved {len(df)} rows to {output_filename}")
    else:
        print("\nNo data collected.")

else:
    print("Reddit instance not available. Scraping aborted.")

Starting Data Collection...
PRAW Reddit instance created.

--- Processing Subreddit: r/technology ---
  Fetching post 1/50: 1jl7jtp - Elon Musk pressured Reddit’s CEO on content modera...
    Fetched 100 comments for post 1jl7jtp
  Fetching post 2/50: 1jm22rt - Elon Musk makes request to Reddit CEO to take down...
    Fetched 100 comments for post 1jm22rt
  Fetching post 3/50: 1j84l15 - Elon Musk Says X Outage Caused by ‘Massive Cyberat...
    Fetched 100 comments for post 1j84l15
  Fetching post 4/50: 1j6mkmi - FAA workers threatened with firing if they ‘impede...
    Fetched 100 comments for post 1j6mkmi
  Fetching post 5/50: 1jmxnez - Elon Musk's Alleged Meddling Sparks Reddit Backlas...


## 5.2 Identifying Evidence Indicators

Instead of attempting full argument annotation, which is complex, I focused on identifying *indicators* that a user might be referencing external material.

1.  **Indicator Identification**
    *   I processed the `text` field of each post and comment in the DataFrame.
    *   **URLs:** Regular expressions (`regex`) were used to detect and extract any HTTP/HTTPS links present in the text. A boolean flag (`has_url`) was added.
    *   **Keywords:** A predefined list of words often associated with citing evidence (e.g., 'study', 'article', 'source', 'data', 'link') was compiled. Text was checked for the presence of these keywords (case-insensitive), resulting in a boolean flag (`has_keyword`).
    *   **Combined Indicator:** A final boolean column (`has_evidence_indicator`) was created, marking rows that contained *either* a URL or an evidence-related keyword.

2.  **Identifying Potential Contradictory Usage**
    *   To find instances where the same evidence might be used differently, I extracted all unique URLs cited across the dataset.
    *   I identified URLs that appeared more than once. These represent potential cases for RQ3, where multiple users referenced the same external source.

3.  **Augmented Data Storage**
    *   This indicator information was added to the DataFrame, which was then saved to a new CSV file (`..._augmented.csv`).

In [8]:
# 5.2.1 Argument Annotation Code

import pandas as pd
import re # For regular expressions (finding URLs)

print("Starting Argument/Evidence Identification...")

# --- Configuration ---
# Load the previously saved data
# Adjust filename if you saved it differently or ran collection earlier
input_filename = output_filename # Uses the filename saved from the previous step


# Keywords that might indicate citing evidence 
evidence_keywords = ['study', 'research', 'data', 'stat', 'article', 'report',
                     'source', 'evidence', 'according to', 'shows that', 'link', 'graph', 'chart']
keyword_pattern = r'\b(' + '|'.join(evidence_keywords) + r')\b' # Regex for whole words

# --- Load Data ---
try:
    df = pd.read_csv(input_filename)
    print(f"Loaded data from {input_filename} ({len(df)} rows)")
except FileNotFoundError:
    print(f"Error: Input file not found at {input_filename}. Make sure the data collection step ran successfully.")
    df = None
except Exception as e:
    print(f"Error loading data: {e}")
    df = None

# --- Identify Evidence ---
if df is not None:
    # Ensure 'text' column exists and handle potential NaN values
    if 'text' in df.columns:
        df['text'] = df['text'].fillna('') # Replace NaN with empty string
    else:
        print("Error: 'text' column not found in DataFrame.")
        df = None # Stop processing if text column is missing

if df is not None:
    # 1. Find URLs using regex
    # Simple regex for URLs (might not catch all edge cases but good start)
    url_pattern = r'https?://[^\s/$.?#].[^\s]*'
    df['urls_found'] = df['text'].apply(lambda x: re.findall(url_pattern, str(x)))
    df['has_url'] = df['urls_found'].apply(lambda x: len(x) > 0)

    # 2. Find evidence-related keywords (case-insensitive)
    df['has_keyword'] = df['text'].str.contains(keyword_pattern, case=False, regex=True, na=False)

    # 3. Combine: Does it have either a URL or a keyword?
    df['has_evidence_indicator'] = df['has_url'] | df['has_keyword']

    # --- Summarize Findings ---
    print("\nEvidence Identification Summary:")
    print(f"Rows with URLs: {df['has_url'].sum()} ({df['has_url'].mean()*100:.1f}%)")
    print(f"Rows with Keywords: {df['has_keyword'].sum()} ({df['has_keyword'].mean()*100:.1f}%)")
    print(f"Rows with any Evidence Indicator: {df['has_evidence_indicator'].sum()} ({df['has_evidence_indicator'].mean()*100:.1f}%)")

    # Identify potential contradictory usage (simple check: same URL used multiple times)
    all_urls = df[df['has_url']]['urls_found'].explode() # Get a Series of all individual URLs found
    if not all_urls.empty:
        url_counts = all_urls.value_counts()
        repeated_urls = url_counts[url_counts > 1]
        print(f"\nFound {len(repeated_urls)} unique URLs cited more than once.")
        print("Top 5 most frequently cited URLs:")
        print(repeated_urls.head(5))
        # You can further analyze posts/comments citing these specific URLs later

    # --- Save Augmented Data ---
    augmented_filename = input_filename.replace(".csv", "_augmented.csv")
    df.to_csv(augmented_filename, index=False)
    print(f"\nSaved augmented data with evidence indicators to {augmented_filename}")

else:
    print("DataFrame not available. Evidence identification aborted.")

Starting Argument/Evidence Identification...
Loaded data from reddit_data_Elon_Musk_20250408_011357.csv (4757 rows)

Evidence Identification Summary:
Rows with URLs: 194 (4.1%)
Rows with Keywords: 172 (3.6%)
Rows with any Evidence Indicator: 344 (7.2%)

Found 3 unique URLs cited more than once.
Top 5 most frequently cited URLs:
urls_found
https://elonmusk.today/                                    2
https://www.reddit.com/r/PresidentElonMusk/s/LPoNoGq9Nx    2
https://www.reddit.com/r/gifs/s/Vxk2zfwmpG)                2
Name: count, dtype: int64

Saved augmented data with evidence indicators to reddit_data_Elon_Musk_20250408_011357_augmented.csv


  df['has_keyword'] = df['text'].str.contains(keyword_pattern, case=False, regex=True, na=False)


## 5.3 Structuring the Conversation: Preparing Network Data for Gephi

To visualize the underlying structure of the conversations and map how users interact, I transformed the collected data into a format suitable for network analysis using **Gephi**. This involves defining who the participants are (nodes) and how they connect (edges).

1.  **Defining Nodes and Their Characteristics**
    *   Each unique Redditor who authored a post or comment in the dataset became a **node** in the network. Users identified only as `[deleted]` or other placeholders were excluded.
    *   Crucially, I aggregated key information for each author to serve as node attributes. Beyond basic activity metrics (post/comment counts, total/average scores, `used_evidence` count), I calculated the user's average VADER sentiment score (`avg_vader_sentiment`) and determined their most frequent stance towards Elon Musk (`dominant_openai_stance`) based on the OpenAI classifications of their comments. This enrichment allows for visualizing not just activity, but also attitude and stance directly onto the network participants.

2.  **Mapping Interactions as Edges**
    *   **Edges** represent direct replies between users. A directed edge was created from the author of a replying comment (`Source`) to the author of the parent comment or post (`Target`).
    *   Edge attributes capture details about the reply itself, including the original comment's score, whether the reply contained an evidence indicator (`reply_has_evidence`), and importantly, the specific VADER sentiment (`reply_vader_sentiment`) and OpenAI stance (`reply_openai_stance`) of that particular replying comment. This allows for analyzing the nature of specific interactions.

3.  **Export for Visualization**
    *   The resulting node and edge lists, complete with their respective attributes (including the aggregated stance/sentiment for nodes and specific stance/sentiment for edges), were exported as two distinct CSV files (`gephi_nodes_..._final_...csv`, `gephi_edges_..._final_...csv`). These files are directly importable into Gephi.

4.  **Visualization Goals with Enriched Data**
    *   My aim in Gephi is to move beyond just seeing who talks to whom. By mapping attributes like `dominant_openai_stance` and `avg_vader_sentiment` onto the nodes (e.g., using color), and potentially filtering edges based on `reply_openai_stance`, I can explore:
        *   Whether users with similar stances cluster together.
        *   How users with different sentiments interact.
        *   The structural position of users who frequently use evidence indicators, and how this relates to their stance.
        *   Differences in interaction patterns and stance distributions between `r/technology` and `r/EnoughMuskSpam`.

In [None]:
# 5.3.1 Export for Gephi (Updated to include Sentiment/Stance)

import pandas as pd

print("Preparing data for Gephi (including Sentiment/Stance)...")

# --- Configuration ---
# Load the FINAL data which includes sentiment and stance
# Make sure this filename matches the output of cell 5.4.1
input_filename = final_filename # Use the variable holding the final filename
# Or uncomment and set manually:
# input_filename = "reddit_data_Elon_Musk_YYYYMMDD_HHMMSS_final_analysis.csv"

# --- Load Data ---
try:
    df = pd.read_csv(input_filename)
    # Ensure required columns exist and handle potential NaN in text/authors
    df['text'] = df['text'].astype(str).fillna('')
    df['author'] = df['author'].fillna('[deleted]')
    df['parent_author'] = df['parent_author'].fillna('[deleted]') # Use a consistent placeholder
    print(f"Loaded final analysis data from {input_filename} ({len(df)} rows)")
except FileNotFoundError:
    print(f"Error: Input file not found at {input_filename}. Make sure step 5.4.1 ran successfully and saved the file.")
    df = None
except Exception as e:
    print(f"Error loading data: {e}")
    df = None

# --- Create Nodes and Edges ---
if df is not None:
    # --- Aggregate User-Level Attributes ---
    print("Aggregating user attributes (activity, sentiment, stance)...")

        # Define a function to get the mode (most frequent value) safely using pandas .mode()
    def safe_mode(series):
        # Exclude known non-user placeholders if necessary before calculating mode
        # Also filter out generic/uninformative stances if you don't want them as dominant
        valid_series = series.dropna()[~series.isin(['[deleted]', '[unknown_parent]', 'API Error', 'Processing Error', 'Unclear'])]
        if valid_series.empty:
            # Default stance if no valid ones are found for a user
            # Choose what makes sense: 'Unknown', 'Neutral/Mixed', etc.
            return 'Unknown'
        else:
            # .mode() returns a Series (can have multiple modes if tied)
            # We usually just take the first one.
            mode_result = valid_series.mode()
            if not mode_result.empty:
                return mode_result[0]
            else:
                # Should not happen if valid_series wasn't empty, but as a fallback
                return 'Unknown'


    author_stats = df[df['author'] != '[deleted]'].groupby('author').agg(
        post_count=('type', lambda x: (x == 'post').sum()),
        comment_count=('type', lambda x: (x == 'comment').sum()),
        avg_score=('score', 'mean'),
        total_score=('score', 'sum'),
        used_evidence=('has_evidence_indicator', 'sum'),
        # Aggregate sentiment and stance
        avg_vader_sentiment=('vader_sentiment_compound', 'mean'),
        dominant_openai_stance=('openai_stance', safe_mode) # Use the safe mode function
    ).reset_index()

    # --- Create Node List ---
    print("Creating Node list...")
    # Nodes are unique authors (excluding deleted) from both author and parent_author columns
    authors = pd.concat([df['author'], df['parent_author']]).unique()
    nodes_df = pd.DataFrame(authors, columns=['Id'])
    nodes_df = nodes_df[~nodes_df['Id'].isin(['[deleted]', '[unknown_parent]'])] # Remove known placeholders

    nodes_df['Label'] = nodes_df['Id'] # Gephi uses 'Label' column for text display

    # Merge aggregated stats
    nodes_df = pd.merge(nodes_df, author_stats, left_on='Id', right_on='author', how='left')
    nodes_df = nodes_df.drop(columns=['author']) # Remove redundant column
    # Fill NaNs for users who only received replies (didn't post/comment themselves in the sample)
    # Fill numerical NaNs with 0, categorical NaNs with a default like 'Unknown' or 'Neutral/Mixed'
    numeric_cols = ['post_count', 'comment_count', 'avg_score', 'total_score', 'used_evidence', 'avg_vader_sentiment']
    for col in numeric_cols:
        if col in nodes_df.columns:
            nodes_df[col] = nodes_df[col].fillna(0)
    if 'dominant_openai_stance' in nodes_df.columns:
         nodes_df['dominant_openai_stance'] = nodes_df['dominant_openai_stance'].fillna('Unknown') # Handle users with no classifiable comments


    # --- Create Edge List ---
    print("Creating Edge list...")
    # Edges represent replies (comment author -> parent author)
    # Filter for comments that have valid source and target authors
    edges_df = df[
        (df['type'] == 'comment') &
        (~df['author'].isin(['[deleted]', '[unknown_parent]'])) & # Valid source
        (df['parent_author'].notna()) &
        (~df['parent_author'].isin(['[deleted]', '[unknown_parent]'])) # Valid target
    ].copy() # Make a copy to avoid SettingWithCopyWarning

    # Select and rename columns for Gephi edge format
    edges_df = edges_df[['author', 'parent_author', 'score', 'has_evidence_indicator', 'vader_sentiment_compound', 'openai_stance', 'id']] # Keep comment id for potential reference
    edges_df.rename(columns={
        'author': 'Source',
        'parent_author': 'Target',
        'score': 'comment_score', # Rename to avoid clash with node scores
        'has_evidence_indicator': 'reply_has_evidence',
        'vader_sentiment_compound': 'reply_vader_sentiment',
        'openai_stance': 'reply_openai_stance',
        'id': 'comment_id'
        }, inplace=True)

    edges_df['Type'] = 'Directed' # Edges go from replier to parent
    edges_df['Weight'] = 1.0 # Default weight for each reply


    # --- Save Files ---
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") # Add timestamp to avoid overwriting old files
    nodes_filename = f"gephi_nodes_{search_query.replace(' ','_')}_final_{timestamp}.csv"
    edges_filename = f"gephi_edges_{search_query.replace(' ','_')}_final_{timestamp}.csv"

    nodes_df.to_csv(nodes_filename, index=False)
    edges_df.to_csv(edges_filename, index=False)

    print(f"\nGephi data prepared (including aggregated stance/sentiment):")
    print(f"  - Nodes ({len(nodes_df)}): {nodes_filename}")
    print(f"  - Edges ({len(edges_df)}): {edges_filename}")
    print("Import these NEW CSV files into Gephi.")

else:
    print("DataFrame not available. Gephi export aborted.")

Preparing data for Gephi (including Sentiment/Stance)...
Loaded final analysis data from reddit_data_Elon_Musk_20250408_011357_final_analysis.csv (4757 rows)
Aggregating user attributes (activity, sentiment, stance)...
Creating Node list...
Creating Edge list...

Gephi data prepared (including aggregated stance/sentiment):
  - Nodes (2875): gephi_nodes_Elon_Musk_final_20250408_024555.csv
  - Edges (4550): gephi_edges_Elon_Musk_final_20250408_024555.csv
Import these NEW CSV files into Gephi.


## 5.4 Sentiment Analysis (VADER) and Stance Detection (OpenAI)

To gauge the emotional tone and the specific alignment of the discussions towards Elon Musk, I employed two distinct computational methods:

1.  **Sentiment Analysis Tool (VADER)**
    *   I used **VADER (Valence Aware Dictionary and sEntiment Reasoner)**, a lexicon and rule-based sentiment analysis tool attuned to social media language.
    *   VADER provided a `compound` score from -1 (most negative) to +1 (most positive) for each post and comment, capturing the overall emotional polarity. This was stored in the `vader_sentiment_compound` column.

2.  **Stance Detection Tool (OpenAI GPT)**
    *   To understand *how* users were positioned relative to Elon Musk (beyond just positive/negative tone), I utilized OpenAI's **GPT model (specifically `gpt-3.5-turbo` in this run)**.
    *   I crafted a prompt asking the model to classify the stance of each post/comment into predefined categories: `"Pro-Musk"`, `"Anti-Musk"`, `"Neutral/Mixed"`, or `"Unclear"`.
    *   The text from each post/comment was sent in batches to the OpenAI API. The model's classification for each item was parsed from the response.
    *   This classification, representing the inferred authorial stance towards Musk, was stored in the `openai_stance` column. *Initial results show classifications like "Anti-Musk" for critical comments (e.g., "What a hypocrite this dude is.") and "Neutral/Mixed" or "Unclear" for others, demonstrating the model's ability to differentiate.*

3.  **Combined Analysis Approach**
    *   Having both VADER sentiment and OpenAI stance allows for a more nuanced understanding. For example, a comment could have negative sentiment (VADER score < 0) and be classified as "Anti-Musk" (OpenAI stance), confirming strong criticism. Conversely, a comment might be neutral in sentiment (VADER score ≈ 0) but still classified as "Pro-Musk" or "Anti-Musk" based on the content's alignment.

4.  **Final Data Storage**
    *   The DataFrame, now augmented with evidence indicators, VADER sentiment scores, and OpenAI stance classifications, was saved to a final CSV file (`..._final_analysis.csv`). This comprehensive dataset serves as the foundation for the subsequent discussion and analysis.

In [10]:
# 5.4.1 Sentiment / Stance Analysis Code

import pandas as pd
# Option 1: Simple Rule-Based Sentiment (VADER)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Option 2: Advanced Stance/Sentiment 
import openai
import json
import time
import os

print("Starting Sentiment/Stance Analysis...")

# --- Configuration ---
# Load the augmented data
input_filename = augmented_filename # From step 5.2.1


# --- Load Data ---
try:
    df = pd.read_csv(input_filename)
    # Ensure 'text' column is string and handle NaN
    df['text'] = df['text'].astype(str).fillna('')
    print(f"Loaded data from {input_filename} ({len(df)} rows)")
except FileNotFoundError:
    print(f"Error: Input file not found at {input_filename}.")
    df = None
except Exception as e:
    print(f"Error loading data: {e}")
    df = None

# --- Option 1: VADER Sentiment Analysis ---
if df is not None:
    print("\n--- Running VADER Sentiment Analysis ---")
    analyzer = SentimentIntensityAnalyzer()

    # Function to get VADER compound score
    def get_vader_sentiment(text):
        vs = analyzer.polarity_scores(text)
        return vs['compound'] # Compound score ranges from -1 (most negative) to +1 (most positive)

    # Apply VADER to the 'text' column (works for both posts and comments)

    df['vader_sentiment_compound'] = df['text'].apply(get_vader_sentiment)

    print("VADER analysis complete. Added 'vader_sentiment_compound' column.")
    print("Sentiment Score Summary:")
    print(df['vader_sentiment_compound'].describe())

# --- Option 2: Stance Detection using OpenAI ---


print("\n--- Preparing for OpenAI Stance Detection")
# # --- OpenAI Setup ---
try:
    openai_api_key = os.environ.get("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
    if not openai_api_key or openai_api_key == "YOUR_OPENAI_API_KEY":
         print("Warning: OpenAI API Key not found or not set in environment variables.")
         openai_client = None
    else:
         openai_client = openai.OpenAI(api_key=openai_api_key)
         print("OpenAI client initialized.")
except Exception as e:
     print(f"Error initializing OpenAI client: {e}")
     openai_client = None

 # Function to classify stance using OpenAI (modify prompt as needed)
def classify_stance_openai(texts, batch_size=10):
     if not openai_client or not texts:
         return ["Error: OpenAI not configured or no texts"] * len(texts)

     results = {} # Use dictionary to store results mapped to original index

     for i in range(0, len(texts), batch_size):
         batch_texts = texts[i : i + batch_size]
         original_indices = list(range(i, i + len(batch_texts))) # Keep track of original index

         classification_prompt = ""
         for idx, text in enumerate(batch_texts):
             cleaned_text = str(text).replace('\n', ' ').replace('"', "'") # Basic cleaning
             classification_prompt += f'{idx}: "{cleaned_text[:500]}"\n\n' # Limit text length per item

         system_prompt = f"""
         You are analyzing Reddit comments about Elon Musk.
         For each text provided (indexed from 0), determine the author's stance towards Elon Musk.
         Classify the stance into one of these categories:
         - "Pro-Musk": Clearly positive, supportive, defending Musk.
         - "Anti-Musk": Clearly negative, critical, attacking Musk.
         - "Neutral/Mixed": No clear stance, balanced, objective, or discussing tangentially.
         - "Unclear": Cannot determine stance from the text provided.

         Provide your response as a JSON object for each text on a new line. Example format:
         {{"Index": 0, "Stance": "Anti-Musk"}}
         {{"Index": 1, "Stance": "Pro-Musk"}}
         """

         try:
             response = openai_client.chat.completions.create(
                 model="gpt-3.5-turbo", # Or use "gpt-4" for potentially better (but slower/costlier) results
                 messages=[
                     {"role": "system", "content": system_prompt},
                     {"role": "user", "content": f"Classify the stance for these texts:\n{classification_prompt}"}
                 ],
                 temperature=0.2 # Lower temperature for more deterministic results
             )
             response_content = response.choices[0].message.content

             # Parse the response (expecting JSON per line)
             lines = response_content.strip().split('\n')
             for line in lines:
                 try:
                     data = json.loads(line.strip())
                     batch_index = data.get("Index")
                     stance = data.get("Stance")
                     if batch_index is not None and stance is not None and 0 <= batch_index < len(original_indices):
                         original_idx = original_indices[batch_index]
                         results[original_idx] = stance
                 except json.JSONDecodeError:
                     print(f"Warning: Could not decode JSON line: {line}")
                 except Exception as parse_e:
                      print(f"Warning: Error processing line '{line}': {parse_e}")


             print(f"  Processed batch {i//batch_size + 1}/{(len(texts) + batch_size - 1)//batch_size}")
             time.sleep(1) # Rate limiting

         except Exception as api_e:
             print(f"Error calling OpenAI API: {api_e}")
             # Mark remaining items in batch as error
             for idx in original_indices:
                 if idx not in results:
                     results[idx] = "API Error"
             time.sleep(5) # Wait longer after an API error

     # Return stances in the original order
     final_stances = [results.get(idx, "Processing Error") for idx in range(len(texts))]
     return final_stances


 # --- Apply OpenAI Stance Classification (if configured) ---
if df is not None and openai_client:
     print("\n--- Running OpenAI Stance Detection (this may take time and cost money) ---")
     # Select only comments for stance detection, or apply to all? Let's do all for simplicity here.
     texts_to_classify = df['text'].tolist()
     stances = classify_stance_openai(texts_to_classify)

     if len(stances) == len(df):
         df['openai_stance'] = stances
         print("OpenAI stance classification complete. Added 'openai_stance' column.")
         print("Stance Distribution (OpenAI):")
         print(df['openai_stance'].value_counts())
     else:
         print(f"Error: Number of stance results ({len(stances)}) does not match DataFrame rows ({len(df)}).")
elif df is not None:
      print("\n--- Skipping OpenAI Stance Detection (not configured or enabled) ---")


# --- Save Final Data ---
if df is not None:
    final_filename = input_filename.replace("_augmented.csv", "_final_analysis.csv")
    df.to_csv(final_filename, index=False)
    print(f"\nSaved final data with sentiment/stance analysis to {final_filename}")

else:
    print("DataFrame not available. Sentiment/stance analysis aborted.")

Starting Sentiment/Stance Analysis...
Loaded data from reddit_data_Elon_Musk_20250408_011357_augmented.csv (4757 rows)

--- Running VADER Sentiment Analysis ---
VADER analysis complete. Added 'vader_sentiment_compound' column.
Sentiment Score Summary:
count    4757.000000
mean       -0.037494
std         0.470868
min        -0.987900
25%        -0.421500
50%         0.000000
75%         0.335500
max         0.995500
Name: vader_sentiment_compound, dtype: float64

--- Preparing for OpenAI Stance Detection
OpenAI client initialized.

--- Running OpenAI Stance Detection (this may take time and cost money) ---
  Processed batch 1/476
  Processed batch 2/476
  Processed batch 3/476
  Processed batch 4/476
  Processed batch 5/476
  Processed batch 6/476
  Processed batch 7/476
  Processed batch 8/476
  Processed batch 9/476
  Processed batch 10/476
  Processed batch 11/476
  Processed batch 12/476
  Processed batch 13/476
  Processed batch 14/476
  Processed batch 15/476
  Processed batch 16

# 6. Preliminary Discussion and Findings

Having executed the data collection, evidence indicator identification, sentiment analysis (VADER), stance detection (OpenAI), and network data preparation, I can now discuss the key findings emerging from the analysis of discussions surrounding Elon Musk in `r/technology` and `r/EnoughMuskSpam`.

**Stance Distribution:**

The OpenAI stance classification revealed a distinct landscape of opinions. Across the 4785 analyzed posts and comments:
*   A significant portion (**~54%**) were classified as **"Unknown"**. This could reflect limitations in the GPT-3.5 model's ability to interpret nuanced or short comments, or it might indicate that many comments genuinely didn't express a clear stance on Musk himself, perhaps focusing on tangential aspects of the news or discussion.
*   Among the comments where a stance *was* detected, there was a strong skew towards negativity: **~34%** were classified as **"Anti-Musk"**.
*   **"Neutral/Mixed"** stances accounted for **~9.6%**.
*   Explicitly **"Pro-Musk"** stances were relatively rare, representing only **~2.3%** of the total.

This distribution suggests that while overt support for Musk was minimal in these specific discussions and subreddits during the sampled period, outright criticism was common. The large "Unknown" category warrants further investigation, potentially through manual review or using a more advanced model, but the dominance of "Anti-Musk" among classified stances is a clear finding.


**Low Evidence Indicator Usage:**

A striking finding was the infrequent use of explicit evidence indicators. **Less than 7%** of all analyzed posts and comments contained either a URL or one of the predefined evidence-related keywords. This low rate occurred despite the often argumentative and opinionated nature of the discussions, where stances (particularly "Anti-Musk") were clearly expressed. This observation aligns with the initial hypothesis that evidence, at least in the form of easily detectable indicators like links or specific keywords, might not be a primary feature of these online exchanges, even when strong opinions are present.


**The "NotEnoughMuskSpam" Anomaly:**

The network visualization preparation (sizing nodes by comment count) revealed an unexpected and intriguing case study: the user **"NotEnoughMuskSpam"**. This user was, by a significant margin, the most active commenter in the dataset (63+ comments) and garnered substantial engagement (675+ karma within this data). Surprisingly:
*   This highly active user was classified as **"Neutral/Mixed"** by OpenAI, despite a username suggesting a strong bias.
*   They used **zero detectable evidence indicators** in any of their comments.
*   Their average VADER score was slightly positive (~0.14), further aligning with the username's implication.

This user represents a fascinating deviation. Their high engagement without relying on external evidence indicators, coupled with a "Neutral" classification that contradicts their username, raises questions. Possible interpretations include: the user primarily engages through questions, meta-commentary, or non-falsifiable assertions rather than claims requiring evidence; the GPT-3.5 model might have struggled with their specific style; or their activity pattern itself is a form of engagement that doesn't fit typical "argumentative" models. This case underscores the idea that high activity and influence in online discussions aren't necessarily tied to evidence-based argumentation. The second largest node being the AutoModerator (correctly neutral) serves as a baseline.

**Network Structure Insights:**

The prepared Gephi files now allow for visualizing these patterns. Plotting users (nodes) colored by their `dominant_openai_stance` and sized by `comment_count` should visually represent the stance distribution and highlight the prominence of users like "NotEnoughMuskSpam". Analyzing the connections (edges) can reveal if interactions primarily occur between users of similar stances (echo chambers) or if there's cross-stance engagement.

**Vader Sentiment:**

![Vader Sentiment](VaderSentiment.png)

**Legend:**

![Legend](Legend.png)



**ProMusk:**


![ProMusk](ProMusk.png)


**AntiMusk:**


![AntiMusk](AntiMusk.png)


**Uknown:**


![Uknown](Unknown.png)

This analysis phase has successfully quantified stance and evidence usage, revealing a landscape dominated by Anti-Musk sentiment (among classified comments), very low reliance on explicit evidence indicators, and intriguing user behavior patterns like the "NotEnoughMuskSpam" case.

# 7. Conclusions Based on Research Questions

Based on the computational analysis performed on the collected Reddit data regarding Elon Musk from `r/technology` and `r/EnoughMuskSpam`, I can offer the following preliminary conclusions for my research questions:

1.  **RQ1: To what extent do Reddit users cite external evidence or data?**
    *   **Conclusion:** Explicit evidence indicators (defined as URLs or specific keywords like 'source', 'data', 'article') were used **very infrequently**. Less than 7% of all posts and comments contained such indicators. This suggests that, within these subreddits and this topic, arguments or statements frequently rely on assertion, anecdote, or implied knowledge rather than explicit external backing. *Further analysis is needed to confirm if this low rate significantly differs between the two subreddits.*

2.  **RQ2: Do posts/comments containing explicit evidence correlate with higher levels of engagement or persuasion?**
    *   **Conclusion:** Based on this initial analysis, there is **no clear positive correlation** between the presence of these specific evidence indicators and higher engagement (measured by score/karma). The most active and highly-engaged user identified ("NotEnoughMuskSpam") used zero evidence indicators yet achieved significant comment volume and karma. This suggests engagement in these discussions may be more strongly driven by factors like activity level, humor, alignment with community sentiment/stance, or controversy rather than the explicit citation of external evidence as defined here.

3.  **RQ3: Under what circumstances do we see contradictory interpretations of the *same* piece of evidence?**
    *   **Conclusion:** The methodology successfully identified URLs that were cited multiple times within the dataset, providing **candidates for investigating contradictory interpretations**. However, a detailed qualitative analysis comparing the text, sentiment (VADER), and stance (OpenAI) of comments citing the *same* URL is required to definitively answer this question. The framework is established, but the specific analysis of these instances was not completed in this computational pass.

4.  **RQ4: Are these patterns different in subreddits with strict moderation policies versus those with more lenient or minimal rules?**
    *   **Conclusion:** Clear differences were observed in **stance distribution**, with `r/EnoughMuskSpam` presumably showing a much higher concentration of "Anti-Musk" stances compared to the likely more mixed distribution in `r/technology` (pending confirmation via direct comparison). However, the **low overall usage of evidence indicators (<7%)** might be a **shared characteristic** across both subreddits for this topic, suggesting that community norms around explicit evidence citation might be similarly low in both contexts *for this specific type of discussion*, despite potentially different moderation styles or prevailing sentiments. The distinct stance profiles likely influence *how* arguments are made and received more than the *frequency* of citing external links or keywords using the methods employed here. Further analysis comparing interaction patterns within the Gephi network for each subreddit is needed.

Overall, the findings suggest that while discussions about Elon Musk are polarized (predominantly critical in this sample), the explicit use of external evidence indicators is not a common feature. Engagement appears linked to factors other than providing such evidence, and user activity can be high even without it, sometimes in surprising ways ("NotEnoughMuskSpam"). The distinct stance environments of the subreddits are the most apparent difference identified so far.