<a href="https://colab.research.google.com/github/FaarisIq/Persuasion-Analysis-Engine/blob/main/Persuasion_Analysis_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Persuasion Analysis Engine - Faaris Iqbal

In [None]:
!pip install praw spacy
!python -m spacy download en_core_web_sm
!pip install praw pandas spacy vaderSentiment
!python -m spacy download en_core_web_sm

In [None]:
import praw
import pandas as pd
import re
import spacy
import json
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from IPython.display import display, HTML
import time
import numpy as np
from datetime import datetime

"""
Persuasion Analysis Engine - Faaris Iqbal

This engine analyzes persuasiveness in the subreddit r/changemyview by
collecting the top posts and then scoring them based off of different
factors.

Main Features Include:
- Data collection of top 100 posts and their comments using Reddit API
- Analyzing the argument's structure, evidence quality, and persuasive techniques
- Detects deltas awarded (gift awarded to OP when one's view is changed)
- Scores persuasiveness from 0-1 using weights
- Exports clean datasets for further research

The persuasion factors analyzed are:
1. Argument Length/Depth (25% weight) - Detailed vs surface level arguments
2. Evidence quality (20% weight) - Academic sources vs blogs
3. Argument sophistication (20% weight) - Logical structure
4. Delta Awards (15% weight) - Proof of persuasion
5. Comment Engagement (10% weight) - Quality of responses and discussion
6) Emotional appeal (10% weight) - Emotional connection and language used

The persuasive techniques detected are:
- Analogies
- Rhetorical questions and direct questions
- Stats and data
- Personal experience
- Moral/ethical appeals
- Authority citations
- Concessions
- Counterargument acknowledgement
- Logical connectors

It outputs:
- 'cmv_posts_analysis.csv' - Main post data with all metrics
- 'cmv_comments_analysis.csv' - Individual comment analysis
- 'cmv_summary_stats.csv' - Aggregated statistics
- Summary stats of most/least persuasive posts

To use it:
You run the script and it automatically collects data and generates CSV files
You can also adjust the limit parameter in collect_cmv_data() and in the main
function at the bottom in order to change the amount of posts being analyzed
"""

# reddit API setup
reddit = praw.Reddit(
    client_id="1Mqp8_sUj6ivNylhouNiUg",
    client_secret="uKxoTRwLFpA8p_JKp-20tV4qbiWcoA",
    user_agent="changemyview_data_collector_v2"
)

# nlp setup
nlp = spacy.load("en_core_web_sm")
analyzer = SentimentIntensityAnalyzer()

# scoring weights
SCORING_WEIGHTS = {
    'length': 0.25,
    'evidence': 0.20,
    'sophistication': 0.20,
    'delta': 0.15,
    'engagement': 0.10,
    'emotion': 0.10
}

# text preprocessing
def clean_text(text):
    """Enhanced text cleaning for Reddit content"""
    if pd.isna(text) or not isinstance(text, str):
        return ""

    # Remove Reddit markdown
    text = re.sub(r'\*\*(.*?)\*\*', r'\1', text)
    text = re.sub(r'\*(.*?)\*', r'\1', text)
    text = re.sub(r'\[([^\]]+)\]\([^)]+\)', r'\1', text)
    text = re.sub(r'&gt;.*?\n', '', text)
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\n+', ' ', text)
    text = re.sub(r'[^\w\s.,!?;:-]', '', text)

    return text.strip()

def spacy_sent_tokenize(text):
    """sentence tokenization"""
    if not isinstance(text, str) or not text.strip():
        return []

    try:
        doc = nlp(text)
        return [sent.text.strip() for sent in doc.sents if sent.text.strip()]
    except:
        # Fallback to simple splitting if spacy fails
        return [s.strip() for s in text.split('.') if s.strip()]

# delta detection (a delta is an award given to OP if the gifter was persuaded)
def extract_actual_deltas(post_comments):
    delta_count = 0
    delta_patterns = [
        r'!delta\b',
        r'∆',
        r'Δ',
        r'awarded.*?delta',
        r'changed my view',
        r'view.*?changed',
        r'cmv.*?successful'
    ]

    for comment in post_comments:
        body = comment.get('body', '').lower()

        # Skip bot/moderator comments
        if any(bot in str(comment.get('author', '')).lower()
               for bot in ['deltabot', 'automoderator', '[deleted]']):
            continue

        # Check for delta indicators
        for pattern in delta_patterns:
            if re.search(pattern, body, re.I):
                delta_count += 1
                break  # Max one delta per comment

    return delta_count

def is_delta_metadata_comment(text):
    """Filter out system delta messages"""
    if not isinstance(text, str):
        return False

    metadata_phrases = [
        "all comments that earned deltas",
        "delta system explained",
        "/r/deltalog",
        "change of view doesn't necessarily mean a reversal",
        "awarded a delta",
        "confirmation that a delta has been awarded"
    ]

    lowered = text.lower()
    return any(phrase in lowered for phrase in metadata_phrases)

# evidence quality analysis
def analyze_source_credibility(text):
    urls = re.findall(r'https?://[^\s<>\[\]]+', text)
    if not urls:
        return 0

    credibility_score = 0

    # High credibility sources
    high_cred_domains = [
        '.edu', '.gov', 'scholar.google', 'jstor', 'pubmed',
        'nature.com', 'science.org', 'cell.com', 'nejm.org',
        'stanford.edu', 'harvard.edu', 'mit.edu'
    ]

    # Medium credibility sources
    medium_cred_domains = [
        'reuters.com', 'ap.org', 'bbc.com', 'npr.org',
        'economist.com', 'wsj.com', 'nytimes.com',
        'washingtonpost.com', 'theguardian.com'
    ]

    # Low credibility indicators
    low_cred_indicators = [
        'blog', 'wordpress', 'medium.com', 'reddit.com',
        'youtube.com', 'twitter.com', 'facebook.com'
    ]

    for url in urls:
        url_lower = url.lower()

        if any(domain in url_lower for domain in high_cred_domains):
            credibility_score += 1.0
        elif any(domain in url_lower for domain in medium_cred_domains):
            credibility_score += 0.7
        elif any(indicator in url_lower for indicator in low_cred_indicators):
            credibility_score += 0.2
        else:
            credibility_score += 0.4  # Generic web source

    return min(credibility_score / len(urls), 1.0)

# argument sophistication analysis
def detect_argument_sophistication(text):
    """Measure argument quality and sophistication"""
    if not text:
        return 0

    text_lower = text.lower()
    sophistication_score = 0

    # Counterargument acknowledgment (high value)
    counter_patterns = [
        r"some might argue", r"critics say", r"on the other hand",
        r"however", r"nevertheless", r"although", r"admittedly",
        r"while.*?true", r"granted", r"i understand.*?but",
        r"fair point.*?but", r"you could argue"
    ]
    counter_count = sum(1 for p in counter_patterns if re.search(p, text_lower))
    sophistication_score += min(counter_count * 0.15, 0.4)

    # Qualification/nuance
    nuance_patterns = [
        r"in some cases", r"generally", r"tends to", r"often",
        r"usually", r"primarily", r"largely", r"typically",
        r"in most cases", r"under certain conditions"
    ]
    nuance_count = sum(1 for p in nuance_patterns if re.search(p, text_lower))
    sophistication_score += min(nuance_count * 0.08, 0.3)

    # Evidence integration
    evidence_patterns = [
        r"according to", r"research shows", r"studies indicate",
        r"data suggests", r"evidence shows", r"surveys found",
        r"analysis reveals", r"statistics show"
    ]
    evidence_count = sum(1 for p in evidence_patterns if re.search(p, text_lower))
    sophistication_score += min(evidence_count * 0.1, 0.3)

    # Logical structure indicators
    logic_patterns = [
        r"first", r"second", r"third", r"finally",
        r"therefore", r"thus", r"consequently", r"as a result",
        r"this leads to", r"it follows that"
    ]
    logic_count = sum(1 for p in logic_patterns if re.search(p, text_lower))
    sophistication_score += min(logic_count * 0.05, 0.2)

    return min(sophistication_score, 1.0)

# persuasive Features
def analyze_persuasive_features(text):
    """Comprehensive persuasive element detection"""
    if not text:
        return {}

    text_lower = text.lower()

    features = {
        'analogies': len(re.findall(r'\b(like|as if|similar to|just as|imagine if)\b', text_lower)),
        'questions': len(re.findall(r'\?', text)),
        'statistics': len(re.findall(r'\b\d+(\.\d+)?%?|\bpercent\b|\bratio\b|\btimes\b', text_lower)),
        'hedging': len(re.findall(r'\b(i think|maybe|possibly|could be|might be|seems like)\b', text_lower)),
        'personal_experience': len(re.findall(r'\b(i have|i\'ve|my experience|personally|i witnessed)\b', text_lower)),
        'moral_appeals': len(re.findall(r'\b(moral|ethics|right|wrong|should|ought|duty|responsibility)\b', text_lower)),
        'emotional_appeals': len(re.findall(r'\b(feel|emotion|heart|compassion|empathy|sympathy)\b', text_lower)),
        'authority_appeals': len(re.findall(r'\b(expert|professor|doctor|researcher|authority|official)\b', text_lower)),
        'consensus_appeals': len(re.findall(r'\b(everyone|most people|society|generally accepted|common sense)\b', text_lower)),
        'concessions': len(re.findall(r'\b(i admit|you\'re right|fair point|i concede|granted)\b', text_lower))
    }

    return features

def score_persuasive_features(features):
    """Convert feature counts to normalized score"""
    if not features:
        return 0

    weights = {
        'analogies': 0.15,
        'questions': 0.10,
        'statistics': 0.20,
        'hedging': 0.05,
        'personal_experience': 0.12,
        'moral_appeals': 0.10,
        'emotional_appeals': 0.08,
        'authority_appeals': 0.15,
        'consensus_appeals': 0.10,
        'concessions': 0.15
    }

    score = 0
    for feature, count in features.items():
        if feature in weights:
            normalized_count = min(count / 3, 1.0)
            score += weights[feature] * normalized_count

    return min(score, 1.0)

# comment engagement analysis
def analyze_comment_engagement(comments_data):
    """Analyze quality of community response"""
    if not comments_data:
        return 0

    engagement_score = 0

    # Average comment length (longer = more thoughtful)
    avg_length = np.mean([len(c.get('body', '')) for c in comments_data])
    length_score = min(avg_length / 500, 0.4)

    # Comment depth (replies to replies = deeper engagement)
    root_comments = [c for c in comments_data if c.get('is_root', False)]
    non_root_comments = [c for c in comments_data if not c.get('is_root', False)]
    depth_score = min(len(non_root_comments) / max(len(root_comments), 1) * 0.2, 0.3)

    # Quality indicators in comments
    quality_indicators = 0
    for comment in comments_data[:10]:  # Check top 10 comments
        body = comment.get('body', '').lower()

        # Positive engagement
        if any(word in body for word in ['interesting', 'good point', 'valid', 'thoughtful']):
            quality_indicators += 0.1

        # Constructive disagreement
        if any(word in body for word in ['however', 'but consider', 'what about', 'counterpoint']):
            quality_indicators += 0.15

        # Evidence sharing
        if any(word in body for word in ['source', 'link', 'study', 'research']):
            quality_indicators += 0.1

    engagement_score = length_score + depth_score + min(quality_indicators, 0.3)
    return min(engagement_score, 1.0)

# Sentiment Analysis
def get_enhanced_emotion_scores(text):
    """Combine VADER with emotional analysis that is domain-based"""
    vader_scores = analyzer.polarity_scores(text)

    # CMV-specific emotional indicators
    persuasive_emotions = [
        'understand', 'realize', 'believe', 'feel', 'think',
        'important', 'significant', 'crucial', 'essential'
    ]

    emotional_intensity = sum(1 for word in persuasive_emotions if word in text.lower())
    emotional_score = min(emotional_intensity / 15, 0.5)  # Normalize

    return {
        'vader_compound': vader_scores['compound'],
        'vader_pos': vader_scores['pos'],
        'vader_neg': vader_scores['neg'],
        'vader_neu': vader_scores['neu'],
        'persuasive_emotion': emotional_score
    }

# persuasion scoring function
def compute_enhanced_persuasion_score(post_text, comments_data, actual_deltas=0):
    """Comprehensive persuasion scoring with improved methodology"""

    if not post_text:
        return 0

    # 1. Length/Depth Analysis
    sentences = spacy_sent_tokenize(post_text)
    length_score = min(len(sentences) / 25, 1)  # Optimal around 25 sentences

    # 2. Evidence Quality
    evidence_score = analyze_source_credibility(post_text)

    # 3. Argument Sophistication
    sophistication_score = detect_argument_sophistication(post_text)

    # 4. Delta Integration (ground truth)
    delta_score = min(actual_deltas * 0.25, 1.0)  # Each delta worth 0.25, cap at 1.0

    # 5. Comment Engagement Quality
    engagement_score = analyze_comment_engagement(comments_data)

    # 6. Emotional Connection
    emotion_data = get_enhanced_emotion_scores(post_text)
    emotion_score = emotion_data['persuasive_emotion']

    # Weighted final score
    total_score = (
        SCORING_WEIGHTS['length'] * length_score +
        SCORING_WEIGHTS['evidence'] * evidence_score +
        SCORING_WEIGHTS['sophistication'] * sophistication_score +
        SCORING_WEIGHTS['delta'] * delta_score +
        SCORING_WEIGHTS['engagement'] * engagement_score +
        SCORING_WEIGHTS['emotion'] * emotion_score
    )

    return round(min(total_score, 1.0), 3)

def format_timestamp(unix_timestamp):
    """Convert Unix timestamp to readable format"""
    try:
        return datetime.fromtimestamp(unix_timestamp).strftime('%Y-%m-%d %H:%M:%S')
    except:
        return ""

# data collection
def collect_cmv_data(limit=3, time_filter="year"):
    print(f" Starting enhanced CMV data collection (limit: {limit})")

    subreddit = reddit.subreddit("changemyview")
    posts_data = []
    comments_data = []

    for i, post in enumerate(subreddit.top(time_filter=time_filter, limit=limit)):
        print(f" Processing post {i+1}/{limit}: {post.title[:50]}...")

        # Get comments
        post.comment_sort = 'top'
        post.comments.replace_more(limit=5)

        post_comments = []

        for comment in post.comments.list():
            if is_delta_metadata_comment(comment.body):
                continue

            cleaned_body = clean_text(comment.body)
            if len(cleaned_body) < 10:  # Skip very short comments
                continue

            # Analyze individual comment
            comment_features = analyze_persuasive_features(cleaned_body)
            comment_emotion = get_enhanced_emotion_scores(cleaned_body)
            comment_args = spacy_sent_tokenize(cleaned_body)

            comment_data = {
                'post_id': post.id,
                'comment_id': comment.id,
                'author': str(comment.author),
                'comment_text': cleaned_body,
                'comment_score': comment.score,
                'created_timestamp': format_timestamp(comment.created_utc),
                'created_utc': comment.created_utc,
                'is_root_comment': comment.parent_id == post.id,
                'comment_word_count': len(cleaned_body.split()),
                'comment_sentence_count': len(comment_args),

                # Persuasive features (flattened)
                'analogies_count': comment_features.get('analogies', 0),
                'questions_count': comment_features.get('questions', 0),
                'statistics_count': comment_features.get('statistics', 0),
                'hedging_count': comment_features.get('hedging', 0),
                'personal_experience_count': comment_features.get('personal_experience', 0),
                'moral_appeals_count': comment_features.get('moral_appeals', 0),
                'emotional_appeals_count': comment_features.get('emotional_appeals', 0),
                'authority_appeals_count': comment_features.get('authority_appeals', 0),
                'consensus_appeals_count': comment_features.get('consensus_appeals', 0),
                'concessions_count': comment_features.get('concessions', 0),

                # Sentiment scores (flattened)
                'vader_compound': comment_emotion['vader_compound'],
                'vader_positive': comment_emotion['vader_pos'],
                'vader_negative': comment_emotion['vader_neg'],
                'vader_neutral': comment_emotion['vader_neu'],
                'persuasive_emotion_score': comment_emotion['persuasive_emotion']
            }

            comments_data.append(comment_data)
            post_comments.append({
                'body': cleaned_body,
                'author': str(comment.author),
                'score': comment.score,
                'created_utc': comment.created_utc,
                'parent_id': comment.parent_id,
                'comment_id': comment.id,
                'is_root': comment.parent_id == post.id,
                'features': comment_features,
                'emotion': comment_emotion,
                'arg_units': comment_args
            })
        # metrics used in persuasion score
        post_clean = clean_text(post.selftext)
        post_features = analyze_persuasive_features(post_clean)
        post_emotion = get_enhanced_emotion_scores(post_clean)
        post_args = spacy_sent_tokenize(post_clean)
        actual_deltas = extract_actual_deltas(post_comments)
        persuasion_score = compute_enhanced_persuasion_score(
            post_clean, post_comments, actual_deltas
        )

        # additional metrics
        sophistication_score = detect_argument_sophistication(post_clean)
        evidence_quality = analyze_source_credibility(post_clean)
        engagement_score = analyze_comment_engagement(post_comments)

        # Create clean post record
        post_data = {
            # Basic post info
            'post_id': post.id,
            'title': post.title,
            'author': str(post.author),
            'post_text': post_clean,  # Full cleaned text
            'original_post_text': post.selftext,  # Original for reference
            'created_timestamp': format_timestamp(post.created_utc),
            'created_utc': post.created_utc,
            'post_score': post.score,
            'num_comments': post.num_comments,
            'post_url': f"https://reddit.com{post.permalink}",

            # Text metrics
            'post_word_count': len(post_clean.split()),
            'post_sentence_count': len(post_args),
            'post_character_count': len(post_clean),

            # Persuasion scores
            'persuasion_score': persuasion_score,
            'argument_sophistication_score': sophistication_score,
            'evidence_quality_score': evidence_quality,
            'comment_engagement_score': engagement_score,
            'delta_count': actual_deltas,
            'has_deltas': actual_deltas > 0,

            # Persuasive features (flattened from post_features)
            'analogies_count': post_features.get('analogies', 0),
            'questions_count': post_features.get('questions', 0),
            'statistics_count': post_features.get('statistics', 0),
            'hedging_count': post_features.get('hedging', 0),
            'personal_experience_count': post_features.get('personal_experience', 0),
            'moral_appeals_count': post_features.get('moral_appeals', 0),
            'emotional_appeals_count': post_features.get('emotional_appeals', 0),
            'authority_appeals_count': post_features.get('authority_appeals', 0),
            'consensus_appeals_count': post_features.get('consensus_appeals', 0),
            'concessions_count': post_features.get('concessions', 0),

            # Boolean flags for easy filtering
            'has_analogies': post_features.get('analogies', 0) > 0,
            'has_statistics': post_features.get('statistics', 0) > 0,
            'has_personal_experience': post_features.get('personal_experience', 0) > 0,
            'has_moral_appeals': post_features.get('moral_appeals', 0) > 0,
            'has_authority_appeals': post_features.get('authority_appeals', 0) > 0,

            # Sentiment scores (flattened from post_emotion)
            'vader_compound': post_emotion['vader_compound'],
            'vader_positive': post_emotion['vader_pos'],
            'vader_negative': post_emotion['vader_neg'],
            'vader_neutral': post_emotion['vader_neu'],
            'persuasive_emotion_score': post_emotion['persuasive_emotion'],

            # Engagement metrics
            'root_comments_count': len([c for c in post_comments if c.get('is_root', False)]),
            'reply_comments_count': len([c for c in post_comments if not c.get('is_root', False)]),
            'avg_comment_length': np.mean([len(c.get('body', '')) for c in post_comments]) if post_comments else 0,
            'total_comment_words': sum([len(c.get('body', '').split()) for c in post_comments])
        }

        posts_data.append(post_data)

        # Brief pause to avoid rate limiting
        time.sleep(0.5)

    print(f" Data collection complete. Collected {len(posts_data)} posts and {len(comments_data)} comments")
    return posts_data, comments_data

# Enhanced analysis + export functions
def save_clean_datasets(posts_data, comments_data):
    """Save multiple clean CSV files for different analysis needs"""

    # 1. Main posts dataset
    posts_df = pd.DataFrame(posts_data)
    posts_df.to_csv('cmv_posts_analysis.csv', index=False)
    print(f" Posts data saved to 'cmv_posts_analysis.csv' ({len(posts_df)} rows)")

    # 2. Comments dataset (if we have comment data)
    if comments_data:
        comments_df = pd.DataFrame(comments_data)
        comments_df.to_csv('cmv_comments_analysis.csv', index=False)
        print(f" Comments data saved to 'cmv_comments_analysis.csv' ({len(comments_df)} rows)")

    # 3. Summary statistics dataset
    summary_stats = create_summary_statistics(posts_df)
    summary_df = pd.DataFrame([summary_stats])
    summary_df.to_csv('cmv_summary_stats.csv', index=False)
    print(f" Summary statistics saved to 'cmv_summary_stats.csv'")

    # 4. Top performers dataset (for easy reference)
    top_performers = posts_df.nlargest(20, 'persuasion_score')[
        ['post_id', 'title', 'persuasion_score', 'delta_count', 'post_score',
         'post_word_count', 'has_deltas', 'created_timestamp']
    ].copy()
    top_performers['rank'] = range(1, len(top_performers) + 1)
    top_performers.to_csv('cmv_top_performers.csv', index=False)
    print(f" Top performers saved to 'cmv_top_performers.csv' ({len(top_performers)} rows)")

    return posts_df, comments_df if comments_data else None, summary_df

def create_summary_statistics(df):
    """Create summary stats"""
    return {
        'analysis_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'total_posts_analyzed': len(df),
        'avg_persuasion_score': round(df['persuasion_score'].mean(), 3),
        'median_persuasion_score': round(df['persuasion_score'].median(), 3),
        'std_persuasion_score': round(df['persuasion_score'].std(), 3),
        'max_persuasion_score': round(df['persuasion_score'].max(), 3),
        'min_persuasion_score': round(df['persuasion_score'].min(), 3),

        'posts_with_deltas': int((df['delta_count'] > 0).sum()),
        'posts_with_deltas_percent': round((df['delta_count'] > 0).mean() * 100, 1),
        'total_deltas_awarded': int(df['delta_count'].sum()),
        'avg_deltas_per_post': round(df['delta_count'].mean(), 2),
        'max_deltas_single_post': int(df['delta_count'].max()),

        'avg_post_word_count': round(df['post_word_count'].mean(), 0),
        'avg_post_sentence_count': round(df['post_sentence_count'].mean(), 1),
        'avg_reddit_score': round(df['post_score'].mean(), 1),
        'avg_comment_count': round(df['num_comments'].mean(), 1),

        # Persuasive technique prevalence
        'posts_with_analogies_percent': round((df['has_analogies']).mean() * 100, 1),
        'posts_with_statistics_percent': round((df['has_statistics']).mean() * 100, 1),
        'posts_with_personal_experience_percent': round((df['has_personal_experience']).mean() * 100, 1),
        'posts_with_moral_appeals_percent': round((df['has_moral_appeals']).mean() * 100, 1),
        'posts_with_authority_appeals_percent': round((df['has_authority_appeals']).mean() * 100, 1),

        # Quality metrics
        'avg_argument_sophistication': round(df['argument_sophistication_score'].mean(), 3),
        'avg_evidence_quality': round(df['evidence_quality_score'].mean(), 3),
        'avg_comment_engagement': round(df['comment_engagement_score'].mean(), 3),

        # Sentiment distribution
        'avg_vader_compound': round(df['vader_compound'].mean(), 3),
        'positive_sentiment_posts_percent': round((df['vader_compound'] > 0.1).mean() * 100, 1),
        'negative_sentiment_posts_percent': round((df['vader_compound'] < -0.1).mean() * 100, 1),
        'neutral_sentiment_posts_percent': round((abs(df['vader_compound']) <= 0.1).mean() * 100, 1)
    }

def display_enhanced_analysis_summary(posts_df, comments_df=None):
    """Enhanced summary with actionable insights"""
    print("\n" + "="*60)
    print(" PERSUASION ANALYSIS SUMMARY")
    print("="*60)

    # Basic stats
    print(f" Dataset Overview:")
    print(f"   • Total posts analyzed: {len(posts_df):,}")
    if comments_df is not None:
        print(f"   • Total comments analyzed: {len(comments_df):,}")
    print(f"   • Analysis completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

    # Persuasion metrics
    print(f"\n Persuasion Metrics:")
    print(f"   • Average persuasion score: {posts_df['persuasion_score'].mean():.3f}")
    print(f"   • Highest persuasion score: {posts_df['persuasion_score'].max():.3f}")
    print(f"   • Posts with deltas: {(posts_df['delta_count'] > 0).sum()} ({(posts_df['delta_count'] > 0).mean()*100:.1f}%)")
    print(f"   • Total deltas awarded: {posts_df['delta_count'].sum()}")
    print(f"   • Average deltas per post: {posts_df['delta_count'].mean():.2f}")

    # Content analysis
    print(f"\n Content Analysis:")
    print(f"   • Average word count: {posts_df['post_word_count'].mean():.0f}")
    print(f"   • Average sentence count: {posts_df['post_sentence_count'].mean():.1f}")
    print(f"   • Posts with statistics: {posts_df['has_statistics'].mean()*100:.1f}%")
    print(f"   • Posts with personal experience: {posts_df['has_personal_experience'].mean()*100:.1f}%")
    print(f"   • Posts with moral appeals: {posts_df['has_moral_appeals'].mean()*100:.1f}%")

    # Quality indicators
    print(f"\n Quality Indicators:")
    print(f"   • Average argument sophistication: {posts_df['argument_sophistication_score'].mean():.3f}")
    print(f"   • Average evidence quality: {posts_df['evidence_quality_score'].mean():.3f}")
    print(f"   • Average comment engagement: {posts_df['comment_engagement_score'].mean():.3f}")

    # Top performing posts
    print(f"\n Most Persuasive Posts:")
    top_posts = posts_df.nlargest(5, 'persuasion_score')[
        ['title', 'persuasion_score', 'delta_count', 'post_score']
    ]

    for i, (_, row) in enumerate(top_posts.iterrows(), 1):
        title_truncated = row['title'][:55] + "..." if len(row['title']) > 55 else row['title']
        print(f"   {i}. [{row['persuasion_score']:.3f}] {title_truncated}")
        print(f"      Deltas: {row['delta_count']} | Reddit Score: {row['post_score']}")

    # Correlation insights
    print(f"\n Key Correlations:")
    corr_deltas = posts_df['persuasion_score'].corr(posts_df['delta_count'])
    corr_reddit_score = posts_df['persuasion_score'].corr(posts_df['post_score'])
    corr_word_count = posts_df['persuasion_score'].corr(posts_df['post_word_count'])

    print(f"   • Persuasion score ↔ Delta count: {corr_deltas:.3f}")
    print(f"   • Persuasion score ↔ Reddit score: {corr_reddit_score:.3f}")
    print(f"   • Persuasion score ↔ Word count: {corr_word_count:.3f}")

    print(f"\n Files Generated:")
    print(f"   • cmv_posts_analysis.csv - Main dataset with all post metrics")
    if comments_df is not None:
        print(f"   • cmv_comments_analysis.csv - Individual comment analysis")
    print(f"   • cmv_summary_stats.csv - Aggregated statistics")
    print(f"   • cmv_top_performers.csv - Top 20 most persuasive posts")

def create_data_dictionary():
    """Generate a data dictionary for the CSV files"""

    posts_dictionary = {
        'Column Name': [
            'post_id', 'title', 'author', 'post_text', 'original_post_text',
            'created_timestamp', 'created_utc', 'post_score', 'num_comments',
            'post_url', 'post_word_count', 'post_sentence_count', 'post_character_count',
            'persuasion_score', 'argument_sophistication_score', 'evidence_quality_score',
            'comment_engagement_score', 'delta_count', 'has_deltas',
            'analogies_count', 'questions_count', 'statistics_count', 'hedging_count',
            'personal_experience_count', 'moral_appeals_count', 'emotional_appeals_count',
            'authority_appeals_count', 'consensus_appeals_count', 'concessions_count',
            'has_analogies', 'has_statistics', 'has_personal_experience',
            'has_moral_appeals', 'has_authority_appeals',
            'vader_compound', 'vader_positive', 'vader_negative', 'vader_neutral',
            'persuasive_emotion_score', 'root_comments_count', 'reply_comments_count',
            'avg_comment_length', 'total_comment_words'
        ],
        'Description': [
            'Unique Reddit post identifier',
            'Post title text',
            'Reddit username of post author',
            'Cleaned post content text',
            'Original unprocessed post text',
            'Human-readable creation timestamp',
            'Unix timestamp of post creation',
            'Reddit upvote score',
            'Total number of comments',
            'Direct URL to Reddit post',
            'Number of words in post',
            'Number of sentences in post',
            'Total character count',
            'Overall persuasion score (0-1)',
            'Argument sophistication score (0-1)',
            'Quality of cited evidence (0-1)',
            'Comment engagement quality (0-1)',
            'Number of delta awards received',
            'Boolean: post received any deltas',
            'Count of analogies used',
            'Count of questions asked',
            'Count of statistics/numbers cited',
            'Count of hedging language used',
            'Count of personal experience references',
            'Count of moral/ethical appeals',
            'Count of emotional language',
            'Count of authority citations',
            'Count of consensus appeals',
            'Count of concessions made',
            'Boolean: contains analogies',
            'Boolean: contains statistics',
            'Boolean: contains personal experience',
            'Boolean: contains moral appeals',
            'Boolean: contains authority appeals',
            'VADER sentiment compound score (-1 to 1)',
            'VADER positive sentiment (0-1)',
            'VADER negative sentiment (0-1)',
            'VADER neutral sentiment (0-1)',
            'Domain-specific persuasive emotion score',
            'Number of top-level comments',
            'Number of reply comments',
            'Average length of comments',
            'Total words across all comments'
        ],
        'Data Type': [
            'string', 'string', 'string', 'string', 'string',
            'datetime', 'integer', 'integer', 'integer', 'string',
            'integer', 'integer', 'integer', 'float', 'float', 'float',
            'float', 'integer', 'boolean', 'integer', 'integer', 'integer',
            'integer', 'integer', 'integer', 'integer', 'integer', 'integer',
            'integer', 'boolean', 'boolean', 'boolean', 'boolean', 'boolean',
            'float', 'float', 'float', 'float', 'float', 'integer', 'integer',
            'float', 'integer'
        ]
    }

    dictionary_df = pd.DataFrame(posts_dictionary)
    dictionary_df.to_csv('cmv_data_dictionary.csv', index=False)
    print(f" Data dictionary saved to 'cmv_data_dictionary.csv'")

    return dictionary_df

# execution
if __name__ == "__main__":
    # Collect data
    posts_data, comments_data = collect_cmv_data(limit=3, time_filter="year")

    # Save clean datasets
    posts_df, comments_df, summary_df = save_clean_datasets(posts_data, comments_data)

    # Create data dictionary
    dictionary_df = create_data_dictionary()

    # Display comprehensive summary
    display_enhanced_analysis_summary(posts_df, comments_df)

    print(f"\n Analysis complete.")
    print(f" Check 'cmv_data_dictionary.csv' for column explanations.")

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



 Starting enhanced CMV data collection (limit: 3)


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



 Processing post 1/3: CMV: Anyone who votes for Trump is completely lack...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

 Processing post 2/3: CMV: Voting for Donald Trump in the 2024 election ...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

 Processing post 3/3: CMV: The online left has failed young men...


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

 Data collection complete. Collected 3 posts and 2602 comments
 Posts data saved to 'cmv_posts_analysis.csv' (3 rows)
 Comments data saved to 'cmv_comments_analysis.csv' (2602 rows)
 Summary statistics saved to 'cmv_summary_stats.csv'
 Top performers saved to 'cmv_top_performers.csv' (3 rows)
 Data dictionary saved to 'cmv_data_dictionary.csv'

 PERSUASION ANALYSIS SUMMARY
 Dataset Overview:
   • Total posts analyzed: 3
   • Total comments analyzed: 2,602
   • Analysis completed: 2025-09-02 03:54:19

 Persuasion Metrics:
   • Average persuasion score: 0.472
   • Highest persuasion score: 0.531
   • Posts with deltas: 3 (100.0%)
   • Total deltas awarded: 11
   • Average deltas per post: 3.67

 Content Analysis:
   • Average word count: 572
   • Average sentence count: 26.7
   • Posts with statistics: 100.0%
   • Posts with personal experience: 66.7%
   • Posts with moral appeals: 100.0%

 Quality Indicators:
   • Average argument sophistication: 0.163
   • Average evidence quality: 0.0