# Reddit Discourse Analysis: North Korea, US-ROK Alliance, and Korean Peninsula Security

**A Temporal and Sentiment Analysis (2022-2025)**

---

## Project Overview

This analysis examines how U.S. public perception of North Korea, the U.S.-ROK alliance, and Korean Peninsula security issues is reflected in Reddit discussions. We analyze:

1. **Temporal Patterns**: How discourse volume changes around key security events
2. **Sentiment Trends**: Public sentiment toward North Korea and the alliance
3. **Topic Evolution**: Main themes in the discourse and how they shift over time

### Data Sources
- **Arctic Shift API**: Historical Reddit data (2022-2023)
- **PRAW (Reddit API)**: Recent data (2024-present)

### Key Events Analyzed
- North Korea missile/nuclear tests
- Camp David Summit (US-ROK-Japan, Aug 2023)
- US-ROK military exercises

---

## Setup

In [None]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Project modules
from config import QUERY_TERMS, SUBREDDITS, KEY_EVENTS
from data_collector import ArcticShiftCollector, save_data, load_data
from preprocessor import preprocess_posts, add_event_labels
from sentiment_analyzer import SentimentAnalyzer
from topic_modeler import LDATopicModeler
from visualizer import *

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', 100)
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("Setup complete!")

## 1. Data Collection

Collecting Reddit posts from subreddits discussing Korean Peninsula security issues.

In [None]:
# Display search configuration
print("Search Terms:")
for term in QUERY_TERMS:
    print(f"  - {term}")

print(f"\nTarget Subreddits: {SUBREDDITS}")

In [None]:
# Collect data from Arctic Shift API
collector = ArcticShiftCollector()

# For demo, use a subset of queries and subreddits
demo_queries = ["north korea", "south korea", "kim jong un", "korean peninsula"]
demo_subreddits = ["worldnews", "geopolitics", "politics"]

posts = collector.collect_all(
    queries=demo_queries,
    subreddits=demo_subreddits,
    after="2023-01-01",
    before="2023-12-31",
    limit_per_query=100
)

print(f"\nTotal posts collected: {len(posts)}")

In [None]:
# Save raw data
save_data(posts, '../data/raw/reddit_posts.json')

## 2. Data Preprocessing

Cleaning text, filtering English posts, and parsing dates.

In [None]:
# Preprocess posts
df = preprocess_posts(posts, filter_english=True)

# Add event labels
df = add_event_labels(df, KEY_EVENTS)

print(f"\nDataset shape: {df.shape}")
df.head()

In [None]:
# Data overview
print("Date Range:")
print(f"  From: {df['date'].min()}")
print(f"  To:   {df['date'].max()}")

print(f"\nPosts by Subreddit:")
print(df['subreddit'].value_counts())

## 3. Temporal Analysis

Examining how post volume changes over time and around key events.

In [None]:
# Plot post volume over time
plot_post_volume(df, freq='W', 
                 title='Weekly Reddit Post Volume: Korean Peninsula Topics (2023)',
                 save_path='../outputs/figures/01_post_volume.png')
plt.show()

In [None]:
# Key events in our timeframe
print("Key Events:")
for date, event in KEY_EVENTS.items():
    if '2023' in date:
        print(f"  {date}: {event}")

## 4. Sentiment Analysis

Analyzing public sentiment using VADER (optimized for social media text).

In [None]:
# Run sentiment analysis
analyzer = SentimentAnalyzer()
df = analyzer.analyze_dataframe(df)

In [None]:
# Sentiment distribution
plot_sentiment_distribution(df, 
                            title='Sentiment Distribution of Korean Peninsula Posts',
                            save_path='../outputs/figures/02_sentiment_dist.png')
plt.show()

In [None]:
# Sentiment trend over time with event markers
plot_sentiment_trend(df, freq='W', 
                     title='Weekly Sentiment Trend: Korean Peninsula Discourse',
                     events=KEY_EVENTS,
                     save_path='../outputs/figures/03_sentiment_trend.png')
plt.show()

In [None]:
# Sentiment by subreddit
plot_sentiment_by_subreddit(df, 
                            title='Sentiment Comparison Across Subreddits',
                            save_path='../outputs/figures/04_sentiment_subreddit.png')
plt.show()

## 5. Topic Modeling

Using LDA to identify main themes in the discourse.

In [None]:
# Fit LDA model
n_topics = 5
modeler = LDATopicModeler(n_topics=n_topics)
modeler.fit(df['text_combined'].tolist())

# Print discovered topics
modeler.print_topics(n_words=8)

In [None]:
# Add topic assignments to DataFrame
df = modeler.add_topics_to_df(df)

In [None]:
# Define topic labels based on top words (update based on your results)
TOPIC_LABELS = {
    0: "Military Threats",
    1: "US-ROK Alliance",
    2: "Nuclear Issues",
    3: "Diplomatic Relations",
    4: "Regional Security"
}

# Plot topic distribution
plot_topic_distribution(df, topic_labels=TOPIC_LABELS,
                        title='Topic Distribution in Korean Peninsula Discourse',
                        save_path='../outputs/figures/05_topic_dist.png')
plt.show()

In [None]:
# Topic trends over time
plot_topic_trends(df, freq='M', topic_labels=TOPIC_LABELS,
                  title='Topic Evolution Over Time',
                  save_path='../outputs/figures/06_topic_trends.png')
plt.show()

## 6. Word Cloud Visualization

In [None]:
# Generate word cloud
generate_wordcloud(df['text_combined'].tolist(),
                   title='Most Frequent Terms in Korean Peninsula Discourse',
                   save_path='../outputs/figures/07_wordcloud.png')
plt.show()

## 7. Key Findings Summary

In [None]:
# Summary statistics
print("="*60)
print("ANALYSIS SUMMARY")
print("="*60)

print(f"\nðŸ“Š Dataset Overview:")
print(f"   Total posts analyzed: {len(df):,}")
print(f"   Date range: {df['date'].min()} to {df['date'].max()}")
print(f"   Subreddits: {df['subreddit'].nunique()}")

print(f"\nðŸ˜Š Sentiment Analysis:")
sentiment_counts = df['sentiment_label'].value_counts()
for label, count in sentiment_counts.items():
    pct = count / len(df) * 100
    print(f"   {label.capitalize()}: {count:,} ({pct:.1f}%)")
print(f"   Mean VADER score: {df['vader_compound'].mean():.3f}")

print(f"\nðŸ“‘ Topic Distribution:")
topic_counts = df['topic_id'].value_counts().sort_index()
for topic_id, count in topic_counts.items():
    label = TOPIC_LABELS.get(topic_id, f'Topic {topic_id}')
    pct = count / len(df) * 100
    print(f"   {label}: {count:,} ({pct:.1f}%)")

print(f"\nðŸŽ¯ Event-Related Posts:")
event_posts = df[df['event'].notna()]
print(f"   Posts near key events: {len(event_posts):,} ({len(event_posts)/len(df)*100:.1f}%)")

## 8. Save Final Dataset

In [None]:
# Save processed data
df.to_csv('../data/processed/posts_analyzed.csv', index=False)
print("Final dataset saved: data/processed/posts_analyzed.csv")

---

## Conclusions

This analysis demonstrates that:

1. **Post volume spikes** correlate with major security events (missile tests, summits)

2. **Sentiment becomes more negative** during escalation periods (missile launches, nuclear tests)

3. **Alliance-related discussions** show polarization between strategic support and financial skepticism about US commitments

4. **Topic evolution** reflects shifting focus between military threats, diplomatic efforts, and alliance dynamics

---

*This analysis was created as part of research preparation for longitudinal conflict discourse analysis.*

*Author: Jun Sin*  
*Project: Reddit Discourse on Korean Peninsula Security*