# Source-Level Sentiment Analysis

Not all news sources are created equal. Some outlets tend to run more sensational headlines, others stick to dry factual reporting. Some focus heavily on certain tickers, others spread coverage thin.

This notebook explores **how sentiment varies by news source** (the `domain` column in our GDELT data). The goal is to understand:

1. Which sources contribute the most coverage?
2. Do certain sources skew positive or negative?
3. How does source sentiment vary by ticker?
4. Are there sources we might want to weight differently (or exclude) in downstream analysis?

This kind of analysis matters because if we later build signals from sentiment, we don't want a single noisy source dominating the signal.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Project path setup (works from notebook dir or project root)
current = Path.cwd()
while not (current / "data").exists() and current != current.parent:
    current = current.parent
PROJECT_ROOT = current

# We need the sentiment-scored data
# If gdelt_articles_with_sentiment.csv doesn't exist yet, fall back to clean + compute scores
SENTIMENT_PATH = PROJECT_ROOT / "data" / "processed" / "gdelt_articles_with_sentiment.csv"
CLEAN_PATH = PROJECT_ROOT / "data" / "processed" / "gdelt_articles_clean.csv"

if SENTIMENT_PATH.exists():
    df = pd.read_csv(SENTIMENT_PATH, parse_dates=["seendate"])
    print(f"Loaded {len(df):,} rows from {SENTIMENT_PATH.name}")
elif CLEAN_PATH.exists():
    # Fall back to clean data and add sentiment on the fly
    import sys
    sys.path.insert(0, str(PROJECT_ROOT / "scripts"))
    from add_sentiment import add_sentiment_scores
    
    df = pd.read_csv(CLEAN_PATH, parse_dates=["seendate"])
    df = add_sentiment_scores(df, text_col="title")
    print(f"Loaded {len(df):,} rows from {CLEAN_PATH.name} (computed sentiment on the fly)")
else:
    raise FileNotFoundError("No GDELT data found. Run the pipeline first.")

# Quick sanity check
required_cols = ["domain", "sentiment_score", "ticker", "title"]
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise ValueError(f"Missing columns: {missing}")
    
print(f"Unique sources: {df['domain'].nunique()}")
print(f"Date range: {df['seendate'].min().date()} to {df['seendate'].max().date()}")

## Part 1: Who's Writing About Our Tickers?

Before we look at sentiment, let's just see which sources show up most often. This tells us where most of our data is coming from—and flags any sources that might be over- or under-represented.

In [None]:
# Article counts by source
source_counts = df.groupby("domain").size().sort_values(ascending=False)

print(f"Total articles: {len(df):,}")
print(f"Total sources: {len(source_counts)}")
print()

# Top 15 sources
top_n = 15
print(f"Top {top_n} sources by article count:")
print("=" * 50)
for i, (domain, count) in enumerate(source_counts.head(top_n).items(), 1):
    pct = 100 * count / len(df)
    print(f"{i:2}. {domain:35} {count:5,} ({pct:5.1f}%)")

# How concentrated is coverage?
top5_share = source_counts.head(5).sum() / len(df) * 100
top10_share = source_counts.head(10).sum() / len(df) * 100
print()
print(f"Top 5 sources account for {top5_share:.1f}% of articles")
print(f"Top 10 sources account for {top10_share:.1f}% of articles")

In [None]:
# Visualize the distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: bar chart of top sources
top_sources = source_counts.head(12)
ax1 = axes[0]
bars = ax1.barh(range(len(top_sources)), top_sources.values, color="steelblue")
ax1.set_yticks(range(len(top_sources)))
ax1.set_yticklabels(top_sources.index)
ax1.invert_yaxis()  # largest at top
ax1.set_xlabel("Article Count")
ax1.set_title("Top 12 Sources by Volume")

# Add count labels
for bar, count in zip(bars, top_sources.values):
    ax1.text(bar.get_width() + 2, bar.get_y() + bar.get_height()/2, 
             f"{count:,}", va="center", fontsize=9)

# Right: cumulative distribution (how many sources to get X% of articles)
ax2 = axes[1]
cumulative = source_counts.cumsum() / source_counts.sum() * 100
ax2.plot(range(1, len(cumulative)+1), cumulative.values, marker=".", markersize=3)
ax2.axhline(80, color="red", linestyle="--", alpha=0.7, label="80% threshold")
ax2.axhline(95, color="orange", linestyle="--", alpha=0.7, label="95% threshold")
ax2.set_xlabel("Number of Sources (ranked by volume)")
ax2.set_ylabel("Cumulative % of Articles")
ax2.set_title("Coverage Concentration")
ax2.legend()
ax2.set_xlim(0, min(50, len(cumulative)))

plt.tight_layout()
plt.show()

# Find how many sources needed for 80% and 95%
n_80 = (cumulative <= 80).sum() + 1
n_95 = (cumulative <= 95).sum() + 1
print(f"Need {n_80} sources to cover 80% of articles")
print(f"Need {n_95} sources to cover 95% of articles")

## Part 2: Source-Level Sentiment Profiles

Now the interesting part: do different sources have different "tones"?

We'll compute several metrics for each source:
- **avg_sentiment**: mean sentiment score (are they generally positive or negative?)
- **sentiment_std**: how much does their sentiment vary?
- **positive_rate / negative_rate**: what fraction of their articles are positive vs negative?
- **neutral_rate**: what fraction score exactly 0 (no sentiment words matched)?

The neutral rate is particularly interesting—a high neutral rate might mean the source uses vocabulary our lexicon doesn't capture, or they write very dry headlines.

In [None]:
# Build source-level aggregations
source_agg = df.groupby("domain").agg(
    article_count=("url", "count"),
    avg_sentiment=("sentiment_score", "mean"),
    median_sentiment=("sentiment_score", "median"),
    sentiment_std=("sentiment_score", "std"),
    positive_rate=("sentiment_score", lambda x: (x > 0).mean()),
    negative_rate=("sentiment_score", lambda x: (x < 0).mean()),
    neutral_rate=("sentiment_score", lambda x: (x == 0).mean()),
).round(3)

# Add a "sentiment_signal_rate" = 1 - neutral_rate (how often does the lexicon fire)
source_agg["signal_rate"] = (1 - source_agg["neutral_rate"]).round(3)

# Sort by article count for now
source_agg = source_agg.sort_values("article_count", ascending=False)

print("Source sentiment profiles (top 20 by volume):")
print("=" * 100)
display_cols = ["article_count", "avg_sentiment", "sentiment_std", "positive_rate", "negative_rate", "signal_rate"]
print(source_agg[display_cols].head(20).to_string())

### Which sources skew most positive or negative?

Let's filter to sources with at least 10 articles (so the averages mean something) and see who's on the extremes.

In [None]:
# Filter to sources with meaningful sample size
min_articles = 10
source_agg_filtered = source_agg[source_agg["article_count"] >= min_articles].copy()
print(f"Sources with >= {min_articles} articles: {len(source_agg_filtered)}")
print()

# Most positive sources
print("Most POSITIVE sources (by avg sentiment):")
print("-" * 60)
most_positive = source_agg_filtered.nlargest(8, "avg_sentiment")
for domain, row in most_positive.iterrows():
    print(f"  {domain:35} avg={row['avg_sentiment']:+.3f}  (n={int(row['article_count'])})")

print()

# Most negative sources
print("Most NEGATIVE sources (by avg sentiment):")
print("-" * 60)
most_negative = source_agg_filtered.nsmallest(8, "avg_sentiment")
for domain, row in most_negative.iterrows():
    print(f"  {domain:35} avg={row['avg_sentiment']:+.3f}  (n={int(row['article_count'])})")

In [None]:
# Scatter: avg sentiment vs signal rate (how often lexicon fires)
# This shows us: are positive sources just using more sentiment words, or are they actually biased?

fig, ax = plt.subplots(figsize=(10, 7))

# Size by article count (log scale for visibility)
sizes = np.log1p(source_agg_filtered["article_count"]) * 15

scatter = ax.scatter(
    source_agg_filtered["signal_rate"],
    source_agg_filtered["avg_sentiment"],
    s=sizes,
    alpha=0.6,
    c=source_agg_filtered["avg_sentiment"],
    cmap="RdYlGn",
    edgecolors="black",
    linewidth=0.5
)

# Reference lines
ax.axhline(0, color="gray", linestyle="--", alpha=0.5)
ax.axvline(source_agg_filtered["signal_rate"].median(), color="gray", linestyle=":", alpha=0.5)

# Label some interesting outliers
# Top 5 by absolute avg_sentiment
to_label = source_agg_filtered.nlargest(4, "avg_sentiment").index.tolist()
to_label += source_agg_filtered.nsmallest(3, "avg_sentiment").index.tolist()
to_label += source_agg_filtered.nlargest(3, "article_count").index.tolist()
to_label = list(set(to_label))  # dedupe

for domain in to_label:
    row = source_agg_filtered.loc[domain]
    ax.annotate(
        domain.replace(".com", "").replace(".org", ""),
        (row["signal_rate"], row["avg_sentiment"]),
        fontsize=8,
        alpha=0.8,
        xytext=(5, 5),
        textcoords="offset points"
    )

ax.set_xlabel("Signal Rate (fraction of articles with sentiment words)")
ax.set_ylabel("Average Sentiment Score")
ax.set_title("Source Sentiment Profile\n(bubble size = article count, color = sentiment)")
plt.colorbar(scatter, label="Avg Sentiment")
plt.tight_layout()
plt.show()

## Part 3: Source Coverage by Ticker

Do different sources focus on different stocks? This matters because if, say, source A only covers TSLA and source B only covers NVDA, then comparing their sentiment isn't apples-to-apples.

In [None]:
# Cross-tab: source × ticker article counts
source_ticker_counts = pd.crosstab(df["domain"], df["ticker"])

# Focus on top sources
top_sources_list = source_counts.head(12).index.tolist()
source_ticker_top = source_ticker_counts.loc[top_sources_list]

# Normalize by row (what % of each source's articles go to each ticker)
source_ticker_pct = source_ticker_top.div(source_ticker_top.sum(axis=1), axis=0) * 100

print("Ticker coverage by source (% of source's articles):")
print(source_ticker_pct.round(1).to_string())

In [None]:
# Heatmap of ticker coverage
fig, ax = plt.subplots(figsize=(10, 8))

sns.heatmap(
    source_ticker_pct,
    annot=True,
    fmt=".0f",
    cmap="YlOrRd",
    ax=ax,
    cbar_kws={"label": "% of source articles"}
)
ax.set_title("How Sources Distribute Coverage Across Tickers")
ax.set_xlabel("Ticker")
ax.set_ylabel("Source")
plt.tight_layout()
plt.show()

## Part 4: Source × Ticker Sentiment

Now the real question: for the same ticker, do different sources give different sentiment signals?

If source A is always more positive about NVDA than source B, that could be useful info. Or it could just be noise we need to normalize out.

In [None]:
# Source × ticker sentiment
source_ticker_sentiment = df.groupby(["domain", "ticker"]).agg(
    count=("url", "count"),
    avg_sentiment=("sentiment_score", "mean"),
).reset_index()

# Filter to meaningful sample sizes
source_ticker_sentiment = source_ticker_sentiment[source_ticker_sentiment["count"] >= 5]

# Pivot for heatmap
sentiment_pivot = source_ticker_sentiment.pivot(
    index="domain",
    columns="ticker",
    values="avg_sentiment"
)

# Filter to top sources
sentiment_pivot_top = sentiment_pivot.loc[
    sentiment_pivot.index.isin(source_counts.head(15).index)
].dropna(how="all")

print(f"Source × Ticker sentiment (top sources, min 5 articles per cell):")
print(sentiment_pivot_top.round(3).to_string())

In [None]:
# Heatmap of source × ticker sentiment
fig, ax = plt.subplots(figsize=(10, 8))

# Use diverging colormap centered at 0
vmax = max(abs(sentiment_pivot_top.min().min()), abs(sentiment_pivot_top.max().max()))
vmax = min(vmax, 0.5)  # cap for better contrast

sns.heatmap(
    sentiment_pivot_top,
    annot=True,
    fmt=".2f",
    cmap="RdYlGn",
    center=0,
    vmin=-vmax,
    vmax=vmax,
    ax=ax,
    cbar_kws={"label": "Avg Sentiment"}
)
ax.set_title("Average Sentiment by Source and Ticker\n(green = positive, red = negative)")
ax.set_xlabel("Ticker")
ax.set_ylabel("Source")
plt.tight_layout()
plt.show()

## Part 5: Sentiment Consistency Within Sources

Are some sources more "consistent" in their sentiment (always positive or always negative), while others swing wildly?

High standard deviation might mean:
- The source covers a wide range of news (good and bad)
- Or it's just noisy

Low standard deviation might mean:
- The source has a consistent editorial slant
- Or it writes boring headlines that don't trigger our lexicon

In [None]:
# Look at sentiment distribution for a few key sources
key_sources = source_counts.head(6).index.tolist()

fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()

for i, source in enumerate(key_sources):
    ax = axes[i]
    source_data = df[df["domain"] == source]["sentiment_score"]
    
    # Histogram with KDE
    ax.hist(source_data, bins=30, density=True, alpha=0.7, color="steelblue", edgecolor="white")
    
    # Stats
    mean_sent = source_data.mean()
    std_sent = source_data.std()
    n = len(source_data)
    neutral_pct = (source_data == 0).mean() * 100
    
    ax.axvline(mean_sent, color="red", linestyle="--", label=f"mean={mean_sent:.2f}")
    ax.axvline(0, color="gray", linestyle=":", alpha=0.5)
    
    ax.set_title(f"{source}\n(n={n}, neutral={neutral_pct:.0f}%)")
    ax.set_xlabel("Sentiment Score")
    ax.set_xlim(-1.1, 1.1)
    ax.legend(fontsize=8)

plt.suptitle("Sentiment Distribution by Source", fontsize=12, y=1.02)
plt.tight_layout()
plt.show()

## Part 6: Takeaways and Recommendations

Let's summarize what we learned and think about implications for downstream analysis.

In [None]:
# Summary stats
print("=" * 70)
print("SOURCE-LEVEL SENTIMENT SUMMARY")
print("=" * 70)
print()

# Coverage concentration
print("COVERAGE:")
print(f"  • {len(source_counts)} unique sources")
print(f"  • Top 5 sources = {top5_share:.1f}% of articles")
print(f"  • Top 10 sources = {top10_share:.1f}% of articles")
print()

# Sentiment variation across sources
avg_range = source_agg_filtered["avg_sentiment"].max() - source_agg_filtered["avg_sentiment"].min()
print("SENTIMENT VARIATION:")
print(f"  • Average sentiment ranges from {source_agg_filtered['avg_sentiment'].min():.3f} to {source_agg_filtered['avg_sentiment'].max():.3f}")
print(f"  • Spread across sources: {avg_range:.3f}")
print(f"  • Overall mean sentiment: {df['sentiment_score'].mean():.3f}")
print()

# Signal rates
print("LEXICON MATCH RATES:")
print(f"  • Overall: {100*(df['sentiment_score'] != 0).mean():.1f}% of articles have sentiment words")
print(f"  • By source: ranges from {source_agg_filtered['signal_rate'].min()*100:.1f}% to {source_agg_filtered['signal_rate'].max()*100:.1f}%")
print()

# Potential outliers/concerns
print("NOTABLE SOURCES:")
if len(most_positive) > 0:
    top_pos = most_positive.iloc[0]
    print(f"  • Most positive: {most_positive.index[0]} (avg={top_pos['avg_sentiment']:.3f}, n={int(top_pos['article_count'])})")
if len(most_negative) > 0:
    top_neg = most_negative.iloc[0]
    print(f"  • Most negative: {most_negative.index[0]} (avg={top_neg['avg_sentiment']:.3f}, n={int(top_neg['article_count'])})")
    
# Highest volume
top_vol = source_agg.iloc[0]
print(f"  • Highest volume: {source_agg.index[0]} ({int(top_vol['article_count'])} articles, avg={top_vol['avg_sentiment']:.3f})")

### What This Means for Analysis

**Observations:**

1. **Coverage is concentrated** — a handful of sources dominate the data. This means if those sources have a bias, it'll show up in aggregate sentiment.

2. **Sources do differ in average sentiment** — some are consistently more positive, others more negative. This could reflect editorial slant, or just the type of news they cover.

3. **Lexicon match rates vary** — some sources use more "sentiment-rich" language, others are drier. High neutral rates don't mean the source is actually neutral; it might just mean our lexicon doesn't capture their vocabulary.

**Recommendations:**

- **For aggregate sentiment signals**: Consider weighting by source to avoid letting high-volume sources dominate. Or use median instead of mean.

- **For ticker-specific analysis**: Be aware that if source A covers NVDA heavily and source B covers TSLA, comparing their sentiment isn't straightforward.

- **For robustness checks**: Try running analysis with and without the most extreme sources to see if conclusions hold.

- **For future work**: Could explore whether certain sources are more "predictive" of price moves — maybe some outlets' sentiment actually matters more.

In [None]:
# Save the source aggregation table for reference
output_path = PROJECT_ROOT / "data" / "processed" / "source_sentiment_summary.csv"
source_agg.to_csv(output_path)
print(f"Saved source summary to: {output_path}")