In [4]:
from IPython.display import HTML, display
display(HTML("""
<style>
.jp-MarkdownOutput h1 { font-size: 1.4em !important; color: #2c3e50 !important; }
.jp-MarkdownOutput h2 { font-size: 1.2em !important; color: #34495e !important; }
.jp-MarkdownOutput p { font-family: Arial !important; font-size: 13px !important; }
</style>
"""))

## Israel–Palestine: Measuring Sentiment and Stance Over Time

### Project overview
We aim to measure how online sentiment and stance regarding Israel and Palestine evolve over time, using a Reddit dataset (daily‑updated) curated around the conflict. The final output will be a narrative notebook that combines text, code, and visualizations to tell a coherent story about trends and patterns in public discourse.

- **Goal**: quantify and compare sentiment/stance toward Israel and Palestine over time; identify spikes and turning points aligned with major events.
- **Data**: Reddit posts/comments from the Kaggle dataset: [Reddit on Israel–Palestine (daily updated)](https://www.kaggle.com/datasets/asaniczka/reddit-on-israel-palestine-daily-updated).
- **Primary task**: build a classifier that assigns each text a stance label among five classes:
  - pro‑Israel
  - anti‑Israel
  - pro‑Palestine
  - anti‑Palestine
  - neutral

### Why stance (not only sentiment)?
Generic sentiment (positive/negative/neutral) is often insufficient for polarized topics. Stance captures the *position* relative to a target (Israel or Palestine), allowing us to distinguish, for example, criticism of one side vs support of the other, even when overall sentiment is negative.

### Research questions
- How do the proportions of each stance change over time (daily/weekly)?
- Do major geopolitical events coincide with sharp shifts in stance distribution?
- Are there persistent differences between subreddits or topical clusters?
- How do sentiment scores relate to stance labels (agreement or divergence)?

### Planned approach (high level)
1. Load and clean the Reddit dataset; deduplicate and filter off‑topic/low‑quality text.
2. Exploratory analysis: descriptive stats, topic/cluster exploration (e.g., PCA/UMAP + clustering) to map discourse structure.
3. Pseudo‑labeling with a strong pretrained sentiment model for baseline trends; create a small gold set for stance.
4. Train a stance classifier with five classes (baseline: TF‑IDF + Logistic Regression; advanced: lightweight transformer fine‑tune/PEFT).
5. Evaluate with accepted metrics (macro‑F1, confusion matrices), check over/underfitting, and use explainability tools (e.g., SHAP) for insight.
6. Visualize time‑series of stance proportions with uncertainty bands; annotate salient events.

### Scope and assumptions
- Initial focus on English text; multilingual expansion if time permits.
- Stance schema is defined below and may be refined after an initial labeling pass.
- Ethical considerations: handle sensitive content respectfully, document limitations/biases, and avoid doxxing or harm.

### Deliverables
- Narrative Jupyter Notebook (50/50 text and code) exported to HTML.
- Clear comparisons of methods and a reasoned discussion of findings and limitations.



In [32]:
# Core libraries and visualization setup
from __future__ import annotations

# Data & utils
import os
from pathlib import Path
import pandas as pd
import numpy as np

# Visualization (Plotly only)
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

# Plotly theme and defaults
pio.templates.default = "plotly_white"
px.defaults.template = "plotly_white"
px.defaults.width = 900
px.defaults.height = 520

# Stance color palette (consistent across plots)
STANCE_COLORS = {
    "pro-Israel": "#1f77b4",
    "anti-Israel": "#2ca02c",
    "pro-Palestine": "#d62728",
    "anti-Palestine": "#9467bd",
    "neutral": "#7f7f7f",
}

print("Libraries loaded. Plotly configured.")


Libraries loaded. Visualization themes configured.


### Stance label schema (5 classes)
We frame stance as the author’s position toward a target (Israel or Palestine). This is distinct from generic sentiment.

- **pro‑Israel**: expresses support for Israeli people/policies/positions; frames Israel positively or defends it.
- **anti‑Israel**: expresses opposition to Israeli people/policies/positions; frames Israel negatively or calls for sanctions/condemnation.
- **pro‑Palestine**: expresses support for Palestinian people/rights/positions; frames Palestine positively or advocates on its behalf.
- **anti‑Palestine**: expresses opposition to Palestinian people/policies/positions; frames Palestine negatively or calls for sanctions/condemnation.
- **neutral**: informational or balanced coverage without clear stance; sarcasm/ambiguous content that cannot be reliably assigned.

Notes and edge cases:
- Posts can be negative about violence broadly without taking a side → likely **neutral** unless a clear target is present.
- Criticism of a specific government (e.g., Israeli government) can be **anti‑Israel** if it attacks Israel as a whole; otherwise mark **neutral** if it’s policy‑specific and balanced.
- Support for a ceasefire or humanitarian aid is **neutral** unless it explicitly endorses or opposes a side.
- When both stances are present, choose the dominant one; if unclear, mark **neutral**.



### Dataset files overview
The dataset includes two CSV files. Per the dataset documentation, we treat the comments file as legacy and use the opinions/posts file as the authoritative source.

- `reddit_opinion_PSE_ISR.csv` (primary)
  - Post/opinion–level records curated for ongoing analysis and updates.
  - Typical fields include post identifiers, text (`title`/`selftext`), `subreddit`, `score`, and `created_time`.
  - Recommended as the main input for stance/sentiment over time.

- `pse_isr_reddit_comments.csv` (legacy)
  - Comment–level dump retained for historical/backward compatibility.
  - Columns commonly observed: `comment_id`, `self_text`, `subreddit`, `score`, `created_time`.
  - Optional for supplementary analyses (e.g., comment dynamics); not the primary table.

Usage policy in this notebook:
- We will load and analyze `reddit_opinion_PSE_ISR.csv` as the canonical dataset.
- The legacy comments file may be referenced for context or robustness checks if needed.


### Data Loading and Intitial Discovery

In [33]:
from pathlib import Path
import pandas as pd

# Use the path KaggleHub returned earlier; fallback to re-download if missing.
try:
    local_path
except NameError:
    import kagglehub
    local_path = kagglehub.dataset_download('asaniczka/reddit-on-israel-palestine-daily-updated')

base = Path(local_path)
opinion_csv = next(base.rglob('reddit_opinion_PSE_ISR.csv'))
df = pd.read_csv(opinion_csv)

print('Using file:', opinion_csv)
print('Shape:', df.shape)
print('Columns:', df.columns.tolist())
display(df.head())

Using file: C:\Users\Danco\.cache\kagglehub\datasets\asaniczka\reddit-on-israel-palestine-daily-updated\versions\651\reddit_opinion_PSE_ISR.csv
Shape: (3161341, 24)
Columns: ['comment_id', 'score', 'self_text', 'subreddit', 'created_time', 'post_id', 'author_name', 'controversiality', 'ups', 'downs', 'user_is_verified', 'user_account_created_time', 'user_awardee_karma', 'user_awarder_karma', 'user_link_karma', 'user_comment_karma', 'user_total_karma', 'post_score', 'post_self_text', 'post_title', 'post_upvote_ratio', 'post_thumbs_ups', 'post_total_awards_received', 'post_created_time']


Unnamed: 0,comment_id,score,self_text,subreddit,created_time,post_id,author_name,controversiality,ups,downs,...,user_link_karma,user_comment_karma,user_total_karma,post_score,post_self_text,post_title,post_upvote_ratio,post_thumbs_ups,post_total_awards_received,post_created_time
0,na7ycnc,1,"It might be late, but better late than never.",Palestine,2025-08-23 09:27:56,1mxj1rm,VivienneNovag,0,1,0,...,2701.0,10601.0,13302.0,6487,,Cardinal Zuppi read the names of every child w...,0.98,6487,0,2025-08-22 21:29:58
1,na7y50u,1,Because Jewish people have also faced pogroms ...,IsraelPalestine,2025-08-23 09:25:50,1mxpnoj,AsaxenaSmallwood04,0,1,0,...,26.0,3872.0,3898.0,0,"To preface, I am a Jew and that day was horrif...",Why do people constantly bring up October 7th ...,0.41,0,0,2025-08-23 02:26:14
2,na7xy4f,1,Zionists and their eternal victim complex.,Palestine,2025-08-23 09:23:55,1mxw26r,Mediocre-Working8841,0,1,0,...,1.0,2742.0,2743.0,198,,"Brainwash Zio Crew on the streets again, tryin...",1.0,198,0,2025-08-23 08:33:46
3,na7xx8o,1,Let's stop the whataboutism right here. The Pa...,IsraelPalestine,2025-08-23 09:23:42,1mxpnoj,Reasonable-Notice439,0,1,0,...,5.0,2966.0,2971.0,0,"To preface, I am a Jew and that day was horrif...",Why do people constantly bring up October 7th ...,0.41,0,0,2025-08-23 02:26:14
4,na7xvf2,1,unpopular opinion:\n\nThe more 'Israel' seeks ...,AskMiddleEast,2025-08-23 09:23:11,1mxvv5i,Tornupto48,0,1,0,...,238.0,5423.0,5661.0,5,The prospects for a Palestinian two-state solu...,Bleak Outlook for Palestinian Statehood,1.0,5,0,2025-08-23 08:21:14


### Column dictionary
The table below describes the main fields present in `reddit_opinion_PSE_ISR.csv`.

| Column | Meaning |
|---|---|
| `comment_id` | ID of the specific Reddit comment associated with the row (if applicable). |
| `score` | Net upvotes (upvotes − downvotes) for the comment. |
| `self_text` | Comment body text. May be empty/NaN. |
| `subreddit` | Name of the subreddit where the content appeared. |
| `created_time` | Comment timestamp (UTC). May be NaN if only post time is available. |
| `post_id` | ID of the parent Reddit post (submission). |
| `author_name` | Username of the commenting user. May be `[deleted]`/`[removed]`. |
| `controversiality` | Reddit flag for how controversial a comment is (0/1). |
| `ups` | Upvote count reported for the comment (may mirror `score`). |
| `downs` | Downvote count reported for the comment (often 0 due to API limitations). |
| `user_is_verified` | Whether the user account is verified (boolean/flag). |
| `user_account_created_time` | Account creation time (UTC) of the author. |
| `user_awardee_karma` | Karma from receiving awards. |
| `user_awarder_karma` | Karma from giving awards. |
| `user_link_karma` | Link (post) karma of the author. |
| `user_comment_karma` | Comment karma of the author. |
| `user_total_karma` | Total karma of the author. |
| `post_score` | Net upvotes for the post (submission). |
| `post_self_text` | Post body text (submission text). May be empty/NaN if it’s a link post. |
| `post_title` | Post title. |
| `post_upvote_ratio` | Fraction of upvotes out of total votes for the post (0–1). |
| `post_thumbs_ups` | Upvote/like count for the post as captured by the scraper (if available). |
| `post_total_awards_received` | Number of awards the post received. |
| `post_created_time` | Post (submission) timestamp (UTC). |

Notes:
- Text for stance will primarily come from `post_title` + `post_self_text` (and optionally `self_text`).
- Time analyses will use `post_created_time` when present; otherwise `created_time`.
- Some fields can be missing for certain rows; handle NaNs accordingly.


### Reusable dataframe stats
A simple helper to quickly inspect shape, dtypes, missing values, and memory usage. We’ll reuse it after preprocessing steps.


In [39]:
def show_stats(df, name='df', top_na=10):
    print(f'[{name}] Shape:', df.shape)
    print('\nDtypes (counts):')
    print(df.dtypes.value_counts())
    print('\nMissing values (top):')
    na = df.isna().sum().sort_values(ascending=False)
    print(na.head(top_na))
    mem_mb = df.memory_usage(deep=True).sum() / 1_048_576
    print(f"\nApprox. memory usage: {mem_mb:.2f} MB")

# Use on raw and reduced views
show_stats(df, 'df (current)')


[df (current)] Shape: (3161341, 23)

Dtypes (counts):
object     9
int64      7
float64    6
bool       1
Name: count, dtype: int64

Missing values (top):
user_account_created_time    130693
user_comment_karma              468
user_awardee_karma              468
user_awarder_karma              468
user_total_karma                468
user_link_karma                 468
self_text                        16
author_name                       0
post_id                           0
created_time                      0
dtype: int64

Approx. memory usage: 8270.39 MB


### Missing post body (`post_self_text`) and our approach
We observe a high number of missing values in `post_self_text` (1,691,162 rows). This is expected because many Reddit submissions are link/image/video posts that do not have a self‑text body, and some posts may be deleted/removed.

Since both the post topic (title) and the body can carry meaningful information for stance, we will:
- Combine `post_title` and `post_self_text` into a single `text` field.
- If the body is missing, use the title alone.
- Drop rows only when both title and body are empty.


In [37]:
# Simple combine: title + body -> text, then drop originals
df['text'] = (df['post_title'].fillna('') + ' ' + df['post_self_text'].fillna('')).str.strip()
df.drop(columns=['post_title', 'post_self_text'], inplace=True)
print('Empty text rows:', int((df['text'].str.len() == 0).sum()))


Empty text rows: 0


### Keep engagement columns for impact analysis
For modeling we won’t use engagement fields (to avoid leakage), but for impact evaluation we’ll keep them now:
- Text and context: `text`, `subreddit`, `post_id`
- Time: `post_created_time` or `created_time`
- Engagement: `score`, `ups`, `downs`, `post_score`, `post_upvote_ratio`, `post_thumbs_ups`, `post_total_awards_received`

We’ll correlate stance trends with these later to study visibility and reception, but exclude them from training features.


In [40]:
# Reduced dataframe with engagement columns kept for analysis
eng_cols = [
    'score','ups','downs','post_score','post_upvote_ratio','post_thumbs_ups','post_total_awards_received'
]
base_cols = ['text','subreddit','post_id','post_created_time','created_time']
cols = [c for c in base_cols + eng_cols if c in df.columns]

# Save the original dataframe
df_original = df.copy()

# Create a reduced dataframe with the essential columns
df = df[cols]

# Standardize timestamp
if 'post_created_time' in view.columns:
    df['created_time'] = pd.to_datetime(df['post_created_time'], errors='coerce', utc=True)
elif 'created_time' in df.columns:
    df['created_time'] = pd.to_datetime(df['created_time'], errors='coerce', utc=True)

df['date'] = df['created_time'].dt.date
print('df shape:', df.shape)
df.head()


df shape: (3161341, 13)


Unnamed: 0,text,subreddit,post_id,post_created_time,created_time,score,ups,downs,post_score,post_upvote_ratio,post_thumbs_ups,post_total_awards_received,date
0,Cardinal Zuppi read the names of every child w...,Palestine,1mxj1rm,2025-08-22 21:29:58,2025-08-22 21:29:58+00:00,1,1,0,6487,0.98,6487,0,2025-08-22
1,Why do people constantly bring up October 7th ...,IsraelPalestine,1mxpnoj,2025-08-23 02:26:14,2025-08-23 02:26:14+00:00,1,1,0,0,0.41,0,0,2025-08-23
2,"Brainwash Zio Crew on the streets again, tryin...",Palestine,1mxw26r,2025-08-23 08:33:46,2025-08-23 08:33:46+00:00,1,1,0,198,1.0,198,0,2025-08-23
3,Why do people constantly bring up October 7th ...,IsraelPalestine,1mxpnoj,2025-08-23 02:26:14,2025-08-23 02:26:14+00:00,1,1,0,0,0.41,0,0,2025-08-23
4,Bleak Outlook for Palestinian Statehood The pr...,AskMiddleEast,1mxvv5i,2025-08-23 08:21:14,2025-08-23 08:21:14+00:00,1,1,0,5,1.0,5,0,2025-08-23


In [41]:
show_stats(df, 'df (current)')

[df (current)] Shape: (3161341, 13)

Dtypes (counts):
int64                  6
object                 5
datetime64[ns, UTC]    1
float64                1
Name: count, dtype: int64

Missing values (top):
text                 0
subreddit            0
post_id              0
post_created_time    0
created_time         0
score                0
ups                  0
downs                0
post_score           0
post_upvote_ratio    0
dtype: int64

Approx. memory usage: 6315.74 MB


### Subreddit distribution 
We want a quick peek at the dominant subreddits and their share of the conversation. To do that, we:
- Display a table summarizing the number of posts for each subreddit (top N, with a possible “Other” group).
- Show a pie chart that visualizes each subreddit’s percentage of the total.
This gives us a clear view of where discussion is concentrated before we dive deeper

In [145]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Get top subreddits
counts = df['subreddit'].value_counts().head(10)

# Sample colors from the warm palette to match your data
black_white_colors = px.colors.sample_colorscale('Greys', len(counts))

# Create side-by-side layout
fig = make_subplots(
    rows=1, cols=2,
    specs=[[{"type": "pie"}, {"type": "table"}]],
    subplot_titles=("Distribution", "PCT (%)")
)

# Add pie chart
fig.add_trace(go.Pie(
    labels=counts.index, 
    values=counts.values,
    textinfo='label+percent',
    marker_colors=black_white_colors,
), row=1, col=2)

# Add table
fig.add_trace(go.Table(
    header=dict(values=['Subreddit', 'Posts'], fill_color='black',font_color='white'),
    cells=dict(values=[counts.index, counts.values],font=dict(size=14),height=30, fill_color='white',line_color='black')
), row=1, col=1)

# Layout
fig.update_layout(height=600, width=1000, title="Top Subreddits")
fig.show()

### What the results tell us (current dataset snapshot)
- The conversation is highly concentrated. The largest subreddit is **IsraelPalestine** with about **1,150,793** posts (~**37%** of the top group).
- Next is **worldnews** with ~**776,636** posts (~**25%**), followed by **Palestine** with ~**345,207** (~**11%**).
- Additional high‑volume communities include:
  - **PublicFreakout** ~**230,524** (~**7.4%**)
  - **AskMiddleEast** ~**211,261** (~**6.8%**)
  - **CombatFootage** ~**149,476** (~**4.8%**)
  - **worldnewsvideo** ~**93,115** (~**3.0%**)
  - **NonCredibleDefense** ~**89,415** (~**2.9%**)
  - **CrazyFuckingVideos** ~**41,429** (~**1.3%**)
  - **IsrealPalestineWar_23** ~**23,161** (~**0.7%**)
- Interpretation: a handful of subreddits dominate total volume, which can bias aggregate trends toward those communities. In later steps we’ll normalize by time and examine per‑subreddit rates to enable fair comparisons.

### Let's check the distribution over time. For that we first need to find what timeframe are we talking about

In [146]:
earliest_date = df['date'].min()
earliest_row = df[df['date'] == earliest_date].iloc[0]

print(f"Dataset spans from: {earliest_date}")
print(f"First post was in r/{earliest_row['subreddit']}")
print(f"Post ID: {earliest_row['post_id']}")
if 'text' in df.columns:
    text_preview = earliest_row['text'][:100] + "..." if len(earliest_row['text']) > 100 else earliest_row['text']
    print(f"Text preview: {text_preview}")

Dataset spans from: 2023-09-02
First post was in r/ActualPublicFreakouts
Post ID: 1680ip7
Text preview: Eritrean immigrants (illegals) fighting in the streets of Tel Aviv (Israel) - Blues are against Erit...


In [147]:
# Create monthly post counts
monthly_counts = (df.assign(month_year=pd.to_datetime(df['date']).dt.to_period('M'))
                   .groupby('month_year')
                   .size()
                   .reset_index(name='post_count'))

monthly_counts['year'] = monthly_counts['month_year'].dt.year
monthly_counts['month'] = monthly_counts['month_year'].dt.month
monthly_counts['month_name'] = monthly_counts['month_year'].dt.strftime('%Y-%m')

print(f"Dataset spans {len(monthly_counts)} months")
display(monthly_counts.head(10))

Dataset spans 24 months


Unnamed: 0,month_year,post_count,year,month,month_name
0,2023-09,2034,2023,9,2023-09
1,2023-10,25790,2023,10,2023-10
2,2023-11,289402,2023,11,2023-11
3,2023-12,270405,2023,12,2023-12
4,2024-01,194976,2024,1,2024-01
5,2024-02,157363,2024,2,2024-02
6,2024-03,163465,2024,3,2024-03
7,2024-04,179069,2024,4,2024-04
8,2024-05,184685,2024,5,2024-05
9,2024-06,132615,2024,6,2024-06


In [150]:
import plotly.graph_objects as go

# Ensure chronological order, then 3-month running average
monthly_counts = monthly_counts.sort_values('month_name')
monthly_counts['ma_3'] = monthly_counts['post_count'].rolling(3, min_periods=1).mean()
monthly_counts['ma_6'] = monthly_counts['post_count'].rolling(6, min_periods=1).mean()

# Bar chart
fig = px.bar(
    monthly_counts, x='month_name', y='post_count',
    title='Post Volume Over Time (Monthly)',
    labels={'date': 'Month', 'post_count': 'Number of Posts'},
    color_discrete_sequence=['#D2691E']
)

# Overlay running average
fig.add_trace(go.Scatter(
    x=monthly_counts['month_name'], y=monthly_counts['ma_6'],
    name='12-month avg', mode='lines',
    line=dict(color='#8B4513', width=3)
))

fig.update_layout(
    height=500, width=1000, bargap=0.1,
    paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)',
    xaxis_title='Time', yaxis_title='Posts per Month'
)
fig.show()

### Interpretation
Activity begins to climb in October 2023 and then surges sharply in November 2023. This peak aligns with reactions to the October 7 attack and its aftermath, suggesting a short lag between the event and the broader social‑media response.

Alsp - Using a 6‑month running average, activity has eased from the early peak but remains high. June 2025 records ~189K posts— the fourth‑highest monthly total since September 2023.

https://en.wikipedia.org/wiki/October_7_attacks


<img src="assets/october_7.png" alt="Alt text" width="800">