# Platform Narratives Analysis Workflow

This notebook demonstrates the complete workflow for analyzing social media platform narratives using the modular components from the `src` package. It follows the same analysis steps as the original `matching.ipynb` but in a more structured and reproducible way.

## Setup and Imports

In [None]:
import os
import pandas as pd
import numpy as np
import networkx as nx
from tqdm.auto import tqdm

# Import components from our package
from src.ranking import fastLexRank
from src.graph_analysis import build_graph, get_descendants, get_tree_nodes, get_posts_from_tree
from src.text_processing import clean_text, detect_post_language, filter_posts_by_language
from src.utils import load_json_data, create_id_to_post_map, posts_to_dataframe, extract_anchor_ids

## 1. Load and Prepare Data

First, we'll load the data from a JSON file and prepare it for analysis.

In [None]:
# Load data from JSON file
# Update the path to your data file
data_path = '../data/your_data.json'
data = load_json_data(data_path)
print(f"Loaded {len(data)} posts from {data_path}")

## 2. Clean and Preprocess Text

Next, we'll clean the post text and filter to English posts only.

In [None]:
# Clean post text
for post in data:
    if 'post' in post:
        post['post'] = clean_text(post['post'])

# Filter to English posts only
english_posts = filter_posts_by_language(data)
print(f"Filtered to {len(english_posts)} English posts")

## 3. Build Post Graph

Now we'll build a directed graph representing the relationships between posts.

In [None]:
# Build the post graph
post_graph = build_graph(english_posts)
print(f"Built graph with {post_graph.number_of_nodes()} nodes and {post_graph.number_of_edges()} edges")

# Create a mapping from post IDs to posts
id_to_post = create_id_to_post_map(english_posts)

## 4. Apply FastLexRank

Next, we'll apply FastLexRank to identify significant content.

In [None]:
# Convert posts to DataFrame for ranking
posts_df = pd.DataFrame(english_posts)

# Apply FastLexRank to identify significant content
ranked_df = fastLexRank(posts_df)
print("Applied FastLexRank to identify significant content")

# Display the top 5 most significant posts
ranked_df.head(5)[['post', 'ap']]

## 5. Extract Anchor Posts and Their Trees

Now we'll identify anchor posts and extract all posts in their trees.

In [None]:
# Extract anchor posts (posts with no parent or matched_id=0)
# If you have a stats DataFrame with matched_id column:
# anchor_ids = extract_anchor_ids(stats_df)

# For demonstration, we'll use the top 10 ranked posts as anchors
top_posts = ranked_df.head(10)
anchor_ids = top_posts['id'].values
print(f"Selected {len(anchor_ids)} anchor posts")

In [None]:
# Extract all posts in the trees rooted at anchor posts
anchor_posts_and_replies = []

for post_id in anchor_ids:
    descendants = get_descendants(post_graph, post_id)
    tree_nodes = {post_id} | descendants
    posts = [id_to_post.get(node_id) for node_id in tree_nodes]
    anchor_posts_and_replies.extend(posts)

# Remove None values
anchor_posts_and_replies = [post for post in anchor_posts_and_replies if post]

print(f"Extracted {len(anchor_posts_and_replies)} posts from anchor post trees")

## 6. Analyze Results

Finally, we'll analyze the results of our analysis.

In [None]:
# Convert anchor posts and replies to DataFrame
anchor_df = pd.DataFrame(anchor_posts_and_replies)

# Apply FastLexRank to the anchor posts and replies
ranked_anchor_df = fastLexRank(anchor_df)

# Display the top 5 most significant posts in the anchor trees
ranked_anchor_df.head(5)[['post', 'ap']]

## Summary

In this notebook, we've demonstrated the complete workflow for analyzing social media platform narratives:

1. Loading and preparing data
2. Cleaning and preprocessing text
3. Building a post graph
4. Applying FastLexRank to identify significant content
5. Extracting anchor posts and their trees
6. Analyzing the results

This structured approach makes it easy to reproduce the analysis with different datasets and parameters.