# Social Network Analysis: Eredivisie Toxicity Project

This notebook performs a social network analysis (SNA) on toxic interactions related to Eredivisie football matches. It includes data preprocessing, network construction, community detection, and visualization of toxicity spread.

In [None]:
# ==============================================
# Step 1: Import packages
# ==============================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import pickle
import os

# Configure plot style
sns.set(style="whitegrid")

### Load main and metadata files

This section loads the two core datasets used for network analysis that were created earlier in the pipeline:

- `final_sna_dataframe.csv`: the full dataset of tweets and replies enriched with toxicity scores and thread structure.
- `toxic_match_metadata_cleaned.csv`: structured metadata per match, including team names, dates, and manually defined toxicity-triggering events.

These two datasets form the basis for constructing and analyzing reply networks across matches and clubs.


In [40]:
# ==============================================
# Step 2: Load main and metadata files
# ==============================================

# Load main tweet + reply data
sna_df = pd.read_csv("final_sna_dataframe.csv")

# Load metadata (matches, teams, event triggers)
match_metadata = pd.read_csv("toxic_match_metadata_cleaned.csv")

# Preview structure
print("Main dataset shape:", sna_df.shape)
print("Metadata shape:", match_metadata.shape)


Main dataset shape: (23478, 17)
Metadata shape: (79, 14)


### Clean and prepare tweet data (with URL structure)

This step filters and standardizes the dataset to ensure that reply relationships can be traced accurately across match and club networks.

Key actions include:

- Selecting only the columns relevant for network construction and visualization.
- Dropping entries missing critical identifiers (`match_id`, `author`, or `tweet_url`).
- Cleaning and lowercasing all tweet and parent URLs for consistent matching.
- Converting timestamps into datetime objects for timeline visualizations.
- Reconstructing the `parent_author` field using a mapping from `parent_url → author`, ensuring more reliable edge definitions even when scraped data is incomplete.

This preparation step is essential for building valid reply graphs and avoiding structural gaps.


In [41]:
# ==============================================
# Step 3: Clean and prepare tweet data (with URL structure)
# ==============================================

# Select key columns for SNA
columns_to_keep = [
    "match_id", "author", "text", "timestamp", "team_handle", 
    "final_toxicity_label", "tweet_url", "parent_url", "thread_depth", 
    "original_parent_author"  # make sure this is retained if present
]

# Drop rows without match_id, author, or tweet_url
sna_df = sna_df[columns_to_keep].dropna(subset=["match_id", "author", "tweet_url"])

# Standardize URLs for safe tracing
sna_df["tweet_url"] = sna_df["tweet_url"].astype(str).str.strip().str.lower()
sna_df["parent_url"] = sna_df["parent_url"].astype(str).str.strip().str.lower()

# Convert timestamps
sna_df["timestamp"] = pd.to_datetime(sna_df["timestamp"], errors="coerce")

# Rebuild parent_author using tweet_url → author map
url_to_author = sna_df.set_index("tweet_url")["author"].to_dict()

# Use mapping where possible; fallback to original scraped reply field if needed
mapped_authors = sna_df["parent_url"].map(url_to_author)
sna_df["parent_author"] = mapped_authors.combine_first(sna_df["original_parent_author"])


### Add event trigger information

This step enriches the tweet dataset with metadata about emotionally charged match events that are likely to trigger online toxicity. These include:

- **Red cards**
- **Controversial referee decisions**
- **Player mistakes**
- **Other custom-labeled trigger types**

The selected columns from the match metadata are merged into the main `sna_df` using the `match_id` as a key. This allows later analysis to connect spikes in toxicity or engagement to real-world in-game incidents.


In [42]:
# ==============================================
# Step 4: Add event trigger information
# ==============================================

# Select trigger columns from match metadata
trigger_columns = ["match_id", "trigger_types", "red_card", "controversie", "player_error"]

# Merge into main SNA dataset
sna_df = pd.merge(
    sna_df,
    match_metadata[trigger_columns],
    on="match_id",
    how="left"
)


## Construct per-match reply networks

Now that I have cleaned and merged the tweets and replies into a single dataset, I can construct the interaction network.

Each node in this network represents a unique Twitter user. A directed edge is created from user A to user B if A replied to B. This network helps me analyze the structure and dynamics of conversations surrounding Eredivisie matches.

I focus only on replies with valid `author` and `in_reply_to_user` values to ensure edges reflect meaningful interactions.

Key actions in this step:
- Filter relevant reply rows
- Create a directed edge list: source = replier, target = person replied to
- Build a directed graph using NetworkX


In [None]:
# ==============================================
# Step 5a: Create per-match reply networks
# ==============================================

# === Handle normalization ===
def normalize_handle(x):
    return str(x).lower().replace("@", "").strip()

# === Dictionary to store graphs per match ===
match_networks = {}

for match_id, match_df in sna_df.groupby("match_id"):
    G = nx.DiGraph()

    # 1. Add reply edges: normalized author → parent_author
    for _, row in match_df.iterrows():
        author = normalize_handle(row["author"])
        parent = normalize_handle(row["parent_author"])
        if pd.notna(author) and pd.notna(parent) and author != parent:
            G.add_edge(author, parent, match_id=match_id)

    # 2. Add timeline authors as isolated nodes if not already in graph
    timeline_authors = match_df[match_df["thread_depth"] == 0]["author"].dropna().unique()
    for author in timeline_authors:
        norm_author = normalize_handle(author)
        if norm_author not in G:
            G.add_node(norm_author)

    match_networks[match_id] = G
    print(f"Match {match_id}: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

# Save the match networks as a pickle file
with open("C:/Master/Master project/match_networks.pkl", "wb") as f:
    pickle.dump(match_networks, f)

Match M001: 65 nodes, 72 edges
Match M002: 34 nodes, 37 edges
Match M003: 41 nodes, 44 edges
Match M004: 42 nodes, 44 edges
Match M005: 72 nodes, 85 edges
Match M006: 192 nodes, 253 edges
Match M007: 19 nodes, 18 edges
Match M008: 258 nodes, 304 edges
Match M009: 58 nodes, 59 edges
Match M010: 478 nodes, 611 edges
Match M011: 131 nodes, 139 edges
Match M012: 85 nodes, 88 edges
Match M013: 26 nodes, 28 edges
Match M014: 164 nodes, 193 edges
Match M015: 3 nodes, 2 edges
Match M016: 72 nodes, 76 edges
Match M017: 119 nodes, 140 edges
Match M018: 47 nodes, 52 edges
Match M019: 5 nodes, 3 edges
Match M020: 238 nodes, 276 edges
Match M021: 64 nodes, 74 edges
Match M022: 117 nodes, 136 edges
Match M023: 175 nodes, 220 edges
Match M024: 49 nodes, 54 edges
Match M025: 38 nodes, 42 edges
Match M026: 201 nodes, 234 edges
Match M027: 38 nodes, 38 edges
Match M028: 125 nodes, 141 edges
Match M029: 534 nodes, 736 edges
Match M030: 209 nodes, 263 edges
Match M031: 13 nodes, 13 edges
Match M032: 71 no

## Construct per-club reply networks

To analyze interaction patterns at the club level, I create separate reply networks for each Eredivisie club based on the `team_handle` associated with the tweets.

Each node in these networks represents a Twitter user. A directed edge is drawn from the reply author to the user they replied to. In addition to interaction edges, I also add timeline authors (original posters with `thread_depth == 0`) as isolated nodes, even if they didn’t participate in replies, to preserve their presence in the network.

Key actions in this step:
- Group the dataset by `team_handle` to process each club separately
- Create directed edges from repliers to parent authors
- Add isolated nodes for timeline authors without interactions
- Store each club’s graph in a dictionary and save it to file


In [44]:
# ==============================================
# Step 5b: Create per-club reply networks (with isolated timeline authors)
# ==============================================

# Helper to normalize usernames
def normalize_handle(x):
    return str(x).lower().replace("@", "").strip()

# Dictionary to store graphs per club
club_networks = {}

for club_name, club_df in sna_df.groupby("team_handle"):
    G = nx.DiGraph()

    # 1. Add reply edges: normalized author → parent_author
    for _, row in club_df.iterrows():
        author = normalize_handle(row["author"])
        parent = normalize_handle(row["parent_author"])

        if pd.notna(author) and pd.notna(parent) and author != parent:
            G.add_edge(author, parent, club=club_name)

    # 2. Add timeline authors (thread_depth = 0) as isolated nodes
    timeline_authors = club_df[club_df["thread_depth"] == 0]["author"].dropna().unique()
    for author in timeline_authors:
        norm_author = normalize_handle(author)
        if norm_author not in G:
            G.add_node(norm_author)

    club_networks[club_name] = G
    print(f"{club_name}: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

with open("C:/Master/Master project/club_networks.pkl", "wb") as f:
    pickle.dump(club_networks, f)



AFCAjax: 2269 nodes, 4234 edges
AZAlkmaar: 343 nodes, 506 edges
AlmereCityFC: 75 nodes, 78 edges
Feyenoord: 1508 nodes, 2764 edges
FortunaSittard: 80 nodes, 89 edges
GAEagles: 184 nodes, 211 edges
HeraclesAlmelo: 95 nodes, 103 edges
NACnl: 243 nodes, 322 edges
PECZwolle: 258 nodes, 328 edges
PSV: 1348 nodes, 2132 edges
RKCWAALWIJK: 84 nodes, 93 edges
SpartaRotterdam: 162 nodes, 266 edges
WillemII: 84 nodes, 99 edges
fcgroningen: 214 nodes, 300 edges
fctwente: 747 nodes, 1178 edges
fcutrecht: 201 nodes, 265 edges
necnijmegen: 158 nodes, 224 edges
scHeerenveen: 128 nodes, 176 edges


## Compute SNA metrics per match

In this step, I compute key social network analysis (SNA) metrics for each match-specific reply graph. These metrics help quantify the structure and dynamics of user interactions during each match.

For each `match_id`, I calculate:

- **Degree Centrality**: Measures how many direct connections a user has.
- **Betweenness Centrality**: Captures how often a user lies on the shortest path between others, indicating influence or gatekeeping.
- **Clustering Coefficient**: Reflects how tightly users in the network tend to cluster together (measured on the undirected version of the graph).

I then compute the average value of each metric across all nodes in the graph and store the results in a summary DataFrame (`match_metrics_df`).


In [None]:
# ==============================================
# Step 6: Compute SNA metrics per match
# ==============================================

match_metrics = []

# Loop over each match-specific graph
for match_id, G in match_networks.items():
    if G.number_of_nodes() == 0:
        continue  # Skip empty graphs

    # Compute node-level metrics
    degree_centrality = nx.degree_centrality(G)
    betweenness_centrality = nx.betweenness_centrality(G, normalized=True)
    clustering = nx.clustering(G.to_undirected())

    # Aggregate into match-level averages
    match_metrics.append({
        "match_id": match_id,
        "num_nodes": G.number_of_nodes(),
        "num_edges": G.number_of_edges(),
        "avg_degree_centrality": sum(degree_centrality.values()) / len(degree_centrality),
        "avg_betweenness": sum(betweenness_centrality.values()) / len(betweenness_centrality),
        "avg_clustering": sum(clustering.values()) / len(clustering)
    })

# Convert results to DataFrame and preview
match_metrics_df = pd.DataFrame(match_metrics)
match_metrics_df.sort_values("num_nodes", ascending=False).head()

Unnamed: 0,match_id,num_nodes,num_edges,avg_degree_centrality,avg_betweenness,avg_clustering
57,M058,965,1355,0.002913,0.000108,0.025263
74,M075,592,808,0.004619,5.5e-05,0.022544
28,M029,534,736,0.005172,0.000517,0.007119
9,M010,478,611,0.00536,2.8e-05,0.033842
34,M035,452,580,0.00569,3.1e-05,0.022048


## Compute SNA metrics per club

Similar to the match-level analysis, I compute network metrics for each club-specific reply graph. These metrics provide insight into how users engage in conversations surrounding specific Eredivisie teams.

For each `club` in the `club_networks` dictionary, I calculate:

- **Degree Centrality**: Measures direct interactions a user has.
- **Betweenness Centrality**: Indicates how influential a user is in the flow of information.
- **Clustering Coefficient**: Measures the tightness of user clusters in the network.

The results are aggregated into a `club_metrics_df` DataFrame, enabling comparison across clubs based on interaction structure and toxicity patterns.

In [None]:
# ==============================================
# Step 7: Compute SNA metrics per club
# ==============================================

club_metrics = []

# Loop over each club-specific graph
for club, G in club_networks.items():
    if G.number_of_nodes() == 0:
        continue  # Skip empty graphs

    # Compute node-level metrics
    degree_centrality = nx.degree_centrality(G)
    betweenness_centrality = nx.betweenness_centrality(G, normalized=True)
    clustering = nx.clustering(G.to_undirected())

    # Aggregate into club-level averages
    club_metrics.append({
        "club": club,
        "num_nodes": G.number_of_nodes(),
        "num_edges": G.number_of_edges(),
        "avg_degree_centrality": sum(degree_centrality.values()) / len(degree_centrality),
        "avg_betweenness": sum(betweenness_centrality.values()) / len(betweenness_centrality),
        "avg_clustering": sum(clustering.values()) / len(clustering)
    })

# Convert results to DataFrame and preview
club_metrics_df = pd.DataFrame(club_metrics)
club_metrics_df.sort_values("num_nodes", ascending=False).head()

Unnamed: 0,club,num_nodes,num_edges,avg_degree_centrality,avg_betweenness,avg_clustering
0,AFCAjax,2269,4234,0.001646,0.000155,0.059631
3,Feyenoord,1508,2764,0.002433,0.000339,0.04851
9,PSV,1348,2132,0.002348,0.000305,0.041445
14,fctwente,747,1178,0.004228,0.000435,0.074687
1,AZAlkmaar,343,506,0.008627,0.000423,0.047156


## References and Justification of Methods

This notebook applies Social Network Analysis (SNA) techniques to visualize and analyze interactions around Eredivisie football club tweets. The development of this notebook is grounded in both academic literature and established technical tools. Below are the references used for each key component:

**1. Social Network Metrics**  
The following SNA metrics were implemented:
- **Degree Centrality**, **Betweenness Centrality**, and **Clustering Coefficient** are core concepts in network science (Wasserman & Faust, 1994).
- These have been applied to Twitter in research such as Himelboim et al. (2017), which explored influence structures in social media.

> Wasserman, S., & Faust, K. (1994). *Social Network Analysis: Methods and Applications*. Cambridge University Press.  
> Himelboim, I., Smith, M. A., Rainie, L., Shneiderman, B., & Espina, C. (2017). Classifying Twitter topic-networks using social network analysis. *Social Media + Society*, 3(1).

**2. Community Detection (Louvain Algorithm)**  
Community detection is done using the Louvain method (Blondel et al., 2008), a fast and widely used algorithm for modularity optimization in large-scale networks.

> Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. *Journal of Statistical Mechanics: Theory and Experiment*, 2008(10), P10008.

**3. Toxicity Highlighting and Node Annotation**  
Nodes are color-coded and styled based on account type and toxicity, inspired by prior work on toxic content in Twitter SNA (Chatzakou et al., 2017).

> Chatzakou, D., Kourtellis, N., Blackburn, J., De Cristofaro, E., Stringhini, G., & Vakali, A. (2017). Mean birds: Detecting aggression and bullying on Twitter. In *Proceedings of the ACM Web Science Conference* (WebSci).

**4. Visualization Libraries**  
We use NetworkX and Plotly for interactive and static graph visualization. These are standard tools for Python-based SNA:
- NetworkX: Hagberg et al. (2008)
- Plotly: Official documentation and open-source community examples

> Hagberg, A. A., Schult, D. A., & Swart, P. J. (2008). Exploring network structure, dynamics, and function using NetworkX. In *Proceedings of the 7th Python in Science Conference* (SciPy2008), 11–15.

These references justify the choice of methods used in this notebook and align with best practices in both social science research and computational SNA.
