### Introduction

This notebook combines all the CSV files generated from the scraping process into one structured dataset. It loads, merges, and cleans the tweet data from multiple match files, ensuring that all timeline tweets and their associated replies (first-level and second-level) are aggregated into a unified format.

This step is important for preparing the final dataset that will be used for toxicity detection and social network analysis. It allows me to work with the entire sample of scraped data in a consistent way, across all matches and teams.

### Prepare combined tweet dataset

This notebook loads all tweet data collected from the scraping pipeline and merges it into a single, clean dataset for further analysis. Specifically, it brings together:

- Timeline tweets from official club accounts
- First-level replies to those tweets
- Second-level replies to those replies (when available)

The cleaning process begins by reading all individual CSV files from the `timelines`, `replies`, and `second_level_replies` folders. I loop through each folder and combine all files into a single dataframe per level. This approach allows for flexibility, as I can easily add or remove matches without changing the code.

To keep track of where each tweet or reply comes from, I ensure that important metadata columns like `match_id`, `team_handle`, `tweet_url`, and `reply_url` are preserved. These identifiers are crucial for later steps:
- In **NLP**, I will classify toxicity levels based on the `tweet` or `reply` content.
- In **social network analysis (SNA)**, I will use these IDs to map who is interacting with whom and through what type of tweet.

I also perform basic cleanup, such as:
- Dropping duplicate tweets or replies (based on timestamp or link)
- Normalizing column names and formats
- Ensuring consistent datetime formats, which are important for time-based filtering or network animation

After combining everything, I optionally merge the timeline tweets with the replies (if needed for the use case) or keep them separated depending on the structure required by the analysis notebooks that follow.

By the end of this step, I have a centralized and standardized dataset that represents all toxic-match-related conversations. This unified dataset is the foundation for the upcoming stages of the project, including toxicity classification using NLP and influence mapping using social network analysis.


In [1]:
import os
import pandas as pd
import re

# === Paths ===
timeline_path = "C:/Master/Master project/timelines"
replies_path = "C:/Master/Master project/replies"
second_replies_path = "C:/Master/Master project/second_level_replies"
output_path = "C:/Master/Master project/merged_threads_fixed.csv"

# === File-level functions ===
def extract_match_and_team(filename):
    match = re.search(r'_M(\d+)_', filename)
    team = filename.split("_")[-1].replace(".csv", "")
    return f"M{match.group(1)}" if match else None, team

def safe_load_csv(filepath):
    try:
        df = pd.read_csv(filepath, encoding="utf-8-sig")
        if df.empty or len(df.columns) == 0:
            raise ValueError("Empty or invalid file")
        return df
    except Exception as e:
        print(f"Failed to load {os.path.basename(filepath)}: {e}")
        return None

def load_and_prepare(folder, depth):
    dfs = []
    for file in os.listdir(folder):
        if not file.endswith(".csv"):
            continue
        full_path = os.path.join(folder, file)
        df = safe_load_csv(full_path)
        if df is None:
            continue

        match_id, team_handle = extract_match_and_team(file)
        df["match_id"] = match_id
        df["team_handle"] = team_handle
        df["thread_depth"] = depth
        df["source_file"] = file
        dfs.append(df)
    return pd.concat(dfs, ignore_index=True) if dfs else pd.DataFrame()

# === Load data ===
tweets = load_and_prepare(timeline_path, depth=0)
replies = load_and_prepare(replies_path, depth=1)
second = load_and_prepare(second_replies_path, depth=2)

# === Clean each layer ===

# --- Timeline tweets ---
tweets = tweets.rename(columns={"tweet": "text", "link": "tweet_url"})
tweets["author"] = tweets["source_file"].str.split("_tweets_").str[0]
tweets["parent_url"] = None  # top-level tweets have no parent
tweets["thread_id"] = tweets["tweet_url"]

# --- Replies ---
replies = replies.rename(columns={
    "reply": "text",
    "reply_url": "tweet_url",
    "in_reply_to_user": "original_parent_author",  # keep original if needed
    "tweet_url": "parent_url"  # this is the tweet being replied to
})
replies["thread_id"] = replies["parent_url"]

# --- Second-level replies ---
second = second.rename(columns={
    "reply": "text",
    "reply_url": "tweet_url",
    "parent_reply_url": "parent_url",
    "in_reply_to_author": "original_parent_author"
})
second["thread_id"] = second["parent_url"]

# === Ensure all columns exist ===
all_cols = [
    "match_id", "team_handle", "thread_depth", "thread_id", "author",
    "text", "timestamp", "tweet_url", "parent_author", "parent_url",
    "comments", "source_file", "original_parent_author"
]

def fill_missing_columns(df):
    for col in all_cols:
        if col not in df.columns:
            df[col] = None
    return df[all_cols]

tweets = fill_missing_columns(tweets)
replies = fill_missing_columns(replies)
second = fill_missing_columns(second)

# === Merge all ===
merged = pd.concat([tweets, replies, second], ignore_index=True)
merged = merged.sort_values(by=["match_id", "thread_id", "timestamp", "thread_depth"]).reset_index(drop=True)

# === Rebuild parent_author using tweet_url → author map
url_to_author = merged.set_index("tweet_url")["author"].to_dict()
merged["parent_author"] = merged["parent_url"].map(url_to_author)

# === Fallback to scraped in_reply_to when mapping fails
merged["parent_author"] = merged["parent_author"].fillna(merged["original_parent_author"])

# === Save output ===
merged.to_csv(output_path, index=False, encoding="utf-8-sig")
print(f"Merged dataset saved to: {output_path}")


Failed to load AlmereCityFC_tweets_M065_AlmereCityFC.csv: Empty or invalid file
Failed to load PECZwolle_tweets_M053_PECZwolle.csv: Empty or invalid file
Failed to load scHeerenveen_tweets_M015_scHeerenveen.csv: Empty or invalid file
Failed to load scHeerenveen_tweets_M067_scHeerenveen.csv: Empty or invalid file
Failed to load second_level_replies_M001_fcgroningen.csv: No columns to parse from file
Failed to load second_level_replies_M007_GAEagles.csv: No columns to parse from file
Failed to load second_level_replies_M027_RKCWAALWIJK.csv: No columns to parse from file
Merged dataset saved to: C:/Master/Master project/merged_threads_fixed.csv


### Note on empty or missing files

During the merge process, you may see warnings such as:

- `Empty or invalid file`
- `No columns to parse from file`

These warnings occur when the scraper did not generate any tweets or replies for a particular match or team. This is not an error — it simply means that:

- No tweets were posted by that account in the scraping window, or
- No tweets received replies, or
- Nitter failed to load valid tweet content during scraping

The merge step skips over these files and continues processing the rest of the data. These warnings can be safely ignored and do not affect the functionality of the tool.
