# Project 3
## Author: Brailey Sharpe
## Version: Fall 2025

In [25]:
import pandas as pd
import glob
import os

## 1. Load and Prepare Tweet Dataset from Multiple Sources

The tweet dataset for this project was collected by multiple students in the class. Originally, each CSV file in the `data` folder was named after the individual student who collected it. For privacy reasons, these filenames were renamed to a generic pattern (`stu1.csv`, `stu2.csv`, â€¦, `stu14.csv`) so that no student names appear in this notebook or any public repository.

All CSV files in the `data` directory are loaded and combined into a single dataframe. For each file, only rows where `type == "tweet"` are kept so that mock records and other non-tweet entries are removed. The combined dataset is then restricted to the 20 common columns specified in Part 1 of the project:

- `id`, `url`, `twitterUrl`, `text`, `source`
- `retweetCount`, `replyCount`, `likeCount`, `quoteCount`, `viewCount`
- `createdAt`, `lang`, `bookmarkCount`
- `isReply`, `inReplyToId`, `conversationId`, `inReplyToUsername`
- `isPinned`, `isConversationControlled`, `isQuote`

Rows that are missing key fields (`id`, `text`, or `createdAt`) are discarded. After concatenating all files, duplicate tweets are removed in two stages: first, duplicates with the same `id` are dropped, and then any remaining rows that are exact duplicates across all 20 columns are removed. To see more information about the dataset, find `README.md`

In [26]:
TWITTER_DATETIME_FORMAT = "%a %b %d %H:%M:%S %z %Y"

COMMON_COLS = [
    "id", "url", "twitterUrl", "text", "source",
    "retweetCount", "replyCount", "likeCount", "quoteCount", "viewCount",
    "createdAt", "lang", "bookmarkCount",
    "isReply", "inReplyToId", "conversationId", "inReplyToUsername",
    "isPinned", "isConversationControlled", "isQuote"
]

NUMERIC_COLS = [
    "retweetCount", "replyCount", "likeCount",
    "quoteCount", "viewCount", "bookmarkCount"
]


def load_and_clean_single(path: str) -> pd.DataFrame:
    """Load and clean a single student CSV file."""
    df = pd.read_csv(path, low_memory=False)

    if "type" in df.columns:
        df = df[df["type"] == "tweet"].copy()

    for col in COMMON_COLS:
        if col not in df.columns:
            df[col] = pd.NA

    df["createdAt"] = pd.to_datetime(
        df["createdAt"],
        format=TWITTER_DATETIME_FORMAT,
        errors="coerce",
        utc=True
    )

    for col in NUMERIC_COLS:
        df[col] = pd.to_numeric(df[col], errors="coerce")

    df = df.dropna(subset=["id", "text", "createdAt"])
    df = df[COMMON_COLS].copy()

    return df


def load_all_data(data_folder: str = "data") -> pd.DataFrame:
    """Load, clean, and combine all student CSV files."""
    pattern = os.path.join(data_folder, "*.csv")
    file_paths = glob.glob(pattern)

    if not file_paths:
        raise FileNotFoundError(f"No CSV files found in folder: {data_folder}")

    frames = []
    for fp in file_paths:
        print(f"Loading and cleaning: {fp}")
        cleaned = load_and_clean_single(fp)
        frames.append(cleaned)

    df = pd.concat(frames, ignore_index=True)

    before_id = len(df)
    df = df.drop_duplicates(subset="id", keep="first")
    after_id = len(df)
    print(f"Removed {before_id - after_id} duplicates by ID.")

    before_full = len(df)
    df = df.drop_duplicates(subset=COMMON_COLS, keep="first")
    after_full = len(df)
    print(f"Removed {before_full - after_full} full-content duplicates.")

    print("Final dataframe shape:", df.shape)
    return df

df = load_all_data("data")


Loading and cleaning: data\stu1.csv
Loading and cleaning: data\stu10.csv
Loading and cleaning: data\stu11.csv
Loading and cleaning: data\stu12.csv
Loading and cleaning: data\stu13.csv
Loading and cleaning: data\stu14.csv
Loading and cleaning: data\stu2.csv
Loading and cleaning: data\stu3.csv
Loading and cleaning: data\stu4.csv
Loading and cleaning: data\stu5.csv
Loading and cleaning: data\stu6.csv
Loading and cleaning: data\stu7.csv
Loading and cleaning: data\stu8.csv
Loading and cleaning: data\stu9.csv
Removed 867 duplicates by ID.
Removed 0 full-content duplicates.
Final dataframe shape: (20653, 20)
