# Airline Tweet Sentiment – Exploratory Data Analysis

This notebook provides a quick exploratory data analysis (EDA) for the airline tweet sentiment datasets used in our paper:

**Exploring Transformer Models for Sentiment Analysis in Airline Service Reviews**  
https://ieeexplore.ieee.org/abstract/document/10796289

We focus on:
- Basic dataset structure and size
- Sentiment label distribution
- Example tweets per sentiment class

You can switch between datasets (e.g. `airline_us`, `airline_global`, `airline_merged`) by changing a single variable in the first code cell.

In [None]:
import os
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt

from airline_sentiment.utils.config import load_global_config, PROJECT_ROOT

# -------------------------------------------------------------------
# Configuration
# -------------------------------------------------------------------

# Choose which dataset to inspect: "airline_us", "airline_global", "airline_merged"
DATASET_NAME = "airline_us"

cfg = load_global_config()
processed_root = cfg.get("paths", {}).get("processed_data", "data/processed")
dataset_dir = PROJECT_ROOT / processed_root / DATASET_NAME

print(f"Using processed dataset from: {dataset_dir}")

if not dataset_dir.is_dir():
    raise FileNotFoundError(
        f"Processed dataset directory not found. Run 'python scripts/prepare_datasets.py' first."
    )

train_path = dataset_dir / "train.csv"
val_path = dataset_dir / "val.csv"
test_path = dataset_dir / "test.csv"

for p in [train_path, val_path, test_path]:
    if not p.is_file():
        raise FileNotFoundError(
            f"Expected split file not found: {p}. Did you run the data preparation script?"
        )

train_df = pd.read_csv(train_path)
val_df = pd.read_csv(val_path)
test_df = pd.read_csv(test_path)

len(train_df), len(val_df), len(test_df)

## Basic dataset overview

We start by looking at the shapes of the splits and a sample of rows.

In [None]:
print("Train shape:", train_df.shape)
print("Val shape:  ", val_df.shape)
print("Test shape: ", test_df.shape)

print("\nTrain columns:", list(train_df.columns))

train_df.head()

## Sentiment label distribution

Now we inspect the distribution of sentiment labels in the training set. This helps us understand class imbalance and verify that our stratified splitting worked as expected.

In [None]:
label_counts = train_df["label_str"].value_counts().sort_index()
label_counts

In [None]:
plt.figure(figsize=(6, 4))
label_counts.plot(kind="bar")
plt.title(f"Label distribution in TRAIN split – {DATASET_NAME}")
plt.xlabel("Sentiment label")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

## Example tweets per sentiment class

We inspect a few example tweets from each sentiment class to qualitatively understand the data.

In [None]:
def show_examples(df, label_str, n=5):
    subset = df[df["label_str"] == label_str].head(n)
    print(f"\n=== Examples for label: {label_str} (n={len(subset)}) ===")
    for i, row in subset.iterrows():
        print(f"[{i}]", row["text"])

for label in sorted(train_df["label_str"].unique()):
    show_examples(train_df, label, n=5)

## Next steps

From here you can extend the EDA to:
- Analyze tweet length distributions
- Investigate frequent unigrams/bigrams per class
- Explore temporal patterns (if timestamps are available)
- Check for data leakage or anomalies

This notebook is meant as a lightweight, reproducible starting point for exploring the airline tweet sentiment datasets used in our experiments.