# 5.2 Exploratory Data Analysis (EDA)

This notebook analyzes the current synthetic dataset and its labels.
We report:
- Dataset integrity (CSV ↔ image files match)
- Label distributions and unknown rates
- Crosstab analysis between attributes
- Qualitative visualization (random image grid)

> Note: EDA is performed on the current annotation version. Labels are being refined through manual verification and will be updated for the final dataset release.


## Setup

This notebook assumes the dataset is stored in Google Drive under `MyDrive/dataset_project`.


In [None]:
from google.colab import drive
drive.mount("/content/drive")

## 1. Load dataset

We load the labels CSV and clean Excel-export artifacts (e.g., `Unnamed:*` columns).

In [None]:
import os
import pandas as pd

BASE_DIR = "/content/drive/MyDrive/dataset_project"
CSV_PATH = os.path.join(BASE_DIR, "labels.csv")

df = pd.read_csv(CSV_PATH)
df = df.loc[:, ~df.columns.str.contains("^Unnamed")].copy()
df.columns = [c.strip() for c in df.columns]

print("Rows:", len(df))
print("Columns:", df.columns.tolist())
df.head()

## 2. Dataset integrity checks

We verify a 1:1 mapping between image files and the CSV filenames:
- Missing files (CSV → no image)
- Extra files (image → not in CSV)
- Duplicate filenames in CSV

In [None]:
import glob
import os
import pandas as pd

IMAGES_DIR = BASE_DIR
exts = ("*.jpg","*.jpeg","*.png","*.webp","*.JPG","*.JPEG","*.PNG","*.WEBP")
img_files = []
for e in exts:
    img_files += glob.glob(os.path.join(IMAGES_DIR, e))

img_set = set(os.path.basename(p) for p in img_files)
csv_names = df["filename"].astype(str).apply(os.path.basename).tolist()
csv_set = set(csv_names)
missing = sorted(list(csv_set - img_set))
extra   = sorted(list(img_set - csv_set))
dup_counts = pd.Series(csv_names).value_counts()
dups = dup_counts[dup_counts > 1]

print("Images in folder:", len(img_files))
print("Rows in CSV:", len(df))
print("Unique filenames in CSV:", len(csv_set))
print("Missing (CSV->no file):", len(missing))
print("Extra (file->not in CSV):", len(extra))
print("Duplicate filenames in CSV:", int((dups > 1).sum()))

## 3. Label distributions

We inspect the class distribution for each attribute and report unknown rates.
This helps identify potential imbalance and noisy labels.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

cols = ["Gaze", "Headphones", "Environment", "Privacy", "ObjectInHand"]

def clean_labels(s: pd.Series):
    s = s.astype(str).str.strip()
    s = s.replace(["nan", "NaN", "None", ""], np.nan)
    s = s.fillna("Unknown")
    return s

def summary_table(series: pd.Series):
    counts = series.value_counts(dropna=False)
    perc = (counts / counts.sum() * 100).round(1)
    return pd.DataFrame({"count": counts, "percent": perc})

for col in cols:
    s = clean_labels(df[col])
    tab = summary_table(s)

    print(f"\n=== {col} ===")
    display(tab)

    plt.figure()
    tab["count"].plot(kind="bar")
    plt.title(f"{col} distribution")
    plt.xlabel(col)
    plt.ylabel("count")
    plt.tight_layout()
    plt.show()

unknown_rates = []
for col in cols:
    s = clean_labels(df[col]).str.lower()
    unknown_rates.append((col, round((s == "unknown").mean() * 100, 2)))

unknown_df = pd.DataFrame(unknown_rates, columns=["attribute", "unknown_percent"])
print("\n=== UNKNOWN RATES (%) ===")
display(unknown_df)

## 4. Crosstab analysis

We examine pairwise relationships between attributes using row-normalized crosstabs (percentages).
This highlights correlations that may indicate dataset bias.

In [None]:
import pandas as pd

def crosstab_percent(a_col, b_col):
    a = df[a_col].astype(str).fillna("Unknown").str.strip()
    b = df[b_col].astype(str).fillna("Unknown").str.strip()
    return (pd.crosstab(a, b, normalize="index") * 100).round(1)

print("Privacy x Environment (row %):")
display(crosstab_percent("Privacy", "Environment"))

print("\nObjectInHand x Gaze (row %):")
display(crosstab_percent("ObjectInHand", "Gaze"))

## 5. Qualitative inspection (random samples)

We display a random grid of images to validate that the synthetic data matches the intended Zoom-like framing and that the annotated attributes are visually present.

In [None]:
import random
import matplotlib.pyplot as plt
from PIL import Image
import os

n = min(12, len(img_files))
sample = random.sample(img_files, n)

cols_grid = 4
rows = (n + cols_grid - 1) // cols_grid
plt.figure(figsize=(cols_grid*3, rows*3))

for i, p in enumerate(sample):
    img = Image.open(p).convert("RGB")
    ax = plt.subplot(rows, cols_grid, i+1)
    ax.imshow(img)
    ax.axis("off")
    ax.set_title(os.path.basename(p), fontsize=8)

plt.tight_layout()
plt.show()

## 6. Key takeaways

- The dataset contains **1825** images and **1825** labeled rows, with a **1:1 match** between CSV filenames and image files (**0 missing / 0 extra / 0 duplicates**).
- We analyzed distributions for **Gaze, Headphones, Environment, Privacy, ObjectInHand**, including unknown rates.
- Crosstabs show correlations (e.g., **Phone/Pen** co-occurring with **Camera**) that may indicate dataset bias.
- Visual sampling confirms Zoom-like framing and attribute visibility.
- Labels are being refined through manual verification; the final release will include improved annotations.