# 01. Dataset Exploration and Domain Analysis

### Introduction
In a **Sim-to-Real** project, understanding the distribution of data is crucial. We need to ensure that our **Source Domain** (Synthetic/Game data) provides enough examples for the model to learn, and that our **Target Domain** (Real-world data) is correctly formatted for validation.

### Objectives
1.  Verify the integrity of the processed dataset.
2.  Visualize the balance between Simulated (Train) and Real (Val) images.
3.  Analyze the class distribution to check for imbalance (e.g., are there enough snipers?).

In [None]:
import os
import glob
import matplotlib.pyplot as plt
import seaborn as sns
import config 

# Set visual style
sns.set_theme(style="whitegrid")
%matplotlib inline

# Define Paths using the Config module for consistency
TRAIN_DIR = config.PROCESSED_DATA_DIR / "train"
VAL_DIR = config.PROCESSED_DATA_DIR / "val"

### 1. Domain Distribution (Sim vs. Real)
Here we count the number of images in the Training set (Simulated) versus the Validation set (Real). A healthy ratio is usually 70-80% training data.

In [None]:
def count_dataset(path):
    imgs = len(glob.glob(str(path / "images" / "*")))
    lbls = len(glob.glob(str(path / "labels" / "*")))
    return imgs, lbls

train_imgs, train_lbls = count_dataset(TRAIN_DIR)
val_imgs, val_lbls = count_dataset(VAL_DIR)

print(f"[{'Simulated':^10}] Training Images:   {train_imgs}")
print(f"[{'Real-World':^10}] Validation Images: {val_imgs}")

# Plot
plt.figure(figsize=(6, 4))
bars = plt.bar(['Simulated (Source)', 'Real (Target)'], [train_imgs, val_imgs], color=['#3498db', '#e74c3c'])
plt.title("Data Distribution by Domain")
plt.ylabel("Count")
plt.bar_label(bars)
plt.show()