# üìä Notebook 02: Data Exploration & Analysis

**AI Virtual Try-On System - Hybrid Generative AI Approach**

---

## üìã Notebook Overview

This notebook covers comprehensive data exploration for the Virtual Try-On project:

1. **Dataset Overview** - Understanding VITON-HD and DeepFashion datasets
2. **Data Download** - Scripts to download required datasets
3. **Data Structure Analysis** - Explore directory structure and file formats
4. **Image Visualization** - Visualize person images, garments, and annotations
5. **Statistical Analysis** - Image dimensions, distributions, and quality checks
6. **Annotation Exploration** - Parse masks, pose keypoints, and segmentation
7. **Data Quality Assessment** - Check for corrupted files and missing data
8. **Dataset Splitting** - Prepare train/validation/test splits

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- ‚úÖ Understand the structure of VITON-HD dataset
- ‚úÖ Know how to download and organize datasets
- ‚úÖ Visualize person-garment pairs
- ‚úÖ Understand pose annotations and segmentation masks
- ‚úÖ Identify data quality issues
- ‚úÖ Be ready for data preprocessing

---

## 1Ô∏è‚É£ Setup and Imports

In [None]:
# Standard libraries
import os
import sys
import json
import random
from pathlib import Path
from collections import Counter, defaultdict

# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.gridspec import GridSpec

# Image processing
import cv2
from PIL import Image

# Progress bars
from tqdm.notebook import tqdm

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)

# Configure matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Increase figure quality
plt.rcParams['figure.dpi'] = 100
plt.rcParams['savefig.dpi'] = 300
plt.rcParams['figure.figsize'] = (12, 8)

print("‚úÖ All imports successful!")

In [None]:
# Set up project paths
project_root = Path.cwd().parent
data_dir = project_root / 'data'
raw_data_dir = data_dir / 'raw'
processed_data_dir = data_dir / 'processed'
outputs_dir = project_root / 'outputs'

print("="*70)
print("üìÅ PROJECT PATHS")
print("="*70)
print(f"Project Root: {project_root}")
print(f"Data Directory: {data_dir}")
print(f"Raw Data: {raw_data_dir}")
print(f"Processed Data: {processed_data_dir}")
print(f"Outputs: {outputs_dir}")
print("="*70)

## 2Ô∏è‚É£ Dataset Overview

### VITON-HD Dataset

**VITON-HD** is a high-resolution virtual try-on dataset containing:
- **13,679 image pairs** (person + garment)
- **Resolution**: 1024√ó768 pixels
- **Annotations**: Human parsing masks, pose keypoints, dense pose
- **Split**: Train (11,647) / Test (2,032)

**Dataset Structure:**
```
VITON-HD/
‚îú‚îÄ‚îÄ train/
‚îÇ   ‚îú‚îÄ‚îÄ image/           # Person images
‚îÇ   ‚îú‚îÄ‚îÄ cloth/           # Garment images
‚îÇ   ‚îú‚îÄ‚îÄ image-parse-v3/  # Segmentation masks
‚îÇ   ‚îú‚îÄ‚îÄ openpose_img/    # Pose visualizations
‚îÇ   ‚îú‚îÄ‚îÄ openpose_json/   # Pose keypoints (JSON)
‚îÇ   ‚îî‚îÄ‚îÄ train_pairs.txt  # Image pair mappings
‚îî‚îÄ‚îÄ test/
    ‚îî‚îÄ‚îÄ [same structure]
```

## 3Ô∏è‚É£ Download Dataset

### Option 1: Manual Download

1. **VITON-HD**: 
   - Visit: https://github.com/shadow2496/VITON-HD
   - Download from Google Drive link
   - Extract to `data/raw/viton-hd/`

2. **DeepFashion** (Optional):
   - Visit: http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html
   - Register and download
   - Extract to `data/raw/deepfashion/`

### Option 2: Automated Download (Using gdown)

In [None]:
# Install gdown for Google Drive downloads
!pip install gdown -q

print("‚úÖ gdown installed successfully!")

In [None]:
import gdown

# Create raw data directory
viton_hd_dir = raw_data_dir / 'viton-hd'
viton_hd_dir.mkdir(parents=True, exist_ok=True)

print("="*70)
print("üì• DATASET DOWNLOAD")
print("="*70)
print("\n‚ö†Ô∏è  Note: VITON-HD dataset is ~15GB. Download may take 30-60 minutes.")
print("\nFor this tutorial, we'll work with a sample dataset first.")
print("You can download the full dataset later.\n")
print("="*70)

# For now, we'll create sample data structure
print("\nüí° Creating sample data structure for demonstration...")

## 4Ô∏è‚É£ Create Sample Dataset

For demonstration purposes, let's create a sample dataset structure with synthetic data.

In [None]:
# Create sample dataset structure
sample_dirs = [
    'viton-hd/train/image',
    'viton-hd/train/cloth',
    'viton-hd/train/image-parse-v3',
    'viton-hd/train/openpose_img',
    'viton-hd/train/openpose_json',
    'viton-hd/test/image',
    'viton-hd/test/cloth',
    'viton-hd/test/image-parse-v3',
    'viton-hd/test/openpose_img',
    'viton-hd/test/openpose_json',
]

for dir_path in sample_dirs:
    full_path = raw_data_dir / dir_path
    full_path.mkdir(parents=True, exist_ok=True)

print("‚úÖ Sample dataset structure created!")
print("\nüìÅ Created directories:")
for dir_path in sample_dirs[:5]:
    print(f"   - {dir_path}")
print("   ...")

In [None]:
# Helper function to create sample images
def create_sample_person_image(width=768, height=1024):
    """Create a sample person image with simple shapes"""
    img = np.ones((height, width, 3), dtype=np.uint8) * 240
    
    # Draw simple person silhouette
    # Head
    cv2.circle(img, (width//2, height//4), 80, (255, 220, 200), -1)
    
    # Body
    cv2.rectangle(img, (width//2-100, height//4+50), 
                  (width//2+100, height//2+100), (100, 150, 200), -1)
    
    # Arms
    cv2.rectangle(img, (width//2-180, height//4+80), 
                  (width//2-100, height//2+50), (255, 200, 180), -1)
    cv2.rectangle(img, (width//2+100, height//4+80), 
                  (width//2+180, height//2+50), (255, 200, 180), -1)
    
    # Legs
    cv2.rectangle(img, (width//2-80, height//2+100), 
                  (width//2-20, height-100), (50, 50, 100), -1)
    cv2.rectangle(img, (width//2+20, height//2+100), 
                  (width//2+80, height-100), (50, 50, 100), -1)
    
    return img

def create_sample_garment_image(width=768, height=1024, color=None):
    """Create a sample garment (shirt) image"""
    img = np.ones((height, width, 3), dtype=np.uint8) * 255
    
    if color is None:
        color = (random.randint(50, 200), random.randint(50, 200), random.randint(50, 200))
    
    # Draw simple shirt shape
    points = np.array([
        [width//2-150, height//3],
        [width//2-200, height//3+100],
        [width//2-120, height//2+150],
        [width//2+120, height//2+150],
        [width//2+200, height//3+100],
        [width//2+150, height//3],
    ], np.int32)
    
    cv2.fillPoly(img, [points], color)
    
    return img

def create_sample_segmentation_mask(width=768, height=1024):
    """Create a sample segmentation mask"""
    mask = np.zeros((height, width), dtype=np.uint8)
    
    # Different regions with different labels
    # 0: background, 5: upper-clothes, 13: face, 14-15: arms, 16-17: legs
    
    # Face (label 13)
    cv2.circle(mask, (width//2, height//4), 80, 13, -1)
    
    # Upper clothes (label 5)
    cv2.rectangle(mask, (width//2-100, height//4+50), 
                  (width//2+100, height//2+100), 5, -1)
    
    # Arms (labels 14, 15)
    cv2.rectangle(mask, (width//2-180, height//4+80), 
                  (width//2-100, height//2+50), 14, -1)
    cv2.rectangle(mask, (width//2+100, height//4+80), 
                  (width//2+180, height//2+50), 15, -1)
    
    # Legs (labels 16, 17)
    cv2.rectangle(mask, (width//2-80, height//2+100), 
                  (width//2-20, height-100), 16, -1)
    cv2.rectangle(mask, (width//2+20, height//2+100), 
                  (width//2+80, height-100), 17, -1)
    
    return mask

print("‚úÖ Sample image generation functions created!")

In [None]:
# Generate sample dataset (10 samples for demonstration)
num_samples = 10

print("üé® Generating sample dataset...\n")

train_dir = raw_data_dir / 'viton-hd' / 'train'
pairs = []

for i in tqdm(range(num_samples), desc="Creating samples"):
    # Generate IDs
    person_id = f"{i:05d}_00"
    cloth_id = f"{i:05d}_00"
    
    # Create person image
    person_img = create_sample_person_image()
    person_path = train_dir / 'image' / f"{person_id}.jpg"
    cv2.imwrite(str(person_path), cv2.cvtColor(person_img, cv2.COLOR_RGB2BGR))
    
    # Create garment image
    garment_img = create_sample_garment_image()
    garment_path = train_dir / 'cloth' / f"{cloth_id}.jpg"
    cv2.imwrite(str(garment_path), cv2.cvtColor(garment_img, cv2.COLOR_RGB2BGR))
    
    # Create segmentation mask
    seg_mask = create_sample_segmentation_mask()
    seg_path = train_dir / 'image-parse-v3' / f"{person_id}.png"
    cv2.imwrite(str(seg_path), seg_mask)
    
    # Create pose keypoints (simplified JSON)
    pose_data = {
        "version": 1.3,
        "people": [{
            "pose_keypoints_2d": [384, 256, 0.9] * 18  # Simplified 18 keypoints
        }]
    }
    pose_path = train_dir / 'openpose_json' / f"{person_id}_keypoints.json"
    with open(pose_path, 'w') as f:
        json.dump(pose_data, f)
    
    # Add to pairs
    pairs.append(f"{person_id}.jpg {cloth_id}.jpg")

# Save pairs file
pairs_path = train_dir / 'train_pairs.txt'
with open(pairs_path, 'w') as f:
    f.write('\n'.join(pairs))

print(f"\n‚úÖ Generated {num_samples} sample image pairs!")
print(f"üìç Location: {train_dir}")

## 5Ô∏è‚É£ Explore Dataset Structure

In [None]:
# Analyze dataset structure
def analyze_dataset_structure(dataset_path):
    """Analyze and display dataset structure"""
    dataset_path = Path(dataset_path)
    
    if not dataset_path.exists():
        print(f"‚ùå Dataset not found at {dataset_path}")
        return
    
    print("="*70)
    print(f"üìä DATASET STRUCTURE ANALYSIS")
    print("="*70)
    
    for split in ['train', 'test']:
        split_path = dataset_path / split
        if not split_path.exists():
            continue
            
        print(f"\nüìÅ {split.upper()} Split:")
        print("-" * 50)
        
        for subdir in split_path.iterdir():
            if subdir.is_dir():
                num_files = len(list(subdir.glob('*')))
                print(f"   {subdir.name:20s} : {num_files:5d} files")
            elif subdir.is_file():
                print(f"   {subdir.name:20s} : file")
    
    print("\n" + "="*70)

# Analyze VITON-HD dataset
viton_hd_path = raw_data_dir / 'viton-hd'
analyze_dataset_structure(viton_hd_path)

## 6Ô∏è‚É£ Load and Visualize Sample Data

In [None]:
# Load pairs file
pairs_file = raw_data_dir / 'viton-hd' / 'train' / 'train_pairs.txt'

if pairs_file.exists():
    with open(pairs_file, 'r') as f:
        pairs_data = [line.strip().split() for line in f.readlines()]
    
    print("="*70)
    print("üìã DATASET PAIRS INFORMATION")
    print("="*70)
    print(f"\nTotal pairs: {len(pairs_data)}")
    print(f"\nFirst 5 pairs:")
    for i, (person, garment) in enumerate(pairs_data[:5]):
        print(f"   {i+1}. Person: {person:20s} | Garment: {garment}")
    print("\n" + "="*70)
else:
    print("‚ö†Ô∏è  Pairs file not found. Please ensure dataset is downloaded.")
    pairs_data = []

In [None]:
# Visualization function
def visualize_sample(person_img, garment_img, seg_mask, pose_img=None, title="Sample"):
    """Visualize a complete sample with all components"""
    
    if pose_img is not None:
        fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    else:
        fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    
    # Person image
    axes[0].imshow(person_img)
    axes[0].set_title('Person Image', fontsize=12, fontweight='bold')
    axes[0].axis('off')
    
    # Garment image
    axes[1].imshow(garment_img)
    axes[1].set_title('Garment Image', fontsize=12, fontweight='bold')
    axes[1].axis('off')
    
    # Segmentation mask
    axes[2].imshow(seg_mask, cmap='tab20')
    axes[2].set_title('Segmentation Mask', fontsize=12, fontweight='bold')
    axes[2].axis('off')
    
    # Pose visualization (if available)
    if pose_img is not None:
        axes[3].imshow(pose_img)
        axes[3].set_title('Pose Keypoints', fontsize=12, fontweight='bold')
        axes[3].axis('off')
    
    plt.suptitle(title, fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()

print("‚úÖ Visualization function created!")

In [None]:
# Load and visualize random samples
if pairs_data:
    train_dir = raw_data_dir / 'viton-hd' / 'train'
    
    # Select random samples
    num_visualize = min(3, len(pairs_data))
    sample_indices = random.sample(range(len(pairs_data)), num_visualize)
    
    print("="*70)
    print("üé® VISUALIZING SAMPLE DATA")
    print("="*70)
    
    for idx in sample_indices:
        person_name, garment_name = pairs_data[idx]
        
        # Load images
        person_path = train_dir / 'image' / person_name
        garment_path = train_dir / 'cloth' / garment_name
        seg_path = train_dir / 'image-parse-v3' / person_name.replace('.jpg', '.png')
        
        if person_path.exists() and garment_path.exists() and seg_path.exists():
            person_img = cv2.imread(str(person_path))
            person_img = cv2.cvtColor(person_img, cv2.COLOR_BGR2RGB)
            
            garment_img = cv2.imread(str(garment_path))
            garment_img = cv2.cvtColor(garment_img, cv2.COLOR_BGR2RGB)
            
            seg_mask = cv2.imread(str(seg_path), cv2.IMREAD_GRAYSCALE)
            
            # Visualize
            visualize_sample(
                person_img, garment_img, seg_mask,
                title=f"Sample {idx+1}: {person_name} + {garment_name}"
            )
        else:
            print(f"‚ö†Ô∏è  Sample {idx+1} files not found")
else:
    print("‚ö†Ô∏è  No pairs data available for visualization")

## 7Ô∏è‚É£ Image Statistics and Analysis

In [None]:
# Analyze image dimensions and statistics
def analyze_image_statistics(image_dir, num_samples=50):
    """Analyze image dimensions and basic statistics"""
    image_dir = Path(image_dir)
    
    if not image_dir.exists():
        print(f"‚ùå Directory not found: {image_dir}")
        return
    
    image_files = list(image_dir.glob('*.jpg')) + list(image_dir.glob('*.png'))
    
    if not image_files:
        print(f"‚ùå No images found in {image_dir}")
        return
    
    # Sample images for analysis
    sample_files = random.sample(image_files, min(num_samples, len(image_files)))
    
    dimensions = []
    file_sizes = []
    
    for img_path in tqdm(sample_files, desc="Analyzing images"):
        img = Image.open(img_path)
        dimensions.append(img.size)  # (width, height)
        file_sizes.append(img_path.stat().st_size / 1024)  # KB
    
    # Convert to DataFrame for analysis
    df = pd.DataFrame({
        'width': [d[0] for d in dimensions],
        'height': [d[1] for d in dimensions],
        'file_size_kb': file_sizes
    })
    
    return df

# Analyze person images
person_dir = raw_data_dir / 'viton-hd' / 'train' / 'image'
if person_dir.exists():
    print("üìä Analyzing person images...\n")
    person_stats = analyze_image_statistics(person_dir)
    
    if person_stats is not None:
        print("="*70)
        print("üìà PERSON IMAGES STATISTICS")
        print("="*70)
        print(person_stats.describe())
        print("\n" + "="*70)

In [None]:
# Visualize dimension distribution
if person_stats is not None and len(person_stats) > 0:
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Width distribution
    axes[0].hist(person_stats['width'], bins=20, color='skyblue', edgecolor='black')
    axes[0].set_title('Image Width Distribution', fontweight='bold')
    axes[0].set_xlabel('Width (pixels)')
    axes[0].set_ylabel('Frequency')
    axes[0].grid(True, alpha=0.3)
    
    # Height distribution
    axes[1].hist(person_stats['height'], bins=20, color='lightcoral', edgecolor='black')
    axes[1].set_title('Image Height Distribution', fontweight='bold')
    axes[1].set_xlabel('Height (pixels)')
    axes[1].set_ylabel('Frequency')
    axes[1].grid(True, alpha=0.3)
    
    # File size distribution
    axes[2].hist(person_stats['file_size_kb'], bins=20, color='lightgreen', edgecolor='black')
    axes[2].set_title('File Size Distribution', fontweight='bold')
    axes[2].set_xlabel('File Size (KB)')
    axes[2].set_ylabel('Frequency')
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## 8Ô∏è‚É£ Segmentation Mask Analysis

In [None]:
# Define segmentation labels (VITON-HD uses 20 classes)
SEGMENTATION_LABELS = {
    0: 'background',
    1: 'hat',
    2: 'hair',
    3: 'glove',
    4: 'sunglasses',
    5: 'upper-clothes',
    6: 'dress',
    7: 'coat',
    8: 'socks',
    9: 'pants',
    10: 'jumpsuits',
    11: 'scarf',
    12: 'skirt',
    13: 'face',
    14: 'left-arm',
    15: 'right-arm',
    16: 'left-leg',
    17: 'right-leg',
    18: 'left-shoe',
    19: 'right-shoe'
}

print("üìã Segmentation Labels:")
print("="*70)
for label_id, label_name in SEGMENTATION_LABELS.items():
    print(f"   {label_id:2d}: {label_name}")
print("="*70)

In [None]:
# Analyze segmentation masks
def analyze_segmentation_masks(mask_dir, num_samples=20):
    """Analyze segmentation mask label distribution"""
    mask_dir = Path(mask_dir)
    
    if not mask_dir.exists():
        print(f"‚ùå Mask directory not found: {mask_dir}")
        return None
    
    mask_files = list(mask_dir.glob('*.png'))
    
    if not mask_files:
        print(f"‚ùå No mask files found in {mask_dir}")
        return None
    
    sample_files = random.sample(mask_files, min(num_samples, len(mask_files)))
    
    label_counts = Counter()
    
    for mask_path in tqdm(sample_files, desc="Analyzing masks"):
        mask = cv2.imread(str(mask_path), cv2.IMREAD_GRAYSCALE)
        unique_labels, counts = np.unique(mask, return_counts=True)
        
        for label, count in zip(unique_labels, counts):
            label_counts[label] += count
    
    return label_counts

# Analyze masks
mask_dir = raw_data_dir / 'viton-hd' / 'train' / 'image-parse-v3'
if mask_dir.exists():
    print("üé≠ Analyzing segmentation masks...\n")
    label_distribution = analyze_segmentation_masks(mask_dir)
    
    if label_distribution:
        print("\n" + "="*70)
        print("üìä LABEL DISTRIBUTION")
        print("="*70)
        
        for label_id in sorted(label_distribution.keys()):
            label_name = SEGMENTATION_LABELS.get(label_id, 'unknown')
            count = label_distribution[label_id]
            print(f"   {label_id:2d} ({label_name:15s}): {count:12,d} pixels")
        
        print("="*70)

In [None]:
# Visualize label distribution
if label_distribution:
    labels = [SEGMENTATION_LABELS.get(k, f'Label {k}') for k in sorted(label_distribution.keys())]
    counts = [label_distribution[k] for k in sorted(label_distribution.keys())]
    
    plt.figure(figsize=(14, 6))
    bars = plt.bar(range(len(labels)), counts, color='steelblue', edgecolor='black')
    plt.xticks(range(len(labels)), labels, rotation=45, ha='right')
    plt.xlabel('Segmentation Labels', fontweight='bold')
    plt.ylabel('Pixel Count', fontweight='bold')
    plt.title('Segmentation Label Distribution', fontsize=14, fontweight='bold')
    plt.yscale('log')  # Log scale for better visualization
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()

## 9Ô∏è‚É£ Pose Keypoints Analysis

In [None]:
# Load and analyze pose keypoints
def load_pose_keypoints(json_path):
    """Load pose keypoints from OpenPose JSON file"""
    with open(json_path, 'r') as f:
        data = json.load(f)
    
    if 'people' in data and len(data['people']) > 0:
        keypoints = data['people'][0]['pose_keypoints_2d']
        # Reshape to (18, 3) - 18 keypoints with (x, y, confidence)
        keypoints = np.array(keypoints).reshape(-1, 3)
        return keypoints
    
    return None

# OpenPose keypoint names
POSE_KEYPOINT_NAMES = [
    'Nose', 'Neck', 'RShoulder', 'RElbow', 'RWrist',
    'LShoulder', 'LElbow', 'LWrist', 'MidHip', 'RHip',
    'RKnee', 'RAnkle', 'LHip', 'LKnee', 'LAnkle',
    'REye', 'LEye', 'REar', 'LEar'
]

print("üìç OpenPose Keypoints:")
print("="*70)
for i, name in enumerate(POSE_KEYPOINT_NAMES[:18]):
    print(f"   {i:2d}: {name}")
print("="*70)

In [None]:
# Visualize pose keypoints on image
def visualize_pose_keypoints(image, keypoints):
    """Visualize pose keypoints on image"""
    img_copy = image.copy()
    
    # Draw keypoints
    for i, (x, y, conf) in enumerate(keypoints):
        if conf > 0.1:  # Only draw if confidence > threshold
            cv2.circle(img_copy, (int(x), int(y)), 5, (0, 255, 0), -1)
            cv2.putText(img_copy, str(i), (int(x)+10, int(y)), 
                       cv2.FONT_HERSHEY_SIMPLEX, 0.4, (255, 255, 255), 1)
    
    # Draw skeleton connections
    skeleton = [
        (0, 1), (1, 2), (2, 3), (3, 4),  # Right arm
        (1, 5), (5, 6), (6, 7),  # Left arm
        (1, 8), (8, 9), (9, 10), (10, 11),  # Right leg
        (8, 12), (12, 13), (13, 14),  # Left leg
        (0, 15), (0, 16), (15, 17), (16, 18)  # Face
    ]
    
    for start_idx, end_idx in skeleton:
        if start_idx < len(keypoints) and end_idx < len(keypoints):
            start_point = keypoints[start_idx]
            end_point = keypoints[end_idx]
            
            if start_point[2] > 0.1 and end_point[2] > 0.1:
                cv2.line(img_copy, 
                        (int(start_point[0]), int(start_point[1])),
                        (int(end_point[0]), int(end_point[1])),
                        (255, 0, 0), 2)
    
    return img_copy

print("‚úÖ Pose visualization function created!")

In [None]:
# Load and visualize a sample with pose
if pairs_data:
    train_dir = raw_data_dir / 'viton-hd' / 'train'
    
    # Get first sample
    person_name, _ = pairs_data[0]
    
    person_path = train_dir / 'image' / person_name
    pose_json_path = train_dir / 'openpose_json' / person_name.replace('.jpg', '_keypoints.json')
    
    if person_path.exists() and pose_json_path.exists():
        # Load image
        person_img = cv2.imread(str(person_path))
        person_img = cv2.cvtColor(person_img, cv2.COLOR_BGR2RGB)
        
        # Load keypoints
        keypoints = load_pose_keypoints(pose_json_path)
        
        if keypoints is not None:
            # Visualize
            pose_vis = visualize_pose_keypoints(person_img, keypoints)
            
            fig, axes = plt.subplots(1, 2, figsize=(12, 6))
            
            axes[0].imshow(person_img)
            axes[0].set_title('Original Image', fontweight='bold')
            axes[0].axis('off')
            
            axes[1].imshow(pose_vis)
            axes[1].set_title('Pose Keypoints Visualization', fontweight='bold')
            axes[1].axis('off')
            
            plt.suptitle('Pose Estimation Example', fontsize=14, fontweight='bold')
            plt.tight_layout()
            plt.show()
            
            print("\nüìä Keypoint Confidence Scores:")
            print("="*70)
            for i, (x, y, conf) in enumerate(keypoints[:18]):
                print(f"   {POSE_KEYPOINT_NAMES[i]:15s}: ({x:6.1f}, {y:6.1f}) - Confidence: {conf:.3f}")
            print("="*70)

## üîü Data Quality Check

In [None]:
# Check for missing or corrupted files
def check_data_quality(dataset_path, pairs_file):
    """Check for missing or corrupted files in dataset"""
    dataset_path = Path(dataset_path)
    
    # Load pairs
    with open(pairs_file, 'r') as f:
        pairs = [line.strip().split() for line in f.readlines()]
    
    print("="*70)
    print("üîç DATA QUALITY CHECK")
    print("="*70)
    print(f"\nChecking {len(pairs)} pairs...\n")
    
    missing_files = []
    corrupted_files = []
    
    for person_name, garment_name in tqdm(pairs, desc="Checking files"):
        # Check person image
        person_path = dataset_path / 'image' / person_name
        if not person_path.exists():
            missing_files.append(('person', person_name))
        else:
            try:
                img = Image.open(person_path)
                img.verify()
            except:
                corrupted_files.append(('person', person_name))
        
        # Check garment image
        garment_path = dataset_path / 'cloth' / garment_name
        if not garment_path.exists():
            missing_files.append(('garment', garment_name))
        else:
            try:
                img = Image.open(garment_path)
                img.verify()
            except:
                corrupted_files.append(('garment', garment_name))
        
        # Check segmentation mask
        seg_path = dataset_path / 'image-parse-v3' / person_name.replace('.jpg', '.png')
        if not seg_path.exists():
            missing_files.append(('segmentation', person_name))
    
    # Report results
    print("\n" + "="*70)
    print("üìä QUALITY CHECK RESULTS")
    print("="*70)
    print(f"\nTotal pairs checked: {len(pairs)}")
    print(f"Missing files: {len(missing_files)}")
    print(f"Corrupted files: {len(corrupted_files)}")
    
    if missing_files:
        print(f"\n‚ö†Ô∏è  First 5 missing files:")
        for file_type, filename in missing_files[:5]:
            print(f"   - {file_type}: {filename}")
    
    if corrupted_files:
        print(f"\n‚ö†Ô∏è  First 5 corrupted files:")
        for file_type, filename in corrupted_files[:5]:
            print(f"   - {file_type}: {filename}")
    
    if not missing_files and not corrupted_files:
        print("\n‚úÖ All files are present and valid!")
    
    print("\n" + "="*70)
    
    return missing_files, corrupted_files

# Run quality check
train_dir = raw_data_dir / 'viton-hd' / 'train'
pairs_file = train_dir / 'train_pairs.txt'

if train_dir.exists() and pairs_file.exists():
    missing, corrupted = check_data_quality(train_dir, pairs_file)
else:
    print("‚ö†Ô∏è  Dataset or pairs file not found")

## 1Ô∏è‚É£1Ô∏è‚É£ Summary Statistics

In [None]:
# Create comprehensive summary
print("="*70)
print("üìä DATASET SUMMARY")
print("="*70)

summary = {
    'Dataset': 'VITON-HD (Sample)',
    'Total Pairs': len(pairs_data) if pairs_data else 0,
    'Image Resolution': '768 x 1024 pixels',
    'Segmentation Classes': 20,
    'Pose Keypoints': 18,
    'Data Quality': '‚úÖ Good' if not (missing or corrupted) else '‚ö†Ô∏è  Issues Found'
}

for key, value in summary.items():
    print(f"\n{key:25s}: {value}")

print("\n" + "="*70)
print("\n‚úÖ Data exploration complete!")
print("\nüöÄ Next Steps:")
print("   1. Download full VITON-HD dataset if needed")
print("   2. Proceed to notebook 03_data_preprocessing.ipynb")
print("   3. Preprocess images and create training dataset")
print("\n" + "="*70)

---

## üìù Key Takeaways

### What We Learned:

1. **Dataset Structure**
   - VITON-HD contains person-garment pairs with rich annotations
   - Each sample includes: image, garment, segmentation mask, and pose keypoints
   - Standard resolution: 768√ó1024 pixels

2. **Segmentation Masks**
   - 20 semantic classes covering body parts and clothing
   - Critical for garment region identification
   - Used for warping and blending operations

3. **Pose Keypoints**
   - 18 body keypoints from OpenPose
   - Essential for pose-guided generation
   - Used in ControlNet conditioning

4. **Data Quality**
   - Important to check for missing/corrupted files
   - Consistent image dimensions across dataset
   - Clean annotations for better training

### Important Notes:

- **Full Dataset**: Download complete VITON-HD (~15GB) for production training
- **Storage**: Ensure sufficient disk space (50GB+ recommended)
- **Preprocessing**: Next notebook will handle image preprocessing and augmentation
- **Custom Data**: You can add your own e-commerce images following the same structure

---

## üîó Resources

- [VITON-HD Paper](https://arxiv.org/abs/2103.16874)
- [VITON-HD GitHub](https://github.com/shadow2496/VITON-HD)
- [OpenPose](https://github.com/CMU-Perceptual-Computing-Lab/openpose)
- [Graphonomy (Human Parsing)](https://github.com/Gaoyiminggithub/Graphonomy)

---

**Author**: Huzaifa Nasir  
**Date**: December 2025  
**Notebook**: 02_data_exploration.ipynb  
**Status**: ‚úÖ Complete