# üìö PyTorch Practice Notebook - Lecture 2: Professional Data Pipelines

**Based on:** SAIR PyTorch Mastery - Lecture 2: Professional Data Pipelines with PyTorch

**Instructions:** Complete the exercises below to test your understanding of PyTorch data pipelines. Try to solve them without looking at the original notebook first!

**Time Estimate:** 3-4 hours

## üÜï Enhanced Features:
- Edge case testing (corrupt files, missing data)
- Performance comparison exercises
- Debugging exercises (finding bugs in given code)
- Additional Sudanese context scenarios

## üîß Setup & Imports

Run this cell first to set up your environment.

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, IterableDataset
import torchvision
from torchvision import transforms
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import os
from pathlib import Path
from PIL import Image
import json
from collections import defaultdict
import tempfile
import shutil
import psutil
from tqdm import tqdm
from io import StringIO
import random
import warnings

# For reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## üÜï NEW: Debugging Exercise 0 - Find the Bugs!

**Task:** This dataset class has multiple bugs. Identify and fix them all.

In [None]:
# =========== BUGGY DATASET - FIND AND FIX ALL BUGS! ===========
class BuggyImageDataset(Dataset):
    """Dataset with multiple bugs - fix them all!"""
    
    def __init__(self, image_dir, label_file):
        # BUG 1: Missing super().__init__()
        
        self.image_dir = image_dir
        
        # BUG 2: No error handling for missing file
        self.labels = pd.read_csv(label_file)
        
        # BUG 3: Inefficient - loading all image paths upfront
        self.image_paths = []
        for ext in ['.jpg', '.png', '.jpeg']:
            self.image_paths.extend(list(Path(image_dir).glob(f'*{ext}')))
            
        # BUG 4: No validation that images match labels
        
        # BUG 5: Transform applied differently each time
        self.transform = transforms.Compose([
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
        ])
    
    def __len__(self):
        # BUG 6: Inconsistent length
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        # BUG 7: No error handling for corrupt files
        img_path = self.image_paths[idx]
        
        # BUG 8: Opening image but not closing it properly
        image = Image.open(img_path)
        
        # BUG 9: Applying random transform differently each time
        if self.transform:
            image = self.transform(image)
        
        # BUG 10: Hardcoded label extraction
        label = self.labels.iloc[idx]['label']
        
        return image, label
    
    def show_sample(self, idx):
        # BUG 11: Modifies the image but doesn't return it
        img, label = self[idx]
        plt.imshow(img.permute(1, 2, 0) if len(img.shape) == 3 else img)
        plt.title(f"Label: {label}")
        plt.show()

# =========== YOUR FIXED VERSION ===========
class FixedImageDataset(Dataset):
    """Your fixed version of the buggy dataset"""
    
    def __init__(self, image_dir, label_file):
        # TODO: Fix all bugs
        pass
    
    def __len__(self):
        pass
    
    def __getitem__(self, idx):
        pass
    
    def show_sample(self, idx):
        pass

## üéØ Exercise 1: Dataset Fundamentals & Memory Management

### Part A: Fix the Memory-Inefficient Dataset

**Task:** This dataset loads ALL data in `__init__`, which is inefficient for large datasets. Rewrite it to use lazy loading.

**Original (problematic) implementation:**

In [None]:
# =========== PROBLEMATIC DATASET - FIX ME! ===========
class MemoryInefficientDataset(Dataset):
    """Dataset that loads ALL data in __init__ - problematic for large datasets"""
    
    def __init__(self, csv_path):
        super().__init__()
        # PROBLEM: Loading ALL data at initialization
        self.data = pd.read_csv(csv_path)
        
        # PROBLEM: Converting ALL to tensors upfront
        self.features = torch.tensor(self.data.iloc[:, :-1].values, dtype=torch.float32)
        self.labels = torch.tensor(self.data.iloc[:, -1].values, dtype=torch.float32)
        
        print(f"Loaded {len(self)} samples")
        print(f"Memory usage: {self.features.element_size() * self.features.nelement() / 1e6:.1f} MB for features")
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        # No actual loading needed - data already in memory
        return self.features[idx], self.labels[idx]
# =====================================================

**Your Task:** Rewrite the dataset to:
1. Load only metadata in `__init__`
2. Load data on-demand in `__getitem__`
3. Handle CSV files larger than memory

**Test Data:**

In [None]:
# Create test CSV data
test_csv_data = """feature1,feature2,feature3,feature4,label
1.2,3.4,5.6,7.8,0
2.3,4.5,6.7,8.9,1
3.4,5.6,7.8,9.0,0
4.5,6.7,8.9,10.1,1
5.6,7.8,9.0,11.2,0
"""

# Save to temporary file
temp_csv = tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False)
temp_csv.write(test_csv_data)
temp_csv.close()

print(f"Test CSV created at: {temp_csv.name}")

In [None]:
# =========== YOUR CODE HERE ===========
class MemoryEfficientDataset(Dataset):
    """Your optimized dataset implementation"""
    
    def __init__(self, csv_path):
        super().__init__()
        # TODO: Load only metadata, not data
        
    def __len__(self):
        # TODO: Return dataset length
        pass
    
    def __getitem__(self, idx):
        # TODO: Load data on-demand
        pass
# =======================================

### üÜï NEW: Part A-2: Handle Corrupt/Missing Data

**Task:** Extend your dataset to handle:
1. Corrupted rows in CSV
2. Missing values (NaN)
3. Invalid data types

**Test with corrupt data:**

In [None]:
# Create CSV with corrupt/missing data
corrupt_csv_data = """feature1,feature2,feature3,feature4,label
1.2,3.4,5.6,7.8,0
2.3,4.5,corrupt,8.9,1  # String where float expected
3.4,5.6,7.8,,0  # Missing value
4.5,6.7,8.9,10.1,1
invalid_row_with_extra_columns,1,2,3,4,5,6  # Wrong number of columns
5.6,7.8,9.0,11.2,0
"""

corrupt_csv = tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False)
corrupt_csv.write(corrupt_csv_data)
corrupt_csv.close()

print(f"Corrupt test CSV created at: {corrupt_csv.name}")

# =========== YOUR CODE HERE ===========
class RobustMemoryEfficientDataset(Dataset):
    """Dataset that handles corrupt/missing data"""
    
    def __init__(self, csv_path, default_value=0.0, skip_corrupt=True):
        """
        Args:
            csv_path: Path to CSV file
            default_value: Value to use for missing/corrupt data
            skip_corrupt: Whether to skip corrupt rows or replace with defaults
        """
        super().__init__()
        # TODO: Implement robust CSV loading
        # Handle:
        # 1. Corrupt rows (wrong data types)
        # 2. Missing values (NaN)
        # 3. Wrong number of columns
        
    def __len__(self):
        pass
    
    def __getitem__(self, idx):
        # TODO: Handle corrupt data gracefully
        # Options:
        # 1. Return default values
        # 2. Skip to next valid sample
        # 3. Raise specific error types
        pass
# =======================================

### Part B: Memory Usage Comparison

**Task:** Compare memory usage between the two implementations.

In [None]:
# =========== YOUR CODE HERE ===========
# 1. Test memory inefficient version
print("Testing Memory Inefficient Dataset:")
# TODO: Instantiate and measure memory

# 2. Test your memory efficient version
print("\nTesting Your Memory Efficient Dataset:")
# TODO: Instantiate and measure memory

# 3. Load a few samples and measure time
print("\nTesting sample loading performance:")
# TODO: Time loading of 1000 samples for each

# 4. Cleanup
os.unlink(temp_csv.name)
os.unlink(corrupt_csv.name)
print(f"Cleaned up temporary files")
# =======================================

### üÜï NEW: Part C: Performance Comparison Challenge

**Task:** Compare 3 different implementations and analyze trade-offs.

In [None]:
class Implementation1(Dataset):
    """Implementation 1: Load everything in __init__"""
    def __init__(self, csv_path):
        self.data = pd.read_csv(csv_path)
        self.features = torch.tensor(self.data.iloc[:, :-1].values, dtype=torch.float32)
        self.labels = torch.tensor(self.data.iloc[:, -1].values, dtype=torch.float32)
    
    def __len__(self): return len(self.data)
    def __getitem__(self, idx): return self.features[idx], self.labels[idx]

class Implementation2(Dataset):
    """Implementation 2: Load metadata in __init__, data in __getitem__"""
    def __init__(self, csv_path):
        self.data = pd.read_csv(csv_path)
        self.csv_path = csv_path
    
    def __len__(self): return len(self.data)
    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        return torch.tensor(row.iloc[:-1].values, dtype=torch.float32), \
               torch.tensor(row.iloc[-1], dtype=torch.float32)

class Implementation3(Dataset):
    """Implementation 3: Memory mapping with numpy"""
    def __init__(self, csv_path):
        self.data = pd.read_csv(csv_path)
        # Use numpy memmap for large files
        self.features = np.array(self.data.iloc[:, :-1].values, dtype=np.float32)
        self.labels = np.array(self.data.iloc[:, -1].values, dtype=np.float32)
    
    def __len__(self): return len(self.data)
    def __getitem__(self, idx):
        return torch.from_numpy(self.features[idx]), torch.from_numpy(self.labels[idx])

# =========== PERFORMANCE COMPARISON ===========
print("Performance Comparison Challenge:")
print("="*50)

# Create a larger test file
large_csv = tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False)
large_csv.write("feature1,feature2,feature3,feature4,label\n")
for i in range(10000):  # 10k samples
    large_csv.write(f"{i*1.0},{i*2.0},{i*3.0},{i*4.0},{i%2}\n")
large_csv.close()

print(f"Large test file created: {large_csv.name}")

# TODO: Implement performance comparison
# 1. Measure initialization time for each
# 2. Measure memory usage after initialization
# 3. Measure time to load 1000 random samples
# 4. Create a comparison table
# 5. Analyze trade-offs for different scenarios

print("\nExpected Output Table:")
print("Implementation | Init Time | Memory | Load Time | Best For")
print("------------- | --------- | ------ | --------- | --------")
print("1 (Eager)     | Fast      | High   | Fast      | Small datasets")
print("2 (Lazy)      | Fast      | Low    | Slow      | Huge datasets")
print("3 (Memmap)    | Medium    | Medium | Medium    | Medium-large datasets")

# Cleanup
os.unlink(large_csv.name)

## üöÄ Exercise 2: DataLoader Optimization

### Part A: Diagnose and Fix Slow Data Loading

**Task:** This training script has slow data loading. Identify the bottlenecks and fix them.

In [None]:
# =========== SLOW TRAINING SCRIPT - FIX ME! ===========
class SlowDataset(Dataset):
    def __init__(self, num_samples=1000):
        self.num_samples = num_samples
        self.images = []
        self.labels = []
        
        # Simulate slow image generation
        for i in range(num_samples):
            # Simulate image loading/processing
            time.sleep(0.001)  # 1ms delay per image
            self.images.append(torch.randn(3, 224, 224))
            self.labels.append(i % 10)
    
    def __len__(self):
        return self.num_samples
    
    def __getitem__(self, idx):
        return self.images[idx], self.labels[idx]

# Slow training setup
dataset = SlowDataset(num_samples=100)

# PROBLEMATIC DataLoader configuration
dataloader = DataLoader(
    dataset,
    batch_size=16,
    shuffle=True,
    num_workers=0,  # PROBLEM: No parallel loading
    pin_memory=False,  # PROBLEM: Not using pinned memory for GPU
)

# Simple model
model = nn.Sequential(
    nn.Conv2d(3, 16, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(16, 10)
).to(device)

# Training loop
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

print("Starting slow training...")
start_time = time.time()

for epoch in range(2):
    for batch_idx, (images, labels) in enumerate(dataloader):
        images, labels = images.to(device), labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print(f"Epoch {epoch+1}: Loss = {loss.item():.4f}")

print(f"Total training time: {time.time() - start_time:.2f} seconds")
# ======================================================

**Your Task:**
1. Identify all bottlenecks in the code above
2. Rewrite the dataset to be more efficient
3. Optimize the DataLoader configuration
4. Show performance improvement

In [None]:
# =========== YOUR OPTIMIZED SOLUTION ===========
# 1. Create an optimized dataset
class OptimizedDataset(Dataset):
    def __init__(self, num_samples=1000):
        # TODO: Optimize initialization
        pass
    
    def __len__(self):
        # TODO
        pass
    
    def __getitem__(self, idx):
        # TODO: Optimize data loading
        pass

# 2. Create optimized DataLoader
# TODO: Choose optimal parameters
# dataloader_optimized = DataLoader(...)

# 3. Benchmark performance
print("\nBenchmarking Optimized Version:")
# TODO: Run training with optimized setup and measure time

# 4. Compare performance
# TODO: Show speedup factor
# ===============================================

### üÜï NEW: Part A-2: Handle Corrupt Images Gracefully

**Task:** Real-world datasets often have corrupt images. Implement a dataset that handles:
1. Corrupt image files
2. Missing image files
3. Wrong image formats

In [None]:
class RobustImageDataset(Dataset):
    """Dataset that handles corrupt/missing images gracefully"""
    
    def __init__(self, image_dir, label_dict, corrupt_strategy='skip'):
        """
        Args:
            image_dir: Directory with images (some may be corrupt)
            label_dict: Dict mapping image_name to label
            corrupt_strategy: 'skip', 'placeholder', or 'retry'
        """
        super().__init__()
        self.image_dir = Path(image_dir)
        self.label_dict = label_dict
        self.corrupt_strategy = corrupt_strategy
        
        # TODO: Implement
        # 1. Scan directory and validate images
        # 2. Handle corrupt images based on strategy
        # 3. Create list of valid samples
        
        # Statistics
        self.corrupt_count = 0
        self.valid_count = 0
    
    def _validate_image(self, image_path):
        """Validate if image can be loaded"""
        # TODO: Check if file exists, can be opened, is valid image
        pass
    
    def _get_placeholder(self):
        """Return placeholder for corrupt images"""
        # TODO: Return gray image or previous valid image
        pass
    
    def __len__(self):
        pass
    
    def __getitem__(self, idx):
        # TODO: Implement with error handling
        pass

### Part B: Profile Data Loading Performance

**Task:** Create a profiling tool that measures:
1. Batch loading times
2. GPU idle time
3. Memory usage during training

In [None]:
# =========== YOUR CODE HERE ===========
class DataLoaderProfiler:
    """Your implementation of a data loading profiler"""
    
    def __init__(self):
        self.metrics = {
            'batch_times': [],
            'gpu_idle_times': [],
            'memory_usage': [],
            'cpu_usage': []
        }
    
    def profile_training(self, model, dataloader, num_batches=20):
        """Profile training loop"""
        # TODO: Implement profiling
        pass
    
    def print_report(self):
        """Print profiling results"""
        # TODO: Print detailed report
        pass
    
    def plot_metrics(self):
        """Visualize metrics"""
        # TODO: Create plots
        pass

# Test your profiler
profiler = DataLoaderProfiler()
# TODO: Profile both slow and optimized versions
# ===============================================

## üñºÔ∏è Exercise 3: Computer Vision Pipeline

### Part A: Create Augmentation Pipeline for Sudanese Agriculture

**Task:** Design data augmentations specifically for Sudanese agricultural images.

**Considerations:**
- Plants might be at different angles
- Varying lighting conditions (bright sun vs shade)
- Dust/sand particles in air
- Different camera angles (from drone vs ground)

In [None]:
# =========== YOUR CODE HERE ===========
# 1. Create training augmentations
sudanese_agriculture_train_transform = transforms.Compose([
    # TODO: Design appropriate augmentations
    # Consider: rotation, color jitter, random crop, etc.
])

# 2. Create validation augmentations (simpler)
sudanese_agriculture_val_transform = transforms.Compose([
    # TODO: Simple preprocessing for validation
])

# 3. Create test function
def test_augmentations(transform, num_samples=4):
    """Test and visualize augmentations"""
    # TODO: Create dummy image and apply transformations
    # Visualize original + augmented versions
    pass

# Test your augmentations
print("Testing Sudanese Agriculture Augmentations:")
test_augmentations(sudanese_agriculture_train_transform)
# =======================================

### üÜï NEW: Part A-2: Performance Comparison of Augmentations

**Task:** Compare different augmentation strategies for performance.

In [None]:
# Different augmentation strategies
light_aug = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
])

heavy_aug = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(30),
    transforms.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.1),
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
    transforms.ToTensor(),
])

sudanese_aug = transforms.Compose([
    transforms.Resize((256, 256)),
    # Add dust/sand simulation
    transforms.RandomApply([AddDustNoise()], p=0.3),
    # Bright sunlight simulation
    transforms.RandomApply([transforms.ColorJitter(brightness=0.4)], p=0.5),
    # Wind effect (blur)
    transforms.RandomApply([transforms.GaussianBlur(3)], p=0.2),
    transforms.ToTensor(),
])

class AddDustNoise:
    """Simulate dust/sand particles in air"""
    def __call__(self, img):
        # TODO: Implement dust noise
        return img

# TODO: Benchmark performance
# 1. Measure time per batch with each augmentation
# 2. Compare GPU utilization
# 3. Analyze trade-off between augmentation complexity and performance

### Part B: Handle Large Satellite Images

**Task:** Create a dataset that can handle very large satellite images (e.g., 10,000√ó10,000 pixels) without loading them entirely into memory.

In [None]:
# =========== YOUR CODE HERE ===========
class SatelliteImageDataset(Dataset):
    """Dataset for large satellite images using tiling"""
    
    def __init__(self, image_paths, labels, tile_size=512, overlap=64):
        """
        Args:
            image_paths: List of paths to large satellite images
            labels: List of labels (e.g., crop type, drought level)
            tile_size: Size of tiles to extract
            overlap: Overlap between tiles to avoid edge artifacts
        """
        super().__init__()
        self.image_paths = image_paths
        self.labels = labels
        self.tile_size = tile_size
        self.overlap = overlap
        
        # TODO: Pre-calculate tile information
        # Store tile metadata (image_idx, x, y, label)
        self.tiles = []
        
    def __len__(self):
        # TODO: Return number of tiles
        pass
    
    def __getitem__(self, idx):
        # TODO: Load only the needed tile from large image
        pass
    
    def visualize_tile(self, idx, show_grid=True):
        """Visualize a tile within the context of the full image"""
        # TODO: Implement visualization
        pass

# Test with simulated large images
print("Creating test satellite images...")
temp_dir = Path(tempfile.mkdtemp())

# TODO: Create test images and test your dataset

# Cleanup
shutil.rmtree(temp_dir)
# =======================================

### üÜï NEW: Part B-2: Performance Optimization for Satellite Images

**Task:** Optimize satellite image loading for different scenarios.

In [None]:
def benchmark_satellite_strategies():
    """Compare different strategies for handling large satellite images"""
    
    strategies = {
        'naive': 'Load entire image, then crop',
        'tiling': 'Pre-compute tiles',
        'streaming': 'Stream tiles on demand',
        'memmap': 'Memory map the image file',
    }
    
    # TODO: Implement benchmark
    # 1. Create large test image (5000x5000)
    # 2. Measure time to load 100 random tiles with each strategy
    # 3. Measure memory usage
    # 4. Create comparison table
    
    print("Strategy Comparison:")
    print("Strategy   | Load Time | Memory | Best For")
    print("---------- | --------- | ------ | --------")
    print("Naive      | Slow      | High   | Small images")
    print("Tiling     | Medium    | Medium | Medium images, random access")
    print("Streaming  | Fast      | Low    | Sequential access")
    print("Memmap     | Fast      | Low    | Random access, large images")

# Run benchmark
benchmark_satellite_strategies()

## üìö Exercise 4: NLP Pipeline for Arabic Text

### Part A: Handle Sudanese Arabic Dialect

**Task:** Create a text dataset that handles Sudanese Arabic dialect features:
1. Right-to-left text
2. Dialect-specific words
3. Handle both Modern Standard Arabic and Sudanese dialect

In [None]:
# =========== YOUR CODE HERE ===========
class SudaneseArabicDataset(Dataset):
    """Dataset for Sudanese Arabic text classification"""
    
    def __init__(self, texts, labels, vocab=None, max_length=128, 
                 handle_dialect=True, normalize=True):
        super().__init__()
        self.texts = texts
        self.labels = labels
        self.max_length = max_length
        self.handle_dialect = handle_dialect
        self.normalize = normalize
        
        # TODO: Build vocabulary considering dialect
        if vocab is None:
            self.vocab = self._build_vocab(texts)
        else:
            self.vocab = vocab
        
        # TODO: Add special tokens
        
    def _build_vocab(self, texts):
        """Build vocabulary with dialect handling"""
        # TODO: Implement vocabulary building
        # Consider: dialect normalization, MSA mapping, etc.
        pass
    
    def _preprocess_text(self, text):
        """Preprocess Arabic text"""
        # TODO: Implement preprocessing steps:
        # 1. Normalize Arabic characters
        # 2. Remove diacritics (optional)
        # 3. Handle dialect words (map to MSA or keep)
        # 4. Other cleaning steps
        pass
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        # TODO: Implement text encoding
        pass
    
    def decode(self, token_ids):
        """Convert token IDs back to text"""
        # TODO: Implement decoding
        pass

# Test data
sudanese_texts = [
    "ŸÉŸêÿ≥ÿ±ÿ© ÿ®ÿ™ÿßÿπÿ© ŸÅŸàŸÑ ŸÖÿπ ÿ∑ŸÖÿßÿ∑ŸÖ",  # Sudanese dialect
    "ÿßŸÑÿ∑ŸÇÿ≥ ÿßŸÑŸäŸàŸÖ ÿ≠ÿßÿ± ÿ¨ÿØÿßŸã",  # Modern Standard Arabic
    "ÿ¥ÿßŸäŸÅ ÿßŸÑŸÇŸàŸÖ ÿØŸá ÿπÿßŸÖŸÑŸäŸÜ ÿ•ÿ≤ÿßŸä",  # Sudanese dialect
    "ÿßŸÑÿ≤ÿ±ÿßÿπÿ© ŸÅŸä ÿßŸÑÿ≥ŸàÿØÿßŸÜ ŸÖÿ™ŸÇÿØŸÖÿ©",  # MSA
    "ÿπÿßŸäÿ≤ ÿ£ÿ¥ŸàŸÅ ŸÉŸÖÿßÿ¥ÿ©",  # Sudanese dialect
]

labels = [0, 1, 0, 1, 0]  # 0 = dialect, 1 = MSA

# Test your dataset
print("Testing Sudanese Arabic Dataset:")
dataset = SudaneseArabicDataset(sudanese_texts, labels)
# TODO: Test encoding/decoding
# =======================================

### üÜï NEW: Part A-2: Handle Noisy/Corrupt Arabic Text

**Task:** Real-world Arabic text from social media/SMS is often noisy. Handle:
1. Mixed Arabic/English/Latin script
2. Missing diacritics
3. Spelling variations
4. Emojis and special characters

In [None]:
class NoisyArabicDataset(Dataset):
    """Dataset that handles noisy Arabic text"""
    
    def __init__(self, texts, labels, cleaning_strategy='aggressive'):
        """
        Args:
            texts: List of noisy Arabic texts
            labels: Corresponding labels
            cleaning_strategy: 'light', 'aggressive', or 'smart'
        """
        super().__init__()
        self.texts = texts
        self.labels = labels
        self.strategy = cleaning_strategy
        
    def _clean_text(self, text):
        """Clean noisy Arabic text"""
        # TODO: Implement cleaning strategies
        # Light: Remove emojis, extra spaces
        # Aggressive: Normalize all variations, remove Latin
        # Smart: Try to preserve meaning while cleaning
        pass
    
    def _normalize_arabic(self, text):
        """Normalize Arabic characters"""
        # TODO: Normalize different forms of same character
        pass
    
    def _handle_mixed_script(self, text):
        """Handle text with mixed Arabic/Latin script"""
        # TODO: Convert numbers, transliterations, etc.
        pass

# Test with noisy data
noisy_texts = [
    "ÿßŸÑÿ∑ŸÇÿ≥ ÿ≠ÿßÿ± ÿ¨ÿØÿß ‚òÄÔ∏èüî•",  # With emojis
    "ÿßŸÑÿ≥ÿπÿ± 150 ÿ¨ŸÜŸä€Å",  # Mixed Arabic/English numbers
    "ŸÖÿ¥ÿ∫ŸàŸÑ ÿ≠ÿßŸÑŸäÿß...",  # With punctuation
    "I love ÿßŸÑÿ≥ŸàÿØÿßŸÜ ‚ù§Ô∏è",  # Mixed languages
    "ÿ£ŸÜÿß ÿ¨ÿπÿßŸÜ üçî",  # With emoji
]

# TODO: Test cleaning strategies

### Part B: Streaming Dataset for Large Text Corpora

**Task:** Create a streaming dataset that can handle text files larger than memory.

In [None]:
# =========== YOUR CODE HERE ===========
class StreamingArabicNews(IterableDataset):
    """Streaming dataset for Arabic news articles"""
    
    def __init__(self, file_path, vocab=None, max_length=128, 
                 buffer_size=1000, shuffle=True):
        """
        Args:
            file_path: Path to large text file (one article per line)
            vocab: Pre-built vocabulary
            max_length: Maximum sequence length
            buffer_size: Number of lines to buffer
            shuffle: Whether to shuffle the stream
        """
        super().__init__()
        self.file_path = file_path
        self.vocab = vocab or self._build_vocab_from_file()
        self.max_length = max_length
        self.buffer_size = buffer_size
        self.shuffle = shuffle
        
        # TODO: Initialize
    
    def _build_vocab_from_file(self):
        """Build vocabulary by streaming through file once"""
        # TODO: Implement vocabulary building from stream
        pass
    
    def __iter__(self):
        """Stream data from file"""
        # TODO: Implement streaming logic
        # Consider: worker splitting, buffering, shuffling
        pass

# Create test large text file
print("Creating test text file...")
temp_text_file = tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False, encoding='utf-8')

# Generate Arabic text
arabic_samples = [
    "ÿ™ŸÇÿ±Ÿäÿ± ÿπŸÜ ÿßŸÑÿ≤ÿ±ÿßÿπÿ© ŸÅŸä ÿßŸÑÿ≥ŸàÿØÿßŸÜ",
    "ÿ£ÿÆÿ®ÿßÿ± ÿßŸÑÿ±Ÿäÿßÿ∂ÿ© ÿßŸÑŸÖÿ≠ŸÑŸäÿ©",
    "ÿ™ÿ∑Ÿàÿ±ÿßÿ™ ÿßŸÑÿ≥ŸàŸÇ ÿßŸÑŸÖÿßŸÑŸäÿ©",
    "ÿßŸÑÿ∑ŸÇÿ≥ Ÿàÿ£ÿ≠ŸàÿßŸÑ ÿßŸÑÿ≤ÿ±ÿßÿπÿ©",
    "ÿßŸÑÿ™ÿπŸÑŸäŸÖ ŸÅŸä ÿßŸÑŸÖŸÜÿßÿ∑ŸÇ ÿßŸÑÿ±ŸäŸÅŸäÿ©",
]

# Write many lines to simulate large file
for i in range(100):
    for sample in arabic_samples:
        temp_text_file.write(f"{sample} - ÿßŸÑŸÜÿ≥ÿÆÿ© {i}\n")
temp_text_file.close()

print(f"Test file created: {temp_text_file.name} ({os.path.getsize(temp_text_file.name)} bytes)")

# Test your streaming dataset
print("\nTesting Streaming Dataset:")
# TODO: Test the streaming dataset

# Cleanup
os.unlink(temp_text_file.name)
# =======================================

### üÜï NEW: Part B-2: Performance Comparison for Text Datasets

**Task:** Compare different text processing strategies.

In [None]:
def benchmark_text_processing():
    """Compare text processing strategies"""
    
    strategies = {
        'simple': 'Split on whitespace',
        'arabic_tokenizer': 'Arabic-specific tokenizer',
        'transformers': 'HuggingFace tokenizer',
        'character_level': 'Character-level encoding',
    }
    
    # TODO: Implement benchmark
    # 1. Process 10,000 Arabic sentences with each strategy
 # 2. Measure processing time
    # 3. Measure memory usage
    # 4. Compare vocabulary sizes
    
    print("Text Processing Strategy Comparison:")
    print("Strategy         | Speed   | Memory | Vocab Size | Accuracy")
    print("---------------- | ------- | ------ | ---------- | --------")
    print("Simple split     | Fast    | Low    | Large      | Low")
    print("Arabic tokenizer | Medium  | Medium | Medium     | High")
    print("Transformers     | Slow    | High   | Large      | Highest")
    print("Character level  | Fast    | Low    | Small      | Medium")

# Run benchmark
benchmark_text_processing()

## üß™ Challenge Problems

### üÜï NEW: Challenge 0: Debugging Real-World Sudanese Data Pipeline

**Task:** Debug this real-world Sudanese data pipeline with multiple issues.

In [None]:
# =========== BUGGY SUDANESE PIPELINE - DEBUG ME! ===========
class BuggySudanesePipeline:
    """Buggy pipeline for Sudanese agricultural data"""
    
    def __init__(self, data_dir):
        self.data_dir = data_dir
        
        # Load all data at once (problem for large datasets)
        self.images = []
        self.prices = []
        self.dates = []
        
        for file in os.listdir(data_dir):
            if file.endswith('.jpg'):
                # Load image
                img = Image.open(os.path.join(data_dir, file))
                self.images.append(img)  # PROBLEM: Storing PIL images
                
                # Parse metadata from filename
                parts = file.split('_')
                self.prices.append(float(parts[1]))  # No error handling
                self.dates.append(parts[2])  # Assuming format is correct
    
    def create_dataset(self):
        """Create PyTorch dataset"""
        class SudaneseDataset(Dataset):
            def __init__(self, images, prices, dates):
                self.images = images  # PROBLEM: Passing PIL images
                self.prices = prices
                self.dates = dates
                
                # Random transform each time
                self.transform = transforms.RandomRotation(30)
            
            def __len__(self):
                return len(self.images)
            
            def __getitem__(self, idx):
                # PROBLEM: No tensor conversion
                img = self.images[idx]
                
                # PROBLEM: Different augmentation each call
                if random.random() > 0.5:
                    img = self.transform(img)
                
                # Convert price to tensor (but price might be missing)
                price = torch.tensor(self.prices[idx])
                
                return img, price
        
        return SudaneseDataset(self.images, self.prices, self.dates)
    
    def create_dataloader(self, batch_size=32):
        """Create DataLoader"""
        dataset = self.create_dataset()
        
        # PROBLEM: Inefficient DataLoader settings
        return DataLoader(
            dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=0,  # No parallel loading
            collate_fn=self.buggy_collate  # Custom buggy collate
        )
    
    def buggy_collate(self, batch):
        """Buggy collate function"""
        images, prices = zip(*batch)
        
        # PROBLEM: Assuming all images same size
        images = torch.stack(images)
        prices = torch.stack(prices)
        
        return images, prices

# TODO: Identify and fix ALL bugs in this pipeline
# List at least 10 different bugs and their fixes

### Challenge 1: Multi-Modal Dataset (Images + Text)

**Task:** Create a dataset that handles both images and text for a Sudanese market monitoring system.

**Scenario:** You're building a system that:
- Takes photos of market goods (sorghum, millet, wheat)
- Has Arabic text descriptions from sellers
- Includes price information
- Needs to predict whether prices are reasonable

**Requirements:**
1. Handle image loading and augmentation
2. Process Arabic text descriptions
3. Combine multiple data types in single sample
4. Handle missing data (some samples might have only image or only text)

In [None]:
# =========== CHALLENGE 1 ===========
class SudaneseMarketMultiModalDataset(Dataset):
    """Multi-modal dataset for Sudanese market monitoring"""
    
    def __init__(self, 
                 image_dir,  # Directory with images
                 metadata_file,  # CSV with text, prices, etc.
                 image_transform=None,
                 text_max_length=50,
                 handle_missing='zero'):  # How to handle missing data
        """
        Args:
            image_dir: Directory containing product images
            metadata_file: CSV with columns: image_name, description, price, category, is_reasonable
            image_transform: Transformations for images
            text_max_length: Maximum text sequence length
            handle_missing: Strategy for missing data ('zero', 'mean', 'ignore')
        """
        super().__init__()
        self.image_dir = Path(image_dir)
        
        # TODO: Load metadata
        self.metadata = pd.read_csv(metadata_file)
        
        # TODO: Initialize image transformations
        self.image_transform = image_transform or self._default_image_transform()
        
        # TODO: Initialize text processing
        self.text_max_length = text_max_length
        self.text_vocab = self._build_text_vocab()
        
        # TODO: Handle missing data strategy
        self.handle_missing = handle_missing
        
        # TODO: Preprocess data
        
    def _default_image_transform(self):
        """Default image transformations for market products"""
        # TODO: Design appropriate transformations
        pass
    
    def _build_text_vocab(self):
        """Build vocabulary from Arabic descriptions"""
        # TODO: Implement
        pass
    
    def __len__(self):
        return len(self.metadata)
    
    def __getitem__(self, idx):
        """Return multi-modal sample"""
        row = self.metadata.iloc[idx]
        
        # TODO: Load and process image
        image_tensor = None  # Load image if exists
        
        # TODO: Process Arabic text
        text_tensor = None  # Encode text if exists
        
        # TODO: Handle price/numerical features
        price_tensor = None
        
        # TODO: Handle missing data
        
        # TODO: Combine into single sample
        # Sample structure:
        # {
        #     'image': image_tensor,
        #     'text': text_tensor,
        #     'price': price_tensor,
        #     'label': row['is_reasonable']  # 0 or 1
        # }
        
        return sample
    
    def collate_fn(self, batch):
        """Custom collate function for multi-modal data"""
        # TODO: Implement collate function that handles:
        # - Variable length sequences
        # - Missing modalities
        # - Different data types
        pass

# Create test data
print("Creating test multi-modal data...")
temp_mm_dir = Path(tempfile.mkdtemp())

# TODO: Create test images and metadata
# 1. Create image directory with dummy images
# 2. Create CSV metadata file

# Test your dataset
print("\nTesting Multi-Modal Dataset:")
# dataset = SudaneseMarketMultiModalDataset(...)
# dataloader = DataLoader(dataset, batch_size=4, collate_fn=dataset.collate_fn)
# TODO: Test batch loading

# Cleanup
shutil.rmtree(temp_mm_dir)
# ===================================

### Challenge 2: Data Pipeline for Sudanese Healthcare

**Task:** Design a complete data pipeline for medical imaging in Sudanese hospitals.

**Special Considerations:**
1. Handle DICOM files (medical images)
2. Include patient metadata
3. Respect patient privacy (anonymization)
4. Work with limited internet connectivity (offline capable)
5. Handle power outages (checkpointing)

**Bonus:** Implement data validation to catch corrupted files.

In [None]:
# =========== CHALLENGE 2 ===========
class SudaneseHealthcareDataset(Dataset):
    """Dataset for Sudanese healthcare applications"""
    
    def __init__(self, data_root, transform=None, anonymize=True,
                 validate_data=True, cache_size=100):
        """
        Args:
            data_root: Root directory with structure:
                - images/ (DICOM or PNG files)
                - metadata.csv (patient info, diagnoses)
                - annotations/ (optional: segmentation masks)
            transform: Image transformations
            anonymize: Whether to anonymize patient data
            validate_data: Validate file integrity
            cache_size: Number of samples to cache in memory
        """
        super().__init__()
        self.data_root = Path(data_root)
        
        # TODO: Implement with healthcare-specific considerations
        
    def _load_and_validate_dicom(self, filepath):
        """Load and validate DICOM file"""
        # TODO: Implement DICOM loading with validation
        pass
    
    def _anonymize_metadata(self, metadata):
        """Remove personally identifiable information"""
        # TODO: Implement anonymization
        pass
    
    def _checkpoint_state(self):
        """Save dataset state for recovery from power outages"""
        # TODO: Implement checkpointing
        pass
    
    def _restore_from_checkpoint(self):
        """Restore dataset state"""
        # TODO: Implement restoration
        pass
    
    def __len__(self):
        pass
    
    def __getitem__(self, idx):
        pass

# Create a DataValidator class
class HealthcareDataValidator:
    """Validate healthcare data integrity"""
    
    @staticmethod
    def validate_dicom(filepath):
        """Validate DICOM file integrity"""
        # TODO: Check if DICOM is valid and not corrupted
        pass
    
    @staticmethod
    def validate_metadata(metadata):
        """Validate patient metadata"""
        # TODO: Check required fields, data types, ranges
        pass
    
    @staticmethod
    def check_anonymization(metadata):
        """Check if data is properly anonymized"""
        # TODO: Verify no PII remains
        pass

# Design document
print("""
Design Considerations for Sudanese Healthcare Pipeline:

1. OFFLINE OPERATION:
   - Local caching of all data
   - Pre-processed datasets stored locally
   - Batch processing for when connectivity is available

2. POWER RESILIENCE:
   - Regular checkpointing of dataset state
   - Incremental processing with recovery
   - Battery backup considerations

3. PRIVACY:
   - Automatic anonymization of patient data
   - Encryption of sensitive data
   - Access controls

4. VALIDATION:
   - File integrity checks
   - Data completeness validation
   - Cross-field consistency checks
""")

# TODO: Implement test cases for your design
# ===================================

## üÜï NEW: Performance Comparison Final Challenge

**Task:** Create a comprehensive performance comparison of different data pipeline strategies.

In [None]:
def comprehensive_performance_benchmark():
    """Comprehensive benchmark of different data pipeline strategies"""
    
    print("Comprehensive Performance Benchmark")
    print("="*60)
    
    scenarios = [
        ("Small dataset (1GB)", 1000, "all_in_memory"),
        ("Medium dataset (10GB)", 10000, "lazy_loading"),
        ("Large dataset (100GB)", 100000, "streaming"),
        ("Mixed modalities", 5000, "multi_modal"),
        ("Corrupt data (10% corrupt)", 1000, "robust_loading"),
    ]
    
    strategies = {
        "naive": "Basic implementation (num_workers=0, no caching)",
        "optimized": "Optimized (num_workers=4, pin_memory=True, prefetch)",
        "memory_mapped": "Memory mapping for large files",
        "streaming": "IterableDataset for streaming",
        "cached": "CachedDataset with LRU cache",
    }
    
    # TODO: Implement comprehensive benchmark
    # For each scenario and strategy:
    # 1. Measure initialization time
    # 2. Measure memory usage
    # 3. Measure time to load 100 batches
    # 4. Measure GPU utilization
    # 5. Create comparison table
    
    print("\nExpected Results Summary:")
    print("Scenario               | Best Strategy      | Why")
    print("---------------------- | ------------------ | ---")
    print("Small dataset          | naive/optimized    | Overhead not worth it")
    print("Medium dataset         | optimized          | Balanced perf/memory")
    print("Large dataset          | streaming          | Memory constraints")
print("Mixed modalities      | cached            | Repeated access")
    print("Corrupt data          | robust_loading     | Error handling needed")

## üìä Assessment Questions

Answer these questions in markdown cells:

### Q1: When should you use `num_workers=0` in DataLoader? What are the trade-offs?

### Q2: What's the difference between `pin_memory=True` and `pin_memory=False`? When would you use each?

### Q3: How does `prefetch_factor` affect performance and memory usage?

### Q4: What are the main differences between `Dataset` and `IterableDataset`? Give examples of when to use each.

### Q5: How would you handle a dataset where some samples have corrupted files?

### Q6: What special considerations are needed for Arabic text processing vs English?

### Q7: How would you design a data pipeline that works in areas with intermittent internet connectivity?

### üÜï Q8: Compare 3 different strategies for handling large images (naive loading, tiling, memory mapping). When would you use each?

### üÜï Q9: How would you debug a data pipeline that's slower than expected? List the steps you would take.

### üÜï Q10: Design a data validation pipeline for Sudanese agricultural images. What checks would you implement?

## ‚úÖ Progress Tracker

Check off exercises as you complete them:

- [ ] **Debugging Exercise 0**: Find and fix bugs in buggy dataset
- [ ] **Exercise 1A**: Fix Memory-Inefficient Dataset
- [ ] **Exercise 1A-2**: Handle corrupt/missing data
- [ ] **Exercise 1B**: Memory Usage Comparison
- [ ] **Exercise 1C**: Performance Comparison Challenge
- [ ] **Exercise 2A**: Diagnose & Fix Slow Data Loading
- [ ] **Exercise 2A-2**: Handle corrupt images gracefully
- [ ] **Exercise 2B**: Profile Data Loading Performance
- [ ] **Exercise 3A**: Sudanese Agriculture Augmentations
- [ ] **Exercise 3A-2**: Performance comparison of augmentations
- [ ] **Exercise 3B**: Large Satellite Images Dataset
- [ ] **Exercise 3B-2**: Performance optimization for satellite images
- [ ] **Exercise 4A**: Sudanese Arabic Dialect Dataset
- [ ] **Exercise 4A-2**: Handle noisy/corrupt Arabic text
- [ ] **Exercise 4B**: Streaming Text Dataset
- [ ] **Exercise 4B-2**: Performance comparison for text datasets
- [ ] **Challenge 0**: Debugging Real-World Sudanese Data Pipeline
- [ ] **Challenge 1**: Multi-Modal Dataset (Images + Text)
- [ ] **Challenge 2**: Sudanese Healthcare Pipeline
- [ ] **Final Challenge**: Comprehensive Performance Benchmark
- [ ] **Assessment Questions Q1-Q10**

## üèÜ Completion Certificate

Once you complete all exercises, you've mastered:
- ‚úÖ PyTorch Dataset design patterns
- ‚úÖ DataLoader optimization techniques
- ‚úÖ Computer vision pipelines with augmentation
- ‚úÖ NLP pipelines for Arabic text
- ‚úÖ Multi-modal data handling
- ‚úÖ Production considerations for Sudanese context
- ‚úÖ üÜï Debugging and performance optimization skills
- ‚úÖ üÜï Handling corrupt/missing data
- ‚úÖ üÜï Performance comparison and analysis

**You're ready for Lecture 3: Advanced Model Architectures & Training!** üéâ

## üí° Tips for Success

1. **Start Simple**: Begin with basic implementations, then optimize
2. **Profile Early**: Use the profiler to identify bottlenecks
3. **Test with Small Data**: Verify correctness before scaling up
4. **Consider Sudanese Context**: Think about real-world constraints
5. **Document Your Choices**: Explain why you made certain design decisions
6. **üÜï Test Edge Cases**: Always test with corrupt/missing data
7. **üÜï Compare Performance**: Benchmark different approaches
8. **üÜï Debug Systematically**: Learn to identify and fix bugs efficiently

## ü§ù Need Help?

- Review Lecture 2 notebook for concepts
- Use PyTorch documentation for specific APIs
- Test your implementations step by step
- Consider edge cases (missing data, large files, etc.)
- üÜï Use debugging tools: pdb, print statements, profiling
- üÜï Create minimal reproducible examples when debugging

### Very Important Note:
# Go to Chapter 10 of Hands On Machine Learning with sklearn and PyTorch by Aur√©lien G√©ron.and solve the exercises at the end of the chapter.and add it in this notebook as well.