# üìö PyTorch Practice Notebook - Lecture 2: Professional Data Pipelines

**Based on:** SAIR PyTorch Mastery - Lecture 2: Professional Data Pipelines with PyTorch

**Instructions:** Complete the exercises below to test your understanding of PyTorch data pipelines. Try to solve them without looking at the original notebook first!

**Time Estimate:** 3-4 hours

## üîß Setup & Imports

Run this cell first to set up your environment.

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, IterableDataset
import torchvision
from torchvision import transforms
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import os
from pathlib import Path
from PIL import Image
import json
from collections import defaultdict
import tempfile
import shutil
import psutil
from tqdm import tqdm
from io import StringIO

# For reproducibility
torch.manual_seed(42)
np.random.seed(42)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## üéØ Exercise 1: Dataset Fundamentals & Memory Management

### Part A: Fix the Memory-Inefficient Dataset

**Task:** This dataset loads ALL data in `__init__`, which is inefficient for large datasets. Rewrite it to use lazy loading.

**Original (problematic) implementation:**

In [None]:
# =========== PROBLEMATIC DATASET - FIX ME! ===========
class MemoryInefficientDataset(Dataset):
    """Dataset that loads ALL data in __init__ - problematic for large datasets"""
    
    def __init__(self, csv_path):
        super().__init__()
        # PROBLEM: Loading ALL data at initialization
        self.data = pd.read_csv(csv_path)
        
        # PROBLEM: Converting ALL to tensors upfront
        self.features = torch.tensor(self.data.iloc[:, :-1].values, dtype=torch.float32)
        self.labels = torch.tensor(self.data.iloc[:, -1].values, dtype=torch.float32)
        
        print(f"Loaded {len(self)} samples")
        print(f"Memory usage: {self.features.element_size() * self.features.nelement() / 1e6:.1f} MB for features")
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        # No actual loading needed - data already in memory
        return self.features[idx], self.labels[idx]
# =====================================================

**Your Task:** Rewrite the dataset to:
1. Load only metadata in `__init__`
2. Load data on-demand in `__getitem__`
3. Handle CSV files larger than memory

**Test Data:**

In [None]:
# Create test CSV data
test_csv_data = """feature1,feature2,feature3,feature4,label
1.2,3.4,5.6,7.8,0
2.3,4.5,6.7,8.9,1
3.4,5.6,7.8,9.0,0
4.5,6.7,8.9,10.1,1
5.6,7.8,9.0,11.2,0
"""

# Save to temporary file
temp_csv = tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False)
temp_csv.write(test_csv_data)
temp_csv.close()

print(f"Test CSV created at: {temp_csv.name}")

In [None]:
# =========== YOUR CODE HERE ===========
class MemoryEfficientDataset(Dataset):
    """Your optimized dataset implementation"""
    
    def __init__(self, csv_path):
        super().__init__()
        # TODO: Load only metadata, not data
        
    def __len__(self):
        # TODO: Return dataset length
        pass
    
    def __getitem__(self, idx):
        # TODO: Load data on-demand
        pass
# =======================================

### Part B: Memory Usage Comparison

**Task:** Compare memory usage between the two implementations.

In [None]:
# =========== YOUR CODE HERE ===========
# 1. Test memory inefficient version
print("Testing Memory Inefficient Dataset:")
# TODO: Instantiate and measure memory

# 2. Test your memory efficient version
print("\nTesting Your Memory Efficient Dataset:")
# TODO: Instantiate and measure memory

# 3. Load a few samples and measure time
print("\nTesting sample loading performance:")
# TODO: Time loading of 1000 samples for each

# 4. Cleanup
os.unlink(temp_csv.name)
print(f"Cleaned up temporary file")
# =======================================

## üöÄ Exercise 2: DataLoader Optimization

### Part A: Diagnose and Fix Slow Data Loading

**Task:** This training script has slow data loading. Identify the bottlenecks and fix them.

In [None]:
# =========== SLOW TRAINING SCRIPT - FIX ME! ===========
class SlowDataset(Dataset):
    def __init__(self, num_samples=1000):
        self.num_samples = num_samples
        self.images = []
        self.labels = []
        
        # Simulate slow image generation
        for i in range(num_samples):
            # Simulate image loading/processing
            time.sleep(0.001)  # 1ms delay per image
            self.images.append(torch.randn(3, 224, 224))
            self.labels.append(i % 10)
    
    def __len__(self):
        return self.num_samples
    
    def __getitem__(self, idx):
        return self.images[idx], self.labels[idx]

# Slow training setup
dataset = SlowDataset(num_samples=100)

# PROBLEMATIC DataLoader configuration
dataloader = DataLoader(
    dataset,
    batch_size=16,
    shuffle=True,
    num_workers=0,  # PROBLEM: No parallel loading
    pin_memory=False,  # PROBLEM: Not using pinned memory for GPU
)

# Simple model
model = nn.Sequential(
    nn.Conv2d(3, 16, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.AdaptiveAvgPool2d((1, 1)),
    nn.Flatten(),
    nn.Linear(16, 10)
).to(device)

# Training loop
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

print("Starting slow training...")
start_time = time.time()

for epoch in range(2):
    for batch_idx, (images, labels) in enumerate(dataloader):
        images, labels = images.to(device), labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print(f"Epoch {epoch+1}: Loss = {loss.item():.4f}")

print(f"Total training time: {time.time() - start_time:.2f} seconds")
# ======================================================

**Your Task:**
1. Identify all bottlenecks in the code above
2. Rewrite the dataset to be more efficient
3. Optimize the DataLoader configuration
4. Show performance improvement

In [None]:
# =========== YOUR OPTIMIZED SOLUTION ===========
# 1. Create an optimized dataset
class OptimizedDataset(Dataset):
    def __init__(self, num_samples=1000):
        # TODO: Optimize initialization
        pass
    
    def __len__(self):
        # TODO
        pass
    
    def __getitem__(self, idx):
        # TODO: Optimize data loading
        pass

# 2. Create optimized DataLoader
# TODO: Choose optimal parameters
# dataloader_optimized = DataLoader(...)

# 3. Benchmark performance
print("\nBenchmarking Optimized Version:")
# TODO: Run training with optimized setup and measure time

# 4. Compare performance
# TODO: Show speedup factor
# ===============================================

### Part B: Profile Data Loading Performance

**Task:** Create a profiling tool that measures:
1. Batch loading times
2. GPU idle time
3. Memory usage during training

In [None]:
# =========== YOUR CODE HERE ===========
class DataLoaderProfiler:
    """Your implementation of a data loading profiler"""
    
    def __init__(self):
        self.metrics = {
            'batch_times': [],
            'gpu_idle_times': [],
            'memory_usage': [],
            'cpu_usage': []
        }
    
    def profile_training(self, model, dataloader, num_batches=20):
        """Profile training loop"""
        # TODO: Implement profiling
        pass
    
    def print_report(self):
        """Print profiling results"""
        # TODO: Print detailed report
        pass
    
    def plot_metrics(self):
        """Visualize metrics"""
        # TODO: Create plots
        pass

# Test your profiler
profiler = DataLoaderProfiler()
# TODO: Profile both slow and optimized versions
# ===============================================

## üñºÔ∏è Exercise 3: Computer Vision Pipeline

### Part A: Create Augmentation Pipeline for Sudanese Agriculture

**Task:** Design data augmentations specifically for Sudanese agricultural images.

**Considerations:**
- Plants might be at different angles
- Varying lighting conditions (bright sun vs shade)
- Dust/sand particles in air
- Different camera angles (from drone vs ground)

In [None]:
# =========== YOUR CODE HERE ===========
# 1. Create training augmentations
sudanese_agriculture_train_transform = transforms.Compose([
    # TODO: Design appropriate augmentations
    # Consider: rotation, color jitter, random crop, etc.
])

# 2. Create validation augmentations (simpler)
sudanese_agriculture_val_transform = transforms.Compose([
    # TODO: Simple preprocessing for validation
])

# 3. Create test function
def test_augmentations(transform, num_samples=4):
    """Test and visualize augmentations"""
    # TODO: Create dummy image and apply transformations
    # Visualize original + augmented versions
    pass

# Test your augmentations
print("Testing Sudanese Agriculture Augmentations:")
test_augmentations(sudanese_agriculture_train_transform)
# =======================================

### Part B: Handle Large Satellite Images

**Task:** Create a dataset that can handle very large satellite images (e.g., 10,000√ó10,000 pixels) without loading them entirely into memory.

In [None]:
# =========== YOUR CODE HERE ===========
class SatelliteImageDataset(Dataset):
    """Dataset for large satellite images using tiling"""
    
    def __init__(self, image_paths, labels, tile_size=512, overlap=64):
        """
        Args:
            image_paths: List of paths to large satellite images
            labels: List of labels (e.g., crop type, drought level)
            tile_size: Size of tiles to extract
            overlap: Overlap between tiles to avoid edge artifacts
        """
        super().__init__()
        self.image_paths = image_paths
        self.labels = labels
        self.tile_size = tile_size
        self.overlap = overlap
        
        # TODO: Pre-calculate tile information
        # Store tile metadata (image_idx, x, y, label)
        self.tiles = []
        
    def __len__(self):
        # TODO: Return number of tiles
        pass
    
    def __getitem__(self, idx):
        # TODO: Load only the needed tile from large image
        pass
    
    def visualize_tile(self, idx, show_grid=True):
        """Visualize a tile within the context of the full image"""
        # TODO: Implement visualization
        pass

# Test with simulated large images
print("Creating test satellite images...")
temp_dir = Path(tempfile.mkdtemp())

# TODO: Create test images and test your dataset

# Cleanup
shutil.rmtree(temp_dir)
# =======================================

## üìö Exercise 4: NLP Pipeline for Arabic Text

### Part A: Handle Sudanese Arabic Dialect

**Task:** Create a text dataset that handles Sudanese Arabic dialect features:
1. Right-to-left text
2. Dialect-specific words
3. Handle both Modern Standard Arabic and Sudanese dialect

In [None]:
# =========== YOUR CODE HERE ===========
class SudaneseArabicDataset(Dataset):
    """Dataset for Sudanese Arabic text classification"""
    
    def __init__(self, texts, labels, vocab=None, max_length=128, 
                 handle_dialect=True, normalize=True):
        super().__init__()
        self.texts = texts
        self.labels = labels
        self.max_length = max_length
        self.handle_dialect = handle_dialect
        self.normalize = normalize
        
        # TODO: Build vocabulary considering dialect
        if vocab is None:
            self.vocab = self._build_vocab(texts)
        else:
            self.vocab = vocab
        
        # TODO: Add special tokens
        
    def _build_vocab(self, texts):
        """Build vocabulary with dialect handling"""
        # TODO: Implement vocabulary building
        # Consider: dialect normalization, MSA mapping, etc.
        pass
    
    def _preprocess_text(self, text):
        """Preprocess Arabic text"""
        # TODO: Implement preprocessing steps:
        # 1. Normalize Arabic characters
        # 2. Remove diacritics (optional)
        # 3. Handle dialect words (map to MSA or keep)
        # 4. Other cleaning steps
        pass
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        # TODO: Implement text encoding
        pass
    
    def decode(self, token_ids):
        """Convert token IDs back to text"""
        # TODO: Implement decoding
        pass

# Test data
sudanese_texts = [
    "ŸÉŸêÿ≥ÿ±ÿ© ÿ®ÿ™ÿßÿπÿ© ŸÅŸàŸÑ ŸÖÿπ ÿ∑ŸÖÿßÿ∑ŸÖ",  # Sudanese dialect
    "ÿßŸÑÿ∑ŸÇÿ≥ ÿßŸÑŸäŸàŸÖ ÿ≠ÿßÿ± ÿ¨ÿØÿßŸã",  # Modern Standard Arabic
    "ÿ¥ÿßŸäŸÅ ÿßŸÑŸÇŸàŸÖ ÿØŸá ÿπÿßŸÖŸÑŸäŸÜ ÿ•ÿ≤ÿßŸä",  # Sudanese dialect
    "ÿßŸÑÿ≤ÿ±ÿßÿπÿ© ŸÅŸä ÿßŸÑÿ≥ŸàÿØÿßŸÜ ŸÖÿ™ŸÇÿØŸÖÿ©",  # MSA
    "ÿπÿßŸäÿ≤ ÿ£ÿ¥ŸàŸÅ ŸÉŸÖÿßÿ¥ÿ©",  # Sudanese dialect
]

labels = [0, 1, 0, 1, 0]  # 0 = dialect, 1 = MSA

# Test your dataset
print("Testing Sudanese Arabic Dataset:")
dataset = SudaneseArabicDataset(sudanese_texts, labels)
# TODO: Test encoding/decoding
# =======================================

### Part B: Streaming Dataset for Large Text Corpora

**Task:** Create a streaming dataset that can handle text files larger than memory.

In [None]:
# =========== YOUR CODE HERE ===========
class StreamingArabicNews(IterableDataset):
    """Streaming dataset for Arabic news articles"""
    
    def __init__(self, file_path, vocab=None, max_length=128, 
                 buffer_size=1000, shuffle=True):
        """
        Args:
            file_path: Path to large text file (one article per line)
            vocab: Pre-built vocabulary
            max_length: Maximum sequence length
            buffer_size: Number of lines to buffer
            shuffle: Whether to shuffle the stream
        """
        super().__init__()
        self.file_path = file_path
        self.vocab = vocab or self._build_vocab_from_file()
        self.max_length = max_length
        self.buffer_size = buffer_size
        self.shuffle = shuffle
        
        # TODO: Initialize
    
    def _build_vocab_from_file(self):
        """Build vocabulary by streaming through file once"""
        # TODO: Implement vocabulary building from stream
        pass
    
    def __iter__(self):
        """Stream data from file"""
        # TODO: Implement streaming logic
        # Consider: worker splitting, buffering, shuffling
        pass

# Create test large text file
print("Creating test text file...")
temp_text_file = tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False, encoding='utf-8')

# Generate Arabic text
arabic_samples = [
    "ÿ™ŸÇÿ±Ÿäÿ± ÿπŸÜ ÿßŸÑÿ≤ÿ±ÿßÿπÿ© ŸÅŸä ÿßŸÑÿ≥ŸàÿØÿßŸÜ",
    "ÿ£ÿÆÿ®ÿßÿ± ÿßŸÑÿ±Ÿäÿßÿ∂ÿ© ÿßŸÑŸÖÿ≠ŸÑŸäÿ©",
    "ÿ™ÿ∑Ÿàÿ±ÿßÿ™ ÿßŸÑÿ≥ŸàŸÇ ÿßŸÑŸÖÿßŸÑŸäÿ©",
    "ÿßŸÑÿ∑ŸÇÿ≥ Ÿàÿ£ÿ≠ŸàÿßŸÑ ÿßŸÑÿ≤ÿ±ÿßÿπÿ©",
    "ÿßŸÑÿ™ÿπŸÑŸäŸÖ ŸÅŸä ÿßŸÑŸÖŸÜÿßÿ∑ŸÇ ÿßŸÑÿ±ŸäŸÅŸäÿ©",
]

# Write many lines to simulate large file
for i in range(100):
    for sample in arabic_samples:
        temp_text_file.write(f"{sample} - ÿßŸÑŸÜÿ≥ÿÆÿ© {i}\n")
temp_text_file.close()

print(f"Test file created: {temp_text_file.name} ({os.path.getsize(temp_text_file.name)} bytes)")

# Test your streaming dataset
print("\nTesting Streaming Dataset:")
# TODO: Test the streaming dataset

# Cleanup
os.unlink(temp_text_file.name)
# =======================================

## üß™ Challenge Problems

### Challenge 1: Multi-Modal Dataset (Images + Text)

**Task:** Create a dataset that handles both images and text for a Sudanese market monitoring system.

**Scenario:** You're building a system that:
- Takes photos of market goods (sorghum, millet, wheat)
- Has Arabic text descriptions from sellers
- Includes price information
- Needs to predict whether prices are reasonable

**Requirements:**
1. Handle image loading and augmentation
2. Process Arabic text descriptions
3. Combine multiple data types in single sample
4. Handle missing data (some samples might have only image or only text)

In [None]:
# =========== CHALLENGE 1 ===========
class SudaneseMarketMultiModalDataset(Dataset):
    """Multi-modal dataset for Sudanese market monitoring"""
    
    def __init__(self, 
                 image_dir,  # Directory with images
                 metadata_file,  # CSV with text, prices, etc.
                 image_transform=None,
                 text_max_length=50,
                 handle_missing='zero'):  # How to handle missing data
        """
        Args:
            image_dir: Directory containing product images
            metadata_file: CSV with columns: image_name, description, price, category, is_reasonable
            image_transform: Transformations for images
            text_max_length: Maximum text sequence length
            handle_missing: Strategy for missing data ('zero', 'mean', 'ignore')
        """
        super().__init__()
        self.image_dir = Path(image_dir)
        
        # TODO: Load metadata
        self.metadata = pd.read_csv(metadata_file)
        
        # TODO: Initialize image transformations
        self.image_transform = image_transform or self._default_image_transform()
        
        # TODO: Initialize text processing
        self.text_max_length = text_max_length
        self.text_vocab = self._build_text_vocab()
        
        # TODO: Handle missing data strategy
        self.handle_missing = handle_missing
        
        # TODO: Preprocess data
        
    def _default_image_transform(self):
        """Default image transformations for market products"""
        # TODO: Design appropriate transformations
        pass
    
    def _build_text_vocab(self):
        """Build vocabulary from Arabic descriptions"""
        # TODO: Implement
        pass
    
    def __len__(self):
        return len(self.metadata)
    
    def __getitem__(self, idx):
        """Return multi-modal sample"""
        row = self.metadata.iloc[idx]
        
        # TODO: Load and process image
        image_tensor = None  # Load image if exists
        
        # TODO: Process Arabic text
        text_tensor = None  # Encode text if exists
        
        # TODO: Handle price/numerical features
        price_tensor = None
        
        # TODO: Handle missing data
        
        # TODO: Combine into single sample
        # Sample structure:
        # {
        #     'image': image_tensor,
        #     'text': text_tensor,
        #     'price': price_tensor,
        #     'label': row['is_reasonable']  # 0 or 1
        # }
        
        return sample
    
    def collate_fn(self, batch):
        """Custom collate function for multi-modal data"""
        # TODO: Implement collate function that handles:
        # - Variable length sequences
        # - Missing modalities
        # - Different data types
        pass

# Create test data
print("Creating test multi-modal data...")
temp_mm_dir = Path(tempfile.mkdtemp())

# TODO: Create test images and metadata
# 1. Create image directory with dummy images
# 2. Create CSV metadata file

# Test your dataset
print("\nTesting Multi-Modal Dataset:")
# dataset = SudaneseMarketMultiModalDataset(...)
# dataloader = DataLoader(dataset, batch_size=4, collate_fn=dataset.collate_fn)
# TODO: Test batch loading

# Cleanup
shutil.rmtree(temp_mm_dir)
# ===================================

### Challenge 2: Data Pipeline for Sudanese Healthcare

**Task:** Design a complete data pipeline for medical imaging in Sudanese hospitals.

**Special Considerations:**
1. Handle DICOM files (medical images)
2. Include patient metadata
3. Respect patient privacy (anonymization)
4. Work with limited internet connectivity (offline capable)
5. Handle power outages (checkpointing)

**Bonus:** Implement data validation to catch corrupted files.

In [None]:
# =========== CHALLENGE 2 ===========
class SudaneseHealthcareDataset(Dataset):
    """Dataset for Sudanese healthcare applications"""
    
    def __init__(self, data_root, transform=None, anonymize=True,
                 validate_data=True, cache_size=100):
        """
        Args:
            data_root: Root directory with structure:
                - images/ (DICOM or PNG files)
                - metadata.csv (patient info, diagnoses)
                - annotations/ (optional: segmentation masks)
            transform: Image transformations
            anonymize: Whether to anonymize patient data
            validate_data: Validate file integrity
            cache_size: Number of samples to cache in memory
        """
        super().__init__()
        self.data_root = Path(data_root)
        
        # TODO: Implement with healthcare-specific considerations
        
    def _load_and_validate_dicom(self, filepath):
        """Load and validate DICOM file"""
        # TODO: Implement DICOM loading with validation
        pass
    
    def _anonymize_metadata(self, metadata):
        """Remove personally identifiable information"""
        # TODO: Implement anonymization
        pass
    
    def _checkpoint_state(self):
        """Save dataset state for recovery from power outages"""
        # TODO: Implement checkpointing
        pass
    
    def _restore_from_checkpoint(self):
        """Restore dataset state"""
        # TODO: Implement restoration
        pass
    
    def __len__(self):
        pass
    
    def __getitem__(self, idx):
        pass

# Create a DataValidator class
class HealthcareDataValidator:
    """Validate healthcare data integrity"""
    
    @staticmethod
    def validate_dicom(filepath):
        """Validate DICOM file integrity"""
        # TODO: Check if DICOM is valid and not corrupted
        pass
    
    @staticmethod
    def validate_metadata(metadata):
        """Validate patient metadata"""
        # TODO: Check required fields, data types, ranges
        pass
    
    @staticmethod
    def check_anonymization(metadata):
        """Check if data is properly anonymized"""
        # TODO: Verify no PII remains
        pass

# Design document
print("""
Design Considerations for Sudanese Healthcare Pipeline:

1. OFFLINE OPERATION:
   - Local caching of all data
   - Pre-processed datasets stored locally
   - Batch processing for when connectivity is available

2. POWER RESILIENCE:
   - Regular checkpointing of dataset state
   - Incremental processing with recovery
   - Battery backup considerations

3. PRIVACY:
   - Automatic anonymization of patient data
   - Encryption of sensitive data
   - Access controls

4. VALIDATION:
   - File integrity checks
   - Data completeness validation
   - Cross-field consistency checks
""")

# TODO: Implement test cases for your design
# ===================================

## üìä Assessment Questions

Answer these questions in markdown cells:

### Q1: When should you use `num_workers=0` in DataLoader? What are the trade-offs?

### Q2: What's the difference between `pin_memory=True` and `pin_memory=False`? When would you use each?

### Q3: How does `prefetch_factor` affect performance and memory usage?

### Q4: What are the main differences between `Dataset` and `IterableDataset`? Give examples of when to use each.

### Q5: How would you handle a dataset where some samples have corrupted files?

### Q6: What special considerations are needed for Arabic text processing vs English?

### Q7: How would you design a data pipeline that works in areas with intermittent internet connectivity?

## ‚úÖ Progress Tracker

Check off exercises as you complete them:

- [ ] Exercise 1A: Fix Memory-Inefficient Dataset
- [ ] Exercise 1B: Memory Usage Comparison
- [ ] Exercise 2A: Diagnose & Fix Slow Data Loading
- [ ] Exercise 2B: Profile Data Loading Performance
- [ ] Exercise 3A: Sudanese Agriculture Augmentations
- [ ] Exercise 3B: Large Satellite Images Dataset
- [ ] Exercise 4A: Sudanese Arabic Dialect Dataset
- [ ] Exercise 4B: Streaming Text Dataset
- [ ] Challenge 1: Multi-Modal Dataset (Images + Text)
- [ ] Challenge 2: Sudanese Healthcare Pipeline
- [ ] Assessment Questions Q1-Q7

## üèÜ Completion Certificate

Once you complete all exercises, you've mastered:
- ‚úÖ PyTorch Dataset design patterns
- ‚úÖ DataLoader optimization techniques
- ‚úÖ Computer vision pipelines with augmentation
- ‚úÖ NLP pipelines for Arabic text
- ‚úÖ Multi-modal data handling
- ‚úÖ Production considerations for Sudanese context

**You're ready for Lecture 3: Advanced Model Architectures & Training!** üéâ

## üí° Tips for Success

1. **Start Simple**: Begin with basic implementations, then optimize
2. **Profile Early**: Use the profiler to identify bottlenecks
3. **Test with Small Data**: Verify correctness before scaling up
4. **Consider Sudanese Context**: Think about real-world constraints
5. **Document Your Choices**: Explain why you made certain design decisions

## ü§ù Need Help?

- Review Lecture 2 notebook for concepts
- Use PyTorch documentation for specific APIs
- Test your implementations step by step
- Consider edge cases (missing data, large files, etc.)