# PyTorch Tutorial: Working with Real Data

In the real world, data doesn't come in nice pre-packaged tensors. It comes in CSV files, image folders, text documents, and databases. This notebook teaches you how to handle real-world data in PyTorch.

## Learning Objectives
- Understand `Dataset` and `DataLoader` classes
- Create custom datasets for CSV and image data
- Use DataLoaders for batching and shuffling
- Apply data transformations and augmentation

---

## The PyTorch Data Pipeline

The standard pipeline for handling data in PyTorch involves two main classes:

1. **`torch.utils.data.Dataset`**: Stores the samples and their corresponding labels.
2. **`torch.utils.data.DataLoader`**: Wraps an iterable around the Dataset to enable easy access to the samples.


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
from PIL import Image
import os
import torchvision.transforms as transforms
import matplotlib.pyplot as plt

torch.manual_seed(42)
np.random.seed(42)

print("PyTorch version:", torch.__version__)

## 1. Custom Dataset from CSV

Let's simulate a common scenario: you have a CSV file with features and labels.

First, we'll create a dummy CSV file.

In [None]:
# Create a dummy CSV file
data = {
    'feature1': np.random.rand(100),
    'feature2': np.random.rand(100),
    'feature3': np.random.rand(100),
    'label': np.random.randint(0, 2, 100)  # Binary classification
}
df = pd.DataFrame(data)
df.to_csv('dummy_data.csv', index=False)
print("Created dummy_data.csv")
print(df.head())

Now, let's create a custom Dataset class to read this CSV.

A custom Dataset must implement three functions:
- `__init__`: Initialize data, download, etc.
- `__len__`: Return the size of the dataset
- `__getitem__`: Return one sample (feature, label) at the given index

In [None]:
class CSVDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        # Get features (columns 0-2)
        features = self.data.iloc[idx, 0:3].values.astype(np.float32)
        # Get label (column 3)
        label = self.data.iloc[idx, 3]
        
        # Convert to tensors
        features = torch.tensor(features)
        label = torch.tensor(label, dtype=torch.long)
        
        return features, label

# Test the dataset
dataset = CSVDataset('dummy_data.csv')
print(f"Dataset length: {len(dataset)}")
features, label = dataset[0]
print(f"First sample features: {features}")
print(f"First sample label: {label}")

## 2. Using DataLoader

The `DataLoader` handles batching, shuffling, and parallel data loading.

In [None]:
dataloader = DataLoader(dataset, batch_size=4, shuffle=True)

# Iterate through the dataloader
print("Iterating through DataLoader:")
for i, (features, labels) in enumerate(dataloader):
    print(f"Batch {i}:")
    print(f"  Features shape: {features.shape}")
    print(f"  Labels: {labels}")
    if i == 2:  # Stop after 3 batches
        break

## 3. Image Data and Transforms

For images, we often use `torchvision.transforms` for preprocessing and augmentation.

Common transforms:
- `Resize`: Change image size
- `ToTensor`: Convert image to tensor (scales to 0-1, moves channels first)
- `Normalize`: Normalize with mean and std
- `RandomHorizontalFlip`: Data augmentation

In [None]:
# Define transforms
transform = transforms.Compose([
    transforms.Resize((128, 128)),
    transforms.RandomHorizontalFlip(p=0.5),  # Augmentation
    transforms.ToTensor(),  # Converts [0, 255] to [0, 1]
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])  # Normalize to [-1, 1]
])

print("Transforms defined!")

## 4. ImageFolder

If your images are organized in folders by class (e.g., `data/cats/`, `data/dogs/`), you can use `torchvision.datasets.ImageFolder`.

```python
from torchvision.datasets import ImageFolder

# dataset = ImageFolder(root='path/to/data', transform=transform)
```

This automatically assigns labels based on folder names!

## Key Takeaways

1. **Dataset Class**: Implement `__len__` and `__getitem__` to handle your custom data.
2. **DataLoader**: Use this for efficient batching and shuffling.
3. **Transforms**: Essential for image preprocessing and augmentation.
4. **ImageFolder**: The easiest way to load classification datasets organized by folder.

**Clean up**: Let's remove the dummy file.

In [None]:
import os
if os.path.exists('dummy_data.csv'):
    os.remove('dummy_data.csv')
    print("Removed dummy_data.csv")