In [None]:
# =============================================================================
# DS776 REQUIRED SETUP - Run this cell FIRST, before any other code!
# =============================================================================
%run ../Course_Tools/auto_update_introdl.py

In [None]:
# Imports for this notebook
from pathlib import Path
import os
import shutil

from introdl import config_paths_keys

paths = config_paths_keys()
DATA_PATH = paths['DATA_PATH']
MODELS_PATH = paths['MODELS_PATH']
CACHE_PATH = paths['CACHE_PATH']

# Understanding Storage in DS776

This notebook explains how storage works in this course, where different types of files go, and how to manage your storage effectively.

**Why this matters:** Deep learning involves downloading large pretrained models and saving training checkpoints. Without understanding where these files go, you can easily fill up your storage quota.

---

## Storage Overview

In this course, you'll work with several types of files:

| File Type | Examples | Size | Persistence |
|-----------|----------|------|-------------|
| **Datasets** | CIFAR-10, IMDB, CoNLL | 10MB - 5GB | Downloaded once, reused |
| **Pretrained Models** | BERT, ResNet, GPT-2 | 100MB - 5GB | Cached automatically |
| **Your Checkpoints** | `model.pt`, `checkpoint-1000/` | 50MB - 2GB each | You control these |
| **Training Logs** | TensorBoard, wandb | 1MB - 100MB | Usually temporary |

The course setup system organizes these into specific folders so you can:
1. Find your files easily
2. Clean up when needed
3. Know what's safe to delete

## The Path Variables

Every notebook uses three path variables from `config_paths_keys()`:

### `DATA_PATH` - Datasets
Where datasets are downloaded and stored.

```python
# Loading CIFAR-10
dataset = CIFAR10(root=DATA_PATH, download=True)

# Loading a HuggingFace dataset
dataset = load_dataset('imdb', cache_dir=DATA_PATH)
```

### `MODELS_PATH` - Your Trained Models
Where YOU save your trained model checkpoints. This is unique per notebook.

```python
# Saving a PyTorch model
torch.save(model.state_dict(), MODELS_PATH / 'my_model.pt')

# HuggingFace Trainer output
TrainingArguments(output_dir=str(MODELS_PATH / 'bert_classifier'), ...)
```

### `CACHE_PATH` - Downloaded Pretrained Models
Where pretrained models (BERT, ResNet, etc.) are automatically cached.

```python
# This downloads and caches automatically to CACHE_PATH
model = AutoModel.from_pretrained('bert-base-uncased')
```

In [None]:
# Let's see where your paths point to
print("Your storage locations:")
print(f"  DATA_PATH:   {DATA_PATH}")
print(f"  MODELS_PATH: {MODELS_PATH}")
print(f"  CACHE_PATH:  {CACHE_PATH}")

## CoCalc Storage Architecture

CoCalc has two types of storage:

### Home Server (Synced Storage)
- **Location:** `~/home_workspace/`
- **Limit:** ~10GB total
- **Syncs:** Yes - available everywhere, backed up
- **Use for:** API keys, important models you want to keep

### Compute Server (Local Storage)
- **Location:** `~/cs_workspace/`
- **Limit:** ~50GB
- **Syncs:** No - only on compute server
- **Use for:** Large datasets, cached models, temporary files

**The course setup automatically uses the right storage:**
- On compute server: Data and cache go to `cs_workspace/` (fast, lots of space)
- Your models always go to `MODELS_PATH` which syncs back

## What Gets Cached Automatically?

### PyTorch/TorchVision Models
When you use pretrained models:
```python
model = models.resnet50(weights='IMAGENET1K_V1')  # ~100MB download
```
PyTorch caches these in `TORCH_HOME` (set by our setup to `CACHE_PATH`).

### HuggingFace Models
Transformers downloads are LARGE:
```python
model = AutoModel.from_pretrained('bert-base-uncased')  # ~440MB
model = AutoModel.from_pretrained('bert-large-uncased')  # ~1.3GB
model = AutoModel.from_pretrained('gpt2-medium')  # ~1.5GB
```
These cache in `HF_HOME` (set by our setup to `CACHE_PATH/huggingface`).

### HuggingFace Datasets
```python
dataset = load_dataset('imdb')  # ~80MB
dataset = load_dataset('xsum')  # ~250MB
```
These cache in `HF_DATASETS_CACHE` (set to `DATA_PATH`).

In [None]:
# Check what's in your cache
def get_folder_size(path):
    """Get total size of a folder in bytes."""
    total = 0
    path = Path(path)
    if path.exists():
        for f in path.rglob('*'):
            if f.is_file():
                total += f.stat().st_size
    return total

def format_size(size_bytes):
    """Format bytes to human readable."""
    for unit in ['B', 'KB', 'MB', 'GB']:
        if size_bytes < 1024:
            return f"{size_bytes:.1f} {unit}"
        size_bytes /= 1024
    return f"{size_bytes:.1f} TB"

print("Current storage usage:")
print(f"  DATA_PATH:   {format_size(get_folder_size(DATA_PATH))}")
print(f"  CACHE_PATH:  {format_size(get_folder_size(CACHE_PATH))}")
print(f"  MODELS_PATH: {format_size(get_folder_size(MODELS_PATH))}")

## The HuggingFace Trainer Storage Problem

**This is the #1 cause of storage issues in this course!**

By default, HuggingFace Trainer saves a checkpoint after EVERY epoch:

```
output_dir/
├── checkpoint-500/     # 500MB
├── checkpoint-1000/    # 500MB
├── checkpoint-1500/    # 500MB
├── checkpoint-2000/    # 500MB
└── checkpoint-2500/    # 500MB
                        # = 2.5GB for 5 epochs!
```

For a BERT model trained for 10 epochs, this can use **5-10GB of storage**.

### The Solution: `save_total_limit=1`

**Always include these settings when using HuggingFace Trainer:**

```python
training_args = TrainingArguments(
    output_dir=str(MODELS_PATH / 'my_model'),
    save_strategy="epoch",
    save_total_limit=1,           # CRITICAL: Only keep 1 checkpoint!
    load_best_model_at_end=True,  # Load the best one when done
    metric_for_best_model="eval_loss",  # Or eval_accuracy, eval_f1
    # ... other settings ...
)
```

With these settings:
```
output_dir/
└── checkpoint-best/    # 500MB (only the best model!)
```

## What's Safe to Delete?

### Safe to Delete (will re-download automatically):
- ✅ `DATA_PATH` contents - datasets re-download
- ✅ `CACHE_PATH` contents - pretrained models re-download
- ✅ `~/.cache/huggingface/` - HuggingFace cache
- ✅ Old `Lesson_XX_Models/` folders
- ✅ Old `Homework_XX_Models/` folders (after grades posted)

### Keep Until Grades Posted:
- ⚠️ Current homework `MODELS_PATH` - your submitted work

### Never Delete:
- ❌ `home_workspace/api_keys.env` - your API keys
- ❌ Your notebook files (`.ipynb`)

### Using Storage_Cleanup.ipynb

The easiest way to clean up is to run:

**`Lessons/Course_Tools/Storage_Cleanup.ipynb`**

This notebook:
1. Shows you exactly what's using storage
2. Lets you delete it with one click
3. Protects your API keys and current work

## Storage Tips

### 1. Always Use Path Variables
```python
# Good - works everywhere
torch.save(model, MODELS_PATH / 'model.pt')

# Bad - breaks on different environments
torch.save(model, './models/model.pt')
```

### 2. Set `save_total_limit=1` for HF Trainer
Every. Single. Time.

### 3. Clean Up After Grades Are Posted
Old homework models can be deleted once you have your grade.

### 4. Don't Download the Same Model Twice
If a notebook asks you to load a pretrained model, it's cached after first download. Running the cell again uses the cache.

### 5. Use Compute Server for Training
The compute server has more local storage and faster GPUs. Use it for training, then your models sync back.

### 6. Check Storage Before Long Training Runs
Run Storage_Cleanup.ipynb before starting a multi-hour training job.

## Summary

| Path Variable | What Goes There | Safe to Delete? |
|--------------|-----------------|------------------|
| `DATA_PATH` | Datasets | ✅ Yes - re-downloads |
| `CACHE_PATH` | Pretrained models | ✅ Yes - re-downloads |
| `MODELS_PATH` | YOUR checkpoints | ⚠️ After grades posted |

**Key Takeaways:**
1. Use `DATA_PATH`, `MODELS_PATH`, `CACHE_PATH` - never hardcode paths
2. Always use `save_total_limit=1` with HuggingFace Trainer
3. Run `Storage_Cleanup.ipynb` when you need space
4. Cached models re-download automatically - don't worry about deleting them