# Train from a notebook

This notebook demonstrates how to run the training script from within a Jupyter environment. This is useful for interactive development and debugging.

## Preprocessing

Before running the training, you must first convert your dataset into a Parquet file. This pre-processes and caches the images, which significantly speeds up the start of training on subsequent runs.

You can use the `convert-to-parquet` script for this. For example, if you have a directory of images and a directory of corresponding `.txt` caption files, you can run:

```bash
convert-to-parquet --images-dir /path/to/your/images --captions-dir /path/to/your/captions --output-file preprocessed.parquet
```

Alternatively, if you have a `dataset.json` file, you can run:

```bash
convert-to-parquet --dataset-json /path/to/your/dataset.json --images-dir /path/to/your/images --output-file preprocessed.parquet
```

Then, use the path to `preprocessed.parquet` in the configuration below.

In [None]:
from pathlib import Path
import omegaconf
from joycaption_beta_one.finetuning.train_accelerate import Config, run_training

In [None]:
# Configure your training run.
# You can override any of the default values in the Config object.
config = Config(
    output_dir=Path("checkpoints_notebook"),
    # wandb_project="my-notebook-runs",  # uncomment to use wandb
    device_batch_size=1,
    batch_size=8,
    learning_rate=5e-5,
    num_epochs=1, # Increase for a full training run
    
    # --- Dataset Configuration ---
    # Recommended: Use a preprocessed .parquet file for speed.
    # IMPORTANT: Make sure to set `dataset` to your preprocessed parquet file.
    dataset="/path/to/your/preprocessed.parquet",

    # Alternatively, you can preprocess on the fly from a dataset.json and image directory.
    # This is slower and not recommended for large datasets or multiple runs.
    # dataset="/path/to/your/dataset.json",
    # images_path=Path("/path/to/your/images"),
)

In [None]:
# The training function expects an OmegaConf object.
structured_config = omegaconf.OmegaConf.structured(config)

In [None]:
# This will start the training process.
# Note: This requires a GPU and will take a significant amount of time to run.
run_training(structured_config)