# Faraday - Streaming Training Data

This tutorial explains how to use Faraday to train a generative model from a streaming dataset.
This is important if the source data is too large to fit in device memory.

---

For more information on Faraday's architecture, refer to the [Faraday paper](https://arxiv.org/abs/2404.04314).

For more information on litdata (torch streaming library), refer to the [litdata docs](https://github.com/Lightning-AI/litdata)


### Pre-requisites

1. If you haven't already, please download LCL dataset from [data.london.gov.uk](https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households), or...
2. Use the cli app to download and prepare the data (see README)
3. Follow the tutorial 'faraday_tutorial.ipynb' to train Faraday using the traditional method

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import logging

logger = logging.getLogger(__name__)

# 💿 Loading Data Modules

In [3]:
from pathlib import Path
from opensynth.data_modules.streaming_data_module import StreamDataModule

stream_data_path = Path("../../data/processed/historical/stream")
stats_path = Path("../../data/processed/historical/train/mean_std.csv")

dm = StreamDataModule(
    data_path=str(stream_data_path),
    stats_path=stats_path,
    num_workers=9,
    batch_size=500,
    max_cache_size="10GB",
    shuffle=False,
    persistent_workers=True,
)
dm.setup()

# 🤖 VAE Module

In [4]:
from opensynth.models.faraday.model import FaradayVAE
# Option to pass in your own encoder architecture in the future
model = FaradayVAE(
    class_dim=2,
    latent_dim=16,
    learning_rate=0.001,
    mse_weight=3,
)

In [6]:
import pytorch_lightning as pl
import torch

# Batch size 500 is when MPS becomes faster than CPU..
# But sometimes large batch size hurts convergence..
# Suggest training on CPU with small batch size
# And potentially experiment with best hyperparameters on large batch size before using 'mps'

trainer = pl.Trainer(max_epochs=1, accelerator="auto")
trainer.fit(model, dm)

# Save model
torch.save(model, 'faraday_vae_stream.pt')

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/shengchai/.local/share/virtualenvs/OpenSynth-EhRIPYd3/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default

  | Name           | Type                    | Params | Mode 
-------------------------------------------------------------------
0 | encoder        | Encoder                 | 201 K  | train
1 | decoder        | Decoder                 | 200 K  | train
2 | reparametriser | ReparametrisationModule | 544    | 

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=1` reached.


# 🕸️ GMM Module

In [7]:
from opensynth.models.faraday.model import FaradayModel
import torch

In [9]:
# Replace with the path to the relevant checkpoint
model = torch.load('faraday_vae_stream.pt')

faraday_model_50 = FaradayModel(vae_module=model, n_components=50, max_epochs=100, tol=1e-2, covariance_reg=1e-4)
faraday_model_10 = FaradayModel(vae_module=model, n_components=10, max_epochs=100, tol=1e-2, covariance_reg=1e-4)
faraday_model_1 = FaradayModel(vae_module=model, n_components=1, max_epochs=100, tol=1e-2, covariance_reg=1e-4)


  model = torch.load('faraday_vae_stream.pt')


In [10]:
gmm_data_module = StreamDataModule(
    data_path=str(stream_data_path),
    stats_path=stats_path,
    num_workers=9,
    batch_size=5000,
    max_cache_size="10GB",
    shuffle=False,
    persistent_workers=True,
)
gmm_data_module.setup()

In [11]:
faraday_model_50.train_gmm(dm=gmm_data_module)

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/shengchai/.local/share/virtualenvs/OpenSynth-EhRIPYd3/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
/Users/shengchai/.local/share/virtualenvs/OpenSynth-EhRIPYd3/lib/python3.11/site-packages/pytorch_lightning/core/optimizer.py:182: `LightningModule.configure_optimizers` returned `None`, this fit will run with no optimizer

  | Name                      | Type                        | Params | Mode 
----------------------------------------------------------------------------------
0 | gmm_module                | GaussianMixtureModel        | 0      | train
1 | vae_module                | FaradayVAE                  | 402 K  | train
2 | weight_metric             | WeightsMetric               | 0      | train
3 | mean_metric               | MeansMetric              

Initial prec chol: 0.48578402400016785.                 Initial mean: -5.278018951416016


/Users/shengchai/.local/share/virtualenvs/OpenSynth-EhRIPYd3/lib/python3.11/site-packages/pytorch_lightning/utilities/data.py:122: Your `IterableDataset` has `__len__` defined. In combination with multi-process data loading (when num_workers > 1), `__len__` could be inaccurate if each worker is not configured independently to avoid having duplicate data.


Training: |          | 0/? [00:00<?, ?it/s]

In [12]:
torch.save(faraday_model_50, "faraday_model_50.pt")