Following [this](https://pytorch-lightning.readthedocs.io/en/stable/model/build_model_expert.html)

## Lightning Lite

> [`LightningLite`](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.lite.LightningLite.html#pytorch_lightning.lite.LightningLite)
> enables **pure PyTorch users** to `scale their existing code on any kind of device` while retaining **full control over their own loops and optimization logic**.

See gif:

<img src="./assets/lightning_lite.gif"/>

LightningLite is the right tool for you if you match one of the two following descriptions:
* I want to quickly scale my existing code to multiple devices with minimal code changes.
* I would like to convert my existing code to the Lightning API, but a full path to Lightning transition might be too complex. I am looking for a stepping stone to ensure reproducibility during the transition.

> ⚠️ Currently in Beta!


### Learn by example

#### My Existing PyTorch Code

The `train` function contains a standard training loop used to train `MyModel` on `MyDataset` for `num_epochs` epochs.

```python
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset


class MyModel(nn.Module):
    ...


class MyDataset(Dataset):
    ...


def train(args):
    device = "cuda" if torch.cuda.is_available() else "cpu"

    model = MyModel(...).to(device)
    optimizer = torch.optim.SGD(model.parameters(), ...)

    dataloader = DataLoader(MyDataset(...), ...)

    model.train()
    for epoch in range(args.num_epochs):
        for batch in dataloader:
            batch = batch.to(device)
            optimizer.zero_grad()
            loss = model(batch)
            loss.backward()
            optimizer.step()


train(args)
```

#### Convert to LightningLite

Here are five easy steps to let LightningLite scale your PyTorch models.
1. Create the `LightningLite` object at the beginning of your training code.
1. Remove all `.to` and `.cuda` calls since `LightningLite` will take care of it.
1. Do:
    * **apply `setup()` over each model and optimizers pair** and 
    * **`setup_dataloaders()` on all your dataloaders** and 
    * **replace `loss.backward()` by `lite.backward(loss)`**.
1. Run the script from the terminal using `lightning run model path/to/train.py` or use the `launch()` method in a notebook.

Code:
```python
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
from lightning.lite import LightningLite


class MyModel(nn.Module):
    ...


class MyDataset(Dataset):
    ...


def train(args):

    lite = LightningLite()  # NOTE.

    model = MyModel(...)
    optimizer = torch.optim.SGD(model.parameters(), ...)
    model, optimizer = lite.setup(model, optimizer)    # NOTE.  # Scale your model / optimizers

    dataloader = DataLoader(MyDataset(...), ...)
    dataloader = lite.setup_dataloaders(dataloader)    # NOTE.  # Scale your dataloaders

    model.train()
    for epoch in range(args.num_epochs):
        for batch in dataloader:
            optimizer.zero_grad()
            loss = model(batch)
            lite.backward(loss)    # NOTE.  # instead of loss.backward()
            optimizer.step()


train(args)
```

That’s all you need to do to your code. You can now train on any kind of device and scale your training.

Check out [this](https://github.com/Lightning-AI/lightning/blob/master/examples/lite/image_classifier_2_lite.py) full MNIST training example with LightningLite.

Here is how to train on eight GPUs with [`torch.bfloat16`](https://pytorch.org/docs/1.10.0/generated/torch.Tensor.bfloat16.html) precision:
```sh
lightning run model ./path/to/train.py --strategy=ddp --devices=8 --accelerator=cuda --precision="bf16"
```

Here is how to use [DeepSpeed Zero3](https://www.deepspeed.ai/news/2021/03/07/zero3-offload.html) with eight GPUs and mixed precision:
```sh
lightning run model ./path/to/train.py --strategy=deepspeed --devices=8 --accelerator=cuda --precision=16
```

`LightningLite` can also figure it out automatically for you!
```sh
lightning run model ./path/to/train.py --devices=auto --accelerator=auto --precision=16
```

You can also easily use distributed collectives if required.

```python
lite = LightningLite()

# Transfer and concatenate tensors across processes
lite.all_gather(...)

# Transfer an object from one process to all the others
lite.broadcast(..., src=...)

# The total number of processes running across all devices and nodes.
lite.world_size

# The global index of the current process across all devices and nodes.
lite.global_rank

# The index of the current process among the processes running on the local node.
lite.local_rank

# The index of the current node.
lite.node_rank

# Whether this global rank is rank zero.
if lite.is_global_zero:
    # do something on rank 0
    ...

# Wait for all processes to enter this call.
lite.barrier()
```

The code stays agnostic, whether you are running on CPU, on two GPUS or on multiple machines with many GPUs.

If you require custom data or model device placement, you can deactivate `LightningLite`’s automatic placement by doing `lite.setup_dataloaders(..., move_to_device=False)` for the data and `lite.setup(..., move_to_device=False)` for the model. Furthermore, you can access the current device from `lite.device` or rely on `to_device()` utility to move an object to the current device.

### Distributed Training Pitfalls

See: https://pytorch-lightning.readthedocs.io/en/stable/model/build_model_expert.html#distributed-training-pitfalls

### Lightning Lite Flags

See: https://pytorch-lightning.readthedocs.io/en/stable/model/build_model_expert.html#lightning-lite-flags

Key things like:
* `accelerator`, `devices`
* `strategy`
* `precision`
* `save`, `load`
* ...