# Intro to TorchRec

Frequently, when building recommendation systems, we want to represent entities like products or pages with embeddings. For example, see Meta AI's [Deep learning recommendation model](https://arxiv.org/abs/1906.00091), or DLRM. As the number of entities grow, the size of the embedding tables can exceed a single GPU’s memory. A common practice is to shard the embedding table across devices, a type of model parallelism. To that end, **TorchRec introduces its primary API called [`DistributedModelParallel`](https://pytorch.org/torchrec/torchrec.distributed.html#torchrec.distributed.model_parallel.DistributedModelParallel), or DMP. Like pytorch’s DistributedDataParallel, DMP wraps a model to enable distributed training.**

## **Installation**
Requirements:
- python >= 3.7

We highly recommend CUDA when using TorchRec. If using CUDA:
- cuda >= 11.0


In [1]:
# install conda to make installying pytorch with cudatoolkit 11.3 easier.
!wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.9.2-Linux-x86_64.sh
!bash ./Miniconda3-py37_4.9.2-Linux-x86_64.sh -b -f -p /usr/local

--2023-07-29 02:37:27--  https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8203, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 90040905 (86M) [application/x-sh]
Saving to: ‘Miniconda3-py37_4.9.2-Linux-x86_64.sh’


2023-07-29 02:37:28 (201 MB/s) - ‘Miniconda3-py37_4.9.2-Linux-x86_64.sh’ saved [90040905/90040905]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ done
Solving environment: / - done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - brotlipy==0.7.0=py37h27cfd23_1003
    - ca-certificates==2020.10.14=0
    - certifi==2020.6.20=pyhd3eb1b0_3
    - cffi==1.14.3=py37h261ae71_2
    - chardet==3.0.4=py37h06a4308_1003
    - conda-package-handling==1

In [2]:
# install pytorch with cudatoolkit 11.6
!conda install pytorch pytorch-cuda=11.6 -c pytorch-nightly -c nvidia -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / failed with initial frozen solve. Retrying with flexible solve.
Solving environment: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | 

Installing TorchRec will also install [FBGEMM](https://github.com/pytorch/fbgemm), a collection of CUDA kernels and GPU enabled operations to run

In [3]:
# install torchrec
!pip3 install torchrec-nightly

Collecting torchrec-nightly
  Downloading torchrec_nightly-2023.1.26-py37-none-any.whl (322 kB)
[K     |████████████████████████████████| 322 kB 7.1 MB/s 
[?25hCollecting torchmetrics
  Downloading torchmetrics-0.11.4-py3-none-any.whl (519 kB)
[K     |████████████████████████████████| 519 kB 48.0 MB/s 
[?25hCollecting pandas
  Downloading pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 39.1 MB/s 
[31mERROR: Could not find a version that satisfies the requirement fbgemm-gpu-nightly (from torchrec-nightly) (from versions: none)[0m
[31mERROR: No matching distribution found for fbgemm-gpu-nightly (from torchrec-nightly)[0m
[?25h

The following steps are needed for the Colab runtime to detect the added shared libraries. The runtime searches for shared libraries in /usr/lib, so we copy over the libraries which were installed in /usr/local/lib/. **This is a very necessary step, only in the colab runtime**.

In [4]:
!cp /usr/local/lib/lib* /usr/lib/

\**Restart your runtime at this point for the newly installed packages to be seen.** Run the step below immediately after restarting so that python knows where to look for packages. **Always run this step after restarting the runtime.**

In [5]:
import sys
sys.path = ['', '/env/python', '/usr/local/lib/python37.zip', '/usr/local/lib/python3.7', '/usr/local/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/site-packages']

## **Overview**
This tutorial will cover three pieces of TorchRec - the `nn.module` `EmbeddingBagCollection`, the `DistributedModelParallel` API, and the datastructure `KeyedJaggedTensor`.

### Distributed Setup
We setup our environment with torch.distributed. For more info on distributed, see this [tutorial](https://pytorch.org/tutorials/beginner/dist_overview.html)

Here, we use one rank (the colab process) corresponding to our 1 colab GPU.

In [6]:
import os
import torch
import torchrec
import torch.distributed as dist

os.environ["RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "29500"

# Note - you will need a V100 or A100 to run tutorial as as!
# If using an older GPU (such as colab free K80),
# you will need to compile fbgemm with the appripriate CUDA architecture
# or run with "gloo" on CPUs
dist.init_process_group(backend="nccl")

ImportError: ignored

### From EmbeddingBag to EmbeddingBagCollection
Pytorch represents embeddings through [`torch.nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) and [`torch.nn.EmbeddingBag`](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html). EmbeddingBag is a pooled version of Embedding.

TorchRec extends these modules by creating collections of embeddings. We will use [`EmbeddingBagCollection`](https://pytorch.org/torchrec/torchrec.modules.html#torchrec.modules.embedding_modules.EmbeddingBagCollection) to represent a group of EmbeddingBags.

Here, we create an EmbeddingBagCollection (EBC) with two embedding bags. Each table, `product_table` and `user_table`, is represented by 64 dimension embedding of size 4096. Note how we initially allocate the EBC on device "meta". This will tell EBC to not allocate memory yet.

In [None]:
ebc = torchrec.EmbeddingBagCollection(
    device="meta",
    tables=[
        torchrec.EmbeddingBagConfig(
            name="product_table",
            embedding_dim=64,
            num_embeddings=4096,
            feature_names=["product"],
            pooling=torchrec.PoolingType.SUM,
        ),
        torchrec.EmbeddingBagConfig(
            name="user_table",
            embedding_dim=64,
            num_embeddings=4096,
            feature_names=["user"],
            pooling=torchrec.PoolingType.SUM,
        )
    ]
)

# FBGEMM Optimizations - Batching and Fusion

TorchRec provides abstractions over [FBGEMM](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu) kernels that provide efficient implementations of the canonical nn.EmbeddingBags. Two of the optimizations that can be done are

* Table batching, which allows you to look up multiple embeddings with one kernel call.
* Optimizer Fusion, which allows the module to update itself given the canonical pytorch optimizers and arguments.

This can be accessed by using the [fuse_embedding_optimizer](https://github.com/pytorch/torchrec/blob/main/torchrec/modules/fused_embedding_modules.py#L271) wrapper, which will replace embedding modules with their batched and fused counter parts. You can also directly use these efficient counterparts, take a look at torchrec.modules.fused_embedding_modules.

To quantitatively see the performance gains of this, see our [benchmarks](https://github.com/pytorch/torchrec/blob/main/benchmarks/README.md).

Note that this step is optional - the following steps can also be applied to the non-optimizer EmbeddingBagCollection.

In [None]:
from torchrec.optim.apply_optimizer_in_backward import apply_optimizer_in_backward

apply_optimizer_in_backward(
    optimizer_class=torch.optim.SGD,
    params=ebc.parameters(),
    optimizer_kwargs={"lr": 0.02},
)

### DistributedModelParallel
Now, we’re ready to wrap our model with [`DistributedModelParallel`](https://pytorch.org/torchrec/torchrec.distributed.html#torchrec.distributed.model_parallel.DistributedModelParallel) (DMP). Instantiating DMP will:

1. Decide how to shard the model. DMP will collect the available ‘sharders’ and come up with a ‘plan’ of the optimal way to shard the embedding table(s) (i.e, the EmbeddingBagCollection)
2. Actually shard the model. This includes allocating memory for each embedding table on the appropriate device(s).

In this toy example, since we have two EmbeddingTables and one GPU, TorchRec will place both on the single GPU.

To learn more about sharding, see our [sharding tutorial](https://pytorch.org/tutorials/advanced/sharding.html).


In [None]:
model = torchrec.distributed.DistributedModelParallel(ebc, device=torch.device("cuda"))
print(model)
print(model.plan)

## Sharders and Quantized Comms

By default, DistributedModelParallel will identify which Sharder to use for your embedding module. In the above case, it creates a default EmbeddingBagCollectionSharder.

However, you may also specify your own Sharder; by doing so, you can set additional sharding parameters. For example, you can specify the quantized/mixed precision config.

Applying quantization and mixed precision to collective calls as part of distributed training is often used to increase the model's training throughput, while at the same time, not significantly sacrificing model quality.

TorchRec provides helper functions to construct communication codecs, based on the [FBGEMM Qcomm library](https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/fbgemm_gpu/quantize_comm.py).

Below, we create a sharder that uses FP16 mixed precision for the forward pass (when passing embedding tensors around in a collective call, first cast them to FP16). And similarly, for the backwards pass, cast tensors within collective calls to BF16.

In [None]:
from torchrec.distributed.fbgemm_qcomm_codec import get_qcomm_codecs_registry, QCommsConfig, CommType
from torchrec.distributed.embeddingbag import EmbeddingBagCollectionSharder

sharder = EmbeddingBagCollectionSharder(
    qcomm_codecs_registry=get_qcomm_codecs_registry(
            qcomms_config=QCommsConfig(
                forward_precision=CommType.FP16,
                backward_precision=CommType.BF16,
            )
        )
)
model = torchrec.distributed.DistributedModelParallel(ebc, sharders=[sharder], device=torch.device("cuda"))

### Query vanilla nn.EmbeddingBag with input and offsets

We query [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) and [`nn.EmbeddingBag`](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html) with `input` and `offsets`. Input is a 1-D tensor containing the lookup values. Offsets is a 1-D tensor where the sequence is a cumulative sum of the number of values to pool per example.

Let's look at an example, recreating the product EmbeddingBag above

```
|------------|
| product ID |
|------------|
| [101, 202] |
| []         |
| [303]      |
|------------|
```



In [None]:
product_eb = torch.nn.EmbeddingBag(4096, 64)
product_eb(input=torch.tensor([101, 202, 303]), offsets=torch.tensor([0, 2, 2]))

### Representing minibatches with KeyedJaggedTensor

We need an efficient representation of multiple examples of an arbitrary number of entity IDs per feature per example. In order to enable this "jagged" representation, we use the TorchRec datastructure [`KeyedJaggedTensor`](https://pytorch.org/torchrec/torchrec.sparse.html#torchrec.sparse.jagged_tensor.JaggedTensor) (KJT).

Let's take a look at **how to lookup a collection of two embedding bags**, "product" and "user".  Assume the minibatch is made up of three examples for three users. The first of which has two product IDs, the second with none, and the third with one product ID.

```
|------------|------------|
| product ID | user ID    |
|------------|------------|
| [101, 202] | [404]      |
| []         | [505]      |
| [303]      | [606]      |
|------------|------------|
```

The query should be:

In [None]:
mb = torchrec.KeyedJaggedTensor(
    keys = ["product", "user"],
    values = torch.tensor([101, 202, 303, 404, 505, 606]).cuda(),
    lengths = torch.tensor([2, 0, 1, 1, 1, 1], dtype=torch.int64).cuda(),
)

print(mb.to(torch.device("cpu")))

Note that the KJT batch size is `batch_size = len(lengths)//len(keys)`. **In the above example, batch_size is 3.**



### Putting it all together, querying our distributed model with a KJT minibatch
Finally, we can query our model using our minibatch of products and users.

The resulting lookup will contain a KeyedTensor, where each key (or feature) contains a 2D tensor of size 3x64 (batch_size x embedding_dim).

In [None]:
pooled_embeddings = model(mb).to_dict()
print("product embeddings", pooled_embeddings["product"])
print("user embeddings", pooled_embeddings["user"])

## More resources
For more information, please see our [dlrm](https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm/) example, which includes multinode training on the criteo terabyte dataset, using Meta’s [DLRM](https://arxiv.org/abs/1906.00091).