# Introduction to TorchVision

TorchVision is a package which consists of popular datasets, models and computer vision utilities such as transforms, display and writing videos/images, etc.

We have already used some of the torchvision functionality in previous sections. In this section we will discuss them in more detail so that you are better equipped to use it in your work.

Torchvision consists of the following classes:
1. Datasets
2. Transforms
3. Models
4. Utils
5. IO
6. Ops

The most used of the above are `Datasets`, `Transforms` and `Models`.


Sure, here's the full markdown code with added necessary colors for the attached image. Please note that the colors will only be visible if your Markdown viewer supports inline HTML:


# Datasets

While loading datasets like MNIST, Fashion-MNIST etc. in previous sessions, we saw how useful they are. Many more datasets come packaged with TorchVision and are very popular. The `datasets` are mostly used with `dataloaders` available in PyTorch.

## Function Syntax

```python
torchvision.datasets.DATASET(root, train=True, transform=None, target_transform=None, download=False)
```

Where,

- `DATASET` is the name of the dataset, which can be MNIST, FashionMNIST, COCO etc. Get the full list here
- `root` is the folder that stores the dataset. Use this if you opt to download.
- `train` is a flag that specifies whether you should use the train data or test data.
- `download` is a flag which is turned on when you want to download the data. Note that the data is not downloaded if it is already present in the root folder mentioned above.
- `transform` applies a series of image transforms on the input images. For example, cropping, resizing, etc.
- `target_transform` takes the target or labels and transforms it as required.

<span style="color:blue">## Why Is It Useful?</span>

Suppose you are working on a problem and achieve a decent accuracy. Now, you want to test your model on different/harder data. So, you will have to search for a dataset, go through it then see how it is organized. Next, download it on your system and prepare it to fit your training pipeline. Only then will you be ready to use the new dataset.

But when you use TorchVision datasets, <span style="color:green">you can skip all these steps</span>, and treat new dataset as drop-in replacement for old one. <span style="color:red">That's because almost all datasets available on torchvision have similar API.</span>

They have common arguments like <span style="color:purple">transform</span> and <span style="color:purple">target_transform</span>, which transform input as well as labels/targets.
```



# Transforms

These are image transforms applied while training a network. Simple operations like cropping, resizing and normalizing are all examples of a transform. Apply multiple transforms to an image, by chaining the transforms using the Compose Class.

To see the different transforms available in TorchVision, click here.

Some frequently used transforms:

- <span style="color:blue">`torchvision.transforms.ToTensor`</span> - It takes in a PIL image of dimension [H x W x C] in the range [0,255] and converts it to a float Tensor of dimension [C x H x W] in the range [0,1].
- <span style="color:red">`torchvision.transforms.Compose`</span> - It chains many transformers together so that you can apply then all in one go.

Apart from these readymade transforms, there are <span style="color:green">functional transforms</span>, which give you more control over the transformations. Read more


# Models

Just like `torchvision.datasets` consists of popular datasets used for experimentation, `torchvision.models` has many well-known models for computer vision tasks. For example:

- Classification
- Detection
- Segmentation
- Video Classification

The list keeps growing with time.

## Function Syntax

```python
model = torchvision.models.MODEL(pretrained=True)
```

Where,

- `MODEL` is the name of the model such as AlexNet, ResNet etc. Check the full list of available models [here](#).
- `pretrained` is the flag which specifies whether you want the model to be initialized with the pretrained weights of the model or not. If set to True, it will also download the weights file, when absent.

You can use these models for similar problems. Or a subclass of the problem for which the pre-trained model was trained. You can even treat these models as starting point for fine-tuning your model to perform a new task. More on fine-tuning later.
```


# Utils

It has 2 nice functions which come in handy while dealing with images and publishing findings of your work.

## Make grid of images for display

```python
torchvision.utils.make_grid(tensor, nrow=8, padding=2, normalize=False, range=None, scale_each=False, pad_value=0)
```

Where,

- `tensor` – 4D mini-batch Tensor of shape (B x C x H x W) or a list of images all of the same size.
- `nrow` – Number of images displayed in each row of the grid. The final grid size is (B / nrow, nrow). Default: 8.
- `padding` – amount of padding. Default: 2.
- `normalize` – If True scales the image to the range (0, 1), by the min and max values specified by range. Default: False.
- `range` – tuple (min, max) where min and max are numbers, then these numbers are used to normalize the image. Default: None.
- `scale_each` – If True, scale each image in the batch of images separately rather than (the min, max) over all images. Default: False.
- `pad_value` – Value for the padded pixels. Default: 0.

## Save Image

```python
torchvision.utils.save_image(tensor, fp, nrow=8, padding=2, normalize=False, range=None, scale_each=False, pad_value=0, format=None)
```

If you provide a mini-batch to the above function, it saves them as a grid of images. The other arguments are similar to make_grid.
```

# IO

As the name suggests, it is designed to perform IO operations such as reading/writing media files. Currently it only supports video reading and writing.

## Read Video

```python
torchvision.io.read_video(filename, start_pts=0, end_pts=None, pts_unit='pts')
```

Where,

- `filename` – The name of the video file to read.
- `start_pts` – The start presentation timestamp of the video to read. Default: 0.
- `end_pts` – The end presentation timestamp of the video to read. Default: None.
- `pts_unit` – The unit of the presentation timestamps (pts) in `start_pts` and `end_pts`. Default: 'pts'.

It reads a video from filename and returns the video as well as audio frames. You can also specify the time stamp from/to where you want to read the video.

## Write Video

```python
torchvision.io.write_video(filename, video_array, fps, video_codec='libx264', options=None)
```

Where,

- `filename` – The name of the video file to write.
- `video_array` – The array of video frames to write.
- `fps` – The frames per second (fps) rate for the written video.
- `video_codec` – The codec to use for the written video. Default: 'libx264'.
- `options` – Additional options to pass to the video writer. Default: None.

If you provide a mini-batch to the above function, it saves them as a grid of images. The other arguments are similar to make_grid.


# Ops

The ops module implements some functions used for specific computer vision tasks. Some of them are:

- <span style="color:lightblue">Non Maximum suppression</span> - Used in Object detection pipelines
- <span style="color:red">Region of Interest Pooling</span> - Used in Fast RCNN paper
- <span style="color:green">Region of Interest Alignment</span> - Used in Mask RCNN paper

These are just mentioned for the sake of completeness and are rarely used.
