<a href="https://colab.research.google.com/github/LuluW8071/Data-Science/blob/main/Pytorch/07_PyTorch_Paper_Replicating/00_PyTorch_Paper_Replicating.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Paper Replicating

### [Resource](https://www.learnpytorch.io/08_pytorch_paper_replicating)

We're going to replicate a **machine learning research paper** and create Vision Transformer(ViT), a state-of-the-art computer vision architecture from scratch using PyTorch.

<img src="https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/08-vit-paper-applying-vit-to-food-vision-mini.png">

## What is paper replicating?

Paper replicating refers to the process of reproducing the results of a published research paper. This involves independently conducting the same experiments or analyses as the original study to determine if the same findings can be obtained. Replication is a fundamental aspect of the scientific method because it helps to verify the reliability and validity of research findings.

The goal of **paper replicating** is to replicate these advances with code so you can use the techniques for your own problem.

<img src = "https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/08-vit-paper-what-is-paper-replicating-images-math-text-to-code.png">

*Machine learning paper replicating involves turning a machine learning paper comprised of images/diagrams, math and text into usable code and in our case, usable PyTorch code. Diagram, math equations and text from the [ViT paper](https://arxiv.org/abs/2010.11929).*

## What is a machine learning research paper?

A machine learning research paper is a scientific paper that details findings of a research group on a specific area.

The contents of a machine learning research paper can vary from paper to paper but they generally follow the structure:

| Section      | Contents                                                                                      |
|--------------|-----------------------------------------------------------------------------------------------|
| Abstract     | A summary of the main findings and contributions of the paper.                                |
| Introduction | The main problem addressed by the paper and a review of previous methods used to tackle it.   |
| Method       | The approach taken by the researchers, including models, data sources, and training setups.   |
| Results      | The outcomes of the research, comparing new models or setups with previous work.              |
| Conclusion   | The limitations of the proposed methods and suggestions for future research directions.       |
| References   | The sources and papers referenced by the researchers to support their work.                   |
| Appendix     | Additional resources or findings that weren't included in the main sections of the paper.     |

## Why replicate a machine learning research paper?

A machine learning research paper is often a presentation of months of work and experiments done by some of the best machine learning teams in the world condensed into a few pages of text.

And if these experiments lead to better results in an area related to the problem you're working on, it'd be nice to check them out.

Also, replicating the work of others is a fantastic way to practice your skills.

<img src = "https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/08-george-hotz-quote.png">


## Where can you find code examples for machine learning research papers?

- [arXiv](https://arxiv.org/)
- [AK Twitter](https://twitter.com/_akhaliq)
- [Papers with Code](https://paperswithcode.com/)
- [Google Scholar](https://scholar.google.com/)
- [lucidrains' `vit-pytorch` GitHub repository](https://github.com/lucidrains/vit-pytorch)

## What we're going to cover

We will replicate the machine learning research paper [*An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale*](https://arxiv.org/abs/2010.11929) (ViT paper) using PyTorch. By replicating this paper, we aim to understand the process and gain momentum for future replications.

The **Transformer neural network architecture**, introduced in the paper [*Attention is All You Need*](https://arxiv.org/abs/1706.03762) was originally designed for one-dimensional text sequences and is characterized by its use of the attention mechanism as the primary learning layer.

The **Vision Transformer**(ViT) adapts the Transformer architecture for vision problems, starting with image classification. Although the ViT has evolved over time, we will focus on replicating the original version, known as the "vanilla Vision Transformer." Mastering this will enable us to adapt to newer versions.

Our goal is to build the ViT architecture according to the original paper and apply it to the FoodVision Mini dataset.

| Topic                                        | Contents                                                                                                               |
|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| 0. Getting setup                             | We've written useful code in previous sections, let's download it and ensure we can use it again.                       |
| 1. Get data                                  | Let's get the pizza, steak, and sushi image classification dataset and build a Vision Transformer to improve FoodVision Mini results. |
| 2. Create Datasets and DataLoaders           | We'll use the data_setup.py script from chapter 05, PyTorch Going Modular, to set up our DataLoaders.                    |
| 3. Replicating the ViT paper: an overview    | Replicating a machine learning paper can be challenging, so let's break down the ViT paper into smaller parts to replicate it step by step. |
| 4. Equation 1: The Patch Embedding           | The ViT architecture has four main equations; the first is the patch and position embedding, which turns an image into a sequence of learnable patches. |
| 5. Equation 2: Multi-Head Attention (MSA)    | The self-attention/multi-head self-attention (MSA) mechanism is central to every Transformer, including ViT. Let's create an MSA block using PyTorch's built-in layers. |
| 6. Equation 3: Multilayer Perceptron (MLP)   | The ViT uses a multilayer perceptron in its Transformer Encoder and output layer. Let's create an MLP for the Transformer Encoder. |
| 7. Creating the Transformer Encoder          | A Transformer Encoder has alternating layers of MSA (equation 2) and MLP (equation 3) connected via residual connections. Let's stack the layers from sections 5 & 6 to create one. |
| 8. Putting it all together to create ViT     | We have all the components to create the ViT architecture, so let's combine them into a single class for our model.     |
| 9. Setting up training code for our ViT model| Training our custom ViT is similar to the other models we've trained. With our train() function in engine.py, we can start training with a few lines of code. |
| 10. Using a pretrained ViT from torchvision.models | Training a large model like ViT needs a lot of data. Since we have a small dataset, let's use transfer learning to improve our results. |
| 11. Make predictions on a custom image       | The magic of machine learning is seeing it work on your own data. |



## 0. Getting setup

We'll continue with the regular imports, setting up device agnostic code and this time we'll also get the `helper_functions.py` script from GitHub.

To save us writing extra code, we're going to be leveraging some of the Python scripts (such as `data_setup.py`, `dataset.py` and `engine.py`) we created in the previous section, [05_PyTorch_Going_Modular](https://github.com/LuluW8071/Data-Science/tree/main/Pytorch/05_PyTorch_Going_Modular/going_modular).

- `data_setup.py`: downloads the dataset
- `dataset.py`: contains the train and test dataloader functions
- `engine.py`: contains the train and test loop functions


In [None]:
# Importing Libraries
import torch
import torchvision
import torch.nn as nn
import matplotlib.pyplot as plt
from torchvision import transforms

# Setting up device agnostic code
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cpu'

In [None]:
try:
  from torchinfo import summary
except:
  !pip install torchinfo
  from torchinfo import summary

# Download and Load the going_modular scripts
try:
    from going_modular import data_setup, dataset, engine
    from assets import helper_functions
except ImportError:
    # Get the going_modular scripts
    print("[INFO] Couldn't find going_modular scripts... downloading them from GitHub.")
    !git clone https://github.com/LuluW8071/Data-Science/
    !mv Data-Science/assets .
    !mv Data-Science/Pytorch/05_PyTorch_Going_Modular/going_modular .
    !rm -rf Data-Science
    from going_modular import data_setup, dataset, engine
    from assets import helper_functions

## 1. Get Data

In [None]:
# Setup directory paths to train and test images
train_dir = "dataset/train"
test_dir = "dataset/test"

## 2. Create Datasets and DataLoaders

We can use the create_dataloaders() function in `data_setup.py` to transform the images as per mentioned in ViT paper where height being 224px and width being 224px.

<img src = "https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/08-vit-paper-image-size-and-batch-size.png">

*You can often find various hyperparameter settings listed in a table. In this case we're still preparing our data, so we're mainly concerned with things like image size and batch size. Source: Table 3 in [ViT paper](https://arxiv.org/abs/2010.11929).*

And since we'll be training our model from scratch (no transfer learning to begin with), we won't provide a `normalize` transform.

### 2.1 Prepare transforms for images

In [None]:
# Transforming images
img_transform = transforms.Compose([transforms.Resize((224, 224)),
                                    transforms.ToTensor()])

img_transform

Compose(
    Resize(size=(224, 224), interpolation=bilinear, max_size=None, antialias=True)
    ToTensor()
)