# 09 - PyTorch Paper Replication #

This time we will be working on replicating a machine learning paper and then create a Vision Transformer - ViT from scratch with PyTorch. We will touch on a new architecture that is seperate from the neural networks that we've been working on and experience something new and exciting if not a bit intimidating. 

![Display](images/08-vit-paper-applying-vit-to-food-vision-mini.png)

**What is Paper Replicating?**

Machine learning is a field that moves very fast meaning that there are new things popping up each year. These findings are often published as machine learning research papers on the internet. 

The goal of *paper replicating* is to actually make the discoveries that they've made with code so that you can understand the technique and probably use that for your own problem, adding another tool for you to use. 

Example, something new came out that talks about being better than other current architectures out there available and they showed benchmarks to prove this. Why not try that new architecture for your own problem and leverage the benefits?

![Display](images/08-vit-paper-equation-3-mapped-to-code.png)

**What is a Machine Learning Paper?**

Machine learning papers are formally done in the usual scientific paper layout. The contents can vary but for the most part the same structure remains when it comes to reading through them. You'll usually see the following:

1. *Abstract* - Overview/Summary of the paper's findings and contributions.
2. *Introduction* - Discuss the paper's main problem and a short history of the methods done before in trying to solve it.
3. *Method* - Details how the researches did their work. They discuss the model(s), data source, and training setups.
4. *Results* - The outcome of the entire paper. What the new model / setup resulted in and how it compares to the previous works. This is where experiment tracking comes to play because you can easily compare different models.
5. *Conclusion* - Informs of the limitations of the methods and what should be done next to further the field.
6. *Appendix* - Extra resources to share and use that weren't included in the previous sections.

**Why Replicate Machine Learning Papers?**

Machine learning papers show the results that was done from months of work and experiments by the best machine learning engineers and teams out there and are condendsed into a very easy-to-consume form for other people to read and use. 

It's always interesting to wonder if these new approaches do result in better performances or even posseses the possibility to solve your specific problem so it's always a good option to discover and check. Other than that, it's just good practice to replicate the work of the best in the industry and you also get new experiences with it. 

One good rule to follow is this:

1. Download a paper.
2. Implement it.
3. Repeat until you're confident.

But that's easier said than done. Replicating reseach papers with nothing but just the basic understanding of machine learning isn't exactly 'convenient', it's actually pretty challenging.

That's just how it is and perfectly normal. You don't expect to go to a bouldering gym and know much about what people are doing there. 

Research teams that create these papers spend a lot of time from months to years creating these research papers and you're just reading that in the span of a couple of hours or days. Condensing all those info and expecting to understand them from the get-go isn't exactly practical. It takes time to understand and to replicate so be patient.

Replicating papers are so hard that there are libraries out there that were purposely made to make machine learning research accessible to the general public. Libraries such as *HuggingFace*, *PyTorch Image Models* - TIMM, and *Fast.ai* are such examples. 

**Where To Find Code Examples For Machine Learning Research Papers?**

There's one thing to keep in mind when looking for machine learning research papers - there's a lot of them and they keep growing by the day. Don't expect to always be up to date to the ever growing ecosystem, just stick to one interesting paper that got your attention and learn through it. 

Here's a couple of resources to look for machine learning papers + code that comes along with it (hopefully):

*arXiv* - Commonly referred to as 'archive', it is a free and open resource portal for reading technical articles but it's just not limited to machine learning but even other fields such as physics and computer science in general.

*AK Twitter* - This is a twitter account that posts machine learning research highlights. What's good is that they often also have live demos. It's not really easy to look into but it's just a good place to look at interesting papers. 

*Papers with Code* - Curated collection of the best machine learning papers and with the code attached to them. It also contains machine learing datasets, benchmarks, the current best in the business models. 

**Chapter Coverage**

Alright that's a lot of discussing, it's time to get hands-on and replicate a paper. There's nothing better than actually doing it to start learning. The process for each paper out there might be different but getting to experience what it's like to go through one will bring momentum and confidence when dealing with other papers out there in the wild. 

So which paper are we oging for? We'll be doing the [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) (ViT paper) with PyTorch. This is a Transformer neural network architecture which was introduced to the public in another paper [Attention is all you need](https://arxiv.org/abs/1706.03762).

But even going further back, the Transformer architecture was originally meant to just deal with one-dimensional sequences of text. 

The **Transformer Architecture** is generally considered to be any neural network that uses the **attention mechanism** as it's primary learning layer. Quite confusing? Not really! Remember convolution neural networks (CNN)? The primary learning layer for that was convolutions and the same concept applies to Transformers but this time it's the attention mechanism. 

As you might already have guessed just from the name, the *Vision Transformer* - ViT architecture is designed to make the original Transformer architecture to work with vision problems. It started with classification and then gradually developed into other areas. 

The *Vision Transformer* has been implemented and developed many times and such exists many different variants. We're just going to focus on replicating the original paper, referred to as the *Vanilla Vision Transformer*. Why is that? Why not just use the latest? Well, once you get the main idea, the rest would be pretty easy to follow. 

So again, we're going to replicate the original *Vision Transformer* and implement that to our *FoodVision Mini* problem. 

**1. Getting Setup** - Take all the helper and core functions that we've built before that we can use here to make things easier.

**2. Get Data** - Grab the pizza, steak, ssushi image classification dataset. 

**3. Create Datasets &  DataLoaders** - Use the *data_setup.py* script and then setup the DataLoaders. 

**4. Replicating ViT Paper: Overview** - We'll go over the ViT paper and decipher it's contents first before jumping into replicating it. We can't replicate something that we don't understand.

**5. Equation 1: Patch Embedding** - The original ViT architecture is composed of four core equations. The first being patch & position embedding. Basically turning an image into a sequence of learnable patches.

**6. Equation 2: Multi-Head Attention (MSA)** - The self-attention/multi-head self-attention (MSA) mechanism is the main idea of every Transformer architecture, including the ViT architecture. This section will be focusing on creating a MSA layer with PyTorch.

**7. Equation 3: Multilayer Perceptron (MLP)** - The ViT architecture works with a multilayer perceptron as a part of its Transformer Encode and for its output layer. We'll make an MLP for the Transformer Encoder.

**8. Creating The Transformer Encoder** - The Transformer Encoder works by alternating between MSA & MLP equations which are joined together by residual connections because they're next to each other. We'll create a Transformer Encoder by stacking MSA & MLP layers with each other.

**9. Putting It All Together** - We'll create a class by combining everything in the previous steps then we can create our very first ViT model which we did from scratch. 

**10. Preparing Training Code** - Luckily, there's not much change in the training of a ViT model. It's pretty similiar to neural networks so we can still leverage our *train()* function from *engine.py*.

**11. Using Pretrained ViT from TorchVision** - ViT is a pretty large model and because of that, let's try to save ourselves valuable time and leverage the power of transfer learning. Because we're dealing with a much much smaller dataset, we can't reliably train an big model. Because of that, we'll use pretrained weights and then see if it improves the performance. 

**12. Making Predictions** - The happiest part of the entire process is actually seeing the the final results and using the model in actuality. Just like before, we'll grab a custom image and see the results. 

**NOTE**: Even if you've gained valuable experience with working on and replicating papers, it's important to remember to not get bogged down on one single paper alone. There are always new things being discovered and better methods come along as time goes on so it's not about a particular paper but more so on the skill of deciphering papers. The fundamental skill of grasping math and the the words on a page into actual woring code.

**Quick Terminology**

Before proceeding, there would be some definitions that are important to remember so that you won't get confused:

1. *ViT* - This is a Vision Transformer - a neural network architecture that we're going to focus on working with.

2. *ViT Paper* - Refers to the original machine learning paper that introduced the ViT architecture - [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929). Anytime ViT paper is mentioned then it talks about the paper itself and not the architecture.

**Setting Up**

