# Introduction To PyTorch - Learning Path

This notebook serves as a guide to all the great resources available online for you to learn PyTorch. The notebook has no explicit code because there are possibly thousands of Introduction to PyTorch tutorials.

Now, why is this one different? The intention is to lay a path to becoming a serious accelerated computing professional. The world's top researchers and industry scientists use the PyTorch framework to train large and small models across hundreds of thousands of GPUs.

While you start with an introduction, the hope is that you gain such a strong intuition over time that you can learn to train accurate models at scale. This becomes a mix of understanding distributed computing and deeper levels of math used in machine learning. The PyTorch Framework provides efficient abstractions of all the building blocks needed to build these robust deep learning systems.

The [PyTorch website](https://pytorch.org/) is the top place to read tutorials, documentation, and release notes. You can think of PyTorch as an API-style library that uses its building blocks to compose your deep learning systems. See the [PyTorch Tutorials](https://pytorch.org/tutorials/) for great notebooks for starting.

What you will discover in machine learning is that people are trying to solve problems in specific domains such as speech processing, language modeling, computer vision. Hence, you will see that PyTorch has domain-specific libraries to help optimize those workflows and building blocks. Some libraries accelerate other parts of the machine learning pipeline, such as data loading. PyTorch is built so you can write code that runs on a single GPU or CPU and scale to thousands of GPUs.  See some of the [domain-specific libraries](https://pytorch.org/pytorch-domains):

- [torchaudio](https://pytorch.org/audio/stable/index.html) - audio and signal processing
- [torchvision](https://pytorch.org/vision/stable/index.html) - Datasets, model architectures, and common image transformations for computer vision - part of the PyTorch framework
-  [Torchserve](https://pytorch.org/serve/) - a performant, flexible tool for serving (using after training) PyTorch models in production. This project is officially discontinued but the blogs will still have great patterns to learn from.
- [torcharrow](https://github.com/pytorch/torcharrow) - a data preprocessing library that uses the [Apache Arrow](https://arrow.apache.org/overview/) data format
- [torchrec](https://github.com/pytorch/torchrec) - train Recommender System (RecSys) models using scalable building blocks

# The Best Way to Learn PyTorch
According to [Jeremy Horward](https://x.com/jeremyphoward), Founder of the popular deep learning community - [fast.ai](https://www.fast.ai/), the best way to learn is by doing (coding). He claims that you don't need to learn a bunch of complex math to get going. There is a lot of math to learn in machine learning, but a majority of the deep learning software packages do the math for you, so you need to know how to use the APIs, in this case, the PyTorch APIs.

He created the very popular course [Practical Deep Learning for Coders](https://course.fast.ai/), which a number of people have graduated from and have gone on to become senior deep learning researchers at NVIDIA.

See these interviews with famous fast.ai Fellows.
- [Even Oldridge, PhD](https://www.linkedin.com/in/even-oldridge/) - Director of Engineering at NVIDIA
  - [Interview](https://youtu.be/-WzXIV8P_Jk?si=ARW4z8mfHP3TfuNZ&t=646) - Watch the lecture all the way through to build understanding, download as an MP3 and listen to it, do kaggle competitions, and pick a problem outside the teaching materials
- [Sanyam Bhutani](https://www.linkedin.com/in/sanyambhutani/) - Host of Chai Time Data Science Podcast + Partner Engineering at Meta
- [Radek Osmulski]() - Senior Deep Learning Scientist at NVIDIA (no college degree)
  - [Interview](https://www.youtube.com/watch?v=CkPrDBzD1Hs)
  - [Book on Meta Learning](https://radekosmulski.com/) - How to learn Deep learning, written from his experience doing the fast.ai course.

Most importantly, your job as a PyTorch practitioner is to train models, not necessarily study how to train models, so implement as many model trainings as possible on as many different datasets. Some of the people mentioned before leveraged Kaggle as a mechanism for learning by doing competitions. This is not always the easiest way to start because it can be intimidating. [Kaggle](https://www.kaggle.com/) is still one of the best places to get data for training on real world problems.


#### Using Coding Assistants

Most importantly, use a coding assistant that can explain every line of PyTorch code and use it to sharpen your intuition. The hard part is that if an assistant writes the code for you, you may not spend much time writing yourself, so strike a balance by retyping code so you can learn specific APIs. Always cross-reference the PyTorch documentation for up-to-date APIs, as coding assistants tend to have static knowledge. Very senior engineers use coding assistants to accelerate their development, and you should too.

#### Why are you Learning PyTorch?

Knowing why you want to learn PyTorch and having a clearly defined goal will give you the determination to get through the complex learning steps. Many times, if you are working on a custom problem where an example does not exist, you may run into odd errors, so read the [PyTorch forums](http://discuss.pytorch.org/) for help (after you ask a coding assistant). More importantly, having a good reason will guide you to find people in the community from whom you can learn and grow.


#### Using Abstraction Libraries
An abstraction library takes the lower-level PyTorch APIs. It groups them into easier-to-use building blocks that will reduce the amount of code that you write and help you to adhere to better design patterns. [Fast.ai](https://www.fast.ai/) and [PyTorch Lightning](https://github.com/Lightning-AI/pytorch-lightning) are two examples. Fast.ai is more of a learning abstraction library, and PyTorch Lightning is a larger open source project used at the largest scale at some companies. NVIDIA's [Nemo](https://github.com/NVIDIA/NeMo) library for doing training of Large Language Model, Multi-modal model, Text-to-Speech, and more is built on PyTorch Lightning.

#### What Does it Mean to Learn PyTorch?
If you learn PyTorch well, you will understand the fundamentals of machine learning and deep learning and have a strong command of training and inference on distributed systems.

Your goal in training a model is to find the best-performing parameters of the model that satisfy some objective function. Typically, the goal is to minimize the error between a predicted output vs the actual output. People create custom objective functions to achieve their desired modeling goals.

A model learns from a dataset. During the process, you have to split your dataset into different groups so the model does not [overfit](https://stackoverflow.com/questions/52009816/how-to-know-if-underfitting-or-overfitting-is-occuring) (learn the answers) on your data. This data is typically stored in host (CPU) memory, which needs to be moved to the GPU to accelerate training. The model may have to train on the same data many times, known as an [epoch](https://stackoverflow.com/questions/4752626/epoch-vs-iteration-when-training-neural-networks) (one round through the data). Each epoch of training the model learns a bit more; you measure the error and then adjust the parameters for the next round proportional to the error in the model. This is where you will hear terms such as [gradients](https://builtin.com/data-science/gradient-descent#:~:text=A%20gradient%20simply%20measures%20the,zero%2C%20the%20model%20stops%20learning.) and backpropagation. Once the model is trained, you check its correctness against some validation dataset as a proxy for how well it will perform on unseen data.

When you use PyTorch, you will leverage different APIs to perform each of these high-level steps. Other than training a more accurate model, the most challenging part of your PyTorch journey is optimizing your training to fit on limited computing resources. Every person wants their model to train faster. Once you start doing distributed training, the model training process involves loading, moving (communicating), generating, and storing lots of data from the CPU to the GPU and across GPUs.

As you progress in your PyTorch journey, you will see that the APIs are designed to work across multiple GPUs to increase the computational performance and reduce the time to train the model. As you go deeper, you will pay more attention to computational bottlenecks between having finite memory and compute cores on the GPU and CPU. The quality of the learning done by the model changes as you start to make the model bigger and have to split it across multiple GPUs.


## Other Introduction to PyTorch Learning Material

Here are some other great books that you can leverage to learn PyTorch. **Note** that the code in books can become outdated because the PyTorch project is constantly being updated, so you may need to pin to a PyTorch version.
- [Machine Learning with PyTorch and Scikit-Learn](https://sebastianraschka.com/books/#machine-learning-with-pytorch-and-scikit-learn) by [Sebastian Raschka](https://www.linkedin.com/in/sebastianraschka/)
- [Ligthning AI Education Portal](https://lightning.ai/ai-education/) by [Lightning AI - Creators of PyTorch Lightning](https://www.linkedin.com/in/wfalcon/)
- [Build a Large Language Model From Scratch](https://sebastianraschka.com/books/) by [Sebastian Raschka](https://www.linkedin.com/in/sebastianraschka/)
- [Programming PyTorch for Deep Learning: Creating and Deploying Deep Learning Applications]() by [Ian Pointer]
- [Deep Learning for Coders with fastai and PyTorch: AI Applications Without a PhD](https://www.amazon.com/Programming-PyTorch-Deep-Learning-Applications/dp/1492045357/ref=asc_df_1492045357/?tag=hyprod-20&linkCode=df0&hvadid=730352155585&hvpos=&hvnetw=g&hvrand=937163955041805249&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=1015113&hvtargid=pla-2281435177618&psc=1&mcid=08d0a4297b0a3ac883da9c436c8067da&hvocijid=937163955041805249-1492045357-&hvexpln=73&tag=hyprod-20&linkCode=df0&hvadid=730352155585&hvpos=&hvnetw=g&hvrand=937163955041805249&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=1015113&hvtargid=pla-2281435177618&psc=1) by [Jeremy Howard](https://x.com/jeremyphoward)
- [Youtube PyTorch Tutorials](https://www.youtube.com/playlist?list=PLhhyoLH6IjfxeoooqP9rhU3HJIAVAJ3Vz) by [Aladdin Persson](https://www.linkedin.com/in/aladdin-persson-a95384153/)
- [24 Hour Full PyTorch Course - PyTorch for Deep Learning & Machine Learning – Full Course](https://www.youtube.com/watch?v=V_xro1bcAuA) by [Daniel Bourke](https://www.linkedin.com/in/mrdbourke/)


# Topics to Understand for Machine Learning, Deep Learning and PyTorch

### Installing PyTorch
When you are first starting, installation of Deep Learning software on a home machine can sometimes be very nuanced. The fastest way to get started with PyTorch code (especially using the GPU) is to use online GPU platforms, as they will typically have PyTorch notebooks with one click. This will become even more important once you start learning beyond a single GPU.
- [Google Colab](https://colab.research.google.com/)
- GPU Providers such as [Runpod](https://www.runpod.io/), and [Brev.dev](https://www.nvidia.com/en-us/launchables/pricing/)

See the [official installation page for PyTorch](https://pytorch.org/get-started/locally/)

When using NVIDIA GPUs, you will need to install a PyTorch version that is compatible with your GPU's [CUDA version installed](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) on your machine. See the [PyTorch Documentation](https://pytorch.org/docs/stable/torch.html) where they state that you need a GPU that has CUDA Compatibility>=3.0

Read more information on the [compute capability](https://developer.nvidia.com/cuda-gpus) of your GPU.


#### Short Note on Installing CUDA
A short note on installing CUDA, the "programming language" for NVIDIA GPUs. There are two parts to installing CUDA on your system. The first is the device driver, which allows your computer to see the NVIDIA GPU (sometimes referred to as a CUDA Device). This is typically very specific to your system configurations and operating system. To the computer, the GPU might be viewed as another graphics device. In many installation instructions, you have to override graphics settings so as not to conflict with your primary graphics device on the machine. Once you install the device driver, this doesn't mean that you can start doing deep learning on the GPU.

The next step is to install the CUDA Toolkit, which must be compatible with the device driver you installed previously. If you install the incorrect Toolkit for the device driver, something will break when you use PyTorch. Think of the CUDA toolkit as a set of high-performance instructions that leverage the hardware's capabilities. Once you install the CUDA toolkit, it will detect all the NVIDIA GPUs on your system, and you can use all the devices. In some cases in multi-GPU systems, you will have a dedicated (non-compute) GPU on your system that renders the graphics, but this will be detected as a compute device. You should pay attention to the device IDs starting at index 0, 1, 2, etc, in your ```nvidia-smi`` command to make sure PyTorch uses a compute GPU, not a graphics GPU.

See how [PyTorch allocated ids to your CPUs and GPUs](https://pytorch.org/docs/stable/tensor_attributes.html#torch.dtype) that you can use to have finer grained control how where data is transferred to.



An easier way to start with PyTorch on your home computer is to use a NVIDIA PyTorch Container. This may be a bit daunting to set up if you have not worked with Docker containers. This will give you a containerized environment with a compatible version of PyTorch installed that you can then launch on a multi-GPU machine as long as that machine can run NVIDIA-docker containers. The [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html) will enable you to use any NVIDIA software by simply pulling a container. See [NVIDIA PyTorch Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch), a stable version of PyTorch maintained by NVIDIA.

### Understanding Tensors

In Deep Learning, you will work with scalars (just a single number), vectors (an array of numbers), and matrices. These can all be described as tensors in PyTorch, a fundamental data structure. During training and inference, your input data will get converted to some Tensor that has a shape and will interact with your model, which is just another group of Tensors. Interaction means doing matrix math between your data and the model.

Read this getting started guide for [PyTorch Tensors](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html).

Every Tensor will have a shape, a type, and a device. This shape simply means the dimensions of the tensor, so the matrix math works out. The type, meaning the precision of the numbers stored inside the tensors (more on this below) and the device, is the tensor on the CPU or the GPU. Where the tensor is located matters, and you should always be conscious of the overhead of moving data onto the GPU and retrieving data from the GPU back to the CPU.

As you dive more into Tensors, you will likely get confused by all of the different APIs and things that you can do with them, in addition to changing the shapes of tensors. PyTorch and other frameworks are focused on writing the most efficient code, both in readability and runtime (how fast it runs). How you do matrix multiplication on the GPU on a small scale changes as the shape of the matrix grows. So, a significant part of your journey to becoming a PyTorch expert will be knowing how to structure your problem so that it runs fast on GPUs.

Read more about [Tensor Attributes](https://pytorch.org/docs/stable/tensor_attributes.html).

When PyTorch uses a CUDA-enabled device (NVIDIA GPU), it interacts with the GPU through the CUDA API. It is worth taking a glance at this [link](https://pytorch.org/docs/stable/cuda.html) once you have written some PyTorch code, as it will give you a view of more granular ways to communicate with the GPU(s).


### A Short Memo on Memory and Numerical Precision

A good mental model is that a Tensor contains a group of numbers. Each of those numbers has a numeric precision (think of the number of decimal places as a loose definition). The higher the numerical precision (or loosely decimal places), the more memory (on CPU and GPU) that number occupies. Math with higher-precision numbers is typically more accurate, but the computation is slower, consumes more memory, and takes longer to move from one device to another. Most of your professional journey will be spent optimizing how you use memory on the GPU so that you can do more work with the same compute resources.

For your long-term PyTorch journey, you must understand how memory works in a computer. Ask your favorite coding assistant to explain a bit and a byte. You will be thinking a lot about using memory on the GPU, typically measured in Gigabytes. Each GPU has a finite amount of memory, and your job is to do computations within a limited memory budget. When people need more GPU memory, they add more GPUs. When you have multiple GPUs, they need to be connected to each other to transfer data back and forth efficiently. The speed of this connection matters. This is why [NVLink](https://blogs.nvidia.com/blog/what-is-nvidia-nvlink/) was invented to make GPU-to-GPU communication faster within a node.

A node of GPUs is a single computer; think of a desktop. It typically has 8 GPUs connected with high-speed CPUs, memory, and storage (yes, this matters for performance). Once you start pooling multiple nodes of GPUs together, the networking between these computers becomes critical. This is why [NVIDIA purchased the networking company Mellanox](https://nvidianews.nvidia.com/news/nvidia-to-acquire-mellanox-for-6-9-billion) to be able to make larger systems of GPUs much faster.

See more information on individual GPUs, always look up how much memory a GPU has: [T4-16GB](https://www.nvidia.com/en-us/data-center/tesla-t4/), [V100-16GB or 32GB](https://www.nvidia.com/en-gb/data-center/tesla-v100/), [L4-24GB](https://www.nvidia.com/en-us/data-center/l4/), [L40s-48GB](https://www.nvidia.com/en-us/data-center/l40s/), [A100-80GB](https://www.nvidia.com/en-us/data-center/a100/), [H100-80GB](https://www.nvidia.com/en-us/data-center/h100/), [H200-141GB](https://www.nvidia.com/en-us/data-center/h200/), [A6000-48GB](https://www.nvidia.com/en-us/design-visualization/rtx-a6000/), [RTX 4090](https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/).

Not all GPUs are created equal. Each GPU is designed to perform well for different applications and under different performance loads. See this page for the programming guides of various architectures of [NVIDIA GPUs](https://docs.nvidia.com/cuda/index.html). When you are starting, use smaller (in terms of memory) GPUs so you can hone your skills and then advance to more powerful GPUs such as H100 and H200 when your workload increases both in size and complexity.

When it comes to performance, it's not only about the memory size but also the types of cores on the GPU. The cores determine the number of Ops or operations and calculations that can happen on the GPU.

In the image, you can see various integer or floating point formats. Why are there so many formats? It comes down to the accuracy vs memory tradeoff. Integer math is typically less accurate because you don't have decimals, but it is significantly faster and uses less memory.

What can be more confusing is that certain precisions are only available on specific GPUs. See this article on [numerical precision](https://dev-discuss.pytorch.org/t/more-in-depth-details-of-floating-point-precision/654) in deep learning. The Hopper family of GPUs (H100, H200) enabled the [FP8](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html) precision format, which allowed generative AI researchers to have higher accuracy with less memory. The new [Blackwell generation of GPUs enables FP4](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/) (4-bit floating point), which is half the size of FP8 with the same accuracy. For industry professionals, this means you can do twice the amount of work within the same memory bandwidth.

Read more about [Tensor Cores](https://www.nvidia.com/en-us/data-center/tensor-cores/).

During Neural Network training, some model layers can be represented in a lower precision, so you have some models with higher precision and low precision. This is known as mixed precision or automatic mixed precision, where PyTorch knows when to change the precision of the numbers in the model to make computation more efficient without affecting the accuracy. Once you start using PyTorch for serious projects, this will become very important. Read more about [PyTorch Automatic Mixed Precision Docs](https://pytorch.org/docs/stable/amp.html), [PyTorch Automatic Precision Blog](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/)

You will get confused once you start digging deeper into numerical precision, but that's ok. It is part of the learning process, so do not give up. To summarize this section, your journey to become a PyTorch expert is simply about doing highly performant and accurate math to solve some business, hobby, or scientific research problem. Sometimes, your innovation might be a new model architecture that gives you more accuracy, or it may be changing the numeric precision to conserve compute resources so you train on more data and the model gets more accurate because it had more practice. Always do your best to understand the flow of data both during your training and inference phases.


### Tensor Operations

The previous section covered that PyTorch's Tensor Data Structure stores numbers and data types in different formats. A large part of your model training workflow is manipulating data. The Tensor data structure has many efficient operators for manipulating the data inside of Tensors. Keep in mind the type of data inside of the Tensor. Typically, you can only store one [type of numeric precision](https://pytorch.org/docs/stable/tensor_attributes.html#torch.dtype) in your Tensor.

To understand how you can access how a PyTorch Tensor is laid out in memory, see this [page](https://pytorch.org/docs/stable/tensor_attributes.html#torch.dtype).

[Tensor Views](https://pytorch.org/docs/stable/tensor_view.html) - Remember that as a PyTorch developer, you think about optimizing memory usage. When you interact with tensors, sometimes you may create copies of the data, which may not always be efficient and can lead to out-of-memory on the GPU. As you get more comfortable with PyTorch beyond the basics, read the documentation of the api you are using to see how it affects memory usage. This will sharpen your intuition and help you read production PyTorch code where experts use all these best practices.

The primary operations you will be doing on Tensors are contained [here](https://pytorch.org/docs/stable/torch.html#module-torch), the main torch api.

For specific APIs for tensor slicing, indexing, etc, see [link](https://pytorch.org/docs/stable/torch.html#indexing-slicing-joining-mutating-ops)
This is a good video to digest on [Tensor Operations](http://youtube.com/watch?v=TXZmaIvE9tw)

See this [link](https://pytorch.org/docs/stable/torch.html#module-torch) for some properties you can interact with about tensors. Many of the operations on tensors are implemented (or referred to as) PyTorch Operators. The main reason to highlight this concept is that as you train more models on PyTorch, the version of PyTorch used becomes essential. Each version of PyTorch will have a set of operators that it supports, and sometimes, when you get error,s it comes down to the operator use.

For instance, the torch.add() is an operator that allows you to add tensors together. However, during the development process, the functionality of this operator may change (not very likely for add) across PyTorch versions. You might get an error when using an older version of PyTorch. Main takeaways as you use PyTorch with external libraries: many errors occur because an external library would be written in a different version of PyTorch, so just be aware.

When you read the logs of lots of PyTorch errors, you may see the word "ATen" pop up; you will see it in the forums. Much of the PyTorch performance exists because it is written in C++. [ATen](https://pytorch.org/cppdocs/#aten) is fundamentally a tensor library, on top of which almost all other Python and C++ interfaces in PyTorch are built. This is only mentioned to make this long-term learning path less mysterious.


### GPU Acceleration

Remember that CUDA is the lower level primitive to interact with NVIDIA GPUs. As you progress in your journey, it is worthwhile to understand how PyTorch leverages CUDA under the hood. Read more about [CUDA Semantics in PyTorch](https://pytorch.org/docs/stable/notes/cuda.html).

See this important snippet from the documentation
> Cross-GPU operations are not allowed by default, except copy_() and other methods with copy-like functionality such as to() and cuda(). Unless you enable peer-to-peer memory access, any attempts to launch ops on tensors spread across different devices will raise an error

The [Torch.cuda](https://pytorch.org/docs/stable/cuda.html#module-torch.cuda) api shows what CUDA level primitives you can access in PyTorch. This will become more useful once you investigate more fine-grain memory management techniques and multi-GPU training.

You can also check out [Working with CUDA in PyTorch](https://www.run.ai/guides/gpu-deep-learning/pytorch-gpu). See this video from [Lightning AI](https://lightning.ai/) on [training on multiple GPUs](https://lightning.ai/docs/pytorch/1.6.2/accelerators/gpu.html#multi-gpu-training) This will help you to understand why people scale up training on GPUs.

So far, we have discussed at a high level about PyTorch, and CUDA being the interface to leverage NVIDIA GPUs to accelerate and scale Deep Neural Network Training and Inference. The main library leveraged by PyTorch under the hood is [cuDNN - CUDA Deep Neural Network](https://developer.nvidia.com/cudnn). This library contains efficient, scalable implementations of common operations used in DNN workflows. When you call a PyTorch API on an NVIDIA GPU, you will likely be using cuDNN under the hood, which is built on CUDA.
- The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.
- Accelerated Learning - cuDNN provides [kernels](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#kernels) (programs that run on the GPU) that target Tensor Cores to deliver the best available performance on compute-bound operations. It offers heuristics for choosing the correct kernel for a given problem size.
- Expressive Op Graph API - The user defines computations as a graph of operations on tensors. The cuDNN library has both a direct C API and an open-source C++ frontend for convenience. Most users choose the frontend as their entry point to cuDNN.
- Fusion Support - cuDNN supports fusion of compute-bound and memory-bound operations. Common generic fusion patterns are typically implemented by runtime kernel generation. Specialized fusion patterns are optimized with pre-written kernels.

One concept to mention is that of Fusion. A PyTorch program on the GPU will consist of multiple programs that launch on the GPU, they will read data, process data and write data and pass control onto another program (kernel). Much of the lower level performance optimizations can be done by combining multiple of these programs (kernels) into a single kernel. There is overhead in launching a new kernel so this can be minimized by launching less kernels and doing more work per kernel launched on the GPU.


Read through this [Deep Learning Performance Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf) to gain a deeper appreciation for how computation problems are broken down and sped up using GPU computing. Another topic to become familiar with over time is that of [computational graphs](https://www.geeksforgeeks.org/computational-graphs-in-deep-learning/). Think of this as a way of keeping track of all the operations (and their ordering) that must happen to produce a successful result. Note that the computational graphs during training and inference may be slightly different.


## Training Models in PyTorch

There are many great tutorials on the PyTorch Training Loop; these are the same steps you would take in any deep learning training framework, but these tutorials are more specific to the PyTorch API. See [tutorials](https://pytorch.org/tutorials/beginner/basics/intro.html).

The tutorials just mentioned cover seven steps. For each one, we have put an intuition primer you should have in your mind and links to other educational material that will deepen your understanding. It is not necessary to dive into each one of these extra links when you are first starting off, but remember this is a map that you will continuously come back to find the next path to take in your PyTorch journey.

1. **Tensors** - Learn the fundamentals of the Tensor Data Structure in PyTorch. Many of the previous links in this document cover this. Always keep in mind where your tensor is located (CPU or GPU and which GPU), what its size and shape are, and if you do any operations, are you duplicating the data and wasting memory  
2. **Datasets and DataLoaders** - Think of a dataset as a processed set of data that is ready to be sent to the GPU for processing and the DataLoader as the transportation mechanism that will determine how many workers send data and how much data gets sent in batches. GPUs process data very quickly, so to keep them busy, you want to ensure enough data is sent to the GPU. Once you run through the initial tutorials, these [docs](https://pytorch.org/docs/stable/data.html) would be good to digest. You may not understand everything at first, but this is an area you will have to master over time.

3. **Transforms** - This is a bit more for training computer vision models. During the training process, there are several standard transformations you can do to images to make the training more accurate and get the data to be valid for consumption by your neural network.

4. **Build Model** - There are two phases here, first you define all the layers that your model will contain then you define how those layers are connected together in a forward pass. A forward pass is a way of defining the flow of data through your model to product the desired output. For different models PyTorch has implementation of [standard layers](https://pytorch.org/docs/stable/nn.html) so you should use these.

5. **Automatic Differentiation** - You may have come across differentiation in your calculus class. When you compute the derivative of a function you are asking how much does your output variable change based on a change in your input variable. In the case of Deep Neural Networks we ask the question how does the error change when a change is made in the parameters of the model. PyTorch has a numerical way of doing this calculation at scale [Read More](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html?highlight=parameter). This [video](https://youtu.be/VMj-3S1tku0?si=lLuiYx1XeKVqIVAL) from [Andrej Karparthy](https://x.com/karpathy) (former AI lead at Tesla, and founding member OpenAI) is a must watch if you want to understand the deep internals of this topic.

6. **Optimization Loop** - An objective function is one of the keys to getting an accurate model. It is an equation that defines success. In many cases, it defines how to measure the error in a network as a function of the training parameters. The model uses this objective function to minimize the error during training. You may see other terms such as cost function, loss function, or objective function [read more](https://stats.stackexchange.com/questions/179026/objective-function-cost-function-loss-function-are-they-the-same-thing). Most times for training models, you will see the term "loss" being used. This just indicates the total error at this point in the training process. The training loop consists of sending batches of data to your model; your model predicts its best answer using the current values of the model's parameters. Once you predict a value, you can compute an error based on the true value of the training data. Given this error, you can find the gradient of each parameter (essentially how much you should adjust the parameter to minimize error) through back propagation. Once you know how much to adjust the parameter, you update all the parameters in the network and train on your data again. Each time you measure the loss (error) of the model and with more training iterations (epochs), you hope to see your error decrease. You will see the term optimizer used frequently. This refers to the technique for finding the gradients (changes) in the objective function. Read more:[1](https://pytorch.org/docs/stable/optim.html), [2](https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-on-deep-learning-optimizers/). This is a complex topic to grasp at first, so be patient. One thing to keep in mind during training is that every parameter in your model has a gradient. Think of this as a separate number to describe how much that parameter affects the performance of the whole model. Calculating these [gradients](https://pytorch.org/docs/stable/generated/torch.Tensor.requires_grad_.html) consumes a majority of GPU memory, and you will learn over time many advanced techniques to optimize this. Gradients are only calculated during the training process, once you get to inference, the total memory of your model decreases.

7. **Save, Load, and Use Model** - Once your training loops are complete, you would have shown your model the training data multiple times and found the best parameters that would make the model most accurate to the objective you set. Now, you want to use your trained model to solve the problem. You get rid of the gradients and some other layers you don't need after training, and you save all the parameters of the model as a file. This file can have multiple formats, such as [PyTorch](https://pytorch.org/tutorials/beginner/saving_loading_models.html) and [ONNX](https://onnx.ai/), that can be used in an inference server.

# Conclusion
We hope you have enjoyed this high level PyTorch learning path. Feel free to share any feedback that would improve the material and remember to keep pushing yourself to build more models.


## Other resources
- [Pre-Train a 3B parameter LLM on 16 GPUs - Lightning AI](https://lightning.ai/docs/overview/train-models)



In [None]:
"""