A deep-dive on the entire history of deep learning, highlighting the series of innovations that got us from simple feed-forward networks to GPT-4o.
For each key milestone, I've included the critical papers in this repository, along with my notes, my explanation of important intuitions & math, and a toy implementation in pytorch when relevant.
The rest of this page is my breakdown of everything we can learn from this history, and what it tells us about the future of deep learning, inspired by The Lessons of History by Will & Ariel Durant.
Important
This project is designed so everyone can get most of the value by just reading my overview on the rest of this page.
Then, people curious to learn about the technical details of each innovation can explore the rest of the repository via the links in the resources section.
Note
For more context, checkout the original twitter thread
Note
Thanks to Pavan Jayasinha and Anand Majmudar for their constant feedback while I made this 😄
The most interesting part of my deep-dive came from noticing a clear trend across all the key advancements, which has completely reframed how I understand deep learning:
Important
There are 7 simple constraints that limit the capacity of digital intelligence:
- data
- parameters
- optimization & regularization
- architecture
- compute
- compute efficiency
- energy
The entire history of deep learning can be seen as the series of advancements that have gradually raised the ceiling on these constraints, enabling the creation of increasingly intelligent systems.
It's impossible to understand where we're going without first understanding how we got here - and it's impossible to understand how we got here without understanding these constraints, which have always governed the rate of progress.
By understanding them, we can also explore a few related questions:
- How is progress made in deep learning?
- Where do the ideas that drive progress in deep learning come from?
- How have our narratives about digital intelligence changed over time?
- What does deep learning teach us about our own intelligence?
- Where is the future of deep learning headed?
So, let's start by understanding these constraints from first principles.
We can define intelligence1 as the ability to accurately model reality2. Practically, we're interested in models of reality that are useful for performing economically valuable tasks.
The goal of deep learning is to produce accurate models of reality for these useful tasks by:
- Treating the true models that describe reality as complex probability distributions3
- Creating neural networks capable of modeling complex probability distributions
- Training these networks to learn to model the probability distributions that underlie reality
In this view, creating intelligence with deep learning involves just two steps:
- Collect useful information about reality (collect data)
- Create a neural network that can effectively learn from this information (model data)
The only way to increase the intelligence of our model is to improve how well we accomplish each of these steps.
With this in mind, we can look at the constraints that govern this process. Let's start by understanding the constraint on data.
We've established that the goal of deep learning is to model the probability distributions that describe reality.
Let's call the distribution that we're trying to model for a specific task the true distribution. In order to learn about the true distribution, we collect many samples from it. These samples make up a dataset.
The dataset contains some information about the true distribution, but it doesn't contain all information about the true distribution4. Because of this, the dataset represents an approximation of the true distribution, which we'll call the empirical distribution.
At best, we can expect our neural network to learn to model this empirical distribution5.
However, our original goal was to model the true distribution. To account for this, we need the empirical distribution to be a good approximation of the true distribution. The quality of this approximation determines the cap of how good a model trained on the dataset can get.
This is the first constraint on the intelligence of a neural network.
Note
Constraint #1: A model can only be as good as the dataset it was trained on.
Specifically, the cap on how well a model can approximate the true distribution is determined by how much information about the true distribution is contained within the dataset.
To make the empirical distribution a better approximation of the true distribution, we need to include more information about the true distribution in the dataset.
We can increase the total information in the dataset by the information in each individual sample (intuitively, this means using samples that are more informative for the relevant task).
We can also increase the information in the dataset by adding more samples that offer new information about the true distribution6.
To simplify, there are two ways to improve the quality of the dataset:
- data quality
- data quantity
This is not because more data is always good7, but because we want more information about the true distribution in the dataset so the model can learn a sufficient approximation of it.
With this understanding of the data constraint and how to improve the quality of datasets, we can look at how progress in this dimension has impacted the history of deep learning.
Early machine learning relied on datasets collected by individual research teams. Despite the development of effective approaches to deep learning, datasets weren't large enough to prove their advantages.
The introduction of datasets like MNIST and ImageNet drastically increased the availability of high quality datasets large enough to effectively train deep learning models.
Early CNNs like LeNet and AlexNet used these datasets to show that deep neural networks could compete with the traditional machine learning approaches used at the time.
It's easy to take for granted the impact of these datasets now, as they have long been obselete - but they clearly had a huge impact on the field. Notably, AlexNet, which completely changed the field of deep learning, could not have existed without the creation of the ImageNet dataset.
The introduction of large labeled datasets can be seen as the first breakthrough in pushing the data constraint toward larger datasets.
Though useful, these datasets were inherently unscalable due to the manual labeling process they rely on. In order to push the data constraint to the next level with even larger datasets, a new approach to data was needed.
The internet is the most obvious source of massive amounts of data that could plausibly be used for deep learning. However, it was initially unclear how to use this data to train a deep learning model.
Unlike labeled datasets, internet data is not created for a specific tasks, so it didn't appear to contain high quality data that could contribute to training a specific model. For this reason, internet data appeared to be unusable in deep learning for a long time8.
BERT completely changed this. BERT popularized the transfer learning paradigm now used by all large language models (including GPTs) - the model was pre-trained on a large portion of the internet (high quantity, unpredictable quality), and then fine-tuned on smaller datasets (low quantity, high quality).
For the first time ever, BERT showed that we could actually make internet-scale datasets useful.
The results also shocked the broader tech community - for example, causing a Google executive to express that an AI system would inevitably replace Google search in the near future.
For those curious, the LoRA paper further developed on why the transfer learning paradigm developed by BERT and used by all modern LLMs may be so effective.
BERT and the GPTs were technically impressive but didn't immediately reach the mainstream until the release of ChatGPT.
InstructGPT was the breakthrough that enabled this. It used RLHF techniques to fine-tune the base GPT-3 model using a human generated dataset of question-answer pairs deemed good responses for a helpful assistant.
By learning to behave effectively as an assistant, InstructGPT created the practical communication style that enabled ChatGPT to succeed.
The success of InstructGPT is an indication of how high-leverage data quality can be when fine-tuning language-models.
Though many fine-tuned models existed before the instruct series, InstructGPT was far preferred to almost everything else at the time due to the high quality data it was trained on.
How much more can we improve the quality of the datasets deep learning models are trained on to improve the capacity for models to become intelligent?
The amount of data generated on the internet is increasing exponentially, which should continue to provide a source of increasingly large datasets to train on9.
However, there's another question about the quality of the data on internet-scale datasets. We want our systems to model reality - whereas the internet can be understood as a (highly) lossy compression of the true laws of reality10.
Because of this, the abundance of humanoid robots may present a new means of data collection for deep learning models that gives direct access to information about reality - which makes OpenAI & Microsoft's investment and collaboration with Figure particularly interesting.
Regardless, current scaling laws have shown that current models are far from reaching the capacity of the information available in internet-scale datasets, meaning we may be far away from the point where data becomes the constraint again.
Now that we've understood the data constraint, we can explore what constrains how effectively the neural network can model the data.
This determines how close to modeling the empirical distribution the model will get, which corresponds with its intelligence.
The first constraint that determines the capacity for the model to learn the empirical distribution is the number of parameters in the neural network.
The model needs to have enough representational capacity to be able to learn the empirical distribution of the dataset.
This means the neural network needs to have parameters to provide enough degrees of freedom to accurately model the distribution. In practice, it's challenging to predict the minimal number of parameters needed to fully model a dataset.
However, when the amount of information in the dataset is far beyond what the network is capable of modeling, the easiest way to improve the network is to scale up the number of parameters - which can mean increasing the depth of the network and adding more parameters per layer.
With modern internet-scale datasets, the complexity is massive, so the approach of adding more parameters shows no signs of slowing down in terms of its efficacy at improving the intelligence of models.
Note
Constraint #2: The representational capacity of a model is bounded by the number of parameters it contains.
In practice, we'll see that increasing the number of parameters in a neural network is actually a function of the other constraints.
Let's look at the times in the past where this constraint has been particularly relevant.
The earliest neural networks consisted of just a single input and output layer, heavily limiting their representational capacity.
The original backpropagation paper discussed the addition of a hidden layer, adding more parameters to the network which significantly increased it's ability to represent more complex problems (like shift-registers, the XOR gate, etc. - all very simple examples, but impressive at the time).
AlexNet is one of the clearest examples of increasing parameters leading to better models11 - the AlexNet architecture used 5 convolutional layers, far more than the previous largest CNN at the time, which enabled it to crush the previous best score in the ImageNet competition.
However, early on, size appeared to be one of many factors constraining the improvement of models, rather than the most important constraint.
The GPT series made it clear that for internet datasets, scaling parameters appears to be sufficient to significantly increase model quality.
The scaling laws show no sign of letting up, which has motivated the current continued attempts at training larger and larger models.
Scaling laws for model performance as a function of parameters
Importantly, the reason for this trend is not that increasing the number of parameters in a model always increases it's intelligence. Instead, it's due to the fact that current models still don't have enough representational capacity to capture all the information in internet-scale datasets.
As mentioned previosly, increasing the number of parameters in a neural network is actually governed by the other constraints.
In reality, you can't keep scaling up the number of parameters in a model and expect quality to keep increasing.Scaling up a model (via increasing the depth or the number of parameters per layer) introduces two new classes of problems.
First, increasing the depth of a network can make it take far longer to converge to an optimal solution, or in the worst cases, can prevent the network from converging.
The process of ensuring models can converge effectively, even as they grow in depth, is known as optimization.
Additionally, when you scale up the number of parameters in a model so it's representational capacity exceeds the complexity of the empirical distribution, the model can start fitting trivial noise in the distribution. This effect is known as overfitting.
The process of regularization is used to ensure models learn useful generalizations of the dataset and don't overfit to noise.
In practice, the actual depth of a network is constrained by the efficacy of the optimization & regularization strategies used.
Note
Constraint #3: The efficacy of optimization & regularization approaches constrains the number of parameters a network can handle while still being able to converge and generalize.
While training deeper neural networks with backpropagation, gradients start to get magnified or disappear, due to the compounding effects of multiplication by sequences of large or small weights12.
This is known as the vanishing and exploding gradients problem.
It's easy to forget how prohibitive this problem was - it completely prevented the effective training of networks beyond a few layers in depth, putting a significant constraint on the size of networks.
The introduction of residuals via the ResNet architecture completely solved this problem by creating residual pathways for gradients to flow effectively through networks of arbitrary depth.
This unlock removed a significant constraint on network depth, enabling much larger networks to be trained (which removed a cap on parameters that existed for a long time before this).
Dropout introduced a critical regularization strategy that has been used in most networks after it's creation, notably contributing to the success of AlexNet which initially popularized it.
Conceptually, the ideal way to prevent a model from overfitting to a particular problem would be to train a variety of neural networks on the same problem and then take the average of their predictions. This process would cancel out the noise fitted by each network, leaving only the true representations.
However, this naive approach was prohibitively expensive - training multiple large neural networks for a single problem costs more compute.
Dropout enabled a computationally effective equivalent approach involving randomly blocking out the effects of a subset of neurons in each training run13, effectively training an exponential number of sub-networks within a neural network and averaging their predictions together.
Another problem when training deep networks is that later layers suffer from improving while the activations of earlier layers change, potentially rendering their early stages of training useless.
This problem is known as internal covariate shift, and also prohibitted the training of deeper networks.
The introduction of Batch Normalization and Layer Normalization solved this by forcing neuron activations into predictable distributions, preventing the covariate shift problem.
This breakthrough, combined with residuals, provided the basis for building much deeper networks. Layer Normalization in particular enabled the training of deeper reccurent models like RNNs and LSTMs's that led to the innovations eventually resulting in the Transformer.
The initial optimization algorithm, stochastic gradient-descent, involves taking a pre-determined step to update the parameters at each time-step.
In practice, this can be highly inefficient and hurt convergence14.
The Adam optimizer introduced an efficient algorith to keep track of adaptive moments tracking the history of gradients throughout the optimization process. This allowed the optimizer to adjust step-sizes based on past information, often leading to much faster convergence.
The advancements mentioned above (and related developments) are all used in most models to date. For example, the Transformer architecture uses Dropout, Layer Normalization, andResiduals throughout it's architecture, and was trained using the Adam optimizer.
Because of how effective they've been completely removing prior problems, optimization & regularization appear to be largely solved now.
This is especially augmented by the fact that we're far from reaching the peak of the scaling laws on current internet-scale datasets, so overfitting is not a concern.
Despite this, it's important to remember that optimization & regularization are still real constraints on the size of neural networks, although they no longer effect models in their current state.
We covered how increasing the number of parameters in a neural network increases its representational capacity. This can be understood as the networks ability to store useful representations that effectively model the empirical distribution.
By default, deep neural networks are forced to learn the most optimal ways to store representations for different problems.
However, when we already know an effective method for the model to store useful representations relevant to a particular problem, it can be helpful to build the ability to store representations in this useful form directly into the model.
Building specific structures into the neural network design to make it easier for the model to store useful representations is known as adding inductive bias.
Desiging good neural network architectures into our models is about increasing the density of useful representations in the model, meaning more efficient usage of parameters.
In this way, improved architectures can achieve similar effects to scaling up parameters.
In practice, architectural advancements have made previously intractable problems (like image synthesis) possible for neural networks.
Note
Constraint #4: The quality of the network architecture constrains the representational capacity of a model.
Technically, a deep neural network with non-linearities is capable of modeling any distribution, given a sufficient number of parameters15.
But in practicality, there are distributions with so much complexity that simple deep neural networks can't effectively model them16. For these distributions, we turn to architectural advancements to make progress.
The Convolutional Neural Network was the first effective architecture that introduced a significant inductive bias into neural networks. The idea behind the CNN is directly inspired by the hierarchical processing of inputs from the brain's vision system.
CNNs use feature maps that detect high-level features across images to implement the translational invariance that's critical to image recognition tasks.
This provided a deep learning analogue to the manual feature engineering efforts often used before deep learning was proven.
CNNs were critical for the initial adoption of deep learning - neural networks like LeNet and AlexNet used the architecture to beat the state-of-the-art in image classification competitions. Additionally CNNs are still relevant in modern models with the U-Net architecture being used in modern Diffusion models for image generation.
The Recurrent Neural Network introduced the ability to store memories about the past to inform future decisions.
While theoretically interesting, it remained largely ineffective for sequence-modeling tasks until the introduction of the Long Short-Term Memory architecture which enabled neural networks to learn complex relationships across time and space by learning to store, retrieve, and forget memories over long time horizons.
The LSTM inductive bias made them effective at sequence-modeling tasks, kicking off the arc of progress that eventually led to the creation of the Transformer.
Despite their efficacy, the LSTM was constrained by the fact that it processed input sequences sequentially, making it slow to train.
The Attention mechanism was initially introduced as an addition to LSTMs to enhance their ability to understand the relationship between concepts.
The now famous Attention Is All You Need paper removed all the LSTM components and demonstrated that the inductive bias of attention alone is effective for sequence-modeling tasks, introducing the Transformer architecture which has permanently changed deep learning.
The transformer is particularly effective not just because of the power of the attention mechanism, but because of the high parallelization it achieved by removing recurrence.
The CNN introduced the ability to understand samples from the complex distribution of images.
However, the problem of synthesizing images appeared to be much harder - CNNs could learn to filter out the details in images and focus on high-level features, whereas image geneartion models would need to learn to create both high-level features and complex details.
Image generation models like Variational Auto-Encoders and Diffusion models learn to generate both high-level features and complex details by introducing random sampling and noise directly into their architectures.
VAEs create a bottleneck that forces the models to learn useful representations in a low dimensional space. Then, they add back noise on top of these representations through random sampling. So VAEs start by learning representations, and then add noise.
Diffusion models, instead, starts with noise, and learn to add information into to the noise slowly.
Without these designs, modern image generation models like Stable Diffusion and DALL E wouldn't exist.
The Word2Vec model popularized the concept of text embeddings that preserve semantic and syntactic meaning by forcing models to create vector representations for concepts with interesting properties.
A commonly used example of the power of such embeddings is that the following equation holds true in the embedding space: Emedding("King") - Embedding("Man") + Embedding("Woman") = Embedding("Queen").
Embeddings show us how the relationships between concepts can be represented in a highly condensed format.
Later models like CLIP based on the Transformer architecture have led to complex embedding spaces mapping understandings of concepts across modalities to a single representation space, enabling multi-modal models like DALL E 2.
For the past several years after the introduction of the Transformer, efforts have mainly been focused around scaling up the parameters and data fed into transformers without heavily adjusting the inductive biases.
This suggests a stagnation in architectural improvement motivated by the efficacy of the Transformer, which may suggest something about the inherent efficacy of the inductive bias of Attention in intelligence.
This explicit desire not to change architectures anymore is discussed by Andrej Karpathy in this clip.
Instead of changing base architectures, many state-of-the-art models have been combining different existing architectures together - for example, the Diffusion model design uses the U-Net underneath, and DALL-E-2 uses both CLIP (which is built with the Vision Transformer) and a Diffusion model.
The combination of different working architectures has also resulted in the increasing multi-modality of models, indicative in the recent announcement of GPT-4o which trains a single base model on a variety of modalities (likely combining a variety of architectures underneath, although the implementation details are unreleased.).
Assuming an efficient architecture and effective optimization & regularization, the last constraint on the total number of parameters and representational capacity in a model is compute.
During training, the gradient for each parameter needs to be computed and updated at each time-step, which costs computational resources. So, with more parameters, there are far more computations during back-propagation which becomes the limiting step.
Because of this, a single device can train a finite number of parameters at once, and beyond this, training has to expand to multiple devices at once to parallelize.
And if there's a limit on the number of devices we can use for training, we hit a constraint on compute.
So we can train a certain number of parameters per device. And then we need to get more devices. And if there's a limit on how many devices we can use together, we've hit a constraint on compute.
Note
Constraint #5: The total available compute constraints the maximum number of trainable parameters a model can have.
In practice, the constraint may be caused by a lack of resources (to buy devices), supply (due to constrained supply chains), or energy (discussed later)17.
AlexNet was one of the first major deep learning applications that took advantage of the parallelization capacity of GPUs to train neural networks.
They were also the first people to train a deep learning model across multiple GPUs at once to speed up training.
They were able to accomplish this because of the recent addition of the ability for NVIDIA GPUs to write to each others memory, which enabled much faster direct communication between GPUs rather than communicating through the host machine.
This innovation (introduced due to gaming, not deep learning), has become critical in training large models, where communication between large clusters of GPUs has become essential.
This paper pushed the compute constraint in several ways - first, just by using GPUs for training the first place, and additionally by using multiple GPUs to shard training, and using inter-GPU communication.
Until the past decade, the GPUs that have enabled deep learning to progress so far were driven forward not by the incentives of deep learning (which offered scarce revenue opportunity early-on for large companies like NVIDIA), but by the tailwinds of the gaming market.
In this way, deep learning benefited from a bit of luck - the compute tailwinds created by the gaming industry enabled deep learning to take off in a way that likely would not have happened in the absence of gaming.
The gaming industry raised the constraint on compute for deep learning models by creating a sufficient financial incentive to produce GPUs of increasing quality.
Through the trail of papers, you can see the quality of compute slowly get better over time, even before dedicated AI compute was created.
Finally, in 2020, NVIDIA released their A100 model built specifically for AI applications, as they determined that AI was a strategic bet worth taking. This decision has now yielding the H100, and soon B100 GPUs that will power much of AI training.
Jensen Huang delivering an H200 to early OpenAI
It wasn't initially obvious that acquiring compute would become a huge constraint.
The power laws trend that first became visible with BERT, RoBERTa, GPT-2, and GPT-3 made it clear that scaling up parameters, and thus compute, was a necessary factor of increasing model intelligence.
As this trend became more clear and the AI narrative became more powerful, everyone began to acquire the necessary compute, leading to a demand volume that wasn't previously predicted by the supply-chain. This has caused a constraint in acquiring compute.
In addition, the raw cost of acquiring a large amount of compute has become prohibitively expensive for most players.
These constraints on compute led Sam Altman to say that "compute is going to be the currency of the future."
Zuck spent several billion to buy 350,000 NVIDIA GPUs, which now appears to be an act of incredible foresight considering the current struggle to get compute.
This increased demand for compute has also been reflected in the surging market caps of all the essential companies in NVIDIA's compute supply chain including TSMC & ASML.
The current constraint on compute is partially a result of compute supply chains not having predicted the unexpected jump in demand caused by the AI boom.
As supply chains inevitably adjust to meet these demands, the constraint will likely shift from who has already obtained the most compute to who has the resources to purchase the most compute, which also positions OpenAI well considering their partnership with the well-resourced Microsoft.
In recent fundraising cycles, many startups have raised money to build dedic-ated AI chips for inference and training, promising to further speed up the efficiency of training large models.
These specialized chips, broadly known as Application Specific Integrated Circuits, build assumption about how deep learning models work directly into hardware, offering the ability to drastically accelerate training.
The question is, will other companies be able to compete in this space, or will NVIDIA maintain it's domination of the AI training market (most likely).
While the power of compute increases, making effective use of this compute is not a guarantee. Using compute efficiently is a software problem that takes active effort and optimization.
Innovations like FlashAttention, which drastically accelerated the speed of Transformers through an optimization in how attention access memory, are a reminder that compute optimizations are another lever to increase the efficiency of training and scale up models.
Note
Constraint #6: The software implementations for training constrain the efficiency of compute utilization.
Initially, GPUs were challenging to work with as they depended on a completely new programming paradigm.
The introduction of CUDA as a GPU programming paradigm familiar to C programmers made writing GPU code far more approachable.
This language enabled AlexNet to manually implement their own kernels to speed up the convolution operation on GPUs, unlocking a new level of parallelization for training CNNs.
People rarely have to write low-level kernels anymore since popular libraries like PyTorch and JAX have already written the kernel code for the most popular kernels, making it easy for modern deep learning engineers to use GPUs without needing to dip into low-level code.
Despite the fact that GPU kernels are now largely written, there are likely still plenty of opportunities for improving the compute efficiency of model implementations - notably, the introduction of FlashAttenion demonstrated how big of a difference these changes could make in terms of training efficiency.
Finally, even if the compute supply chains are capable of supporting all demand, and we have infinite resources to purchase compute, there is still a constraint on compute: energy
In practice, large training runs need to be run on physically clustered compute in large data centers since the devices need to communicate with each other.
As the amount of devices in large training runs grows, datacenters will need to be able to support the energy needs of these devices.
This may actually become a meaningful constraint, as Zuck discussed in this clip on the Dwarkesh podcast.
Specifically, energy grids are limited to allowing a certain amount of energy being drawn from them in a location, meaning there's a cap to how large data-centers can become before they run into problems that require energy permitting and dipping into much slower government regulated processes.
Note
Constraint #7: The energy available to draw from the grid in a single location constrains the amount of compute that can be used for a training run.
As many companies plan to build large data-centers for AI training, we'll see how the energy constraint plays out - notably, Microsfot and OpenAI are rumored to be launching a $100B data-center project.
Having covered each constraint individually, we can now put them all into perspective in relation to the broader arrow of progress in deep learning.
A helpful way to think about the 7 constraints is in terms of hard constraints and leverage.
The hard constraints are data, compute, and energy - these are rate-limited by slow processes - data currently being limited by the scaling growth of the internet and other data collection methods, compute being limited by individual company resources and supply chains, and energy constraints eventually being rate-limited by regulation.
Meanwhile, parameters, optimization & regularization, architecture, and compute efficiency can be thought of as forms of leverage on the hard constraints - they are all easy to vary and can be optimized to maximize a models intelligence given a fixed set of data, compute, and energy.
Maximizing leverage constraints are important for individual training runs, but improving the hard constraints is what really pushed forward the increasing base intelligence of models now.
This is again indicative of the scaling laws - our models have not shown signs of coming close to fully modeling the information in current internet-scale datasets, so we continue to scale up models by increasing compute and parameters
We can look back at this history of progress in deep learning through the lens of constraints, and see a few key milestones that stand out above the rest which have completely shifted narratives around deep learning.
Since narratives are a powerful tool for allocating capital and talent toward problems18, these narrative shifts alone have had a significant impact on deep learning progress.
The first major narrative shift in deep learning occured after the release of AlexNet in 2012.
Prior to this paper, deep learning was considered inferior to traditional ML, as it consistently lost to manual feature engineering approaches in image classification and other challenges.
The success of AlexNet brought down the top-5 error rate on the ImageNet challenge from 25.8% to 16.4%, blowing the previous state-of-the-art out of the water.
This directly enabled further innovations like GoogLeNet and ResNet, but more importantly, it shifted attention back on deep learning and created new interest in the field.
The narrative shift that occured as a result of this work was from one of skepticism about the utility of deep learning to belief that it was a viable, and even superior approach to traditional machine learning.
This narrative shift was essential to get us to the point that we're at today, and it seems that Ilya Sutskever (who co-authored AlexNet) realized how scaling laws would playout long before it reached consensus, as discussed in this interview with Geoffrey Hinton.
The Attention Is All You Need paper created a massively parallelizable architecture that enabled training on internet scale datasets.
The introduction of the Transformer alone was not what created the largest narrative shifts though.
Arguably, it was the introduction of BERT that really showed how transformers could take advantage of massive datasets scraped from the internet via pre-training and fine-tuning, which kicked off the modern trends in AI focusing on achieving general intelligence.
Because of it's transfer learning approach, BERT achieved state-of-the-art results on many NLP tasks withou training on them explicitly, showing one of the first indications of some form of generalized intelligence.
The shock caused by BERT is evident in the Google executive statement claiming that BERT will replace all the 20 years of progress on the search product.
The arrow of progress defined by the improvements from GPT-2 to GPT-3 onwards created the scaling laws narrative that dominates the current public sentiment.
Importantly, OpenAI took a bet on the scaling laws early on, well before they were widely recognized as being valid19. A few years ago, most people thought the scaling laws were naive.
Now, they look clear in hindisght because of the series of bets OpenAI took to validate these laws, with GPT-2 and GPT-3 further validating their hypothesis.
Extrapolating out the progression of scaling laws correctly is challenging - as Zuck points out in this clip, trends like these rarely continue until we reach the goal - we usually run into bottlenecks and then have to readjust strategy.
In this context, the question is how far the empirical distribution of the internet dataset will take. Framed differently - how close is the empirical distribution of the internet to the true distribution of the model of reality?
This will determine when we hit a carrying capacity on how much better our models can get by scaling parameters to train on the internet.
This narrative is also a good indicator of how impactful narratives are in fundraising.
The AGI narrative may be the most powerful narrative in history since it can claim that "everything else economically valuable will be solved by this problem."
Clearly, this was used effectively with the rumored $7T OpenAI fundraising attempt (which was of course just a rumor, but an indication of the power of the AGI narrative, since people believed it was a possibility).
Where do the ideas that have led to breakthroughs in deep learning come from?
When we look at the history of progress, we can see several common sources of inspiration that appear frequently.
The most apparent source of direct inspiration for many advancements in deep learning is neuroscience.
The CNN is almost directly inspired by the visual system in the brain, and it led to significant advancements in deep learning.
Similarly, the effectiveness of ReLU is explained in terms of the energy efficiency of sparse representations for concepts in the brain.
Other systems, like the LSTM and Attention mechanisms appear to draw from neuroscientific concepts (memory and attention) on a surface level, although in reality, their implementations are more motivated by the math of neural networks and engineering to specific problems rather than they are directly modeled after the brain.
For example, the LSTM design is perfectly engineering to address the vanishing & exploding gradients problem in RNNs, and it happens that a long-term memory based system is an effective way to fix this problem.
This pattern suggests that rather than taking direct inspiration from neuroscience, deep learning may have converged on similar approaches to how nature has built intelligence in the brain, partly through first principles.
This is a nice ex-post rationalization, but may overly construct a clean narrative that doesn't actually reflect the situation.
Additionally, early papers seem to intentionally feel pressure to fit ideas into neuroscientific and biological justifications, even where there may not have been any.
Dropout struck me as the most blatant example of this, as they explain "one possible motivation" for dropout coming from animal sexual behavior, despite their prior explanation in the paper of dropout following from a rather logical line of thinking around regularization.
This seems to an attempt to make the architecture appear to correspond with biology after it was designed, rather than it actually serving as a source for inspiration (of course, I could be wrong about this).
Most notably, backpropagation and LoRA are directly inspired by the math behind neural networks.
LoRA (low-rank adaptation) is directly a manipulation on how models are trained by taking advantage of a feature of linear-algebra (decomposing weight matrices into lower dimensionality matrices with fewer trainable parameters).
Similarly, advancements like Residuals were directly motivated by the nature of gradient flows within neural networks.
Notably, VAEs and Diffusion models take inspiration from thermodynamics - specifically Langevin dynamics, as well as probability and information theory.
These systems involve noisy sampling, and these models turn to approaches used in similarly noisy systems in the real world for inspiration
In practice, most of the innovations in deep learning are actually more motivated by engineering problems in neural network design, and bear only surface-level resemblance to the apparent fields of inspiration.
What can this progression of progress in deep learning tell us about our own intelligence?
I'll try to be purely empirical here, since it's easy to dip into unbased philosophizing with this topic given it's subjective nature.
As we've disucssed, one way way to view intelligence (motivated by the Free Energy Principle) is as a measure of our ability to model complex distributions that describe reality, and then run active inference on these models to accomplish things in the world20.
It seems that the combination of data about reality (dataset vs. our senses), compute (transistors vs. neurons), and energy (electricity vs. food) along with scale (parameters vs. connections), and of course, an effective learning algorithm, yields systems that appear intelligent.
Additionally, the efficacy of various inductive biases offered by different architectes may indicate something inherent about the structure of the information they're trying to model.
For example, the effectiveness of the attention mechanism raises the question of why this inductive bias alone appears to be so effective at modeling data.
If intelligence is really just a function of data, compute, energy, and training, then it seems inevitable now that digital intelligence will soon surpass us.
We've now reframed the history of progress as the series of advancements that have continually raised the ceiling on the constraints governing digital intelligence.
Everything in the past that has contributed to progress has been determined by the constraints discussed above.
Importantly, nothing about this changes in the future - these same 7 constraints will always determine where we're headed, and how close we are to AGI.21
At this point, we've solved the theoretical problem of AGI, in the sense that we know exactly what would get us to AGI22.
This was not obvious until the past decade, where we've seen the power of how far deep learning can go.
The question is now whether we will solve the engineering problem of AGI. Will we be able to keep pushing on all the constraints to keep improving digital intelligence?
Although scaling laws are currently at play and the current path forward is to acquire larger amounts of compute to train larger models, the efficacy of this approach will hit a limit in the future (it's difficult to know when).
It's possible that we may hit a bottleneck in how good models can get based on the quality of the empirical distribution of the internet, in which case we'll have to seek other sources of data.
Important
It's critical to remember that the core principle of progress in deep learning is that pushing on the 7 constraints will lead to increasingly intelligence systems.
Though the scaling laws indicate that the current limiting constraints are compute and parameters, these may shift to data and energy over time, which will bring new challenges.
Important
Each topic highlighted in this repository is covered in a folder linked below.
In each folder, you'll find a copy of the critical papers related to the topic (.pdf
files), along with my own breakdown of intuitions, math, and my implementation when relevant (all in the .ipynb
file).
1. Deep Neural Networks
2. Optimization & Regularization
- 2.1. Weight Decay
- 2.2. ReLU
- 2.3. Residuals
- 2.4. Dropout
- 2.5. Batch Normalization
- 2.6. Layer Normalization
- 2.7. GELU
- 2.8. Adam
3. Sequence Modeling
- 3.1. RNN
- 3.2. LSTM
- 3.3. Learning to Forget
- 3.4. Word2Vec & Phrase2Vec
- 3.5. Seq2Seq
- 3.6. Attention
- 3.7. Mixture of Experts
4. Transformers
- 4.1. Transformer
- 4.2. BERT
- 4.3. T5
- 4.4. GPT-2 & GPT-3
- 4.5. LoRA
- 4.8. RLHF & InstructGPT
- 4.9. Vision Transformer
5. Image Generation
I've provided my minimal implementations for many of the core topics in this repository in the .ipynb
files for each topic.
Generally, you can find good implementations of most papers online, which means the challenge isn't figuring out reimplementation. I've included these to collect them in one place for convenience, and also to highlight simple implementations that demonstrate each concept and I've gotten trained.
I used A100's to train most of the larger models.
Deep Neural Networks
- DNN - Learning Internal Representations by Error Propagation (1987), D. E. Rumelhart et al. [PDF]
- CNN - Backpropagation Applied to Handwritten Zip Code Recognition (1989), Y. Lecun et al. [PDF]
- LeNet - Gradient-Based Learning Applied to Document Recognition (1998), Y. Lecun et al. [PDF]
- AlexNet - ImageNet Classification with Deep Convolutional Networks (2012), A. Krizhevsky et al. [PDF]
- U-Net - U-Net: Convolutional Networks for Biomedical Image Segmentation (2015), O. Ronneberger et al. [PDF]
Optimization & Regularization
- Weight Decay - A Simple Weight Decay Can Improve Generalization (1991), A. Krogh and J. Hertz [PDF]
- ReLU - Deep Sparse Rectified Neural Networks (2011), X. Glorot et al. [PDF]
- Residuals - Deep Residual Learning for Image Recognition (2015), K. He et al. [PDF]
- Dropout - Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014), N. Strivastava et al. [PDF]
- BatchNorm - Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015), S. Ioffe and C. Szegedy [PDF]
- LayerNorm - Layer Normalization (2016), J. Lei Ba et al. [PDF]
- GELU - Gaussian Error Linear Units (GELUs) (2016), D. Hendrycks and K. Gimpel [PDF]
- Adam - Adam: A Method for Stochastic Optimization (2014), D. P. Kingma and J. Ba [PDF]
Sequence Modeling
- RNN - A Learning Algorithm for Continually Running Fully Recurrent Neural Networks (1989), R. J. Williams [PDF]
- LSTM - Long-Short Term Memory (1997), S. Hochreiter and J. Schmidhuber [PDF]
- Learning to Forget - Learning to Forget: Continual Prediction with LSTM (2000), F. A. Gers et al. [PDF]
- Word2Vec - Efficient Estimation of Word Representations in Vector Space (2013), T. Mikolov et al. [PDF]
- Phrase2Vec - Distributed Representations of Words and Phrases and their Compositionality (2013), T. Mikolov et al. [PDF]
- Encoder-Decoder - Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (2014), K. Cho et al. [PDF]
- Seq2Seq - Sequence to Sequence Learning with Neural Networks (2014), I. Sutskever et al. [PDF]
- Attention - Neural Machine Translation by Jointly Learning to Align and Translate (2014), D. Bahdanau et al. [PDF]
- Mixture of Experts - Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (2017), N. Shazeer et al. [PDF]
Transformers
- Transformer - Attention Is All You Need (2017), A. Vaswani et al. [PDF]
- BERT - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018), J. Devlin et al. [PDF]
- RoBERTa - RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019), Y. Liu et al. [PDF]
- T5 - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2019), C. Raffel et al. [PDF]
- GPT-2 - Language Models are Unsupervised Multitask Learners (2018), A. Radford et al. [PDF]
- GPT-3 - Language Models are Few-Shot Learners (2020) T. B. Brown et al. [PDF]
- LoRA - LoRA: Low-Rank Adaptation of Large Language Models (2021), E. J. Hu et al. [PDF]
- RLHF - Fine-Tuning Language Models From Human Preferences (2019), D. Ziegler et al. [PDF]
- PPO - Proximal Policy Optimization Algorithms (2017), J. Schulman et al. [PDF]
- InstructGPT - Training language models to follow instructions with human feedback (2022), L. Ouyang et al. [PDF]
- Helpful & Harmless - Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022), Y. Bai et al. [PDF]
- Vision Transformer - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020), A. Dosovitskiy et al. [PDF]
Generative Models
- GAN - Generative Adversarial Networks (2014), I. J. Goodfellow et al. [PDF]
- VAE - Auto-Encoding Variational Bayes (2013), D. Kingma and M. Welling [PDF]
- VQ VAE - Neural Discrete Representation Learning (2017), A. Oord et al. [PDF]
- VQ VAE 2 - Generating Diverse High-Fidelity Images with VQ-VAE-2 (2019), A. Razavi et al. [PDF]
- Diffusion - Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015), J. Sohl-Dickstein et al. [PDF]
- Denoising Diffusion - Denoising Diffusion Probabilistic Models (2020), J. Ho. et al. [PDF]
- Denoising Diffusion 2 - Improved Denoising Diffusion Probabilistic Models (2021), A. Nichol and P. Dhariwal [PDF]
- Diffusion Beats GANs - Diffusion Models Beat GANs on Image Synthesis, P. Dhariwal and A. Nichol [PDF]
- CLIP - Learning Transferable Visual Models From Natural Language Supervision (2021), A. Radford et al. [PDF]
- DALL E - Zero-Shot Text-to-Image Generation (2021), A. Ramesh et al. [PDF]
- DALL E 2 - Hierarchical Text-Conditional Image Generation with CLIP Latents (2022), A. Ramesh et al. [PDF]
Footnotes
-
Everyone has different definitions of intelligence, all of which are useful in different contexts, and none of which capture the full picture of what this word means. People may disagree with the specifics of this definition. I've chosen this one for the sake of simplicity to clearly frame what we're trying to achieve with deep learning from an economic perspective - I'm less concerned with it's philosophical implications here. ↩
-
Karl Friston's Free Energy Principle suggests that this definition of intelligence is also valid in the context of the brain (beware, the paper is explained with unnecessary mathematical complexity, but the core concept it describes is simple). Notably, intelligence systems create models of the world and then use those models to perform active inference to modify their environments. ↩
-
This idea may seem unintuitive at first. But it's actually saying something very simple: (1) reality has a set of rules that govern what happens (2) we can model these rules by assigning probabilities to what's likely to happen, given what has already happened (3) thus, these models are probability distributions. Again, the Free Energy Principle supports this view of modeling reality. ↩
-
Assuming the true distribution we're trying to model is sufficiently complex to the point where including all information about it in the dataset would be intractable. This is almost always the case in deep learning. ↩
-
Assuming the model perfectly represents all information that exists within the dataset, which rarely happens. ↩
-
This is analogous to how adding more terms to a Taylor series yields a function closer to the original. Approximations improve with more information about the true function. ↩
-
In fact, you can think of examples where more data makes no difference. For example adding the same image to a dataset (or two images similar to each other) doesn't improve the quality of the model created. It's because these new data points don't add much new information about the true distribution. ↩
-
There was not powerful enough compute or good enough architectures to process the scale of internet datasets effectively for a long time. ↩
-
This may not actually be sufficient to keep increasing the quality of models, as a recent analysis of zero-shot learning shows that large language models ability to perform tasks increases logartihmically with the amount of relevant data in the dataset. ↩
-
The internet is a lossy compression of the entirety of human knowledge, with lot's of noise (complex and contrasting intentions behind different posts). Additionally, human knowledge itself is a very lossy (and partially inaccurate) compression of the laws of reality. ↩
-
Although, AlexNet was the result of a large number of innovations that combined to make it so effective - the increase in network depth was complemented with a use of effective optimization & regularization methods and the use of GPUs for training which enabled this increase in size. ↩
-
Understanding this section relies on a basic understanding of the fundamentals of the backpropagation algorith. 3blue1brown's neural network series is an intuitive and interesting introduction for anyone who wants to learn. ↩
-
This effect forces individual neurons to learn general representations useful in collaboration with a variety of other neurons, rather than co-adapting with neighboring neurons, which allows large groups of neurons to fit to noise. ↩
-
Specifically in parameter spaces with large variance in the gradients, a certain step-size may cause over-adjustments in certain parts of the landscape, and result in painfully slow changes in other cases. ↩
-
This idea is explored in the original backpropagation paper. ↩
-
For example, image classification, where individual pixel values are noisy and subject to a variety of transformations. ↩
-
There are also many engineering challenges with training on increasingly large clusters of devices like GPUs that need to be able to communicate with each other. ↩
-
For those curious, Kevin Kwok's essay on Narrative Distillation an excellent exploration of the power of narratives in capital and resource allocation. ↩
-
In Theil terms, you could frame this as OpenAI's "secret" or something they believe that others don't. ↩
-
This view of intelligence also paints the framework of thinking in the WaitButWhy post The Cook and the Chef: Musk's Secret Suace particularly well ↩
-
This is not saying that scaling laws will get us to AGI, but that constantly pushing the constraints will get us to AGI. We may run into bottlenecks that render the scaling laws obselete at some point. ↩
-
Assuming you believe that the current systems exhibit intelligent behavior, which some people still disagree with. ↩