In [3]:
# So it seems like a good idea would be something that can take two gradients and turn them into one gradient
# That one gradient could be better than just averaging by minimizing a loss that computes the smallest decrease
#  in loss over the data. The idea is to bootstrap and make the data more useful than it would have otherwise been,
#  as a form of regularization where we are forcing generalization in our gradient steps

# There is next a question of if we can convert this distilled gradient back into a piece of data.
# One way to do this is simply by running gradient descent for multiple steps, then computing the total change in weights
# over those steps. That is the "effective gradient" of all of those data pieces, and we can distill a single data piece
# that obtains that same gradient (just gradient descent the data itself)

# However, this only works if you have the same initialization and network. What if this procedure is ran multiple times
# on different sizes of networks, and different

# Alternatively, what if we make a model that tries to predict the "next gradient", given the current one? Or what if, given
#  data, it tries to output the data that distills it? If we have two pieces of data, and it outputs one piece of data, we
#  could train a "distillation network" on lots of different initializations, all at varying levels of being trained, and
#  may get a decent model at "distilling" two pieces of data into one. This model can be used to divide the size of data in
#  half, then in half again, etc., allowing for a very distilled dataset that can rapidly train new models.

# It's worth asking how that is any better than just training the model itself. You will still need to run a model on every
#   piece of data. You also just have a new hyperparam which might be annoying The improvements would be:
# 1. The distillation network might be smaller, because the function to distill is simpler
# 2. This allows you to distill once, then experiment more rapidly with many different hyperparams because training is quicker
# 3. This makes training more accessible as you don't need to download massive datasets

# The questions we would ask about this distiller are:
# 1. How well does it generalize to different kinds of networks?
# 2. Does it need samples from different amounts of trained networks? (initially trained, trained a little, fully trained)
#    to what extent does this influence it's effectiveness?
# 3. What are ways of measuring distillation quality? One would be average loss of fitted models on data before distill and after distill
# 4. Are there general sorts of tradeoffs we make here? What are they? How do different types of data effect how well this works?
# 5. Can this act as data augmentation? Pick two random images and generate a distilled image
# 6. How do classification tasks fit into this? Should we only distill images from the same class? Or can we incorporate loss to do it from multiple classes?
# 7. We can sort of "bootstrap" it, by producing an image, then "gradient descenting" that image on trying to maximize the 
#    smallest increase in loss between the two images to improve it. We can then take this gradient descented image and use
#    it to futher train the model.

# Can we do that "gradient bootstrapping" for standard training as well? Yes:
# 1. Take a minibatch
# 2. Gradient descent to find the data the maximizes the minimum improvement in loss over that minibatch
# 3. Use the gradient of that data instead of the averaged gradient

# Even better, what if we train a model to update the weights to improve the loss, and then gradient descent on that
# meta model?