-
Notifications
You must be signed in to change notification settings - Fork 6
Description
When we shuffle the training set, DeepClean fails to converge. This seems to go against most DL training intuition.
Think of the batch as its own sort of "meta-sample" used for optimization, composed of smaller individual samples whose individual information contributions are averaged during the computation of the gradient for backpropagation. When we randomly shuffle the dataset, we create combinatorially many meta-samples that each average different information and produce diverse gradient updates, helping to combat overfitting. When we batch things sequentially, we're essentially downsizing our dataset by a factor of the batch size, forcing the network to learn from the same information over and over again.
It would be really great to get an understanding of why we're observing this phenomenon, because it does make it feel as if there's some performance we're leaving on the table by not understanding it.