Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validation data not entirely held out from training #57

Closed
ftostevin-ont opened this issue Sep 24, 2021 · 2 comments
Closed

validation data not entirely held out from training #57

ftostevin-ont opened this issue Sep 24, 2021 · 2 comments

Comments

@ftostevin-ont
Copy link
Collaborator

ftostevin-ont commented Sep 24, 2021

  • When training with a random validation subsample, indices of chunks to be used for training and validation are set up at the start of training.
  • When creating training batches, a random_start_position in the range [0, batch_size) is applied to a random selection from the shuffled list of training chunks. Therefore, for each chunk, the records between coordinates (chunk_start + random offset) and (chunk_end + random offset) are loaded
  • the use of an offset means that if a validation chunk follows a training chunk, then the validation data can be loaded and used in training, so the validation data are not being truly held out as an independent sample

Example:

Input file chunks
|---0---|---1---|---2---|---3---|---4---|---5---|---6---|---7---|---8---|---9---|...
 
Random validation chunks: [2, 5, 6]
                |---2---|               |---5---|---6---|
Training chunks:
|---0---|---1---|       |---3---|---4---|               |---7---|---8---|---9---|...
 
Example of a training epoch
Random offset: |..OFFSET..|
Randomly selected training chunks for batch: [0, 4, 8, 9,...]
Training records:
|---0---|---1---|---2---|---3---|---4---|---5---|---6---|---7---|---8---|---9---|...
|..OFFSET..|-------|            |..OFFSET..|-------|            |..OFFSET..|-----...

Here, data from validation chunks 2, 5 and 6 are included in training. The data associated with chunk 4 is actually entirely validation data.

Possible solutions:

  • Don't use a random_start_position offset
    • Chunk order within a batch is already randomised, so this might be enough randomness for training patterns not to be too much of an issue
    • Records could also be shuffled within a batch after chunks have been assembled (though this might have a performance impact)
  • Prune chunks preceding validation chunks out of the training list
    • Probably requires reducing the range of random_start_position from batch_size to chunk_size, otherwise this would be a huge proportion of training data being excluded
    • Even then, this prevents the run-on issue but means removing approximately validation fraction training chunks (in addition to the chunks used for validation), and complicates the calculation of training/validation fractional split
@zhengzhenxian
Copy link
Collaborator

Issue confirmed. We propose splitting all chunks into continuous [training chunks, buffer chunks, validation chunks], and allow random_start_position only to the training chunks. The size of the buffer chunks is larger than the maximum possible random_start_position, thus no validations chunk would be involved in training. Will fix after merging #56.

@aquaskyline
Copy link
Member

fixed in #61

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants