Replies: 1 comment
-
i'm thinking about minmax normalization before tokenization, but not sure about how this will affect the data distribution... |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
My data have relatively large numbers(4000+), but difference between each number is actually really small (ranging from 1 to 10 maybe).
As shown in this image, the ids after meanscale tokenization are even the same for several points, which means the resolution of the bins are not enough to properly seperate them.
I wonder if there is similar situation in some of the pretraining data, and how to solve this if it is A problem.
Beta Was this translation helpful? Give feedback.
All reactions