the resolution of the input series tokenization is insufficient #202

oops343 · 2024-11-14T01:53:14Z

oops343
Nov 14, 2024

My data have relatively large numbers(4000+), but difference between each number is actually really small (ranging from 1 to 10 maybe).
As shown in this image, the ids after meanscale tokenization are even the same for several points, which means the resolution of the bins are not enough to properly seperate them.
I wonder if there is similar situation in some of the pretraining data, and how to solve this if it is A problem.

oops343 · 2024-11-14T01:58:56Z

oops343
Nov 14, 2024
Author

i'm thinking about minmax normalization before tokenization, but not sure about how this will affect the data distribution...
the idea is to stretch the points further from each other, that makes the 'resolution' better

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

the resolution of the input series tokenization is insufficient #202

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

the resolution of the input series tokenization is insufficient #202

Uh oh!

oops343 Nov 14, 2024

Replies: 1 comment

Uh oh!

oops343 Nov 14, 2024 Author

oops343
Nov 14, 2024

oops343
Nov 14, 2024
Author