smolstr is a concept i had once about space saving by storing text in 5 bits instead of the normal 8 bits which unicode is encoded in.
this is nothing more than a test, feel free to use it anywhere, its open source and should be pretty straight forward
from my own testing this saves on average a ratio of 3/2 in direct size comparison
if you want to test this encoder, i've left a tool in the releases tab which anyone who has touched an command prompt should be able to figure out, also to some extent the user could just drag and drop txt files ontop of exe
The file format is built with three blocks
Reading the file format is pretty simple, one could skip first 36 bytes, but its recommended to read what alphabet the file is using otherwise it will text will be jibberish
this will require some bitshifting, so atleast basic understanding on how bitshifting works.
everytime the current u128 runs out of bits (every 25 characters, 5 * 16 = 125), the current u128 is pushed to the output file and another u128 starts, rinse and repeat until there is no data left and the last u128 just gets pushed, so the last u128 always gets padded.
reading is the same as writing but in reverse, so instead of writing you read u128, go through every 5 bits until there are no bits to go through then drop current u128 and read the next u128. when the end of the file is reached you are done!
- pushing a 1 will toggle "number mode" all character read during this time will get the index in the alphabet to result in a number a = 1, b = 2, etc... example: "1aabb1" will result the number "1122"
- pushing a 2 will either make the following letter uppercase or if the following character is a space it will put a newline character (\n)*
*this might be changed in the future
- charset can be changed BUT charset must be EXACTLY 32 characters. special characters (space, 1, 2) CANNOT be changed
they should stay the same cause they have different meaning), default is " abcdefghijklmnopqrstuvwxyz.!?12"