New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuff I thought I should add - random observations #1
Comments
I didn't even notice! Good find. We can also see the cursor data in the bar at the bottom. I bet all of that data is stored in that metadeta structure that is right before the 3 lengths and the real buffer. As far as observation 3 goes, it's a complete mystery. The whole systems acts differently sometimes. It's wild. I have also noticed that sometimes it appends a second version of the data, but not always complete, to the end. I assume this is like a history buffer of some kind, as well. Maybe it will make more sense once we figure out what is in the metadata structure. I know there has to be some sort of timestamp or something that is generated every save, as the last 4 bytes of the footer and 4 bytes in the metadata structure always change when you save the file, even if you change nothing. I don't have the issue where undo reverts the entire file, although that could be depending on some conditions, maybe?
This is data at the end of the bin file or between the main text buffer and the "history" one? Also, thank you for taking your time to contribute! Appreciate it! |
I wonder if there's also a copy paste buffer. I do think you are right about the file format having all of the data for that tab. Similar to vim, I think, there's a buffer for copy paste in the text file I think? So that it remembers the last thing you cut copied or deleted. IDK about that third one, but vim definitely does that. Will have to keep it in mind when I am looking at these files again! |
I decided to do a bit of work on the notepad bin file while notepad is running. It seems to live-edit the bin file as you type. I believe every action in this state is simply appended to the file. As far as I could see, there is no backtracking or editing of earlier parts of the file, While notepad is open, it appends to the end of this tab file without deleting or modifying any earlier data. I also believe every action in notepad, when appended, includes a "tail" of the form Speaking of pasting text into notepad:The pastes always seem to have a header length of 3 bytes or 4 bytes for smaller pastes. The length of the header behaves weirdly. Pasting X (67) characters (<255) led to a 3 byte header twice. then pasting 68 characters made it a 4-byte header. Then pasting 67 bytes kept the 4 byte header. It is just weird.... Some constants though: When the length of the paste is <255, the last byte before the content is always the length of the content. However this changes when the length is >255. length (tested with 267,268,269...) seems to still be stored in the header, but I see weird values like 8B/8C/8D followed by 02 (perhaps the length is approximated to XX * 02 = the actual length? For a 267 char paste, the header had 8B 02, but 8B * 02 is 278 not 267.... the values seem to increase with length but do not seem to equal the length of the paste but the length of something else) The first byte of the header seems to be random. In general, I've seen pastes of the form
EVERY action I did though, always had a tail attached immediately after the content which is nullbyte + 4 garble bytes. If you see anything to the contrary then do tell, because this pattern doesn't seem to change for me. I also tried testing some other hypothesis:
Edit: Upon further analysis, this is not reliably reproducible. Sometimes, the cursor position persists without 0.bin and 1.bin |
I think you are running into the varints, here. You can see how to decode those, here. This class assumes that you give it a buffer of the right size, but the function above shows how to figure out how many bytes are in the varint (the sign bit is set if there is another byte in the varint). |
I don't have a w11 machine handy at the moment but I'm pretty sure its ctrl+y to redo in the new notepad. Probably to match microsoft word, which I think also uses ctrl+y for redo. |
Hi, I think I might be able to help you all looking at the buffer for unsaved changes. I have my notes and thoughts located in my repo: https://github.com/ogmini/Notepad-Tabstate-Buffer. But in short, unsigned LEB128 is used to position, number of characters deleted, number of characters added. These are followed by the characters if any that were added stored as little-endian UTF-16. As was noted earlier, a 4 byte sequence follows. I haven't figured out what they are yet. It is actually really elegant how they stored this as this handles both normal typing, deletion, selecting, and copy/pasting. *EDIT After some poking, those 4 bytes appear to be the CRC32 of the previous bytes. |
Program.cs LN94 after the 01, its the selected text by the looks of it. The next byte after You can change your code to match and it should display correctly the selections // LN94
byte[] un2 = reader.ReadBytes(2); // 00 01
ulong selectionStartIndex = reader.BaseStream.ReadLEB128Unsigned();
ulong selectionEndIndex = reader.BaseStream.ReadLEB128Unsigned(); In my case This is the content of the
I've forked the repo to include a Selected text output (seems to be consistently working) here |
CRC32 of the bytes just added or more than that? |
There are actually multiple CRC32 checks in the file. https://github.com/ogmini/Notepad-Tabstate-Buffer/blob/main/README.md#file-format |
@notarib-catcher If you're talking about the last 4 bytes, its the CRC32 of everything after |
I'm currently trying to figure out the timestamp type of a saved file. |
I haven't started to tackle these unknown bytes yet. I jumped over to the Windowstate files since they seem a little shorter and might give insight into the Tabstate files. Looking at what you've found, I would agree that those 8 bytes appear to be a timestamp of some sort. It might be related to FILETIME in the win32.api https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-filetime?redirectedfrom=MSDN I could be completely wrong though. I haven't tried or tested. |
I looked into this more and its partially correct. I treated the 8 bytes as a In my tests, the Here's my latest ImHex pattern #include <std/mem.pat>
#include <std/string.pat>
#include <std/hash.pat>
#include <type/leb128.pat>
#include <std/time.pat>
using ul = type::uLEB128;
struct SavedNotepadTab
{
char NullTerminatedHeaderIdentifier[3];
bool IsSaved;
ul SavedFilePathLength;
char16 SavedFilePath[SavedFilePathLength];
ul TabContentLength0;
u8 Unknown1[2];
ul uLEB128FileTime;
u8 Unknown2[32];
u8 PossibleSpecialDelimiter[2];
ul SelectionStartIndex;
ul SelectionEndIndex;
u8 PossibleSpecialDelimiterEnd[4];
ul ContentLength2;
char16 Content[ContentLength2];
bool IsTempFile;
u32 CRC32;
};
SavedNotepadTab tabState @ 0x0;
std::time::EpochTime unixTime = std::time::filetime_to_unix(tabState.uLEB128FileTime);
std::print("Last Saved Unix Time is " + std::string::to_string(unixTime)); |
Very cool, I've playing around with imHex trying to learn it. The use of more uLEB128 makes sense since it appears everywhere else. Oddly, the Windowstate file appears to use uint16 for coordinates. Interestingly, I don't think the unsaved tab has a timestamp in it? |
The unsaved tab doesn't look to have a timestamp in it, at least the ones I have tested. I currently don't know how to have ImHex fill the array with the remaining chunks (see here) I'm still curious about the unknown From my observations it's shown the following
|
They aren't unknown bytes. Check the parser to see what they are. You are looking at the size of the buffer with account for special characters, like line feed, as the buffer only stores unix type line feeds, and omits the Windows one. If you have a newline with windows type, this size (which is a varint) will be 1 larger than the buffer size listed before the actual text buffer for each newline. then you have the encoding type and the carriage return type after that. Parser should hopefully clear it up for you. |
Also this looks like mouse position? you said it's incrementing as if it were big endian, instead of little endian? |
IDK if those are timestamps or what. They might be combined data, as, the timestamp makes no sense no matter what I compare it to. Mouse position might be more accurate, but idk. It's a very weird part of the file. |
By volatile I mean that it changes on every save instance. Even when my mouse is at rest it changes.
They've even confirmed on the post you've linked that it's a timestamp. I'll look into the other things for what the delimiters are in actuality in a little bit. Edit: For note I might add, I was primarily covering the Saved Tabs while @notarib-catcher was covering Unsaved Tabs and the 0.bin and 1.bin files. The timestamps occur in Saved Tabs but not Unsaved tabs. Here's the link to my latest ImHex pattern which covers Saved and Unsaved. In Saved files I am still missing a few pieces u8 Unknown1[2]; // 05 01
ul FileTime;
u8 Unknown2[32];
u8 SelectionStartDelimiter[2]; // 00 01 From the link in my HexPat. The Unsaved and Saved tabs both have that 1 unknown bool before the CRC32. Which the python PR just shrugs of as char unk1; and never references again. |
The 32 unknown bytes looks to be the SHA256 of the content. The CyberChef recipe is |
hmm. That would mean the Metadata structure is not a fixed size. Have you tried this with multiple files? |
I have, and it's consistent, I've tried short text files and long text files. The selection indexes, and lengths all scale properly. I'm just left with figuring out what those delimiters mean. Since the one person said that there's not any delimiters, each value has a purpose. |
Alright, well I gotta figure out how to get the FileTime properly in Rust. I can import the functions myself, but being lazy at the moment. The FileTime crate does not give me what I need. |
Someone in the thread that I mentioned this one in, said they are working on a writeup where they have all of it figured out, btw. Just a heads up. Might wanna go subscribe to that thread if you are waiting for the update. |
This is slightly incorrect. You need to hash the content in the file, as it is not stored the same in the tab-state. The encoding and the carriage return type effect the hash. |
I just uploaded the a |
Thanks, I've gone ahead and updated my pattern to reflect that. |
Btw, this is what I think you were missing from an earlier post. The 0 after content hash might be padding or something else, but, i have been considering the cursor start delimiter to be 01 and the end delimiter to be 01 as an int. I am not sure about this part, though, but if you look at the other states, the marker start and end is the same. Starts with a single byte |
I will be adding all of you to the readme in the thank you section, shortly. I am going to link your githubs, but, let me know here if you would like a different link to be used! |
SHA256 Hash - Nordgaren/tabstate-util#1 (comment)
A lot to catch up on here. Thanks to @JustArion for the imHex patterns and @Nordgaren for the Binary Template file. Very useful in comparing and learning how to use 010 and imHex. Getting a little lost in the mix; but has anyone looked further at the 0.bin and 1.bin files? @notarib-catcher https://github.com/ogmini/Notepad-Tabstate-Buffer?tab=readme-ov-file#0bin--1bin At this point, the only part I can't figure out are:
I might give writing an imHex pattern or 010 Binary Template for this a shot. *EDIT Looks the size of the bin file is stored in bytes as a uLEB128. 0x0B - 11 bytes until the CRC32 bytes |
Looking at your pattern and template files, specifically the header part. I don't think that the 3rd byte is a NULL or a delimiter. Granted, it is always 0 for the non-0.bin and 1.bin files (I'm calling these state files for now). In the state file, it appears to be a uLEB128 sequence number that continues to count up over time. Lets the program know which is the last state file and which is the backup. I doubt those numbers will ever change from 0x00 for the non-state files; but it just seems too similar to the state file to be different. |
Yea, I mention this in the parser. I keep it as null for now until I figure out the significance. It could just be junk, though, when it's not null, which seems to be the case. the uleb128 you are speaking of might actually be replacing the 0 or 1 for the saved/unsaved state, which I have seen happen. It basically just tells you how many bytes there are until the footer. It's hard to tell, as, there is definitely a condition where a junk file can be generated. |
My next line of thought is to check the notepad exe, itself. There are some strings which have been helpful, and I might have found where notepad writes the header bytes, but, I am no at home, so the reverse engineering is slow going (I only have one screen on my laptop) But I think we can get some more definitive answers if I look into the notepad exe a bit more. It's in C++, so we might even luck out on getting some RTTI data :) |
I think I figured out the WindowState files for anyone interested. Far simpler file. |
@Nordgaren - are you planning to publish the BT file to 010 Editor? I'm hoping you are. |
I think someone else had, already. I can upload mine, if it's better, or the person who has uploaded theirs is free to update theirs with mine! |
I submitted one for the windowstate file. I don't believe there is one for the tabstate file. |
I decided to take the approach of repeatedly opening and closing notepad and trying to see what data persisted and what was lost between sessions. All that data must be part of the tab "state" and therefore, must be somewhere in these tab-state files.
To be honest, Reading this a second time, I don't think the info here is that helpful, but it should help someone working on this get a head start. Take everything here with a tub of salt - I'm a uni student and have 0.00 years of professional experience.
OBSERVATION
#1
Steps:
Notepad notices that the file on disks has edits newer than the cached edits in notepad.
Therefore:
I was curious about the 0.bin and .1.bin files, since they are considerably smaller but still follow the same format somewhat (see point 7), I decided to focus a bit on those. I decided to do some tests
OBSERVATION #2:
01 00 00 00
) in the .1.bin file - followed by some garbled data.Also, if you notice, notepad preserves cursor position between sessions, I assume that too, must be stored somewhere in those files or the main one. They're clearly a complete "Tab state" that has all the necessary info to recreate a notepad tab, including where the cursor was, etc.
OBSERVATION #3:
While notepad, was open, I tried adding more data to the file. What I noticed was the new data was added as XXXX appended onto the end of the original file content.
<original-file-contents> <garble> <byte 1 of new data> <8 bytes of garble> <byte 2 of new data> <8 bytes of garble> <another byte of new data>
.... (Or I guess, if it was UTF-16, 7 bytes of garble and two bytes of new data and so on... - which seems more accurate)
Pressing Ctrl+S to save immediately purges this garble garble and turns it into the same format that John saw in his video for a closed session.
Another curious thing I noticed was that while every action done caused a change in the file, curiously, notepad does not seem to have a REDO function. Or atleast, it isn't mapped to ctrl+shift+z.
Another, more curious thing is that "undo" seems to revert the entire file to its original state as saved. Discarding all new data in one step...
So I experimented with this some more. What seems to happen is that all actions taken in notepad get appended to the end of the bin file. Ex. Undo gets appended as
14 05 00 99 19 26 FB
thats then cleared when the file is saved.Adding onto the theory that everything in the file while notepad is open is an action, and not data to be stored: If you paste text in an open window, the pasted text is visible in hex in one coherent, UTF-16 encoded block.
I got kinda fatigued at this point at it was getting late, but I hope whoever reads this gets a bit of a head start!
The text was updated successfully, but these errors were encountered: