Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuff I thought I should add - random observations #1

Open
notarib-catcher opened this issue Feb 28, 2024 · 39 comments
Open

Stuff I thought I should add - random observations #1

notarib-catcher opened this issue Feb 28, 2024 · 39 comments

Comments

@notarib-catcher
Copy link

notarib-catcher commented Feb 28, 2024

I decided to take the approach of repeatedly opening and closing notepad and trying to see what data persisted and what was lost between sessions. All that data must be part of the tab "state" and therefore, must be somewhere in these tab-state files.

To be honest, Reading this a second time, I don't think the info here is that helpful, but it should help someone working on this get a head start. Take everything here with a tub of salt - I'm a uni student and have 0.00 years of professional experience.

OBSERVATION #1

Steps:

  1. Create a new file and save it.
  2. Load up the saved file in notepad and edit it
  3. DO NOT SAVE the edits and close notepad. Reopen notepad to verify that the edits were cached (They were). Then close notepad
  4. Open the file in a second editor and add some text. Save the file.
  5. Reopen the file in notepad and navigate to the tab with the unsaved data.

Notepad notices that the file on disks has edits newer than the cached edits in notepad.

Therefore:

  • Notepad (probably) saves the hash of the file on disk + time of last edit as well as the hash of the cached edits and their timestamps. That could be the garbled data before and after the contents.
  • I believe that the garbled data in between the delimiters and the data after the end of contents must be some form of hashes + timestamp. Perhaps the timestamp of the edits + the timestamp of the last edits and the hash + timestamp of the file on disk.

I was curious about the 0.bin and .1.bin files, since they are considerably smaller but still follow the same format somewhat (see point 7), I decided to focus a bit on those. I decided to do some tests

OBSERVATION #2:

  1. Create a file
  2. Open it in notepad and see the cache. One file with a UUID is made.
  3. Close the file, we see .0.bin and .1.bin pop into existence.
  4. We also see that .1.bin is empty (Zero bytes).
  5. Reopen the file in notepad. This usually makes a second (newer) tab. Close that tab so that the original tab is in view.
  6. Now close the file without making any edits in the tab.
  7. .1.bin is populated! Moreover, we see the same pattern (01 00 00 00) in the .1.bin file - followed by some garbled data.
  8. Now repeat steps 5 through 7.
  9. We see that the end of .1.bin has changed.
  10. If I repeat 5-7 a second time, we see that .1.bin doesn't change, but .0.bin does? Concluding, it seems notepad stores session data alternatively, once in .0.bin and once in .1.bin. The initial session populates .0.bin, the next populates .1.bin, and back and forth.

Also, if you notice, notepad preserves cursor position between sessions, I assume that too, must be stored somewhere in those files or the main one. They're clearly a complete "Tab state" that has all the necessary info to recreate a notepad tab, including where the cursor was, etc.

OBSERVATION #3:

While notepad, was open, I tried adding more data to the file. What I noticed was the new data was added as XXXX appended onto the end of the original file content.

<original-file-contents> <garble> <byte 1 of new data> <8 bytes of garble> <byte 2 of new data> <8 bytes of garble> <another byte of new data>

.... (Or I guess, if it was UTF-16, 7 bytes of garble and two bytes of new data and so on... - which seems more accurate)

Pressing Ctrl+S to save immediately purges this garble garble and turns it into the same format that John saw in his video for a closed session.

Another curious thing I noticed was that while every action done caused a change in the file, curiously, notepad does not seem to have a REDO function. Or atleast, it isn't mapped to ctrl+shift+z.

Another, more curious thing is that "undo" seems to revert the entire file to its original state as saved. Discarding all new data in one step...

So I experimented with this some more. What seems to happen is that all actions taken in notepad get appended to the end of the bin file. Ex. Undo gets appended as 14 05 00 99 19 26 FB thats then cleared when the file is saved.

Adding onto the theory that everything in the file while notepad is open is an action, and not data to be stored: If you paste text in an open window, the pasted text is visible in hex in one coherent, UTF-16 encoded block.

I got kinda fatigued at this point at it was getting late, but I hope whoever reads this gets a bit of a head start!

@Nordgaren
Copy link
Owner

Nordgaren commented Feb 28, 2024

Also, if you notice, notepad preserves cursor position between sessions, I assume that too, must be stored somewhere in those files or the main one. They're clearly a complete "Tab state" that has all the necessary info to recreate a notepad tab, including where the cursor was, etc.

I didn't even notice! Good find. We can also see the cursor data in the bar at the bottom. I bet all of that data is stored in that metadeta structure that is right before the 3 lengths and the real buffer.

As far as observation 3 goes, it's a complete mystery. The whole systems acts differently sometimes. It's wild. I have also noticed that sometimes it appends a second version of the data, but not always complete, to the end. I assume this is like a history buffer of some kind, as well. Maybe it will make more sense once we figure out what is in the metadata structure. I know there has to be some sort of timestamp or something that is generated every save, as the last 4 bytes of the footer and 4 bytes in the metadata structure always change when you save the file, even if you change nothing.

I don't have the issue where undo reverts the entire file, although that could be depending on some conditions, maybe?

So I experimented with this some more. What seems to happen is that all actions taken in notepad get appended to the end of the bin file. Ex. Undo gets appended as 14 05 00 99 19 26 FB thats then cleared when the file is saved.

This is data at the end of the bin file or between the main text buffer and the "history" one?

Also, thank you for taking your time to contribute! Appreciate it!

@Nordgaren
Copy link
Owner

I wonder if there's also a copy paste buffer. I do think you are right about the file format having all of the data for that tab. Similar to vim, I think, there's a buffer for copy paste in the text file I think? So that it remembers the last thing you cut copied or deleted. IDK about that third one, but vim definitely does that. Will have to keep it in mind when I am looking at these files again!

@notarib-catcher
Copy link
Author

notarib-catcher commented Feb 29, 2024

As far as observation 3 goes, it's a complete mystery. The whole systems acts differently sometimes. It's wild. I have also noticed that sometimes it appends a second version of the data, but not always complete, to the end. I assume this is like a history buffer of some kind, as well. Maybe it will make more sense once we figure out what is in the metadata structure. I know there has to be some sort of timestamp or something that is generated every save, as the last 4 bytes of the footer and 4 bytes in the metadata structure always change when you save the file, even if you change nothing.
I worked on this a bit more between classes, mostly with notepad open and the file unsaved (Working on the "live" buffer - while notepad is active in the background)

I decided to do a bit of work on the notepad bin file while notepad is running. It seems to live-edit the bin file as you type.

I believe every action in this state is simply appended to the file. As far as I could see, there is no backtracking or editing of earlier parts of the file, While notepad is open, it appends to the end of this tab file without deleting or modifying any earlier data.

I also believe every action in notepad, when appended, includes a "tail" of the form 00 XX XX XX XX where XX is seemingly random (possibly a checksum). I tested typing characters, backspacing, pasting, etc. Everything seems to have this "Tail" with one null byte and 4 bytes that follow.

Speaking of pasting text into notepad:

The pastes always seem to have a header length of 3 bytes or 4 bytes for smaller pastes. The length of the header behaves weirdly. Pasting X (67) characters (<255) led to a 3 byte header twice. then pasting 68 characters made it a 4-byte header. Then pasting 67 bytes kept the 4 byte header. It is just weird....

Some constants though: When the length of the paste is <255, the last byte before the content is always the length of the content.

However this changes when the length is >255. length (tested with 267,268,269...) seems to still be stored in the header, but I see weird values like 8B/8C/8D followed by 02 (perhaps the length is approximated to XX * 02 = the actual length? For a 267 char paste, the header had 8B 02, but 8B * 02 is 278 not 267.... the values seem to increase with length but do not seem to equal the length of the paste but the length of something else)

The first byte of the header seems to be random.
The second byte seems to be a counter of sorts - only present on headers length >= 4. I have observed it staying the same or increasing by 2, but never decreasing, hence my deduction.
The third byte is null, and the next one or two bytes before the content seem to indicate some sort of length value as described earlier.

In general, I've seen pastes of the form

XX [CC] 00 LL [LL] ...<CONTENT AS UTF-16>... 00 XX XX XX XX
Where
XX - No idea what it does (garble / random)
CC - Somewhat behaves like a counter (+2 or same, never reduces)
00 - null byte
LL - Seems to indicate length (of... something.. Sometimes equals paste length sometimes approximately equals it)
[..] - Only present some times, not present other times.

EVERY action I did though, always had a tail attached immediately after the content which is nullbyte + 4 garble bytes. If you see anything to the contrary then do tell, because this pattern doesn't seem to change for me.

I also tried testing some other hypothesis:

We can also see the cursor data in the bar at the bottom. I bet all of that data is stored in that metadeta structure that is right before the 3 lengths and the real buffer.

I tried to narrow down where the cursor position is stored. It appears to be in the .0.bin or .1.bin files.

Test:

  1. Open notepad on an existing txt file.
  2. Move the cursor to the middle of the file.
  3. Close notepad.
  4. Delete both .0.bin and .1.bin
  5. Reopen notepad: The cursor is at Line 0 Col 0.

Edit: Upon further analysis, this is not reliably reproducible. Sometimes, the cursor position persists without 0.bin and 1.bin

@Nordgaren
Copy link
Owner

However this changes when the length is >255. length (tested with 267,268,269...) seems to still be stored in the header, but I see weird values like 8B/8C/8D followed by 02 (perhaps the length is approximated to XX * 02 = the actual length? For a 267 char paste, the header had 8B 02, but 8B * 02 is 278 not 267.... the values seem to increase with length but do not seem to equal the length of the paste but the length of something else)

I think you are running into the varints, here. You can see how to decode those, here. This class assumes that you give it a buffer of the right size, but the function above shows how to figure out how many bytes are in the varint (the sign bit is set if there is another byte in the varint).

@starchyunderscore
Copy link

Another curious thing I noticed was that while every action done caused a change in the file, curiously, notepad does not seem to have a REDO function. Or atleast, it isn't mapped to ctrl+shift+z.

I don't have a w11 machine handy at the moment but I'm pretty sure its ctrl+y to redo in the new notepad. Probably to match microsoft word, which I think also uses ctrl+y for redo.

@Nordgaren
Copy link
Owner

Another curious thing I noticed was that while every action done caused a change in the file, curiously, notepad does not seem to have a REDO function. Or atleast, it isn't mapped to ctrl+shift+z.

I don't have a w11 machine handy at the moment but I'm pretty sure its ctrl+y to redo in the new notepad. Probably to match microsoft word, which I think also uses ctrl+y for redo.

Weird. It's not in the interface at all, but ctrl + y does work. Thank you!
image

@ogmini
Copy link

ogmini commented Mar 4, 2024

Hi,

I think I might be able to help you all looking at the buffer for unsaved changes. I have my notes and thoughts located in my repo: https://github.com/ogmini/Notepad-Tabstate-Buffer.

But in short, unsigned LEB128 is used to position, number of characters deleted, number of characters added. These are followed by the characters if any that were added stored as little-endian UTF-16. As was noted earlier, a 4 byte sequence follows. I haven't figured out what they are yet. It is actually really elegant how they stored this as this handles both normal typing, deletion, selecting, and copy/pasting.

*EDIT

After some poking, those 4 bytes appear to be the CRC32 of the previous bytes.

@JustArion
Copy link

JustArion commented Mar 7, 2024

Hi,

I think I might be able to help you all looking at the buffer for unsaved changes. I have my notes and thoughts located in my repo: https://github.com/ogmini/Notepad-Tabstate-Buffer.

But in short, unsigned LEB128 is used to position, number of characters deleted, number of characters added. These are followed by the characters if any that were added stored as little-endian UTF-16. As was noted earlier, a 4 byte sequence follows. I haven't figured out what they are yet. It is actually really elegant how they stored this as this handles both normal typing, deletion, selecting, and copy/pasting.

*EDIT

After some poking, those 4 bytes appear to be the CRC32 of the previous bytes.

Program.cs LN94 after the 01, its the selected text by the looks of it. The next byte after 00 01 is the selection start index (uLEB128 most lilkely) and the next one after that is the selection end index(If nothing is selected, the start and end would be equal)

You can change your code to match and it should display correctly the selections

// LN94

byte[] un2 = reader.ReadBytes(2); // 00 01
ulong selectionStartIndex = reader.BaseStream.ReadLEB128Unsigned();
ulong selectionEndIndex = reader.BaseStream.ReadLEB128Unsigned();

In my case selectionIndex would be 0 and selectionEndIndex would be 18
83b8b10a-21d6-496a-9a4a-5f7a419346c8_07-03-2024

EDIT:
Dawn_Files_63ca874b-67e5-4639-a128-31987b44cf57

This is the content of the numBytes

0 ?
1 ?
0 SelectionStartIndex (uLEB128)
18 SelectionEndIndex (uLEB128)
1 ?
0 ?
0 ?
0 ?
18 ContentLength (uLEB128)

I've forked the repo to include a Selected text output (seems to be consistently working) here

@notarib-catcher
Copy link
Author

Hi,

I think I might be able to help you all looking at the buffer for unsaved changes. I have my notes and thoughts located in my repo: https://github.com/ogmini/Notepad-Tabstate-Buffer.

But in short, unsigned LEB128 is used to position, number of characters deleted, number of characters added. These are followed by the characters if any that were added stored as little-endian UTF-16. As was noted earlier, a 4 byte sequence follows. I haven't figured out what they are yet. It is actually really elegant how they stored this as this handles both normal typing, deletion, selecting, and copy/pasting.

*EDIT

After some poking, those 4 bytes appear to be the CRC32 of the previous bytes.

CRC32 of the bytes just added or more than that?

@ogmini
Copy link

ogmini commented Mar 8, 2024

Hi,
I think I might be able to help you all looking at the buffer for unsaved changes. I have my notes and thoughts located in my repo: https://github.com/ogmini/Notepad-Tabstate-Buffer.
But in short, unsigned LEB128 is used to position, number of characters deleted, number of characters added. These are followed by the characters if any that were added stored as little-endian UTF-16. As was noted earlier, a 4 byte sequence follows. I haven't figured out what they are yet. It is actually really elegant how they stored this as this handles both normal typing, deletion, selecting, and copy/pasting.
*EDIT
After some poking, those 4 bytes appear to be the CRC32 of the previous bytes.

CRC32 of the bytes just added or more than that?

There are actually multiple CRC32 checks in the file.

https://github.com/ogmini/Notepad-Tabstate-Buffer/blob/main/README.md#chunk-format-for-unsaved-buffer

https://github.com/ogmini/Notepad-Tabstate-Buffer/blob/main/README.md#file-format

@JustArion
Copy link

CRC32 of the bytes just added or more than that?

@notarib-catcher If you're talking about the last 4 bytes, its the CRC32 of everything after NP\0 till the end (excluding the last 4 bytes of course)
2471f5c9-c083-45ca-b8ab-36b53be0f439_08-03-2024

@JustArion
Copy link

I'm currently trying to figure out the timestamp type of a saved file.
After the SavedFilePath there's the first mention of the Content length, following that there's 2 unknown bytes, after that, there's a volatile patch of bytes that change on every save. I believe this to be an 8 byte long timestamp.

The issue is, its incrementing from the left.
Dawn_Files_fd249d21-e31f-40eb-a257-de61cce949dc

@ogmini
Copy link

ogmini commented Mar 9, 2024

I'm currently trying to figure out the timestamp type of a saved file. After the SavedFilePath there's the first mention of the Content length, following that there's 2 unknown bytes, after that, there's a volatile patch of bytes that change on every save. I believe this to be an 8 byte long timestamp.

The issue is, its incrementing from the left.

I haven't started to tackle these unknown bytes yet. I jumped over to the Windowstate files since they seem a little shorter and might give insight into the Tabstate files.

Looking at what you've found, I would agree that those 8 bytes appear to be a timestamp of some sort. It might be related to FILETIME in the win32.api https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-filetime?redirectedfrom=MSDN

I could be completely wrong though. I haven't tried or tested.

@JustArion
Copy link

JustArion commented Mar 9, 2024

I could be completely wrong though. I haven't tried or tested.

I looked into this more and its partially correct. I treated the 8 bytes as a ulong / int64 and used that to determine the file time. I got close (the time shows the year is 2097). Then I noticed, the 8 bytes could be a uLEB128 value if I included the 01 after the ED (As shown in the gif I posted previously).

In my tests, the uLEB128 value came out to 133544567406230621 which corresponds to a valid file time (which I converted to Unix time as 1709983140 which (according to discord's <t:1709983140:R>) is correct.

Here's my latest ImHex pattern

#include <std/mem.pat>
#include <std/string.pat>
#include <std/hash.pat>
#include <type/leb128.pat>
#include <std/time.pat>

using ul = type::uLEB128;


struct SavedNotepadTab
{
    char NullTerminatedHeaderIdentifier[3];
    bool IsSaved;
    ul SavedFilePathLength;
    char16 SavedFilePath[SavedFilePathLength];
    ul TabContentLength0;

    u8 Unknown1[2];

    ul uLEB128FileTime;

    u8 Unknown2[32];
    u8 PossibleSpecialDelimiter[2];

    ul SelectionStartIndex;
    ul SelectionEndIndex;

    u8 PossibleSpecialDelimiterEnd[4];

    ul ContentLength2;
    char16 Content[ContentLength2];
    bool IsTempFile;
    u32 CRC32;
};

SavedNotepadTab tabState @ 0x0;

std::time::EpochTime unixTime = std::time::filetime_to_unix(tabState.uLEB128FileTime);
std::print("Last Saved Unix Time is " + std::string::to_string(unixTime));

@ogmini
Copy link

ogmini commented Mar 9, 2024

Very cool, I've playing around with imHex trying to learn it. The use of more uLEB128 makes sense since it appears everywhere else. Oddly, the Windowstate file appears to use uint16 for coordinates.

Interestingly, I don't think the unsaved tab has a timestamp in it?

@JustArion
Copy link

JustArion commented Mar 9, 2024

The unsaved tab doesn't look to have a timestamp in it, at least the ones I have tested.

I currently don't know how to have ImHex fill the array with the remaining chunks (see here)

I'm still curious about the unknown bool before the default CRC32 check.

From my observations it's shown the following

  • False on Saved File,
  • True on unsaved file,
  • False on unsaved file with chunks

@Nordgaren
Copy link
Owner

I'm currently trying to figure out the timestamp type of a saved file. After the SavedFilePath there's the first mention of the Content length, following that there's 2 unknown bytes, after that, there's a volatile patch of bytes that change on every save. I believe this to be an 8 byte long timestamp.

The issue is, its incrementing from the left. Dawn_Files_fd249d21-e31f-40eb-a257-de61cce949dc Dawn_Files_fd249d21-e31f-40eb-a257-de61cce949dc

They aren't unknown bytes. Check the parser to see what they are. You are looking at the size of the buffer with account for special characters, like line feed, as the buffer only stores unix type line feeds, and omits the Windows one. If you have a newline with windows type, this size (which is a varint) will be 1 larger than the buffer size listed before the actual text buffer for each newline.

then you have the encoding type and the carriage return type after that. Parser should hopefully clear it up for you.

@Nordgaren
Copy link
Owner

Also this looks like mouse position? you said it's incrementing as if it were big endian, instead of little endian?

@Nordgaren
Copy link
Owner

IDK if those are timestamps or what. They might be combined data, as, the timestamp makes no sense no matter what I compare it to. Mouse position might be more accurate, but idk. It's a very weird part of the file.

@JustArion
Copy link

JustArion commented Mar 10, 2024

By volatile I mean that it changes on every save instance. Even when my mouse is at rest it changes.

E7 E8 8A C7 91 AF 9C ED 01 I've pretty much already confirmed at this point that its a timestamp since it returns the correct time. When I parse it as a uLEB128, those set of bytes equate to 133543903883342951 which is a valid Int64 required for a WindowsFileTime struct. When converted from FileTime to Unix you get a timestamp of 1709916788
Accounting for timezone + DST, you get an accurate timestamp.
19781994-042a-4894-808e-7287f4dff13c_10-03-2024

They've even confirmed on the post you've linked that it's a timestamp. I'll look into the other things for what the delimiters are in actuality in a little bit.

Edit: For note I might add, I was primarily covering the Saved Tabs while @notarib-catcher was covering Unsaved Tabs and the 0.bin and 1.bin files. The timestamps occur in Saved Tabs but not Unsaved tabs. Here's the link to my latest ImHex pattern which covers Saved and Unsaved. In Saved files I am still missing a few pieces

        u8 Unknown1[2]; // 05 01

        ul FileTime;

        u8 Unknown2[32];
        u8 SelectionStartDelimiter[2]; // 00 01

From the link in my HexPat.

The Unsaved and Saved tabs both have that 1 unknown bool before the CRC32. Which the python PR just shrugs of as

    char      unk1;

and never references again.

@JustArion
Copy link

The 32 unknown bytes looks to be the SHA256 of the content. The CyberChef recipe is From Hex (Auto) -> Decode Text (UTF-16LE (1200)) -> SHA2 (Size: 256, Rounds: 64) When copying the hex bytes of the content

@Nordgaren
Copy link
Owner

By volatile I mean that it changes on every save instance. Even when my mouse is at rest it changes.

E7 E8 8A C7 91 AF 9C ED 01 I've pretty much already confirmed at this point that its a timestamp since it returns the correct time. When I parse it as a uLEB128, those set of bytes equate to 133543903883342951 which is a valid Int64 required for a WindowsFileTime struct. When converted from FileTime to Unix you get a timestamp of 1709916788 Accounting for timezone + DST, you get an accurate timestamp. 19781994-042a-4894-808e-7287f4dff13c_10-03-2024

They've even confirmed on the post you've linked that it's a timestamp. I'll look into the other things for what the delimiters are in actuality in a little bit.

Edit: For note I might add, I was primarily covering the Saved Tabs while @notarib-catcher was covering Unsaved Tabs and the 0.bin and 1.bin files. The timestamps occur in Saved Tabs but not Unsaved tabs. Here's the link to my latest ImHex pattern which covers Saved and Unsaved. In Saved files I am still missing a few pieces

        u8 Unknown1[2]; // 05 01

        ul FileTime;

        u8 Unknown2[32];
        u8 SelectionStartDelimiter[2]; // 00 01

From the link in my HexPat.

The Unsaved and Saved tabs both have that 1 unknown bool before the CRC32. Which the python PR just shrugs of as

    char      unk1;

and never refere

By volatile I mean that it changes on every save instance. Even when my mouse is at rest it changes.

E7 E8 8A C7 91 AF 9C ED 01 I've pretty much already confirmed at this point that its a timestamp since it returns the correct time. When I parse it as a uLEB128, those set of bytes equate to 133543903883342951 which is a valid Int64 required for a WindowsFileTime struct. When converted from FileTime to Unix you get a timestamp of 1709916788 Accounting for timezone + DST, you get an accurate timestamp. 19781994-042a-4894-808e-7287f4dff13c_10-03-2024

They've even confirmed on the post you've linked that it's a timestamp. I'll look into the other things for what the delimiters are in actuality in a little bit.

Edit: For note I might add, I was primarily covering the Saved Tabs while @notarib-catcher was covering Unsaved Tabs and the 0.bin and 1.bin files. The timestamps occur in Saved Tabs but not Unsaved tabs. Here's the link to my latest ImHex pattern which covers Saved and Unsaved. In Saved files I am still missing a few pieces

        u8 Unknown1[2]; // 05 01

        ul FileTime;

        u8 Unknown2[32];
        u8 SelectionStartDelimiter[2]; // 00 01

From the link in my HexPat.

The Unsaved and Saved tabs both have that 1 unknown bool before the CRC32. Which the python PR just shrugs of as

    char      unk1;

and never references again.

hmm. That would mean the Metadata structure is not a fixed size. Have you tried this with multiple files?

@JustArion
Copy link

I have, and it's consistent, I've tried short text files and long text files. The selection indexes, and lengths all scale properly. I'm just left with figuring out what those delimiters mean. Since the one person said that there's not any delimiters, each value has a purpose.

@Nordgaren
Copy link
Owner

Alright, well I gotta figure out how to get the FileTime properly in Rust. I can import the functions myself, but being lazy at the moment. The FileTime crate does not give me what I need.

@Nordgaren
Copy link
Owner

Someone in the thread that I mentioned this one in, said they are working on a writeup where they have all of it figured out, btw. Just a heads up. Might wanna go subscribe to that thread if you are waiting for the update.

@Nordgaren
Copy link
Owner

The 32 unknown bytes looks to be the SHA256 of the content. The CyberChef recipe is From Hex (Auto) -> Decode Text (UTF-16LE (1200)) -> SHA2 (Size: 256, Rounds: 64) When copying the hex bytes of the content

This is slightly incorrect. You need to hash the content in the file, as it is not stored the same in the tab-state. The encoding and the carriage return type effect the hash.

@Nordgaren
Copy link
Owner

I just uploaded the a .bt file. Only thing it is missing right now is the extra buffer after the main buffer, if it exists,

https://github.com/Nordgaren/tabstate-util/blob/8e30c391e2e437ea57caf206f42ce5e777f5361c/TabState.bt#L1-L0

@JustArion
Copy link

The 32 unknown bytes looks to be the SHA256 of the content. The CyberChef recipe is From Hex (Auto) -> Decode Text (UTF-16LE (1200)) -> SHA2 (Size: 256, Rounds: 64) When copying the hex bytes of the content

This is slightly incorrect. You need to hash the content in the file, as it is not stored the same in the tab-state. The encoding and the carriage return type effect the hash.

Thanks, I've gone ahead and updated my pattern to reflect that.

@Nordgaren
Copy link
Owner

        u8 Encoding;
        u8 CarriageReturnType;
        ul FileTime;

        u8 ContentHash[32];
        u8 Padding;

Btw, this is what I think you were missing from an earlier post.

The 0 after content hash might be padding or something else, but, i have been considering the cursor start delimiter to be 01 and the end delimiter to be 01 as an int. I am not sure about this part, though, but if you look at the other states, the marker start and end is the same. Starts with a single byte 1 and ends at the 4 byte 1. I also don't thin the 0 is part of the marker as, that would make the first 1 big endian and then the second 1 for the marker little endian, which would be quite odd. Obviously you can read this, but, it's less straight forward to mix endianess like that.

@Nordgaren
Copy link
Owner

I will be adding all of you to the readme in the thank you section, shortly. I am going to link your githubs, but, let me know here if you would like a different link to be used!

ogmini added a commit to ogmini/Notepad-Tabstate-Buffer that referenced this issue Mar 11, 2024
@ogmini
Copy link

ogmini commented Mar 11, 2024

A lot to catch up on here. Thanks to @JustArion for the imHex patterns and @Nordgaren for the Binary Template file. Very useful in comparing and learning how to use 010 and imHex.

Getting a little lost in the mix; but has anyone looked further at the 0.bin and 1.bin files? @notarib-catcher

https://github.com/ogmini/Notepad-Tabstate-Buffer?tab=readme-ov-file#0bin--1bin

At this point, the only part I can't figure out are:

  • some variable 2 or 3 bytes before the selection start and end
  • 4 bytes after the selection start and end (Do always appear to be that 4 byte 1 or 0x00 0x00 0x00 0x01)

I might give writing an imHex pattern or 010 Binary Template for this a shot.

*EDIT

Looks the size of the bin file is stored in bytes as a uLEB128.

0 bin file

0x0B - 11 bytes until the CRC32 bytes
0x00 - Delim?
0xCA 0x3E - 8010 which matches the 8010 bytes of the bin file
0x9B 0x1F - Start Selection
0x9B 0x1F - End Selection
0x01 0x00 0x00 0x00 - Delim?
0x39 0x59 0xA2 0x92 - CRC32

0 bin size

@ogmini
Copy link

ogmini commented Mar 11, 2024

@Nordgaren @JustArion

Looking at your pattern and template files, specifically the header part. I don't think that the 3rd byte is a NULL or a delimiter. Granted, it is always 0 for the non-0.bin and 1.bin files (I'm calling these state files for now). In the state file, it appears to be a uLEB128 sequence number that continues to count up over time. Lets the program know which is the last state file and which is the backup. I doubt those numbers will ever change from 0x00 for the non-state files; but it just seems too similar to the state file to be different.

@Nordgaren
Copy link
Owner

@Nordgaren @JustArion

Looking at your pattern and template files, specifically the header part. I don't think that the 3rd byte is a NULL or a delimiter. Granted, it is always 0 for the non-0.bin and 1.bin files (I'm calling these state files for now). In the state file, it appears to be a uLEB128 sequence number that continues to count up over time. Lets the program know which is the last state file and which is the backup. I doubt those numbers will ever change from 0x00 for the non-state files; but it just seems too similar to the state file to be different.

Yea, I mention this in the parser. I keep it as null for now until I figure out the significance. It could just be junk, though, when it's not null, which seems to be the case.

the uleb128 you are speaking of might actually be replacing the 0 or 1 for the saved/unsaved state, which I have seen happen. It basically just tells you how many bytes there are until the footer.

It's hard to tell, as, there is definitely a condition where a junk file can be generated.

@Nordgaren
Copy link
Owner

My next line of thought is to check the notepad exe, itself. There are some strings which have been helpful, and I might have found where notepad writes the header bytes, but, I am no at home, so the reverse engineering is slow going (I only have one screen on my laptop)

But I think we can get some more definitive answers if I look into the notepad exe a bit more. It's in C++, so we might even luck out on getting some RTTI data :)

ogmini added a commit to ogmini/Notepad-Tabstate-Buffer that referenced this issue Mar 11, 2024
@ogmini
Copy link

ogmini commented Mar 12, 2024

I think I figured out the WindowState files for anyone interested. Far simpler file.

https://github.com/ogmini/Notepad-Windowstate-Buffer/

@notarib-catcher
Copy link
Author

notarib-catcher commented Mar 26, 2024

The issue is, its incrementing from the left. Dawn_Files_fd249d21-e31f-40eb-a257-de61cce949dc

Being able to see the file update in realtime seems extremely useful - what application is that? Getting ImHex rn :P

@ogmini
Copy link

ogmini commented May 1, 2024

@Nordgaren - are you planning to publish the BT file to 010 Editor? I'm hoping you are.

@Nordgaren
Copy link
Owner

@Nordgaren - are you planning to publish the BT file to 010 Editor? I'm hoping you are.

I think someone else had, already. I can upload mine, if it's better, or the person who has uploaded theirs is free to update theirs with mine!

@ogmini
Copy link

ogmini commented May 2, 2024

@Nordgaren - are you planning to publish the BT file to 010 Editor? I'm hoping you are.

I think someone else had, already. I can upload mine, if it's better, or the person who has uploaded theirs is free to update theirs with mine!

I submitted one for the windowstate file. I don't believe there is one for the tabstate file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants