Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the decoding/encoding PackFile logic independant as a lib. #8

Closed
Trolldemorted opened this issue Aug 12, 2018 · 38 comments
Closed

Comments

@Trolldemorted
Copy link
Contributor

While investigating TWA's tables I noticed that some packs were not properly unpacked. Luckily I managed to track the problem down: If 1 << 4 is set in the bitmask, every item in the payload starts at a multiple of 8.

twa_pack_lib fix: TotalWar-Modding/tw_pack_lib@c549376 (unfortunately rather big because of major simplifications)

@Frodo45127
Copy link
Owner

mmm strange. I found that bitmask used in other total war games recently (only in music packs) and PackFiles with that bitmask didn't had any difference in their structure. I'll take another look, but.... maybe it's me not seen any problem, or that bitmask working differently when the PackFile index is encrypted....

@Trolldemorted
Copy link
Contributor Author

Interesting - from what game are these and what bits are set? You said here that 0x4-0x7 only contained 0, 1, 2, 3, or 4, but those music packs are an exception?

@Frodo45127
Copy link
Owner

I only found that recently. At first, I though they were weird PackFile types, like they say in the comments of the PFM code. But, after adding support for Arena and Rome 2, I found that, like arena, those weird types are in reality just normal types with a bitmask. That particular bit is set in music.pack PackFiles. In other games, they use that bit and the one that adds 4 bytes to each entry in the index (wich seems to be a Timestamp like the one in the header) in multiple PackFiles.

@Trolldemorted
Copy link
Contributor Author

So maybe they use a consistent PFH5 format for every PFH5 game? That would be awesome, as our tools should work for every game then!

Are you sure rpfm/pfm are unpacking music packs correctly? TWA's music.pack also has the bit set, but it is also padded and encrypted. Fun fact: twa_table_decrypt can decrypt TWA's encrypted .wem files!

Could it be that the bit means "padded and encrypted"? Or were the files in music.pack unpacked with rpfm/pfm looking good?

If it means padded and encrypted I'm gonna integrate the decryption into twa_unpack.

@Frodo45127
Copy link
Owner

Mmm it could be. Currently, I've only seen that in music PackFiles, and the .wem format is unknown to me, so I haven't been able to check that the extracted data actually works. I've been able to check only that the data of each PackedFile is extracted correctly (no data left to extract when extracting everything). I just did a quick search and found that not only Arena .wem are encrypted. It seems they may be encrypted in all Total War Games at least back to Rome 2 (the oldest game I can currently test), making that bit a "Files encrypted" bit.

Do you have any other total war game you can test it? Just in case that decrypt logic is the same in all games.

@Trolldemorted
Copy link
Contributor Author

Unfortunately I don't own any other TW game. Are you definitely sure that other packs with encrypted files are not padded and no bytes were left after unpacking the last file? Are there any files which could have been padded (i.e. their size is not a multiple of 8)?

@Trolldemorted
Copy link
Contributor Author

I have released twa_unpack 1.5 which can unpack TWA's encrypted files - you could try it with other PFH5 games if you happen to own them. A friend has rome2, but unfortunately that uses PHF4.

@Frodo45127
Copy link
Owner

Just checked. Warhammer 2 is the only game using PFH5, and... there is no music PackFile. All the audio files are unencrypted in an audio.pack. But just by looking at the files, they are very different. In Warhammer 2 wem files you can see they start with RIFF, where Attila and Rome 2 just look like random data (Warhammer 1 doesn't have music either). The only difference I found between PFH4 and PFH5 is PFH4 doesn't have a 00 in a specific position in the index, but Arena PFH5 doesn't have that either so.... technically, your program should be compatible with PFH4.

I have the theory that the 1<< 8 bit means just that, that the PFH5 uses a "compatible" format with PFH4, but.... just a theory.

Also, tested it with a PFH4, crashes on an unwrap.

@Trolldemorted
Copy link
Contributor Author

So the development of PFH5 is rather recent?

I have released twa_unpack 1.6 which should be able to unpack most PFH4 files. I tried with rome2's music.pack which is unpacked and decrypted fine.

Now that twa_pack_lib apparently supports most PFH files that exist, should we adapt its api (for changing a pack file's content) so that you can use it in rpfm?

@Frodo45127
Copy link
Owner

Yes, the first game I saw with PFH5 was Warhammer 2, and that one was released in... september last year I think.

Apart from that, I know there is a PFH4 for Rome 2/Attila/Warhammer, and I heard of PFH2 and 1 for Empire, Napoleon and Shogun 2, but haven't confirmed it yet, as I currently don't support them in RPFM.

And about the twa_pack_lib, sure. For PackedFiles, as long as the decrypt function returns a Vec<u8> with the decrypted data, it's compatible with RPFM, though I cannot ensure files other than tables get the padding ignored... Maybe we can use their size from the index as a way to get those garbage bytes away? I mean, resize the resulting vector to whatever size is specified in the index, so the extra bytes get removed.

For full PackFiles... Currently, it'll be easier for me to just call your decrypt functions when needed. Too many things are wired up in the PackFile decoding logic to just make it use completely the twa_pack_lib, and there are plans to change it even more in the future so...

So the best way I see to get the lib working with RPFM, is to make the decrypt functions public (so I just delete the ones I put in RPFM and it should work). The table decrypt lib should work right now without many problems. I'll see if I can get that tested this weekend.

@Trolldemorted
Copy link
Contributor Author

Maybe we can use their size from the index as a way to get those garbage bytes away? I mean, resize the resulting vector to whatever size is specified in the index, so the extra bytes get removed.

That's what twa_pack_lib and twa_unpack 1.6 is doing atm (with just a truncate call), but since I am neither a .wem or TW table expert I am not 100% certain that the output is correct.

Too many things are wired up in the PackFile decoding logic to just make it use completely the twa_pack_lib, and there are plans to change it even more in the future so...

You might want to decouple the low-level file handling from the rest of rpfm though - after all, every pack file with any version is just a basic archive, all you can do with it is adding, removing and listing content and flipping some flags. Or is there anything a generic .pack libary could not deliver because attunement to specific games is required?

@Frodo45127
Copy link
Owner

You might want to decouple the low-level file handling from the rest of rpfm though - after all, every pack file with any version is just a basic archive, all you can do with it is adding, removing and listing content and flipping some flags. Or is there anything a generic .pack libary could not deliver because attunement to specific games is required?

There is no real problem to do that. What I meant was that moving all the stuff in RPFM to use twa_pack_lib will be a problem, due to having to rework stuff either in the lib or in RPFM to get it working together. But getting the current PackFile module from RPFM, making a separate crate with it and integrate there twa_pack_lib should not really be a problem.

So, to recap:

  • 0001 0000 0000 is still unknown, (I suspect it tells Arena the PackFile index is in a PFH4 compatible format, for compat reasons, as in the beta that's the format Arena used, but have no proof).
  • 0000 0001 0000 means files are encrypted, and need decryption before using them.
  • We still need to check the files are valid without those "padding" bytes from the decryption.
  • The PackFile module should be moved to a separate crate, and updated to reflect all these discoveries.
  • And the current twa_pack_lib can be merged into that future crate (as that crate will contain the logic for all Total War PackFiles (in the future) making your unpacker work with any PackFile the crate can read.

If that's so, I can start working on the separate crate now. Guess Rome 2 Schemas will have to wait a bit more...

@Trolldemorted
Copy link
Contributor Author

0001 0000 0000 is still unknown, (I suspect it tells Arena the PackFile index is in a PFH4 compatible format, for compat reasons, as in the beta that's the format Arena used, but have no proof).

That could be, does warhammer 2 have a pack file where the bit is not set and where the index is different?

0000 0001 0000 means files are encrypted, and need decryption before using them.

For PFH4 that appears to be the case, for PFH5 you also have to respect the padding bytes before each packed file.

We still need to check the files are valid without those "padding" bytes from the decryption.

Yup, can you see that by looking at TWA's tables?

The PackFile module should be moved to a separate crate, and updated to reflect all these discoveries.
And the current twa_pack_lib can be merged into that future crate (as that crate will contain the logic for all Total War PackFiles (in the future) making your unpacker work with any PackFile the crate can read. If that's so, I can start working on the separate crate now. Guess Rome 2 Schemas will have to wait a bit more...

You can have a go at rome 2 schemas, just tell me what you need from the pack library. Is the current way of parsing ok?

When and how does RPFM "bake" a mod pack file? I assume that

  • users can add and remove files in rpfm's UI
  • rpfm realizes these operations in a project folder
  • users can hit a "build me a .pack from my current project" button which packs the project folder recursively into a .pack file

If that is the case, all we have to do is add a build_pack function to pack_lib, something along

fn build_pack(input_directory: &PathBuf, output_file: &mut File, u32 version, u32 bitmask)

@Trolldemorted
Copy link
Contributor Author

Trolldemorted commented Aug 16, 2018

twa_pack_lib now has a build_pack function which packs a directory recursively into a .pack, for PFH5 the results are looking good, I am just unsure what values we should write into 0x1C-0x1F and 0x24-0x27.

Is there any documentation for the PFH4 header available? I don't know what value must go into 0x18-0x0B ah its the timestamp, nevermind

@Frodo45127
Copy link
Owner

That could be, does warhammer 2 have a pack file where the bit is not set and where the index is different?

None of Warhammer 2 PackFiles uses that bit, and all of them have a 00 byte between the size/timestamp and their path in the index.

For PFH4 that appears to be the case, for PFH5 you also have to respect the padding bytes before each packed file.

I though the padding was caused by the decryption process. It's different between 4 and 5?

Yup, can you see that by looking at TWA's tables?

Yeah, can be tested once I finish what I'm doing now.

You can have a go at rome 2 schemas, just tell me what you need from the pack library. Is the current way of parsing ok?

When and how does RPFM "bake" a mod pack file? I assume that

  • users can add and remove files in rpfm's UI
  • rpfm realizes these operations in a project folder
  • users can hit a "build me a .pack from my current project" button which packs the project folder recursively into a .pack file

Nop. RPFM currently has 2 ways to read PackFiles. First one, it reads the PackFile's header, stores it in ram, then the Indexes, stores them in ram, and then it reads to ram all the PackedFile's data. For saving, it puts together a PackFile with all that data and write it to disk.

The other way is similar, but instead of reading all that data, it just stores a BufReader of the PackFile and a list of the position were the data of each PackedFile starts, so it can be read on-demand. This mode doesn't support saving data yet.

https://github.com/Frodo45127/rpfm/blob/master/src/packfile/packfile.rs

That's the file were all the reading/saving happens, and were all the structs that forms the PackFile exists. The separate crate I talked about is that file + a couple more of helpers. There, your decrypt functions can be applied if the header bits are sets in the specific moment (Is what it does right now, but the functions are not updated with your lastest changes yet).

@Trolldemorted
Copy link
Contributor Author

None of Warhammer 2 PackFiles uses that bit, and all of them have a 00 byte between the size/timestamp and their path in the index.

Fascinating. So many changes and flags for literally no gain (or we are just missing a glorious master scheme). Well, whatever floats TCA's boat. Could you upload WH2's boot.pack so that I can ensure everything works?

I though the padding was caused by the decryption process. It's different between 4 and 5?

There is padding introduced after every file by the decryption, but in Arena's PFH5 there is also padding in the .pack before every file - every file in an encrypted .pack starts at the next multiple of 8, even the first file after the index. Since WH2 has no encrypted packs (?) I assume that should be the case for all PFH5 files.

Nop. RPFM currently has 2 ways to read PackFiles. First one, it reads the PackFile's header, stores it in ram, then the Indexes, stores them in ram, and then it reads to ram all the PackedFile's data. For saving, it puts together a PackFile with all that data and write it to disk.
The other way is similar, but instead of reading all that data, it just stores a BufReader of the PackFile and a list of the position were the data of each PackedFile starts, so it can be read on-demand. This mode doesn't support saving data yet.
https://github.com/Frodo45127/rpfm/blob/master/src/packfile/packfile.rs
That's the file were all the reading/saving happens, and were all the structs that forms the PackFile exists. The separate crate I talked about is that file + a couple more of helpers. There, your decrypt functions can be applied if the header bits are sets in the specific moment (Is what it does right now, but the functions are not updated with your lastest changes yet).

What do you think of storing packs in extracted form in the filesystem, and producing a .pack only on demand? Benefits that come to my mind right now are

  • no more pack file parsing when opening projects except the very first time
  • users can easily use external programs to manipulate packed files (e.g. someone will certainly have a look at music.pack with third party tools)
  • projects would not be a single large binary file any more, so they could be conveniently checked into version control systems like git
  • you don't have to copy all files into one file everytime users want to save

@Frodo45127
Copy link
Owner

Fascinating. So many changes and flags for literally no gain (or we are just missing a glorious master scheme). Well, whatever floats TCA's boat. Could you upload WH2's boot.pack so that I can ensure everything works?

This is Warhammer 2 boot.pack: boot.tar.gz

What do you think of storing packs in extracted form in the filesystem, and producing a .pack only on demand? Benefits that come to my mind right now are

  • no more pack file parsing when opening projects except the very first time
  • users can easily use external programs to manipulate packed files (e.g. someone will certainly have a look at music.pack with third party tools)
  • projects would not be a single large binary file any more, so they could be conveniently checked into version control systems like git
  • you don't have to copy all files into one file everytime users want to save

RPFM is used to create and edit PackFiles. Doing that (storing the files in the filesystem) means that every time you want to open/edit a PackFile it has to be written to disk. Doing that for a 1 or 2GB PackFile in an HDD would make it pretty slow. If you want a similar behavior (every extracted file goes to a special folder keeping their paths, and when adding from there, it gets added to the right path in the PackFile), you use the MyMod feature. It's there for that reason. But the main way the PackFiles are stored is not going to change. The only change planned is the current "read-only" mode (PackedFiles data is read on demand, instead of being loaded at the start) will gain saving capabilities when CA fixes certain bug in Warhammer 2, so the data will not be kept eating ram for nothing.

If someone wants to use external programs, or put their entire mod under git, they can just extract everything (it's just a button) or create a git repo in the MyMod folder.

PD: Something I just realised. What you're suggesting is what the Assembly Kit (official modding tools) does. Everything is an individual file, then we export to a PackFile. And the modders call it a "Glorified search engine" and many prefer to use the unofficial tools instead. Guess that's something to take into account too, as this tool is aimed at modders after all.

@Trolldemorted
Copy link
Contributor Author

Trolldemorted commented Aug 16, 2018

This is Warhammer 2 boot.pack: boot.tar.gz

Thanks! I will look at it after I have ensured that twa_pack_lib does not panic on arbitrary input.

RPFM is used to create and edit PackFiles. Doing that (storing the files in the filesystem) means that every time you want to open/edit a PackFile it has to be written to disk. Doing that for a 1 or 2GB PackFile in an HDD would make it pretty slow.

Not everytime, only when you open a .pack for the first time (and you could even do it in the background and switch from read-only to editing mode after unpacking is complete)! After that, you can always load it from the filesystem, which should be faster than parsing the pack again and again.

Right now you have to write an up to 2 GB pack file every time users save and exit, even if they don't want to produce a pack file - If you would have a filesystem backend, you can virtually instantly save and load.

Do the users dislike Assembly Kit because of its underlying backend design principles, or because it lacks important features? I think from TCA's engineers' point of view the only reason why you want to build a pack file is because you want a TW game to load your work right now, for all other intents and purposes a plain old folder on your filesystem is better suited for the job.

@Frodo45127
Copy link
Owner

Not everytime, only when you open a .pack for the first time (and you could even do it in the background and switch from read-only to editing mode after unpacking is complete)! After that, you can always load it from the filesystem, which should be faster than parsing the pack again and again.

And we have this situation: You want to create a unit. You want it to have different textures. You open the data.pack to see how the tables there work. You open the models.pack Packfiles until you find the one with the rigidmodel you want his texture changed. You make your unit mod. And, as a bonus, you have 10GB of useless data in your disk. In contrast, parsing the files takes almost no time (except if you open a very big PackFile in an HDD, and that's getting fixed when CA fix their bug), and doesn't eat your disk space nor produce a ton of files you'll never use.

Right now you have to write an up to 2 GB pack file every time users save and exit, even if they don't want to produce a pack file - If you would have a filesystem backend, you can virtually instantly save and load.

I did some optimizations a while ago. As a result, a 2GB PackFile takes 1.8 seconds to open, and 2.3 seconds to save (in an SSD). That for me is fast enough. Of course, a 2GB PackFile is an extreme case, as most mods don't ever reach 500MB. So yeah, the speed gained with getting all the files in the filesystem it's far from enough to justify rewriting most of the program.

Do the users dislike Assembly Kit because of its underlying backend design principles, or because it lacks important features? I think from TCA's engineers' point of view the only reason why you want to build a pack file is because you want a TW game to load your work right now, for all other intents and purposes a plain old folder on your filesystem is better suited for the job.

Because what RPFM or PFM can do to a mod in seconds, the assembly kit takes more than a minute with all the loading tables, exporting, etc.... For a game developer is a good tool, because you don't have to tweak and test every single change. For a modder, once you start tweaking (for example, tweaking your units stats to make it balanced, that requires constant testing, or for testing scripts), or doing more than one mod in the same installation, it gets pretty bad, and even useless in some cases. And also, certain things like patching maps in Warhammer games, with Assembly Kit alone are impossible to do. So, the main use many people give it is to use his global search function.

@Trolldemorted
Copy link
Contributor Author

As a result, a 2GB PackFile takes 1.8 seconds to open, and 2.3 seconds to save (in an SSD).

That is indeed amazing - So the impact of the overhead of using multiple files is so huge?

@Trolldemorted
Copy link
Contributor Author

Trolldemorted commented Aug 17, 2018

0001 0000 0000 is still unknown, (I suspect it tells Arena the PackFile index is in a PFH4 compatible format, for compat reasons, as in the beta that's the format Arena used, but have no proof).

0001 0000 0000 means that the header size is 0x30 and that the index has the useless 00 byte before the path. Warhammer 2's boot.pack's header does not have the bit set, and its header is only 0x1C bytes long - I assume the others are the same?

regarding pack_lib: So to keep things speedy pack_lib must provide a way

  • to iterate the index and yield something which you can ask for the file's contents (if you want lazyloading)
  • to build a .pack file with input from memory

@Frodo45127
Copy link
Owner

That is indeed amazing - So the impact of the overhead of using multiple files is so huge?

If you mean the impact of using filesystem files instead the ones in RAM, no, it's not huge (at least in theory), but it requires to change the entire system of how RPFM deals with PackFiles for no apparent gain. The current system does the job very fast, rewriting it just because is.... not useful.

0001 0000 0000 means that the header size is 0x30 and that the index has the useless 00 byte before the path. Warhammer 2's boot.pack's header does not have the bit set, and its header is only 0x1C bytes long - I assume the others are the same?

🤦‍♂️ I didn't even realize that... Yeah, all Warhammer 2 PackFiles have a 28 byte header.

regarding pack_lib: So to keep things speedy pack_lib must provide a way

  • to iterate the index and yield something which you can ask for the file's contents (if you want lazyloading)
  • to build a .pack file with input from memory

The lib must provide a way to read any PackFile since Rome 2 (PFH4 & 5), to write PackFiles of said format (except encrypted/with extended header, as there is no reason to do that because Arena doesn't even support modding, but do it if you want). In addition, if there are encrypted PackedFiles (like the tables in Arena, or the music files) it must provide them already decrypted when you try to read them.

Also, if the idea it's to make RPFM depend on it, it has to implement all the extra functions that RPFM currently have in the PackFile module, like adding PackFiles, deleting them,.... and, of course, being somewhat compatible with the current system RPFM uses.

@Trolldemorted
Copy link
Contributor Author

Trolldemorted commented Aug 18, 2018

I began looking through rpfm today, it is indeed not a tiny change.

Are you using serde to send commands and data through an mpsc channel to communicate between ui and bg thread by sending a command string first and encoded data immediately after? Can't you just send dedicated command/data structs back and forth, and wrap them in arcs/mutexes if neccessary?

Here is what I have so far. Every place that needs cleaning is marked with a TODO.

@Frodo45127
Copy link
Owner

I began looking through rpfm today, it is indeed not a tiny change.

That's why I told you it was easier to just isolate the PackFile module, integrate there your decrypt functions (because the only things that module doesn't do is decrypt PackedFiles and the padding stuff, and both should be relatively easy to add) and use that as the basis of a new crate.

Also, just checked and you removed a lot of the data taken from the PackFiles, like all the PackFile Index, header stuff... Is all of that stuff stored in a struct from the twa_pack_lib when decoded? because some of that, like the timestamp in the header is currently used.

Are you using serde to send commands and data through an mpsc channel to communicate between ui and bg thread by sending a command string first and encoded data immediately after? Can't you just send dedicated command/data structs back and forth, and wrap them in arcs/mutexes if neccessary?

Probably, but I'm still learning how to deal with two threads, and this approach allows me to have both threads data isolated from one another with a channel to send commands, and another two in case we need to send data back and forward. Things stay simple, isolated and easy to use and understand.

@Trolldemorted
Copy link
Contributor Author

That's why I told you it was easier to just isolate the PackFile module, integrate there your decrypt functions (because the only things that module doesn't do is decrypt PackedFiles and the padding stuff, and both should be relatively easy to add) and use that as the basis of a new crate.

The top-down approach would most likely be less work, but I fear the resulting library would be less uniform and constrained in it's design decisions. I want others to enjoy using the library too :)

Also, just checked and you removed a lot of the data taken from the PackFiles, like all the PackFile Index, header stuff... Is all of that stuff stored in a struct from the twa_pack_lib when decoded? because some of that, like the timestamp in the header is currently used.

Yes and no. The PackFile from twa_pack_lib doesn't store anything except its raw bytes, and rpfm uses its public accessors to build it's PackFileView:

pub struct PackFileView {
    pub file_path: PathBuf,
    pub pfh_version: PFHVersion,
    pub bitmask: PFHFlags,
    pub packed_files: Vec<PackedFile>,
    pub empty_folders: Vec<Vec<String>>
}

The timestamp is still missing, will add it as soon as the most important TODOs are solved. PackFileView should contain everything you need: You can add and remove PackedFiles, you can add temporary empty folders, and you have everything you need to build a valid .pack file from it - a PFH version, a bitmask and a list of files is everything a pack library should need to build a file (ok, maybe the weird PackFile Index that apparently has no effect has to be added).

I think we might want to move the type (the least significant byte of the bitmap) out of the bitmap - the bitflags crate does not like flags with zero set bits, and overlapping bits aren't nice either.

Probably, but I'm still learning how to deal with two threads, and this approach allows me to have both threads data isolated from one another with a channel to send commands, and another two in case we need to send data back and forward. Things stay simple, isolated and easy to use and understand.

Using channels is perfectly reasonable, but is serializing and deserializing necessary? Maybe clone is sufficient, which should be faster both at compile and runtime. By the way, how long does rpfm (without dependencies) take to build on your machine? My poor laptop needs 15-30 seconds :(

@Frodo45127
Copy link
Owner

The top-down approach would most likely be less work, but I fear the resulting library would be less uniform and constrained in it's design decisions. I want others to enjoy using the library too :)

Yeah, but in RPFM the top-down approach (I learned today it's called that) it's useful, because it's not just PackFiles. There are still a lot of things to decode (older PackFiles, ESF files, RigidModels,....) and it's far easier to do it if you can store and deal separately with every part of the code. I mean, for a "finished" lib where you know exactly what you're going to have or you know it's never going to change, your approach is good, but for something that is reverse-engineered and that it may change in an update/new game release, and there are still unknown things about it (shogun 2 and older games)... It's not as "future-proof".

This is something I learnt when I started this project: code should be easy to read for someone who wants to learn how the things work and has no idea of the language. If someone in the future has to take the lead, I want them to be able to see how everything works, how things can be used.... That eases a lot the "entry barrier" for that kind of thing. That's why I re-implemented the original PFM instead of maintaining it. Because it was hard to understand how the things worked, what was X or Y byte for.... Took me a while to just get how the tables were encoded properly, due to that, for example. And had to fix it a few updates later for the same reason. Yes, it's not the fastest or more flexible way to do it, but it's the easiest to learn the structure of the files just by looking at the code.

I want others to be able to look at the code and learn how everything works, even if the data itself it's useless. Because this is reverse engineering. You can't go and ask anyone to see how it works. There is no official documentation. If you want others to learn in case you can't continue/abandon the projects, they need to have an easy way to learn the format, how it works, how the data can be used,....

The changes you want to make in RPFM to make it work with the lib are too disruptive, hiding too much of the "how it works" part that actually helps to understand how all the stuff works. So in that case, RPFM will not depend on that lib. You may want others to enjoy using the library, but in my case, I don't really enjoy using your approach for this kind of stuff. For a lib that must work fast, it works. For code to learn how things work, it doesn't. And I want RPFM to be easy to learn from.

Using channels is perfectly reasonable, but is serializing and deserializing necessary?

I wanted to just have one sending channel and one receiving channel for data, so all the data needed to have the same type. Serializing to a Vec<u8> does the job. I tested it with the schemas (they are far bigger than what it's usually sended in those channels, except PackedFile's data of rigidmodels and some exceptions) and it was fast, so I used that solution.

By the way, how long does rpfm (without dependencies) take to build on your machine? My poor laptop needs 15-30 seconds :(

In my i5-4570, in linux 50 seconds to 1:30. In windows, 1:30 to 1:50, depending how much CPU the antimalware service wants to eat in that moment. Full compilation with libs is about 8 to 16 minutes, mainly due to Qt. Those are debug builds. Release builds take more.

@Trolldemorted
Copy link
Contributor Author

Yeah, but in RPFM the top-down approach (I learned today it's called that) it's useful, because it's not just PackFiles. There are still a lot of things to decode (older PackFiles, ESF files, RigidModels,....) and it's far easier to do it if you can store and deal separately with every part of the code. I mean, for a "finished" lib where you know exactly what you're going to have or you know it's never going to change, your approach is good, but for something that is reverse-engineered and that it may change in an update/new game release, and there are still unknown things about it (shogun 2 and older games)... It's not as "future-proof".

Do you have pfh3 support? I really would like to include that into pack_lib, but I don't have any .packs nor games to reverseengineer. Nevertheless, ESF/RigidModels handling could stay exactly like it is, even with a separate pack lib.

The changes you want to make in RPFM to make it work with the lib are too disruptive, hiding too much of the "how it works" part that actually helps to understand how all the stuff works. So in that case, RPFM will not depend on that lib. You may want others to enjoy using the library, but in my case, I don't really enjoy using your approach for this kind of stuff. For a lib that must work fast, it works. For code to learn how things work, it doesn't. And I want RPFM to be easy to learn from.

Is it hiding the "how it works" part, or just moving a part of it into a separate crate? So far my diff modifies/adds 241 lines of code, that's really not that much. If you want a well-tested module that can handle any (malformed) input without panicking, you are better of separating it from everything else and making sure it can do one thing well. My Signal-Windows project has ~50k lines of code, it would be unmanageable if we didn't have decisive borders where one lib ends and another lib begins.

I wanted to just have one sending channel and one receiving channel for data, so all the data needed to have the same type. Serializing to a Vec does the job. I tested it with the schemas (they are far bigger than what it's usually sended in those channels, except PackedFile's data of rigidmodels and some exceptions) and it was fast, so I used that solution.

They must have the same type, but that does not mean they must be the same "object". You could have used an enum (which I would recommend because they are analyzed at compile time by the compiler, which gives you nice things like compile time errors if you forget to match an enum variant on the receiving end of a channel), or boxes to Any implementations. Both would be simpler and more easy to use than using 4 channels and inter-process-communication libraries.

@Frodo45127
Copy link
Owner

Do you have pfh3 support? I really would like to include that into pack_lib, but I don't have any .packs nor games to reverseengineer. Nevertheless, ESF/RigidModels handling could stay exactly like it is, even with a separate pack lib.

I don't have any older games installed right now, as I'm implementing full support (packfiles, schemas,...) for one game at a time, and I'm with Rome 2 now. But I think they skipped the 3. The game before Rome 2 (Shogun 2) used PFH2 if the code I read from PFM is correct.

Is it hiding the "how it works" part, or just moving a part of it into a separate crate? So far my diff modifies/adds 241 lines of code, that's really not that much. If you want a well-tested module that can handle any (malformed) input without panicking, you are better of separating it from everything else and making sure it can do one thing well. My Signal-Windows project has ~50k lines of code, it would be unmanageable if we didn't have decisive borders where one lib ends and another lib begins.

It's the hiding "how it works" part. I don't have a problem with moving the PackFile or PackedFile systems to a different crate. They are already fairly isolated modules to begin with with clearly defined roles that they do properly, and there is no problem with making them available to others. The problem is reducing them to just "what it would be useful to change" hiding how they really work.

I don't really care about malformed input. If a PackFile is malformed, RPFM throws an error. That's all. But if someone wants to learn how they work, I want them to be able to see what each specific part of the PackFile does. Yes, things like the PackFile index may seem useless right now, but that doesn't mean we find them an use in the future.

Also, where did you saw 241 lines? Here I just see a ton of code that has changed, and a lot of commented code that has to be rewritten just to fit with the library. That's the kind of change I don't want. That's what you call "well-tested" code that has been in use for differents amounts of time since release back in november last year with no problems, changed just to fit a lib with no apparent benefit.
imagen

They must have the same type, but that does not mean they must be the same "object". You could have used an enum (which I would recommend because they are analyzed at compile time by the compiler, which gives you nice things like compile time errors if you forget to match an enum variant on the receiving end of a channel), or boxes to Any implementations. Both would be simpler and more easy to use than using 4 channels and inter-process-communication libraries.

Didn't though about that (using enum I mean) for the message passing. The only time I heavely used enums was in the recent error rework, after all. I'll take a look at it one I finish the Rome 2 schema. And four channels? Between the UI and Background threads there is one to send commands from the UI, and two for communication, one UI -> Background and one Background -> UI. There is no forth channel there.

@Trolldemorted
Copy link
Contributor Author

It's the hiding "how it works" part. I don't have a problem with moving the PackFile or PackedFile systems to a different crate. They are already fairly isolated modules to begin with with clearly defined roles that they do properly, and there is no problem with making them available to others. The problem is reducing them to just "what it would be useful to change" hiding how they really work.

But how am I hiding anything? pack_lib's parser is a really clean and safe parser in ~250 lines of code. You can easily see what the fields in the header are, you can easily see when a header has which size, and you can easily see how to iterate through the index.

I don't really care about malformed input. If a PackFile is malformed, RPFM throws an error. That's all. But if someone wants to learn how they work, I want them to be able to see what each specific part of the PackFile does. Yes, things like the PackFile index may seem useless right now, but that doesn't mean we find them an use in the future.

If you consider crashes/panics ok that's completely fine for rpfm, but it is never acceptable for a parsing library to panic on malformed input. And do you really believe that you can understand a packfile's structure better if you read rpfm's? I deliberately tried to write an easy-to-read crate, it has literally only one function which does something not trivial, and that is get_next which has to handle all the different index shenanigans.

Also, where did you saw 241 lines? Here I just see a ton of code that has changed, and a lot of commented code that has to be rewritten just to fit with the library. That's the kind of change I don't want. That's what you call "well-tested" code that has been in use for differents amounts of time since release back in november last year with no problems, changed just to fit a lib with no apparent benefit.

Add that time it were 241 modifieds/added (and a few hundred removed) lines in 8 files, which is not that much compared to rpfm's size. Right now I'm at 345 modified/added (953 removed) lines, and I would be willing to discuss each of these if you disagree with them. It's fine when you don't want to use pack_lib, but it would be way easier to maintain one pack implementation instead of several.

Didn't though about that (using enum I mean) for the message passing. The only time I heavely used enums was in the recent error rework, after all. I'll take a look at it one I finish the Rome 2 schema. And four channels? Between the UI and Background threads there is one to send commands from the UI, and two for communication, one UI -> Background and one Background -> UI. There is no forth channel there.

Ah I thought the bg thread might want to send commands to the ui thread too, my bad. But yeah, rust's enums are awesome! All instances of one enum are of the same type, so you can declare commands like

pub struct PackedPath {
    pack_file: String,
    packed_file_path: String
}
pub struct AddPackedFileCommand {
    pub fs_path: PathBuf,
    pub packed_path: PackedPath
}

pub struct DeletePackedFileCommand {
    pub packed_path: PackedPath
}

enum UICommand {
    AddPackedFile(AddPackedFileCommand),
    DeletePackedFile(DeletePackedFileCommand)
}

and match them on the receiver:

match receiver.recv() {
    UICommand::AddPackedFile(add_packed_file_command) => println!("foo!"),
    UICommand::DeletePackedFile(delete_packed_file_command) => println!("bar!")
}

Since your commands could be rather big I maybe would recommend sending boxes with UICommands instead of raw UICommands. We maybe should open a dedicated issue for a typesafe thread communication, though.

@Frodo45127
Copy link
Owner

But how am I hiding anything? pack_lib's parser is a really clean and safe parser in ~250 lines of code. You can easily see what the fields in the header are, you can easily see when a header has which size, and you can easily see how to iterate through the index.

If you consider crashes/panics ok that's completely fine for rpfm, but it is never acceptable for a parsing library to panic on malformed input. And do you really believe that you can understand a packfile's structure better if you read rpfm's? I deliberately tried to write an easy-to-read crate, it has literally only one function which does something not trivial, and that is get_next which has to handle all the different index shenanigans.

No, I mean it returns an error to the UI and does nothing. And yeah, in RPFM is all explained, not only through comments, but also with code that's not too hard to understand for someone new to the language. From that point of view, your crate is harder to understand. I mean, even I have problems understanding some parts, specially with the iterator part because I've never implemented one because I didn't need it, and I'm not a noob (not an expert either). From that point of view, RPFM is not as simple, but it's easier to understand.

Here's an example. This is RPFM header struct:

pub struct PackFileHeader {
    pub id: String,
    pub pack_file_type: u32,
    pub pack_file_count: u32,
    pub pack_file_index_size: u32,
    pub packed_file_count: u32,
    pub packed_file_index_size: u32,
    pub creation_time: u32,

    pub data_is_encrypted: bool,
    pub index_includes_timestamp: bool,
    pub index_is_encrypted: bool,
    pub header_is_extended: bool,
}

This is twa_packfile_lib way to get stuff from the header:

pub fn get_bitmask(raw_data: &[u8]) -> ::PFHFlags {
    ::PFHFlags::from_bits_truncate(LittleEndian::read_u32(&raw_data[0x04..0x08]))
}

fn get_index_length(raw_data: &[u8]) -> u32 {
    LittleEndian::read_u32(&raw_data[0x10..0x14])
}

In RPFM you can clearly see every piece of code that forms the header (it lacks the extended header stuff, but you get the idea). In the lib, I can see how to obtain certain parts. You get from [0x04..0x08] the bitmask and from [0x10..0x14] the index lenght. Ok, but what about the bytes in between? What do they have? They are not used yet, but someone coming with 0 experience reads that and thinks, "that's unknown data". Then they go ahead, again, trying to get what are those bytes, why some packfiles have them with other than "0",.... That's what I mean with "hiding how it works". You still don't understand the header with the struct? Then you can take a look at the reading function, in simple code, were you can see how every piece of the header is taken appart and stored.

And that, for someone new to the language (as I was a few months ago) or even to programming, is far easier to understand that a bunch of getters.

Since your commands could be rather big I maybe would recommend sending boxes with UICommands instead of raw UICommands. We maybe should open a dedicated issue for a typesafe thread communication, though.

Added to my personal todo list. Hopefully, once I'm done with the 70+ tables I still have to decode from Rome 2, and with the Arena stuff, I can get that done.

@Trolldemorted
Copy link
Contributor Author

No, I mean it returns an error to the UI and does nothing.

That would be nice but that is not the case - the very first malformed pack file I created happily crashes rpfm with an index out of range error.

I mean, even I have problems understanding some parts, specially with the iterator part because I've never implemented one because I didn't need it, and I'm not a noob (not an expert either)

Iterators are a thing in every major programming language, and all you have to do is implement a function that yields the next item - isn't that super simple stuff? But yeah, some comments would be nice, I'll have to add some rustdoc strings.

Ok, but what about the bytes in between? What do they have? They are not used yet, but someone coming with 0 experience reads that and thinks, "that's unknown data". Then they go ahead, again, trying to get what are those bytes, why some packfiles have them with other than "0",.... That's what I mean with "hiding how it works".

That's a pack_lib bug - it still needs support for pf header extensions.

You still don't understand the header with the struct? Then you can take a look at the reading function, in simple code, were you can see how every piece of the header is taken appart and stored.

pack_lib's philosophy is to be a shallow-but-safe library above .pack files. If you are interested where a certain value comes from, you go to the function implementation and can precisely see which bytes from the file are being mapped to your requested value, and how they are being converted. If you are interested in all values, you should not be looking at source code but into the pack file format documentation (which we should reference in our comments, now that I think of it!).

Added to my personal todo list. Hopefully, once I'm done with the 70+ tables I still have to decode from Rome 2, and with the Arena stuff, I can get that done.

👍it might take some refactoring, but you are rewarded with a typesafe and simple thread communication.

@Frodo45127
Copy link
Owner

That would be nice but that is not the case - the very first malformed pack file I created happily crashes rpfm with an index out of range error.

And that's an error that has to be fixed.

Iterators are a thing in every major programming language, and all you have to do is implement a function that yields the next item - isn't that super simple stuff? But yeah, some comments would be nice, I'll have to add some rustdoc strings.

Using iterators, yes. Implementing the IntoIterator trait for a Struct, no. There is a big difference. In the total war modding community most of the people that knows about programming know Lua, C#, Java and I've seen some code in Ruby too. People that knows Rust is.... well, I can count them with one hand, including you and me. One of the reasons to deliverately keep it simple, easy to follow the process and not use that kind of complex stuff in RPFM is so, if in the future I can't maintain it anymore, someone doesn't have a hard time trying to learn the entirety of the language to just update it for one more game.

pack_lib's philosophy is to be a shallow-but-safe library above .pack files. If you are interested where a certain value comes from, you go to the function implementation and can precisely see which bytes from the file are being mapped to your requested value, and how they are being converted. If you are interested in all values, you should not be looking at source code but into the pack file format documentation (which we should reference in our comments, now that I think of it!).

Sure but, as I said above, people who can read it in the community if needed to know how are PackFile's encoded are not many. Sure, you can get from were single values come from. Sure, you can check the entire format in a doc. But if a new format comes out and someone wants to know "how does the lib work" to adapt it to the new format, they need to spend a lot of time learning what do implementing this IntoIterator thing does, why it's done, how it fits in the process to decode the PackFile.... you can't see how the program "flows", so you can't see were it fails with the new format and fix it. It's not about being "clean", it's about being "easy for beginners and easy to debug" which makes it more "future-proof". That's the problem I have with the twa_pack_lib. That it doesn't fit any of those ticks and, in a very limited community (Total War modders who know rust) those are a must if you want your lib to be useful in the long run.

it might take some refactoring, but you are rewarded with a typesafe and simple thread communication.

Not sure about that, as the program will still crash by default if the variant is not the one it should receive, but... if it can be done without the overhead of serializing/deserializing stuff, is good enough.

Also, in one of the last commits the padding stuff got fixed, so the original issue is fixed, but to keep with the conversation I'll just change the title instead of opening a new issue for the lib.

@Frodo45127 Frodo45127 changed the title fix TWA's padded packs Make the decoding/encoding PackFile logic independant as a lib. Aug 22, 2018
@Trolldemorted
Copy link
Contributor Author

Using iterators, yes. Implementing the IntoIterator trait for a Struct, no. There is a big difference. In the total war modding community most of the people that knows about programming know Lua, C#, Java and I've seen some code in Ruby too. People that knows Rust is.... well, I can count them with one hand, including you and me. One of the reasons to deliverately keep it simple, easy to follow the process and not use that kind of complex stuff in RPFM is so, if in the future I can't maintain it anymore, someone doesn't have a hard time trying to learn the entirety of the language to just update it for one more game.

Java has Iterable where you have to implement iterator(), C# has IEnumerable where you have to implement GetEnumerator(). Every higher level programming language I know has the concept of an iterator and an iterable, and everyone who has worked with these languages should know them, so understanding rust's into_iter should be trivial.

Sure but, as I said above, people who can read it in the community if needed to know how are PackFile's encoded are not many. Sure, you can get from were single values come from. Sure, you can check the entire format in a doc. But if a new format comes out and someone wants to know "how does the lib work" to adapt it to the new format, they need to spend a lot of time learning what do implementing this IntoIterator thing does, why it's done, how it fits in the process to decode the PackFile.... you can't see how the program "flows", so you can't see were it fails with the new format and fix it. It's not about being "clean", it's about being "easy for beginners and easy to debug" which makes it more "future-proof". That's the problem I have with the twa_pack_lib. That it doesn't fit any of those ticks and, in a very limited community (Total War modders who know rust) those are a must if you want your lib to be useful in the long run.

The entire flow consists of get_next, which contains less than 100 lines of code. All it does is move through the index and through the payload, while respecting the bitmask's settings and the pfh version. That's really not complicated. If a new pfh version comes out, all you have to do is change how get_next interprets the index and the payload.

If we were to take your packfile modules to build a separate lib, the first thing I'd do is implement IntoIter - the pack file format is nothing more than an iteration, so IntoIter perfectly repesents the underlying data structure and it is the most easy-to-use api you can get. I have opened a PR, if you like it merge it, if you have questions ask, if you don't like it close it :)

Semi-related: What do you think of supporting packing encrypted mod packs for the sake of trolling?

@Frodo45127
Copy link
Owner

Wasn't convinced (because despite having touched both C# and Rust for quite some time, I've never had the need to implement an iterator by myself), so I asked some modders, and the only guy who answered said that, from a point of someone who knows lua and is familiar with rust, the lib is easy enough to read.

So I guess that's enough to accept it (still at least the functions in the lib need comments to know what they actually do), though the pull-request needs some... polishing before accepting it. I already wrote all that stuff in a comment in the pull-request, so you can check all that stuff there.

Semi-related: What do you think of supporting packing encrypted mod packs for the sake of trolling?

If you want to do it, sure. But do you have any clue what those 14 extra bytes from the extended header are for?

@Frodo45127
Copy link
Owner

An update about the channel stuff. The commands have been moved to an enum (no more risk of crashing due to a typo in the command), but hold no data, because the data may or may not need to have a command before sending it and it's better to have all the data together in one enum instead splited in two. For example, there are multiple features that try something in the background thread, then go back to the ui for some input, then back to the background.

And the data... just counting the data send from the background thread, there are 33 different types of data. There are even more coming from the UI. That means a potential 50 to 100 variants long enum. All of them need to be matched when receiving, and panic if the variant is not the one we want, because that means the message system has been messed up and there is not an easy way to recover from that. Is there any specific reason to rewrite all the data sending/receiving stuff other than removing the already very little overhead while serializing/deserializing? Because, apart of that I can't find any reason to do it, and it looks like it's too much work to move the system from using serde and generics to enums for very little gain.

@Trolldemorted
Copy link
Contributor Author

If you want to do it, sure. But do you have any clue what those 14 extra bytes from the extended header are for?

I don't, but since there is many unused stuff like the timestamps I wouldn't worry too much about them as long as everything works.

An update about the channel stuff. The commands have been moved to an enum (no more risk of crashing due to a typo in the command)

👍

but hold no data, because the data may or may not need to have a command before sending it and it's better to have all the data together in one enum instead splited in two.

I'm not sure I understand this correctly. Why can't you add the data to the command? If the same command is sometimes used with and sometimes without data you can either put the data in an Option<DATA_FOO> or split it into two commands because they are different commands after all.

For example, there are multiple features that try something in the background thread, then go back to the ui for some input, then back to the background.

Doesn't it sound like this are two different commands then?

And the data... just counting the data send from the background thread, there are 33 different types of data. There are even more coming from the UI. That means a potential 50 to 100 variants long enum. All of them need to be matched when receiving, and panic if the variant is not the one we want, because that means the message system has been messed up and there is not an easy way to recover from that.

Assuming there are not more responses than commands, I guess it is worth it. You can (and imho should) add the data to the enums, so you will have one bg->ui enum and the command enum, and just 2 channels.

Is there any specific reason to rewrite all the data sending/receiving stuff other than removing the already very little overhead while serializing/deserializing? Because, apart of that I can't find any reason to do it, and it looks like it's too much work to move the system from using serde and generics to enums for very little gain.

That depends on how much you want to change in the future. If you are saying rpfm's feature list is almost complete and you will only add support for new games, it could be wasted time. If you want to continue developing it for a longer period and want a codebase that is easy to understand for new developers, I'd go for it immediately:

  • it makes things easier to read for new contributors (you can move the enums and their contained structs into a new module, so new contributors have an exact, small but exhaustive and correct view on rpfm's internal communication)
  • it makes contributions more stable because there are less ways to make things wrong
  • it makes refactoring easier because you cannot accidentally send the wrong data (that happened to me in the pack_lib branch, took me a while to find out what was going wrong where)
  • serde prevents you from directly serializing external types (and the workaround I used is annoying)
  • serde prevents you from sharing data without copying it
  • iirc serde generates code during compile time, which could explain the slow build times on HDDs

@Frodo45127
Copy link
Owner

I'm not sure I understand this correctly. Why can't you add the data to the command? If the same command is sometimes used with and sometimes without data you can either put the data in an Option<DATA_FOO> or split it into two commands because they are different commands after all.

Quite the opposite. The situation is the data is sent sometimes without a command. For example, there are commands that you send and the UI thread waits for a response. Then, depending on the response you send more data or just do nothing.

Assuming there are not more responses than commands, I guess it is worth it. You can (and imho should) add the data to the enums, so you will have one bg->ui enum and the command enum, and just 2 channels.

There are more responses that commands. There are currently 55 commands, and between 50-100 data types that are send one way or another.

Also, there's another problem related to merging the channels. When I started the 0.9 version and the Qt port, I set a goal: the UI MUST NOT HANG. To do it, RPFM currently makes the UI thread to enter on a loop that keeps processing events (like resizing the window) until it gets a response from the background thread. That means you can potentially send a command while there is another one executing itself, making the entire program crash. In earlier 0.9 versions, you could crash the program by keeping pressed the "Del" button because the deletion process got messed by a checking process, and message were received in the wrong place. To "fix it", I left a channel for commands only (so commands can be chained and don't interfere with one another) and another for data, and every time data received, RPFM checks that's the type it needs using generics and crashes if it's not (because there is no way to be sure what was sent).

Merging them will mean the commands can potentially crash in many more places that before.

That depends on how much you want to change in the future. If you are saying rpfm's feature list is almost complete and you will only add support for new games, it could be wasted time. If you want to continue developing it for a longer period and want a codebase that is easy to understand for new developers, I'd go for it immediately:

  • it makes things easier to read for new contributors (you can move the enums and their contained structs into a new module, so new contributors have an exact, small but exhaustive and correct view on rpfm's internal communication)
  • it makes contributions more stable because there are less ways to make things wrong
  • it makes refactoring easier because you cannot accidentally send the wrong data (that happened to me in the pack_lib branch, took me a while to find out what was going wrong where)
  • serde prevents you from directly serializing external types (and the workaround I used is annoying)
  • serde prevents you from sharing data without copying it
  • iirc serde generates code during compile time, which could explain the slow build times on HDDs

Mmmm.... I'm not sure about the second and third one. The way I see it, the same problem with wrong messages being sent can also happen with enums, because you can receive the wrong variant and the program will have to crash, to avoid weird behavior.

And the fifth one.... if you want data to be shared in an enum without deleting the original data (because the threads are isolated), you have to copy it anyways isn't?

I'll check it again once I'm done with this release of today, but the way I see it.... other than readability improvements and a very little speed up, it's a very big rewrite with very little gain.

@Frodo45127 Frodo45127 added this to ToDo... If I Can. in RPFM Status Jan 5, 2019
@Frodo45127 Frodo45127 removed this from ToDo... If I Can. in RPFM Status Jan 5, 2019
@Frodo45127
Copy link
Owner

Done in 2.0 with the big split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@Trolldemorted @Frodo45127 and others