-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the decoding/encoding PackFile logic independant as a lib. #8
Comments
mmm strange. I found that bitmask used in other total war games recently (only in music packs) and PackFiles with that bitmask didn't had any difference in their structure. I'll take another look, but.... maybe it's me not seen any problem, or that bitmask working differently when the PackFile index is encrypted.... |
Interesting - from what game are these and what bits are set? You said here that 0x4-0x7 only contained 0, 1, 2, 3, or 4, but those music packs are an exception? |
I only found that recently. At first, I though they were weird PackFile types, like they say in the comments of the PFM code. But, after adding support for Arena and Rome 2, I found that, like arena, those weird types are in reality just normal types with a bitmask. That particular bit is set in music.pack PackFiles. In other games, they use that bit and the one that adds 4 bytes to each entry in the index (wich seems to be a Timestamp like the one in the header) in multiple PackFiles. |
So maybe they use a consistent PFH5 format for every PFH5 game? That would be awesome, as our tools should work for every game then! Are you sure rpfm/pfm are unpacking music packs correctly? TWA's music.pack also has the bit set, but it is also padded and encrypted. Fun fact: twa_table_decrypt can decrypt TWA's encrypted .wem files! Could it be that the bit means "padded and encrypted"? Or were the files in music.pack unpacked with rpfm/pfm looking good? If it means padded and encrypted I'm gonna integrate the decryption into twa_unpack. |
Mmm it could be. Currently, I've only seen that in music PackFiles, and the .wem format is unknown to me, so I haven't been able to check that the extracted data actually works. I've been able to check only that the data of each PackedFile is extracted correctly (no data left to extract when extracting everything). I just did a quick search and found that not only Arena .wem are encrypted. It seems they may be encrypted in all Total War Games at least back to Rome 2 (the oldest game I can currently test), making that bit a "Files encrypted" bit. Do you have any other total war game you can test it? Just in case that decrypt logic is the same in all games. |
Unfortunately I don't own any other TW game. Are you definitely sure that other packs with encrypted files are not padded and no bytes were left after unpacking the last file? Are there any files which could have been padded (i.e. their size is not a multiple of 8)? |
I have released twa_unpack 1.5 which can unpack TWA's encrypted files - you could try it with other PFH5 games if you happen to own them. A friend has rome2, but unfortunately that uses PHF4. |
Just checked. Warhammer 2 is the only game using PFH5, and... there is no music PackFile. All the audio files are unencrypted in an I have the theory that the 1<< 8 bit means just that, that the PFH5 uses a "compatible" format with PFH4, but.... just a theory. Also, tested it with a PFH4, crashes on an unwrap. |
So the development of PFH5 is rather recent? I have released twa_unpack 1.6 which should be able to unpack most PFH4 files. I tried with rome2's music.pack which is unpacked and decrypted fine. Now that twa_pack_lib apparently supports most PFH files that exist, should we adapt its api (for changing a pack file's content) so that you can use it in rpfm? |
Yes, the first game I saw with PFH5 was Warhammer 2, and that one was released in... september last year I think. Apart from that, I know there is a PFH4 for Rome 2/Attila/Warhammer, and I heard of PFH2 and 1 for Empire, Napoleon and Shogun 2, but haven't confirmed it yet, as I currently don't support them in RPFM. And about the For full PackFiles... Currently, it'll be easier for me to just call your decrypt functions when needed. Too many things are wired up in the PackFile decoding logic to just make it use completely the So the best way I see to get the lib working with RPFM, is to make the decrypt functions public (so I just delete the ones I put in RPFM and it should work). The table decrypt lib should work right now without many problems. I'll see if I can get that tested this weekend. |
That's what twa_pack_lib and twa_unpack 1.6 is doing atm (with just a truncate call), but since I am neither a .wem or TW table expert I am not 100% certain that the output is correct.
You might want to decouple the low-level file handling from the rest of rpfm though - after all, every pack file with any version is just a basic archive, all you can do with it is adding, removing and listing content and flipping some flags. Or is there anything a generic .pack libary could not deliver because attunement to specific games is required? |
There is no real problem to do that. What I meant was that moving all the stuff in RPFM to use So, to recap:
If that's so, I can start working on the separate crate now. Guess Rome 2 Schemas will have to wait a bit more... |
That could be, does warhammer 2 have a pack file where the bit is not set and where the index is different?
For PFH4 that appears to be the case, for PFH5 you also have to respect the padding bytes before each packed file.
Yup, can you see that by looking at TWA's tables?
You can have a go at rome 2 schemas, just tell me what you need from the pack library. Is the current way of parsing ok? When and how does RPFM "bake" a mod pack file? I assume that
If that is the case, all we have to do is add a fn build_pack(input_directory: &PathBuf, output_file: &mut File, u32 version, u32 bitmask) |
twa_pack_lib now has a build_pack function which packs a directory recursively into a .pack, for PFH5 the results are looking good, I am just unsure what values we should write into 0x1C-0x1F and 0x24-0x27.
|
None of Warhammer 2 PackFiles uses that bit, and all of them have a
I though the padding was caused by the decryption process. It's different between 4 and 5?
Yeah, can be tested once I finish what I'm doing now.
Nop. RPFM currently has 2 ways to read PackFiles. First one, it reads the PackFile's header, stores it in ram, then the Indexes, stores them in ram, and then it reads to ram all the PackedFile's data. For saving, it puts together a PackFile with all that data and write it to disk. The other way is similar, but instead of reading all that data, it just stores a BufReader of the PackFile and a list of the position were the data of each PackedFile starts, so it can be read on-demand. This mode doesn't support saving data yet. https://github.com/Frodo45127/rpfm/blob/master/src/packfile/packfile.rs That's the file were all the reading/saving happens, and were all the structs that forms the PackFile exists. The separate crate I talked about is that file + a couple more of helpers. There, your decrypt functions can be applied if the header bits are sets in the specific moment (Is what it does right now, but the functions are not updated with your lastest changes yet). |
Fascinating. So many changes and flags for literally no gain (or we are just missing a glorious master scheme). Well, whatever floats TCA's boat. Could you upload WH2's boot.pack so that I can ensure everything works?
There is padding introduced after every file by the decryption, but in Arena's PFH5 there is also padding in the .pack before every file - every file in an encrypted .pack starts at the next multiple of 8, even the first file after the index. Since WH2 has no encrypted packs (?) I assume that should be the case for all PFH5 files.
What do you think of storing packs in extracted form in the filesystem, and producing a .pack only on demand? Benefits that come to my mind right now are
|
This is Warhammer 2
RPFM is used to create and edit PackFiles. Doing that (storing the files in the filesystem) means that every time you want to open/edit a PackFile it has to be written to disk. Doing that for a 1 or 2GB PackFile in an HDD would make it pretty slow. If you want a similar behavior (every extracted file goes to a special folder keeping their paths, and when adding from there, it gets added to the right path in the PackFile), you use the If someone wants to use external programs, or put their entire mod under git, they can just extract everything (it's just a button) or create a git repo in the PD: Something I just realised. What you're suggesting is what the Assembly Kit (official modding tools) does. Everything is an individual file, then we export to a PackFile. And the modders call it a "Glorified search engine" and many prefer to use the unofficial tools instead. Guess that's something to take into account too, as this tool is aimed at modders after all. |
Thanks! I will look at it after I have ensured that twa_pack_lib does not panic on arbitrary input.
Not everytime, only when you open a .pack for the first time (and you could even do it in the background and switch from read-only to editing mode after unpacking is complete)! After that, you can always load it from the filesystem, which should be faster than parsing the pack again and again. Right now you have to write an up to 2 GB pack file every time users save and exit, even if they don't want to produce a pack file - If you would have a filesystem backend, you can virtually instantly save and load. Do the users dislike Assembly Kit because of its underlying backend design principles, or because it lacks important features? I think from TCA's engineers' point of view the only reason why you want to build a pack file is because you want a TW game to load your work right now, for all other intents and purposes a plain old folder on your filesystem is better suited for the job. |
And we have this situation: You want to create a unit. You want it to have different textures. You open the
I did some optimizations a while ago. As a result, a 2GB PackFile takes 1.8 seconds to open, and 2.3 seconds to save (in an SSD). That for me is fast enough. Of course, a 2GB PackFile is an extreme case, as most mods don't ever reach 500MB. So yeah, the speed gained with getting all the files in the filesystem it's far from enough to justify rewriting most of the program.
Because what RPFM or PFM can do to a mod in seconds, the assembly kit takes more than a minute with all the loading tables, exporting, etc.... For a game developer is a good tool, because you don't have to tweak and test every single change. For a modder, once you start tweaking (for example, tweaking your units stats to make it balanced, that requires constant testing, or for testing scripts), or doing more than one mod in the same installation, it gets pretty bad, and even useless in some cases. And also, certain things like patching maps in Warhammer games, with Assembly Kit alone are impossible to do. So, the main use many people give it is to use his global search function. |
That is indeed amazing - So the impact of the overhead of using multiple files is so huge? |
regarding pack_lib: So to keep things speedy pack_lib must provide a way
|
If you mean the impact of using filesystem files instead the ones in RAM, no, it's not huge (at least in theory), but it requires to change the entire system of how RPFM deals with PackFiles for no apparent gain. The current system does the job very fast, rewriting it just because is.... not useful.
🤦♂️ I didn't even realize that... Yeah, all Warhammer 2 PackFiles have a 28 byte header.
The lib must provide a way to read any PackFile since Rome 2 (PFH4 & 5), to write PackFiles of said format (except encrypted/with extended header, as there is no reason to do that because Arena doesn't even support modding, but do it if you want). In addition, if there are encrypted PackedFiles (like the tables in Arena, or the music files) it must provide them already decrypted when you try to read them. Also, if the idea it's to make RPFM depend on it, it has to implement all the extra functions that RPFM currently have in the PackFile module, like adding PackFiles, deleting them,.... and, of course, being somewhat compatible with the current system RPFM uses. |
I began looking through rpfm today, it is indeed not a tiny change. Are you using serde to send commands and data through an mpsc channel to communicate between ui and bg thread by sending a command string first and encoded data immediately after? Can't you just send dedicated command/data structs back and forth, and wrap them in arcs/mutexes if neccessary? Here is what I have so far. Every place that needs cleaning is marked with a TODO. |
That's why I told you it was easier to just isolate the PackFile module, integrate there your decrypt functions (because the only things that module doesn't do is decrypt PackedFiles and the padding stuff, and both should be relatively easy to add) and use that as the basis of a new crate. Also, just checked and you removed a lot of the data taken from the PackFiles, like all the PackFile Index, header stuff... Is all of that stuff stored in a struct from the
Probably, but I'm still learning how to deal with two threads, and this approach allows me to have both threads data isolated from one another with a channel to send commands, and another two in case we need to send data back and forward. Things stay simple, isolated and easy to use and understand. |
The top-down approach would most likely be less work, but I fear the resulting library would be less uniform and constrained in it's design decisions. I want others to enjoy using the library too :)
Yes and no. The PackFile from twa_pack_lib doesn't store anything except its raw bytes, and rpfm uses its public accessors to build it's PackFileView: pub struct PackFileView {
pub file_path: PathBuf,
pub pfh_version: PFHVersion,
pub bitmask: PFHFlags,
pub packed_files: Vec<PackedFile>,
pub empty_folders: Vec<Vec<String>>
} The timestamp is still missing, will add it as soon as the most important TODOs are solved. PackFileView should contain everything you need: You can add and remove PackedFiles, you can add temporary empty folders, and you have everything you need to build a valid .pack file from it - a PFH version, a bitmask and a list of files is everything a pack library should need to build a file (ok, maybe the weird PackFile Index that apparently has no effect has to be added). I think we might want to move the type (the least significant byte of the bitmap) out of the bitmap - the bitflags crate does not like flags with zero set bits, and overlapping bits aren't nice either.
Using channels is perfectly reasonable, but is serializing and deserializing necessary? Maybe clone is sufficient, which should be faster both at compile and runtime. By the way, how long does rpfm (without dependencies) take to build on your machine? My poor laptop needs 15-30 seconds :( |
Yeah, but in RPFM the top-down approach (I learned today it's called that) it's useful, because it's not just PackFiles. There are still a lot of things to decode (older PackFiles, ESF files, RigidModels,....) and it's far easier to do it if you can store and deal separately with every part of the code. I mean, for a "finished" lib where you know exactly what you're going to have or you know it's never going to change, your approach is good, but for something that is reverse-engineered and that it may change in an update/new game release, and there are still unknown things about it (shogun 2 and older games)... It's not as "future-proof". This is something I learnt when I started this project: code should be easy to read for someone who wants to learn how the things work and has no idea of the language. If someone in the future has to take the lead, I want them to be able to see how everything works, how things can be used.... That eases a lot the "entry barrier" for that kind of thing. That's why I re-implemented the original PFM instead of maintaining it. Because it was hard to understand how the things worked, what was X or Y byte for.... Took me a while to just get how the tables were encoded properly, due to that, for example. And had to fix it a few updates later for the same reason. Yes, it's not the fastest or more flexible way to do it, but it's the easiest to learn the structure of the files just by looking at the code. I want others to be able to look at the code and learn how everything works, even if the data itself it's useless. Because this is reverse engineering. You can't go and ask anyone to see how it works. There is no official documentation. If you want others to learn in case you can't continue/abandon the projects, they need to have an easy way to learn the format, how it works, how the data can be used,.... The changes you want to make in RPFM to make it work with the lib are too disruptive, hiding too much of the "how it works" part that actually helps to understand how all the stuff works. So in that case, RPFM will not depend on that lib. You may want others to enjoy using the library, but in my case, I don't really enjoy using your approach for this kind of stuff. For a lib that must work fast, it works. For code to learn how things work, it doesn't. And I want RPFM to be easy to learn from.
I wanted to just have one sending channel and one receiving channel for data, so all the data needed to have the same type. Serializing to a
In my i5-4570, in linux 50 seconds to 1:30. In windows, 1:30 to 1:50, depending how much CPU the antimalware service wants to eat in that moment. Full compilation with libs is about 8 to 16 minutes, mainly due to Qt. Those are debug builds. Release builds take more. |
Do you have pfh3 support? I really would like to include that into pack_lib, but I don't have any .packs nor games to reverseengineer. Nevertheless, ESF/RigidModels handling could stay exactly like it is, even with a separate pack lib.
Is it hiding the "how it works" part, or just moving a part of it into a separate crate? So far my diff modifies/adds 241 lines of code, that's really not that much. If you want a well-tested module that can handle any (malformed) input without panicking, you are better of separating it from everything else and making sure it can do one thing well. My Signal-Windows project has ~50k lines of code, it would be unmanageable if we didn't have decisive borders where one lib ends and another lib begins.
They must have the same type, but that does not mean they must be the same "object". You could have used an enum (which I would recommend because they are analyzed at compile time by the compiler, which gives you nice things like compile time errors if you forget to match an enum variant on the receiving end of a channel), or boxes to Any implementations. Both would be simpler and more easy to use than using 4 channels and inter-process-communication libraries. |
But how am I hiding anything? pack_lib's parser is a really clean and safe parser in ~250 lines of code. You can easily see what the fields in the header are, you can easily see when a header has which size, and you can easily see how to iterate through the index.
If you consider crashes/panics ok that's completely fine for rpfm, but it is never acceptable for a parsing library to panic on malformed input. And do you really believe that you can understand a packfile's structure better if you read rpfm's? I deliberately tried to write an easy-to-read crate, it has literally only one function which does something not trivial, and that is
Add that time it were 241 modifieds/added (and a few hundred removed) lines in 8 files, which is not that much compared to rpfm's size. Right now I'm at 345 modified/added (953 removed) lines, and I would be willing to discuss each of these if you disagree with them. It's fine when you don't want to use pack_lib, but it would be way easier to maintain one pack implementation instead of several.
Ah I thought the bg thread might want to send commands to the ui thread too, my bad. But yeah, rust's enums are awesome! All instances of one enum are of the same type, so you can declare commands like pub struct PackedPath {
pack_file: String,
packed_file_path: String
}
pub struct AddPackedFileCommand {
pub fs_path: PathBuf,
pub packed_path: PackedPath
}
pub struct DeletePackedFileCommand {
pub packed_path: PackedPath
}
enum UICommand {
AddPackedFile(AddPackedFileCommand),
DeletePackedFile(DeletePackedFileCommand)
} and match them on the receiver: match receiver.recv() {
UICommand::AddPackedFile(add_packed_file_command) => println!("foo!"),
UICommand::DeletePackedFile(delete_packed_file_command) => println!("bar!")
} Since your commands could be rather big I maybe would recommend sending boxes with UICommands instead of raw UICommands. We maybe should open a dedicated issue for a typesafe thread communication, though. |
No, I mean it returns an error to the UI and does nothing. And yeah, in RPFM is all explained, not only through comments, but also with code that's not too hard to understand for someone new to the language. From that point of view, your crate is harder to understand. I mean, even I have problems understanding some parts, specially with the iterator part because I've never implemented one because I didn't need it, and I'm not a noob (not an expert either). From that point of view, RPFM is not as simple, but it's easier to understand. Here's an example. This is RPFM header struct: pub struct PackFileHeader {
pub id: String,
pub pack_file_type: u32,
pub pack_file_count: u32,
pub pack_file_index_size: u32,
pub packed_file_count: u32,
pub packed_file_index_size: u32,
pub creation_time: u32,
pub data_is_encrypted: bool,
pub index_includes_timestamp: bool,
pub index_is_encrypted: bool,
pub header_is_extended: bool,
} This is pub fn get_bitmask(raw_data: &[u8]) -> ::PFHFlags {
::PFHFlags::from_bits_truncate(LittleEndian::read_u32(&raw_data[0x04..0x08]))
}
fn get_index_length(raw_data: &[u8]) -> u32 {
LittleEndian::read_u32(&raw_data[0x10..0x14])
} In RPFM you can clearly see every piece of code that forms the header (it lacks the extended header stuff, but you get the idea). In the lib, I can see how to obtain certain parts. You get from And that, for someone new to the language (as I was a few months ago) or even to programming, is far easier to understand that a bunch of getters.
Added to my personal todo list. Hopefully, once I'm done with the 70+ tables I still have to decode from Rome 2, and with the Arena stuff, I can get that done. |
That would be nice but that is not the case - the very first malformed pack file I created happily crashes rpfm with an index out of range error.
Iterators are a thing in every major programming language, and all you have to do is implement a function that yields the next item - isn't that super simple stuff? But yeah, some comments would be nice, I'll have to add some rustdoc strings.
That's a pack_lib bug - it still needs support for pf header extensions.
pack_lib's philosophy is to be a shallow-but-safe library above .pack files. If you are interested where a certain value comes from, you go to the function implementation and can precisely see which bytes from the file are being mapped to your requested value, and how they are being converted. If you are interested in all values, you should not be looking at source code but into the pack file format documentation (which we should reference in our comments, now that I think of it!).
👍it might take some refactoring, but you are rewarded with a typesafe and simple thread communication. |
And that's an error that has to be fixed.
Using iterators, yes. Implementing the
Sure but, as I said above, people who can read it in the community if needed to know how are PackFile's encoded are not many. Sure, you can get from were single values come from. Sure, you can check the entire format in a doc. But if a new format comes out and someone wants to know "how does the lib work" to adapt it to the new format, they need to spend a lot of time learning what do implementing this
Not sure about that, as the program will still crash by default if the variant is not the one it should receive, but... if it can be done without the overhead of serializing/deserializing stuff, is good enough. Also, in one of the last commits the padding stuff got fixed, so the original issue is fixed, but to keep with the conversation I'll just change the title instead of opening a new issue for the lib. |
Java has Iterable where you have to implement
The entire flow consists of get_next, which contains less than 100 lines of code. All it does is move through the index and through the payload, while respecting the bitmask's settings and the pfh version. That's really not complicated. If a new pfh version comes out, all you have to do is change how get_next interprets the index and the payload. If we were to take your packfile modules to build a separate lib, the first thing I'd do is implement IntoIter - the pack file format is nothing more than an iteration, so IntoIter perfectly repesents the underlying data structure and it is the most easy-to-use api you can get. I have opened a PR, if you like it merge it, if you have questions ask, if you don't like it close it :) Semi-related: What do you think of supporting packing encrypted mod packs for the sake of trolling? |
Wasn't convinced (because despite having touched both C# and Rust for quite some time, I've never had the need to implement an iterator by myself), so I asked some modders, and the only guy who answered said that, from a point of someone who knows lua and is familiar with rust, the lib is easy enough to read. So I guess that's enough to accept it (still at least the functions in the lib need comments to know what they actually do), though the pull-request needs some... polishing before accepting it. I already wrote all that stuff in a comment in the pull-request, so you can check all that stuff there.
If you want to do it, sure. But do you have any clue what those 14 extra bytes from the extended header are for? |
An update about the channel stuff. The commands have been moved to an enum (no more risk of crashing due to a typo in the command), but hold no data, because the data may or may not need to have a command before sending it and it's better to have all the data together in one enum instead splited in two. For example, there are multiple features that try something in the background thread, then go back to the ui for some input, then back to the background. And the data... just counting the data send from the background thread, there are 33 different types of data. There are even more coming from the UI. That means a potential 50 to 100 variants long enum. All of them need to be matched when receiving, and panic if the variant is not the one we want, because that means the message system has been messed up and there is not an easy way to recover from that. Is there any specific reason to rewrite all the data sending/receiving stuff other than removing the already very little overhead while serializing/deserializing? Because, apart of that I can't find any reason to do it, and it looks like it's too much work to move the system from using serde and generics to enums for very little gain. |
I don't, but since there is many unused stuff like the timestamps I wouldn't worry too much about them as long as everything works.
👍
I'm not sure I understand this correctly. Why can't you add the data to the command? If the same command is sometimes used with and sometimes without data you can either put the data in an Option<DATA_FOO> or split it into two commands because they are different commands after all.
Doesn't it sound like this are two different commands then?
Assuming there are not more responses than commands, I guess it is worth it. You can (and imho should) add the data to the enums, so you will have one bg->ui enum and the command enum, and just 2 channels.
That depends on how much you want to change in the future. If you are saying rpfm's feature list is almost complete and you will only add support for new games, it could be wasted time. If you want to continue developing it for a longer period and want a codebase that is easy to understand for new developers, I'd go for it immediately:
|
Quite the opposite. The situation is the data is sent sometimes without a command. For example, there are commands that you send and the UI thread waits for a response. Then, depending on the response you send more data or just do nothing.
There are more responses that commands. There are currently 55 commands, and between 50-100 data types that are send one way or another. Also, there's another problem related to merging the channels. When I started the 0.9 version and the Qt port, I set a goal: the UI MUST NOT HANG. To do it, RPFM currently makes the UI thread to enter on a loop that keeps processing events (like resizing the window) until it gets a response from the background thread. That means you can potentially send a command while there is another one executing itself, making the entire program crash. In earlier 0.9 versions, you could crash the program by keeping pressed the "Del" button because the deletion process got messed by a checking process, and message were received in the wrong place. To "fix it", I left a channel for commands only (so commands can be chained and don't interfere with one another) and another for data, and every time data received, RPFM checks that's the type it needs using generics and crashes if it's not (because there is no way to be sure what was sent). Merging them will mean the commands can potentially crash in many more places that before.
Mmmm.... I'm not sure about the second and third one. The way I see it, the same problem with wrong messages being sent can also happen with enums, because you can receive the wrong variant and the program will have to crash, to avoid weird behavior. And the fifth one.... if you want data to be shared in an enum without deleting the original data (because the threads are isolated), you have to copy it anyways isn't? I'll check it again once I'm done with this release of today, but the way I see it.... other than readability improvements and a very little speed up, it's a very big rewrite with very little gain. |
Done in 2.0 with the big split. |
While investigating TWA's tables I noticed that some packs were not properly unpacked. Luckily I managed to track the problem down: If 1 << 4 is set in the bitmask, every item in the payload starts at a multiple of 8.
twa_pack_lib fix: TotalWar-Modding/tw_pack_lib@c549376 (unfortunately rather big because of major simplifications)
The text was updated successfully, but these errors were encountered: