Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

Compression

orangemittens edited this page Jul 21, 2014 · 5 revisions

#Overview

The Sims 4 uses two different compression algorithm, i.e., the DEFLATE compression and traditional DBPF compression.

##DEFLATE Compression

The input stream has several headers that indicate it is using the DEFLATE algorithm. The most common ones are 0x78DA and 0x7801. The latter one is less effective. After two bytes, the standard DEFLATE library can handle the rest of the stream.

NOTE: The compression flag of each entry index has nothing to do with the type of compression. You should read the first two bytes and tell the library which one to use.

##Traditional DBPF Compression

The idea behind the compression is to reuse previously decoded strings. For example, if the word "heureka" occurs twice in a file, the second occurrence would be encoded by pointing to the first, thus lowering the size of the file.

The compression is done by defining control characters that tell three things:

  1. How many characters of plain text that follow should be appended to the output.
  2. How many characters should be read from the already decoded text (and appended to the output)
  3. Where to read the characters from in the already decoded text.

Thus, the algorithm to decompress these files goes like this:

Read the header, which is formatted like so:

 Offset 00 - 0xFB
 Offset 01 - Compression type (0x10 | [0x40] | [0x80])
 Offset 02 - Uncompressed Size of file (4 bytes if type contains 0x80, otherwise 3 bytes)

After the header (Offset 5 or 6, depending on the compression type flags) is the start of the actual compressed file data, which is handled like so:

 { 
 	- Read the next control character. 
 	- Depending on the control character, read 0-3 more bytes that are a part of the control character.
 	- Inspect the control character.  From this, find out ''how many'' characters should be read and ''where from''.
 	- Read 0-''n'' characters from source and append them to the output. (''n'' being the "how many" data from above)
 	- Copy 0-''n'' characters from somewhere in the output to the end of the output. (''n'' in this case is the 
 } 

###Control Characters

There are 4 types of control characters. These are used with different restrictions on how many characters that can be read and from how far behind these can be read. The following conventions are used to describe them:

CC length

  • Length of control character.
Num plain text
  • Number of characters immediately after the control character that should be read and appended to output.
Num to copy
  • Number of chars that should be copied from somewhere in the already decoded output and added to the end of the output.
Copy offset
  • Where to start reading characters when copying from somewhere in the already decoded output.
  • This is given as an offset from the current end of the output buffer, i.e. an offset of 0 means that you should copy the last character in the output and append it to the output. And offset of 1 means that you should copy the second-to-last character.
byte0
  • first byte of control character.
Bits
  • Bits of the control character.
    • p - num plain text
    • c - num to copy
    • o - copy offset
    • - identifier.

Note: It can sometimes be confusing when a control character states that you should copy for example 10 characters 5 steps from the end of the output. Clearly, you cannot read more than 5 characters before you reach the end of the buffer. The solution is to read and write one character at the time. Each time you read a character you copy it to the end thereby increasing the size of the output. By doing this, even offset 0 is possible and would result in duplicating the last character a number of times. This is utilized by the compression to recreate repeating text, for example bars of repeating dashes.

###0x00 - 0x7F

 CC length: 2 bytes
 Num plain text: byte0 & 0x03
 Num to copy: ( (byte0 & 0x1C) > > 2) + 3
 Copy offset: ( (byte0 & 0x60) < < 3) + byte1 + 1
 Bits: 0oocccpp oooooooo
 Num plain text limit: 0-3
 Num to copy limit: 3-11
 Maximum Offset: 1023

###0x80 - 0xBF

 CC length: 3 bytes
 Num plain text: ((byte1 & 0xC0) > > 6 ) & 0x03
 Num to copy: (byte0 & 0x3F) + 4
 Copy offset: ( (byte1 & 0x3F) < < 8 ) + byte2 + 1
 Bits: 10cccccc ppoooooo oooooooo
 Num plain text limit: 0-3
 Num to copy limit: 4-67
 Maximum Offset: 16383

###0xC0 - 0xDF

 CC length: 4 bytes
 Num plain text: byte0 & 0x03
 Num to copy: ( (byte0 & 0x0C) < < 6 )  + byte3 + 5
 Copy offset: ((byte0 & 0x10) < < 12 ) + (byte1 < < 8 ) + byte2 + 1
 Bits: 110occpp oooooooo oooooooo cccccccc
 Num plain text limit: 0-3
 Num to copy limit: 5-1028
 Maximum Offset: 131071

###0xE0 - 0xFB

This is the simplest form of control character. The only thing it does is tell how many plain text characters follow.
The formula for this is: (C - 0xDF) * 4.
Thus a value of 0xE0 means that you should read 4 characters of plain text and append to the output.

 CC length: 1 byte 
 Num plain text: ((byte0 & 0x1F) < < 2 ) + 4
 Num to copy: 0 
 Copy offset: -  
 Bits: 111ppppp 
 Num plain text limit: 4-112 (Multiples of 4)
 Num to copy limit: 0 
 Maximum Offset: - 

###0xFC - 0xFF

 CC length: 1 byte 
 Num plain text: (byte0 & 0x03)
 Num to copy: 0 
 Copy offset: - 
 Bits: 111ppppp 
 Num plain text limit: 3
 Num to copy limit: 0 
 Maximum Offset: -

Compressed data MUST end with a code in the range 0xFC to 0xFF. If the data is an exact fit to the size, 0xFC can be used as a null code. While community tools properly handle data without the ending byte, Sims 3 will keep reading until it encounters it, usually resulting in a crash.

###Compression Types

In addition to compression tagged as 0x10FB, there are other values that can be orred into the 0x10 byte.

  • 0x40 : Sims 3 seems to have two compression and decompression routines. The coding is identical between them, however data tagged with 0x40 only uses a subset of the available codes, and limits the window to a much smaller size than would otherwise be possible. If data is written that goes beyond these limits, it will crash Sims 3. (Q: What codes are restricted? What window size is allowed?)
  • 0x80 : If the uncompressed data is longer than 16mb, the size won't fit in the normal 3 bytes in the header. Adding 0x80 in to the compression type increases the uncompressed size field to 4 bytes.
Clone this wiki locally