KWZ Format

The KWZ format is used to store Flipnote animations. The body of the file is structured into sections, with each section beginning with an 8-byte header. The last 256 bytes of a KWZ is an SHA-256 RSA-2048 signature over the whole file.

Variants of the format are also used for folder icons and comments.

Section Headers

Type	Details
char[4]	Section magic
uint32	Section size (not including header)

The first 3 chars of the section magic identify the section type, and the last char seems to be used for flags of some kind but the meaning of these isn't known.

Sections

KFH (File Header)

Offset	Type	Details
0x0	uint32	CRC32 checksum
0x4	uint32	Creation timestamp
0x8	uint32	Last edit timestamp
0xC	uint32	App version? - seen as `0`, `1` or `3` so far
0x10	byte[10]	Root author ID
0x1A	byte[10]	Parent author ID
0x24	byte[10]	Current author ID
0x2E	wchar[11]	Root author name
0x44	wchar[11]	Parent author name
0x5A	wchar[11]	Current author name
0x70	char[28]	Root filename
0x8C	char[28]	Parent filename
0xA8	char[28]	Current filename
0xC4	uint16	Frame count
0xC6	uint16	Thumbnail frame index
0xC8	uint16	Flags
0xCA	uint8	Frame speed
0xCB	uint8	Layer visibility flags

Timestamps are stored as the number of seconds since midnight 1 Jan 2000.

Author names are null-padded UTF-16 LE strings. Author IDs are usually formatted as lowercase hex strings with dashes like xxxx-xxxx-xxxx-xxxxxx. The last byte of an author ID seems to always be null, and isn't included as part of the string.

Filenames are base32-encoded using a custom alphabet sequence cwmfjordvegbalksnthpyxquiz012345. The decoded filename can be unpacked to get the author ID, creation timestamp and modified timestamp.

If you're dealing with DSi Library notes, bear in mind there's some weird quirks with the metadata in them that you may need to handle.

KFH flags

Bitmask	Details
`flags & 0x1`	Lock flag
`flags & 0x2`	Loop playback flag
`flags & 0x4`	Toolset flag
`flags & 0x10`	Unsure, possibly indicates something with layer depth?

Layer visibility flags

Bitmask	Details
`flags & 0x1`	Layer A invisible
`flags & 0x2`	Layer B invisible
`flags & 0x4`	Layer C invisible

Flipnote Playback Speeds:

Value	Frames per second
0	0.2
1	0.5
2	1
3	2
4	4
5	6
6	8
7	12
8	20
9	24
10	30

KTN (Thumbnail)

This section starts with a CRC32 checksum followed by the Flipnote's thumbnail stored as JPEG image data. Thumbnails are 80px x 64px, with a black line at the bottom which is normally cropped out when displayed.

KMC (Frame Data)

Offset	Type	Details
0x0	uint32	CRC32 checksum
0x4	-	Frame data

Frames are stored in playback sequence, and split into separate images for layer A, layer B, and layer C. Layer images are 320 pixels high and 240 pixels wide, although under normal circumstances the outer edges can't be drawn on since the app puts a border around the bottom screen.

The data for each layer is compressed individually; to get the compressed size for a given frame layer (and this calculate the offsets for any given frame) you need to first parse the KMI.

Layers are divided into 1200 tiles that are 8 pixels high and 8 pixels wide. Every horizontal line in a tile ultimately references a line table index. The line table contains every possible combination of pixels for a line.

Tile arrangement

8x8 tiles (shown in red) are grouped into larger tiles (shown in blue) which are 128x128 unless they fall off the edge of the frame. Tiles are stored in sequence from left-to-right, top to bottom.

The following pseudocode shows how we currently deal with this:

for (int large_tile_y = 0; large_tile_y < 240; large_tile_y += 128) {
  for (int large_tile_x = 0; large_tile_x < 320; large_tile_x += 128) {
    for (int tile_y = 0; tile_y < 128; tile_y += 8) {
      int y = large_tile_y + tile_y;
      // if the tile falls off the bottom of the frame, jump to the next large tile
      if (y >= 240)
        break;

      for (int tile_x = 0; tile_x < 128; tile_x += 8) {
        int x = large_tile_x + tile_x;
        // if the tile falls off the right of the frame, jump to the next small tile row
        if (x >= 320)
        	break;
        // ... decode tile here -- (x, y) is the position of the tile's top-left corner relative to the top-left of the image
      }
    }
  }
}

Reading bits

Layer compression relies heavily on bitpacking. Bits are read from a tiny 16-bit buffer until there are no more bits left, at which point another uint16 is read from the compressed layer buffer, and so on.

For reference, here is a pseudocode implementation of a generic readBits() function. This will be used throughout the rest of the pseudocode examples in the next part of the documentation:

// these should only be reset whenever you start reading a new layer
uint16 bitValue = 0;
int bitIndex = 0;

int readBits(int numBits) {
  if (bitIndex + numBits > 16) {
    // readUint16() would read an uint16 from the compressed layer buffer, 
    // then increment the layer buffer pointer by 2
    uint16 nextBits = readUint16();
    bitValue |= nextBits << (16 - bitIndex);
    bitIndex -= 16;
  }
  int result = bitValue & ((1 << numBits) - 1);
  bitValue >>= numBits;
  bitIndex += numBits;
  return result;
}

Tile decompression

Each tile starts with a 3-bit value which gives the type of compression it uses.

Pseudocode:

int tileType = readBits(3);

Tile type 0

All lines are the same and use one of the commonly occurring line indexes defined in the common line index table . A single 5-bit value provides an index for common line index table , which in turn gives the line table index.

Pseudocode:

int lineIndex = commonLineIndexTable[readBits(5)];
uint8 a[8] = lineTable[lineIndex];
uint8 tile[8][8] = [a, a, a, a, a, a, a, a];

Tile type 1

Same as type 0, but instead a 13-bit value gives the line table index directly.

Pseudocode:

int lineIndex = readBits(13);
uint8 a[8] = lineTable[lineIndex];
uint8 tile[8][8] = [a, a, a, a, a, a, a, a];

Tile type 2

All lines use a commonly occurring line index, given by a single 5-bit value. However, every other line is rotated one pixel to the left, so the common line index table is used to get the line table index for odd lines and the common shifted line index table is used to get the line table index for even ones.

This tile type is most commonly used for dithering patterns created with the paintbrush tool:

Pseudocode:

index = readBits(5);
int lineIndexA = commonLineIndexTable[index];
int lineIndexB = commonShiftedLineIndexTable[index];
uint8 a[8] = lineTable[lineIndexA];
uint8 b[8] = lineTable[lineIndexB];
uint8 tile[8][8] = [a, b, a, b, a, b, a, b];

Tile type 3

Same as type 3, except a 13-bit value is used. Odd lines use the regular line table, while even lines use the shifted line table.

Pseudocode:

int lineIndexA = readBits(13);
int lineIndexB = shiftedLineTable[lineIndexA];
uint8 a[8] = lineTable[lineIndexA];
uint8 b[8] = lineTable[lineIndexB];
uint8 tile[8][8] = [a, b, a, b, a, b, a, b];

Tile type 4

Each line can either be a 5-bit common line index table index, or a 13-bit line table index. The tile starts with a series of 8 bitflags which indicates which to use for each line.

Pseudocode:

uint8 flags = readBits(8);
for (int mask = 1; mask < 0xFF; mask <<= 1)
{
  int lineIndex;
  if (flags & mask)
    lineIndex = commonLineIndexTable[readBits(5)];
  else
    lineIndex = readBits(13);
  tile[i] = lineTable[lineIndex];
}

Tile type 5

This indicates that one or more tiles have not changed since the previous frame, so they can be skipped. A 5-bit value gives the number of tiles to skip after the current one.

Tile type 6

Not used.

Tile type 7

This tile is comprised of two possible line values (A and B) arranged in a pattern. A 2-bit value provides the pattern type (detailed below), followed by a 1-bit value which indicates whether the common line index table is used.

If the 1-bit value is set to 1, then A and B should be read as 5-bit common line index table first, and the pattern type should also be adjusted by doing pattern type = (pattern type + 1) % 4.

If the 1-bit value is 0, then A and B should be read as 13-bit line table indexes.

Then the arrangement of tile lines for each pattern type is:

Pattern type	Line pattern
0	`A B A B A B A B`
1	`A A B A A B A A`
2	`A B A A B A A B`
3	`A B B A B B A B`

Pseudocode:

uint8 pattern = readBits(2);
uint8 isCommon = readBits(1);

int lineIndexA;
int lineIndexB;

if (isCommon == 1)
{
  lineIndexA = commonLineIndexTable[readBits(5)];
  lineIndexB = commonLineIndexTable[readBits(5)];
  pattern = (pattern + 1) % 4;
}
else
{
  lineIndexA = readBits(13);
  lineIndexB = readBits(13);
}

uint8 a[8] = lineTable[lineIndexA]; // pixels for line A
uint8 b[8] = lineTable[lineIndexB]; // pixels for line B
uint8 tile[8][8];

// pattern number indicates the order of the lines in the tile
switch (pattern)
{
  case 0:
    tile = [a, b, a, b, a, b, a, b];
    break;
  case 1:
    tile = [a, a, b, a, a, b, a, a];
    break;
  case 2:
    tile = [a, b, a, a, b, a, a, b];
    break;
  case 3:
    tile = [a, b, b, a, b, b, a, b];
    break; 
}

Decompression tables

Line table

Contains every possible combination of pixels for an 8-pixel line. This can be generated too -- our method creates the table as an array of 6561 lines, where each item represents a line of 8 pixels. Pixel values are 0 for transparent, 1 for layer color 1 and 2 for layer color 2.

// the line table is a 2d array of size [6561][8]
uint8 lineTable[6561][8];
int index = 0;
for (uint8 a = 0; a < 3; a++)
for (uint8 b = 0; b < 3; b++)
for (uint8 c = 0; c < 3; c++)
for (uint8 d = 0; d < 3; d++)
for (uint8 e = 0; e < 3; e++)
for (uint8 f = 0; f < 3; f++)
for (uint8 g = 0; g < 3; g++)
for (uint8 h = 0; h < 3; h++)
{
	lineTable[index] = [b, a, d, c, f, e, h, g];
	index += 1;
}

Shifted line table

Contains every possible combination of pixels for an 8-pixel line, in the same order as the regular line table, but with pixels shift-rotated one place to the left.

uint8 shiftedLineTable[6561][8];
int index = 0;
for (uint8 a = 0; a < 3; a++)
for (uint8 b = 0; b < 3; b++)
for (uint8 c = 0; c < 3; c++)
for (uint8 d = 0; d < 3; d++)
for (uint8 e = 0; e < 3; e++)
for (uint8 f = 0; f < 3; f++)
for (uint8 g = 0; g < 3; g++)
for (uint8 h = 0; h < 3; h++)
{
	shiftedLineTable[index] = [a, d, c, f, e, h, g, b];
	index += 1;
}

Common line index table

Represents line table indices for commonly occurring lines.

uint16 commonLineIndexTable[32] = [
  0x0000, 0x0CD0, 0x19A0, 0x02D9, 0x088B, 0x0051, 0x00F3, 0x0009,
  0x001B, 0x0001, 0x0003, 0x05B2, 0x1116, 0x00A2, 0x01E6, 0x0012,
  0x0036, 0x0002, 0x0006, 0x0B64, 0x08DC, 0x0144, 0x00FC, 0x0024,
  0x001C, 0x0004, 0x0334, 0x099C, 0x0668, 0x1338, 0x1004, 0x166C
];

Common shifted line index table

Represents line table indices for commonly occurring lines, but where the line pixels are shift-rotated one place to the left.

uint16 commonShiftedLineIndexTable[32] = [
  0x0000, 0x0CD0, 0x19A0, 0x0003, 0x02D9, 0x088B, 0x0051, 0x00F3, 
  0x0009, 0x001B, 0x0001, 0x0006, 0x05B2, 0x1116, 0x00A2, 0x01E6, 
  0x0012, 0x0036, 0x0002, 0x02DC, 0x0B64, 0x08DC, 0x0144, 0x00FC, 
  0x0024, 0x001C, 0x099C, 0x0334, 0x1338, 0x0668, 0x166C, 0x1004
];

KMI (Frame Meta)

This section starts contains a table of metadata for each animation frame. Each entry in the table is 28 bytes long:

Frame Meta Entry

Offset	Type	Details
0x0	uint32	Flags
0x4	uint16	Layer A size
0x6	uint16	Layer B size
0x8	uint16	Layer C size
0xA	hex[10]	Frame author ID
0x14	uint8	Layer A depth
0x15	uint8	Layer B depth
0x16	uint8	Layer C depth
0x17	uint8	Sound effect flags
0x18	uint16	Unknown, usually 0
0x1A	uint16	Camera flags

Frame Flags

Mask	Details
`flags & 0xF`	Paper color index
`(flags >> 4) & 0x1`	Layer A diffing flag
`(flags >> 5) & 0x1`	Layer B diffing flag
`(flags >> 6) & 0x1`	Layer C diffing flag
`(flags >> 7) & 0x1`	Is frame based on prev frame
`(flags >> 8) & 0xF`	Layer A first color index
`(flags >> 12) & 0xF`	Layer A second color index
`(flags >> 16) & 0xF`	Layer B first color index
`(flags >> 20) & 0xF`	Layer B second color index
`(flags >> 24) & 0xF`	Layer C first color index
`(flags >> 28) & 0xF`	Layer C second color index

Diffing flags are stored in the order of Layer A, layer B, layer C starting from the lowest bit. The bit will be set to 0 if the layer is based on the same layer from the previous frame.

Each color is stored as a palette index.

Frame Palette

Index	Name	HEX color
`0`	white	#ffffff
`1`	black	#141414
`2`	red	#ff1717
`3`	yellow	#ffe600
`4`	green	#008232
`5`	blue	#06aeff
`6`	transparent (paper only)	-

Layer depth

Stores each layer's 3D depth - 0 for nearest, 6 for furthest. Note that layers are not necessarily stored in order of depth (layer B could be visually in front of A, or behind it). You will need to sort layers by their depth to correctly reconstruct the frame. If two layers have the same depth then the A, B, C order takes precedent.

When stereoscopic 3D is enabled, layers are shifted to the left for the left-eye image and to the right for the right-eye image. The number of pixels to shift by is simply the depth value. So to e.g. get the left-eye image for a layer with a depth value of 6, you just have to shift all of its pixels by 6 places to the left. To control the intensity of the 3D effect, the layer's depth value is multiplied with the value of the 3DS' depth slider, which starts at 0 for no 3D, and ends at 1 for full 3D.

Sound Effect Flags

Mask	Details
`(soundFlags & 0x1) !== 0`	Is SE1 used on this frame
`(soundFlags & 0x2) !== 0`	Is SE2 used on this frame
`(soundFlags & 0x4) !== 0`	Is SE3 used on this frame
`(soundFlags & 0x8) !== 0`	Is SE4 used on this frame

Camera Flags

Mask	Details
`(cameraFlags & 0x1) !== 0`	Layer A includes a photo
`(cameraFlags & 0x2) !== 0`	Layer B includes a photo
`(cameraFlags & 0x4) !== 0`	Layer C includes a photo

KSN (Sound Data)

Sound Header

Type	Description
uint32	Flipnote speed when recorded
uint32	BGM size
uint32	SE1 (A) size
uint32	SE2 (X) size
uint32	SE3 (Y) size
uint32	SE4 (up) size
uint32	CRC32 checksum of the audio tracks

After the header, the audio tracks data is stored in the order of BGM, SE1, SE2, SE3, and SE4.

Sound data

Sound data is mono-channel IMA ADPCM sampled at 16364Hz (NOT 16384Hz). That said, Nintendo's implementation differs from the norm ever so slightly (of course!), and these differences need to be accounted for in order to accurately decode audio.

Typically 4-bit IMA ADPCM data is used, however in order to save space, the audio may switch into a 2-bit sample mode in places where the audio signal is relatively flat. The decoder will read the next sample as a 2-bit value if the previous sample was below 18, or if it is only possible to read 2 bits from the current byte (conveniently, the audio encoder avoids 4-bit samples that overlap two bytes).

In addition, there are a couple of small divergences from the IMA ADPCM standard:

The step index is clamped between 0 and 79, compared to the standard 0 to 88.
The diff is clamped in the 12-bit range (between -2048 and 2047), compared to the standard 16-bit range (-32768 to 32767).
The diff value is scaled to the 16-bit range by multiplying by 16 after being clamped.
The initial decoder state is step_index = 40, however step_index = 0 may result in better sounding audio depending on the flipnote. This issue is seen in a more exaggerated form in some DSi Library flipnotes, detailed below.

The full step table is:

7, 8, 9, 10, 11, 12, 13, 14, 16, 17,
19, 21, 23, 25, 28, 31, 34, 37, 41, 45,
50, 55, 60, 66, 73, 80, 88, 97, 107, 118,
130, 143, 157, 173, 190, 209, 230, 253, 279, 307,
337, 371, 408, 449, 494, 544, 598, 658, 724, 796,
876, 963, 1060, 1166, 1282, 1411, 1552, 1707, 1878, 2066,
2272, 2499, 2749, 3024, 3327, 3660, 4026, 4428, 4871, 5358,
5894, 6484, 7132, 7845, 8630, 9493, 10442, 11487, 12635, 13899,
15289, 16818, 18500, 20350, 22385, 24623, 27086, 29794, 32767

Since this is fairly complex, here is pseudocode to convert the sound data to 16-bit signed PCM:

# track_length is assumed to be the size of the audio track, in bytes
# track_data is assumed to be the audio data, as an array of bytes

# index table for 2-bit samples
index_table_2 = [
  -1,  2,
  -1,  2,
]

# index table for 4-bit samples
index_table_4 = [
  -1, -1, -1, -1, 2, 4, 6, 8,
  -1, -1, -1, -1, 2, 4, 6, 8,
]

# we don't know how long the unpacked audio is going to be, 
# so create an output buffer with enough space for 
# 60 seconds of audio with a sample rate of 16364 Hz
# output_buffer should be of type int16[]
output_buffer = Array(16364 * 60)
output_offset = 0

# initial decoder state:
# note: these variables must be signed integers of at least 16 bits in size
sample = 0
step = 0
diff = 0
step_index = 40
predictor = 0

for track_offset = 0; track_offset < track_length; track_offset += 1:
  byte = track_buffer[track_offset]
  bit_pos = 0
  while bit_pos < 8:
    if prev_step_index < 18 or bit_pos > 4:
      # read 2-bit sample
      sample = byte & 0x3
      # get diff
      step = step_table[prev_step_index]
      diff = step >> 3
      if sample & 1: diff += step
      if sample & 2: diff = -diff
      predictor += diff
      # get step index
      step_index = index_table_2[sample]
      byte >>= 2
      bit_pos += 2
    else:
      # read 4-bit sample
      sample = byte & 0xF
      # get diff
      step = step_table[prev_step_index];
      diff = step >> 3
      if sample & 1: diff += step >> 2
      if sample & 2: diff += step >> 1
      if sample & 4: diff += step
      if sample & 8: diff = -diff
      predictor += diff
      # get step index
      step_index += index_table_4[sample]
      byte >>= 4
      bit_pos += 4

    # clamp step index and diff
    step_index = max(0, min(step_index, 79))
    predictor = max(-2048, min(diff, 2047))

    # scale to 16 bit and write to output
    output_buffer[output_offset] = predictor * 16
    output_offset += 1

Extras

Nintendo DSi Library Conversions

The Nintendo DSi Library was a section of Flipnote Studio 3D's online services where users could view works from the DSiWare version of Flipnote Studio that had originally been uploaded to Flipnote Hatena. These notes have been converted from the PPM format to KWZ by Nintendo. Unfortunately, their converter was a bit buggy, so there's a few quirks in these notes:

Sometimes audio track data sounds extremely distorted, even when the Flipnote is played on a 3DS. This is because of a bug in Nintendo's conversion process that requires us to use a different initial step index for decoding those audio tracks. The initial step index values 0 or 40 are by far the most common, however any value between 0 and 40 may be correct in order to produce audio tracks with the least distortion.
Filenames stored in the KFH section may be a packed PPM filename instead of a valid KWZ one. This only happens in about 1/3 of the notes seen.
The first byte of the FSIDs stored in the KFH section will always be either 00, 10, 12 or 14 and the last byte will always be 00 (never displayed). The rest of the ID is the PPM FSID stored in reverse byte order. e.g the PPM FSID 13209B805109B9B8 would become something like 00B8B90951809B201300.

Comments

Hand-written comments on Flipnote Gallery World (the app's online service client) use a variant of the KWZ format with the extension .kwc. These comments do not have KTN or KSN sections, and can only ever have 1 frame.

Folder Icons

Icons used for SD card folders are also a variant of the KWZ format, which only use the KMC and KMI sections.

RSA Signature

The final 256 bytes of a KWZ file should consist of an SHA-256 RSA-2048 signature over the rest of the file.

The DER format private key for signing a KWZ file can be found as plaintext in memory or in the decompressed .code of the app. It will begin with the bytes 30 82 04 and end with the bytes E4 07 50, resulting in a total of 1,192 bytes overall. Its SHA-256 checksum should match E6892FF794E8A768C9ECC76152C4E72823514366B3A206298F5CB603D5EB797A.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.