Skip to content

Commit

Permalink
Update README.md (#13)
Browse files Browse the repository at this point in the history
* Update README.md

* Big Endian (#14)

* Create OEM_Doc.md

* PEP8 naming (#19)
  • Loading branch information
Beakerboy committed Mar 7, 2023
1 parent 8e3fef9 commit 8c78155
Show file tree
Hide file tree
Showing 10 changed files with 262 additions and 149 deletions.
25 changes: 24 additions & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --show-source --statistics
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
Expand All @@ -42,3 +42,26 @@ jobs:
run: |
pytest --cov=ms_ovba_compression
coveralls --service=github
flake8:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.10"]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
if [ -f requirements_dev.txt ]; then pip install -r requirements_dev.txt; fi
pip install -e .
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --show-source --statistics
77 changes: 77 additions & 0 deletions OEM_Doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
Page 110 of [MS OVBA](https://interoperability.blob.core.windows.net/files/MS-OVBA/%5bMS-OVBA%5d.pdf) provides this as the uncompressed input:

23 61 61 61 62 63 64 65 66 61 61 61 61 67 68 69
6a 61 61 61 61 61 6B 6C 61 61 61 6D 6E 6F 70 71
61 61 61 61 61 61 61 61 61 61 61 61 72 73 74 75
76 77 78 79 7A 61 61 61

And this as the compressed output:

01 2F B0 00 23 61 61 61 62 63 64 65 82 66 00 70
61 67 68 69 6A 01 38 08 61 6B 6C 00 30 6D 6E 6F
70 06 71 02 70 04 10 72 73 74 75 76 10 77 78 79
7A 00 3C

If we split the final output into token sequences and
swap the little-endien encoded words:

Container signature = 01
chunk header = B0 2F

Token sequences. The two-byte copy tokens are distinguished
from literal tokens with brackets.

1 = 00 23 61 61 61 62 63 64 65
2 = 82 66 [70 00] 61 67 68 69 6A [38 01]
3 = 08 61 6B 6C [30 00] 6D 6E 6F 70
4 = 06 71 [70 02] [10 04] 72 73 74 75 76
5 = 10 77 78 79 7A [3C 00]

The differences I’m seeing are at the copy token in the third
sequence, the second copy token in the 4th sequence, and
the final copy token.

I’ll work the 3rd token sequence by hand to demonstrate
what is expected given my understanding of the pseudo-code.

At this point in the compressing a token sequence function, we
have the following state variables:

index=3
compressedcurrent=26
decompressedCurrent=24

Running through the matching algorithm, we set candidate to
decompressedCurrent-1, so 23. Byte at 23 is 6C, while the byte
at 24 is 61, so we decrement the candidate. The same occurs until
candidate is 21. At this point the inner while loop iterates once
to give us a length of 1. We decrement candidate again and again
find a match, this time of length 2, so this is the new BestLength.
We decrement Candidate again (to 19) and find a match of length three.
We continue to decrement the candidate and find other length 3 matches,
but since the length is never greater then three, the BestCandidate
remains 19 and the BestLength three.

The offset is the difference between DecompressedCurrent and BestCandidate,
so 5, and the length is 3.

Now we call CopyTokenHelp:

difference = 24 - 0 = 24
BitCount = max(⌈log₂(24)⌉, 4) = max(⌈4.58⌉, 4) = 5
LengthMask = 0xFFFF >> 5 = 0x07FF
OffsetMask = ~0x07FF = 0xF800

And a call to PackCopytoken:

temp1 = Offset - 1 = 4
temp2 = 16 - BitCount = 11
temp3 = Length - 3 = 0
Token = (temp1 >> temp2) | temp3
= (4 << 11) | 0 = 0x0004 << 11 = 0x2000

So we see the token should be 0x2000 instead of the 0x3000
that is published in the document. If we unpack 0x3000 we
get the same length, 3, but a different offset. For some
reason the matching algorithm is not returning the first
best candidate, but instead a different match.
10 changes: 3 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ from ms_ovba_compression.ms_ovba import MsOvba

# returns b'\x01\x19°\x00abcdefgh\x00ijklmnop\x00qrstuv.'
input = b'abcdefghijklmnopqrstuv.'
ms_ovba = MsOVBA()
ms_ovba = MsOvba()
ms_ovba.compress(input)

# returns b'#aaabcdefaaaaghijaaaaaklaaamnopqaaaaaaaaaaaarstuvwxyzaaa'
Expand All @@ -38,13 +38,9 @@ compressed = b'\x01/°\x00#aaabcde²f\x00paghij\x018\x08akl\x000mnop\x06q\x02p\x
ms_ovba.decompress(compressed)

```
The objects can be initialized to indicate the endianness if the default little-endian is not desired. However, having never seen real world big-endian packed data
means this feature is untested.
The objects can be initialized to indicate the endianness if the default little-endian is not desired.
```python
# unsure if it should return:
# b'\x01°\x19\x00abcdefgh\x00ijklmnop\x00qrstuv.'
# or
# b'\x01\x01—\x00abcdefgh\x00ijklmnop\x00qrstuv.'
# returns b'\x01°\x19\x00abcdefgh\x00ijklmnop\x00qrstuv.'
input = b'abcdefghijklmnopqrstuv.'
ms_ovba = MsOvba("big")
ms_ovba.compress(input)
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "ms_ovba_compression"
version = "0.1.1"
version = "0.2.0"
authors = [
{ name="Kevin Nowaczyk", email="beakerboy99@yahoo.com" },
]
Expand Down
1 change: 1 addition & 0 deletions requirements_dev.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
pytest
pytest-cov
coveralls
pep8-naming
Loading

0 comments on commit 8c78155

Please sign in to comment.