Update README.md (#13)

* Update README.md * Big Endian (#14) * Create OEM_Doc.md * PEP8 naming (#19)
Beakerboy · Mar 7, 2023 · 8c78155 · 8c78155
1 parent 8e3fef9
commit 8c78155
Show file tree

Hide file tree

Showing 10 changed files with 262 additions and 149 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -33,7 +33,7 @@ jobs:
     - name: Lint with flake8
       run: |
         # stop the build if there are Python syntax errors or undefined names
-        flake8 . --count --show-source --statistics
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
         # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
         flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
     - name: Test with pytest
@@ -42,3 +42,26 @@ jobs:
       run: |
         pytest --cov=ms_ovba_compression
         coveralls --service=github
+  flake8:
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.10"]
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v3
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        python -m pip install flake8 pytest
+        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+        if [ -f requirements_dev.txt ]; then pip install -r requirements_dev.txt; fi
+        pip install -e .
+    - name: Lint with flake8
+      run: |
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --show-source --statistics
diff --git a/OEM_Doc.md b/OEM_Doc.md
@@ -0,0 +1,77 @@
+Page 110 of [MS OVBA](https://interoperability.blob.core.windows.net/files/MS-OVBA/%5bMS-OVBA%5d.pdf) provides this as the uncompressed input:
+
+    23 61 61 61 62 63 64 65 66 61 61 61 61 67 68 69
+    6a 61 61 61 61 61 6B 6C 61 61 61 6D 6E 6F 70 71
+    61 61 61 61 61 61 61 61 61 61 61 61 72 73 74 75
+    76 77 78 79 7A 61 61 61
+
+And this as the compressed output:
+
+    01 2F B0 00 23 61 61 61 62 63 64 65 82 66 00 70
+    61 67 68 69 6A 01 38 08 61 6B 6C 00 30 6D 6E 6F
+    70 06 71 02 70 04 10 72 73 74 75 76 10 77 78 79
+    7A 00 3C
+
+If we split the final output into token sequences and
+swap the little-endien encoded words:
+
+    Container signature = 01
+    chunk header = B0 2F
+
+Token sequences. The two-byte copy tokens are distinguished
+from literal tokens with brackets.
+
+    1 = 00 23 61 61 61 62 63 64 65
+    2 = 82 66 [70 00] 61 67 68 69 6A [38 01]
+    3 = 08 61 6B 6C [30 00] 6D 6E 6F 70
+    4 = 06 71 [70 02] [10 04] 72 73 74 75 76
+    5 = 10 77 78 79 7A [3C 00]
+
+The differences I’m seeing are at the copy token in the third
+sequence, the second copy token in the 4th sequence, and
+the final copy token.
+
+I’ll work the 3rd token sequence by hand to demonstrate
+what is expected given my understanding of the pseudo-code.
+
+At this point in the compressing a token sequence function, we
+have the following state variables:
+
+    index=3
+    compressedcurrent=26
+    decompressedCurrent=24
+
+Running through the matching algorithm, we set candidate to
+decompressedCurrent-1, so 23. Byte at 23 is 6C, while the byte
+at 24 is 61, so we decrement the candidate. The same occurs until
+candidate is 21. At this point the inner while loop iterates once
+to give us a length of 1. We decrement candidate again and again
+find a match, this time of length 2, so this is the new BestLength.
+We decrement Candidate again (to 19) and find a match of length three.
+We continue to decrement the candidate and find other length 3 matches,
+but since the length is never greater then three, the BestCandidate
+remains 19 and the BestLength three.
+
+The offset is the difference between DecompressedCurrent and BestCandidate,
+so 5, and the length is 3.
+
+Now we call CopyTokenHelp:
+
+    difference = 24 - 0 = 24
+    BitCount = max(⌈log₂(24)⌉, 4) = max(⌈4.58⌉, 4) = 5
+    LengthMask = 0xFFFF >> 5 = 0x07FF
+    OffsetMask = ~0x07FF = 0xF800
+
+And a call to PackCopytoken:
+
+    temp1 = Offset - 1 = 4
+    temp2 = 16 - BitCount = 11
+    temp3 = Length - 3 = 0
+    Token = (temp1 >> temp2) | temp3
+          = (4 << 11) | 0 = 0x0004 << 11 = 0x2000
+
+So we see the token should be 0x2000 instead of the 0x3000
+that is published in the document. If we unpack 0x3000 we
+get the same length, 3, but a different offset. For some
+reason the matching algorithm is not returning the first
+best candidate, but instead a different match.
diff --git a/README.md b/README.md
@@ -29,7 +29,7 @@ from ms_ovba_compression.ms_ovba import MsOvba
 
 # returns b'\x01\x19°\x00abcdefgh\x00ijklmnop\x00qrstuv.'
 input = b'abcdefghijklmnopqrstuv.'
-ms_ovba = MsOVBA()
+ms_ovba = MsOvba()
 ms_ovba.compress(input)
 
 # returns b'#aaabcdefaaaaghijaaaaaklaaamnopqaaaaaaaaaaaarstuvwxyzaaa'
@@ -38,13 +38,9 @@ compressed = b'\x01/°\x00#aaabcde²f\x00paghij\x018\x08akl\x000mnop\x06q\x02p\x
 ms_ovba.decompress(compressed)
 
 ```
-The objects can be initialized to indicate the endianness if the default little-endian is not desired. However, having never seen real world big-endian packed data
-means this feature is untested.
+The objects can be initialized to indicate the endianness if the default little-endian is not desired.
 ```python
-# unsure if it should return:
-# b'\x01°\x19\x00abcdefgh\x00ijklmnop\x00qrstuv.'
-# or
-# b'\x01\x01—\x00abcdefgh\x00ijklmnop\x00qrstuv.'
+# returns b'\x01°\x19\x00abcdefgh\x00ijklmnop\x00qrstuv.'
 input = b'abcdefghijklmnopqrstuv.'
 ms_ovba = MsOvba("big")
 ms_ovba.compress(input)

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "ms_ovba_compression"
-version = "0.1.1"
+version = "0.2.0"
 authors = [
   { name="Kevin Nowaczyk", email="beakerboy99@yahoo.com" },
 ]

diff --git a/requirements_dev.txt b/requirements_dev.txt
@@ -1,3 +1,4 @@
 pytest
 pytest-cov
 coveralls
+pep8-naming