Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Implement simple base class for custom encoding implementations #181

Merged
merged 15 commits into from Mar 4, 2023

Conversation

pleonex
Copy link
Member

@pleonex pleonex commented Feb 26, 2022

Description

Add a simple abstract class to implement custom encoding implementations.
It only requires to provide the encode and decoding methods base on the performance-based Span and it provides help methods to report invalid chars / bytes.
The byte/char count methods will re-use the encode and decode methods, as typically the implementation is very similar and the performance is not affected.
This class was already use in Metatron to implement the Persona encoding.

Also refactor the performance test app so we can run all the tests individually.

  • Implement base class
  • Compare performance
  • Add tests

Performance comparison

The performance between a custom implementation on Encoding and on this class is practically the same. Thanks to the advance usage of Span, simplified in this class, the memory is 3x times better and consumes the same as the standard implementations of .NET.
In this case we compare a custom Shift-JIS implementation with the .NET implementation. The performance is worse than the .NET API provided encoding as they do optimizations at low-level with binary dictionaries. Our implementation is a typically use-case with conditions following the specs.
The values are good for a maximum use case of 5 MB of text (0.5 sec), the memory is great and the implementation is very simple.

BenchmarkDotNet=v0.13.1, OS=fedora 34
Intel Core i7-4720HQ CPU 2.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.102
  [Host]     : .NET 6.0.2 (6.0.222.11401), X64 RyuJIT
  DefaultJob : .NET 6.0.2 (6.0.222.11401), X64 RyuJIT

Method TextLength Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Sjis 280 3.428 μs 0.0336 μs 0.0314 μs 0.19 0.00 0.3662 - - 1 KB
SjisCustomEncoding 280 18.076 μs 0.1248 μs 0.1167 μs 1.00 0.00 0.9460 - - 3 KB
SjisCustomYarhlEncoding 280 19.227 μs 0.1089 μs 0.1019 μs 1.06 0.01 0.3967 - - 1 KB
Sjis 5242880 59,626.926 μs 430.4532 μs 402.6462 μs 0.14 0.00 222.2222 222.2222 222.2222 20,225 KB
SjisCustomEncoding 5242880 413,645.574 μs 3,893.8989 μs 3,642.3555 μs 1.00 0.00 - - - 50,950 KB
SjisCustomYarhlEncoding 5242880 462,457.768 μs 8,352.3387 μs 7,812.7828 μs 1.12 0.02 - - - 20,227 KB

Example

private sealed class CustomSjisYarhlEncoding : SimpleSpanEncoding
{
    private readonly Dictionary<int, int> codeToUnicode;
    private readonly Dictionary<int, int> unicodeToCode;

    public CustomSjisYarhlEncoding(Dictionary<int, int> codeToUnicode, Dictionary<int, int> unicodeToCode)
        : base(0, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback)
    {
        this.codeToUnicode = codeToUnicode;
        this.unicodeToCode = unicodeToCode;
    }

    public override string EncodingName => "sjis-yarhl";

    public override int GetMaxByteCount(int charCount) => charCount * 2;

    public override int GetMaxCharCount(int byteCount) => byteCount;

    protected override void Decode(ReadOnlySpan<byte> bytes, SpanStream<char> buffer)
    {
        byte lead = 0x00;
        int count = bytes.Length;
        for (int i = 0; i < count; i++) {
            int codePoint = -1;
            byte current = bytes[i];

            if (lead != 0x00) {
                int offset = (current < 0x7F) ? 0x40 : 0x41;
                int leadOffset = (lead < 0xA0) ? 0x81 : 0xC1;

                bool inRange1 = current is >= 0x40 and <= 0x7E;
                bool inRange2 = current is >= 0x80 and <= 0xFC;
                if (!inRange1 && !inRange2) {
                    DecodeUnknownBytes(buffer, i, current);
                }

                int pointer = ((lead - leadOffset) * 188) + current - offset;
                if (pointer is 8836 and <= 10715) {
                    codePoint = 0xE000 - 8836 + pointer;
                } else {
                    if (!codeToUnicode.TryGetValue(pointer, out codePoint)) {
                        DecodeUnknownBytes(buffer, i, current);
                    }
                }

                lead = 0x00;
            } else if (current == 0x5C) {
                codePoint = 0x00A5; // yen
            } else if (current == 0x7E) {
                codePoint = 0x203E; // overline
            } else if (current < 0x80) {
                codePoint = current;
            } else if (current is >= 0xA1 and <= 0xDF) {
                codePoint = 0xFF61 - 0xA1 + current;
            } else if (current is(>= 0x81 and <= 0x9F) or(>= 0xE0 and <= 0xFC)) {
                lead = current;
            } else {
                throw new FormatException();
            }

            if (codePoint != -1) {
                buffer.Write((char)codePoint);
            }
        }

        // 1.
        if (lead != 0x00) {
            DecodeUnknownBytes(buffer, count - 2, lead);
        }
    }

    protected override void Encode(ReadOnlySpan<char> chars, SpanStream<byte> buffer, bool isFallbackText = false)
    {
        int count = chars.Length;
        for (int i = 0; i < count; i++) {
            ushort codePoint = chars[i];

            if (codePoint == 0x00A5) {
                buffer.Write(0x5C);
            } else if (codePoint == 0x203E) {
                buffer.Write(0x7E);
            } else if (codePoint < 0x80) {
                buffer.Write((byte)codePoint);
            } else if (codePoint is >= 0xFF61 and <= 0xFF9F) {
                buffer.Write((byte)(codePoint - 0xFF61 + 0xA1));
            } else {
                if (codePoint == 0x2212) {
                    codePoint = 0xFF0D;
                }

                if (!unicodeToCode.TryGetValue(codePoint, out int code)) {
                    EncodeUnknownChar(buffer, codePoint, i, isFallbackText);
                }

                int lead = code / 188;
                int leadOffset = (lead < 0x1F) ? 0x81 : 0xC1;
                int trail = code % 188;
                int offset = (trail < 0x3F) ? 0x40 : 0x41;
                buffer.Write((byte)(lead + leadOffset));
                buffer.Write((byte)(trail + offset));
            }
        }
    }
}

@pleonex pleonex added this to the vNext milestone Feb 26, 2022
@pleonex pleonex self-assigned this Feb 26, 2022
@pleonex pleonex marked this pull request as ready for review March 4, 2023 12:13
@pleonex pleonex merged commit 330f35b into develop Mar 4, 2023
5 checks passed
@pleonex pleonex deleted the feature/encoding-span-again branch March 4, 2023 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

1 participant