#### LZ码

将变长的输入符号串映射成定长或长度可预测的码字，按照几乎相等的出现概率安排输入符号串，从而使频繁出现符号的串将比不常出现符号的串包含更多的符号。

压缩算法是自适应的，只需扫描一次数据，无需有关数据统计量的先验信息，运算时间正比于消息的长度。

#### LZW算法

建立转换表，若由某个字符串ω和某个单字符K所组成的字符串ωK在表中，则ω也在表中。

流程

初始化：将所有单字符串放入串表

    读第一个输入字符 → 前缀串ω

Step:读下一个输入字符串K

    if 没有K(输入已穷尽):

        码字(ω) → 输出；结束

    if ωK已存在于串表中:

        ωK → ω; Repeat Step

    else ωK不在串表中:

        码字(ω) → 输出

        ωK → 串表;

        K → ω; Repeat Step

In [1]:
# 例，对三字母字符串做LZW编码，结果为4位码字
str = "ababcbababaaaaaaa";
maplist = ["a","b","c"];
ω = str[1];
cstr = "";
for i in str[2:end]
    K = i;
    ωK = string(ω, K);
    if ωK in maplist
        ω = ωK;
    else
        cstr = string(cstr, string(findfirst(maplist .== "$ω"),base=2,pad=4));
        push!(maplist, ωK);
        ω = K;
    end
end
cstr = string(cstr, string(findfirst(maplist .== "$ω"),base=2,pad=4));
@show maplist;
@show cstr;

maplist = ["a", "b", "c", "ab", "ba", "abc", "cb", "bab", "baba", "aa", "aaa", "aaaa"]
cstr = "0001001001000011010110000001101010110001"


In [37]:
# 在消息长度较短时，压缩效果并不好，需要消息足够长以便算法能积累足够有关符号出现频率的知识
length(str) < length(cstr)

true

In [43]:
# 另一种形式的字串表，以前缀标识符加扩充字符表示新字串，优点是表中新增的每一项长度相等
str = "ababcbababaaaaaaa";
maplist = ["a","b","c"];
ω = str[1];
cstr = "";
for i in str[2:end]
    K = i;
    idx = string(findfirst(maplist .== "$ω"),base=2,pad=4);
    ωK = "$idx$K";
    if ωK in maplist
        ω = ωK;
    else
        cstr = string(cstr, idx);
        push!(maplist, ωK);
        ω = K;
    end
end
cstr = string(cstr, string(findfirst(maplist .== "$ω"),base=2,pad=4));
@show maplist;
@show cstr;

maplist = ["a", "b", "c", "0001b", "0010a", "0100c", "0011b", "0101b", "1000a", "0001a", "1010a", "1011a"]
cstr = "0001001001000011010110000001101010110001"


LZW解码算法，核心在于还原编码时用的字典，参考[link](https://blog.csdn.net/hanzhen7541/article/details/91141112)

流程

初始化：将所有已知码字放入串表

    读第一个输入码字 → p；输出字符串(p)

Step:c → p

    读下一个输入码字c

    if 没有c(输入已穷尽):

        字符串(c) → 输出；结束

    if c已存在于串表中:

        输出字符串(c)

        字符串(p) → P; 字符串(c) → C

        字典中加入P+C[1]; Repeat Step

    else c不在串表中:

        字符串(p) → P; 字符串(p) → C

        字典中加入P+C[1]; 输出P+C[1]; Repeat Step

In [2]:
rmaplist = ["a","b","c"];
c = parse(Int,cstr[1:4],base=2);
str2 = rmaplist[c];
N = length(cstr)÷4;
for i in 2:N
    p = c;
    c = parse(Int,cstr[4*i-3:4*i],base=2);
    if c <= length(rmaplist)
        P = rmaplist[p]; C = rmaplist[c];
        str2 = "$str2$C";
        push!(rmaplist,string(P,C[1]));
    else
        P = rmaplist[p]; C = P[1];
        str2 = "$str2$P$C";
        push!(rmaplist,string(P,C));
    end
    
end
@show rmaplist;
@show str2;

rmaplist = ["a", "b", "c", "ab", "ba", "abc", "cb", "bab", "baba", "aa", "aaa", "aaaa"]
str2 = "ababcbababaaaaaaa"


In [3]:
isequal(str,str2)

true