# Lempel-Ziv Complexity

## Definition
This complexity measure is related to Kolmogorov Complexity. The Lempel-Ziv complexity is the number of different sub-strings (or sub-words) encountered, as the binary sequence is viewed as a stream (from left to right)[1].
The more a stream of data is repeating itself, the less its Lempel-Ziv complexity will be.

## Algorithm
The algorithm is straightforward and efficient O(n), here is the pseudocode:
```
S := a binary sequence of size n
i := 0
C := 1 //C is the Lempel-Ziv Complexity, incremented iteratively.
u := 1 // u is the length of the current prefix
v := 1 // v is the length of the current component for the current pointer p
vmax := v // vmax is the final length used for the current component (largest on all the possible pointers p)

while u + v <= n do
   if S[i + v] = S[u + v] then
      v := v + 1
   else
      vmax := max(v, vmax)
      i := i + 1
      if i = u then  // all the pointers have been treated
         C := C + 1
         u := u + vmax
         v := 1
         i := 0
         vmax := v
      else
         v := 1
      end if
   end if
end while
if v != 1 then
    C := C+1
end

```

Following is a python implementation of the Lempel-Ziv complexity with increased readability:


In [1]:
def complexityLempelZiv(stream):
    #Variables Initialization
    complexity = 1
    prefix_length = 1
    length_component = 1 
    max_length_component = 1 
    pointer = 0
    
    # While we haven't decoded the full stream we continue
    while prefix_length + length_component <= len(stream):
        
        # Given a prefix length, find the largest component
        if stream[pointer + length_component - 1] == stream[prefix_length + length_component - 1] :
            length_component = length_component + 1 #increase the lengh of the component
        else:
            
            max_length_component = max(length_component, max_length_component)
            pointer = pointer + 1
            
            # all the pointers have been investigated, we pick the largest for the jump
            if pointer == prefix_length:  
                # Increase the complexity
                complexity = complexity + 1
                # Increase the prefix length by the maximum component size found so far
                prefix_length = prefix_length + max_length_component
                
                # Reset the variables
                pointer = 0
                max_length_component = 1
                
            # reset the length of the component
            length_component = 1   
            
    # Check final repetition if we were in the middle of a component
    if length_component != 1:
        complexity = complexity + 1
        
    return complexity


In [3]:
# Example with a big complex binary stream
stream = [1,0,0,1,1,1,1,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0]
print("Complex binary stream complexity = " + str(complexityLempelZiv(stream)))
# Example with a big non-complex binary stream
stream = [1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0]
print("Non-Complex binary stream complexity = " + str(complexityLempelZiv(stream)))

Complex binary stream complexity = 8
Non-Complex binary stream complexity = 3


## How does it work?
The algorithm tries to find what is called the exhaustive history of the stream of data. The number of component composing this exhaustive history is the Lempel-Ziv complexity (i.e. greater or equal to 1). The exhaustive history is found by recursively trying to produce what is to the right of a delimiting line with what is on the left. If we take for instance this stream of data: 
```
stream = [0,1,0,1,0,1,0]
```
It's Lempel-Ziv complexity is equal to 3 has there are two components in its exhaustive history {'0', 1' and '10'}
Another way to see this metric is to say that Lempel-Ziv Complexity = the number of different substrings encountered as the stream is viewed from begining to the end.

For a more detailed analysis of the algorithm: https://nbviewer.jupyter.org/github/Naereen/Lempel-Ziv_Complexity/blob/master/Short_study_of_the_Lempel-Ziv_complexity.ipynb

## How to use this metric?
First we need to codify the data stream either as binary number or with a given alphabet. We need to make sure that the data we want to analyze the complexity of is actually a stream where the data at the begining are in the 'past' of what is at the end of the stream. Once this is done we run the stream of data into the Lempel-Ziv complexity algorithm and we gather the complexity number. This number alone doesn't say much, so we need to compare it with some baseline for it to be meaningful.

## Why this metric matter?
This metric is important as it's the core idea behind two algorithm for lossless compression (LZ77 and LZ78) which are the basis for many other performant compression algorithm like DEFLATE. Apart from being of interest for computer science in general it is also a fast and useful metric to compare two stream of data complexity (as was used in the Perturbation Complexity Index by Casali et al [2]).

## Reference:
[1] A. Lempel and J. Ziv, “On the Complexity of Finite Sequences,” IEEE Transactions on Information Theory, vol. 22, no. 1, pp. 75–81, Jan. 1976.

[2] A. G. Casali et al., “A Theoretically Based Index of Consciousness Independent of Sensory Processing and Behavior,” Science Translational Medicine, vol. 5, no. 198, pp. 198ra105-198ra105, Aug. 2013.
