*TP réalisé par Rachel Blin dans le cadre du cours d'Alexandrina Rogozan*

# Codage par plage

L'objectif de ce TP est de réaliser le codage par plage de la séquence suivante :    

"*00000110000011101110010111000001101011100001101011*"

## Rassemblement des paires de signes

Dans un premier temps, on va rassembler les paires de signes présentes dans la séquence de la forme (plage, valeur) où plage représente le nombre de signes identiques successifs et valeur représente le nombre d'occurences.    

__Exemple__ :  
Pour le message Pour le message "babeecceddae" on obtient les paires de signes suivantes :  

(b,1), (a,1), (b,1), (e,2), (c,2), (e,1), (d,2), (a,1), (e,1).    

1) Transformez la séquence à coder en paires de signes.

In [19]:
s = "00000110000011101110010111000001101011100001101011"

def sequence_to_pair_of_sign(s):
    """
    A function that transforms a sequence into pairs of signs
    
    # Argument:
        - s: The sequence to be transformed
        
    # Return:
        A list of tuples containing a symbol and the lenght of its subsequence for the whole sequence
    """
    prec = s[0]
    l = 0 
    res = []
    for i in s: 
        if (prec==i): 
            l += 1
        else :
            res += [(prec,l)]
            prec = i
            l = 1 
    return res

sequence_to_pair_of_sign(s)

[('0', 5),
 ('1', 2),
 ('0', 5),
 ('1', 3),
 ('0', 1),
 ('1', 3),
 ('0', 2),
 ('1', 1),
 ('0', 1),
 ('1', 3),
 ('0', 5),
 ('1', 2),
 ('0', 1),
 ('1', 1),
 ('0', 1),
 ('1', 3),
 ('0', 4),
 ('1', 2),
 ('0', 1),
 ('1', 1),
 ('0', 1)]

## Encodage des paires de signes

Maintenant que la séquence a été transformée en paires de signes, chaque paire de signe peut être considérée comme un caractère à part entière. Ces caractères seront encodés avec un des algorithmes que nous avons vu dans les TP précédents au choix. Il est conseillé d'utiliser un algorithme comprenant des mots-codes variables comme l'algorithme de Huffman.

__Exemple__ :  
Pour le message Pour le message "babeecceddae" on obtient les paires de signes suivantes :  

(b,1), (a,1), (b,1), (e,2), (c,2), (e,1), (d,2), (a,1), (e,1)    

Ces paires de mots codes ont le nombre d'occurences suivantes :    

| Caractère | Nombre d'occurences |
|-----------|---------------------|
| (b,1)         | 2                   |
| (a,1)         | 2                   |
| (e,2)         | 1                   |
| (c,2)         | 1                   |
| (e,1)         | 2                   |
| (d,2)         | 1                   |

2) Encodez la séquence de paires de signes ainsi obtenue selon le codage de Huffman.

In [123]:
import operator
import collections

def unique(message):
    """ Returs all the uninque tuples in the sequence transformed into a list of tuples.
    
    # Argument:
        - message: The sequence of tuples (symbol, lenght).
        
    # Returns:
        - A list containing all the unique tuples of the sequence.
    """
    
    chars = [] 
    for char in message: 
        if char not in chars: 
            chars.append(char) 
    return chars




def occurencies(message):
    """ Counts the number of occurencies of the tuples in the sequence.
    
    # Argument:
        - message: The sequence transformed into a list of tuples.
    
    # Returns:
        - A dict containing the tuples (symbol,length) as keys and the number of occurencies of each tuple.
    """
    
    occ = pd.value_counts(message)
    
    dict_occ = {}
    for i in range(len(occ)):
        dict_occ[occ.index[i]] = occ[i]
    
    return dict_occ




def order_by_values(dictionary):
    """ Helper function to order a dictionnary by its values.
    
    # Argument:
        - dictionary: The dictionary to order.
        
    # Returns:
        - An ordered list of tuples containing as first value the character and as second its number of occurencies.
    
    """
    return sorted(dictionary.items(), key=operator.itemgetter(1))

# Data simplification

def attributing_symbol_pair(sorted_occurences):
    
    import string
    alphabet = string.ascii_lowercase
    """
    A function attributing a symbol to each tuple contained in the sequence to be encoded.
    
    # Arguments:
        - sorted_occurences: A list of tuples containing the pair symbol, lenght and its number of occurences in the sequenced, in increasing number.
        
    # Return: 
        The new ordered list of symbols corresponding to unique tuples and its dictionary of conversion.
    """
    list_clef = []
    for clef in sorted_occurences.keys(): 
        list_clef += [clef]
 
            
    dictio = {}
    dict_conv = {}
    for i in range(len(sorted_occurences)):
        dictio[alphabet[i]] = sorted_occurences[list_clef[i]]
        dict_conv[list_clef[i]] = alphabet[i]
        
    return dictio, dict_conv
    
    
# Séparation des symboles en sous-groupes 

# Creation of the Tree type

class Tree(object):
    """ Tree object."""
    def __init__(self, g=None, d=None, data=None):
        """ Init function.
        
        # Arguments:
            - g: The left node of the tree.
            - d: The right node of the tree.
            - data: The data to be held in the node.
        """
        self.g = g
        self.d = d
        self.data = data
    
def update_list_node(list_node, new_node):
    """ Adds a node in the correct place in an ordered list of nodes.
    
    # Arguments:
        - list_node: The list of nodes we want to add the node to.
        - new_node: The node to be added.
    
    # Returns:
        - The list with the added node.
    """
    for index, node in enumerate(list_node):
        if node.data[1] >= new_node.data[1]:
            list_node[index:index] = [new_node]
            break
    else:
        list_node.append(new_node)
            
    return list_node
    
def Huffman(sorted_occurences):
    """ Creates the Huffman tree.
    
    # Argument:
        - sorted_occurences: The sorted occurencies of the message we wish to encode.
    
    # Returns:
        - The root of the Huffman tree.
    """
    list_node = [Tree(data=value) for value in sorted_occurences]
    
    while(len(list_node) > 1):
        node_left = list_node.pop(0)
        node_right = list_node.pop(0)
        new_data = (node_left.data[0] + node_right.data[0], node_left.data[1] + node_right.data[1])
        new_node = Tree(node_left, node_right, new_data)
        update_list_node(list_node, new_node)

    return list_node[0]
    

def get_dict_coding(graph):
    """ Creates the coding dictionnary.
    
    # Argument:
        - graph: The root of the Huffman tree.
    
    # Returns:
        - A coding dictionary.
    """
    current_level = [graph]
    nb_nodes = ""
    level = 0
    dict_Huffman = {}
    while current_level:
        next_level = []
        nb_nodes = nb_nodes[:len(nb_nodes)//2]
        for node in current_level:
            if node.g:
                next_level.append(node.g)
                seq = node.g.data[0]
                for char in seq:
                    if char in dict_Huffman:
                        dict_Huffman[char] += "1"
                    else:
                        dict_Huffman[char] = "1"
            if node.d:
                next_level.append(node.d)
                seq = node.d.data[0]
                for char in seq:
                    if char in dict_Huffman:
                        dict_Huffman[char] += "0"
                    else:
                        dict_Huffman[char] = "0"
            current_level = next_level
        level +=1
    return dict_Huffman



def conversion_dict_Huff(dict_Huffman, conversion_dict):
    """
    A function to convert the Huffman dictionary so it can use the real tuples symbol, length
    
    # Arguments:
        - dict_Huffman: The non converted Huffman dictionary 
        - conversion_dict: The dictionary of conversion
        
    # Return:
        A converted version of the Huffman dictionary so it can directly encode the sequence of tuples.
    """
    
    list_clef_Huff = []
    for clef in dict_Huffman: 
        list_clef_Huff += [clef]
    print(list_clef_Huff)
    
    list_clef_conv = []
    for clef in conversion_dict.keys(): 
        list_clef_conv += [clef]
    print(list_clef_conv)
    
    
    
    dict_final = {}
    for i in range(len(list_clef_conv)) :
        dict_final[list_clef_conv[i]] = dict_Huffman[conversion_dict[list_clef_conv[i]]]
        
    return dict_final 


# Encoding Huffman message

def encode_message(message_Huffman, dict_Huffman):
    """ Encodes a given message.
    
    # Arguments:
        - m: The message to encode.
        - dict_Huffman: The coding dictionary.
    
    # Returns:
        - The encoded message.
    """
    m = ""
    for char in message_Huffman:
        m = m + dict_Huffman[char]
    return m

    

In [125]:
import numpy as np
import pandas as pd


dictio,dict_conv = attributing_symbol_pair(occurencies(sequence_to_pair_of_sign(s)))
print(dict_conv)

print(get_dict_coding(Huffman(order_by_values(dictio))))

print(conversion_dict_Huff(get_dict_coding(Huffman(order_by_values(dictio))), dict_conv))

encode_message(sequence_to_pair_of_sign(s),conversion_dict_Huff(get_dict_coding(Huffman(order_by_values(dictio))), dict_conv))


{('0', 1): 'a', ('1', 3): 'b', ('1', 1): 'c', ('1', 2): 'd', ('0', 5): 'e', ('0', 4): 'f', ('0', 2): 'g'}
{'b': '11', 'f': '1011', 'g': '1010', 'c': '100', 'd': '011', 'e': '010', 'a': '00'}
['b', 'f', 'g', 'c', 'd', 'e', 'a']
[('0', 1), ('1', 3), ('1', 1), ('1', 2), ('0', 5), ('0', 4), ('0', 2)]
{('0', 1): '00', ('1', 3): '11', ('1', 1): '100', ('1', 2): '011', ('0', 5): '010', ('0', 4): '1011', ('0', 2): '1010'}
['b', 'f', 'g', 'c', 'd', 'e', 'a']
[('0', 1), ('1', 3), ('1', 1), ('1', 2), ('0', 5), ('0', 4), ('0', 2)]


'0100110101100111010100001101001100100001110110110010000'

3) Calculer le taux de compression du message et conclure quant à la performance.

In [4]:
def compression_rate(longueur_source, longueur_encode):
    """
    A function to compute the compression rate of the message
    
    # Arguments:
        - len_source: The length of the source message
        - len_coding: The length of the obtained message
    """
    
    taux = 1 - (longueur_encode / longueur_source)
    return taux

On en déduit donc que dans notre cas, le codage par plage donne un message plus long que celui en entrée. En effet le codage par plages est efficace seulement si les plages sont suffisemment grandes. 


Soit la séquence :     

"111111111111111000000111111000000000000000111111111111111"

4) Coder cette séquence en utilisant le codage par plage et conclure.

In [129]:
s = "111111111111111000000111111000000000000000111111111111111"

#Récupère le dictionnaire de conversion 
dictio,dict_conv = attributing_symbol_pair(occurencies(sequence_to_pair_of_sign(s)))

#On encode le message
encode_message(sequence_to_pair_of_sign(s),conversion_dict_Huff(get_dict_coding(Huffman(order_by_values(dictio))), dict_conv))

#On vérifie le taux de compression
compression_rate(len(s),len(encode_message(sequence_to_pair_of_sign(s),conversion_dict_Huff(get_dict_coding(Huffman(order_by_values(dictio))), dict_conv))))



['c', 'd', 'a', 'b']
[('0', 15), ('0', 6), ('1', 6), ('1', 15)]
['c', 'd', 'a', 'b']
[('0', 15), ('0', 6), ('1', 6), ('1', 15)]


0.8596491228070176

Il y a un bon taux de compression, puisque qu'il est de 85,9%