> Dane jest słowo, będące tablicą o długości $ n $, składającej się ze znaków z alfabetu $ E $ o rozmiarze $ |E| $. Dana jest również liczba $ k $. Długość słowa wynosi co najmniej $ |E|^k $ ($ n \ge |E|^k $). Zaproponuj algorytm, który zwróci najczęściej powtarzający się w tym słowie spójny pociąg o długości $ k $. Algorytm ma działać w czasie $ O(n) $, wykorzystywać $ O(1) $ pamięci. Ponadto, zawartość tablicy po wykonaniu algorytmu powinna pozostać niezmieniona.

### Uwagi

Aby jakkolwiek zadanie było wykonalne, musimy założyć, że możliwa jest konwersja litery na liczbę, przedstawiającą jej pozycję w alfabecie, w czasie $ O(1) $ oraz konwersja w drugą stronę w tym samym czasie. Z tego powodu zakładam, że alfabet jest reprezentowany, przy pomocy klasy, która wykonuje obie operacje w czasie liniowym.

### Implementacja algorytmu

##### Implementacja klasy, reprezentującej alfabet 

In [5]:
class Alphabet:
    def __init__(self, characters):
        self.chars_to_codes = {char:code for code, char in enumerate(characters)}
        self.codes_to_chars = list(characters)
        
    def __repr__(self):
        return f'{self.__class__.__name__}({"".join(self.codes_to_chars)})'
        
    def __str__(self):
        return '\n'.join(f'{char}: {code}' for char, code in self.chars_to_codes.items())
    
    def __iter__(self):
        yield from self.codes_to_chars
    
    def __len__(self):
        return len(self.chars_to_codes)
    
    def __getitem__(self, idx):
        return self.chr(idx)
    
    def ord(self, char):
        # If doesn't exist, None will be returned
        return self.chars_to_codes.get(char)
    
    def chr(self, code):
        # If a code is not valid, None will be returned
        return self.codes_to_chars[code] if 0 <= code < len(self.codes_to_chars) else None


##### Algorytm z zadania

In [2]:
def most_common_substring(string: list, alphabet: 'alphabet class', k: 'length of a substring') -> (str, int):
    # In the first loop encode all the substrings of length k (we will store codes of substrings
    # shorter than k on indices before k-1 index as there are no enough letters to create
    # a k-element substring)
    
    limit = len(alphabet) ** k
    # Store a code of the first value in the first position
    string[0] = alphabet.ord(string[0])
    
    for i in range(1, len(string)):
        string[i] = encode_substring(string[i-1], string[i], limit, alphabet)
        
    # Using counting sort approach, store couns of each encoded substring in a string array.
    # As we have to remember the initial codes, we will increase each code by the limit value
    # if we encounter another substring with the same code
    for i in range(len(string)):
        # Retrieve the initial code of a substring
        substring_code = counter_to_code(string[i], limit)
        # Icrease a counter code of this substring in a string array
        string[substring_code] += limit
        
    # Find the greatest value of a counter
    max_count = 0
    max_count_code = 0
    for i in range(len(string)):
        # Retrieve a value of a counter
        count = get_count_from_counter(string[i], limit)
        # Update the max_count and max_count_idx variables if found a substring which occurs
        # more often
        if count > max_count:
            max_count = count
            max_count_code = i
            
    # Bring back the initial values of a string array
    curr_code = counter_to_code(string[-1], limit)
    for i in range(len(string)-1, 0, -1):
        prev_code = counter_to_code(string[i-1], limit)
        curr_char = decode_substring(prev_code, curr_code, limit, alphabet)
        string[i] = curr_char
        curr_code = prev_code
        
    # Decode the first character
    first_code = counter_to_code(string[0], limit)
    string[0] = decode_substring(0, first_code, limit, alphabet)

    # Return a substring with the most occurrences and a number of its repetitions
    # We will use an array to store strings characters and then join them together as
    # it is far faster than a string concatenation
    return decode_substring_with_code(max_count_code, limit, alphabet), max_count
        

def encode_substring(prev_code: 'code of a previous encoded substring', 
                     curr_char: 'code of a current character', 
                     code_limit: 'limit value',
                     alphabet: 'Alphabet class instance') -> int:
    """
    This function creates a code of a substring based on a code of the previous substring
    and a new character. This code always will be lower than code_limit as code_limit is
    dependent on the max number of substring's characters. The idea of encoding a substring
    is to treat this substring as a number in base of the alphabet's length. Encodins a substring
    is an equivalent of converting a hexadecimal number to the decimal representation but in this
    case we don't use base 16 and use different characters from our alphabet.
    """
    base = len(alphabet)
    char_code = alphabet.ord(curr_char)
    # Shift all values to the left by multiplying a code of the previous substring by the base
    # of our system (dictionary length) and add a code of the new character on the least
    # significant bit's position and then drop the first character if a code of a new substring
    # exceeds the biggest possible code.
    return (prev_code * base + char_code) % code_limit


def decode_substring(prev_code: 'code of a previous encoded substring', 
                     curr_code: 'code of a current substring', 
                     code_limit: 'limit value',
                     alphabet: 'Alphabet class instance') -> int:
    """
    This function retrieves a code of a single character which was added to the substring after
    the previous substring. prev_code (a code of a previous substring) is required in order to
    get a value of a single character which was next added.
    """
    base = len(alphabet)
    # Subtract a previous code shifted to the left from the current code in order to get the code
    # Of a value which was added last.
    char_code = curr_code - (prev_code * base) % code_limit
    return alphabet.chr(char_code)


def counter_to_code(counter_value: 'value of a substring counter',
                    code_limit: 'limit value') -> int:
    """
    As we will store counts of repeated substring as codes increased by a proper multiple
    of the code_limit, this function will help us retrieve the initial code of an encoded
    substring.
    """
    return counter_value % code_limit


def get_count_from_counter(counter_value: 'value of a substring counter',
                           code_limit: 'limit value') -> int:
    """
    As we will store counts of repeated substring as codes increased by a proper multiple
    of the code_limit, this function will help us retrieve the real count of the substring
    which will be equal to the number of times we increased a code by the code_limit.
    """
    return counter_value // code_limit


def decode_substring_with_code(substring_code: 'code of a substring to decode',
                               code_limit: 'limit value',
                               alphabet: 'Alphabet class instance') -> int:
    """
    This function retrieves a string from its code.
    """
    base = len(alphabet)
    chars = []
    
    while code_limit > 1:
        substring_code, char_code = divmod(substring_code, base)
        chars.append(alphabet.chr(char_code))
        code_limit //= base
        
    return ''.join(chars[::-1])

Kilka testów

In [3]:
import random


def test(alphabet_letters, k):
    E = Alphabet(alphabet_letters)
    print(repr(E))
    print(E)
    E_pow_k = len(E) ** k
    E_chars = list(E)
    string_length = random.randint(E_pow_k, 2 * E_pow_k) # Let's use a random length to match the topic's case
    string_arr = [random.choice(E_chars) for _ in range(string_length)]
    prev_string = string_arr[:]
    print('String (part):')
    print(string_arr[:10], string_arr[-10:])

    result = most_common_substring(string_arr, E, k)
    print(string_arr[:10], string_arr[-10:])  # Check if an alphabet hasn't changed
    print('Is the same?:', string_arr == prev_string)

    print('\n===== Final results: =====')
    print('Most common substring:', result[0])
    print('Occerrences:', result[1])
    print('Real occurences of the result string:', ''.join(prev_string).count(result[0])) # If the same as above, an algorithm is correct

In [4]:
from string import ascii_lowercase

test(ascii_lowercase, 4)

Alphabet(abcdefghijklmnopqrstuvwxyz)
a: 0
b: 1
c: 2
d: 3
e: 4
f: 5
g: 6
h: 7
i: 8
j: 9
k: 10
l: 11
m: 12
n: 13
o: 14
p: 15
q: 16
r: 17
s: 18
t: 19
u: 20
v: 21
w: 22
x: 23
y: 24
z: 25
String (part):
['s', 'f', 'w', 'b', 't', 'e', 'q', 'q', 'o', 'g'] ['u', 'y', 'q', 's', 'u', 'w', 'o', 'n', 'm', 'q']
['s', 'f', 'w', 'b', 't', 'e', 'q', 'q', 'o', 'g'] ['u', 'y', 'q', 's', 'u', 'w', 'o', 'n', 'm', 'q']
Is the same?: True

===== Final results: =====
Most common substring: ccte
Occerrences: 11
Real occurences of the result string: 11
