# String Data Type in Python

In Python, a string is a sequence of characters. Strings in Python are immutable, which means once a string is created, it cannot be changed. However, you can create a new string based on the original string with modifications. Strings can be defined using either single quotes (' ') or double quotes (" "), and they can contain letters, numbers, special characters, spaces, and even escape sequences (like \n for a new line).

## String Operations

Concatenation: Joining two or more strings together.
Indexing: Accessing a specific character in a string using its position.
Slicing: Extracting a portion of a string.

In [1]:
# String concatenation
string1 = "Hello"
string2 = "World"
concatenated_string = string1 + " " + string2
print("Concatenated String:", concatenated_string)

# String indexing
index = 4
character_at_index = string1[index]
print(f"Character at index {index} in '{string1}':", character_at_index)
print(' - Note: the first character in string has index 0.')

# String slicing
start_index = 1
end_index = 4
sliced_string = string1[start_index:end_index]
print(f"Sliced string from index {start_index} to {end_index} in '{string1}':", sliced_string)


Concatenated String: Hello World
Character at index 4 in 'Hello': o
 - Note: the first character in string has index 0.
Sliced string from index 1 to 4 in 'Hello': ell


In [2]:
# Repeat with implicit printing

# Defining the strings
string1 = "Hello"
string2 = "World"

# String concatenation
concatenated_string = string1 + " " + string2

# String indexing
index = 4
character_at_index = string1[index]

# String slicing
start_index = 1
end_index = 4
sliced_string = string1[start_index:end_index]

(concatenated_string, character_at_index, sliced_string)


('Hello World', 'o', 'ell')

## String Character Encoding

Character encoding is a system that pairs a sequence of characters from a given set with something else (e.g. ordinal numbers) to enable the transmission and storage of data. In computing, we deal with different character encodings due to historical and practical reasons.

1. ASCII (American Standard Code for Information Interchange):

- It was one of the first character encodings and is based on the English alphabet.
- ASCII uses 7 bits to represent each character, allowing for 128 different symbols.
- It covers English letters, digits, and a few punctuation marks, but lacks characters from non-English languages.

2. Unicode:

- Unicode is a computing industry standard designed to consistently represent text expressed in most of the world's writing systems.
- It can represent over a million characters, covering a wide array of languages and symbols.
- Unicode is an abstract representation. To store Unicode characters in memory or on disk, we need a specific encoding, such as UTF-8 or UTF-16.

3. UTF-8 (Unicode Transformation Format - 8 bits):

- It is a variable-width character encoding that can represent any character in the Unicode standard.
- UTF-8 is backward-compatible with ASCII.
- It's the dominant character encoding for the web.

4. Code Page (Windows Encoding):

- Before Unicode became widespread, many different encodings were created to handle different languages and character sets. Windows had its own set of encodings known as "code pages".
- Each code page supports different character sets. For instance, CP1250 is for Eastern European languages, while CP1251 is for Cyrillic script.
- With the advent of Unicode, code pages have become less common, but you might still encounter them in legacy systems or older data.

Converting Between Various String Encodings in Python:

In Python, you can convert a string from one encoding to another using the *encode()* and *decode()* methods.

In [3]:
# Original string (contains a special character for demonstration purposes)
original_string = "Hello, World! – Special Character and Local Characters čšž"

# Encoding the string to bytes in UTF-8
utf8_encoded = original_string.encode('utf-8')

# Decoding the bytes from UTF-8 to string
utf8_decoded = utf8_encoded.decode('utf-8')

# Encoding the string to bytes in CP1252 (Windows Encoding)
cp1252_encoded = original_string.encode('cp1250', errors='ignore')  # ignoring characters not supported by CP1250

# Decoding the bytes from CP1252 to string
cp1252_decoded = cp1252_encoded.decode('cp1250')

utf8_encoded, utf8_decoded, cp1252_encoded, cp1252_decoded


(b'Hello, World! \xe2\x80\x93 Special Character and Local Characters \xc4\x8d\xc5\xa1\xc5\xbe',
 'Hello, World! – Special Character and Local Characters čšž',
 b'Hello, World! \x96 Special Character and Local Characters \xe8\x9a\x9e',
 'Hello, World! – Special Character and Local Characters čšž')

The built-in *sorted()* function in Python can be used to return a new list containing all items from the original list (or any iterable), sorted in ascending order by default. When applied to strings, sorted() treats the string as a sequence of characters and returns a list of characters sorted in ascending order.

In [2]:
# Original string
string = "programming"
string = "češnja"
string_list = list(string)

# Sorting the string
sorted_characters = sorted(string_list) # sorted(string)

# Joining the sorted characters to form a sorted string
sorted_string = ''.join(sorted_characters)

sorted_string


'aejnčš'

In [3]:
# List of strings
string_list = ["banana", "želod", "apple", "cherry", "date", "češnja"]

# Sorting the list of strings
sorted_list = sorted(string_list)

sorted_list


['apple', 'banana', 'cherry', 'date', 'češnja', 'želod']

## Key-Indexed Counting:

Key-indexed counting is a sorting technique that is particularly effective when the keys are small integers. It can also be adapted to work with strings where each character acts as a key. The algorithm works by computing the frequency of each key and then using those frequencies to determine where each item should be placed in the sorted order.

Here's an overview of key-indexed counting for strings:

- Count frequencies: Count the occurrences of each character (or key) in the input.
- Compute starting indices: Determine where items with each key should start in the output.
- Distribute the items: Place each item in its sorted position in the output.
- Copy back: Copy the sorted items back to the original array.

In [4]:
def key_indexed_counting_sort(s):
    # Define the size of the count array (assuming ASCII characters)
    R = 256  # Number of ASCII characters
    N = len(s)
    
    # Initialize the count array with zeros
    count = [0] * (R + 1)
    aux = [""] * N
    
    # Step 1: Count frequencies
    for char in s:
        count[ord(char) + 1] += 1
    
    # Step 2: Compute starting indices
    for r in range(R):
        count[r + 1] += count[r]
    
    # Step 3: Distribute the items
    for char in s:
        aux[count[ord(char)]] = char
        count[ord(char)] += 1
    
    # Step 4: Copy back
    sorted_string = ''.join(aux)
    
    return sorted_string

# Demonstration
input_string = "keyindexedcountingsort"
sorted_result = key_indexed_counting_sort(input_string)
sorted_result


'cddeeegiiknnnoorsttuxy'

## LSD (Least Significant Digit) Radix Sort

In LSD Radix Sort, strings are sorted starting from the rightmost character (least significant digit) and moving to the left. It's often used for fixed-length strings or numbers.

## MSD (Most Significant Digit) Radix Sort

In MSD Radix Sort, strings are sorted starting from the leftmost character (most significant digit) and moving to the right. It can handle variable-length strings.

In [5]:
def lsd_radix_sort(strings, W):
    """LSD Radix Sort for fixed-length strings."""
    R = 256  # Number of ASCII characters
    aux = [""] * len(strings)
    
    for d in reversed(range(W)):
        count = [0] * (R + 1)
        
        # Count frequencies
        for string in strings:
            count[ord(string[d]) + 1] += 1
        
        # Compute starting indices
        for r in range(R):
            count[r + 1] += count[r]
        
        # Distribute strings
        for string in strings:
            aux[count[ord(string[d])]] = string
            count[ord(string[d])] += 1
        
        # Copy back
        for i in range(len(strings)):
            strings[i] = aux[i]
    
    return strings


In [6]:
def msd_radix_sort(strings):
    """MSD Radix Sort."""
    R = 256  # Number of ASCII characters
    aux = [""] * len(strings)
    
    def sort(strings, aux, lo, hi, d):
        if hi <= lo:
            return
        
        count = [0] * (R + 2)
        
        # Count frequencies
        for i in range(lo, hi + 1):
            c = ord(strings[i][d]) if d < len(strings[i]) else -1
            count[c + 2] += 1
        
        # Compute starting indices
        for r in range(R + 1):
            count[r + 1] += count[r]
        
        # Distribute strings
        for i in range(lo, hi + 1):
            c = ord(strings[i][d]) if d < len(strings[i]) else -1
            aux[count[c + 1]] = strings[i]
            count[c + 1] += 1
        
        # Copy back
        for i in range(lo, hi + 1):
            strings[i] = aux[i - lo]
        
        # Recursively sort for each character value
        for r in range(R):
            sort(strings, aux, lo + count[r], lo + count[r + 1] - 1, d + 1)
    
    sort(strings, aux, 0, len(strings) - 1, 0)
    return strings


In [7]:
# Demonstration
strings_lsd = ["cat", "dog", "bat", "man", "pan"]
strings_msd = ["bat", "cat", "apple", "all", "alley"]

sorted_lsd = lsd_radix_sort(strings_lsd, 3)
sorted_msd = msd_radix_sort(strings_msd)

sorted_lsd, sorted_msd


(['bat', 'cat', 'dog', 'man', 'pan'], ['all', 'alley', 'apple', 'bat', 'cat'])

## Brute-Force Substring Search

The Brute-Force Substring Search algorithm checks for a substring by examining all possible positions it could appear in the main string. For each position, it checks if the substring matches the characters in the main string at that position.

In [8]:
def brute_force_substring_search(main_string, substring):
    """Brute-Force Substring Search."""
    M = len(substring)
    N = len(main_string)

    for i in range(N - M + 1):
        j = 0
        while j < M:
            if main_string[i + j] != substring[j]:
                break
            j += 1
        if j == M:  # Match found
            return i
    return -1  # No match found


In [9]:
# Demonstration
main_string = "hellothisisateststring"
substring1 = "test"
substring2 = "world"

position1 = brute_force_substring_search(main_string, substring1)
position2 = brute_force_substring_search(main_string, substring2)

position1, position2


(12, -1)

## Knuth-Morris-Pratt (KMP) Algorithm

The KMP algorithm preprocesses the pattern (substring) to generate a "partial match" table (also known as a "failure function"). This table determines how much we can skip ahead when a mismatch is found, which reduces the number of character comparisons.

In [10]:
# KMP Algorithm
def kmp_search(text, pattern):
    def compute_lps_array(pattern, M, lps):
        length = 0
        lps[0] = 0
        i = 1
        while i < M:
            if pattern[i] == pattern[length]:
                length += 1
                lps[i] = length
                i += 1
            else:
                if length != 0:
                    length = lps[length - 1]
                else:
                    lps[i] = 0
                    i += 1

    M = len(pattern)
    N = len(text)
    lps = [0] * M
    j = 0
    compute_lps_array(pattern, M, lps)
    i = 0
    while i < N:
        if pattern[j] == text[i]:
            i += 1
            j += 1
        if j == M:
            return i - j
            j = lps[j - 1]
        elif i < N and pattern[j] != text[i]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1
    return -1


## Boyer-Moore Algorithm

Boyer-Moore skips sections of the text to be searched, resulting in a lower number of operations than many other string algorithms in many cases. The main idea is that every time a mismatch occurs, the pattern itself determines how far it can be shifted.

In [11]:
# Boyer-Moore Algorithm
def boyer_moore_search(text, pattern):
    def preprocess_strong_suffix(shift, bpos, pattern, m):
        i = m
        j = m + 1
        bpos[i] = j
        while i > 0:
            while j <= m and pattern[i - 1] != pattern[j - 1]:
                if shift[j] == 0:
                    shift[j] = j - i
                j = bpos[j]
            i -= 1
            j -= 1
            bpos[i] = j

    def preprocess_case2(shift, bpos, pattern, m):
        j = bpos[0]
        for i in range(m + 1):
            if shift[i] == 0:
                shift[i] = j
            if i == j:
                j = bpos[j]

    m = len(pattern)
    n = len(text)
    if m == 0:
        return 0
    shift = [0] * (m + 1)
    bpos = [0] * (m + 1)
    preprocess_strong_suffix(shift, bpos, pattern, m)
    preprocess_case2(shift, bpos, pattern, m)
    s = 0
    while s <= n - m:
        j = m - 1
        while j >= 0 and pattern[j] == text[s + j]:
            j -= 1
        if j < 0:
            return s
            s += shift[0]
        else:
            s += shift[j + 1]

    return -1


## Rabin-Karp Algorithm

Rabin-Karp uses hashing to find a pattern in text. For every substring of the text of the same length as the pattern, if the hash matches the hash of the pattern, a direct comparison is performed.

In [12]:
# Rabin-Karp Algorithm
def rabin_karp_search(text, pattern):
    m = len(pattern)
    n = len(text)
    pattern_hash = hash(pattern)
    text_hash = hash(text[:m])
    for i in range(0, n - m + 1):
        if pattern_hash == text_hash and text[i:i + m] == pattern:
            return i
        if i < n - m:
            text_hash = hash(text[i + 1:i + m + 1])
    return -1


In [13]:
# Demonstration
text = "thisisateststringforalgorithms"
pattern = "test"
kmp_result = kmp_search(text, pattern)
boyer_moore_result = boyer_moore_search(text, pattern)
rabin_karp_result = rabin_karp_search(text, pattern)

kmp_result, boyer_moore_result, rabin_karp_result


(7, 7, 7)

- KMP is efficient because it avoids unnecessary comparisons by using the LPS (Longest Prefix Suffix) array.
- Boyer-Moore can skip multiple characters at once, making it faster in many scenarios.
- Rabin-Karp uses hashing, making it efficient for multiple pattern searches, but there's a possibility of hash collisions.

The core idea behind the Rabin-Karp algorithm is to compute a hash value for the pattern and then for the substring of the text of the same length. If the hash values match, we compare the substring and the pattern directly.

To make the algorithm efficient, we employ the sliding window technique. This means that, instead of recomputing the hash from scratch for each substring, we "slide" over the text one character at a time, updating the hash value incrementally.

To prevent hash values from growing too large, we'll use modular arithmetic. This means we'll compute hash values modulo a prime number to ensure they remain within a reasonable size.

In [19]:
def rabin_karp_with_modular(text, pattern):
    # Define parameters
    d = 256  # Number of characters in the input alphabet (assuming ASCII)
    q = 101  # A prime number for modulus operation
    M = len(pattern)
    N = len(text)
    p = 0  # Hash value for pattern
    t = 0  # Hash value for text
    h = 1
    
    # The value of h would be "pow(d, M-1) % q"
    for i in range(M-1):
        h = (h * d) % q

    # Calculate the hash value of pattern and first window of text
    for i in range(M):
        p = (d * p + ord(pattern[i])) % q
        t = (d * t + ord(text[i])) % q

    # Slide the pattern over text one by one
    for i in range(N - M + 1):
        # Check the hash values of current window of text and pattern
        if p == t:
            # If the hash values match, check for characters one by one
            for j in range(M):
                if text[i + j] != pattern[j]:
                    break
            j += 1
            # If p == t and pattern[0...M-1] = text[i, i+1, ...i+M-1]
            if j == M:
                return i

        # Calculate hash value for next window of text
        if i < N - M:
            t = (d * (t - ord(text[i]) * h) + ord(text[i + M])) % q

            # We might get negative value of t, converting it to positive
            if t < 0:
                t = t + q

    return -1


In [20]:
# Demonstration
text = "thisisateststringforrabin_karp"
pattern = "rabin"
result = rabin_karp_with_modular(text, pattern)

result


20

## Regular expressions

In [2]:
import re

# regularni izraz za črke
test_letters = ["123", "abc", "abc123", "čšž", ""]
results = {word: bool(re.match(r"^[a-zA-ZčšžČŠŽ]*$", word)) for word in test_letters}
results

{'123': False, 'abc': True, 'abc123': False, 'čšž': True, '': True}

In [19]:
# regularni izraz za števila
test_numbers = ["123", "abc", "123.4", "13e-3"]
results = {number: bool(re.match(r"^[0-9.e-]*$", number)) for number in test_numbers}
results

{'123': True, 'abc': False, '123.4': True, '13e-3': True}

To check if a string is a valid email, we can use a regular expression pattern that captures common characteristics of email addresses. Here's a basic approach using the re library:

- Start with the local part which can contain alphanumeric characters, dots, hyphens, and underscores.
- Followed by the @ symbol.
- Then the domain part which can contain alphanumeric characters, hyphens, and dots.
- The domain part must end with a dot followed by a valid top-level domain (like "com", "org", "net", etc.), typically 2-6 characters long.

In [20]:
def is_valid_email(email):
    # Regular expression pattern for a basic email validation
    pattern = r"^[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$"
    return bool(re.match(pattern, email))



In [21]:
# Test the function
test_emails = ["example@email.com", "invalid-email", "another.example@domain.org", "test@.com"]
results = {email: is_valid_email(email) for email in test_emails}

results


{'example@email.com': True,
 'invalid-email': False,
 'another.example@domain.org': True,
 'test@.com': False}

To check if a string is a valid URL, we can use a regular expression pattern that captures common characteristics of URLs. Here's a basic approach using the re library:

- Start with the scheme part, commonly http, https, ftp, etc.
- Followed by ://.
- Then an optional www. prefix.
- Next, the domain name which can contain alphanumeric characters, hyphens, and dots.
- Optionally, a port number prefixed by a colon.
- A path, which can have slashes, alphanumeric characters, hyphens, dots, etc.
- Optionally, a query string starting with ? and containing key-value pairs separated by &.
- Optionally, a fragment starting with #.

In [22]:
def is_valid_url(url):
    # Regular expression pattern for a basic URL validation
    pattern = r'^(https?|ftp):\/\/'  # scheme
    pattern += r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|'  # domain
    pattern += r'localhost|'  # localhost
    pattern += r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'  # OR ip
    pattern += r'(?::\d+)?'  # port
    pattern += r'(?:/?|[/?]\S+)$'  # path
    return bool(re.match(pattern, url, re.IGNORECASE))



In [23]:
# Test the function
test_urls = [
    "https://www.example.com",
    "ftp://files.example.com/download/",
    "http:/invalid-url.com",
    "https://localhost:8000/path/to/page?query=value&another=value#fragment"
]
results = {url: is_valid_url(url) for url in test_urls}

results


{'https://www.example.com': True,
 'ftp://files.example.com/download/': True,
 'http:/invalid-url.com': False,
 'https://localhost:8000/path/to/page?query=value&another=value#fragment': True}

In [24]:
def test_13_digits(s):
    pattern = re.compile(r'^\d{13}$')
    return bool(pattern.match(s))

# Test cases
test_strings = [
    "1234567890123",    # 13 digits
    "12345678901234",   # 14 digits
    "12345678901",      # 11 digits
    "123456789012a",    # 13 characters, but last one is not a digit
    " 1234567890123",   # 14 characters, starts with a space
    "1234567890123 "    # 14 characters, ends with a space
]

# Apply the test function to each string
results = {s: test_13_digits(s) for s in test_strings}
results


{'1234567890123': True,
 '12345678901234': False,
 '12345678901': False,
 '123456789012a': False,
 ' 1234567890123': False,
 '1234567890123 ': False}

In [25]:
def test_EMSO(s):
    pattern = re.compile(r'^[0-3]\d[01]\d[09]\d\d[5][0]\d{4}$')
    #                    r"^[0-3][0-9][01][0-9]{3}[5-9][0-9]{2}[0-1][0-9][0-9][0-9]$"
    return bool(pattern.match(s))

# Test cases
test_strings = [
    "1212000501234",   # 
    "1201999505037",   # 
    "1234567890125",   # 
    "1234567890126",   # 
    "1234567890127",   # 
    "1512961500073"    # 
]

# Apply the test function to each string
results = {s: test_EMSO(s) for s in test_strings}
results

{'1212000501234': True,
 '1201999505037': True,
 '1234567890125': False,
 '1234567890126': False,
 '1234567890127': False,
 '1512961500073': True}

In [26]:
from datetime import datetime

def validate_emso(emso):
    # Step 1: Check if the string has exactly 13 digits
    if not re.match(r"^\d{13}$", emso):
        return False
    
    # Step 2: Extract the date parts and validate the date
    day, month, year = int(emso[:2]), int(emso[2:4]), int(emso[4:7])
    # print(emso)
    # print(day, month, year)
    
    # Handle years in both centuries (1900 and 2000)
    year += 2000 if year < 900 else 1000
    # print(day, month, year)
    
    try:
        datetime(year, month, day)
    except ValueError:
        return False
    
    # Step 3: Check the gender digit
    
    # Step 4: Calculate and validate the control digit
    control_digit = int(emso[12])
    
    # if sum(int(digit) for digit in emso[:12]) % 11 != control_digit:
    #     return False
    
    return True

# Test cases
test_emso_numbers = [
    "0101006500006",  # Valid EMSO number
    "0112996505555",  # Invalid date
    "0101006505555",  # Invalid checksum and control digit
    "0101006500007",  # Invalid control digit
    "01010065055555", # Too many digits
    "ABCDEF6500006",  # Contains letters
    "1512961500073",  # ;-)
]

# Apply the validation function to each EMSO number
results = {emso: validate_emso(emso) for emso in test_emso_numbers}
results


{'0101006500006': True,
 '0112996505555': True,
 '0101006505555': True,
 '0101006500007': True,
 '01010065055555': False,
 'ABCDEF6500006': False,
 '1512961500073': True}

In [11]:
def test_QT_EMSO(s):
    pattern = re.compile(r'^(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[0-2])(\d{2})(\d{3}[A-Z]?)') # Andraz
    # pattern = re.compile(r'^[0-3][0-9][01][0-9]\d{3}[5-9]0\d{3}\d$') #  ^[0-3][0-9][01][0-9]{3}[5-9][0-9]{2}[0-1][0-9][0-9][0-9]$) # Andrej060
    # pattern = re.compile(r'^[01][1-9](0[1-9]|1[0-2])([0-2][0-9]|3[0-1])\d{3}$') #Aleksander
    return bool(pattern.match(s))

# Test cases
test_strings = [
    "0000000500000",   # dan, mesec
    "0101001500070",   # OK
    "1508961505010",   # OK
    "3002999506010",   # februar
    "3112000500010",   # OK
    "3310005507010",   # dan
    "1513123501010",   # mesec
    "1709123502010",   # leto
    "0806992602010",   # 60
    "6397265046840"    # random
]

# Apply the test function to each string
results = {s: test_QT_EMSO(s) for s in test_strings}
results

{'0000000500000': False,
 '0101001500070': True,
 '1508961505010': True,
 '3002999506010': True,
 '3112000500010': True,
 '3310005507010': False,
 '1513123501010': False,
 '1709123502010': True,
 '0806992602010': True,
 '6397265046840': False}

## String compression

Run-Length Encoding (RLE) is a basic form of lossless data compression in which runs of data are stored as a single data value and count. For strings, this typically involves representing consecutive repeated characters as a single character followed by the number of repetitions.

In [12]:
def rle_encode(s):
    if not s:
        return ""

    encoding = []
    count = 1

    for i in range(1, len(s)):
        if s[i] == s[i-1]:
            count += 1
        else:
            encoding.append(str(count) + s[i-1])
            count = 1

    encoding.append(str(count) + s[-1])
    return ''.join(encoding)


def rle_decode(s):
    if not s:
        return ""

    decoding = []
    num = ""

    for char in s:
        if char.isdigit():
            num += char
        else:
            decoding.append(char * int(num))
            num = ""

    return ''.join(decoding)


In [13]:
# Test the functions
test_string = "AAABBBCCDAAAAACCCCCC"
encoded = rle_encode(test_string)
decoded = rle_decode(encoded)

test_string, encoded, decoded


('AAABBBCCDAAAAACCCCCC', '3A3B2C1D5A6C', 'AAABBBCCDAAAAACCCCCC')

## Huffman coding 

is a popular method for lossless data compression. The basic idea is to encode frequently-occurring characters with shorter codes and less frequent characters with longer codes. The Huffman algorithm uses a priority queue (or a sorted list) of nodes based on character frequency, and then it builds the Huffman Tree, from which the Huffman codes are derived.

- Create a frequency dictionary from the input string.
- Build the Huffman Tree.
- Build the Huffman code table.
- Encode the input string.
- Decode the encoded data.

In [14]:
import heapq
from collections import defaultdict, Counter

class Node:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

    def __lt__(self, other):
        return self.freq < other.freq


def build_huffman_tree(s):
    frequency = dict(Counter(s))
    priority_queue = [Node(char, freq) for char, freq in frequency.items()]
    heapq.heapify(priority_queue)

    while len(priority_queue) > 1:
        left = heapq.heappop(priority_queue)
        right = heapq.heappop(priority_queue)

        merged = Node(None, left.freq + right.freq)
        merged.left = left
        merged.right = right

        heapq.heappush(priority_queue, merged)

    return priority_queue[0]


def build_huffman_codes(root):
    codes = {}

    def build_codes_recursive(node, current_code):
        if node is None:
            return

        if node.char is not None:
            codes[node.char] = current_code
            return

        build_codes_recursive(node.left, current_code + "0")
        build_codes_recursive(node.right, current_code + "1")

    build_codes_recursive(root, "")
    return codes


def huffman_encode(s):
    root = build_huffman_tree(s)
    huffman_code = build_huffman_codes(root)

    encoded = ''.join([huffman_code[char] for char in s])
    return encoded, root


def huffman_decode(encoded, root):
    decoded = []
    current = root

    for bit in encoded:
        if bit == '0':
            current = current.left
        else:
            current = current.right

        if current.char:
            decoded.append(current.char)
            current = root

    return ''.join(decoded)


In [15]:
# Test the Huffman encoding and decoding
test_string = "BCCABBDDAEAFAAAAAACCCCCCCCCC"
encoded_data, tree_root = huffman_encode(test_string)
decoded_data = huffman_decode(encoded_data, tree_root)

test_string, encoded_data, decoded_data


('BCCABBDDAEAFAAAAAACCCCCCCCCC',
 '100001110010010101010111011111101101111111111110000000000',
 'BCCABBDDAEAFAAAAAACCCCCCCCCC')

In [16]:
def print_huffman_tree(root, level=0, prefix=""):
    """Recursively print the Huffman tree."""
    if not root:
        return

    # If it's a leaf, print the character
    if root.char:
        print(f"{' ' * (level * 2)}[{prefix}] {root.char}")
    else:
        # If it's an internal node, print its frequency
        print(f"{' ' * (level * 2)}[{prefix}] {root.freq}")

    # Recursive calls for the left and right children
    print_huffman_tree(root.left, level + 1, "0")
    print_huffman_tree(root.right, level + 1, "1")


# Display the Huffman tree and Huffman codes
huffman_codes = build_huffman_codes(tree_root)
print("Huffman Tree:")
print_huffman_tree(tree_root)
print("\nHuffman Codes:")
for char, code in huffman_codes.items():
    print(f"{char}: {code}")


Huffman Tree:
[] 28
  [0] C
  [1] 16
    [0] 7
      [0] B
      [1] 4
        [0] D
        [1] 2
          [0] F
          [1] E
    [1] A

Huffman Codes:
C: 0
B: 100
D: 1010
F: 10110
E: 10111
A: 11


While both Huffman coding and Morse code use variable-length codes to represent characters, their purposes and methodologies differ. Huffman coding is a method of data compression, while Morse code is a telecommunication encoding scheme. However, both can be visualized using binary trees, and both have historical significance in their respective fields.

## LZW (Lempel-Ziv-Welch)

is a universal lossless data compression algorithm. It's based on the idea of replacing repeated occurrences of data with references to a dictionary.

Compression:

- Initialize the dictionary with all possible characters.
- Search for the longest string W in the dictionary that matches the current input.
- Emit the dictionary index for W.
- Add W followed by the next symbol in the input to the dictionary.
- Start the next cycle with the symbol just read.

Decompression:

- Initialize the dictionary with all possible characters.
- Read one value from the compressed input and output the corresponding string from the dictionary.
- Read the next value and do the following:
- Find the corresponding string for this value.
- Output the string.
- Add the string from the last value followed by the first character of the current string to the dictionary.
- Continue with the next value.

In [17]:
def lzw_compress(s):
    """Compress a string using LZW algorithm and return compressed string and dictionary."""
    # Initialize the dictionary with individual characters
    dictionary = {chr(i): i for i in range(256)}
    current_string = ""
    compressed = []
    # print(compressed)

    for char in s:
        combined = current_string + char
        if combined in dictionary:
            current_string = combined
        else:
            compressed.append(dictionary[current_string])
            dictionary[combined] = len(dictionary)
            current_string = char
        # print(char, 'combined: ', combined, 'current: ', current_string, 'compressed: ', compressed)

    if current_string:
        compressed.append(dictionary[current_string])
    # print(compressed)
    # print()

    return compressed, dictionary


In [18]:
def lzw_decompress(compressed):
    """Decompress a list of output ks to a string."""
    # Initialize the dictionary with individual characters
    dictionary = {i: chr(i) for i in range(256)}
    current_string = dictionary[compressed[0]]
    decompressed = [current_string]
    # print(decompressed)

    for k in compressed[1:]:
        if k in dictionary:
            entry = dictionary[k]
        elif k == len(dictionary):
            entry = current_string + current_string[0]
        else:
            raise ValueError("Invalid compressed data")

        decompressed.append(entry)
        dictionary[len(dictionary)] = current_string + entry[0]
        current_string = entry
        # print(k, entry, len(dictionary)-1, list(dictionary.keys())[len(dictionary)-1], list(dictionary.values())[len(dictionary)-1])
        # print(decompressed)
        # print()

    return ''.join(decompressed)


In [19]:
# Test the LZW compression and decompression
test_string = "ABABABABACCCCCCAAAAAA"
# test_string = "Testiranje ja sestavljeno iz vec zaporednih testov in testov in testov"
compressed_data, lzw_dict = lzw_compress(test_string)
decompressed_data = lzw_decompress(compressed_data)

test_string, compressed_data, decompressed_data


('ABABABABACCCCCCAAAAAA',
 [65, 66, 256, 258, 257, 67, 261, 262, 65, 264, 265],
 'ABABABABACCCCCCAAAAAA')

In [26]:
# Convert the dictionary to a list of tuples and sort by value for visualization
sorted_lzw_dict = sorted(lzw_dict.items(), key=lambda x: x[1])

sorted_lzw_dict[256:]

[('AB', 256),
 ('BA', 257),
 ('ABA', 258),
 ('ABAB', 259),
 ('BAC', 260),
 ('CC', 261),
 ('CCC', 262),
 ('CCCA', 263),
 ('AA', 264),
 ('AAA', 265)]

In [27]:
sorted_lzw_dict[32:67]


[(' ', 32),
 ('!', 33),
 ('"', 34),
 ('#', 35),
 ('$', 36),
 ('%', 37),
 ('&', 38),
 ("'", 39),
 ('(', 40),
 (')', 41),
 ('*', 42),
 ('+', 43),
 (',', 44),
 ('-', 45),
 ('.', 46),
 ('/', 47),
 ('0', 48),
 ('1', 49),
 ('2', 50),
 ('3', 51),
 ('4', 52),
 ('5', 53),
 ('6', 54),
 ('7', 55),
 ('8', 56),
 ('9', 57),
 (':', 58),
 (';', 59),
 ('<', 60),
 ('=', 61),
 ('>', 62),
 ('?', 63),
 ('@', 64),
 ('A', 65),
 ('B', 66)]