Skip to content
turbolag edited this page Jul 15, 2019 · 4 revisions

This tutorial presents some of the commonly used algorithms and techniques for solving string problems.

Hashing

The most basic problem encountered to any coder is the string search problem. Here we are supposed to find the occurrence of a pattern in big text document. In this section we will discuss the technique of hashing to solve this problem.

The underlying principle behind this approach is to compare hash values, hash values are integers, of the pattern with all the substrings of the text rather than comparing the substring with the pattern character by character. If the hash of the both the pattern and substring are equal we can actually compare them. If the hash function is good, less collision and even distribution, we can avoid a lot useless comparisons. But this involves calculating hash for every substring of the text.

Suppose the length of the pattern is m and length of text is n. For calculating hash of every substring of length m we need to calculate hash of (n - m) substrings. To avoid this we can make use of a special hash function called rolling hash function. The hash function helps in getting the hash value of the new string from the previous substring using only O(1), constant, operation.

Let's see how this can be done: If the total number of characters in a character-set is 9, we can use integers to uniquely determine each string. Each character can be given values from 1 to 9 and position can then be determined by the 10th factor.

For example:

[b = 1, c = 2, d = 3, e = 4, f = 5, g = 6, h = 7, i = 8, j = 9] 
str = "bdfg" 
hash(str) = 1356

But in reality the numbers of characters in not limited to 10. In such case to make the hash value fit in the integer range we can take mod of the hash value with a big prime number.

d = number of characters in the character-set
p = a large prime number

For calculating the hash value of the string str

int hash_pattern = 0; 
for(int i = 0; i < m; i ++){
	hash_pattern *= d; 
	hash_pattern += str[i]
	hash_pattern %= p;
}

In the above code we calculated hash value for a string of length m.

hash(str(l, r)) = V
hash(str(l + 1, r + 1)) = (d * (V - str[l] * pow(d, m - 1)) + str[r + 1]) % p

we can pre-calculate pow(d, m - 1)
h = pow(d, m - 1)

hash(str(l + 1, r + 1)) = (d * (V - str[l] * h) + str[r + 1]) % p

The reason we call it rolling hash is because we are calculating the hash value of the new string, str(l + 1, r + 1), from str(l, r). It is almost like we shifted a window of size r - l + 1 by l. Therefore we can see that we can calculate the value of hash of all the substrings of a particular length in O(n).

So in essence we calculate hash of each substring of length m and compare it with the hash value of the pattern. If the hash values are equal we compare the actual string otherwise there is no need to compare.

TODO: Sample Code;

Z Function

This function calculates for each index i, 0 <= i <= n, the length of the substring, k, for which input.substr(0, k) == input.substr(i, k).

For example:

input = "aaabaab"
z[0] = 0
z[1] = 2
z[2] = 1
z[3] = 0
z[4] = 2
z[5] = 1
z[6] = 0

Naive Method:

vector<int> z_function(string& input){
	int n = input.size();
	vector<int> z(n);

	for(int i = 1; i < n; i ++){
		int counter = 0, index = i;
		for(int j = 0; input[index] == input[j] ; j ++, index ++, counter ++);

		z[i] = counter;
	}

	return z;
}

Time Complexity: O(n^2)

Linear Time Algorithm:

If we look carefully at the above algorithm we can see that computation of the previous steps can be used to compute the value for the current step.

For example: During the ith step we scan the input from i to i + k where k is the value of the z function. This means that substring starting from i to i + k is equal to substring from 0 to k according to the definition of z function. Let us denote i as l and i + k as r. Therefore we can say that for j , l < j <= r, the value of z function can be approximated by z[j - l] rather than assigning 0 as the initial value. But there is a caveat that the value z[j - l] can exceed the length of the string. To avoid that we can use z[j] = min(z[j - l], r - j + 1). And if j does not lie in the range l to r, we can initialize z[j] to 0. After initializing z[j] to appropriate values depending on the values of l, r and j itself we can run the trivial algorithm to expand values of z[j]. After calculating the new value of z function we can update the value of l and r if z[j] + j - 1 > r.

vector<int> z_function(string& input){
	int n = input.size();
	vector<int> z(n);

	for(int i = 1, l = 0, r = 0; i < n; i ++){
		if(i < r){
			z[i] = min(z[i - l], r - i + 1);
		}

		while(i + z[i] < n && input[z[i]] == input[i + z[i]]){
			z[i] ++;
		}

		if(i + z[i] - 1 > r){
			l = i, r = i + z[i] - 1;
		}
	}

	return z;
}

Time Complexity = O(n)

Suffix Array

As the name suggests Suffix Array is a collection of all the suffixes of the input.

For example: Denoting Suffix Array by sa[]

input = "banana"
sa[0] = "banana"
sa[1] = "anana"
sa[2] = "nana"
sa[3] = "ana"
sa[4] = "na"
sa[5] = "n"

This collection of string can be very useful if we can sort them. Using quick sort to sort all the strings will be very time consuming. Total time taken by quicksort is O(n * n lg n) as each comparison can take O(n) time(comparing two strings can take O(n) time in the worst case scenario). Instead of using quicksort we can use Most Significant Bit Sort.

The total space taken by all the suffixes is (n + n - 1 + n - 2 + .. 1) = O(n^2) which is very high. To avoid this we can make an array called rank as shown in the code below. The array rank stores the index of starting character of the suffix.

struct SuffixArray {
    string str;
    vector<int> ranks;
    int length;

    SuffixArray(string& input) : length(input.size()), str(input) {
        for (int i = 0; i < length + 1; i++) {
            ranks.push_back(i - 1);
        }
    }

    void build() {
        vector<int> aux(length + 1);
        msbSort(&ranks[0], &aux[0], 0, length, 0);
    }

    void msbSort(int* ranks, int* aux, int low, int high, int offset) {
        if (low >= high) {
            return;
        }

        // 256 + 1
        // when the string is shorter than the offset, we assign a special negative value which
        // also needs to be accomodated in the count array.
        int count[256 + 1] = { 0 };

        for (int i = low; i <= high; i++) {
            int ch = ranks[i] == -1 | ranks[i] + offset >= length ? -1 : str[ranks[i] + offset];
            count[ch + 1] ++;
        }

        for (int i = 1; i < 256 + 1; i++) {
            count[i] += count[i - 1];
        }

        for (int i = low; i <= high; i++) {
            int ch = ranks[i] == -1 | ranks[i] + offset >= length ? -1 : str[ranks[i] + offset];
            // 1 is added to every charater so that '-1' can also be accomodated.
            int index = --count[ch + 1];
            aux[index] = ranks[i];
        }

        for (int i = low; i <= high; i++) {
            ranks[i] = aux[i - low];
        }

        for (int i = 0; i < 256; i++) {
            if (count[i + 1] - count[i] <= 1) {
                continue;
            }

            msbSort(ranks, aux, low + count[i], low + count[i + 1] - 1, offset + 1);
        }
    }
};
Clone this wiki locally