# Keyword matches

The string `"Hello World"` contains the substring `"Hello"` once, `"World"` once, and the substring `"o"` twice. 

Let's write a function which given some substring `sub`, counts how many times it appears in our string `full`.
We will allow substrings to overlap. For example `"ahaha"` contains `"aha"` twice!


In [136]:
{- Example Solution -}

import Data.List (isPrefixOf, tail)

matchCount :: Eq a => [a] -> [a] -> Int
matchCount _  [] = 0
matchCount sub full = 
  let inc = if sub `isPrefixOf` full then 1 else 0
   in inc + matchCount sub (tail full)


And testing the matches:

In [137]:
matchCount "Hello" "Hello World" == 1
matchCount "World" "Hello World" == 1
matchCount "o" "Hello World" == 2
matchCount "aha" "ahaha" == 2

True

True

True

True

It is also worthwile writing a function `matchAll` which given a list of substrings, counts all matches and returns the sum total.

In [138]:
{- Example Solution -}

matchAll :: Eq a => [[a]] -> [a] -> Int
matchAll cs ss = foldl (\i c -> i + matchCount c ss) 0 cs


And testing it:

In [139]:
matchAll ["Hello", "World"] "Hello World" == 2
matchAll ["Hello", "World", "o"] "Hello World" == 4

True

True

One quesiton we can ask ourselves is: how many possible unique substrings can a string match on?

After all, a string is fintie in length, so it can only match on a finite number of keywords. 

Taking an example of the string  `"Rob"`, it must match the substrings `"R"`, `"o"`, and  `"b"`. But also `"Ro"`, `"ob"` and `"Rob"` itself! That's 6 total substrings that fit in `"Rob"`.

It would be good to understand this pattern for any length of string. To do this we'll look to `Data.List` which contains two useful functions [subsequences](https://hackage.haskell.org/package/base-4.20.0.1/docs/Data-List.html#v:subsequences) and [nub](https://hackage.haskell.org/package/base-4.20.0.1/docs/Data-List.html#v:nub).

`nub` takes a list a returns the **unique** elements of the list, e.g. the list `['b', 'o', 'b']` becomes `['b', 'o']`.

In [149]:
import Data.List (nub)

nub "bob"

"bo"

`subsequences` takes a list and returns all **subsequences** of that list. E.g. for the list `['b', 'o', 'b']` it returns all sublists. E.g. `[['', 'b', 'bo', 'bo', 'b', 'bb', 'ob' 'bob']]`


In [150]:
subsequences "bob"

["","b","o","bo","b","bb","ob","bob"]

Combining these two functions, we can find all unique subsequences:

In [151]:
uniqueSubseqences = nub . subsequences

In [152]:
uniqueSubseqences "bob"

["","b","o","bo","bb","ob","bob"]

From `uniqueSubseqences` write a function that counts all unique substrings in a string **ignoring empty substrings** (i.e. ignoring `""`).

Hint: use the `matchAll` you wrote previously.

In [154]:
{- Example solution -}


countTotalSubstrings ss = matchAll (filter ([] /=) $ uniqueSubseqences ss) ss

We can then test it against different inputs and see if we can find a pattern:

In [155]:
countTotalSubstrings "H"     == 1
countTotalSubstrings "He"    == 3
countTotalSubstrings "Hel"   == 6
countTotalSubstrings "Hell"  == 10
countTotalSubstrings "Hello" == 15

True

True

True

True

True

This works for any string you might try. But a general pattern begins to emerge: the sum is the previous sum + the length of the string!

| Word    | Substring count |
|---------|-----------------|
| "H"     | 1               |
| "He"    | 2+1+1=3         |
| "Hel"   | 3+2+1=6         |
| "Hell"  | 4+3+2+1=10      |
| "Hello" | 5+4+3+2+1=15    |

This pattern is also known as a **Triangle number** because we can draw this out as a triangle:

```
    *     <- length == 1
   * *    <- length == 2
  * * *   <- length == 3
 * * * *  <- length == 4
* * * * * <- length == 5
```

And in-fact there is a simple sum for finding the triangle number!

$$
T_n = \frac{ n \cdot (n+1) } 2 
$$

Can you write a function that represents the triangle number calculation

In [158]:
{- Example solution -}

triangle n = (n * (n +1)) `div` 2

In [159]:
countTotalSubstrings "H"     == triangle (length "H")
countTotalSubstrings "He"    == triangle (length "He")
countTotalSubstrings "Hel"   == triangle (length "Hel")
countTotalSubstrings "Hell"  == triangle (length "Hell")
countTotalSubstrings "Hello" == triangle (length "Hello")

True

True

True

True

True