# Text Searching/processing: Regular Expressions

Regular expressions (or _REGEX_) are fundamental parts of any text searching/processing. Regular expressions represent a robust and flexible way to define patterns of characters within text documents. There are many uses, such as pattern matching and term extraction. With regex and some custom code, we can easily facilitate search capabilities for a small corpus. 


## Typical REGEX Tasks

 * Search and Extract
 * Search and Replace
 * Search and Count

**References:**
 * https://en.wikipedia.org/wiki/Regular_expression
 * [Python Regular Expressions](https://docs.python.org/3/library/re.html)


Read more about the basics of REGEX in [this section](https://en.wikipedia.org/wiki/Regular_expression#Basic_concepts).


Here are some _Cheat Sheets_ for constructing regular expressions:
 * https://www.debuggex.com/cheatsheet/regex/python
 * https://web.dsa.missouri.edu/static/PDF/python-regular-expressions-cheat-sheet.pdf
 * [Cheat Sheet + Testing Playground](http://www.pyregex.com/)


**Below are a number of examples to play with.**

In [None]:
# The Python library
import re

## Search

In [None]:
text_to_search = 'What is the frequency Kenneth!?'

In [None]:
query_text = r'the'
searched = re.search(query_text, text_to_search)

query_text = r'What'
also_searched = re.search(query_text, text_to_search)

print(searched)
print(also_searched)
print(type(searched))

The `r` at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions. It is recommended that we always write pattern strings with the 'r' just as a habit. The search results are returned as a `Match` object

### Extracting Data from the RE objects

You see above, we did not get back strings but instead objects with attributes.
We have to use the API to extract matches.

<!--
This may seem strange. 
Since we tried to match on '`the`', so why print '`the`'?
You will see as we progress, the key is that we used a **pattern**, not a literal.
-->

In [None]:
if searched:
    print(searched.group(0))
else:
    print('Nothing found')

In [None]:
also_searched.group(0)

In [None]:
query_text = r'fellow'
not_matched = re.match(query_text, text_to_search)
print(type(not_matched))

In [None]:
if not_matched:
    print(not_matched.group(0))
else:
    print('Nothing matched')

## Search and Extract (match)

The goal of matching is to find all the instances of a pattern!
Specifically, we typically want to extract them.

### Find All (multiple matches, counting)

What happens when we have multiple instances of the pattern?
Well, the REGEX should get us all instances!

In [None]:
query_text = r'frequency'
found = re.findall(query_text, text_to_search)
print(type(found))

In [None]:
print(found)

Notice we got a list! 
(A list of one item.)
Since a list is a familiar Python object, the `findall` concept is typically very useful.
Think back to **BeautifulSoup** parsing and searches 
and how `findall` versus `find` had more robustness in algorithm use.
For instance, if the term was not found previously, we saw:
```
<class 'NoneType'>
```
Compare that to the below!

In [None]:
query_text = r'fellow'
not_matched = found = re.findall(query_text, text_to_search)
print(not_matched)

Note, an empty list!

------

#### Now, the power is in the multi-occurence!

In [None]:
text_to_search = "frog bog log cog nog fog food"

### Consult the cheat sheets to fully understand!
txt_pattern = r'\wo\w?'
## A breakdown of this RE:
# \w == alphanumeric!
## o == o
## \w == alpahnumeric
## ? == optional

found = re.findall(txt_pattern, text_to_search)
print(type(found))

In [None]:
print(found)
print("Num of matches: {}".format(len(found)))

We see that from the **`\wo\w`** pattern we found multiple words.

To recap, based on your reading of the comments and the cheat sheets,
we expect a pattern as follows:
 1. Any alphanumeric character
 1. the letter 'o'
 1. Any alphanumeric character
 
**Take special note of the last match: `foo` from `food`. It didn't include the other overlapping match `ood`. We can include the overlapping matches with positive look-ahead assertion (see [here](https://junli.netlify.app/en/overlapping-regular-expression-in-python/)).**


In [None]:
found = re.findall(r'(?=(\wo\w?))', text_to_search)
print(found)

---

## Substitution (sub)

In the next example, we are using an exclusionary pattern: **`^`** within a set of characters **`[]`**

In [None]:
text_to_change = "frog bog log cog nog fog"
more_text_to_change = "frog bog log cog nog fog schlog nschlog grog"

In [None]:
regex_sub = r'[^og ]{1,2}'
#find not o and not g and not space
#1 or 2 of these things
subbed = re.sub(regex_sub, 'd', text_to_change)
subbed_again = re.sub(regex_sub, 'd', more_text_to_change)
print(type(subbed))

In [None]:
print(subbed)
print(subbed_again)

We see that each time with did not match one of the following;
 1. o
 1. g
 1. _a space_

Then, we replaced that match with the letter **`d`**.
A key aspect is the `{1,2}` that specifcied one or two of any of the three characters.
That is how `frog` changed to `dog`!

You might want to see how the output changes using `r'[^og ]{1}'` or `r'[^og ]{2}'`.




In [None]:
regex_sub = r'[^og ]{2}'
#find not o and not g and not space
#1 or 2 of these things
subbed = re.sub(regex_sub, 'd', text_to_change)
subbed_again = re.sub(regex_sub, 'd', more_text_to_change)
print(f"Given: {text_to_change}")
print(f"Subs.: {subbed}")
print(f"Given: {more_text_to_change}")
print(f"Subs.: {subbed_again}")



## Tokenization (split)

Classically, we see the tokenization or spitting of data on particular characters such as lines (`\n`) and fields (`,`). What if we want to use a class of characters?


In [None]:
text_to_split = 'This1is2some34text567to89split'

In [None]:
split_text = re.split(r'\d+', text_to_split)
print(split_text)

Above, we see that we are matching to a digit (`\d`), specifying _one or more_ with the **`+`**.
This allows the string to split on the following:
 * 1
 * 2
 * 34
 * 567
 * 89

You will observe this is much more powerful that spliting on commas or some other single character alone.

---

## Caution: Greed

REGEX patterns are _greedy_, meaning they will match the largest region possible.

In the example below: 
 1. "`.`" matches any character
 1. "`*`" is a _zero or more_ multiplier for the character match.

In [None]:
html = '<h1>Header!</h1> <p>Paragraph!</p>'

In [None]:
# Match on an opening brace, and any character 
text_pattern = r'<.*>'
greedy = re.findall(text_pattern, html)
print(greedy)

So, we see above we got a **1 item** list.
In otherwords, one single match!

Do we have some control?
Maybe!

In [None]:
regex_not_greedy = r'<.*?>'
regex_not_greedy_words = r'<.*?>(.*?)<.*?>'
## *? == not greedy!

not_greedy = re.findall(regex_not_greedy, html)
not_greedy_words = re.findall(regex_not_greedy_words, html)

print(not_greedy)
print(not_greedy_words)

##NOTE:  Python returns the highest group number by default when using findall 


You can see the `r'<.*?>'` got just the tag elements. Here, `?` ensures 0 or 1 match.

Additionally, you see one other new thing to pay special notice of:
the match sub-string extraction.
 * `r'<.*?>(.*?)<.*?>'`

The `()` included in the match pattern means to extract that portion of the code text that matched the pattern instead of the entire pattern match.

The above snippet is an example of **greedy vs lazy matching**. To learn you can see [here](https://blog.kiprosh.com/regular-expressions-greedy-vs-non-greedy/) and [here](https://mariusschulz.com/articles/why-using-the-greedy-in-regular-expressions-is-almost-never-what-you-actually-want).

---

# Save your notebook, then `File > Close and Halt`