:::{.callout-note}
## Look at Github

The *real* content is in your [`regex-in-class-<username>`](https://github.com/orgs/Lin511-2024/repositories) repository on github. This is meant to be more of a reference sheet.
:::



:::{.callout-note collapse="true"}
## This is just a support function

In [1]:
import urllib.parse
from IPython.display import display, Markdown

def get_regex_url(regex: str):
    base = "https://regexper.com/#"
    url = base + urllib.parse.quote(regex)
    url_markdown = f"[{regex}]({url})"
    #return url_markdown
    display(Markdown(url_markdown))

:::

## Setting up for using regular expressions in python

- We'll need to import the `re` module
- Unlike simple strings, we'll need to write our regular expressions with a preceding `r`

In [2]:
import re

r"regex"

'regex'

## Important `re` functions

Two ways to use `re` to search strings are

`re.search()`
: Return structured information about *where* the regex matches.

`re.findall()`
: Return all actual matching substrings

In [3]:
sentence1 = "The speaker is speaking."

### `re.search()`

In [4]:
re.search(r"speak", sentence1)

<re.Match object; span=(4, 9), match='speak'>

In [5]:
sentence1[4:9]

'speak'

### `re.findall()`

In [6]:
re.findall(r"speak", sentence1)

['speak', 'speak']

## Simple character searches

Like the examples above, literally the characters you want to use will match.

In [7]:
#| results: asis
speak_regex = r"speak"
get_regex_url(speak_regex)

[speak](https://regexper.com/#speak)

## Options

If you want some characters to be chosen from a set of options, place them in `[]`.

In [8]:
vowels_regex = r"[aeiou]"
get_regex_url(vowels_regex)

[[aeiou]](https://regexper.com/#%5Baeiou%5D)

In [9]:
re.findall(r"[aeiou]", sentence1)

['e', 'e', 'a', 'e', 'i', 'e', 'a', 'i']

In [10]:
the_regex = r"[Tt]he"
get_regex_url(the_regex)

[[Tt]he](https://regexper.com/#%5BTt%5Dhe)

In [11]:
re.findall(the_regex, sentence1)

['The']

### Ranges

Ranges of characters or numbers can be given inside `[]` like so

In [12]:
get_regex_url(r"[a-z]")
get_regex_url(r"[A-Z]")
get_regex_url(r"[0-9]")
get_regex_url(r"[A-Za-z]")

[[a-z]](https://regexper.com/#%5Ba-z%5D)

[[A-Z]](https://regexper.com/#%5BA-Z%5D)

[[0-9]](https://regexper.com/#%5B0-9%5D)

[[A-Za-z]](https://regexper.com/#%5BA-Za-z%5D)

### "Metacharacters"

- `\w` == `[A-Za-z0-9_]`
  - word characters
- `\W` == `[^A-Za-z0-9_]`
  - non-word characters
- `\d` == `[0-9]`
  - digits
- `\D` == `[^0-9]`
  - non-digits
- `\s` == `[ \t\n]`
  - Any whitespace character
- `\S` == `[^ \t\n]`
  - non-whitespace
  

### Any Character

To match *any* character (letter, number, punctuation, space, etc.) use `.` or "dot"

In [13]:
re.findall(
    # return every word character and 
    # the following character
    r"\w.",
    sentence1
)

['Th', 'e ', 'sp', 'ea', 'ke', 'r ', 'is', 'sp', 'ea', 'ki', 'ng']

### Escaping special symbols

If you wanted to find the actual period in `sentence1`, you'd have to "escape" the `.` with a preceding `
`\`.

In [14]:
# compare
get_regex_url(r".")
get_regex_url(r"\.")

[.](https://regexper.com/#.)

[\.](https://regexper.com/#%5C.)

In [15]:
re.findall(
    "\.",
    sentence1
)

['.']

## Modifiers

Modifiers come after the definition of a single character, and define *how many times* that character can appear.

- `a?` = zero or one `a`
- `a+` = one or more `a`
- `a*` = zero or more `a`

In [16]:
get_regex_url(r"bana?na")
get_regex_url(r"bana+na")
get_regex_url(r"bana*na")


[bana?na](https://regexper.com/#bana%3Fna)

[bana+na](https://regexper.com/#bana%2Bna)

[bana*na](https://regexper.com/#bana%2Ana)

## Grouping

You can define groupings within regular expressions. The *effect* of these groupings depends what kind of regex function you're using. For `re.findall()`, it'll find the whole string, but return just the text from the grouping.

In [17]:
sentence2 = "The big bear and the small bear ran away."

In [18]:
get_regex_url(r"[Tt]he (\w+) bear")

[[Tt]he (\w+) bear](https://regexper.com/#%5BTt%5Dhe%20%28%5Cw%2B%29%20bear)

In [19]:
re.findall(
    r"[Tt]he (\w+) bear",
    sentence2
)

['big', 'small']

## Boundaries

- `^the ` == Finds "the " at the *start* of a string.

- ` the$` == Finds " the" at the *end* of a string.

- `\bthe\b` == Finds "the" in between word boundaries.

In [20]:
get_regex_url(r"^the ")
get_regex_url(r" the$")
get_regex_url(r"\bthe\b")

[^the ](https://regexper.com/#%5Ethe%20)

[ the$](https://regexper.com/#%20the%24)

[\bthe\b](https://regexper.com/#%5Cbthe%5Cb)

In [21]:
sentence3 = "I saw the other bear."
re.findall(
    r"the",
    sentence3
)

['the', 'the']

The second "the" there comes from inside "o**the**r"

In [26]:
re.findall(
    r"\bthe\b",
    sentence3
)

['the']

In [23]:
sentence3

'I saw the other bear.'