# Additional regex tricks

A regex master can pull of some amazing tricks, but even a novice can wield them to great effect when working with natural language data.
This unit introduces some essential tricks that will be with us for the rest of the semester.
They are:

1. `+` for matching sequences of 1 or more characters, and
1. `\w` for matching **w**ord characters.

## Using `+` for "one or more"

One exercise in the previous unit had you replace punctuation symbols, e.g. exclamation points (`!`) by dots (`.`).
So `"Hi!!! You know me!! I'm a chatbot!"` would become `"Hi... You know me.. I'm a chatbot."`.
But what if your instructions were instead to replace each sequence of exclamation points by a single dot so that `"Hi!!! You know me!! I'm a chatbot!"` is converted to `"Hi. You know me. I'm a chatbot."`.
How could you do this?
Your first instinct might be something like this:

In [None]:
import re

line = "Hi!!! You know me!! I'm a chatbot!"

# replace one !
line = re.sub(r"!", r".", line)
# replace two !
line = re.sub(r"!!", r".", line)
# replace three !
line = re.sub(r"!!!", r".", line)

print(line)

As you can see, this does not work at all.
The first line replaces every `!` by a dot, and after that none of the other lines of code get to apply because there are no `!` left.
We can add `print` statements to make this clearer

In [None]:
import re

line = "Hi!!! You know me!! I'm a chatbot!"

# replace one !
line = re.sub(r"!", r".", line)
# show result, with tabulator before line
print("Replaced ! by .:\t", line)
# replace two !
line = re.sub(r"!!", r".", line)
print("Replaced !! by .:\t", line)
# replace three !
line = re.sub(r"!!!", r".", line)
print("Replaced !!! by .:\t", line)

Clearly the first substitution has already destroyed all configurations where the others could have applied.

By the way, if you're wondering about the `\t` in the code above, this is how Python represents a tabulator (the big key right above CapsLock on the left side of your keyboard).
Whenever you want a tabulator in a Python string, you have to use `\t`.
The tab key on your keyboard won't work.

But let's return to our problem.
The code above does not work because the substitutions are carried out in the wrong order.
The most specific substition should come first, not last.
If a substition *X* matches all configurations where substitution *Y* should apply, *Y* must be carried out before *X*.
So in our case, we have to reverse the order of substitutions.

In [None]:
import re

line = "Hi!!! You know me!! I'm a chatbot!"

# replace three !
line = re.sub(r"!!!", r".", line)
print("Replaced !!! by .:\t", line)
# replace two !
line = re.sub(r"!!", r".", line)
print("Replaced !! by .:\t", line)
# replace one !
line = re.sub(r"!", r".", line)
print("Replaced ! by .:\t", line)

So now we have a solution that works for our example string.
But this is not a good solution.
It will fail for any string that contains more than three exclamation points in a row.

In [None]:
import re

line = "Hi!!! You know me!! I'm a chatbot! I really love shouting!!!!"

# replace three !
line = re.sub(r"!!!", r".", line)
print("Replaced !!! by .:\t", line)
# replace two !
line = re.sub(r"!!", r".", line)
print("Replaced !! by .:\t", line)
# replace one !
line = re.sub(r"!", r".", line)
print("Replaced ! by .:\t", line)

As you can see, `!!!!` is rewritten as `.!` by the first substitution because if you have 4 `!` in a row, then you obviously have 3 `!` in a row, and the first substitution replaces 3 `!` by a single dot.
The last substitution then turns all remaining `!` into `.`, so `.!` becomes `..`.
As we wanted `!!!!` to become `.` rather than `..`, this code does not work.

**Exercise.**
You might think that replacing `..` by `.` would fix the problem, but that won't work either.
Expand the code below so that it carries out this substitution, then change the example string so that the output will still contain `..` somewhere.

In [None]:
import re

line = "Hi!!! You know me!! I'm a chatbot! I really love shouting!!!!"

# replace three !
line = re.sub(r"!!!", r".", line)
print("Replaced !!! by .:\t", line)
# replace two !
line = re.sub(r"!!", r".", line)
print("Replaced !! by .:\t", line)
# replace one !
line = re.sub(r"!", r".", line)
print("Replaced ! by .:\t", line)

If you like a challenge, you might now sit down and try to figure out some complex way of chaining substitutions in such a way that you can never get two dots in a row.
That might be entertaining, but it is far from necessary.
In fact, whatever solution you would come up with would be clunky and inelegant.
Regular expressions provide a much nicer solution.

In [None]:
import re

line = "Hi!!! You know me!! I'm a chatbot! I really love shouting!!!!"

# replace three !
line = re.sub(r"!+", r".", line)
print("Replaced !-sequences by .:\t", line)

There you have it, all in one line, short and sweet.
The `+` character means "1 or more instances of whatever is right before me":

- `!+` means "1 ore more instances of `!`",
- `A+` means "1 or more instances of `A`",
- `[Aa]+` means "1 or more instances of characters that match `A` or `a`".

**Exercise.**
Look at the two code cells below.
The first one replaces `!+` and `\?+` by `.`, whereas the second one applies this substitution to `[!\?]+`.
The two do not produce the same output for `"What!? Why did you do this??? You're insane!!!".
Before you run the code, think about what the difference will be.
Then verify your answer by running the cells.
Provide a brief explanation why we get this difference.

In [None]:
import re

line = "What!? Why did you do this??? You're insane!!!"

line = re.sub(r"!+", ".", line)
line = re.sub(r"\?+", ".", line)

print(line)

In [None]:
import re

line = "What!? Why did you do this??? You're insane!!!"

line = re.sub(r"[!\?]+", ".", line)

print(line)

*put your explanation here*

*Hints:*
If you're stuck with the exercise, highlight the text below to read some tips.

<span style="color:#000000;background-color:#000000;">
Insert print statements to see how the line changes after each substitution.
</span>

**Exercise.**
By default, `+` only pays attention to whatever character or character range is immediately to its left.
So `Aa+` means "an A followed by 1 or more instances of a".
But what might `(Aa)+` mean?
Or `(!!!)+` compared to `!!!+`, for instance?
Use the code cell below for some experiments of your own, then add your answer underneath.

In [None]:
# experiment here

*put your answer here*

In sum, whenever you want to replace 1 or more instances of *X*, use *X* followed by `+`.
Here *X* may be a single character, a range of characters inside `[` and `]`, or a sequence inside `(` and `)`.

## Matching specific character classes

You already know that you can use `[` and `]` to define a range of possible matches.
So `[abD!\?]` will match every character that is either `a`, `b`, `D`, `!`, or `?`.

In [None]:
import re

line = "This string contains D? Sure, but not T!!"

line = re.sub(r"[abD!\?]", r"X", line)

print(line)

But what if you want to match every word character, for instance to play a word game?

In [None]:
import re
import random

print("Hey, let's play a word game!")
print("You have to guess the secret animal.")

secret = random.choice(["cat", "dog", "duck", "elephant"])

# we split the print-command across two lines to keep it readable
print("Here's what it looks like:",
      # this masks out all lowercase characters
      re.sub(r"[abcdefghijklmnopqrstuvwxyz]", r"*", secret))

print("What is it?")
answer = input()

while answer not in [secret, "STOP"]:
    print("Sorry, wrong guess!")
    print("Try again or type STOP to quit playing.")
    answer = input()

if answer == "STOP":
    print("Okay. The correct answer would have been", secret)
else:
    print("That's it! Good job!!!")

In the code above, we use

```python
re.sub(r"[abcdefghijklmnopqrstuvwxyz]", r"*", secret)
```

to replace the secret word by an equally long string of starts.
This is awfully long, though, and, it gets even worse if we play this game with, say, famous rock and metal bands whose name is a single word.

In [None]:
import re
import random

print("Hey, let's play a word game!")
print("You have to guess which rock/metal band I'm thinking of.")

secret = random.choice(["Megadeth", "Metallica", "Nirvana", "Opeth",
                        "Rammstein", "Sepultura", "Tool"])

# we split the print-command across two lines to keep it readable
print("Here's what it looks like:",
      # this masks out all lowercase characters
      re.sub(r"[ABCDEFGHIJKLMNOPQRSTUVWXZYabcdefghijklmnopqrstuvwxyz]", r"*", secret))

print("What is it?")
answer = input()

while answer not in [secret, "STOP"]:
    print("Sorry, wrong guess!")
    print("Try again or type STOP to quit playing.")
    answer = input()

if answer == "STOP":
    print("Okay. The correct answer would have been", secret)
else:
    print("That's it! Good job!!!")

Now that is one superlong `re.sub` to replace letters by stars.
And you know what?
It doesn't even work reliably!
Run the code below, where the list of bands has been changed a bit.

In [None]:
import re
import random

print("Hey, let's play a word game!")
print("You have to guess which rock/metal band I'm thinking of.")

secret = random.choice(["Deströyer 666", "Motörhead", "Mötley Crüe", "Schweißer"])

# we split the print-command across two lines to keep it readable
print("Here's what it looks like:",
      # this masks out all letters, but not spaces or punctuation
      re.sub(r"[ABCDEFGHIJKLMNOPQRSTUVWXZYabcdefghijklmnopqrstuvwxyz]", r"*", secret))

print("What is it?")
answer = input()

while answer not in [secret, "STOP"]:
    print("Sorry, wrong guess!")
    print("Try again or type STOP to quit playing.")
    answer = input()

if answer == "STOP":
    print("Okay. The correct answer would have been", secret)
else:
    print("That's it! Good job!!!")

Our substitution only works for letters of the Latin alphabet.
But many Metal bandnames love their Metal umlauts, for instance the *ö* in *Motörhead*.
And others have non-English names and thus may use letters that aren't part of the Latin alaphabet.
In the code above, that's the case for the German metal band *Schweißer* (German for *welder*).
So this clearly isn't a good way of handling things: it takes forever to type, and it doesn't even work well.

Again regular expressions provide a simple solution.
You can use `\w` to match any **w**ord character.
That is to say, `\w` matches all the letters of the Latin alphabet used in English, but also special letters from other languages like ö, ü, ß, æ, ð, þ, and much more.
It also matches numbers, but not punctuation (`.`, `!`, `?`, `;`, `-`, `'`) or white space (spaces and tabs).

In [None]:
import re
import random

print("Hey, let's play a word game!")
print("You have to guess which rock/metal band I'm thinking of.")

secret = random.choice(["Deströyer 666!!!", "Motörhead", "Mötley Crüe", "Schweißer"])

# we split the print-command across two lines to keep it readable
print("Here's what it looks like:",
      # this masks out all letters, but not spaces or punctuation
      re.sub(r"\w", r"*", secret))

print("What is it?")
answer = input()

while answer not in [secret, "STOP"]:
    print("Sorry, wrong guess!")
    print("Try again or type STOP to quit playing.")
    answer = input()

if answer == "STOP":
    print("Okay. The correct answer would have been", secret)
else:
    print("That's it! Good job!!!")

**Exercise.**
Instead of `re.sub(r"\w", r"*", secret)` we could have used `re.sub(r".", r"*", secret)` to replace every character by a star.
This would have made a difference for two of the band names in the code above.
Why is it more appropriate to use `\w` rather than `.` in this program?

*put your answer here*

*Hints:*
If you're stuck with the exercise, highlight the text below to read some tips.

<span style="color:#000000;background-color:#000000;">
The relevant band names are "Deströyer 666" and "Mötley Crüe".
</span>

The combination of `\w` and `+` is particularly powerful.
The regex `\w+` matches sequences of word characters.
In other words, words!
At this point it might not be clear why this is powerful.
After all, we only use regexes with `re.sub` to rewrite parts of a string.
There's no tangible benefit to rewriting every word by, say, `*` or `word`.

**Exercise.**
While there is no tangible benefit to it, it's still a good exercise.
Write a piece of code that asks the user for a random input and rewrites every word by *word*.
Then test your code on the following sentences:

- True that!
- John's car got totaled.
- 1337 speak is used by h4x0rz.

Reflect on what your program considers a word.
Does it match your own intuition?
There is no wrong or right here, just think about it.

In [None]:
# put your code here

In future units, we will encounter other uses for regular expressions, in particular with the function `re.findall`.
This is still a few weeks away, but nonetheless you should spend some time right now on mastering the use of `+` and `\w`.
It will makes things much easier later on.

## Bullet point summary

- `+` means *1 or more of*
    - `a+` = 1 or more `a`s; matched by `aaaa`, but not `AAAA`
    - `A+` = 1 or more `A`s; matched by `AAAA`, but not `aaaa`
    - `[Aa]+` = 1 or more characters matching `A` or `a`; matched by `aaaa`, `AAAA`, `AAaA`, `aAAa`, and many more
    - `(Aa)+` = 1 or more instances of `Aa` in a row; matched by `AaAa`, but not `aaaa`, `AAAA`, or `aAAA`
- `\w` matches **w**ord characters
    - all letters of the Latin alphabet (A, B, C, ..., X, Y, Z, a, b, c, ..., x, y, z)
    - letters of other alphabets (æ, ö, ð, ü, ß, ç, and many more)
    - digits (0, 1, 2, ..., 9)
- `\w` does **not** match
    - white space (space, tab)
    - special characters (!, ?, &, %, #, -, ;,)