# String substitutions

Our set of tools for building chatbots is growing rapidly, but there are still some key techniques missing.
Arguably one of the most important ones is the ability to clean up the user's input before we do anything with it.

## Normalizing user input with `str.lower()`

Consider once more the code for Bran, the branching chatbot.
Originally, the code contained this clunker of an `if`-test:

In [None]:
leg_pulling = input()
if leg_pulling == "Yes" or leg_pulling == "Yes." or leg_pulling == "yes" or leg_pulling == "yes.":
    print("Well, at least you're honest.")

With lists, we could simplify this quite a bit to the following:

In [None]:
leg_pulling = input()
if leg_pulling in ["Yes", "Yes.", "yes", "yes."]:
    print("Well, at least you're honest.")

But this is still rather clunky since we have to list multiple versions of *yes* to account for capitalization and punctuation.
Fortunately, we can tell Python to *normalize* the string before it is used in the input.
First, we can use the function `str.lower` to replace all uppercase characters by their lowercase counterparts.

In [None]:
print(str.lower("ThiS StrInG uses lowerCase and UPPERCase in a HapHaZaRd Manner."))

So by forcing all characters in the user input to be converted to lowercase, we can drop `"Yes"` and `"Yes."` from the list.
Go ahead, run the code below several times and enter *Yes* one time and *yes* the other, and maybe *YES* on your third try.
You will get exactly the same result every time.

In [None]:
leg_pulling = input()
if str.lower(leg_pulling) in ["yes", "yes."]:
    print("Well, at least you're honest.")

**Exercise.**
Experimentation time.
The `str.lower` function also has the counterparts `str.upper` and `str.title`.
Do at least 4 experiments on different example strings to figure out what these functions do.
Then put your description below.

In [None]:
# put at least 4 experiments here

*put your descriptions here*

**Exercise.**
Below are two copies of the `if`-test from above, but with `str.lower` replaced by `str.upper` and `str.title`.
Adapt the strings in the list so that the test still works as intended.

In [None]:
leg_pulling = input()
if str.upper(leg_pulling) in ["yes", "yes."]:
    print("Well, at least you're honest.")

In [None]:
leg_pulling = input()
if str.title(leg_pulling) in ["yes", "yes."]:
    print("Well, at least you're honest.")

## Getting rid of punctuation with regular expressions

But note that the code above still does not work as intended if you enter *Yes!* or *yes...* -
that's because Python leaves the punctuation unchanged, and neither *yes!* nor *yes...* are items of the list.
To fix this, we have to tell Python to also delete all punctuation markers from the input before we test it.
This is done with a powerful tool called *regular expressions*.

You might have already used regular expressions without realizing it.
For example, if you have a black belt in Google-fu then you know that you can enter `"smashing .* pumpkins"` to match any website that contains the words *Smashing* and *Pumpkins* in this order, possibly with other material inbetween.
That would of course include websites that talk about the *The Smashing Pumpkins* (which I guess count as an oldies-band by now?), but it also includes variations like *smashing rotten pumpkins*, *Smashing ABC Pumpkins*, and *Smashing Marilyn and Pumpkins Manson* (those are actual Google hits).
You can also use regular expressions when you search for a book in a library catalog, or even with a word processor to find specific patterns.

A lesser known function of regular expressions is that they can also be used to replace the matched pattern by some other pattern.
It is this aspect of regular expressions that we are particularly interested in.
Python already comes with a regular expression package, called `re`.
Let us look at a simple example first where we replace every exclamation mark by a question mark.

In [None]:
# first we have to load the regular expression library, called re
import re

# here's our example string
example_string = "Go... go home! I hate you!!! Never talk to me again!!1!!1!"
# replace every ! by ?
print(re.sub(r"!", r"?", example_string))

Here we use the substitution function of the `re` library, called `re.sub`.
As any other function, the `re.sub` followed by `(` and `)`.
In contrast to `print` or `input`, `re.sub` needs three things to occur between those brackets, and they must be separated by a comma (`,`):

1.  First we specify what parts of the string should be replaced. Note the `r` in front of `"`.
1.  Then we say what those parts should be replaced by. Note again the `r` in front of `"`.
1.  Finally, we tell the `sub` command in what string it should carry out those substitutions.

Don't worry about why there is an `r` in front of each string (it tells Python that these are *raw strings*, but what that means is a very technical matter that we won't wrestle with in this class).
We need the `r` to have things work the way we want, rather than a subtly different way.
As a mnemonic, just remember that we are using **r**egular expressions here, and strings in **r**egular expressions should have an `r` in front of them.

***The regular r rule:* always put r before regular expression strings.**

**Exercise.**
Adapt the code below so that every exclamation mark is replaced by 3 question marks.
No, we haven't explicitly said yet how to do that, but you should be able to figure it out with a bit of experimentation.

In [None]:
# first we have to import the regular expression library, called re
import re

# here's our example string
example_string = "Go... go home! I hate you!!! Never talk to me again!!1!!1!"
# replace every ! by 3 instances of ?
print(re.sub(r"!", r"?", example_string))

The main difference between regular expressions and normal strings is that some symbols have a special meaning, and this allows us to represent complex patterns via strings.
For example, suppose we want to delete every symbol in the string, including all spaces.
Then we could do this with the following regular expression.

In [None]:
# first we have to import the regular expression library, called re
import re

# here's our example string
example_string = "Go... go home! I hate you!!! Never talk to me again!!1!!1!"
# replace everything by ?
print(re.sub(r".", r"", example_string))

As you can see here, `.` has a special meaning in regular expressions.
It does not represent just the punctuation symbol `.`, but every character, including space.
The power of regular expressions stems from the fact that one can define very complicated pattern matching rules.
But this power requires certain diacritic symbols, and since it would be awfully inconvenient to introduce new symbols that nobody can easily type on their keyboard, the decision was made to assign special meaning to common symbols that can be found on every keyboard.
In our particular case, `.` is a special symbol that matches any character, including any digit, letter of the alphabet, punctuation symbols, special characters (e.g. `@` or `#`), and even *whitespace* (spaces and tabs).
In regular expressions, `.` does not represent a dot, it is a placeholder for literally any character.
So the regular expression above says "replace every symbol by nothing", which is the same as deleting every symbol.
This may not seem very useful, but in combination with some other special symbols, `.` allows you to do some very powerful *pattern matching* and remove unwanted material from strings.

## Escaping special characters

But if `.` matches anything, how can we match an actual dot?
We have to temporarily cancel the regular expression interpretation of `.`, which we do by putting a backlash `\` in front of it.
By writing `\.`, we effectively say "Computer, I mean an actual dot. Seriously, dude, don't give me a digit, a letter, or God knows what, I just want a dot."
Using a backslash to block the regular expression interpretation is also called *escaping*.
So when we write `\.`, we escape `.` to get a standard dot.

In [None]:
# first we have to import the regular expression library, called re
import re

# here's our example string
example_string = "Go... go home! I hate you!!! Never talk to me again!!1!!1!"
# replace everything by ?
print(re.sub(r"\.", r"?", example_string))

**Exercise.**
The cell below contains a slightly modified version of the `if` test from Bran, the branching chatbot.
Complete the arguments for the `re.sub` function so that the `if` test works as intended for answers containing dots.
You do not need to handle answers with `?` or `!`.

In [None]:
import re

leg_pulling = input()
# clean up leg_pulling with a regular substitution;
# you have to modify r""
leg_pulling = re.sub(r"", r"", leg_pulling)
if str.lower(leg_pulling) in ["yes", "sure", "most definitely", "that's for sure"]:
    print("Well, at least you're honest.")

## Multiple substitutions

Our main motivation for using regular expressions is their ability to remove punctuation symbols from the user's input.
But right now we only know how to replace individual characters.
This would make it rather tedious to remove all punctuation symbols from the input.

In [None]:
leg_pulling = input()
# remove . (special character -> must be escaped with \)
leg_pulling = re.sub(r"\.", r"", leg_pulling)
# remove ? (special character -> must be escaped with \)
leg_pulling = re.sub(r"\?", r"", leg_pulling)
# remove !
leg_pulling = re.sub(r"!", r"", leg_pulling)
# remove ,
leg_pulling = re.sub(r",", r"", leg_pulling)
# remove ;
leg_pulling = re.sub(r";", r"", leg_pulling)

if str.lower(leg_pulling) in ["yes", "sure", "most definitely", "that's for sure"]:
    print("Well, at least you're honest.")

This is incredibly tedious.
It would be much more convenient if we could just provide a list of characters, all of which should be deleted.
Regular expressions allow us to do just that with a syntax that's similar to lists in Python.
Inside the string of the first argument, we can specify a range of alternatives between square brackets.
So `r"[\.\?!,;]"` means "any symbol that is a dot, a question mark, an exclamation mark, a period, or a semicolon".
Note that in contrast to Python lists, the entries between the square brackets are not separated by commas - that would make them even harder to read.

In [None]:
leg_pulling = input()
# remove punctuation
leg_pulling = re.sub(r"[\.\?!,;]", r"", leg_pulling)

if str.lower(leg_pulling) in ["yes", "sure", "most definitely", "that's for sure"]:
    print("Well, at least you're honest.")

**Exercise.**
Without running the code above, try to figure out if the following input would pass the `if`-test: `Y!!;e...,S..?`.
Write down your prediction in the cell below.
Then run the code above with this input.
Was your prediction borne out?
Explain why the string does or does not pass the `if`-test.

*put your prediction and the explanation of the actual behavior here*

**Exercise.**
Design a new chatbot, *Lord Loudmouth*.
Lord Loudmouth, possibly as a result of the inbreeding that is so common among aristocractic nobility, is not the greatest conversationalist.
He just repeats whatever words the user said, but with three exclamation marks after each word.
So if the user enters *Hi. I am John, what is your name?*, Lord Loudmouth will reply *Hi!!! I!!! Am!!! John!!! What!!! Is!!! Your!!! Name!!!*.

Lord Loudmouth keeps doing this until the user says *Goodbye.*, *Sorry, my ears are ringing.*, or *And I thought it's the quiet ones you oughta watch.* (or some variant with sloppier punctuation and capitalisation).

In [None]:
# Your Lord Loudmouth code

*Hints:*
If you're stuck with the exercise, highlight the text below to read some tips.

<span style="color:#000000;background-color:#000000;">
We already have a function for making only the first letter of each word uppercase.
If you don't remember what it is, go back to the very first exercise of this notebook.
</span>

<span style="color:#000000;background-color:#000000;">
Inserting !!! after every word is the same as replacing every space after a word by "!!! " (note the space after !!!).
So all you have to figure out is how you can make sure there is exactly one space after each word.
Think about it for a bit.
If you're still stuck, check the next hint.
</span>

<span style="color:#000000;background-color:#000000;">
First, replace every punctuation symbol (.!?,;:) by a space.
Assuming that the user input ends with a punctuation symbol, now every word has at least one space after it.
But some may have two spaces.
For instance, ", " would become two spaces after the comma is replaced by space.
So you also have to replace each instance of two spaces by one space.
Otherwise, some words will be followed by "!!! !!! " instead of just "!!! ".
</span>

**Exercise. **
Lord Loudmouth has a lovely wife, *Lady Lunatic*.
She, too, is conversationally challenged:

- Every command she turns into a sentence (exclamation marks become dots).
- Every sentence she turns into a question (dots become question marks).
- Every question she turns into a command (question marks become exclamation marks).

Write an initial draft of the code for Lady Lunatic.
Two important hints:

1.  As mentioned in an earlier example, `?` also has a special meaning in regular expressions, so make sure to escape it with `\`.
1.  Special characters only need to be escaped in the first part of `re.sub`, i.e. the part that specifies what should be replaced.
    In the second part, where it says what to replace the match with, characters **must not** be escaped.
    Yeah, regular expressions are a little finicky, that's why it's important to practice them a lot.
    
Once you have finished your draft, test it on the following strings:

- "Let it go!"
- "I am tired."
- "You don't like me anymore?"
- "You don't like me anymore? Shut up! You old bastard."

Did things turn out the way you expected?
What do you think went wrong?

Once you've thought about this for a bit (and put your answers below), write up a second draft that works for the examples above (but may fail on some other sentences that include, say, `|`, or `@`).

In [None]:
# put your first draft of the Lady Lunatic code here

*Put your answers to the questions here*

In [None]:
# put your second draft here, which tries to fix the previous problem

*Hints:*
If you're stuck with the exercise, highlight the text below to read some tips.

<span style="color:#000000;background-color:#000000;">
In your first draft, you probably used three substitutions.
One turns ! to ., one turns . to ?, and one turns ? to !.
This creates a big problem.
If you run the three substitutions in sequence, then all punctuation eventually becomes !:
"What? Shut up! Bastard." -> "What? Shut up. Bastard." -> "What? Shut up? Bastard?" -> "What! Shut up! Bastard!"
</span>

<span style="color:#000000;background-color:#000000;">
In order to fix this, you have to create placeholders.
For example, it is very unlikely that the user ever writes "@@@@@" or "|||||".
So replace ! by @@@@@, . by |||||, and ? by !:
"What? Shut up! Bastard." -> "What? Shut up@@@@@ Bastard." -> "What? Shut up@@@@@ Bastard|||||" -> "What! Shut up@@@@@ Bastard |||||"
</span>

<span style="color:#000000;background-color:#000000;">
Don't forget to replace @@@@@ by ! and ||||| by . at the end.
</span>

Regular expressions are very powerful, and we will see all kinds of uses for them throughout the semester.
If you're curious, you can already experiment a bit online on the website [Pythex](https://pythex.org/).
But the power of regular expressions (or simply *regexes*) can feel overwhelming at the beginning.
So we will take it slowly and grow our *regex* vocabulary step-by-step rather than doing everything in one fell swoop.

## Bullet point summary

- User input must be normalized before processing.
- Capitalization is controlled by `str.lower`, `str.upper`, and `str.title`.
- Use a regular expression (*regex*) for more complicated substitutions.
- Load the regex library with `import re`.
- For substitutions, use `re.sub`.
  The usage pattern is `(r"to_replace", r"the_replacement", the_string)`.
- Don't forget the *regular r rule*.

***The regular r rule:* always put r before regular expression strings.**

- In regexes, some characters have special meaning.
  The dot (`.`) matches any arbitrary character.
- Special meanings must be escaped with `\`.
  So `\.` matches a literal dot.
- Lists of characters go between `[` and `]`, but are **not** separated by a comma.
  To replace all punctuation, use `r"[\.\?!;,]"`.