# Perfecting our chatbots with regular expressions

As you know, Eliza was the first chatbot, and there are many websites such as [this one](http://www.manifestation.com/neurotoys/eliza.php3) that host an online version of Eliza.
After playing around with Eliza for a while, you will notice that she uses a rather neat trick: she takes part of the user's input and reuses it in her own sentence.
This goes beyond the simple memorization and repetition we have done so far.
But we can copy that trick with regular expressions.

## Reusing parts of the input without changes

One particular pattern of Eliza is that any question of the form *Can you X?* will get the reply *You don't believe that I can X?* or *Do you want me to be able to X?*.
Apparently Eliza saves the user's input, deletes *Can you*, and then puts the remainder after *You don't believe that I can* or *Do you want me to be able to*.
Since we already know how to delete something with regular expressions, this should be straightforward.

In [None]:
# our first attempt at reusing code;
# we will use the libraries random and re
import random
import re

# an example of a possible user input
reply = "Can you drain the ocean with a tea egg?"

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"Can you", r"", reply)
print(beginning, reply)

This worked fairly well, except that we have two spaces between *can* and *drain* instead of one.
That's because after deleting *Can you*, `reply` starts with a space, which Python won't automatically remove for us.
But Python does automatically insert a space between the arguments inside `print`, so we get two spaces: first the automatically inserted one, then the one that wasn't deleted from `reply`.

**Exercise.**
Copy-paste the code above into the cell below.
Then adapt the regular expression so that we now longer get two spaces in a row.

In [None]:
# put the adapted code here

But the code still has several problems.
Run your fixed version with the following inputs:

1. can you drain the ocean with a tea egg?
1. Can you drain the ocean with a tea egg
1. Can you help me and my friends?

These example sentences show three things:

1. When the user ignores capitalization, our regular expression fails to delete *can you*.
1. When the user does not add punctuation, our output lacks it too.
1. We need to change the pronouns in the reply.

Let us tackle each one in turn.

## Ignoring case

The first problem concerns the distinction between lowercase and uppercase letters.
This is actually fairly easy to solve, although the most obvious solution won't work.
Initially, you might think that we can just use the `str.lower` function to make the whole input lowercase.
Then we just change the regular expression from `"Can you "` to `"can you "` and everything should work fine.

In [None]:
import random
import re

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"can you ", r"", str.lower(reply))
print(beginning, reply)

This code does indeed word as intended for the input *can you drain the ocean with a tea egg*.
But see what happens when we make a minor change to the input sentence.

In [None]:
import random
import re

reply = "Can you drain the Pacific with a tea egg?"

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"can you ", r"", str.lower(reply))
print(beginning, reply)

By using the `str.lower` function, we have not only undone the capitalization of *Can*, but also the one of the proper name *Pacific*.
That's not good, we don't want a chatbot that turns *John* into *john*, or *Microsoft* into *microsoft*.

Alright, back to the drawing board.
Another choice would be to use two different regular expressions, one for uppercase and one for lowercase.
Here we first replace `reply` by the result of deleting all instances of *can you *, and then we take that string, delete all instances of *Can you *, and save that as the variable `reply`.
This code should work correctly irrespective of whether the user starts their reply with *Can you* or *can you*.

**Exercise.**
Copy-paste the version of the code without `str.lower` and make the suggested modifications.
Then run the code with two example sentences, one that starts with *Can you* and one that starts with *can you*.

In [None]:
# put your modified code here

While this solution seems to work, it is very clunky.
Fortunately, there is a much nicer solution.
We can make a regular expression case-insensitive by adding `(?i)` at the beginning.
So `"(?i)Can you "` and `"(?i)can you"` do the same thing, they both match *Can you * and *can you *.
This allows us to make the code almost as simple as the version we started out with.

In [None]:
import random
import re

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"(?i)can you ", r"", reply)
print(beginning,reply)

But this piece of code is still not flawless.
Run the code above and enter *Can you let me can you to end this conversation?*.
This is a well-formed sentence of English as the second *can you* is an instance of the phrase *to can somebody*, which means to fire somebody.
A user may indeed enter a sentence like this.
So what does the program do with such an input?

The result may or may not be surprising to you - both instances of *can you * are gone.
By making our match for *Can you * case insensitive, we now match every instance of *can you * in the string although we only want to delete the one at the very beginning of the sentence.

In order to fix this, we have to use yet another regular expression operator: the symbol `^`, which matches the beginning of a string.
By replacing `"(?i)can you "` with `"(?i)^can you "`, we now tell Python to do a case insensitive search for *can you * at the beginning of the string.
The symbol `^` is called *caret*, and on American keyboards it is entered with SHIFT 6.

In [None]:
import random
import re

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"(?i)^can you ", r"", reply)
print(beginning, reply)

Try this code on a variety of inputs that start with *Can you* or *can you*, it should work just fine.

**Exercise.**
Eliza doesn't just take a part of the user's question when it starts with *Can you* or *can you*, but also when it starts with *Do you* or *do you*.
In regular expressions, such alternatives can be specified as `(alternative1|alternative2)`.
So the alternatives must be between parentheses, separated by the *pipe* symbol `|` (to enter this symbol, hit SHIFT key and \ at the same time).
Modify the code below (it's the same as the one above) so that the regular expression also works for *Do you* and *do you*.

In [None]:
import random
import re

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"(?i)^can you ", r"", reply)
print(beginning, reply)

## Stripping punctuation

Another problem with the current code is that we rely on the user to supply correct punctuation.
But as anyybody who has ever read a Youtube comment can tell you, people are really sloppy with punctuation.
So rather than rely on the user, we will do it ourselves in two steps.
First, we strip away all punctuation at the end of the input, and then we add the question mark ourselves.

Let's start very simple, by deleting all instances of `.`, `?`, and `!`.
We have already seen how to do this with `re.sub`, so this should be easy.

**Exercise.**
Guess what you have to do now! ;)

In [None]:
# adding a regex for stripping away punctuation
# we will add our own punctuation in a later version

# copy-paste the code here, then extend it so that it deletes punctuation symbols (.?!;)

But once again this doesn't work well in all cases.
Sure, for inputs like *Can you tell me what time it is?* it does just fine for stripping away the final `?`.
But try the slightly different *Can you help me? What time is it?*, and things go horribly wrong.
Again our match is too general.
We only want to remove punctuation symbols at the end of the string.
For this we need the counterpart of `^`: `$` matches the end of the string.

In [None]:
# adding a regex for stripping away punctuation
# we will add our own punctuation in a later version

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"(?i)^can you ", r"", reply)
# also remove punctuation from end
reply = re.sub(r"[\.\?!]$", r"", reply)
print(beginning, reply)

Alright, this is much better (although the chatbot's reply is still somewhat strange because it also contains the follow-up question; we will deal with that later).

Before we proceed, a quick remark on `^` and `$`, which aren't exactly the most intuitive choices for indicating the beginning and the end of a string.
Things are made even worse by the fact that on American keyboards, `^` is SHIFT 6 whereas `$` is SHIFT 4.
At the very least, one would expect that the symbol for the beginning of the string is to the **left** of the symbol for the end of the string, not the other way round.
These conventions are mostly a historical accident - they were established in the 60s and 70s, when computer keyboards still looked very different from today.
But here is a mnemonic that helps me remember which is which:

1. Since `$` kind of looks like an S, we can take it be short for *string*.
   So there is some link between `$` and strings.
1. When you **start** drawing an `$` by hand, you first add the curve at the top, which kind of looks like `^`.
   So `^` is the beginning of the string.
1. You only get `$` once you're done writing the symbol, at the very **end**.
   So `$` is the end of the string.

Alright, let's return to the chatbot itself.
There is still an issue with the code above.
You will notice it immediately when you run it with the input *Can you help me...*.
Go ahead, try it!

In [None]:
# adding a regex for stripping away punctuation
# we will add our own punctuation in a later version

# This is the same code as before!

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"(?i)^can you ", r"", reply)
# also remove punctuation from end
reply = re.sub(r"[\.\?!]$", r"", reply)
print(beginning, reply)

As you can see, our regular expression only deletes the last period at the end of the sentence, rather than all of them.
That's because `[\.\?!]` only matches a single character, it cannot match a sequence of characters.
If we want it to mean "match one or more instances of these characters", we have to add `+` right after it.

In [None]:
# adding a regex for stripping away punctuation
# we will add our own punctuation in a later version

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"(?i)^can you ", r"", reply)
# also remove puncttuation from end
# we now match all punctuation symbols at the end, not just the last
reply = re.sub(r"[\.\?!]+$", r"", reply)
print(beginning, reply)

Great, this does indeed work!

**Exercise.**
The two instances of `re.sub` in the code above can be combined into one using the `|` symbol for alternatives.
Copy-paste the code into the cell below and make the requested change.

In [None]:
# put your modified code here

So now we only have to add our question mark at the end of the string.
We could add it as a third argument of `print`, but that would add a space before the question mark.

In [None]:
# adding a regex for stripping away punctuation
# we now add our own punctuation

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"(?i)^can you ", r"", reply)
reply = re.sub(r"[\.\?!]+$", r"", reply)
print(beginning, reply, "?")

Okay, so adding the punctuation inside the `print` statement won't work.
But there is a simple alternative: instead of deleting the punctuation symbols at the end, we just replace them by `?`.

**Exercise.**
Copy-paste the latest version of the code into the cell below, then modify is so that punctuation symbols are replaced by `?` rather than being deleted.

In [None]:
# put your coe here

If you run the code above with the input *Can you help me...*, you will now get something like *You don't believe that I can help me?* as the answer.
Far from perfect, but the punctuation now works exactly as we want.
Except... yes, I'm sorry to say it, there is still one problem.
Instead of *Can you help me...*, try the code with *Can you help me*, without any punctuation at the end.
The answer you will get won't have a question mark at the end.

Why don't we get a question mark?
Because our regular expression basically says "if there is one or more punctuation symbols at the end of the string, replace them by a question mark".
But if the string has no punctuation symbols at the end, then there is nothing to replace and we don't get a question mark.
Instead, we want to say something more like this "replace the end of the string, and any punctuatution symbols you might find there, by a single question mark".
In other words, we want the punctuation symbols to be **optional**: if they are there, they get replaced, if they aren't, we still do our substituation.
The regular expression operator `()?` indicates this optionality, where the optional match is listed between the brackets.
So instead of `[\.\?!]+$`, we need `([\.\?!]+)?$`.

In [None]:
# adding a regex for stripping away punctuation
# we now add our own punctuation

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"(?i)^can you ", r"", reply)
# also remove punctuation from end
# we now match all punctuation symbols at the end, not just the last
# and we replace them by ?
reply = re.sub(r"([\.\?!]+)?$", r"?", reply)
print(beginning, reply)

At this point your eyes may glaze over from all the symbol salad.
Don't despair!
Regular expressions can be a little tricky, but they're just a matter of practice.
You don't have to be a rocket engineer to figure out how they work, once you've gotten used to the confusing notation they're actually fairly easy most of the time.
Here are the symbols we have encountered so far:

**Symbol** | **Meaning**
:-- | :--
. | matches everything
[xyz] | match any one of the characters x, y, z, between the brackets
(X)? | the match X between parentheses is optional
X+ | match one or more of the preceding pattern X
^ | match beginning of string
$ | match end of string

But you've got a point, we've covered quite a lot in this section.
Let's continue with something simpler, just to give you a breather.

## Pronoun replacement

Our chatbot still copies large parts of the user's input *verbatim*, without any changes.
In English this trick works mostly fine, but it runs into a serious issue with pronouns.
If the user asks *Can you help me*, the chatbot should reply *Do you want me to be able to help you*, not *Do you want me to be able to help me*.
Similarly, if the user says *Can you help yourself*, we should reply *You don't believe that I can help myself*, not *You don't believe that I can help yourself*.
So all first person pronouns must be replaced by their second person counterparts, and the other way round.
This is really easy compared to handling punctuation, but there's still several traps to avoid.

Let's try the following first.

In [None]:
# adding a regex for stripping away punctuation
# we now add our own punctuation

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"(?i)^can you ", r"", reply)
# also remove punctuation from end
# we now match all punctuation symbols at the end, not just the last
# and we replace them by ?
reply = re.sub(r"([\.\?!]+)?$", r"?", reply)

# replace first person pronouns by second person
reply = re.sub(r" I ", r" you ", reply)
reply = re.sub(r" me ", r" you ", reply)
reply = re.sub(r" my ", r" your ", reply)
reply = re.sub(r" mine ", r" yours ", reply)

# replace second person pronouns by first person
reply = re.sub(r" you ", r" me ", reply)
reply = re.sub(r" your ", r" my ", reply)
reply = re.sub(r" yours ", r" mine ", reply)

print(beginning, reply)

This does not work.
When you type in *Can you help me please*, you still get a reply like *Do you want me to be able to help me please?*.
The reason is actually obvious: we first tell Python to replace all first person pronouns by second person pronouns, and then all second person pronouns by first person pronouns.
So the string *Can you help me please* changes as follows:

1. Delete *Can you*, leaving *help me please*.
1. Replace first person by second person, leaving *help you please*.
1. Replace second person by first person, leaving *help me please*.

The problem is that the second set of substitutions undoes the first.
We have to somehow mark the new second person pronouns as special.
One way of doing this is to put a rarely used symbol sequence in front of them, e.g. `@@@` (this string is a good choice because the user is unlikely to ever use it in a sentence).
Then the second person substitution will no longer apply to those pronouns.
At the end, we delete `@@@` before pronouns and get the desired string.

In [None]:
# adding a regex for stripping away punctuation
# we now add our own punctuation

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
reply = re.sub(r"(?i)^can you ", r"", reply)
# also remove punctuation from end
# we now match all punctuation symbols at the end, not just the last
# and we replace them by ?
reply = re.sub(r"([\.\?!]+)?$", r"?", reply)

# replace first person pronouns by second person;
# since we add a @ in front, they won't be rewritten later on
reply = re.sub(r" I ", r" @@@you ", reply)
reply = re.sub(r" me ", r" @@@you ", reply)
reply = re.sub(r" my ", r" @@@your ", reply)
reply = re.sub(r" mine ", r" @@@yours ", reply)

# replace second person pronouns by first person
reply = re.sub(r" you ", r" me ", reply)
reply = re.sub(r" your ", r" my ", reply)
reply = re.sub(r" yours ", r" mine ", reply)

# delete all instances of @@@
reply = re.sub(r"@@@", "", reply)

print(beginning, reply)

This code now works for *Can you help me please*, but it still fails for *Can you help me*.
That is because we only replace pronouns between spaces, but a sentence-final pronoun is not followed by a space.
And actually we're going to run into the same problem if for some reason a personal pronoun occurs at the beginning of the string.
There's a very elegant way to solve this, which is to first break the string into words, replace words, and then reassemble the words into a string.
We cannot do this yet, so let us try a somewhat hacky solution that has the advantage that it can be done just with regular expressions: we will make sure that the string in which we do the substitutions always starts with a space and ends with a space.
That way, we don't run into the problems with sentence-initial or sentence-final pronouns.

The overall procedure then works as follows:

1. Suppose user enter *Can you help me???*
1. Delete *Can you*, but leave the space before *help* (as we did at the very beginning of the unit).
1. The string is now ` help me???`.
1. Instead of replacing `???` by `?`, substitute only a space.
   The string is now ` help me `.
1. Do the first round of pronoun subsitutions.
   The string is now ` help @@@you `.
1. Delete all instnaces of `@@@`.
   The string is now ` help you `.
1. Delete sentence-initial space.
   The string is now `help you `.
1. Replace sentence-final space by question mark.
   The string is now `help you?`.
   
Here's the corresponding code:

In [None]:
# adding a regex for stripping away punctuation
# we now add our own punctuation

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
# keep space after you
reply = re.sub(r"(?i)^can you", r"", reply)
# also remove punctuation from end
# but replace it by space now
reply = re.sub(r"([\.\?!]+)?$", r" ", reply)

# replace first person pronouns by second person;
# since we add a @ in front, they won't be rewritten later on
reply = re.sub(r" I ", r" @@@you ", reply)
reply = re.sub(r" me ", r" @@@you ", reply)
reply = re.sub(r" my ", r" @@@your ", reply)
reply = re.sub(r" mine ", r" @@@yours ", reply)

# replace second person pronouns by first person
reply = re.sub(r" you ", r" me ", reply)
reply = re.sub(r" your ", r" my ", reply)
reply = re.sub(r" yours ", r" mine ", reply)

# delete all instances of @@@
reply = re.sub(r"@@@", r"", reply)

# delete initial space
reply = re.sub(r"^ ", r"", reply)
# replace final space by ?
reply = re.sub(r" $", r"?", reply)

print(beginning, reply)

Pretty convoluted, but that is what a lot of work with regular expressions is like.
You have to think about a way to set up an elaborate system of pattern matching rules that step by step converts the input into the output you want.
This isn't particularly elegant, but in a system that doesn't even understand what a pronoun is, that's the only way to do it.

**Exercise.**
Our pronoun substitution system works fairly well, but you might have noticed that we do not cover cases where the user entered *Me* or *My* instead of *me* and *my*.
Similarly, we always replace *you* by *me*, although it might sometimes be a subject and thus need to be replaced by *I*.
A general solution is difficult, but at the very least we want to account for the most common cases:

- *you are* should be *I am*, and the other way round
- *you're* should be *I'm*, and the other way round
- *you* should be replaced by *I* before *may*, *must*, *have*, *need*, *should*, *feel*, and *think*.

Adapt the code below so that it handles those additional cases.

In [None]:
# adding a regex for stripping away punctuation
# we now add our own punctuation

reply = input()

beginning = random.choice(["You don't believe that I can",
                           "Do you want me to be able to"])
# keep space after you
reply = re.sub(r"(?i)^can you", r"", reply)
# also remove punctuation from end
# but replace it by space now
reply = re.sub(r"([\.\?!]+)?$", r" ", reply)

# replace first person pronouns by second person;
# since we add a @ in front, they won't be rewritten later on
reply = re.sub(r" I ", r" @@@you ", reply)
reply = re.sub(r" me ", r" @@@you ", reply)
reply = re.sub(r" my ", r" @@@your ", reply)
reply = re.sub(r" mine ", r" @@@yours ", reply)

# replace second person pronouns by first person
reply = re.sub(r" you ", r" me ", reply)
reply = re.sub(r" your ", r" my ", reply)
reply = re.sub(r" yours ", r" mine ", reply)

# delete all instances of @@@
reply = re.sub(r"@@@", r"", reply)

# delete initial space
reply = re.sub(r"^ ", r"", reply)
# replace final space by ?
reply = re.sub(r" $", r"?", reply)

print(beginning, reply)

Well, we covered quite a lot of ground in this unit.
We still don't know everything there is to know about regular expressions, but we have the basics under our belts now.
The next unit will teach you one more important technique, and after that we will slowly grow our collection of regular expression tricks throughout the semester.
It might all still feel pretty awkward to you, but you'll be a *regex wizard* in no time.