# Backreferences for smarter input recycling

By now our regular expression repertoire is large enough to allow for some impressive tricks.
Sure, our few lines of code aren't enough to fool anybody for very long into believing that they are talking to a real human.
But the **techniques** themselves already cover an enormous amount of ground.
If you were to sit down for a week and just keep adding more and more code and regular expressions to cover all kinds of cases, you would end up with a fairly convincing chatbot.
You could easily do much better than Eliza.

That said, there is still one more trick that is needed to cover some edge cases.
Namely how we can turn *Can you help this man?* into *Would you like helping this man?*.

## Adding -ing inside the input: A failed attempt

Consider the example above where the user's input *Can you help this man?* gets the reply *Would you like helping this man?*.
By now you know how such input reusal works, it's just the result of some regular expression substitutions.
But notice that this example is slightly different form what we have encountered so far.
In previous cases, we deleted part of the user input and combined the rest with some prewritten string.
More precisely, we would delete *Can you* from the input and print the remainder right after *Would you like*.
But this isn't sufficient in this case:

In [None]:
# our current solution for reusing user inputs

# as always, we have to import the re module first
import re

# let's fix the user's reply to keep the example easy to test
reply = "Can you help this man?"

# delete "can you "
reply = re.sub(r"(?i)^can you ", r"", reply)
# here we might have some other code for pronoun substitutions and punctuation

print("Would you like", reply)

The code above produces the output *Would you like help this man*, which is not a grammatical sentence of English.
We are missing the *ing* after *help*.
But how could we possibly get it there?
It seems that we have to insert *ing* inside of reply, but that's tricky to do.

We could also delete *help* and then insert *Would you like helping* before the remainder of the user's reply.

In [None]:
# a first attempt at inserting -ing after the verb

# as always, we have to import the re module first
import re

# let's fix the user's reply to keep the example easy to test
reply = "Can you help this man?"

# delete "can you "
reply = re.sub(r"(?i)^can you help ", r"", reply)
# here we might have some other code for pronoun substitutions and punctuation

print("Would you like helping", reply)

This works for any question of the form *Can you help XZY?*, but it obviously produces wrong results with other inputs:

In [None]:
# a first attempt at inserting -ing after the verb;
# now with a different input, producing a bad reply

# as always, we have to import the re module first
import re

# let's fix the user's reply to keep the example easy to test
reply = "Can you fix this man?"

# delete "can you help "
reply = re.sub(r"(?i)^can you help ", r"", reply)
# here we might have some other code for pronoun substitutions and punctuation

print("Would you like helping", reply)

Hmm, not good, not good at all.
Alright, maybe we can fix this with a bunch of `if` statements to ensure we only use a substitution when it is appropriate.
But that's not really a good solution.

In [None]:
# a second attempt at inserting -ing after the verb;
# now with many if-then tests

# as always, we have to import the re module first
import re

# let's fix the user's reply to keep the example easy to test
reply = "Can you fix this man?"

if "Can you help" in reply:
    # delete "can you help "
    reply = re.sub(r"(?i)^can you help ", r"", reply)
    # here we might have some other code for pronoun substitutions and punctuation
    print("Would you like helping", reply)
    
if "Can you fix" in reply:
    # delete "can you help "
    reply = re.sub(r"(?i)^can you fix ", r"", reply)
    # here we might have some other code for pronoun substitutions and punctuation
    print("Would you like fixing", reply)

**Exercise.**
Think carefully about how well this code would work in practice, then list at least three shortcomings.
You could give example sentences where the code won't work as intended, e.g. because the chatbot produces an ungrammatical reply or multiple replies.
You can also talk about why the code is inelegant and wouldn't be much fun to write for a programmer.

*put your answers here*

## Storing multiple parts of the input

The solution above clearly isn't usable in practice, so let's consider another option.
Intuitively, it seems that whenever we get an input of the form *Can you verb XZY*, we want to delete *Can you* and split the remainder into two strings, *verb* and *XYZ*.
Then we should be able to produce an output of the form *Would you like verbing XYZ?*.

In [None]:
# a second attempt at inserting -ing after the verb;
# now we store different parts of the input in distinct variables

# as always, we have to import the re module first
import re

# let's fix the user's reply to keep the example easy to test
reply = "Can you help this man?"

# delete "can you "
reply = re.sub(r"(?i)^can you ", r"", reply)
# delete everything after the first space, and store the remainder as verb
verb = re.sub(r" .*", r"", reply)
# delete the first word and space, and store the remainder as end; \w is a shorthand for [A-Za-z0-9_]
end = re.sub(r"^\w+ ", r"", reply)
# here we might have some other code for pronoun substitutions and punctuation

# put everything together
print("Would you like", verb, "ing", end)

This works much better than the previous version of the code.
It produces a useable reply with any input of the form *Can you verb XYZ*, not just for those verbs for which we wrote a dedicated regular expression.
And we do not need to juggle around all those `if-then` statements.

**Exercise.**
The code above does not work correctly if the user made a typo and accidentally entered two spaces after *Can you*, or a tab.
First, describe the problem and why it arises.
Then fix the first instance of `re.sub` so that the problem no longer arises.

*Hints*:
- `\s` matches the characters space, newline (`\n`), and tabulator (`\t`).
- Remember that you can use `+` and `*` to iterate a class of characters.

In [None]:
# a second attempt at inserting -ing after the verb;
# we have to fix a minor bug

# as always, we have to import the re module first
import re

# let's fix the user's reply to keep the example easy to test
reply = "Can you help this man?"

# delete "can you ";
# FIXME: something isn't quite right with this substitution
reply = re.sub(r"(?i)^can you ", r"", reply)
# delete everything after the first space, and store the remainder as verb
verb = re.sub(r" .*", r"", reply)
# delete the first word and space, and store the remainder as end; \w is a shorthand for [A-Za-z0-9_]
end = re.sub(r"^\w+ ", r"", reply)
# here we might have some other code for pronoun substitutions and punctuation

# put everything together
print("Would you like", verb, "ing", end)

**Exercise.**
Even with your fix the code above still runs into a problem if the user is careless with how many spaces he or she inserts.
What is this second problem, and how can it be fixed?

*put your explanation of the problem here*

In [None]:
# fix the line below so that it avoids the problem you noticed
reply = re.sub(r"(?i)^can you", r"", reply)

One obvious quibble with the current code is the space between the verb and *ing*.
Instead of *Would you like helping this man?* we get *Would you like help ing this man?*.
There are two ways of fixing this.

1. We can tell Python not to insert a space between the arguments of `print`.
1. We can combine `verb` and `ing` into a string before they are printed.

For the first solution, we have to set the parameter `sep` in the `print` function to the *empty string* `""` and then insert the spaces ourselves.

In [None]:
# Solution 1: change the print statement

# as always, we have to import the re module first
import re

# let's fix the user's reply to keep the example easy to test
reply = "Can you help this man?"

# delete "can you "
reply = re.sub(r"(?i)can you ", r"", reply)
# delete everything after the first space, and store the remainder as verb
verb = re.sub(r" .*", r"", reply)
# delete the first word and space, and store the remainder as end
end = re.sub(r"^\w+ ", r"", reply)
# here we might have some other code for pronoun substitutions and punctuation

# put everything together;
# we use sep="" so that no spaces are inserted between arguments,
# then we add some spaces on our own whereever we need them
print("Would you like", " ", verb, "ing", " ", end, sep="")

**Exercise.**
For each one of the input-output pairings below, write a `print` statement that automatically prints the input as the output:

In [None]:
# finish the code here
print("ba", "a", "a")  # -> banana
print("John", "Mary", "Sue")  # -> John, Mary, Sue
print("Police", "police", "police") # -> Police police police police police
print("A", "list", "of", "words") # -> A
                                  #    list
                                  #    of
                                  #    words
                                  #    (Hint: what's the special character for printing a new line?)

The second solution is a lot simpler: whenever we have two strings, we can tell Python to concatenate them into a single string with the operator `+`.

In [None]:
verb = "help"
verbing = verb + "ing"
print(verbing)

**Exercise.**
Do you know what we haven't had in a while?
Experimentation time!
List at least five commands that explore possible usages of `+`, and for each one add a comment that explains what you are trying to find out.
Then provide a summary of your findings at the end.
Two things you should pay particular attention to:

- Can we combine multiple strings with `+`?
- Can we combine strings with other things, e.g. numbers or lists?

In [None]:
# put your experimentation code here

*put your summary here*

By using `+`, we can greatly simplify the `print` statement:

In [None]:
# Solution 2: combine verb and "ing" with +

# as always, we have to import the re module first
import re

# let's fix the user's reply to keep the example easy to test
reply = "Can you help this man?"

# delete "can you "
reply = re.sub(r"(?i)can you ", r"", reply)
# delete everything after the first space, and store the remainder as verb
verb = re.sub(r" .*", r"", reply)
# delete the first word and space, and store the remainder as end
end = re.sub(r"^\w+ ", r"", reply)
# here we might have some other code for pronoun substitutions and punctuation

# put everything together
print("Would you like", verb + "ing", end)

Well, this pretty much does what we want, and it is a fair bit nicer than changing the separator for `print`.
So, end of story?
No, not quite.

## A more elegant solution: Backreferences

The solution above works, but it is clunky.
For each part of the string, we have to define a new variable and use `re.sub` to delete everything from the string except the part that we want to store in that variable.
That's pretty tedious, and it also makes the code hard to read for others.
Regular expressions provide a much simpler solution: *backreferences*.

Backreferences couldn't be easier.
Whenever a part of a regular expression appears in parenthesis, we can refer to the part that is matched by it by a number.
For a concrete example, consider date conversions:

- An American date is of the form `MM/DD/YYYY`, i.e. two-digit number of the month, two-digit number of the day, and the four-digit year, all separated by slashes.
- A European date, on the other hand, is of the form `DD.MM.YYYY`, so the order of day and month is switched and the separator is a dot instead of a slash.
- An ISO date is of the form `YYYY-MM-DD`, so the units are ordered from largest to smallest and separated by hyphens.

Now suppose that we have a document full of US-style dates and want to convert them to the European format.
Intuitively we want to switch the order of days and months, and replace dots by slashes.
This is very easy with backreferences.

In [None]:
# convert US date to European date
date_us = "03/11/2012"

# use re.sub with backreferences and \d as a shorthand for [0-9]
date_eu = re.sub(r"(\d+)/(\d+)/(\d+)", r"\2.\1.\3", date_us)
print("European conversion:", date_us, "becomes", date_eu)

In the code above, we first define a match of the form `\d+/\d+/\d+`.
This doesn't look very nice because of all the slashes and backslashes, but it's not too complicated once we move through it step by step:

1. `\d` represents a digit, i.e. any member of `[0-9]`
1. `+` tells us that the previous item should be matched one or more times; hence `\d+` means "one or more digits"
1. `/` matches a slash
1. and at this point the pattern repeats, matching sequences of digits separated by slashes.

But the code above doesn't just use the regular expression `\d+/\d+/\d+`, it uses `(\d+)/(\d+)/(\d+)`.
This is the crucial trick that allows us to use backreferences.
Each pair of parentheses defines a new group, and we can reference these groups by escaped numbers:

1. `\1` matches the first group, i.e. month in the US format
1. `\2` matches the second group, i.e. day in the US format
1. `\3` matches the third group, i.e. year in the US format

So when we tell Python to replace `(\d+)/(\d+)/(\d+)` by `\2.\1.\3`, we are saying "start with the second group, then put in a dot, then the first group, then a dot, then the third group.

**Exercise.**
Complete the code below so that it converts the US date to the ISO format YYYY-MM-DD instead.

In [None]:
# convert US date to European date
date_us = "03/11/2012"

# use re.sub with backreferences and \d as a shorthand for [0-9]
date_iso = re.sub(r"", r"", date_us)  # FIXME: add correct regexes
print("ISO conversion:", date_us, "becomes", date_iso)

Alright, so how can backreferences help us with the problem of inserting *ing*?
We can write a single regular expression that will define two groups, one consisting of the first word, the second of everything after the first word.
Then we replace the match by `\1ing\2`, i.e. the word followed by *ing* and then the second match.

In [None]:
# The final solution: use backreferences

# as always, we have to import the re module first
import re

# let's fix the user's reply to keep the example easy to test
reply = "Can you help this man?"

# immediately produce the desired output by
# - deleting "can you "
# - using \w+ to match the first word after "can you "; in the output, we insert "ing" right after the match
# - following this up by whatever the rest of the input is (remember, .* matches everything!)
reply = re.sub(r"(?i)can you (\w+)(.*)", r"\1ing\2", reply)
# here we might have some other code for pronoun substitutions and punctuation

# put everything together
print("Would you like", reply)

**Exercise.**
Earlier on we saw that assuming a single space between `you` and the next word isn't safe because the user may make typos.
This problem is still present with the code above.
Modify the code in the cell below so that this problem no longer arises.

In [None]:
# The final solution with a minor tweak to space matching
import re
reply = "Can you help this man?"

reply = re.sub(r"(?i)can you (\w+)(.*)", r"\1ing\2", reply)  # FIXME: this regex needs to be adjusted
# here we might have some other code for pronoun substitutions and punctuation

print("Would you like", reply)

The code above works like a charm, and it does it all with a single regular expression.
Backreferences are a tremendously useful tool because in almost all cases where one wants to modify a string, one also wants to preserve certain parts of it.
This isn't limited to chatbots, it is also essential for cleaning up any data that one wants to use in a program.
For example, we may want to analyse the writing style of Shakespeare, but when we download an ebook of Hamlet, it may be full of crud like HTML tags that are obviously not part of Hamlet, but just instructions for how the file should be displayed on an Ebook reader or in a web browser.
All of this stuff should be ripped out while preserving the text, and that's exactly what backreferences are good for.

## Another application for backreferences

Backreferences will be important for the rest of this course, but they are also very useful for your daily work.
Or at least, they are in mine.
Let me show you two examples.

All the notebooks for this course are written in [Markdown](https://www.markdowntutorial.com/).
Intuitively, markdown allows you to write text with some basic formatting such as italics and lists without using a word processor.
Markdown even allows you to write mathematical formulas in a convenient fashion.
For example, `\frac{1}{2}` is displayed as $\frac{1}{2}$.
(If you cannot see the formula, make sure Javascript is activated in your browser.)

I use notebooks with mathematical formulas a fair bit for some of my other classes.
But sometimes I get confused and accidentally use Python-like syntax, typing `\frac{1,2}`, which displays incorrectly as $\frac{1,2}$.
I could go through the whole notebook and fix those errors by hand, but it is much faster to do it with a regular expression that uses backreferences:

In [None]:
import re

# a test string for demonstration purposes
string = r"The correct syntax is \frac{x}{y} not \frac{x,y}, so both \frac{1,2} and \frac{101, \pi} are incorrect"
print(string)
# replace \frac{x, y} with \frac{x}{y}
string = re.sub(r"frac{([^,]*),\s*([^}]*)}", r"frac{\1}{\2}", string)
print(string)

Here's how this regular expression works in detail.
Basically, we want to split any expression of the form `frac{x, y}` into `frac-group1-,-whitespace-group2-}`.
The regular expression accomplishes this as follows:

1. `frac{`: matches frac{
1. `(`: starts the first group
1. `[^,]*`: 0 or more characters that are not a comma
1.  `)`: ends the first group
1.  `,`: matches the comma
1.  `\s*`: 0 or more whitespace characters (space, tabulator, newline)
1.  `(`: starts the second group
1.  `[^}]*`: 0 ore more characters that are not }
1.  `)`: ends the second group
1.  `}`: matches }

As you can see, the regex may look like an intimidating barrage of strange characters at first, but if you carefully work through it from left to right you can make sense of it soon enough.

Here's another case where I use regular expressions a fair bit in my daily routine: inserting links.
In markdown, any string can be made a link by surrounding it in square brackets and following it up by a URL in parenthesis.
For example, the link

[Markdown](https://www.markdowntutorial.com/)

is created from the code

`[Markdown](https://www.markdowntutorial.com/)`.

Sometimes I want to make every instance of certain words and phrases a link, but doing that by hand is tedious.
Instead, I add the links later on with a regular expression.
Below is a code snippet that will replace every instance of *Markdown*, *markdown*, or *[Mm]arkdown tutorial* by a link to the markdown tutorial.

In [None]:
import re

# define a test string for demonstration purposes
string = "The markdown tutorial teaches you a bit about Markdown, which allows you to use markdown notation for text formatting."
print(string)

# now we add the markdown command for the link
string = re.sub(r"(?i)(markdown( tutorial)?)", r"[\1](https://www.markdowntutorial.com)", string)
print(string)

**The moral of the story:**
Put some effort into mastering regular expressions, and soon you'll also be able to save time by automatically fixing your own documents!

**The secondary moral of the story:**
[Markdown](https://www.markdowntutorial.com) is really cool, you should give it a try!
Why deal with a clunky and slow word processor that only works on PCs if you can write your essay as a txt file with markdown on any device you want and then convert it to docx, pdf, a website, an ebook, or whatever else you want.