# Writing a cleaning function for preprocessing text

The previous unit only contained some toy examples for functions.
This unit takes a look at a practical usage that is ubiquitous in language technology: cleaning up text.
The type of text varies.
It may be a sentence entered by the user, an email, a tweet, a website, or a bunch of books you downloaded from the internet.
But whatever it is, odds are that the text contains all kinds of things that you need to fix to make the text easier to work with in Python.
This is called *preprocessing* or simply *cleaning*.

## Cleaning user input for a chatbot

We have already seen a very simple kind of cleaning in previous units.
There, we used `str.lower` to remove capitalization, and we also used a regular expression to get rid of punctuation.
This greatly simplifies the rest of the program, for instance `if`-tests.
Contrast the two mini-programs below.

In [None]:
print("Hi. Do you think I'm human?")
reply = input()

while reply not in ["Yes.", "Yes", "yes.", "yes"]:
    if reply in ["No.", "No", "no.", "no"]:
        print("But I am! Why won't you believe me")
    else:
        print("Please answer Yes or No.")
    print("So, I ask you again: Do you think I'm human?")
    reply = input()
        
print("Fooled'ya!")

In [None]:
import re

print("Hi. Do you think I'm human?")
reply = input()
# some basic cleanup
reply = str.lower(reply)
reply = re.sub(r"[\.\?!,; ]", r"", reply)

while reply != "yes":
    if reply == "no":
        print("But I am! Why won't you believe me")
    else:
        print("Please answer Yes or No.")
    print("So, I ask you again: Do you think I'm human?")
    reply = input()
    reply = str.lower(reply)
    reply = re.sub(r"[\.\?!,; ]", r"", reply)
        
print("Fooled'ya!")

**Exercise.**
The second program is more general than the first.
List at least 5 user inputs that are handled correctly by the second program, but not the first.

*list your user inputs here (one line per input)*

In the second program, the conditions for `while` and `if` are simpler because we know that the user input has been preprocessed so that all characters are lowercase and the string contains no spaces or punctuation.
But this comes at the price of code duplication.
The first time we get a reply, we clean it up with `str.lower` and then `re.sub`.
A few lines later, we do the same thing inside the `while` loop.
Alright, not the most elegant solution, but... whatever, right?
It gets the job done, so it's fine... not.

Suppose that later down the road, you decide to also remove hyphens from `reply`.
Then you have to replace

```python
re.sub(r"[\.\?!,; ]", r"", reply)
```

by

```python
re.sub(r"[\.\?!,; ]", r"", reply)
```

A simple change.
But crucially, you have to do it in two places: before the `while`-loop, and inside of the `while`-loop.
If you change only one of the two and then change your program under the assumption that `reply` never contains a hyphen (`-`), all of a sudden things may go wrong in some cases.
You have accidentally introduced a bug.

Our code snippets are so short that it won't take you long to track down the mistake, but in the real world programs aren't this short.
A simple Python script can easily have a few hundred lines, and more complex applications quickly run into thousands of lines of code.
An operating system involves millions of lines of code!
With such a code base, finding the one line you forgot to change is like looking for a needle in a haystack.
Functions allow you to work around this problem.

## A clean-up function

Think a bit about the structure of the program above.
It actually involves two tasks:

1. cleaning up the user input, and
1. interacting with the user until he or she provides the desired input.

In the current code, these two tasks are directly interwoven, there is no clear separation between code lines that handle the first task and code lines that handle the second task.
With functions, we can tease the two apart.
Here is what this would look like.

In [None]:
import re

def clean_up(user_string):
    user_string = str.lower(user_string)
    user_string = re.sub(r"[\.\?!,; ]", r"", user_string)
    # a new command: return defines the output of the function
    return user_string

print("Hi. Do you think I'm human?")
reply = input()
# some basic cleanup
reply = clean_up(reply)

while reply != "yes":
    if reply == "no":
        print("But I am! Why won't you believe me")
    else:
        print("Please answer Yes or No.")
    print("So, I ask you again: Do you think I'm human?")
    reply = input()
    reply = clean_up(reply)
        
print("Fooled'ya!")

This code snippet contains a command we haven't encountered before, `return`.
Ignore it for now, the important thing is the overall structure of the code.
As you can see, we first define a function `clean_up` whose job it is to clean up any string that is passed into it as an argument.
The commands are exactly the same we used before, that is to say, `str.lower` and `re.sub` with the appropriate substitutions.
The rest of the program is almost exactly the same, except that we now call the `clean_up` function in two places to preprocess the user input.
This may seem like a minor change, but it completely changes the program.
If we now decide to also remove hyphens from user input, we only have to make a single change in the definition of `clean_up`.
Both points where the function is called automatically use the new definition.
With a change to a single line of code, we have simultaneously changed the program in two places.

This is the true power of functions.
We can disect our program into tasks, and each task gets its own function.
Any modification we make to the function is automatically used at every point where we call the function.
Without functions, you might have to make the same change to hundreds of lines in your program.
With functions, the change is made only once, in the definition of the function.

## What is `return`? It is not `print`!

Consider once more the definition of `clean_up`.

In [None]:
import re

def clean_up(user_string):
    user_string = str.lower(user_string)
    user_string = re.sub(r"[\.\?!,; ]", r"", user_string)
    # a new command: return defines the output of the function
    return user_string

clean_up("Hi!!!1! I hate periods, but I love ellipsis...")

What is the use of `return` here?
It seems to do the same thing as `print`.

In [None]:
import re

def clean_up(user_string):
    user_string = str.lower(user_string)
    user_string = re.sub(r"[\.\?!,; ]", r"", user_string)
    # let's try print instead of return
    print(user_string)

clean_up("Hi!!!1! I hate periods, but I love ellipsis...")

It looks like `return` and `print` do the same thing.
In both cases you get the string `hi1ihateperiodsbutiloveellipsis`.
When you look more closely, though, you might already notice some minor differences.
With `print`, you just get the string.
With `return`, you also get something like `Out[3]` in your output, immediately to the left of the string.
You also don't get `hi1ihateperiodsbutiloveellipsis` but rather `'hi1ihateperiodsbutiloveellipsis'`.
Why these subtle differences?
Because `return` produces an output, whereas `print` only shows a message on the screen.

The difference may seem subtle (or perhaps even non-existent) to you, but it is incredibly important.
Whenever you run Python code, there are two "recipients":

1. The user sitting in front of the screen, and
1. Python itself.

The `print` function only serves one purpose, and that's to display information to the user.
It cannot be used to make information accessible to Python.
So far this hasn't been an issue because we didn't use functions, so Python had access to everything in the code you wrote.
But remember that functions are essentially blackboxes.
When you call a function `foo` with, say, two arguments `x` and `y`, you feed the values of `x` and `y` into this blackbox.
Inside the blackbox, Python carries out computations according to the definition of `foo`, but from the outside nobody can see what's going on in there.
If you want Python to use the end result of whatever happened in the blackbox, you have to open a door, metaphorically speaking.
The `return` command opens this door.
Without the `return` command, a function is more like a black hole that absorbs some arguments and never returns any output.
Sure, you can use `print` to tell the user what's going on, but Python can't use those.
If you want to pass something out of the function, you need `return`.

Here's how you can think of this in terms of flowcharts.
Let's look first at the flowchart for the chatbot without the separate `clean_up` function.

```
print Hi
|
store user input as reply
|
lowercase reply
|
remove punctuation
|
reply != yes? yes ---> reply == no? yes --> print complaint
|      ^                     |                |
|      |               print Answer Yes/No    |
|      |                     |                |
|      |                     |<---------------|
|      |                     |
|      |               print Am I human?
|      |                     |
|      |               store user input as reply
|      |                     |
|      |               lowercase reply
|      |                     |
|      |---------------remove punctuation
|
print Fooled'ya
```

And here's what it looks like with the `clean_up` function.
Notice how we now effectively have two separate flowcharts.
The first one, called `MAIN`, illustrate the main structure of the program.
The second one, `CLEAN_UP`, depicts what happens inside the `clean_up` function.

```
MAIN:

print Hi
|
store user input as reply
|
/---------------\
|clean_up(reply)|
\---------------/
|
remove punctuation
|
reply != yes? yes ---> reply == no? yes --> print complaint
|      ^                     |                |
|      |               print Answer Yes/No    |
|      |                     |                |
|      |                     |<---------------|
|      |                     |
|      |               print Am I human?
|      |                     |
|      |               store user input as reply
|      |                     |
|      |               /---------------\
|      |               |clean_up(reply)|
|      |               \---------------/
|      |                     |
|      |---------------------|
|
print Fooled'ya

CLEAN_UP(x):

lowercase x
|
remove punctuation from x
|
output clean x
```

The calls to `clean_up` are surrounded by borders in `MAIN` to highlight that the function acts like a blackbox.
`MAIN` does not know what is going on inside.
But notice that `MAIN` does not end with `clean_up`.
Instead, it relies on `clean_up` to do its job and then give it some output to work with.
Whenever you have a flowchart where a function is inserted somewhere in the middle and thus has the job of converting some input to some output, you know that this function needs a `return` command.

***The print return mnemonic: `print` prints to screen, `return` returns an output.**

**Exercise.**
To see for yourself that `print` does not provide what you want, modify the definition below so that it uses `print` instead of return.
Then run the function.
Something will go pretty wrong.
Think about why the program behaves the way it does, and write down your explanation in the cell below.

In [None]:
import re

def clean_up(user_string):
    user_string = str.lower(user_string)
    user_string = re.sub(r"[\.\?!,; ]", r"", user_string)
    return user_string

print("Hi. Do you think I'm human?")
reply = input()
# some basic cleanup
reply = clean_up(reply)

while reply != "yes":
    if reply == "no":
        print("But I am! Why won't you believe me")
    else:
        print("Please answer Yes or No.")
    print("So, I ask you again: Do you think I'm human?")
    reply = input()
    reply = clean_up(reply)
        
print("Fooled'ya!")

*put your explanation here*

# How to use `return`

You now know that you should `return` whenever you need the function to return a value that can serve as the input for another computation.
Only use `print` if you actually want the user to see a specific part of the computation.
Remember, `print` inside functions is only for showing stuff to the user, it does not allow Python access to what's going on inside the function.
For this, you need `return`.
The actual usage of `return` is simple: just put it in front of whatever you want to return.
This does not have to be a single variable, it can also be a line of code

**Exercise.**
For each one of the functions below, add a comment that describes what its output looks like.
You should give both a general description and a specific example (if a function has multiple return statements, provide an example for each).
The first cell has already been filled out to give you a better idea of what to do.

In [None]:
import random

def upper_lower(chars):
    if random.choice([True, False]):
        return str.upper(chars)
    else:
        return str.lower(chars)

# randomly convert string to uppercase or lowercase
# upper("aBolu") -> "ABOLU"
# upper("aBolu") -> "abolu"

In [None]:
def addition(x, y):
    return x + y

In [None]:
def politics_filter(string):
    if "Trump" in string or "Hillary" in string:
        return "censored"
    else:
        return string

In [None]:
import random

def random_greeting(list_of_names):
    greeting = random.choice(["Hi,", "Hello,"])
    name = random.choice(list_of_names)
    return greeting + " " + name + "!"
    # experiment to find out what + does with strings

In [None]:
def madlibs(adjective, verb, noun):
    madlib = "A(n) " + adjective + " man was " + verb + "ing his " + noun
    # experiment to find out what + does with strings
    return madlib

In [None]:
def constant_function(x):
    return 17

In [None]:
def is_upper(chars):
    if chars == str.upper(chars):
        return True
    else:
        return False

**Exercise.**
The strength of functions is that they allow you to break down a complex task into several simple ones.
In order to see this, you have to finally work on a more complex task.
That's what this exercise is about.
Go back to the earlier units and collect all the tricks you have learned for chatbots:

1. Cleaning up the user input.
1. Memorizing user input, either to reuse it with `random.choice` or to point out if the user has said something before.
1. Randomizing the chatbot with the random library.
1. Using if-statements to check for special cases (e.g. slurs in the input).
1. Talking indefinitely until the user writes a specific sentence (e.g. "Good Lord almighty, please make it stop!").

Then design a chatbot that behaves as follows:

1. It starts the conversation with a random greeting (at least 5 choices).
1. If the user writes a nice reply, the bot switches into "nice mode".
   If the user writes a nasty reply, the bot switches into "nasty mode".
   You have to write some code to determine if the reply is nice or nasty.
1. In nice mode, the bot keeps saying supersweet niceties until the user replies with something nasty.
   It then switches to nasty mode.
1. In nasty mode, the bot only outputs `"..."` until the user writes something they've written before.
   The bot then shouts `"Leave me alone!"` and the conversation stops.
   
Here are a few suggestions for what should probably be its own function:

- the clean up function (you want one, believe me)
- detecting if the user reply is nice or nasty
- nice mode for the bot
- nasty mode for the bot

I also suggest that you draw a flowchart for yourself.
Don't write a single line of code until you have a good idea of how the whole chatbot works!
You must think about what each function looks like, how to combine them with the main part of the program, where to start the `while`-loop, and so on.

This is the first time you have to write a more complex program from scratch, so take your time.
Feel free to brainstorm ideas with your peers in the Python discussion board.

In [None]:
# put your code here

## Bullet point summary

- Functions need a `return` statement to return an output.
- Never confuse `print` and `return`.

***The print return mnemonic: `print` prints to screen, `return` returns an output.**