# From text to lists

The previous unit talked a bit about cleaning up input.
You saw how you can use custom functions to remove all kinds of clutter from the input.
When you clean up input in this way, you are taking a string and modifying it to get a more suitable string that is easier to work with.
But sometimes a string is still not good enough, what you actually want is a list.
Instead of the string

```python
"This string is a string, not a list, as you can see."
```

you might actually want the list

```python
["This", "string", "is", "a", "string", "not", "a", "list", "as", "you", "can", "see"]
```

The process of breaking up a string into a list of words is called **tokenization**, and it is really important in language technology.

## Tokenization the easy way

It is very easy to build a basic tokenizer for English.
In fact, a few lines of Python code suffice.

In [None]:
import re

# define a custom tokenizer function
def tokenize(the_string):
    token_list = re.findall(r"\w+", the_string)
    return token_list


def ex_print(example):
    print("The sentence is:")
    print(example)
    print("Tokenization yields this list:")
    print(tokenize(example))
    
    
ex_print("This is an example sentence")

Run the code above to verify that it indeed converts the example sentence to a list of words.
Feel free to change the example sentence a bit and observe how the list changes accordingly.

All the real work in the code above is done by the custom function `tokenize`.
It takes a single argument, which is given the name `the_string`.
The function then does two things.
First, it uses a rather cryptic command `re.findall(r"\w+", the_string)`.
This is the piece that does all the magic of converting a string into a list of words.
We'll see in a moment how exactly it does that.
The produced list is stored in the variable `token_list`, which is then returned as the output of the function.

**Exercise.**
Rather than storing the list in a variable and then returning the variable, we can directly return the list.
Modify the code below along these lines.
The result should consist of only two lines of code.

In [None]:
def tokenize(the_string):
    token_list = re.findall(r"\w+", the_string)
    return token_list

So let's turn to the mystery of how the string is actually converted into a list.
The process is actually much simpler than it looks.
You already know Python's `re` package for working with regular expressions.
This package provides the `re.sub` function to modify parts of a string.
But it also provides the function `re.findall`.
This function takes two arguments:

1. The first argument of `re.findall` is a regular expression.
1. The second argument of `re.findall` is a string.

The function then scans the string from left to right, looking for parts that match the regular expression.
When a match is found, it is added to the list of matches.
At the end, the function returns the list of all found matches.

Here is a very simple example:

In [None]:
import re
# matching digits in a string
digits = re.findall(r"[0-9]", "James Madison had 0 sons and fought the war of 1812.")
print(digits)

As you can see, the regular expression matches every character that is a number between 0 and 9.
When the `re.findall` function scans through the string from left to right, the first matching symbol is 0.
So at this point the list of matches is `['0']`.
Then it has to move right for quite a while without much happening, until it finally encounters the year 1812.
Each digit of this year is a symbol that matches the regular expression, so each one gets added to the list of matches.
The order in the list corresponds exactly to the order of the digits in the string.

Now compare this to the list we get with a slightly different regular expression.

In [None]:
import re
# matching numbers in a string
numbers = re.findall(r"[0-9]+", "James Madison had 0 sons and fought the war of 1812.")
print(numbers)

Instead of a list of digits, we now have a list of numbers.
How come?
Recall that `+` in a regular expression means *1 or more instances of*.
So now `re.findall` doesn't just look at individual symbols in the string.
Instead, it looks for the longest parts of the string that are only built from digits.
That's `0` and `1812`.

The regular expression `r"[0-9]+"` is very similar to the one used in our custom function `tokenize`.
There, the regex is `r"\w+"`.
So apparently we want `re.findall` to match continuous parts of the string that are built up from whatever `\w` means.
But as you already know, `\w` is a shorthand for *word character*.
So the regex `r"\w+"` matches the maximal parts of the string that are built up from word characters - but that's just a round-about way of saying that it matches words.

**Exercise.**
The code below uses the tokenizer function on a variety of example sentences.
Run the cell multiple times and carefully study how the `re.findall` picks out the matches.
Based on your observations, what counts as a word character?

In [None]:
import random
import re

# define a custom tokenizer function
def tokenize(the_string):
    token_list = re.findall(r"\w+", the_string)
    return token_list


def ex_print(example):
    print("The sentence is:")
    print(example)
    print("Tokenization yields this list:")
    print(tokenize(example))
    
    
sentences = ["This is the first example sentence.",
             "Cuz I'm Batman!",
             "What???!!",
             "Engage the hyper-drive!",
             "True music aficionados listen to Röyksopp, Schweißer, and Sígur Rós...",
             "Bankers only care about $$$",
             "My phone number is 555-123-4567",
             "Stalag 17 might be Billy Wilder's best movie!",
             ":-)",
             "2 + 2 = 4. This much is obvious!"]

ex_print(random.choice(sentences))

*put your description of word characters here*

**Exercise.**
Regular expressions don't just have `\w` as a special descriptor for word characters.
Among others, there is also `\d` to pick out digits.
Based on the code above for matching digits and numbers, write two custom functions `digit_match` and `number_match`.
Test the functions on the example sentence *James Madison had 0 sons and fought the war of 1812*.
Whereas `digit_match` should return `['0', '1', '8', '1', '2']`, `number_match` should return `['0', '1812']`.

In [None]:
def digit_match(the_string):
    # complete this function
    

def number_match(the_string):
    # complete this function
    
    
# put your tests here

So whenever you need to tokenize something, you can follow this simple recipe:

```python
re.findall(r"\w+", string_to_be_tokenized)
```

Pretty amazing if you think about it.
A single line of code for what might seem like a fairly difficult task.

## Limits of the simple approach

You have probably noticed by now that this approach results in lists that use a somewhat odd notion of word.
For instance, the word *non-descriptive* would be split into *non* and *descriptive*.
A phone number like *555-1234-5678* will be split into *555*, *1234*, and *5678*, when it is arguably a single word.
One solution would be to treat as a word any continuous sequence of characters that does not contain any whitespace.
A whitespace character is a space or a tabulator.

**Exercise.**
In regular expressions, `\s` can be used to match whitespace characters, whereas `\S` matches anything that is not whitespace.
Adapt the code below so that instead of maximal sequences of word characters, it matches maximal sequences of non-whitespace characters.
How does this affect tokenization of the example sentences?
List at least one specific improvement, and at least one specific regression (that is to say, something this solution does worse than the previous tokenizer).

In [None]:
import random
import re

# define a custom tokenizer function
def tokenize(the_string):
    token_list = re.findall(r"\w+", the_string)
    return token_list

sentences = ["This is the first example sentence.",
             "Cuz I'm Batman!",
             "What???!!",
             "Engage the hyper-drive!",
             "True music aficionados listen to Röyksopp, Schweißer, and Sígur Rós...",
             "Bankers only care about $$$",
             "My phone number is 555-123-4567",
             "Stalag 17 might be Billy Wilder's best movie!",
             ":-)"]

example = random.choice(sentences)
print("The sentence is:")
print(example)
print("Tokenization yields this list:")
print(tokenize(example))

*put your answers here*

**Exercise.**
This continues the previous exercise.
Modify your code so that each punctuation symbol is treated as a word.
So `"Sue slept."` would be tokenized as `["Sue", "slept", "."]`,
`"Sue, stop!"` as `["Sue", ",", "stop", "!"]`,
and `"Sue and Bill..."` as `["Sue", "and", "Bill", ".", ".", "."]`.

*Hints:*
If you're stuck with the exercise, highlight the text below to read some tips.

<span style="color:#000000;background-color:#000000;">
Use re.sub to insert whitespace before punctuation symbols.
</span>

In [None]:
# put your modified code here

**Exercise.**
You now know that

- `\w` matches word characters
- `\d` matches digits
- `\s` matches whitespace (space & tabs)
- `\S` matches anything that is not matched by `\s`.

So what might `\W` and `\D` match?
Experiment with `re.findall` in the code cell below to verify your answer.

In [None]:
# experiment here

*put your description of \W and \D here*

Writing a high-quality tokenizer is actually a very challenging task.
All kinds of edge cases must be taken into account.
For example, how should one tokenize any of the following:

- *$20*
- *R2-D2*
- *www.somedomain.com/query?=5328_iawb;id=293*

The answer is not straight-forward and depends on why exactly you want to tokenize strings to begin with.
We will see concrete applications in the remainder of this unit.
For this application, the tokenizer doesn't have to be top-notch - the regex `r"\w+"` will do the job just fine.
But even for such simple applications, tokenization can be very hard depending on the language.
English makes our life easy because of the convention that words are separated by spaces.

```
WedonotwriteEnglishsentenceslikethiswithoutanyspacebetweenwords.
```

But some languages do not follow this convention, for instance Chinese.
Writing a tokenizer for Chinese is a much harder task that requires a good understanding of the language.
One short regular expression won't cut it for Chinese.
This is a good example of how a piece of language technology might be straight-forward for language X, but really hard for language Y.
This happens quite often.
In an ideal world, the engineers working on language technology would have a solid background in linguistics that makes them aware of how much languages can vary.
Unfortunately, we do not live in this ideal world; many pieces of language technology do not work well for certain languages because the programmers made assumptions that simply do not hold for these languages.

## Bullet point summary

- Tokenization is the process of converting a sentence or text from a string to a list.
- A basic tokenization recipe:

```python
re.findall(r"\w+", string_to_be_tokenized)
```

- `re.findall` takes a regular expression R and a string S and returns a list of all matches for R in S.
- Regular expressions provides shorthands for matching specific classes:
    - `\w` matches word characters (A-Z, a-z, 0-9),
    - `\d` matches digits (0-9),
    - `\s` matches whitespace,
    - `\W`, `\D`, and `\S` match whatever is not matched by `\w`, `\d`, and `\s`.