# Tokenization: Simple uses

Tokenizatin is incredibly useful.
That's because it adds **structure** to sentences.
A string is a linear sequence of characters, but as far as Python is concerned all the characters have equal status.
In a string like `"Sue ran."`, Python does not treat `S`, `u`, `e` as parts of a unit *Sue*.
For Python, there is no noteworthy difference between *Sue* and *e r* or *an.*, they are all parts of the string.
Tokenization imposes a list structure on the sentence, and we can use this list structure to do all kinds of nifty things.
Once you are used to treating sentences as lists of words rather than strings, you won't want to go back.

## Counting words

One very simple application for tokenization is word counts.
Suppose you want the user to write a brief personal statement of no more than 250 words.
If you already have a tokenizer, this is easy-peasy.

In [None]:
import re

def tokenize(the_string):
    """Convert string to list of words."""
    return re.findall(r"\w+", the_string)


def get_description(n):
    """Get description of at most n words."""
    # request description
    print("Welcome to Social Media Platform #729!")
    print("Please enter a description of yourself (at most", n, "words)")
    description = input()
    
    # check that word limit is obeyed
    while len(tokenize(description)) > n:
        print("Your description is too long, please enter a shorter one.")
        description = input()
        
    # acknowledge receipt of description
    print("Thank you, your description will be added to your profile.")
    

# ask the user for a 10 word description
get_description(10)

This piece of code contains two things we haven't encountered before, so let's go through them one by one.

1. Each custom function starts with a **docstring**, e.g. """Convert string to list of words.""".
   A docstring is like a comment that describes what the function does.
   Docstrings don't do anything on their own, they're just a summary for whoever is reading the code.
   But there are specialized tools that can use docstrings to automatically create documentation for the code (which isn't possible with normal comments).
   It is good practice to give each function its own docstring.
   
1. The `len` function return the **len**gth of a string or list.
   For a list, it's length is the number of items it contains.
   The length of a string is its number of characters.

In [None]:
len("abcde")

In [None]:
len(["a", "b", "c", "d", "e"])

In [None]:
import re

def tokenize(the_string):
    """Convert string to list of words."""
    return re.findall(r"\w+", the_string)

example_string = "James Madison had 0 sons and fought the war of 1812."
example_tokenized = tokenize(example_string)

print("The string is:")
print(example_string)
print("It has", len(example_string), "characters (including spaces and punctuation).")
print("It contains", len(example_tokenized), "words.")

So the code above works as follows:
First, it gets the description from the user as a string.
It then starts a `while` loop.
Whether the loop is executed depends on the length of the list that results from tokenizing the description.
The loop is entered whenever the length of this list exceeds *n*:

```python
len(tokenize(description)) > 10
```

That's the same as saying that the list contains more than *n* elements.
But since each element of the list is a word of the description provided by the user, this means that the loop is entered whenever the description contains more than *n* words.
In this case, the user is asked to provide a shorter description.
The `while` loop keeps running until the user complies with the request for a suitably short description.

This example may strike you as somewhat artificial, but there will be several occasions where we need to measure how many words are in a string.
The procedure is always the same:

1. Tokenize the string.
1. Measure the length of the list with `len`.

**Exercise.**
Write a custom function `wordchar_count` that takes a string as its only argument and returns the number of word characters in it.

*Hints:*
If you're stuck with the exercise, highlight the text below to read some tips.

<span style="color:#000000;background-color:#000000;">
Use re.findall to match all word characters, then measure the length of this list.
</span>

In [None]:
# put your code here

**Exercise.**
Write a function `shorter` that compares two strings and returns the one of the two that contains fewer characters (including white space, punctuation, special symbols, and so on).
If the strings have the same length, randomly choose one of the two to return.

In [None]:
# put your code here

## Picking out specific words

There is another big advantage to representing sentences as lists of words rather than flat strings.
Items in a list can be explicitly referenced by their **index**.
This makes it possible to check which word occurs in a specific position in a sentence.

In [None]:
sentence = ["John", "likes", "Sue"]

# use the notation list[n] to get the element that is preceded by n elements
print(sentence[0])
print(sentence[1])
print(sentence[2])

Python allows us to use index notation with lists to pick out the specific element at this index.
The format is very simple:

```python
some_list[some_index]
```

While indices are a fairly simple concept, Python beginners are usually tripped up by the fact that indices start at 0.
So if you want the first word in a sentence, you have to look at `sentence[0]`, not `sentence[1]`.
This can be confusing, but there's actually a good reason for that.
You can think of a list like `["John", "likes", "Sue"]` as a line where indices and elements are interspersed:

```
0 John 1 likes 2 Sue 3
```

The piece of code below also reveals this hidden pattern.

In [None]:
list(enumerate(["John", "likes", "Sue"]))

So elements of a list don't occupy specific indices but rather occur between two indices.
When we want a specific list element, we have to give Python the index to the left of the element.

**Exercise.**
Experimentation time.
Use the code cell below to figure out the answers to the following questions:

1. What happens if we use an index that's not a number, e.g. `some_list['John']` or `some_list[$]`?
1. What happens if the index does not exist, e.g. `["John", "likes", "Sue"][3]`?
1. Can indices be used with strings?

In [None]:
# experiment here

*put your answers here*

Once an item has been retrieved from a list by its index, it behaves as usual.

In [None]:
# John is John
["John", "likes", "Sue"][0] == "John"

In [None]:
# Sue is not John
["John", "likes", "Sue"][2] == "John"

**Exercise.**
There is a prescriptive rule that sentences should not start with *because*.
Write a custom function `because_check` that takes as its input a string and prints a warning if the first word is *because* or *Because*.

In [None]:
# put your code here

One curious feature of Python is that we can also use negative indices for lists.

In [None]:
sentence = ["John", "likes", "Sue"]

# use the notation list[-n] to get the n-th element from the back
print(sentence[-1])
print(sentence[-2])
print(sentence[-3])

Note that negative indices start with 1, not 0.
That seems awfully inconsistent: positive indices start with 0 ("the first element is preceded by 0 elements"), but negative indices start with 1 ("the last element is the first element from the right").
This makes more sense if you again think of lists as sequences of indices and list elements.
Remember, the list `["John", "likes", "Sue"]` actually looks more likes this under the hood:

```
0 John 1 likes 2 Sue 3
```

Negative indices are exactly the same, except that we number from right to left:

```
-3 John -2 likes -1 Sue -0
```

Now keep in mind that we always refer to an element by the index to its left.
This is why the last element is at the index -1 rather than -0.

I know, I know, that's still fairly confusing.
We will mostly be using positive indices for the rest of this course.
But when we do use negative indices and you can't figure out what's going on, refer back to this passage.
It all gets a lot clearer once you've seen it in action a few times.

**Exercise.**
Say whether following statement is true or false:

- For every list `some_list`, it holds that `some_list[-1] == some_list[len(some_list) - 1]`.

Justify your answer.

*put your answer here*

**Exercise.**
Write a custom function `sandwich_sentence` that checks whether a word starts with the same word that it ends with.
If so, it returns `True`, otherwise it returns `False`.
For example:

```python
sandwich_sentence("John's name has always been John")
True

sandwich_sentence("John's true name is unpronouncable in your human tongue")
False
```

In [None]:
# put your solution here

## What's the point?

The examples in this notebook all very simplistic.
Considered in isolation, they do not seem to have much to do with language technology.
But that's because they are such basic techniques.
True, they do not amount to much on their own.
But they are essential for more interesting applications, such as stylistic analysis.
More on that in the next notebook.

## Bullet point summary

- Tokenization adds structure to sentences, which opens up many new techniques.
- Functions should have docstrings at the very top to explain their basic functionality:
    
```python
def some_function(first_argument, ..., last_argument):
    """This is a docstring."""
    some code
```

- `len` measures the length of a list (number of elements) or a string (number of characters)
- Each item in a list has an index.
  Use the index to retrieve the item from the list.
  
  - `some_list[0]`: first element from the left (0 elements to its left)
  - `some_list[7]`: 8th element from the left (7 elements to its left)
  - `some_list[-1]`: last element (1st from the right)
  - `some_list[-7]`: 7th element from the right