# Word Completion

There is one important applicatinon area for unigram models that we haven't discussed yet: word completion.
In a word completion task, we have to determine the possible continuations for a partially typed word.
For example, if the user types *to* we might suggest *too*, *tool*, *torrent*, and so on.
Usually only the most likely completions are shown to the the user, but we will leave this to the next unit.
For now, we just want to figure out how we can compute all the available completions for a string.
You might think this is harder, but it's actually quite a bit easier.

## Some general observations

Obviously Python can only offer possible completions if it knows what the possible words of English are.
Otherwise we might end up with suggestions like *tolmnox*, which is not an English word.
So the first thing we need is an English dictionary.
Without it, Python has no idea what is or isn't an English word.
Fortunately we already know from the previous unit where to find such a dictionary and how to convert it to a list.

In [None]:
import urllib.request
import re

def read_file(filename):
    with open(filename, "r", encoding="utf-8") as text:
        return text.read()

# download the file
url = "https://raw.githubusercontent.com/dwyl/english-words/master/words.txt"
urllib.request.urlretrieve(url, "words.txt")
dict_string = read_file("words.txt")

# tokenize dict_string;
# remember that each word is on its own line, so [^\n]+ does the trick
dictionary = re.findall(r"[^\n]+", dict_string)

**Warning.** Make sure you run the cell above, otherwise `dictionary` won't be defined and we need it for the rest of this section!

Now that we have a dictionary, we need to find a way to find those words in the dictionary that start with the string we want to complete.
For example, if the user has typed *excite*, the possible completions according to our dictionary are:

1. excite (after all, the user might have really just meant to type *excite*)
1. excited
1. excitedly
1. excitedness
1. excitement
1. excitements
1. exciter
1. exciters
1. excites

But how do we do this?
Intuitively, we have to look at each word in the dictionary and check whether it starts with the string the user has typed.
If so, we add it to the list of possible completions.
There is a simple way of doing this, a more elaborate one, and a pretty cumbersome one.
We'll look at the easy way first, but the pretty cumbersome way will prove more insightful for the future.
In the next unit, we'll look at the more elaborate way as a happy middle ground.

## The simple way: finding completions with `str.startswith`

As just discussed, we want to iterate over all words in the dictionary and add a word to the list of completions only if it starts with the string we are trying to complete.
This is pretty easy to convert into Python code.
The only problem is how one determines whether a word is a possible completion for a string.
Fortunately, Python already has a function for exactly that job: `str.startwith`.

In [None]:
def complete_word(string, dictionary):
    # start empty list of completions
    completions = []
    # iterate over the words in the dictoinary
    for word in dictionary:
        # if word starts with string, add it to completions
        if str.startswith(word, string):
            list.append(completions, word)
    # the for loop is done;
    # we return the list of all found completions
    return completions

Most of the code above should be familiar to you.
We create an empty list `completions` that we add potential completions to, and this is what the function returns as its output.
The `for`-loop is used to iterate over all elements of `dictionary` so that we can look at each `word` individually and determine if it should be added to completions with `list.append(completions, word)`.
The mystery is how we make this decision.
The job is accomplished by the built-in function `str.startswith(string1, string2)`, which takes two strings and determines if `string1` starts with `string2`.

In [None]:
for example in [str.startswith("excitement", "excite"),
                str.startswith("excite", "excite"),
                str.startswith("excitable", "excite"),
                str.startswith("Hillary", "Hi"),
                str.startswith("Hillary", "Hi!")]:
    print(example)

**Exercise.**
Python also has a function `str.endswith`.
What might this function do?
Construct some examples as in the previous cell for `str.startswith`, then give an informal description of the function.

In [None]:
# adapt the examples so that they illustrate how str.endswith works
for example in [str.endswith("excitement", "excite"),
                str.endswith("excite", "excite"),
                str.endswith("excitable", "excite"),
                str.endswith("Hillary", "Hi"),
                str.endswith("Hillary", "Hi!")]:
    print(example)

*put your description here*

Thanks to Python's `str.startswith` function, we can now use `complete_word` to get the list of all completions for a string.

In [None]:
complete_word("excite", dictionary)

In [None]:
complete_word("yes", dictionary)

This is the easy way, and usually this is the preferred way of solving this specific task.
But the solution is slightly unsatisfying for us because it is still a mystery how exactly `str.startswith` compares the two strings.
As Python beginners you should not be content to just learn a list of Python commands for specific tasks, but rather try to figure out how complex tasks can be solved with general-purpose tools that are useful in many different situations.
So let us look at the cumbersome but more general solution next, which introduces an important new technique: **string positions**.

## The cumbersome way: referencing string positions

Why is the string `yesty` (whatever that means) a possible completion for `yes`?
Because `yesty` starts with `yes`.
But how can we tell that `yesty` starts with `yes`?
Well, `yes` is three characters long, and

1. when we look at the first characters of `yes` and `yesty`, they are both `y`, and
1. when we look at the second character of `yes` and `yesty`, they are both `e`, and
1. when we look at the third characer of `yes` and `yesty`, the are both `s`.

Alright, fair enough, but how does that help us with Python?
It shows us how to compare strings by looking at their parts!

First, remember that when we want a specific element `a` from a counter `counts`, we can use the square bracket notation `counts["a"]` to get the value for that specific element.
We can do something very similar for strings: if we want to look at only a specific letter in `string`, we use `string[p]`, where `p` is a number indicating the position in the string.
Sounds complicated, but it's fairly straight-forward:

In [None]:
print("yes"[0])  # this print y, the 1st letter
print("yes"[1])  # this prints e, the 2nd letter
print("yes"[2])  # this prints s, the 3rd letter

Notice that the value of the position is always one less than what you would intuitively expect.
If you want the first letter, you have to look at position 0, if you want the second letter you look at position 1, and so on.
As a mnemonic, you can think of the position as an indicator of how many letters are to the left of the letter you want to look at.
Obviously the first character has nothing to its left - it is the first letter after all! - so its position is 0.
At position 1, we see the letter that has one letter to its left, which must be the second letter in the string.
And so on.

**Exercise.**
The cell below has code to print a few strings, but for each string we actually want to print only a specific letter.
Add the necessary position to each string.

In [None]:
# 1) print t
print("yesty")
# 2) print the second y
print("yesty")
# 3) print the w
print("yesterweek")

**Exercise.**
Somebody was very eager to use the new position notation, but made quite a few mistakes in the process.
Fix them all, and explain in a comment why the code doesn't work as intended.

In [None]:
print("excitement[3]")
# your comment:

print("yes")[1]
# your comment:

print("yes"[3])
# your comment:

print("quagmire"[2.0])
# your comment:

**Exercise.**
We can also use more complex expressions inside integers as long as they evaluate to an integer.
For each one of the following, add a comment that explains what it does.

In [None]:
print("yesterweek"[2*3])
# your comment:

print("yesterweek"[len("yesterweek") - 1])
# your comment:

print("yesterweek"[(len("yester") + len("week")) - len("yesterweek")])

With the position notation we can already ask Python to compare if `yesty` starts with `yes`.
To do this, we first determine the length of `yes` with `len(yes)`, which evaluates to 3.
We then look at all the positions that are less than 3, and if the two strings disagree on any value, we return `False`.
If every comparison is passed without returning `False`, we return `True`.

In [None]:
# a function for comparing two strings
def startswith(long_string, short_string):
    # we only need to compare as many positions as there are in the shorter string
    limit = len(short_string)
    # we start with the leftmost position
    pos = 0
    # and now a while-loop for comparing positions until we hit the limit
    while pos < limit:
        # if the two strings aren't the same at the position, starts_with is False
        if long_string[pos] != short_string[pos]:
            return False
        else:
            # move on to next position
            pos = pos + 1
    # the while loop has finished, so all the comparisons went okay;
    # we return True
    return True

The Python built-in `str.startswith` accomplishes the same work as our custom function `startswith`.
So now we have a better idea what exactly Python is doing when we call `str.startswith`.

In [None]:
print("Output with str.startswith")
for example in [str.startswith("excitement", "excite"),
                str.startswith("excite", "excite"),
                str.startswith("excitable", "excite"),
                str.startswith("Hillary", "Hi"),
                str.startswith("Hillary", "Hi!")]:
    print(example)
    
# print a new line
print()

print("Output with our custom startswith function")
for example in [startswith("excitement", "excite"),
                startswith("excite", "excite"),
                startswith("excitable", "excite"),
                startswith("Hillary", "Hi"),
                startswith("Hillary", "Hi!")]:
    print(example)

**Exercise.**
We saw that Python also has a function `str.endswith` as a counterpart to `str.startswith`.
Adapt the code for our custom `startswith` function so that it is the counterpart of `star.endswith` instead.
If you're stuck, check the next text cell for hints.

In [None]:
# a function for comparing two strings
def endsswith(long_string, short_string):
    # all the code below must be adapated
    limit = len(short_string)
    # we start with the leftmost position
    pos = 0
    # and now a while-loop for comparing positions until we hit the limit
    while pos < limit:
        # if the two strings aren't the same at the position, starts_with is False
        if long_string[pos] != short_string[pos]:
            return False
        else:
            # move on to next position
            pos = pos + 1
    # the while loop has finished, so all the comparisons went okay;
    # we return True
    return True

*Hints*:
Suppose we want to see if `yesterweek` ends with `week`.
Then we have to do the following steps:

1. Determine that the length of `week` is 4.
1. Now we know that we have to look at the last four characters of `yesterweek`.
   Since the length of `yesterweek` is 10, the last four charactes have the positions 6, 7, 8, and 9.
1. So overall we have to compare
    - position 0 in `week` to position 6 in `yesterweek`,
    - position 1 in `week` to position 7 in `yesterweek`,
    - and so on, until we've looked at all 4 relevant positions.

The ability to reference specific elements of a string with positions is incredibly useful.
In particular because it is not limited to strings but also works for lists.

In [None]:
testlist = ["I", "love", "Python", "!"]

print(testlist[0])
print(testlist[1])
print(testlist[2])
print(testlist[3])
print(testlist[1+2])
print(testlist[len(testlist) - 1])

**Exercise.**
Determine through an experiment whether we can also use positions with counters.

In [None]:
# you can experiment here

For the concrete problem of word completions, however, using string positions is more cumbersone than `str.startswith` because we have to define our own custom function `startswith`.
In addition, our custom function is not as fast as `str.startswith` - built-in Python functions use all kinds of optimizations to make the code as efficient as possible.
So for a real-world problem, you are better off using `str.startswith`.
But this specific function will only help you in a few cases, whereas referencing items by their position is a very useful general purpose technique.