# Exercises 6 - 7

## Exercise 6 - Slicing and Modifying

There are (at least) two ways in which we can approach this exercise. We can either take a slice of characters (A) or tokenize the string (B).


In [2]:
text = 'Python programming can be fun.'

### Variant A (Slicing)

In [11]:
third_word_a = text[19:23]
third_word_a_upper = third_word_a.upper()

In [20]:
print(third_word_a_upper)

CAN


If we don't want to count the characters, we can also use a combination of `find` and `len`.

In [17]:
starting_point = text.find('can')
end_point = starting_point + len('can')

third_word_a = text[starting_point:end_point]
third_word_a_upper = third_word_a.upper()

In [18]:
print(third_word_a_upper)

CAN


Now that we have this solution, we can generalize even further and create a function that finds and modifies arbitrary substrings. Of course, this function does not make any sense, but it nicely shows how we can gradually approach generalized solutions.

In [23]:
def find_and_uppercase(text, search):
  starting_point = text.find(search)
  end_point = starting_point + len(search)

  word = text[starting_point:end_point]
  word_upper = word.upper()

  return word_upper

In [24]:
find_and_uppercase(text, 'can')

'CAN'

### Variant B (Tokenization)

In [30]:
tokenized = text.split() # Without any arguments, split will use whitespace
tokenized

['Python', 'programming', 'can', 'be', 'fun.']

In [28]:
third_word_b = tokenized[2]
third_word_b_upper = third_word_b.upper()

In [29]:
third_word_b_upper

'CAN'

## Exercise 7 - Counting Tokens

First we will download (`git clone`) the repository. This way, we will have access to the two files (`simple.txt` and `challenge.txt`).

In [38]:
%%capture
!git clone https://github.com/IngoKl/python-programming-for-linguists

First, we will open and read the file.

In [45]:
with open('python-programming-for-linguists/2020/data/tokenize/simple.txt', 'r') as f:
  text = f.read()

In [46]:
text

'The black cat chased the mouse.'

We can build a very simple tokenizer, just as above, by using the `str.split()` method.

In [50]:
tokenized = text.split()

tokenized

['The', 'black', 'cat', 'chased', 'the', 'mouse.']

Now we just need to get the length of the resulting list.

In [48]:
len(tokenized)

6

Now, as is requested in the exercise, we will put all of that into one function.

In [51]:
def count_tokens(file):
  with open(file, 'r') as f:
    text = f.read()

  tokenized = text.split()

  return len(tokenized)

In [52]:
count_tokens('python-programming-for-linguists/2020/data/tokenize/simple.txt')

6

Great, now let's try to use our function with the more challenging `challenge.txt` example.

In [53]:
count_tokens('python-programming-for-linguists/2020/data/tokenize/challenge.txt')

11

This does not look good. Let's have a look at both the file and at the output of our tokenizer.

In [55]:
with open('python-programming-for-linguists/2020/data/tokenize/challenge.txt', 'r') as f:
  text = f.read()

print(text)
print(text.split())

Sue owed Ms. O'Neil $10. Unfortunately, she  didn't have the money.
['Sue', 'owed', 'Ms.', "O'Neil", '$10.', 'Unfortunately,', 'she', "didn't", 'have', 'the', 'money.']


Alright, there are various problems here.

*   *$* and *10* should arguably be split
*   *didn't* should also be split into two words or tokens
*   Due to the fact that we have *O'Neil*, we can't just split at the `'` character.
*   There's an extra space after *she* that potentially could cause trouble.

Let's try to build a more robust tokenizer that can handle these cases. Our approach will be to modify the text before we do the tokenization.


In [71]:
import re

def count_tokens_optimized(text):
  # Replace double whitespace
  text = text.replace('  ', ' ')

  # Add a space between $/€ and numbers
  text = re.sub(r'(\$|\€)([0-9]*)\b', r'\1 \2', text)

  # Add space between words and periods
  text = re.sub(r'(\w+)(\.)', r'\1 \2', text)

  # Account for the abbreviation
  text = text.replace("n't", " n't")

  tokenized = text.split()

  return len(tokenized)

In [70]:
count_tokens_optimized(text)

16

Great, our optimized function works very well. However, it will only work for this particular example and the edge cases (which are really not edge cases) we encountered here. Well, we at least have accounted for not just the *$* sign, but also for *€*.

While state-of-the-art tokenizers use sophisticated language models to solve these problems, there are still good rule-based tokenizers out there. If you want to have a look at some real-word code, have a look at the [NLTKWordTokenizer](https://github.com/nltk/nltk/blob/develop/nltk/tokenize/destructive.py).