# Text as data

## Segmentation and tokenization

In this section we will try to chop up a piece of text into words (tokens) that we can count and analyse.

In [2]:
example_string = "This is an example string. It will illustrate how text data is not trivial to work with, even if it is nice and clean.\n\nThings like punctuation and abbreviations, e.g. 'cf.' and 'i.e.', can cause trouble for simple approaches of segmentation and tokenization."

In [3]:
print(example_string)

This is an example string. It will illustrate how text data is not trivial to work with, even if it is nice and clean.

Things like punctuation and abbreviations, e.g. 'cf.' and 'i.e.', can cause trouble for simple approaches of segmentation and tokenization.


A naive approach of splitting a text periods for sentences and on whitespace gets us pretty far, but as we shall see, will not get us all the way. Let's try out a few things.

In [4]:
example_string.split('.')

['This is an example string',
 ' It will illustrate how text data is not trivial to work with, even if it is nice and clean',
 '\n\nThings like punctuation and abbreviations, e',
 'g',
 " 'cf",
 "' and 'i",
 'e',
 "', can cause trouble for simple approaches of segmentation and tokenization",
 '']

We mostly get complete sentences, but the abbreviations cause trouble. Let's forget sentences for now and focus on individual words. We start by splitting on whitespace.

In [5]:
example_string.split(' ')[:20]

['This',
 'is',
 'an',
 'example',
 'string.',
 'It',
 'will',
 'illustrate',
 'how',
 'text',
 'data',
 'is',
 'not',
 'trivial',
 'to',
 'work',
 'with,',
 'even',
 'if',
 'it']

Hmm, the period sticks to the `string.` token and the comma to the `with,` token. Let's try a regex which will split on (most) punctuation and a whitespace by defining a group of punctuation characters that _may_ (see the use of the `?` operator) precede a space.

In [6]:
import re # regular expression module

In [10]:
pattern = '[.,;?!]? '
regex = re.compile(pattern)
regex.split(example_string, )[:30]

['This',
 'is',
 'an',
 'example',
 'string',
 'It',
 'will',
 'illustrate',
 'how',
 'text',
 'data',
 'is',
 'not',
 'trivial',
 'to',
 'work',
 'with',
 'even',
 'if',
 'it',
 'is',
 'nice',
 'and',
 'clean.\n\nThings',
 'like',
 'punctuation',
 'and',
 'abbreviations',
 'e.g',
 "'cf.'"]

Better. But we have some newlines (`\n`) causing trouble. We can use the whitespace `\s` and say that there should be one or more of them by using the `+` operator.

In [11]:
pattern = '[.,;?!]?\s+'
regex = re.compile(pattern)
regex.split(example_string)[:40]

['This',
 'is',
 'an',
 'example',
 'string',
 'It',
 'will',
 'illustrate',
 'how',
 'text',
 'data',
 'is',
 'not',
 'trivial',
 'to',
 'work',
 'with',
 'even',
 'if',
 'it',
 'is',
 'nice',
 'and',
 'clean',
 'Things',
 'like',
 'punctuation',
 'and',
 'abbreviations',
 'e.g',
 "'cf.'",
 'and',
 "'i.e.'",
 'can',
 'cause',
 'trouble',
 'for',
 'simple',
 'approaches',
 'of']

Hmm, notice that we also lost the final period in `e.g.`, resulting in a truncated `e.g` token. We could handle the different abbreviations in our regex. Also, we may actually want to keep the punctuation as tokens themselves, but probably not the whitespaces since they are not that interesting.

You can probably see where this is going. To handle all cases that do not conform to a simple "split on whitespace" approach, the regex grows more complex.

Instead of reinventing the wheel, we will use tokenizers from the Natural Language Toolkit for this purpose.

In [12]:
import nltk
from nltk import word_tokenize
nltk.download('punkt') # underlying model for tokenization

[nltk_data] Downloading package punkt to /home/au479461/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
tokens = word_tokenize(example_string)
tokens

['This',
 'is',
 'an',
 'example',
 'string',
 '.',
 'It',
 'will',
 'illustrate',
 'how',
 'text',
 'data',
 'is',
 'not',
 'trivial',
 'to',
 'work',
 'with',
 ',',
 'even',
 'if',
 'it',
 'is',
 'nice',
 'and',
 'clean',
 '.',
 'Things',
 'like',
 'punctuation',
 'and',
 'abbreviations',
 ',',
 'e.g',
 '.',
 "'cf",
 '.',
 "'",
 'and',
 "'",
 'i.e',
 '.',
 "'",
 ',',
 'can',
 'cause',
 'trouble',
 'for',
 'simple',
 'approaches',
 'of',
 'segmentation',
 'and',
 'tokenization',
 '.']

We can count those tokens and see what makes it to the top.

In [14]:
from collections import Counter

counter = Counter(tokens)
counter.most_common(20)

[('.', 6),
 ('and', 4),
 ('is', 3),
 (',', 3),
 ("'", 3),
 ('This', 1),
 ('an', 1),
 ('example', 1),
 ('string', 1),
 ('It', 1),
 ('will', 1),
 ('illustrate', 1),
 ('how', 1),
 ('text', 1),
 ('data', 1),
 ('not', 1),
 ('trivial', 1),
 ('to', 1),
 ('work', 1),
 ('with', 1)]

## Loading some text data
Let's get some simple text data to work with. It does not matter much exactly what - just that it is plain text data.

A good place to start is [Project Gutenberg](https://www.gutenberg.org/) which has free e-books available in plain text files. I have downloaded the top books and stored them in the `data` folder.

You can try to find other text sources and load them in.

In [17]:
from glob import glob

for filename in glob('../data/gutenberg/*.txt'):
    print(filename)

../data/gutenberg/moby_dick.txt
../data/gutenberg/pride_and_prejudice.txt
../data/gutenberg/romeo_and_juliet.txt
../data/gutenberg/middlemarch.txt
../data/gutenberg/a_room_with_a_view.txt


Let's try to work with the text data from _Moby Dick_. Load the text, and chop it into chapters with a regex.

In [18]:
chapter_splitter = re.compile('\n{3}CHAPTER \d+. ')

with open('../data/gutenberg/moby_dick.txt') as f:
    raw_text = f.read()
    chapters = chapter_splitter.split(raw_text)[1:] # removing title, preface etc. up to chapter 1 by slicing

for chapter in chapters[:10]:
    print(chapter[:100] + ' ...')
    print('-'*50)

Loomings.

Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money i ...
--------------------------------------------------
The Carpet-Bag.

I stuffed a shirt or two into my old carpet-bag, tucked it under my
arm, and starte ...
--------------------------------------------------
The Spouter-Inn.

Entering that gable-ended Spouter-Inn, you found yourself in a wide,
low, straggli ...
--------------------------------------------------
The Counterpane.

Upon waking next morning about daylight, I found Queequeg’s arm thrown
over me in  ...
--------------------------------------------------
Breakfast.

I quickly followed suit, and descending into the bar-room accosted the
grinning landlord ...
--------------------------------------------------
The Street.

If I had been astonished at first catching a glimpse of so outlandish
an individual as  ...
--------------------------------------------------
The Chapel.

In this same New Bedford there stands a Whaleman’s 

Let's count some words from the first chapter!

In [19]:
first_chapter = chapters[0] # remember zero indexing!
tokens = word_tokenize(first_chapter)
counter = Counter(tokens)
counter.most_common(20)

[(',', 169),
 ('the', 121),
 ('of', 81),
 ('.', 79),
 ('a', 68),
 ('and', 66),
 ('to', 53),
 ('in', 46),
 ('I', 43),
 ('is', 34),
 ('that', 31),
 ('it', 26),
 ('as', 26),
 (';', 25),
 ('me', 24),
 ('all', 23),
 ('you', 23),
 ('?', 18),
 ('this', 16),
 ('my', 14)]

This is a bit more interesting than the two or one word counts from the example sentence above. But it still does not tell us much since  the top ranking words are just punctuation and function words like _the_ and _of_. Let's do some filtering!

## Filtering

In [20]:
from string import punctuation
from nltk import corpus
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/au479461/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
punctuation_set = set(punctuation)
stopwords_set = set(corpus.stopwords.words('english'))

In [22]:
# short-hand way with list comprehension:
# filtered_tokens = [token for token in tokens if token not in punctuation_set and token not in stopwords_set]

filtered_tokens = []
for token in tokens:
    if token not in punctuation_set and token not in stopwords_set:
        filtered_tokens.append(token)

In [23]:
filtered_counter = Counter(filtered_tokens)
filtered_counter.most_common(20)

[('I', 43),
 ('sea', 10),
 ('one', 10),
 ('go', 10),
 ('upon', 9),
 ('But', 8),
 ('part', 7),
 ('’', 7),
 ('And', 7),
 ('see', 6),
 ('It', 6),
 ('get', 6),
 ('time', 6),
 ('land', 6),
 ('like', 6),
 ('water', 6),
 ('old', 6),
 ('take', 5),
 ('What', 5),
 ('ever', 5)]

Better. But this reveals an issue that we need to fix: some of the stopwords are not removed because they begin with a capital letter, e.g. _But_.

We can also make other observations.

The first is that capitalized and non-capitalized versions of a word (e.g. _Tell_ vs _tell_) are counted as two different words. Depending on our analysis, we may want to count them as one.

The second is that inflected and non-inflected versions of a word (e.g. _passenger_ vs _passengers_) are counted as different words. Again, depending on our analysis, we may want to count them as one.

## Normalization

Normalization is about handling differences. This includes things like:
1. Capitalization, e.g. _Tell_ vs _tell_.
2. UK and US spelling, e.g. _colour_ vs. _color_. (We will skip that here; it is mostly relevant across different texts)
3. Inflection, e.g. _passenger_ vs _passengers_.

Be mindful that each such a normalization process eliminate potentially meaningful information. That is, there are reasons for capitalization; consider _apple_ vs _Apple_ or _bill_ vs _Bill_, and plural inflection or past tense are there for a reason.

For this exercise we will do decapitalization and stemming.

In [24]:
# short-hand way with list comprehension
# lowercase_tokens = [token.lower() for token in tokens]
# filtered_lowercase_tokens = [token for token in lowercase_tokens
#                              if token not in punctuation_set and token not in stopwords_set]

lowercase_tokens = []
for token in tokens:
    lowercase_tokens.append(token.lower())

filtered_lowercase_tokens = []
for token in lowercase_tokens:
    if token not in punctuation_set and token not in stopwords_set:
        filtered_lowercase_tokens.append(token)

filtered_lowercase_tokens[:50]

['loomings',
 'call',
 'ishmael',
 'years',
 'ago—never',
 'mind',
 'long',
 'precisely—having',
 'little',
 'money',
 'purse',
 'nothing',
 'particular',
 'interest',
 'shore',
 'thought',
 'would',
 'sail',
 'little',
 'see',
 'watery',
 'part',
 'world',
 'way',
 'driving',
 'spleen',
 'regulating',
 'circulation',
 'whenever',
 'find',
 'growing',
 'grim',
 'mouth',
 'whenever',
 'damp',
 'drizzly',
 'november',
 'soul',
 'whenever',
 'find',
 'involuntarily',
 'pausing',
 'coffin',
 'warehouses',
 'bringing',
 'rear',
 'every',
 'funeral',
 'meet',
 'especially']

In [25]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

# short-hand way with list comprehension
# stemmed_filtered_lowercase_tokens = [stemmer.stem(token) for token in filtered_lowercase_tokens]

stemmed_filtered_lowercase_tokens = []
for token in filtered_lowercase_tokens:
    stemmed = stemmer.stem(token)
    stemmed_filtered_lowercase_tokens.append(stemmed)

stemmed_filtered_lowercase_tokens[:50]

['loom',
 'call',
 'ishmael',
 'year',
 'ago—nev',
 'mind',
 'long',
 'precisely—hav',
 'littl',
 'money',
 'purs',
 'noth',
 'particular',
 'interest',
 'shore',
 'thought',
 'would',
 'sail',
 'littl',
 'see',
 'wateri',
 'part',
 'world',
 'way',
 'drive',
 'spleen',
 'regul',
 'circul',
 'whenev',
 'find',
 'grow',
 'grim',
 'mouth',
 'whenev',
 'damp',
 'drizzli',
 'novemb',
 'soul',
 'whenev',
 'find',
 'involuntarili',
 'paus',
 'coffin',
 'warehous',
 'bring',
 'rear',
 'everi',
 'funer',
 'meet',
 'especi']

In [26]:
counter = Counter(stemmed_filtered_lowercase_tokens)
counter.most_common(20)

[('go', 15),
 ('sea', 12),
 ('one', 11),
 ('part', 10),
 ('upon', 9),
 ('whale', 8),
 ('get', 7),
 ('’', 7),
 ('take', 7),
 ('passeng', 7),
 ('see', 6),
 ('time', 6),
 ('land', 6),
 ('like', 6),
 ('water', 6),
 ('voyag', 6),
 ('old', 6),
 ('thing', 6),
 ('sailor', 6),
 ('whenev', 5)]

### A note on stemming
Stemming can be quite aggressive in how much it removes of a word, e.g. _passeng_ from _passenger_ and _passengers_. An alternative is lemmatization which gives the dictionary lookup form of a word, but it requires POS-tags on the tokens to work properly. That is, _passengers_ and _passenger_ would both become _passenger_.