# Word Counting

This notebook introduces some of the basic tools and idea for working with natural language (text), including tokenization and word counting.

## Imports

In [1]:
import types

## Tokenization

In [2]:
PUNCTUATION = '`~!@#$%^&*()_-+={[}]|\:;"<,>.?/}\t\n'

Write a generator function, `remove_punctuation`, that removes punctuation from an iterator of words and yields the cleaned words:

* Strip the punctuation characters at the beginning and end of each word.
* Replace `-` by a space if found in the middle of the word and split on that white space to yield multiple words.
* If a word is all punctuation, don't yield it at all.

In [3]:
def remove_punctuation(words, punctuation=PUNCTUATION):
    """Remove punctuation from an iterator of words, yielding the results."""
    for word in words:
        word = word.strip(punctuation)
        split_word = word.replace("-", " ").split()
        
        for word in split_word:
            if word != '':
                yield word

In [4]:
assert list(remove_punctuation(['!data;']))==['data']
assert list(remove_punctuation(['!data-science:']))==['data', 'science']
assert list(remove_punctuation(['!!']))==[]
assert isinstance(remove_punctuation(['!!']), types.GeneratorType)

Write a generator function, `lower_words`, that makes each word in an iterator lowercase, yielding each lowercase word:

In [5]:
def lower_words(words):
    """Make each word in an iterator lowercase."""
    for word in words:
        yield word.lower()

In [6]:
assert isinstance(lower_words('AAA'), types.GeneratorType)
assert list(lower_words('This IS NOT LoWerCaSe'.split(' ')))==['this', 'is', 'not', 'lowercase']

[Stop words](https://en.wikipedia.org/wiki/Stop_words) are common words in text that are typically filtered out when performing natural language processing. Typical stop words are *and*, *of*, *a*, *the*, etc.

Write a generator function, `remove_stop_words`, that removes stop words from an iterator, yielding the results:

In [7]:
def remove_stop_words(words, stop_words=None):
    """Remove the stop words from an iterator of words.
    
    stop_words can be provided as a list of words or a whitespace separated string of words.
    """
    for word in words:
        if stop_words is None or word not in stop_words:
            yield word        

In [8]:
assert list(remove_stop_words('the begin to the end a of the day'.split(' '), stop_words='a the')) == \
    ['begin', 'to', 'end', 'of', 'day']
assert list(remove_stop_words('the begin to the end a of the day'.split(' '), stop_words=['a', 'the'])) == \
    ['begin', 'to', 'end', 'of', 'day']
assert list(remove_stop_words('the begin to the end a of the day'.split(' '))) == \
    ['the', 'begin', 'to', 'the', 'end', 'a', 'of', 'the', 'day']

[Tokenization](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking a string or line of text and returning a sequence of words, or *tokens*, with the following transforms applied

* Punctuation removed
* All words lowercased
* Stop words removed

Write a generator function, `tokenize_line`, that yields tokenized words from a an input line of text. 

In [9]:
def tokenize_line(line, stop_words=None, punctuation=PUNCTUATION):
    """Split a string into a list of words, removing punctuation and stop words."""
    for word in list(remove_stop_words(lower_words(remove_punctuation(line.split())), stop_words)):
        yield word

In [10]:
assert isinstance(tokenize_line("This, is the way; that things will end"), types.GeneratorType)
assert list(tokenize_line("This, is the way; that things will end", stop_words=['the', 'is'])) == \
    ['this', 'way', 'that', 'things', 'will', 'end']

Write a generator function, `tokenize_lines`, that can yield the tokens in an iterator of lines of text.

In [11]:
def tokenize_lines(lines, stop_words=None, punctuation=PUNCTUATION):
    """Tokenize an iterator of lines, yielding the tokens."""
    for line in lines:
        for word in list(tokenize_line(line, stop_words, punctuation)):
            yield word

In [12]:
wasteland = """
APRIL is the cruellest month, breeding
Lilacs out of the dead land, mixing
Memory and desire, stirring
Dull roots with spring rain.
"""

assert isinstance(tokenize_lines(wasteland.splitlines()), types.GeneratorType)

assert list(tokenize_lines(wasteland.splitlines(), stop_words='is the of and')) == \
    ['april','cruellest','month','breeding','lilacs','out','dead','land',
     'mixing','memory','desire','stirring','dull','roots','with','spring',
     'rain']

## Counting words

Write a function, `count_words`, that takes an iterator of words and returns a dictionary where the keys in the dictionary are the unique words in the list and the values are the word counts. Be careful to not ever assume that the input iterator is a concrete list/tuple.

In [13]:
def count_words(words):
    """Return a word count dictionary from the list of words in data."""
    word_dict = {}
    
    for word in words:
        if word not in word_dict:
            word_dict[word] = 1
        else:
            word_dict[word] += 1
            
    return word_dict

In [14]:
assert count_words(tokenize_line('This, and The-this from, and A a a')) == \
    {'a': 3, 'and': 2, 'from': 1, 'the': 1, 'this': 2}

Write a function, `sort_word_counts`, that return a list of sorted word counts:

* Each element of the list should be a `(word, count)` tuple.
* The list should be sorted by the word counts, with the higest counts coming first.
* To perform this sort, look at using the `sorted` function.

This can return a concrete list as the memory here is proportional to the number of unique words in the text.

In [15]:
def sort_word_counts(wc):
    """Return a list of 2-tuples of (word, count), sorted by count descending."""
    return sorted(wc.items(), key=lambda word_count: word_count[1], reverse=True)

In [16]:
assert set(sort_word_counts(count_words(tokenize_line('This, and The-this from, and A a a')))) == \
    {('a', 3), ('and', 2), ('this', 2), ('the', 1), ('from', 1)}

## File IO

Write a generator function, `files_to_lines`, that takes an iterator of filenames, and yields the lines in all of those files. Make sure to not ever create a concrete list/tuple in this process to keep your memory consumption $\mathcal{O}(1)$. Make sure you use a `with` statement to properly close each file.

In [17]:
def files_to_lines(files):
    """Iterator over a sequence of filenames, yielding all of the lines in the files."""
    for file in files:
        with open(file) as f:
            for line in f:
                yield line

In [18]:
%%writefile file1.txt
This is the first line in the first file.
This is the secon line in the first file.

Overwriting file1.txt


In [19]:
%%writefile file2.txt
This is the first line in the second file.
This is the second line in the second file.

Overwriting file2.txt


In [20]:
assert isinstance(files_to_lines(['file1.txt', 'file2.txt']), types.GeneratorType)
assert list(files_to_lines(['file1.txt', 'file2.txt'])) == \
    ['This is the first line in the first file.\n',
     'This is the secon line in the first file.',
     'This is the first line in the second file.\n',
     'This is the second line in the second file.']

## All together now

Now use all of the above functions to perform tokenization and word counting for all of the text documents described by your instructor:

* You should be able to perform this in a memory efficient manner.
* Read your stop words from the included `stopwords.txt` file.
* Save your sorted word counts to a variable named `swc`.

In [21]:
# %cat /data/gutenberg/1400.txt

In [22]:
%cd ~/assignment03
def get_all_stopwords():
    with open("stopwords.txt") as file:
        for line in file:
            for word in line.split():
                yield word

stopwords = list(get_all_stopwords())
    
%cd /data/gutenberg
files = ['11.txt', '1400.txt', '17208.txt', '2701.txt', '33511.txt', 'README.md', '1342.txt', '1661.txt', '23.txt', '29021.txt', '84.txt']

swc = sort_word_counts(count_words(list(tokenize_lines(files_to_lines(files), stopwords))))

/nbhome/dtumer/assignment03
/data/gutenberg


In [23]:
assert [word for word, count in swc[0:10]] == \
    ['said', 'one', 'mr', 'now', 'upon', 'will', 'little', 'time', 'man', 'like']

Create a horizontal bar chart for the top 50 words using text and simple calls to `print`:

* For each word, encode the count as a bar of `*` characters.
* You will have to scale the length of your bars to fit on the page.
* Provide labels for each bar that indicates which word the counts apply to.

In [24]:
def print_bar(count):
    count /= 40
    
    while count > 0:
        yield "*"
        count -= 1
print("each '*' represents 40 instances of the specified word\n")      
for word, count in swc[0:50]:
    print(word + ": ", end="")
    print("".join(list(print_bar(count))))

each '*' represents 40 instances of the specified word

said: *******************************************************************************************
one: ************************************************************************
mr: ***************************************************
now: ***************************************************
upon: *************************************************
will: **********************************************
little: *******************************************
time: **************************************
man: **************************************
like: ************************************
see: *********************************
must: ********************************
much: ********************************
well: ******************************
know: ******************************
may: *****************************
whale: ****************************
two: ****************************
great: ****************************
never: ***********************