This notebook will be collected automatically at **6pm on Monday** from `/home/data_scientist/assignments/Week7` directory on the course JupyterHub server. If you work on this assignment on the course Jupyterhub server, just make sure that you save your work and instructors will pull your notebooks automatically after the deadline. If you work on this assignment locally, the only way to submit assignments is via Jupyterhub, and you have to place the notebook file in the correct directory with the correct file name before the deadline.

1. Make sure everything runs as expected. First, restart the kernel (in the menubar, select `Kernel` → `Restart`) and then run all cells (in the menubar, select `Cell` → `Run All`).
2. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed by the autograder.
3. Do not change the file path or the file name of this notebook.
4. Make sure that you save your work (in the menubar, select `File` → `Save and CheckPoint`)

# Problem 7.1. Text Analysis.

In this problem, we perform basic text analysis tasks,
such as accessing data, tokenizing a corpus, and computing token frequencies,
on our course syllabus and on the NLTK Reuters corpus.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy as sp
import re
import requests

import nltk

from nose.tools import (
    assert_equal,
    assert_is_instance,
    assert_almost_equal,
    assert_true,
    assert_false
    )

In the first half of the problem, we use our [course syllabus](https://github.com/UI-DataScience/info490-sp16/blob/master/orientation/syllabus.md) as a sample text.

In [2]:
repo_url = 'https://raw.githubusercontent.com/UI-DataScience/info490-sp16'
syllabus_path = 'orientation/syllabus.md'
commit_hash = '9a70b4f736963ff9bece424b0b34a393ebd574f9'

resp = requests.get('{0}/{1}/{2}'.format(repo_url, commit_hash, syllabus_path))
syllabus_text = resp.text

assert_is_instance(syllabus_text, str)

print(syllabus_text)

# INFO 490: Advanced Data Science #

INFO 490: Advanced Data Science explores advanced concepts in data
science by employing a practical approach, including machine learning;
probabilistic programming; text, network, and graph analysis; and cloud
computing.

## Course Goals ##

Upon completion of this course, students will be expected to understand
advanced data science concepts. Students will learn the practical
aspects of applying machine and statistical learning in a variety of
contexts, as well as different aspects of cloud computing. Specific
concepts that will be covered including supervised and unsupervised
learning, dimensional reduction, clustering, probabilistic programming,
text mining, graph analysis, network analysis, Hadoop, NoSQL data
stores, Spark, and streaming data analysis.

## Prerequisites ##

As a pre-requisite for this course, you must have mastered the material
in *INFO 490: Foundations of Data Science*. Generally, this is
demonstrated by having taken this previ

## Tokenize

- Tokenize the text string `syllabus_text`.
  You should clean up the list of tokens by removing all puntuation tokens
  and keeping only tokens with one or more alphanumeric characters.

In [3]:
def get_words(text):
    '''
    Tokenizes the text string, and returns a list of tokens with
    one or more alphanumeric characters.
    
    Parameters
    ----------
    text: A string.
    
    Returns
    -------
    words: A list of strings.
    '''
    
    # YOUR CODE HERE
    pattern = re.compile(r'[^\w\s]') # define the pattern to keep words and numbers only
    words = re.sub(pattern, ' ', text.lower()).split()
        
    return words

In [4]:
syllabus_words = get_words(syllabus_text)
print(syllabus_words[:5], '...', syllabus_words[-5:])

['info', '490', 'advanced', 'data', 'science'] ... ['following', 'tuesday', '12', '00', 'pm']


In [5]:
assert_is_instance(syllabus_words, list)
assert_true(all(isinstance(w, str) for w in syllabus_words))
assert_equal(len(syllabus_words), 2363)

assert_true(all(all(not c.isupper() for c in w) for w in syllabus_words))
assert_true(all(any(c.isalnum() for c in w) for w in syllabus_words))

assert_equal(syllabus_words[:5], ['info', '490', 'advanced', 'data', 'science'])
assert_equal(syllabus_words[-5:], ['following', 'tuesday', '12', '00', 'pm'])

## Lexical Diversity

- Compute the the number of tokens, number of words, and lexical diversity.

In [6]:
def count(words):
    '''
    Computes the the number of token, number of words, and lexical diversity.
    
    Parameters
    ----------
    words: A list of of strings.
    
    Returns
    -------
    A 3-tuple of (num_tokens, num_words, lex_div)
    num_tokens: An int. The number of tokens in "words".
    num_words: An int. The number of words in "words".
    lex_div: A float. The lexical diversity of "words".
    '''
    
    # YOUR CODE HERE
    counts = nltk.FreqDist(words)
    num_tokens = len(counts)
    num_words = len(words)
    lex_div = num_words / num_tokens
    
    return num_tokens, num_words, lex_div

In [7]:
num_tokens, num_words, lex_div = count(syllabus_words)
print("Syllabus has {0} tokens and {1} words for a lexical diversity of {2:4.3f}"
      "".format(num_tokens, num_words, lex_div))

Syllabus has 702 tokens and 2363 words for a lexical diversity of 3.366


In [8]:
assert_is_instance(num_tokens, int)
assert_is_instance(num_words, int)
assert_is_instance(lex_div, float)

assert_equal(num_tokens, 702)
assert_equal(num_words, 2363)
assert_almost_equal(lex_div, 3.366096866096866)

## Most common occurrences

- Compute the most commonly occurring terms and their counts.

In [9]:
def get_most_common(words, ntop):
    '''
    Computes the most commonly occurring terms and their counts.
    
    Parameters
    ----------
    words: A list of of strings.
    ntop: An int. The number of most common words that will be returned.
    
    Returns
    -------
    A list of tuple (token, frequency).
    '''
    
    # YOUR CODE HERE
    counts = nltk.FreqDist(words)
    most_common = counts.most_common(ntop)
    
    return most_common

In [10]:
syllabus_most_common = get_most_common(syllabus_words, 10)

print('{0:12s}: {1}'.format('Term', 'Count'))
print(20*'-')

for token, freq in syllabus_most_common:
    print('{0:12s}: {1:4d}'.format(token, freq))

Term        : Count
--------------------
the         :  113
to          :   58
will        :   51
and         :   48
a           :   47
week        :   44
of          :   43
be          :   39
course      :   38
you         :   38


In [11]:
assert_is_instance(syllabus_most_common, list)
assert_true(all(isinstance(t, tuple) for t in syllabus_most_common))
assert_true(all(isinstance(t, str) for t, f in syllabus_most_common))
assert_true(all(isinstance(f, int) for t, f in syllabus_most_common))

assert_equal(len(get_most_common(syllabus_words, 10)), 10)
assert_equal(len(get_most_common(syllabus_words, 20)), 20)

assert_equal(
    set(syllabus_most_common[:10]),
    set([('the', 113), ('to', 58), ('will', 51), ('and', 48), ('a', 47),
     ('week', 44), ('of', 43), ('be', 39), ('course', 38), ('you', 38)])
    )

## Hapax

- Write a function that finds all hapexes in a text string.

In [12]:
def find_hapaxes(words):
    '''
    Finds hapexes in "words".
    
    Parameters
    ----------
    words: A list of strings.
    
    Returns
    -------
    A list of strings.
    '''
    
    # YOUR CODE HERE
    counts = nltk.FreqDist(words)
    hapaxes = counts.hapaxes()
    
    return hapaxes

In [13]:
syllabus_hapaxes = find_hapaxes(syllabus_words)
print(sorted(syllabus_hapaxes)[-10:])

['why', 'willing', 'wondering', 'working', 'worth', 'would', 'writing', 'www', 'yourself', 'zero']


In [14]:
assert_is_instance(syllabus_hapaxes, list)
assert_true(all(isinstance(w, str) for w in syllabus_hapaxes))
assert_equal(len(syllabus_hapaxes), 388)
assert_equal(
    sorted(syllabus_hapaxes)[-10:],
    ['why', 'willing', 'wondering', 'working', 'worth',
     'would', 'writing', 'www', 'yourself', 'zero']
    )

## NLTK corpus

In the second half of the problem, we use the NLTK Reuters corpus. See the [NLTK docs](http://www.nltk.org/book/ch02.html#reuters-corpus) for more information.

In [15]:
from nltk.corpus import reuters

## Lexical diversity in corpus

- Compute the the number of token, number of words, and lexical diversity. Use the `words()` function of the reuters object, which includes non-alphanumeric characters.

In [16]:
def count_corpus(corpus):
    '''
    Computes the the number of token, number of words, and lexical diversity.
    
    Parameters
    ----------
    corpus: An NLTK corpus.
    
    Returns
    -------
    A 3-tuple of (num_tokens, num_words, lex_div)
    num_tokens: An int. The number of tokens in "words".
    num_words: An int. The number of words in "words".
    lex_div: A float. The lexical diversity of "words".
    '''
    
    # YOUR CODE HERE
    words = corpus.words()
    counts = nltk.FreqDist(words)
    num_words = len(words)
    num_tokens = len(counts)
    lex_div = num_words / num_tokens
    
    return num_words, num_tokens, lex_div

In [17]:
num_words, num_tokens, lex_div = count_corpus(reuters)
print("The Reuters corpus has {0} tokens and {1} words for a lexical diversity of {2:4.3f}"
      "".format(num_tokens, num_words, lex_div))

The Reuters corpus has 41600 tokens and 1720901 words for a lexical diversity of 41.368


In [18]:
assert_is_instance(num_tokens, int)
assert_is_instance(num_words, int)
assert_is_instance(lex_div, float)
assert_equal(num_tokens, 41600)
assert_equal(num_words, 1720901)
assert_almost_equal(lex_div, 41.3678125)

## Long words

- Search for all words in corpus that are longer than 20 characters.

In [19]:
def get_long_words(corpus, length=20):
    '''
    Finds all words in "corpus" longer than "length".
    
    Parameters
    ----------
    corpus: An NLTK corpus.
    length: An int. Default: 22
    
    Returns
    -------
    A list of strings.
    '''
    
    # YOUR CODE HERE
    words = corpus.words()
    long_words = [word for word in words if len(word) > length]
    
    return long_words

In [20]:
long_words = get_long_words(reuters, length=20)
print(long_words)

['discontinuedoperations', 'Beteiligungsgesellschaft', 'Gloeielampenfabrieken', '..........................................', 'Warenhandelsgesellschaft']


In [21]:
assert_is_instance(long_words, list)
assert_true(all(isinstance(w, str) for w in long_words))    
assert_equal(len(long_words), 5)
assert_equal(
    set(long_words),
    set([
        'discontinuedoperations',
        'Warenhandelsgesellschaft',
        'Gloeielampenfabrieken',
        'Beteiligungsgesellschaft',
        '..........................................'
        ])
    )