A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 1. Text Analysis.

In this problem, we perform basic text analysis tasks on the NLTK Reuters corpus.

In [None]:
import nltk
from nltk.corpus import reuters

from nose.tools import assert_equal, assert_is_instance, assert_true

In the following code cells is an overview of the NLTK Reuters corpus, which I copy verbatim from the [NLTK documentation](http://www.nltk.org/book/ch02.html#reuters-corpus).

The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called "training" and "test"; thus, the text with fileid `test/14826` is a document drawn from the test set. This split is for training and testing algorithms that automatically detect the topic of a document, as we will see in [chap-data-intensive](http://www.nltk.org/book/ch06.html#chap-data-intensive).

In [None]:
print(len(reuters.fileids()))

In [None]:
print(reuters.fileids()[:10])

In [None]:
print(len(reuters.categories()))

In [None]:
print(reuters.categories())

Unlike the Brown Corpus, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories. For convenience, the corpus methods accept a single fileid or a list of fileids.

In [None]:
print(reuters.categories('training/9865'))

In [None]:
print(reuters.categories(['training/9865', 'training/9880']))

In [None]:
print(reuters.fileids('barley'))

In [None]:
print(reuters.fileids(['barley', 'corn']))

Similarly, we can specify the words or sentences we want in terms of files or categories. The first handful of words in each of these texts are the titles, which by convention are stored as upper case.

In [None]:
print(reuters.words('training/9865')[:14])

In [None]:
print(reuters.words(['training/9865', 'training/9880']))

In [None]:
print(reuters.words(categories='barley'))

In [None]:
print(reuters.words(categories=['barley', 'corn']))

## Search for long words

- Write a function that retuns a list of all words in an NLTK corpus that are strictly longer than the specified `length`.
- The function takes two arguments, `corpus` and `length`. `corpus` is mandatory, while `length` is optinal. If the user doesn't specify the `length` parameter, the default vaule of 20 will be used.
- The function returns a list of strings.
- For example, if we use the Reuters corpus, we get

```python
>>> long_words = get_long_words(reuters)
>>> print(long_words)
```
```
['discontinuedoperations', 'Warenhandelsgesellschaft', 'Gloeielampenfabrieken', 'Beteiligungsgesellschaft', '..........................................']
```

```python
>>> words_longer_than_18_characters = get_long_words(reuters, 18)
>>> print(words_longer_than_18_characters)
```
```
['shareholdersapproved', 'polytetrahydrofuran', 'internationalisation', 'extraordinarycredits', 'discontinuedoperations', 'chlorofluorocarbons', 'Warenhandelsgesellschaft', 'Internationalisation', 'Internationalisation', 'Houdstermaatschappij', 'Gloeilampenfabrieken', 'Gloeilampenfabrieken', 'Gloeielampenfabrieken', 'Genossenschaftliche', 'Beteiligungsgesellschaft', 'Beleggingscompagnie', '..........................................', '....................', '.........--........']
```

In [None]:
def get_long_words(corpus, length=20):
    """
    Finds all words in 'corpus' longer than 'length'.
    
    Parameters
    ----------
    corpus: An NLTK corpus.
    length: An int. Default: 20
    
    Returns
    -------
    A list of strings.
    """
    
    # YOUR CODE HERE
    
    return long_words

In [None]:
long_words = get_long_words(reuters)
print(long_words)

In [None]:
long_words = get_long_words(reuters)
assert_is_instance(long_words, list)
assert_true(all(isinstance(w, str) for w in long_words))   
assert_equal(len(long_words), 5)
assert_equal(
    set(long_words),
    set(
        ['discontinuedoperations',
        'Warenhandelsgesellschaft',
        'Gloeielampenfabrieken',
        'Beteiligungsgesellschaft',
        '..........................................']
    )
)

words_longer_than_18_characters = get_long_words(reuters, 18)
assert_equal(len(words_longer_than_18_characters), 19)
assert_equal(
    set(words_longer_than_18_characters),
    set(
        ['shareholdersapproved', 'polytetrahydrofuran', 'internationalisation',
         'extraordinarycredits', 'discontinuedoperations', 'chlorofluorocarbons',
         'Warenhandelsgesellschaft', 'Internationalisation', 'Internationalisation',
         'Houdstermaatschappij', 'Gloeilampenfabrieken', 'Gloeilampenfabrieken',
         'Gloeielampenfabrieken', 'Genossenschaftliche', 'Beteiligungsgesellschaft',
         'Beleggingscompagnie', '..........................................',
         '....................', '.........--........']
    )
)