A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 2. Text Classification.

In this problem, we use the NLTK and scikit-learn libraries to perform text classificatoin tasks on the [NLTK Reuters corpus](http://www.nltk.org/book/ch02.html#reuters-corpus).

In [None]:
import numpy as np
import nltk
from nltk.corpus import reuters
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

from nose.tools import assert_equal, assert_is_instance
from numpy.testing import assert_almost_equal, assert_array_equal

The corpus contains 10,788 documents (`fileids`) which have been classified into 90 topics (categories). We will use those 10,788 documents to train a machine learning model, and try to predict which topic each document belongs to. So, we will first find categories for each element of fileids. Note that the categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. If a document has more than one category, we will use only the first category. Here's a function that extracts categories from the `fileids`.

In [None]:
def get_categories_from_fileids(corpus, fileids):
    """
    Finds categories for each element of 'fileids'.
    
    Parameters
    ----------
    corpus: An NLTK corpus.
    fileids: A list of strings.
    
    Returns
    -------
    A list of strings.
    """
    
    result = [sorted(corpus.categories(fileids=f))[0] for f in fileids]
    
    return result

In [None]:
categories = get_categories_from_fileids(reuters, reuters.fileids())
print(categories[:5], '...', categories[-5:])

The Reuters data set has already been grouped into a training set and a test set. Here's a functoin that mimics `scikit-learn`'s `train_test_split()` function and organizes the text data and categories into training and test sets.

In [None]:
def train_test_split(corpus):
    """
    Creates a training set and a test from the NLTK Reuters corpus.
    
    Parameters
    ----------
    corpus: An NLTK corpus.
    
    Returns
    -------
    A 4-tuple (X_train, X_test, y_train, y_test)
    """
    
    train_fileids = [fileid for fileid in corpus.fileids() if fileid.startswith('train')]
    X_train = [corpus.raw(fileids=fileid) for fileid in train_fileids]
    y_train = get_categories_from_fileids(corpus, train_fileids)
    
    test_fileids = [fileid for fileid in corpus.fileids() if fileid.startswith('test')]
    X_test = [corpus.raw(fileids=fileid) for fileid in test_fileids]
    y_test = get_categories_from_fileids(corpus, test_fileids)
    
    return X_train, X_test, y_train, y_test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(reuters)

We have 7,769 documents for training. Note that each data point in `X_train` (and `X_test`) is a text. To train a machine learning model on text data, we will use `CountVectorizer` to convert text documents to a matrix of token counts.

In [None]:
print("There are {} training data.".format(len(X_train)))
print("An example of training data:\n\n{}".format(X_train[0]))

And each training label is the cateogry that the corresponding document belong to.

In [None]:
print(y_train[0])

## Classify documents using SVC and stop words.

- Build a **pipeline** by using [Pipeline](http://scikit-learn.org/0.17/modules/generated/sklearn.pipeline.Pipeline.html),  [CountVectorizer](http://scikit-learn.org/0.17/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), and [LinearSVC](http://scikit-learn.org/0.17/modules/generated/sklearn.svm.LinearSVC.html). Name the first step `cv` (the `CountVectorizer` step) and the second step `svc` (the `LinearSVC` step).

- Use English stop words. 

- Use default values for all parameters in `CountVectorizer()`. Use default values for all parameters in `LinearSVC()` execept for `random_state`.

- Without the `random_state` parameter, the `LinearSVC` algorithm has a random element. If you provide an integer to the `random_state` paramter, the algorithm becomes determinitstic and reproducible. So, don't forget to set the `random_state` parameter in `LinearSVC()`.

- It is not necessary that you use all of the other four arguments `(X_train, X_test, y_train, and y_test)`. You should decide which arguments are needed and which are not.

- The function must return a tuple of a `Pipeline` instance and a numpy array of predicted values. So your function will look something like

```python
def classify_document(X_train, X_test, y_train, y_test, random_state):

    ### YOUR CODE HERE
    clf = pipline.fit(...)
    predicted = clf.predict(...)
    ### YOUR CODE HERE

    return clf, predicted
```

- Refer to the [Introduction to Text Classification notebook](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week9/notebooks/intro2tc.ipynb).

In [None]:
def classify_document(X_train, X_test, y_train, y_test, random_state):
    """
    Creates a document term matrix and uses SVM classifier to make document classifications.
    Uses English stop words.
    
    Parameters
    ----------
    X_train: A list of strings.
    y_train: A list of strings.
    X_test: A list of strings.
    random_state: A np.random.RandomState instance.
    
    Returns
    -------
    A tuple of (clf, y_pred)
    clf: A Pipeline instance.
    y_pred: A numpy array.
    """

    # YOUR CODE HERE
    
    return clf, predicted

In [None]:
clf, y_pred = classify_document(X_train, X_test, y_train, y_test, random_state=0)
score = accuracy_score(y_pred, y_test)
print("SVC prediction accuracy = {0:3.1f}%".format(100.0 * score))

In [None]:
assert_is_instance(clf, Pipeline)
assert_is_instance(y_pred, np.ndarray)
cv = clf.named_steps['cv']
assert_is_instance(cv, CountVectorizer)
assert_is_instance(clf.named_steps['svc'], LinearSVC)
assert_equal(cv.stop_words, 'english')
assert_equal(len(y_pred), len(y_test))
assert_array_equal(y_pred[:5], ['trade', 'grain', 'crude', 'corn', 'palm-oil'])
assert_array_equal(y_pred[-5:], ['acq', 'dlr', 'earn', 'ipi', 'gold'])
assert_almost_equal(score, 0.87777409738323953)