A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 1. Twitter.

In this problem, we will use the twitter API to extract a set of tweets, and perform a sentiment analysis on twitter data to classify tweets as positive or negative.

In [None]:
import numpy as np
import nltk
import tweepy as tw

from sklearn.utils import check_random_state
from sklearn.cross_validation import train_test_split
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

from nose.tools import assert_equal, assert_is_instance, assert_true, assert_almost_equal
from numpy.testing import assert_array_equal

## Create a Twitter Application

- Follow the [Introduction to Social Media, Twitter notebook](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week10/notebooks/intro2smt.ipynb) to create a Twitter Application.

- Save the credentials into a file at `/home/data_scientist/lessons/accy571_week10/twitter.cred`. Assuming you have already used the *Lessons* tab to download Week 10 notebooks, you can use the file manager on the Jupyter server dashboard to navigate to *lessons* $\rightarrow$ *accy571_week10*, and click on `twitter.cred`. Clicking on a text file will open the file in a text editor. Alternatively, you can use a text editor such as `vim` in the terminal.

- `twitter.cred` must have the following four credentials in order on separate lines:
```
Access Token
Access Token Secret
Consumer Key
Consumer Secret
```

- Once you have stored your credientials, run the following code cells (you don't have to write any code in this section) to check if you are able to use the Twitter API.

In [None]:
def connect_twitter_api(cred_file):
    
    # Order: Access Token, Access Token Secret, Consumer Key, Consumer SecretAccess
    with open(cred_file) as fin:
        tokens = [line.rstrip('\n') for line in fin if not line.startswith('#')]

    auth = tw.OAuthHandler(tokens[2], tokens[3])
    auth.set_access_token(tokens[0], tokens[1])

    return tw.API(auth)

In [None]:
# Do NOT change file path or name of twitter.cred
api = connect_twitter_api('/home/data_scientist/lessons/accy571_week10/twitter.cred')
assert_equal(api.get_user('katyperry').screen_name, 'katyperry')
assert_equal(api.get_user('justinbieber').created_at.strftime('%Y %m %d %H %M %S'), '2009 03 28 16 41 22')
assert_equal(api.get_user('BarackObama').name, 'Barack Obama')

We will first train a model on the NLTK twitter corpus, and use it to classify a set of tweets fetched from the Twitter API.

In [None]:
from nltk.corpus import twitter_samples as tws

`get_pos_neg_tweets()` in the following code cell creates a training set from the NLTK twitter corpus. Positive tweets are in `positive_tweets.json`, while negative tweets are in `negative_tweets.json`. The `data` and `targets` ararys are one-dimensional numpy arrays, where the first half are the positive tweets and the second half are the negative tweets. Every positive tweet is assigned a numerical label of 1 in `targets`, and negative tweets 0.

In [None]:
def get_pos_neg_tweets(corpus):
    """
    Creates a training set from twitter_samples corpus.
    
    Parameters
    ----------
    corpus: The nltk.corpus.twitter_samples corpus.
    
    Returns
    -------
    A tuple of (data, targets)
    """
    
    pos_tweets = np.array(tws.strings('positive_tweets.json'))
    neg_tweets = np.array(tws.strings('negative_tweets.json'))

    pos_labels = np.ones(pos_tweets.shape[0])
    neg_labels = np.zeros(neg_tweets.shape[0])

    targets = np.concatenate((pos_labels, neg_labels), axis=0)
    data = np.concatenate((pos_tweets, neg_tweets), axis=0)
    
    return data, targets

In [None]:
data, targets = get_pos_neg_tweets(tws)
print(data)

In [None]:
print(targets)

We train on 80% of the data, and test the performance on the remaining 20%.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    data, targets, test_size=0.2, random_state=0
)

## Train a LinearSVC model.

- Build a pipeline by using [Pipeline](http://scikit-learn.org/0.17/modules/generated/sklearn.pipeline.Pipeline.html), [TfidVectorizer](http://scikit-learn.org/0.17/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), and [LinearSVC](http://scikit-learn.org/0.17/modules/generated/sklearn.svm.LinearSVC.html). Name the first step `tf` (the TfidVectorizer step) and the second step `svc` (the LinearSVC step).

- Use English stop words.
- Use unigrams, bigrams, and trigrams.

- Use default values for all parameters in `TfidVectorizer()`. Use default values for all parameters in `LinearSVC()` execept for `random_state`.

- Without the random_state parameter, the `LinearSVC` algorithm has a random element. If you provide an integer to the random_state paramter, the algorithm becomes determinitstic and reproducible. So, don't forget to set the random_state parameter in `LinearSVC()`.

- It is not necessary that you use all of the other four arguments `(X_train, X_test, y_train, and y_test)`. You should decide which arguments are needed and which are not.

- The function must return a tuple of a `Pipeline` instance and a numpy array of predicted values. So your function will look something like

```python
def classify_document(X_train, X_test, y_train, y_test, random_state):

    ### YOUR CODE HERE
    clf = Pipeline(...)
    ### YOUR CODE HERE
    y_pred = clf.predict(...)
    ### YOUR CODE HERE

    return clf, y_pred
```

In [None]:
def train(X_train, X_test, y_train, y_test, random_state):
    """
    Creates a document term matrix and uses LinearSVC classifier to make document classifications.
    
    Parameters
    ----------
    X_train: A list of strings.
    y_train: A list of strings.
    X_test: A list of strings.
    random_state: A np.random.RandomState instance.
    
    Returns
    -------
    A tuple of (clf, y_pred)
    clf: A Pipeline instance.
    y_pred: A numpy array.
    """
    
    # YOUR CODE HERE
    
    return clf, y_pred

In [None]:
clf, y_pred = train(X_train, X_test, y_train, y_test, random_state=check_random_state(0))
score = accuracy_score(y_pred, y_test)
print("SVC prediction accuracy = {0:5.1f}%".format(100.0 * score))

In [None]:
assert_is_instance(clf, Pipeline)
assert_is_instance(y_pred, np.ndarray)
tf = clf.named_steps['tf']
assert_is_instance(tf, TfidfVectorizer)
assert_is_instance(clf.named_steps['svc'], LinearSVC)
assert_equal(tf.stop_words, 'english')
assert_equal(tf.ngram_range, (1, 3))
assert_equal(len(y_pred), len(y_test))
assert_array_equal(y_pred[:10], [0, 1, 1, 0, 1, 0, 0, 0, 1, 1])
assert_array_equal(y_pred[-10:], [0, 0, 1, 1, 1, 0, 0, 1, 1, 0])
assert_almost_equal(score, 0.76400000)

To apply our trained sentiment analysis pipeline on new twitter data, let's use Tweepy's [user_timeline()](http://docs.tweepy.org/en/latest/api.html#API.user_timeline) to extract 20 tweets from some users. Note that we specify the `max_id` parameter for reproducibility.

(You don't have to write any code in the following code cells.)

In [None]:
def get_timeline(user, max_id):
    """
    Fetches 20 tweets from 'user'.
    
    Parameters
    ----------
    user: A string. The ID or screen name of a Twitter user.
    max_id: An int. Returns only statuses with an ID less than
            (i.e., older than) or equal to the specified ID.
    
    Returns
    -------
    A list of integers.
    """
    
    timeline = api.user_timeline(id=user, count=20, max_id=max_id)
    
    return timeline

In [None]:
timeline1 = get_timeline('HillaryClinton', max_id=790748347615371264)

In [None]:
timeline2 = get_timeline('realDonaldTrump', max_id=790730455129714688)

Finally, we use the Linear SVC model to classify each tweet in the timelines as a positive tweet or a negative tweet.

In [None]:
def predict(clf, timeline):
    """
    Uses a classifier ("clf") to classify each tweet in
    "timeline" as a positive tweet or a negative tweet.
    
    Parameters
    ----------
    clf: A Pipeline instance.
    timeline: A tweepy.models.ResultSet instance.
    
    Returns
    -------
    A numpy array.
    """
    
    texts = np.array([t.text for t in timeline])
    y_pred = clf.predict(texts)
    
    return y_pred

In [None]:
pred1 = predict(clf, timeline1)
print('{} has {} positive tweets and {} negative tweets.'.format(
    'Hillary Clinton', (pred1 == 1).sum(), (pred1 == 0).sum()))

In [None]:
pred2 = predict(clf, timeline2)
print('{} has {} positive tweets and {} negative tweets.'.format(
    'Donald Trump', (pred2 == 1).sum(), (pred2 == 0).sum()))