# Exercise Notebook Instructions

### 1. Important: Only modify the cells which instruct you to modify them - leave "do not modify" cells alone.  

The code which tests your responses assumes you have run the startup/read-only code exactly.

### 2. Work through the notebook in order.

Some of the steps depend on previous, so you'll want to move through the notebook in order.

### 3. It is okay to use libraries.

You may find some questions are fairly straightforward to answer using built-in library functions.  That's totally okay - part of the point of these exercises is to familiarize you with the commonly used functions.

### 4. Seek help if stuck

If you get stuck, don't worry!  You can either review the videos/notebooks from this week, ask in the course forums, or look to the solutions for the correct answer.  BUT, be careful about looking to the solutions too quickly.  Struggling to get the right answer is an important part of the learning process.

# Exercise Notebook on natural language processing

`nltk` also provides access to a dataset of tweets from Twitter, it includes a set of tweets already classified as negative or positive.

In this exercise notebook we would like to replicate the sentiment analysis classification performed on the movie reviews corpus on this dataset.

## Exercise 1: Download and inspect the twitter_samples dataset

First we want to download the dataset and inspect it:

In [None]:
import nltk

In [None]:
# DO NOT MODIFY

nltk.download("twitter_samples")
from nltk.corpus import twitter_samples

First let's check the common `fileids` method of `nltk` corpora:

In [None]:
twitter_samples.fileids()

The twitter_samples object has a `tokenized()` method that returns all tweets from a fileid already individually tokenized. Read its documentation and use it to find the number of positive and negative tweets.

In [None]:
number_of_positive_tweets = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
number_of_negative_tweets = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# DO NOT MODIFY

assert number_of_positive_tweets == 5000, "Make sure you are counting the number of tweets, not the number of words"

## Exercise 2: Build a bag-of-words model function

As in the lecture, we can build a bag-of-words model to train our machine learning algorithm.

In [None]:
import string

First step we define a list of words that we want to filter out of our dataset:

In [None]:
useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)

In [None]:
def build_bag_of_words_features_filtered(words):
    """Build a bag of words model"""
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert len(build_bag_of_words_features_filtered(["what", "the", "?", ","]))==0, "Make sure we are filtering out both stopwords and punctuation"

## Exercise 3: create a list of all words

Before performing sentiment analysis, let's first inspect the dataset a little bit more by creating a list of all words.

In [None]:
words = []
for dataset in ["positive_tweets.json", "negative_tweets.json"]:
    for tweet in twitter_samples.tokenized(dataset):
        words.extend(tweet)

Study the code above, see that it is a case of nested loop, for each dataset we are looping through each tweet. Also notice we are using `extend`, how does it differ from `append`? Try it on a simple case, or read the documentation or Google for it!

Now let's filter out punctuation and stopwords:

In [None]:
filtered_words = None
# YOUR CODE HERE
raise NotImplementedError()

First we want to filter out `useless_words` as defined in the previous section, this will reduce the lenght of the dataset by more than a factor of 2:

In [None]:
# DO NOT MODIFY 

assert len(filtered_words) == 85637, "Make sure that the filtering is applied correctly"

## Exercise 4: find the most common words


The `collection` package of the standard library contains a `Counter` class that is handy for counting frequencies of words in our list:

In [None]:
# DO NOT MODIFY 

from collections import Counter

counter = Counter(filtered_words)

It also has a `most_common()` method to access the words with the higher count:

In [None]:
most_common_words = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert most_common_words[0][0] == ":(", "The most common word should be :("
assert len(most_common_words) == 10, "Make sure you are only getting the first 10"

## Exercise 5: Build the features for machine learning

Using our `build_bag_of_words_features` function we can build separately the negative and positive features.

The format of the positive features should be:

    [
        ( { "here":1, "some":1, "words":1 }, "pos" ),
        ( { "another":1, "tweet":1}, "pos" )
    ]
    
It is a list of tuples, the first element is a dictionary of the words with 1 if that word appears, the second the "pos" or "neg" string.

In [None]:
negative_features = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
positive_features = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
positive_features[0][0]

In [None]:
assert positive_features[0][1] == "pos", "Make sure the feature is a list of tuples whose second element is pos or neg"
assert positive_features[0][0]["engaged"] == 1, "Make sure that the first element of each tuple is a dictionary of words"

## Exercise 6: Train a NaiveBayesClassifier

In [None]:
from nltk.classify import NaiveBayesClassifier

Let's use 80% of the data for training, the rest for validation:

In [None]:
split = int(len(positive_features) * 0.8)

In [None]:
split

In [None]:
classifier = NaiveBayesClassifier.train(positive_features[:split]+negative_features[:split])

Let's check the accuracy on the training and on the test sets, make sure to turn those into a percent value

In [None]:
training_accuracy = None
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
test_accuracy = None
# YOUR CODE HERE
raise NotImplementedError()

It looks like the accuracy for the test is very high compared to the movie review dataset, check the most informative features below to understand why:

In [None]:
classifier.show_most_informative_features()