# Workshop Week 5

# Demonstration: Sentiment analysis

In this part of the workshop we will walk through a system that uses the Naive Bayes classifiers of NLTK and scikit-learn to predict the review scores of NLTK's corpus of movie reviews. This corpus is used in [NLTK's chapter 6](http://www.nltk.org/book/ch06.html#document-classification). The corpus contains a selection of movie reviews, and a label that indicates whether the review is positive or negative. Classification of reviews as "positive" or "negative" is a task that is related to [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis).

### Read the movie review corpus

The following code reads the movie reviews corpus and shows some statistics on the labels given to the data.

In [1]:
import nltk
nltk.download("movie_reviews")
from nltk.corpus import movie_reviews
movie_reviews.categories()

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


['neg', 'pos']

In [2]:
print("Number of negative reviews:", len(movie_reviews.fileids('neg')))
print("Number of positive reviews:", len(movie_reviews.fileids('pos')))

Number of negative reviews: 1000
Number of positive reviews: 1000


The following code partitions the movie review corpus into training, devtest, and test sets.

In [3]:
import random
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]
random.seed(1234)
random.shuffle(documents)
threshold1 = int(len(documents)*.6)
threshold2 = int(len(documents)*.8)
train = documents[:threshold1]
devtest = documents[threshold1:threshold2]
test = documents[threshold2:]

### Document features

The following code defines a feature set of the 2000 most frequent non-stop words of the training set. The value of each feature is 1 if the word is present in the document, and 0 otherwise. This is what is generally called [one-hot encoding](https://en.wikipedia.org/wiki/One-hot). To build the list we will ignore word casing.

In [5]:
import collections
nltk.download("stopwords")
from nltk.corpus import stopwords
stop = stopwords.words('english')
c = collections.Counter([w.lower() for (words,category) in train 
                                   for w in words if w.lower() not in stop])
top2000words = [w for (w,count) in c.most_common(2000)]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [6]:
def document_features(words):
    "Return the document features for an NLTK classifier"
    words_lower = [w.lower() for w in words]
    result = dict()
    for w in top2000words:
        result['has(%s)' % w] = (w in words_lower)
    return result

### NLTK Naive Bayes

The following code trains an NLTK Naive Bayes classifier using the training set, and reports the evaluation results on the training set and the dev-test set.

In [7]:
train_features = [(document_features(x),y) for (x,y) in train]
devtest_features = [(document_features(x),y) for (x,y) in devtest]
classifier = nltk.NaiveBayesClassifier.train(train_features)

In [8]:
nltk.classify.accuracy(classifier,devtest_features)

0.7775

In [9]:
nltk.classify.accuracy(classifier,train_features)

0.8816666666666667

We can see the difference in accuracy between the dev-test set and the train set.

### Matrix features for sklearn

The following code defines a second feature extractor that uses one-hot encoding on the same list of 2000 words, and which is suitable for sklearn.

In [10]:
def vector_features(words):
    "Return a vector of features for sklearn"
    words_lower = [w.lower() for w in words]
    result = []
    for w in top2000words:
        if w in words_lower:
            result.append(1)
        else:
            result.append(0)
    return result

### sklearn Naive Bayes

This code generates the features for sklearn and trains and evaluates a multinomial Naive Bayes classifier.

In [11]:
train_vectors = [vector_features(x) for (x,y) in train]
devtest_vectors = [vector_features(x) for (x,y) in devtest]

In [12]:
from sklearn.naive_bayes import MultinomialNB
sklearn_classifier = MultinomialNB()
sklearn_classifier.fit(train_vectors, [y for (x,y) in train])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [13]:
from sklearn.metrics import accuracy_score

In [14]:
predictions = sklearn_classifier.predict(devtest_vectors)
true = [y for (x,y) in devtest]
accuracy_score(true,predictions)

0.83999999999999997

In [15]:
predictions = sklearn_classifier.predict(train_vectors)
true = [y for (x,y) in train]
accuracy_score(true,predictions)

0.92000000000000004

### Question

Both Naive Bayes implementations clearly overfits. What do you think you could do to reduce overfitting?

_Write your answer here_

### tfidf with Naive Bayes

The following code computes tf.idf of the training set and uses sklearn's multinomial Naive Bayes classifier.

Note that sklearn's `TfidfVectorizer` takes a list of strings as the input, but in our previous experiments we had used the tokenised information, that is, the list of tokens provided by sklearn. We can use `TfidfTransformer` to process a list of tokens (see the [sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)), 
but since we haven't covered it in the lectures, let's use the raw strings as our data.

In [16]:
texts = [(movie_reviews.raw(fileid), category)
         for category in movie_reviews.categories()
         for fileid in movie_reviews.fileids(category)]
random.seed(1234)
random.shuffle(texts)
threshold1 = int(len(texts)*.6)
threshold2 = int(len(texts)*.8)
text_train = texts[:threshold1]
text_devtest = texts[threshold1:threshold2]
text_test = texts[threshold2:]

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(input='contents',stop_words='english',max_features=2000)
text_tfidf_train = tfidf.fit_transform([x for (x,y) in text_train])
text_tfidf_devtest = tfidf.transform([x for (x,y) in text_devtest])

Note that we used `tfidf.fit_transform` when using the training set, and `tfidf.transform` when using the test set. This is because we use the train set to learn the tfidf parameters 
(the 2000 most frequent words and their IDF), and then we apply that information when we want to compute the tfidf of the test set.

In [18]:
sklearn_tfidfclassifier = MultinomialNB()
sklearn_tfidfclassifier.fit(text_tfidf_train,[y for (x,y) in text_train])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [19]:
accuracy_score([y for (x,y) in text_devtest], sklearn_tfidfclassifier.predict(text_tfidf_devtest))

0.80500000000000005

In [20]:
accuracy_score([y for (x,y) in text_train], sklearn_tfidfclassifier.predict(text_tfidf_train))

0.91416666666666668

### Question
The results with tf.idf vectors are similar to the results with one-not encoding. In other applications, tf.idf may improve the results. It didn't happen this time. Why do you think there wasn't much improvement?

_Write your answer here_

# Your Turn

## Naive Bayes on Question Segmentation

NLTK has a corpus of questions and their question types according to a particular classification scheme (e.g. `DESC` refers to a question expecting a descriptive answer, such as one starting with "How"; `HUM` refers to a question expecting an answer referring to a human). Here's some example of use of the corpus:

In [21]:
nltk.download("qc")
from nltk.corpus import qc
train = qc.tuples("train.txt")
test = qc.tuples("test.txt")

[nltk_data] Downloading package qc to /root/nltk_data...
[nltk_data]   Unzipping corpora/qc.zip.


In [22]:
train[:3]

[('DESC:manner', 'How did serfdom develop in and then leave Russia ?'),
 ('ENTY:cremat', 'What films featured the character Popeye Doyle ?'),
 ('DESC:manner', "How can I find a list of celebrities ' real names ?")]

In [23]:
test[:3]

[('NUM:dist', 'How far is it from Denver to Aspen ?'),
 ('LOC:city', 'What county is Modesto , California in ?'),
 ('HUM:desc', 'Who was Galileo ?')]

### Exercise: All question types

Write Python code that lists all the possible question types of the training set (remember: _never look at the test set_).

['HUM:title',
 'ENTY:product',
 'NUM:count',
 'NUM:dist',
 'ENTY:food',
 'NUM:volsize',
 'NUM:date',
 'LOC:city',
 'NUM:perc',
 'NUM:period',
 'LOC:country',
 'ENTY:techmeth',
 'NUM:code',
 'DESC:manner',
 'NUM:other',
 'ENTY:event',
 'LOC:state',
 'ENTY:plant',
 'ENTY:color',
 'ENTY:word',
 'ENTY:cremat',
 'ABBR:exp',
 'LOC:mount',
 'NUM:speed',
 'ENTY:termeq',
 'DESC:reason',
 'LOC:other',
 'ENTY:letter',
 'ENTY:veh',
 'ENTY:symbol',
 'ENTY:instru',
 'DESC:def',
 'DESC:desc',
 'HUM:gr',
 'ENTY:body',
 'ENTY:dismed',
 'ENTY:sport',
 'ABBR:abb',
 'NUM:temp',
 'ENTY:other',
 'ENTY:religion',
 'ENTY:substance',
 'ENTY:animal',
 'NUM:weight',
 'ENTY:lang',
 'ENTY:currency',
 'HUM:ind',
 'NUM:ord',
 'NUM:money',
 'HUM:desc']

### Exercise: All general types

The question types have two parts. The first part describes a general type, and the second part defines a subtype. For example, the question type `DESC:manner` belongs to the general `DESC` type and within that type to the `manner` subtype. Let's focus on the general types only. Write Python code that lists all the possible general types (there are 6 of them).

['NUM', 'LOC', 'ENTY', 'DESC', 'HUM', 'ABBR']

### Exercise: Partition the data
There is a train and test data, but for this exercise we want to have a partition into train, dev-test, and test. In this exercise, combine all data into one array and do a 3-way partition into train, dev-test, and test. Make sure that you shuffle the data.

### Exercise: Feature extractor

Write a feature extractor function that uses individual words as features ("one-hot encoding"). To obtain the list of words, use the 100 most frequent words in the training set (since you aren’t supposed to use the dev-test or test sets to extract features). Note that we do not use a list of stop words now since the questions are very short, and some words such as 'how' are useful for question classification but are listed as stop words.

### Exercise: NLTK Naive Bayes classifier

Train an NLTK Naïve Bayes classifier with the features of the training set, and test it on the testing set. What accuracy do you obtain? Is the system overfittig to the training data?

### Exercise: sklearn Naive Bayes classifier

Convert the feature set to a document matrix suitable for sklearn, and train again using sklearn's Multinomial Naive Bayes classifier. Are the results different?

### Exercise: Majority baseline

What is the accuracy if we use a majority baseline?

### Exercise: Support Vector Machine
Use now sklearn's Support Vector Machines and compare the results.

### Advanced Exercise: Classification errors

**This exercise will be useful for some of the advanced tasks of assignment 2.**

The following code produces the confusion matrix of the NLTK classifier (read section 3.4 of [Chapter 6 of the NLTK book](http://www.nltk.org/book/ch06.html)). The confusion matrix indicates all errors of classification between different labels. With this information, identify the most common misclassification types. In groups, discuss how you could address the misclassification errors.

In [48]:
qc_devtest_features = [(qc_extractor(q), c) for (c,q) in q_devtest]
devtest_predictions = [qc_classifier.classify(f) for f, l in qc_devtest_features]
devtest_labels = [c for (c,q) in q_devtest]
nltk_cm = nltk.ConfusionMatrix(devtest_labels, devtest_predictions)
print(nltk_cm.pretty_format(sort_by_count=True, show_percents=True))

     |             E      D                    A |
     |      H      N      E      N      L      B |
     |      U      T      S      U      O      B |
     |      M      Y      C      M      C      R |
-----+-------------------------------------------+
 HUM | <15.5%>  5.3%   1.3%   0.3%   0.6%   0.2% |
ENTY |   3.7% <14.0%>  2.8%   0.2%   1.2%      . |
DESC |   0.3%   4.1% <15.1%>  1.3%   0.6%      . |
 NUM |   0.5%   3.9%   1.0% <10.2%>  0.4%      . |
 LOC |   0.4%   2.9%   1.1%      . <10.8%>     . |
ABBR |      .   0.5%   0.8%      .      .  <0.9%>|
-----+-------------------------------------------+
(row = reference; col = test)

