# Assignment 2 - Sentiment Analysis of Tweets (20 marks) - Update 18 April 2017

In this assignment you will use statistical classifiers for the task of predicting the polarity (positive, negative) of opinions expressed in tweets. This is a type of **sentiment analysis** which is becoming increasingly useful given the strong influence of the opinions posted in social media nowadays.

To learn more about sentiment analysis and some of its techniques you can read chapter 7 of this book:

* [Maynard et al. Natural Language Processing for the Semantic Web. Morgan & Claypool, 2016.](http://www.morganclaypool.com/doi/10.2200/S00741ED1V01Y201611WBE015)

For this assignment we will use the Twitter Sentiment Analysis Training Corpus described in [this blogpost](http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/), which contains over 1.5 million tweets annotated with the sentiment polarity: 0 for negative sentiment, and 1 for positive sentiment.

**Note that this corpus is much larger than the data that we have used in the workshop exercises. Be careful when you process it. If your code is not efficient, you will easily overwhelm the resources in your computer. To do the tasks, it is best that your computer has at least 16GB of RAM (the computers in E6A123 have 16GB of RAM). It would be wise if you did your first tests for debugging with a small portion of the data.**


Download the file SentimenAnalysisDataset.zip, unzip the file and place the resulting CSV (comma-separated-value) file in the same folder as this notebook. The unzipped file takes 150MB of space, so there is no problem to load the code into the memory of a modern computer. The following code uses Python's `csv` module to load the data and prints the first row and the total number of rows.

In [1]:
import nltk
import sklearn
import csv

from collections import Counter

In [2]:
with open('Sentiment Analysis Dataset.csv', encoding='utf8') as f:
    reader = csv.reader(f)
    print("Header line: %s" % next(reader))
    annotated_data = [r for r in reader]
print(annotated_data[0])
print("Total number of rows:", len(annotated_data))

Header line: ['\ufeffItemID', 'Sentiment', 'SentimentSource', 'SentimentText']
['1', '0', 'Sentiment140', '                     is so sad for my APL friend.............']
Total number of rows: 1578614


As you can see, each element in the list `annotated_data` is a list with the following information:

* item ID
* Sentiment (0 if negative, 1 if positive)
* Sentiment source (we can ignore this information for this assignment)
* Text of the tweet

To make this exercise more manageable, we will use only 500,000 tweets taken randomly from the annotated data:

In [3]:
import random
random.seed(1234)
random.shuffle(annotated_data)
annotated_data = annotated_data[:500000]

If you look at the tweets you will see that there are words in all uppercase characters, and that information may be useful. Therefore, **in the exercises below it is probably best that you do not convert the words to uppercase or lowercase**.

### Exercise 1 (1 mark) - Split the data
Split the data into a training set, a dev-test set, and a test set. Use the following ratio for splitting the data:

* Training set: 80%
* Dev-test set: 10%
* Test set: 10%

In [4]:
threshold1 = int(len(annotated_data)*0.8)
threshold2 = int(len(annotated_data)*0.9)

training_set = annotated_data[:threshold1]
devtest_set = annotated_data[threshold1:threshold2]
test_set = annotated_data[threshold2:]

# tokenize the text in the sets
training_set_tokenized = []
devtest_set_tokenized = []
test_set_tokenized = []

for (i,j,k,l) in training_set:
    training_set_tokenized.append((i,j,k,nltk.word_tokenize(l)))
for (i,j,k,l) in devtest_set:
    devtest_set_tokenized.append((i,j,k,nltk.word_tokenize(l)))
for (i,j,k,l) in test_set:
    test_set_tokenized.append((i,j,k,nltk.word_tokenize(l)))

### Exercise 2 (1 mark) - Check that the data are balanced
Print the percentage of positive and negative sentiments in each partition, and check that they are similar.

In [5]:
# training
training_PosNeg = [j for (i,j,k,l) in training_set]
training_PosNeg_Counter = Counter(training_PosNeg)
# devtest
devtest_PosNeg = [j for (i,j,k,l) in devtest_set]
devtest_PosNeg_Counter = Counter(devtest_PosNeg)
# testset
test_PosNeg = [j for (i,j,k,l) in test_set]
test_PosNeg_Counter = Counter(test_PosNeg)

# '0'=negative, '1'=positive
training_pos_value = training_PosNeg_Counter['1']/sum(training_PosNeg_Counter.values())
print("training set positive percentage: {:.2%}".format(training_pos_value))

training_neg_value = training_PosNeg_Counter['0']/sum(training_PosNeg_Counter.values())
print("training set negative percentage: {:.2%}".format(training_neg_value))

devtest_pos_value = devtest_PosNeg_Counter['1']/sum(devtest_PosNeg_Counter.values())
print("devtest set positive percentage: {:.2%}".format(devtest_pos_value))

devtest_neg_value = devtest_PosNeg_Counter['0']/sum(devtest_PosNeg_Counter.values())
print("devtest set negative percentage: {:.2%}".format(devtest_neg_value))

test_pos_value = test_PosNeg_Counter['1']/sum(test_PosNeg_Counter.values())
print("test set positive percentage: {:.2%}".format(test_pos_value))

test_neg_value = test_PosNeg_Counter['0']/sum(test_PosNeg_Counter.values())
print("test set negative percentage: {:.2%}".format(test_neg_value))


training set positive percentage: 49.99%
training set negative percentage: 50.01%
devtest set positive percentage: 49.81%
devtest set negative percentage: 50.19%
test set positive percentage: 50.19%
test set negative percentage: 49.81%


### Exercise 3 (2 marks) - Some simple data exploration
Answer the following questions. In your solution you need to include the code that you used to answer the questions. To find the answers to all of the questions below, **do not convert the text to lowercase**.

1. (1 mark) What is the size of the entire vocabulary in the training set?
2. (1 mark) In the training set, what words appear in the largest number of tweets with positive sentiment? What words appear in the largest number of tweets with negative sentiment?

In [6]:
total_words = []
positive_words = []
negative_words = []

for (i,j,k,l) in training_set_tokenized:
    for word in set(l):
        #add to total words
        total_words.append(word)
        if j =='0':
            #its negative
            negative_words.append(word)
        if j =='1':
            #its positive
            positive_words.append(word)
        
unique_total_words = set(total_words)
positive_words_counter = Counter(positive_words)
negative_words_counter = Counter(negative_words)

print("Size of vocabulary: ", len(unique_total_words), "\n")

print("Words which appear in largest number of tweets " \
      "with positive sentiment: ")
for (i,j) in positive_words_counter.most_common(10):
    print(i, " ", end="")
print("\n") #newline

print("Words which appear in largest number of tweets " \
      "with negative sentiment: ")
for (i,j) in negative_words_counter.most_common(10):
    print(i, " ", end="")
print("\n") #newline

Size of vocabulary:  346623 

Words which appear in largest number of tweets with positive sentiment: 
@  !  .  the  to  I  ,  a  you  and  

Words which appear in largest number of tweets with negative sentiment: 
@  .  I  to  the  !  ,  a  my  i  



### Exercise 4 (2 marks) - One-hot encoding for Naive Bayes in NLTK

Using the training set, design a feature extractor that uses one-hot encoding with the entire vocabulary in the training set. Use the feature extractor to train a Naive Bayes classifier in NLTK and report the accuracy of the classifier using the test set.

In [7]:
def feature_extractor(words):
    result = dict()
    for w in words:
        result[w] = (w in unique_total_words)
    return result

train_features = [(feature_extractor(l),j) for (i,j,k,l) in training_set_tokenized]
devtest_features = [(feature_extractor(l), j) for (i,j,k,l) in devtest_set_tokenized]
test_features = [(feature_extractor(l), j) for (i,j,k,l) in test_set_tokenized]
classifier = nltk.NaiveBayesClassifier.train(train_features)

print("training set: ",nltk.classify.accuracy(classifier, train_features))
print("devtest set: ",nltk.classify.accuracy(classifier, devtest_features))
print("test set: ",nltk.classify.accuracy(classifier, test_features))

training set:  0.86886
devtest set:  0.7662
test set:  0.76926


### Exercise 5 (2 marks) - One-hot encoding of most informative features

Find the 2000 most informative features with the help of the NLTK classifier of exercise 4 (read [chapter 6 of the NLTK book](http://www.nltk.org/book/ch06.html) for help on how to find the most informative features). Use NLTK to build a new Naive Bayes classifier that uses these 2000 most informative features, train it on the training set, and report the accuracy on the test set. 

In [8]:
most_informative2000 = classifier.most_informative_features(2000)
most_informative2000words = []
for (word,boolean) in most_informative2000:
    most_informative2000words.append(word)

def FE_most_informative_features(words):
    result = dict()
    for w in words:
        if w in most_informative2000words:
            result[w] = (w in word)
    return result

train_features = [(FE_most_informative_features(l),j) for (i,j,k,l) in training_set_tokenized]
devtest_features = [(FE_most_informative_features(l), j) for (i,j,k,l) in devtest_set_tokenized]
test_features = [(FE_most_informative_features(l), j) for (i,j,k,l) in test_set_tokenized]
classifier = nltk.NaiveBayesClassifier.train(train_features)

print("training set: ",nltk.classify.accuracy(classifier, train_features))
print("devtest set: ",nltk.classify.accuracy(classifier, devtest_features))
print("test set: ",nltk.classify.accuracy(classifier, test_features))

training set:  0.578395
devtest set:  0.57246
test set:  0.56868


### Exercise 6 (2 marks) - Tfidf for Naive Bayes in Scikit-Learn
Using Scikit-Learn, generate the tf.idf matrix of the training set. **Use the defaults of `sklearn`'s `TfidfVectorizer` except for `lowercase`, which you must set to `False`**. [sklearn documentation for TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) With this matrix, train an `sklearn` Naive Bayes classifier using the training set and report the accuracy on the test set.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

tfidf = TfidfVectorizer(lowercase="False")
tfidf_train = tfidf.fit_transform([l for (i,j,k,l) in training_set])
tfidf_devtest = tfidf.transform([l for (i,j,k,l) in devtest_set])
tfidf_test = tfidf.transform(([l for (i,j,k,l) in test_set]))

tfidfclassifier = MultinomialNB()
tfidfclassifier.fit(tfidf_train,[j for (i,j,k,l) in training_set])

print("test: ",accuracy_score([j for (i,j,k,l) in training_set], tfidfclassifier.predict(tfidf_train)))
print("devtest: ",accuracy_score([j for (i,j,k,l) in devtest_set], tfidfclassifier.predict(tfidf_devtest)))
print("test: ",accuracy_score([j for (i,j,k,l) in test_set], tfidfclassifier.predict(tfidf_test)))

test:  0.85102
devtest:  0.7663
test:  0.76866


### Exercise 7 (2 marks) - Analysis of Results
Analyse the results of all the classifiers from the previous exercises, and answer these questions. In all answers you must include any code that you used to answer the questions, the output of the code. Your answer must also include text formatted in Markdown where you write your interpretation of the output and how this interpretation answer the questions.

1. (1 mark) Did you observe any overfitting in any of the classifiers? How did you determine whether they are overfitting?
2. (1 mark) Do we have too little training data, or do we have too much training data for these classifiers?


#### 1.
Yes, both the: **One-hot encoding for Naive Bayes in NLTK** and **Tfidf for Naive Bayes in Scikit-Learn** are overfitting.  This can be caused by providing too many features and relying on the training data that doesn't generalise well to new examples.  The results of the classifiers reveal that they are overfitting.

**One-hot encoding for Naive Bayes in NLTK**<br>
Difference in accuracy between training set and devtest set: 0.10266<br>
Difference in accuracy between training set and test set: 0.0996<br>

|Set          |Accuracy |
|-------------|---------|
|training set:|0.86886  |
|devtest set: |0.7662   |
|test set:    |0.76926  |

**Tfidf for Naive Bayes in Scikit-Learn**<br>
Difference in accuracy between training set and devtest set: 0.08472<br>
Difference in accuracy between training set and test set: 0.08236<br>

|Set          |Accuracy|
|-------------|------- |
|training set:|0.85102 |
|devtest:     |0.7663  |
|test:        |0.76866 |

#### 2.
Yes, there is too much training data in the classifers.

### Exercise (8 marks) - Improve the System and Final Analysis
This exercise is open ended. Your goal is to perform a more detailed error analysis and identify ways to improve the classification of sentiments by adding or changing the features. Read, for example, Chapter 7 of the book by Maynard et al, or do your own research using the library or the Web. To obtain top marks in this part, you need to include the following in the answer:

1. An error analysis of the previous systems.
2. Text explaining what sort of modifications you would want to implement, and justify why these would be useful modifications.
3. An implementation of the improved classifier.
4. An evaluation of the results and comparison with the previous classifiers. You must explain what is the best system. In your explanation, include all factors that you used to decide which system is best.
5. Text explaining what further changes would possibly improve the classifier and why.

All this information should be inserted in this notebook below this question. The information should be structured in sections and it must be clear and precise. The explanations should be convincing. Below is a possible list of section headings. These sections are just a guideline. Feel free to change them, but make sure that they are informative and relevant.

### 1. Error Analysis

**One-hot encoding for Naive Bayes in NLTK** and **Tfidf for Naive Bayes in Scikit-Learn** are over fitted.

### 2. Explanation of the Proposed New Classifier
##### Giving less data to the training set and more to dev-test and test set
This can reduce overfitting.

##### Removing punctuation characters from the vocabulary
As seen in the words which appear in the largest number of tweets, punctuation characters appear.  The tokenizer has recognised these characters as words, however they are not technically words.  This can be seen with words with the largest number of tweets.
Removing punctuation characters would help the classifier.

##### Removing stopwords from the vocabulary
Stopwords are high frequency words.  These words appear in both positive and negative sentiments.  Because of their high frequency, they can make the classifier inaccurate.  Removing stopwords will lower the vocabulary and increase efficiency


### 3. Code of the Proposed New Classifier

#### [SEE CODE BLOCK BELOW]
`#Removing stopwords from the vocabulary`

### 4. Evaluation and Comparison

##### Removing stopwords from the vocabulary
The vocabulary has been reduced from 346,623 to 321,588 which is a 25,035 reduction in words.

### 5. Final Conclusions and Possible Improvements

Both **One-hot encoding for Naive Bayes in NLTK** and **Tfidf for Naive Bayes in Scikit-Learn** achieve similar goals but take different approaches.  The more a classifer can be trained, the better.

In [14]:
#Removing stopwords from the vocabulary
import collections
nltk.download("stopwords")
from nltk.corpus import stopwords
stop = stopwords.words('english')

total_words = []
positive_words = []
negative_words = []

for (i,j,k,l) in training_set_tokenized:
    if word not in set(l):
        for word in set(l):
            #add to total words
            total_words.append(word)
            if j =='0':
                #its negative
                negative_words.append(word)
            if j =='1':
                #its positive
                positive_words.append(word)

unique_total_words = set(total_words)
positive_words_counter = Counter(positive_words)
negative_words_counter = Counter(negative_words)

print("Size of vocabulary: ", len(unique_total_words), "\n")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Nigel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Size of vocabulary:  321588 



# Submission of Results

Your submission should consist of this jupyter notebook with all your code and explanations inserted in the notebook. The notebook should contain the output of the runs so that it can be read by the assessor without needing to run the output.

Examine this notebook so that you can have an idea of how to format text for good visual impact. You can also read this useful [guide to the markdown notation](http://daringfireball.net/projects/markdown/syntax), which explains the format of the text.

Late submissions will have a penalty of **4 marks deduction per day late**.

Each question specifies a mark. The final mark of the assignment is the sum of all the individual marks, after applying any deductions for late submission.

By submitting this assignment you are acknowledging that this is your own work. Any submissions that breach the code of academic honesty will be penalised as per the [academic honesty policy](https://staff.mq.edu.au/work/strategy-planning-and-governance/university-policies-and-procedures/policies/academic-honesty).