# Week 6 (Part 2): Sentiment Lexicons
In this lab we will be looking at lexicons for sentiment analysis.  In particular, we will be investigating:
* bootstrapping wordlists using WordNet
* bootstrapping wordlists from corpora

First some preliminary imports:

In [6]:
from nltk.corpus import wordnet as wn
import nltk
import operator
import random
import sys
import pandas as pd
from Week6Labs.classifiercode import feature_extract, get_training_test_data, SimpleClassifier, classifier_evaluate
%matplotlib inline

We are going to use a WordList classifier (developed in Wk3) and evaluation code (developed in Wk4). So lets import it here 

In [4]:
from classifiercode import *

Sussex NLTK root directory is \\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources


Now lets define some short lists of positive and negative words, that we might expect to find in movie reviews

In [2]:
my_positive_words=["awesome","thrilling","funny","great"]
my_negative_words=["boring","terrible","hate","waste"]

Grab some training and testing data

In [5]:
training,testing=get_training_test_data("dvd")
traindata=[(feature_extract(review),label) for (review,label) in training]
testdata=[(feature_extract(review),label) for (review,label) in testing]


Construct a simple wordlist classifier (no training).  This is going to give us our baseline performance which we are going to try to beat.

In [7]:
baseline=SimpleClassifier(my_positive_words,my_negative_words)

Now lets evaluate it

In [8]:
ms=["accuracy","precision","recall","f1"]
classifier_evaluate(baseline,testing,ms)

[0.55, 0.5270758122743683, 0.9733333333333334, 0.6838407494145199]

What do these figures tell you about the baseline classifier?

## 1. Bootstrapping Wordlists from WordNet

We are going to use semantic relationships in WordNet to extend the wordlists we have created.  Using human input to *seed* an algorithm in this way is often referred to as **bootstrapping**.  It often also gets referred to as **semi-supervised** learning. 

A useful helper function `flatten` is defined below.  This takes an arbitrarily nested list and flattens it into a one level list

In [None]:
def flatten(nested_list):
    """
    flatten an arbitrarily nested list
    :param nested_list: list structure potentially containing more lists
    :return: list of atomic items
    """
    if isinstance(nested_list,str):
        return [nested_list]
    elif isinstance(nested_list,list):
        res=[]
        for item in nested_list:
            res+=flatten(item)
        return res
    else:
        return [nested_list]


flatten([[[1,2],4],[5,6]])

### Exercise 1.1
Write a function `find_words(word,relation)` which takes two arguments, a word and a relation, and returns the **set** of all of the words which are in the given relation with the given word according to WordNet.  For example:
* find_words("car","synonym") should return  {'auto','automobile','cable_car','car','elevator_car','gondola','machine','motorcar','railcar','railroad_car','railway_car'}
* find_words("car","hyponym") should return a set of 83 words
* find_words("car","hypernym") should return a set of 4 words
* find_words("car","antonym") should return an empty set

Hint: one way of doing this is to use nested list comprehensions, flatten the resulting list using the `flatten()` function defined above and then use the built-in `set()` function to remove duplicates.

### Exercise 1.2
Use your `find_words()` function to extend your lists of seeds in 4 different ways:-
1. add all synonyms
2. add all synonyms and antonyms
3. add all synonyms, antonyms and hyponyms
4. add all synonyms, antonyms, hyponyms and hypernyms

In each case, think about whether the related words should be added to the **same** seed list or to the **other** seed list.

Starting with the seed lists defined above, the lengths of your extended lists should be
1. positive: 68, negative: 73
2. positive: 71, negative: 73 
3. positive: 71, negative: 168
4. positive: 91, negative: 219


### Exercise 1.3
* Build classifiers using each of the four variations of wordlist extensions. 
* Test them on the testing set
* Display your results (including the baseline) in a pandas table
* Make a barchart showing the accuracy scores for the different variations.
* Interpret your results

## 2. Finding Word Patterns in Corpus Data

Now we are going to use the training data as a corpus to search for words which tend to be **conjoined** with the seed words.  For example, "funny and fresh" would indicate that "fresh" is also a positive sentiment word.  On the other hand, "funny but predictable" indicates that "predictable" is a negative sentiment word. 

Note that whilst we are using the labelled training data, we will not be paying attention to the positive and negative labels.  Therefore this is essentially an unsupervised method (we could use any in-domain corpus data even if it had not be labelled with sentiment).

First, lets flatten the corpus into a list of tokens and normalise the corpus (stop-word removal and lower-casing).  We are not going to use the bag-of-words representation here because we want to pay attention to the order of the words.

In [None]:
training_corpus=flatten([review.sents() for review,label in training])

In [None]:
normalised_corpus=[normalise(token) for token in training_corpus]

In [None]:
normalised_corpus[:20]

### Exercise 2.1

Write a function `search_conj()` to find words which are co-ordinated.  Your function should take 3 arguments:
* the conjunction word e.g., one of {"and", "or", "but"}
* the seed word
* the corpus

It should return a set of words.  You should filter this set for (at least) stopwords and punctuation.

### Exercise 2.2
Update your `find_words()` and `extend()` functions as necessary so that you can use them to add words found by `search_conj()` to seed word lists.  Then, extend your **original** seed word lists by 
1. adding words conjoined with *and*
2. adding words conjoined with *and* and with *but*

When I ran this, I found that I have lists of the following lengths:
1. positive: 28, negative: 17
2. positive: 28, negative: 25

However, your lists will be of a different length due to the random nature of the sample of sentences selected from the corpus.

### Exercise 2.3
Build and evaluate classifiers using these wordlists.  Make sure you provide a visualation of  your results (e.g., Pandas barchart) and draw some conclusions.  What are the advantages and disadvantages of each method (using WordNet vs using corpora) to extend the wordlists?