In this notebook, we will dive a little into unsupervised sentiment analysis using the SentiWordNet lexicon. We will use `nltk`, a very well known linrary for NLP.

You will need the following libraries installed: `nltk`, `numpy`, and `dataclasses` (this one you can avoid by changing the code a bit). I ran this on Python 3.6 but any Python 3 should work. These are the versions on my computer: 

```
In [65]: numpy.__version__
Out[65]: '1.18.4'

In [66]: nltk.__version__
Out[66]: '3.5'
```

First, some necessary imports and downloads

In [110]:
import nltk
nltk.download('sentiwordnet')
nltk.download('wordnet')
from nltk.corpus import sentiwordnet as swn

[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /home/sagnik/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /home/sagnik/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


If you remember from the slides, SentiWordNet is a big dictionary (map) where each [synset](https://www.geeksforgeeks.org/nlp-synsets-for-a-word-in-wordnet/) is indexed by a key (Part of speech tag, Word). It gives you the positive score, negative score and neutral scores for that key (along with the synset). 

What is a part of speech tagging? The process of classifying words into their parts of speech and labeling them accordingly.

| sentence                                                 | why    | not    | tell | someone | ?           |
|----------------------------------------------------------|--------|--------|------|---------|-------------|
| part of speech                                           | adverb | adverb | verb | noun    | punctuation |
| [part of speech tag](http://www.nltk.org/book/ch05.html) | WRB    | RB     | VB   | NN      | .           |

Any part of speech tagger will use a particular `tagset`: or a set of part of speech tags. NLTK uses a tagset from [UPenn](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html), more commonly known as PennTreeBank. 

How to get sentiment from SentiWordNet?


In [111]:
positive_word = "happy"
negative_word = "unhappy"
pos_tag = "a"
print(list(swn.senti_synsets(positive_word, pos_tag)))
for word in [positive_word, negative_word]:
    print("-"*30)
    print(f"word: {word}, positive score: {list(swn.senti_synsets(word, pos_tag))[0].pos_score()}")
    print(f"word: {word}, negative score: {list(swn.senti_synsets(word, pos_tag))[0].neg_score()}")
    print(f"word: {word}, neutral score: {list(swn.senti_synsets(word, pos_tag))[0].obj_score()}")
    print("-"*30)


[SentiSynset('happy.a.01'), SentiSynset('felicitous.s.02'), SentiSynset('glad.s.02'), SentiSynset('happy.s.04')]
------------------------------
word: happy, positive score: 0.875
word: happy, negative score: 0.0
word: happy, neutral score: 0.125
------------------------------
------------------------------
word: unhappy, positive score: 0.0
word: unhappy, negative score: 0.75
word: unhappy, neutral score: 0.25
------------------------------


You might have noticed that we use a PoS tag of `a` and not `ADJ`. This is because SentiWordNet uses a different tagset for part of speech tags: `a: all adjectives, r: all adverbs, v: all verbs, n: all nouns`. This requires us to define this function later:

```
def penn_pos_tag_to_word_net(pos_tag_penn: str) -> Union[str, None]:
    word_net_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
                  'VB':wn.VERB, 'RB':wn.ADV}
    return word_net_tag.get(pos_tag_penn[:2])
```

So we know how to get positive, negative, and neutral sentiment score for a word from SentiWordNet. How do we use that to classify a sentence? Let's define some interfaces first.

We will assume the use of SentiWordNet, so some parts of the code are the same for any class, i.e, tokenization, pos tagging, conversion of pos tags to SentiWordNet pos tags and scores for each sentiment. The change is in _how_ we use these scores.

In [112]:
from enum import Enum
from typing_extensions import Literal
from nltk.tokenize import word_tokenize as tokenize
from nltk.tag import pos_tag
import numpy as np
from nltk.corpus import wordnet as wn
from typing import List, Tuple, Union
from dataclasses import dataclass

class Sentiment(Enum):
    POSITIVE = 1
    NEGATIVE = 2
    NEUTRAL = 3

@dataclass
class TokenSentiment:
    token: str
    pos_tag: str 
    positive: str
    negative: str 
    neutral: str

def penn_pos_tag_to_word_net(pos_tag_penn: str) -> Union[str, None]:
    word_net_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
                  'VB':wn.VERB, 'RB':wn.ADV}
    return word_net_tag.get(pos_tag_penn[:2])


def get_token_sentiment(token: str, pos: str) -> TokenSentiment:
    try:
        synset_0 = list(swn.senti_synsets(token, pos))[0]
        return TokenSentiment(token=token, pos_tag=pos, positive=synset_0.pos_score(), negative=synset_0.neg_score(), 
                          neutral=synset_0.obj_score())
    except IndexError:
        return TokenSentiment(token=token, pos_tag=pos, positive=0, negative=0,  neutral=1.) 

class SentenceSentiment:
    def __init__(self, sentence: str):
        self.sentence = sentence
        self.pos_tokens = [(token, penn_pos_tag_to_word_net(pos_tag)) for token, pos_tag in pos_tag(tokenize(sentence))]
        self.token_sentiments = [get_token_sentiment(token, pos) for token, pos in self.pos_tokens if pos is not None]     
    
    def run(self, **kwargs) -> Literal[Sentiment.POSITIVE, Sentiment.NEGATIVE, Sentiment.NEUTRAL]:
        pass


In [113]:
sentence = SentenceSentiment("this movie is awesome!")
sentence.token_sentiments

[TokenSentiment(token='movie', pos_tag='n', positive=0.0, negative=0.0, neutral=1.0),
 TokenSentiment(token='is', pos_tag='v', positive=0.25, negative=0.125, neutral=0.625),
 TokenSentiment(token='awesome', pos_tag='a', positive=0.875, negative=0.125, neutral=0.0)]

In [114]:
sentence = SentenceSentiment("this movie is horrible!")
sentence.token_sentiments

[TokenSentiment(token='movie', pos_tag='n', positive=0.0, negative=0.0, neutral=1.0),
 TokenSentiment(token='is', pos_tag='v', positive=0.25, negative=0.125, neutral=0.625),
 TokenSentiment(token='horrible', pos_tag='a', positive=0.0, negative=0.625, neutral=0.375)]

Now we will implement the `run` method in a subclass to get the sentiment of a sentence. A simple way to do this would be provide an aggregation of the sentiment scores of the tokens. The aggregation method can be average or max.

In [115]:
import numpy as np

class SentenceSentimentAggregation(SentenceSentiment):
    def __init__(self, sentence: str):
        super().__init__(sentence)
    
    def run(self, **kwargs):
        aggregation_fn = kwargs['aggregation_fn']
        score_dict = {
            Sentiment.POSITIVE: aggregation_fn([x.positive for x in self.token_sentiments]),
            Sentiment.NEGATIVE: aggregation_fn([x.negative for x in self.token_sentiments]),
            Sentiment.NEUTRAL: aggregation_fn([x.neutral for x in self.token_sentiments]),
        }
        score_vals = score_dict.items()
        return sorted(score_vals, key=lambda x:x[1], reverse=True)[0][0]

In [116]:
sentence = SentenceSentimentAggregation("this movie is awesome!")
sentence.run(aggregation_fn=np.mean)

<Sentiment.NEUTRAL: 3>

In [117]:
sentence = SentenceSentimentAggregation("awesome!")
sentence.run(aggregation_fn=np.mean)

<Sentiment.NEUTRAL: 3>

In [109]:
sentence = SentenceSentimentAggregation("truly awesome!")
sentence.run(aggregation_fn=np.max)

<Sentiment.POSITIVE: 1>

You should try this out with some other examples. In another method, you can take the number of positive vs number of negative words to determine the polarity of the sentence. This is left for exercise.