# Sentiment & Dictionaries

We will mostly be using NLTK to conduct sentiment analysis in this lab

## NLTK Corpus

NLTK has several corpora. Some are useful for sentiment analysis.

http://www.nltk.org/howto/corpus.html

* opinion_lexicon
* WordNet
* SentiWordNet

### opinion lexicon

Opinion Lexicon: A list of English positive and negative opinion words or sentiment words (around 6800 words). This list was compiled over many years starting from in the paper by (Hu and Liu, KDD-2004).

You need to first download this nltk opinion_lexicon corpus
`nltk.download('opinion_lexicon')`



In [84]:
import nltk
#nltk.download('opinion_lexicon') #this download needs to happen for the very first time
from nltk.corpus import opinion_lexicon

In [85]:
opinion_lexicon.positive()

['a+', 'abound', 'abounds', 'abundance', 'abundant', ...]

In [86]:
len(opinion_lexicon.positive())

2006

In [87]:
opinion_lexicon.negative()

['2-faced', '2-faces', 'abnormal', 'abolish', ...]

In [88]:
len(opinion_lexicon.negative())

4783

**<span class="mark">Your turn</span>**: think of three positive and negative sentiment words. See if they are in the lexicons.

In [90]:
# replace with your own words
my_pos = ['happy','alright','groovy']
my_neg = ['psycho','retarded','nasty']

In [91]:
# run this to see if they are in any of the lexicon
print('WORD, POS, NEG\n---------------')
for lex in [my_pos,my_neg]:
    for word in lex:
        print(word,word in opinion_lexicon.positive(),word in opinion_lexicon.negative())

WORD, POS, NEG
---------------
happy True False
alright False False
groovy False False
psycho False False
retarded False True
nasty False True


The above results tells you that for certain words, opinion_lexicon is not able to assign positive or negative labels. Trying with a non-sentiment word you will see the same result  

### Sentiment of tweet
In the last lab, you all tried tokenizing tweets.

**<span class="mark">TODO</span>**: What's the sentiment of a tweet sample? 
You can try with "@john lol that was #awesome :)"


In [92]:
test_tweet = "@john lol that was #awesome :)"

#your code below
from nltk.tokenize import sent_tokenize, word_tokenize

# tokenize text into words
words = word_tokenize(test_tweet)
words

['@', 'john', 'lol', 'that', 'was', '#', 'awesome', ':', ')']

In [93]:
print('WORD, POS, NEG\n---------------')
for lex in [words]:
    for word in lex:
        print(word,word in opinion_lexicon.positive(),word in opinion_lexicon.negative())

WORD, POS, NEG
---------------
@ False False
john False False
lol False False
that False False
was False False
# False False
awesome True False
: False False
) False False


In [94]:
# Prof Tanu's Code
tweetoken = nltk.word_tokenize(test_tweet)
for word in tweetoken:
        print(word,word in opinion_lexicon.positive(),word in opinion_lexicon.negative())

@ False False
john False False
lol False False
that False False
was False False
# False False
awesome True False
: False False
) False False


### sentiment analysis with `VADER`
https://pypi.org/project/vaderSentiment/

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

In [95]:
#pip install vaderSentiment

In [96]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer #pip install vaderSentiment

In [97]:
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores(test_tweet)

{'neg': 0.0, 'neu': 0.233, 'pos': 0.767, 'compound': 0.872}

Trying with another text. News article this time. Recall this text from last lab

In [98]:
text = "Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack \"despicable.\""

In [99]:
analyzer.polarity_scores(text)

{'neg': 0.2, 'neu': 0.778, 'pos': 0.023, 'compound': -0.9287}

**How to interpret the overall score?**

The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.

* Positive sentiment: compound score >= 0.05
* Neutral sentiment: -0.05 < compound score < 0.05 : 
* Negative sentiment: compound score <= -0.05

**Multi-dimensional measures of sentiment**

The `pos`, `neu`, and `neg` scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). These are the most useful metrics if you want multidimensional measures of sentiment for a given sentence.

**<span class="mark">TODO</span>**:

write function to interpret the overall sentiment of text as positive, negavitve, or neutral based on VADER's analysis

In [100]:
# Your code below
#implement the logic if else - if compound > - 0.05 then positive

# Program checks if the number is positive or negative
# And displays an appropriate message

k = analyzer.polarity_scores(text)


if k["compound"] >= 0.05:
    print("Positive sentiment")
elif k["compound"] <= -0.05:
    print("Negative sentiment")
else:
    print("Neutral sentiment")


Negative sentiment


### Testing few more text sentiments:

In [101]:
text1 = "I am happy"
text2 = "I think I am happy"
text3 = "I doubt If I am happy"
text4 = "I think I am not happy"
text5 = "I am not happy"

In [102]:
analyzer.polarity_scores(text1)

{'neg': 0.0, 'neu': 0.351, 'pos': 0.649, 'compound': 0.5719}

In [103]:
analyzer.polarity_scores(text2)

{'neg': 0.0, 'neu': 0.519, 'pos': 0.481, 'compound': 0.5719}

In [104]:
analyzer.polarity_scores(text3)

{'neg': 0.245, 'neu': 0.392, 'pos': 0.363, 'compound': 0.296}

In [105]:
analyzer.polarity_scores(text4)

{'neg': 0.375, 'neu': 0.625, 'pos': 0.0, 'compound': -0.4585}

In [106]:
analyzer.polarity_scores(text5)

{'neg': 0.5, 'neu': 0.5, 'pos': 0.0, 'compound': -0.4585}

### sentiment analysis with `TextBlob`

https://textblob.readthedocs.io/en/dev/

In [107]:
#!pip install TextBlob

In [108]:
from textblob import TextBlob #pip install TextBlob

blob = TextBlob(test_tweet)
blob.polarity

0.7666666666666666

In [109]:
blob.subjectivity

0.9

Trying with another text. News article this time. Recall this text from last lab

In [110]:
text = "Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack \"despicable.\""

In [111]:
blob = TextBlob(text)
blob.polarity

-0.04285714285714285

In [112]:
blob.subjectivity

0.2642857142857143

There are few other functions available as well. Press tab to see them

In [113]:
blob.parse

<bound method BaseBlob.parse of TextBlob("Netanyahu's visit was cut short by reports late Sunday that a rocket was fired from Gaza into central Israel, wounding at least seven people. Following criticism from political opponents over what they consider the prime minister's unclear stance toward the militant political group, Israel responded with a series of strikes into Gaza against Hamas, which largely governs the contested strip. President Donald Trump tacitly endorsed the strike following his meetings with Netanyahu, calling the Hamas attack "despicable."")>

#### Few more tests to see rule-based approach

In [114]:
TextBlob('great').sentiment

Sentiment(polarity=0.8, subjectivity=0.75)

In [115]:
TextBlob('not great').sentiment

Sentiment(polarity=-0.4, subjectivity=0.75)

So the rule above for "not great" is polarity of "great" X -0.5 = 0.8* -0.5 = -0.4

**<span class="mark">TODO for fun</span>**

Try with a few different variations to see whether you can observe the rules working here.

### `Empath`

https://github.com/Ejhfast/empath-client

https://pypi.org/project/empath/

In [116]:
#!pip install empath

In [117]:
from empath import Empath #pip install empath

In [118]:
lexicon = Empath()

In [119]:
categ = lexicon.analyze("he hit the other person", normalize=True)

In [120]:
print('Categories for the sentence: "he hit the other person":')
for key, value in categ.items():
    if value != 0:
        print(key)

Categories for the sentence: "he hit the other person":
movement
violence
pain
negative_emotion


In [121]:
#available categories in empath
print(categ.keys())

dict_keys(['help', 'office', 'dance', 'money', 'wedding', 'domestic_work', 'sleep', 'medical_emergency', 'cold', 'hate', 'cheerfulness', 'aggression', 'occupation', 'envy', 'anticipation', 'family', 'vacation', 'crime', 'attractive', 'masculine', 'prison', 'health', 'pride', 'dispute', 'nervousness', 'government', 'weakness', 'horror', 'swearing_terms', 'leisure', 'suffering', 'royalty', 'wealthy', 'tourism', 'furniture', 'school', 'magic', 'beach', 'journalism', 'morning', 'banking', 'social_media', 'exercise', 'night', 'kill', 'blue_collar_job', 'art', 'ridicule', 'play', 'computer', 'college', 'optimism', 'stealing', 'real_estate', 'home', 'divine', 'sexual', 'fear', 'irritability', 'superhero', 'business', 'driving', 'pet', 'childish', 'cooking', 'exasperation', 'religion', 'hipster', 'internet', 'surprise', 'reading', 'worship', 'leader', 'independence', 'movement', 'body', 'noise', 'eating', 'medieval', 'zest', 'confusion', 'water', 'sports', 'death', 'healing', 'legend', 'heroic

In [122]:
# let's see how Empath works on our tweet text
categ_tweets = lexicon.analyze(test_tweet)
# categ_tweets ~ commented becuase gives a longer list

In [123]:
print('Categories for the sentence:', test_tweet)
for key, value in categ_tweets.items():
    if value != 0:
        print(key)

Categories for the sentence: @john lol that was #awesome :)


In [124]:
# how will this work on text5

categ = lexicon.analyze(text5, normalize=True)
categ_tweets = lexicon.analyze(text5)

In [125]:

print('Categories for the sentence:', text5)
for key, value in categ_tweets.items():
    if value != 0:
        print(key)

Categories for the sentence: I am not happy
wedding
cheerfulness
optimism
childish
celebration
party
positive_emotion


In [126]:
categ_text = lexicon.analyze(text5)

In [127]:
print('Categories for the news sentence:', text5, '\n---------')
for key, value in categ_text.items():
    if value != 0:
        print(key)

Categories for the news sentence: I am not happy 
---------
wedding
cheerfulness
optimism
childish
celebration
party
positive_emotion


**<span class="mark">TODO</span>**: 

1. From the project pitches that you all submitted, you had some idea of what data to collect. Get one data point for your problem (this could be one reddit post from a community, one tweet, etc.)
2. Now check to see which categories of Empath are present
3. Now loop through your entire data

In [128]:
import tweepy 
import json

In [129]:
# Function to read the key file and load keys in a dictionary
def loadKeys(key_file):
    with open(key_file) as f:
        key_dict = json.load(f)
    return key_dict['api_key'], key_dict['api_secret'], key_dict['token'], key_dict['token_secret']

KEY_FILE = 'keys.json'
api_key, api_secret, token, token_secret = loadKeys(KEY_FILE)
auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(token, token_secret)
api = tweepy.API(auth)

In [130]:
search_term = "COVID19"
new_search = search_term + " -filter:retweets"
no_of_pages = 1

for page in tweepy.Cursor(api.search, q = new_search, lang="en",).pages(no_of_pages):
    for status in page:
        print("\033[1mtweet :\033[0m " + status.text)
        categ_text = lexicon.analyze(status.text)
        for key, value in categ_text.items():
            if value != 0:
                print(key)

[1mtweet :[0m When the federal government’s #COVID19 Testing and Screening Expert Advisory Panel released its first report last m… https://t.co/K5X7OxMudg
crime
government
journalism
communication
meeting
work
law
[1mtweet :[0m Hospital pastor heading into ICU to pray with #COVID19 patients &amp; administer last rites, but lets bring the governm… https://t.co/u919if9b9U
divine
religion
worship
death
traveling
giving
[1mtweet :[0m Number of UK 🇬🇧 intensive care beds increased 158% in &lt;12 months due to #COVID19. Now provide 1 bed per 11,085 popu… https://t.co/AlKQSUmr3T
help
sleep
medical_emergency
furniture
healing
trust
negative_emotion
children
giving
positive_emotion
[1mtweet :[0m Looking for datasets for my research project regarding Covid19's impact on labour market. https://t.co/A51KFqxeo9 #bigdata
school
journalism
business
internet
reading
violence
meeting
injury
science
work
technology
[1mtweet :[0m #Moderna CEO says the world will have to live with Covid 'forever'