# 6.6. Sentiment Analysis using VADER 🎭

-----

## 1. Warmup


#### Take a look at the Vader-github-repo and try to answer these questions:

1. What is sentiment analysis ? What use cases can you think of?

2. What can we find in the lexicon, and more specifically: what are the four values representing ?

3. Does Vader take punctuation into account ? Which words does Vader consider to intensify a sentiment ?

4. How does Vader score a text as a whole ?

### 1.1. What is Sentiment Analysis?


             
### Main classes of Sentiment Analysis solutions:


### 1.2. What use cases are there for Sentiment Analysis?

### 1.3 What are some challenges faced when determing Sentiment in NLP?




## 2. Sentiment Analysis using VADER


In [1]:
#!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 2.1 MB/s eta 0:00:01
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


* VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based model for sentiment analysis that takes into account polarity (positive vs. negative) but also intensity of a sentiment.

In [3]:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer # add thosj to etl script, requirements

analyzer = SentimentIntensityAnalyzer()


#### VADER’s SentimentIntensityAnalyzer() takes in a string and returns a dictionary of scores in each of four categories:
- negative
- neutral
- positive
- compound 
- Note: The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive)

In [4]:
analyzer.polarity_scores("WTF")




{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound': -0.5859}

#### VADER analyses sentiments primarily based on certain key points:
    

Punctuation: The use of an exclamation mark(!), increases the magnitude of the intensity 
For example, “The party was good!” is more intense than “The party was good” and an increase in the number of (!), increases the magnitude accordingly.

In [10]:
analyzer.polarity_scores("The party was AWESOME!!!!!")


{'neg': 0.0, 'neu': 0.187, 'pos': 0.813, 'compound': 0.8658}

In [9]:
analyzer.polarity_scores("The party was shit")

{'neg': 0.434, 'neu': 0.241, 'pos': 0.325, 'compound': -0.2263}

Capitalization: Using upper case letters to emphasize a sentiment-relevant word in the presence of other non-capitalized words, increases the magnitude of the sentiment intensity. For example, “The party was GREAT!” conveys more intensity than “The party was great!”





In [12]:
analyzer.polarity_scores("The service here is good BUT the food is shit.")

{'neg': 0.33, 'neu': 0.539, 'pos': 0.131, 'compound': -0.6059}

In [13]:
analyzer.polarity_scores("The service here is good but the food is shit.")

{'neg': 0.33, 'neu': 0.539, 'pos': 0.131, 'compound': -0.6059}

Degree modifiers: Also called intensifiers, they impact the sentiment intensity by either increasing or decreasing the intensity. For example, “The service here is extremely good” is more intense than “The service here is good”, whereas “The service here is marginally good” reduces the intensity.


In [21]:
analyzer.polarity_scores("Putin advanced to Ukraine. People killed")


{'neg': 0.429, 'neu': 0.381, 'pos': 0.19, 'compound': -0.5423}

Conjunctions: Use of conjunctions like “but” signals a shift in sentiment polarity, with the sentiment of the text following the conjunction being dominant. “The music here is great, but the service is horrible” has mixed sentiment, with the latter half dictating the overall rating.


Preceding Tri-gram: By examining the tri-gram preceding a sentiment-laden lexical feature, we catch nearly 90% of cases where negation flips the polarity of the text. A negated sentence would be “The music here isn’t really all that great”.


In [None]:
analyzer.polarity_scores("The service here is good BUT the food is shit.")

What do we notice about the scores?

* Scores give the proportion of text belonging to the category.
* compund: most valuable parameter, thresholds mentioned in repo:
- positive sentiment: compound score >= 0.05
- neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
- negative sentiment: compound score <= -0.05

In [22]:
import pandas as pd

## 3. Toy dataset:  Analysing Tweets

In [26]:
#!pip install nltk



In [31]:
import nltk
from nltk.corpus import twitter_samples 
import pandas as pd
nltk.download('twitter_samples')

# get 5000 positive and 5000 negative tweets


[nltk_data] Downloading package twitter_samples to
[nltk_data]     /Users/alexandros.samartzis/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [32]:
twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

In [33]:
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
for string in all_negative_tweets[:5]:
    print(string)

hopeless for tmr :(
Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(
@Hegelbon That heart sliding into the waste basket. :(
“@ketchBurning: I hate Japanese call him "bani" :( :(”

Me too
Dang starting next week I have "work" :(


In [35]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
for string in all_positive_tweets[:5]:
    print(string)

#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!
@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!
@97sides CONGRATS :)
yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days


In [37]:
analyzer.polarity_scores(all_positive_tweets[0:59])

{'neg': 0.024, 'neu': 0.657, 'pos': 0.319, 'compound': 0.9998}

In [38]:
all_positive_tweets.extend(all_negative_tweets) # extend my list

In [39]:
df_tweets = pd.DataFrame({'tweets':all_positive_tweets})

In [40]:
df_tweets

Unnamed: 0,tweets
0,#FollowFriday @France_Inte @PKuchly57 @Milipol...
1,@Lamb2ja Hey James! How odd :/ Please call our...
2,@DespiteOfficial we had a listen last night :)...
3,@97sides CONGRATS :)
4,yeaaaah yippppy!!! my accnt verified rqst has...
...,...
9995,I wanna change my avi but uSanele :(
9996,MY PUPPY BROKE HER FOOT :(
9997,where's all the jaebum baby pictures :((
9998,But but Mr Ahmad Maslan cooks too :( https://t...


### 3.1 Clean data


In [41]:
import re # add to the requirements 

In [42]:
mentions_regex= '@[A-Za-z0-9]+'
url_regex='https?:\/\/\S+' #this will not catch all possible URLs     ###add this to etl script
hashtag_regex= '#'
rt_regex= 'RT\s'

def clean_tweets(tweet):
    tweet = re.sub(mentions_regex, '', tweet)  #removes @mentions
    tweet = re.sub(hashtag_regex, '', tweet) #removes hashtag symbol
    tweet = re.sub(rt_regex, '', tweet) #removes RT to announce retweet
    tweet = re.sub(url_regex, '', tweet) #removes most URLs
    
    return tweet



In [43]:
df_tweets.tweets = df_tweets.tweets.apply(clean_tweets)
df_tweets

Unnamed: 0,tweets
0,FollowFriday _Inte _Paris for being top engag...
1,Hey James! How odd :/ Please call our Contact...
2,we had a listen last night :) As You Bleed is...
3,CONGRATS :)
4,yeaaaah yippppy!!! my accnt verified rqst has...
...,...
9995,I wanna change my avi but uSanele :(
9996,MY PUPPY BROKE HER FOOT :(
9997,where's all the jaebum baby pictures :((
9998,But but Mr Ahmad Maslan cooks too :(


### 3.2 Calculating scores

In [44]:
pol_scores = df_tweets['tweets'].apply(analyzer.polarity_scores).apply(pd.Series)

In [None]:
# Make dataframe of polarity scores



In [45]:
pol_scores.head()

Unnamed: 0,neg,neu,pos,compound
0,0.0,0.595,0.405,0.7579
1,0.149,0.572,0.279,0.6229
2,0.0,0.706,0.294,0.7959
3,0.0,0.0,1.0,0.7983
4,0.0,0.729,0.271,0.795


In [None]:
# In postgres you will have two columns: one column: tweets, compound score


### Further Reading:

* https://www.kaggle.com/piyushagni5/sentiment-analysis-on-twitter-dataset-nlp

* https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews

* https://textblob.readthedocs.io/en/dev/




### Next step:
* Get tweets from MongoDB
* Clean the tweets
* Do sentiment analysis with VADER
* Save tweet and sentiment in postgres
