# Sentiment Analysis for Exploratory Data Analysis

modified by: Erynn Gutierrez
[Link to lesson](https://programminghistorian.org/en/lessons/sentiment-analysis)

In [1]:
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

The following definition uses the VADER sentiment analysis tool, text yields a set of positive, neutral, and negative scores, which are then aggregated and scaled as a ‘compound score’. \

VADER also sums all weighted scores to calculate a “compound” value normalized between -1 and 1; this value attempts to describe the overall affect of the entire text from strongly negative (-1) to strongly positive (1). In this case, the VADER analysis describes the passage as slightly-to-moderately negative (-0.3804). We can think of this value as estimating the overall impression of an average reader when considering the e-mail as a whole, despite some ambiguity and ambivalence along the way.

In [2]:
# first, we import the relevant modules from the NLTK library
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def analyse_message(message_text):

    # next, we initialize VADER so we can use it within our Python script
    sid = SentimentIntensityAnalyzer()

    # Calling the polarity_scores method on sid and passing in the message_text outputs a dictionary with negative, neutral, positive, and compound scores for the input text
    scores = sid.polarity_scores(message_text)

    print(message_text)

    # Here we loop through the keys contained in scores (pos, neu, neg, and compound scores) and print the key-value pairs on the screen
    for key in sorted(scores):
            print('{0}: {1}, '.format(key, scores[key]), end='')

In [3]:
# setting message text

message_1 = "Looks great.  I think we should have a least 1 or 2 real time traders in Calgary."

message_2 = """I think we are making great progress on the systems side.  I would like to
set a deadline of November 10th to have a plan on all North American projects
(I'm ok if fundementals groups are excluded) that is signed off on by
commercial, Sally's world, and Beth's world.  When I say signed off I mean
that I want signitures on a piece of paper that everyone is onside with the
plan for each project.  If you don't agree don't sign. If certain projects
(ie. the gas plan) are not done yet then lay out a timeframe that the plan
will be complete.  I want much more in the way of specifics about objectives
and timeframe.

Thanks for everyone's hard work on this."""

analyse_message(message_1)
print("\n")
analyse_message(message_2)

Looks great.  I think we should have a least 1 or 2 real time traders in Calgary.
compound: 0.6249, neg: 0.0, neu: 0.745, pos: 0.255, 

I think we are making great progress on the systems side.  I would like to
set a deadline of November 10th to have a plan on all North American projects
(I'm ok if fundementals groups are excluded) that is signed off on by
commercial, Sally's world, and Beth's world.  When I say signed off I mean
that I want signitures on a piece of paper that everyone is onside with the
plan for each project.  If you don't agree don't sign. If certain projects
(ie. the gas plan) are not done yet then lay out a timeframe that the plan
will be complete.  I want much more in the way of specifics about objectives
and timeframe.

Thanks for everyone's hard work on this.
compound: 0.8951, neg: 0.042, neu: 0.821, pos: 0.136, 

Here you can see that, when analyzing the e-mail as a whole, VADER returns values that suggest the message is mostly neural (neu: 0.765) but that more features appear to be positive (pos: 0.14) rather than negative (0.096). VADER computes an overall sentiment score of 0.889 for the message (on a scale of -1 to 1) which suggests a strongly positive affect for the message as a whole.

In [4]:
message_3 = '''It seems to me we are in the middle of no man's land with respect to the  following:  Opec production speculation, Mid east crisis and renewed  tensions, US elections and what looks like a slowing economy (?), and no real weather anywhere in the world. I think it would be most prudent to play  the markets from a very flat price position and try to day trade more aggressively. I have no intentions of outguessing Mr. Greenspan, the US. electorate, the Opec ministers and their new important roles, The Israeli and Palestinian leaders, and somewhat importantly, Mother Nature.  Given that, and that we cannot afford to lose any more money, and that Var seems to be a problem, let's be as flat as possible. I'm ok with spread risk  (not front to backs, but commodity spreads). The morning meetings are not inspiring, and I don't have a real feel for  everyone's passion with respect to the markets.  As such, I'd like to ask  John N. to run the morning meetings on Mon. and Wed.  Thanks. Jeff'''
analyse_message(message_3)

It seems to me we are in the middle of no man's land with respect to the  following:  Opec production speculation, Mid east crisis and renewed  tensions, US elections and what looks like a slowing economy (?), and no real weather anywhere in the world. I think it would be most prudent to play  the markets from a very flat price position and try to day trade more aggressively. I have no intentions of outguessing Mr. Greenspan, the US. electorate, the Opec ministers and their new important roles, The Israeli and Palestinian leaders, and somewhat importantly, Mother Nature.  Given that, and that we cannot afford to lose any more money, and that Var seems to be a problem, let's be as flat as possible. I'm ok with spread risk  (not front to backs, but commodity spreads). The morning meetings are not inspiring, and I don't have a real feel for  everyone's passion with respect to the markets.  As such, I'd like to ask  John N. to run the morning meetings on Mon. and Wed.  Thanks. Jeff
compoun

This function analyzes each sentence separately instead of encompassing the scores of the entire text.

In [None]:
# nltk.download('punkt_tab')
import nltk.data
from nltk.tokenize import PunktTokenizer

# MODIFIED from Programming Historian due to pickling being removed from NLKT
# referenced NLKT documentation + Stack Overflow for the nltk.download tip!

def analyze_sentences(text):
    sent_detector = PunktTokenizer()
    sid = SentimentIntensityAnalyzer()

    # Tokenize and analyze
    sentences = sent_detector.tokenize(text)

    for sentence in sentences:
        print(sentence)
        scores = sid.polarity_scores(sentence)
        for key in sorted(scores):
            print(f'{key}: {scores[key]}, ', end='')
        print("\n")
        print()

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [14]:
my_text = '''
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.  And sometimes sentences
can start with non-capitalized words.  i is a good variable
name.
'''
analyze_sentences(my_text)


Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
compound: -0.0572, neg: 0.071, neu: 0.929, pos: 0.0, 


And sometimes sentences
can start with non-capitalized words.
compound: 0.0516, neg: 0.0, neu: 0.854, pos: 0.146, 


i is a good variable
name.
compound: 0.4404, neg: 0.0, neu: 0.508, pos: 0.492, 


