# In-Class Activity, February 7, 2023

## Part 1: Sentiment Analysis

In "Data Reimagined," you read about how researchers have analyzed speech and written text to draw conclusions about how people interact with each other. These kinds of analyses are part of a data science sub-field called [natural language processling, or NLP](https://en.wikipedia.org/wiki/Natural_language_processing).

In this activity, we'll do a simple NLP project using a method called [_sentiment analysis_](https://en.wikipedia.org/wiki/Sentiment_analysis), which looks at how many positive and negative words are used in a piece of text.

This activity is adapted from:

Zoë Wilkinson Saldaña, ["Sentiment Analysis for Exploratory Data Analysis,"](https://programminghistorian.org/en/lessons/sentiment-analysis) Programming Historian 7 (2018), https://doi.org/10.46430/phen0079.

### Getting started: Importing packages

We'll use a new package this time, nltk, as well as a couple of other features. Run the cells below to install everything you need.

In [1]:
import pandas as pd

In [2]:
# We'll be using a package called nltk, or the Natural Language Tool Kit
# If you are using Anaconda, it should already be installed and you just need to import it
import nltk

#natural langauge processing tool kit

# This is the sentiment analyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# If you run this block of code and get an error, you may need to install or update nltk
# See: https://www.nltk.org/install.html

We will be using a method called VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis ([GitHub page here](https://github.com/cjhutto/vaderSentiment#about-the-scoring)), which essentially looks at each individual word in a piece of text and marks it as either negative, positive, or neutral.

The VADER lexicon has ~7500 words in it, and you can [view it here.](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt)

**Some negative words**: awful, avoiding, boredom, ):<

**Some positive words**: awesome, boldest, clever, :)

**Some neutral words**: aboard, amorphously

The code below gets the VADER dataset, as well as a tokenizer (we'll get to it -- it turns a big block of text into smaller pieces, in our case, sentences).

In [3]:
# This is the VADER dataset
nltk.download('vader_lexicon')
# This is the tokenizer
nltk.download('punkt')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/tenzinuden/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tenzinuden/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
# Let's use a sample sentence one...
aSentence = "Happy families are all alike; every unhappy family is unhappy in its own way."
bSentence = "I love data science :)"

In [5]:
# This code creates a sentiment intensity object (which we'll call sid)
sid = SentimentIntensityAnalyzer()

In [25]:
# This is how we get the sentiment scores for that sentence
scores = sid.polarity_scores(aSentence)


{'neg': 0.276, 'neu': 0.542, 'pos': 0.182, 'compound': -0.2263}

In [26]:
scoresb=sid.polarity_scores(bSentence)
scoresb

{'neg': 0.0, 'neu': 0.217, 'pos': 0.783, 'compound': 0.802}

In [7]:
# The sentiment scores are in the form of a dictionary
# This prints them in a nice format
for key in sorted(scores):
    print('{0}: {1}, '.format(key, scores[key]), end='')

compound: -0.2263, neg: 0.276, neu: 0.542, pos: 0.182, 

How do we read these scores?

* _compound_ is the overall sentiment for the piece of text. It is a number between -1 and 1, with -1 being the most negative possible, and 1 being the most positive possible. This a score of -0.2263 is only slightly negative.

* _neg_, _neu_, and _pos_ tell us the proportion of words in this piece of text that are negative, neutral or positive. In the sentence above, 27.6% of the words are negative, 54.2% are neutral, and 18.2% are positive.

### Now let's read in a bigger text file -- the chapter you read and annotated

In [8]:
# This code opens up the text file and saves it as variable called chapter
with open('data-reimagined.txt') as f:
    chapter = f.read()

In [27]:
# Let's see what it looks like
chapter
# YOUR CODE HERE

'At 6 A.M. on a particular Friday of every month, the streets of most of Manhattan will be largely desolate. The stores lining these streets will be closed,\ntheir façades covered by steel security gates, the apartments above dark and silent.\nThe floors of Goldman Sachs, the global investment banking institution in lower Manhattan, on the other hand, will be brightly lit, its elevators taking\nthousands of workers to their desks. By 7 A.M. most of these desks will be occupied.\nIt would not be unfair on any other day to describe this hour in this part of town as sleepy. On this Friday morning, however, there will be a buzz of\nenergy and excitement. On this day, information that will massively impact the stock market is set to arrive.\nMinutes after its release, this information will be reported by news sites. Seconds after its release, this information will be discussed, debated, and\ndissected, loudly, at Goldman and hundreds of other financial firms. But much of the real action in 

In [10]:
# This code gets the sentiment scores for the entire chapter (as one block)
scores = sid.polarity_scores(chapter)

In [11]:
# And let's print them out
for key in sorted(scores):
    print('{0}: {1}, '.format(key, scores[key]), end='')

compound: 1.0, neg: 0.058, neu: 0.828, pos: 0.115, 

What is going on here? Why is the compound score 1, when most of the words are neutral, and positive words are twice as likely as negative ones?

It has to do with how VADER calculates compound score -- which is based on the sum of all of the words. As you get more and more words, it approaches either 1 or -1! (For more on that, check out [this blog post.](https://medium.com/@piocalderon/vader-sentiment-analysis-explained-f1c4f9101cd9))

(And the lesson here is: Always check how analysis methods you are using actually work!)

### Tokenizing: Breaking up a big chunk of text into smaller pieces

So, in order to do analysis on this chapter, we'll need to break it up into sentences. This is called _tokenization_.

We'll use a built in tokenizer in nltk.

In [28]:
# Make the tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Tokenize the chapter -- and save the tokens as a variable called sentences
sentences = tokenizer.tokenize(chapter)
sentences
# Note that sentences is a list

['At 6 A.M. on a particular Friday of every month, the streets of most of Manhattan will be largely desolate.',
 'The stores lining these streets will be closed,\ntheir façades covered by steel security gates, the apartments above dark and silent.',
 'The floors of Goldman Sachs, the global investment banking institution in lower Manhattan, on the other hand, will be brightly lit, its elevators taking\nthousands of workers to their desks.',
 'By 7 A.M. most of these desks will be occupied.',
 'It would not be unfair on any other day to describe this hour in this part of town as sleepy.',
 'On this Friday morning, however, there will be a buzz of\nenergy and excitement.',
 'On this day, information that will massively impact the stock market is set to arrive.',
 'Minutes after its release, this information will be reported by news sites.',
 'Seconds after its release, this information will be discussed, debated, and\ndissected, loudly, at Goldman and hundreds of other financial firms.',

In [13]:
# Let's examine...

# YOUR CODE HERE

In [14]:
# And pull out a sentence

# YOUR CODE HERE

In [15]:
# Build a list of compound scores for each of our sentences
compound_scores = []
for sentence in sentences:
    # This selects the compound score from the dictionary of sentiment scores
    compound_scores.append(sid.polarity_scores(sentence)['compound'])

In [16]:
compound_scores

[0.0,
 0.34,
 0.5423,
 0.0,
 0.3724,
 0.6486,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0258,
 0.0,
 0.6113,
 -0.4404,
 0.0,
 -0.6666,
 -0.3182,
 -0.3612,
 0.0,
 -0.296,
 0.0,
 0.6705,
 -0.4019,
 0.4588,
 -0.7783,
 0.0,
 0.0,
 -0.2023,
 0.25,
 0.0,
 0.0,
 -0.5994,
 0.4101,
 -0.0258,
 -0.4404,
 0.0,
 0.3898,
 0.0,
 0.3898,
 0.0,
 -0.1027,
 -0.6249,
 -0.2057,
 0.5831,
 0.6808,
 0.4497,
 0.0,
 0.34,
 0.2732,
 0.8885,
 0.4019,
 0.0,
 -0.2023,
 0.0,
 0.5106,
 0.0,
 0.6369,
 0.8649,
 -0.296,
 0.25,
 0.8047,
 0.0,
 0.6369,
 0.4215,
 0.0,
 0.3804,
 0.4215,
 0.0,
 0.4404,
 0.6369,
 0.0,
 0.0,
 0.0,
 0.0,
 0.1531,
 0.296,
 0.5256,
 0.4754,
 0.2263,
 0.8316,
 0.0,
 0.0,
 0.3384,
 0.4927,
 -0.128,
 0.4404,
 0.4754,
 0.2263,
 0.0,
 0.0,
 0.0,
 -0.4497,
 0.6808,
 0.0,
 0.0,
 0.7684,
 0.4215,
 0.0,
 0.0,
 0.7184,
 0.4215,
 -0.235,
 0.0,
 -0.5409,
 0.7543,
 -0.5267,
 0.0,
 -0.6124,
 -0.296,
 0.0,
 0.6369,
 0.0,
 0.1901,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.0352,
 0.0,
 0.0,
 0.4404,
 -0.2755,
 0.0,
 0.6789,
 0.0,
 -0.7

In [17]:
# Now let's put them in a dataframe
# There are many ways to do this, but here's one:
# First we "zip" the two lists into a tuple
sentence_sentiment = list(zip(sentences, compound_scores))

In [18]:
# Then we create a new data frame, df, and give it the tuple and a list of column names
df = pd.DataFrame(sentence_sentiment, columns=['Sentence','Compound Score'])

In [19]:
# And let's examine it
df.head(20)
# YOUR CODE HERE

Unnamed: 0,Sentence,Compound Score
0,At 6 A.M. on a particular Friday of every mont...,0.0
1,The stores lining these streets will be closed...,0.34
2,"The floors of Goldman Sachs, the global invest...",0.5423
3,By 7 A.M. most of these desks will be occupied.,0.0
4,It would not be unfair on any other day to des...,0.3724
5,"On this Friday morning, however, there will be...",0.6486
6,"On this day, information that will massively i...",0.0
7,"Minutes after its release, this information wi...",0.0
8,"Seconds after its release, this information wi...",0.0
9,But much of the real action in finance these d...,0.0


In [20]:
# Sort to see our range
df.sort_values('Compound Score')
# YOUR CODE HERE

Unnamed: 0,Sentence,Compound Score
510,"In other words, just from the words, the compu...",-0.8934
474,If someone writes “I am sad thinking about all...,-0.8779
381,But many have long\nsuspected the cause was th...,-0.8271
483,"Generally, I think many people are secretly sa...",-0.7964
160,"(Every year, hundreds of horses die on America...",-0.7906
...,...,...
361,"For one thing, we learn about the slow growth\...",0.8860
50,But the opportunity to know how much solitaire...,0.8885
571,Republicans and Democrats presumably both have...,0.8957
200,153 was a two-year-old who ran faster than eve...,0.9153


In [21]:
# Most positive sentence
# Hint: Use .iloc to select a row, and then select the 'Sentence' column from that row
df.iloc[473]['Sentence']
# YOUR CODE HERE

'If someone writes “I am happy and in love and feeling awesome,”\nsentiment analysis would code that as extremely happy text.'

In [22]:
# Most negative sentence
df.iloc[510]['Sentence']

# YOUR CODE HERE

'In other words, just from the words, the computer was able to detect that things go from bad to worse to worst.'

In [23]:
# What other sentences might you examine?
df.iloc[200]['Sentence']
df.iloc[0]['Sentence']

# YOUR CODE HERE

'At 6 A.M. on a particular Friday of every month, the streets of most of Manhattan will be largely desolate.'

In [24]:
# Challenge 1: 
# Add columns in your data frame for the percent of positive, negative, and neutral words in each sentence
df['Percent_of_Positive']=df
# What sentence has the highest proportion of positive words?
# What sentence has the highest proportion of negative words?


ValueError: Cannot set a DataFrame with multiple columns to the single column Percent_of_Positive

In [None]:
# Challenge 2:
# Make a plot or plots to visualize the sentiment data for the "Data Reimagined" chapter
# You might start with a line plot (see notes from last week)
# If you want to try something new, check out the histogram: https://plotly.com/python/histograms/

In [None]:
# Challenge 3:
# Pick another piece of text to analyze.
# It could be a paper you wrote, or social media posts, or your annotations on the chapter, or anything...
# Perform a basic sentiment analysis on it