# Sentiment Analysis

Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
from requests import get
import re

import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import nltk

from nltk.sentiment import SentimentIntensityAnalyzer

## Sentiment from Text

The purpose of **text analysis** is to extract meaning from text data. This involves cleaning and processing text data, as well as using analysis methods that are able to get something **quantitative** out of something that doesn't inherently have **numbers**. So far, we've used **topic modeling** to do some of this to get an idea of what topics were discussed within documents. 

Another way to extract meaning from text is by assigning values of **sentiment**. The words we use have meaning, and we can assign measures of what they are intended to portray. For example, the word "bad" is generally a negative sentiment (slang usage notwithstanding), while "good" has a positive sentiment. The word "hurt" generally also has a negative sentiment, while "heal" has a positive one. In this way, we can attempt to put different words onto the same scale and measure the overall sentiment of text.

In this section, we will look at doing a type of analysis called **sentiment analysis**, which is a class of techniques designed to extract this type of meaning from text data. In particular, we'll look at one dictionary-based method called **VADER** (Valence Aware Dictionary and Sentiment Reasoner). 

## Example with Twitter Data

VADER is a **dictionary-based** method, meaning it is pre-built and comes with a list of words and scores. To use it, we need to download the list of words with scores, then apply those scores to the words within our documents. Combining the scores of the words/tokens within our document gives us the overall sentiment of the document. For VADER, we will get back the negative, neutral, positive, and compound scores of a document.

To use VADER, we first download the `vader_lexicon` resource. 

In [None]:
nltk.download('vader_lexicon')

VADER was actually developed for and meant to be used with social media data, such as with Twitter. That is, the dictionary scores that are part of VADER are generally developed with shorter posts and some slang in mind. The `nltk` package actually comes with some sample twitter data to test these methods out on. 

In [None]:
nltk.download('twitter_samples')

We'll first do some light cleaning of the data to make any existing links unclickable. The `tweets` object below should be a list with all tweets from the sample dataset provided with `nltk`.

In [None]:
# make it so that we can't accidentally click on links
tweets = [t.replace("://", "//") for t in nltk.corpus.twitter_samples.strings()]

In [None]:
tweets[:10]

To do sentiment analysis with VADER, we first create a `SentimentIntensityAnalyzer` object. This works similarly to how we did topic modeling with Latent Dirichlet Allocation. We then provide the data that we want scored. 

In [None]:
sia = SentimentIntensityAnalyzer()
sia.polarity_scores(tweets[0])

The scores for negative, neutral, and positive are always positive, and indicate how much of that type of sentiment is present in the document. The compound score is a value from -1 to 1 that provides an overall summary of how positive or negative that document is in its sentiment. 

The compound score is most often used, and typically, the threshold for being considered positive, neutral, or negative is as follows:
- positive sentiment: compound score >= 0.05
- neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
- negative sentiment: compound score <= -0.05

<font color ='red'>**Question 1: Create a list called `sentiments` that contains the compound sentiment for each tweet in `tweets`.**</font>

We can also do some basic analyses of the sentiment scores. For example, we can find some summary statistics, as well as create graphs of the distribution of the sentiment score.

In [None]:
plt.hist(sentiments, bins = 20)

<font color ='red'>**Question 2: What is the mean compound sentiment in `sentiments`?**</font>

## Data - NYT 2021 Archive

Though VADER was developed for shorter social media posts, we can still use it for other types of text as well. It generally works best with shorter documents, though, and longer forms of text such as movie reviews would be better served by breaking them apart and looking at sentiment of individual sentences.

The abstract data from the New York Times API is relatively short and can actually work quite well for this type of analysis. As before, let's bring in all articles in 2021 from the NYT Archive.

In [None]:
nyt_2021 = pd.read_csv('nyt_2021.csv').dropna()

In [None]:
nyt_2021.head()

Let's take a quick look at one abstract to see what it would be scored as.

In [None]:
sia.polarity_scores(nyt_2021.abstract[0])

Let's take a look at the abstract to see what it actually says.

In [None]:
nyt_2021.abstract[0]

<font color ='red'>**Question 3: Find the compound score for each abstract in `nyt_2021`. Create a new column in `nyt_2021` called `sentiment_score` that contains the compound score.**</font>

## Using Sentiment Scores

Now that we have calculated sentiment scores for our article abstracts, we can look at summaries and try to understand more about the abstracts using various summary statistics and graphs.

In [None]:
nyt_2021.sentiment_score.describe()

In [None]:
nyt_2021.sentiment_score.hist()

We can also use them to look at trends and differences across different types of articles and types of content. For example, let's take a look at all Op-Ed articles about Biden. 

In [None]:
biden_op_df = nyt_2021[(nyt_2021.type_of_material == 'Op-Ed') & (nyt_2021.abstract.str.contains('Biden'))]

To get an idea of the sentiment of the Op-Eds written about Biden, we can first create a quick histogram. 

In [None]:
biden_op_df.sentiment_score.hist()

Most of them seemed to be relatively neutral, but there are peaks around -0.5 and 0.5 indicating articles that were positive or negative. This makes sense for Op-Eds since they are opinion pieces, so we might expect stronger positive or negative language. 

If we wanted to see trends, we could also look at the sentiment over time. Let's take a look at the average sentiment for these articles by month. The seaborn `sns.lineplot` does the aggregation for us and even provides error bars to get an idea of what the range of sentiments might be.

In [None]:
sns.lineplot(biden_op_df, x = 'month', y = 'sentiment_score')

The sentiment generally seems to be neutral or positive with a quick drop in August. Why might this be the case? This is around the time that Biden announced the withdrawal of US troops from Afghanistan, which drew lots of criticism for how it was handled. Let's take a look at some of the abstracts to see if that's what we see. 

In [None]:
biden_op_df[biden_op_df.month == 8]

<font color ='red'>**Question 4: What was the overall trend of the sentiment of News articles over the course of 2021? Did the overall trend differ for articles that mentioned "Biden"?**</font>

## How good are our scores?

It seems intuitive that a dictionary of positive and negative terms would be a good way to classify text, but how do we know whether a sentiment dictionary is any good? We can test out the performance of our sentiment dictionary by comparing its predictions to a "ground truth" source of evidence. The `imdb_reviews` dataset has two columns: the text of a user review and a label that is 1 if the user gave a positive rating (>=7) and its zero if the user gave a negative rating ( >=4). Since this corpus includes both text and a numeric label, we can use it as the ground truth for assessing our sentiment classifier.

In [None]:
imdb = pd.read_csv('imdb_reviews.csv').dropna()
imdb.head()

Now we'll apply the polarity scores to each review. Since the ground truth labels are dichotomous (1 or 0), we'll simplify our analysis by also making the sentiment score measure into a dichotomous variable. 

In [None]:
polarity =  pd.DataFrame([sia.polarity_scores(i) for i in imdb['text']])

# assign a positive or negative label based on the compound score: 
polarity['positive'] = polarity['compound']>=.05

# add the "ground truth" to the polarity
polarity['actual'] = imdb['label']
# view the results:
polarity.head()

Finally, we can assess the quality of our sentiment predictions by creating a confusion matrix. A confusion matrix is a two-way table with the "predictions" on one axis, and the "ground truth" on the other. 

In [None]:
cmat = pd.crosstab(polarity['positive'], polarity['actual'],  margins=True)
cmat

I can calculate the accuracy of this output by summming up the correctly classified documents divided by the total. What is the accuracy rate? 

In many cases, the accuracy alone doesn't tell us enough to really judge whether a model is good. One issue is that we might care more about false negatives or false positives. For instance: if a restaurant wants to be sure they address customer complaints, they might place a lower threshold for classifying a review as negative, even if it means they are less accurate overall. 

In [None]:
# lower threshold for negative reviews:
pd.crosstab(polarity['compound']>=-.05,  polarity['actual'],  margins=True)


The other problem with accuracy as a metric is that it can give a misleading picture when there is more of one class than another. To give an extreme example: if we had a data set where 90% of the reviews were positive, I could make a classifier that was 90% accurate by simplify predicting that every review was positive. We can partially mitigate this problem by using a metric like balanced accuracy, which essentially averages the correct classifications for positive and negative cases. 

In [None]:
from sklearn.metrics import balanced_accuracy_score
balanced_accuracy_score(polarity['positive'], polarity['actual'])