# Sentiment Analysis

With our new workflow for cleaning and processing text, let's see another example of the types of analyses we can perform on the data.

**Sentiment Analysis** aims to automatically classify the emotional content or sentiment of a given piece of information. The most common use case is automatically classifying text social media posts as positive or negative.

A common approach is to do a full Part-Of-Speech parse on a large set of training data, then use machine learning to determine the features that best predict positivity or negativity in the training set. 

This approach uses machine learning techniques similar to what we discussed in our last workshop (in this case, often Naive Bayes Classifiers). 


For our *simple case today*, we'll be using a shortcut, the AFINN-111 wordlist that contains positive/negative valency ratings for around 2,500 English lexical items. This will allow us to estimate a rough positive to negative item ranking for each tweet.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import nltk

# read in saved tweet searches and AFINN sentiment ratings
trump_tweets = pd.read_csv('saved_searches/trump_tweets_extensive.csv')
biden_tweets = pd.read_csv('saved_searches/biden_tweets_extensive.csv')
sentim_ratings = pd.read_csv('AFINN/AFINN-111.txt', sep = '\t', header = None, names = ['word', 'rating'])

# redefine our cleaning function we developed in the last segment
def clean_sentence(s, stopwords):
    """
    Take as input a sentence as a str and a list of stopwords
    Then output a tokenized, lower case list of all words which are not stopwords.
    You could make this function more efficient by vectorizing it and then applying it to a pandas column simultaneously.
    """
    s_tokenized = nltk.tokenize.word_tokenize(s)
    s_lower = [w.lower() for w in s_tokenized]
    s_words = [w for w in s_lower if w.isalpha()]
    s_final = [w for w in s_words if not w in stopwords]
    return s_final

# define stopwords following nltk
stopwords = nltk.corpus.stopwords.words('english')

# clean the tweet text from both trump and biden dfs
trump_tweets['cleaned_text'] = trump_tweets.apply(lambda row: clean_sentence(row['text'], stopwords), axis = 1)
biden_tweets['cleaned_text'] = biden_tweets.apply(lambda row: clean_sentence(row['text'], stopwords), axis = 1)

# inspect various aspects to double-check our work
print(sentim_ratings.head())
print('\n\n')
print(trump_tweets['text'][0])
print(trump_tweets['cleaned_text'][0])
print('\n\n')
print(biden_tweets['text'][0])
print(biden_tweets['cleaned_text'][0])

## Method

Our rough method for sentiment analysis will take the following steps:

- Match all the words to sentiment values
- If a word doesn't have a match, give it a value of 0
- Divide the total sentiment rating by the number of words

**As we work through one example, consider in what ways this can be inaccurate?**

In [None]:
def measure_sentiment(s, ratings):
    """
    takes a list (s) of lowercase words and matches them to the pandas df ratings ('word' and 'rating' columns)
    returns a normalized overall sentiment value
    """
    value = 0
    # looping over words in list s; lots of room to make this more efficient
    for word in s:
        # test if word is in our set of rated words
        if ratings['word'].isin([word]).any():
            # select matching value and extract as float
            word_value = ratings.loc[ratings['word'] == word]['rating'].iloc[0]
            # add this word_value to the sentence value
            value += word_value
    # normalize by length of tweet
    return (value/len(s))

trump_tweets['sentiment_value'] = trump_tweets.apply(lambda row: measure_sentiment(row['cleaned_text'], sentim_ratings), axis = 1)
biden_tweets['sentiment_value'] = biden_tweets.apply(lambda row: measure_sentiment(row['cleaned_text'], sentim_ratings), axis = 1)

In [None]:
# plot and descriptive statistics
fig, ax = plt.subplots(1,1)
plt.boxplot([trump_tweets['sentiment_value'], biden_tweets['sentiment_value']])
plt.xticks([1,2], ['trump','biden'])
ax.axhline(y=0)
plt.show()

print(trump_tweets['sentiment_value'].describe())
print()
print(biden_tweets['sentiment_value'].describe())


In [None]:
# remove all "ambivalent" tweets that have a value of 0
trump_sub = trump_tweets[trump_tweets.sentiment_value != 0]
biden_sub = biden_tweets[biden_tweets.sentiment_value != 0]

# plot and descriptive statistics
fig, ax = plt.subplots(1,1)
plt.boxplot([trump_sub['sentiment_value'], biden_sub['sentiment_value']])
plt.xticks([1,2], ['trump','biden'])
ax.axhline(y=0)
plt.show()

print(trump_sub['sentiment_value'].describe())
print()
print(biden_sub['sentiment_value'].describe())
