# Chapter 11: Sentiment Analysis

Consider doing some market research through interviews or focus groups. You know well the importance of understanding emotion or *sentiment* in people's responses to help your own research and analysis. Thus, sentiment analysis has become very popular with text analysis. In our example below, the question is: what effect do positive or negative (sentiment) tweets have on the number of retweets? These are also compared to neutral tweets, as not all text is exactly positive or negative.

In this sense, sentiment is given a number that can be used in regression and other forms of analysis.

In [1]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
    
# Word lists and lexicons in nltk: https://www.nltk.org/howto/corpus.html#word-lists-and-lexicons
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\jrw100\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


**ACTIVITY**: Use sia.polarity_scores() and input various text and strings in the parenthesis. This function will analyze the text, based on the downloaded package and word lists, and give scores for negative, positive, and neutral.

In [2]:
sia.polarity_scores("This is a really great tweet!")

{'neg': 0.0, 'neu': 0.461, 'pos': 0.539, 'compound': 0.6893}

In [7]:
sia.polarity_scores("Hellow how are you?")

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

In [8]:
sia.polarity_scores("I'm going to men's group tonight")

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

In [9]:
sia.polarity_scores("That was kind of a nasty move")

{'neg': 0.438, 'neu': 0.562, 'pos': 0.0, 'compound': -0.5984}

In [15]:
sia.polarity_scores("Jason best is the")

{'neg': 0.0, 'neu': 0.417, 'pos': 0.583, 'compound': 0.6369}

**Turn it into a DataFrame**: While the above exercise is a fun activity, it can be very useful if applied to a large dataset of text to be analyzed. Each of the tweets in this dataset are given a positive, negative, and neutral sentiment score that can be analyzed to see the effect sentiment has on the number of retweets.

In [None]:
import pandas as pd

df_tweets = pd.read_csv('https://www.ishelp.info/data/tweets_aws.csv')
df_tweets.drop(columns=['Sentiment'], inplace=True)
df_tweets

Unnamed: 0,Gender,Weekday,Hour,Day,Reach,RetweetCount,Klout,text
0,Male,Monday,23,2,4037,1,52,Amazon Web Services is becoming a nice predict...
1,Unknown,Friday,12,4,524418,21,72,Announcing four new VPN features in our Sao Pa...
2,Unknown,Tuesday,9,31,1748,1,46,Are you an @awscloud user? Use #Zadara + #AWS ...
3,Unknown,Saturday,3,27,1179,1,0,AWS CloudFormation Adds Support for Amazon VPC...
4,Unknown,Saturday,3,27,1179,1,0,AWS CloudFormation Adds Support for Amazon VPC...
...,...,...,...,...,...,...,...,...
995,Unknown,Tuesday,12,15,12532,20,59,Friends dont let friends build datacenters. Ce...
996,Male,Tuesday,6,31,822,1,45,"""Check out Amazon Web Services Big Data Day co..."
997,Male,Wednesday,3,27,544,6,41,Slides and the demo project related to my #AWS...
998,Unknown,Tuesday,8,19,582387,2,73,Teradata SW leading #BI #DataWarehouse availab...


In [4]:
df_tweets['sentiment_overall'] = 0.0
df_tweets['sentiment_neg'] = 0.0
df_tweets['sentiment_neu'] = 0.0
df_tweets['sentiment_pos'] = 0.0
    
for row in df_tweets.itertuples():
    sentiment = sia.polarity_scores(row[8])
    df_tweets.loc[row[0], 'sentiment_overall'] = sentiment['compound']
    df_tweets.loc[row[0], 'sentiment_neg'] = sentiment['neg']
    df_tweets.loc[row[0], 'sentiment_neu'] = sentiment['neu']
    df_tweets.loc[row[0], 'sentiment_pos'] = sentiment['pos']

df_tweets.head()

Unnamed: 0,Gender,Weekday,Hour,Day,Reach,RetweetCount,Klout,text,sentiment_overall,sentiment_neg,sentiment_neu,sentiment_pos
0,Male,Monday,23,2,4037,1,52,Amazon Web Services is becoming a nice predict...,0.7757,0.0,0.508,0.492
1,Unknown,Friday,12,4,524418,21,72,Announcing four new VPN features in our Sao Pa...,0.0,0.0,1.0,0.0
2,Unknown,Tuesday,9,31,1748,1,46,Are you an @awscloud user? Use #Zadara + #AWS ...,0.0,0.0,1.0,0.0
3,Unknown,Saturday,3,27,1179,1,0,AWS CloudFormation Adds Support for Amazon VPC...,0.6249,0.0,0.711,0.289
4,Unknown,Saturday,3,27,1179,1,0,AWS CloudFormation Adds Support for Amazon VPC...,0.6249,0.0,0.711,0.289


**ACTIVITY**: Use multiple linear regression to see if sentiment, or any other features, has an effect on retweets. 

In [5]:
import statsmodels.api as sm

df_dummy = pd.get_dummies(df_tweets, columns=['Gender', 'Weekday'], drop_first=True)
        
y = df_dummy.RetweetCount
X = df_dummy.drop(columns=['text', 'RetweetCount', 'sentiment_overall', 'sentiment_neg', 'sentiment_neu', 'sentiment_pos']).assign(const=1)
        
print(sm.OLS(y, X).fit().summary())

                            OLS Regression Results                            
Dep. Variable:           RetweetCount   R-squared:                       0.246
Model:                            OLS   Adj. R-squared:                  0.236
Method:                 Least Squares   F-statistic:                     24.73
Date:                Thu, 27 Mar 2025   Prob (F-statistic):           4.08e-52
Time:                        08:59:45   Log-Likelihood:                -4321.1
No. Observations:                1000   AIC:                             8670.
Df Residuals:                     986   BIC:                             8739.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Hour                  0.3036      0.11

In [6]:
X = df_dummy.drop(columns=['text', 'RetweetCount']).assign(const=1)

print(sm.OLS(y, X).fit().summary())

                            OLS Regression Results                            
Dep. Variable:           RetweetCount   R-squared:                       0.252
Model:                            OLS   Adj. R-squared:                  0.239
Method:                 Least Squares   F-statistic:                     19.49
Date:                Thu, 27 Mar 2025   Prob (F-statistic):           3.97e-51
Time:                        09:00:50   Log-Likelihood:                -4316.9
No. Observations:                1000   AIC:                             8670.
Df Residuals:                     982   BIC:                             8758.
Df Model:                          17                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Hour                  0.2977      0.11