## Sentiment Analysis

After one of initial stages in NLP pipeline through tokenization (including text normalization, n-grams, stems, lemmas), such tokens contain information around a word's sentiment i.e. emotion or feeling a word invokes. 
* ***Sentiment analysis*** -  measuring the sentiment of phrases or chunks of text
* Examples - Companies such as Movie review sites/Amazon request feedback on their services or the products they promote within the market place
    * Star rating - typically from 1-5 gives us quantitative data about how people feel about products they've purchased or services they've used

<br>
<br>
The possibility of a machine algorithm detecting sentiment is crucial, especially when humans (unless they have superior domain knowledge) can be erroneous in retrieving a non-biased sentiment score for a rating (particularly if it's negative). The ability of input that represents natural language text helps us retrieve and extract information from it. Given the Big Data era, NLP pipelines can process large amounts of text fairly quickly and objectively.

### Implementation

The two approaches to sentiment analysis 

1) Rules-based algorithm composed by a human 
<br>
2) Machine learning (ML) model learned from data by a machine

* **Rules based** - such approach uses human constructed rules of thumb (heuristics) to measure sentiment. A common rule-based approach to sentiment analysis is to find specific keywords in the corpus and map each one to numerical scores or weights in a dictionary/mapping. Such step builds upon the tokenization process. The final step in computing this rule is to add up the score for each keyword in a document that could also be found in dictionary of sentiment scores. The final score is based on polarity scheme (-1 for absolutuely negative; 0 for neutral; +1 for absolutuely positive).
* **Machine learning (supervised learning)** - relies on labeled set of data documents to train a ML model to create such rules. The ML sentiment model is trained to process input text and output a numerical value (score) for sentiment being measured such as **positivity, negativy or spaminess**. A lot of labeled data with the right sentiment score is required. Hence, utilise a KPI such as a star rating to then get a corresponding set of keywords associated with that star rating to come up with a labelled output (target variable) to state either **positive** or **negative**.

#### 1) VADER - rules-based sentiment analyser

Valence Aware Dictionary for (s)Entiment Reasoning (VADER) is one of the common rules-based sentiment analyser algorithms. NLTK contains an implementation of this under `nltk.sentiment.vader`, but one of the creators **Hutto (GA Tech)** maintains the distinctive (his own) python package `vaderSentiment`.

In [1]:
#!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


In [2]:
sent_anal = SentimentIntensityAnalyzer()
lexicon = sent_anal.lexicon
# Only retrieve phrases with empty 'whitespace' between i.e. n-grams/bigrams
[(tok, score) for tok, score in lexicon.items() if " " in tok]

[("( '}{' )", 1.6),
 ("can't stand", -2.0),
 ('fed up', -1.8),
 ('screwed up', -1.5)]

In [3]:
# Computing polarity scores for example texts 
print(sent_anal.polarity_scores(text='Python is handy and is good for when we need to use NLP'))
print(sent_anal.polarity_scores(text='Python is not a poor choice for implementing most applications'))

{'neg': 0.0, 'neu': 0.805, 'pos': 0.195, 'compound': 0.4404}
{'neg': 0.0, 'neu': 0.779, 'pos': 0.221, 'compound': 0.3724}


The VADER algorithm considers the concentration of sentiment polarity in three separate scores 
<br>
1) **Positive**
<br>
2) **Neutral**
<br>
3) **Negative**
<br>
<br>
Then combines them together into a **compound** positivity sentiment.
<br>
VADER even manages to handle negation fairly well by taking into account 'not a poor' by considering it in a slightly positive context like the case with 'good' i.e. through neighbouring associations rather than in isolation.
<br>
VADER's inherent tokenization doesn't consider any words that aren't in its lexicon/vocabulary along with n-grams.


In [4]:
corpus = ['Amazingly perfect! Nice one! :) :)', 'Completely horrible! The product is useless. :@',
'The food was decent. some good and bads meals in between.']
for doc in corpus:
    scores = sent_anal.polarity_scores(doc)
    print(f"{scores['compound']:+}: {doc}")

+0.9281: Amazingly perfect! Nice one! :) :)
-0.8856: Completely horrible! The product is useless. :@
+0.4404: The food was decent. some good and bads meals in between.


Only drawback to VADER is that it contains only 7500 words in its lexicon i.e. it can only look at that many possible words that could be included in one's own corpus.
<br>
Analysing larger copora would mean having to understand all the words in the lexicon and possibly adding the scores (polarity) to put in the lexicon i.e. supplementing `sent_anal.lexicon`.
<br>
ML (Naive Bayes model) will help us out with such labelling on test data.

#### 2) Naive Bayes - ML

As any standard ML model, it's always important to identify the feature(s) and target variable.
* **Feature** - text i.e. all the words in our corpus
* **Target** - Sentiment label

**Naive Bayes** attempts to find keywords in our lexicon that are predictive of our sentiment (output) label.
<br>
The model will compute the internal coefficients to map words/tokens to score thresholds that fall onto a sentiment label.

In [5]:
# Need to be in nlpiaenv (conda) virtual environment before running this cell: conda activate nlpiaenv
from nlpia.data.loaders import get_data
movies = get_data('hutto_movies')
print(movies.head().round(2))
print(movies.shape)

sentiment                                               text
id                                                              
1        2.27  The Rock is destined to be the 21st Century's ...
2        3.53  The gorgeously elaborate continuation of ''The...
3       -0.60                     Effective but too tepid biopic
4        1.47  If you sometimes like to go to the movies to h...
5        1.73  Emerges as something rare, an issue movie that...
(10605, 2)


In [6]:
# It's wise to check on summary statistics such as range of the sentiment scores
movies.describe().round(2)

Unnamed: 0,sentiment
count,10605.0
mean,0.0
std,1.92
min,-3.88
25%,-1.77
50%,-0.08
75%,1.83
max,3.94


In [7]:
import pandas as pd 
from collections import Counter
from nltk.tokenize import casual_tokenize # Better at handling slang, usernames and puncuation altogether

In [8]:
# bow: bag-of-words
bow = [] 
for text in movies['text']:
    bow.append(Counter(casual_tokenize(text)))

In [9]:
df_bow = pd.DataFrame.from_records(bow)

In [10]:
# Make sure each word vector has missing values filled with 0 ('false') for token appearing in such sentence
#df_bow.isna().sum()

In [11]:
# casting as int after filling in missing values helps with display and enables compression of memory for the DataFrame
df_bow = df_bow.fillna(0).astype(int)
df_bow.shape

(10605, 20756)

In [12]:
# Allow DataFrame to show wide output (many columns) easier
pd.set_option('display.width', 75)

In [13]:
# Here is clear to see that vocabulary normalization would be of benefit by limiting the number of columns to be observed (dimensions)
# For now carry on as this is just a example run through of NLP sentiment analysis pipeline through ML
df_bow.head()

Unnamed: 0,The,Rock,is,destined,to,be,the,21st,Century's,new,...,Ill,slummer,Rashomon,dipsticks,Bearable,Staggeringly,’,ve,muttering,dissing
0,1,1,1,1,2,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,4,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
# Test to see first sentence within corpus bow representation
df_bow.head()[list(bow[0].keys())]

Unnamed: 0,The,Rock,is,destined,to,be,the,21st,Century's,new,...,Schwarzenegger,",",Jean,Claud,Van,Damme,or,Steven,Segal,.
0,1,1,1,1,2,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,2,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,4
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,4,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1


#### Model fitting

In [15]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# fit X - df of bow
# fit y - for now movie sentiment filtered only for positives instances (sentiment > 0)
nb = nb.fit(df_bow, movies.sentiment > 0)

In [16]:
# Indexing where classes are True (i.e. second column of array which is 1 under python indexing)
movies['predicted_sentiment'] = nb.predict_proba(df_bow)[:,1] * 8 - 4

In [17]:
movies['error'] = (movies['predicted_sentiment'] - movies['sentiment']).abs()
mae = movies['error'].mean()
# print out MAE to one decimal place
print(f'the Mean Absolute Error (MAE) is {mae:0.1f}')

the Mean Absolute Error (MAE) is 1.9


In [18]:
movies['sentiment_ispositive'] = (movies['sentiment'] > 0).astype(int)
movies['predicted_ispositive'] = (movies['predicted_sentiment'] > 0).astype(int)

In [19]:
movies['''sentiment predicted_sentiment sentiment_ispositive\
    predicted_ispositive'''.split()].head(8)

Unnamed: 0_level_0,sentiment,predicted_sentiment,sentiment_ispositive,predicted_ispositive
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2.266667,2.511515,1,1
2,3.533333,3.999904,1,1
3,-0.6,-3.655976,0,0
4,1.466667,1.940954,1,1
5,1.733333,3.910373,1,1
6,2.533333,3.995188,1,1
7,2.466667,3.960466,1,1
8,1.266667,-1.918701,1,0


In [20]:
# check accuracy of the model 
(movies.predicted_ispositive == movies.sentiment_ispositive).sum() / len(movies)

0.9344648750589345

In [21]:
# Using sklearn accuracy score function instead of filtering/computation version directly above
from sklearn.metrics import accuracy_score
# (y_true, y_pred) - bear in mind no test sets
accuracy_score(movies['sentiment_ispositive'], movies['predicted_ispositive'])

0.9344648750589345

This is a convenient process at building a sentiment analyser with relatively minimal code and a large set of labelled/text data. This can be better at computation over large sets compared to VADER with a compiled list of 7500 words and their sentiment.
<br>
Some steps to refine model:
* Split training data -  Leave training data for model to learn and place a subset out to test for out of sample accuracy (test how well model predicts on new data).

In [48]:
products = get_data('hutto_products')
bow = [] 
for text in products['text']:
    bow.append(Counter(casual_tokenize(text)))

In [49]:
df_product_bow = pd.DataFrame.from_records(bow)
df_product_bow = df_product_bow.fillna(0).astype(int)
df_product_bow.head()

Unnamed: 0,troubleshooting,ad,-,2500,and,2600,no,picture,scrolling,b,...,undone,warrranty,expire,expired,voids,develops,soldier,serving,baghdad,harddisk
0,1,2,2,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,2,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
# Add data from two bow dataframes together based on rows 
# df_bow.append(df_product_bow)
df_all_bow = pd.concat([df_bow, df_product_bow], axis='rows')
df_all_bow.columns

Index(['The', 'Rock', 'is', 'destined', 'to', 'be', 'the', '21st',
       'Century's', 'new',
       ...
       'sligtly', 'owner', '81', 'defectively', 'warrranty', 'expire',
       'expired', 'voids', 'baghdad', 'harddisk'],
      dtype='object', length=23302)

New bow df **(df_all_bow)** has some tokens that weren't in the original bow df **(df_bow)** - 23302 columns now instead of 20756.

In [51]:
# product_bow at this stage of filtering for some reason reverts back to filling NANs instead of 0 as 
# Fix here just in case
df_product_bow = df_all_bow.iloc[len(movies):][df_bow.columns].fillna(0).astype(int)
print(df_product_bow.shape) # New product bow dataframe 
print(df_bow.shape) # Original bow dataframe 

(3546, 20756)
(10605, 20756)


The above code step ensures that new bow df has an identical number of columns (tokens) in the same order as the orignal df used to train the Naive Bayes model.

In [52]:
products['ispos'] = (products['sentiment'] > 0).astype(int)

In [53]:
# Making our predictions on unseen data here - i.e. product reviews 
products['predicted_ispos'] = nb.predict(df_product_bow).astype(int)

In [54]:
products.head()

Unnamed: 0,id,sentiment,text,ispos,predicted_ispos
0,1_1,-0.9,troubleshooting ad-2500 and ad-2600 no picture...,0,0
1,1_2,-0.15,"repost from january 13, 2004 with a better fit...",0,0
2,1_3,-0.2,does your apex dvd player only play dvd audio ...,0,0
3,1_4,-0.1,or does it play audio and video but scrolling ...,0,0
4,1_5,-0.5,before you try to return the player or waste h...,0,0


In [55]:
(products.predicted_ispos == products.ispos).sum() / len(products)




0.5572476029328821

The Naive bayes model predictions based on product reviews (positive reviews) performed poorly.

***Reason*** 

* Difference between document text lexicons - vocabulary from the product texts has 2546 (23302 - 20756) tokens that were not present in the movie reviews (texts our model was originally trained on). This accounts to around 10% of words/tokens in the original movie reviews tokenization - where all those words will not have have any weights or scores in our Naive Bayes model
* Potential Naive Bayes model consequences - Naive Bayes doesn't handle negation as well as VADER does. The additional step of including n-grams into our tokenizer to connect negation words (like 'not'/'never')to be used alongside valid positive words