## Sentiment Analysis

After one of initial stages in NLP pipeline through tokenization (including text normalization, n-grams, stems, lemmas), such tokens contain information around a word's sentiment i.e. emotion or feeling a word invokes. 
* ***Sentiment analysis*** -  measuring the sentiment of phrases or chunks of text
* Examples - Companies such as Movie review sites/Amazon request feedback on their services or the products they promote within the market place
    * Star rating - typically from 1-5 gives us quantitative data about how people feel about products they've purchased or services they've used

<br>
<br>
The possibility of a machine algorithm detecting sentiment is crucial, especially when humans (unless they have superior domain knowledge) can be erroneous in retrieving a non-biased sentiment score for a rating (particularly if it's negative). The ability of input that represents natural language text helps us retrieve and extract information from it. Given the Big Data era, NLP pipelines can process large amounts of text fairly quickly and objectively.

### Implementation

The two approaches to sentiment analysis 

1) Rules-based algorithm composed by a human 
<br>
2) Machine learning (ML) model learned from data by a machine

* **Rules based** - such approach uses human constructed rules of thumb (heuristics) to measure sentiment. A common rule-based approach to sentiment analysis is to find specific keywords in the corpus and map each one to numerical scores or weights in a dictionary/mapping. Such step builds upon the tokenization process. The final step in computing this rule is to add up the score for each keyword in a document that could also be found in dictionary of sentiment scores. The final score is based on polarity scheme (-1 for absolutuely negative; 0 for neutral; +1 for absolutuely positive).
* **Machine learning (supervised learning)** - relies on labeled set of data documents to train a ML model to create such rules. The ML sentiment model is trained to process input text and output a numerical value (score) for sentiment being measured such as **positivity, negativy or spaminess**. A lot of labeled data with the right sentiment score is required. Hence, utilise a KPI such as a star rating to then get a corresponding set of keywords associated with that star rating to come up with a labelled output (target variable) to state either **positive** or **negative**.

#### 1) VADER - rules-based sentiment analyser

Valence Aware Dictionary for (s)Entiment Reasoning (VADER) is one of the common rules-based sentiment analyser algorithms. NLTK contains an implementation of this under `nltk.sentiment.vader`, but one of the creators **Hutto (GA Tech)** maintains the distinctive (his own) python package `vaderSentiment`.

In [1]:
#!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


In [2]:
sent_anal = SentimentIntensityAnalyzer()
lexicon = sent_anal.lexicon
# Only retrieve phrases with empty 'whitespace' between i.e. n-grams/bigrams
[(tok, score) for tok, score in lexicon.items() if " " in tok]

[("( '}{' )", 1.6),
 ("can't stand", -2.0),
 ('fed up', -1.8),
 ('screwed up', -1.5)]

In [5]:
# Computing polarity scores for example texts 
print(sent_anal.polarity_scores(text='Python is handy and is good for when we need to use NLP'))
print(sent_anal.polarity_scores(text='Python is not a poor choice for implementing most applications'))

{'neg': 0.0, 'neu': 0.805, 'pos': 0.195, 'compound': 0.4404}
{'neg': 0.0, 'neu': 0.779, 'pos': 0.221, 'compound': 0.3724}


The VADER algorithm considers the concentration of sentiment polarity in three separate scores 
<br>
1) **Positive**
<br>
2) **Neutral**
<br>
3) **Negative**
<br>
<br>
Then combines them together into a **compound** positivity sentiment.
<br>
VADER even manages to handle negation fairly well by taking into account 'not a poor' by considering it in a slightly positive context like the case with 'good' i.e. through neighbouring associations rather than in isolation.
<br>
VADER's inherent tokenization doesn't consider any words that aren't in its lexicon/vocabulary along with n-grams.


In [17]:
corpus = ['Amazingly perfect! Nice one! :) :)', 'Completely horrible! The product is useless. :@',
'The food was decent. some good and bads meals in between.']
for doc in corpus:
    scores = sent_anal.polarity_scores(doc)
    print(f"{scores['compound']:+}: {doc}")

+0.9281: Amazingly perfect! Nice one! :) :)
-0.8856: Completely horrible! The product is useless. :@
+0.4404: The food was decent. some good and bads meals in between.


#### 2) Naive Bayes - ML

In [2]:
# Need to be in nlpiaenv (conda) virtual environment before running this cell: conda activate nlpiaenv
import nlpia