# Sentiment Analysis

Sentiment Analysis is just one part of Natural Language Processing (NLP). At its most basic level, the main idea behind sentiment analysis is to categorize a section of text (words, sentences, tweets, etc) either as "postive" or "negative". However sentiment analysis is not only limited to just "postive" and "negative", it could also categorized by different emotions. There are many different ways to conduct sentiment analysis, in this notebook we showcase a few ways to do so ranging from very basic to more advance methods.   

It is important to note here we are going to avoid just matching words to calculate sentiment. That is to say, we are not going to have a dictionary of words of a category (e.x 'postive') and seeing if the input text contains those words. This is a relative simple method to calculate sentiment anaylsis. Therefore we are going to mainly focus on predicting the sentiment of a input text.

In this notebook we will showcase a few way to do so.

To find out more about what Sentiment Analysis is see: 
 * https://en.wikipedia.org/wiki/Sentiment_analysis 
 * https://www.brandwatch.com/blog/understanding-sentiment-analysis/
 
For full installation methods used in this notebook see their respective sites. 

Written 2018

## Basic Sentiment Analysis

We will first showcase a very basic method to conduct sentiment analysis. This example is not comprehensive, and is mostly shown to help demostrate the core concept of sentiment analysis, which is to train a model (program) to predict the sentiment of a body of text. 

We will first create some list of words that we believe are "postive", "neutral", and "negative".

In [66]:
# List of words belonging to each category

positive_words = ['good','great','awesome','happy','fun','exciting']
neutral_words = ['any','other','words','i','believe','mean','neutral']
negative_words = [ 'bad', 'worst','terrible','sad','angry' ]

Using these set of words we are going to train a model to help us categories different text. As you can tell, the list above is very basic and is mostly used as an example. If you have your own corpus of words for each category, feel free to use them (as long as it is in the format of the one above, if you want to keep following the example). 

First we will import some of the libraries we are going to use. For full installation steps see their package.

In [67]:
# Import required library
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier

Now that we have defined some libraries, we can move on to defining a helper function. This function will label the words as "postive", etc. This way when we encounter those words later, it will label them as such. This is sometimes referred to as "word features".

In [68]:
# Helper function for labeling our categories
def sentiment_label(category_words):
    return dict([(category_words, True) for word in category_words])

Now that we have done that we can move onto labeling each set of words we defined eariler. As you will see, the code so far allows you to expand the categories from just "postive, negative, and neutral".

In [69]:
# Labelling each words to their corresponding category
positive_label_words = [(sentiment_label(positive), 'positive') for positive in positive_words]
neutral_label_words  = [(sentiment_label(neutral), 'neutral') for neutral in neutral_words]
negative_label_words  = [(sentiment_label(negative), 'negative') for negative in negative_words]

Now we have completed all the setup required. All that is left is to train the model on our categories and test it out. Before we move on, it is important to discuss that what we are doing is to "predictive". It may not be accurate in all cases (dependent on your labels/features and model).

We will start by setting up the classifier using the labels we created eariler

In [70]:
# Create the classifier
classifier = NaiveBayesClassifier.train((positive_label_words + neutral_label_words + negative_label_words)) 

We now have everything we need to try out our classifier. To help keep things organized a function will be created to predict the sentiment of the input text. In this function you can alter it to do additional cleaning to the text or manipulate it however you wish. In this example we are simply going to split it by word and set it to lower case.

In [73]:
# Create a function to predict the sentiment of the input text
def simple_predict(input_text):
    
    # Set all the values to 0 to start for each category
    postive =0 
    negative = 0 
    neutral = 0
    
    # Split the input text into words and set it to lower case
    formatted_input = input_text.lower().split(' ')

    # Classify each word
    for word in formatted_input:
        classResult = classifier.classify(sentiment_label(word))

        # Depending on the results increase the corresponding category
        if classResult == 'positive':
            postive = postive + 1
        elif classResult == 'neutral':
            neutral = neutral + 1
        elif classResult == 'negative':
            negative = negative + 1
        
    # Print out the results
    print('Positive: ' + str(float(postive)/len(formatted_input)))
    print('Neutral: ' + str(float(neutral)/len(formatted_input)))
    print('Negative: ' + str(float(negative)/len(formatted_input)))

Now that we have that compeleted we can try it out

In [74]:
# Testing out the predicting function
simple_predict("Awesome movie, I liked it")

Positive: 0.2
Neutral: 0.8
Negative: 0.0


As you can see this is not too accurate and the sentence is more postive than neutral. Therefore this inaccurate results can be attributed to the fact that our model or classifier is not trained well (limited vocabulary size, lack of cleaning, etc). 

In addition we can improve our results via cleaning. For example the word "like" and "liked" are the same word. Therefore you can apply stemming (or any other cleaning method) to help standardize the input text.

For additional information see:
* https://pythonspot.com/python-sentiment-analysis/ 

## NLTK Vader 

Another method to conduct sentiment anaylsis is to use NLTK Vader. This is library in NLTK that can be used to calculate how positive, negative, and neutral a text is. Compared to the previous example, Vader does not require you to define what is "positive, negative, and neutral". Depending on your goal and purpose, this may or may not be relavent. 

To begin we will import Vader from NLTK.

In [1]:
# Import Vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer

To better showcase the differences between this method and the previous we will be using the exact same sentences for analysis. However before we can do that we need to declare the analyzer.

In [3]:
# Declare Analyzer
sid = SentimentIntensityAnalyzer()

Now that we have the analyzer declared we can analysis our sentence. First we are going to calculate the score and then print it out (the print statement is formatted so it is easier to read)

In [5]:
# Calculate sentiment score
sentiment_Score = sid.polarity_scores("Awesome movie, I liked it")

# Print out the score
for score in sorted(sentiment_Score):
    print('{0}: {1} '.format(score, sentiment_Score[score]), end='')

compound: 0.7845 neg: 0.0 neu: 0.225 pos: 0.775 

As you can see, compared the the previous example Vader believes that this sentence is more postive than neutral. For those that want more control over what constitutes a "postive" word (etc) or would like additional categories outside the 3 presented, Vader may not be the method you want. 

For additional information on Vader see it documentation at:
* http://www.nltk.org/howto/sentiment.html 
* https://www.nltk.org/_modules/nltk/sentiment/vader.html

## Conclusion

In this notebook we touched on two different methods to conduct sentiment analysis. Although this notebook does not show all the ways to conduct sentiment analysis, it does showcases some common ones. Depending on your needs, each method can be expanded and altered to fit you needs. For example, additional cleaning and larger vocabulary of words could lead to more accurate results.