# Sentiment Analysis

In [4]:
import os
import operator
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Using data processing techniques and NLP to infer a _sentiment_ of a piece of text. We will only look at _polarity_ today — positive vs. negative opinion.

Use cases for sentiment analysis:
* ...

Problems with sentiment analysis:
* ...

### Dataset

IMDB movie reviews dataset: http://ai.stanford.edu/~amaas/data/sentiment/
* 25000 positive & 25000 negative reviews
* 50/50 training/test split
* 7 stars or more -> positive review
* 4 starts or fewer -> negative review
* at most 30 reviews per movie

In [3]:
def read_corpus(dataset):
    corpus = []
    labels = []
    for rev in ['pos', 'neg']:
        for file in os.listdir('./aclImdb/' + dataset + '/'+ rev + '/'):
            file_path = 'aclImdb/' + dataset + '/'+ rev + '/' + file
            with open(file_path, 'r') as f:
                corpus.append(f.read())
                if rev == 'pos':
                    labels.append(1)
                else:
                    labels.append(0)
    return corpus, labels

In [5]:
corpus_train, y_train = read_corpus('train')
corpus_test, y_test = read_corpus('test')

FileNotFoundError: [Errno 2] No such file or directory: './aclImdb/train/pos/'

### Approaches

1. rule-based (unsupervised)
2. vectorization / ML model (supervised)
3. deep learning / LSTM (supervised)

#### 1.a. Lexicon-based method

We start with two lexicons of words associated with positive and negative sentiments.

`positive-words.txt`: https://gist.github.com/mkulakowski2/4289437

`negative-words.txt`: https://gist.github.com/mkulakowski2/4289441

Let's imagine you have an unlabeled dataset of movie reviews. How would you use these lists of positive and negative words to infer the sentiment of the reviews?
* ...

In [None]:
def read_words(sentiment):
    f = open(f'posneg/{sentiment}-words.txt', mode='r')
    result = f.readlines()
    f.close()
    result = [line.strip('\n') for line in result if not line.startswith(';') and len(line)>1]
    return result

In [None]:
def determine_sentiment(corpus):
    y_pred = []
    for text in corpus:
        n_pos = len([w for w in positive_words if w in text])
        n_neg = len([w for w in negative_words if w in text])
        if n_pos > n_neg:
            y_pred.append(1)
        elif n_pos < n_neg:
            y_pred.append(0)
        else:
            y_pred.append(np.random.choice([0, 1]))
    return y_pred

In [None]:
positive_words = read_words('positive')

In [None]:
negative_words = read_words('negative')

In [None]:
y_pred_lexicon = determine_sentiment(corpus_test)

#### 1.b. VADER Sentiment Analysis

[VADER](https://github.com/cjhutto/vaderSentiment) (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based model for sentiment analysis that takes into account polarity (positive vs. negative) but also intensity of a sentiment.

In [6]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 1.9 MB/s eta 0:00:01
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [7]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

You task is to implement sentiment analysis using VADER, following the README file here: 

https://github.com/cjhutto/vaderSentiment#code-examples

For each review in your test corpus, determine the sentiment (positive or negative), and compare that with the labels for your test set to determine accuracy.

In [8]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [9]:
analyzer = SentimentIntensityAnalyzer()

In [11]:
sentences = ["VADER is smart, handsome, and funny.",  # positive sentence example
             "VADER is smart, handsome, and funny!",  # punctuation emphasis handled correctly (sentiment intensity adjusted)
             "VADER is very smart, handsome, and funny.", # booster words handled correctly (sentiment intensity adjusted)
             "VADER is VERY SMART, handsome, and FUNNY.",  # emphasis for ALLCAPS handled
             "VADER is VERY SMART, handsome, and FUNNY!!!", # combination of signals - VADER appropriately adjusts intensity
             "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!", # booster words & punctuation make this close to ceiling for score
             "VADER is not smart, handsome, nor funny.",  # negation sentence example
             "The book was good.",  # positive sentence
             "At least it isn't a horrible book.",  # negated negative sentence with contraction
             "The book was only kind of good.", # qualified positive sentence is handled correctly (intensity adjusted)
             "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
             "Today SUX!",  # negative slang with capitalization emphasis
             "Today only kinda sux! But I'll get by, lol", # mixed sentiment example with slang and constrastive conjunction "but"
             "Make sure you :) or :D today!",  # emoticons handled
             "Catch utf-8 emoji such as such as 💘 and 💋 and 😁",  # emojis handled
             "Not bad at all"  # Capitalized negation
            ]

In [12]:
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<65} {}".format(sentence, str(vs)))

VADER is smart, handsome, and funny.----------------------------- {'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}
VADER is smart, handsome, and funny!----------------------------- {'neg': 0.0, 'neu': 0.248, 'pos': 0.752, 'compound': 0.8439}
VADER is very smart, handsome, and funny.------------------------ {'neg': 0.0, 'neu': 0.299, 'pos': 0.701, 'compound': 0.8545}
VADER is VERY SMART, handsome, and FUNNY.------------------------ {'neg': 0.0, 'neu': 0.246, 'pos': 0.754, 'compound': 0.9227}
VADER is VERY SMART, handsome, and FUNNY!!!---------------------- {'neg': 0.0, 'neu': 0.233, 'pos': 0.767, 'compound': 0.9342}
VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!--------- {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469}
VADER is not smart, handsome, nor funny.------------------------- {'neg': 0.646, 'neu': 0.354, 'pos': 0.0, 'compound': -0.7424}
The book was good.----------------------------------------------- {'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'co

#### 2. Vectorization / ML model

This follows the approach you've seen in Week 4.

In [None]:
pipeline = make_pipeline(CountVectorizer(stop_words='english', ngram_range=(1, 2)),
                         TfidfTransformer(),
                         LogisticRegression())

In [None]:
pipeline.fit(corpus_train, y_train)

In [None]:
y_pred = pipeline.predict(corpus_test)

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
weights = pipeline['logisticregression'].coef_[0]

In [None]:
feature_names = pipeline['countvectorizer'].get_feature_names()

In [None]:
print(operator.itemgetter(*np.argsort(weights))(feature_names)[:20])

In [None]:
print(operator.itemgetter(*np.argsort(weights))(feature_names)[-20:])