## Creating a Dictionary-based Sentiment Analyzer

Given the small corpus (reviews textual dataset) generated in task 1, the objective is to construct a dictionary-based sentiment analyser.
<br>
<br>
Some of the lessons taught during these tasks:
* Word and sentence tokenization 
* Review score classification 
* Insights surrounding score and review comparisons 
* Correlation analysis surrounding review groups 
* Accounting for negation within sentiment analyser



In [1]:
## Necessary imports needed for analysis 
import pandas as pd 
import numpy as np 
import os, pathlib
import matplotlib.pyplot as plt 
import altair
import pandas_bokeh
%matplotlib inline 

### Load small corpus dataset

In [2]:
# Move back into task1 repo where review (small) corpus is located
path = pathlib.Path().home()/'Desktop/nlp-map-project/task1-create-dataset'
try:
    path = os.chdir(path) 
except FileNotFoundError as a :
    print('Already in directory/folder, carry on!')

In [3]:
df = pd.read_csv('small_corpus.csv')

When dealing with NLP/ML problems, we must initially ensure that there are no missing values, otherwise this will lead to problems later down the line.

In [4]:
# there are some missing reviews 
# which is not substanstial relative to size of dataset/corpus 
df.isna().sum()

ratings    0
reviews    4
dtype: int64

In [5]:
df[df['reviews'].isna()]

Unnamed: 0,ratings,reviews
686,1,
2590,4,
3197,5,
3470,5,


In [6]:
# fill NaNs with empty string (whitespace) 
df['reviews'] = df['reviews'].fillna('')

In [7]:
# check to ensure there's no nulls 
assert df['reviews'].notna().all()

In [8]:
review_sample = df['reviews'].head().tolist()
rating_sample = df['ratings'].head().tolist()

In [9]:
print(*rating_sample)
print(*review_sample, sep='\n') # each review is seperated by a '-'

1 1 1 1 1
Recently UBISOFT had to settle a huge class-action suit brought against the company for bundling (the notoriously harmful) StarFORCE DRM with its released games. So what the geniuses at the helm do next? They decide to make the same mistake yet again - by choosing the same DRM scheme that made BIOSHOCK, MASS EFFECT and SPORE infamous: SecuROM 7.xx with LIMITED ACTIVATIONS!

MASS EFFECT can be found in clearance bins only months after its release; SPORE not only undersold miserably but also made history as the boiling point of gamers lashing back, fed up with idiotic DRM schemes. And the clueless MBAs that run an art-form as any other commodity business decided that, "hey, why not jump into THAT mud-pond ourselves?"

The original FAR CRY was such a GREAT game that any sequel of it would have to fight an uphill battle to begin with (especially without its original developing team). Now imagine shooting this sequel on the foot with a well known, much hated and totally useless DR

### Word and sentence tokenization

In [10]:
# import relevant tokenization modules from nltk 
from nltk.tokenize import word_tokenize, sent_tokenize

In [11]:
# text normalization (in this ex phrases are lowercase) is a nice addition for text analysis
# followed by the appropriate token parsing 
word_tokenization = df['reviews'].str.lower().apply(lambda x: word_tokenize(x))
word_tokenization

0       [recently, ubisoft, had, to, settle, a, huge, ...
1        [code, did, n't, work, ,, got, me, a, refund, .]
2       [these, do, not, work, at, all, ,, all, i, get...
3       [well, let, me, start, by, saying, that, when,...
4       [dont, waste, your, money, ,, you, will, just,...
                              ...                        
4495    [nice, long, micro, usb, cable, ,, battery, la...
4496    [i, 've, been, having, a, great, time, with, t...
4497                                                  [d]
4498    [really, pretty, ,, funny, ,, interesting, gam...
4499    [i, had, a, lot, of, fun, playing, this, game,...
Name: reviews, Length: 4500, dtype: object

In [12]:
sent_tokenization = df['reviews'].str.lower().apply(lambda x: sent_tokenize(x))
sent_tokenization

0       [recently ubisoft had to settle a huge class-a...
1                    [code didn't work, got me a refund.]
2       [these do not work at all, all i get is static...
3       [well let me start by saying that when i first...
4       [dont waste your money, you will just end up u...
                              ...                        
4495    [nice long micro usb cable, battery lasts a lo...
4496    [i've been having a great time with this game....
4497                                                  [d]
4498    [really pretty, funny, interesting game., work...
4499    [i had a lot of fun playing this game, if your...
Name: reviews, Length: 4500, dtype: object

### Download NLTK `opinion lexicon`

In [13]:
# corresponding module import 
import nltk
nltk.download('opinion_lexicon')

[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     /Users/ShuaibAhmed/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!


True

In [14]:
from nltk.corpus import opinion_lexicon

In [15]:
# Examine this module/corpus - i.e. first 10 
positive = opinion_lexicon.positive()[:10]
negative = opinion_lexicon.negative()[:10]
words = sorted(opinion_lexicon.words())[:10] # sorted alphabetically

In [16]:
print(negative)
print(positive)
print(words)

['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted']
['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation']
['2-faced', '2-faces', 'a+', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort']


In [17]:
# check the length of each corpus/set 
print(len(opinion_lexicon.positive()))
print(len(opinion_lexicon.negative()))
print(len(opinion_lexicon.words()))

2006
4783
6789


In [18]:
# create a function to check/test if certain words are in the opinion_lexicon 
def word_check(word):
    if word in opinion_lexicon.positive():
        return f'{word} is positive'
    elif word in opinion_lexicon.negative():
        return f'{word} is negative' 
    else: 
        return f'{word} not covered in lexicon' 

In [19]:
print(word_check('sad'))
print(word_check('bad'))
print(word_check('wonderful'))
# purposely checking that lexicon is always lowercase 
print(word_check('AWESOME')) 
print(word_check('awesome'))

sad is negative
bad is negative
wonderful is positive
AWESOME not covered in lexicon
awesome is positive


### Classify reviews - negative (-1) to positive (+1)

It is recommended to score the reviews in two steps: 
<br>
1) First score the sentences of the reviews from 1 to 1 based on the sum of the positive and negative words they include. 
<br>
2) Then count the sentiment score of the reviews, which you preliminary sliced into sentences.

In [20]:
"""
def sentence_scoring(sents):
    word_selection = [w.lower() for w in sents if w.isalpha()]
    total_word_selection = len(word_selection)
    sum_positive = len([w for w in word_selection if w in opinion_lexicon.positive()])
    sum_negative = len([w for w in word_selection if w in opinion_lexicon.negative()])
    if total_word_selection > 0:
        return (sum_positive - sum_negative) / total_word_selection
    else:
        return 0
"""

'\ndef sentence_scoring(sents):\n    word_selection = [w.lower() for w in sents if w.isalpha()]\n    total_word_selection = len(word_selection)\n    sum_positive = len([w for w in word_selection if w in opinion_lexicon.positive()])\n    sum_negative = len([w for w in word_selection if w in opinion_lexicon.negative()])\n    if total_word_selection > 0:\n        return (sum_positive - sum_negative) / total_word_selection\n    else:\n        return 0\n'

In [21]:
"""
def review_score(review):
    sentiment_scores = list()
    sentences = sent_tokenize(review)
    for sent in sentences:
        words = word_tokenize(sent)
        sentence_score = sentence_scoring(words)
        sentiment_scores.append(sentence_score)
    if sentiment_scores:  # has at least 1 sentence score
        return sum(sentiment_scores) / len(sentiment_scores)
    else:  # return 0 if no sentiment_scores, avoid division by zero
        return 0
"""

'\ndef review_score(review):\n    sentiment_scores = list()\n    sentences = sent_tokenize(review)\n    for sent in sentences:\n        words = word_tokenize(sent)\n        sentence_score = sentence_scoring(words)\n        sentiment_scores.append(sentence_score)\n    if sentiment_scores:  # has at least 1 sentence score\n        return sum(sentiment_scores) / len(sentiment_scores)\n    else:  # return 0 if no sentiment_scores, avoid division by zero\n        return 0\n'

In [22]:
def sentiment(sentence):
    sentiment=0
    words = [word.lower() for word in word_tokenize(sentence) if word.isalnum()]
    for word in words:
        if word in opinion_lexicon.positive():
            sentiment += 1
        elif word in opinion_lexicon.negative():
            sentiment -= 1
        else:
            sentiment = 0 
    # normalize scores to make sure score is within -1/+1 range 
    return sentiment/len(words)

In [None]:
sentiment = df['reviews'].apply(sentiment)
sentiment