## Creating a Dictionary-based Sentiment Analyzer

Given the small corpus (reviews textual dataset) generated in task 1, the objective is to construct a dictionary-based sentiment analyser.
<br>
<br>
Some of the lessons taught during these tasks:
* Word and sentence tokenization 
* Review score classification 
* Insights surrounding score and review comparisons 
* Correlation analysis surrounding review groups 
* Accounting for negation within sentiment analyser



In [51]:
## Necessary imports needed for analysis 
import pandas as pd 
import numpy as np 
import os, pathlib
import matplotlib.pyplot as plt 
import altair
import pandas_bokeh
%matplotlib inline 

### Load small corpus dataset

In [3]:
# Move back into task1 repo where review (small) corpus is located
path = pathlib.Path().home()/'Desktop/nlp-map-project/task1-create-dataset'
try:
    path = os.chdir(path) 
except FileNotFoundError as a :
    print('Already in directory/folder, carry on!')

In [4]:
df = pd.read_csv('small_corpus.csv')

When dealing with NLP/ML problems, we must initially ensure that there are no missing values, otherwise this will lead to problems later down the line.

In [6]:
# there are some missing reviews 
# which is not substanstial relative to size of dataset/corpus 
df.isna().sum()

ratings    0
reviews    4
dtype: int64

In [9]:
df[df['reviews'].isna()]

Unnamed: 0,ratings,reviews
686,1,
2590,4,
3197,5,
3470,5,


In [10]:
# fill NaNs with empty string (whitespace) 
df['reviews'] = df['reviews'].fillna('')

In [12]:
# check to ensure there's no nulls 
assert df['reviews'].notna().all()

In [48]:
review_sample = df['reviews'].head().tolist()
rating_sample = df['ratings'].head().tolist()

In [49]:
print(*rating_sample)
print(*view_reviews, sep='\n') # each review is seperated by a '-'

1 1 1 1 1
-
Recently UBISOFT had to settle a huge class-action suit brought against the company for bundling (the notoriously harmful) StarFORCE DRM with its released games. So what the geniuses at the helm do next? They decide to make the same mistake yet again - by choosing the same DRM scheme that made BIOSHOCK, MASS EFFECT and SPORE infamous: SecuROM 7.xx with LIMITED ACTIVATIONS!

MASS EFFECT can be found in clearance bins only months after its release; SPORE not only undersold miserably but also made history as the boiling point of gamers lashing back, fed up with idiotic DRM schemes. And the clueless MBAs that run an art-form as any other commodity business decided that, "hey, why not jump into THAT mud-pond ourselves?"

The original FAR CRY was such a GREAT game that any sequel of it would have to fight an uphill battle to begin with (especially without its original developing team). Now imagine shooting this sequel on the foot with a well known, much hated and totally useless 

### Word and sentence tokenization

In [52]:
# import relevant tokenization modules from nltk 
from nltk.tokenize import word_tokenize, sent_tokenize

In [54]:
# text normalization (in this ex phrases are lowercase) is a nice addition for text analysis
# followed by the appropriate token parsing 
word_tokenization = df['reviews'].str.lower().apply(lambda x: word_tokenize(x))
word_tokenization

0       [recently, ubisoft, had, to, settle, a, huge, ...
1        [code, did, n't, work, ,, got, me, a, refund, .]
2       [these, do, not, work, at, all, ,, all, i, get...
3       [well, let, me, start, by, saying, that, when,...
4       [dont, waste, your, money, ,, you, will, just,...
                              ...                        
4495    [nice, long, micro, usb, cable, ,, battery, la...
4496    [i, 've, been, having, a, great, time, with, t...
4497                                                  [d]
4498    [really, pretty, ,, funny, ,, interesting, gam...
4499    [i, had, a, lot, of, fun, playing, this, game,...
Name: reviews, Length: 4500, dtype: object

In [55]:
sent_tokenization = df['reviews'].str.lower().apply(lambda x: sent_tokenize(x))
sent_tokenization

0       [recently ubisoft had to settle a huge class-a...
1                    [code didn't work, got me a refund.]
2       [these do not work at all, all i get is static...
3       [well let me start by saying that when i first...
4       [dont waste your money, you will just end up u...
                              ...                        
4495    [nice long micro usb cable, battery lasts a lo...
4496    [i've been having a great time with this game....
4497                                                  [d]
4498    [really pretty, funny, interesting game., work...
4499    [i had a lot of fun playing this game, if your...
Name: reviews, Length: 4500, dtype: object

### Download NLTK `opinion lexicon`

In [59]:
# corresponding module import 
import nltk
nltk.download('opinion_lexicon')

[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     /Users/ShuaibAhmed/nltk_data...
[nltk_data]   Unzipping corpora/opinion_lexicon.zip.


True

In [61]:
from nltk.corpus import opinion_lexicon

In [62]:
dir(opinion_lexicon)

['CorpusView',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_encoding',
 '_fileids',
 '_get_root',
 '_read_word_block',
 '_root',
 '_tagset',
 '_unload',
 'abspath',
 'abspaths',
 'citation',
 'encoding',
 'ensure_loaded',
 'fileids',
 'license',
 'negative',
 'open',
 'positive',
 'raw',
 'readme',
 'root',
 'unicode_repr',
 'words']

In [74]:
# Examine this module/corpus - i.e. first 10 
positive = opinion_lexicon.positive()[:10]
negative = opinion_lexicon.negative()[:10]
words = sorted(opinion_lexicon.words())[:10] # sorted alphabetically

In [75]:
print(negative)
print(positive)
print(words)

['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted']
['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation']
['2-faced', '2-faces', 'a+', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort']


In [79]:
# check the length of each corpus/set 
print(len(opinion_lexicon.positive()))
print(len(opinion_lexicon.negative()))
print(len(opinion_lexicon.words()))

2006
4783
6789
