## Sentiment Analyzer

In [44]:
import nltk
# https://www.nltk.org/_modules/nltk/stem/wordnet.html
from nltk.stem import WordNetLemmatizer

import numpy as np 
from sklearn.utils import shuffle 
from sklearn.linear_model import LogisticRegression 
from bs4 import BeautifulSoup


In [45]:
# turns words into their base forms, e.g. cats --> cat, dogs --> dog 
wordnet_lemmatizer = WordNetLemmatizer()

# stop words from http://www.lextek.com/manuals/onix/stopwords1.html
stopwords = set(w.rstrip() for w in open('data/stopwords.txt'))
# from nltk.corpus import stopwords
# stopwords.words('english')

In [46]:
# load review data 
positive_reviews = BeautifulSoup(open('electronics/positive.review').read())
# only look at the review text 
positive_reviews = positive_reviews.findAll('review_text')

negative_reviews = BeautifulSoup(open('electronics/negative.review').read())
negative_reviews = negative_reviews.findAll('review_text')

In [47]:
type(positive_reviews)

bs4.element.ResultSet

In [48]:
positive_reviews[:5]

[<review_text>
 I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad.  It will run my cable modem, router, PC, and LCD monitor for 5 minutes.  This is more than enough time to save work and shut down.   Equally important, I know that my electronics are receiving clean power.
 
 I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.
 
 As always, Amazon had it to me in &lt;2 business days
 </review_text>, <review_text>
 I ordered 3 APC Back-UPS ES 500s on the recommendation of an employee of mine who used to work at APC. I've had them for about a month now without any problems. They've functioned properly through a few unexpected power interruptions. I'll gladly order more if the need arises.
 
 Pros:
  - Large plug spacing, good for power adapters
  - Simple design
  - Long cord
 
 Cons:
  - No line conditioning (usually an expensive option
 </review_t

In [49]:
negative_reviews[:5]

[<review_text>
 cons
 tips extremely easy on carpet and if you have a lot of cds stacked at the top
 
 poorly designed, it is a vertical cd rack that doesnt have individual slots for cds, so if you want a cd from the bottom of a stack you have basically pull the whole stack to get to it
 
 putting it together was a pain, the one i bought i had to break a piece of metal just to fit it in its guide holes.
 
 again..poorly designed... doesnt even fit cds that well, there are gaps, and the cd casses are loose fitting
 
 pros
 ..........
 i guess it can hold a lot of cds....
 </review_text>, <review_text>
 It's a nice look, but it tips over very easily. It is not steady on a rug surface dispite what the picture on the box shows. My advice is if you need a CD rack that holds a lot of CD's? Save your money and invest in something nicer and more sturdy
 </review_text>, <review_text>
 I have bought and returned three of these units now. Each one has been defective, and finally I just gave up on

In [50]:
len(positive_reviews)

1000

In [51]:
len(negative_reviews)

1000

In [53]:
# # positive reviews are more than negative reviews
# # take a random sample from positive reviews to keep 
# # the postitive / negative reviews balanced 
# np.random.shuffle(positive_reviews)
# positive_reviews = positive_reviews[: len(negative_reviews)]


# # # we can also oversample the negative reviews
# # diff = len(positive_reviews) - len(negative_reviews)
# # idxs = np.random.choice(len(negative_reviews), size=diff)
# # extra = [negative_reviews[i] for i in idxs]
# # negative_reviews += extra

In [54]:
# customized tokenizers to remove short words, turn words into lower cases, remove stop words ... 
def custom_tokenizer(s):
    # lower case 
    s = s.lower()
    # split into words 
    tokens = nltk.tokenize.word_tokenize(s)
    # remove short words such as it, am ... 
    tokens = [word for word in tokens if len(word) > 2]
    # convert words into base form: cats --> cat 
    tokens = [wordnet_lemmatizer.lemmatize(word) for word in tokens]
    # remove short words 
    tokens = [word for word in tokens if word not in stopwords]
    return tokens 

In [67]:
# create word_to_index map to record the index of each word in a vocabulary 
word_idx_map = {}
positive_tokenized = []
negative_tokenized = []
orig_reviews = []

def update_map(reviews, review_list, cur_idx):
    for review in reviews:
        orig_reviews.append(review.text)
        tokens = custom_tokenizer(review.text)
        review_list.append(tokens)
        for token in tokens:
            if token not in word_idx_map:
                word_idx_map[token] = cur_idx
                cur_idx += 1 
    return cur_idx

    
    

In [60]:
# https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

# nltk.download()
nltk.download('punkt')

In [68]:
# positive reviews 
update_map(positive_reviews, positive_tokenized, cur_idx = 0)


7556

In [70]:
# negative reviews
update_map(negative_reviews, negative_tokenized, cur_idx = 7556)

print('Vocabulary size: ' ,len(word_idx_map))


Vocabulary size:  11082


In [71]:
# convert tokens of each review into vector and add labels to the last col of the vector 
def tokens_to_vector(tokens, label):
    # vocabulary size + 1 for label 
    x = np.zeros(len(word_idx_map) + 1)
    for token in tokens:
        i = word_idx_map[token]
        x[i] += 1 
    # normalize 
    x = x / x.sum()
    x[-1] = label 
    return x 

In [81]:
# total number of reviews 
N = len(positive_tokenized) + len(negative_tokenized)
print('Number of reviews: ', N)

# data have N rows and len(word_idx_map) + 1 cols for features and label 
data = np.zeros((N, len(word_idx_map) + 1))

for i, tokens in enumerate(positive_tokenized):
    data[i, :] = tokens_to_vector(tokens, 1)

for i, tokens in enumerate(negative_tokenized):
    data[i + len(positive_tokenized), :] = tokens_to_vector(tokens, 0)

Number of reviews:  2000


In [82]:
# create train and test set from data 

# shuffle 
orig_reviews, data = shuffle(orig_reviews, data)

X, Y = data[:, :-1], data[:, -1]

X_train, Y_train, X_test, Y_test = X[:-100, ], Y[:-100, ], X[-100:, ], Y[-100:, ]


In [86]:
# fit model 
model = LogisticRegression()
model.fit(X_train, Y_train)
print('Train accuracy: ', model.score(X_train, Y_train))
print('Test accuracy: ', model.score(X_test, Y_test))




Train accuracy:  0.7857894736842105
Test accuracy:  0.72


In [87]:
# observe the weights of each word 
threshold = 0.7 
for word, index in word_idx_map.items():
    weight = model.coef_[0][index]
    if abs(weight) > threshold:
        print(word, weight)
        
        

wa -1.7308697031787024
you 1.0171635324366886
n't -1.959793012109964
ha 0.7205408328370105
little 0.9454283441895711
buy -0.8422313078150562
cable 0.7817057724738325
doe -1.2115400249929358
support -0.8524665413863667
quality 1.4807231263729081
lot 0.7365005930568032
price 2.821363877407886
've 0.7474735606440409
love 1.1060511210359893
bad -0.7108218967956866
money -0.9491708761120471
then -1.007293200971402
speaker 0.8944060445521356
highly 0.9335654403919926
recommend 0.7448579992011756
perfect 0.9498494605783255
tried -0.7509026086833667
sound 0.9399908938442343
time -0.8654097653797324
poor -0.7863136125014152
easy 1.7458418002204057
excellent 1.2953946006339072
returned -0.8145441968758805
week -0.7236048377363157
memory 0.972496757426921
month -0.8088324730311598
item -0.9901746996502715
return -1.2216844154486044
fast 0.8966566890759341
waste -0.9947943115242666


In [88]:
# have a look at the misclassified examples 
# predictions on all data 
preds = model.predict(X)
probabilities = model.predict_proba(X)[:, 1] # p(y = 1|x)


In [111]:
actual_positive = []
actual_negative = []
for i in range(N):
    prob = probabilities[i]
    y = Y[i]
    if y == 1 and prob < 0.5:
        actual_positive.append(orig_reviews[i])
    elif y == 0 and prob > 0.5:
        actual_negative.append(orig_reviews[i])

In [113]:
actual_positive[:2]

['\nA simple, cheap investment.  This CF reader actually does what it should, i.e. transfer info from compact flash to the computer.  Previous Kodak 1.0 multi-card reader had issues after a few weeks, and stopped working after 2 months.  This one is much faster and is still working.  Ahhhhh\n',
 "\nNo problems so far; does its job.(if you are setting up a wireless network, get windows XP. 1 click and you're set up.\n"]

In [114]:
actual_negative[:2]

["\n2gb of flash memory will not be destroyed at high altitude or dropped short of shattering the memory stick.  A 30gb hard drive as in the iPod 30g (that my son has) is subject to air pressure, shock and temperature domain that a flash memory device will not be subject to.  Because the hard drive spins to lift the heads above the disk, moisture from condensation will cause the head to crash if, brought in from a cold car and turned on.  I do not know about the iPod but some larger devices have dew point detector that will not spin up unless the air in the device is at safe temp.\n\nTHAT's WHY..\n",
 '\nI purchased the Notebook Cooler approximately one month ago, after reading online testimonials of laptops becoming damaged from overheating. Thankfully, the same time I read that, Circuit City was having a sale on this item for only $15, after mail-in rebate. Fast forward one month and my opinion of the Cooler so far: not bad... \n\nMy laptop is noticeably cooler to the touch now, and 

### Conclusions

* One can try different classifiers, different features to see if the classifier will perform better.

* One can also try to use regression instead of classification in this sentiment analysis as a continuous number does make sense when telling about the sentiment. 

* One can also try to do classification job with more categories instead of just postive and negative classes.