# Creating a Dictionary-based Sentiment Analyzer

In [49]:
import pandas as pd
import nltk
from IPython.display import display
pd.set_option('display.max_columns', None)

#### Step 1: Loading in the small_corpus .csv file created in the "creating_dataset" milestone.

In [17]:
reviews = pd.read_csv('amazon.csv')
reviews.head(5)

Unnamed: 0.1,Unnamed: 0,reviewerName,overall,reviewText,reviewTime,day_diff,helpful_yes,helpful_no,total_vote,score_pos_neg_diff,score_average_rating,wilson_lower_bound
0,0,,4,No issues.,23-07-2014,138,0,0,0,0,0.0,0.0
1,1,0mie,5,"Purchased this for my device, it worked as advertised. You can never have too much phone memory, since I download a lot of stuff this was a no brainer for me.",25-10-2013,409,0,0,0,0,0.0,0.0
2,2,1K3,4,it works as expected. I should have sprung for the higher capacity. I think its made a bit cheesier than the earlier versions; the paint looks not as clean as before,23-12-2012,715,0,0,0,0,0.0,0.0
3,3,1m2,5,"This think has worked out great.Had a diff. bran 64gb card and if went south after 3 months.This one has held up pretty well since I had my S3, now on my Note3.*** update 3/21/14I've had this for a few months and have had ZERO issue's since it was transferred from my S3 to my Note3 and into a note2. This card is reliable and solid!Cheers!",21-11-2013,382,0,0,0,0,0.0,0.0
4,4,2&amp;1/2Men,5,"Bought it with Retail Packaging, arrived legit, in a orange envelope, english version not asian like the picture shows. arrived quickly, bought a 32 and 16 both retail packaging for my htc one sv and Lg Optimus, both cards in working order, probably best price you'll get for a nice sd card",13-07-2013,513,0,0,0,0,0.0,0.0


#### Step 2: Tokenizing the sentences and words of the reviews

In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language.

Here, We're going to test different versions of word tokenizer on reviews. We'll then decide which tokenizer might be better to use.

### 1. Treebank Word Tokenizer

Topics to be covered
1. TreeBankWordTokenizer in NLP
2. Peen TreeBank

In [18]:
from nltk.tokenize import TreebankWordTokenizer
from string import punctuation
import string

In [19]:
tb_tokenizer = TreebankWordTokenizer()

Here first we are converting the reviewText to string, Second we are removing the punctuation, Third replacing the new line character to space 
,Fourth converting all the words to lowerCase.

In [20]:
reviews["rev_text_lower"] = reviews['reviewText'].apply(lambda rev: str(rev)\
                                                        .translate(str.maketrans('', '', punctuation))\
                                                        .replace("<br />", " ")\
                                                        .lower())

In [21]:
reviews[['reviewText','rev_text_lower']].sample(2)

Unnamed: 0,reviewText,rev_text_lower
4804,The micro card also comes with an adaptor to allow use as a SD card. I used it to increase memory in a cell phone and it worked perfectly.,the micro card also comes with an adaptor to allow use as a sd card i used it to increase memory in a cell phone and it worked perfectly
4725,What you buy this for is in there. 32 gb fast writing speed etc. happy with product. no issues yet,what you buy this for is in there 32 gb fast writing speed etc happy with product no issues yet


The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize()

In [22]:
reviews["tb_tokens"] = reviews['rev_text_lower'].apply(lambda rev: tb_tokenizer.tokenize(str(rev)))

In [23]:
pd.set_option('display.max_colwidth', None)

In [24]:
reviews[['reviewText','tb_tokens']].sample(3)

Unnamed: 0,reviewText,tb_tokens
2841,The card seems fast and it's a huge amount of storage for the price. It has worked flawlessly for a few months in a range of smartphones and cameras.,"[the, card, seems, fast, and, its, a, huge, amount, of, storage, for, the, price, it, has, worked, flawlessly, for, a, few, months, in, a, range, of, smartphones, and, cameras]"
1532,"This is a very good microSD memory, i use it in an Android 4.0 phone, it sees 29.71 GB of space. It was recognize immediately by the phone and it does not freeze when i'm playing a videogame (like Asphalt 7). The design is very cool too. I like its speed. I would recommend this memory to any smartphone user.","[this, is, a, very, good, microsd, memory, i, use, it, in, an, android, 40, phone, it, sees, 2971, gb, of, space, it, was, recognize, immediately, by, the, phone, and, it, does, not, freeze, when, im, playing, a, videogame, like, asphalt, 7, the, design, is, very, cool, too, i, like, its, speed, i, would, recommend, this, memory, to, any, smartphone, user]"
2626,works great with my samsung galaxy note 2. just stuck it in and it worked. no formatting needed. very fast which and no loss of camera speed,"[works, great, with, my, samsung, galaxy, note, 2, just, stuck, it, in, and, it, worked, no, formatting, needed, very, fast, which, and, no, loss, of, camera, speed]"


### Casual Tokenizer

In [25]:
from nltk.tokenize.casual import casual_tokenize

In [26]:
reviews['casual_tokens'] = reviews['rev_text_lower'].apply(lambda rev: casual_tokenize(str(rev)))

In [27]:
reviews[['reviewText','casual_tokens','tb_tokens']].sample(3)

Unnamed: 0,reviewText,casual_tokens,tb_tokens
2997,"Great price delivered to my front door what more could you ask for. Access speeds are quick enough for my phone (note 3), have had no issues thus far.","[great, price, delivered, to, my, front, door, what, more, could, you, ask, for, access, speeds, are, quick, enough, for, my, phone, note, 3, have, had, no, issues, thus, far]","[great, price, delivered, to, my, front, door, what, more, could, you, ask, for, access, speeds, are, quick, enough, for, my, phone, note, 3, have, had, no, issues, thus, far]"
603,does very good..... holds a lot of songs......,"[does, very, good, holds, a, lot, of, songs]","[does, very, good, holds, a, lot, of, songs]"
1302,I bought a new phone and got this chip because it was the largest the phone would accept. I keep a lot of pictures on it. 100% satisfied.,"[i, bought, a, new, phone, and, got, this, chip, because, it, was, the, largest, the, phone, would, accept, i, keep, a, lot, of, pictures, on, it, 100, satisfied]","[i, bought, a, new, phone, and, got, this, chip, because, it, was, the, largest, the, phone, would, accept, i, keep, a, lot, of, pictures, on, it, 100, satisfied]"


## Stemming

In [28]:
from nltk.stem.porter import PorterStemmer

In [29]:
stemmer = PorterStemmer()

In [30]:
reviews['tokens_stemmed'] = reviews['tb_tokens'].apply(lambda words: [stemmer.stem(w) for w in words])

In [32]:
reviews[['tb_tokens','tokens_stemmed']].sample(3)

Unnamed: 0,tb_tokens,tokens_stemmed
358,"[i, do, not, get, the, advertised, speed, of, 30mbs, more, like, a, 200300kbs, on, my, galaxy, s4, at, least, it, gives, me, a, lot, of, extra, storage, space, for, a, phone, it, works, nicely, if, you, dont, need, to, transfer, big, files]","[i, do, not, get, the, advertis, speed, of, 30mb, more, like, a, 200300kb, on, my, galaxi, s4, at, least, it, give, me, a, lot, of, extra, storag, space, for, a, phone, it, work, nice, if, you, dont, need, to, transfer, big, file]"
1002,"[gets, the, job, done, for, my, gopro, so, thats, good, i, just, never, liked, how, small, the, actual, card, is, due, to, its, nature, of, it, might, just, get, lostthen, thats, game, over, the, price, was, amazing, so, i, got, two, for, the, price, of, half, of, one]","[get, the, job, done, for, my, gopro, so, that, good, i, just, never, like, how, small, the, actual, card, is, due, to, it, natur, of, it, might, just, get, lostthen, that, game, over, the, price, wa, amaz, so, i, got, two, for, the, price, of, half, of, one]"
4646,"[fast, and, holds, all, my, music, and, larger, apps]","[fast, and, hold, all, my, music, and, larger, app]"


## Lemmatisation

In [35]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

In [36]:
def penn_to_wn(tag):
    """
        Convert between the PennTreebank tags to simple Wordnet tags
    """
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None

In [37]:
lemmatizer = WordNetLemmatizer()
def get_lemas(tokens):
    lemmas = []
    for token in tokens:
        pos = penn_to_wn(pos_tag([token])[0][1])
        if pos:
            lemma = lemmatizer.lemmatize(token, pos)
            if lemma:
                lemmas.append(lemma)
    return lemmas

In [None]:
reviews['lemmas'] = reviews['tb_tokens'].apply(lambda tokens: get_lemas(tokens))

In [44]:
reviews[['reviewText','tokens_stemmed','lemmas']].sample(2)

Unnamed: 0,reviewText,tokens_stemmed,lemmas
1240,"I bought this card when I got my Note 2 back in October, it is perfect and fast. No problems in any way and fully compatible with my device.UPDATE 1/15/14: I bought a second card for my Note 3 and it it excellent. The frustration free packaging was much appreciated from the other packing last year. My first original card got corrupted when I put it in an old feature phone but I am confident it had nothing to do with the card in any way. I have tested these cards with the Galaxy Note 2, Note 3, Note 10.1 2014 and Tab 3 7"" and they all read the card no problem. I have been extremely satisfied so far.","[i, bought, thi, card, when, i, got, my, note, 2, back, in, octob, it, is, perfect, and, fast, no, problem, in, ani, way, and, fulli, compat, with, my, deviceupd, 11514, i, bought, a, second, card, for, my, note, 3, and, it, it, excel, the, frustrat, free, packag, wa, much, appreci, from, the, other, pack, last, year, my, first, origin, card, got, corrupt, when, i, put, it, in, an, old, featur, phone, but, i, am, confid, it, had, noth, to, do, with, the, card, in, ani, way, i, have, test, these, card, with, the, galaxi, note, 2, note, 3, note, 101, ...]","[i, bought, card, i, get, note, back, october, be, perfect, fast, problem, way, fully, compatible, deviceupdate, i, bought, second, card, note, excellent, frustration, free, packaging, be, much, appreciate, other, pack, last, year, first, original, card, get, corrupt, i, put, old, feature, phone, i, be, confident, have, nothing, do, card, way, i, have, test, card, galaxy, note, note, note, tab, read, card, problem, i, have, be, extremely, satisfied, so, far]"
4068,"Had this item for a week or so now and it seems to be working fine. I haven't put it through any kind of stress or anything, just stuck it in my phone and let it do it's thing. So far, so good.","[had, thi, item, for, a, week, or, so, now, and, it, seem, to, be, work, fine, i, havent, put, it, through, ani, kind, of, stress, or, anyth, just, stuck, it, in, my, phone, and, let, it, do, it, thing, so, far, so, good]","[have, item, week, so, now, seem, be, work, fine, i, havent, put, kind, stress, anything, just, stuck, phone, let, do, thing, so, far, so, good]"






## Sentiment Predictor Baseline Model

In [40]:
def get_sentiment_score(tokens):
    score = 0
    tags = pos_tag(tokens)
    for word, tag in tags:
        wn_tag = penn_to_wn(tag)
        if not wn_tag:
            continue
        synsets = wn.synsets(word, pos=wn_tag)
        if not synsets:
            continue
        
        #most common set:
        synset = synsets[0]
        swn_synset = swn.senti_synset(synset.name())
        
        score += (swn_synset.pos_score() - swn_synset.neg_score())
        
    return score

In [47]:
swn.senti_synset(wn.synsets("awesome", wn.ADJ)[0].name()).pos_score()

0.875

In [50]:
reviews['sentiment_score'] = reviews['lemmas'].apply(lambda tokens: get_sentiment_score(tokens))

In [51]:
reviews[['reviewText','lemmas','sentiment_score']].sample(5)

Unnamed: 0,reviewText,lemmas,sentiment_score
4439,"I own the Samsung Galaxy Note 2 smartphone- and besides having a user-replaceable battery, this device also allows the user to pop in a Micro SD card. More is better, too much is almost enough... and hence the 64-GB model, which permits storing insane numbers of songs on your phone-- in my case, over 7000! (Three of the songs are great, 2 are OK, and the rest? Not so good!) Seriously- this is a 'must own' if your smart phone is not made by Apple! If you own the iPhone, not so great, there are no user-accessible batteries or Micro SD cards in iPhones. If you own Android- well, life will be considerably brighter once you pop this in, which took all of like 30 seconds. You do not need to format it unless you wish to erase what you've already put on it. Highly recommend - SanDisk is a trusted provider of flash memory devices, and buying from Amazon.com is a no-brainer.","[i, own, samsung, galaxy, note, smartphone, have, userreplaceable, battery, device, also, allows, user, pop, micro, sd, card, more, be, well, too, much, be, almost, enough, hence, model, permit, store, insane, number, song, phone, case, song, be, great, be, ok, rest, not, so, good, seriously, be, own, smart, phone, be, not, make, apple, own, iphone, not, so, great, there, be, useraccessible, battery, micro, sd, card, iphones, own, android, well, life, be, considerably, brighter, once, pop, take, second, do, not, need, format, wish, erase, youve, already, put, highly, recommend, sandisk, be, trust, provider, flash, memory, device, buying, amazoncom, be, nobrainer]",1.875
1491,"I big problem I have been having lately is finding a trustworthy source for SD cards. They are manufactured in China and these rip off companies are selling fake cards at a competitive price, but don't be fooled. If the price seems unbelievable that is because you will end up with a card that is not only less capacity then you are paying for, but a lower class and sometimes a faulty card altogether.When it comes to storage, pay the extra cash to get something reliable. Like this SanDisk Ultra 64 GB SDXC Class 10 memory card. It is blazaing fast and works well in my HTC EVO LTE Android phone.Side note: This card brings my total phone storage capacity to 158 GB. That is 158 GB that I have full access to. 85 GB of which is synced to the cloud.","[i, big, problem, i, have, be, have, lately, be, find, trustworthy, source, sd, card, be, manufacture, china, rip, company, be, sell, fake, card, competitive, price, dont, be, fool, price, seem, unbelievable, be, end, up, card, be, not, only, less, capacity, then, be, pay, low, class, sometimes, faulty, card, altogetherwhen, come, storage, pay, extra, cash, get, something, reliable, sandisk, ultra, gb, sdxc, class, memory, card, be, blazaing, fast, work, well, htc, evo, lte, android, phoneside, note, card, brings, total, phone, storage, capacity, gb, be, gb, i, have, full, access, gb, be, sync, cloud]",0.75
1602,Bought This For My Samsung Galaxy S3Load It Up With Music And It Works FlawlesslyI'm Very Satisfied With This Card,"[bought, samsung, galaxy, s3load, up, music, work, flawlesslyim, very, satisfied, card]",0.375
1029,Bought it for my Galaxy S5. Dumped 32g of music on it and so far so good. It's nice to have plenty of storage.,"[bought, galaxy, s5, dumped, music, so, far, so, good, nice, have, plenty, storage]",1.0
1180,"I purchased this for a Galaxy Tab 10.1 but didn't like it - I returned the tablet (and accidently left the memory card in it). So this is the second one I've purchased, it is now in use in an ASUS T1000 tablet/notebook and I am entirely satisfied.","[i, purchase, galaxy, tab, didnt, i, return, tablet, accidently, left, memory, card, so, be, second, ive, purchase, be, now, use, asus, t1000, tabletnotebook, i, be, entirely, satisfied]",0.875


In [52]:
reviews[['reviewText','lemmas','sentiment_score']].sample(5)

Unnamed: 0,reviewText,lemmas,sentiment_score
3389,"I purchased this as a recommendation to use with a LS300W dashcam. Like most dashcams, there is constant high definition writing to the SD card because of needed looping. This works as expected. Reliable and fast.","[i, purchase, recommendation, use, ls300w, dashcam, most, dashcams, there, be, constant, high, definition, write, sd, card, need, loop, work, expect, reliable, fast]",0.75
110,Great for adding a ton of storage to your smartphone or other device. Fast enough for most needs and the price is right.,"[great, add, ton, storage, smartphone, other, device, fast, enough, most, need, price, be, right]",-0.375
1740,This is a great up grade for me. I use it for my tablet and it gives me lots of extra storage. I would recommend to anyone.,"[be, great, up, grade, i, use, tablet, give, lot, extra, storage, i, recommend, anyone]",-0.625
4030,Already bought this one last may but you can't beat the new price. Original Sandisk products never failed me before. Just be careful where to buy.,"[already, bought, last, cant, beat, new, price, original, sandisk, product, never, fail, just, be, careful, buy]",0.25
3811,"I ordered the regular packaging after reading the frustration-free packaging is sometimes counterfeit. Anyway, the 32GB card I received was formatted as FAT32, and it seemed to work just fine in my new Moto G LTE phone running Android 4.4.3. Moved apps onto the card, put some music on it. Works great, no complaints.","[i, order, regular, packaging, reading, frustrationfree, packaging, be, sometimes, counterfeit, anyway, card, i, receive, be, format, fat32, seem, work, just, fine, new, moto, g, lte, phone, run, android, move, apps, card, put, music, work, great, complaint]",0.388
