In [83]:
import pandas as pd
import nltk
from IPython.display import display
pd.set_option('display.max_columns', None)

In [84]:
reviews = pd.read_csv("new_data/small_corpus.csv")

In [85]:
reviews.head(2)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,1.0,True,"02 9, 2015",A29ROVDR59S0UJ,B00BPEBG8A,Madmatt7,"The folks who developed this game obviously had absolutely no connection/clue as to what was going on in the previous thief games. All the intelligence, humor, superb story-telling, stylish music and interesting characters have given way to looking goth/emo/cool. I'm guessing this game was developed with the pre-teen crowd in mind. It's that brainless. I'd rather play ""The Metal Age"" 20 times over than this. Very, very disappointed.","Does not deserve the ""Thief"" name.",1423440000,2.0,{'Platform:': ' PC Online Game Code'},
1,1.0,True,"12 30, 2015",A13WITJWA68CBR,B00438VWRU,Adventure Reader,"After an hour of play, I am only through six of the puzzles. In fact, I had to look up the solutions to two of the puzzles on internet gaming walkthrough sites.\nHere is my assessment of the game so far:\nGRAPHICS: The graphics are very lackluster. I understand this is a DS game, but I am using it on the new 3DS XL with the biggest screen and the game still looks fuzzy.\nGAMEPLAY: The flow of the games is limited to going to different points on a map and not being able to move to another location until you solve the given puzzle(s). Sometimes you will be tasked with having to find an object in one of the rooms, but the cruddy graphics make it hard to be certain of what or where to click on the screen. However, you can just click all over the screen until the item reveals itself.\nOVERALL:\nI enjoy games of mystery, hidden objects, puzzle-solving, and problem-solving, and I thought this game would offer these things and a good story. Unfortunately it is a disappointment due to the terrible graphics and the difficulty of some of the puzzles. I give this game 1 star.",Some of the puzzles are just too difficult,1451433600,,,


<h3> Text Preprocessing

1. Tokenizing the sentences and words of the reviews

Here, We're going to test different versions of word tokenizer on reviews. We'll then decide which tokenizer might be better to use.

<b> Treebank Word Tokenizer</b>

It identifies word boundaries in a given text, separating words based on spaces and punctuation marks.

 It properly handles contractions like "don't" or "can't," splitting them into their constituent parts ("do" and "n't", "ca" and "n't").

In [86]:
from nltk.tokenize import TreebankWordTokenizer
from string import punctuation
import string

In [87]:
tb_tokenizer = TreebankWordTokenizer()

In [88]:
reviews["rev_text_lower"] = reviews['reviewText'].apply(lambda rev: str(rev)\
                                                        .translate(str.maketrans('', '', punctuation))\
                                                        .replace("<br />", " ")\
                                                        .lower())

In [89]:
reviews[['reviewText','rev_text_lower']].sample(2)

Unnamed: 0,reviewText,rev_text_lower
3903,excellent,excellent
3552,Excellent service.\n\nThe game arrived in perfect condition.\n\nI recommend to all fans of the series X.\n\nEveryone should have it in your collection.,excellent service\n\nthe game arrived in perfect condition\n\ni recommend to all fans of the series x\n\neveryone should have it in your collection


In [90]:
reviews["tb_tokens"] = reviews['rev_text_lower'].apply(lambda rev: tb_tokenizer.tokenize(str(rev)))

In [91]:
pd.set_option('display.max_colwidth', None)

In [92]:
reviews[['reviewText','tb_tokens']].sample(2)

Unnamed: 0,reviewText,tb_tokens
3865,Steelseries's best optical mouse,"[steelseriess, best, optical, mouse]"
1318,"It was already on this PC, and I couldn't run it for that reason. I tried changing the player name and everything I could think of. It would be a stretch to give it a good rating.","[it, was, already, on, this, pc, and, i, couldnt, run, it, for, that, reason, i, tried, changing, the, player, name, and, everything, i, could, think, of, it, would, be, a, stretch, to, give, it, a, good, rating]"


<b> Casual Tokenizer

In [93]:
from nltk.tokenize.casual import casual_tokenize

In [94]:
reviews['casual_tokens'] = reviews['rev_text_lower'].apply(lambda rev: casual_tokenize(str(rev)))


In [95]:
reviews[['reviewText','casual_tokens','tb_tokens']].sample(2)


Unnamed: 0,reviewText,casual_tokens,tb_tokens
2251,"The issue I feel I have with this is that it's not really a base builder game, but more of a capture points on the map game, much like Relic's flagship series, Dawn of War 2.\n\nIt's not an rpg/rts like dow2, but I enjoyed more of the base building of DOW and AOE and Starcraft than this. Granted, it's still fun, and it looks good but whatever.\n\nSome people were saying they had disk issues, but I got this game (both expansions) off of steam, and the only issue I have is that there are TONS of patches. Literally GIGS.\n\nIf you have a slow connection and you want to play online, forget about it. Leave your computer on for a few days.\n\nI had no issues with drm.","[the, issue, i, feel, i, have, with, this, is, that, its, not, really, a, base, builder, game, but, more, of, a, capture, points, on, the, map, game, much, like, relics, flagship, series, dawn, of, war, 2, its, not, an, rpgrts, like, dow, 2, but, i, enjoyed, more, of, the, base, building, of, dow, and, aoe, and, starcraft, than, this, granted, its, still, fun, and, it, looks, good, but, whatever, some, people, were, saying, they, had, disk, issues, but, i, got, this, game, both, expansions, off, of, steam, and, the, only, issue, i, have, is, that, there, are, tons, of, patches, ...]","[the, issue, i, feel, i, have, with, this, is, that, its, not, really, a, base, builder, game, but, more, of, a, capture, points, on, the, map, game, much, like, relics, flagship, series, dawn, of, war, 2, its, not, an, rpgrts, like, dow2, but, i, enjoyed, more, of, the, base, building, of, dow, and, aoe, and, starcraft, than, this, granted, its, still, fun, and, it, looks, good, but, whatever, some, people, were, saying, they, had, disk, issues, but, i, got, this, game, both, expansions, off, of, steam, and, the, only, issue, i, have, is, that, there, are, tons, of, patches, literally, ...]"
4156,Hope SONY makes a Prologue Game of The LEGEND OF DRAGOON with D3 with Japan Studio and script writers from FOLK LORE!,"[hope, sony, makes, a, prologue, game, of, the, legend, of, dragoon, with, d3, with, japan, studio, and, script, writers, from, folk, lore]","[hope, sony, makes, a, prologue, game, of, the, legend, of, dragoon, with, d3, with, japan, studio, and, script, writers, from, folk, lore]"


Here we can see the difference in both methods in tree bank method the punctioations and apostopies are handelled very well but in normal tokenizer we can see some words still have punctuations left

<h3>2. Stemming

In [96]:
from nltk.stem.porter import PorterStemmer

In [97]:
stemmer = PorterStemmer()

In [98]:
reviews['tokens_stemmed'] = reviews['tb_tokens'].apply(lambda words: [stemmer.stem(w) for w in words])

In [99]:
reviews[['tb_tokens','tokens_stemmed']].sample(2)

Unnamed: 0,tb_tokens,tokens_stemmed
1984,"[i, havent, waited, all, this, time, for, an, incomplete, game, that, is, so, restricted, and, controlled, this, is, not, like, blizzard, blizzard, is, supposed, to, be, the, thoughtful, good, company, that, knows, what, we, the, gamers, want, no, lan, ability, is, unacceptable, no, spawn, installs, makes, me, feel, sad, splitting, the, game, into, 3, parts, and, likely, charging, 60, for, each, is, pretty, ridiculous, ongoing, online, activation, could, get, annoying, this, hardly, seems, like, starcraft, or, blizzard, what, happened]","[i, havent, wait, all, thi, time, for, an, incomplet, game, that, is, so, restrict, and, control, thi, is, not, like, blizzard, blizzard, is, suppos, to, be, the, thought, good, compani, that, know, what, we, the, gamer, want, no, lan, abil, is, unaccept, no, spawn, instal, make, me, feel, sad, split, the, game, into, 3, part, and, like, charg, 60, for, each, is, pretti, ridicul, ongo, onlin, activ, could, get, annoy, thi, hardli, seem, like, starcraft, or, blizzard, what, happen]"
4360,"[i, got, this, to, complet, the, sires, that, i, have, so, fare, and, still, gating, the, others, best, game, ever]","[i, got, thi, to, complet, the, sire, that, i, have, so, fare, and, still, gate, the, other, best, game, ever]"


After stemming we can see that the words are converted into their root forms by removing suffexis and prefixes like writeing to write , this helps us to reduce the dimentionality of the data set and impoveis efficieny by removing nosie line ing forms which do not contribute  much to the prediction

<h3>3.  Lemmatisation </h3>

Lemmatization is similar to stemming but aims to reduce words to their canonical or dictionary form (lemma). 

In [100]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

In [101]:
def penn_to_wn(tag):
    """
        Convert between the PennTreebank tags to simple Wordnet tags
    """
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None

In [102]:
lemmatizer = WordNetLemmatizer()
def get_lemas(tokens):
    lemmas = []
    for token in tokens:
        pos = penn_to_wn(pos_tag([token])[0][1])
        if pos:
            lemma = lemmatizer.lemmatize(token, pos)
            if lemma:
                lemmas.append(lemma)
    return lemmas

In [103]:
import nltk
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\tatav\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [104]:

reviews['lemmas'] = reviews['tb_tokens'].apply(lambda tokens: get_lemas(tokens))

In [105]:
reviews[['reviewText','tokens_stemmed','lemmas']].sample(2)

Unnamed: 0,reviewText,tokens_stemmed,lemmas
3070,Looks great and was a great price\nI like the little Japanese package art touches.,"[look, great, and, wa, a, great, price, i, like, the, littl, japanes, packag, art, touch]","[look, great, be, great, price, i, little, japanese, package, art, touch]"
3384,I'm not a fan of shooters. But I am a fan of Final Fantasy VII so that's why I got this. Still not a fan of shooters but it's all right. I got the game because I love the characters and the story so I'm more than satisfied. The cut scenes are beautiful. They're long but I'd rather watch them than play the game anyway to be honest.,"[im, not, a, fan, of, shooter, but, i, am, a, fan, of, final, fantasi, vii, so, that, whi, i, got, thi, still, not, a, fan, of, shooter, but, it, all, right, i, got, the, game, becaus, i, love, the, charact, and, the, stori, so, im, more, than, satisfi, the, cut, scene, are, beauti, theyr, long, but, id, rather, watch, them, than, play, the, game, anyway, to, be, honest]","[im, not, fan, shooter, i, be, fan, final, fantasy, vii, so, thats, i, get, still, not, fan, shooter, right, i, get, game, i, love, character, story, so, im, more, satisfied, cut, scene, be, beautiful, theyre, long, id, rather, watch, play, game, anyway, be, honest]"


In [106]:
def get_sentiment_score(tokens):
    score = 0
    tags = pos_tag(tokens)
    for word, tag in tags:
        wn_tag = penn_to_wn(tag)
        if not wn_tag:
            continue
        synsets = wn.synsets(word, pos=wn_tag)
        if not synsets:
            continue
        
        #most common set:
        synset = synsets[0]
        swn_synset = swn.senti_synset(synset.name())
        
        score += (swn_synset.pos_score() - swn_synset.neg_score())
        
    return score

In [107]:
import nltk
nltk.download('sentiwordnet')


[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\tatav\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


True

Now we are going to test this function to if its giving  a good sentiment score



In [108]:
swn.senti_synset(wn.synsets("perfect", wn.ADJ)[0].name()).pos_score()

0.625

As we can see that the word perfect has a poitive scoere as it is a positive word

In [109]:
reviews['sentiment_score'] = reviews['lemmas'].apply(lambda tokens: get_sentiment_score(tokens))

In [110]:
reviews[['reviewText','lemmas','sentiment_score']].sample(3)

Unnamed: 0,reviewText,lemmas,sentiment_score
835,This Wii was not in good condition and it did not have the two games with it which was very disappointing. I sent it back.,"[wii, be, not, good, condition, do, not, have, game, be, very, disappoint, i, sent, back]",0.0
2541,"Great Dragon Quest game for fans of Dragon Quest and a great building crafting game for fans of Minecraft and other such games. This is not a Minecraft clone, this is an Action RPG with light crafting and building elements. If you enjoy JRPGs or Crafting survival type games, give this one a shot. It's got great art, runs well, has a story and side quests and gives you things to do other then just roam around building things.","[great, dragon, quest, game, fan, dragon, quest, great, building, craft, game, fan, minecraft, other, such, game, be, not, minecraft, clone, be, action, rpg, light, craft, building, element, enjoy, jrpgs, craft, survival, type, game, give, shot, get, great, art, run, well, have, story, side, quest, give, thing, do, other, then, just, roam, building, thing]",-1.125
1440,"I ordered the Sentey Nebulus due to it's price point and me wanting to replace my Razer Deathadder. The Nebulus came and the build and weight was both good. I've no complaints at this point until Day 2 of use and suddenly all key clicks are not recognized, Left, Right, Mouse Wheel, Back, Forward and even the DPI button would not work.\n\nNow I understood that this was likely due to a faulty device and with how it came packaged and all that, I was led to believe that this mouse was worth more than what it's sold for and I was just getting a bad apple. So I got it exchanged and got a 2nd Sentey Nebulus, same amazing product it seems but when I plugged it in the same issue was happening. No key clicks we're recognized and this was when I decided that I'll be staying away from this company/mouse.\n\nI've not been frustrated with a product as much as I have this one and I definitely recommend looking elsewhere for a mouse.","[i, order, sentey, nebulus, due, price, point, want, replace, razer, deathadder, nebulus, come, build, weight, be, good, ive, complaint, point, day, use, suddenly, key, click, be, not, recognize, left, right, mouse, wheel, back, forward, even, dpi, button, not, work, now, i, understood, be, likely, due, faulty, device, come, package, i, be, lead, believe, mouse, be, worth, more, sell, i, be, just, get, bad, apple, so, i, get, exchange, get, sentey, nebulus, same, amaze, product, seem, i, plug, same, issue, be, happen, key, click, be, recognize, be, i, decide, ill, be, stay, away, companymouse, ive, not, be, frustrate, product, much, i, ...]",-1.0


<h3> User input for sentimet analysis



In [113]:
def classify_sentiment(sentiment_score):
    if sentiment_score > 0:
        return "positive"
    elif sentiment_score < 0:
        return "negative"
    else:
        return "neutral"

def analyze_sentiment(review_text):
    tokens = tb_tokenizer.tokenize(review_text.lower())
    lemmas = get_lemas(tokens)
    sentiment_score = get_sentiment_score(lemmas)
    return classify_sentiment(sentiment_score)

while True:
    review_text = input("Enter a review (type 'exit' to quit): ")
    if review_text.lower() == 'exit':
        break
    sentiment = analyze_sentiment(review_text)
    print(f"The sentiment of the review is: {sentiment}")


The sentiment of the review is: negative
The sentiment of the review is: negative
The sentiment of the review is: positive
The sentiment of the review is: positive
