### Data
sentiment = how positive or negative some text is.

These are Amazon reviews come with 5 star ratings and we will look at the electronics category. This data comes from this link-
http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html (it has multidomain data)

The data is already labeled for us-

negative.review.txt

positive.review.txt

This is an XML file, so we will need an XML parser (BeautifulSoup). We will ignore all the extra data and only look at the "review_text". To get our feature data, we will count the number of occurances of each word, and divide it by total no. of words.

In [3]:
import nltk
import numpy as np
from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup

In [32]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [4]:
wordnet_lemmatizer = WordNetLemmatizer() # it turns words into their base forms, i.e. it makes 'cat' and 'cats' both as 'cat'

In [5]:
# http://www.lextek.com/manuals/onix/stopwords1.html
stopwords = set(w.rstrip() for w in open('stopwords.txt'))  # read the words from the stopword.txt and strips of the space on the right.

In [12]:
positive_reviews = BeautifulSoup(open('positive.review.txt').read(), "lxml")

In [13]:
positive_reviews = positive_reviews.findAll('review_text')

In [16]:
positive_reviews[0:2]

[<review_text>
 I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad.  It will run my cable modem, router, PC, and LCD monitor for 5 minutes.  This is more than enough time to save work and shut down.   Equally important, I know that my electronics are receiving clean power.
 
 I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.
 
 As always, Amazon had it to me in &lt;2 business days
 </review_text>, <review_text>
 I ordered 3 APC Back-UPS ES 500s on the recommendation of an employee of mine who used to work at APC. I've had them for about a month now without any problems. They've functioned properly through a few unexpected power interruptions. I'll gladly order more if the need arises.
 
 Pros:
  - Large plug spacing, good for power adapters
  - Simple design
  - Long cord
 
 Cons:
  - No line conditioning (usually an expensive option
 </review_t

In [18]:
negative_reviews = BeautifulSoup(open('negative.review.txt').read(), "lxml")

In [20]:
negative_reviews = negative_reviews.findAll('review_text')

In [21]:
negative_reviews[0:2]

[<review_text>
 cons
 tips extremely easy on carpet and if you have a lot of cds stacked at the top
 
 poorly designed, it is a vertical cd rack that doesnt have individual slots for cds, so if you want a cd from the bottom of a stack you have basically pull the whole stack to get to it
 
 putting it together was a pain, the one i bought i had to break a piece of metal just to fit it in its guide holes.
 
 again..poorly designed... doesnt even fit cds that well, there are gaps, and the cd casses are loose fitting
 
 pros
 ..........
 i guess it can hold a lot of cds....
 </review_text>, <review_text>
 It's a nice look, but it tips over very easily. It is not steady on a rug surface dispite what the picture on the box shows. My advice is if you need a CD rack that holds a lot of CD's? Save your money and invest in something nicer and more sturdy
 </review_text>]

In [22]:
len(negative_reviews)

1000

In [23]:
len(positive_reviews)

1000

In [24]:
# So, the number of both positive and the negative reviews are same.

In [34]:
def my_tokenizer(s):
    s = s.lower()
    tokens = nltk.tokenize.word_tokenize(s)  # tokenize the words (splits the word based on space as a delimiter)
    tokens = [t for t in tokens if len(t) > 2] # only take words more than 2 letters
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]  # lemmatize them
    tokens = [t for t in tokens if t not in stopwords] # removing the stopwords
    return tokens

In [38]:
positive_tokenized = []
negative_tokenized = []

In [39]:
# Now I want to create an index for each of my words, so that each word will have its own index in the final data vector.
word_index_map = {} # map words to indices
current_index = 0 # it will increase whenever I see a new word.

for review in positive_reviews:
    tokens = my_tokenizer(review.text) # tokenizing the words from a given review. '.text' converts the corpus to a string
    positive_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1

In [40]:
for review in negative_reviews:
    tokens = my_tokenizer(review.text) # tokenizing the words from a given review. '.text' converts the corpus to a string
    negative_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1

In [47]:
positive_tokenized[0] # tokens for the 1st review

['purchased',
 'this',
 'unit',
 'due',
 'frequent',
 'blackout',
 'power',
 'supply',
 'bad',
 'run',
 'cable',
 'modem',
 'router',
 'lcd',
 'monitor',
 'minute',
 'this',
 'time',
 'save',
 'shut',
 'equally',
 'electronics',
 'receiving',
 'clean',
 'power',
 'feel',
 'this',
 'investment',
 'minor',
 'compared',
 'loss',
 'valuable',
 'data',
 'failure',
 'equipment',
 'due',
 'power',
 'spike',
 'irregular',
 'power',
 'supply',
 'amazon',
 'business',
 'day']

In [41]:
def tokens_to_vector(tokens, label):
    x = np.zeros(len(word_index_map) + 1) # our vocabulary size + 1 (for the label)
    for t in tokens:
        i = word_index_map[t] # get the index from the word_index_map
        x[i] += 1  # counting the number of time a given token (word) has occurred and placing in by index in the x-array
    x = x / x.sum() # taking the proportion
    x[-1] = label # last place in the x-array is the label value
    return x

In [48]:
N = len(positive_tokenized) + len(negative_tokenized)
data = np.zeros((N, len(word_index_map) + 1))
i = 0

for tokens in positive_tokenized:
    xy = tokens_to_vector(tokens, 1)  # creating the xy data
    data[i, :] = xy  # creating the terms-documents matrix
    i += 1

In [50]:
for tokens in negative_tokenized:
    xy = tokens_to_vector(tokens, 0)
    data[i, :] = xy
    i += 1

In [51]:
data[0:3,]  # how the data (tdm) looks like 

array([[ 0.02272727,  0.06818182,  0.02272727, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  1.        ],
       [ 0.        ,  0.        ,  0.08333333, ...,  0.        ,
         0.        ,  1.        ]])

In [52]:
np.random.shuffle(data)
X = data[:, :-1] # all rows and everything expect the last column
Y = data[:, -1] # all rows and the last column

Xtrain = X[:-100, ] # n-100 rows
Ytrain = Y[:-100, ]
Xtest = X[-100:, ] # last 100 rows
Ytest = Y[-100:, ]

In [53]:
model = LogisticRegression()
model.fit(Xtrain, Ytrain)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [54]:
print("Classification Rate: ", model.score(Xtest, Ytest))

Classification Rate:  0.76


In [55]:
# Now we can look at the weights that each word has, to see if that word has positive or negative sentiment.
# So, we are only interested to see the weghts which are far away from 0

In [65]:
len(model.coef_[0])

11091

In [72]:
threshold = 0.5
for word, index in word_index_map.items():
    weight = model.coef_[0][index]
    if weight > threshold or weight < -threshold:
        print(word, weight)

# More postive : Positive Words
# More negative: Negative Words

excellent 1.37901783019
comfortable 0.616864182857
home 0.524906543303
pro 0.501658861835
support -0.838216132451
returned -0.801154845597
buy -0.808404552766
customer -0.656858883524
quality 1.51424312473
recommend 0.669441306869
picture 0.554290487222
paper 0.581735867138
refund -0.592400150479
bad -0.778819168452
time -0.595275432258
little 0.873519580721
unit -0.741762453688
waste -0.947856210098
then -1.09319438347
month -0.718469318886
item -0.968374402213
price 2.80746360336
using 0.65710622088
sound 1.04919740495
pretty 0.770956716708
warranty -0.618821051962
expected 0.557377790438
speaker 0.946786896444
cable 0.621942760353
wa -1.56327408287
you 1.023016148
return -1.19354366257
video 0.634326414007
week -0.745043401469
space 0.638827835221
easy 1.76637483035
've 0.800972810874
money -0.989810580724
n't -1.80723776261
lot 0.677668764814
fast 0.915560219336
try -0.693867551178
ha 0.872442812886
bit 0.619292094803
doe -1.18853213168
happy 0.541922097699
junk -0.515878107438
hig