# Background

For this project we'll be the Multi-Domain Sentiment Dataset that is availible in the following location :
https://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html. 

The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from 4 product types (domains): Kitchen, Books, DVDs, and Electronics. Each domain has several thousand reviews, but the exact number varies by domain. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed.
        
    

We'll start as usual by importing the necessary libraries

In [38]:
from six.moves import urllib
import os
import tarfile
from bs4 import BeautifulSoup
import nltk
nltk.download(['punkt','wordnet'])
import numpy as np
from nltk.stem import WordNetLemmatizer 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/riaanmostert/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/riaanmostert/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Next we'll download the required file and extract it to our working directory

In [2]:
download_root = "https://www.cs.jhu.edu/~mdredze/datasets/sentiment/"
file_name = "domain_sentiment_data.tar.gz"
sentiment_url =  download_root + file_name


def download_extract(url,location):
    '''
    url: ulr location where the data resides
    location: location on workstation the data needs to be copied to
    '''
    gz_path =  os.path.join(location,file_name)
    _ = urllib.request.urlretrieve(url = sentiment_url,filename=gz_path)
    gz_folder = tarfile.open(gz_path)
    gz_folder.extractall(path=location)
    gz_folder.close()

In [3]:
download_extract(url= sentiment_url,location=os.getcwd())

We'll import a list of stopwords that were downloaded from : http://www.lextek.com/manuals/onix/stopwords1.html

In [4]:
stopwords = set(w.rstrip() for w in open('stopwords.txt'))

By looking at the list of stopwords, we see some interesting inclusions, like 'great','important', 'problem'. These words will be excluded since they may be too restrictive

In [5]:
not_stopwords= ['best','better','good','great',
'greater','greatest','important','interesting','problem','problems','work',
'worked','working','works','would']

In [6]:
stopwords = [x for x in stopwords if x not in not_stopwords]

Since the reviews are in a XML format, we'll make use of the BeautifulSoup library to assit us in importing this file

In [7]:
positive_reviews = (BeautifulSoup(open('sorted_data_acl/electronics/positive.review')
                                  .read(),'lxml'))
positive_reviews = positive_reviews.findAll('review_text')



In [8]:
negative_reviews = (BeautifulSoup(open('sorted_data_acl/electronics/negative.review')
                                  .read(),'lxml'))
negative_reviews = negative_reviews.findAll('review_text')

In [9]:
print('Number of positive reviews: {}'.format(len(positive_reviews)))
print('Number of negative reviews: {}'.format(len(negative_reviews)))

Number of positive reviews: 1000
Number of negative reviews: 1000


We see that the number of positive and negative reviews are evenly split, which is great for our modelling exercise. Next we'll write a function that will assist us in tokenizing the reviews by first converting the text to lowercase, only keeping words whose length is greater than 2, Lemmatization to return the base or dictionary form of a word,and lastly remove the stopwords

In [10]:
wordnet_lemmatizer = WordNetLemmatizer()

def my_tokenizer(s):
    s = s.lower()
    tokens = nltk.tokenize.word_tokenize(s)
    tokens = [t for t in tokens if len(t) > 2 ]
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]
    tokens = [t for t in tokens if t not in  stopwords]
    return tokens

Let's create a word-to-index map so that we can create our word-frequency vectors later. Let's also save the tokenized versions so we don't have to tokenize again later

In [11]:
word_index_map = {}
current_index = 0

positive_tokenize = []
nagative_tokenize = []


for review in positive_reviews:
    tokens = my_tokenizer(review.text)
    positive_tokenize.append(tokens)

    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1
            
            
for review in negative_reviews:
    tokens = my_tokenizer(review.text)
    nagative_tokenize.append(tokens)

    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1

In [12]:
positive_tokenize[0]

['purchased',
 'unit',
 'due',
 'frequent',
 'blackout',
 'power',
 'supply',
 'bad',
 'run',
 'cable',
 'modem',
 'router',
 'lcd',
 'monitor',
 'minute',
 'time',
 'save',
 'work',
 'shut',
 'equally',
 'important',
 'electronics',
 'receiving',
 'clean',
 'power',
 'feel',
 'investment',
 'minor',
 'compared',
 'loss',
 'valuable',
 'data',
 'failure',
 'equipment',
 'due',
 'power',
 'spike',
 'irregular',
 'power',
 'supply',
 'amazon',
 'business',
 'day']

Next we'll write a function to assist us to first calculate the word frequency per review and then calculate the proportion of times that word appear in a particular review

In [13]:
def tokens_to_vector(tokens,label):
    x = np.zeros(len(word_index_map)+1)
    for t in tokens:
        i = word_index_map[t]
        x[i] += 1
    x = x/ x.sum()
    x[-1] = label
    return x
    

In [14]:
N = len(positive_reviews) + len(negative_reviews)
data = np.zeros((N,len(word_index_map)+1))
i = 0

for token in positive_tokenize:
    xy = tokens_to_vector(token,1)
    data[i,:] = xy
    i += 1
    
for token in nagative_tokenize:
    xy = tokens_to_vector(token,0)
    data[i,:] = xy
    i += 1

Next we'll shuffle the data and split it into a training and test set

In [15]:
np.random.seed(567)
np.random.shuffle(data)

X = data[:,:-1]
y = data[:,-1]

In [16]:
X_train = X[:-100,]
y_train = y[:-100]

In [17]:
X_test = X[-100:,]
y_test = y[-100:]

In [18]:
model = LogisticRegression()
model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [19]:
print('Accuracy rate {:1.2f} '.format(model.score(X_test,y_test)))

Accuracy rate 0.75 


Not bad for the first try! Let's look at the weights for the different words to see whether it makes sense. We'll use a threshold of 0.5

In [20]:
threshold = 0.5

for word, index in word_index_map.items():
    weight = model.coef_[0][index]
    
    if abs(weight) > threshold:
        print(word,weight)

unit -0.5192522345018026
bad -0.7216540338202034
cable 0.5665912597138244
time -0.785508495761577
've 0.7423653151173176
month -0.8401167522996076
problem 0.6864006668703992
good 2.0418603056513405
sound 0.9271781879037885
lot 0.6969369705471137
n't -2.214416060680395
easy 1.6964124737926294
quality 1.2268962576370404
company -0.5812177116732773
card -0.5175925736432473
best 1.1583655523035772
item -1.0006937155653035
working -0.6278319414574575
wa -1.4985996144368325
perfect 1.0105649222514923
fast 0.8750718518437195
ha 0.5226216515686919
price 2.532786135135857
great 3.8206824310571283
money -1.125056780650876
memory 0.9158174736872906
would -0.9681810064705028
buy -1.0918928377260069
worked -0.8976187684010911
happy 0.6211608325395602
pretty 0.6295695652542614
doe -1.1507679196744474
highly 1.0180175252415575
recommend 0.6472559119760061
customer -0.6012834046036264
support -0.81876351877893
little 0.7997269220004799
returned -0.7761606505264842
excellent 1.2630159048311336
love 1.0

These weights mostly make sense. For example, we see that reviews containing the word 'junk' is more likely to be a negative rewiew. Likewise reviews containing the word 'great' has a large positive weight and increase the likelihood of the review being positive

We'll fit a few additional models to see whether we can imporve the accuracy rate: A naive bayes, a random forest and an AdaBoost model

In [39]:
nb_model = MultinomialNB()
nb_model.fit(X_train,y_train)

In [41]:
print('Accuracy rate {:1.2f} '.format(nb_model.score(X_test,y_test)))

Accuracy rate 0.76 


In [24]:
rf_model = RandomForestClassifier(n_estimators=200,random_state=893)
rf_model.fit(X_train,y_train)

In [26]:
print('Accuracy rate {:1.2f} '.format(rf_model.score(X_test,y_test)))

Accuracy rate 0.79 


In [31]:
ada_model =  AdaBoostClassifier(n_estimators=200,random_state=893)
ada_model.fit(X_train,y_train)

In [36]:
print('Accuracy rate {:1.2f} '.format(ada_model.score(X_test,y_test)))

Accuracy rate 0.80 


The AdaBoost model is giving the best performance - and this is just using the default hyperparameters. We can achieve superior results by making use of a Recursive neural network