In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import scipy
from sklearn.naive_bayes import BernoulliNB
import seaborn as sns

Here I will be classifying Yelp reviews based on training Amazon reviews. For this purpose, let's load the *amazon_cells_labelled.txt* file:

In [2]:
amazon = pd.read_csv(r'sentiment labelled sentences\amazon_cells_labelled.txt', delimiter= '\t', header=None)
amazon.columns = ['text', 'sentiment']
amazon.head()

Unnamed: 0,text,sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


they want to start with this simple model nad.
1. download the positive words bookmark and include it in htis analysis
2. do tokenization with lowercasing
3. get rid of stop words..etc


Now to load a file containing positive words, obtained  from [mkulakowski2's gist on GitHub](https://gist.github.com/mkulakowski2/4289437) :

In [3]:
# load it
with open('positive-words.txt','r') as file:
    positive = file.read()


# remove description and empty elements
positive = positive[positive.rfind(';')+1:]
positive = positive.split('\n')

for i in range(positive.count('')):
    positive.remove('')

# add a couple of positive words    
positive.append('cool')
positive.append('decent')
    
len(positive)

2008

Load another list, this time consisting of stop words:

In [4]:
# load list of stop words
with open('stop_words.txt','r') as file:
    stop = file.read()
stop = stop.split('\n')

On to the final list of negative words:

In [5]:
# load it
with open('negative-words.txt','r') as file:
    negative = file.read()


# remove description and empty elements
negative = negative[negative.rfind(';')+1:]
negative = negative.split('\n')

for i in range(negative.count('')):
    negative.remove('')

len(negative)

4783

Since the this text corpus is lengthy, it may not be very efficient to add each word as a column. Alternatively, we can calculate the percentage of positive words in each review while making sure *not*, *not that* and *not so* do not precede them, otherwise it may be negative. 

Additionally we need to remove puncutation marks from words first, because many punctuation marks are not separated with spaces from words, and can be taken along with words into lists, preventing detection of sentiments when checked for membership in positive and negative word lists. 

In [6]:
# Prepare a translation table from punctuation
punctuation = ''.join(['.',',',';',':','-','?','!'])
TRANSDICT = str.maketrans(punctuation,' '*len(punctuation))

def remove_punctuation(word):
    """ removes punctuation from a word"""
    return word.translate(TRANSDICT).strip().replace(' ','')


def percent_positive(review):
    """ Tokenizes each sentence, checks for membership in positive words,
        makes sure positive words are not preceded by 'not'
    """
    
    # tokenize a sentence and remove punctuation
    tokenized = review.lower().split(' ')
    tokenized = [remove_punctuation(word) for word in tokenized]
    pcnt = 0
    
    # check for membership in poitive words list, making sure 'not' doesn't precede
    for word in tokenized:
        if tokenized.index(word) == 0 and (word in positive):
            pcnt += 1/len(tokenized)
        elif tokenized.index(word) == 1:
            if word in positive and (tokenized[tokenized.index(word)-1] != 'not'):
                pcnt += 1/len(tokenized)
        elif tokenized.index(word) > 1:
            if word in positive and (tokenized[tokenized.index(word)-1] != 'not') and (tokenized[tokenized.index(word)-2] != 'not'):
                pcnt += 1/len(tokenized)
    return pcnt

# Apply percent_positive to the text column in our dataframe
amazon['positive'] = amazon['text'].apply(percent_positive)

Now we do the same thing for negative sentiments. This one was obtained from [this webpage](http://ptrckprry.com/course/ssd/data/negative-words.txt) and it includes misspelled negative words as well:

In [7]:
def percent_negative(review):
    """ Tokenizes each sentence, checks for membership in positive words,
        makes sure positive words are not perceded by 'not'
    """
    
    # tokenize a sentence and remove punctuation
    tokenized = review.lower().split(' ')
    tokenized = [remove_punctuation(word) for word in tokenized]
    pcnt = 0
    
    # check for membership in negative words list, making sure 'not' doesn't precede
    for word in tokenized:
        if tokenized.index(word) == 0 and word in negative:
            pcnt += 1/len(tokenized)
        elif tokenized.index(word) == 1:
            if word in negative and (tokenized[tokenized.index(word)-1] != 'not'):
                pcnt += 1/len(tokenized)
        elif tokenized.index(word) > 1:
            if word in negative and (tokenized[tokenized.index(word)-1] != 'not') and (tokenized[tokenized.index(word)-2] != 'not'):
                pcnt += 1/len(tokenized)
    return pcnt

# Apply percent_negative to the text column in our dataframe
amazon['negative'] = amazon['text'].apply(percent_negative)

Now let's check our performance:

In [8]:
# Sort values by most negative first
amazon.sort_values(by=['negative'],ascending=False).head()

Unnamed: 0,text,sentiment,positive,negative
502,Defective crap.,0,0.0,1.0
751,disappointing.,0,0.0,1.0
463,Disappointed!.,0,0.0,1.0
993,disappointed.,0,0.0,1.0
836,"Horrible, horrible protector.",0,0.0,0.666667


In [9]:
# Sort values by most positive first
amazon.sort_values(by=['positive'],ascending=False).head()

Unnamed: 0,text,sentiment,positive,negative
407,Works great.,1,1.0,0.0
92,Worked great!.,1,1.0,0.0
689,Works well.,1,1.0,0.0
185,Incredible!.,1,1.0,0.0
870,Works fine.,1,1.0,0.0


Pay attention to our target feature, *sentiment*. Looks like it aligns with the *positive* and *negative* columns in both cases. This means that our data is ready for training using a Bernoulli (binary) classifier. Before that, let's make sure we load the Yelp dataset and equip it with the *positive* and *negative* columns:


In [10]:
# Load the Yelp dataset
yelp = pd.read_csv(r'sentiment labelled sentences\yelp_labelled.txt', delimiter= '\t', header=None)
yelp.columns = ['text', 'sentiment']
yelp.head()


# Apply percent_positive to the text column in our dataframe
yelp['positive'] = yelp['text'].apply(percent_positive)


# Apply percent_negative to the text column in our dataframe
yelp['negative'] = yelp['text'].apply(percent_negative)

yelp.head()

Unnamed: 0,text,sentiment,positive,negative
0,Wow... Loved this place.,1,0.5,0.0
1,Crust is not good.,0,0.0,0.0
2,Not tasty and the texture was just nasty.,0,0.0,0.125
3,Stopped by during the late May bank holiday of...,1,0.133333,0.0
4,The selection on the menu was great and so wer...,1,0.083333,0.0


In [11]:
# Initialize a model object
classifier = BernoulliNB()

# Fit our model to the data.
classifier.fit(amazon[['positive','negative']], amazon['sentiment'])

# Classify, storing the result in a new variable.
y_pred = classifier.predict(yelp[['positive','negative']])

# Display our results.
print("Number of mislabeled points out of a total {} points : {}".format(
    yelp.shape[0],
    (yelp['sentiment'] != y_pred).sum()
))

Number of mislabeled points out of a total 1000 points : 185


# Conclusion

Our model was able to predict 815 out of a 1000 Yelp review sentiments correctly when trained on Amazon reviews. This shows great usability for generalizing over pre-prepared datasets from different sources to build strong prediction models.