I will be working with product review text from Amazon. The reviews are for only products in the "Electronics" category. The objective is to train a model to predict the rating, ranging from 1 to 5 stars, based on the customer review, as well as get whatever extra insight I can from the data.

In [1]:
import gzip
import ujson as json

with gzip.open("Desktop/Stuff/amazon_electronics_reviews_training.json.gz", "r") as f:                                  
    data = [json.loads(line) for line in f]

The ratings are stored in the keyword "overall"

In [2]:
import numpy as np

ratings = np.array([rev['overall'] for rev in data])

Now I'll create a transformer that extracts the corpus from the 'reviewText' keyword

In [3]:
from spacy.lang.en import STOP_WORDS
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, train_test_split

In [4]:
class MyTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return [rev['reviewText'] for rev in X]

In [5]:
X_train, X_test, y_train, y_test = train_test_split(data, ratings, test_size=0.9, random_state=0)

In [6]:
pipe = Pipeline([('transformer', MyTransformer()), ('vectorizer', TfidfVectorizer(stop_words=STOP_WORDS)),
                ('regressor', Ridge())])

param_grid = {'regressor__alpha': np.logspace(-3, 2, 20)}

In [7]:
gs = GridSearchCV(pipe, param_grid, cv=5, n_jobs=2, verbose=1)

In [8]:
gs.fit(X_train, y_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:  3.9min
[Parallel(n_jobs=2)]: Done 100 out of 100 | elapsed:  7.1min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('transformer', MyTransformer()), ('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))]),
       fit_params=None, iid='warn', n_jobs=2,
       param_grid={'regressor__alpha': array([1.00000e-03, 1.83298e-03, 3.35982e-03, 6.15848e-03, 1.12884e-02,
       2.06914e-02, 3.79269e-02, 6.95193e-02, 1.27427e-01, 2.33572e-01,
       4.28133e-01, 7.84760e-01, 1.43845e+00, 2.63665e+00, 4.83293e+00,
       8.85867e+00, 1.62378e+01, 2.97635e+01, 5.45559e+01, 1.00000e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [9]:
gs.best_params_

{'regressor__alpha': 1.438449888287663}

In [10]:
model = gs.best_estimator_

In [11]:
# I'll use X_test and y_test to train my model since it's a lot bigger than X_train and y_train

pipe.fit(X_test, y_test)

Pipeline(memory=None,
     steps=[('transformer', MyTransformer()), ('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

In [12]:
pipe.score(X_train, y_train)

0.43865741905267175

To derive some insight from my analysis, I want to determine the most polarizing words in the corpus of reviews. In other words, identify words that strongly signal a review is either positive or negative. For example, a word like "terrible" will mostly appear in negative rather than positive reviews. The naive Bayes model calculates probabilities such as  ùëÉ(terrible | negative) , the probability the review is negative given the word "terrible" appears in the text. Using these probabilities, I can derive a polarity score for each counted word,

polarity = log(ùëÉ(word | positive) / ùëÉ(word | negative)).

In [13]:
#This dataset I'll be using only has reviews rated at 1 and 5 stars

import numpy as np
from sklearn.naive_bayes import MultinomialNB

with gzip.open("Desktop/Stuff/amazon_one_and_five_star_reviews.json.gz", "r") as f:
    data_polarity = [json.loads(line) for line in f]

ratings = np.array([rev['overall'] for rev in data_polarity])

In [14]:
tr_model = MyTransformer()
tr = tr_model.fit_transform(data_polarity)

In [15]:
vec = TfidfVectorizer(stop_words=STOP_WORDS)
tr_t = vec.fit_transform(tr)

In [16]:
nb_model = MultinomialNB()
nb_model.fit(tr_t, ratings)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [17]:
# The naive bayes model has an attribute that handles the positive and negative logarithmic probability feature for each word

log_array = nb_model.feature_log_prob_

In [18]:
# ref variable will contain all the words in the corpus

ref = vec.get_feature_names()

In [19]:
# Just to make sure I have the same number of words

assert log_array.shape[1] == len(ref)

In [20]:
# Just making a dictionary with each word as the key and the log of their positive and negaive polarities attached to each key
# To find the negative polarity, I subtract the log of the probability of the word being negative by the log of the
# probability of the word being positive, and vise versa for the positive polarity

mapping = {word: [] for word in ref}

for i in range(len(ref)):
    mapping[ref[i]].append(log_array[0, i]-log_array[1, i])
    mapping[ref[i]].append(log_array[1, i]-log_array[0, i])

In [21]:
#Sorting to extract the words with the highest polarities

negative_polarities = sorted([log_array[0, i]-log_array[1, i] for i in range(len(ref))])[-25:]
positive_polarities = sorted([log_array[1, i]-log_array[0, i] for i in range(len(ref))])[-25:]

In [22]:
top_50 = []

for polarity in negative_polarities:
    for value in ref:
        if mapping[value][0] == polarity:
            top_50.append(value)
            
for polarity in positive_polarities:
    for value in ref:
        if mapping[value][1] == polarity:
            top_50.append(value)

In [23]:
# From this, I was able to get the top 50 polarizing words that often indicate a positive or negative rating

top_50

['refused',
 'threw',
 'disappointing',
 'randomly',
 'stopped',
 'unreliable',
 'horrible',
 'awful',
 'unacceptable',
 'poor',
 'beware',
 'defective',
 'trash',
 'worse',
 'worthless',
 'useless',
 'garbage',
 'returned',
 'terrible',
 'junk',
 'worst',
 'returning',
 'return',
 'waste',
 'refund',
 'regrets',
 'fantastic',
 'buck',
 'telephoto',
 'photography',
 'crisp',
 'dslr',
 'portraits',
 'awesome',
 'handy',
 'charm',
 '200mm',
 'pleased',
 'bokeh',
 'excellent',
 'incredible',
 'macro',
 'sturdy',
 'amazing',
 'portrait',
 'monopod',
 'perfect',
 'protects',
 'beat',
 'highly']

Finally, I'll apply Topic Modelling using the Non-Negative Matrix Factorization model to try to extract topics that exist within the reviews to further get more information from my data 

In [24]:
from sklearn.decomposition import NMF

In [25]:
# Well, not much can be taken from the first topic, but from the second it definitely involves photography using a nikon camera,
# the third about the sound generated from headphones and the forth probably a desktop system

n_topics = 10
n_top_words = 20

tfidf = TfidfVectorizer(stop_words='english')
nmf = NMF(n_components=n_topics, random_state=0)
pipe = Pipeline([('tr', MyTransformer()), ('vectorizer', tfidf), ('dim-red', nmf)])

pipe.fit(data)

feature_names = tfidf.get_feature_names()

for i, topic in enumerate(nmf.components_):
    print("Topic: {}".format(i))
    indices = topic.argsort()[-n_top_words-1:-1]
    top_words = [feature_names[ind] for ind in indices]
    print(" ".join(top_words), "\n")

Topic: 0
really little tv battery need case does ve bought don power time cd work player radio unit like dvd use 

Topic: 1
pictures photography image low macro kit fast 70 wide shots cap zoom nikon hood 50mm light sharp lenses focus canon 

Topic: 2
listening like don headphone set hear better great quality noise head volume ears comfortable music pair bass ear speakers headphones 

Topic: 3
signal computer printer need extension video connectors long audio monitor say monster belkin quality length connect usb tv needed cables 

Topic: 4
shots carry picture case zoom small memory photos batteries strap tripod use cameras card battery flash canon digital pictures bag 

Topic: 5
need use worked highly exactly expected just buy bought item advertised easy perfectly needed say recommend fine price product works 

Topic: 6
haze polarizing clean glare multi job coated expensive polarizer lens quality lenses does hoya protection protect tiffen glass uv filters 

Topic: 7
ve keys computer han