# Amazon reviews multilingual UK dataset


This is divided into 4 tasks:
1. __Data Processing__ 
2. __Classification__ 
3. __Regression__ 
4. __Recommender Sytstems__
    1. Similarity matching
    2. Predictions
    3. Recommendations on Test set

# 1: Data Processing

## The Data

For this project I will be doing on amazon reviews dataset. The list of such dataset repository can be found [here.](https://s3.amazonaws.com/amazon-reviews-pds/readme.html)
This dataset is a set of multiple Product reviews bought in UK on amazon. This dataset is of size ~333 MB, so its a mid-range dataset.

### DATA COLUMNS:
    marketplace       - 2 letter country code of the marketplace where the review was written.
    customer_id       - Random identifier that can be used to aggregate reviews written by a single author.
    review_id         - The unique ID of the review.
    product_id        - The unique Product ID the review pertains to. In the multilingual dataset the reviews
                    for the same product in different countries can be grouped by the same product_id.
    product_parent    - Random identifier that can be used to aggregate reviews for the same product.
    product_title     - Title of the product.
    product_category  - Broad product category that can be used to group reviews 
                    (also used to group the dataset into coherent parts).
    star_rating       - The 1-5 star rating of the review.
    helpful_votes     - Number of helpful votes.
    total_votes       - Number of total votes the review received.
    vine              - Review was written as part of the Vine program.
    verified_purchase - The review is on a verified purchase.
    review_headline   - The title of the review.
    review_body       - The review text.
    review_date       - The date the review was written.

### DATA FORMAT
    Tab ('\t') separated text file, without quote or escape characters.
    First line in each file is header; 1 line corresponds to 1 record.

### First Step: Imports

Importing all necessary libraries needed in this project.

In [1]:
import gzip
from collections import defaultdict
import random
import numpy as np
import scipy.optimize
import string
import nltk
from sklearn import linear_model
from nltk.stem.porter import PorterStemmer # Stemming
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing

### 1: Read the data and Fill your dataset

1. Type Casting some of the features.
2. Converting any boolean responses to True/False.

In [2]:
path = "amazon_reviews_multilingual_UK_v1_00.tsv.gz"

f = gzip.open(path, 'rt', encoding="utf8")

header = f.readline()
header = header.strip().split('\t')

# print(header)
dataset = []

for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    for field in ['verified_purchase','vine']:
        if d[field] == 'Y':
            d[field]=True
        else:
            d[field]=False
    dataset.append(d)

In [3]:
dataset[10]

{'marketplace': 'UK',
 'customer_id': '17809',
 'review_id': 'R1X8DIPIAIWFK9',
 'product_id': 'B00AESN8XY',
 'product_parent': '593367180',
 'product_title': 'Plague Inc.',
 'product_category': 'Mobile_Apps',
 'star_rating': 4,
 'helpful_votes': 0,
 'total_votes': 0,
 'vine': False,
 'verified_purchase': True,
 'review_headline': 'gd',
 'review_body': 'Good game I spose',
 'review_date': '2015-03-11'}

### 2: Split the data into a Training and Testing set

Have Training be the first 80%, and testing be the remaining 20%. 

In [4]:
#2107824 526957
# Lengths should be: 2107824 526957
random.shuffle(dataset)

N = len(dataset)
trainingSet = dataset[:4*N//5]
testingSet = dataset[4*N//5:]

print("Training Set: ",len(trainingSet), "\nTest Set: ",len(testingSet), "\nTotal no.of rows",N)
# print("Lengths should be: 2107824 526957")

Training Set:  1365995 
Test Set:  341499 
Total no.of rows 1707494


### 3: Extracting Basic Statistics

Next calculate the answer to some statistic questions all based on the __Training Set:__
1. What is the __average rating__?
2. What fraction of reviews are from __verified purchases__?
3. How many __total users__ are there?
4. How many __total items__ are there?
5. What fraction of reviews have __5-star ratings__?

In [5]:
d_star = [d['star_rating'] for d in trainingSet]
avg_rating = np.average(d_star)
print("1. ",avg_rating)

d_ver = [d['verified_purchase'] for d in trainingSet if d['verified_purchase'] ==True ]
frac_reviews = (len(d_ver)/len(trainingSet))*100
print("2. ",round(frac_reviews,2),"%")

# This way it takes unique customer_id and product_id
users = set()
for d in trainingSet:
    users.add(d['customer_id'])
print("3. ",len(users))
items = set()
for d in trainingSet:
    items.add(d['product_id'])
print("4. ",len(items))

d_five = [d['star_rating'] for d in trainingSet if d['star_rating'] ==5 ]
frac_five = (len(d_five)/len(trainingSet))*100
print("5. ",round(frac_five,2),"%")

1.  4.379450144400236
2.  76.17 %
3.  797423
4.  55021
5.  67.1 %


# 2: Classification

Perform classification to extract features and make predictions based on them. Here I will be using a Logistic Regression Model.

### 1: Define the feature function

This implementation will be based on the __star rating__ and the ___length___ of the __review body__.

In [6]:
#GIVEN for 1.
# wordCount = defaultdict(int)
# punctuation = set(string.punctuation)

# #GIVEN for 2.
# # wordCountStem = defaultdict(int)
#  print(len(wordCount))

# counts = [(wordCount[w],w) for w in wordCount]
# words = [x[1] for x in counts]
# wordid = dict(zip(words,range(len(words))))
# for d in dataset:
#     f = ''.join([c for c in d['text'].lower() if not c in punctuation])
#     for w in r.split():
#         w = stemmer.stem(w) # with stemming
#         wordCount[w] += 1
# stemmer.stem()
#     features = [0]*len(words)
#     global f
#     for w in f.split():
#         if w in words:
#             features[wordid[w]]+=1
#     features.append(1)

wordCount = defaultdict(int)
stemmer = PorterStemmer() #use stemmer.stem(stuff)
for d in trainingSet:
    f = ''.join([x for x in d['review_body'].lower() if not x in string.punctuation])
for w in f.split():
    w = stemmer.stem(w) # with stemming
    wordCount[w]+=1


def feature(dat):
    feat = [1, dat['star_rating'], len(wordCount)]
    return feat

### 2: Fit your model

1. Creating a __Feature Vector__ based on the feature function defined above. 
2. Creating a __Label Vector__ based on the "verified purchase" column from the training set.
3. Defining a model i.e; __Logistic Regression__ model.
4. Fitting the model.

In [7]:
X_train = [feature(d) for d in trainingSet]    
y_train = [d['verified_purchase'] for d in trainingSet]   

X_test = [feature(d) for d in testingSet]    
y_test = [d['verified_purchase'] for d in testingSet]   

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# print("Label: ", y[:100], "\nFeatures:", X[:10])
model = linear_model.LogisticRegression()
model.fit(X_train_scaled, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

### 3: Compute Accuracy of Your Model

1. Make __Predictions__ based on the model.
2. Compute the __Accuracy__ of the model.

In [8]:

predictions_train = model.predict(X_train_scaled)
predictions_test = model.predict(X_test_scaled)

correctPredictions_train = predictions_train == y_train
correctPredictions_test = predictions_test == y_test

accuracy_train = sum(correctPredictions_train) / len(correctPredictions_train)*100
accuracy_test = sum(correctPredictions_test) / len(correctPredictions_test)*100

print("Training accuracy: ",round(accuracy_train,2),"%","\nTest accuracy: ",round(accuracy_test,2),"%")
print("Confusion matrix: \n",confusion_matrix(y_test, predictions_test))

Training accuracy:  76.21 % 
Test accuracy:  76.14 %
Confusion matrix: 
 [[     0  81466]
 [     0 260033]]


###  4: Finding the Balanced Error Rate

1. Compute __True__ and __False Positives__
2. Compute __True__ and __False Negatives__
3. Compute __Balanced Error Rate__ based on the above defined variables.

In [9]:
TP_train = sum([(p and l) for (p, l) in zip(predictions_train, y_train)])
FP_train = sum([(p and not l) for (p, l) in zip(predictions_train, y_train)])
TN_train = sum([(not p and not l) for (p, l) in zip(predictions_train, y_train)])
FN_train = sum([(not p and l) for (p, l) in zip(predictions_train, y_train)])
TF_accuracy = (TP_train + TN_train) / (TP_train + FP_train + TN_train + FN_train)
BER = 1 - 1/2 * (TP_train / (TP_train + FN_train) + TN_train / (TN_train + FP_train))
print(f'TP_train = {TP_train}')
print(f'FP_train = {FP_train}')
print(f'TN_train = {TN_train}')
print(f'FN_train = {FN_train}')
print(f'TF_Accuracy: {round(TF_accuracy*100,2)}%')
print(f'BER_train = {BER}')

TP_train = 1041020
FP_train = 324975
TN_train = 0
FN_train = 0
TF_Accuracy: 76.21%
BER_train = 0.5


In [10]:
TP_test = sum([(p and l) for (p, l) in zip(predictions_test, y_test)])
FP_test = sum([(p and not l) for (p, l) in zip(predictions_test, y_test)])
TN_test = sum([(not p and not l) for (p, l) in zip(predictions_test, y_test)])
FN_test = sum([(not p and l) for (p, l) in zip(predictions_test, y_test)])
TF_accuracy = (TP_test + TN_test) / (TP_test + FP_test + TN_test + FN_test)
BER = 1 - 1/2 * (TP_test / (TP_test + FN_test) + TN_test / (TN_test + FP_test))
print(f'TP_train = {TP_test}')
print(f'FP_train = {FP_test}')
print(f'TN_train = {TN_test}')
print(f'FN_train = {FN_test}')
print(f'TF_Accuracy: {round(TF_accuracy*100,2)}%')
print(f'BER_train = {BER}')

TP_train = 260033
FP_train = 81466
TN_train = 0
FN_train = 0
TF_Accuracy: 76.14%
BER_train = 0.5


# 3: Regression

Alter the features to differentiate. 

Here I will be using word ID's and star rating as feature vectors.

### 1: Unique Words in a Sample Set

I will take a smaller Sample Set here, as stemming on the normal training set will take a very long time.

1. Count the number of unique words found within the 'review body' portion of the sample set defined below, making sure to __Ignore Punctuation and Capitalization__.
2. Count the number of unique words found within the 'review body' portion of the sample set defined below, this time with use of __Stemming,__ __Ignoring Puctuation,__ ___and___ __Capitalization__.

In [11]:
#GIVEN for 1.
wordCount = defaultdict(int)
punctuation = set(string.punctuation)

#GIVEN for 2.
wordCountStem = defaultdict(int)
stemmer = PorterStemmer() #use stemmer.stem(stuff)

In [12]:
sampleSet = trainingSet[:2*len(trainingSet)//10]

In [13]:
for d in sampleSet:
    f = ''.join([x for x in d['review_body'].lower() if not x in punctuation])
for w in f.split():
    w = stemmer.stem(w) # with stemming
    wordCountStem[w]+=1
        
# for d in trainingSet:
#     f = ''.join([x for x in d['review_body'].lower() if not x in punctuation])
for w in f.split():
    wordCount[w]+=1
    
counts = [(wordCount[w],w) for w in wordCount]
counts_stem = [(wordCountStem[w],w) for w in wordCountStem]

words = [x[1] for x in counts]
words_stem = [x[1] for x in counts_stem]
print("wordCount:",words,len(words),"\nwordStem Count:",words_stem,len(words_stem))

wordCount: ['excellent'] 1 
wordStem Count: ['excel'] 1


### 2: Evaluating Classifiers

1. Given the feature function and counts vector, __Define__ a X vector.
2. __Fit__ the model using a __Ridge Model__ with (alpha = 1.0, fit_intercept = True).
3. Using the model, __Make your Predictions__.
4. Find the __MSE__ between resulted predictions and y vector.

In [14]:
#GIVEN FUNCTIONS
def feature_reg(datum):
    feat = [0]*len(words)
    r = ''.join([c for c in datum['review_body'].lower() if not c in punctuation])
    for w in r.split():
        if w in wordSet:
            feat[wordId[w]] += 1
    feat.append(1) #offset
    return feat

def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)

In [15]:
#GIVEN COUNTS AND SETS
counts = [(wordCount[w], w) for w in wordCount]
counts.sort()
counts.reverse()

#Note: increasing the size of the dictionary may require a lot of memory
words = [x[1] for x in counts[:100]]

wordId = dict(zip(words, range(len(words))))
wordSet = set(words)

In [17]:
random.shuffle(trainingSet)
X_train_reg = [feature_reg(d) for d in trainingSet]
y_train_reg = [d['star_rating'] for d in trainingSet]

X_test_reg = [feature_reg(d) for d in testingSet]
y_test_reg = [d['star_rating'] for d in testingSet]

scaler = preprocessing.StandardScaler().fit(X_train_reg)
X_train_reg_scaled = scaler.transform(X_train_reg)
X_test_reg_scaled = scaler.transform(X_test_reg)

model = linear_model.Ridge(alpha = 1.0, fit_intercept = True)
model.fit(X_train_reg_scaled, y_train_reg)
y_test_pred = model.predict(X_test_reg_scaled)
mse = MSE(y_test_pred, y_test_reg)

print("MSE: ",mse)

MSE:  1.1778335377014026


In [18]:
# If you would like to work with this example more in your free time, here are some tips to improve your solution:
# 1. Implement a validation pipeline and tune the regularization parameter
# 2. Alter the word features (e.g. dictionary size, punctuation, capitalization, stemming, etc.)
# 3. Incorporate features other than word features

# 4: Recommendation Systems

For this final task, you will see a simple latent factor-based recommender systems to make predictions. Then evaluating the performance of this predictions.

In [6]:
#Create and fill our default dictionaries for our dataset
reviewsPerUser = defaultdict(list)
reviewsPerItem = defaultdict(list)

for d in trainingSet:
    user,item = d['customer_id'], d['product_id']
    reviewsPerUser[user].append(d)
    reviewsPerItem[item].append(d)
    
#Create two dictionaries that will be filled with our rating prediction values
userBiases = defaultdict(float)
itemBiases = defaultdict(float)

#Getting the respective lengths of our dataset and dictionaries
N = len(trainingSet)
nUsers = len(reviewsPerUser)
nItems = len(reviewsPerItem)

#Getting the list of keys
users = list(reviewsPerUser.keys())
items = list(reviewsPerItem.keys())

labels = [d['star_rating'] for d in trainingSet]

### 1: Calculate the ratingMean

1. Find the __average rating__ of the training set.
2. Calculate a __baseline MSE value__ from the actual ratings to the average ratings.

In [7]:
alpha = sum([d['star_rating'] for d in trainingSet]) / len(trainingSet)
# alpha = np.reshape(-1,1)
alwaysPredictMean = [alpha for d in trainingSet]

def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)

# labels = [d['star_rating'] for d in trainingSet]
# print(labels[:100])
print("Rating mean: ",alpha)
print("MSE: ",MSE(alwaysPredictMean, labels))

Rating mean:  4.379450144400236
MSE:  1.1851124967710491


In [8]:
userGamma = {}
itemGamma = {}
K = 2 #Dimensionality of gamma

for u in reviewsPerUser:
    userGamma[u] = [random.random() * 0.1 - 0.05 for k in range(K)]
    
for i in reviewsPerItem:
    itemGamma[i] = [random.random() * 0.1 - 0.05 for k in range(K)]

Here are some functions defined to optimize the above MSE value. 

In [9]:
# alpha = ratingMean
def unpack(theta):
    global alpha
    global userBiases
    global itemBiases
    global userGamma
    global itemGamma
    index = 0
    alpha = theta[index]
    index += 1
    userBiases = dict(zip(users, theta[index:index+nUsers]))
    index += nUsers
    itemBiases = dict(zip(items, theta[index:index+nItems]))
    index += nItems
    for u in users:
        userGamma[u] = theta[index:index+K]
        index += K
    for i in items:
        itemGamma[i] = theta[index:index+K]
        index += K
        
def inner(x, y):
    return sum([a*b for a,b in zip(x,y)])


def prediction(user, item):
    return alpha + userBiases[user] + itemBiases[item] + inner(userGamma[user], itemGamma[item])


def cost(theta, labels, lamb):
    unpack(theta)
    predictions = [prediction(d['customer_id'], d['product_id']) for d in trainingSet]
    cost = MSE(predictions, labels)
    print("MSE = " + str(cost))
    for u in users:
        cost += lamb*userBiases[u]**2
        for k in range(K):
            cost += lamb*userGamma[u][k]**2
    for i in items:
        cost += lamb*itemBiases[i]**2
        for k in range(K):
            cost += lamb*itemGamma[i][k]**2
    return cost


def derivative(theta, labels, lamb):
    unpack(theta)
    N = len(trainingSet)
    dalpha = 0
    dUserBiases = defaultdict(float)
    dItemBiases = defaultdict(float)
    dUserGamma = {}
    dItemGamma = {}
    for u in reviewsPerUser:
        dUserGamma[u] = [0.0 for k in range(K)]
    for i in reviewsPerItem:
        dItemGamma[i] = [0.0 for k in range(K)]
    for d in trainingSet:
        u,i = d['customer_id'], d['product_id']
        pred = prediction(u, i)
        diff = pred - d['star_rating']
        dalpha += 2/N*diff
        dUserBiases[u] += 2/N*diff
        dItemBiases[i] += 2/N*diff
        for k in range(K):
            dUserGamma[u][k] += 2/N*itemGamma[i][k]*diff
            dItemGamma[i][k] += 2/N*userGamma[u][k]*diff
    for u in userBiases:
        dUserBiases[u] += 2*lamb*userBiases[u]
        for k in range(K):
            dUserGamma[u][k] += 2*lamb*userGamma[u][k]
    for i in itemBiases:
        dItemBiases[i] += 2*lamb*itemBiases[i]
        for k in range(K):
            dItemGamma[i][k] += 2*lamb*itemGamma[i][k]
    dtheta = [dalpha] + [dUserBiases[u] for u in users] + [dItemBiases[i] for i in items]
    for u in users:
        dtheta += dUserGamma[u]
    for i in items:
        dtheta += dItemGamma[i]
    return np.array(dtheta)

### 2: Optimize

1. __Optimize__ the above MSE using the scipy.optimize.fmin_1_bfgs_b("arguments") functions.

In [10]:
scipy.optimize.fmin_l_bfgs_b(cost, [alpha] + # Initialize alpha
                                   [0.0]*(nUsers+nItems) + # Initialize beta
                                   [random.random() * 0.1 - 0.05 for k in range(K*(nUsers+nItems))], derivative, 
                             args = (labels, 0.001))


MSE = 1.1851130366731504
MSE = 1.1822930372457103
MSE = 1.1723415069484748
MSE = 106.06406002545341
MSE = 1.1883609556190189
MSE = 1.1644849173794891
MSE = 1.139567901823131
MSE = 1.138967496958636
MSE = 1.1399467030896961
MSE = 1.141787650918011
MSE = 1.1422832258314293
MSE = 1.1425497236767768
MSE = 1.1425707025961862
MSE = 1.142571705087677


(array([ 4.38334481e+00, -4.04016730e-03,  8.21316899e-04, ...,
         3.88408651e-07,  1.16490547e-07, -6.08750809e-07]),
 1.1587750830270385,
 {'grad': array([ 1.73857309e-06, -6.50465159e-10, -8.14765130e-11, ...,
          7.74118627e-10,  2.33175246e-10, -1.21687998e-09]),
  'task': b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL',
  'funcalls': 14,
  'nit': 11,
  'warnflag': 0})

### 3: Recommending Products

    Based on similarities in trainingSet Recommendations were made on TestingSet.

In [11]:
usersPerItem = defaultdict(set)
itemsPerUser = defaultdict(set)
itemNames = {}

for d in trainingSet:
    user,item = d['customer_id'], d['product_id']
    usersPerItem[item].add(user)
    itemsPerUser[user].add(item)
    itemNames[item] = d['product_title']

def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    return numer / denom

def mostSimilar(iD, n):
    similarities = []
    id_list = []
    users = usersPerItem[iD]
    for i2 in usersPerItem:
        if i2 == iD: continue
        sim = Jaccard(users, usersPerItem[i2])
        similarities.append((sim,i2))
    similarities.sort(reverse=True)
    
    for i in similarities:
        id_list.append(i[1])
    print(id_list[:n])
    return similarities[:n]

# def predictRating(user,item):
#     ratings = []
#     similarities = []
#     for d in reviewsPerUser[user]:
#         i2 = d['product_id']
#         if i2 == item: continue
#         ratings.append(d['star_rating'])
#         similarities.append(Jaccard(usersPerItem[item],usersPerItem[i2]))
#     if (sum(similarities) > 0):
#         weightedRatings = [(x*y) for x,y in zip(ratings,similarities)]
#         return sum(weightedRatings) / sum(similarities)
#     else:
#         # User hasn't rated any similar items
#         return ratingMean


In [12]:
query = testingSet[10]['product_id']
# query1 = testingSet['product_id']

print("Product ID: ",query,"\nProduct title:",itemNames[query])

Product ID:  1401216676 
Product title: Batman The Killing Joke Special Ed HC


In [13]:
mostSimilar(query, 10)

['1401207529', '1401232590', '1401223176', '1401235425', 'B00X5ZT04E', '140120841X', '1607066017', '160309329X', 'B003N5VTUY', 'B000028D3Y']


[(0.07339449541284404, '1401207529'),
 (0.02912621359223301, '1401232590'),
 (0.02586206896551724, '1401223176'),
 (0.024793388429752067, '1401235425'),
 (0.018691588785046728, 'B00X5ZT04E'),
 (0.017857142857142856, '140120841X'),
 (0.013333333333333334, '1607066017'),
 (0.013157894736842105, '160309329X'),
 (0.012987012987012988, 'B003N5VTUY'),
 (0.012987012987012988, 'B000028D3Y')]

In [14]:
print("Similarity based Recommendations(top 10)")
[itemNames[x[1]] for x in mostSimilar(query, 10)]

Similarity based Recommendations(top 10)
['1401207529', '1401232590', '1401223176', '1401235425', 'B00X5ZT04E', '140120841X', '1607066017', '160309329X', 'B003N5VTUY', 'B000028D3Y']


['Batman Year One Deluxe SC',
 'Batman The Long Halloween TP',
 'Batman Hush Complete TP',
 'Batman Volume 1: The Court of Owls TP (The New 52) (Batman (DC Comics Paperback))',
 'X-Men: Days of Future Past - Rogue Cut [Blu-ray] [2014]',
 'V For Vendetta New Edition TP',
 'Saga Volume 1 (Saga (Comic Series))',
 'The League of Extraordinary Gentlemen (Volume III): Century|The League of Extraordinary Gentlemen|The League of Extraordinary Gentlemen',
 'Batman: Under the Red Hood',
 'Instrument (Soundtrack)']

## Comments:
1. MSE after optimizing has slightly better result than average rating.
2. Recommendations on products was done here; Predicting ratings can be done further.
3. For the input product of Batman The Killing Joke Special Ed HC you get a list of comic related recommendations, except for Instrument which is a music category.