<a href="https://colab.research.google.com/github/Hernanros/NLP-Ydata/blob/master/HW5/HW_5_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW5 - Rating prediction using Amazon's Reviews
    
In this exercise, you'll train a text classification on a **subset** of the the Amazon's Reviews dataset. 

The Amazon's Reviews dataset  contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.


We will focus on the Home and Kitchen segment which contains ~550k reviews and can be downloaded here: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz

You will predict the rating that was given to a product from the review.

The dataset contains the following fields for each review, in JSON format:
1. "reviewerID": "A11N155CW1UV02",
1. "asin": "B000H00VBQ",
1. "reviewerName": "AdrianaM"
1. "helpful": [0, 0]
1. "reviewText": "I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all."
1. "overall": 2.0
1. "summary": "A little bit boring for me"
1. "unixReviewTime": 1399075200
1. "reviewTime": "05 3, 2014"




Please note that the **only** two fields that you are allowed to use in this exercise are "reviewText" which contains the review and "overall" which contains the rating. Other than that you have the **option** to use the "asin" field which is a unique product identifier. You may (or may not :) ) find this field useful. 



## General guidelines

1. You are required to implement at least two models.
1. The first should be a CNN or an RNN (or a combination) and should include the use of Glove embeddings.
1. The second model should be implemented using the transformers package and include Transfer learning concepts that were mentioned in the Lecture.
1. Pay attention to any preprocessing steps that are needed.
1. Feel free to be creative and use any method which was mentioned in the lectures (e.g., tf-idf, pos,...) extra points will be given to creativity.
1. The main criteria for evaluation is not the over-all score but rather the entire process (preprocessing, efficient training ...)





In [1]:
!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz

--2020-06-24 11:09:14--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 138126598 (132M) [application/x-gzip]
Saving to: ‘reviews_Home_and_Kitchen_5.json.gz’


2020-06-24 11:09:23 (14.9 MB/s) - ‘reviews_Home_and_Kitchen_5.json.gz’ saved [138126598/138126598]



In [2]:
! gunzip reviews_Home_and_Kitchen_5.json.gz

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import zipfile
import json

In [4]:
revs = []
for line in open('reviews_Home_and_Kitchen_5.json', 'rb'):
    try:
      revs.append(json.loads(line))
    except:
      continue

In [81]:
df = pd.DataFrame(revs)

data = df[['asin','reviewText','overall']]
data.head()

Unnamed: 0,asin,reviewText,overall
0,615391206,My daughter wanted this book and the price on ...,5.0
1,615391206,I bought this zoku quick pop for my daughterr ...,5.0
2,615391206,There is no shortage of pop recipes available ...,4.0
3,615391206,This book is a must have if you get a Zoku (wh...,5.0
4,615391206,This cookbook is great. I have really enjoyed...,4.0


In [6]:
data.groupby('asin').size()

asin
0615391206    11
0689027818     5
0912696591    93
1223070743     8
1567120709    16
              ..
B00L8HA5L8    14
B00L9KOZBK     6
B00LAI4UYS     5
B00LB18EKK    19
B00LBFUU12     9
Length: 28237, dtype: int64

# Workflow
Pre-processing:
  - tf-idf calculation (where each document might be a review OR a specific user reviews collection)
  - ~tokenization~
  - ~POS tagging~
  - ~embedding~

## Model No.1 - RNN + Glove
Generally: takes in each review, removes stopwords, transform tokens into embedding, predicts a score.
As features:
- ~embbeded tokens~
- ~product_id (asin)~
- ~num reviews for product~
- ~len of mean product review length~
- ~num of adjactives~


In [7]:
!wget  http://nlp.stanford.edu/data/glove.840B.300d.zip

z = zipfile.ZipFile("./glove.840B.300d.zip")
glove_pd = pd.read_csv(z.open('glove.840B.300d.txt'), sep=" ", quoting=3, header=None, index_col=0,skiprows=lambda x: x >200000)
glove = {key: val.values for key, val in glove_pd.T.items()}
del glove_pd

--2020-06-24 11:09:39--  http://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.840B.300d.zip [following]
--2020-06-24 11:09:39--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip [following]
--2020-06-24 11:09:39--  http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip

# part 1 - pre-processing and feature extraction

In [8]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords 
nltk.download('stopwords')
from nltk.tokenize import word_tokenize 
from nltk.stem.porter import *
nltk.download('averaged_perceptron_tagger')


stemmer = PorterStemmer()
tokenizer = nltk.RegexpTokenizer(r"\w+")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [82]:
toy = data.iloc[np.random.choice(len(data),500),:]
toy

Unnamed: 0,asin,reviewText,overall
431234,B005447JCY,I love my filter water in these. I washed the ...,5.0
494773,B008MWKGH0,"Bought 2 of these for our guest rooms (1 gray,...",4.0
79799,B00017UT6W,Great quality product. A good bronze color an...,5.0
149988,B000FSFOM6,I bought one of the 20X72 inch mats for my wif...,1.0
342229,B0032SK8XG,I have no complaints about this item. It is w...,5.0
...,...,...,...
45574,B00008439Y,I bought the Roomba about 6 months ago. It has...,3.0
297464,B0027IS6NG,I opted for this curtain because I didn't want...,5.0
399715,B004BA8UWA,So far the king size has lasted and is definit...,3.0
466357,B006SOHESS,I make frozen yogurt for my pups. They love it...,5.0


In [83]:
s2i = {w:i for i,w in enumerate(glove.keys())}
i2s = {i:w for w,i in s2i.items()}
i2v = {i:v for i,v in enumerate(glove.values())}
stop_words = set(stopwords.words('english'))
def intranslate(sent):
  return np.array([s2i[word] if word in s2i.keys() else 0 for word in sent]).reshape(1,-1)

def preprocessor (entry, stopwords):  
  #1. remove stop words and punctuation marks:
  tokenized = tokenizer.tokenize(entry)
  tokenized = [w.lower() for w in tokenized  if not w in stop_words]
  return tokenized

def embed (sent, embbeding_dict):
  return np.array([embbeding_dict[word] if word in embbeding_dict.keys() else np.zeros((1,300)) for word in sent])

nltk.download('averaged_perceptron_tagger')
def adj_count(entry):
  return np.sum([1 if pos[1].startswith('JJ') else 0 for pos in np.array(nltk.pos_tag(tokenizer.tokenize(entry)))])

def preprocessor_df (toy):

  toy['tokenized'] = toy['reviewText'].apply(lambda x: preprocessor(x, stopwords))
  toy['len'] = toy['tokenized'].apply(lambda x: len(x))
  toy['joined'] = toy['tokenized'].apply(lambda x: ' '.join(x))
  toy['int_sentences'] = toy.tokenized.apply(lambda x: intranslate(x))

  toy['glove'] = toy.tokenized.apply(lambda x:embed (x, glove))

  #add number of reviews as feature
  toy = toy.join(toy.groupby('asin')['reviewText'].count(), on = 'asin',rsuffix = '_count')

  # remove too long reviews
  toy = toy[toy.len < toy.len.quantile(.95)]

  #add product mean length as a feature
  toy = toy.join(toy.groupby('asin')['len'].mean(), on = 'asin',rsuffix = '_mean')
  
  #count number of adhuctives in review
  toy['num_adjs'] = toy.reviewText.apply(lambda x: adj_count(x))

  return toy

toy = preprocessor_df(toy)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user

# part 2 - Bidirectional LSTM

In [84]:
dt = toy
dt.head()

Unnamed: 0,asin,reviewText,overall,tokenized,len,joined,int_sentences,glove,reviewText_count,len_mean,num_adjs
431234,B005447JCY,I love my filter water in these. I washed the ...,5.0,"[i, love, filter, water, i, washed, soon, i, g...",20,i love filter water i washed soon i got weird ...,"[[108, 185, 3268, 333, 108, 9734, 677, 108, 21...","[[0.18733, 0.40595, -0.51174, -0.55482, 0.0397...",1,20.0,3
494773,B008MWKGH0,"Bought 2 of these for our guest rooms (1 gray,...",4.0,"[bought, 2, guest, rooms, 1, gray, 1, camel, s...",24,bought 2 guest rooms 1 gray 1 camel silky smoo...,"[[1475, 80, 2728, 1494, 66, 7305, 66, 23233, 2...","[[0.05361799999999999, 0.07041900000000001, -0...",1,24.0,4
79799,B00017UT6W,Great quality product. A good bronze color an...,5.0,"[great, quality, product, a, good, bronze, col...",14,great quality product a good bronze color heav...,"[[158, 396, 493, 6, 112, 11144, 866, 2007, 188...","[[-0.093846, 0.58296, -0.019271, -0.0700720000...",2,17.5,4
149988,B000FSFOM6,I bought one of the 20X72 inch mats for my wif...,1.0,"[i, bought, one, 20x72, inch, mats, wife, two,...",55,i bought one 20x72 inch mats wife two years ag...,"[[108, 1475, 51, 0, 2786, 17173, 1119, 135, 14...","[[0.18733, 0.40595, -0.51174, -0.55482, 0.0397...",1,55.0,7
342229,B0032SK8XG,I have no complaints about this item. It is w...,5.0,"[i, complaints, item, it, well, made, looks, n...",14,i complaints item it well made looks nice serv...,"[[108, 6346, 1308, 21, 133, 171, 628, 490, 397...","[[0.18733, 0.40595, -0.51174, -0.55482, 0.0397...",1,14.0,1


In [58]:
# import torch
# from torch import nn
# from torch.nn.utils.rnn import pad_sequence

from keras import Model
from keras.preprocessing import sequence
from keras.models import Sequential,Input
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional,Concatenate
from sklearn.model_selection import  train_test_split

MAXLEN = np.max(data.len)
BATCHSIZE = 32


In [102]:
X = dt.int_sentences
X = [x[0] for x in X]
X = sequence.pad_sequences(X, maxlen=MAXLEN, padding='post', truncating = 'post')
X = np.concatenate([X , np.array(dt.len).reshape(-1,1) , np.array(dt.reviewText_count).reshape(-1,1),
                np.array(dt.len_mean).reshape(-1,1),np.array(dt.num_adjs).reshape(-1,1)],axis = 1)
X_train,X_test,y_train,y_test = train_test_split(X,dt.overall,test_size = .25, random_state = 123)

X_t1,X_t2 = X_train[:,:MAXLEN],X_train[:,MAXLEN:]


In [15]:



glove_matrix = np.array(list(glove.values())[:200000])

In [16]:
glove_matrix = np.concatenate([np.zeros((1,300)),glove_matrix])

In [96]:
inp1,inp2 = Input(shape=(MAXLEN,)),Input(shape=(4,))
x_emb = Embedding(glove_matrix.shape[0],300 ,weights=[glove_matrix],input_length =  MAXLEN, trainable = False)(inp1)
x_emb = Bidirectional(LSTM(64))(x_emb)
x_emb = Dropout(0.2)(x_emb)
layer = Concatenate()([x_emb, inp2])
output = Dense(1,input_dim = 128)(layer)
model = Model([inp1,inp2], output)
model.compile('adam', 'mse', metrics=['mse'])

In [97]:
print(model.summary())

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_23 (InputLayer)           (None, 138)          0                                            
__________________________________________________________________________________________________
embedding_13 (Embedding)        (None, 138, 300)     59998500    input_23[0][0]                   
__________________________________________________________________________________________________
bidirectional_13 (Bidirectional (None, 128)          186880      embedding_13[0][0]               
__________________________________________________________________________________________________
dropout_13 (Dropout)            (None, 128)          0           bidirectional_13[0][0]           
____________________________________________________________________________________________

In [116]:
model.fit([X_train[:,:MAXLEN],X_train[:,MAXLEN:]], y_train,
          batch_size=BATCHSIZE,
          epochs=40,
          validation_data=[[X_test[:,:MAXLEN],X_test[:,MAXLEN:]], y_test])

Train on 356 samples, validate on 119 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.callbacks.History at 0x7f5e44c8a630>