<a href="https://colab.research.google.com/github/Hernanros/NLP-Ydata/blob/master/HW5/HW_5_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HW5 - Rating prediction using Amazon's Reviews
    
In this exercise, you'll train a text classification on a **subset** of the the Amazon's Reviews dataset. 

The Amazon's Reviews dataset  contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.


We will focus on the Home and Kitchen segment which contains ~550k reviews and can be downloaded here: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz

You will predict the rating that was given to a product from the review.

The dataset contains the following fields for each review, in JSON format:
1. "reviewerID": "A11N155CW1UV02",
1. "asin": "B000H00VBQ",
1. "reviewerName": "AdrianaM"
1. "helpful": [0, 0]
1. "reviewText": "I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all."
1. "overall": 2.0
1. "summary": "A little bit boring for me"
1. "unixReviewTime": 1399075200
1. "reviewTime": "05 3, 2014"




Please note that the **only** two fields that you are allowed to use in this exercise are "reviewText" which contains the review and "overall" which contains the rating. Other than that you have the **option** to use the "asin" field which is a unique product identifier. You may (or may not :) ) find this field useful. 



## General guidelines

1. You are required to implement at least two models.
1. The first should be a CNN or an RNN (or a combination) and should include the use of Glove embeddings.
1. The second model should be implemented using the transformers package and include Transfer learning concepts that were mentioned in the Lecture.
1. Pay attention to any preprocessing steps that are needed.
1. Feel free to be creative and use any method which was mentioned in the lectures (e.g., tf-idf, pos,...) extra points will be given to creativity.
1. The main criteria for evaluation is not the over-all score but rather the entire process (preprocessing, efficient training ...)





In [60]:
!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz

--2020-06-18 09:48:04--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Home_and_Kitchen_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 138126598 (132M) [application/x-gzip]
Saving to: ‘reviews_Home_and_Kitchen_5.json.gz’


2020-06-18 09:48:36 (4.20 MB/s) - ‘reviews_Home_and_Kitchen_5.json.gz’ saved [138126598/138126598]



In [61]:
! gunzip reviews_Home_and_Kitchen_5.json.gz

gzip: reviews_Home_and_Kitchen_5.json already exists; do you wish to overwrite (y or n)? y


In [76]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import zipfile

In [66]:
revs = []
for line in open('reviews_Home_and_Kitchen_5.json', 'r'):
    try:
      revs.append(json.loads(line))
    except:
      continue

In [74]:
df = pd.DataFrame(revs)

data = df[['asin','reviewText','overall']]
data.head()

Unnamed: 0,asin,reviewText,overall
0,615391206,My daughter wanted this book and the price on ...,5.0
1,615391206,I bought this zoku quick pop for my daughterr ...,5.0
2,615391206,There is no shortage of pop recipes available ...,4.0
3,615391206,This book is a must have if you get a Zoku (wh...,5.0
4,615391206,This cookbook is great. I have really enjoyed...,4.0


In [75]:
data.groupby('asin').size()

asin
0615391206    11
0689027818     5
0912696591    93
1223070743     8
1567120709    16
              ..
B00L8HA5L8    14
B00L9KOZBK     6
B00LAI4UYS     5
B00LB18EKK    19
B00LBFUU12     9
Length: 28237, dtype: int64

# Workflow
Pre-processing:
  - tf-idf calculation (where each document might be a review OR a specific user reviews collection)
  - ~tokenization~
  - POS tagging
  - ~embedding~

## Model No.1 - RNN + Glove
Generally: takes in each review, removes stopwords, transform tokens into embedding, predicts a score.
As features:
- ~embbeded tokens~
- ~product_id (asin)~
- ~num reviews for product~
- ~len of mean product review length~
- num of adjactives


In [77]:
# !wget  http://nlp.stanford.edu/data/glove.840B.300d.zip

# z = zipfile.ZipFile("./glove.840B.300d.zip")
# glove_pd = pd.read_csv(z.open('glove.840B.300d.txt'), sep=" ", quoting=3, header=None, index_col=0)
# glove = {key: val.values for key, val in glove_pd.T.items()}
# del glove_pd

--2020-06-18 10:22:22--  http://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.840B.300d.zip [following]
--2020-06-18 10:22:22--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip [following]
--2020-06-18 10:22:23--  http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip

ModuleNotFoundError: ignored

# part 1 - pre-processing and feature extraction

In [234]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords 
nltk.download('stopwords')
from nltk.tokenize import word_tokenize 
from nltk.stem.porter import *
nltk.download('averaged_perceptron_tagger')


stemmer = PorterStemmer()
tokenizer = nltk.RegexpTokenizer(r"\w+")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [235]:
toy = data.iloc[np.random.choice(len(data),500),:]
toy

Unnamed: 0,asin,reviewText,overall,tokenized
122167,B000A68E48,"Good product, well built, and works as describ...",5.0,"[Good product, well built, and works as descri..."
382462,B003YUBQI8,I love these measuring spoons! They're measur...,5.0,"[I love these measuring spoons!, They're measu..."
326211,B002R5A178,I love that it folds flat and stores easily wh...,5.0,[I love that it folds flat and stores easily w...
132651,B000BVFYUO,For years I had been using a wooden turntable ...,5.0,[For years I had been using a wooden turntable...
547753,B00IW110B2,The Spiral Slicer is so amazing. I have seen ...,5.0,"[The Spiral Slicer is so amazing., I have seen..."
...,...,...,...,...
138210,B000E9Q0TM,"I liked this grinder--good looks, one hand ope...",2.0,"[I liked this grinder--good looks, one hand op..."
540864,B00GCETRWU,Really charming display for cupcakes. Has a ti...,4.0,"[Really charming display for cupcakes., Has a ..."
385862,B00416XIW6,The Snapware storage containers are the best t...,5.0,[The Snapware storage containers are the best ...
546997,B00IMV7I7W,"First of all, this Serta queen-size box spring...",1.0,"[First of all, this Serta queen-size box sprin..."


In [285]:
s2i = {w:i for i,w in enumerate(glove.keys())}
i2s = {i:w for w,i in s2i.items()}
i2v = {i:v for i,v in enumerate(glove.values())}

def intranslate(sent):
  return np.array([s2i[word] if word in s2i.keys() else 0 for word in sent]).reshape(1,-1)

def preprocessor (entry, stopwords):  
  #1. remove stop words and punctuation marks:
  tokenized = tokenizer.tokenize(entry)
  tokenized = [w.lower() for w in tokenized  if not w in stop_words]
  return tokenized

def embed (sent, embbeding_dict):
  return np.array([embbeding_dict[word] if word in embbeding_dict.keys() else np.zeros((1,300)) for word in sent])

nltk.download('averaged_perceptron_tagger')
def adj_count(entry):
  return np.sum([1 if pos[1].startswith('JJ') else 0 for pos in np.array(nltk.pos_tag(tokenizer.tokenize(entry)))])

def preprocessor_df (toy):

  toy['tokenized'] = toy['reviewText'].apply(lambda x: preprocessor(x, stopwords))
  toy['len'] = toy['tokenized'].apply(lambda x: len(x))
  toy['joined'] = toy['tokenized'].apply(lambda x: ' '.join(x))
  toy['int_sentences'] = toy.tokenized.apply(lambda x: intranslate(x))

  toy['glove'] = toy.tokenized.apply(lambda x:embed (x, glove))

  #add number of reviews as feature
  toy = toy.join(toy.groupby('asin')['reviewText'].count(), on = 'asin',rsuffix = '_count')

  # remove too long reviews
  toy = toy[toy.len < toy.len.quantile(.95)]

  #add product mean length as a feature
  toy = toy.join(toy.groupby('asin')['len'].mean(), on = 'asin',rsuffix = '_mean')
  
  #count number of adhuctives in review
  toy['num_adjs'] = toy.reviewText.apply(lambda x: adj_count(x))

  return toy

toy = preprocessor_df(toy)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


# part 2 - Bidirectional LSTM

In [286]:
data = toy
data.head()

Unnamed: 0,asin,reviewText,overall,tokenized,len,joined,glove,reviewText_count,len_mean,num_adjs,int_sentences,reviewText_count.1,len_mean.1,reviewText_count.2,len_mean.2
122167,B000A68E48,"Good product, well built, and works as describ...",5.0,"[good, product, well, built, works, described,...",19,good product well built works described got tw...,"[[-0.42625, 0.4431, -0.34517, -0.1326, -0.0581...",1,19.0,3,"[[112, 493, 133, 1178, 647, 1945, 218, 135, 71...",1,19.0,1,19.0
382462,B003YUBQI8,I love these measuring spoons! They're measur...,5.0,"[i, love, measuring, spoons, they, measuring, ...",36,i love measuring spoons they measuring spoons ...,"[[0.18733, 0.40595, -0.51174, -0.55482, 0.0397...",1,36.0,4,"[[108, 185, 7677, 32659, 49, 7677, 32659, 108,...",1,36.0,1,36.0
326211,B002R5A178,I love that it folds flat and stores easily wh...,5.0,"[i, love, folds, flat, stores, easily, use, it...",19,i love folds flat stores easily use it hold mu...,"[[0.18733, 0.40595, -0.51174, -0.55482, 0.0397...",1,19.0,4,"[[108, 185, 18103, 2488, 2151, 1069, 125, 21, ...",1,19.0,1,19.0
132651,B000BVFYUO,For years I had been using a wooden turntable ...,5.0,"[for, years, i, using, wooden, turntable, thou...",49,for years i using wooden turntable thought pre...,"[[-0.17224, 0.18234, -0.27847, -0.084665999999...",1,49.0,8,"[[11, 141, 108, 245, 5387, 38841, 400, 491, 11...",1,49.0,1,49.0
359247,B003G2ZVSA,I ordered this on May 8th and received it in j...,5.0,"[i, ordered, may, 8th, received, 4, business, ...",30,i ordered may 8th received 4 business days it ...,"[[0.18733, 0.40595, -0.51174, -0.55482, 0.0397...",2,20.0,6,"[[108, 3311, 119, 3690, 933, 131, 220, 257, 21...",2,20.0,2,20.0


In [279]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional

MAXLEN = np.max(data.len)
BATCHSIZE = 32


In [287]:
sequence.pad_sequences(data.int_sentences.iloc[0], maxlen=MAXLEN, padding='post', truncating = 'post')

array([[ 112,  493,  133, 1178,  647, 1945,  218,  135, 7106, 3426,  250,
        2568, 8438,  751,   27,  158,  493,  158,  423,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]], dtype=int32)