## Amazon review data 2018
This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

More reviews:
The total number of reviews is 233.1 million (142.8 million in 2014).
Newer reviews:
Current data includes reviews in the range May 1996 - Oct 2018.


Metadata:
We have added transaction metadata for each review shown on the review page. Such information includes:
Product information, e.g. color (white or black), size (large or small), package type (hardcover or electronics), etc.
Product images that are taken after the user received the product.
Added more detailed metadata of the product landing page. Such detailed information includes:
Bullet-point descriptions under product title.
Technical details table (attribute-value pairs).
Similar products table.


More categories:
Includes 5 new product categories.

In [5]:
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen

In [6]:
!wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Books.json.gz
!wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Books.json.gz

--2022-01-14 00:00:33--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Books.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11813848069 (11G) [application/octet-stream]
Saving to: ‘Books.json.gz.2’


2022-01-14 00:05:03 (41.7 MB/s) - ‘Books.json.gz.2’ saved [11813848069/11813848069]

--2022-01-14 00:05:03--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Books.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1219104464 (1.1G) [application/octet-stream]
Saving to: ‘meta_Books.json.gz.2’


2022-01-14 00:05:31 (41.7 MB/s) - ‘meta_Books.json.gz.2’ saved [1219104464/1219104464]



In [9]:
### load the meta data

data = []
count = 0
with gzip.open('meta_Books.json.gz') as f:
    for l in f:
        count+=1
        data.append(json.loads(l.strip()))
        if count == 100000:
          break
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])

100000
{'category': [], 'tech1': '', 'description': ['It is a biology book with God&apos;s perspective.'], 'fit': '', 'title': 'Biology Gods Living Creation Third Edition 10 (A Beka Book Science Series)', 'also_buy': ['0669009075', 'B000K2P5SA', 'B00MD4G2N0', 'B000ASIPTK', '0130508470', '1892427524', '0321567919', 'B000BJBH20', '0547484631', 'B000HAJTQO', 'B000AUCX7I', '0130365645', 'B000BI1Y2O', '0395976715', '052817729X', '1579246443', 'B001CK63XK', '1591669847', '0395879884', '836585161X', 'B01J2F9BH6', 'B00KYEHR4E', '158008141X', '1857928393', '0927545829', 'B015AR0RA0', 'B000TVHHRE', '0865167990', '1579246052', 'B003NXXVD4', 'B000OH6AX0', '061802087X', 'B000NU2X02', '0743252012'], 'tech2': '', 'brand': 'Keith Graham', 'feature': [], 'rank': '1,349,781 in Books (', 'also_view': ['0019777701', 'B000AUCX7I', 'B000K2P5SA', 'B001CK63XK', 'B01J2F9BH6', 'B000BI1Y2O', '1932012540', 'B0095ZCRCK'], 'main_cat': 'Books', 'similar_item': '', 'date': '', 'price': '$39.94', 'asin': '0000092878',

In [10]:
### load the reviews data

data_books = []
count = 0
with gzip.open('Books.json.gz') as f:
    for l in f:
        count+=1
        data_books.append(json.loads(l.strip()))
        if count == 1000000:
          break
    
# total length of list, this number equals total number of products
print(len(data_books))

# first row of the list
print(data_books[0])

1000000
{'overall': 5.0, 'verified': False, 'reviewTime': '08 12, 2005', 'reviewerID': 'A1C6M8LCIX4M6M', 'asin': '0001713353', 'style': {'Format:': ' Paperback'}, 'reviewerName': 'June Bug', 'reviewText': "This book is a winner with both of my boys.  They really enjoy the pictures and the story.  It's a classic.", 'summary': "Children's favorite", 'unixReviewTime': 1123804800}


In [11]:
# convert list into pandas dataframe

df = pd.DataFrame.from_dict(data)

print(len(df))

df_books = pd.DataFrame.from_dict(data_books)

print(len(df_books))

100000
1000000


In [12]:
df_books.head(5)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5.0,False,"08 12, 2005",A1C6M8LCIX4M6M,1713353,{'Format:': ' Paperback'},June Bug,This book is a winner with both of my boys. T...,Children's favorite,1123804800,,
1,5.0,False,"03 30, 2005",A1REUF3A1YCPHM,1713353,{'Format:': ' Hardcover'},TW Ervin II,"The King, the Mice and the Cheese by Nancy Gur...",A story children will love and learn from,1112140800,,
2,5.0,False,"04 4, 2004",A1YRBRK2XM5D5,1713353,{'Format:': ' Hardcover'},Rebecca L. Menner,My daughter got her first copy from her great-...,Third copy,1081036800,5.0,
3,5.0,False,"02 21, 2004",A1V8ZR5P78P4ZU,1713353,{'Format:': ' Hardcover'},Mindy Stone,I remember this book from when I was a child a...,Graphically Wonderful!,1077321600,,
4,5.0,True,"10 3, 2016",A2ZB06582NXCIV,1713353,,B. Deniger,"Just as I remembered it, one of my favorites f...","Great condition, very happy to have this to sh...",1475452800,,


In [13]:
df.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,[],,[It is a biology book with God&apos;s perspect...,,Biology Gods Living Creation Third Edition 10 ...,"[0669009075, B000K2P5SA, B00MD4G2N0, B000ASIPT...",,Keith Graham,[],"1,349,781 in Books (","[0019777701, B000AUCX7I, B000K2P5SA, B001CK63X...",Books,,,$39.94,0000092878,[],[]
1,"[Books, New, Used & Rental Textbooks, Medicine...",,[],,Mksap 16 Audio Companion: Medical Knowledge Se...,[],,Acp,[],"1,702,625 in Books (","[B01MUCYEV7, B01KUGTY6O]",Books,,,,000047715X,[],[]
2,"[Books, Arts & Photography, Music]",,"[Discography of American Punk, Hardcore, and P...",,"Flex! Discography of North American Punk, Hard...",[],,Burkhard Jarisch,[],"6,291,012 in Books (",[],Books,,,$199.99,0000004545,[],[]
3,"[Books, Arts & Photography, Music]",,[This is a collection of classic gospel hymns ...,,Heavenly Highway Hymns: Shaped-Note Hymnal,[],,Stamps/Baxter,[],"2,384,057 in Books (","[0006180116, 0996092730, B000QFOGY0, B06WWKNDL...",Books,,,,0000013765,[],[]
4,[],,[],,Georgina Goodman Nelson Womens Size 8.5 Purple...,[],,,[],"11,735,726 in Books (",[],Books,,,$164.10,0000000116,[],[]


In [14]:
#group reviews by asin, count and avg overall
df_books.groupby("asin").count().sort_values("overall", ascending = False)

Unnamed: 0_level_0,overall,verified,reviewTime,reviewerID,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
asin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0007420412,44371,44371,44371,44371,44324,44364,44358,44349,44371,1458,20
0007548672,27944,27944,27944,27944,27929,27944,27940,27941,27944,998,21
000711835X,18591,18591,18591,18591,18319,18590,18587,18589,18591,2176,89
0007444117,18029,18029,18029,18029,18028,18028,18028,18013,18029,6858,1
0007378033,13088,13088,13088,13088,13036,13088,13088,13088,13088,115,4
...,...,...,...,...,...,...,...,...,...,...,...
0007137621,1,1,1,1,1,1,1,1,1,1,0
000817461X,1,1,1,1,1,1,1,1,1,0,0
0007137613,1,1,1,1,1,1,1,1,1,1,0
0030036798,1,1,1,1,1,1,1,1,1,1,0


In [15]:
df_books.groupby("asin").mean("overall").sort_values("overall", ascending = False)

Unnamed: 0_level_0,overall,verified,unixReviewTime
asin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0007307454,5.0,0.50,1.350173e+09
0060012854,5.0,0.00,1.116202e+09
0060164786,5.0,0.75,1.257757e+09
0007131119,5.0,1.00,1.257206e+09
0007131852,5.0,0.00,1.056758e+09
...,...,...,...
0060173025,1.0,0.00,8.612352e+08
0028638611,1.0,0.00,1.142957e+09
0005047218,1.0,0.00,1.125101e+09
0002250535,1.0,0.00,1.089677e+09


In [21]:
#df.head()
df_books.groupby("reviewerID").count()
# No hay datos suficientes en esta muestra como para agrupar por usuario

Unnamed: 0_level_0,overall,verified,reviewTime,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
A0015332H21AK8WZ0ZCS,1,1,1,1,1,1,1,1,1,0,0
A0016480NKTAGGP3FK83,1,1,1,1,1,1,1,1,1,0,0
A0031182Y9AZ7V9OI7FJ,1,1,1,1,1,1,1,1,1,0,0
A0037178T85I7MBSWLOR,1,1,1,1,1,1,1,1,1,0,0
A00394165MU8CCTDESEE,1,1,1,1,1,1,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...
AZZYK2BT6EU8V,1,1,1,1,1,1,1,1,1,0,0
AZZYUA6JI1MO0,1,1,1,1,1,1,1,1,1,0,0
AZZYZQB8GYU82,1,1,1,1,1,1,1,1,1,0,0
AZZZBHHLU1CMM,1,1,1,1,1,1,1,1,1,0,0


In [None]:
### remove rows with unformatted title (i.e. some 'title' may still contain html style content)

df3 = df.fillna('')
df4 = df3[df3.title.str.contains('getTime')] # unformatted rows
df5 = df3[~df3.title.str.contains('getTime')] # filter those unformatted rows
print(len(df4))
print(len(df5))

In [None]:
# how those unformatted rows look like
df4.iloc[0]