In [None]:
# --------------------------------------Assignment Report and Code ----------------------------------------------------
### Part 1.1: Describe the dataset
# The product category I selected is the Video Games category, with two variations of the datasets included: 5-Core reviews dataset and all reviews dataset. The 5-Core dataset contains only reviews from users and products with at least five reviews, while all reviews dataset contain every review entry, subsequently making it much more diverse. There are a total of 1324753 reviews; however, the 5-Core dataset only contains 231780 reviews. There is also a file for metadata dataset; however, that is a completely different set of information rather than a variation. Metadata contains information regarding each product. Only the 5-Core reviews dataset will be used in this analysis. This is because video games is a highly competitive market with multiple online stores available. Amazon does not have its own social platform for gaming (unlike Steam, Origin or Epic Store), making it less popular in general. However, a lot of people buy video games from Amazon when sales are happening. These customers are however unlikely to leave a review on Amazon. Nevertheless, if a user provides multiple reviews on Amazon, it is more likely that they use Amazon to buy video games in general. Therefore, I assume that their reviews will generally be more reliable than one-off customers. Likewise, as for video games, a lot of games are based on different platforms and people usually buy the games on their respective platforms. This means that there are much fewer copies being sold on Amazon. Products with less than five reviews are likely to have a very small share in the market and do not reflect the trend within the category.

# The reviews and metadata can be linked together using the Amazon Standard Identification Number (ASIN) unique to each product. 
# Both variations of the reviews dataset contains reviewerID and asin as the identifier, with other attributes being reviewerName, helpful, reviewText, overall, summary, unixReviewTime, and reviewTime. The metadata uses asin as the identifier and contains description, price, imURL, related, salesRank, categories, title, and brand. The attributes I will be using from the product data are the asin, description, price, related, salesRank, title and brand, while every attribute of the review data will be used in this analysis. Only video games product will be included in the analysis, meaning that electronics and gaming equipment will be filtered out for this report.

In [None]:
### Part 1.2: Describe the steps you used for data preparation and preprocessing
I use the code provided at the source of the dataset to extract DataFrames from the downloaded files, then save each of them into Apache Parquet files using the df.to_parquet function in pandas. The detailed implementation of this can be found in the code written later in this notebook. 

In [1]:
# Import and define parse functions for the dataset
import pandas as pd
import gzip

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0 
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient ='index')

In [24]:
#generate DataFrame from 5-core reviews and save to parquet
fiveCoreDF = getDF('reviews_Video_Games_5.json.gz')
fiveCoreDF.to_parquet('fiveCore.parquet.gzip', compression='gzip')


In [25]:
#generate DataFrame from products metadata and save to parquet
meta = getDF('meta_Video_Games.json.gz')
meta.to_parquet('metadata.parquet.gzip', compression='gzip')

In [26]:
#generate DataFrame from all category reviews and save to parquet
allDF = getDF('reviews_Video_Games.json.gz')
allDF.to_parquet('all.parquet.gzip', compression='gzip')

In [15]:
#Read metadata from parquet and store it in a variable
metadata = pd.read_parquet('metadata.parquet.gzip')
metadata.head(10)

Unnamed: 0,asin,description,price,imUrl,related,salesRank,categories,title,brand
0,0078764343,Brand new sealed!,37.98,http://ecx.images-amazon.com/images/I/513h6dPb...,"{'also_bought': ['B000TI836G', 'B003Q53VZC', '...","{'Arts, Crafts & Sewing': None, 'Beauty': None...","[[Video Games, Xbox 360, Games]]",,
1,043933702X,In Stock NOW. Eligible for FREE Super Saving ...,23.5,http://ecx.images-amazon.com/images/I/61KKRndV...,"{'also_bought': None, 'also_viewed': ['B000067...","{'Arts, Crafts & Sewing': None, 'Beauty': None...","[[Video Games, PC, Games]]",,
2,0439339987,Grandma Groupers kelp seeds are missing and wi...,8.95,http://ecx.images-amazon.com/images/I/416QZg89...,"{'also_bought': ['B000314VVU', 'B000PXUOTE', '...","{'Arts, Crafts & Sewing': None, 'Beauty': None...","[[Video Games, PC, Games]]",,
3,0439342260,This software is BRAND NEW. Packaging may diff...,,http://ecx.images-amazon.com/images/I/61Wvu-Uj...,"{'also_bought': None, 'also_viewed': ['0439343...","{'Arts, Crafts & Sewing': None, 'Beauty': None...","[[Video Games, PC, Games]]",,
4,0439339960,a scholastic clubs fairs cd rom game,,http://ecx.images-amazon.com/images/I/51k3oRCF...,"{'also_bought': None, 'also_viewed': ['B00028D...","{'Arts, Crafts & Sewing': None, 'Beauty': None...","[[Video Games, PC, Games]]",,
5,0439374391,CD-ROM: Thomas & Friends: The Great Festival A...,20.0,http://ecx.images-amazon.com/images/I/21MTRNJY...,"{'also_bought': None, 'also_viewed': ['B00005Y...","{'Arts, Crafts & Sewing': None, 'Beauty': None...","[[Video Games, PC, Games]]",,
6,0439394422,"Product that encourages families to learn, exp...",12.96,http://ecx.images-amazon.com/images/I/51Zx2bIw...,"{'also_bought': ['B0002667BI', 'B00005JKTY', '...","{'Arts, Crafts & Sewing': None, 'Beauty': None...","[[Video Games, Mac, Games], [Video Games, PC, ...",,
7,043940133X,Your ship has crash-landed on the planet Tatoo...,30.0,http://ecx.images-amazon.com/images/I/51Gh39kq...,"{'also_bought': None, 'also_viewed': ['0545077...","{'Arts, Crafts & Sewing': None, 'Beauty': None...","[[Video Games, Mac, Games], [Video Games, PC, ...",,
8,0439573947,3 Great games in one box set!,10.79,http://ecx.images-amazon.com/images/I/514%2B24...,"{'also_bought': ['B000RLM19A', 'B003R79HFW', '...","{'Arts, Crafts & Sewing': None, 'Beauty': None...","[[Video Games, PC, Games]]",,
9,0439591295,cartoon network,,http://ecx.images-amazon.com/images/I/51XQ-nAM...,,"{'Arts, Crafts & Sewing': None, 'Beauty': None...","[[Video Games, PC, Games]]",,


In [None]:
#Read full review data from parquet and store it in a variable
allReviews = pd.read_parquet('all.parquet.gzip')
allReviews.head(10)

In [16]:
#Read 5-Core Reviews data from parquet and store it in a variable
fiveCoreReviews = pd.read_parquet('fiveCore.parquet.gzip')
fiveCoreReviews.head(10)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2HD75EMZR8QLN,700099867,123,"[8, 12]",Installing the game was a struggle (because of...,1.0,Pay to unlock content? I don't think so.,1341792000,"07 9, 2012"
1,A3UR8NLLY1ZHCX,700099867,"Alejandro Henao ""Electronic Junky""","[0, 0]",If you like rally cars get this game you will ...,4.0,Good rally game,1372550400,"06 30, 2013"
2,A1INA0F5CWW3J4,700099867,"Amazon Shopper ""Mr.Repsol""","[0, 0]",1st shipment received a book instead of the ga...,1.0,Wrong key,1403913600,"06 28, 2014"
3,A1DLMTOTHQ4AST,700099867,ampgreen,"[7, 10]","I got this version instead of the PS3 version,...",3.0,"awesome game, if it did not crash frequently !!",1315958400,"09 14, 2011"
4,A361M14PU2GUEG,700099867,"Angry Ryan ""Ryan A. Forrest""","[2, 2]",I had Dirt 2 on Xbox 360 and it was an okay ga...,4.0,DIRT 3,1308009600,"06 14, 2011"
5,A2UTRVO4FDCBH6,700099867,A.R.G.,"[0, 0]","Overall this is a well done racing game, with ...",4.0,"Good racing game, terrible Windows Live Requir...",1368230400,"05 11, 2013"
6,AN3YYDZAS3O1Y,700099867,Bob,"[11, 13]",Loved playing Dirt 2 and I thought the graphic...,5.0,A step up from Dirt 2 and that is terrific!,1313280000,"08 14, 2011"
7,AQTC623NCESZW,700099867,Chesty Puller,"[1, 4]",I can't tell you what a piece of dog**** this ...,1.0,Crash 3 is correct name AKA Microsoft,1353715200,"11 24, 2012"
8,A1QJJU33VNC4S7,700099867,D@rkFX,"[0, 1]",I initially gave this one star because it was ...,4.0,A great game ruined by Microsoft's account man...,1352851200,"11 14, 2012"
9,A2JLT2WY0F2HVI,700099867,D. Sweetapple,"[1, 1]",I still haven't figured this one out. Did ever...,2.0,Couldn't get this one to work,1391817600,"02 8, 2014"


In [59]:
#Check the number of reviews and number of users from all reviews dataset
print ("All reviews:", allReviews.count())
print ("5-Core reviews:", fiveCoreReviews.count())
#The total number of  in this dataset
print ("All reviews:", allReviews['reviewerID'].size)
#The total number of unique users with at least a review 
print ("All users:", allReviews['reviewerID'].unique().size)

#some reviewerName are missing, everything else is 1324753 but reviewerName is only 1298568



All reviews: reviewerID        1324753
asin              1324753
reviewerName      1298568
helpful           1324753
reviewText        1324753
overall           1324753
summary           1324753
unixReviewTime    1324753
reviewTime        1324753
dtype: int64
5-Core reviews: reviewerID        231780
asin              231780
reviewerName      228967
helpful           231780
reviewText        231780
overall           231780
summary           231780
unixReviewTime    231780
reviewTime        231780
dtype: int64
All reviews: 1324753
All users: 826767


In [58]:
metadata.categories.unique

<bound method Series.unique of 0                         [[Video Games, Xbox 360, Games]]
1                               [[Video Games, PC, Games]]
2                               [[Video Games, PC, Games]]
3                               [[Video Games, PC, Games]]
4                               [[Video Games, PC, Games]]
                               ...                        
50948    [[Video Games, Digital Games, Casual Games], [...
50949    [[Video Games, PC, Accessories, Gaming Keyboar...
50950    [[Video Games, PC, Accessories, Gaming Keyboar...
50951    [[Video Games, Digital Games, Casual Games], [...
50952    [[Video Games, Digital Games, Casual Games], [...
Name: categories, Length: 50953, dtype: object>