# Introduction to data

The dataset consists of product reviews and metadata of products in the "Grocery and Gourmet Food" category on Amazon and was collected by researchers at UCSD.

The original reviews dataset has 142.8 million reviews spanning from May 1996 - July 2014. However, we use only use a subset of the dataset where all users and items have at least 5 reviews in the "Grocery and Gourmet Food" category, called food 5-core. The food 5-core dataset is 108 MB and contains 151,254 reviews, with each review having 9 attributes:

- <code>reviewerID</code> - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- <code>asin </code> - ID of the product, e.g. 0000013714
- <code>reviewerName </code> - name of the reviewer
- <code>helpful </code> - helpfulness rating of the review, e.g. 2/3
- <code>reviewText </code> - text of the review
- <code>overall </code> - rating of the product
- <code>summary </code> - summary of the review
- <code>unixReviewTime </code> - time of the review (unix time)
- <code>reviewTime </code> - time of the review (raw)

The metadata dataset contains metadata for 171,761 products in the "Gorcery and Gourmet Food" category and is 182 MB. Each product has 9 attributes:

- <code>asin </code> - ID of the product, e.g. 0000031852
- <code>title </code> - name of the product
- <code>price </code> - price in US dollars (at time of crawl)
- <code>imUrl </code> - url of the product image
- <code>related </code> - related products (also bought, also viewed, bought together, buy after viewing)
- <code>salesRank </code> - sales rank information
- <code>brand </code> - brand name
- <code>categories </code> - list of categories the product belongs to

# Project Idea

:D

# Explore data

In [1]:
import pandas as pd
import json
import gzip

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

def parse2(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF2(path):
    i = 0
    df = {}
    for d in parse2(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

In [2]:
review_df = getDF('data/reviews_Grocery_and_Gourmet_Food_5.json.gz')
# review_df

In [3]:
product_df = getDF2('data/meta_Grocery_and_Gourmet_Food.json.gz')
# product_df

Verify that there indeed are at least 5 reviews for each user and product.

In [4]:
import numpy as np
from collections import Counter

In [5]:
asin_counts = Counter(list(review_df['asin']))
reviewerID_counts = Counter(list(review_df['reviewerID']))
# print(asin_counts)
# print(reviewerID_counts)

# Preprocess

In [6]:
product_df

Unnamed: 0,asin,description,title,imUrl,related,salesRank,categories,price,brand
0,0657745316,This is real vanilla extract made with only 3 ...,100 Percent All Natural Vanilla Extract,http://ecx.images-amazon.com/images/I/41gFi5h0...,{'also_viewed': ['B001GE8N4Y']},{'Grocery & Gourmet Food': 374004},[[Grocery & Gourmet Food]],,
1,0700026444,"Silverpot Tea, Pure Darjeeling, is an exquisit...",Pure Darjeeling Tea: Loose Leaf,http://ecx.images-amazon.com/images/I/51hs8sox...,,{'Grocery & Gourmet Food': 620307},[[Grocery & Gourmet Food]],,
2,1403796890,Must have for any WWE Fan\n \n \n \nFeaturing ...,WWE Kids Todler Velvet Slippers featuring John...,http://ecx.images-amazon.com/images/I/518SEST5...,,,[[Grocery & Gourmet Food]],3.99,
3,141278509X,Infused with Vitamins and Electrolytes Good So...,Archer Farms Strawberry Dragonfruit Drink Mix ...,http://ecx.images-amazon.com/images/I/51CFQIis...,{'also_viewed': ['B0051IETTY']},{'Grocery & Gourmet Food': 620322},[[Grocery & Gourmet Food]],,
4,1453060375,MiO Energy is your portable energy source givi...,Mio Energy Liquid Water Enhancer Black Cherry ...,http://ecx.images-amazon.com/images/I/51EUsMcn...,"{'also_viewed': ['B006MSEOJ2', 'B005VOOQLO', '...",{'Grocery & Gourmet Food': 268754},[[Grocery & Gourmet Food]],11.99,Mio
...,...,...,...,...,...,...,...,...,...
171755,B00LDXFI6Y,Nescafe Cafe Viet is extracted from the aromat...,Nescafe Cafe Viet Vietnamese Sweetened Instant...,http://ecx.images-amazon.com/images/I/51qAGS7j...,{'also_viewed': ['B000DN8EZW']},,[[Grocery & Gourmet Food]],17.99,
171756,B00LMMLRG6,Moon Cheese Snacks Moon Cheese High in protein...,"Moon Cheese, 2 Oz. Pack of Three (Assortment)",http://ecx.images-amazon.com/images/I/419FO438...,{'also_viewed': ['B000UPFWW6']},{'Grocery & Gourmet Food': 54090},[[Grocery & Gourmet Food]],16.95,
171757,B00LOXAZ1Q,Sour Punch candy is the brand of mouth waterin...,"Sour Punch Blue Raspberry Bite, 5 Ounce Bag --...",http://ecx.images-amazon.com/images/I/31Cj3cHD...,,{'Grocery & Gourmet Food': 133517},[[Grocery & Gourmet Food]],16.55,
171758,B00LOZ7F0S,"Our Vanilla Extract made from\nPremium, Organi...",Organic Mexican Vanilla,http://ecx.images-amazon.com/images/I/11iORwy7...,,,[[Grocery & Gourmet Food]],,


In [7]:
asin_review_list = np.unique(list(review_df['asin']))

count = 0
related_products = []
rel = {}
idx = []
# loop over all products in product_df
for i in range(len(product_df)):
    asin = product_df.iloc[i]['asin']
    
    # if product is in the review df, find its related products.
    if asin in asin_review_list:
        related = product_df.iloc[i]['related']
        
        # check that the product has related items
        if str(related) != "nan":
            related_prod = []
            
            # loop over every key: also viewed, also bought, etc.
            for key in related.keys():
                # check that every related product is also in review_df
                for prod in related[key]:
                    if prod in asin_review_list:
                        related_prod.append(prod)
            rel[i] = related_prod
            related_products.append(related_prod)
        else:
            idx.append(i)
    
    else:
        idx.append(i)

In [10]:
# remove indices in original df
product_df2 = product_df.drop(idx)
# make a new column 
product_df3 = product_df2.assign(related_products = related_products)
# explode column related_products
product_df3 = product_df3.explode('related_products')
product_df3

Unnamed: 0,asin,description,title,imUrl,related,salesRank,categories,price,brand,related_products
17,616719923X,Green Tea Flavor Kit Kat have quickly become t...,Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,http://ecx.images-amazon.com/images/I/51LdEao6...,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",{'Grocery & Gourmet Food': 37305},[[Grocery & Gourmet Food]],,,B0047YG5UY
17,616719923X,Green Tea Flavor Kit Kat have quickly become t...,Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,http://ecx.images-amazon.com/images/I/51LdEao6...,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",{'Grocery & Gourmet Food': 37305},[[Grocery & Gourmet Food]],,,B0002IZD02
17,616719923X,Green Tea Flavor Kit Kat have quickly become t...,Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,http://ecx.images-amazon.com/images/I/51LdEao6...,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",{'Grocery & Gourmet Food': 37305},[[Grocery & Gourmet Food]],,,B004N8LMFM
17,616719923X,Green Tea Flavor Kit Kat have quickly become t...,Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,http://ecx.images-amazon.com/images/I/51LdEao6...,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",{'Grocery & Gourmet Food': 37305},[[Grocery & Gourmet Food]],,,B0009F8JRC
17,616719923X,Green Tea Flavor Kit Kat have quickly become t...,Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,http://ecx.images-amazon.com/images/I/51LdEao6...,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",{'Grocery & Gourmet Food': 37305},[[Grocery & Gourmet Food]],,,B004HU7TC6
...,...,...,...,...,...,...,...,...,...,...
171232,B00K00H9I6,Harvested from the iconic snowy woods of Quebe...,Canadian Finest Maple Syrup - 100% Pure Certif...,http://ecx.images-amazon.com/images/I/41abh7Ho...,"{'also_bought': ['B005P0LW66', 'B00JEKYNZA', '...",{'Grocery & Gourmet Food': 1500},[[Grocery & Gourmet Food]],18.95,,B008RVURA2
171232,B00K00H9I6,Harvested from the iconic snowy woods of Quebe...,Canadian Finest Maple Syrup - 100% Pure Certif...,http://ecx.images-amazon.com/images/I/41abh7Ho...,"{'also_bought': ['B005P0LW66', 'B00JEKYNZA', '...",{'Grocery & Gourmet Food': 1500},[[Grocery & Gourmet Food]],18.95,,B002483TSQ
171232,B00K00H9I6,Harvested from the iconic snowy woods of Quebe...,Canadian Finest Maple Syrup - 100% Pure Certif...,http://ecx.images-amazon.com/images/I/41abh7Ho...,"{'also_bought': ['B005P0LW66', 'B00JEKYNZA', '...",{'Grocery & Gourmet Food': 1500},[[Grocery & Gourmet Food]],18.95,,B000LKXNG2
171486,B00KC0LGI8,,"Betty Crocker Dry Meals Suddenly Grain Salad, ...",http://ecx.images-amazon.com/images/I/61zqxqJi...,"{'also_viewed': ['B00KSKIHVG', 'B00JWWM1T0', '...",{'Grocery & Gourmet Food': 97624},[[Grocery & Gourmet Food]],,,


Tjek at det er de rigtige indexer, som er blevet 'droppet' :)

In [11]:
a = set(list(product_df2.index))
b = set(list(rel.keys()))

In [12]:
a - a.intersection(b)

set()