# Product Review Embeddings
## Author: Luis Eduardo Ferro Diez <a href="mailto:luis.ferro1@correo.icesi.edu.co">luis.ferro1@correo.icesi.edu.co</a>

This notebook contains the exploratory analysis over the Amazon product review dataset which will serve as ground truth of the semantic representation of the product categories for which we want later to characterize the geographic zones.

### Resources
* Dataset: http://jmcauley.ucsd.edu/data/amazon/

We are going to use the product reviews (aggresively deduplicated data) and the metadata of the products to build the semantic space.

In [4]:
metadata_path = "/media/ohtar10/Adder-Storage/datasets/amazon_products/reviews/metadata.json.gz"

Using the reading instructions from the web page since the json objects are not strict and cause problems when reading from pandas.

Also, since the file is significantly big to the point it can't fit into memory, we need a mechanism that can read it in chunks.

In [24]:
import pandas as pd
import gzip
import json
from itertools import islice

def read_file(path, lines=100):
    f = gzip.open(path, 'r')
    for l in islice(f, lines):
        #yield json.dumps(eval(l))
        yield eval(l)
        
def read_file_chunks(path, chunksize=1000, line_number=9430088):
    f = gzip.open(path, 'r')
    for index in range(0, line_number, chunksize):
        for l in islice(f, index, chunksize):
            yield json.dumps(eval(l))

def generate_head_df(path, lines=100):
    df = {}
    for k, v in enumerate(read_file(path, lines)):
        df[k] = v
    return pd.DataFrame.from_dict(df, orient='index')

In [25]:
metadata_df = generate_head_df(metadata_path, 10)
metadata_df.head()

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
0,1048791,{'Books': 6334800},http://ecx.images-amazon.com/images/I/51MKP0T4...,[[Books]],"The Crucible: Performed by Stuart Pankin, Jero...",,,,
1,143561,{'Movies & TV': 376041},http://g-ecx.images-amazon.com/images/G/01/x-s...,"[[Movies & TV, Movies]]","Everyday Italian (with Giada de Laurentiis), V...","3Pack DVD set - Italian Classics, Parties and ...",12.99,"{'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '...",
2,37214,{'Clothing': 1233557},http://ecx.images-amazon.com/images/I/31mCncNu...,"[[Clothing, Shoes & Jewelry, Girls], [Clothing...",Purple Sequin Tiny Dancer Tutu Ballet Dance Fa...,,6.99,"{'also_viewed': ['B00JO8II76', 'B00DGN4R1Q', '...",Big Dreams
3,32069,,http://ecx.images-amazon.com/images/I/51EzU6qu...,"[[Sports & Outdoors, Other Sports, Dance, Clot...",Adult Ballet Tutu Cheetah Pink,,7.89,"{'also_bought': ['0000032050', 'B00D0DJAEG', '...",BubuBibi
4,31909,{'Toys & Games': 201847},http://ecx.images-amazon.com/images/I/41xBoP0F...,"[[Sports & Outdoors, Other Sports, Dance]]",Girls Ballet Tutu Neon Pink,High quality 3 layer ballet tutu. 12 inches in...,7.0,"{'also_bought': ['B002BZX8Z6', 'B00JHONN1S', '...",Unknown


In [35]:
from itertools import chain

categories = list(chain.from_iterable(metadata_df.categories.values))
categories

[['Books'],
 ['Movies & TV', 'Movies'],
 ['Clothing, Shoes & Jewelry', 'Girls'],
 ['Clothing, Shoes & Jewelry',
  'Novelty, Costumes & More',
  'Costumes & Accessories',
  'More Accessories',
  'Kids & Baby'],
 ['Sports & Outdoors', 'Other Sports', 'Dance', 'Clothing', 'Girls', 'Skirts'],
 ['Sports & Outdoors', 'Other Sports', 'Dance'],
 ['Sports & Outdoors', 'Other Sports', 'Dance', 'Clothing', 'Girls', 'Skirts'],
 ['Movies & TV', 'Movies'],
 ['Books'],
 ['Sports & Outdoors', 'Other Sports', 'Dance'],
 ['Books']]

In [39]:
flatten = lambda l: [item for sublist in l for item in sublist]

# Use set to eliminate duplicates
categories_flat = set(flatten(categories))
categories_flat

{'Books',
 'Clothing',
 'Clothing, Shoes & Jewelry',
 'Costumes & Accessories',
 'Dance',
 'Girls',
 'Kids & Baby',
 'More Accessories',
 'Movies',
 'Movies & TV',
 'Novelty, Costumes & More',
 'Other Sports',
 'Skirts',
 'Sports & Outdoors'}