# Product Review Embeddings
## Author: Luis Eduardo Ferro Diez <a href="mailto:luis.ferro1@correo.icesi.edu.co">luis.ferro1@correo.icesi.edu.co</a>

This notebook contains the exploratory analysis over the Amazon product review dataset which will serve as ground truth of the semantic representation of the product categories for which we want later to characterize the geographic zones.

### Resources
* Dataset: http://jmcauley.ucsd.edu/data/amazon/

We are going to use the product reviews (aggresively deduplicated data) and the metadata of the products to build the semantic space.

In [1]:
metadata_path = "/media/ohtar10/Adder-Storage/datasets/amazon_products/reviews/metadata.json.gz"

Using the reading instructions from the web page since the json objects are not strict and cause problems when reading from pandas.

Also, since the file is significantly big to the point it can't fit into memory, we need a mechanism that can read it in chunks.

In [2]:
import pandas as pd
import gzip
import json
from itertools import islice

def read_file(path, lines=100):
    f = gzip.open(path, 'r')
    for l in islice(f, lines):
        #yield json.dumps(eval(l))
        yield eval(l)
        
def generate_head_df(path, lines=100):
    df = {}
    for k, v in enumerate(read_file(path, lines)):
        df[k] = v
    return pd.DataFrame.from_dict(df, orient='index')

In [22]:
metadata_df = generate_head_df(metadata_path, 1000)
metadata_df.head()

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
0,1048791,{'Books': 6334800},http://ecx.images-amazon.com/images/I/51MKP0T4...,[[Books]],"The Crucible: Performed by Stuart Pankin, Jero...",,,,
1,143561,{'Movies & TV': 376041},http://g-ecx.images-amazon.com/images/G/01/x-s...,"[[Movies & TV, Movies]]","Everyday Italian (with Giada de Laurentiis), V...","3Pack DVD set - Italian Classics, Parties and ...",12.99,"{'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '...",
2,37214,{'Clothing': 1233557},http://ecx.images-amazon.com/images/I/31mCncNu...,"[[Clothing, Shoes & Jewelry, Girls], [Clothing...",Purple Sequin Tiny Dancer Tutu Ballet Dance Fa...,,6.99,"{'also_viewed': ['B00JO8II76', 'B00DGN4R1Q', '...",Big Dreams
3,32069,,http://ecx.images-amazon.com/images/I/51EzU6qu...,"[[Sports & Outdoors, Other Sports, Dance, Clot...",Adult Ballet Tutu Cheetah Pink,,7.89,"{'also_bought': ['0000032050', 'B00D0DJAEG', '...",BubuBibi
4,31909,{'Toys & Games': 201847},http://ecx.images-amazon.com/images/I/41xBoP0F...,"[[Sports & Outdoors, Other Sports, Dance]]",Girls Ballet Tutu Neon Pink,High quality 3 layer ballet tutu. 12 inches in...,7.0,"{'also_bought': ['B002BZX8Z6', 'B00JHONN1S', '...",Unknown


We are interested in `asin`, `categories`, `title` and `description` columns, let's also check if we have null values there.

In [27]:
columns = ['asin', 'categories', 'title', 'description']
metadata_df = metadata_df[columns]
metadata_df.isnull().sum()

asin             0
categories       2
title            0
description    622
dtype: int64

It seems we have several null values in description, but we can cope with that as we have the reviews. However, it seems we have null categories, this won't serve us well for our purpose so let's get rid of them. We must take care of not getting rid of the records that lack description.

In [34]:
metadata_df = metadata_df[metadata_df.categories.notnull()]
metadata_df.head()

Unnamed: 0,asin,categories,title,description
0,1048791,[[Books]],"The Crucible: Performed by Stuart Pankin, Jero...",
1,143561,"[[Movies & TV, Movies]]","Everyday Italian (with Giada de Laurentiis), V...","3Pack DVD set - Italian Classics, Parties and ..."
2,37214,"[[Clothing, Shoes & Jewelry, Girls], [Clothing...",Purple Sequin Tiny Dancer Tutu Ballet Dance Fa...,
3,32069,"[[Sports & Outdoors, Other Sports, Dance, Clot...",Adult Ballet Tutu Cheetah Pink,
4,31909,"[[Sports & Outdoors, Other Sports, Dance]]",Girls Ballet Tutu Neon Pink,High quality 3 layer ballet tutu. 12 inches in...


In [35]:
metadata_df.isnull().sum()

asin             0
categories       0
title            0
description    621
dtype: int64

We observe we have matrices for categories, it would be better to faltten them for easier usage.

In [37]:
from itertools import chain

categories = list(chain.from_iterable(metadata_df.categories.values))
categories[:10]

[['Books'],
 ['Movies & TV', 'Movies'],
 ['Clothing, Shoes & Jewelry', 'Girls'],
 ['Clothing, Shoes & Jewelry',
  'Novelty, Costumes & More',
  'Costumes & Accessories',
  'More Accessories',
  'Kids & Baby'],
 ['Sports & Outdoors', 'Other Sports', 'Dance', 'Clothing', 'Girls', 'Skirts'],
 ['Sports & Outdoors', 'Other Sports', 'Dance'],
 ['Sports & Outdoors', 'Other Sports', 'Dance', 'Clothing', 'Girls', 'Skirts'],
 ['Movies & TV', 'Movies'],
 ['Books'],
 ['Sports & Outdoors', 'Other Sports', 'Dance']]

And flatten them and obtain the unique ones

In [38]:
flatten = lambda l: [item for sublist in l for item in sublist]

# Use set to eliminate duplicates
categories_flat = set(flatten(categories))
categories_flat

{'Active',
 'Active Skirts',
 'Books',
 'CDs & Vinyl',
 'Christian',
 'Clothing',
 'Clothing, Shoes & Jewelry',
 'Costumes & Accessories',
 'Dance',
 'Girls',
 'Gospel',
 'Jigsaw Puzzles',
 'Kids & Baby',
 'More Accessories',
 'Movies',
 'Movies & TV',
 'Novelty, Costumes & More',
 'Other Sports',
 'Pop & Contemporary',
 'Praise & Worship',
 'Puzzles',
 'Skirts',
 'Sports & Outdoors',
 'Toys & Games'}

Now let's extract the titles and descriptions for each category as separate fields (as arrays) so we can use them for later processing.

In [45]:
# Create a new field with a flattened version of the categories column for better computation
metadata_df['categories_flattened'] = metadata_df.categories.apply(lambda cat: list(chain.from_iterable(cat)))
metadata_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,asin,categories,title,description,categories_flattened
0,1048791,[[Books]],"The Crucible: Performed by Stuart Pankin, Jero...",,[Books]
1,143561,"[[Movies & TV, Movies]]","Everyday Italian (with Giada de Laurentiis), V...","3Pack DVD set - Italian Classics, Parties and ...","[Movies & TV, Movies]"
2,37214,"[[Clothing, Shoes & Jewelry, Girls], [Clothing...",Purple Sequin Tiny Dancer Tutu Ballet Dance Fa...,,"[Clothing, Shoes & Jewelry, Girls, Clothing, S..."
3,32069,"[[Sports & Outdoors, Other Sports, Dance, Clot...",Adult Ballet Tutu Cheetah Pink,,"[Sports & Outdoors, Other Sports, Dance, Cloth..."
4,31909,"[[Sports & Outdoors, Other Sports, Dance]]",Girls Ballet Tutu Neon Pink,High quality 3 layer ballet tutu. 12 inches in...,"[Sports & Outdoors, Other Sports, Dance]"


The above modification allow us to perform some searching via lambdas.

In [46]:
books = metadata_df.categories_flattened.apply(lambda cl: 'Books' in cl)
metadata_df[books]

Unnamed: 0,asin,categories,title,description,categories_flattened
0,0001048791,[[Books]],"The Crucible: Performed by Stuart Pankin, Jero...",,[Books]
7,0001048775,[[Books]],Measure for Measure: Complete &amp; Unabridged,William Shakespeare is widely regarded as the ...,[Books]
9,0001048236,[[Books]],The Sherlock Holmes Audio Collection,"&#34;One thing is certain, Sherlockians, put a...",[Books]
10,0000401048,[[Books]],The rogue of publishers' row;: Confessions of ...,,[Books]
11,0001019880,[[Books]],Classic Soul Winner's New Testament Bible,,[Books]
...,...,...,...,...,...
995,0004707052,[[Books]],Collins English-Norwegian Dictionary,,[Books]
996,0004700481,[[Books]],Collins Gem Spanish Dictionary: Spanish-Englis...,"Text: Spanish, English",[Books]
997,0004707702,[[Books]],Collins Pocket French Dictionary,,[Books]
998,0004710304,[[Books]],Collins Easy Learning Italian Dictionary,,[Books]


Let's define a function to facilitate this:

In [49]:
def get_products(df:pd.DataFrame, category:str, field='categories_flattened')-> pd.DataFrame:
    cat_indexes = df[field].apply(lambda cl: category in cl)
    return df[cat_indexes]

In [50]:
get_products(metadata_df, 'Movies')

Unnamed: 0,asin,categories,title,description,categories_flattened
1,0000143561,"[[Movies & TV, Movies]]","Everyday Italian (with Giada de Laurentiis), V...","3Pack DVD set - Italian Classics, Parties and ...","[Movies & TV, Movies]"
6,0000589012,"[[Movies & TV, Movies]]",Why Don't They Just Quit? DVD Roundtable Discu...,,"[Movies & TV, Movies]"
28,0000695009,"[[Movies & TV, Movies]]",Understanding Seizures and Epilepsy DVD,,"[Movies & TV, Movies]"
34,000107461X,"[[Movies & TV, Movies]]",Live in Houston [VHS],,"[Movies & TV, Movies]"
36,0000143529,"[[Movies & TV, Movies]]",My Fair Pastry (Good Eats Vol. 9),Disc 1: Flour Power (Scones; Shortcakes; South...,"[Movies & TV, Movies]"
45,0000143502,"[[Movies & TV, Movies]]",Rise and Swine (Good Eats Vol. 7),Rise and Swine (Good Eats Vol. 7) includes bon...,"[Movies & TV, Movies]"
65,0000143588,"[[Movies & TV, Movies]]","Barefoot Contessa (with Ina Garten), Entertain...",Barefoot Contessa Volume 2: On these three dis...,"[Movies & TV, Movies]"
66,0001517791,"[[Movies & TV, Movies]]",Praise Aerobics [VHS],Praise Aerobics - A low-intensity/high-intesit...,"[Movies & TV, Movies]"
73,0001527665,"[[Movies & TV, Movies]]",Peace Child [VHS],,"[Movies & TV, Movies]"
91,0001516035,"[[Movies & TV, Movies]]",Worship with Don Moen [VHS],Worship with Don Moen [VHS],"[Movies & TV, Movies]"


## Combine with product reviews
Now that we are able to obtain all the metadata for a particular category, let's now obtain the reviews per category. For a more interesting analysis, we are interested only in categories with ten or more reviews.

In [51]:
product_reviews_path = "/media/ohtar10/Adder-Storage/datasets/amazon_products/reviews/aggressive_dedup.json.gz"

In [52]:
pr_df = generate_head_df(product_reviews_path, 1000)
pr_df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A00000262KYZUE4J55XGL,B003UYU16G,Steven N Elich,"[0, 0]",It is and does exactly what the description sa...,5.0,Does what it's supposed to do,1353456000,"11 21, 2012"
1,A000008615DZQRRI946FO,B005FYPK9C,mj waldon,"[0, 0]",I was sketchy at first about these but once yo...,5.0,great buy,1357603200,"01 8, 2013"
2,A00000922W28P2OCH6JSE,B000VEBG9Y,Gabriel Merrill,"[0, 0]",Very mobile product. Efficient. Easy to use; h...,3.0,Great product but needs a varmint guard.,1395619200,"03 24, 2014"
3,A00000922W28P2OCH6JSE,B001EJMS6K,Gabriel Merrill,"[0, 0]",Easy to use a mobile. If you're taller than 4f...,4.0,Great inexpensive product. Mounts easily and t...,1395619200,"03 24, 2014"
4,A00000922W28P2OCH6JSE,B003XJCNVO,Gabriel Merrill,"[0, 0]",Love this feeder. Heavy duty & capacity. Best ...,4.0,Great feeder. Would recommend for use for thos...,1395619200,"03 24, 2014"


We are interested only in what it was said about the product so we will only use: `asin`, `reviewText` and `summary`.

In [53]:
pr_columns = ['asin', 'summary', 'reviewText']
pr_df = pr_df[pr_columns]
pr_df.head()

Unnamed: 0,asin,summary,reviewText
0,B003UYU16G,Does what it's supposed to do,It is and does exactly what the description sa...
1,B005FYPK9C,great buy,I was sketchy at first about these but once yo...
2,B000VEBG9Y,Great product but needs a varmint guard.,Very mobile product. Efficient. Easy to use; h...
3,B001EJMS6K,Great inexpensive product. Mounts easily and t...,Easy to use a mobile. If you're taller than 4f...
4,B003XJCNVO,Great feeder. Would recommend for use for thos...,Love this feeder. Heavy duty & capacity. Best ...


In [75]:
metadata_df.join(pr_df, on='asin', how='inner')

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

#### Note:
No idea what I was trying to do here....

In [7]:
from collections import defaultdict

categories_dict = defaultdict(dict)
for cat in categories_flat:
    cat_indexes = metadata_df.categories_flattened.apply(lambda clist: cat in clist)
    filtered = metadata_df[cat_indexes]
    categories_dict[cat] = {'titles': filtered.title.values, }

### Read the whole set in chunks

In [10]:
def read_file_chunks(path, chunksize=1000, line_number=9430088):
    f = gzip.open(path, 'r')
    for index in range(0, line_number, chunksize):
        for l in islice(f, index, chunksize):
            yield json.dumps(eval(l))