# Amazon Product reviews playground
This note contains my playground for amazon products using Apache Livy to communicate with Spark

In [1]:
spark.version

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
4,,spark,idle,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

res1: String = 2.4.4


In [2]:
val base_path = "/Users/lferrod/datasets"

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

base_path: String = /Users/lferrod/datasets


In [3]:
val metadata_path = base_path + "/amazon/product_reviews/metadata.json.gz"
val reviews_path = base_path + "/amazon/product_reviews/aggressive_dedup.json.gz"

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

metadata_path: String = /Users/lferrod/datasets/amazon/product_reviews/metadata.json.gz
reviews_path: String = /Users/lferrod/datasets/amazon/product_reviews/aggressive_dedup.json.gz


In [4]:
val metadata_df = spark.read.json(metadata_path)
metadata_df.createOrReplaceTempView("product_metadata")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

metadata_df: org.apache.spark.sql.DataFrame = [_corrupt_record: string, asin: string ... 8 more fields]


In [5]:
%%sql
select * from product_metadata limit 20

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…





VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

We can observe the categories are encoded as array of arrays, and that a product might belong to more than one category. Let's first get rid of the first dimension of the category array and explore the different categories in case we want to get rid of redundant ones.

In [6]:
import spark.implicits._
import org.apache.spark.sql.functions._

//We need to double explode the records to get the categories separated.
metadata_df.select($"asin", 
                   explode($"categories").as("categories"),
                   $"title",
                   $"description").
    select($"asin", 
           explode($"categories").as("category"), 
           $"description", $"title").createOrReplaceTempView("product_metadata")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

import spark.implicits._
import org.apache.spark.sql.functions._


There are more than 2500 categoroes! this can become highly inpractical for experimentation. We need to reduce the amount of categories so we can operate them easier. We might want to discard those with less than $n$ products associated with them, and possibly merge some of them.

Let's start by working with the top 100 categories in terms of amount of products. We need to manually inspect them and discard or merge them with others for simplicity.

We are interested in major categories not specific ones for now. That is, for instance "Books" instead of a particular literature genere, the same for movies and TV.

In [7]:
%%sql
select category, count(asin) as products
from product_metadata
group by category
having count(asin) > 10000
order by products desc
limit 60

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

We can observe a good amount of products for these categories. We can also observe that categories such as 'Movies' and 'Movies & TV' can be merged togheter since they talk about similar things.

Another important aspect is that some categories seem to be related with a particular genere of something, this is the case of "classical" for example, it could be movies, music or books, a second category might be needed to differentiate it. For now, we'll stick with parent or root categories that has the capability of aggregate something.

In [8]:
%%sql
select category, count(asin) as products
from (
    select asin, case 
    when category in ("Movies", 
                      "Movies & TV") then "Movies & TV"
    when category in ("Clothing, Shoes & Jewelry", 
                      "Clothing", "T-Shirts", "Shirts", 
                      "Jewelry", 
                      "Dresses", 
                      "Boots", 
                      "Shoes", 
                      "Jewelry: International Shipping Available", 
                      "Shoes & Accessories: International Shipping Available", 
                      "Fashion") then "Clothing, Shoes & Jewelry"
    when category in ("Music", 
                      "Digital Music", 
                      "CDs & Vinyl",
                      "Pop",
                      "Jazz",
                      "Alternative Rock",
                      "Rock", 
                      "World Music") then "Music"
    when category in ("Electronics", 
                      "Cell Phones & Accessories", 
                      "Computers & Accessories", 
                      "Camera & Photo") then "Technology, Electronics & Accessories"
    when category in ("Books", 
                      "Kindle eBooks", 
                      "Kindle Store", 
                      "Literature & Fiction", 
                      "Kindle Short Reads",
                      "Christian Books & Bibles") then "Books"
    when category in ("Games", 
                      "Toys & Games") then "Toys & Games"
    when category in ("Home & Kitchen", 
                      "Kitchen & Dining") then "Home & Kitchen"
    when category in ("Office & School Supplies", 
                      "Office Products") then "Office & School Supplies"
    when category in ("Hair Care", 
                      "Skin Care",
                      "Health & Personal Care") then "Health & Personal Care"
    when category in ("Pet Supplies", 
                      "Dogs") then "Pets & Animals"
    when category in ("Athletic", 
                     "Health, Fitness & Dieting") then "Health, Fitness & Dieting"
    else category end as category
    from product_metadata
)
group by category
having count(asin) > 10000
order by products desc
limit 60

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

Based on the list above, we manually pick a group that we consider to be good enough for the experiment.

In [9]:
val final_products_sql = """select asin, title, description, case 
    when category in ('Movies', 
                      'Movies & TV') then 'Movies & TV'
    when category in ('Clothing, Shoes & Jewelry', 
                      'Clothing', 
                      'T-Shirts', 
                      'Shirts', 
                      'Jewelry', 
                      'Dresses', 
                      'Boots', 
                      'Shoes', 
                      'Jewelry: International Shipping Available', 
                      'Shoes & Accessories: International Shipping Available', 
                      'Fashion',
                      'Earrings') then 'Clothing, Shoes & Jewelry'
    when category in ('Music', 
                      'Digital Music', 
                      'CDs & Vinyl',
                      'Pop',
                      'Jazz',
                      'Alternative Rock',
                      'Rock', 
                      'World Music',
                      'Dance & Electronic') then 'Music'
    when category in ('Electronics', 
                      'Cell Phones & Accessories', 
                      'Computers & Accessories', 
                      'Camera & Photo') then 'Technology, Electronics & Accessories'
    when category in ('Books', 
                      'Kindle eBooks', 
                      'Kindle Store', 
                      'Literature & Fiction', 
                      'Kindle Short Reads',
                      'Christian Books & Bibles') then 'Books'
    when category in ('Games', 
                      'Toys & Games') then 'Toys & Games'
    when category in ('Home & Kitchen', 
                      'Kitchen & Dining') then 'Home & Kitchen'
    when category in ('Office & School Supplies', 
                      'Office Products') then 'Office & School Supplies'
    when category in ('Hair Care', 
                      'Skin Care',
                      'Health & Personal Care') then 'Health & Personal Care'
    when category in ('Pet Supplies', 
                      'Dogs') then 'Pets & Animals'
    when category in ('Athletic', 
                     'Health, Fitness & Dieting') then 'Health, Fitness & Dieting'
    else category end as category
    from product_metadata"""

val final_products = spark.sql(final_products_sql)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

final_products_sql: String =
select asin, title, description, case
    when category in ('Movies',
                      'Movies & TV') then 'Movies & TV'
    when category in ('Clothing, Shoes & Jewelry',
                      'Clothing',
                      'T-Shirts',
                      'Shirts',
                      'Jewelry',
                      'Dresses',
                      'Boots',
                      'Shoes',
                      'Jewelry: International Shipping Available',
                      'Shoes & Accessories: International Shipping Available',
                      'Fashion',
                      'Earrings') then 'Clothing, Shoes & Jewelry'
    when category in ('Music',
                      'Digital Music',
                      'CDs & Vinyl',
          ...final_products: org.apache.spark.sql.DataFrame = [asin: string, title: string ... 2 more fields]


In [10]:
val categories = List("Music", 
                      "Books",
                      "Movies & TV",
                      "Clothing, Shoes & Jewelry",
                      "Technology, Electronics & Accessories",
                      "Home & Kitchen",
                      "Sports & Outdoors",
                      "Toys & Games",
                      "Health & Personal Care",
                      "Tools & Home Improvement",
                      "Pets & Animals",
                      "Health, Fitness & Dieting",
                      "Patio, Lawn & Garden",
                      "Musical Instruments",
                      "Video Games")
val categoriesDF = final_products.filter($"category"isin(categories:_*))
categoriesDF.createOrReplaceTempView("product_categories")
categoriesDF.write.mode("overwrite").partitionBy("category").parquet(s"$base_path/amazon/product_reviews/final_categories")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

categories: List[String] = List(Music, Books, Movies & TV, Clothing, Shoes & Jewelry, Technology, Electronics & Accessories, Home & Kitchen, Sports & Outdoors, Toys & Games, Health & Personal Care, Tools & Home Improvement, Pets & Animals, Health, Fitness & Dieting, Patio, Lawn & Garden, Musical Instruments, Video Games)
categoriesDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [asin: string, title: string ... 2 more fields]


### Explore product reviews
With the categories selected we can proceed to obtain the corresponding reviews for each one and aggregate them.

In [11]:
val product_reviews = spark.read.json(reviews_path)
product_reviews.createOrReplaceTempView("product_reviews")
product_reviews.show(10)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

product_reviews: org.apache.spark.sql.DataFrame = [asin: string, helpful: array<bigint> ... 7 more fields]
+----------+-------+-------+--------------------+-----------+--------------------+---------------+--------------------+--------------+
|      asin|helpful|overall|          reviewText| reviewTime|          reviewerID|   reviewerName|             summary|unixReviewTime|
+----------+-------+-------+--------------------+-----------+--------------------+---------------+--------------------+--------------+
|B003UYU16G| [0, 0]|    5.0|It is and does ex...|11 21, 2012|A00000262KYZUE4J5...| Steven N Elich|Does what it's su...|    1353456000|
|B005FYPK9C| [0, 0]|    5.0|I was sketchy at ...| 01 8, 2013|A000008615DZQRRI9...|      mj waldon|           great buy|    1357603200|
|B000VEBG9Y| [0, 0]|    3.0|Very mobile produ...|03 24, 2014|A00000922W28P2OCH...|Gabriel Merrill|Great product but...|    1395619200|
|B001EJMS6K| [0, 0]|    4.0|Easy to use a mob...|03 24, 2014|A00000922W28P2OCH...|G

In [12]:
%%sql
select pc.category, pc.title, pc.description, pr.reviewText as review_text
from product_categories pc inner join product_reviews pr on pc.asin = pr.asin
limit 100

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

In [13]:
%%sql
select pc.category, collect_set(pc.title) as titles
from product_categories pc
where category = "Books"
group by pc.category
limit 10

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
Invalid status code '400' from http://localhost:8998/sessions/4/statements/13 with error payload: {"msg":"requirement failed: Session isn't active."}


In [14]:
print("hello")

An error was encountered:
Invalid status code '404' from http://localhost:8998/sessions/4 with error payload: {"msg":"Session '4' not found."}
