# Analysis of Amazon Product Data

After construction of the graph (product-id and link to cross-buys), the following topics will be investigated:
* correlation between in-degrees and salesrank
* and some further smart questions... influence of number of reviews?, are there different sub-graphs like produkt categories?
* 
* 
* 
Data source: https://snap.stanford.edu/data/amazon-meta.html

Data import

In [3]:
# Import graphframes (from Spark-Packages)
from graphframes import *
from pyspark.sql.functions import monotonically_increasing_id, desc


# Get the product data (already pre-processed data in table format)
products = spark.read.csv("/FileStore/tables/ProductData_small.csv", sep=";", header=True)

# The the cross-buy data (similar products)
links = spark.read.csv("/FileStore/tables/LinkData_small.csv", sep=";", header=True)

How many products are we dealing with here?

In [5]:
products.count()

quick check of the data formats

In [7]:
display(products.take(3))

Id,ASIN,Group,Salesrank,Reviews,avg rating,Title
21,0790747324,DVD,795,140,4.5,The Time Machine
84,B000063W82,Video,780,14,4.5,The Best of Schoolhouse Rock! - 30th Anniversary Edition
252,B0000262WI,Music,581,82,4.5,The Köln Concert


And which and how many groups are there?

In [9]:
display(products.select("Group").distinct())

Group
Video
Toy
DVD
Sports
Baby Product
Video Games
Book
Music
Software


In [10]:
products.select("Group").distinct().count()

And how many products per category?

In [12]:
display(products.groupBy('Group').count().sort('count', ascending=False))

Group,count
Video,2933
Music,2855
Book,2535
DVD,2276
Software,5
Toy,4
Sports,1
Baby Product,1
Video Games,1


In [13]:
display(links.take(3))

src,dest
790747324,B00007JMD8
790747324,6305350221
790747324,B00004RF9B


The graph will be constructed based on the product data. Each product is identified through its unique *ASIN* (Amazon Standard Identification Number) code

GraphFrame library required *ASIN* to be renamed to *id*, but be careful, there is already an "Id" in the DataFrame

In [15]:
# rename "Id" to "Id-nu" to avoid any confusion (nu = not used)
# rename "ASIN" to "id"
ProductVertices = products.withColumnRenamed("Id", "Id-nu").distinct()
ProductVertices = ProductVertices.withColumnRenamed("ASIN", "id").distinct()
ProductVertices = ProductVertices.withColumnRenamed("avg rating", "avg_rating").distinct()

# check if done correctly
display(ProductVertices.take(3))

Id-nu,id,Group,Salesrank,Reviews,avg_rating,Title
1698,6302796857,Video,424,28,4.5,A Town Like Alice
7524,B00000JWVS,DVD,88,58,4.5,Jerry Seinfeld Live on Broadway: I'm Telling You for the Last Time
10025,0393057658,Book,4371,252,4.5,Moneyball: The Art of Winning an Unfair Game


How does the Edge table look like?

In [17]:
display(links.take(1))

src,dest
790747324,B00007JMD8


And how many edges are there?

In [19]:
links.count()

Although not always strictly required, it's good practice to have an ID column for the tables. 

Since the product links does not have one yet, create and additional ID:

In [21]:
# 2. Create Edges
links = links.withColumn("link-id", monotonically_increasing_id())

display(links.take(3))

src,dest,link-id
790747324,B00007JMD8,0
790747324,6305350221,1
790747324,B00004RF9B,2


Salesrank and number of reviews are read as string. Change type to int for calculations.

In [23]:
# change type of features "Salesrank" and "Reviews" to int
ProductVertices = ProductVertices.withColumn('Salesrank', ProductVertices.Salesrank.cast('int'))
ProductVertices = ProductVertices.withColumn('Reviews', ProductVertices.Reviews.cast('int'))
ProductVertices = ProductVertices.withColumn('avg rating', ProductVertices.avg_rating.cast('float'))
display(ProductVertices.take(1))

Id-nu,id,Group,Salesrank,Reviews,avg_rating,Title,avg rating
1698,6302796857,Video,424,28,4.5,A Town Like Alice,4.5


Rename links to similar products ar *src* and *dst*

In [25]:
# rename destination "dest" to "dst"
links = links.withColumnRenamed('dest', 'dst')


Now comes the moment of truth! Let's see if the graph can be constructed!

In [27]:
# Build the GraphFrame based on products and links

ProductGraph = GraphFrame(ProductVertices, links)

# print a sample of the vertices and edges
print("Products: %d" % ProductGraph.vertices.count())
print
print("SAMPLE PRODUCT DATA:")
print(ProductGraph.vertices.take(1))
print

print("Links: %d" % ProductGraph.edges.count())
print
print("SAMPLE LINK DATA:")
print(ProductGraph.edges.take(1))
print

The Graph is constructed. Do the fancy analyses!

1) assess the number of in-degress per product and compare it to the salesrank

Plot the next chunk in the following way:

you can look at the distribution of the inDegree-values by clicking on the icon to the right of 'Raw table' and then chosing a histogram, values = 'inDegree' and  30 bins

We can see that most of the inDegree values are between 0 and 2

In [31]:
display(ProductGraph.inDegrees.sort('inDegree', ascending=False))
#inDegRank = ProductGraph.inDegrees.sort('inDegree', ascending=False)

id,inDegree
B00008LDNZ,78
B00005JMAH,55
B00023P4I8,53
B00003CXCT,52
B000634DCW,45
B000068DBC,37
6304765266,36
B00005ATZT,34
B00001QEE7,33
B00005JMIJ,33


In [32]:
maxNumInDeg = ProductGraph.inDegrees.groupBy().max('inDegree')
display(maxNumInDeg)

maxNumInDegValue = str(maxNumInDeg.collect()[0]['max(inDegree)'])
#print(maxNumInDegValue)

maxInDegreeProduct = ProductGraph.inDegrees.filter('inDegree = {}'.format(maxNumInDegValue))
display(maxInDegreeProduct)


id,inDegree
B00008LDNZ,78


In [33]:
# join back to get product info based on max inDegree
ProductMaxInDegree = ProductGraph.vertices.join(maxInDegreeProduct, 'id')
display(ProductMaxInDegree)



id,Id-nu,Group,Salesrank,Reviews,avg_rating,Title,avg rating,inDegree
B00008LDNZ,548091,DVD,110,83,5.0,Laura,5.0,78


Plot the following chunk in this way:
Plot this as a bar chart with 'Plot Options':
 - Keys: InDegree
 - Values: Salesrank
 
Unfortunately you can't chose to have a logarithmic y-axis, but you do see that at least for low in-degree values the sales rank is rather bad.

Alternative: scatterplot inDegree vs. salesrank and grouping by product group (Keys=Group)
Update: Boxplot avg_rating vs. Salesrank is can now be created, but does not provide much info

In [35]:
# repeat exercise from above but with entire list
selection = ProductGraph.inDegrees.sort('inDegree', ascending=False)
#display(selection)

ProductSortInDeg = ProductGraph.vertices.join(selection, 'id').sort('inDegree', ascending=False)
display(ProductSortInDeg)

id,Id-nu,Group,Salesrank,Reviews,avg_rating,Title,avg rating,inDegree
B00008LDNZ,548091,DVD,110,83,5.0,Laura,5.0,78
B000068DBC,546689,DVD,109,633,4.5,Pulp Fiction (Collector's Edition),4.5,37
6304765266,97579,DVD,351,152,4.5,While You Were Sleeping,4.5,36
B00005ATZT,547040,DVD,1020,140,4.5,The Fugitive (Special Edition),4.5,34
B00001QEE7,104775,DVD,49,160,4.5,The Little Mermaid (Limited Issue),4.5,33
6305368171,276371,DVD,236,500,4.0,You've Got Mail,4.0,33
B00003CXA2,547272,DVD,185,530,4.0,Forrest Gump,4.0,30
630522577X,548182,DVD,253,180,4.5,My Fair Lady,4.5,29
B000059HB7,4388,DVD,1877,79,4.5,Rio Bravo,4.5,29
B000022TSH,544390,DVD,1108,154,4.5,Chinatown,4.5,27


Plot inDegree vs. Salesrank for Books only
* plot as scatter plot with LOESS smoother

In [37]:
# try to differentiate by product group
ProductSortInDeg = ProductGraph.vertices.join(selection, 'id').sort('inDegree', ascending=False).filter("Group = 'Book'")
display(ProductSortInDeg)

id,Id-nu,Group,Salesrank,Reviews,avg_rating,Title,avg rating,inDegree
0684801523,199628,Book,956,934,4.0,The Great Gatsby,4.0,27
0316769487,98756,Book,60,2568,4.0,The Catcher in the Rye,4.0,24
0446310786,537519,Book,111,1414,4.5,To Kill a Mockingbird,4.5,20
0694003611,502086,Book,156,339,4.5,Goodnight Moon (Board Book),4.5,19
0140177396,341570,Book,126,1000,4.5,Of Mice and Men (Penguin Great Books of the 20th Century),4.5,19
0066620996,154855,Book,29,363,4.5,Good to Great: Why Some Companies Make the Leap... and Others Don't,4.5,18
0805047905,502784,Book,171,172,5.0,"Brown Bear, Brown Bear, What Do You See?",5.0,18
0399501487,35512,Book,143,1101,4.0,Lord of the Flies,4.0,18
0399226907,180888,Book,279,164,4.5,The Very Hungry Caterpillar board book,4.5,18
0449214923,482263,Book,215,260,4.5,Think and Grow Rich,4.5,17


Plot the following chunk in this way:
Plot this as a bar chart with 'Plot Options':
 - Keys: OutDegree
 - Values: Salesrank

In [39]:
# repeat exercise from above for outDegree
selection = ProductGraph.outDegrees.sort('outDegree', ascending=True)
#display(selection)

ProductSortOutDeg = ProductGraph.vertices.join(selection, 'id').sort('OutDegree', ascending=True)
display(ProductSortOutDeg)

id,Id-nu,Group,Salesrank,Reviews,avg_rating,Title,avg rating,outDegree
6303031803,481374,Video,1778,3,3.5,A Gnome Named Gnorm,3.5,1
6300215350,394817,Video,2142,19,4.5,Water,4.5,1
6300218228,481265,Video,721,4,5.0,Woman Called Golda,5.0,1
6303471617,241928,Video,4095,5,4.0,Rolling Thunder,4.0,1
B0000714F7,475706,Video,447,7,5.0,The Forsyte Saga - First Generation,5.0,1
B00000F5ZW,282259,Video,1014,26,4.5,Scavenger Hunt,4.5,1
6300270246,424280,Video,4661,6,3.0,Deal of the Century,3.0,1
B00005MM6X,57827,Video,4135,0,0.0,First Aid Pet Emergency: Dogs,0.0,1
6301973135,162750,Video,1011,8,5.0,Of Human Bondage,5.0,1
6300271382,328876,Video,3023,7,4.0,Pete Kelly's Blues,4.0,1


outDegree is rather boring, since most of the products get assigned 5 products by default that are presumingly similar, as you can see here:

In [41]:
display(ProductSortOutDeg.groupBy("outDegree").count())

outDegree,count
1,41
2,43
3,48
4,43
5,10343


plot the output of the next chunk in the following way:
chose a scatter plot with options:
  - values: Salesrank, Reviews
  - LOESS bandwidth 0.47
  
and apply to all rows.
We can see a very weak dependency, suggesting that the less reviews a product has, the less likely it is to be a top selling product (a potential causal orientation might actually be inverse here...)

In [43]:
# relation of number of reviews and sales rank:
display(ProductGraph.vertices.select("Reviews", "Salesrank").sort("Reviews", ascending = False))


Reviews,Salesrank
5545,110
5539,420
5539,2135
5039,170
5039,1533
5034,746
4924,404
4924,661
4924,327
3839,3221


what is the correlation coefficient between the two?

In [45]:
ProductGraph.vertices.stat.corr("Reviews", "Salesrank")

trying to find products which have an edge to another product which in turn has an edge to the first product:

In [47]:
ProductsWithBalancedSimilarity = ProductGraph.find("(v1)-[e1]->(v2); (v2)-[e2]->(v1)")
display(ProductsWithBalancedSimilarity)

v1,e1,v2,e2
"List(21, 0790747324, DVD, 795, 140, 4.5, The Time Machine, 4.5)","List(0790747324, B00007JMD8, 0)","List(114681, B00007JMD8, DVD, 1076, 104, 4.5, Journey to the Center of the Earth, 4.5)","List(B00007JMD8, 0790747324, 11138)"
"List(84, B000063W82, Video, 780, 14, 4.5, The Best of Schoolhouse Rock! - 30th Anniversary Edition, 4.5)","List(B000063W82, 156949407X, 5)","List(239294, 156949407X, Video, 1223, 24, 5.0, Schoolhouse Rock! - Grammar Rock, 5.0)","List(156949407X, B000063W82, 22808)"
"List(296, 0385504209, Book, 19, 3049, 3.5, The Da Vinci Code, 3.5)","List(0385504209, 0671027360, 20)","List(376858, 0671027360, Book, 31, 1552, 4.0, Angels & Demons, 4.0)","List(0671027360, 0385504209, 34818)"
"List(296, 0385504209, Book, 19, 3049, 3.5, The Da Vinci Code, 3.5)","List(0385504209, 0671027387, 22)","List(424705, 0671027387, Book, 84, 429, 4.0, Deception Point, 4.0)","List(0671027387, 0385504209, 38171)"
"List(448, 0312966091, Book, 607, 141, 4.5, Three To Get Deadly : A Stephanie Plum Novel (A Stephanie Plum Novel), 4.5)","List(0312966091, 0312990456, 40)","List(471848, 0312990456, Book, 304, 414, 4.5, One for the Money (A Stephanie Plum Novel), 4.5)","List(0312990456, 0312966091, 41794)"
"List(448, 0312966091, Book, 607, 141, 4.5, Three To Get Deadly : A Stephanie Plum Novel (A Stephanie Plum Novel), 4.5)","List(0312966091, 0312971346, 41)","List(104366, 0312971346, Book, 1401, 264, 4.5, High Five (A Stephanie Plum Novel), 4.5)","List(0312971346, 0312966091, 10118)"
"List(448, 0312966091, Book, 607, 141, 4.5, Three To Get Deadly : A Stephanie Plum Novel (A Stephanie Plum Novel), 4.5)","List(0312966091, 0312976275, 42)","List(475036, 0312976275, Book, 787, 324, 4.5, Hot Six : A Stephanie Plum Novel (A Stephanie Plum Novel), 4.5)","List(0312976275, 0312966091, 41970)"
"List(448, 0312966091, Book, 607, 141, 4.5, Three To Get Deadly : A Stephanie Plum Novel (A Stephanie Plum Novel), 4.5)","List(0312966091, 0312980140, 43)","List(158327, 0312980140, Book, 1084, 274, 4.0, Seven Up (A Stephanie Plum Novel), 4.0)","List(0312980140, 0312966091, 15295)"
"List(457, B0000296JB, Music, 2439, 545, 4.5, Make Yourself, 4.5)","List(B0000296JB, B00005QG9J, 45)","List(119794, B00005QG9J, Music, 2488, 622, 4.5, Morning View, 4.5)","List(B00005QG9J, B0000296JB, 11691)"
"List(480, B0000296J9, Music, 1310, 42, 5.0, Gunfighter Ballads & Trail Songs, 5.0)","List(B0000296J9, B0000026AG, 50)","List(219235, B0000026AG, Music, 2015, 26, 5.0, Johnny Horton - Greatest Hits, 5.0)","List(B0000026AG, B0000296J9, 21158)"


how many product-pairs reference each other? can we just calculate it like this?

In [49]:
ProductsWithBalancedSimilarity.count()

does this mean, we have 11202 product-pairs that reference each other? Let's test if our data frame of mutually referencing products lists the pairs once or twice by looking at an example:

In [51]:
display(ProductsWithBalancedSimilarity.filter("(v1.id = '0790747324' and v2.id = 'B00007JMD8') or (v2.id = '0790747324' and v1.id = 'B00007JMD8')"))

v1,e1,v2,e2
"List(21, 0790747324, DVD, 795, 140, 4.5, The Time Machine, 4.5)","List(0790747324, B00007JMD8, 0)","List(114681, B00007JMD8, DVD, 1076, 104, 4.5, Journey to the Center of the Earth, 4.5)","List(B00007JMD8, 0790747324, 11138)"
"List(114681, B00007JMD8, DVD, 1076, 104, 4.5, Journey to the Center of the Earth, 4.5)","List(B00007JMD8, 0790747324, 11138)","List(21, 0790747324, DVD, 795, 140, 4.5, The Time Machine, 4.5)","List(0790747324, B00007JMD8, 0)"


So the pairs are listed twice, which kind of makes sense, seeing how we constructed the data frame. This means there are 11202/2 = 5601 product-pairs that reference each other.

Now we want to explore if the "strongly connected components" of our graph are evenly distributed across all groups, which would mean that the probability of a DVD referencing to another DVD is the same as it referencing to e.g. a book, or if the groups tend to stick to themselves.

Here we first calculate the strongly connected components
(small warning: this command takes quite a while, ~1.17 hours, >250 jobs):

In [54]:
connectedComp = ProductGraph.stronglyConnectedComponents(maxIter=3)
#connectedComp.select("id", "component").groupBy("component").count().sort(desc("count")).show()

#ProductGraph.connectedComponents()
display(connectedComp)

Id-nu,id,Group,Salesrank,Reviews,avg_rating,Title,avg rating,component
194466,6305973385,DVD,3145,108,3.5,Paula Abdul - Cardio Dance,3.5,26
45292,B000002NJH,Music,295,282,4.5,Paint the Sky with Stars: The Best of Enya,4.5,29
265694,B00000K2VU,Music,2504,31,4.5,Cheap Thrills (Exp),4.5,34359738398
100395,B00005Q6OS,Music,1254,43,4.5,"Those Who Tell the Truth Shall Die, Those Who Tell the Truth Shall Live Forever",4.5,77309411361
526151,055357342X,Book,853,670,4.5,"A Storm of Swords (A Song of Ice and Fire, Book 3)",4.5,137438953476
477878,0316769509,Book,4230,137,4.5,Nine Stories,4.5,146028888066
301867,6303418899,Video,3006,5,4.5,There Goes a Motorcycle,4.5,146028888086
169155,6304400551,Video,3815,216,4.5,Mary Poppins,4.5,206158430228
190765,156686903X,Book,3495,102,4.0,Final Fantasy VIII Official Strategy Guide,4.0,171798691859
355037,B00004U38Q,Music,2436,11,5.0,The Very Best of Perry Como,5.0,94489280561


The connected components are marked by the same value in the column "component". So if we group the data frame by "component" and "Group" and use the function "count()", we will have as many rows per "component" value as the number of different groups belonging to this cluster.

In [56]:
countGroupComponent= connectedComp.groupBy("component", "Group").count().sort("component")
display(countGroupComponent)

component,Group,count
0,Book,5
1,Book,1
2,Book,1
3,Book,1
4,Book,1
5,Book,1
6,Book,1
7,Book,8
8,Book,2
9,Video,9


Now how many groups are taking part in each cluster?

In [58]:
countComponentsNumberOfGroups = countGroupComponent.groupBy("component").count().sort(desc("count"))
display(countComponentsNumberOfGroups)

component,count
23,4
1357209665545,3
8589934634,3
730144440322,3
180388626482,3
300647710759,3
266287972394,3
360777252871,2
283467841564,2
1468878815251,2


Next we get an overview how many clusters there are with members of *n* groups:

In [60]:
numGroupsCount = countComponentsNumberOfGroups.withColumnRenamed("count", "NumberOfGroupsForComponent").groupBy("NumberOfGroupsForComponent").count()
display(numGroupsCount)

NumberOfGroupsForComponent,count
4,1
3,6
2,77
1,6667


So how many clusters are there in total? Just adding up the values in column "count" in the last chunk results in the number of clusters:

In [62]:
display(numGroupsCount.agg({"count": "sum"}))

sum(count)
6751


6751 is a very high number, seeing that there are just 10611 products in our dataset in total.

And what is the size distribution of the clusters in general (independent which groups its members belong to)?

In [65]:
countConnected = connectedComp.groupBy("component").count().sort(desc("count"))
display(countConnected)

component,count
23,988
51539607567,29
60129542167,25
8589934634,21
154618822687,21
197568495616,19
163208757252,19
94489280561,19
19,16
343597383724,15


So there is only one really huge cluster of strongly connected components with almost 1000 members. The next biggest one has only a size of 29.

How many of the 6751 clusters have more than 3 members?

In [68]:
countConnected.filter("count > 3").count()