## Book Copurchase Graph

#### Table of Contents

* Introduction
* Data Imported and Formatting
* Data Exploration
* Graph Processing and Analysis

### Imports

In [73]:
import cudf
import cugraph
import numpy as np

import pandas as pd

### Introduction

Dataset is the processed version of Amazon Product co-purchasing network metadata taken from SNAP http://snap.stanford.edu/data/amazon-meta.html. 
The original dataset includes about 548,552 different products (Books, music CDs, DVDs, and VHS video tapes)
The dataset used below includes only Book.

### Load and Explore Dataset

In [74]:
dataset_path = '../data/amazon/books/amazon-books-v2.0.csv'

In [75]:
%%time
gdf = cudf.DataFrame()
gdf = cudf.read_csv(dataset_path)

CPU times: user 113 ms, sys: 145 ms, total: 258 ms
Wall time: 262 ms


To see how the data looks like

ASIN is the identifier for Amazon Book. Copurchased column contains a list of books that are normally purchased along with the book in ASIN column.

In [76]:
%%time
gdf.head().to_pandas()

CPU times: user 22.3 ms, sys: 926 µs, total: 23.3 ms
Wall time: 22 ms


Unnamed: 0,Id,ASIN,Title,Categories,Group,Copurchased,SalesRank,TotalReviews,AvgRating
0,1,827229534,Patterns of Preaching: A Sermon Sampler,subjects religion preaching clergy spiritualit...,Book,0804215715 156101074X 0687023955 0687074231 08...,396585,2,2.0
1,2,738700797,Candlemas: Feast of Flames,subjects witchcraft earth religion based spiri...,Book,0738700827 1567184960 1567182836 0738700525 07...,168596,12,12.0
2,3,486287785,World War II Allied Fighter Planes Trading Cards,general subjects hobbies home garden crafts books,Book,,1270652,1,1.0
3,4,842328327,Life Application Bible Commentary: 1 and 2 Tim...,subjects life bibles christian general history...,Book,0842328130 0842330313 0842328610 0842328572,631289,1,1.0
4,5,1577943082,Prayers That Avail Much for Business: Executive,subjects religion prayerbooks devotion worship...,Book,157794349X 0892749504 1577941829 0892749563,455160,0,0.0


#### Explore Books

Dataset contains 392966 book titles.

In [77]:
%%time
gdf.shape

CPU times: user 10 µs, sys: 6 µs, total: 16 µs
Wall time: 23.1 µs


(392966, 9)

And all the rows are unique

In [78]:
%%time
gdf.ASIN.unique().shape[0]

CPU times: user 3.83 ms, sys: 24.8 ms, total: 28.7 ms
Wall time: 27.5 ms


392966

This is how the querying can be done.

In [79]:
%%time
query = gdf[gdf.ASIN == "1577943082"]

CPU times: user 139 ms, sys: 137 ms, total: 276 ms
Wall time: 275 ms


In [80]:
query.to_pandas()

Unnamed: 0,Id,ASIN,Title,Categories,Group,Copurchased,SalesRank,TotalReviews,AvgRating
4,5,1577943082,Prayers That Avail Much for Business: Executive,subjects religion prayerbooks devotion worship...,Book,157794349X 0892749504 1577941829 0892749563,455160,0,0.0


#### Preprocessing

Convert cudf to Pandas to use the Pandas APIs needed to split (explode) a column into multiple rows

In [81]:
pd_df = gdf.to_pandas()

Fill None with empty string for Object Type Columns.

In [82]:
pd_df.Copurchased = pd_df.Copurchased.fillna('').astype(str)

In [83]:
pd_df.Categories = pd_df.Categories.fillna('').astype(str)

In [84]:
pd_df.head(6)

Unnamed: 0,Id,ASIN,Title,Categories,Group,Copurchased,SalesRank,TotalReviews,AvgRating
0,1,827229534,Patterns of Preaching: A Sermon Sampler,subjects religion preaching clergy spiritualit...,Book,0804215715 156101074X 0687023955 0687074231 08...,396585,2,2.0
1,2,738700797,Candlemas: Feast of Flames,subjects witchcraft earth religion based spiri...,Book,0738700827 1567184960 1567182836 0738700525 07...,168596,12,12.0
2,3,486287785,World War II Allied Fighter Planes Trading Cards,general subjects hobbies home garden crafts books,Book,,1270652,1,1.0
3,4,842328327,Life Application Bible Commentary: 1 and 2 Tim...,subjects life bibles christian general history...,Book,0842328130 0842330313 0842328610 0842328572,631289,1,1.0
4,5,1577943082,Prayers That Avail Much for Business: Executive,subjects religion prayerbooks devotion worship...,Book,157794349X 0892749504 1577941829 0892749563,455160,0,0.0
5,6,486220125,How the Other Half Lives: Studies Among the Te...,general social subjects history jewish nonfict...,Book,0486401960 0452283612 0486229076 0714840343,188784,17,17.0


Create new dataframe, splitting books in Copurchased into individual rows with ASIN as the index

In [85]:
new_pd_df = pd.DataFrame(pd_df.Copurchased.str.split(' ').tolist(), index=pd_df.ASIN).stack()

In [86]:
new_pd_df.head(6)

ASIN         
0827229534  0    0804215715
            1    156101074X
            2    0687023955
            3    0687074231
            4    082721619X
0738700797  0    0738700827
dtype: object

In [87]:
# get rid of secondary index
# make ASIN as a column (it can't be an index since the values will be duplicate)
new_pd_df = new_pd_df.reset_index([0, 'ASIN'])

In [88]:
new_pd_df.head()

Unnamed: 0,ASIN,0
0,827229534,0804215715
1,827229534,156101074X
2,827229534,0687023955
3,827229534,0687074231
4,827229534,082721619X


In [89]:
# to save memory, select only the columns we need for our graph
# rename column '0' to column 'Copurchase_ASIN'
new_pd_df.columns = ['ASIN', 'Copurchase_ASIN']

In [90]:
new_pd_df.head(10)

Unnamed: 0,ASIN,Copurchase_ASIN
0,827229534,0804215715
1,827229534,156101074X
2,827229534,0687023955
3,827229534,0687074231
4,827229534,082721619X
5,738700797,0738700827
6,738700797,1567184960
7,738700797,1567182836
8,738700797,0738700525
9,738700797,0738700940


In [91]:
%%time
sorted_pd_df = new_pd_df.sort_values(by=['ASIN'])

CPU times: user 1.59 s, sys: 0 ns, total: 1.59 s
Wall time: 1.59 s


In [92]:
sorted_pd_df

Unnamed: 0,ASIN,Copurchase_ASIN
700446,0000037931,
511570,0001047655,0061007358
511569,0001047655,0061007129
511571,0001047655,0061007137
511572,0001047655,0061099341
511573,0001047655,0061007161
886962,0001053388,
596758,0001053736,0345336062
596759,0001053736,0140380531
596757,0001053736,0440905605


Construct Book Graph

In [93]:
%%time
new_gdf = cudf.from_pandas(new_pd_df)

CPU times: user 130 ms, sys: 60.2 ms, total: 191 ms
Wall time: 190 ms


In [94]:
new_gdf.dtypes

ASIN               object
Copurchase_ASIN    object
dtype: object

In [95]:
combined_gdf = cudf.merge(new_gdf, gdf, on=['ASIN'])

In [96]:
sorted_combined_gdf = combined_gdf.sort_values(['ASIN'])

Fill None with empty string for Object Type Columns.

In [97]:
sorted_combined_gdf['Categories'] = sorted_combined_gdf['Categories'].fillna('')

In [98]:
sorted_combined_gdf.head().to_pandas()

Unnamed: 0,ASIN,Copurchase_ASIN,Id,Title,Categories,Group,Copurchased,SalesRank,TotalReviews,AvgRating
694075,37931,,370379,"Saluki Champions, 1952-1988",,Book,,2031890,0,0.0
507141,1047655,61007129.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,0061007129 0061007358 0061007137 0061099341 00...,1116690,30,30.0
507145,1047655,61007358.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,0061007129 0061007358 0061007137 0061099341 00...,1116690,30,30.0
507149,1047655,61007137.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,0061007129 0061007358 0061007137 0061099341 00...,1116690,30,30.0
507153,1047655,61099341.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,0061007129 0061007358 0061007137 0061099341 00...,1116690,30,30.0


Remove Copurchased Columns which is redundant.

In [99]:
sorted_combined_gdf = sorted_combined_gdf.drop('Copurchased')

#### Graph

Calculate Edge Weight (the strength of connection between vertices based on the relative similarity of their neighbors Similarity)

We will form a graph between ASIN and Copurchase_ASIN. The data we want to use for Graph is Object type. So we will create columns of renumbered source vertex ids and destination vertex ids, both will be int32 type needed by cuGraph. The numbering map from renumbering will map the new ids to original ids. The current renumbering API from cuGraph only support int32 type. So, we need to convert ASIN and Copurchase_ASIN to int32 first.

In [100]:
sorted_combined_gdf.add_column('ASIN_int', sorted_combined_gdf['ASIN'].astype('int32'))
sorted_combined_gdf.add_column('Copurchase_ASIN_int', sorted_combined_gdf['Copurchase_ASIN'].astype('int32'))

In [101]:
sorted_combined_gdf.head().to_pandas()

Unnamed: 0,ASIN,Copurchase_ASIN,Id,Title,Categories,Group,SalesRank,TotalReviews,AvgRating,ASIN_int,Copurchase_ASIN_int
694075,37931,,370379,"Saluki Champions, 1952-1988",,Book,2031890,0,0.0,37931,0
507141,1047655,61007129.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,1116690,30,30.0,1047655,61007129
507145,1047655,61007358.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,1116690,30,30.0,1047655,61007358
507149,1047655,61007137.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,1116690,30,30.0,1047655,61007137
507153,1047655,61099341.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,1116690,30,30.0,1047655,61099341


In [102]:
sorted_combined_gdf

<cudf.DataFrame ncols=11 nrows=1037401 >

In [103]:
sorted_combined_gdf.dtypes

ASIN                    object
Copurchase_ASIN         object
Id                       int64
Title                   object
Categories              object
Group                   object
SalesRank                int64
TotalReviews             int64
AvgRating              float64
ASIN_int                 int32
Copurchase_ASIN_int      int32
dtype: object

In [104]:
G = cugraph.Graph()

src_r, dst_r, numbering = G.renumber(sorted_combined_gdf['ASIN_int'], sorted_combined_gdf['Copurchase_ASIN_int'])

In [105]:
renumbered_map_gdf = cudf.DataFrame()
renumbered_map_gdf.add_column("original_id", numbering)

In [106]:
sorted_combined_gdf.add_column("src_renumbered", src_r)
sorted_combined_gdf.add_column("dst_renumbered", dst_r)

In [107]:
sorted_combined_gdf.head(10).to_pandas()

Unnamed: 0,ASIN,Copurchase_ASIN,Id,Title,Categories,Group,SalesRank,TotalReviews,AvgRating,ASIN_int,Copurchase_ASIN_int,src_renumbered,dst_renumbered
694075,37931,,370379,"Saluki Champions, 1952-1988",,Book,2031890,0,0.0,37931,0,247725,0
507141,1047655,61007129.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,1116690,30,30.0,1047655,61007129,354701,26322
507145,1047655,61007358.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,1116690,30,30.0,1047655,61007358,354701,37041
507149,1047655,61007137.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,1116690,30,30.0,1047655,61007137,354701,26683
507153,1047655,61099341.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,1116690,30,30.0,1047655,61099341,354701,126539
507157,1047655,61007161.0,271961,Prodigal Daughter,general tape subjects literature contemporary ...,Book,1116690,30,30.0,1047655,61007161,354701,27795
873449,1053388,,466894,The Poetry of Lord Byron,general tape subjects literature authors lord ...,Book,1874503,0,0.0,1053388,0,236861,0
596387,1053736,393320979.0,316651,Sir Gawain and the Green Knight,general tape subjects literature short poetry ...,Book,53150,15,15.0,1053736,393320979,253550,265651
596391,1053736,395898714.0,316651,Sir Gawain and the Green Knight,general tape subjects literature short poetry ...,Book,53150,15,15.0,1053736,395898714,253550,147791
596405,1053736,440905605.0,316651,Sir Gawain and the Green Knight,general tape subjects literature short poetry ...,Book,53150,15,15.0,1053736,440905605,253550,21474


In [108]:
sorted_combined_gdf.dtypes

ASIN                    object
Copurchase_ASIN         object
Id                       int64
Title                   object
Categories              object
Group                   object
SalesRank                int64
TotalReviews             int64
AvgRating              float64
ASIN_int                 int32
Copurchase_ASIN_int      int32
src_renumbered           int32
dst_renumbered           int32
dtype: object

In [109]:
for i in range(10):
    print(" " + str(i) +
          ": (" + str(sorted_combined_gdf.ASIN_int[i]) + "," + 
          str(sorted_combined_gdf.Copurchase_ASIN_int[i]) +")"
          ", renumbered: (" + str(sorted_combined_gdf.src_renumbered[i]) + "," + 
          str(sorted_combined_gdf.dst_renumbered[i]) +")"
          ", translate back: (" + str(numbering[sorted_combined_gdf.src_renumbered[i]]) + "," +
          str(numbering[sorted_combined_gdf.dst_renumbered[i]]) +")"
         )


 0: (37931,0), renumbered: (247725,0), translate back: (37931,0)
 1: (1047655,61007129), renumbered: (354701,26322), translate back: (1047655,61007129)
 2: (1047655,61007358), renumbered: (354701,37041), translate back: (1047655,61007358)
 3: (1047655,61007137), renumbered: (354701,26683), translate back: (1047655,61007137)
 4: (1047655,61099341), renumbered: (354701,126539), translate back: (1047655,61099341)
 5: (1047655,61007161), renumbered: (354701,27795), translate back: (1047655,61007161)
 6: (1053388,0), renumbered: (236861,0), translate back: (1053388,0)
 7: (1053736,393320979), renumbered: (253550,265651), translate back: (1053736,393320979)
 8: (1053736,395898714), renumbered: (253550,147791), translate back: (1053736,395898714)
 9: (1053736,440905605), renumbered: (253550,21474), translate back: (1053736,440905605)


In [110]:
def mapping_to_original(gdf):
    for i in range(len(gdf)-1):
        print(" " + str(i) +
              ": (" + str(gdf.ASIN_int[i]) + "," + 
              str(gdf.Copurchase_ASIN_int[i]) +")"
              ", renumbered: (" + str(gdf.src_renumbered[i]) + "," + 
              str(gdf.dst_renumbered[i]) +")"
              ", translate back: (" + str(numbering[gdf.src_renumbered[i]]) + "," +
              str(numbering[gdf.dst_renumbered[i]]) +")"
             )


In [111]:
query_1 = sorted_combined_gdf[sorted_combined_gdf.src_renumbered == 108718]
query_2 = sorted_combined_gdf[sorted_combined_gdf.src_renumbered == 160]

In [112]:
query_1.to_pandas()

Unnamed: 0,ASIN,Copurchase_ASIN,Id,Title,Categories,Group,SalesRank,TotalReviews,AvgRating,ASIN_int,Copurchase_ASIN_int,src_renumbered,dst_renumbered
220416,425164349,0385334206,125240,Timequake,general science subjects literature fantasy au...,Book,20867,176,176.0,425164349,385334206,108718,239424
220420,425164349,0385333498,125240,Timequake,general science subjects literature fantasy au...,Book,20867,176,176.0,425164349,385333498,108718,205098
220424,425164349,038533348X,125240,Timequake,general science subjects literature fantasy au...,Book,20867,176,176.0,425164349,38533348,108718,136798
220428,425164349,0385333501,125240,Timequake,general science subjects literature fantasy au...,Book,20867,176,176.0,425164349,385333501,108718,205265
223679,425164349,0425130215,125240,Timequake,general science subjects literature fantasy au...,Book,20867,176,176.0,425164349,425130215,108718,43672


In [113]:
query_2.to_pandas()

Unnamed: 0,ASIN,Copurchase_ASIN,Id,Title,Categories,Group,SalesRank,TotalReviews,AvgRating,ASIN_int,Copurchase_ASIN_int,src_renumbered,dst_renumbered
811651,899683061,0385333498,415390,Venus on the Half-Shell,fantasy science subjects general fiction books,Book,104045,39,39.0,899683061,385333498,160,205098
811655,899683061,0899667570,415390,Venus on the Half-Shell,fantasy science subjects general fiction books,Book,104045,39,39.0,899683061,899667570,160,41952
811659,899683061,0425164349,415390,Venus on the Half-Shell,fantasy science subjects general fiction books,Book,104045,39,39.0,899683061,425164349,160,108718
811663,899683061,038533348X,415390,Venus on the Half-Shell,fantasy science subjects general fiction books,Book,104045,39,39.0,899683061,38533348,160,136798
811677,899683061,0743422007,415390,Venus on the Half-Shell,fantasy science subjects general fiction books,Book,104045,39,39.0,899683061,743422007,160,328148


Create a Directed Graph of a copurchase network. Edges are pointing from one book to another. Interpretation is when the user purchase 'src_renumbered', this will influence the 'dst_renumbered' book to be purchased. Since this is a unipartite graph, the similarity coefficient between source and destination is calculated based on the number of neighbor nodes that both source and destination shared.

In [114]:
graph = cugraph.Graph()
graph.add_edge_list(sorted_combined_gdf["src_renumbered"], sorted_combined_gdf["dst_renumbered"])

In [115]:
%time jac_df = cugraph.jaccard(graph)

CPU times: user 6.69 ms, sys: 13.7 ms, total: 20.3 ms
Wall time: 19.2 ms


In [116]:
jac_df

<cudf.DataFrame ncols=3 nrows=1037401 >

In [117]:
graph.number_of_edges()

1037401

In [118]:
graph.number_of_vertices()

392970

In [119]:
degree = graph.degree()

In [120]:
in_degree = graph.in_degree()

In [121]:
out_degree = graph.out_degree()

In [122]:
degree_query = degree[degree.vertex == 160]
in_degree_query = in_degree[in_degree.vertex == 160]
out_degree_query = out_degree[out_degree.vertex == 160]

In [123]:
degree_query.to_pandas()

Unnamed: 0,vertex,degree
160,160,5


In [124]:
in_degree_query.to_pandas()

Unnamed: 0,vertex,degree
160,160,0


In [125]:
out_degree_query.to_pandas()

Unnamed: 0,vertex,degree
160,160,5


In [145]:
jac_query_1 = jac_df[jac_df.source == 2]
jac_query_2 = jac_df[jac_df.destination == 13578]

In [146]:
jac_query_1.to_pandas()

Unnamed: 0,source,destination,jaccard_coeff
1,2,1768,0.0
2,2,175960,0.0


In [128]:
jac_query_2.to_pandas()

Unnamed: 0,source,destination,jaccard_coeff
4675,1768,13578,0.0
143983,54703,13578,0.166667
198538,75315,13578,0.4
206705,78331,13578,0.25
224889,85144,13578,0.0
293393,111290,13578,0.4
323422,122522,13578,0.166667
463344,175115,13578,0.166667
989152,374721,13578,0.0


Query Jaccard Coefficient between 0.1 and 0.3

In [129]:
jac_count_query = jac_df[jac_df.jaccard_coeff < 0.3]
final_query = jac_count_query[jac_count_query.jaccard_coeff > 0.1]

In [131]:
final_query.to_pandas()
final_query.sort_values("jaccard_coeff", ascending=True).to_pandas()

Unnamed: 0,source,destination,jaccard_coeff
65,19,357871,0.111111
95,31,22716,0.111111
98,31,251997,0.111111
110,36,362777,0.111111
147,54,215276,0.111111
170,62,258634,0.111111
298,112,25001,0.111111
411,160,41952,0.111111
413,160,136798,0.111111
414,160,205098,0.111111


two_hop = graph.get_two_hop_neighbors()

In [133]:
two_hop.to_pandas()

Unnamed: 0,first,second
0,2,13578
1,2,31899
2,2,154112
3,2,212003
4,2,218064
5,2,340652
6,3,9829
7,3,35350
8,3,69979
9,3,81636


In [62]:
two_hop_query = two_hop[two_hop.first == 2]

Unnamed: 0,first,second
0,2,13578
1,2,31899
2,2,154112
3,2,212003
4,2,218064


In [143]:
new_query = sorted_combined_gdf[sorted_combined_gdf.dst_renumbered == 31899]

In [144]:
new_query.to_pandas()

Unnamed: 0,ASIN,Copurchase_ASIN,Id,Title,Categories,Group,SalesRank,TotalReviews,AvgRating,ASIN_int,Copurchase_ASIN_int,src_renumbered,dst_renumbered
740368,0060506539,1563384086,391724,When Religion Becomes Evil,subjects comparative religion spirituality soc...,Book,203685,25,25.0,60506539,1563384086,374721,31899
906298,0060556102,1563384086,482724,When Religion Becomes Evil: Five Warning Signs,subjects comparative religion spirituality soc...,Book,40349,25,25.0,60556102,1563384086,1768,31899
543799,1563383624,1563384086,292054,Jesus Against Christianity: Reclaiming the Mis...,reference subjects christology religion theolo...,Book,192034,6,6.0,1563383624,1563384086,10303,31899
376103,157075134X,1563384086,203476,School of Assassins: The Case for Closing the ...,social general subjects nonfiction sciences so...,Book,884968,6,6.0,157075134,1563384086,216607,31899
246589,1570753857,1563384086,137358,"School of Assassins: Guns, Greed, and Globaliz...",events social subjects general terrorism nonfi...,Book,376163,6,6.0,1570753857,1563384086,323191,31899


In [148]:
query_1 = sorted_combined_gdf[sorted_combined_gdf.src_renumbered == 2]

In [149]:
query_1.to_pandas()

Unnamed: 0,ASIN,Copurchase_ASIN,Id,Title,Categories,Group,SalesRank,TotalReviews,AvgRating,ASIN_int,Copurchase_ASIN_int,src_renumbered,dst_renumbered
720259,077880027X,778800482,388677,Better Baby Food: Your Essential Guide to Nutr...,general parenting subjects body wine medicine ...,Book,130103,15,15.0,77880027,778800482,392926,9537
720263,077880027X,965260313,388677,Better Baby Food: Your Essential Guide to Nutr...,general parenting subjects body wine medicine ...,Book,130103,15,15.0,77880027,965260313,392926,5139
720267,077880027X,671750194,388677,Better Baby Food: Your Essential Guide to Nutr...,general parenting subjects body wine medicine ...,Book,130103,15,15.0,77880027,671750194,392926,301471
720271,077880027X,553380907,388677,Better Baby Food: Your Essential Guide to Nutr...,general parenting subjects body wine medicine ...,Book,130103,15,15.0,77880027,553380907,392926,246334
720275,077880027X,1579547222,388677,Better Baby Food: Your Essential Guide to Nutr...,general parenting subjects body wine medicine ...,Book,130103,15,15.0,77880027,1579547222,392926,141142


In [64]:
two_hop_query.to_pandas()

Unnamed: 0,first,second
0,2,13578
1,2,31899
2,2,154112
3,2,212003
4,2,218064
5,2,340652


In [151]:
rows = sorted_combined_gdf[sorted_combined_gdf.ASIN_int == 77880027]
rows.to_pandas()

Unnamed: 0,ASIN,Copurchase_ASIN,Id,Title,Categories,Group,SalesRank,TotalReviews,AvgRating,ASIN_int,Copurchase_ASIN_int,src_renumbered,dst_renumbered
720259,077880027X,778800482,388677,Better Baby Food: Your Essential Guide to Nutr...,general parenting subjects body wine medicine ...,Book,130103,15,15.0,77880027,778800482,392926,9537
720263,077880027X,965260313,388677,Better Baby Food: Your Essential Guide to Nutr...,general parenting subjects body wine medicine ...,Book,130103,15,15.0,77880027,965260313,392926,5139
720267,077880027X,671750194,388677,Better Baby Food: Your Essential Guide to Nutr...,general parenting subjects body wine medicine ...,Book,130103,15,15.0,77880027,671750194,392926,301471
720271,077880027X,553380907,388677,Better Baby Food: Your Essential Guide to Nutr...,general parenting subjects body wine medicine ...,Book,130103,15,15.0,77880027,553380907,392926,246334
720275,077880027X,1579547222,388677,Better Baby Food: Your Essential Guide to Nutr...,general parenting subjects body wine medicine ...,Book,130103,15,15.0,77880027,1579547222,392926,141142


Get renumbered id of ASIN_int

In [156]:
ASIN = '077880027X'
ASIN_int = 77880027
renum_ASIN = numbering[numbering == ASIN_int].index[0]

In [None]:
jac_df.sort_values("jaccard_coeff", ascending=False).to_pandas()

In [157]:
%%time
edge_ls = cudf.DataFrame()
edge_ls["second"] = two_hop.second.unique().astype("int32")

CPU times: user 278 ms, sys: 9.48 ms, total: 288 ms
Wall time: 287 ms


In [158]:
%%time
edge_ls["first"] = renum_ASIN.astype("int32")

CPU times: user 214 ms, sys: 8.31 ms, total: 222 ms
Wall time: 220 ms


In [160]:
%%time
edge_ls.sort_values("second", ascending=False).head().to_pandas()

CPU times: user 8.08 ms, sys: 16 ms, total: 24.1 ms
Wall time: 22.8 ms


Unnamed: 0,second,first
145383,392963,392926
145382,392962,392926
145381,392960,392926
145380,392954,392926
145379,392952,392926


In [161]:
%%time
jacc = cugraph.jaccard(graph, first=edge_ls.first, second=edge_ls.second)

CPU times: user 5.23 ms, sys: 593 µs, total: 5.82 ms
Wall time: 4.85 ms


Sorting by descending order of Jaccard Coefficient to see the most similar book.

In [163]:
%%time
jacc.sort_values("jaccard_coeff", ascending=False).head(15).to_pandas()

CPU times: user 20.1 ms, sys: 12.1 ms, total: 32.2 ms
Wall time: 31 ms


Unnamed: 0,source,destination,jaccard_coeff
145369,392926,392926,1.0
122993,392926,332769,0.5
12941,392926,35293,0.285714
91078,392926,246334,0.285714
111407,392926,301471,0.285714
139332,392926,376868,0.285714
31412,392926,85045,0.25
42625,392926,115568,0.25
55881,392926,151095,0.25
16836,392926,45871,0.125


Sample 3 books in subgraph related to book's renum_ASIN 392926

In [178]:
ASIN = '077880027X'
ASIN_int = 77880027
renum_ASIN = numbering[numbering == ASIN_int].index[0]
renum_ASIN

392926

In [184]:
ASIN_int_1 = numbering[392926]
ASIN_int_1

77880027

In [235]:
book_title = sorted_combined_gdf[sorted_combined_gdf.ASIN_int == ASIN_int_1].Title.unique()
book_cats = sorted_combined_gdf[sorted_combined_gdf.ASIN_int == ASIN_int_1].Categories.unique()
book_cats_ls_1 = book_cats.str.split()

print(book_title)
book_cats_ls_1.to_pandas()

0    Better Baby Food: Your Essential Guide to Nutrition, Feeding & Cooking for Your Baby & Toddler
Name: Title, dtype: object


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,general,parenting,subjects,body,wine,medicine,families,medical,health,babies,technical,food,cooking,mind,toddlers,nutrition,nursing,books,professional


Get ASIN_int from numbering mapping

In [197]:
ASIN_int_2 = numbering[332769]
ASIN_int_2

789471906

In [208]:
book_title = sorted_combined_gdf[sorted_combined_gdf.ASIN_int == ASIN_int_2].Title.unique()
book_cats = sorted_combined_gdf[sorted_combined_gdf.ASIN_int == ASIN_int_2].Categories.unique()
book_cats_ls_2 = book_cats.str.split()

print(book_title)
book_cats_ls_2.to_pandas()

0    Organic Baby and Toddler Cookbook (Organic)
Name: Title, dtype: object
0    general reference subjects healthy wine parenting families special vegetables health food cooking nutrition diet books vegetarian
Name: Categories, dtype: object


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,general,reference,subjects,healthy,wine,parenting,families,special,vegetables,health,food,cooking,nutrition,diet,books,vegetarian


In [237]:
ASIN_int_3 = numbering[152064]
ASIN_int_3

71387765

In [239]:
book_title = sorted_combined_gdf[sorted_combined_gdf.ASIN_int == ASIN_int_3].Title.unique()
book_cats = sorted_combined_gdf[sorted_combined_gdf.ASIN_int == ASIN_int_3].Categories.unique()
book_cats_ls_3 = book_cats.str.split()

print(book_title)
book_cats_ls_3.to_pandas()

0    Baby Signs: How to Talk with Your Baby Before Your Baby Can Talk, New Edition
Name: Title, dtype: object


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,general,body,subjects,parents,infant,parenting,infants,development,families,health,babies,music,specialty,counseling,videos,toddlers,psychology,mind,stores,books


In [230]:
def find_common_categories(A,B):
    for i in A:
        for j in B:
            if (A[i][0] == B[j][0]): 
                print(A[i][0], end = " ")  

In [240]:
find_common_categories(book_cats_ls_1, book_cats_ls_3)

general parenting subjects body families health babies mind toddlers books 