## Generating Graph Embeddings

We have already consctructed our graphs. However, now we need a representation of the nodes, if we are to make meaningful comparisons between the various products that form the nodes of our graphs. We need a way to be able to 'compare' them. So, we need some embeddings to represent our nodes. In general, embeddings help us capture some kind of social relations in our graphs.

We could get these embeddings using the [DeepWalk](https://arxiv.org/pdf/1403.6652.pdf) and [Node2Vec](https://arxiv.org/pdf/1607.00653.pdf) algorithms.

We first need to generate [Random Walks](https://cse.iitkgp.ac.in/~pawang/courses/SC15/rw1.pdf) for our nodes to get their embeddings. In simple terms, Given a graph and a starting node, if we select a neighbor of it at random, and move to this neighbor, then select a neighbor of this neighbor and move to it, and continue to do so (say 50 times), the random sequence of nodes selected this way is a random walk on the graph of length 50.

We can generate such walks multiple times for each node instead of just one, since each walk could be different.

### Generating Embeddings using Deepwalk

We first generate random walks for each node (let's say each of length 50 and 10 walks per node) and after we have these random paths of nodes we will train a Word2Vec (skip-gram) model to obtain the node embeddings.

We will use the library [pecanpy](https://github.com/krishnanlab/PecanPy) for this. It will allow us to use both Deepwalk and Node2Vec by tweaking certain parameters. PecanPy is an ultra fast and memory efficient implementation of node2vec. The original implementations suffer from issues such as :

    - Not having parallelized implementation of walk generation which is an inherently parallelizable task.
    - Memory issues arising from preprocessing and storing of 2nd order transition probabilities. 
    - Using networkx to store graph, which is quite inefficient for large scale computation. 
    
This [blogpost](https://towardsdatascience.com/run-node2vec-faster-with-less-memory-using-pecanpy-1bdf31f136de) discusses these issues in a lot more detail. 

In [1]:
import pandas as pd
import numpy as np
from pecanpy import pecanpy
from gensim.models import Word2Vec

We will first look at product views. For this we can take either of the graphs we generated in before. Let us consider the undirected weighted graph for views for this analysis.

We could just as easily pick one of the other event types and the other graph types for our analysis. We just need to change the filenames accordingly.

In [4]:
views_graph = pd.read_parquet('../Data/Graphs/undir_weight_graph_views.parquet')
views_graph

Unnamed: 0,product_id1,product_id2,count
0,3601512,3601269,37
1,100059274,3601306,12
2,5100855,4804056,880
3,100041977,100041934,2
4,100057579,100057560,11
...,...,...,...
6846587,12300752,4803895,1
6846588,100017063,6501098,1
6846589,8800741,1005195,1
6846590,25200369,23900098,1


In [5]:
print("The number of nodes in the graph is", len(set(views_graph['product_id1'].unique()).union(views_graph['product_id2'].unique())))
print("The number of edges in the graph is", len(views_graph))

The number of nodes in the graph is 211861
The number of edges in the graph is 6846592


Pecanpy accepts input graphs that have the '.edg' format. So we would first need to store our graph in that format. We store these graphs in the 'PecanPy_Graphs' folder.

In [6]:
views_graph.to_csv('../Data/PecanPy_Graphs/undir_weight_graph_views.edg', sep='\t', index=False, header=False)

#### Random Walk Generation

Pecanpy implements node2vec originally and node2vec uses a combination of the algorithms DFS and BFS to extract the random walks. This combination of algorithms is controlled by two parameters 'p' (return parameter) and 'q' (in-out parameter). For Deepwalk those are simply set to 1 each.

In [7]:
# For Deepwalk we simply set p=1 and q=1 
v_graph = pecanpy.SparseOTF(p=1, q=1, workers=-1, verbose=True, extend=False)

In [8]:
# We have set weighted to True and set directed to False because we are using an undirected weighted graph for views.
v_graph.read_edg('../Data/PecanPy_Graphs/undir_weight_graph_views.edg', weighted=True, directed=False)

# We generate 10 random walks per node and each walk is of length 50
walks = v_graph.simulate_walks(num_walks=10, walk_length=50)

  0%|                                               | 0/2118610 [00:00<?, ?it/s]

In [9]:
# We look at one of the walks generated for the nodes
print(walks[0])

['100028635', '100008532', '28718344', '100028626', '100028648', '28716341', '28721605', '28718996', '28721605', '28721604', '100044499', '28722269', '28718867', '28717988', '28717986', '28720817', '28717986', '28719275', '28717946', '28719275', '28717980', '28719115', '28719132', '28718352', '28722207', '45600073', '45600114', '45600109', '12718062', '12719529', '12713110', '12710815', '12705468', '12716207', '12715137', '12702049', '12707874', '12719390', '12708290', '100023385', '12718994', '12714511', '12700236', '12704352', '12705983', '12702915', '12703402', '12703401', '100031717', '100048700', '12714716']


Next, we train a word2vec model on these walks to convert them to embeddings.

The parameter choices can be explained below :

    - hs = 1 : for using hierarchical softmax
    - sg = 1 : for using skipgrams 
    - vector_size = 128 : this is the size of our embeddings
    - window = 5 : window size; Maximum distance between the current and predicted word within a sentence.
    - min_count = 1 : to ignore words with total frequency less than 1
    
We set the seed to get predictable results each time we run this notebook.
 

In [10]:
model = Word2Vec(walks,  
                 hs=1,  
                 sg = 1,  
                 vector_size=128,  
                 window=5,
                 min_count=1,
                 workers=-1,
                 seed=42)

In [11]:
model.save('../Data/Models/deepwalk_undir_weight_graph_views.model')

In [12]:
product_ids = set(views_graph['product_id1'].unique()).union(views_graph['product_id2'].unique())
vg_embeddings = []
for i in product_ids:
    try:
        vg_embeddings.append({'product_id': i, 'embedding_vector': model.wv[str(i)]})
    except:
        print(i, "Not Exist")
        pass

In [13]:
views_graph_embeddings = pd.DataFrame(vg_embeddings)
views_graph_embeddings.to_parquet('../Data/Embeddings/deepwalk_undir_weight_graph_views_embedding.parquet', index=False)


We have successfully saved our embeddings and the trained model.

<hr> 

### Generating embeddings using Node2vec

We discussed above what p and q are. If p is large the random walks will be large, so it does exploration and if p is small we remain within local neighborhood. Similarly if q is small, depth first exploration will be favored and if q is large we focus on a breadth first exploration. 

We shall keep p as 1, but reduce the value of q. This helps us to kind of do in depth within the regional clusters.It will help us to discover clusters/communities of characters that frequently interact with each other or co-occur, which in our case, are products viewed together.

In [24]:
views_graph_n2v = pecanpy.SparseOTF(p=1, q=0.5, workers=-1, verbose=True, extend=True)

In [25]:
views_graph_n2v.read_edg('../Data/PecanPy_Graphs/undir_weight_graph_views.edg', weighted=True, directed=False)

walks = views_graph_n2v.simulate_walks(num_walks=10, walk_length=50)

  0%|                                               | 0/2118170 [00:00<?, ?it/s]

We train the Word2Vec model once again.

In [28]:
model_n2v = Word2Vec(walks,  
                 hs=1,  
                 sg = 1,  
                 vector_size=128,  
                 window=5,
                 min_count=1,
                 workers=-1,
                 seed=42)

In [29]:
model_n2v.save('../Data/Models/node2vec_undir_weight_graph_views.model')

In [30]:
product_ids = set(views_graph['product_id1'].unique()).union(views_graph['product_id2'].unique())
vg_embeddings_n2v = []
for i in product_ids:
    try:
        vg_embeddings_n2v.append({'product_id': i, 'embedding_vector': model_n2v.wv[str(i)]})
    except:
        print(i, "Not Exist")
        pass

In [None]:
views_graph_embeddings_n2v = pd.DataFrame(vg_embeddings_n2v)
views_graph_embeddings_n2v.to_parquet('../Data/Embeddings/node2vec_undir_weight_graph_views_embedding.parquet', index=False)


So now we have saved the models and embeddings for both node2vec and our deepwalk models. We have done this for Product Views with the Unweighted Directed Graph We could also do it using the other graph variants for the other events. 