# Document and Node Embeddings

In this notebook we outline the process of creating vector representations of the text in the wikipedia articles as well as their node in the network. The best way to interact and understand the embeddings is in [`tensorboard`](https://www.tensorflow.org/tensorboard) and we have hosted the data on the following two links bellow for you to play around with. We recommend testing both PCA and UMAP for dimensionality reduction.

* [Document Embeddings](https://projector.tensorflow.org/?config=https://gist.githubusercontent.com/MatPiq/139f3fccd6f0d0c6c9077f3aa87bd301/raw/0ab01b0d1615ac1f8ff686775fb9afa8b43e5422/config.json)
* [Node Embeddings]

The rest of this page is structured as follows: first, we compute the document embeddings and give a brief explanation of the method. Next, we do the same for node embeddings. We finish with an analysis where we look at the similarities between the document and node embeddings by computing the correlation coefficient between their correponding principal component.

In [34]:
import time
from gensim.models.doc2vec import Doc2Vec
from collections import namedtuple
from node2vec import Node2Vec
import networkx as nx
import numpy as np
import pandas as pd
data = pd.read_csv('wiki_df.csv.gz')
#warnings.warn(msg)

## Document Embeddings

Document embeddings is an extension of `word2vec` which allows us to estimate a vectorial representation of documents using shallow neural networks {cite}`le2014distribute, mikolov2013efficien`. In our case the documents are represented by the wikipedia articles corresponding to the social science disciplines. The task of the network is to predict a word $x_k$ based on a defined amount of surrounding words $x_{k-c}$ and $x_{k+c}$ called the context, where $c$ is the size of the window. While this task is not very interesting in itself, it forces the hidden layer to learn a numerical representation of the words that takes context into account. By also including an indicator variable $x_{ck}$ for each document we simultaniously learn the representation of documents in the same latent vector space. In a paper by by {citep}`rheault2020word` for example, they showed that one can extract meaningful representations of the ideology of politicians and parties using a parliamentary corpora. The gif bellow is borrowed from https://github.com/tsandefer/dsi_capstone_2 and shows the a simplified visualization of the architecture. 

![gid](https://raw.githubusercontent.com/tsandefer/dsi_capstone_2/master/images/model_path.gif)

To run `doc2vec` we first need to create a list of documents. Each document is a `namedtuple` containing the text and the indicator variable for the article. We also set several hypyerparameters, the most important being the dimensions of the hidden layer `vector_size`. Since the data is relatively small we set this 64, common for larger corpuses being in the range of 200-300.

In [11]:
docs = []
#Define a document data obj
document_tup = namedtuple('Doc', 'words, tags')
for row in data.iterrows():
    #Ignore empty articles
    if isinstance(row[1]['cleaned_text'], str):
        docs.append(document_tup(row[1]['cleaned_text'].split(), 
                                 [row[1]['name']]
                                 ))

In [12]:
def doc2vec(docs:namedtuple, vector_size, window, 
            min_count, workers, epochs):
    """
    Fits the document doc2vec model on a list of namedtuples.
    Returns the trained model.
    """
    start_time = time.time()
    model = Doc2Vec(vector_size=vector_size, window=window, min_count=min_count, workers=workers, epochs=epochs)
    print(f'Starting to build the vocabulary based on {len(docs)} documents...')
    model.build_vocab(docs)
    print(f'Starting to train the model for {epochs} epochs and with vector size {vector_size}...')
    model.train(docs, total_examples=model.corpus_count, epochs=model.epochs)
    print(f'Finished. Total time to train: {(time.time()-start_time) / 60} min...')
    return model

model = doc2vec(docs, vector_size=64, window=20, min_count=10, workers=8, epochs=5)

Starting to build the vocabulary based on 5954 documents...
Starting to train the model for 5 epochs and with vector size 64...
Finished. Total time to train: 0.31995852788289386 min...


In [65]:
#Extract the document embeddings from trained model
doc_labs = list(model.dv.key_to_index.keys())
doc_embs = np.array([model.dv[lab] for lab in doc_labs])
doc_df = pd.DataFrame(doc_embs, index = doc_labs)
#Save the embeddings and labels locally as TSV
with open('document_meta.tsv','w+', encoding='utf-8') as file_metadata:
    for lab in doc_labs:
        file_metadata.write(lab+'\n')

doc_df.to_csv('document_embeddings.tsv', sep='\t', index=False, header=False)

## Node Embeddings

`Node2vec` was introduced in {citep}`grover2016node2vec` and is in many ways just like `doc2vec` explained above with the noteable difference that we are working with a graph instead of document of text. The trick of `node2vec` is to first create a representation of the graph as a string that encodes the connection between nodes. As visualized bellow, this is done by taking random walks in the graph and letting the connections form artificial "sentences". This leads to a data structure that can be passed to the normal `word2vec` model and generates one embedding corresponding to each node. 

![node](https://miro.medium.com/max/1838/1*GbZk_M_HqCu8Y99J_FzhQw.gif)

Before running the model we load the edge list and create the undirected `networkx` graph object.

In [23]:
edgelist = pd.read_pickle("https://drive.google.com/uc?export=download&id=1x1WOVm5Wp6SLfN1sePSdgorbAaGQSYR3")
G = nx.Graph()
G.add_edges_from(edgelist)

In [24]:
#Train the model
model = Node2Vec(G, dimensions=64, walk_length=40, num_walks=200, workers=8).fit()

Computing transition probabilities: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4186/4186 [00:09<00:00, 420.27it/s]
Generating walks (CPU: 1): 100%|██████████| 25/25 [01:20<00:00,  3.22s/it]
Generating walks (CPU: 2): 100%|██████████| 25/25 [01:22<00:00,  3.31s/it]
Generating walks (CPU: 4): 100%|██████████| 25/25 [01:29<00:00,  3.60s/it]
Generating walks (CPU: 6): 100%|██████████| 25/25 [01:43<00:00,  4.14s/it]
Generating walks (CPU: 8): 100%|██████████| 25/25 [01:40<00:00,  4.04s/it]
Generating walks (CPU: 3): 100%|██████████| 25/25 [01:54<00:00,  4.59s/it]
Generating walks (CPU: 5): 100%|██████████| 25/25 [01:54<00:00,  4.59s/it]
Generating walks (CPU: 7): 100%|██████████| 25/25 [01:55<00:00,  4.62s/it]


In [66]:
node_labs = list(fit.wv.key_to_index.keys())
node_embs = [fit.wv[lab] for lab in node_labs]
node_df = pd.DataFrame(node_embs, index = node_labs)
#Save the embeddings and labels locally as TSV
with open('node_meta.tsv','w+', encoding='utf-8') as file_metadata:
    for lab in doc_labs:
        file_metadata.write(lab+'\n')

node_df.to_csv('node_embeddings.tsv', sep='\t', index=False, header=False)