# Generate queries with neurocard

See neurocard.ipynb in https://github.com/Erostrate9/neurocard

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!cp /content/drive/MyDrive/job-light-content-2500.csv /content/job-light-content-2500.csv

- Notice: As mentioned in Fauce paper, JOB-base contains the queries in JOB-base generated based on numeric columns in JOB-light.
- So instead of any String column, there're only int and float columns in the training data.
- There're only 6 tables in JOB-base:
  1. title
  2. cast_info
  3. movie_info
  4. movie_company
  5. movie_keyword
  6. movie_info_idx


# Queries Featurization

## Clone GitHithub repository

In [1]:
!git config --global user.email ericsun42@outlook.com
!git config --global user.name Erostrate9
!mkdir -p /root/.ssh && cp -r "/content/drive/My Drive/ssh/." /root/.ssh/
!ssh -T git@github.com

Hi Erostrate9! You've successfully authenticated, but GitHub does not provide shell access.


In [2]:
!git clone -b eric git@github.com:Erostrate9/vldb2021_fauce.git

Cloning into 'vldb2021_fauce'...
remote: Enumerating objects: 636, done.[K
remote: Counting objects: 100% (194/194), done.[K
remote: Compressing objects: 100% (136/136), done.[K
remote: Total 636 (delta 70), reused 163 (delta 52), pack-reused 442 (from 1)[K
Receiving objects: 100% (636/636), 4.55 MiB | 13.39 MiB/s, done.
Resolving deltas: 100% (290/290), done.


In [3]:
!mkdir py3
!cd py3 && git clone -b dverma/update_fauce git@github.com:Erostrate9/vldb2021_fauce.git

Cloning into 'vldb2021_fauce'...
remote: Enumerating objects: 636, done.[K
remote: Counting objects: 100% (194/194), done.[K
remote: Compressing objects: 100% (136/136), done.[K
remote: Total 636 (delta 70), reused 163 (delta 52), pack-reused 442 (from 1)[K
Receiving objects: 100% (636/636), 4.55 MiB | 16.65 MiB/s, done.
Resolving deltas: 100% (290/290), done.


In [None]:
!mkdir vldb2021_fauce/colab_notebooks
!

In [56]:
%%bash
cd vldb2021_fauce
git add .
git commit -m "1. only use 6 tables. 2. finished Joins2Vec"
git push origin eric

[eric 3df3c8f] update queries_featurization environment.yml
 17 files changed, 301 insertions(+), 28 deletions(-)
 create mode 100644 queries_featurization/Joins2Vec/classify.pyc
 create mode 100644 queries_featurization/Joins2Vec/data_utils.pyc
 create mode 100644 queries_featurization/Joins2Vec/embeddings/node_edges_dims_4_epochs_3_embeddings.txt
 create mode 100644 queries_featurization/Joins2Vec/example_data/datasets/node.Labels
 create mode 100644 queries_featurization/Joins2Vec/example_data/datasets/node_edges/0.WL2
 create mode 100644 queries_featurization/Joins2Vec/example_data/datasets/node_edges/0.gexf
 create mode 100644 queries_featurization/Joins2Vec/make_subgraph2vec_corpus.pyc
 create mode 100644 queries_featurization/Joins2Vec/skipgram.pyc
 create mode 100644 queries_featurization/Joins2Vec/train_utils.pyc
 create mode 100644 queries_featurization/Joins2Vec/utils.pyc
 rewrite queries_featurization/graph_embedding/emb/graph_node.emd (99%)
 rename queries_featurization/gr

To github.com:Erostrate9/vldb2021_fauce.git
   87bef1f..3df3c8f  eric -> eric


## Table Encoding using a graph embedding method

- The whole schema of IMDB database is shown in the following figure.
However, there're only 6 tables in JOB-base:
  1. title
  2. cast_info
  3. movie_info
  4. movie_company
  5. movie_keyword
  6. movie_info_idx
![IMDB schema](https://www.researchgate.net/profile/Peter-Boncz/publication/319893076/figure/fig2/AS:631637725438007@1527605577677/MDB-schema-with-key-foreign-key-relationships-Underlined-attributes-are-primary-keys.png)

- Fauce paper didn't record full details of training data. They didn't mention which 15 tables they used, and the edgelist they offered doesn't correspond to the IMDB schema.
- So I use the JOB-base version with 6 tables.
```csv
title 1
cast_info 2
movie_info 3
movie_companies 4
movie_keyword 5
movie_info_idx 6
```



In [None]:
!apt install build-essential
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
!conda install --channel defaults conda --yes
!conda update --channel defaults --all --yes

In [42]:
!conda env remove -n qf

In [None]:
!conda env create -f /content/vldb2021_fauce/queries_featurization/environment.yml

In [None]:
!conda env create -f /content/py3/vldb2021_fauce/environment.yml

In [12]:
!curl https://bootstrap.pypa.io/pip/2.7/get-pip.py -o get-pip.py
!source activate qf && python get-pip.py --force-reinstall

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1863k  100 1863k    0     0  4814k      0 --:--:-- --:--:-- --:--:-- 4827k
Collecting pip<21.0
  Downloading pip-20.3.4-py2.py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 2.8 MB/s 
[?25hInstalling collected packages: pip
Successfully installed pip-20.3.4


In [55]:
!cd /content/vldb2021_fauce/queries_featurization/graph_embedding && source activate qf && python main.py --input graph/graph.edgelist --output emb/graph_node.emd

Walk iteration:
1 / 3
2 / 3
3 / 3


The first line has the following format:

`num_of_nodes dim_of_representation`

The next lines are as follows:

`node_id dim1 dim2 ... dimd`
where dim1, ... , dimd is the d-dimensional representation learned by the graoh embedding method.

In [56]:
!cat /content/vldb2021_fauce/queries_featurization/graph_embedding/emb/graph_node.emd

6 4
1 -0.137066 -0.133486 0.131702 -0.100182
5 0.021005 -0.105605 -0.127137 0.000637
6 0.035167 -0.029032 0.026336 0.117065
3 -0.109892 0.026662 0.112908 0.021810
4 -0.002323 0.120363 0.000427 0.053929
2 0.097379 -0.015477 -0.023758 -0.032507


## Join Encoding using Joins2Vec

In [27]:
%%bash
source activate qf
pip install networkx==1.11
pip install numpy==1.11.2
pip install gensim==0.12.1
pip install tensorflow==0.12.1
pip install joblib==0.11
pip install scikit-learn
pip install singledispatch



In [15]:
# !mkdir -p /content/vldb2021_fauce/queries_featurization/example_data/datasets/node_edges
# !touch /content/vldb2021_fauce/queries_featurization/example_data/datasets/node_edges/0.gexf
# !touch /content/vldb2021_fauce/queries_featurization/example_data/datasets/node.Labels

In [20]:
!touch /content/vldb2021_fauce/queries_featurization/example_data/datasets/node_edges/1.gexf

In [54]:
%%bash
source activate qf
cd /content/vldb2021_fauce/queries_featurization/Joins2Vec
# mkdir ../embeddings
python main.py --corpus ./example_data/datasets/node_edges --class_labels_file_name ./example_data/datasets/node.Labels --output_dir ./embeddings

Device mapping: no known devices.
gradients/nce_loss/embedding_lookup_grad/strided_slice: (StridedSlice): /job:localhost/replica:0/task:0/cpu:0
gradients/nce_loss/Slice_grad/pack: (Pack): /job:localhost/replica:0/task:0/cpu:0
gradients/embedding_lookup_grad/strided_slice: (StridedSlice): /job:localhost/replica:0/task:0/cpu:0
gradients/nce_loss/embedding_lookup_1_grad/strided_slice: (StridedSlice): /job:localhost/replica:0/task:0/cpu:0
gradients/nce_loss/Slice_2_grad/pack: (Pack): /job:localhost/replica:0/task:0/cpu:0
gradients/nce_loss/Slice_1_grad/pack: (Pack): /job:localhost/replica:0/task:0/cpu:0
gradients/nce_loss/Slice_3_grad/pack: (Pack): /job:localhost/replica:0/task:0/cpu:0
gradients/nce_loss/sub_1_grad/Shape_1: (Const): /job:localhost/replica:0/task:0/cpu:0
gradients/nce_loss/truediv_grad/Shape_1: (Const): /job:localhost/replica:0/task:0/cpu:0
gradients/Mean_grad/Prod_1: (Prod): /job:localhost/replica:0/task:0/cpu:0
gradients/Mean_grad/Maximum: (Maximum): /job:localhost/replic

INFO:root:Loaded 1 graph file names form ./example_data/datasets/node_edges
INFO:root:Dumped subgraph2vec sentences for all 1 graphs in ./example_data/datasets/node_edges in 0.0 sec
INFO:root:Initializing SKIPGRAM...
INFO:root:vocabulary size: 19
INFO:root:number of documents: 1
INFO:root:number of words to be trained: 20
I tensorflow/core/common_runtime/direct_session.cc:255] Device mapping:

I tensorflow/core/common_runtime/simple_placer.cc:827] gradients/nce_loss/embedding_lookup_grad/strided_slice: (StridedSlice)/job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] gradients/nce_loss/Slice_grad/pack: (Pack)/job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] gradients/embedding_lookup_grad/strided_slice: (StridedSlice)/job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:827] gradients/nce_loss/embedding_lookup_1_grad/strided_slice: (StridedSlice)/job:localhost/replica:0

CalledProcessError: Command 'b'source activate qf\ncd /content/vldb2021_fauce/queries_featurization/Joins2Vec\n# mkdir ../embeddings\npython main.py --corpus ./example_data/datasets/node_edges --class_labels_file_name ./example_data/datasets/node.Labels --output_dir ./embeddings\n'' returned non-zero exit status 1.