# Quick Start

The aim of this notebook is to give you an understanding of how to use this project to predict function names from their function bodies.

We start with the setup of the global variables, we proceed with reading the model and other necessary files from disk and finally we feed our example functions into our model to evaluate the results.

## Setup

During the setup we import the necessary libraries and set the values for our model. If you have trained a separate model, verify your model parameters coincide with default settings of this notebook.

> If you run this notebook yourself, make sure to correctly set the `VARIABLES` in the next cell according to your environment.

In [1]:
PY150_PATH = '/tmp/py150'
PY150_DICT_NAME = 'extracted.dict.c2s'

EXAMPLE_PATH = '/code-embeddings/examples'
EXAMPLE_C2S_NAME = 'examples.c2s'

CHECKPOINT_PATH = '/code-embeddings/checkpoints/train'

### Imports

In [2]:
from argparse import ArgumentParser
from functools import partial

import numpy as np
import tensorflow as tf

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

from preprocessing import dataset
from preprocessing import vocabulary
from training import loss, mask, schedule
from training.model import transformer
from evaluate import evaluate


config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)



### Arguments

In [3]:
arg_parser = ArgumentParser()
arg_parser.add_argument('--dict', required=True)
arg_parser.add_argument('--data', required=True)

arg_parser.add_argument('--num-paths', type=int, default=100)
arg_parser.add_argument('--num-tokens', type=int, default=10)
arg_parser.add_argument('--num-targets', type=int, default=10)

arg_parser.add_argument('--num-layers', type=int, default=2)
arg_parser.add_argument('--num-heads', type=int, default=4)
arg_parser.add_argument('--embedding-size', type=int, default=32)
arg_parser.add_argument('--dense-size', type=int, default=64)
arg_parser.add_argument('--dropout-rate', type=float, default=.2)

args = arg_parser.parse_args(
    [
        '--dict',
        f'{PY150_PATH}/{PY150_DICT_NAME}',
        '--data',
        f'{EXAMPLE_PATH}/{EXAMPLE_C2S_NAME}'
    ]
)

## Load files

After the setup we continue the script with loading the model, the example file and the embeddings lookups.

### Embedding lookups

In [4]:
subtoken2count, path2count, target2count, max_contexts = vocabulary.load(args.dict)

idx2sub, sub2idx = vocabulary.to_encoder_decoder(subtoken2count, special_tokens=[vocabulary.PAD, vocabulary.UNK])
idx2path, path2idx = vocabulary.to_encoder_decoder(path2count, special_tokens=[vocabulary.PAD, vocabulary.UNK])
idx2tar, tar2idx = vocabulary.to_encoder_decoder(target2count, special_tokens=[vocabulary.PAD, vocabulary.UNK, vocabulary.SOS, vocabulary.EOS])

token_table = vocabulary.to_table(sub2idx, sub2idx[vocabulary.UNK])
path_table = vocabulary.to_table(path2idx, path2idx[vocabulary.UNK])
target_table = vocabulary.to_table(tar2idx, tar2idx[vocabulary.UNK])

### Example file

In [5]:
dst = dataset.create(
    args.data,
    args.num_paths,
    args.num_tokens,
    args.num_targets,
    token_table,
    path_table,
    target_table
)

### Model

In [6]:
model = transformer.Transformer(
    args.num_paths,
    args.num_tokens,
    args.num_layers,
    args.num_heads,
    args.embedding_size,
    args.dense_size,
    len(idx2path),
    len(idx2sub),
    len(idx2tar),
    1000,
    args.dropout_rate
)

In [7]:
ckpt = tf.train.Checkpoint(model=model)
ckpt_manager = tf.train.CheckpointManager(ckpt, CHECKPOINT_PATH, max_to_keep=5)

if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print('Latest checkpoint restored!')

Latest checkpoint restored!


## Prediction

Finally we use our model to make predictions of names for the functions defined in the example file. 

In [8]:
evaluate_fn = partial(evaluate, num_targets=args.num_targets, tar2idx=tar2idx, model=model)

In [9]:
for X, y in dst:
    y_hat, weights = evaluate_fn(X)
    y = tf.gather_nd(y, tf.where(y > 3))

    real = '_'.join([idx2tar[i] for i in y.numpy()])
    predicted = '_'.join([idx2tar[i] for i in y_hat.numpy()])

    print(f'Real function name: {real}')
    print(f'Predicted function name: {predicted}')
    print('')

Real function name: count_occurences
Predicted function name: <UNK>

Real function name: contains
Predicted function name: is_element_present

Real function name: index_of
Predicted function name: get_item

