# How to Use DeepBugs for Yourself
Follow along with this notebook to reproduce our replication of DeepBugs, tested on the switched-argument bug (i.e., the developer accidentally typed the arguments in reverse order.)

Or, feel free to just check out the pre-saved output - things can take a while to run.

You can also use the functions we provide to deploy DeepBugs in your own code!

## 1. Round up the source code (~1 hour download)

Start by downloading the 150k JavaScript Dataset using the links below. 

* [Training Data - 10.0GB](https://1drv.ms/u/s!AvvT9f1RiwGbh6hYNoymTrzQcNA46g?e=WeJf3K)
* [Testing Data - 4.8GB](https://1drv.ms/u/s!AvvT9f1RiwGbh6hXmjPOUS-kBARjFA?e=AJY1Xf)

Save them into the `demo_data` folder.

## 2. Convert the source code ASTs to tokens (22 minutes)
For a given corpus of code, you should have a large list of source files, each of which is converted into an Abstract Syntax Tree (AST).

In this example, we convert each AST from the 150k JavaScript Dataset into a list of tokens (e.g., "ID:setInterval" or "LIT:true"). Those lists are aggregated together into a master list of lists. This list-of-list format is important for training Word2Vec, since each list of tokens corresponds to a single source file -  tokens within a source file are closely related but tokens across source files may not be as closely related.

Example:
```
[
    { # Corresponds to first source file
        "ID:setInterval",
        "LIT:1000",
        "ID:callbackFn",
        "LIT:true",
        "LIT:http-mode",
        ...
    },
    { # Corresponds to second source file
        "ID:fadeIn",
        "LIT:300",
        "ID:css",
        "LIT:color:red;margin:auto",
        ...
    }
]
```
### Note on using our code
If you organize your ASTs into one file, such that each line of the file corresponds to one AST, you can just call our ready-to-go `ast_token_extractor.get_tokens_from_corpus()` function as shown below.

If you need more fine-grained control, you could use `ast_token_extractor.get_tokens_from()` to extract tokens from each node in a single AST.

In [2]:
from ast_token_extractor import get_tokens_from_corpus

TRAIN_DATA_PATH = "demo_data/150k_training.json"
TEST_DATA_PATH = "demo_data/150k_testing.json"

list_of_lists_of_tokens = get_tokens_from_corpus(TRAIN_DATA_PATH)

# Count the tokens extracted
num_tokens_extracted = len([len(tokens_from_single_src_file) for tokens_from_single_src_file in list_of_lists_of_tokens])

print("Extracted {0} tokens".format(num_tokens_extracted))
print("A few examples...")
print(list_of_lists_of_tokens[0])


100000it [22:00, 75.76it/s]

Extracted 100000 tokens
A few examples...
['ID:gTestfile', 'LIT:regress-472450-04.js', 'ID:BUGNUMBER', 'LIT:472450', 'ID:summary', 'LIT:TM: Do not assert: StackBase(fp) + blockDepth == regs.sp', 'ID:actual', 'LIT:', 'ID:expect', 'LIT:', 'ID:test', 'ID:test', 'ID:enterFunc', 'LIT:test', 'ID:printBugNumber', 'ID:BUGNUMBER', 'ID:printStatus', 'ID:summary', 'ID:jit', 'LIT:true', 'ID:__proto__', 'ID:✖', 'LIT:1', 'ID:f', 'ID:eval', "LIT:for (var y = 0; y < 1; ++y) { for each (let z in [null, function(){}, null, '', null, '', null]) { let x = 1, c = []; } }", 'ID:f', 'ID:jit', 'LIT:false', 'ID:reportCompare', 'ID:expect', 'ID:actual', 'ID:summary', 'ID:exitFunc', 'LIT:test']





## 3. Convert tokens to vectors: train a Word2Vec model (~10 minutes)
Now that you have reduced your dataset to lists of tokens, you can use them to train a Word2Vec model so that it predicts a vector for each token based on lexical similarity. In other words, a token of `LIT:true` will be lexically similar to a token of `LIT:1` but not `LIT:false`.

We train Word2Vec using the Continuous Bag of Words method with a 200-word window (i.e. for a given token, we use the previous 100 tokens and the following 100 tokens to learn the context of the token). Like the original authors, we limit the vocabulary size to the top 10,000 tokens from the dataset.

### Note on using our code
As long as you have one list of tokens per source file, aggregated into a master list of all source files, then you can call our ready-made `token2vectorizer.train_word2vec()` function as shown below.


In [4]:
from token2vectorizer import train_word2vec

WORD2VEC_MODEL_SAVE_PATH = "demo_data/word2vec.model"

model = train_word2vec(list_of_lists_of_tokens, WORD2VEC_MODEL_SAVE_PATH)

print("Should be larger difference btwn LIT:true and LIT:false", model.wv.similarity("LIT:true", "LIT:false"))
print("Should be smaller difference btwn LIT:true and LIT:1", model.wv.similarity("LIT:true", "LIT:1"))


Should be larger difference btwn LIT:true and LIT:false 0.8699683
Should be smaller difference btwn LIT:true and LIT:1 0.013210262


## 4. Save your token-vector vocabulary for later
To speed things up when you're training and testing DeepBugs, you should save off your learned Word2Vec vocabulary in a dictionary for rapid lookup and sharing. Our `token2vectorizer.save_token2vec_vocabulary()` handles this for you in a jiffy.

Example output:
```
{
    "LIT:true": [-5.174832   -4.9506106   1.6868128   1.476279   -3.211739   ...],
    ...
}

In [5]:
import json
from gensim.models import Word2Vec
from token2vectorizer import save_token2vec_vocabulary

WORD2VEC_MODEL_READ_PATH = "demo_data/word2vec.model"
VOCAB_SAVE_PATH = "demo_data/token2vec.json"

model = Word2Vec.load(WORD2VEC_MODEL_READ_PATH)
save_token2vec_vocabulary(model, VOCAB_SAVE_PATH)

with open(VOCAB_SAVE_PATH) as example_json:
    vocab = json.load(example_json)
    print("A couple examples...")
    print("ID:Date: ", vocab["ID:Date"], "\n")
    print("ID:end: ", vocab["ID:end"])

A couple examples...
ID:Date:  [2.293358325958252, 2.8205933570861816, -2.336764097213745, -1.6994364261627197, 0.732081949710846, 9.726104736328125, -0.49508002400398254, -8.360478401184082, 0.813254714012146, -3.673178195953369, 1.3742514848709106, 1.2163194417953491, -4.503299713134766, -2.1647446155548096, 5.015124797821045, 5.080772399902344, -7.650942802429199, -1.901321291923523, -0.9393377900123596, -2.431166887283325, -6.784677028656006, 3.32989239692688, -9.29594612121582, 4.371859073638916, -1.702826976776123, -0.5957798361778259, -2.4750595092773438, 2.1724228858947754, -3.6958248615264893, -10.615705490112305, -8.203575134277344, -8.35372257232666, 3.7036781311035156, -1.3913450241088867, 4.563570499420166, -7.413445472717285, -10.156298637390137, 11.29690170288086, -1.2124414443969727, -4.36269474029541, 7.557013511657715, 7.205021381378174, -5.480103492736816, 2.5261998176574707, 9.942569732666016, -2.938769817352295, -2.881113290786743, 0.6350539326667786, 7.15181446075

## 5. Generate positive/negative examples (~45 minutes)

In our example, we our testing for the switched-argument bug that the DeepBugs authors tested for, so we generate data by extracting all 2-argument function calls from the 150k dataset and then manually switching the arguments around to make "buggy" examples.

### Note on using our code
Our code is specific to switched-argument bugs. For your own bugs, you will need to write your own code to generate positive and negative training/testing examples. You can follow similar procedures to our `swarg_` scripts.

We save our examples as `.npz` files, where each file is a `Tuple[List,List]`: `(Data, Labels)`. Both `Data` and `Labels` are numpy arrays of the same length, where `Labels[i]` is 1 for positive, 0 for negative

In [None]:
import json
from swarg_gen_train_eval import gen_good_bad_fn_args
from swarg_fnargs2tokens import get_all_2_arg_fn_calls_from_ast

VOCAB_READ_PATH = "demo_data/token2vec.json"

SWARG_TRAIN_EXAMPLES_SAVE_PATH = "demo_data/switch_arg_train.npz"
SWARG_TEST_EXAMPLES_SAVE_PATH = "demo_data/switch_arg_test.npz"

gen_good_bad_fn_args(TRAIN_DATA_PATH, VOCAB_READ_PATH, SWARG_TRAIN_EXAMPLES_SAVE_PATH)
gen_good_bad_fn_args(TEST_DATA_PATH, VOCAB_READ_PATH, SWARG_TEST_EXAMPLES_SAVE_PATH)


100000it [33:48, 49.29it/s]
8845it [02:47, 48.35it/s] 

## 6. Train DeepBugs (~10 minutes)
We use examples generated from the training partition of the 150K JavaScript Dataset.

In [14]:
from nn_trainer import model

train_path = "demo_data/switch_arg_train.npz"

# load up data
with np.load(train_path) as data:
    x_data = data['data_x']
    y_data = data['labels_y']
    print('x_data:'+str( x_data.shape))
    print('y_data:'+str( y_data.shape))

# cut up data
train_p = 2/3
train_data_x = x_data[:ceil(len(x_data)*train_p)]
train_data_y = y_data[:ceil(len(y_data)*train_p)]
val_data_x = x_data[ceil(len(x_data)*train_p):]
val_data_y = y_data[ceil(len(y_data)*train_p):]

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop', 
              metrics=['accuracy'])

model_mdata = model.fit(train_data_x, train_data_y, 
       validation_data=(val_data_x, val_data_y), 
       epochs=10, batch_size=100, shuffle=True)

model.save('demo_data/deepbug_model_withw2v.keras')


## 7. Test DeepBugs (<1 minute)
We use examples generated from the test partition of the 150K JavaScript Dataset.

In [15]:
test_path = "demo_data/switch_arg_test.npz"
with np.load(test_path) as data:
      data_test = data['data_x']
      labels_test = data['labels_y']

model = keras.models.load_model("demo_data/deepbug_model.keras")

model_mdata = model.evaluate(data_test, labels_test)

print(model_mdata)