### Prerequisites

You should have completed steps 1-3 of this tutorial before beginning this exercise.  The files required for this notebook are generated by those previous steps.

This notebook takes approximately 3 hours to run on an AWS `p3.8xlarge` instance. 

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
%cd /content/drive/MyDrive/Automate/

/content/drive/MyDrive/Automate


In [3]:
!pip install annoy==1.11.5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting annoy==1.11.5
  Downloading annoy-1.11.5.tar.gz (632 kB)
[K     |████████████████████████████████| 632 kB 28.1 MB/s 
[?25hBuilding wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.11.5-cp38-cp38-linux_x86_64.whl size=303177 sha256=d4f2e6de5d5f928bd38db63535a7e67065036fc54d5409d3ee4ad7d84c59b24d
  Stored in directory: /root/.cache/pip/wheels/42/08/67/145506e4a49c72863367f7c4c2706e8e3da0841d211ddc470d
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.11.5


In [4]:
!pip install pathos

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pathos
  Downloading pathos-0.3.0-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 6.4 MB/s 
Collecting ppft>=1.7.6.6
  Downloading ppft-1.7.6.6-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.7 MB/s 
[?25hCollecting multiprocess>=0.70.14
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 71.1 MB/s 
[?25hCollecting pox>=0.3.2
  Downloading pox-0.3.2-py3-none-any.whl (29 kB)
Installing collected packages: ppft, pox, multiprocess, pathos
Successfully installed multiprocess-0.70.14 pathos-0.3.0 pox-0.3.2 ppft-1.7.6.6


In [5]:
!pip install textacy==0.6.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting textacy==0.6.1
  Downloading textacy-0.6.1-py2.py3-none-any.whl (137 kB)
[K     |████████████████████████████████| 137 kB 30.6 MB/s 
Collecting python-levenshtein>=0.12.0
  Downloading python_Levenshtein-0.20.8-py3-none-any.whl (9.4 kB)
Collecting pyphen>=0.9.4
  Downloading pyphen-0.13.2-py3-none-any.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 46.4 MB/s 
Collecting unidecode>=0.04.19
  Downloading Unidecode-1.3.6-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 78.2 MB/s 
[?25hCollecting cytoolz>=0.8.0
  Downloading cytoolz-0.12.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 71.5 MB/s 
Collecting ftfy<5.0.0,>=4.2.0
  Downloading ftfy-4.4.3.tar.gz (50 kB)
[K     |████████████████████████████████| 50 kB 8.8 MB/s 
[?25hCollecting ijson>=2.3
  Downloading ijson-

In [None]:
# # Optional: you can set what GPU you want to use in a notebook like this.  
# # Useful if you want to run concurrent experiments at the same time on different GPUs.
# import os
# os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
# os.environ["CUDA_VISIBLE_DEVICES"]="2"

In [6]:
from pathlib import Path
import numpy as np
from seq2seq_utils import extract_encoder_model, load_encoder_inputs
from keras.layers import Input, Dense, BatchNormalization, Dropout, Lambda

from keras.models import load_model, Model
from seq2seq_utils import load_text_processor


#where you will save artifacts from this step
OUTPUT_PATH = Path('./data/code2emb/')
OUTPUT_PATH.mkdir(exist_ok=True)

# These are where the artifacts are stored from steps 2 and 3, respectively.
seq2seq_path = Path('./data/seq2seq/')
langemb_path = Path('./data/lang_model/')

# set seeds
from numpy.random import seed
seed(1)
import tensorflow
tensorflow.random.set_seed(2)

# Train Model That Maps Code To Sentence Embedding Space

In step 2, we trained a seq2seq model that can summarize function code using `(code, docstring)` pairs as the training data.  

In this step, we will fine tune the encoder from the seq2seq model to generate code embeddings in the docstring space by using `(code, docstring-embeddings)` as the training data.  Therefore, this notebook will go through the following steps:

1. Load the seq2seq model and extract the encoder (remember seq2seq models have an encoder and a decoder).
2. Freeze the weights of the encoder.
3. Add some dense layers on top of the encoder.
4. Train this new model supplying by supplying `(code, docstring-embeddings)` pairs.  We will call this model `code2emb_model`.
5. Unfreeze the entire model, and resume training.  This helps fine tune the model a little more towards this task.
6. Encode all of the code, including code that does not contain a docstring and save that into a search index for future use.  

### Load seq2seq model from Step 2 and extract the encoder

First load the seq2seq model from Step2, then extract the encoder (we do not need the decoder).

In [7]:
# load the pre-processed data for the encoder (we don't care about the decoder in this step)
encoder_input_data, doc_length = load_encoder_inputs(seq2seq_path/'py_t_code_vecs_v2.npy')
seq2seq_Model = load_model(seq2seq_path/'code_summary_seq2seq_model.h5')

Shape of encoder input: (139472, 55)


In [8]:
encoder_input_data.shape

(139472, 55)

In [9]:
# Extract Encoder from seq2seq model
encoder_model = extract_encoder_model(seq2seq_Model)
# Get a summary of the encoder and its layers
encoder_model.summary()

Model: "Encoder-Model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Encoder-Input (InputLayer)  [(None, 55)]              0         
                                                                 
 Body-Word-Embedding (Embedd  (None, 55, 800)          16001600  
 ing)                                                            
                                                                 
 Encoder-Batchnorm-1 (BatchN  (None, 55, 800)          3200      
 ormalization)                                                   
                                                                 
 Encoder-Last-GRU (GRU)      [(None, 1000),            5406000   
                              (None, 1000)]                      
                                                                 
Total params: 21,410,800
Trainable params: 21,409,200
Non-trainable params: 1,600
_____________________________________

Freeze the encoder

In [20]:
# Freeze Encoder Model
for l in encoder_model.layers:
    l.trainable = False
    print(l, l.trainable)

<keras.engine.input_layer.InputLayer object at 0x7faccd19e8b0> False
<keras.layers.core.embedding.Embedding object at 0x7faccd19e9a0> False
<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x7faccd19ecd0> False
<keras.layers.rnn.gru.GRU object at 0x7faccd19ef70> False


### Load Docstring Embeddings From From Step 3

The target for our `code2emb` model will be docstring-embeddings instead of docstrings.  Therefore, we will use the embeddings for docstrings that we computed in step 3.  For this tutorial, we will use the average over all hidden states, which is saved in the file `avg_emb_dim500_v2.npy`.

Note that in our experiments, a concatenation of the average, max, and last hidden state worked better than using the average alone.  However, in the interest of simplicity we demonstrate just using the average hidden state.  We leave it as an exercise to the reader to experiment with other approaches. 

In [21]:
# Load Fitlam Embeddings
fastailm_emb = np.load(langemb_path/'avg_emb_dim500_v2.npy')
print(encoder_input_data.shape)
print(fastailm_emb.shape)

# check that the encoder inputs have the same number of rows as the docstring embeddings
assert encoder_input_data.shape[0] == fastailm_emb.shape[0]
fastailm_emb.shape


(139472, 55)
(139472, 400)


(139472, 400)

### Construct `code2emb` Model Architecture

The `code2emb` model is the encoder from the seq2seq model with some dense layers added on top.  The output of the last dense layer of this model needs to match the dimensionality of the docstring embedding, which is 500 in this case.

In [22]:
#### Encoder Model ####
encoder_inputs = Input(shape=(doc_length,), name='Encoder-Input')
enc_out = encoder_model(encoder_inputs)

# first dense layer with batch norm
x = Dense(400, activation='relu')(enc_out)
x = BatchNormalization(name='bn-1')(x)
out = Dense(400)(x)
code2emb_model = Model([encoder_inputs], out)

In [23]:
code2emb_model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Encoder-Input (InputLayer)  [(None, 55)]              0         
                                                                 
 Encoder-Model (Functional)  (None, 1000)              21410800  
                                                                 
 dense_2 (Dense)             (None, 400)               400400    
                                                                 
 bn-1 (BatchNormalization)   (None, 400)               1600      
                                                                 
 dense_3 (Dense)             (None, 400)               160400    
                                                                 
Total params: 21,973,200
Trainable params: 561,600
Non-trainable params: 21,411,600
_________________________________________________________________


### Train the `code2emb` Model

The model we are training is relatively simple - with two dense layers on top of the pre-trained encoder.  We are leaving the encoder frozen at first, then will unfreeze the encoder in a later step.

In [24]:
from keras.callbacks import CSVLogger, ModelCheckpoint
from keras import optimizers
import tensorflow.compat.v1 as tf

code2emb_model.compile(optimizer=optimizers.Nadam(lr=0.002), loss=tf.keras.losses.cosine_proximity)
script_name_base = 'code2emb_model_'
csv_logger = CSVLogger('{:}.log'.format(script_name_base))
model_checkpoint = ModelCheckpoint('{:}.epoch{{epoch:02d}}-val{{val_loss:.5f}}.hdf5'.format(script_name_base),
                                   save_best_only=True)

batch_size = 1500
epochs = 10
history = code2emb_model.fit([encoder_input_data], fastailm_emb,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.12, callbacks=[csv_logger, model_checkpoint])

  super(Nadam, self).__init__(name, **kwargs)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [25]:
encoder_input_data.shape

(139472, 55)

`.7453`

### Unfreeze all Layers of Model and Resume Training

In the previous step, we left the encoder frozen.  Now that the dense layers are trained, we will unfreeze the entire model and let it train some more.  This will hopefully allow this model to specialize on this task a bit more.

In [26]:
for l in code2emb_model.layers:
    l.trainable = True
    print(l, l.trainable)

<keras.engine.input_layer.InputLayer object at 0x7fac6662b0d0> True
<keras.engine.functional.Functional object at 0x7faccd19e790> True
<keras.layers.core.dense.Dense object at 0x7fac6662b4c0> True
<keras.layers.normalization.batch_normalization.BatchNormalization object at 0x7fad3bc13ac0> True
<keras.layers.core.dense.Dense object at 0x7fac6662b130> True


In [27]:
code2emb_model.compile(optimizer=optimizers.Nadam(lr=0.0001), loss=tf.keras.losses.cosine_proximity)
script_name_base = 'code2emb_model_unfreeze_'
csv_logger = CSVLogger('{:}.log'.format(script_name_base))
model_checkpoint = ModelCheckpoint('{:}.epoch{{epoch:02d}}-val{{val_loss:.5f}}.hdf5'.format(script_name_base),
                                   save_best_only=True)

batch_size = 100
epochs = 10
history = code2emb_model.fit([encoder_input_data], fastailm_emb,
          batch_size=batch_size,
          epochs=epochs,
          initial_epoch=16,
          validation_split=0.12, callbacks=[csv_logger, model_checkpoint])

### Save `code2emb` model

In [28]:
code2emb_model.save(OUTPUT_PATH/'code2emb_model.hdf5')

# Vectorize all of the code without docstrings

We want to vectorize all of the code without docstrings so we can test the efficacy of the search on the code that was never seen by the model. 

In [29]:
from keras.models import load_model
from pathlib import Path
import numpy as np
from seq2seq_utils import load_text_processor
code2emb_path = Path('./data/code2emb/')
seq2seq_path = Path('./data/seq2seq/')
data_path = Path('./data/processed_data/')

In [31]:
!pip install joblib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
!pip install keras==2.9.0 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting keras==2.9.0
  Downloading keras-2.9.0-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 13.7 MB/s 
[?25hInstalling collected packages: keras
  Attempting uninstall: keras
    Found existing installation: keras 2.11.0
    Uninstalling keras-2.11.0:
      Successfully uninstalled keras-2.11.0
Successfully installed keras-2.9.0


pad_sequences error

````from keras_preprocessing.sequence import pad_sequences````

In [32]:
code2emb_model = load_model(code2emb_path/'code2emb_model.hdf5')
num_encoder_tokens, enc_pp = load_text_processor(seq2seq_path/'py_code_proc_v2.dpkl')

with open(data_path/'without_docstrings.function', 'r') as f:
    no_docstring_funcs = f.readlines()

Size of vocabulary for data/seq2seq/py_code_proc_v2.dpkl: 20,002


### Pre-process code without docstrings for input into `code2emb` model

We use the same transformer we used to train the original model.

In [33]:
# tokenized functions that did not contain docstrigns
no_docstring_funcs[:5]

['function_tokens\n',
 'def __init__ self leafs edges self edges edges self leafs sorted leafs\n',
 'def __eq__ self other if isinstance other Node return id self id other or self leafs other leafs and self edges other edges else return False\n',
 'def __repr__ self return Node leafs edges format self leafs self edges\n',
 'staticmethod def _isCapitalized token return len token 1 and token isalpha and token 0 isupper and token 1 islower\n']

In [34]:
encinp = enc_pp.transform_parallel(no_docstring_funcs)
np.save(code2emb_path/'nodoc_encinp.npy', encinp)



### Extract code vectors

In [5]:
from keras.models import load_model
from pathlib import Path
import numpy as np
code2emb_path = Path('./data/code2emb/')
encinp = np.load(code2emb_path/'nodoc_encinp.npy')
code2emb_model = load_model(code2emb_path/'code2emb_model.hdf5')

Use the `code2emb` model to map the code into the same vector space as natural language 

In [6]:
nodoc_vecs = code2emb_model.predict(encinp, batch_size=10)



In [7]:
# make sure the number of output rows equal the number of input rows
assert nodoc_vecs.shape[0] == encinp.shape[0]

Save the vectorized code

In [8]:
np.save(code2emb_path/'nodoc_vecs.npy', nodoc_vecs)