# Data

[150k Python Dataset](https://eth-sri.github.io/py150) from SRILAB.

# Code representation

A continuous vector embedding of each code fragment at method–level granularity as "document".[13] FastText, a variation of Word2Vec algorithm.
- Extracting Information from Source Code
    - simple tokenizer: extract all words from source code by removing non–alphanumeric tokens.=> indifferenciable
    - parser-based approach: traverse through the parse tree for each method, and extract information from the following syntactic categories. (Java-like)
        - method name
        - method invocation
        - Enums
        - String literals
        - comments
        - <strike>variable name</strike>
- Building Vector Representations
    - <strike>simply average embeddings</strike>
    - Weighted average of all unique words in a document=> normalized tf-idf
- Retrieval
    - average the vector representations of constituent words to create a document embedding for the query sentence
    - a standard similarity search algorithm to find the document vectors with closest cosine distance. => FAISS 

# Model

Input: natural language queries <br>
Output: related code fragments retrieved directly from Github code corpus<br><br>

Map the Input into the same vector space as the codebase, and then calculate the vector distance of them in order to get the relevant result.

# Evaluation

Metric and choosing parameters of the model

- Metric: select subsets of words from the document as simulated queries and then see if it can retrive the document, and then evaluate by the percentage of the documents that are retrieve back at top1 and top10. 
    - random benchmark test
    - TF-IDF benchmark test => better performance
- Parameters:
    - embedding dimention=> 500
    - three ways of combining word embeddings to document embeddings=> the conclusion is tf-idf better
    - vector representation=> BM25 is better

# Implementation

In [1]:
import numpy as np
import pandas as pd
import string
import re
import pickle

from gensim.models import FastText
from gensim.models import KeyedVectors
from time import time

In [2]:
# change parameters here
function_file_path="data/pickle100_list.pkl"
wordembedding_file_path="data/embeddings.txt"
docembedding_file_path="data/document_embeddings.csv"

## Data Processing

In [3]:
# load in processed file. A set of keywords for each document (source code function)
def load_words_from_ast(file_path):
    with open(file_path, 'rb') as f:
        function_list = pickle.load(f)
    unpickled_df = pd.DataFrame(function_list, columns=['data_id', 'function_name', 'docstring', 'func_call'])

    func_size=len(unpickled_df)
    print("Total Number of Functions in \"{}\": {}".format(file_path, func_size))
    return unpickled_df

unpickled_df=load_words_from_ast(function_file_path)

Total Number of Functions in "data/pickle100_list.pkl": 1038


In [4]:
# substract useless function from code base 
# 1 - keep, 0 - delete from code base
unpickled_df['keep_in_codebase'] = np.where(((unpickled_df['func_call'] == '')|
                                            (unpickled_df['function_name'] == '__init__')
                                            ), 0, 1)
unpickled_df.head()

Unnamed: 0,data_id,function_name,docstring,func_call,keep_in_codebase
0,1,__getattribute__,,__getattribute__,1
1,1,__setattr__,,"ref,__setattr__",1
2,2,main,,"setup,closing,ZmqProxy,consume_in_thread,wait",1
3,3,test_vpnservice_create,,"create_stubs,first,AndReturn,ReplayAll,vpnserv...",1
4,3,test_vpnservices_get,,"create_stubs,AndReturn,ReplayAll,vpnservices_g...",1


In [5]:
# output new function file with `keep_in_codebase` column

In [6]:
def parse_func_name(terms):
    """
    Return a list of keywords from function name
    """
    result=[]
    #1. alphanumeric and "-", "_" only
    terms=re.sub("[^\w_-]", " ", terms)
    
    #2. snake case
    result=[term for term in terms.split("_") if term!=""]
    
    #3. camel case
    ## https://stackoverflow.com/questions/29916065/how-to-do-camelcase-split-in-python
    def camel_case_split(identifier):
        matches = re.finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
        return [m.group(0) for m in matches]
    
    result=[camel_case_split(t) for t in result]
    
    #flatten the result
    #[TODO] check output
    result=[s if isinstance(t, list) else t for t in result for s in t] 
    result=set(result)
    # [TODO] deal with __init__

    # lowercase
    result=[t.lower() for t in result]
    
    return result

def parse_docstring(df):
    #[TODO]remove punctuations except for "_"
    ## reference: https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
    terms=df
    regex = re.compile('[%s]' % re.escape(string.punctuation.replace("_", "")))
    terms=regex.sub(' ', terms)
    return terms
    
def parse_func_call(terms):
    result=[]
    terms_list=terms.split(",")
    for term in terms_list:
        result+=parse_func_name(term)
    return result

In [7]:
## Parsing examples
##parse_docstring("test 123, real_test.")
#parse_func_name("UnpickledDfTest")
#parse_func_call("setup,closing,ZmqProxy,consume_in_thread,wait")

In [8]:
#[TODO] wrap processing into pipeline
#def parse_keywords(df):
#    data = (
#        data
#        # Clean Data
#        .pipe(parse_func_name)
#        .pipe(fix_missing_desc)
#        .pipe(change_lowercase_title)
#        .pipe(change_lowercase_desc)
#        
#        # Transform data
#        .pipe(select_columns, 
#              "item_id",
#              "title",
#              "description",
#              "deal_probability"
#             )
#        .pipe(add_char_len_title)
#        .pipe(add_char_len_desc)
#        .pipe(add_tfidf_title)
#        .pipe(add_tfidf_desc)
#        #.pipe(add_keywords_desc)
#    ) 

In [9]:
# for each function, combine all the keywords into a set
def parse_keywords(df):
    list_function_keywords=[]
    for idx in range(len(df)):
        keywords=[]

        func_name=parse_func_name(unpickled_df.iloc[idx]["function_name"])
        keywords+=func_name
        
        # [TODO] exact possible function names (snakecase or camelcase)
        if unpickled_df.iloc[idx]["docstring"]:
            docstring=unpickled_df.iloc[idx]["docstring"].lower().split()
            keywords+=docstring

        if unpickled_df.iloc[idx]["func_call"]:
            func_invoc=parse_func_call(unpickled_df.iloc[idx]["func_call"])
            keywords+=func_invoc

        list_function_keywords.append(set(keywords))
    return list_function_keywords

list_function_keywords=parse_keywords(unpickled_df)

In [10]:
len(list_function_keywords) #742490 for 100k

1038

## Building Word Embeddings

In [11]:
# hyperparameters
#num_list_function_keywords=742490
vocab_size=500
window_size=5
min_count=1


# other parameters defined earlier
func_size=len(unpickled_df)

In [12]:
# We employ the continuous skip–gram model with a window size of 5, 
# i.e. all pairs of words within distance 5 are considered nearby words.
st=time()
#[TODO] tuning hyperparameters
model = FastText(size=vocab_size, window=window_size, min_count=min_count)  # instantiate
model.build_vocab(sentences=list_function_keywords)
model.train(sentences=list_function_keywords, total_examples=len(list_function_keywords), epochs=10)  # train
print("Run time: {} s".format(time()-st))

Run time: 1.6682569980621338 s


In [13]:
print(model)

FastText(vocab=2193, size=500, alpha=0.025)


In [14]:
# saving a model trained via Gensim's fastText implementation
# 2019/03/05: the model might be too big. Saving word vector only.
# model.save('saved_model_gensim')

In [15]:
#[TODO] normalize word mebedding
trained_ft_vectors = model.wv
# save vectors to file if you want to use them later
trained_ft_vectors.save_word2vec_format(wordembedding_file_path, binary=False)

In [16]:
# load wordembedding
st=time()
trained_ft_vectors = KeyedVectors.load_word2vec_format(wordembedding_file_path)
print("Run time: {} s".format(time()-st))

Run time: 1.0046939849853516 s


In [17]:
# Test
trained_ft_vectors.most_similar("pop", topn=10)

[('popitemlist', 0.9999936819076538),
 ('poplist', 0.9999935030937195),
 ('popitem', 0.9999934434890747),
 ('profileconnectionv3', 0.9999931454658508),
 ('formatted', 0.999993085861206),
 ('serverhardwaretypeuri:', 0.9999929666519165),
 ('res', 0.9999929666519165),
 ('profileconnectionv4', 0.9999929666519165),
 ('serverprofiledescription:', 0.9999929666519165),
 ('serverhardwaretypeuri', 0.9999929666519165)]

## Building Document Embeddings

1. Average over all the words;
2. Average over the unique words in each document;
3. [x] Weighted average of all unique words in a document

In [18]:
trained_ft_vectors["ping"]

array([-6.25856370e-02, -5.55267222e-02, -7.85330608e-02, -3.93048376e-02,
       -4.29004878e-02, -3.79858464e-02,  1.42647758e-01, -1.74224690e-01,
       -8.96555465e-03,  3.22859317e-01, -6.30123094e-02,  2.95341253e-01,
        4.56187874e-02, -2.16565847e-01,  1.87134832e-01, -1.82485044e-01,
       -2.77954359e-02,  2.66234815e-01, -2.45725200e-01, -1.53831780e-01,
       -5.41035794e-02,  1.04464404e-01,  1.26882985e-01, -7.77364671e-02,
        1.23624302e-01, -1.39036879e-01,  3.21845189e-02,  1.37269631e-01,
       -9.99649465e-02, -4.16537654e-03, -7.15402663e-02,  3.87921780e-02,
       -2.34017938e-01,  4.73021120e-02,  4.31040108e-01, -4.36112620e-02,
       -1.29100844e-01,  1.07720375e-01, -2.80036390e-01,  1.29020929e-01,
        8.93684402e-02, -1.88718773e-02,  3.14803213e-01,  2.48088613e-02,
       -2.70498693e-02,  4.17367332e-02, -1.13380343e-01,  9.18267220e-02,
        2.16031186e-02, -2.33349949e-01,  7.70558640e-02,  1.41062737e-01,
        1.37658939e-01,  

In [19]:
def gen_doc_embedding(df, trained_ft_vectors):
    document_embeddings=np.zeros((func_size, vocab_size))
    for idx in range(len(df)):
        if df.iloc[idx]["keep_in_codebase"]==1:
            keywords=[]

            func_name=parse_func_name(df.iloc[idx]["function_name"])
            keywords+=func_name

            if df.iloc[idx]["docstring"]:
                docstring=df.iloc[idx]["docstring"].lower().split()
                keywords+=docstring

            if df.iloc[idx]["func_call"]:
                func_invoc=parse_func_call(df.iloc[idx]["func_call"])
                keywords+=func_invoc
            
            set_keywords=set(keywords)
            doc_vec_sum=np.zeros(vocab_size)
            for term in set_keywords:
                doc_vec_sum+=trained_ft_vectors[term]
            document_embeddings[idx]=doc_vec_sum/len(set_keywords)
        else:
            continue
    return document_embeddings    
 
st=time()
document_embeddings=gen_doc_embedding(unpickled_df, trained_ft_vectors)

# save the whole document_embeddings
np.savetxt(docembedding_file_path, document_embeddings, delimiter=",")
print("Run time: {} s".format(time()-st))

Run time: 1.3863921165466309 s


In [20]:
document_embeddings[0]

array([-3.66801880e-02, -3.21855582e-02, -4.59765755e-02, -2.35858317e-02,
       -2.45559718e-02, -2.24505942e-02,  8.43747035e-02, -1.02493078e-01,
       -5.15302084e-03,  1.90111786e-01, -3.72895114e-02,  1.73853204e-01,
        2.66733337e-02, -1.27457857e-01,  1.10210687e-01, -1.07288539e-01,
       -1.63457636e-02,  1.57116175e-01, -1.45147130e-01, -9.02184844e-02,
       -3.14268246e-02,  6.15113713e-02,  7.45549425e-02, -4.58974466e-02,
        7.30959624e-02, -8.18264559e-02,  1.86812934e-02,  8.09102580e-02,
       -5.87484464e-02, -2.88770651e-03, -4.22209315e-02,  2.24082600e-02,
       -1.37350544e-01,  2.78440081e-02,  2.54560262e-01, -2.57833675e-02,
       -7.63532147e-02,  6.34653047e-02, -1.64957941e-01,  7.54120201e-02,
        5.25249094e-02, -1.04196491e-02,  1.85203046e-01,  1.43219084e-02,
       -1.51749421e-02,  2.44742054e-02, -6.67797551e-02,  5.35738915e-02,
        1.28637813e-02, -1.37296230e-01,  4.48314771e-02,  8.30411613e-02,
        8.13265145e-02,  

In [21]:
print("{} documents with {} dimentions".format(document_embeddings.shape[0], document_embeddings.shape[1]))

1038 documents with 500 dimentions


## Evaluate Model

In [22]:
trained_ft_vectors["pop"]

array([-4.68079038e-02, -4.04543281e-02, -5.80844954e-02, -2.87060738e-02,
       -3.15373875e-02, -2.84324195e-02,  1.05471179e-01, -1.28478676e-01,
       -6.28562225e-03,  2.38256693e-01, -4.65923771e-02,  2.17354193e-01,
        3.38355266e-02, -1.59164429e-01,  1.38680652e-01, -1.34276271e-01,
       -2.12664790e-02,  1.97024718e-01, -1.82871431e-01, -1.13444343e-01,
       -4.01647165e-02,  7.72339627e-02,  9.40907523e-02, -5.74264862e-02,
        9.07003507e-02, -1.03050575e-01,  2.28626169e-02,  1.00894913e-01,
       -7.41708502e-02, -3.23248655e-03, -5.41171655e-02,  2.81273890e-02,
       -1.72783270e-01,  3.50187048e-02,  3.18187565e-01, -3.21109183e-02,
       -9.54566523e-02,  7.98353478e-02, -2.06494600e-01,  9.41645727e-02,
        6.62450865e-02, -1.30515797e-02,  2.31177941e-01,  1.79727860e-02,
       -1.91488601e-02,  3.03185247e-02, -8.42062011e-02,  6.73439056e-02,
        1.63530800e-02, -1.71801567e-01,  5.60964048e-02,  1.04134865e-01,
        1.02509938e-01,  

# Notes

# Questions
The paper mentions 2 evaluation approach: 1 uses Github only, the other one uses both GitHub and StackOverflow. I'm guessing the former one is for tuning in the development stage; while the later is the final evaluation for the completed system (NCS).