# Data

[150k Python Dataset]("https://eth-sri.github.io/py150") from SRILAB.

# Code representation

A continuous vector embedding of each code fragment at method–level granularity as "document".[13] FastText, a variation of Word2Vec algorithm.
- Extracting Information from Source Code
    - simple tokenizer: extract all words from source code by removing non–alphanumeric tokens.=> indifferenciable
    - parser-based approach: traverse through the parse tree for each method, and extract information from the following syntactic categories. (Java-like)
        - method name
        - method invocation
        - Enums
        - String literals
        - comments
        - <strike>variable name</strike>
- Building Vector Representations
    - <strike>simply average embeddings</strike>
    - Weighted average of all unique words in a document=> normalized tf-idf
- Retrieval
    - average the vector representations of constituent words to create a document embedding for the query sentence
    - a standard similarity search algorithm to find the document vectors with closest cosine distance. => FAISS 

# Model

Input: natural language queries <br>
Output: related code fragments retrieved directly from Github code corpus<br><br>

Map the Input into the same vector space as the codebase, and then calculate the vector distance of them in order to get the relevant result.

# Evaluation

Metric and choosing parameters of the model

- Metric: select subsets of words from the document as simulated queries and then see if it can retrive the document, and then evaluate by the percentage of the documents that are retrieve back at top1 and top10. 
    - random benchmark test
    - TF-IDF benchmark test => better performance
- Parameters:
    - embedding dimention=> 500
    - three ways of combining word embeddings to document embeddings=> the conclusion is tf-idf better
    - vector representation=> BM25 is better

# Implementation

In [1]:
import numpy as np
import pandas as pd
import string
import re

from gensim.models import FastText

## Data Processing

In [2]:
# load in processed file. A set of keywords for each document (source code function)
unpickled_df = pd.read_pickle("data/py100.pkl")
func_size=len(unpickled_df)
print("Total Number of Functions: {}".format(func_size))
unpickled_df.head()

Total Number of Functions: 1038


Unnamed: 0,data_id,function_name,docstring,func_call
0,1,__getattribute__,,__getattribute__
1,1,__setattr__,,"ref,__setattr__"
2,2,main,,"setup,closing,ZmqProxy,consume_in_thread,wait"
3,3,test_vpnservice_create,,"create_stubs,first,AndReturn,ReplayAll,vpnserv..."
4,3,test_vpnservices_get,,"create_stubs,AndReturn,ReplayAll,vpnservices_g..."


In [32]:
def parse_func_name(terms):
    
    # remove punctuations except for "_"
    ## reference: https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
    regex = re.compile('[%s]' % re.escape(string.punctuation.replace("_", "")))
    terms=regex.sub('', terms)
    
    #[TODO] camel case
    
    # [TODO] deal with __init__

    # lowercase
    terms=terms.lower()
    
    # snake case
    return [term for term in terms.split("_") if term!=""]
    

In [34]:
# for each function, combine all the keywords into a set
list_function_keywords=[]
for idx in range(len(unpickled_df)):
    keywords=[]
    
    func_name=parse_func_name(unpickled_df.iloc[idx]["function_name"].lower())
    keywords+=func_name
    
    #[TODO] only alphabenumeric characters
    #[TODO] camel case
    #[TODO] snake case
    
    if unpickled_df.iloc[idx]["docstring"]:
        docstring=unpickled_df.iloc[idx]["docstring"].lower().split()
        keywords+=docstring
    
    if unpickled_df.iloc[idx]["func_call"]:
        func_invoc=unpickled_df.iloc[idx]["func_call"].lower().split(",")
        keywords+=func_invoc
    
    list_function_keywords.append(set(keywords))

In [59]:
len(list_function_keywords)

1038

## Building Word Embeddings

In [47]:
# hyperparameters
vocab_size=200
window_size=5
min_count=1


# other parameters defined earlier
# func_size

In [48]:
# We employ the continuous skip–gram model with a window size of 5, 
# i.e. all pairs of words within distance 5 are considered nearby words.

#[TODO] tuning hyperparameters
model = FastText(size=vocab_size, window=window_size, min_count=min_count)  # instantiate
model.build_vocab(sentences=list_function_keywords)
model.train(sentences=list_function_keywords, total_examples=len(list_function_keywords), epochs=10)  # train

In [49]:
print(model)

FastText(vocab=2826, size=200, alpha=0.025)


In [50]:
# saving a model trained via Gensim's fastText implementation
model.save('saved_model_gensim')

In [51]:
trained_ft_vectors = model.wv
# save vectors to file if you want to use them later
trained_ft_vectors.save_word2vec_format('embeddings.txt', binary=False)

In [58]:
# Test
trained_ft_vectors.most_similar("button", topn=10)

[('buttonpresssignal', 0.9999955296516418),
 ('buttonreleasesignal', 0.9999948740005493),
 ('menubutton', 0.9999943971633911),
 ('buttonpress', 0.999994158744812),
 ('buttonrelease', 0.9999939203262329),
 ('__menubutton', 0.9999935626983643),
 ('gadgetmenudefinition', 0.9999933838844299),
 ('widgetmenudefinition', 0.9999933242797852),
 ('__selectionchangedsignal', 0.9999933242797852),
 ('__addmenudefinition', 0.9999932646751404)]

## Building Document Embeddings

1. Average over all the words;
2. Average over the unique words in each document;
3. [x] Weighted average of all unique words in a document

In [53]:
trained_ft_vectors["ping"]

array([-5.56673110e-01,  5.04946597e-02, -6.21083099e-03,  7.66705275e-02,
       -4.35648561e-01, -1.12217357e-02, -1.99079901e-01, -1.77537516e-01,
       -1.90618366e-01, -1.70305610e-01,  2.27255642e-01,  9.98908356e-02,
       -2.59666800e-01, -5.06281614e-01,  3.26456904e-01, -2.70832479e-01,
        2.08521977e-01,  8.82439911e-02, -1.41606554e-01,  3.91939998e-01,
        4.06734288e-01,  1.84914604e-01,  1.56045079e-01,  2.20082983e-01,
        4.43840586e-02, -2.10403875e-01, -7.75189400e-01,  3.60961616e-01,
        3.83777827e-01, -1.51074573e-01, -3.56423855e-02, -3.47285122e-02,
        7.90149271e-02, -1.95222750e-01, -1.23928860e-01, -1.25188157e-01,
       -2.44518846e-01,  1.82098448e-02,  2.50440568e-01,  2.17390880e-01,
       -4.60534185e-01,  7.12547824e-02, -3.46408099e-01, -1.01377316e-01,
        3.02482873e-01,  5.12988389e-01, -2.71733373e-01, -1.14690021e-01,
        1.18469305e-01,  1.80004150e-01, -1.58470497e-01, -1.23429805e-01,
        2.89993674e-01,  

In [54]:
document_embeddings=np.zeros((func_size, vocab_size))
for idx, doc in enumerate(list_function_keywords):
    doc_vec_sum=np.zeros(vocab_size)
    for term in doc:
        doc_vec_sum+=trained_ft_vectors[term]
    document_embeddings[idx]=doc_vec_sum

In [55]:
document_embeddings[0]

array([-8.78725350e-01,  8.05873852e-02, -9.74157732e-03,  1.20289113e-01,
       -6.88482195e-01, -1.99429551e-02, -3.12518537e-01, -2.79037617e-01,
       -3.01731780e-01, -2.65968807e-01,  3.59103099e-01,  1.56895529e-01,
       -4.08433288e-01, -7.98545092e-01,  5.14272422e-01, -4.24994081e-01,
        3.29313479e-01,  1.37901261e-01, -2.21873112e-01,  6.16696984e-01,
        6.42413348e-01,  2.90032335e-01,  2.49334745e-01,  3.46673459e-01,
        7.29946401e-02, -3.31593908e-01, -1.21915424e+00,  5.69301248e-01,
        6.01602867e-01, -2.38119364e-01, -5.76344281e-02, -5.45627698e-02,
        1.25188146e-01, -3.10112871e-01, -1.95762858e-01, -1.98347032e-01,
       -3.84881064e-01,  2.96096532e-02,  3.90661761e-01,  3.46012399e-01,
       -7.26857781e-01,  1.12263758e-01, -5.44764236e-01, -1.58682752e-01,
        4.74034548e-01,  8.07780802e-01, -4.27958280e-01, -1.77960575e-01,
        1.89581826e-01,  2.82897569e-01, -2.50179067e-01, -1.92018941e-01,
        4.57203627e-01,  

In [56]:
print("{} documents with {} dimentions".format(document_embeddings.shape[0], document_embeddings.shape[1]))

1038 documents with 200 dimentions


## Evaluate Model

# Notes

# Questions
The paper mentions 2 evaluation approach: 1 uses Github only, the other one uses both GitHub and StackOverflow. I'm guessing the former one is for tuning in the development stage; while the later is the final evaluation for the completed system (NCS).