# Data

[150k Python Dataset]("https://eth-sri.github.io/py150") from SRILAB.

# Code representation

A continuous vector embedding of each code fragment at method–level granularity as "document".[13] FastText, a variation of Word2Vec algorithm.
- Extracting Information from Source Code
    - simple tokenizer: extract all words from source code by removing non–alphanumeric tokens.=> indifferenciable
    - parser-based approach: traverse through the parse tree for each method, and extract information from the following syntactic categories. (Java-like)
        - method name
        - method invocation
        - Enums
        - String literals
        - comments
        - <strike>variable name</strike>
- Building Vector Representations
    - <strike>simply average embeddings</strike>
    - Weighted average of all unique words in a document=> normalized tf-idf
- Retrieval
    - average the vector representations of constituent words to create a document embedding for the query sentence
    - a standard similarity search algorithm to find the document vectors with closest cosine distance. => FAISS 

# Model

Input: natural language queries <br>
Output: related code fragments retrieved directly from Github code corpus<br><br>

Map the Input into the same vector space as the codebase, and then calculate the vector distance of them in order to get the relevant result.

# Evaluation

Metric and choosing parameters of the model

- Metric: select subsets of words from the document as simulated queries and then see if it can retrive the document, and then evaluate by the percentage of the documents that are retrieve back at top1 and top10. 
    - random benchmark test
    - TF-IDF benchmark test => better performance
- Parameters:
    - embedding dimention=> 500
    - three ways of combining word embeddings to document embeddings=> the conclusion is tf-idf better
    - vector representation=> BM25 is better

# Implementation

In [14]:
import numpy as np
import pandas as pd

from gensim.models import FastText

## Data Processing

In [21]:
# load in processed file. A set of keywords for each document (source code function)
unpickled_df = pd.read_pickle("./dummy.pkl")
func_size=len(unpickled_df)
print("Total Number of Functions: {}".format(func_size))
unpickled_df.head()

Total Number of Functions: 264


Unnamed: 0,data_id,function_name,docstring
0,8,ping,Handle ping requests
1,8,message,Proxy message from one user to another
2,8,presence,Presence information may be sent out from the ...
3,8,roster,A roster is this account's list of contacts; i...
4,8,push,Push roster changes to all clients that have r...


In [3]:
# for each function, combine all the keywords into a set
list_function_keywords=[]
for idx in range(len(unpickled_df)):
    keywords=[]
    keywords.append(unpickled_df.iloc[idx]["function_name"].lower())
    
    #[TODO] only alphabenumeric characters
    #[TODO] camel case
    #[TODO] snake case
    #[TODO] Add function calls
    docstring=unpickled_df.iloc[0]["docstring"].lower().split()
        
    
    keywords+=docstring
    list_function_keywords.append(set(keywords))

## Building Word Embeddings

In [23]:
# hyperparameters
vocab_size=10
window_size=5
min_count=1


# other parameters defined earlier
# func_size

In [4]:
# We employ the continuous skip–gram model with a window size of 5, 
# i.e. all pairs of words within distance 5 are considered nearby words.

#[TODO] tuning hyperparameters
model = FastText(size=vocab_size, window=window_size, min_count=min_count)  # instantiate
model.build_vocab(sentences=list_function_keywords)
model.train(sentences=list_function_keywords, total_examples=len(list_function_keywords), epochs=10)  # train

In [5]:
print(model)

FastText(vocab=212, size=10, alpha=0.025)


In [6]:
# saving a model trained via Gensim's fastText implementation
model.save('saved_model_gensim')

In [7]:
trained_ft_vectors = model.wv
# save vectors to file if you want to use them later
trained_ft_vectors.save_word2vec_format('embeddings.txt', binary=False)

In [8]:
# Test
trained_ft_vectors.most_similar("ping", topn=10)

[('requests', 0.973355770111084),
 ('handle', 0.9630223512649536),
 ('help_ping', 0.8176469802856445),
 ('do_ping', 0.7732444405555725),
 ('find_registry', 0.6913332343101501),
 ('iterlists', 0.6796385049819946),
 ('close', 0.6605330109596252),
 ('test_divisibleby', 0.6550693511962891),
 ('add_header', 0.6526221036911011),
 ('do_list', 0.6403547525405884)]

## Building Document Embeddings

1. Average over all the words;
2. Average over the unique words in each document;
3. [x] Weighted average of all unique words in a document

In [29]:
trained_ft_vectors["ping"]

array([ 0.03385561,  0.00126243, -0.03126138, -0.02985602, -0.06224034,
       -0.02554143, -0.0580775 , -0.06109515, -0.11077709, -0.0576953 ],
      dtype=float32)

In [24]:
document_embeddings=np.zeros((func_size, vocab_size))
for idx, doc in enumerate(list_function_keywords):
    doc_vec_sum=np.zeros(vocab_size)
    for term in doc:
        doc_vec_sum+=trained_ft_vectors[term]
    document_embeddings[idx]=doc_vec_sum

In [30]:
document_embeddings[0]

array([ 0.14120819, -0.00069223, -0.11513096, -0.08152609, -0.14342172,
       -0.08912421, -0.16949473, -0.18863287, -0.30467276, -0.17498178])

In [28]:
print("{} documents with {} dimentions".format(document_embeddings.shape[0], document_embeddings.shape[1]))

264 documents with 10 dimentions


## Evaluate Model

# Notes

# Questions
The paper mentions 2 evaluation approach: 1 uses Github only, the other one uses both GitHub and StackOverflow. I'm guessing the former one is for tuning in the development stage; while the later is the final evaluation for the completed system (NCS).