# Evaluation 4

1. most popular 1000 Python Github repo
2. filter top N frequently asked questions on StackOverflow(17000->2000) (question viewed+votes)
3. Verify if the StackOverflow code snippet exist in 1000 repo
    - ElasticSearch
    - manually choose 100 questions from ElasticSearch result
4. use StackOverflow questions as input of the model, and manually evalute if the top 10 results has correct ansers

# Automated Evaluation 6

replaced the 4th step of the earlier evaluation methods with:<br>
- first taking the top 10 results retrieved by NCS, and for each retrieved method, getting a similarity score between the ground–truth code snippet and the method. 
- choose a threshold that minimize false positive

In [1]:
from gensim.models import KeyedVectors
from time import time
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import pickle

In [2]:
# change the path to target files
st=time()
path_wordembedding="data/embeddings.txt"
path_docembedding="data/document_embeddings.csv"
path_stackoverflow="data/stack_overflow_data_all.csv"

# change hyperparameters
vocab_size=200
window_size=5

#StackOverflow start id
start_idx=0
end_idx=10 #will actually run to end_idx-1

In [3]:
# load StackOverflow data
st=time()
df_stack_overflow=pd.read_csv(path_stackoverflow)
print("Dimension of StackOverflow data: {}".format(df_stack_overflow.shape))
print("Run time: {} s".format(time()-st))

Dimension of StackOverflow data: (30510, 7)
Run time: 1.0743699073791504 s


In [4]:
# load wordembedding: representation of words
st=time()
trained_ft_vectors = KeyedVectors.load_word2vec_format(path_wordembedding)
print("Run time: {} s".format(time()-st))

Run time: 0.4511878490447998 s


In [5]:
# load document embedding: representation of each source code function
st=time()
document_embeddings=np.loadtxt(fname=path_docembedding, delimiter=",")
print("Dimension of the document embedding: {}".format(document_embeddings.shape))
print("Run time: {} s".format(time()-st))

Dimension of the document embedding: (1038, 200)
Run time: 0.23111772537231445 s


In [6]:
# normalize a word represenatation vector that its L2 norm is 1.
# we do this so that the cosine similarity reduces to a simple dot product

def normalize(word_representations):
    for word in word_representations:
        total=0
        for key in word_representations[word]:
            total+=word_representations[word][key]*word_representations[word][key]
            
        total=math.sqrt(total)
        for key in word_representations[word]:
            word_representations[word][key]/=total

def dictionary_dot_product(dict1, dict2):
    dot=0
    for key in dict1:
        if key in dict2:
            dot+=dict1[key]*dict2[key]
    return dot

def find_sim(word_representations, query):
    if query not in word_representations:
        print("'%s' is not in vocabulary" % query)
        return None
    
    scores={}
    for word in word_representations:
        cosine=dictionary_dot_product(word_representations[query], word_representations[word])
        scores[word]=cosine
    return scores

# Find the K words with highest cosine similarity to a query in a set of word_representations
def find_nearest_neighbors(word_representations, query, K):
    scores=find_sim(word_representations, query)
    if scores != None:
        sorted_x = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)
        for idx, (k, v) in enumerate(sorted_x[:K]):
            print("%s\t%s\t%.5f" % (idx,k,v))

In [7]:
def get_most_relevant_document(question, word_embedding, doc_embedding, num=10):
    """Return the functions that are most relevant to the natual language question.

    Args:
        question: A string. A Question from StackOverflow. 
        word_embedding: Word embedding generated from codebase.
        doc_embedding: Document embedding generated from codebase
        num: The number of top similar functions to return.

    Returns:
        A list of indices of the top NUM related functions to the QUESTION in the WORD_EMBEDDING.
    
    """
    # convert QUESTION to a vector
    tokenized_ques=question.split()
    vec_ques=np.zeros((1,document_embeddings.shape[1])) #vocab_size
    token_count=0
    has_token_in_embedding=False
    for token in tokenized_ques:
        if token in word_embedding:
            has_token_in_embedding=True
            vec_ques+=word_embedding[token]
            token_count+=1
    
    if has_token_in_embedding:
        mean_vec_ques=vec_ques/token_count
    
    
        # compute similarity between this question and each of the source code snippets
        cosine_sim=[]
        for idx, doc in enumerate(document_embeddings):
            #[TODO] fix dimension

            try:
                cosine_sim.append(cosine_similarity(mean_vec_ques, doc.reshape(1, -1))[0][0])
            except ValueError:
                print(question)
                print(vec_ques, token_count)
                print(mean_vec_ques)
                print(doc.reshape(1, -1))
        # get top `num` similar functions
        result=np.array(cosine_sim).argsort()[-num:][::-1]
    else:
        result=np.nan
    return result

In [8]:
# limit number of questions
df_stack_overflow_partial=df_stack_overflow.iloc[start_idx:end_idx,:]

In [9]:
st=time()
list_most_relevant_doc=[]
for idx in range(len(df_stack_overflow_partial)): 
    question=df_stack_overflow_partial.iloc[idx]["Question Title"]
    
    most_relevant_doc=get_most_relevant_document(question, trained_ft_vectors, document_embeddings)
    list_most_relevant_doc.append(most_relevant_doc)
df_stack_overflow_partial["func_id"]=list_most_relevant_doc
print("Run time: {} s".format(time()-st)) 

Run time: 1.678067922592163 s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [10]:
# save result
df_stack_overflow_partial.to_pickle("data/SO_similarity_{}_{}.pkl".format(start_idx, end_idx))

In [11]:
df_stack_overflow_partial

Unnamed: 0,Post Link,Question Score,ViewCount,Question Title,Question Content,Answer,Tags,func_id
0,48174398,4,435,New Dataframe column as a generic function of ...,<p><strong>What is the fastest (and most effic...,<p>Let's try to analyze the problem for a seco...,<python><pandas><dataframe><vectorization>,"[448, 371, 504, 317, 403, 554, 409, 454, 452, ..."
1,48211001,0,56,Python: How to update multiple address lines,<p>I'm currently working on a banking system f...,<p>Your issue is arguably in your <code>update...,<python>,"[433, 490, 230, 149, 722, 527, 516, 973, 515, ..."
2,48213278,1,2389,Implementing Otsu binarization from scratch py...,<p>It seems my implementation is incorrect and...,<p>I dont know if my implementation is alright...,<python><python-3.x><image-processing><compute...,"[1011, 1003, 991, 275, 992, 994, 993, 990, 989..."
3,48217471,2,311,Is it possible to check for anagram without us...,<p>I'm currently studying and I was told not t...,<p>The pythonic way of answering this question...,<python><algorithm>,"[433, 173, 172, 182, 184, 419, 183, 303, 350, ..."
4,48217583,1,531,Accessing attributes of user defined object le...,<p>I have a class called LineString that consi...,<p>The cause for your error is the fact that y...,<python><class><typeerror>,"[289, 108, 117, 175, 290, 287, 288, 427, 530, ..."
5,48217704,0,47,why the data injected in template is only avai...,<p>I am working on a <strong>portfolio</strong...,<p>It's because you use <code>{{ user }}</code...,<python><django><django-models><django-templat...,"[495, 108, 376, 530, 117, 86, 335, 213, 343, 477]"
6,48219116,0,51,Python Iterators vs Indexing,<p>As I was writing a function I came upon som...,"<blockquote>\n <p>If I'm not mistaken, an ite...",<python><indexing><iterator>,
7,48219403,3,29,How to catch exceptions from function that has...,<p>I have a library that makes calls to an API...,<p>To pass everything seperately you would do ...,<python><python-3.x><exception-handling>,"[175, 288, 172, 515, 173, 153, 289, 150, 523, ..."
8,48221500,0,51,Replacing elements from a string,<p>I am trying to write a function that will p...,"<pre><code>def replace_chars(tmpStr, tmpChar):...",<python><python-3.x>,"[226, 421, 377, 340, 346, 160, 456, 100, 416, ..."
9,48225501,0,243,Refreshing the page and check for text every 1...,<p>I have a jenkins job which runs for 15 to 3...,<p>you are missing to call the re.search again...,<python-3.x><selenium-webdriver><pycharm>,"[287, 92, 360, 258, 407, 477, 428, 290, 117, 495]"
