# Introduction

In this notebook we demonstrate the use of **Word Embeddings (Word2Vec)** weighting technique into Information Retrieval to make trace link recovery between Test Cases and Bug Reports.

We model our study as follows:

* Each bug report title, summary and description compose a single query.
* We use each test case content as an entire document that must be returned to the query made

## Import Libraries

In [2]:
import sys
if '../..' not in sys.path:
    sys.path.append('../..')

import pandas as pd
import numpy as np
import spacy

from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_recall_fscore_support, pairwise_distances, pairwise
from sklearn.externals.joblib import Parallel, delayed

from modules.utils import plots
from modules.utils import firefox_dataset_p2 as fd
from modules.utils import tokenizers as tok
from modules.utils import aux_functions
from modules.utils import model_evaluator as m_eval

from modules.models.wordvec import WordVec_BasedModel
from modules.models.model_hyperps import WordVec_Model_Hyperp

import warnings; warnings.simplefilter('ignore')

## Load Dataset

In [3]:
test_cases_df = fd.read_testcases_df()
bug_reports_df = fd.read_bugreports_df()

corpus = test_cases_df.tc_desc
query = bug_reports_df.br_desc

test_cases_names = test_cases_df.tc_name
bug_reports_names = bug_reports_df.br_name

orc = fd.read_oracle_expert_volunteers_df()

TestCases.shape: (207, 12)
BugReports.shape: (93, 19)
Oracle.shape: (207, 93)


## Evaluate Recovering Efficiency

In order to evaluate the efficiency of the algorithm tested (LSI), we use common metrics applied in the field of IR:

    * Precision
    * Recall
    * F1-score

### Analysis with Default Values of WordVec Model

In [11]:
best_model = WordVec_BasedModel()
best_model.recover_links(corpus, query, test_cases_names, bug_reports_names)
evaluator = m_eval.ModelEvaluator(orc, best_model)
evaluator.evaluate_model(verbose=True)

{'Measures': {'Mean FScore of WordVec': 0.011884550084889643,
              'Mean Precision of WordVec': 0.03225806451612903,
              'Mean Recall of WordVec': 0.0076997946436211185},
 'Setup': [{'Name': 'WordVec'},
           {'Similarity Measure and Minimum Threshold': ('cosine', 0.8)},
           {'Top Value': 3},
           {'Tokenizer': <utils.tokenizers.WordNetBased_LemmaTokenizer object at 0x7ff6d0759b70>}]}


## Running WordVec_Based Model

In [12]:
%%time

wv_hyperp = {
    WordVec_Model_Hyperp.SIM_MEASURE_MIN_THRESHOLD.value : ('cosine', .80),
    WordVec_Model_Hyperp.TOP.value : 100,
    WordVec_Model_Hyperp.TOKENIZER.value : tok.PorterStemmerBased_Tokenizer()
}

wv_model = WordVec_BasedModel(**wv_hyperp)
wv_model.set_name('WordVec_Model_AllData')
wv_model.recover_links(corpus, query, test_cases_names, bug_reports_names)

print("\nModel Evaluation -------------------------------------------")
evaluator = m_eval.ModelEvaluator(orc, wv_model)
evaluator.evaluate_model(verbose=True)


Model Evaluation -------------------------------------------
{'Measures': {'Mean FScore of WordVec_Model_AllData': 0.06038017399432412,
              'Mean Precision of WordVec_Model_AllData': 0.036881720430107505,
              'Mean Recall of WordVec_Model_AllData': 0.21091840265292752},
 'Setup': [{'Name': 'WordVec_Model_AllData'},
           {'Similarity Measure and Minimum Threshold': ('cosine', 0.8)},
           {'Top Value': 100},
           {'Tokenizer': <utils.tokenizers.PorterStemmerBased_Tokenizer object at 0x7ff75b867a90>}]}


In [13]:
display(aux_functions.highlight_df(orc.iloc[0:20, 0:7]))

Unnamed: 0_level_0,BR_1181835_SRC,BR_1248267_SRC,BR_1248268_SRC,BR_1257087_SRC,BR_1264988_SRC,BR_1267480_SRC,BR_1267501_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TC_1_TRG,0,0,0,0,0,0,0
TC_2_TRG,0,0,0,0,0,0,0
TC_3_TRG,0,0,0,0,0,0,0
TC_4_TRG,0,0,0,0,0,0,0
TC_5_TRG,0,0,0,0,0,0,0
TC_6_TRG,0,0,0,0,0,0,0
TC_7_TRG,0,0,0,0,0,0,0
TC_8_TRG,0,0,0,0,0,0,0
TC_9_TRG,0,0,0,0,0,0,0
TC_10_TRG,0,0,0,0,0,0,0


In [14]:
display(aux_functions.highlight_df(wv_model.get_trace_links_df().iloc[0:20, 0:7]))

br_name,BR_1181835_SRC,BR_1248267_SRC,BR_1248268_SRC,BR_1257087_SRC,BR_1264988_SRC,BR_1267480_SRC,BR_1267501_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TC_1_TRG,0,1,1,1,1,1,0
TC_2_TRG,1,0,1,1,1,1,1
TC_3_TRG,0,0,0,0,1,1,0
TC_4_TRG,0,0,0,0,1,1,0
TC_5_TRG,0,0,1,0,1,1,0
TC_6_TRG,0,0,0,0,0,0,0
TC_7_TRG,0,0,0,0,0,1,0
TC_8_TRG,0,0,1,0,0,1,0
TC_9_TRG,1,0,1,0,1,1,0
TC_10_TRG,0,0,1,0,0,0,0


In [15]:
display(aux_functions.highlight_df(wv_model.get_sim_matrix().iloc[0:20, 0:7]))

br_name,BR_1181835_SRC,BR_1248267_SRC,BR_1248268_SRC,BR_1257087_SRC,BR_1264988_SRC,BR_1267480_SRC,BR_1267501_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TC_1_TRG,0.808688,0.895289,0.888758,0.873387,0.878313,0.848028,0.860484
TC_2_TRG,0.834933,0.878338,0.895842,0.867924,0.88925,0.852614,0.882508
TC_3_TRG,0.799828,0.840642,0.862494,0.839938,0.872096,0.843978,0.841579
TC_4_TRG,0.820683,0.852514,0.876617,0.84142,0.884782,0.849379,0.881742
TC_5_TRG,0.815045,0.85896,0.879903,0.854327,0.879805,0.840349,0.867477
TC_6_TRG,0.766847,0.77763,0.80679,0.774372,0.814278,0.782348,0.732557
TC_7_TRG,0.792411,0.83497,0.852643,0.832363,0.851883,0.816936,0.790722
TC_8_TRG,0.817741,0.850608,0.881453,0.838018,0.861688,0.820657,0.81042
TC_9_TRG,0.828998,0.86365,0.902099,0.851726,0.873674,0.836063,0.835231
TC_10_TRG,0.816448,0.827262,0.877477,0.823269,0.858366,0.782386,0.814478
