# Introduction

In this notebook we demonstrate the use of **BM25 (Best Matching 25)** Information Retrieval technique to make trace link recovery between Test Cases and Bug Reports.

We model our study as follows:

* Each bug report title, summary and description compose a single query.
* We use each test case content as an entire document that must be returned to the query made


## Import Libraries

In [4]:
from mod_finder_util import mod_finder_util
mod_finder_util.add_modules_origin_search_path()

import pandas as pd
import numpy as np

from modules.utils import plots
from modules.utils import firefox_dataset_p2 as fd
from modules.utils import tokenizers as tok
from modules.utils import aux_functions
from modules.utils import model_evaluator as m_eval

from modules.models.bm25 import BM_25
from modules.models.model_hyperps import BM25_Model_Hyperp

import warnings; warnings.simplefilter('ignore')

## Load Dataset

In [5]:
test_cases_df = fd.read_testcases_df()
bug_reports_df = fd.read_bugreports_df()

corpus = test_cases_df.tc_desc
query = bug_reports_df.br_desc

test_cases_names = test_cases_df.tc_name
bug_reports_names = bug_reports_df.br_name

orc = fd.read_oracle_expert_volunteers_df()

TestCases.shape: (207, 12)
BugReports.shape: (93, 19)
Oracle.shape: (207, 93)


# BM25 Model

#### Quick Test with Model

## Evaluate Recovering Efficiency

In order to evaluate the efficiency of the algorithm tested (LSI), we use common metrics applied in the field of IR:

    * Precision
    * Recall
    * F1-score

#### Analysis with Default Values of BM25 Model

In [15]:
best_model = BM_25()
best_model.recover_links(corpus, query, test_cases_names, bug_reports_names)

df = pd.DataFrame(best_model.get_sim_matrix())
df.head(10)

evaluator = m_eval.ModelEvaluator(orc, best_model)
evaluator.evaluate_model(verbose=True)
#evaluator.plot_precision_vs_recall()

{'Measures': {'Mean FScore of BM25': 0.030104830696773936,
              'Mean Precision of BM25': 0.08960573476702507,
              'Mean Recall of BM25': 0.0187334993563513},
 'Setup': [{'Name': 'BM25'},
           {'Top Value': 3},
           {'Sim Measure Min Threshold': ('', 0.0)},
           {'K': 1.2},
           {'B': 0.75},
           {'Epsilon': 0.25},
           {'Tokenizer Type': <class 'utils.tokenizers.WordNetBased_LemmaTokenizer'>}]}


## Running BM25 Model

In [16]:
%%time

bm25_hyperp = {
    BM25_Model_Hyperp.TOP.value : 100,
    BM25_Model_Hyperp.SIM_MEASURE_MIN_THRESHOLD.value : ('-', 0.0),
    BM25_Model_Hyperp.TOKENIZER.value : tok.PorterStemmerBased_Tokenizer()
}

bm25_model = BM_25(**bm25_hyperp)
bm25_model.set_name('BM25_Model_AllData')
bm25_model.recover_links(corpus, query, test_cases_names, bug_reports_names)

print("\nModel Evaluation -------------------------------------------")
evaluator = m_eval.ModelEvaluator(orc, bm25_model)
evaluator.evaluate_model(verbose=True)


Model Evaluation -------------------------------------------
{'Measures': {'Mean FScore of BM25_Model_AllData': 0.05931470851816979,
              'Mean Precision of BM25_Model_AllData': 0.03547961247737676,
              'Mean Recall of BM25_Model_AllData': 0.22795750152471203},
 'Setup': [{'Name': 'BM25_Model_AllData'},
           {'Top Value': 100},
           {'Sim Measure Min Threshold': ('-', 0.0)},
           {'K': 1.2},
           {'B': 0.75},
           {'Epsilon': 0.25},
           {'Tokenizer Type': <class 'utils.tokenizers.PorterStemmerBased_Tokenizer'>}]}
CPU times: user 1.85 s, sys: 0 ns, total: 1.85 s
Wall time: 1.85 s


In [17]:
aux_functions.highlight_df(orc.iloc[0:20, 0:7])

Unnamed: 0_level_0,BR_1181835_SRC,BR_1248267_SRC,BR_1248268_SRC,BR_1257087_SRC,BR_1264988_SRC,BR_1267480_SRC,BR_1267501_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TC_1_TRG,0,0,0,0,0,0,0
TC_2_TRG,0,0,0,0,0,0,0
TC_3_TRG,0,0,0,0,0,0,0
TC_4_TRG,0,0,0,0,0,0,0
TC_5_TRG,0,0,0,0,0,0,0
TC_6_TRG,0,0,0,0,0,0,0
TC_7_TRG,0,0,0,0,0,0,0
TC_8_TRG,0,0,0,0,0,0,0
TC_9_TRG,0,0,0,0,0,0,0
TC_10_TRG,0,0,0,0,0,0,0


In [18]:
aux_functions.highlight_df(bm25_model.get_trace_links_df().iloc[0:20, 0:7])

br_name,BR_1181835_SRC,BR_1248267_SRC,BR_1248268_SRC,BR_1257087_SRC,BR_1264988_SRC,BR_1267480_SRC,BR_1267501_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TC_1_TRG,1,1,1,0,1,1,0
TC_2_TRG,1,0,1,0,1,1,1
TC_3_TRG,1,1,1,0,1,1,0
TC_4_TRG,1,0,1,0,1,1,0
TC_5_TRG,1,0,0,0,1,1,0
TC_6_TRG,0,0,0,1,1,1,0
TC_7_TRG,1,1,1,1,1,1,0
TC_8_TRG,1,1,1,1,1,0,0
TC_9_TRG,1,1,1,0,1,1,0
TC_10_TRG,0,0,1,0,1,1,0


In [19]:
aux_functions.highlight_df(bm25_model.get_sim_matrix().iloc[0:20, 0:7])

br_name,BR_1181835_SRC,BR_1248267_SRC,BR_1248268_SRC,BR_1257087_SRC,BR_1264988_SRC,BR_1267480_SRC,BR_1267501_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TC_1_TRG,38.6255,37.7564,99.2604,25.4208,35.2903,9.95057,41.1919
TC_2_TRG,48.7012,32.4501,95.8335,26.0056,47.9472,20.2499,61.5733
TC_3_TRG,48.1199,35.0023,103.687,30.0274,47.4105,10.5854,45.4554
TC_4_TRG,38.4382,29.7728,82.121,26.5913,52.1056,11.1322,46.2069
TC_5_TRG,46.7854,29.0437,73.4714,25.5024,37.9227,10.4102,48.7725
TC_6_TRG,35.5896,30.8811,79.3024,31.0278,31.3074,12.81,43.0957
TC_7_TRG,50.853,47.4332,126.927,40.9876,38.2928,12.724,47.8232
TC_8_TRG,48.5698,35.2207,113.906,32.2022,34.4048,9.04805,47.4817
TC_9_TRG,55.1053,35.7478,125.212,30.6349,34.3473,9.22116,50.1063
TC_10_TRG,31.6464,32.6436,107.518,24.0958,33.721,14.1636,43.7667
