# Introduction - Using COSINE Metric

In this notebook we demonstrate the use of **LSI (Latent Semantic Indexing)** technique of Information Retrieval context to make trace link recovery between Test Cases and Bug Reports.

We model our study as follows:

* Each bug report title, summary and description compose a single query.
* We use each use case content as an entire document that must be returned to the query made

## Import Libraries

In [3]:
import sys
if '../..' not in sys.path:
    sys.path.append('../..')

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from modules.utils import plots
from modules.utils import firefox_dataset_p1 as fd
from modules.utils import tokenizers as tok
from modules.utils import aux_functions
from modules.utils import model_evaluator as m_eval

from modules.models.lsi import LSI
from modules.models.model_hyperps import LSI_Model_Hyperp

import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

from IPython.display import display

import warnings; warnings.simplefilter('ignore')

## Load Dataset

In [4]:
test_cases_df = fd.read_testcases_df()
bug_reports_df = fd.read_bugreports_df()

corpus = test_cases_df.tc_desc
query = bug_reports_df.br_desc

test_cases_names = test_cases_df.tc_name
bug_reports_names = bug_reports_df.br_name

orc = fd.read_trace_df()

TestCases.shape: (207, 12)
BugReports.shape: (35336, 18)
Oracle.shape: (207, 35336)


In [3]:
bug_reports_df[bug_reports_df.Version =='50 Branch'].head(10)

Unnamed: 0,Bug_Number,Summary,Platform,Component,Version,Creation_Time,Whiteboard,QA_Whiteboard,First_Comment_Text,First_Comment_Creation_Time,br_name,br_desc
3118,651799,vmware server 2 web interface does not draw pr...,x86_64,Untriaged,50 Branch,2011-04-21T04:58:40Z,,,User-Agent: Mozilla/5.0 (Windows NT 6.1;...,2011-04-21T04:58:40Z,BR_651799_SRC,651799 vmware server 2 web interface does not ...
6198,933532,Open external links and bookmarks to the right...,All,Tabbed Browser,50 Branch,2013-10-31T23:46:07Z,,,,2013-10-31T23:46:07Z,BR_933532_SRC,933532 Open external links and bookmarks to th...
9435,1194382,URLs of visited sites remain in History after ...,Unspecified,Bookmarks & History,50 Branch,2015-08-13T18:49:08Z,,,User Agent: Mozilla/5.0 (Windows NT 6.0; rv:36...,2015-08-13T18:49:08Z,BR_1194382_SRC,1194382 URLs of visited sites remain in Histor...
10048,1224039,"Sync repeats the default bookmarks,folders,and...",All,Sync,50 Branch,2015-11-12T00:18:24Z,,,Created attachment 8686329 Firefox_bug_example...,2015-11-12T00:18:24Z,BR_1224039_SRC,"1224039 Sync repeats the default bookmarks,fol..."
11829,1278464,wrong rendering of some sites,x86_64,Untriaged,50 Branch,2016-06-07T06:07:12Z,,,Created attachment 8760593 frame problem 1.JPG...,2016-06-07T06:07:12Z,BR_1278464_SRC,1278464 wrong rendering of some sites x86_64 U...
11841,1278631,DataURL link should not inherit the security o...,x86,Security,50 Branch,2016-06-07T18:00:42Z,,,User Agent: Mozilla/5.0 (Macintosh; Intel Mac ...,2016-06-07T18:00:42Z,BR_1278631_SRC,1278631 DataURL link should not inherit the se...
11864,1279094,Private tab addon and e10s causes hung window ...,x86_64,Extension Compatibility,50 Branch,2016-06-08T23:27:17Z,triaged,,User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64...,2016-06-08T23:27:17Z,BR_1279094_SRC,1279094 Private tab addon and e10s causes hung...
11883,1279209,"When creating a new bookmark folder, the namin...",x86_64,Bookmarks & History,50 Branch,2016-06-09T12:29:36Z,,,Created attachment 8761549 bookmark_folder-iss...,2016-06-09T12:29:36Z,BR_1279209_SRC,"1279209 When creating a new bookmark folder, t..."
11905,1279430,Firefox Nightly 50.0a1 has no icons of Bookmar...,x86,Bookmarks & History,50 Branch,2016-06-10T08:31:02Z,,,Created attachment 8761940 Screenshots User A...,2016-06-10T08:31:02Z,BR_1279430_SRC,1279430 Firefox Nightly 50.0a1 has no icons of...
11907,1279445,Crash in mozalloc_abortNS_DebugBreakmozilla::i...,x86_64,New Tab Page,50 Branch,2016-06-10T09:29:54Z,,,This bug was filed from the Socorro interface ...,2016-06-10T09:29:54Z,BR_1279445_SRC,1279445 Crash in mozalloc_abortNS_DebugBreakmo...


## Evaluate Recovering Efficiency

In order to evaluate the efficiency of the algorithm tested (LSI), we use common metrics applied in the field of IR:

    * Precision
    * Recall
    * F1-score

## Running LSI Model with Different Types of Oracles

### Strong and Weak Links Datasets

In [9]:
br_tc_strong_df = pd.read_csv('../../data/mozilla_firefox_v2/firefoxDataset/oracle/output/BR_TC_Strong.csv')
br_tc_weak_df = pd.read_csv('../../data/mozilla_firefox_v2/firefoxDataset/oracle/output/BR_TC_Weak.csv')
br_tc_mix_df = pd.read_csv('../../data/mozilla_firefox_v2/firefoxDataset/oracle/output/BR_TC_Mix.csv')

#br_tc_strong_df = br_tc_mix_df.loc[0:3,:]
#br_tc_weak_df = br_tc_mix_df.loc[4:8,:]

print(br_tc_strong_df.shape)
print(br_tc_weak_df.shape)
print(br_tc_mix_df.shape)

(12, 6)
(11, 6)
(8, 4)


### Define **run_lsi_model()** Function

In [11]:
def run_lsi_model(selected_tcs, selected_brs):
    tcs_df = test_cases_df[test_cases_df.tc_name.isin(selected_tcs)]
    brs_df = bug_reports_df[bug_reports_df.br_name.isin(selected_brs)]

    #display(tcs_df.head(15))
    #display(brs_df.head(15))
    
    corpus_subset = tcs_df.tc_desc
    query_subset = brs_df.br_desc
    testcases_names_subset = tcs_df.tc_name
    bug_reports_names_subset = brs_df.br_name
    orc_subset_df = orc.loc[testcases_names_subset, bug_reports_names_subset]
       
    print('TestCases Subset Shape: {}'.format(tcs_df.shape))
    print('BugReports Subset Shape: {}'.format(brs_df.shape))
    print('Oracle Subset Shape: {}'.format(orc_subset_df.shape))

    lsi_hyperp = {
        LSI_Model_Hyperp.SIM_MEASURE_MIN_THRESHOLD.value : ('cosine' , .80),
        LSI_Model_Hyperp.TOP.value : 10,
        LSI_Model_Hyperp.SVD_MODEL_N_COMPONENTS.value: 50,
        LSI_Model_Hyperp.VECTORIZER_NGRAM_RANGE.value: (1,1),
        LSI_Model_Hyperp.VECTORIZER.value : TfidfVectorizer(stop_words='english', use_idf=True, smooth_idf=True),
        LSI_Model_Hyperp.VECTORIZER_TOKENIZER.value : tok.WordNetBased_LemmaTokenizer()
    }

    lsi_model = LSI(**lsi_hyperp)
    lsi_model.set_name('LSI_Model_0')
    lsi_model.recover_links(corpus_subset, query_subset, testcases_names_subset, bug_reports_names_subset)

    print("\nModel Evaluation -------------------------------------------")
    evaluator = m_eval.ModelEvaluator(orc_subset_df, lsi_model)
    evaluator.evaluate_model(verbose=True)
    
    print("\n\nTraceLinks Matrix --------------------------------------")
    display(aux_functions.highlight_df(lsi_model.get_trace_links_df()))

    print("\n\nOracle -----------------------------------------")
    display(aux_functions.highlight_df(orc_subset_df))

### Oracle with Strong Links Only

In [12]:
selected_tcs = ['TC_{}_TRG'.format(tc_num) for tc_num in br_tc_strong_df.TC.values]
selected_brs = ['BR_{}_SRC'.format(bg_num) for bg_num in br_tc_strong_df.BR.values]

run_lsi_model(selected_tcs, selected_brs)

TestCases Subset Shape: (12, 10)
BugReports Subset Shape: (7, 12)
Oracle Subset Shape: (12, 7)

Model Evaluation -------------------------------------------
{'Measures': {'Mean FScore of LSI_Model_0': 0.25,
              'Mean Precision of LSI_Model_0': 0.8571428571428571,
              'Mean Recall of LSI_Model_0': 0.1496598639455782},
 'Setup': [{'Name': 'LSI_Model_0'},
           {'Similarity Measure and Minimum Threshold': ('cosine', 0.8)},
           {'Top Value': 10},
           {'SVD Model': {'algorithm': 'randomized',
                          'n_components': 50,
                          'n_iter': 10,
                          'random_state': 42,
                          'tol': 0.0}},
           {'Vectorizer': {'analyzer': 'word',
                           'binary': False,
                           'decode_error': 'strict',
                           'dtype': <class 'numpy.float64'>,
                           'encoding': 'utf-8',
                           'input': 'conten

br_name,BR_786692_SRC,BR_1298575_SRC,BR_1313805_SRC,BR_1320658_SRC,BR_1329292_SRC,BR_1329421_SRC,BR_1329430_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TC_17_TRG,0,0,0,0,0,0,0
TC_118_TRG,0,0,0,0,0,0,0
TC_120_TRG,0,1,0,0,0,0,0
TC_121_TRG,0,0,0,0,0,0,0
TC_136_TRG,0,0,0,0,0,0,0
TC_143_TRG,0,0,0,0,0,0,1
TC_155_TRG,0,0,1,0,0,0,0
TC_172_TRG,0,0,0,0,1,0,0
TC_181_TRG,0,0,0,0,0,0,0
TC_183_TRG,0,0,0,1,0,0,0




Oracle -----------------------------------------


Unnamed: 0_level_0,BR_786692_SRC,BR_1298575_SRC,BR_1313805_SRC,BR_1320658_SRC,BR_1329292_SRC,BR_1329421_SRC,BR_1329430_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
TC_17_TRG,1,0,0,0,0,0,0
TC_118_TRG,0,1,0,0,0,0,0
TC_120_TRG,0,1,0,0,0,0,0
TC_121_TRG,0,1,0,0,0,0,0
TC_136_TRG,0,0,0,0,0,0,0
TC_143_TRG,0,0,1,1,1,1,1
TC_155_TRG,0,0,1,1,1,1,1
TC_172_TRG,0,0,1,1,1,1,1
TC_181_TRG,0,0,1,1,1,1,1
TC_183_TRG,0,0,1,1,1,1,1


### Oracle with Weak Links Only

In [13]:
selected_tcs = ['TC_{}_TRG'.format(tc_num) for tc_num in br_tc_weak_df.TC.values]
selected_brs = ['BR_{}_SRC'.format(bg_num) for bg_num in br_tc_weak_df.BR.values]

run_lsi_model(selected_tcs, selected_brs)

TestCases Subset Shape: (11, 10)
BugReports Subset Shape: (5, 12)
Oracle Subset Shape: (11, 5)

Model Evaluation -------------------------------------------
{'Measures': {'Mean FScore of LSI_Model_0': 0.1388888888888889,
              'Mean Precision of LSI_Model_0': 0.4,
              'Mean Recall of LSI_Model_0': 0.08571428571428572},
 'Setup': [{'Name': 'LSI_Model_0'},
           {'Similarity Measure and Minimum Threshold': ('cosine', 0.8)},
           {'Top Value': 10},
           {'SVD Model': {'algorithm': 'randomized',
                          'n_components': 50,
                          'n_iter': 10,
                          'random_state': 42,
                          'tol': 0.0}},
           {'Vectorizer': {'analyzer': 'word',
                           'binary': False,
                           'decode_error': 'strict',
                           'dtype': <class 'numpy.float64'>,
                           'encoding': 'utf-8',
                           'input': 'conten

br_name,BR_1285719_SRC,BR_1298575_SRC,BR_1329292_SRC,BR_1329421_SRC,BR_1329430_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TC_14_TRG,0,0,0,0,0
TC_35_TRG,0,0,0,0,0
TC_75_TRG,0,0,0,0,0
TC_105_TRG,0,0,0,0,0
TC_154_TRG,0,0,0,0,0
TC_155_TRG,0,0,0,0,0
TC_174_TRG,0,0,0,0,0
TC_196_TRG,0,1,0,0,0
TC_197_TRG,0,0,0,1,0
TC_200_TRG,0,0,0,0,1




Oracle -----------------------------------------


Unnamed: 0_level_0,BR_1285719_SRC,BR_1298575_SRC,BR_1329292_SRC,BR_1329421_SRC,BR_1329430_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TC_14_TRG,1,0,0,0,0
TC_35_TRG,1,0,0,0,0
TC_75_TRG,0,1,0,0,0
TC_105_TRG,0,1,0,0,0
TC_154_TRG,0,0,1,1,1
TC_155_TRG,0,0,1,1,1
TC_174_TRG,0,0,1,1,1
TC_196_TRG,0,0,1,1,1
TC_197_TRG,0,0,1,1,1
TC_200_TRG,0,0,1,1,1


### Oracle with Mixed Links (Strong and Weak)

In [14]:
selected_tcs = ['TC_{}_TRG'.format(tc_num) for tc_num in br_tc_mix_df.TC.values]
selected_brs = ['BR_{}_SRC'.format(bg_num) for bg_num in br_tc_mix_df.BR.values]

run_lsi_model(selected_tcs, selected_brs)

TestCases Subset Shape: (8, 10)
BugReports Subset Shape: (3, 12)
Oracle Subset Shape: (8, 3)

Model Evaluation -------------------------------------------
{'Measures': {'Mean FScore of LSI_Model_0': 0.38888888888888884,
              'Mean Precision of LSI_Model_0': 0.6666666666666666,
              'Mean Recall of LSI_Model_0': 0.27777777777777773},
 'Setup': [{'Name': 'LSI_Model_0'},
           {'Similarity Measure and Minimum Threshold': ('cosine', 0.8)},
           {'Top Value': 10},
           {'SVD Model': {'algorithm': 'randomized',
                          'n_components': 50,
                          'n_iter': 10,
                          'random_state': 42,
                          'tol': 0.0}},
           {'Vectorizer': {'analyzer': 'word',
                           'binary': False,
                           'decode_error': 'strict',
                           'dtype': <class 'numpy.float64'>,
                           'encoding': 'utf-8',
                           'i

br_name,BR_786692_SRC,BR_1285719_SRC,BR_1329430_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TC_14_TRG,0,0,0
TC_17_TRG,1,0,0
TC_35_TRG,0,0,0
TC_105_TRG,0,0,0
TC_118_TRG,0,0,0
TC_136_TRG,0,0,0
TC_143_TRG,0,0,1
TC_155_TRG,0,0,0




Oracle -----------------------------------------


Unnamed: 0_level_0,BR_786692_SRC,BR_1285719_SRC,BR_1329430_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TC_14_TRG,1,1,0
TC_17_TRG,1,1,0
TC_35_TRG,1,1,0
TC_105_TRG,0,0,0
TC_118_TRG,0,0,0
TC_136_TRG,0,0,0
TC_143_TRG,0,0,1
TC_155_TRG,0,0,1


### General Test

In [15]:
bugreports_subset_df = bug_reports_df[(bug_reports_df.Version == '48 Branch') | (bug_reports_df.Version == '60 Branch')].sample(15, random_state=42)
testcases_subset_df = test_cases_df[(test_cases_df.TestDay.str.contains('20161014')) | (test_cases_df.TestDay.str.contains('20161028'))].sample(10, random_state=1000)

selected_testcases = ['TC_{}_TRG'.format(tc_num) for tc_num in [13, 14, 15, 16, 17, 18]]  # should link with 48 Branch
aux_tc = test_cases_df[test_cases_df.tc_name.isin(selected_testcases)]

selected_bugreports = bugreports_subset_df.br_name
run_lsi_model(selected_testcases, selected_bugreports)


TestCases Subset Shape: (6, 10)
BugReports Subset Shape: (15, 12)
Oracle Subset Shape: (6, 15)

Model Evaluation -------------------------------------------
{'Measures': {'Mean FScore of LSI_Model_0': 0.07619047619047618,
              'Mean Precision of LSI_Model_0': 0.26666666666666666,
              'Mean Recall of LSI_Model_0': 0.04444444444444444},
 'Setup': [{'Name': 'LSI_Model_0'},
           {'Similarity Measure and Minimum Threshold': ('cosine', 0.8)},
           {'Top Value': 10},
           {'SVD Model': {'algorithm': 'randomized',
                          'n_components': 50,
                          'n_iter': 10,
                          'random_state': 42,
                          'tol': 0.0}},
           {'Vectorizer': {'analyzer': 'word',
                           'binary': False,
                           'decode_error': 'strict',
                           'dtype': <class 'numpy.float64'>,
                           'encoding': 'utf-8',
                          

br_name,BR_1268934_SRC,BR_1282551_SRC,BR_1291175_SRC,BR_1299787_SRC,BR_1418983_SRC,BR_1432520_SRC,BR_1436749_SRC,BR_1443632_SRC,BR_1443754_SRC,BR_1450216_SRC,BR_1461828_SRC,BR_1463274_SRC,BR_1463735_SRC,BR_1497738_SRC,BR_1513270_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
TC_13_TRG,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0
TC_14_TRG,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0
TC_15_TRG,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
TC_16_TRG,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
TC_17_TRG,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0
TC_18_TRG,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0




Oracle -----------------------------------------


Unnamed: 0_level_0,BR_1268934_SRC,BR_1282551_SRC,BR_1291175_SRC,BR_1299787_SRC,BR_1418983_SRC,BR_1432520_SRC,BR_1436749_SRC,BR_1443632_SRC,BR_1443754_SRC,BR_1450216_SRC,BR_1461828_SRC,BR_1463274_SRC,BR_1463735_SRC,BR_1497738_SRC,BR_1513270_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
TC_13_TRG,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0
TC_14_TRG,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0
TC_15_TRG,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0
TC_16_TRG,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0
TC_17_TRG,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0
TC_18_TRG,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0


### Run Model with Entire Dataset

In [27]:
%%time

lsi_hyperp = {
    LSI_Model_Hyperp.SIM_MEASURE_MIN_THRESHOLD.value : ('cosine' , .80),
    LSI_Model_Hyperp.TOP.value : 10,
    LSI_Model_Hyperp.SVD_MODEL_N_COMPONENTS.value: 100,
    LSI_Model_Hyperp.VECTORIZER_NGRAM_RANGE.value: (1,1),
    LSI_Model_Hyperp.VECTORIZER.value : TfidfVectorizer(stop_words='english', use_idf=True, smooth_idf=True),
    LSI_Model_Hyperp.VECTORIZER_TOKENIZER.value : tok.WordNetBased_LemmaTokenizer()
}

lsi_model = LSI(**lsi_hyperp)
lsi_model.set_name('LSI_Model_AllData')
lsi_model.recover_links(corpus, query, test_cases_names, bug_reports_names)

print("\nModel Evaluation -------------------------------------------")
evaluator = m_eval.ModelEvaluator(orc, lsi_model)
evaluator.evaluate_model(verbose=True)


Model Evaluation -------------------------------------------
{'Measures': {'Mean FScore of LSI_Model_AllData': 4.9188730740674965e-05,
              'Mean Precision of LSI_Model_AllData': 0.0009344735798833324,
              'Mean Recall of LSI_Model_AllData': 2.558522889171405e-05},
 'Setup': [{'Name': 'LSI_Model_AllData'},
           {'Similarity Measure and Minimum Threshold': ('cosine', 0.8)},
           {'Top Value': 10},
           {'SVD Model': {'algorithm': 'randomized',
                          'n_components': 100,
                          'n_iter': 10,
                          'random_state': 42,
                          'tol': 0.0}},
           {'Vectorizer': {'analyzer': 'word',
                           'binary': False,
                           'decode_error': 'strict',
                           'dtype': <class 'numpy.float64'>,
                           'encoding': 'utf-8',
                           'input': 'content',
                           'lowercase': Tr