# 0. AQE Per-Query Analysis
In the following, we provide a more detailed analysis of our Automatic Query Expansion evaluation results as a supplement to our analysis provided in the paper. In our paper, we showed that AQE methods, in particular the text-segment-wise PRF (s) method, can outperform the ranking performance of the base ranker BM25. However, since AQE is based on the extraction of expansion terms using the Kullback-Leibler Divergence score on the initial GUI ranking results of the base ranker BM25, the expansion terms may not be suitable enough to obtain a better ranking for queries where the initial ranking results of the base BM25 ranker are to inconsistent. In this analysis, we provide a per-query analysis i.e. we investigate e.g. how many cases can be actually improved and which queries may suffer performance when AQE is used based on our NL-based GUI retrieval gold standard.

# 1. Load Datasets
Here we load the evaluation results for each of the method. Each method has one field for each metric used in our evaluation. Each metric field contains the 100 metric values for the 100 queries in our gold standard.

In [1]:
import json
import pandas as pd
from ast import literal_eval
import numpy as np
import seaborn as sns
sns.set_theme()

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 5000)
pd.set_option('display.max_colwidth', 5000)

abs_path = '../data/gui_ranking/goldstandard/'

goldstandard = pd.read_csv(abs_path+"goldstandard.csv")
goldstandard['gui_indexes'] = goldstandard['gui_indexes'].apply(literal_eval)
goldstandard['relevance'] = goldstandard['relevance'].apply(literal_eval)

abs_path = '../data/gui_ranking/aqe/dataset/'

with open(abs_path +  'aqe_results_for_per_query_analysis.json', 'r', encoding='utf8') as file:
    ranker_results = json.load(file)

# 2. Analysis of Improvements for Different Metrics
In the following, we compared the per-query metric values over the 100 queries of the gold standard and computed the number of cases where AQE (in particular the text-segement-wise PRF (s) method) improved the performance or was at least as good as the base ranker performance. Afterwards we compute the average metric improvement on these cases and compute the same for the cases where AQE did not improve the metric.

First, the Average Precision (AveP) is improved or at least as good as the base ranker in 57% on the 100 queries with an average improvement of 0.13. Second, the Mean Reciprocal Rank (MRR) is improved or at least as good as the base ranker in 69% on the 100 queries with an average improvement of 0.199. Third, the Precision at 5 (P@5) is improved or at least as good as the base ranker in 77% on the 100 queries with an average improvement of 0.07. Fourth, the HITS at 5 (HITS@5) is improved or at least as good as the base ranker in 92% on the 100 queries with an average improvement of 0.097. Fifth, the Normalized Discounted Cumulative Gain at 5 (NDCG@5) is improved or at least as good as the base ranker in 62% on the 100 queries with an average improvement of 0.15. Thus, 71.4% of the 100 queries from the goldstandard could be improved or have a performance at least as good as the base ranker over all the considered metrics on average.

## 2.1 AveP (Average Precision)

In [2]:
results_1 = np.array(ranker_results['bm25okapi']['AveP'])
results_2 = np.array(ranker_results['prf_kld_cat_bm25']['AveP'])
differences = results_2 - results_1

In [3]:
positive_vals = []
negative_vals = []
for diff in differences:
    if diff >= 0:
        positive_vals.append(diff)
    else:
        negative_vals.append(diff)

In [4]:
positive_vals_describe = pd.DataFrame(positive_vals)
positive_vals_describe.describe()

Unnamed: 0,0
count,57.0
mean,0.137215
std,0.1478
min,0.0
25%,0.026667
50%,0.064583
75%,0.235607
max,0.506746


In [5]:
negative_vals_describe = pd.DataFrame(negative_vals)
negative_vals_describe.describe()

Unnamed: 0,0
count,43.0
mean,-0.148715
std,0.169749
min,-0.616667
25%,-0.274306
50%,-0.09241
75%,-0.016438
max,-0.002778


## 2.2 MRR (Mean Reciprocal Rank)

In [6]:
results_1 = np.array(ranker_results['bm25okapi']['RecipRank'])
results_2 = np.array(ranker_results['prf_kld_cat_bm25']['RecipRank'])
differences = results_2 - results_1

In [7]:
positive_vals = []
negative_vals = []
for diff in differences:
    if diff >= 0:
        positive_vals.append(diff)
    else:
        negative_vals.append(diff)

In [8]:
positive_vals_describe = pd.DataFrame(positive_vals)
positive_vals_describe.describe()

Unnamed: 0,0
count,69.0
mean,0.19986
std,0.283674
min,0.0
25%,0.0
50%,0.017857
75%,0.5
max,0.928571


In [9]:
negative_vals_describe = pd.DataFrame(negative_vals)
negative_vals_describe.describe()

Unnamed: 0,0
count,31.0
mean,-0.406418
std,0.317138
min,-0.875
25%,-0.666667
50%,-0.5
75%,-0.064665
max,-0.010256


## 2.3 P@5 (Precision at 5)

In [10]:
results_1 = np.array(ranker_results['bm25okapi']['P@5'])
results_2 = np.array(ranker_results['prf_kld_cat_bm25']['P@5'])
differences = results_2 - results_1

In [11]:
positive_vals = []
negative_vals = []
for diff in differences:
    if diff >= 0:
        positive_vals.append(diff)
    else:
        negative_vals.append(diff)

In [12]:
positive_vals_describe = pd.DataFrame(positive_vals)
positive_vals_describe.describe()

Unnamed: 0,0
count,77.0
mean,0.077922
std,0.117679
min,0.0
25%,0.0
50%,0.0
75%,0.2
max,0.6


In [13]:
negative_vals_describe = pd.DataFrame(negative_vals)
negative_vals_describe.describe()

Unnamed: 0,0
count,23.0
mean,-0.243478
std,0.084348
min,-0.4
25%,-0.2
50%,-0.2
75%,-0.2
max,-0.2


## 2.4 HITS@5

In [14]:
results_1 = np.array(ranker_results['bm25okapi']['HITS@5'])
results_2 = np.array(ranker_results['prf_kld_cat_bm25']['HITS@5'])
differences = results_2 - results_1

In [15]:
positive_vals = []
negative_vals = []
for diff in differences:
    if diff >= 0:
        positive_vals.append(diff)
    else:
        negative_vals.append(diff)

In [16]:
positive_vals_describe = pd.DataFrame(positive_vals)
positive_vals_describe.describe()

Unnamed: 0,0
count,92.0
mean,0.097826
std,0.298707
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [17]:
negative_vals_describe = pd.DataFrame(negative_vals)
negative_vals_describe.describe()

Unnamed: 0,0
count,8.0
mean,-1.0
std,0.0
min,-1.0
25%,-1.0
50%,-1.0
75%,-1.0
max,-1.0


## 2.5 NDCG@5 (Normalized Discounted Cumulative Gain at 5)

In [18]:
results_1 = np.array(ranker_results['bm25okapi']['NDCG@5'])
results_2 = np.array(ranker_results['prf_kld_cat_bm25']['NDCG@5'])
differences = results_2 - results_1

In [19]:
positive_vals = []
negative_vals = []
for diff in differences:
    if diff >= 0:
        positive_vals.append(diff)
    else:
        negative_vals.append(diff)

In [20]:
positive_vals_describe = pd.DataFrame(positive_vals)
positive_vals_describe.describe()

Unnamed: 0,0
count,62.0
mean,0.155565
std,0.150701
min,0.0
25%,0.049256
50%,0.100536
75%,0.220311
max,0.565197


In [21]:
negative_vals_describe = pd.DataFrame(negative_vals)
negative_vals_describe.describe()

Unnamed: 0,0
count,38.0
mean,-0.199118
std,0.163062
min,-0.796229
25%,-0.289457
50%,-0.161485
75%,-0.078993
max,-0.011195


# 3. Top-10 Positive and Negative Query Examples
In the following, we additionally provide the top-10 query examples of cases where AQE (in particular the text-segment-wise PRF (s) method) improved the ranking performance based on the Average Precision (AveP) the most. Moreover, we provide the top-10 query examples of cases where AQE (in particular the text-segment-wise PRF (s) method) decreased the ranking performance based on the Average Precision (AveP) the most. In the end, we discuss the examples and provide possible explanations why applying AQE for these queries lead to the improvements or decreases of the Average Precision.

In particular, we can observe that the queries that can be drastically improved by using AQE are more general queries for requesting more well-known domains (e.g. news, start screen, search screens, login and list screens) and often these queries are less specific. Moreover, we can observe that the queries that decrease the performance by using AQE are more specific, contain more details (the average token number is 6.2 compared to 5.2) and request screens from less well-known domains (e.g. learning piano online, nearby picks, voice changer). For example, the query "Chinese salad recipes" requests screens with very specific recipes, whereas the base retrieval will most likely provide more general recipe screens and thus AQE will enforce the focus on more general recipe screens instead of recipe screens specifally for chinese salads. These very detailed and specific requirements in the queries increase the difficulty for the AQE mechansim to improve the results. The ranking performance can most often be improved when (i) there are more relevant screens available in Rico (e.g. the domains are better represented in the GUI population) and (ii) the base ranker is able to retrieve these relevant and similar GUI screens with the initial query. Higher diversity in the initial GUI result set makes it more difficult for the AQE techniques to extract the meaningful expansion terms.

In [22]:
results_1 = np.array(ranker_results['bm25okapi']['AveP'])
results_2 = np.array(ranker_results['prf_kld_cat_bm25']['AveP'])
differences = results_2 - results_1

In [23]:
positive_vals = []
negative_vals = []
for diff in differences:
    if diff >= 0:
        positive_vals.append(diff)
    else:
        negative_vals.append(diff)

After computing the differences of the Average Precision metric values per query, we sort these values and obtain their indexes. These indexes can be used to find the corresponding query in our gold standard.

In [24]:
diff_sorted_hi_low = np.sort(np.array(differences))[::-1]
diff_sorted_hi_low_args = np.argsort(np.array(differences))[::-1]

In [25]:
diff_sorted_hi_low_args

array([ 9, 64, 49, 85, 59, 76, 73,  3, 95, 13, 58, 10, 98, 97, 31, 35, 17,
       99, 90, 11, 53, 16, 57,  2, 21, 48, 26, 40, 93, 44, 75,  0, 67, 25,
       23, 89, 50, 42, 15,  8, 37, 77, 41,  7, 60, 34, 63, 70, 22, 86, 30,
       72, 32, 83, 87, 36, 39, 62, 28, 92, 19, 51, 27, 91, 84, 78, 38, 52,
        6, 14, 55, 33, 96,  1, 69, 43, 65, 29, 66, 74, 24, 82, 20, 88, 47,
       68, 18, 71, 81, 61, 56, 46,  5, 79, 12, 94, 80, 54, 45,  4],
      dtype=int64)

In [26]:
diff_sorted_low_hi = np.sort(np.array(differences))
diff_sorted_low_hi_args = np.argsort(np.array(differences))

In [27]:
diff_sorted_low_hi_args

array([ 4, 45, 54, 80, 94, 12, 79,  5, 46, 56, 61, 81, 71, 18, 68, 47, 88,
       20, 82, 24, 74, 66, 29, 65, 43, 69,  1, 96, 33, 55, 14,  6, 52, 38,
       78, 84, 91, 27, 51, 19, 92, 28, 62, 39, 36, 87, 83, 32, 72, 30, 86,
       22, 70, 63, 34, 60,  7, 41, 77, 37,  8, 15, 42, 50, 89, 23, 25, 67,
        0, 75, 44, 93, 40, 26, 48, 21,  2, 57, 16, 53, 11, 90, 99, 17, 35,
       31, 97, 98, 10, 58, 13, 95,  3, 73, 76, 59, 85, 49, 64,  9],
      dtype=int64)

## 3.1 Top-10 Positive Query Examples
In the following, we obtain the top-10 positive query examples that have the highest improvements of the Average Precision when using AQE.

In [28]:
queries_diff_hi_low = [goldstandard.loc[[query_index]][['query']].values.tolist()[0][0] for query_index in diff_sorted_hi_low_args[:10]]

In [29]:
queries_diff_hi_low

['home appliances',
 'a panel app window with its points',
 'sports news overview',
 'live breaking news video with another top stories article and video',
 'starting screen of Amino',
 'search coffee with location',
 'online dating',
 'a survery app with sinup and facebook login option',
 'my net diary with many options',
 'a list of songs']

In [30]:
from nltk import word_tokenize

In [31]:
query_lengths_hi_low = [len(word_tokenize(query)) for query in queries_diff_hi_low]

In [32]:
query_lengths_hi_low_describe = pd.DataFrame(query_lengths_hi_low)
query_lengths_hi_low_describe.describe()

Unnamed: 0,0
count,10.0
mean,5.2
std,3.011091
min,2.0
25%,3.25
50%,4.0
75%,6.75
max,11.0


## 3.2 Top-10 Negative Query Examples
In the following, we obtain the top-10 negative query examples that have the highest decreases of the Average Precision when using AQE.

In [33]:
queries_diff_low_hi = [goldstandard.loc[[query_index]][['query']].values.tolist()[0][0] for query_index in diff_sorted_low_hi_args[:10]]

In [34]:
queries_diff_low_hi

['down nearby picks',
 'photo preview app with zoomIn zoomOut options',
 'learning and play piano online',
 'song with the artist picture and pause button',
 'Chinese salad recipes',
 'voice changer with many buttons',
 'customer complaint service',
 'screen with sample movies and information on recording with SnapMovie',
 'page that display the questions  with options',
 'Information screen to add a tabatas in a tabatas list app']

In [35]:
from nltk import word_tokenize

In [36]:
query_lengths_low_hi = [len(word_tokenize(query)) for query in queries_diff_low_hi]

In [37]:
query_lengths_low_hi_describe = pd.DataFrame(query_lengths_low_hi)
query_lengths_low_hi_describe.describe()

Unnamed: 0,0
count,10.0
mean,6.2
std,2.898275
min,3.0
25%,3.5
50%,6.0
75%,7.75
max,11.0
