## Ablation Study for PaperSeek

This notebook presents an ablation study for the PaperSeek pipeline, examining how different synthetic data affect the retrieval performance across various scientific topics.

1. Comparison of the retreival performance using synthetic RQs and the original RQs.
2. Comparison of the retreival performance using synthetic Core Publications and a random Core Publication.

In [1]:
from synergy_dataset import Dataset, iter_datasets
from utils import DataReader, Query
from openai import OpenAI
import pandas as pd
import json

eval_data = json.load(open("data/evaluation_data.json", "r", encoding="utf-8"))
reader = DataReader(create_index=False)

In [2]:
def invert_abstract(inv_index):
    if inv_index is not None:
        l_inv = [(w, p) for w, pos in inv_index.items() for p in pos]
        return " ".join(map(lambda x: x[0], sorted(l_inv, key=lambda x: x[1])))

def generate_synthetic_rq(text, n=1):
    client = OpenAI()
    response = client.chat.completions.create(
        model="o3-mini",
        messages=[
            {"role": "system", "content": "You are a research expert that draws research questions from the given content."},
            {"role": "user", "content": f"""Generate {n} research question(s) for the following content: {text}.
             
             Your output must be as follows:
             RQ1: <RQ1>
             RQ2: <RQ2>
             etc...
             """}
        ],
        seed=42,
    )
    return response.choices[0].message.content


In [3]:
# LitQEval data
slr_content = {
    "Software Defect Prediction": """Title: A Systematic Literature Review of Software Defect Prediction: Research Trends, Datasets, Methods and Frameworks[SEP]
    Abstract: Recent studies of software defect prediction typically produce datasets, methods and frameworks which allow software engineers to focus on development activities in terms of defect-prone code, thereby improving software quality and making better use of resources. Many software defect prediction datasets, methods and frameworks are published disparate and complex, thus a comprehensive picture of the current state of defect prediction research that exists is missing. This literature review aims to identify and analyze the research trends, datasets, methods and frameworks used in software defect prediction research betweeen 2000 and 2013. Based on the defined inclusion and exclusion criteria, 71 software defect prediction studies published between January 2000 and December 2013 were remained and selected to be investigated further. This literature review has been undertaken as a systematic literature review. Systematic literature review is defined as a process of identifying, assessing, and interpreting all available research evidence with the purpose to provide answers for specific research questions. Analysis of the selected primary studies revealed that current software defect prediction research focuses on five topics and trends: estimation, association, classification, clustering and dataset analysis. The total distribution of defect prediction methods is as follows. 77.46% of the research studies are related to classification methods, 14.08% of the studies focused on estimation methods, and 1.41% of the studies concerned on clustering and association methods. In addition, 64.79% of the research studies used public datasets and 35.21% of the research studies used private datasets. Nineteen different methods have been applied to predict software defects. From the nineteen methods, seven most applied methods in software defect prediction are identified. Researchers proposed some techniques for improving the accuracy of machine learning classifier for software defect prediction by ensembling some machine learning methods, by using boosting algorithm, by adding feature selection and by using parameter optimization for some classifiers. The results of this research also identified three frameworks that are highly cited and therefore influential in the software defect prediction field. They are Menzies et al. Framework, Lessmann et al. Framework, and Song et al. Framework.""",
    "Software Fault Prediction Metrics": """Title: Software fault prediction metrics: A systematic literature review[SEP]
    Abstract: ContextSoftware metrics may be used in fault prediction models to improve software quality by predicting fault location.ObjectiveThis paper aims to identify software metrics and to assess their applicability in software fault prediction. We investigated the influence of context on metrics’ selection and performance.MethodThis systematic literature review includes 106 papers published between 1991 and 2011. The selected papers are classified according to metrics and context properties.ResultsObject-oriented metrics (49%) were used nearly twice as often compared to traditional source code metrics (27%) or process metrics (24%). Chidamber and Kemerer’s (CK) object-oriented metrics were most frequently used. According to the selected studies there are significant differences between the metrics used in fault prediction performance. Object-oriented and process metrics have been reported to be more successful in finding faults compared to traditional size and complexity metrics. Process metrics seem to be better at predicting post-release faults compared to any static code metrics.ConclusionMore studies should be performed on large industrial software systems to find metrics more relevant for the industry and to answer the question as to which metrics should be used in a given context.""",
    "Cloud Migration": """Title: Cloud Migration Research: A Systematic Review[SEP]
    Abstract: Background--By leveraging cloud services, organizations can deploy their software systems over a pool of resources. However, organizations heavily depend on their business-critical systems, which have been developed over long periods. These legacy applications are usually deployed on-premise. In recent years, research in cloud migration has been carried out. However, there is no secondary study to consolidate this research. Objective--This paper aims to identify, taxonomically classify, and systematically compare existing research on cloud migration. Method--We conducted a systematic literature review (SLR) of 23 selected studies, published from 2010 to 2013. We classified and compared the selected studies based on a characterization framework that we also introduce in this paper. Results--The research synthesis results in a knowledge base of current solutions for legacy-to-cloud migration. This review also identifies research gaps and directions for future research. Conclusion--This review reveals that cloud migration research is still in early stages of maturity, but is advancing. It identifies the needs for a migration framework to help improving the maturity level and consequently trust into cloud migration. This review shows a lack of tool support to automate migration tasks. This study also identifies needs for architectural adaptation and self-adaptive cloud-enabled systems.""",
    "Multicore Performance Prediction": """Title: Parallelization, Modeling, and Performance Prediction in the Multi-/Many Core Area: A Systematic Literature Review[SEP]
    Abstract: Context: Software developers face complex, connected, and large software projects. The development of such systems involves design decisions that directly impact the quality of the software. For an early decision making, software developers can use model-based prediction approaches for (non-)functional quality properties. Unfortunately, the accuracy of these approaches is challenged by newly introduced hardware features like multiple cores within a single CPU (multicores) and their dependence on shared memory and other shared resources. Objectives: Our goal is to understand whether and how existing model-based performance prediction approaches face this challenge. We plan to use gained insights as foundation for enriching existing prediction approaches with capabilities to predict systems running on multicores. Methods: We perform a Systematic Literature Review (SLR) to identify current model-based prediction approaches in the context of multicores. Results: Our SLR covers the software engineering, embedded systems, High Performance Computing, and Software Performance Engineering domains for which we examined 34 sources in detail. We found various performance prediction approaches which tries to increase prediction accuracy for multicore systems by including shared memory designs to the prediction models. Conclusion: However, our results show that the memory designs models are only in an initial phase. Further research has to be done to improve cache, memory, and memory bandwidth model as well as to include auto tuner support.""",
    "Business Process Meta Models": """Title: What is a process model composed of?[SEP]
    Abstract:  Business process modelling languages typically enable the representation of business process models by employing (graphical) symbols. These symbols can vary depending upon the verbosity of the language, the modelling paradigm, the focus of the language and so on. To make explicit different constructs and rules employed by a specific language, as well as bridge the gap across different languages, meta-models have been proposed in the literature. These meta-models are a crucial source of knowledge on what state-of-the-art literature considers relevant to describe business processes. The goal of this work is to provide the first extensive systematic literature review (SLR) of business process meta-models. This SLR aims to answer research questions concerning: (1) the kind of meta-models proposed in the literature, (2) the recurring constructs they contain, (3) their purposes and (4) their evaluations. The SRL was performed manually considering papers automatically retrieved from reference paper repositories as well as proceedings of the main conferences in the Business Process Management research area. Sixty-five papers were selected and evaluated against four research questions. The results indicate the existence of a reasonable body of work conducted in this specific area, but not a full maturity. In particular, in answering the research questions several challenges have (re-)emerged for the Business Process Community, concerning: (1) the type of elements that constitute a Business Process and their meaning, (2) the absence of a (or several) reference meta-model(s) for the community, (3) the purpose for which meta-models are introduced in the literature and (4) a framework for the evaluation of the meta-models themselves. Moreover, the classification framework devised to answer the four research questions can provide a reference structure for future descriptive categorizations.""",
    "Data Stream Processing Latency": """Title: Enactment of adaptation in data stream processing with latency implications—A systematic literature review[SEP]
    Abstract: Context Stream processing is a popular paradigm to continuously process huge amounts of data. Runtime adaptation plays a significant role in supporting the optimization of data processing tasks. In recent years runtime adaptation has received significant interest in scientific literature. However, so far no categorization of the enactment approaches for runtime adaptation in stream processing has been established. Objective This paper identifies and characterizes different approaches towards the enactment of runtime adaptation in stream processing with a main focus on latency as quality dimension. Method We performed a systematic literature review (SLR) targeting five main research questions. An automated search, resulting in 244 papers, was conducted. 75 papers published between 2006 and 2018 were finally included. From the selected papers, we extracted data like processing problems, adaptation goals, enactment approaches of adaptation, enactment techniques, evaluation metrics as well as evaluation parameters used to trigger the enactment of adaptation in their evaluation. Results We identified 17 different enactment approaches and categorized them into a taxonomy. For each, we extracted the underlying technique used to implement this enactment approach. Further, we identified 9 categories of processing problems, 6 adaptation goals, 9 evaluation metrics and 12 evaluation parameters according to the extracted data properties. Conclusion We observed that the research interest on enactment approaches to the adaptation of stream processing has significantly increased in recent years. The most commonly applied enactment approaches are parameter adaptation to tune parameters or settings of the processing, load balancing used to re-distribute workloads, and processing scaling to dynamically scale up and down the processing. In addition to latency, most adaptations also address resource fluctuation / bottleneck problems. For presenting a dynamic environment to evaluate enactment approaches, researchers often change input rates or processing workloads.""",
    "Software Process Line": """Title: Software process line as an approach to support software process reuse: A systematic literature review[SEP]
    Abstract: Context Software Process Line (SPrL) aims at providing a systematic reuse technique to support reuse experiences and knowledge in the definition of software processes for new projects thus contributing to reduce effort and costs and to achieve improvements in quality. Although the research body in SPrL is expanding, it is still an immature area with results offering an overall view scattered with no consensus. Objective The goal of this work is to identify existing approaches for developing, using, managing and visualizing the evolution of SPrLs and to characterize their support, especially during the development of reusable process family artefacts, including an overview of existing SPrL supporting tools in their multiple stages; to analyse variability management and component-based aspects in SPrL; and, finally, to list practical examples and conducted evaluations. This research aims at reaching a broader and more consistent view of the research area and to provide perspectives and gaps for future research. Method We performed a systematic literature review according to well-established guidelines set. We used tools to partially support the process, which relies on a six-member research team. Results We report on 49 primary studies that deal mostly with conceptual or theoretical proposals and the domain engineering stage. Years 2014, 2015, and 2018 yielded the largest number of articles. This can indicate SPrL as a recent research theme and one that attracts ever-increasing interest. Conclusion Although this research area is growing, there is still a lack of practical experiences and approaches for actual applications or project-specific process derivations and decision-making support. The concept of an integrated reuse infrastructure is less discussed and explored; and the development of integrated tools to support all reuse stages is not fully addressed. Other topics for future research are discussed throughout the paper with gaps pointed as opportunities for improvements in the area.""",
}

for d in iter_datasets():
    title = Dataset(d.name).metadata["publication"]["title"]
    abstract = invert_abstract(
        Dataset(d.name).metadata["publication"]["abstract_inverted_index"]
    )
    text = f"Title: {title}[SEP]\nAbstract: {abstract}"
    slr_content[title] = text


# Replace the titles with the proper topic name in the dataset
keys_map = {
    "Cerebral small vessel disease and the risk of dementia: A systematic review and meta‐analysis of population‐based evidence": "Cerebral Small Vessel Disease and the Risk of Dementia",
    "Psychological theories of depressive relapse and recurrence: A systematic review and meta-analysis of prospective studies": "Psychological Theories of Depressive Relapse and Recurrence",
    "Comparative efficacy and safety of long-acting oral opioids for chronic non-cancer pain: a systematic review": "Comparative Efficacy and Safety of Long-Acting Oral Opioids for Chronic Non-Cancer Pain",
    "Comparative efficacy and safety of skeletal muscle relaxants for spasticity and musculoskeletal conditions: a systematic review": "Efficacy and Safety of Skeletal Muscle Relaxants for Spasticity and Musculoskeletal Conditions",
    "A Systematic Literature Review on Fault Prediction Performance in Software Engineering": "Fault Prediction Performance in Software Engineering",
    "Does the Source of Mesenchymal Stem Cell Have an Effect in the Management of Osteoarthritis of the Knee? Meta-Analysis of Randomized Controlled Trials": "Mesenchymal Stem Cell Effect in the Management of Osteoarthritis of the Knee",
    "A Systematic Review Comparing Experimental Design of Animal and Human Methotrexate Efficacy Studies for Rheumatoid Arthritis: Lessons for the Translational Value of Animal Studies": "Comparing Experimental Design of Animal and Human Methotrexate Efficacy Studies for Rheumatoid Arthritis",
    "Poor nutritional condition promotes high‐risk behaviours: a systematic review and meta‐analysis": "Poor nutritional condition promotes high-risk behaviours",
    "Bayesian Versus Frequentist Estimation for Structural Equation Models in Small Sample Contexts: A Systematic Review": "Bayesian Versus Frequentist Estimation for Structural Equation Models in Small Sample Contexts",
    "Systematic Review and Meta-Analysis of Early-Life Exposure to Bisphenol A and Obesity-Related Outcomes in Rodents": "Early-Life Exposure to Bisphenol A and Obesity-Related Outcomes in Rodents",
}
# replace key
for old_key, new_key in keys_map.items():
    slr_content[new_key] = slr_content.pop(old_key)
slr_content = dict(
    list({k: v for k, v in slr_content.items() if v is not None}.items())
)
def generate_rqs(n_rqs=1):
    topics_with_rqs = [topic for topic, value in eval_data.items() if not value["generated"]]
    synethic_queries = {}

    for topic in topics_with_rqs:
        text = slr_content[topic]
        rq = generate_synthetic_rq(text, n_rqs)
        synethic_queries[topic] = rq

    return synethic_queries

In [4]:
synethic_queries_1 = {
    "Cerebral Small Vessel Disease and the Risk of Dementia": "RQ1: How does the burden of different cerebral small vessel disease markers—specifically white matter hyperintensities, covert brain infarcts, and cerebral microbleeds—predict the risk and progression of Alzheimer’s disease and other forms of dementia in the general population?",
    "Psychological Theories of Depressive Relapse and Recurrence": "RQ1: How do the cognitive, behavioral, and personality-based psychological factors compare to psychodynamic factors in predicting depressive relapse among individuals with a history of major depressive disorder in prospective studies?",
    "Comparative Efficacy and Safety of Long-Acting Oral Opioids for Chronic Non-Cancer Pain": "RQ1: How do the efficacy and safety profiles of individual long-acting oral opioids compare to one another and to short-acting opioid formulations in the treatment of chronic non-cancer pain?",
    "Efficacy and Safety of Skeletal Muscle Relaxants for Spasticity and Musculoskeletal Conditions": "RQ1: How do the efficacy and safety profiles of commonly used skeletal muscle relaxants (eg, baclofen, tizanidine, dantrolene, and cyclobenzaprine) compare in the management of spasticity versus musculoskeletal conditions, and what are the clinical implications of these differences?",
    "Fault Prediction Performance in Software Engineering": "RQ1: How do the context and selection of independent variables, combined with modeling techniques such as Naive Bayes and Logistic Regression, influence the predictive performance of fault prediction models in software engineering?",
    "Mesenchymal Stem Cell Effect in the Management of Osteoarthritis of the Knee": "RQ1: How does the source of mesenchymal stem cells—adipose-derived versus bone marrow-derived—influence the long-term clinical outcomes (such as pain reduction, functional improvement, and imaging changes) and safety in patients with knee osteoarthritis?",
    "Comparing Experimental Design of Animal and Human Methotrexate Efficacy Studies for Rheumatoid Arthritis": "RQ1: How does the misalignment of key experimental design elements between animal and human methotrexate efficacy studies impact the translational validity of research findings for rheumatoid arthritis?",
    "Poor nutritional condition promotes high-risk behaviours": "RQ1: How does poor nutritional condition influence risk-taking behaviour in animals, and what role do factors such as experimental context and life stage play in modulating this relationship?",
    "Bayesian Versus Frequentist Estimation for Structural Equation Models in Small Sample Contexts": "RQ1: How do Bayesian and frequentist estimation methods compare in terms of accuracy, convergence, and model fit when applied to structural equation models in small sample contexts?",
    "Early-Life Exposure to Bisphenol A and Obesity-Related Outcomes in Rodents": "RQ1: How does early-life exposure to bisphenol A affect adiposity and lipid metabolism in rodent models, and are these effects modulated by sex and exposure dose?",
    "Software Process Line": "RQ1: How can an integrated tool infrastructure be designed to support all reuse stages in Software Process Lines, particularly in managing variability and component-based artefacts, to facilitate project-specific process derivation and improve cost, effort, and quality outcomes?",
    "Data Stream Processing Latency": "RQ1: How do different runtime adaptation enactment approaches—such as parameter adaptation, load balancing, and processing scaling—influence latency performance in data stream processing systems under dynamic workloads and resource fluctuations?",
    "Business Process Meta Models": "RQ1: How can a unified reference framework be developed from existing business process meta-models to standardize the identification, meaning, and evaluation of recurring constructs in business processes?",
    "Multicore Performance Prediction": "RQ1: How can current model-based performance prediction approaches be enhanced to accurately model shared memory designs—including cache behavior and memory bandwidth—in multicore systems?",
    "Cloud Migration": "RQ1: How can the integration of automated tool support and adaptive migration frameworks address current gaps in legacy-to-cloud migration, particularly in terms of improving architectural adaptation and system self-adaptiveness?",
    "Software Fault Prediction Metrics": "RQ1: How do different types of software fault prediction metrics (object‐oriented, traditional source code, and process metrics) perform across various development contexts, particularly in large industrial software systems?",
    "Software Defect Prediction": "RQ1: How have the frameworks, datasets, and prediction methods for software defect prediction evolved between 2000 and 2013, and what impact do they have on the accuracy and efficiency of defect classification and estimation in current research?",
}


In [5]:
synethic_queries_5 = {
    "Cerebral Small Vessel Disease and the Risk of Dementia": "RQ1: How does white matter hyperintensity volume contribute to the increased risk of all-dementia and Alzheimer's disease in the general population?\n\nRQ2: What is the relationship between covert brain infarcts and the subsequent risk of developing all-dementia, and what factors may influence this association?\n\nRQ3: In what ways do cerebral microbleeds compare to white matter hyperintensities and covert brain infarcts in predicting the risk of all-dementia?\n\nRQ4: How do variations in study design, population characteristics, and imaging techniques affect the assessment of cerebral small vessel disease markers' association with dementia risk?\n\nRQ5: What further methodological approaches or prospective studies are needed to clarify the roles of different cerebral small vessel disease markers in the development of dementia?",
    "Psychological Theories of Depressive Relapse and Recurrence": "RQ1: How do cognitive, behavioral, and personality-based factors, as measured in prospective studies, predict depressive relapse compared to psychodynamic factors?\n\nRQ2: In what ways do the design features of prospective longitudinal studies (e.g., clinical interview assessments prior to relapse) influence the observed relationship between theory-derived psychological factors and depressive relapse?\n\nRQ3: What methodological or conceptual factors could explain the discrepancy between significant odds ratios and non-significant hazard ratios in studies examining psychological theories of depressive relapse?\n\nRQ4: Given the absence of prospective studies on the diathesis-stress theories in the current literature, what potential research designs could effectively investigate the role of diathesis-stress factors in depressive relapse?\n\nRQ5: How might future research refine the measurement of psychological constructs within cognitive, behavioral, and personality frameworks to improve their predictive utility for depressive relapse in individuals with a history of major depressive disorder?",
    "Comparative Efficacy and Safety of Long-Acting Oral Opioids for Chronic Non-Cancer Pain": "RQ1: What differences in efficacy exist among various long-acting opioids when managing chronic non-cancer pain?\n\nRQ2: How do the safety profiles of individual long-acting opioids compare in the treatment of chronic non-cancer pain?\n\nRQ3: Are long-acting opioids, as a class, more effective or safer than short-acting opioids in managing chronic non-cancer pain?\n\nRQ4: What are the methodological limitations and quality concerns in the current randomized trials and observational studies assessing long-acting opioids for chronic non-cancer pain?\n\nRQ5: In the case of oxycodone, how do long-acting and short-acting formulations compare in terms of efficacy and safety for pain control?",
    "Efficacy and Safety of Skeletal Muscle Relaxants for Spasticity and Musculoskeletal Conditions": "RQ1: What is the comparative efficacy of baclofen, tizanidine, and dantrolene in reducing spasticity, particularly in conditions such as multiple sclerosis?\n\nRQ2: How do the adverse effect profiles of tizanidine and baclofen differ in patients with spasticity, especially regarding symptoms like dry mouth and muscular weakness?\n\nRQ3: In the management of acute musculoskeletal conditions, how does cyclobenzaprine compare to carisoprodol, orphenadrine, and tizanidine in terms of both efficacy and safety?\n\nRQ4: What are the reasons for the observed gaps in adverse event assessment across randomized and observational studies evaluating skeletal muscle relaxants, and how might future trials be designed to address these deficiencies?\n\nRQ5: What is the incidence and clinical significance of rare but serious hepatotoxicity associated with dantrolene and, to a lesser degree, chlorzoxazone, and how should these risks inform clinical decision-making?",
    "Fault Prediction Performance in Software Engineering": "RQ1: How do different modeling techniques, specifically simple methods like Naive Bayes and Logistic Regression, compare to more complex approaches in terms of fault prediction performance?\n\nRQ2: In what ways does the combination of various independent variables and the application of feature selection contribute to improvements in fault prediction accuracy?\n\nRQ3: How does the contextual information of a software project (such as code complexity and project domain) affect the predictive performance of fault prediction models?\n\nRQ4: What role does the overall modeling methodology play in the reliability and confidence of fault prediction studies in software engineering?\n\nRQ5: How can future research address current limitations in reporting contextual and methodological details to enhance the reproducibility and assessment of fault prediction models?",
    "Mesenchymal Stem Cell Effect in the Management of Osteoarthritis of the Knee": "RQ1: How does the source of mesenchymal stem cells (bone marrow vs adipose-derived) affect short-term pain relief outcomes in patients with knee osteoarthritis as measured by the Visual Analog Score (VAS)? \n\nRQ2: In what ways do bone marrow–derived and adipose-derived mesenchymal stem cells differ in improving functional outcomes (e.g., WOMAC, KOOS, Lysholm Knee Scale, WORMS) over various follow-up periods (6 months, 1 year, and 24 months) in knee osteoarthritis management? \n\nRQ3: What are the comparative long-term safety profiles and incidence of adverse events between BM-MSC and AD-MSC therapies in the treatment of knee osteoarthritis? \n\nRQ4: What underlying biological or mechanistic factors might explain the observed differences in clinical outcomes between bone marrow–derived and adipose-derived mesenchymal stem cell treatments for knee osteoarthritis? \n\nRQ5: How can future randomized controlled trials be designed, particularly in terms of standardized dosing and head-to-head comparisons, to conclusively determine the optimal mesenchymal stem cell source for managing knee osteoarthritis?",
    "Comparing Experimental Design of Animal and Human Methotrexate Efficacy Studies for Rheumatoid Arthritis": "RQ1: How do differences in experimental design—such as sample size, randomization, and blinding—between animal and human studies affect the translational validity of methotrexate’s efficacy in rheumatoid arthritis?\n\nRQ2: In what ways do demographic differences (e.g., age and sex of subjects) between animal models and human participants influence the observed treatment outcomes of methotrexate?\n\nRQ3: How do the disparities in statistical methodologies (e.g., use of power calculations, Chi-square tests versus analyses of variance) impact the reliability and comparability of findings from animal and human studies of methotrexate?\n\nRQ4: What role does the route of administration, given its greater variability and reporting in animal studies compared to human studies, play in the observed differences in methotrexate efficacy?\n\nRQ5: What modifications in experimental design could be implemented in future animal studies to better align with human clinical trials and thus enhance the overall translational success in drug development for rheumatoid arthritis?",
    "Poor nutritional condition promotes high-risk behaviours": "RQ1: How does poor nutritional condition influence the propensity for risk-taking behavior in animals across different ecological contexts (predation, novelty, and exploration)?\n\nRQ2: To what extent do the experimental contexts used to measure risk-taking behavior moderate the relationship between nutritional condition and risk tendency?\n\nRQ3: How do the asset protection principle and the state-dependent safety hypothesis explain the observed differences in risk-taking behavior between low- and high-nutritional condition individuals?\n\nRQ4: In what ways does the life stage during which nutritional condition is manipulated affect subsequent risk-taking behavior in animals?\n\nRQ5: How does poor nutritional condition impact not only the mean levels but also the variance of risk-taking behavior, and what factors influence this observed behavioural variability?",
    "Bayesian Versus Frequentist Estimation for Structural Equation Models in Small Sample Contexts": "RQ1: How do Bayesian and frequentist estimation methods compare in terms of accuracy and robustness when applied to structural equation models in small sample contexts?\n\nRQ2: What are the main advantages and limitations of using Bayesian estimation over maximum likelihood estimation for structural equation models with limited data?\n\nRQ3: In what ways do small sample sizes influence the performance and reliability of Bayesian versus frequentist estimation techniques in structural equation modeling?\n\nRQ4: What specific challenges and methodological issues are encountered when implementing Bayesian estimation in small sample contexts, as reported in the systematic review?\n\nRQ5: How can the findings from the systematic review inform the development of best practices or guidelines for choosing between Bayesian and frequentist approaches in structural equation models with small samples?",
    "Early-Life Exposure to Bisphenol A and Obesity-Related Outcomes in Rodents": "RQ1: How does early-life exposure to bisphenol A (BPA) influence obesity-related outcomes such as body weight, fat pad weight, and circulating lipid levels in rodents?\n\nRQ2: To what extent do dose variations, particularly exposures below versus above the U.S. reference dose of 50 μg/kg/d, modulate the impact of BPA on adiposity and lipid profiles in rodent models?\n\nRQ3: What role do sex differences play in the associations observed between early-life BPA exposure and obesity-related outcomes in rodents?\n\nRQ4: How are specific biomarkers—such as triglycerides, free fatty acids, and leptin—altered in rodents following early-life BPA exposure, and what mechanisms might underpin these changes?\n\nRQ5: In what ways does heterogeneity and potential risk of bias across rodent experimental studies affect the overall conclusions on the relationship between early-life BPA exposure and obesity-related outcomes?",
    "Software Process Line": "RQ1: How do current approaches in Software Process Lines (SPrL) support the development, management, and evolution of reusable process family artefacts, particularly regarding variability management and component-based design?\n\nRQ2: What are the key challenges and limitations identified in existing SPrL tools with respect to providing integrated support across all reuse stages in software process reuse?\n\nRQ3: In what ways are the conceptual and theoretical frameworks of SPrL being translated into practical applications and project-specific process derivations, and what gaps remain in this translation?\n\nRQ4: How does the evolution of research trends—from earlier to more recent studies—influence the understanding and application of SPrL in supporting cost reduction and quality improvements in software development?\n\nRQ5: What future research directions can be derived from the current literature gaps, particularly regarding the development of decision-making support systems and integrated reuse infrastructures in SPrL?",
    "Data Stream Processing Latency": "RQ1: How can the existing taxonomy of enactment approaches for runtime adaptation in stream processing be extended or refined to capture emerging techniques and dimensions not covered in current literature?\n\nRQ2: What are the specific impacts of different runtime adaptation techniques on latency in stream processing, and how do these techniques compare in mitigating latency issues under varying data workloads?\n\nRQ3: In what ways can adaptation strategies be enhanced to more effectively address resource fluctuations and bottlenecks beyond the commonly applied approaches of parameter tuning, load balancing, and scaling?\n\nRQ4: How do various evaluation metrics and trigger parameters influence the performance outcomes of adaptation enactment approaches, and what combinations yield optimal latency improvements in dynamic processing environments?\n\nRQ5: What practical challenges arise when implementing automated enactment approaches in real-world stream processing systems, particularly under scenarios with frequent changes in input rates and processing workloads, and how might these challenges be overcome?",
    "Business Process Meta Models": "RQ1: What are the different types of meta-models proposed in the literature for representing business processes, and how do they vary in their use of (graphical) symbols?  \nRQ2: Which recurring constructs are commonly present across various business process meta-models, and what roles do these elements play in the modeling paradigms?  \nRQ3: What purposes motivate the introduction of meta-models in business process modeling literature, and how do these purposes affect the design of business process languages?  \nRQ4: How are business process meta-models evaluated according to current research, and what frameworks or criteria are used in these evaluations?  \nRQ5: What challenges exist regarding the lack of a standardized reference meta-model for business processes, and how might a unified classification framework help address these issues?",
    "Multicore Performance Prediction": "RQ1: How effectively do existing model-based performance prediction approaches address the challenges posed by shared memory architectures in multicore systems?\n\nRQ2: What specific limitations exist in current cache, memory, and memory bandwidth models when predicting performance on multicore platforms, and how can these models be improved?\n\nRQ3: In what ways can the incorporation of auto tuner support enhance the accuracy and adaptability of performance prediction models for systems running on multicores?\n\nRQ4: How do the design decisions made in model-based prediction approaches impact the prediction of (non-)functional quality properties in large-scale, connected software projects deployed on multicore hardware?\n\nRQ5: What insights from systematic literature reviews across diverse domains (software engineering, embedded systems, high-performance computing, and software performance engineering) can guide future enhancements to performance prediction models in the multicore environment?",
    "Cloud Migration": "RQ1: How do legacy, business-critical systems face challenges during the transition from on-premise deployments to cloud infrastructures?  \nRQ2: What frameworks and taxonomic classifications currently exist for cloud migration strategies, and how do they compare in terms of addressing legacy system complexities?  \nRQ3: To what extent does tool support for automating migration tasks exist, and what are the key gaps hindering the automation of legacy-to-cloud migration processes?  \nRQ4: How can architectural adaptation and self-adaptive mechanisms be incorporated into cloud migration practices to enhance system robustness and trust?  \nRQ5: What research gaps and emerging trends, as identified through systematic literature reviews, should future studies focus on to improve the maturity and methodologies of cloud migration?",
    "Software Fault Prediction Metrics": "RQ1: How do object-oriented, traditional source code, and process metrics compare in terms of their effectiveness in predicting software faults?\n\nRQ2: In what ways does the context in which metrics are applied influence their selection and performance in fault prediction models?\n\nRQ3: What specific attributes of Chidamber and Kemerer’s (CK) object-oriented metrics contribute to their frequent use and success in predicting software faults?\n\nRQ4: How can process metrics be further optimized to improve the prediction of post-release faults in large-scale industrial software systems?\n\nRQ5: What additional empirical studies are needed to identify context-specific metrics that are most relevant and effective for fault prediction in diverse software environments?",
    "Software Defect Prediction": "RQ1: What are the primary trends in software defect prediction research between 2000 and 2013, and how have these trends influenced the use of estimation, association, classification, clustering, and dataset analysis methods?\n\nRQ2: How does the choice between public and private datasets affect the reproducibility and generalizability of software defect prediction studies?\n\nRQ3: In what ways do the various classification methods compare in their effectiveness for predicting software defects, and what factors contribute to the predominance of classification techniques in the literature?\n\nRQ4: How do the highly cited frameworks (Menzies et al., Lessmann et al., and Song et al.) influence current practices and future developments in software defect prediction research?\n\nRQ5: What impact do techniques such as ensemble learning, boosting algorithms, feature selection, and parameter optimization have on improving the accuracy of machine learning classifiers in software defect prediction?",
}

In [6]:
def run(n_rqs=1):
    if n_rqs == 1:
        synethic_queries = synethic_queries_1
    elif n_rqs == 5:
        synethic_queries = synethic_queries_5
    else:
        raise ValueError("Invalid number of RQs")
    s_queries = []
    queries = []
    for topic, s_query in synethic_queries.items():
        s_queries.append(s_query)
        queries.append(Query(topic).format(rq_count=n_rqs))


    response = reader.fetch(queries + s_queries)
    hits, s_hits = response[: len(queries)], response[len(queries) :]
    topics = list(synethic_queries.keys())
    results = {"topic": [], "synthetic": [], "real": [], "total": []}
    core_df = pd.read_excel("data/eval_cps.xlsx")
    for i in range(len(hits)):
        topic_df = core_df[core_df["topic"] == topics[i]]
        actual = topic_df.shape[0]

        ids = hits[i].ids
        n_cores = topic_df[core_df["id"].isin(ids)]["title"].count()

        s_ids = s_hits[i].ids
        s_n_cores = topic_df[core_df["id"].isin(s_ids)]["title"].count()
        results["topic"].append(topics[i])
        results["synthetic"].append(s_n_cores)
        results["real"].append(n_cores)
        results["total"].append(actual)

        print(f"----------------------{topics[i]} Recall----------------------")
        print(f"Synthetic: {s_n_cores/actual}, Real: {n_cores/actual}")

    df = pd.DataFrame(results)
    df.to_excel(f"ablation/synthetic_rqs{n_rqs}_results.xlsx", index=False)

#### Are synthetic RQs better than the original RQs?

In [7]:
df_1 = pd.read_excel("ablation/synthetic_rqs1_results.xlsx")
df_1[["synthetic", "real"]] = df_1[["synthetic", "real"]].div(df_1["total"], axis=0)
df_1 = df_1.drop(columns=["total"]).set_index("topic")
# add mean row
df_1.loc["Mean"] = df_1.mean()
df_1.style.highlight_max(subset=pd.IndexSlice[:, ["synthetic", "real"]], axis=1, color="green")\
    .format("{:.3f}")

Unnamed: 0_level_0,synthetic,real
topic,Unnamed: 1_level_1,Unnamed: 2_level_1
Cerebral Small Vessel Disease and the Risk of Dementia,1.0,1.0
Psychological Theories of Depressive Relapse and Recurrence,0.969,1.0
Comparative Efficacy and Safety of Long-Acting Oral Opioids for Chronic Non-Cancer Pain,0.833,0.917
Efficacy and Safety of Skeletal Muscle Relaxants for Spasticity and Musculoskeletal Conditions,0.556,0.333
Fault Prediction Performance in Software Engineering,0.913,0.923
Mesenchymal Stem Cell Effect in the Management of Osteoarthritis of the Knee,0.587,0.651
Comparing Experimental Design of Animal and Human Methotrexate Efficacy Studies for Rheumatoid Arthritis,0.688,0.002
Poor nutritional condition promotes high-risk behaviours,0.069,0.042
Bayesian Versus Frequentist Estimation for Structural Equation Models in Small Sample Contexts,1.0,0.957
Early-Life Exposure to Bisphenol A and Obesity-Related Outcomes in Rodents,0.821,0.791


In [8]:
df_5 = pd.read_excel("ablation/synthetic_rqs5_results.xlsx")
df_5[["synthetic", "real"]] = df_5[["synthetic", "real"]].div(df_5["total"], axis=0)
df_5 = df_5.drop(columns=["total"]).set_index("topic")
df_5.loc["Mean"] = df_5.mean()
df_5.style.highlight_max(subset=pd.IndexSlice[:, ["synthetic", "real"]], axis=1, color="green")\
    .format("{:.3f}")

Unnamed: 0_level_0,synthetic,real
topic,Unnamed: 1_level_1,Unnamed: 2_level_1
Cerebral Small Vessel Disease and the Risk of Dementia,1.0,1.0
Psychological Theories of Depressive Relapse and Recurrence,0.969,1.0
Comparative Efficacy and Safety of Long-Acting Oral Opioids for Chronic Non-Cancer Pain,0.833,0.833
Efficacy and Safety of Skeletal Muscle Relaxants for Spasticity and Musculoskeletal Conditions,0.556,0.111
Fault Prediction Performance in Software Engineering,0.942,0.923
Mesenchymal Stem Cell Effect in the Management of Osteoarthritis of the Knee,0.476,0.651
Comparing Experimental Design of Animal and Human Methotrexate Efficacy Studies for Rheumatoid Arthritis,0.558,0.302
Poor nutritional condition promotes high-risk behaviours,0.083,0.083
Bayesian Versus Frequentist Estimation for Structural Equation Models in Small Sample Contexts,0.957,0.957
Early-Life Exposure to Bisphenol A and Obesity-Related Outcomes in Rodents,0.806,0.791


In [9]:
df = pd.merge(df_1, df_5, on="topic", suffixes=("_1", "_5"))
df = df.rename(columns={"total_5": "total", "real_5": "real"})[["real", "synthetic_1","synthetic_5"]]

df.style.highlight_max(subset=pd.IndexSlice[:, ["real", "synthetic_1","synthetic_5"]], axis=1, color="green")\
    .format("{:.3f}")

Unnamed: 0_level_0,real,synthetic_1,synthetic_5
topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cerebral Small Vessel Disease and the Risk of Dementia,1.0,1.0,1.0
Psychological Theories of Depressive Relapse and Recurrence,1.0,0.969,0.969
Comparative Efficacy and Safety of Long-Acting Oral Opioids for Chronic Non-Cancer Pain,0.833,0.833,0.833
Efficacy and Safety of Skeletal Muscle Relaxants for Spasticity and Musculoskeletal Conditions,0.111,0.556,0.556
Fault Prediction Performance in Software Engineering,0.923,0.913,0.942
Mesenchymal Stem Cell Effect in the Management of Osteoarthritis of the Knee,0.651,0.587,0.476
Comparing Experimental Design of Animal and Human Methotrexate Efficacy Studies for Rheumatoid Arthritis,0.302,0.688,0.558
Poor nutritional condition promotes high-risk behaviours,0.083,0.069,0.083
Bayesian Versus Frequentist Estimation for Structural Equation Models in Small Sample Contexts,0.957,1.0,0.957
Early-Life Exposure to Bisphenol A and Obesity-Related Outcomes in Rodents,0.791,0.821,0.806
