### LLM Based Validation Notebook

This notebook has been a helpful tool to help validate the generated QA dataset in addition to the human validation. 

In this notebook, we utilize an LLM to check the quality of the generated questions and answers. The outputs have then been manually reviewed to assess whether the feedback should be incorporated into the dataset.

Accompanying this notebook are multiple text files that contain previous feedback from the LLM. 

In [1]:
import os
import pandas as pd
from tqdm import tqdm
from sqa_system.core.language_model.llm_provider import LLMProvider
from sqa_system.core.data.file_path_manager import FilePathManager
from sqa_system.core.config.models import LLMConfig
llm_config_classification = LLMConfig.from_dict({
    "additional_params": {},
    "endpoint": "OpenAI",
    "name_model": "gpt-4.1",
    "temperature": 0.0,
    "max_tokens": -1
})
llm_adapter = LLMProvider().get_llm_adapter(llm_config_classification)


fpm = FilePathManager()
current_dir = os.getcwd()
deep_dataset_location = fpm.combine_paths(fpm.get_parent_directory(current_dir), "qa_datasets", "full", "deep_graph_dataset.csv")
reduced_deep_dataset_location = fpm.combine_paths(fpm.get_parent_directory(current_dir), "qa_datasets", "reduced", "reduced_deep_graph_dataset.csv")
reduced_deep_dataset_df = pd.read_csv(reduced_deep_dataset_location)
deep_dataset_df = pd.read_csv(deep_dataset_location)

Rotating log file
[32m2025-04-19 16:01:34,924[0m - New session started


In [2]:
def validate_grammar(text: str) -> tuple[bool, str]:
    """Checks whether the string is grammatically correct."""
    prompt = f"You are a grammar checker. Check the following text for grammatical correctness:\n\n{text}\n\nIf the text has grammatical issues say: 'The text has grammatical issues. Here is the corrected text:' and provide the corrected text. If the text is grammatically correct, say: 'The text is grammatically correct.' Do not add any additional details or explanations. Do not add quotation marks (single or double) to any words or phrases unless they are already present in the original text. Maintain the original quotation marks."
    response = llm_adapter.generate(prompt).content
    if "The text is grammatically correct." in response:
        return True, text
    else:
        corrected_text = response.split("Here is the corrected text:")[-1].strip()
        return False, corrected_text
    
def validate_question_answer_consistency(question: str, answer: str) -> tuple[bool, str]:
    """Checks whether the answer is consistent with the question."""
    prompt = f"""You are an answer validator. Check whether the following answer is consistent with the question by following the checklist: \n
    1. Does the answer directly address the question?
    2. Does the answer add any additional details that the question did not ask for?
    3. Are the verbs used in the question and answer consistent?
    4. Are the nouns used in the question and answer consistent?  

    Note that the answer should repeat the question in some way for example: 'What is the research object of the paper with the title X?' should be answered with 'The research object of the paper with the title X is Y.' Therefore, the answer should not just simply state the facts.
    
    If any of the checklist items are not satisfied, say: 'Checklist item(s) X,Y is/are not satisfied. Here is the corrected answer:' and provide the corrected answer where each of the checklist items are satisfied. If all checklist items are satisfied, say: 'The answer is consistent with the question. Do not add any additional details or explanations. '\n\nQuestion: {question}\nAnswer: {answer}"""
    response = llm_adapter.generate(prompt).content
    if "The answer is consistent with the question" in response:
        return True, answer   
    else:
        corrected_answer = response.split("Here is the corrected answer:")[-1].strip()
        return False, corrected_answer
    
def validate_typed_question(question_template: str):
    """Checks whether the question template correctly adds the type of the placeholder in the question."""
    prompt = f"""You are a question validator. You receive a question with a placeholder that is indicated as '[placeholder]'. For any given placeholder, there is a typed string which needs to be contained within the question (not the placeholder) to indicate the type of constraint that the question is asking for. You need to check whether the question correctly contains the typed string that is expected for the given placeholder. The table of placeholders and their expected typed strings is as follows:
        
    | Placeholder | Typed String | 
    | ----------- | ----------- |
    | -           | doi         |
    | paper title | paper with the title|
    | -           | author     |
    | -           | venue       |
    | -           | research field |
    | year        | publication year |
    | research level | research level |
    | paper class name | paper class |
    | Threat to Validity | threat to validity |
    | -           | input data  | 
    | -           | replication package link | 
    | -           | tool support |
    | -           | threats to validity guideline |
    | research object name | research object |
    | evaluation method name | evaluation method |
    | -           | evaluation guideline |
    | property name | property |
    | sub-property name | property|
    | Provides Replication Package | provides replication package |
    
    If the question correctly includes the typed string that is expected for the given placeholder in the question, say 'The question correctly includes the typed string that is expected for the given placeholder.' If the question does not include the typed string that is expected for the given placeholder, say 'The question does not include the typed string that is expected for the given placeholder. Here is the corrected question:' and provide the corrected question. Do not add any additional details or explanations. 
    
    Examples:
    Question: In which venue has the paper with the title '[paper title]' been published?
    Answer: The question correctly includes the typed string that is expected for the given placeholder.
    Question: In which venue has '[paper title]' been published?
    Answer: The question does not include the typed string that is expected for the given placeholder. Here is the corrected question: In which venue has the paper with the title ['paper title'] been published?
    
    
    The Question with placeholder: {question_template}
    """
    
    response = llm_adapter.generate(prompt).content
    if "The question correctly includes the typed string that is expected for the given placeholder." in response:
        return True, question_template
    else:
        corrected_question = response.split("Here is the corrected question:")[-1].strip()
        if corrected_question == question_template:
            return True, question_template        
        return False, corrected_question
    
def validate_correct_verbs(text: str) -> tuple[bool, str]:
    """Validates whether the verbs in the text are correctly used."""
    prompt = f"""You are a verb-validator. You will receive a single-sentence input and must check every verb that governs one of the listed nouns. Matching is case-insensitive and applies to both singular and plural forms. Any inflected form of an allowed verb (e.g. “apply,” “applies,” “applying,” “applied”) is also acceptable.

    Allowed-verbs table:
    ```markdown
    | Noun               | Allowed Verbs          |
    |--------------------|------------------------|
    | research object    | investigate, evaluate  |
    | evaluation method  | evaluate, apply        |
    | threat to validity | discuss                |
    | guideline          | use                    |
    | property           | investigate, evaluate  |
    | input data         | use                    |
    ```
    If the verbs in the text are correctly used, say 'The verbs in the text are correctly used.' If the verbs in the text are not correctly used, say 'The verbs in the text are not correctly used. Here is the corrected text:' and provide the corrected text. Do not add any additional details or explanations.

    Text: Which paper evaluates the property robustness?” 
    Answer: The verbs in the text are not correctly used. Here is the corrected text: Which paper investigates the property robustness?
    
    Text: {text}
    Answer:
    
    {text}
    """

    response = llm_adapter.generate(prompt).content
    if "The verbs in the text are correctly used." in response:
        return True, text
    else:
        corrected_text = response.split("Here is the corrected text:")[-1].strip()
        return False, corrected_text

def validate_question_to_template(question:str, template:str) -> tuple[bool, str]:
    """Validates whether the template is consistent with the question."""
    prompt = f"""You are a question to template validator. You receive a question and a template. You need to check whether the template is consistent with the question. A template is consistent with the question, if the template is the exact same as the question except for the placeholders that are indicated as '[placeholder]'. The placeholders are the only parts of the template that can be different from the question. If the template is consistent with the question, say 'The template is consistent with the question.' If the template is not consistent with the question, say 'The template is not consistent with the question. Here is the corrected template:' and provide the corrected template.\nQuestion: {question}\nTemplate: {template}"""
    response = llm_adapter.generate(prompt).content
    if "The template is consistent with the question." in response:
        return True, template
    else:
        corrected_template = response.split("Here is the corrected template:")[-1].strip()
        return False, corrected_template
    
def get_list_of_difference(first: str, second: str) -> list[str]:
    """Returns a list of those words that are in the second string but not in the first string."""
    first_words = set(first.split())
    second_words = set(second.split())
    difference = second_words - first_words
    return list(difference)

### Validate Reduced QA Dataset

In [None]:
for index, row in tqdm(reduced_deep_dataset_df.iterrows(), total=len(reduced_deep_dataset_df)):
    question = row['question']
    golden_answer = row['golden_answer']
    question_template = row['updated template']
    is_typed = row['semi-typed']
    
    is_grammatically_correct, corrected_question = validate_grammar(question)
    if not is_grammatically_correct:
        print(f"Line: {index + 2}, Found grammar issues in the question")
        print(f"Original Question: {question}")
        print(f"Corrected Question: {corrected_question}")
        print(f"Difference Question Corrected: {get_list_of_difference(question, corrected_question)}")
        print(f"Difference Question Original: {get_list_of_difference(corrected_question, question)}")
        print()
        
    is_grammatically_correct, corrected_answer = validate_grammar(golden_answer)
    if not is_grammatically_correct:
        print(f"Line: {index + 2}, Found grammar issues in the answer")
        print(f"Original Answer: {golden_answer}")
        print(f"Corrected Answer: {corrected_answer}")
        print(f"Difference Answer Corrected: {get_list_of_difference(golden_answer, corrected_answer)}")
        print(f"Difference Answer Original: {get_list_of_difference(corrected_answer, golden_answer)}")
        print()
        
    is_consistent, corrected_answer = validate_question_answer_consistency(corrected_question, corrected_answer)
    if not is_consistent:
        print(f"Line: {index + 2}, Found inconsistency between question and answer")
        print(f"Original Answer: {golden_answer}")
        print(f"Corrected Answer: {corrected_answer}")
        print(f"Difference Answer Corrected: {get_list_of_difference(golden_answer, corrected_answer)}")
        print(f"Difference Answer Original: {get_list_of_difference(corrected_answer, golden_answer)}")
        print()
    
    if is_typed:
        is_valid, corrected_question_template = validate_typed_question(question_template)
        if not is_valid:
            print(f"Line: {index + 2}, found typing issues in the question template")
            print(f"Original Question Template: {question_template}")
            print(f"Corrected Question Template: {corrected_question_template}")
            print(f"Difference Question Template Corrected: {get_list_of_difference(question_template, corrected_question_template)}")
            print(f"Difference Question Template Original: {get_list_of_difference(corrected_question_template, question_template)}")
            print()
            
        is_noun_verb_correct, corrected_question = validate_correct_verbs(corrected_question)
        if not is_noun_verb_correct:
            print(f"Line: {index + 2}, found verb issues in the question")
            print(f"Original Question: {question}")
            print(f"Corrected Question: {corrected_question}")
            print(f"Difference Question Corrected: {get_list_of_difference(question, corrected_question)}")
            print(f"Difference Question Original: {get_list_of_difference(corrected_question, question)}")
            print()

        is_noun_verb_correct, corrected_answer = validate_correct_verbs(corrected_answer)
        if not is_noun_verb_correct:
            print(f"Line: {index + 2}, found verb issues in the answer")
            print(f"Original Answer: {golden_answer}")
            print(f"Corrected Answer: {corrected_answer}")
            print(f"Difference Answer Corrected: {get_list_of_difference(golden_answer, corrected_answer)}")
            print(f"Difference Answer Original: {get_list_of_difference(corrected_answer, golden_answer)}")
            print()

  7%|▋         | 3/44 [00:16<03:35,  5.26s/it]

Line: 5, Found grammar issues in the answer
Original Answer: The publications with the paper class personal experience, ranked in descending order by publication year, are: 1. Data-Centric Communication and Containerization for Future Automotive Software Architectures (2018); 2. Towards a Reference Architecture for Cloud-Based Plant Genotyping and Phenotyping Analysis Frameworks (2017).
Corrected Answer: The publications with the paper class of personal experience, ranked in descending order by publication year, are: 1. Data-Centric Communication and Containerization for Future Automotive Software Architectures (2018); 2. Towards a Reference Architecture for Cloud-Based Plant Genotyping and Phenotyping Analysis Frameworks (2017).
Difference Answer Corrected: ['of']
Difference Answer Original: []



 11%|█▏        | 5/44 [00:27<03:30,  5.41s/it]

Line: 7, Found inconsistency between question and answer
Original Answer: Konstantinos Plakidas has published three papers in the following years: 2021, 2020, and 2019. Therefore, the distribution of papers published per year is as follows: one paper in 2021, one paper in 2020, and one paper in 2019, resulting in a total of three papers over three years, which is one paper per year.
Corrected Answer: The number of papers that the author Konstantinos Plakidas published per publication year is one paper in 2019, one paper in 2020, and one paper in 2021.
Difference Answer Corrected: ['number', 'author', 'The', '2021.', 'that', 'publication']
Difference Answer Original: ['a', 'over', 'as', 'years,', 'Therefore,', 'resulting', 'has', 'years:', 'following', 'year.', 'three', 'follows:', 'total', '2021,', '2019.', 'distribution', 'which']



 16%|█▌        | 7/44 [00:37<03:10,  5.16s/it]

Line: 9, found verb issues in the answer
Original Answer: The evaluation methods of the paper 'Continuous Analysis of Collaborative Design' are Questionnaire, Controlled Experiment, and Technical Experiment.
Corrected Answer: The evaluation methods of the paper 'Continuous Analysis of Collaborative Design' are applied as Questionnaire, Controlled Experiment, and Technical Experiment.
Difference Answer Corrected: ['as', 'applied']
Difference Answer Original: []



 18%|█▊        | 8/44 [00:43<03:15,  5.42s/it]

Line: 10, Found grammar issues in the answer
Original Answer: The number of threats to validity discussed in the paper with the title 'DesignDiff: Continuously Modeling Software Design Difference from Code Revisions' is two.
Corrected Answer: The number of threats to validity discussed in the paper titled 'DesignDiff: Continuously Modeling Software Design Difference from Code Revisions' is two.
Difference Answer Corrected: ['titled']
Difference Answer Original: ['title', 'with']



 27%|██▋       | 12/44 [01:02<02:31,  4.75s/it]

Line: 14, Found inconsistency between question and answer
Original Answer: The paper with the title 'Designing Robust Software Systems through Parametric Markov Chain Synthesis' is the only paper that evaluates the property robustness.
Corrected Answer: The paper that investigates the property robustness is 'Designing Robust Software Systems through Parametric Markov Chain Synthesis.'
Difference Answer Corrected: ["Synthesis.'", 'investigates', 'robustness']
Difference Answer Original: ['evaluates', "Synthesis'", 'with', 'robustness.', 'title', 'only']



 36%|███▋      | 16/44 [01:23<02:24,  5.15s/it]

Line: 18, Found grammar issues in the question
Original Question: Among papers that apply the evaluation method Data Science, how many have not used input data compared to those that have made their input data available?
Corrected Question: Among papers that apply the evaluation method Data Science, how many have not made their input data available compared to those that have?
Difference Question Corrected: ['have?', 'available']
Difference Question Original: ['available?', 'used']

Line: 18, Found inconsistency between question and answer
Original Answer: There is one paper with the evaluation method Data Science that does not provide input data, and there are two papers with the evaluation method Data Science that make their input data available. Therefore, the comparison shows that there is one paper with no input data and two papers with input data available.
Corrected Answer: The number of papers that apply the evaluation method Data Science and have not made their input data avai

 41%|████      | 18/44 [01:38<02:39,  6.12s/it]

Line: 20, Found inconsistency between question and answer
Original Answer: Evaluation research is the most frequently used paper class for papers that investigate the property Reliability.
Corrected Answer: The paper class that papers are most frequently classified as when they investigate the property Reliability is Evaluation research.
Difference Answer Corrected: ['classified', 'as', 'research.', 'Reliability', 'The', 'are', 'when', 'they']
Difference Answer Original: ['Reliability.', 'research', 'for', 'used']



 43%|████▎     | 19/44 [01:43<02:28,  5.93s/it]

Line: 21, Found grammar issues in the question
Original Question: How are the papers that investigate the research object Technical Debt distributed by their publication year?
Corrected Question: How are the papers that investigate the research object of Technical Debt distributed by their publication year?
Difference Question Corrected: ['of']
Difference Question Original: []



 45%|████▌     | 20/44 [01:48<02:14,  5.61s/it]

Line: 22, Found grammar issues in the question
Original Question: What are the research objects investigated in papers of the paper class philosophical paper?
Corrected Question: What are the research objects investigated in papers of the philosophical paper class?
Difference Question Corrected: ['class?']
Difference Question Original: ['paper?', 'class']

Line: 22, Found grammar issues in the answer
Original Answer: The research object investigated in papers of the philosophical paper class is Architecture Decision Making.
Corrected Answer: The research object investigated in papers of the philosophical paper class is architecture decision making.
Difference Answer Corrected: ['architecture', 'making.', 'decision']
Difference Answer Original: ['Decision', 'Architecture', 'Making.']



 55%|█████▍    | 24/44 [02:09<01:45,  5.26s/it]

Line: 26, Found grammar issues in the answer
Original Answer: The frequency with which the property Satisfaction is evaluated compared to the property Context Coverage in papers that investigate the research object Architecture Evolution is that Satisfaction is evaluated three times, whereas Context Coverage is evaluated only once.
Corrected Answer: The frequency with which the property Satisfaction is evaluated compared to the property Context Coverage in papers that investigate the research object Architecture Evolution is as follows: Satisfaction is evaluated three times, whereas Context Coverage is evaluated only once.
Difference Answer Corrected: ['follows:', 'as']
Difference Answer Original: []



 59%|█████▉    | 26/44 [02:19<01:35,  5.29s/it]

Line: 28, Found grammar issues in the answer
Original Answer: The evaluation method that has been applied most frequently in papers investigating Technical Debt is case study.
Corrected Answer: The evaluation method that has been applied most frequently in papers investigating Technical Debt is the case study.
Difference Answer Corrected: ['the']
Difference Answer Original: []

Line: 28, Found inconsistency between question and answer
Original Answer: The evaluation method that has been applied most frequently in papers investigating Technical Debt is case study.
Corrected Answer: The evaluation method that has been applied the most in papers that investigate the research object Technical Debt is the case study.
Difference Answer Corrected: ['investigate', 'research', 'the', 'object']
Difference Answer Original: ['frequently', 'investigating']



 61%|██████▏   | 27/44 [02:24<01:27,  5.16s/it]

Line: 29, Found grammar issues in the question
Original Question: How frequently is the research object Technical Debt investigated per publication year?
Corrected Question: How frequently is the research object "Technical Debt" investigated per publication year?
Difference Question Corrected: ['"Technical', 'Debt"']
Difference Question Original: ['Technical', 'Debt']

Line: 29, Found grammar issues in the answer
Original Answer: The research object Technical Debt was investigated in 2017, 2020, and 2021, once each year.
Corrected Answer: The research object Technical Debt was investigated in 2017, 2020, and 2021, once in each year.
Difference Answer Corrected: []
Difference Answer Original: []

Line: 29, Found inconsistency between question and answer
Original Answer: The research object Technical Debt was investigated in 2017, 2020, and 2021, once each year.
Corrected Answer: The frequency with which the research object "Technical Debt" is investigated per publication year is as foll

 64%|██████▎   | 28/44 [02:30<01:24,  5.26s/it]

Line: 30, Found grammar issues in the answer
Original Answer: The property investigated on the research object Architecture Design Method in the paper 'Enabling Continuous Software Engineering for Embedded Systems Architectures with Virtual Prototypes' is Functional Suitability.
Corrected Answer: The property investigated in the research object Architecture Design Method in the paper 'Enabling Continuous Software Engineering for Embedded Systems Architectures with Virtual Prototypes' is Functional Suitability.
Difference Answer Corrected: []
Difference Answer Original: ['on']



 68%|██████▊   | 30/44 [02:41<01:15,  5.38s/it]

Line: 32, Found grammar issues in the answer
Original Answer: The number of different evaluation methods that the author David Monschein applied in papers that evaluate the property Efficiency is two.
Corrected Answer: The number of different evaluation methods that the author David Monschein applied in papers that evaluate the property Efficiency is two.
Difference Answer Corrected: []
Difference Answer Original: []



 73%|███████▎  | 32/44 [02:50<00:59,  4.93s/it]

Line: 34, Found grammar issues in the answer
Original Answer: The research object Architecture Extraction is investigated in two papers published in the year 2018, in comparison to one paper in the year 2020.
Corrected Answer: The research object Architecture Extraction is investigated in two papers published in 2018, compared to one paper in 2020.
Difference Answer Corrected: ['compared']
Difference Answer Original: ['year', 'the', 'comparison']

Line: 34, Found inconsistency between question and answer
Original Answer: The research object Architecture Extraction is investigated in two papers published in the year 2018, in comparison to one paper in the year 2020.
Corrected Answer: The research object Architecture Extraction is investigated more often in papers published in the publication year 2018 (two papers) in comparison to the publication year 2020 (one paper).
Difference Answer Corrected: ['more', 'papers)', '2018', '2020', 'paper).', 'often', '(one', 'publication', '(two']
Dif

 82%|████████▏ | 36/44 [03:14<00:41,  5.24s/it]

Line: 38, Found grammar issues in the question
Original Question: What is the title of the paper that investigates the research object Architecture Decision Making and has the paper class philosophical paper?
Corrected Question: What is the title of the paper that investigates the research object Architecture Decision Making and has the paper class of philosophical paper?
Difference Question Corrected: []
Difference Question Original: []



 86%|████████▋ | 38/44 [03:24<00:31,  5.27s/it]

Line: 40, Found grammar issues in the question
Original Question: How many papers applied the evaluation method Controlled Experiment in the publication year 2018?
Corrected Question: How many papers applied the evaluation method "Controlled Experiment" in the publication year 2018?
Difference Question Corrected: ['"Controlled', 'Experiment"']
Difference Question Original: ['Controlled', 'Experiment']

Line: 40, Found grammar issues in the answer
Original Answer: The number of papers that applied the evaluation method Controlled Experiment in the publication year 2018 is two.
Corrected Answer: The number of papers that applied the evaluation method Controlled Experiment in the publication year 2018 is two.
Difference Answer Corrected: []
Difference Answer Original: []

Line: 40, found verb issues in the answer
Original Answer: The number of papers that applied the evaluation method Controlled Experiment in the publication year 2018 is two.
Corrected Answer: The number of papers that ap

 91%|█████████ | 40/44 [03:37<00:22,  5.53s/it]

Line: 42, Found grammar issues in the answer
Original Answer: In the year 2020, there are two papers that investigate the research object Architecture Evolution. In comparison, in the year 2021, there are two papers that investigate the research object.
Corrected Answer: In the year 2020, there are two papers that investigate the research object of Architecture Evolution. In comparison, in the year 2021, there are two papers that investigate the research object.
Difference Answer Corrected: ['of']
Difference Answer Original: []

Line: 42, Found inconsistency between question and answer
Original Answer: In the year 2020, there are two papers that investigate the research object Architecture Evolution. In comparison, in the year 2021, there are two papers that investigate the research object.
Corrected Answer: The number of papers investigating the research object Architecture Evolution in the publication year 2020 is two, whereas the number in the publication year 2021 is also two.
Diff

 95%|█████████▌| 42/44 [03:46<00:10,  5.06s/it]

Line: 44, Found grammar issues in the answer
Original Answer: The paper class that has the most papers applying the evaluation method Benchmark in the publication year 2020 is the proposal of a solution.
Corrected Answer: The paper class that has the most papers applying the evaluation method Benchmark in the publication year 2020 is the proposal of a solution class.
Difference Answer Corrected: ['solution', 'class.']
Difference Answer Original: ['solution.']

Line: 44, found verb issues in the question
Original Question: Among papers that apply the evaluation method Benchmark, what is the paper class that is most frequently applied to papers published in the publication year 2020?
Corrected Question: Among papers that apply the evaluation method Benchmark, what is the paper class that is most frequently evaluated in papers published in the publication year 2020?
Difference Question Corrected: ['evaluated']
Difference Question Original: ['to', 'applied']



100%|██████████| 44/44 [03:56<00:00,  5.38s/it]


In [5]:
for index, row in tqdm(reduced_deep_dataset_df.iterrows(), total=len(reduced_deep_dataset_df)):
    question = row['question']
    question_template = row['updated template']
    template_consistent, corrected_template = validate_question_to_template(question, question_template)
    if not template_consistent:
        print(f"Line: {index + 2}, found template issues in the question")
        print(f"Original Question Template: {question_template}")
        print(f"Corrected Question Template: {corrected_template}")
        print(f"Difference Question Template Corrected: {get_list_of_difference(question_template, corrected_template)}")
        print(f"Difference Question Template Original: {get_list_of_difference(corrected_template, question_template)}")
        print()

 18%|█▊        | 8/44 [00:06<00:28,  1.24it/s]

Line: 9, found template issues in the question
Original Question Template: What are the evaluation methods used in the paper with the title '[paper title]'?
Corrected Question Template: What are the evaluation methods applied in the paper with the title '[paper title]'?
Difference Question Template Corrected: ['applied']
Difference Question Template Original: ['used']



 25%|██▌       | 11/44 [00:08<00:28,  1.18it/s]

Line: 12, found template issues in the question
Original Question Template: What are the evaluation methods used by the paper with the title '[paper title]' compared to the methods applied in the paper with the title '[paper title]'?
Corrected Question Template: What are the evaluation methods used by the paper with the title '[paper title]' compared to the evaluation methods applied in the paper with the title '[paper title]'?
Difference Question Template Corrected: []
Difference Question Template Original: []



 30%|██▉       | 13/44 [00:09<00:22,  1.35it/s]

Line: 14, found template issues in the question
Original Question Template: Which paper investigates the property [property name]?
Corrected Question Template: Which paper investigates the property [property]?
Difference Question Template Corrected: ['[property]?']
Difference Question Template Original: ['[property', 'name]?']



 41%|████      | 18/44 [00:14<00:23,  1.12it/s]

Line: 19, found template issues in the question
Original Question Template: Which papers investigate the research object with the name [research object name] and indicate that no input data was used in their work?
Corrected Question Template: Which papers investigate the research object [research object name] and indicate that no input data was used?
Difference Question Template Corrected: ['used?']
Difference Question Template Original: ['their', 'used', 'name', 'work?', 'with', 'in']



 48%|████▊     | 21/44 [00:16<00:20,  1.14it/s]

Line: 22, found template issues in the question
Original Question Template: What are the research objects that are investigated in papers of the paper class [paper class]?
Corrected Question Template: What are the research objects investigated in papers of the paper class [paper class]?
Difference Question Template Corrected: []
Difference Question Template Original: ['that']



 52%|█████▏    | 23/44 [00:17<00:16,  1.29it/s]

Line: 24, found template issues in the question
Original Question Template: How many evaluation methods are used to evaluate the property [property name]?
Corrected Question Template: How many different evaluation methods are used to evaluate the property [property name]?
Difference Question Template Corrected: ['different']
Difference Question Template Original: []



 55%|█████▍    | 24/44 [00:18<00:16,  1.22it/s]

Line: 25, found template issues in the question
Original Question Template: Among those papers that evaluate the property [property name], what the evaluation methods that have been applied? Rank the evaluation methods in descending alphabetical order.
Corrected Question Template: Among those papers that evaluate the property [property name], what are the evaluation methods that have been applied? Rank the evaluation methods in descending alphabetical order.
Difference Question Template Corrected: ['are']
Difference Question Template Original: []



 57%|█████▋    | 25/44 [00:19<00:17,  1.10it/s]

Line: 26, found template issues in the question
Original Question Template: How frequently is the property [property name] evaluated compared to the property [property name] in papers that investigate the research object [research object name]?
Corrected Question Template: How frequently is the property [property 1] evaluated compared to the property [property 2] in papers that investigate the research object [research object name]?
Difference Question Template Corrected: ['1]', '2]']
Difference Question Template Original: ['name]']



 59%|█████▉    | 26/44 [00:21<00:18,  1.00s/it]

Line: 27, found template issues in the question
Original Question Template: What are the evaluation properties that are evaluated in papers that investigate the research object [research object name] without adhering to a evaluation guideline?
Corrected Question Template: What are the properties that are evaluated in papers that investigate the research object [research object name] without adhering to an evaluation guideline?
Difference Question Template Corrected: ['an']
Difference Question Template Original: ['a']



 64%|██████▎   | 28/44 [00:22<00:13,  1.21it/s]

Line: 29, found template issues in the question
Original Question Template: How frequently is the research object [research object name] investigated per publication year??
Corrected Question Template: How frequently is the research object [research object name] investigated per publication year?
Difference Question Template Corrected: ['year?']
Difference Question Template Original: ['year??']



 70%|███████   | 31/44 [00:24<00:10,  1.28it/s]

Line: 32, found template issues in the question
Original Question Template: How many different evaluation methods did the author [author name] use in papers that evaluate the property [property name]?
Corrected Question Template: How many different evaluation methods did the author [author name] apply in papers that evaluate the property [property name]?
Difference Question Template Corrected: ['apply']
Difference Question Template Original: ['use']



 95%|█████████▌| 42/44 [00:32<00:01,  1.41it/s]

Line: 43, found template issues in the question
Original Question Template: Which papers published in the publication year [year] investigate the research object [research object name] and have not used input data in their work?
Corrected Question Template: Which papers published in the publication year [year] investigate the research object [research object name] and have not used input data?
Difference Question Template Corrected: ['data?']
Difference Question Template Original: ['their', 'work?', 'data']



100%|██████████| 44/44 [00:33<00:00,  1.31it/s]


### Validate Full Dataset

In [4]:
for index, row in tqdm(deep_dataset_df.iterrows(), total=len(deep_dataset_df)):
    question = row['question']
    golden_answer = row['golden_answer']
    question_template = row['updated template']
    is_typed = row['semi-typed']
    
    is_grammatically_correct, corrected_question = validate_grammar(question)
    if not is_grammatically_correct:
        print(f"Line: {index + 2}, Found grammar issues in the question")
        print(f"Original Question: {question}")
        print(f"Corrected Question: {corrected_question}")
        print(f"Difference Question Corrected: {get_list_of_difference(question, corrected_question)}")
        print(f"Difference Question Original: {get_list_of_difference(corrected_question, question)}")
        print()
        
    is_grammatically_correct, corrected_answer = validate_grammar(golden_answer)
    if not is_grammatically_correct:
        print(f"Line: {index + 2}, Found grammar issues in the answer")
        print(f"Original Answer: {golden_answer}")
        print(f"Corrected Answer: {corrected_answer}")
        print(f"Difference Answer Corrected: {get_list_of_difference(golden_answer, corrected_answer)}")
        print(f"Difference Answer Original: {get_list_of_difference(corrected_answer, golden_answer)}")
        print()
        
    is_consistent, corrected_answer = validate_question_answer_consistency(corrected_question, corrected_answer)
    if not is_consistent:
        print(f"Line: {index + 2}, Found inconsistency between question and answer")
        print(f"Original Answer: {golden_answer}")
        print(f"Corrected Answer: {corrected_answer}")
        print(f"Difference Answer Corrected: {get_list_of_difference(golden_answer, corrected_answer)}")
        print(f"Difference Answer Original: {get_list_of_difference(corrected_answer, golden_answer)}")
        print()
    
    if is_typed:
        is_valid, corrected_question_template = validate_typed_question(question_template)
        if not is_valid:
            print(f"Line: {index + 2}, found typing issues in the question template")
            print(f"Original Question Template: {question_template}")
            print(f"Corrected Question Template: {corrected_question_template}")
            print(f"Difference Question Template Corrected: {get_list_of_difference(question_template, corrected_question_template)}")
            print(f"Difference Question Template Original: {get_list_of_difference(corrected_question_template, question_template)}")
            print()
        
        is_noun_verb_correct, corrected_question = validate_correct_verbs(corrected_question)
        if not is_noun_verb_correct:
            print(f"Line: {index + 2}, found verb issues in the question")
            print(f"Original Question: {question}")
            print(f"Corrected Question: {corrected_question}")
            print(f"Difference Question Corrected: {get_list_of_difference(question, corrected_question)}")
            print(f"Difference Question Original: {get_list_of_difference(corrected_question, question)}")
            print()

        is_noun_verb_correct, corrected_answer = validate_correct_verbs(corrected_answer)
        if not is_noun_verb_correct:
            print(f"Line: {index + 2}, found verb issues in the answer")
            print(f"Original Answer: {golden_answer}")
            print(f"Corrected Answer: {corrected_answer}")
            print(f"Difference Answer Corrected: {get_list_of_difference(golden_answer, corrected_answer)}")
            print(f"Difference Answer Original: {get_list_of_difference(corrected_answer, golden_answer)}")
            print()

  4%|▎         | 6/170 [00:16<08:15,  3.02s/it]

Line: 8, Found grammar issues in the question
Original Question: Which papers have the research level secondary research?
Corrected Question: Which papers have the research level of secondary research?
Difference Question Corrected: ['of']
Difference Question Original: []



  6%|▋         | 11/170 [00:31<06:40,  2.52s/it]

Line: 13, Found grammar issues in the question
Original Question: How many papers have the paper class personal experience paper?
Corrected Question: How many papers have the paper class "personal experience paper"?
Difference Question Corrected: ['paper"?', '"personal']
Difference Question Original: ['paper?', 'personal']

Line: 13, Found grammar issues in the answer
Original Answer: There are two publications that have the paper class personal experience paper.
Corrected Answer: There are two publications that have the paper class of personal experience paper.
Difference Answer Corrected: ['of']
Difference Answer Original: []

Line: 13, Found inconsistency between question and answer
Original Answer: There are two publications that have the paper class personal experience paper.
Corrected Answer: The number of papers that have the paper class "personal experience paper" is two.
Difference Answer Corrected: ['"personal', 'two.', 'is', 'of', 'paper"', 'The', 'papers', 'number']
Differe

  7%|▋         | 12/170 [00:33<06:36,  2.51s/it]

Line: 14, Found grammar issues in the answer
Original Answer: The publications with the paper class personal experience, ranked in descending order by publication year, are: 1. Data-Centric Communication and Containerization for Future Automotive Software Architectures (2018); 2. Towards a Reference Architecture for Cloud-Based Plant Genotyping and Phenotyping Analysis Frameworks (2017).
Corrected Answer: The publications with the paper class of personal experience, ranked in descending order by publication year, are: 1. Data-Centric Communication and Containerization for Future Automotive Software Architectures (2018); 2. Towards a Reference Architecture for Cloud-Based Plant Genotyping and Phenotyping Analysis Frameworks (2017).
Difference Answer Corrected: ['of']
Difference Answer Original: []



 10%|█         | 17/170 [00:51<07:38,  2.99s/it]

Line: 19, Found grammar issues in the question
Original Question: In which year has 'Determination and Enforcement of Least-Privilege Architecture in Android' been published in comparison to 'Automated Microservice Identification in Legacy Systems with Functional and Non-Functional Metrics'?
Corrected Question: In which year was 'Determination and Enforcement of Least-Privilege Architecture in Android' published in comparison to 'Automated Microservice Identification in Legacy Systems with Functional and Non-Functional Metrics'?
Difference Question Corrected: ['was']
Difference Question Original: ['has', 'been']



 11%|█         | 18/170 [00:54<08:01,  3.17s/it]

Line: 19, Found inconsistency between question and answer
Original Answer: The publication 'Determination and Enforcement of Least-Privilege Architecture in Android' was published in 2017, while 'Automated Microservice Identification in Legacy Systems with Functional and Non-Functional Metrics' was published in 2020.
Corrected Answer: 'Determination and Enforcement of Least-Privilege Architecture in Android' was published earlier than 'Automated Microservice Identification in Legacy Systems with Functional and Non-Functional Metrics'.
Difference Answer Corrected: ['than', 'earlier', "Metrics'."]
Difference Answer Original: ['publication', 'while', "Metrics'", 'The', '2020.', '2017,']

Line: 20, Found inconsistency between question and answer
Original Answer: The paper class of 'Continuous Integration Impediments in Large-Scale Industry Projects' is classified as evaluation research, while the paper class of 'An Architecture Framework for Modelling and Simulation of Situational-Aware Cy

 11%|█         | 19/170 [00:59<09:07,  3.62s/it]

Line: 21, Found grammar issues in the question
Original Question: Which paper class does 'Supporting Architectural Decision Making on Data Management in Microservice Architectures' have compared to 'An Architecture for Decentralized, Collaborative, and Autonomous Robots'?
Corrected Question: Which paper class does 'Supporting Architectural Decision Making on Data Management in Microservice Architectures' belong to compared to 'An Architecture for Decentralized, Collaborative, and Autonomous Robots'?
Difference Question Corrected: ['belong']
Difference Question Original: ['have']

Line: 21, Found inconsistency between question and answer
Original Answer: The paper 'Supporting Architectural Decision Making on Data Management in Microservice Architectures' has a paper class of evaluation research, while the paper 'An Architecture for Decentralized, Collaborative, and Autonomous Robots' has a paper class of evaluation research as well. Both papers share the same classification in this rega

 13%|█▎        | 22/170 [01:06<06:48,  2.76s/it]

Line: 24, Found inconsistency between question and answer
Original Answer: In the year 2021, Mohamed Soliman published two papers classified as evaluation research. In 2018, he published one paper classified as evaluation research. Therefore, the number of papers classified as evaluation research published by Mohamed Soliman per year is two in 2021 and one in 2018.
Corrected Answer: The number of paper classes among the papers published by the author Mohamed Soliman is one (evaluation research). Their distribution per publication year is: two evaluation research papers in 2021 and one evaluation research paper in 2018.
Difference Answer Corrected: ['Their', 'publication', 'classes', '(evaluation', 'is:', 'research).', 'The', 'author', 'among', 'distribution']
Difference Answer Original: ['research.', 'classified', 'In', '2021,', '2018,', 'Therefore,', 'as', 'he']



 14%|█▎        | 23/170 [01:10<07:39,  3.13s/it]

Line: 25, Found grammar issues in the answer
Original Answer: In the year 2017, Manoj Bhat published one paper class categorized as a proposal of solution. In 2018, he published one paper class categorized as both a proposal of solution and validation research. Therefore, the total number of paper classes published by Manoj Bhat is two, with one in 2017 and one in 2018. The distribution of paper classes published per year is one paper class per year for both 2017 and 2018.
Corrected Answer: In the year 2017, Manoj Bhat published one paper class categorized as a proposal of solution. In 2018, he published one paper class categorized as both a proposal of solution and validation research. Therefore, the total number of paper classes published by Manoj Bhat is two, with one in 2017 and one in 2018. The distribution of paper classes published per year is one paper class per year in both 2017 and 2018.
Difference Answer Corrected: []
Difference Answer Original: ['for']



 14%|█▍        | 24/170 [01:13<07:46,  3.20s/it]

Line: 25, Found inconsistency between question and answer
Original Answer: In the year 2017, Manoj Bhat published one paper class categorized as a proposal of solution. In 2018, he published one paper class categorized as both a proposal of solution and validation research. Therefore, the total number of paper classes published by Manoj Bhat is two, with one in 2017 and one in 2018. The distribution of paper classes published per year is one paper class per year for both 2017 and 2018.
Corrected Answer: The paper classes of the papers published by Manoj Bhat and their distribution per year are as follows: In 2017, the paper class is proposal of solution. In 2018, the paper classes are proposal of solution and validation research.
Difference Answer Corrected: ['their', 'are', 'follows:', 'papers']
Difference Answer Original: ['total', 'both', 'a', '2018.', 'for', 'with', '2017', 'two,', 'one', 'in', 'categorized', 'number', 'Therefore,', 'he']

Line: 26, found verb issues in the questio

 15%|█▍        | 25/170 [01:18<08:52,  3.67s/it]

Line: 26, found verb issues in the answer
Original Answer: Yes, the paper with the title 'Data-Driven Software Architecture for Analyzing Confidentiality' does make its tool support available.
Corrected Answer: Yes, the paper with the title 'Data-Driven Software Architecture for Analyzing Confidentiality' does use its tool support.
Difference Answer Corrected: ['use', 'support.']
Difference Answer Original: ['support', 'make', 'available.']



 15%|█▌        | 26/170 [01:20<07:36,  3.17s/it]

Line: 27, Found inconsistency between question and answer
Original Answer: Yes, the paper indicates that tool support is available for the study of architectural decay in open-source software.
Corrected Answer: Tool support is available in 'An Empirical Study of Architectural Decay in Open-Source Software.'
Difference Answer Corrected: ['Open-Source', 'Empirical', "Software.'", "'An", 'Decay', 'Study', 'Tool', 'Architectural']
Difference Answer Original: ['tool', 'for', 'architectural', 'indicates', 'the', 'that', 'study', 'software.', 'paper', 'decay', 'open-source', 'Yes,']

Line: 28, Found grammar issues in the answer
Original Answer: The evaluation method used in the paper is Case Study.
Corrected Answer: The evaluation method used in the paper is a case study.
Difference Answer Corrected: ['case', 'a', 'study.']
Difference Answer Original: ['Study.', 'Case']

Line: 28, Found inconsistency between question and answer
Original Answer: The evaluation method used in the paper is Case 

 16%|█▌        | 27/170 [01:30<12:05,  5.07s/it]

Line: 29, Found inconsistency between question and answer
Original Answer: The evaluation method used in the paper is a Controlled Experiment.
Corrected Answer: The evaluation method that has been used for the evaluation in 'REST vs GraphQL: A Controlled Experiment' is a controlled experiment.
Difference Answer Corrected: ['A', 'been', 'for', "Experiment'", 'controlled', 'that', 'has', "'REST", 'GraphQL:', 'vs', 'experiment.']
Difference Answer Original: ['paper', 'Experiment.']



 17%|█▋        | 29/170 [01:33<08:17,  3.53s/it]

Line: 30, found verb issues in the answer
Original Answer: The evaluation methods of the paper 'Continuous Analysis of Collaborative Design' are Questionnaire, Controlled Experiment, and Technical Experiment.
Corrected Answer: The evaluation methods of the paper 'Continuous Analysis of Collaborative Design' are applied as Questionnaire, Controlled Experiment, and Technical Experiment.
Difference Answer Corrected: ['as', 'applied']
Difference Answer Original: []



 18%|█▊        | 30/170 [01:35<07:15,  3.11s/it]

Line: 32, found verb issues in the question
Original Question: What are the threats to validity of the paper with the title 'Predicting the Performance of Privacy-Preserving Data Analytics Using Architecture Modelling and Simulation'?
Corrected Question: What are the threats to validity discussed in the paper with the title 'Predicting the Performance of Privacy-Preserving Data Analytics Using Architecture Modelling and Simulation'?
Difference Question Corrected: ['in', 'discussed']
Difference Question Original: []



 18%|█▊        | 31/170 [01:42<09:39,  4.17s/it]

Line: 32, found verb issues in the answer
Original Answer: The threats to validity of the paper 'Predicting the Performance of Privacy-Preserving Data Analytics Using Architecture Modelling and Simulation' are external validity and internal validity.
Corrected Answer: The threats to validity of the paper 'Predicting the Performance of Privacy-Preserving Data Analytics Using Architecture Modelling and Simulation' are discussed as external validity and internal validity.
Difference Answer Corrected: ['as', 'discussed']
Difference Answer Original: []



 19%|█▉        | 32/170 [01:44<08:14,  3.58s/it]

Line: 33, Found inconsistency between question and answer
Original Answer: The threats to validity identified in the paper are external validity and construct validity.
Corrected Answer: The validity threats discussed in 'A Blockchain-Based Micro Economy Platform for Distributed Infrastructure Initiatives' are external validity and construct validity.
Difference Answer Corrected: ['Platform', 'Distributed', 'for', 'Blockchain-Based', 'discussed', "'A", 'Infrastructure', "Initiatives'", 'Economy', 'Micro']
Difference Answer Original: ['paper', 'the', 'identified', 'to']

Line: 34, Found grammar issues in the answer
Original Answer: The number of threats to validity discussed in the paper with the title 'DesignDiff: Continuously Modeling Software Design Difference from Code Revisions' is two.
Corrected Answer: The number of threats to validity discussed in the paper titled 'DesignDiff: Continuously Modeling Software Design Difference from Code Revisions' is two.
Difference Answer Correct

 20%|██        | 34/170 [01:51<07:28,  3.30s/it]

Line: 35, Found inconsistency between question and answer
Original Answer: The paper identifies two threats to validity: external validity and internal validity.
Corrected Answer: The number of validity threats discussed in 'An Architecture for Decentralized, Collaborative, and Autonomous Robots' is two.
Difference Answer Corrected: ['Decentralized,', 'Architecture', "'An", 'for', 'of', 'is', 'two.', 'discussed', "Robots'", 'in', 'Autonomous', 'number', 'Collaborative,']
Difference Answer Original: ['two', 'validity:', 'internal', 'paper', 'identifies', 'external', 'validity.', 'to']

Line: 36, Found grammar issues in the answer
Original Answer: The paper has one evaluation method, which is Focus Group.
Corrected Answer: The paper has one evaluation method, which is focus group.
Difference Answer Corrected: ['focus', 'group.']
Difference Answer Original: ['Focus', 'Group.']

Line: 36, Found inconsistency between question and answer
Original Answer: The paper has one evaluation method, 

 21%|██        | 35/170 [01:58<10:14,  4.55s/it]

Line: 36, found verb issues in the answer
Original Answer: The paper has one evaluation method, which is Focus Group.
Corrected Answer: The number of evaluation methods that the paper with the title 'Continuous Architecture: Towards the Goldilocks Zone and Away from Vicious Circles' applies is one.
Difference Answer Corrected: ['one.', 'from', 'that', 'Goldilocks', 'Vicious', 'Architecture:', "'Continuous", 'with', 'of', 'methods', 'number', 'applies', 'and', 'the', 'Zone', 'Away', "Circles'", 'Towards', 'title']
Difference Answer Original: ['which', 'method,', 'Focus', 'has', 'one', 'Group.']

Line: 37, Found grammar issues in the answer
Original Answer: The paper contains one evaluation method, which is Interview.
Corrected Answer: The paper contains one evaluation method, which is the interview.
Difference Answer Corrected: ['interview.', 'the']
Difference Answer Original: ['Interview.']



 22%|██▏       | 38/170 [02:03<05:59,  2.73s/it]

Line: 39, Found inconsistency between question and answer
Original Answer: The paper has the following properties ranked in descending order: 1. Usability and 2. Effectiveness.
Corrected Answer: The properties that have been evaluated in 'Constructing a Shared Infrastructure for Software Architecture Analysis and Maintenance' are usability and effectiveness, ranked in descending alphabetical order as: usability, effectiveness.
Difference Answer Corrected: ['Architecture', 'a', 'for', 'that', "'Constructing", 'as:', 'have', 'Software', 'are', 'order', 'usability', 'effectiveness.', 'Infrastructure', 'alphabetical', "Maintenance'", 'been', 'Shared', 'evaluated', 'usability,', 'effectiveness,', 'Analysis']
Difference Answer Original: ['following', 'has', 'the', 'Effectiveness.', '1.', 'paper', 'Usability', '2.', 'order:']

Line: 40, Found inconsistency between question and answer
Original Answer: The publication has the following threats to validity: internal validity, external validity, 

 23%|██▎       | 39/170 [02:09<07:57,  3.65s/it]

Line: 40, found verb issues in the answer
Original Answer: The publication has the following threats to validity: internal validity, external validity, and construct validity.
Corrected Answer: The threats to validity that the paper with the title 'Continuous Integration Impediments in Large-Scale Industry Projects' discusses, ranked in descending alphabetical order, are internal validity, external validity, and construct validity.
Difference Answer Corrected: ["Projects'", 'Industry', 'Integration', 'Impediments', 'with', 'Large-Scale', 'ranked', 'that', 'are', 'discusses,', 'descending', 'paper', 'title', 'in', "'Continuous", 'alphabetical', 'order,', 'validity']
Difference Answer Original: ['publication', 'has', 'following', 'validity:']



 24%|██▎       | 40/170 [02:12<07:21,  3.39s/it]

Line: 41, Found inconsistency between question and answer
Original Answer: The publication has the following threats to validity sorted in descending alphabetical order: internal validity, external validity, and construct validity.
Corrected Answer: The validity threats discussed in 'Architectural Security Weaknesses in Industrial Control Systems (ICS) an Empirical Study Based on Disclosed Software Vulnerabilities', ranked in reverse alphabetical order, are internal validity, external validity, and construct validity.
Difference Answer Corrected: ['ranked', 'Control', 'Weaknesses', 'Industrial', "Vulnerabilities',", 'Software', 'are', 'order,', 'Empirical', 'Based', 'Systems', 'Disclosed', 'discussed', 'on', 'Security', "'Architectural", '(ICS)', 'an', 'Study', 'reverse']
Difference Answer Original: ['publication', 'following', 'the', 'has', 'descending', 'sorted', 'to', 'order:']



 25%|██▍       | 42/170 [02:19<07:04,  3.31s/it]

Line: 43, Found inconsistency between question and answer
Original Answer: The paper 'A Blockchain-Based Micro Economy Platform for Distributed Infrastructure Initiatives' uses the evaluation method 'Case Study', while the paper 'Integrating Statistical Response Time Models in Architectural Performance Models' adopts 'Technical Experiment' as its evaluation method.
Corrected Answer: The evaluation method used in 'A Blockchain-Based Micro Economy Platform for Distributed Infrastructure Initiatives' is 'Case Study', while the evaluation method used in 'Integrating Statistical Response Time Models in Architectural Performance Models' is 'Technical Experiment'.
Difference Answer Corrected: ["Experiment'.", 'is', 'used']
Difference Answer Original: ['method.', 'its', 'adopts', "Experiment'", 'paper', 'uses', 'as']

Line: 44, Found inconsistency between question and answer
Original Answer: The paper 'How Software Architects Focus Their Attention' is classified as a validation research paper,

 25%|██▌       | 43/170 [02:24<08:05,  3.82s/it]

Line: 45, Found grammar issues in the question
Original Question: What is the paper class of 'From Monolithic Architecture Style to Microservice one Based on a Semi-Automatic Approach' compared to 'Enabling Consistency between Software Artefacts for Software Adaption and Evolution'?
Corrected Question: What is the paper class of 'From Monolithic Architecture Style to Microservice One Based on a Semi-Automatic Approach' compared to 'Enabling Consistency between Software Artefacts for Software Adaptation and Evolution'?
Difference Question Corrected: ['Adaptation', 'One']
Difference Question Original: ['Adaption', 'one']

Line: 45, Found grammar issues in the answer
Original Answer: The paper class of 'From Monolithic Architecture Style to Microservice One Based on a Semi-Automatic Approach' is evaluation research, which is the same as that of 'Enabling Consistency between Software Artefacts for Software Adaptation and Evolution'.
Corrected Answer: The paper class of 'From Monolithic Arc

 26%|██▌       | 44/170 [02:26<07:16,  3.47s/it]

Line: 46, Found inconsistency between question and answer
Original Answer: António Silva investigated the following number of research objects in his papers per publication year: 2019: one research object. 2020: two research objects.
Corrected Answer: The number of research objects that the author António Silva investigated in his papers per publication year is as follows: 2019: one research object; 2020: two research objects.
Difference Answer Corrected: ['year', 'that', 'is', 'object;', 'The', 'author', 'follows:', 'as']
Difference Answer Original: ['year:', 'object.', 'following']



 26%|██▋       | 45/170 [02:30<07:01,  3.37s/it]

Line: 47, Found grammar issues in the answer
Original Answer: The research objects that have been investigated in papers published by Klym Shumaiev are distributed as one in 2017 and one in 2018.
Corrected Answer: The research objects investigated in papers published by Klym Shumaiev are distributed as one in 2017 and one in 2018.
Difference Answer Corrected: []
Difference Answer Original: ['have', 'that', 'been']



 27%|██▋       | 46/170 [02:32<06:31,  3.16s/it]

Line: 48, Found inconsistency between question and answer
Original Answer: The evaluation methods published by Mauro Caporuscio per year are as follows: In 2020, two evaluation methods were published. In 2021, one evaluation method was published. Therefore, the number of evaluation methods published per year is two methods in 2020 and one method in 2021.
Corrected Answer: The number of evaluation methods applied in papers published by the author Mauro Caporuscio for each publication year is as follows: in 2020, two evaluation methods were applied; in 2021, one evaluation method was applied.
Difference Answer Corrected: ['applied;', 'publication', 'each', 'applied.', 'for', 'applied', 'author', 'papers']
Difference Answer Original: ['2021.', 'and', 'are', '2020', 'per', 'published.', 'Therefore,', 'In']



 28%|██▊       | 47/170 [02:37<07:42,  3.76s/it]

Line: 49, Found grammar issues in the answer
Original Answer: The number of evaluation approaches that have been applied in papers published by Danny Weyns is one in 2018 and one in 2019.
Corrected Answer: The number of evaluation approaches that have been applied in papers published by Danny Weyns is one in 2018 and one in 2019.
Difference Answer Corrected: []
Difference Answer Original: []



 28%|██▊       | 48/170 [02:40<06:57,  3.42s/it]

Line: 49, Found inconsistency between question and answer
Original Answer: The number of evaluation approaches that have been applied in papers published by Danny Weyns is one in 2018 and one in 2019.
Corrected Answer: For each year, the number of evaluation approaches that have been applied in papers published by Danny Weyns is one in 2018 and one in 2019.
Difference Answer Corrected: ['For', 'each', 'year,', 'the']
Difference Answer Original: ['The']

Line: 50, Found inconsistency between question and answer
Original Answer: The paper with the title 'Designing Robust Software Systems through Parametric Markov Chain Synthesis' is the only paper that evaluates the property robustness.
Corrected Answer: The paper that investigates the property robustness is the paper with the title 'Designing Robust Software Systems through Parametric Markov Chain Synthesis.'
Difference Answer Corrected: ['robustness', "Synthesis.'", 'investigates']
Difference Answer Original: ['evaluates', "Synthesis'"

 29%|██▉       | 50/170 [02:45<05:50,  2.92s/it]

Line: 52, Found grammar issues in the answer
Original Answer: The paper titled 'Supporting Architectural Decision Making on Data Management in Microservice Architectures' is where the limit of detection is evaluated.
Corrected Answer: The paper titled 'Supporting Architectural Decision Making on Data Management in Microservice Architectures' evaluates the limit of detection.
Difference Answer Corrected: ['detection.', 'evaluates']
Difference Answer Original: ['detection', 'where', 'evaluated.', 'is']



 30%|███       | 51/170 [02:48<05:23,  2.72s/it]

Line: 52, Found inconsistency between question and answer
Original Answer: The paper titled 'Supporting Architectural Decision Making on Data Management in Microservice Architectures' is where the limit of detection is evaluated.
Corrected Answer: The paper in which the limit of detection is evaluated is 'Supporting Architectural Decision Making on Data Management in Microservice Architectures.'
Difference Answer Corrected: ["Architectures.'", 'which', 'evaluated']
Difference Answer Original: ['where', 'evaluated.', 'titled', "Architectures'"]



 32%|███▏      | 54/170 [02:59<07:04,  3.66s/it]

Line: 55, Found inconsistency between question and answer
Original Answer: The following papers use focus groups as a method in their evaluations: 1. Technical Architectures for Automotive Systems, 2. Architectural Assumptions and Their Management in Industry - An Exploratory Study, 3. An Exploratory Study of Naturalistic Decision Making in Complex Software Architecture Environments, 4. Continuous Architecture: Towards the Goldilocks Zone and Away from Vicious Circles, 5. On Cognitive Biases in Architecture Decision Making, 6. System- and Software-level Architecting Harmonization Practices for Systems-of-Systems: An exploratory case study on a long-running large-scale scientific instrument, 7. Understanding Architecture Decisions in Context.
Corrected Answer: The papers that use focus groups as a method in their evaluations are: 1. Technical Architectures for Automotive Systems, 2. Architectural Assumptions and Their Management in Industry - An Exploratory Study, 3. An Exploratory Stud

 33%|███▎      | 56/170 [03:08<07:09,  3.77s/it]

Line: 57, Found inconsistency between question and answer
Original Answer: The publication 'An Architecture for Decentralized, Collaborative, and Autonomous Robots' evaluates the research object with the property 'Functional Suitability' and has input data available. Additionally, the publication 'Architecture-Centric Support for Integrating Security Tools in a Security Orchestration Platform' also evaluates the same property and has input data available.
Corrected Answer: The publications that assess the functional suitability of a reference architecture and make their input data available are 'An Architecture for Decentralized, Collaborative, and Autonomous Robots' and 'Architecture-Centric Support for Integrating Security Tools in a Security Orchestration Platform'.
Difference Answer Corrected: ['available', "Platform'.", 'architecture', 'their', 'are', 'that', 'of', 'functional', 'publications', 'suitability', 'reference', 'make', 'assess']
Difference Answer Original: ['publication

 34%|███▍      | 58/170 [03:16<07:00,  3.75s/it]

Line: 60, Found grammar issues in the answer
Original Answer: There are three papers that discuss repeatability as a threat to validity and apply the evaluation method interviews.
Corrected Answer: There are three papers that discuss repeatability as a threat to validity and apply the evaluation method of interviews.
Difference Answer Corrected: ['of']
Difference Answer Original: []



 35%|███▌      | 60/170 [03:19<05:09,  2.81s/it]

Line: 61, Found inconsistency between question and answer
Original Answer: There are two papers that discuss confirmability as a threat to validity and apply the evaluation method of focus groups.
Corrected Answer: The number of publications in which confirmability is identified as a validity threat while focus groups are employed for evaluation is two.
Difference Answer Corrected: ['which', 'for', 'two.', 'is', 'while', 'identified', 'publications', 'in', 'employed', 'The', 'groups', 'number']
Difference Answer Original: ['papers', 'and', 'two', 'that', 'the', 'method', 'discuss', 'apply', 'There', 'to', 'groups.']



 37%|███▋      | 63/170 [03:28<05:09,  2.89s/it]

Line: 64, found verb issues in the answer
Original Answer: The publications with the evaluation property Reliability, ranked in descending order by their publication year, are: 1. A Comparison of MQTT Brokers for Distributed IoT Edge Computing (2020), 2. An Architecture-Driven Adaptation Approach for Big Data Cyber Security Analytics (2019), 3. A Platform Architecture for Multi-Tenant Blockchain-Based Systems (2019), 4. An Architecture for Decentralized, Collaborative, and Autonomous Robots (2018), 5. Quality Evaluation of PaaS Cloud Application Design Using Generated Prototypes (2017)
Corrected Answer: The publications with the investigated property Reliability, ranked in descending order by their publication year, are: 1. A Comparison of MQTT Brokers for Distributed IoT Edge Computing (2020), 2. An Architecture-Driven Adaptation Approach for Big Data Cyber Security Analytics (2019), 3. A Platform Architecture for Multi-Tenant Blockchain-Based Systems (2019), 4. An Architecture for De

 38%|███▊      | 64/170 [03:30<04:22,  2.48s/it]

Line: 66, Found grammar issues in the question
Original Question: Among papers that apply the evaluation method Data Science, how many have not used input data compared to those that have made their input data available?
Corrected Question: Among papers that apply the evaluation method Data Science, how many have not made their input data available compared to those that have?
Difference Question Corrected: ['available', 'have?']
Difference Question Original: ['available?', 'used']

Line: 66, Found inconsistency between question and answer
Original Answer: There is one paper with the evaluation method Data Science that does not provide input data, and there are two papers with the evaluation method Data Science that make their input data available. Therefore, the comparison shows that there is one paper with no input data and two papers with input data available.
Corrected Answer: Among papers that apply the evaluation method Data Science, the number that have not made their input data

 39%|███▉      | 66/170 [03:37<04:55,  2.84s/it]

Line: 67, Found inconsistency between question and answer
Original Answer: There are four papers with the evaluation method Grounded Theory that have input data available, compared to one paper that has no input data.
Corrected Answer: The number of papers among publications that use grounded theory for evaluation with publicly available input data is four, whereas the number of papers without publicly available input data is one.
Difference Answer Corrected: ['one.', 'use', 'available', 'publicly', 'without', 'for', 'of', 'is', 'grounded', 'four,', 'theory', 'publications', 'The', 'among', 'number', 'whereas']
Difference Answer Original: ['compared', 'Theory', 'are', 'four', 'has', 'data.', 'method', 'one', 'have', 'Grounded', 'paper', 'no', 'available,', 'There', 'to']



 40%|████      | 68/170 [03:42<04:41,  2.76s/it]

Line: 69, Found inconsistency between question and answer
Original Answer: In the provided contexts, the property Usability was investigated in one paper, while the property Portability was investigated in two papers.
Corrected Answer: Among papers that investigate a reference architecture, the number that focus on usability compared to portability is one paper for usability and two papers for portability.
Difference Answer Corrected: ['a', 'for', 'that', 'is', 'paper', 'to', 'usability', 'number', 'and', 'on', 'portability', 'compared', 'investigate', 'focus', 'portability.', 'architecture,', 'reference', 'papers', 'Among']
Difference Answer Original: ['was', 'investigated', 'contexts,', 'papers.', 'property', 'while', 'provided', 'Portability', 'paper,', 'in', 'Usability', 'In']



 41%|████      | 69/170 [03:46<05:04,  3.01s/it]

Line: 71, Found grammar issues in the question
Original Question: Which papers that investigate architecture evolution have not used input data?
Corrected Question: Which papers investigating architecture evolution have not used input data?
Difference Question Corrected: ['investigating']
Difference Question Original: ['that', 'investigate']



 42%|████▏     | 72/170 [03:54<04:26,  2.72s/it]

Line: 74, Found grammar issues in the answer
Original Answer: The paper class that papers are most frequently classified as when they investigate the property Reliability is Evaluation research.
Corrected Answer: The paper class that papers are most frequently classified as when they investigate the property Reliability is evaluation research.
Difference Answer Corrected: ['evaluation']
Difference Answer Original: ['Evaluation']



 44%|████▎     | 74/170 [03:59<04:15,  2.66s/it]

Line: 76, Found grammar issues in the answer
Original Answer: The most commonly used paper class for publications that employ the evaluation method Case Study and target the research object Architecture Pattern is evaluation research.
Corrected Answer: The most commonly used paper class for publications that employ the evaluation method Case Study and target the research object Architecture Pattern is evaluation research papers.
Difference Answer Corrected: ['papers.']
Difference Answer Original: ['research.']



 45%|████▍     | 76/170 [04:04<03:41,  2.36s/it]

Line: 78, Found grammar issues in the question
Original Question: How are the papers that investigate the research object Technical Debt distributed by their publication year?
Corrected Question: How are the papers that investigate the research object of Technical Debt distributed by their publication year?
Difference Question Corrected: ['of']
Difference Question Original: []



 46%|████▌     | 78/170 [04:12<04:33,  2.98s/it]

Line: 80, Found inconsistency between question and answer
Original Answer: The evaluation method Benchmark is applied in four papers: one from 2017, one from 2019, and two from 2020. Therefore, the frequency is as follows: 1 paper in 2017, 1 paper in 2019, and 2 papers in 2020.
Corrected Answer: The frequency of papers that apply the evaluation method Benchmark per publication year is: 1 paper in 2017, 1 paper in 2019, and 2 papers in 2020.
Difference Answer Corrected: ['publication', 'is:', 'that', 'of', 'per', 'apply', 'year']
Difference Answer Original: ['from', 'four', 'two', 'papers:', 'is', 'applied', 'one', 'Therefore,', 'follows:', 'as']



 46%|████▋     | 79/170 [04:15<04:42,  3.10s/it]

Line: 81, Found grammar issues in the question
Original Question: What is the frequency of papers that used a field experiment in their evaluation per year?
Corrected Question: What is the frequency of papers that used a field experiment in their evaluations per year?
Difference Question Corrected: ['evaluations']
Difference Question Original: ['evaluation']



 47%|████▋     | 80/170 [04:17<04:09,  2.77s/it]

Line: 82, Found grammar issues in the question
Original Question: What are the research objects investigated in papers of the paper class philosophical paper?
Corrected Question: What are the research objects investigated in papers of the philosophical paper class?
Difference Question Corrected: ['class?']
Difference Question Original: ['paper?', 'class']

Line: 82, Found grammar issues in the answer
Original Answer: The research object investigated in papers of the philosophical paper class is Architecture Decision Making.
Corrected Answer: The research object investigated in papers of the philosophical paper class is architecture decision making.
Difference Answer Corrected: ['architecture', 'decision', 'making.']
Difference Answer Original: ['Making.', 'Architecture', 'Decision']

Line: 82, found typing issues in the question template
Original Question Template: What are the research objects investigated in papers of the paper class [paper class]?
Corrected Question Template: What a

 49%|████▉     | 83/170 [04:29<05:17,  3.65s/it]

Line: 85, Found grammar issues in the question
Original Question: Which research objects are investigated in papers that evaluate the reliability?
Corrected Question: Which research objects are investigated in papers that evaluate reliability?
Difference Question Corrected: []
Difference Question Original: ['the']

Line: 86, Found grammar issues in the answer
Original Answer: The evaluation methods applied to evaluate the Compatibility are Field Experiment and Technical Experiment.
Corrected Answer: The evaluation methods applied to evaluate compatibility are field experiment and technical experiment.
Difference Answer Corrected: ['field', 'compatibility', 'technical', 'experiment.', 'experiment']
Difference Answer Original: ['Compatibility', 'Experiment', 'Technical', 'Experiment.', 'the', 'Field']



 50%|█████     | 85/170 [04:31<03:25,  2.42s/it]

Line: 86, Found inconsistency between question and answer
Original Answer: The evaluation methods applied to evaluate the Compatibility are Field Experiment and Technical Experiment.
Corrected Answer: The evaluation methods applied in papers that evaluate compatibility are field experiment and technical experiment.
Difference Answer Corrected: ['field', 'compatibility', 'that', 'technical', 'in', 'papers', 'experiment.', 'experiment']
Difference Answer Original: ['Compatibility', 'Experiment', 'Technical', 'Experiment.', 'the', 'Field', 'to']



 51%|█████     | 87/170 [04:37<03:31,  2.55s/it]

Line: 89, Found grammar issues in the answer
Original Answer: There are two research objects investigated in papers with the Freedom from Risk property.
Corrected Answer: There are two research objects investigated in papers with the Freedom from Risk property.
Difference Answer Corrected: []
Difference Answer Original: []

Line: 89, Found inconsistency between question and answer
Original Answer: There are two research objects investigated in papers with the Freedom from Risk property.
Corrected Answer: The number of research objects investigated in papers where the property Freedom from Risk is evaluated is two.
Difference Answer Corrected: ['evaluated', 'of', 'is', 'two.', 'property', 'where', 'The', 'number']
Difference Answer Original: ['with', 'are', 'two', 'There', 'property.']

Line: 89, found typing issues in the question template
Original Question Template: How many research objects are investigated in papers where the property [Content Data: property name] is evaluated?
Corr

 52%|█████▏    | 88/170 [04:44<04:54,  3.59s/it]

Line: 90, Found grammar issues in the answer
Original Answer: Among papers that evaluate compatibility, the number of research objects evaluated is two.
Corrected Answer: Among papers that evaluate compatibility, the number of research objects evaluated is two.
Difference Answer Corrected: []
Difference Answer Original: []



 56%|█████▌    | 95/170 [05:09<04:10,  3.34s/it]

Line: 97, Found inconsistency between question and answer
Original Answer: Using the evaluation method Case Study, the property Maintainability is evaluated twice while Usability is only evaluated once.
Corrected Answer: The frequency with which the property Maintainability is evaluated in comparison to the property Usability by applying the evaluation method Case Study is that Maintainability is evaluated twice, while Usability is evaluated once.
Difference Answer Corrected: ['comparison', 'with', 'which', 'frequency', 'that', 'applying', 'twice,', 'in', 'The', 'Study', 'by', 'to']
Difference Answer Original: ['Using', 'twice', 'Study,', 'only']



 58%|█████▊    | 99/170 [05:20<03:38,  3.08s/it]

Line: 100, Found inconsistency between question and answer
Original Answer: Among publications that investigate architectural assumptions, which properties have been evaluated without following a guideline for their evaluation?
Corrected Answer: The properties that have been evaluated without following a guideline for their evaluation among publications that investigate architectural assumptions are [list the properties here].
Difference Answer Corrected: ['evaluation', 'assumptions', 'are', 'the', 'here].', 'The', 'among', '[list']
Difference Answer Original: ['Among', 'evaluation?', 'which', 'assumptions,']

Line: 101, Found grammar issues in the answer
Original Answer: The research objects that have been evaluated with the evaluation method Field Experiment without using an evaluation guideline are Reference Architecture and Architecture Optimization Method.
Corrected Answer: The research objects that have been evaluated with the evaluation method Field Experiment without using an e

 59%|█████▉    | 100/170 [05:25<04:11,  3.59s/it]

Line: 102, Found grammar issues in the answer
Original Answer: Architecture Pattern is the only research object for which no evaluation guideline has been used when applying the evaluation method Grounded Theory. 
Corrected Answer: Architecture Pattern is the only research object for which no evaluation guidelines have been used when applying the evaluation method Grounded Theory.
Difference Answer Corrected: ['have', 'guidelines']
Difference Answer Original: ['has', 'guideline']



 59%|█████▉    | 101/170 [05:27<03:29,  3.04s/it]

Line: 102, Found inconsistency between question and answer
Original Answer: Architecture Pattern is the only research object for which no evaluation guideline has been used when applying the evaluation method Grounded Theory. 
Corrected Answer: The research objects evaluated without using a guideline for the evaluation among publications that apply grounded theory in their evaluation are architecture patterns.
Difference Answer Corrected: ['architecture', 'objects', 'a', 'without', 'their', 'are', 'that', 'evaluated', 'using', 'grounded', 'apply', 'theory', 'patterns.', 'publications', 'among', 'in', 'The']
Difference Answer Original: ['when', 'Theory.', 'Architecture', 'only', 'which', 'been', 'has', 'applying', 'method', 'is', 'used', 'Pattern', 'Grounded', 'no', 'object']

Line: 103, Found grammar issues in the answer
Original Answer: The evaluation method that has been applied most frequently in papers investigating Technical Debt is case study.
Corrected Answer: The evaluation met

 60%|██████    | 102/170 [05:29<02:59,  2.63s/it]

Line: 104, Found grammar issues in the question
Original Question: What are the properties that are investigated the most often for the research object Technical Debt?
Corrected Question: What are the properties that are most often investigated for the research object Technical Debt?
Difference Question Corrected: []
Difference Question Original: []

Line: 104, Found grammar issues in the answer
Original Answer: The property that is investigated the most often for the research object Technical Debt is Maintainability.
Corrected Answer: The property that is investigated most often for the research object Technical Debt is Maintainability.
Difference Answer Corrected: []
Difference Answer Original: []

Line: 104, Found inconsistency between question and answer
Original Answer: The property that is investigated the most often for the research object Technical Debt is Maintainability.
Corrected Answer: The properties that are most often investigated for the research object Technical Debt a

 62%|██████▏   | 105/170 [05:38<03:16,  3.03s/it]

Line: 107, Found grammar issues in the question
Original Question: How frequently is the research object Technical Debt investigated per publication year?
Corrected Question: How frequently is the research object "Technical Debt" investigated per publication year?
Difference Question Corrected: ['Debt"', '"Technical']
Difference Question Original: ['Debt', 'Technical']



 63%|██████▎   | 107/170 [05:44<02:50,  2.70s/it]

Line: 109, Found inconsistency between question and answer
Original Answer: The evaluation method Benchmark was used twice in 2020, once in 2017, and once in 2019.
Corrected Answer: The evaluation method Benchmark is applied twice in 2020, once in 2017, and once in 2019 per publication year.
Difference Answer Corrected: ['publication', 'is', 'applied', 'per', 'year.', '2019']
Difference Answer Original: ['2019.', 'was', 'used']



 64%|██████▍   | 109/170 [05:49<02:40,  2.63s/it]

Line: 110, Found inconsistency between question and answer
Original Answer: The evaluation method Experiment was used once in 2019, once in 2018, and once in 2017.
Corrected Answer: The frequency per year that field experiments are used for evaluation is once in 2019, once in 2018, and once in 2017.
Difference Answer Corrected: ['field', 'for', 'frequency', 'are', 'that', 'is', 'per', 'year', 'experiments']
Difference Answer Original: ['Experiment', 'was', 'method']

Line: 111, Found grammar issues in the answer
Original Answer: The property investigated on the research object Architecture Design Method in the paper 'Enabling Continuous Software Engineering for Embedded Systems Architectures with Virtual Prototypes' is Functional Suitability.
Corrected Answer: The property investigated in the research object Architecture Design Method in the paper 'Enabling Continuous Software Engineering for Embedded Systems Architectures with Virtual Prototypes' is Functional Suitability.
Difference 

 65%|██████▌   | 111/170 [05:55<02:39,  2.71s/it]

Line: 113, Found grammar issues in the question
Original Question: What is the evaluation method applied to evaluate the research object Architecture Analysis Method in the paper with the title 'Accurate Analysis of Quality Properties of Software with Observation-Based Markov Chain Refinement'?
Corrected Question: What is the evaluation method applied to evaluate the research object, Architecture Analysis Method, in the paper with the title 'Accurate Analysis of Quality Properties of Software with Observation-Based Markov Chain Refinement'?
Difference Question Corrected: ['Method,', 'object,']
Difference Question Original: ['object', 'Method']



 66%|██████▋   | 113/170 [06:01<02:41,  2.84s/it]

Line: 114, Found inconsistency between question and answer
Original Answer: The method applied to evaluate the reference architecture in the paper 'FLRA: A Reference Architecture for Federated Learning Systems' is Data Science.
Corrected Answer: The method applied to evaluate the reference architecture in the paper 'FLRA: A Reference Architecture for Federated Learning Systems' is not simply 'Data Science'; the answer should specify the actual evaluation method used in the paper, such as 'case study', 'experiment', or 'simulation', if mentioned. Please provide the correct method as stated in the paper. 

For example, if the paper uses a case study, the corrected answer would be:

The method applied to evaluate the reference architecture in the paper 'FLRA: A Reference Architecture for Federated Learning Systems' is a case study.
Difference Answer Corrected: ['simply', "'experiment',", 'such', 'a', 'mentioned.', "study',", "'case", 'Please', 'paper,', 'stated', 'corrected', 'correct', '

 67%|██████▋   | 114/170 [06:04<02:47,  2.98s/it]

Line: 116, Found grammar issues in the answer
Original Answer: The evaluation methods applied to evaluate accuracy on the investigated objects in papers published by Duc Le are Technical Experiment and Data Science.
Corrected Answer: The evaluation methods applied to assess accuracy on the investigated objects in papers published by Duc Le are Technical Experiment and Data Science.
Difference Answer Corrected: ['assess']
Difference Answer Original: ['evaluate']



 68%|██████▊   | 115/170 [06:07<02:34,  2.81s/it]

Line: 116, Found inconsistency between question and answer
Original Answer: The evaluation methods applied to evaluate accuracy on the investigated objects in papers published by Duc Le are Technical Experiment and Data Science.
Corrected Answer: The evaluation methods that have been applied to evaluate accuracy on the investigated objects in papers published by Duc Le are Technical Experiment and Data Science.
Difference Answer Corrected: ['have', 'that', 'been']
Difference Answer Original: []

Line: 117, Found grammar issues in the answer
Original Answer: The methods that have been applied to evaluate the usability of objects investigated in papers published by Lu Xiao are Controlled Experiment and Case Study.
Corrected Answer: The methods that have been applied to evaluate the usability of objects investigated in papers published by Lu Xiao are controlled experiments and case studies.
Difference Answer Corrected: ['case', 'experiments', 'studies.', 'controlled']
Difference Answer Or

 68%|██████▊   | 116/170 [06:09<02:24,  2.67s/it]

Line: 118, Found grammar issues in the answer
Original Answer: The evaluation methods applied by the author Stephan Seifermann to the research object Architecture Analysis Method are Technical Experiment and Case Study.
Corrected Answer: The evaluation methods applied by the author Stephan Seifermann to the research object, Architecture Analysis Method, are technical experiment and case study.
Difference Answer Corrected: ['study.', 'case', 'technical', 'object,', 'Method,', 'experiment']
Difference Answer Original: ['Experiment', 'Technical', 'object', 'Study.', 'Case', 'Method']

Line: 118, Found inconsistency between question and answer
Original Answer: The evaluation methods applied by the author Stephan Seifermann to the research object Architecture Analysis Method are Technical Experiment and Case Study.
Corrected Answer: The evaluation methods that have been applied by the author Stephan Seifermann with the research object Architecture Analysis Method are technical experiment an

 69%|██████▉   | 117/170 [06:16<03:29,  3.96s/it]

Line: 118, found verb issues in the answer
Original Answer: The evaluation methods applied by the author Stephan Seifermann to the research object Architecture Analysis Method are Technical Experiment and Case Study.
Corrected Answer: The evaluation methods that have been applied by the author Stephan Seifermann with the research object Architecture Analysis Method are technical experiment and case study.
Difference Answer Corrected: ['with', 'been', 'study.', 'that', 'case', 'have', 'technical', 'experiment']
Difference Answer Original: ['Experiment', 'Technical', 'to', 'Study.', 'Case']

Line: 119, Found grammar issues in the answer
Original Answer: The number of different evaluation methods that the author David Monschein applied in papers that evaluate the property Efficiency is two.
Corrected Answer: The number of different evaluation methods that the author David Monschein applied in papers that evaluate the property Efficiency is two.
Difference Answer Corrected: []
Difference A

 70%|███████   | 119/170 [06:22<02:53,  3.39s/it]

Line: 121, Found grammar issues in the answer
Original Answer: There are three evaluation methods published by the author Stefan Kugele for the research object Architecture Pattern.
Corrected Answer: There are three evaluation methods published by the author Stefan Kugele for the research object, Architecture Pattern.
Difference Answer Corrected: ['object,']
Difference Answer Original: ['object']



 71%|███████   | 120/170 [06:24<02:33,  3.07s/it]

Line: 121, Found inconsistency between question and answer
Original Answer: There are three evaluation methods published by the author Stefan Kugele for the research object Architecture Pattern.
Corrected Answer: The number of different evaluation methods that have been applied by the author Stefan Kugele to evaluate architecture patterns is three.
Difference Answer Corrected: ['architecture', 'been', 'number', 'that', 'of', 'is', 'applied', 'different', 'have', 'The', 'evaluate', 'patterns', 'to', 'three.']
Difference Answer Original: ['Architecture', 'for', 'are', 'three', 'research', 'object', 'There', 'Pattern.', 'published']

Line: 122, Found grammar issues in the answer
Original Answer: Two evaluation methods have been applied by the author Ingo Weber in papers that investigate the research object Architecture Decision Making.
Corrected Answer: Two evaluation methods have been applied by the author Ingo Weber in papers that investigate the research object of Architecture Decision

 72%|███████▏  | 122/170 [06:30<02:19,  2.90s/it]

Line: 123, Found inconsistency between question and answer
Original Answer: The evaluation methods applied by Xiwei Xu to the research object Architecture Decision Making, ranked in descending alphabetical order, are: 1. Interview and 2. Argumentation.
Corrected Answer: The methods that have been applied to evaluate architecture decision making in the papers published by Xiwei Xu, ranked in descending alphabetical order, are Interview and Argumentation.
Difference Answer Corrected: ['architecture', 'been', 'Xu,', 'are', 'that', 'decision', 'making', 'have', 'evaluate', 'papers', 'published']
Difference Answer Original: ['evaluation', 'Architecture', '1.', 'research', 'Xu', 'object', '2.', 'are:', 'Making,', 'Decision']

Line: 124, found typing issues in the question template
Original Question Template: What are the evaluation methods that the author [Metadata: author name] applied in papers that investigate the research object [Content Data: research object name]? Rank the evaluation m

 72%|███████▏  | 123/170 [06:34<02:32,  3.23s/it]

Line: 125, Found grammar issues in the answer
Original Answer: The evaluation properties investigated by the author Mohamed Soliman and evaluated with the evaluation method Technical Experiment are Functional Suitability and Accuracy.
Corrected Answer: The evaluation properties investigated by the author Mohamed Soliman and evaluated with the evaluation method Technical Experiment are Functional Suitability and Accuracy.
Difference Answer Corrected: []
Difference Answer Original: []



 73%|███████▎  | 124/170 [06:36<02:15,  2.95s/it]

Line: 125, Found inconsistency between question and answer
Original Answer: The evaluation properties investigated by the author Mohamed Soliman and evaluated with the evaluation method Technical Experiment are Functional Suitability and Accuracy.
Corrected Answer: The properties that have been evaluated with a technical experiment in papers published by Mohamed Soliman are Accuracy and Functional Suitability, ranked in descending alphabetical order.
Difference Answer Corrected: ['Suitability,', 'a', 'been', 'order.', 'ranked', 'that', 'Accuracy', 'have', 'technical', 'descending', 'in', 'papers', 'alphabetical', 'published', 'experiment']
Difference Answer Original: ['Experiment', 'Accuracy.', 'evaluation', 'Suitability', 'Technical', 'investigated', 'the', 'method', 'author']

Line: 126, Found grammar issues in the answer
Original Answer: The evaluation properties investigated by the author Andreas Burger that have been evaluated with the evaluation method Technical Experiment are pe

 74%|███████▎  | 125/170 [06:41<02:37,  3.50s/it]

Line: 127, Found grammar issues in the answer
Original Answer: The research object Architecture Extraction is investigated in two papers published in the year 2018, in comparison to one paper in the year 2020.
Corrected Answer: The research object Architecture Extraction is investigated in two papers published in 2018, compared to one paper in 2020.
Difference Answer Corrected: ['compared']
Difference Answer Original: ['comparison', 'the', 'year']

Line: 127, Found inconsistency between question and answer
Original Answer: The research object Architecture Extraction is investigated in two papers published in the year 2018, in comparison to one paper in the year 2020.
Corrected Answer: The research object Architecture Extraction is investigated more often in papers published in the publication year 2018 (two papers) in comparison to the publication year 2020 (one paper).
Difference Answer Corrected: ['publication', '2020', 'paper).', 'papers)', '(two', 'more', '2018', 'often', '(one']
D

 74%|███████▍  | 126/170 [06:46<02:49,  3.84s/it]

Line: 128, Found grammar issues in the answer
Original Answer: The research object Architecture Pattern was investigated two times in 2018 and two times in 2020.
Corrected Answer: The research object Architecture Pattern was investigated twice in 2018 and twice in 2020.
Difference Answer Corrected: ['twice']
Difference Answer Original: ['two', 'times']



 75%|███████▍  | 127/170 [06:49<02:33,  3.58s/it]

Line: 128, Found inconsistency between question and answer
Original Answer: The research object Architecture Pattern was investigated two times in 2018 and two times in 2020.
Corrected Answer: The number of times architecture patterns have been investigated in 2018 in comparison to 2020 is the same; architecture patterns were investigated twice in 2018 and twice in 2020.
Difference Answer Corrected: ['comparison', 'architecture', 'been', 'number', '2020', 'the', 'of', 'is', 'were', 'have', 'same;', 'twice', 'patterns', 'to']
Difference Answer Original: ['was', 'Architecture', 'two', 'Pattern', 'research', 'object']

Line: 129, Found inconsistency between question and answer
Original Answer: In 2020, the evaluation method Questionnaire was applied once. In 2021, the evaluation method Questionnaire was applied twice.
Corrected Answer: The evaluation method Questionnaire is applied once in papers published in the publication year 2020, in comparison to being applied twice in papers publis

 75%|███████▌  | 128/170 [06:57<03:31,  5.04s/it]

Line: 130, Found grammar issues in the answer
Original Answer: In 2021, three papers have been published that applied the evaluation method Interview. In 2020 the method has not been applied.
Corrected Answer: In 2021, three papers were published that applied the evaluation method Interview. In 2020, the method was not applied.
Difference Answer Corrected: ['was', 'were', '2020,']
Difference Answer Original: ['have', 'has', '2020', 'been']



 76%|███████▌  | 129/170 [06:58<02:34,  3.77s/it]

Line: 130, Found inconsistency between question and answer
Original Answer: In 2021, three papers have been published that applied the evaluation method Interview. In 2020 the method has not been applied.
Corrected Answer: The number of times interviews have been applied for evaluation in 2020 in comparison to 2021 is as follows: in 2020, interviews were not applied, while in 2021, interviews were applied in three papers.
Difference Answer Corrected: ['2021', 'comparison', 'for', 'number', '2020,', 'of', 'papers.', 'is', 'were', 'while', 'applied,', 'interviews', 'in', 'The', 'to', 'follows:', 'times', 'as']
Difference Answer Original: ['applied.', 'published', 'that', 'the', 'has', 'method', 'papers', 'Interview.', 'In']



 76%|███████▋  | 130/170 [07:02<02:32,  3.82s/it]

Line: 132, Found grammar issues in the question
Original Question: What are the evaluation methods that have been used by the author Muhammad Babar and were not applied with an evaluation guideline?
Corrected Question: What evaluation methods have been used by the author Muhammad Babar and were not applied with an evaluation guideline?
Difference Question Corrected: []
Difference Question Original: ['are', 'that']

Line: 132, Found grammar issues in the answer
Original Answer: The evaluation methods applied by Muhammad Babar that have no evaluation guideline are Technical Experiment and Case Study.
Corrected Answer: The evaluation methods applied by Muhammad Babar that have no evaluation guidelines are Technical Experiment and Case Study.
Difference Answer Corrected: ['guidelines']
Difference Answer Original: ['guideline']

Line: 132, Found inconsistency between question and answer
Original Answer: The evaluation methods applied by Muhammad Babar that have no evaluation guideline are T

 77%|███████▋  | 131/170 [07:08<02:54,  4.48s/it]

Line: 132, found verb issues in the answer
Original Answer: The evaluation methods applied by Muhammad Babar that have no evaluation guideline are Technical Experiment and Case Study.
Corrected Answer: The evaluation methods that have been applied by the author Muhammad Babar and were not used with an evaluation guideline are Technical Experiment and Case Study.
Difference Answer Corrected: ['with', 'been', 'not', 'the', 'were', 'used', 'an', 'author']
Difference Answer Original: ['no']



 78%|███████▊  | 132/170 [07:10<02:22,  3.75s/it]

Line: 133, Found inconsistency between question and answer
Original Answer: The evaluation methods applied by Johannes Grohmann without an evaluation guideline are Technical Experiment and Case Study.
Corrected Answer: The evaluation methods that were applied without a guideline in papers published by Johannes Grohmann are Technical Experiment and Case Study.
Difference Answer Corrected: ['a', 'that', 'were', 'in', 'papers', 'published']
Difference Answer Original: ['an']



 79%|███████▉  | 135/170 [07:18<01:43,  2.95s/it]

Line: 137, Found grammar issues in the question
Original Question: What are the evaluation methods that have been applied the most for papers with the paper class proposal of solution and the investigated research object Architecture Extraction?
Corrected Question: What are the evaluation methods that have been applied the most to papers with the paper class proposal of solution and the investigated research object Architecture Extraction?
Difference Question Corrected: ['to']
Difference Question Original: ['for']

Line: 137, Found grammar issues in the answer
Original Answer: The evaluation method that has been applied the most with the research object Architecture Extraction and papers classified as proposals of solutions is Case Study.
Corrected Answer: The evaluation method that has been applied the most to the research object Architecture Extraction and to papers classified as proposals of solutions is Case Study.
Difference Answer Corrected: ['to']
Difference Answer Original: ['w

 80%|████████  | 136/170 [07:23<01:59,  3.53s/it]

Line: 138, Found grammar issues in the answer
Original Answer: The evaluation method that was applied the most to the research object Architecture Evolution in the year 2020 is Case Study.
Corrected Answer: The evaluation method that was applied the most to the research object Architecture Evolution in the year 2020 was Case Study.
Difference Answer Corrected: []
Difference Answer Original: ['is']



 81%|████████  | 137/170 [07:32<02:49,  5.15s/it]

Line: 139, Found inconsistency between question and answer
Original Answer: The frequency per publication year of research objects evaluated with the Data Science method is: one in 2019, one in 2020, and one in 2021.
Corrected Answer: The frequency per publication year from 2019 to 2021 at which the research objects of papers have been evaluated using the evaluation method Data Science is: one in 2019, one in 2020, and one in 2021.
Difference Answer Corrected: ['2021', 'at', 'evaluation', 'from', 'which', 'been', 'using', 'have', 'papers', 'to', '2019']
Difference Answer Original: ['with']



 82%|████████▏ | 139/170 [07:39<02:14,  4.33s/it]

Line: 141, Found grammar issues in the question
Original Question: What is the distribution of the research objects that are evaluated with the evaluation method Argumentation between the publication years 2019 and 2021?
Corrected Question: What is the distribution of the research objects that are evaluated with the evaluation method Argumentation between the publication years 2019 and 2021?
Difference Question Corrected: []
Difference Question Original: []

Line: 141, Found grammar issues in the answer
Original Answer: The distribution of the research objects that are evaluated with the evaluation method Argumentation between the publication years 2019 and 2021 is: two objects in 2019, none in 2020, and none in 2021.
Corrected Answer: The distribution of the research objects that are evaluated with the evaluation method Argumentation between the publication years 2019 and 2021 is as follows: two objects in 2019, none in 2020, and none in 2021.
Difference Answer Corrected: ['follows:',

 83%|████████▎ | 141/170 [07:46<01:44,  3.61s/it]

Line: 143, Found inconsistency between question and answer
Original Answer: The title of the paper that investigates the research object Architecture Decision Making and has the paper class of philosophical paper is 'Architectural Design Decisions for Systems Supporting Model-Based Analysis of Runtime Events: A Qualitative Multi-method Study.'
Corrected Answer: The title of the paper that investigates the research object Architecture Decision Making and has the paper class philosophical paper is 'Architectural Design Decisions for Systems Supporting Model-Based Analysis of Runtime Events: A Qualitative Multi-method Study.'
Difference Answer Corrected: []
Difference Answer Original: []



 85%|████████▌ | 145/170 [07:58<01:10,  2.82s/it]

Line: 146, found verb issues in the answer
Original Answer: The papers that applied the evaluation method Argumentation in the publication year 2020 are: 'Unlimited Rulebook: a Reference Architecture for Economy Mechanics in Digital Games', 'Continuous Experimentation for Automotive Software on the Example of a Heavy Commercial Vehicle in Daily Operation', and 'Architectural Patterns for Cross-Domain Personalised Automotive Functions'.
Corrected Answer: The papers that evaluated or applied the evaluation method Argumentation in the publication year 2020 are: 'Unlimited Rulebook: a Reference Architecture for Economy Mechanics in Digital Games', 'Continuous Experimentation for Automotive Software on the Example of a Heavy Commercial Vehicle in Daily Operation', and 'Architectural Patterns for Cross-Domain Personalised Automotive Functions'.
Difference Answer Corrected: ['evaluated', 'or']
Difference Answer Original: []



 86%|████████▌ | 146/170 [07:59<00:57,  2.38s/it]

Line: 148, Found grammar issues in the question
Original Question: How many papers applied the evaluation method Controlled Experiment in the publication year 2018?
Corrected Question: How many papers applied the evaluation method "Controlled Experiment" in the publication year 2018?
Difference Question Corrected: ['Experiment"', '"Controlled']
Difference Question Original: ['Experiment', 'Controlled']

Line: 148, Found grammar issues in the answer
Original Answer: The number of papers that applied the evaluation method Controlled Experiment in the publication year 2018 is two.
Corrected Answer: The number of papers that applied the evaluation method Controlled Experiment in the publication year 2018 is two.
Difference Answer Corrected: []
Difference Answer Original: []



 86%|████████▋ | 147/170 [08:05<01:16,  3.31s/it]

Line: 148, found verb issues in the answer
Original Answer: The number of papers that applied the evaluation method Controlled Experiment in the publication year 2018 is two.
Corrected Answer: The number of papers that applied the evaluation method Controlled Experiment in the publication year 2018 is two.
Difference Answer Corrected: []
Difference Answer Original: []



 87%|████████▋ | 148/170 [08:07<01:04,  2.92s/it]

Line: 149, Found inconsistency between question and answer
Original Answer: Two papers used Grounded Theory as an evaluation method in the year 2020.
Corrected Answer: The number of papers that were evaluated with grounded theory in 2020 is two.
Difference Answer Corrected: ['with', 'evaluated', '2020', 'that', 'of', 'were', 'is', 'two.', 'grounded', 'theory', 'The', 'number']
Difference Answer Original: ['evaluation', 'Theory', 'the', 'method', 'used', 'Grounded', 'Two', 'an', 'year', '2020.', 'as']

Line: 150, Found inconsistency between question and answer
Original Answer: Two papers investigate Architecture Extraction as a research object in the year 2019.
Corrected Answer: The number of papers that investigated the research object Architecture Extraction in the publication year 2019 is two.
Difference Answer Corrected: ['publication', 'investigated', 'that', 'of', 'is', 'two.', 'The', 'number', '2019']
Difference Answer Original: ['investigate', 'a', '2019.', 'Two', 'as']



 89%|████████▉ | 151/170 [08:17<01:01,  3.23s/it]

Line: 153, Found grammar issues in the answer
Original Answer: The publications that have the evaluation method Motivating Example ranked by their publication year in descending order are: 1. A Runtime Safety Enforcement Approach by Monitoring and Adaptation (2021), 2. A DSL for MAPE Patterns Representation in Self-adapting Systems (2018)
Corrected Answer: The publications that have the evaluation method Motivating Example, ranked by their publication year in descending order, are: 1. A Runtime Safety Enforcement Approach by Monitoring and Adaptation (2021), 2. A DSL for MAPE Patterns Representation in Self-adapting Systems (2018).
Difference Answer Corrected: ['(2018).', 'Example,', 'order,']
Difference Answer Original: ['Example', 'order', '(2018)']



 89%|████████▉ | 152/170 [08:19<00:53,  2.99s/it]

Line: 153, Found inconsistency between question and answer
Original Answer: The publications that have the evaluation method Motivating Example ranked by their publication year in descending order are: 1. A Runtime Safety Enforcement Approach by Monitoring and Adaptation (2021), 2. A DSL for MAPE Patterns Representation in Self-adapting Systems (2018)
Corrected Answer: The publications by Patrizia Scandurra that use a motivating example for evaluation, ranked in descending order of their publication year, are: 1. A Runtime Safety Enforcement Approach by Monitoring and Adaptation (2021), 2. A DSL for MAPE Patterns Representation in Self-adapting Systems (2018).
Difference Answer Corrected: ['use', 'a', '(2018).', 'example', 'of', 'Patrizia', 'motivating', 'year,', 'Scandurra', 'evaluation,']
Difference Answer Original: ['evaluation', '(2018)', 'the', 'Motivating', 'method', 'have', 'year', 'Example']

Line: 154, Found inconsistency between question and answer
Original Answer: The public

 91%|█████████ | 154/170 [08:26<00:50,  3.13s/it]

Line: 155, Found inconsistency between question and answer
Original Answer: The publications sorted from newest to oldest are: 1. Architectural Technical Debt: A Grounded Theory (2020), 2. On Service-Orientation for Automotive Software (2017).
Corrected Answer: The publications that are classified as validation research and evaluated with grounded theory, sorted from newest to oldest, are: 1. Architectural Technical Debt: A Grounded Theory (2020), 2. On Service-Orientation for Automotive Software (2017).
Difference Answer Corrected: ['and', 'with', 'theory,', 'classified', 'evaluated', 'are', 'that', 'grounded', 'research', 'validation', 'oldest,', 'as']
Difference Answer Original: ['oldest']



 92%|█████████▏| 156/170 [08:30<00:33,  2.40s/it]

Line: 157, Found inconsistency between question and answer
Original Answer: In the year 2021, there are four papers that investigate the research object Architecture Decision Making. However, there are no papers listed for the year 2020.
Corrected Answer: The number of papers that investigate architecture decision making in 2020 in comparison to 2021 is as follows: there are no papers in 2020, while there are four papers in 2021.
Difference Answer Corrected: ['2021', '2021.', 'comparison', 'architecture', '2020,', 'number', '2020', 'decision', 'of', 'is', 'making', 'while', 'in', 'The', 'to', 'follows:', 'as']
Difference Answer Original: ['2020.', 'Architecture', 'for', 'listed', 'the', 'In', 'research', '2021,', 'object', 'year', 'However,', 'Making.', 'Decision']

Line: 158, Found grammar issues in the answer
Original Answer: In the year 2017, there are three papers that evaluate the property Maintainability. In comparison, in the year 2020, there are also three papers that evaluate 

 92%|█████████▏| 157/170 [08:34<00:40,  3.10s/it]

Line: 159, Found grammar issues in the question
Original Question: How many papers evaluated the property security in 2017 in comparison to 2020?
Corrected Question: How many papers evaluated property security in 2017 in comparison to 2020?
Difference Question Corrected: []
Difference Question Original: ['the']



 93%|█████████▎| 158/170 [08:37<00:37,  3.12s/it]

Line: 159, Found inconsistency between question and answer
Original Answer: In the year 2017, there were two papers that evaluated the property Security. In comparison, in the year 2020, there was one paper that evaluated the property Security.
Corrected Answer: The number of papers that evaluated property security in 2017 in comparison to 2020 is as follows: in 2017, there were two papers, while in 2020, there was one paper.
Difference Answer Corrected: ['comparison', '2017', 'number', '2020', 'of', 'is', 'while', 'paper.', 'The', 'security', 'papers,', 'to', 'follows:', 'as']
Difference Answer Original: ['the', 'paper', 'year', 'comparison,', 'Security.', 'In']



 95%|█████████▌| 162/170 [08:48<00:20,  2.62s/it]

Line: 164, Found grammar issues in the answer
Original Answer: The paper class that has the most papers applying the evaluation method Benchmark in the publication year 2020 is the proposal of a solution.
Corrected Answer: The paper class that has the most papers applying the evaluation method Benchmark in the publication year 2020 is the proposal of a solution class.
Difference Answer Corrected: ['solution', 'class.']
Difference Answer Original: ['solution.']

Line: 164, found verb issues in the question
Original Question: Among papers that apply the evaluation method Benchmark, what is the paper class that is most frequently applied to papers published in the publication year 2020?
Corrected Question: Among papers that apply the evaluation method Benchmark, what is the paper class that is most frequently evaluated in papers published in the publication year 2020?
Difference Question Corrected: ['evaluated']
Difference Question Original: ['to', 'applied']



 96%|█████████▌| 163/170 [08:57<00:32,  4.58s/it]

Line: 166, found verb issues in the question
Original Question: Which paper class has the most papers with the research object Architecture Evolution in the publication year 2021?
Corrected Question: Which paper class has the most papers with the research object Architecture Evolution investigated in the publication year 2021?
Difference Question Corrected: ['investigated']
Difference Question Original: []



 97%|█████████▋| 165/170 [09:01<00:17,  3.48s/it]

Line: 166, found verb issues in the answer
Original Answer: The paper class that has the most papers with the research object Architecture Evolution in the publication year 2021 is evaluation research.
Corrected Answer: The paper class that has the most papers with the research object Architecture Evolution in the publication year 2021 is investigating research.
Difference Answer Corrected: ['investigating']
Difference Answer Original: ['evaluation']



 98%|█████████▊| 166/170 [09:03<00:11,  2.98s/it]

Line: 168, Found inconsistency between question and answer
Original Answer: The distribution of papers investigating the research object Architecture Extraction between 2018 and 2020 is as follows: two papers were published in 2018, one paper was published in 2019, and one paper was published in 2020.
Corrected Answer: The papers that investigate the research object Architecture Extraction are distributed between the publication years 2018 and 2020 as follows: two papers were published in 2018, one paper was published in 2019, and one paper was published in 2020.
Difference Answer Corrected: ['publication', 'investigate', 'years', 'are', 'that', 'distributed']
Difference Answer Original: ['distribution', 'investigating', 'of', 'is']



 99%|█████████▉| 168/170 [09:10<00:06,  3.13s/it]

Line: 170, Found inconsistency between question and answer
Original Answer: The distribution of papers with the evaluation method Argumentation between 2019 and 2021 is as follows: In 2019 two papers were published and in 2020 three papers were published. However, in 2021 no paper was published with the evaluation method Argumentation.
Corrected Answer: The distribution of papers with the evaluation method Argumentation between the publication years 2019 and 2021 is: in 2019, two papers; in 2020, three papers; and in 2021, no papers.
Difference Answer Corrected: ['publication', 'years', '2020,', 'is:', 'papers;', '2019,', 'papers.', '2021,']
Difference Answer Original: ['was', '2020', 'follows:', 'In', 'is', 'were', 'Argumentation.', 'paper', 'published.', 'However,', 'published', 'as']



100%|██████████| 170/170 [09:18<00:00,  3.28s/it]


In [3]:
for index, row in tqdm(deep_dataset_df.iterrows(), total=len(deep_dataset_df)):
    question = row['question']
    question_template = row['updated template']
    template_consistent, corrected_template = validate_question_to_template(question, question_template)
    if not template_consistent:
        print(f"Line: {index + 2}, found template issues in the question")
        print(f"Original Question Template: {question_template}")
        print(f"Corrected Question Template: {corrected_template}")
        print(f"Difference: {get_list_of_difference(question_template, corrected_template)}")
        print()

 (subsequent messages of this type will be suppressed)
 29%|██▉       | 50/170 [00:37<01:42,  1.17it/s]

Line: 51, found template issues in the question
Original Question Template: Which paper applies [evaluation method name] as a method in their evaluation?
Corrected Question Template: Which paper applies [placeholder] as a method in their evaluation?
Difference: ['[placeholder]']



 30%|███       | 51/170 [00:38<01:37,  1.21it/s]

Line: 52, found template issues in the question
Original Question Template: In which paper is the [property name] evaluated?
Corrected Question Template: In which paper is the [placeholder] evaluated?
Difference: ['[placeholder]']



 32%|███▏      | 54/170 [00:41<01:36,  1.20it/s]

Line: 55, found template issues in the question
Original Question Template: Which papers use [evaluation method name]s as a method in their evaluations?
Corrected Question Template: Which papers use [evaluation method name] as a method in their evaluations?
Difference: ['name]']



 33%|███▎      | 56/170 [00:42<01:35,  1.20it/s]

Line: 57, found template issues in the question
Original Question Template: Which publications assess the [property name] of a [research object name] and make their input data available?
Corrected Question Template: Which publications assess the [property name] of a [research object name] and make their input data available?
Difference: []



 35%|███▍      | 59/170 [00:45<01:46,  1.04it/s]

Line: 60, found template issues in the question
Original Question Template: How many papers that discuss [threat to validity] as a validity threat use [evaluation method name]s in their evaluation?
Corrected Question Template: How many papers that discuss [threat to validity] as a validity threat use [evaluation method name] in their evaluation?
Difference: ['name]']



 36%|███▋      | 62/170 [00:47<01:23,  1.29it/s]

Line: 63, found template issues in the question
Original Question Template: Which studies evaluate [property name] in their work sorted from the most recent publication to the earliest?
Corrected Question Template: Which studies evaluate [placeholder] in their work sorted from the most recent publication to the earliest?
Difference: ['[placeholder]']



 38%|███▊      | 64/170 [00:48<01:28,  1.20it/s]

Line: 65, found template issues in the question
Original Question Template: Give me all the papers that apply a [evaluation method name] in their evaluation ranked from the most recent to the oldest.
Corrected Question Template: Give me all the papers that apply a [placeholder] in their evaluation ranked from the most recent to the oldest.
Difference: ['[placeholder]']



 39%|███▉      | 66/170 [00:50<01:30,  1.15it/s]

Line: 67, found template issues in the question
Original Question Template: Among publications that use [evaluation method name] for evaluation, what is the number of papers with publicly available input data versus those without?
Corrected Question Template: Among publications that use [evaluation method name] for evaluation, what is the number of papers with publicly available input data versus those without?
Difference: []



 39%|███▉      | 67/170 [00:51<01:19,  1.29it/s]

Line: 69, found template issues in the question
Original Question Template: Among papers that investigate a [research object name], how many focus on [property name] compared to [property name]?
Corrected Question Template: Among papers that investigate a [research object name], how many focus on [property name] compared to [property name]?
Difference: []



 41%|████      | 70/170 [00:52<00:51,  1.94it/s]

Line: 71, found template issues in the question
Original Question Template: Which papers that investigate [research object name] have not used input data?
Corrected Question Template: Which papers that investigate architecture evolution have not used input data?
Difference: ['architecture', 'evolution']



 42%|████▏     | 72/170 [00:53<00:57,  1.70it/s]

Line: 73, found template issues in the question
Original Question Template: Which papers apply a [evaluation method name] and do not have any input data available?
Corrected Question Template: Which papers apply a [placeholder] and do not have any input data available?
Difference: ['[placeholder]']



 44%|████▍     | 75/170 [00:56<01:24,  1.12it/s]

Line: 76, found template issues in the question
Original Question Template: Among the publications that investigate [research object name] and apply case studies in their evaluation, which is the most used paper class?
Corrected Question Template: Among the publications that investigate [placeholder] and apply case studies in their evaluation, which is the most used paper class?
Difference: ['[placeholder]']



 45%|████▍     | 76/170 [00:58<01:44,  1.11s/it]

Line: 77, found template issues in the question
Original Question Template: Among the publications that investigate [research object name] and apply [evaluation method name] in their evaluation, which is the most used paper class?
Corrected Question Template: Among the publications that investigate [research object name] and apply [evaluation method name] in their evaluation, which is the most used paper class?
Difference: []



 46%|████▌     | 78/170 [00:59<01:22,  1.11it/s]

Line: 79, found template issues in the question
Original Question Template: How are the papers that investigate [research object name] distributed per publication year?
Corrected Question Template: What is the frequency of papers that investigate [research object name] per year?
Difference: ['of', 'frequency', 'What', 'is']



 46%|████▋     | 79/170 [01:01<01:36,  1.06s/it]

Line: 80, found template issues in the question
Original Question Template: What is the distribution of papers that apply the evaluation method [evaluation method name] per publication year?
Corrected Question Template: What is the frequency of papers that apply the evaluation method [evaluation method name] per publication year?
Difference: ['frequency']



 49%|████▉     | 84/170 [01:03<01:02,  1.38it/s]

Line: 85, found template issues in the question
Original Question Template: Which research objects are investigated in papers that evaluate the [property name]?
Corrected Question Template: Which research objects are investigated in papers that evaluate the [placeholder]?
Difference: ['[placeholder]?']



 55%|█████▍    | 93/170 [01:09<01:01,  1.26it/s]

Line: 94, found template issues in the question
Original Question Template: Among those publications that evaluate [property name] in their investigation, what are the research objects the evaluation is applied on? Rank the objects in descending alphabetical order.
Corrected Question Template: Among those publications that evaluate [placeholder] in their investigation, what are the research objects the evaluation is applied on? Rank the objects in descending alphabetical order.
Difference: ['[placeholder]']



 56%|█████▌    | 95/170 [01:11<01:03,  1.19it/s]

Line: 96, found template issues in the question
Original Question Template: Among publications that investigate [research object name], how often is the [property name] evaluated in comparison to [property name]?
Corrected Question Template: Among publications that investigate [research object name], how often is the [property name 1] evaluated in comparison to [property name 2]?
Difference: ['2]?', 'name', '1]']



 57%|█████▋    | 97/170 [01:13<01:08,  1.06it/s]

Line: 98, found template issues in the question
Original Question Template: Among publications that apply a [evaluation method name] in their evaluation, how often is [property name] evaluated in comparison to [property name]?
Corrected Question Template: Among publications that apply a [evaluation method name] in their evaluation, how often is [property name] evaluated in comparison to [another property name]?
Difference: ['[another', 'property']



 58%|█████▊    | 98/170 [01:14<01:14,  1.03s/it]

Line: 99, found template issues in the question
Original Question Template: What are the properties that are evaluated in papers that investigate the research object [research object name] without adhering to an evaluation guideline?
Corrected Question Template: What are the properties that are evaluated in papers that investigate the research object [research object] without adhering to an evaluation guideline?
Difference: ['object]']



 58%|█████▊    | 99/170 [01:15<01:10,  1.01it/s]

Line: 100, found template issues in the question
Original Question Template: Among publications that investigate [research object name], which properties have been evaluated without following a guideline for their evaluation?
Corrected Question Template: Among publications that investigate [research object name], which properties have been evaluated without follwing a guideline for their evaluation?
Difference: ['follwing']



 59%|█████▉    | 101/170 [01:17<01:05,  1.05it/s]

Line: 102, found template issues in the question
Original Question Template: Among publications that apply [evaluation method name] in their evaluation, what are the research objects evaluated without using a guideline for the evaluation?
Corrected Question Template: Among publications that apply [evaluation method] in their evaluation, what are the research objects evaluated without using a guideline for the evaluation?
Difference: ['method]']



 61%|██████    | 104/170 [01:19<00:49,  1.34it/s]

Line: 105, found template issues in the question
Original Question Template: Among publications that investigate [research object name], what properties have been evaluated the most often?
Corrected Question Template: Among publications that investigate [placeholder], what properties have been evaluated the most often?
Difference: ['[placeholder],']



 63%|██████▎   | 107/170 [01:21<00:55,  1.13it/s]

Line: 108, found template issues in the question
Original Question Template: Among publications that investigate [research object name], how are they distributed per year?
Corrected Question Template: Among publications that investigate [placeholder], how are they distributed per year?
Difference: ['[placeholder],']



 64%|██████▍   | 109/170 [01:22<00:30,  2.01it/s]

Line: 110, found template issues in the question
Original Question Template: How often per year are [evaluation method name]s used for evaluation?
Corrected Question Template: How often per year are [placeholder] used for evaluation?
Difference: ['[placeholder]']



 65%|██████▌   | 111/170 [01:23<00:38,  1.52it/s]

Line: 112, found template issues in the question
Original Question Template: What property is evaluated in the [research object name] investigation of the paper '[paper title]'?
Corrected Question Template: What property is evaluated in the [investigation type] investigation of the paper '[paper title]'?
Difference: ['[investigation', 'type]']



 68%|██████▊   | 116/170 [01:27<00:38,  1.40it/s]

Line: 117, found template issues in the question
Original Question Template: Which methods have been applied to evaluate the [property name] of objects investigated in papers published by [author name]?
Corrected Question Template: Which methods have been applied to evaluate the usability of objects investigated in papers published by [author name]?
Difference: ['usability']



 73%|███████▎  | 124/170 [01:32<00:36,  1.26it/s]

Line: 125, found template issues in the question
Original Question Template: What properties have been evaluated with a [evaluation method name] in papers published by [author name]? Rank them in descending alphabetical order.
Corrected Question Template: What properties have been evaluated with a [evaluation method name] in papers published by [author name]? Rank them in descending alphabetical order.
Difference: []



 78%|███████▊  | 132/170 [01:38<00:31,  1.22it/s]

Line: 133, found template issues in the question
Original Question Template: What are the evaluation methods that were applied without a guideline in papers published by [author name]?
Corrected Question Template: What are the evaluation methods that were applied without a guideline in papers published by [placeholder]?
Difference: ['[placeholder]?']



 79%|███████▉  | 135/170 [01:40<00:30,  1.14it/s]

Line: 136, found template issues in the question
Original Question Template: What is the property that is evaluated the most often in papers that investigate [research object name]s published by [author name]?
Corrected Question Template: What is the property that is evaluated the most often in papers that investigate [research object name] published by [author name]?
Difference: ['name]']



 86%|████████▌ | 146/170 [01:48<00:19,  1.23it/s]

Line: 147, found template issues in the question
Original Question Template: In [year], which papers applied [evaluation method name]s for evaluation?
Corrected Question Template: In [year], which papers applied [evaluation method name] for evaluation?
Difference: ['name]']



 91%|█████████ | 154/170 [01:52<00:09,  1.72it/s]

Line: 157, found template issues in the question
Original Question Template: How many papers investigate [research object name] in [year] in comparison to [year]?
Corrected Question Template: How many papers investigate [research object name] in [year1] in comparison to [year2]?
Difference: ['[year1]', '[year2]?']



 98%|█████████▊| 166/170 [01:59<00:03,  1.16it/s]

Line: 167, found template issues in the question
Original Question Template: Which paper class was used the most for publications that investigate [research object name]s in [year]?
Corrected Question Template: Which paper class was used the most for publications that investigated [research object name]s in [year]?
Difference: ['investigated']



 99%|█████████▉| 168/170 [02:01<00:01,  1.17it/s]

Line: 169, found template issues in the question
Original Question Template: What is the distribution of papers that investigate the object [research object name] between [year] and [year]?
Corrected Question Template: What is the distribution of papers that investigate the object [architecture description] between [2018] and [2020]?
Difference: ['[architecture', '[2018]', 'description]', '[2020]?']



100%|██████████| 170/170 [02:02<00:00,  1.39it/s]
