# Regression test case selection using quantum algorithms
The aim of this project is to exploit the capabilities of quantum algorithms and quantum computers to optimize the process of test case selection for regression testing purposes. In particular, representing every test case as a single qubit, that could be in a superposition of both the states 0 and 1 (meaning respectively "not selected" and "selected"), quantum algorithms are able to evaluate the sub-sets of the original test suite that, preserving the coverage level of the original test suite, optimize the better the cost consumption and fault-revealing properties. 

In [23]:
#this cell contains all the imports needed by the pipeline
import pandas as pd
from IPython.display import display
import json

In [24]:
#this cell contains all variable definitions that will be useful throughout the entire project
sir_programs=["flex","grep","gzip","sed"]
sir_programs_tests_number={"flex":567,"grep":807,"gzip":214,"sed":360}

## The pipeline dataset
To experiment the performance of the proposed solution by this work and to compare its results to those of state-of-the-art solutions, 4 public programs have been downloaded from the SIR website. SIR is a repository of software-related artifacts meant to support rigorous controlled experimentation with program analysis and software testing techniques, and education in controlled experimentation. 

### Chosen SIR Programs
The programs that will be used for experimentation have all been written in C and are:
- flex (a program that generates a lexical analysis program, based on regular expressions and C statement contained in one or more input files);
- grep (a useful program to search form matching patterns in a file);
- gzip (a program that substitute a file, generally text files or web pages, with their compressed version)
- sed (a powerful program for stream text editing).

### Needed information
The information needed by the quantum algorithm to work on every one of the 4 programs are:
- a fault matrix: it indicates whether a precise test case already found, during previous execution, a bug in the source code or not;
- execution cost: it indicates the summation of the execution costs of all the test cases of the suite;
- statement coverage: it indicates statement coverage information for every test case. 

All this information have been gathered through previous experimentation on the four programs mentioned above and written in files organized in the SIR_Programs folder. So, the first goal of the project will be gathering data from these files for computational purposes.

In [25]:
#let's make a function to read the fault matrices
#IMPORTANT: all the fault-matrix files must be renamed as "fault-matrix".txt and must be written using the same standard used by the files of this project
def get_fault_matrix(program_name:str):
    #open the fault-matrix file of the desired SIR program
    program_file = open("SIR_Programs/"+program_name+"/fault-matrix.txt")
    lines = program_file.readlines()
    
    #the length of each line = number of version used for historical info
    versions = len(lines[0])-1
    
    faults_dictionary = {}
    [faults_dictionary.setdefault("v"+str(version+1),[]) for version in range(versions)]
    
    #make the dictionary like this: dict={'v0,:[0,1,0],'v2':[0,0,1],...} where keys
    #are the columns and indicates the versions, values are boolean and
    #ROWS ARE TEST CASES
    for line in lines:
        for version in range(versions):
            faults_dictionary['v'+str(version+1)].append(line[version])
            
    #now let's make the pandas dataframe
    faults_matrix = pd.DataFrame(data = faults_dictionary)
    print(program_name.capitalize()+" fault matrix.\nRows are the test cases, columns are the versions.")
    display(faults_matrix)
    
    program_file.close()
    
    return faults_matrix

In [26]:
#the next function is able to research into the json coverage information file of each test case
#of each sir program to gather information about the single test cases costs and coverage
def cost_and_coverage_information_gathering(program_name:str):
    test_suite_execution_cost = 0
    executed_lines_test_by_test = dict()
    
    for test_case in range(sir_programs_tests_number[program_name]):
        #to open the correct file, we must remember that the folders and the json files are
        #numbered from 1 and not from 0
        if program_name == "gzip":
            json_name = "allfile"
        else:
            json_name = program_name
        test_case_json = open("SIR_Programs/"+program_name+"/json_"+program_name+"/t"+str(test_case+1)+"/"+json_name+str(test_case+1)+".gcov.json")
        
        #read the JSON object as a dictionary
        json_data = json.load(test_case_json)
        
        #for programs tested above more than one file, the initial row of a file will start from
        #the final row of the preceding file
        i = 0
        
        for file in json_data["files"]:
            line_count_start = i
            for line in file["lines"]:
                #if a line is executed, we want to remember FOR THAT LINE which are the tests
                #that executed it, and we want to increment the execution cost
                if line["unexecuted_block"] == False:
                    #the test suite exec cost = sum of the exec freq. of each executed basic block
                    #by each test case
                    test_suite_execution_cost += line["count"]
                    
                    if (line_count_start + line["line_number"]) not in executed_lines_test_by_test:
                        executed_lines_test_by_test[line_count_start + line["line_number"]] = [test_case]
                    else:
                        executed_lines_test_by_test[line_count_start + line["line_number"]].append(test_case)
                i = line["line_number"]
        
        test_case_json.close()
                        
    return test_suite_execution_cost, executed_lines_test_by_test
                    

In [27]:
#we can now gain all the historical information about past fault detection 
faults_matrices = []
for sir_program in sir_programs:
    faults_matrices.append(get_fault_matrix(sir_program))

#faults_matrices will contain all the matrices we will need
print(faults_matrices)

Flex fault matrix.
Rows are the test cases, columns are the versions.


Unnamed: 0,v1,v2,v3,v4,v5
0,0,0,0,0,0
1,0,1,1,1,0
2,0,0,0,0,0
3,0,1,1,1,0
4,0,1,1,1,0
...,...,...,...,...,...
562,0,1,1,1,0
563,0,1,1,1,0
564,0,1,1,1,0
565,0,1,1,1,0


Grep fault matrix.
Rows are the test cases, columns are the versions.


Unnamed: 0,v1,v2,v3,v4,v5
0,0,0,0,0,0
1,0,0,0,0,0
2,0,0,0,0,0
3,0,0,0,0,0
4,0,0,0,0,0
...,...,...,...,...,...
804,0,0,1,0,0
805,0,0,0,0,0
806,0,0,0,0,0
807,0,0,0,0,0


Gzip fault matrix.
Rows are the test cases, columns are the versions.


Unnamed: 0,v1,v2,v3,v4,v5
0,0,0,0,0,0
1,0,0,0,0,0
2,0,0,0,0,0
3,0,0,0,0,0
4,0,0,0,0,0
...,...,...,...,...,...
209,1,0,0,0,0
210,1,0,0,0,0
211,1,0,0,0,0
212,1,0,0,0,0


Sed fault matrix.
Rows are the test cases, columns are the versions.


Unnamed: 0,v1,v2,v3,v4,v5,v6
0,1,0,0,0,0,0
1,1,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,1,0,0,0,0,0
...,...,...,...,...,...,...
355,0,0,0,0,1,0
356,0,0,0,0,0,0
357,0,0,0,0,0,0
358,0,0,0,0,0,0


[    v1 v2 v3 v4 v5
0    0  0  0  0  0
1    0  1  1  1  0
2    0  0  0  0  0
3    0  1  1  1  0
4    0  1  1  1  0
..  .. .. .. .. ..
562  0  1  1  1  0
563  0  1  1  1  0
564  0  1  1  1  0
565  0  1  1  1  0
566  0  1  1  1  0

[567 rows x 5 columns],     v1 v2 v3 v4 v5
0    0  0  0  0  0
1    0  0  0  0  0
2    0  0  0  0  0
3    0  0  0  0  0
4    0  0  0  0  0
..  .. .. .. .. ..
804  0  0  1  0  0
805  0  0  0  0  0
806  0  0  0  0  0
807  0  0  0  0  0
808  0  0  0  0  0

[809 rows x 5 columns],     v1 v2 v3 v4 v5
0    0  0  0  0  0
1    0  0  0  0  0
2    0  0  0  0  0
3    0  0  0  0  0
4    0  0  0  0  0
..  .. .. .. .. ..
209  1  0  0  0  0
210  1  0  0  0  0
211  1  0  0  0  0
212  1  0  0  0  0
213  1  0  0  0  0

[214 rows x 5 columns],     v1 v2 v3 v4 v5 v6
0    1  0  0  0  0  0
1    1  0  0  0  0  0
2    0  0  0  0  0  0
3    0  0  0  0  0  0
4    1  0  0  0  0  0
..  .. .. .. .. .. ..
355  0  0  0  0  1  0
356  0  0  0  0  0  0
357  0  0  0  0  0  0
358  0  0  0  0  0  

In [30]:
#we can now gain costs and coverage information
costs = {"flex":None,"grep":None,"gzip":None,"sed":None}
coverage = {"flex":None,"grep":None,"gzip":None,"sed":None}

for sir_program in sir_programs:
    costs_and_coverage = cost_and_coverage_information_gathering(sir_program)
    costs[sir_program] = costs_and_coverage[0]
    coverage[sir_program] = costs_and_coverage[1]

print(costs)
print(coverage["sed"])

{'flex': 210649552, 'grep': 238953909, 'gzip': 350943086, 'sed': 20774515}
{146: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205