Define imports

In [1]:
import pandas as pd
import random

Read dataset file

In [2]:
dataset_file_path = 'input/pr-data.csv'

In [3]:
df = pd.read_csv(dataset_file_path)

In [4]:
df.shape

(6387, 8)

Randomly obtain 10 projects

In [5]:
# Define a seed to make this replicable
seed_value = 321

In [6]:
# Filter out projects that do not contain at least 10 OD tests to ensure a minimum of 100 in the dataset
projects_with_od_count = df.groupby(["Project URL","SHA Detected"])['Category'].apply(lambda x: (x.isin(['OD', 'OD-Brit', 'OD-Vic'])).sum() >= 10).reset_index(name='Min 10 OD Tests')
filtered_projects = projects_with_od_count[projects_with_od_count['Min 10 OD Tests']]
filtered_projects = filtered_projects.drop(columns=['Min 10 OD Tests'])


In [7]:
# Randomly sample 10 projects
sampled_projects = filtered_projects.sample(n=10, random_state=seed_value)
sampled_projects

Unnamed: 0,Project URL,SHA Detected
508,https://github.com/spring-cloud/spring-cloud-n...,dc67500445d3ce7382771e39b64ca93bbebc04c7
399,https://github.com/json-iterator/java,6925cf4c19d313504b416f58a349a36bf563e0e1
141,https://github.com/apache/archiva,292dbe1bb4323dd299d36b78f37d9c1d55c889f8
229,https://github.com/apache/nifi,12015a17dd93a1d42c9d6ddab6cc5ce606fef16a
213,https://github.com/apache/incubator-ratis,bc9d7615d8ffa30e79a36b9fd1950af38f0f6a49
119,https://github.com/alibaba/fastjson,e05e9c5e4be580691cc55a59f3256595393203a1
134,https://github.com/alibaba/wasp,b2593d8e4b31ca6da0cd2f3e18356338d9b6dace
1,https://github.com/Activiti/Activiti,b11f757a48600e53aaf3fcb7a3ba1ece6c463cb4
201,https://github.com/apache/hive,54e43339dd671018fc70ebb5d9f0b292d70391a6
507,https://github.com/spring-cloud/spring-cloud-k...,3351926041a630aee0961ba0e1be8f035e4ba2ca


Experimentation with the dataset led to the discovery that there are 3 of the selected projects with artifacts no longer resolvable for the detected SHAs. Therefore 3 more projects will be sampled

In [8]:
filtered_projects = filtered_projects[~filtered_projects.index.isin(sampled_projects.index)]
additional_projects =  filtered_projects.sample(n=3, random_state=seed_value)
additional_projects

Unnamed: 0,Project URL,SHA Detected
96,https://github.com/Thomas-S-B/visualee,88732d9dbe5031dad9c9f85a4c4b35e5f1551f95
379,https://github.com/j256/ormlite-core,632b87c2a455b8eab4a6c09324e1f166273588d8
260,https://github.com/apache/shardingsphere-elast...,bdfcaff0c1a702c3ecb44adf46d609a3f0e86c5e


Experimentation with the dataset led to the discovery that 1 of the additional selected projects has artifacts no longer resolvable for the detected SHA. Therefore 1 more project will be sampled

In [9]:
filtered_projects = filtered_projects[~filtered_projects.index.isin(additional_projects.index)]
additional_projects2 =  filtered_projects.sample(n=1, random_state=seed_value)
additional_projects2

Unnamed: 0,Project URL,SHA Detected
389,https://github.com/jenkinsci/remoting,abf0455a68ad6c52a57e912bb89d51f883f77542


Remove all projects with unresolvable artifacts

In [13]:
sampled_projects = sampled_projects[~sampled_projects.index.isin([1, 141, 213])]
additional_projects = additional_projects[~additional_projects.index.isin([260])]
final_projects = pd.concat([sampled_projects, additional_projects, additional_projects2], axis=0)

Export to Projects.csv

In [14]:
final_projects.to_csv('output/Projects.csv', index=False)

Obtain dataframe with all the selected project's tests

In [15]:
selected_projects= final_projects[['Project URL', 'SHA Detected']]
selected_tests = df[(df['SHA Detected'].isin(selected_projects['SHA Detected'])) & 
                    (df['Project URL'].isin(selected_projects['Project URL']))]

selected_tests.reset_index(drop=True, inplace=True)
# Display the selected tests
print("Total tests amount:")
print(selected_tests.shape[0])

od_tests_count = selected_tests[selected_tests['Category'].isin(['OD', 'OD-Brit', 'OD-Vic'])].shape[0]
print("OD tests amount:")
print(od_tests_count)
print("OD proportion:")
print(round(od_tests_count/selected_tests.shape[0]*100,2),"%")

Total tests amount:
444
OD tests amount:
287
OD proportion:
64.64 %


Export to Tests.csv

In [16]:
selected_tests

Unnamed: 0,Project URL,SHA Detected,Module Path,Fully-Qualified Test Name (packageName.ClassName.methodName),Category,Status,PR Link,Notes
0,https://github.com/alibaba/fastjson,e05e9c5e4be580691cc55a59f3256595393203a1,.,com.alibaba.json.bvt.asm.SortFieldTest.test_1,ID,Opened,https://github.com/alibaba/fastjson/pull/3525,https://github.com/TestingResearchIllinois/ido...
1,https://github.com/alibaba/fastjson,e05e9c5e4be580691cc55a59f3256595393203a1,.,com.alibaba.json.bvt.bug.Bug_for_smoothrat6.te...,ID,Accepted,https://github.com/alibaba/fastjson/pull/3117,
2,https://github.com/alibaba/fastjson,e05e9c5e4be580691cc55a59f3256595393203a1,.,com.alibaba.json.bvt.bug.Issue_717.test_for_issue,OD,,,
3,https://github.com/alibaba/fastjson,e05e9c5e4be580691cc55a59f3256595393203a1,.,com.alibaba.json.bvt.date.DateTest.test_date,OD,Accepted,https://github.com/alibaba/fastjson/pull/2148,
4,https://github.com/alibaba/fastjson,e05e9c5e4be580691cc55a59f3256595393203a1,.,com.alibaba.json.bvt.date.DateTest_tz.test_codec,OD,Accepted,https://github.com/alibaba/fastjson/pull/2148,
...,...,...,...,...,...,...,...,...
439,https://github.com/Thomas-S-B/visualee,88732d9dbe5031dad9c9f85a4c4b35e5f1551f95,visualee,de.strullerbaumann.visualee.ui.graph.boundary....,OD,Opened,https://github.com/Thomas-S-B/visualee/pull/8,https://github.com/TestingResearchIllinois/fla...
440,https://github.com/Thomas-S-B/visualee,88732d9dbe5031dad9c9f85a4c4b35e5f1551f95,visualee,de.strullerbaumann.visualee.ui.graph.boundary....,OD,Opened,https://github.com/Thomas-S-B/visualee/pull/8,https://github.com/TestingResearchIllinois/fla...
441,https://github.com/Thomas-S-B/visualee,88732d9dbe5031dad9c9f85a4c4b35e5f1551f95,visualee,de.strullerbaumann.visualee.ui.graph.boundary....,OD,Opened,https://github.com/Thomas-S-B/visualee/pull/8,https://github.com/TestingResearchIllinois/fla...
442,https://github.com/Thomas-S-B/visualee,88732d9dbe5031dad9c9f85a4c4b35e5f1551f95,visualee,de.strullerbaumann.visualee.ui.graph.control.D...,OD,Opened,https://github.com/Thomas-S-B/visualee/pull/8,https://github.com/TestingResearchIllinois/fla...


In [17]:
# Remove extra commas from all values in the DataFrame
df_without_commas = df.apply(lambda x: x.map(lambda y: str(y).replace(',', '')))
selected_tests.to_csv('output/Tests.csv', index=False, encoding='utf-8')