# Find all usable projects from 2019 and 2021
All Projects in both database versions fulfill the requirements set by the paper [The Technical Debt Dataset](https://doi.org/10.1145/3345629.3345630). However, in the git repository it is stated that for version 2 some new projects are in the database while others have been removed [Technical Dataset Github: Release Notes](https://github.com/clowee/The-Technical-Debt-Dataset/releases). To maximise data, it might be worth using removed projects from the first version of the database along the new and updated ones from version 2.

In [1]:
import pandas as pd
import os

In [4]:
current_dir = os.getcwd()

# construct path to the project data folder
data_dir = os.path.join(current_dir, '..', 'Data','Projects_V1_V2')

# load project data
tdv1 = pd.read_csv(os.path.join(data_dir, 'PROJECTS_TD_V1.csv'))
tdv2 = pd.read_csv(os.path.join(data_dir, 'PROJECTS_TD_V2.csv'))

# add database source
tdv1['database'] = 'Version1'
tdv2['database'] = 'Version2'

print('--------------- VERSION 1 ---------------')
print(tdv1.head())
print('------------------------------------------------------------------------------------')
print('--------------- VERSION 2 ---------------')
print(tdv2.head())

--------------- VERSION 1 ---------------
  projectID                             gitLink  \
0  accumulo  https://github.com/apache/accumulo   
1    ambari    https://github.com/apache/ambari   
2     atlas     https://github.com/apache/atlas   
3    aurora    https://github.com/apache/aurora   
4     batik     https://github.com/apache/batik   

                                           jiraLink      sonarProjectKey  \
0  https://issues.apache.org/jira/projects/ACCUMULO  org:apache:accumulo   
1    https://issues.apache.org/jira/projects/AMBARI    org.apache:ambari   
2     https://issues.apache.org/jira/projects/ATLAS     org.apache:atlas   
3    https://issues.apache.org/jira/projects/AURORA    org.apache:aurora   
4     https://issues.apache.org/jira/projects/BATIK     org.apache:batik   

   database  
0  Version1  
1  Version1  
2  Version1  
3  Version1  
4  Version1  
------------------------------------------------------------------------------------
--------------- VERSION 2

In [13]:
print(f'Version1 variable names: {tdv1.columns.tolist()}')
print(f'Version2 variable names: {tdv2.columns.tolist()}')

Version1 variable names: ['projectID', 'gitLink', 'jiraLink', 'sonarProjectKey', 'database']
Version2 variable names: ['PROJECT_KEY', 'GIT_LINK', 'JIRA_LINK', 'SONAR_PROJECT_KEY', 'PROJECT_ID', 'database']


In [18]:
# adapt the column names of version 1 so they match
new_columns = ['PROJECT_KEY', 'GIT_LINK', 'JIRA_LINK', 'SONAR_PROJECT_KEY', 'database']
tdv1.columns = new_columns

In [15]:
# Get the project keys already present in tdv2
existing_project_keys = set(tdv2['PROJECT_KEY'])

# Filter tdv1 to include only rows with project keys not in tdv2
rows_to_add = tdv1[~tdv1['PROJECT_KEY'].isin(existing_project_keys)]

# Concatenate tdv2 and the filtered rows from tdv1
combined_df = pd.concat([tdv2, rows_to_add], ignore_index=True)
combined_df

Unnamed: 0,PROJECT_KEY,GIT_LINK,JIRA_LINK,SONAR_PROJECT_KEY,PROJECT_ID,database
0,batik,https://github.com/apache/batik,https://issues.apache.org/jira/projects/BATIK,org.apache:batik,org.apache:batik,Version2
1,commons-bcel,https://github.com/apache/commons-bcel,https://issues.apache.org/jira/projects/BCEL,org.apache:bcel,org.apache:bcel,Version2
2,commons-beanutils,https://github.com/apache/commons-beanutils,https://issues.apache.org/jira/projects/BEANUTILS,org.apache:beanutils,org.apache:beanutils,Version2
3,cocoon,https://github.com/apache/cocoon,https://issues.apache.org/jira/projects/COCOON,org.apache:cocoon,org.apache:cocoon,Version2
4,commons-codec,https://github.com/apache/commons-codec,https://issues.apache.org/jira/projects/CODEC,org.apache:codec,org.apache:codec,Version2
5,commons-collections,https://github.com/apache/commons-collections,https://issues.apache.org/jira/projects/COLLEC...,org.apache:collections,org.apache:collections,Version2
6,commons-cli,https://github.com/apache/commons-cli,https://issues.apache.org/jira/projects/CLI,org.apache:commons-cli,org.apache:commons-cli,Version2
7,commons-exec,https://github.com/apache/commons-exec,https://issues.apache.org/jira/projects/EXEC,org.apache:commons-exec,org.apache:commons-exec,Version2
8,commons-fileupload,https://github.com/apache/commons-fileupload,https://issues.apache.org/jira/projects/FILEUP...,org.apache:commons-fileupload,org.apache:commons-fileupload,Version2
9,commons-io,https://github.com/apache/commons-io,https://issues.apache.org/jira/projects/IO/,org.apache:commons-io,org.apache:commons-io,Version2


In [17]:
#combined_df.to_csv(path_or_buf = os.path.join(data_dir, 'potential_projects.csv'), index = False)