# JSS'19 - Gkortzis et al. - Data Collection

This notebook describes and performs the following steps necessary to collect all data used in the study:

1. [Requirements](#requirements)
1. [Download projects from GitHub](#download)
2. [Detect Maven root directories](#detect_root_dirs)
3. [Compile and install Maven projects](#install)
4. [Retrieve projects' dependencies](#dependencies)
5. [Run Spotbugs](#spotbugs)
6. [Extract metrics and create analysis dataset](#metrics)

<a id="requirements"></a>
## Requirements

The external tools required to run the analysis can be automatically obtained by executing the ```download-vendor-tools.sh``` script in the __analysis/tooling__ directory. The follwoing runtime environments and tools shoudl also be available in your system: 
1. Open-jdk 8 & open-jdk 11 (some projects can be build with version 1.8 only)
2. Python3
3. Unzip


<a id="download"></a>
## Download Projects from GitHub

With a list of Github Maven repositories, the first step is to clone these reposirotories locally. This is the task that the github_downloader script performs. 

The execute_downloader requires three parameters:
1. `credentials` : `String`. The github credentials in a specific format (`github_username:github_token`)
2. `repository_list` : `Filepath`. The list of repositories to clone locally
3. `download_directory` : `Directory`. The fullpath of the directory where the repositories will be cloned
Additionally, there is an optional parameter: 
4. `Boolean`. That can be used to update (perform a git pull) on an already existing repository.

In [None]:
import github_downloader

credentials = "github_username:github_token" # replace this value with your personal credentials
repository_list = "../maven_starred_sorted_all.csv"
download_directory = "/media/agkortzis/Data/maven_repos" # replace this value

github_downloader.execute_downloader(credentials, repository_list, download_directory)
# github_downloader.execute_downloader(credentials, repoInputFile, repoStoreRoot, update_existing=True)


<a id="detect_root_dirs"></a>
## Detect Maven root directories

In [None]:
import repository_path_retriever

repositories_root_directories_file = "../repositories_root_directories.csv" # the file that will store the repositories with their detected maven root direcotry
repositories_alternative_directories_file = "../repositories_alternative_directories_file.csv" # the file that stores the paths for repositories with more than one maven parent projects
repository_path_retriever.detect_configuration_file(download_directory,repositories_root_directories_file,repositories_alternative_directories_file)

<a id="install"></a>
## Compile and install Maven projects

The next step after downloading the projects is to perform a maven install. This process will collect the projects dependencies and generate the `.jar` file. Both, the `.jar` file and the dependencies will be stored in the `.m2` directory under each project's direcoty. 
This `.m2` root directory is, by default, located under the users folder (`/home/user/.m2`).

The `execute_mvn_install` requires one argument:
1. `root_directory` : `Directory`. The full path of the directory in which all repositories (downloaded by the previous step) are located.

Users can also define the following two optional parametes:
2. `Boolean`. Perform a mvn clean on each repository before compiling it again, and
3. `Datetime`. Checkout each repository to a specific date. Date shoud be formatted as `YYYY-MM-DD` (example `2018-12-25`) 

In [None]:
import project_installer

root_directory = download_directory # replace this value
repositories_sucessfully_build_list = "../sucessfully_built_repositories.csv" # 
project_installer.install_all_repositories(root_directory,repositories_root_directories_file,repositories_sucessfully_build_list)
#project_installer.install_all_repositories(root_directory, repository_list, build_list_file, clean_repository_before_install, skip_maven, skip_gradle)

<a id="dependencies"></a>
## Retrieve project dependencies

Having a local copy of each maven reporitory we proceed with retrieving their dependency tree. Each tree will be stored in a separate file (with `.trees` suffix) for further analysis as we describe on the next [step](#spotbugs). If a project consist of more than one modules, a seperate tree of each module will be stored in the `.trees` file.

This step requires two parameters:
1. `root_directory` : `Directory`. The full path of the directory that stores the repositories 
2. `output_directory` : `Directory`. The full path of the directory that will store the `.trees` files. 

In [None]:
import os

import dependency_extractor

os.chdir('/home/agkortzis/git_repos/ICSR19/analysis/tooling')
os.getcwd()

root_directory = f"/media/agkortzis/Data/maven_repos/" # replace this value
output_directory = "/home/agkortzis/git_repos/ICSR19/analysis/data/" # replace this value
#repositories_sucessfully_build_list = repositories_sucessfull_build_list # from step "Compile and install Maven projects"
repositories_sucessfully_build_list = "../successfuly_built_maven_repos_part2.txt" # from step "Compile and install Maven projects"
repositories_root_directories_file = "../downloaded_repos_with_maven_rootpaths.txt" # from step "Detect Maven root directories"

dependency_extractor.execute(root_directory, output_directory, repositories_sucessfully_build_list, repositories_root_directories_file)

<a id="spotbugs"></a>
## Run SpotBugs

With the installed projects, the next step is to run SpotBugs.
For that, we use the `.trees` files, which contain the dependency tree for each module built for the project.

Thus, for each project (i.e., each `.trees` files), the next sub-steps are:
1. Parse the `.trees` file
2. Ignore modules that are not Java source code (not `.jar` nor `.war`)
3. For each remaining tree (i.e., for each `.jar`/`.war` module):
  1. Select relevant dependencies (i.e., compile dependencies)
  2. Verify if main package and dependencies are installed in the `.m2` local repository
  3. Run SpotBugs for the set **[module] + [dependencies]**


In [None]:
import os
import logging
import datetime

import maven
import spotbugs

logging.basicConfig(level=logging.INFO)

os.chdir('/home/agkortzis/git_repos/ICSR19/analysis/tooling')
os.getcwd()
path_to_data = os.path.abspath('../repositories_data') #TODO: replace with relative
path_to_m2_directory = '/media/agkortzis/Data/m2'

def run_spotbugs(file_project_trees, output_file, path_to_m2_directory=os.path.expanduser('~/.m2')):

    trees = maven.get_compiled_modules(file_project_trees)
    
    if not trees:
        logging.info(f'No modules to analyze: {file_project_trees}.')
        return

    pkg_paths = []
    for t in trees:
        pkg_paths.extend([a.artifact.get_m2_path(path_to_m2_directory) for a in t])
        
    pkg_paths = list(set(pkg_paths))
    spotbugs.analyze_project(pkg_paths, output_file)
    
    
currentDT = datetime.datetime.now()
print ("Started at :: {}".format(str(currentDT)))

projects_tress = [f for f in os.listdir(path_to_data) if f.endswith('.trees')]

counter = 1
total = len(projects_tress)
for f in projects_tress:
    filepath = path_to_data + os.path.sep + f
    output_file = f'{os.path.splitext(filepath)[0]}.xml'
    logging.info("{}/{}".format(counter,total))
    run_spotbugs(filepath, output_file, path_to_m2_directory)
    counter = counter + 1

    
currentDT = datetime.datetime.now()
print ("Finished at :: {}".format(str(currentDT)))

<a id="metrics"></a>
## Extract metrics and create analysis dataset

*[Describe steps]*

In [None]:
import csv
import os
import itertools
import logging
import datetime

import maven as mvn
import spotbugs as sb
import sloc

logging.basicConfig(level=logging.INFO)

def count_vulnerabilities(spotbugs_xml, classes_sloc, class_sets, class_dict): # {'uv': project_classes, 'dv': dep_classes}
    vdict = sb.collect_vulnerabilities(spotbugs_xml, class_dict)
    
    dataset_info = {}
    for k in class_dict.keys():
        dataset_info[k] = {}
        dataset_info[k]["classes"] = [sb.get_main_classname(b) for c in vdict[k] for r in c for b in r]
        dataset_info[k]["classes"] = list(set(dataset_info[k]["classes"]))
        dataset_info[k]["count"] = [len(r) for c in vdict[k] for r in c]
        dataset_info[k]["sloc"] = 0
        dataset_info[k]["classes_sloc"] = 0

        try:
            dataset_info[k]["sloc"] = sum([int(classes_sloc.get(c,0)) for c in sloc.get_roots(class_dict[k])])
            dataset_info[k]["classes_sloc"] = sum([int(classes_sloc.get(c,0)) for c in sloc.get_roots(dataset_info[k]["classes"])])
        except Exception as e:
            with open ('/home/agkortzis/git_repos/ICSR19/analysis/log_dep_classes.txt', 'w') as log:
                for entry in sloc.get_roots(class_dict[k]):
                    log.write("{}\n".format(entry))
                
            with open ('/home/agkortzis/git_repos/ICSR19/analysis/log_dep_original.txt', 'w') as  log:
                for entry in class_dict[k]:
                    log.write("{}\n".format(entry))
                
            raise e
            logging.error("Error while calculating metrics\n{}".format(e))#angor

    
    return [len(class_dict[k]) for k in class_sets] + [len(dataset_info[k]["classes"])  for k in class_sets] + [dataset_info[k]["sloc"] for k in class_sets] + [dataset_info[k]["classes_sloc"]  for k in class_sets] + [count for k in class_sets for count in dataset_info[k]["count"]] # e.g., #uv_p1_r1 | #uv_p1_r2 ... | #dv_p3_r3 | #dv_p3_r4


def project_level_metrics(trees, spotbugs_xml, path_to_m2_directory=os.path.expanduser('~/.m2')):
    modules = [m.artifact for m in trees]
    dep_modules = [m.artifact for t in trees for m in t.deps if m.artifact not in modules]
    dep_modules = list(set(dep_modules)) # remove duplicates
    metrics = {}

    # Collect SLOC info
    classes_sloc = {}
    for m in (modules + dep_modules):
        classes_sloc.update(sloc.retrieve_SLOC(m.get_m2_path(path_to_m2_directory))[0])
    
    # Collect classes from user code
    project_classes = [c for m in modules for c in m.get_class_list(path_to_m2_directory)]
    
    # Collect classes from dependencies
    ## Original dataset
    try:
        dep_classes = [c for m in dep_modules for c in m.get_class_list(path_to_m2_directory)]
    except Exception as e:#angor
        with open ('/home/agkortzis/git_repos/ICSR19/analysis/log_modules.txt', 'w') as log:
            log.write("{}\n".format(spotbugs_xml))
            for entry in dep_modules:
                log.write("{}\n".format(entry.get_m2_path(path_to_m2_directory)))
        raise e
    metrics['general'] = count_vulnerabilities(spotbugs_xml, classes_sloc, ['uv', 'dv'], {'uv': project_classes, 'dv': dep_classes})

    ## Enterprise dataset (compare enterprise vs. non-enterprise dependencies)
    dm_enterprise = [m for m in dep_modules if m.groupId in enterprise_group_ids]
    dm_not_enterprise = [m for m in dep_modules if m.groupId not in enterprise_group_ids]
    try:
        dc_enterprise = [c for m in dm_enterprise for c in m.get_class_list(path_to_m2_directory)]
        dc_not_enterprise = [c for m in dm_not_enterprise for c in m.get_class_list(path_to_m2_directory)]
    except Exception as e:#angor
        with open ('/home/agkortzis/git_repos/ICSR19/analysis/log_modules-enterprise.txt', 'w') as log:
            log.write("{}\n".format(spotbugs_xml))
            for entry in dep_modules:
                log.write("{}\n".format(entry.get_m2_path(path_to_m2_directory)))
        raise e
    metrics['enterprise'] = count_vulnerabilities(spotbugs_xml, classes_sloc, ['uv', 'dve', 'dvne'], {'uv': project_classes, 'dve': dc_enterprise, 'dvne': dc_not_enterprise})

    ## Well-known projects (compare well-known community projects vs. non-well-known projects dependencies)
    dm_known = [m for m in dep_modules if m.groupId in wellknown_group_ids]
    dm_not_known = [m for m in dep_modules if m.groupId not in wellknown_group_ids]
    try:
        dc_known = [c for m in dm_known for c in m.get_class_list(path_to_m2_directory)]
        dc_not_known = [c for m in dm_not_known for c in m.get_class_list(path_to_m2_directory)]
    except Exception as e:#angor
        with open ('/home/agkortzis/git_repos/ICSR19/analysis/log_modules-wellknown.txt', 'w') as log:
            log.write("{}\n".format(spotbugs_xml))
            for entry in dep_modules:
                log.write("{}\n".format(entry.get_m2_path(path_to_m2_directory)))
        raise e
    metrics['wellknown'] = count_vulnerabilities(spotbugs_xml, classes_sloc, ['uv', 'dvw', 'dvnw'], {'uv': project_classes, 'dvw': dc_known, 'dvnw': dc_not_known})

    return metrics


def collect_sp_metrics(file_project_trees, output_file, append_to_file=True, path_to_m2_directory=os.path.expanduser('~/.m2')):

    trees = mvn.get_compiled_modules(file_project_trees)
    spotbugs_xml = f'{os.path.splitext(file_project_trees)[0]}.xml'
    proj_name = os.path.basename(os.path.splitext(file_project_trees)[0])
    logging.info("Project :: {}".format(proj_name))
    
    if not trees:
        logging.warning(f'No modules to analyze: {file_project_trees}.')
        return
    
    if not os.path.exists(spotbugs_xml):
        logging.warning(f'SpotBugs XML not found: {spotbugs_xml}.')
        return
    
    metrics = project_level_metrics(trees, spotbugs_xml, path_to_m2_directory)

    for dataset in metrics.keys():
        if append_to_file:
            with open(output_file+dataset+'.csv', 'a') as f:
                f.write(','.join([proj_name] + [str(m) for m in metrics[dataset]]) + os.linesep)
        else:
            with open(output_file+dataset+'.csv', 'w') as f:
                f.write(','.join([proj_name] + [str(m) for m in metrics[dataset]]) + os.linesep)
            
        logging.debug(f'{dataset}||' + ','.join([proj_name] + [str(m) for m in metrics[dataset]]) + os.linesep)


def create_headers(class_sets):
    proj_info = [f'#{k}{h}' for h in ['_classes', 'v_classes', '_sloc', 'v_classes_sloc'] for k in class_sets]
    vcount = [f'#{k}v_p{p}_r{r}' for k in class_sets for p in range(1,4) for r in range(1,5)]
    return ['project'] + proj_info + vcount



currentDT = datetime.datetime.now()
print ("Started at :: {}".format(str(currentDT)))


os.chdir('/home/agkortzis/git_repos/ICSR19/analysis/tooling')
os.getcwd()
path_to_m2_directory = '/media/agkortzis/2TB_EX_STEREO/m2'
path_to_data = os.path.abspath('../repositories_data')
projects_dataset = os.path.abspath('../jss_revised_dataset.csv')

projects_escope_dataset = os.path.abspath('../dependencies_groupids_enterprise_info.csv')
with open(projects_escope_dataset) as escope_csv:
    enterprise_group_ids = set()
    wellknown_group_ids = set()
    rows = csv.reader(escope_csv, delimiter=';')
    for r in rows:
        if r[2] == '1':
            enterprise_group_ids.add(r[1])
#             logging.info("Entreprise id = {}".format(r[1]))
        if r[6] == '1':
            wellknown_group_ids.add(r[1])
#             logging.info("Well known id = {}".format(r[1]))
# 

with open(projects_dataset+'general.csv', 'w') as f:
    class_sets = ['u', 'd']
    f.write(','.join(create_headers(class_sets)) + os.linesep)

with open(projects_dataset+'enterprise.csv', 'w') as f:
    class_sets = ['u', 'de', 'dne']
    f.write(','.join(create_headers(class_sets)) + os.linesep)

with open(projects_dataset+'wellknown.csv', 'w') as f:
    class_sets = ['u', 'dw', 'dnw']
    f.write(','.join(create_headers(class_sets)) + os.linesep)
    
projects_tress = [f for f in os.listdir(path_to_data) if f.endswith('.trees')]

number_of_projects = len(projects_tress)
for index, f in enumerate(projects_tress):
    logging.info("{}/{} --> {}".format(index,number_of_projects,f))
    filepath = path_to_data + os.path.sep + f
    collect_sp_metrics(filepath, projects_dataset, path_to_m2_directory)

currentDT = datetime.datetime.now()
print ("Finished at :: {}".format(str(currentDT)))

In [None]:
import os
import itertools
import logging
import datetime

import maven as mvn
import spotbugs as sb
import sloc

logging.basicConfig(level=logging.INFO)

os.chdir('/home/agkortzis/git_repos/ICSR19/analysis/tooling')
os.getcwd()
path_to_m2_directory = '/media/agkortzis/2TB_EX_STEREO/m2'
path_to_data = os.path.abspath('../repositories_data')
projects_dataset = os.path.abspath('../jss_revised_dataset.csv')


def project_level_metrics(trees, spotbugs_xml, path_to_m2_directory=os.path.expanduser('~/.m2')):
    modules = [m.artifact for m in trees]
    dep_modules = [m.artifact for t in trees for m in t.deps if m.artifact not in modules]
    dep_modules = list(set(dep_modules)) # remove duplicates
    
    # Collect classes from user code
    project_classes = [c for m in modules for c in m.get_class_list(path_to_m2_directory)]
    
    try:
    # Collect classes from dependencies
        dep_classes = [c for m in dep_modules for c in m.get_class_list(path_to_m2_directory)]
    except Exception as e:#angor
        with open ('/home/agkortzis/git_repos/ICSR19/analysis/log_modules.txt', 'w') as log:
            log.write("{}\n".format(spotbugs_xml))
            for entry in dep_modules:
                log.write("{}\n".format(entry.get_m2_path(path_to_m2_directory)))
        raise e
                
    # Collect SLOC info
    classes_sloc = {}
    for m in (modules + dep_modules):
        classes_sloc.update(sloc.retrieve_SLOC(m.get_m2_path(path_to_m2_directory))[0])
            
    vdict = sb.collect_vulnerabilities(spotbugs_xml, {'uv': project_classes, 'dv': dep_classes})
    
    uv_classes = [sb.get_main_classname(b) for c in vdict['uv'] for r in c for b in r]
    uv_classes = list(set(uv_classes))
    
    dv_classes = [sb.get_main_classname(b) for c in vdict['dv'] for r in c for b in r]
    dv_classes = list(set(dv_classes))
    
    uv_count = [len(r) for c in vdict['uv'] for r in c]
    dv_count = [len(r) for c in vdict['dv'] for r in c]
    
    u_sloc, d_sloc, uv_classes_sloc, dv_classes_sloc = 0, 0, 0, 0 #angor
    
    try:#angor
        u_sloc = sum([int(classes_sloc.get(c,0)) for c in sloc.get_roots(project_classes)])
        d_sloc = sum([int(classes_sloc.get(c,0)) for c in sloc.get_roots(dep_classes)])
        
        uv_classes_sloc = sum([int(classes_sloc.get(c,0)) for c in sloc.get_roots(uv_classes)])
        dv_classes_sloc = sum([int(classes_sloc.get(c,0)) for c in sloc.get_roots(dv_classes)])
    except Exception as e:#angor
        with open ('/home/agkortzis/git_repos/ICSR19/analysis/log_dep_classes.txt', 'w') as log:
            for entry in sloc.get_roots(dep_classes):
                log.write("{}\n".format(entry))
            
        with open ('/home/agkortzis/git_repos/ICSR19/analysis/log_dep_original.txt', 'w') as  log:
            for entry in dep_classes:
                log.write("{}\n".format(entry))
            
        raise e#angor
        logging.error("Error while calculating metrics\n{}".format(e))#angor
    
    return [
        len(project_classes),  # #u_classes   
        len(dep_classes),      # #d_classes 
        len(uv_classes),       # #uv_classes 
        len(dv_classes),       # #dv_classes 
        u_sloc,                # #u_sloc 
        d_sloc,                # #d_sloc  
        uv_classes_sloc ,      # #uv_classes_sloc 
        dv_classes_sloc        # #dv_classes_sloc 
    ] + uv_count + dv_count    # #uv_p1_r1 | #uv_p1_r2 ... | #dv_p3_r3 | #dv_p3_r4

    
def collect_sp_metrics(file_project_trees, output_file, append_to_file=True, path_to_m2_directory=os.path.expanduser('~/.m2')):

    trees = mvn.get_compiled_modules(file_project_trees)
    spotbugs_xml = f'{os.path.splitext(file_project_trees)[0]}.xml'
    proj_name = os.path.basename(os.path.splitext(file_project_trees)[0])
    logging.info("Project :: {}".format(proj_name))
    
    if not trees:
        logging.warning(f'No modules to analyze: {file_project_trees}.')
        return
    
    if not os.path.exists(spotbugs_xml):
        logging.warning(f'SpotBugs XML not found: {spotbugs_xml}.')
        return
    
    metrics = project_level_metrics(trees, spotbugs_xml, path_to_m2_directory)
    
    if append_to_file:
        with open(output_file, 'a') as f:
            f.write(','.join([proj_name] + [str(m) for m in metrics]) + os.linesep)
    else:
        with open(output_file, 'w') as f:
            f.write(','.join([proj_name] + [str(m) for m in metrics]) + os.linesep)
            
    logging.debug(','.join([proj_name] + [str(m) for m in metrics]) + os.linesep)

currentDT = datetime.datetime.now()
print ("Started at :: {}".format(str(currentDT)))

metrics_header = ['#u_classes', '#d_classes', '#uv_classes', '#dv_classes', 
           '#u_sloc', '#d_sloc', '#uv_classes_sloc', '#dv_classes_sloc', 
           '#uv_p1_r1', '#uv_p1_r2', '#uv_p1_r3', '#uv_p1_r4', 
           '#uv_p2_r1', '#uv_p2_r2', '#uv_p2_r3', '#uv_p2_r4', 
           '#uv_p3_r1', '#uv_p3_r2', '#uv_p3_r3', '#uv_p3_r4', 
           '#dv_p1_r1', '#dv_p1_r2', '#dv_p1_r3', '#dv_p1_r4', 
           '#dv_p2_r1', '#dv_p2_r2', '#dv_p2_r3', '#dv_p2_r4', 
           '#dv_p3_r1', '#dv_p3_r2', '#dv_p3_r3', '#dv_p3_r4', ]

with open(projects_dataset, 'w') as f:
    f.write(','.join((['project'] + metrics_header)) + os.linesep)
    
projects_tress = [f for f in os.listdir(path_to_data) if f.endswith('.trees')]

number_of_projects = len(projects_tress)
for index, f in enumerate(projects_tress):
    logging.info("{}/{} --> {}".format(index,number_of_projects,f))
    filepath = path_to_data + os.path.sep + f
    collect_sp_metrics(filepath, projects_dataset, path_to_m2_directory)

currentDT = datetime.datetime.now()
print ("Finished at :: {}".format(str(currentDT)))