# ICSR '19 - Gkortzis et al. - Data Collection

This notebook describes and performs the following steps necessary to collect all data used in the study:

1. [Download projects from GitHub](#download)
2. [Compile and install Maven projects](#install)
3. [Retrieve projects' dependencies](#dependencies)
4. [Run Spotbugs](#spotbugs)
5. [Extract metrics and create analysis dataset](#metrics)

<a id="download"></a>
## Download Projects from GitHub

With a list of Github Maven repositories, the first step is to clone these reposirotories locally. This is the task that the github_downloader script performs. 

The execute_downloader requires three parameters:
1. `credentials` : `String`. The github credentials in a specific format (`github_username:github_token`)
2. `repository_list` : `Filepath`. The list of repositories to clone locally
3. `download_directory` : `Directory`. The fullpath of the directory where the repositories will be cloned
Additionally, there is an optional parameter: 
4. `Boolean`. That can be used to update (perform a git pull) on an already existing repository.

In [None]:
import github_downloader

credentials = "github_username:github_token" # replace this value with your personal credentials
repository_list = "/home/agkortzis/git_repos/ICSR19/analysis/starred_github_maven_repositories_top300.txt"
download_directory = "/media/agkortzis/Data/test_repos" # replace this value

github_downloader.execute_downloader(credentials, repository_list, download_directory)
# github_downloader.execute_downloader(credentials, repoInputFile, repoStoreRoot, update_existing=True)


<a id="install"></a>
## Compile and install Maven projects

The next step after downloading the projects is to perform a maven install. This process will collect the projects dependencies and generate the `.jar` file. Both, the `.jar` file and the dependencies will be stored in the `.m2` directory under each project's direcoty. 
This `.m2` root directory is, by default, located under the users folder (`/home/user/.m2`).

The `execute_mvn_install` requires one argument:
1. `root_directory` : `Directory`. The full path of the directory in which all repositories (downloaded by the previous step) are located.

Users can also define the following two optional parametes:
2. `Boolean`. Perform a mvn clean on each repository before compiling it again, and
3. `Datetime`. Checkout each repository to a specific date. Date shoud be formatted as `YYYY-MM-DD` (example `2018-12-25`) 

In [None]:
import maven_installer

root_directory = f'/media/agkortzis/Data/maven_projects_old_versions/' # replace this value
maven_installer.execute_mvn_install(root_directory)
# maven_installer.execute_mvn_install(root_directory,True,'2018-01-01')

<a id="dependencies"></a>
## Retrieve project dependencies

Having a local copy of each maven reporitory we proceed with retrieving their dependency tree. Each tree will be stored in a separate file (with `.trees` suffix) for further analysis as we describe on the next [step](#spotbugs). If a project consist of more than one modules, a seperate tree of each module will be stored in the `.trees` file.

This step requires two parameters:
1. `root_directory` : `Directory`. The full path of the directory that stores the repositories 
2. `output_directory` : `Directory`. The full path of the directory that will store the `.trees` files. 

In [None]:
import dependency_extractor

root_directory = f"/media/agkortzis/Data/maven_projects_old_versions/" # replace this value
output_directory = "/home/agkortzis/git_repos/ICSR19/analysis/data/test" # replace this value

dependency_extractor.execute(root_directory, output_directory)

<a id="spotbugs"></a>
## Run SpotBugs

With the installed projects, the next step is to run SpotBugs.
For that, we use the `.trees` files, which contain the dependency tree for each module built for the project.

Thus, for each project (i.e., each `.trees` files), the next sub-steps are:
1. Parse the `.trees` file
2. Ignore modules that are not Java source code (not `.jar` nor `.war`)
3. For each remaining tree (i.e., for each `.jar`/`.war` module):
  1. Select relevant dependencies (i.e., compile dependencies)
  2. Verify if main package and dependencies are installed in the `.m2` local repository
  3. Run SpotBugs for the set **[module] + [dependencies]**


In [None]:
import os
import logging
import datetime

import maven
import spotbugs


logging.basicConfig(level=logging.INFO)

currentDT = datetime.datetime.now()
print ("Started at :: {}".format(str(currentDT)))


def run_spotbugs(file_project_trees, output_file):

    trees = maven.get_compiled_modules(file_project_trees)
    
    if not trees:
        logging.info(f'No modules to analyze: {file_project_trees}.')
        return

    pkg_paths = []
    for t in trees:
        pkg_paths.extend([a.artifact.get_m2_path() for a in t])
        
    pkg_paths = list(set(pkg_paths))
    spotbugs.analyze_project(pkg_paths, output_file)


path_to_data = os.path.abspath('../data')

projects_tress = [f for f in os.listdir(path_to_data) if f.endswith('.trees')]

for f in projects_tress:
    filepath = path_to_data + os.path.sep + f
    output_file = f'{os.path.splitext(filepath)[0]}.xml'
    run_spotbugs(filepath, output_file)

    
currentDT = datetime.datetime.now()
print ("Finished at :: {}".format(str(currentDT)))

<a id="metrics"></a>
## Extract metrics and create analysis dataset

*[Describe steps]*

In [None]:
import os
import itertools
import logging
import datetime

import maven as mvn
import spotbugs as sb
import sloc


logging.basicConfig(level=logging.INFO)

currentDT = datetime.datetime.now()
print ("Started at :: {}".format(str(currentDT)))


def project_level_metrics(trees, spotbugs_xml):
    modules = [m.artifact for m in trees]
    dep_modules = [m.artifact for t in trees for m in t.deps if m.artifact not in modules]
    dep_modules = list(set(dep_modules)) # remove duplicates
    
    # Collect classes from user code
    project_classes = [c for m in modules for c in m.get_class_list()]
    
    # Collect classes from dependencies
    dep_classes = [c for m in dep_modules for c in m.get_class_list()]
    
    # Collect SLOC info
    classes_sloc = {}
    for m in (modules + dep_modules):
        classes_sloc.update(sloc.retrieve_SLOC(m.get_m2_path())[0])
            
    vdict = sb.collect_vulnerabilities(spotbugs_xml, {'uv': project_classes, 'dv': dep_classes})
    
    uv_classes = [sb.get_main_classname(b) for c in vdict['uv'] for r in c for b in r]
    uv_classes = list(set(uv_classes))
    
    dv_classes = [sb.get_main_classname(b) for c in vdict['dv'] for r in c for b in r]
    dv_classes = list(set(dv_classes))
    
    uv_count = [len(r) for c in vdict['uv'] for r in c]
    dv_count = [len(r) for c in vdict['dv'] for r in c]
    
    u_sloc = sum([int(classes_sloc[c]) for c in sloc.get_roots(project_classes)])
    d_sloc = sum([int(classes_sloc[c]) for c in sloc.get_roots(dep_classes)])
    
    uv_classes_sloc = sum([int(classes_sloc[c]) for c in sloc.get_roots(uv_classes)])
    dv_classes_sloc = sum([int(classes_sloc[c]) for c in sloc.get_roots(dv_classes)])
    
    return [
        len(project_classes),  # #u_classes   
        len(dep_classes),      # #d_classes 
        len(uv_classes),       # #uv_classes 
        len(dv_classes),       # #dv_classes 
        u_sloc,                # #u_sloc 
        d_sloc,                # #d_sloc  
        uv_classes_sloc ,      # #uv_classes_sloc 
        dv_classes_sloc        # #dv_classes_sloc 
    ] + uv_count + dv_count    # #uv_p1_r1 | #uv_p1_r2 ... | #dv_p3_r3 | #dv_p3_r4

    
def collect_sp_metrics(file_project_trees, output_file, append_to_file=True):

    trees = mvn.get_compiled_modules(file_project_trees)
    spotbugs_xml = f'{os.path.splitext(file_project_trees)[0]}.xml'
    proj_name = os.path.basename(os.path.splitext(file_project_trees)[0])
    
    if not trees:
        logging.warning(f'No modules to analyze: {file_project_trees}.')
        return
    
    if not os.path.exists(spotbugs_xml):
        logging.warning(f'SpotBugs XML not found: {spotbugs_xml}.')
        return
    
    metrics = project_level_metrics(trees, spotbugs_xml)
    
    if append_to_file:
        with open(output_file, 'a') as f:
            f.write(','.join([proj_name] + [str(m) for m in metrics]) + os.linesep)
    else:
        with open(output_file, 'w') as f:
            f.write(','.join([proj_name] + [str(m) for m in metrics]) + os.linesep)
            
    logging.debug(','.join([proj_name] + [str(m) for m in metrics]) + os.linesep)

            
path_to_data = os.path.abspath('../data')
projects_dataset = os.path.abspath('../projects-dataset.csv')
metrics_header = ['#u_classes', '#d_classes', '#uv_classes', '#dv_classes', 
           '#u_sloc', '#d_sloc', '#uv_classes_sloc', '#dv_classes_sloc', 
           '#uv_p1_r1', '#uv_p1_r2', '#uv_p1_r3', '#uv_p1_r4', 
           '#uv_p2_r1', '#uv_p2_r2', '#uv_p2_r3', '#uv_p2_r4', 
           '#uv_p3_r1', '#uv_p3_r2', '#uv_p3_r3', '#uv_p3_r4', 
           '#dv_p1_r1', '#dv_p1_r2', '#dv_p1_r3', '#dv_p1_r4', 
           '#dv_p2_r1', '#dv_p2_r2', '#dv_p2_r3', '#dv_p2_r4', 
           '#dv_p3_r1', '#dv_p3_r2', '#dv_p3_r3', '#dv_p3_r4', ]

with open(projects_dataset, 'w') as f:
    f.write(','.join((['project'] + metrics_header)) + os.linesep)
    
for f in projects_tress:
    filepath = path_to_data + os.path.sep + f
    collect_sp_metrics(filepath, projects_dataset)


currentDT = datetime.datetime.now()
print ("Finished at :: {}".format(str(currentDT)))