# ICSR '19 - Gkortzis et al. - Data Collection

This notebook describes and performs the following steps necessary to collect all data used in the study:

1. [Download projects from GitHub](#download)
2. [Compile and install Maven projects](#install)
3. [Retrieve projects' dependencies](#dependencies)
4. [Run Spotbugs](#spotbugs)
5. [Extract metrics and create analysis dataset](#metrics)

<a id="download"></a>
## Download Projects from GitHub

With a list of Github Maven repositories, the first step is to clone these reposirotories locally. This is the task that the github_downloader script performs. 

The execute_downloader requires three parameters:
1. `credentials` : `String`. The github credentials in a specific format (`github_username:github_token`)
2. `repository_list` : `Filepath`. The list of repositories to clone locally
3. `download_directory` : `Directory`. The fullpath of the directory where the repositories will be cloned
Additionally, there is an optional parameter: 
4. `Boolean`. That can be used to update (perform a git pull) on an already existing repository.

In [4]:
import github_downloader

credentials = "github_username:github_token" # replace this value with your personal credentials
repository_list = "/home/agkortzis/git_repos/ICSR19/analysis/starred_github_maven_repositories_top300.txt"
download_directory = "/media/agkortzis/Data/test_repos" # replace this value

github_downloader.execute_downloader(credentials, repository_list, download_directory)
# github_downloader.execute_downloader(credentials, repoInputFile, repoStoreRoot, update_existing=True)


ModuleNotFoundError: No module named 'github_downloader'

<a id="install"></a>
## Compile and install Maven projects

The next step after downloading the projects is to perform a maven install. This process will collect the projects dependencies and generate the `.jar` file. Both, the `.jar` file and the dependencies will be stored in the `.m2` directory under each project's direcoty. 
This `.m2` root directory is, by default, located under the users folder (`/home/user/.m2`).

The `execute_mvn_install` requires one argument:
1. `root_directory` : `Directory`. The full path of the directory in which all repositories (downloaded by the previous step) are located.

Users can also define the following two optional parametes:
2. `Boolean`. Perform a mvn clean on each repository before compiling it again, and
3. `Datetime`. Checkout each repository to a specific date. Date shoud be formatted as `YYYY-MM-DD` (example `2018-12-25`) 

In [None]:
import maven_installer

root_directory = f'/media/agkortzis/Data/maven_projects_old_versions/' # replace this value
maven_installer.execute_mvn_install(root_directory)
# maven_installer.execute_mvn_install(root_directory,True,'2018-01-01')

<a id="dependencies"></a>
## Retrieve project dependencies

Having a local copy of each maven reporitory we proceed with retrieving their dependency tree. Each tree will be stored in a separate file (with `.trees` suffix) for further analysis as we describe on the next [step](#spotbugs). If a project consist of more than one modules, a seperate tree of each module will be stored in the `.trees` file.

This step requires two parameters:
1. `root_directory` : `Directory`. The full path of the directory that stores the repositories 
2. `output_directory` : `Directory`. The full path of the directory that will store the `.trees` files. 

In [3]:
import dependency_extractor

root_directory = f"/media/agkortzis/Data/maven_projects_old_versions/" # replace this value
output_directory = "/home/agkortzis/git_repos/ICSR19/analysis/data/test" # replace this value

dependency_extractor.execute(root_directory, output_directory)

Detected 300 repositories in root directory /media/agkortzis/Data/maven_projects_old_versions/
1/300. Scanning repository :: /media/agkortzis/Data/maven_projects_old_versions/4pr0n/ripme
	Writting file :: /home/agkortzis/git_repos/ICSR19/analysis/data/test/pr0n.ripme.trees
2/300. Scanning repository :: /media/agkortzis/Data/maven_projects_old_versions/abel533/Mybatis-Spring
3/300. Scanning repository :: /media/agkortzis/Data/maven_projects_old_versions/adamfisk/LittleProxy
	Writting file :: /home/agkortzis/git_repos/ICSR19/analysis/data/test/damfisk.LittleProxy.trees
4/300. Scanning repository :: /media/agkortzis/Data/maven_projects_old_versions/addthis/hydra
	Writting file :: /home/agkortzis/git_repos/ICSR19/analysis/data/test/ddthis.hydra.trees
5/300. Scanning repository :: /media/agkortzis/Data/maven_projects_old_versions/addthis/stream-lib
	Writting file :: /home/agkortzis/git_repos/ICSR19/analysis/data/test/ddthis.stream-lib.trees
6/300. Scanning repository :: /media/agkortzis/Dat

KeyboardInterrupt: 

<a id="spotbugs"></a>
## Run SpotBugs

With the installed projects, the next step is to run SpotBugs.
For that, we use the `.trees` files, which contain the dependency tree for each module built for the project.

Thus, for each project (i.e., each `.trees` files), the next sub-steps are:
1. Parse the `.trees` file
2. Ignore modules that are not Java source code (not `.jar` nor `.war`)
3. For each remaining tree (i.e., for each `.jar`/`.war` module):
  1. Select relevant dependencies (i.e., compile dependencies)
  2. Verify if main package and dependencies are installed in the `.m2` local repository
  3. Run SpotBugs for the set **[module] + [dependencies]**


In [None]:
import os
import logging

import maven
import spotbugs

logging.basicConfig(level=logging.INFO)

import datetime

currentDT = datetime.datetime.now()
print ("Started at :: {}".format(str(currentDT)))


def run_spotbugs(file_project_trees):

    with open(file_project_trees) as f:
        trees = maven.split_trees([l.rstrip() for l in f.readlines()])

    for t in trees:
        t = maven.ArtifactTree.parse_tree_str('\n'.join(t))
        if t.artifact.type in ['jar', 'war']:
            t.filter_deps(lambda d : d.artifact.dep_type == 'compile')
            paths = [d.artifact.get_m2_path() for d in t]
            spotbugs.analyze_project(paths[0], list(set(paths[1:])))



path_to_data = os.path.abspath('../data')

projects_tress = [f for f in os.listdir(path_to_data) if f.endswith('.trees')]

for f in projects_tress:
    filepath = path_to_data + os.path.sep + f
    run_spotbugs(filepath)
    

currentDT = datetime.datetime.now()
print ("Finished at :: {}".format(str(currentDT)))

INFO:spotbugs:Already analyzed. Skipping /home/agkortzis/.m2/repository/org/littleshoot/littleproxy/1.1.3-SNAPSHOT/littleproxy-1.1.3-SNAPSHOT.jar.
ERROR:spotbugs:Main package not found: /home/agkortzis/.m2/repository/net/librec/librec-core/3.0.0/librec-core-3.0.0.jar.
ERROR:spotbugs:Skipping /home/agkortzis/.m2/repository/net/librec/librec-core/3.0.0/librec-core-3.0.0.jar.
INFO:spotbugs:Already analyzed. Skipping /home/agkortzis/.m2/repository/com/ebay/pulsar/collector/1.0.2/collector-1.0.2.jar.
INFO:spotbugs:Already analyzed. Skipping /home/agkortzis/.m2/repository/com/ebay/pulsar/replay/replay/1.0.2/replay-1.0.2.jar.
INFO:spotbugs:Already analyzed. Skipping /home/agkortzis/.m2/repository/com/ebay/pulsar/sessionizer/1.0.2/sessionizer-1.0.2.jar.
INFO:spotbugs:Already analyzed. Skipping /home/agkortzis/.m2/repository/com/ebay/pulsar/distributor/1.0.2/distributor-1.0.2.jar.
INFO:spotbugs:Analyzing /home/agkortzis/.m2/repository/com/ebay/pulsar/metriccalculator/1.0.2/metriccalculator-1.0.

Started at :: 2019-02-01 19:56:53.707617


<a id="metrics"></a>
## Extract metrics and create analysis dataset

*[Describe steps]*

In [None]:
import os
import xmltodict
import zipfile

xmlpath='/home/vagrant/.m2/repository/org/apache/empire-db/empire-db-example-jsf2/2.4.7/empire-db-example-jsf2-2.4.7.xml'
artpath='/home/vagrant/.m2/repository/org/apache/empire-db/empire-db-example-jsf2/2.4.7/empire-db-example-jsf2-2.4.7.war'


def get_class_list(pkg):
    container = zipfile.ZipFile(pkg)
    len_preffix =  len('WEB-INF/classes/') if pkg.endswith('.war') else 0
    return [i[len_preffix:-6].replace(os.path.sep,'.') for i in container.namelist() if i.endswith('.class')]

clst = get_class_list(artpath)

print('# classes: ', len(clst))
print('Sample classes', clst[:3])

with open(xmlpath) as fd:
    doc = xmltodict.parse(fd.read())
    
print('# vulnerabilities: ', len(doc['BugCollection']['BugInstance']))

print(len([i for i in doc['BugCollection']['BugInstance'] if i['Class']['@classname'] in clst]))
print(len([i for i in doc['BugCollection']['BugInstance'] if i['@priority'] == '1' and int(i['@rank']) <= 20 and i['Class']['@classname'] in clst]))
print(len([i for i in doc['BugCollection']['BugInstance'] if i['@priority'] == '2' and int(i['@rank']) <= 20 and i['Class']['@classname'] in clst]))
print(len([i for i in doc['BugCollection']['BugInstance'] if i['@priority'] == '3' and int(i['@rank']) <= 20 and i['Class']['@classname'] in clst]))

for i in clst:
    print(i)
