<a id="dc-index"></a>
# Data Collection for RQ3-RQ4, Gkortzis et al. - JSS'19 Special Issue on Software and Systems Reuse in the Big Data Era 

This notebook performs the following processes for answering RQ3 and RQ4:

1. [Run OWASP dependency checker tool](#run_owasp)
2. [Libraries reuse frequencies](#frequency)
3. [Parse OWASP vulnerabilities](#parse_owasp)
4. [Spotbugs on Dependencies](#spotbugs_dependencies)
5. [Detect 'dependency' projects](#dependencies_projects)

<a id='run_owasp'></a>
## Run OWASP dependency checker tool

This script executes __owasp dependency checker tool__ on a list of jar files. If the list is not provided, the tool will generate it on its own, based on the given _root directory_ variable. If you want to provide the list, use the following command:
```
find path_to_root_directory -name '*.jar' -or -name '*.war' 
```
__Important__: 
1. Note that the final file should store the findings with their absolute path.
2. This analysis might take hours or days to complete. You might consider parallelizing it by splitting the jar_list into smaller lists and analyze them in parallel.

You can download the already generated owasp reports from [here](https://send.firefox.com/download/5cd259cabe1ad54a/#FtNsrxrTFGDA9g1Xzik7Zg).

[Back to table of contents](#dc-index)

In [None]:
import datetime
import logging
import os
import owasp_dependency_analyzer as owasp

logging.basicConfig(level=logging.DEBUG)

os.chdir('/home/agkortzis/git_repos/ICSR19/analysis/tooling')
os.getcwd()
root_directory = '/media/agkortzis/Data/m2/repository'
owasp_executable = './vendor/dependency-check/bin/dependency-check.sh'
output_dir = '/media/agkortzis/Data/owasp_reports'
jar_list = '../local-jars_clean.txt'
report_format = 'JSON'

currentDT = datetime.datetime.now()
print ("Started at :: {}".format(str(currentDT)))    

owasp.execute(root_directory, owasp_executable, jar_list, output_dir, report_format)

currentDT = datetime.datetime.now()
print ("Finished at :: {}".format(str(currentDT)))


<a id="frequency"></a>
## Libraries reuse frequencies

The following script counts the appearence of each dependency in a set of given Maven .trees files, as produced by Step '4. Retrieve project dependencies' in the DataCollection notebook. It produces a file with entries like the following:

_library;number of uses;projects used in_

[Back to table of contents](#dc-index)

In [None]:
import os
import logging
import datetime

import maven

logging.basicConfig(level=logging.INFO)

def parse_trees(path_to_data, path_to_m2, dependency_usages_output_file):
    dependency_usage = {}
    projects_tress = [f for f in os.listdir(path_to_data) if f.endswith('.trees')]
    total = len(projects_tress)
    for index, f in enumerate(projects_tress):
        file_project_trees = os.path.join(path_to_data, f)
        project_name = f.replace('.trees','')
        logging.info("{}/{} :: {} (tree={})".format(index,total,project_name,file_project_trees))

        trees = maven.get_compiled_modules(file_project_trees)

        if not trees:
            logging.info(f'No modules to analyze: {file_project_trees}.')
            continue

        modules = [m.artifact for m in trees]
        dep_modules = [m.artifact for t in trees for m in t.deps if m.artifact not in modules]
        dep_modules = list(set(dep_modules)) # remove duplicates
#         pkg_paths = []
#         for t in trees:
            
#             pkg_paths.extend([a.artifact.get_m2_path().replace(path_to_m2, '').replace('/','_') for a in t])

#         pkg_paths = list(set(pkg_paths))
        for module in dep_modules:
            dependency = module.get_m2_path(path_to_m2).replace(path_to_m2, '').replace('/','_')
            if dependency in dependency_usage:
                dependency_usage[dependency].append(project_name)
            else:
                dependency_usage[dependency] = [project_name]

    # write results to file
    with open(dependency_usages_output_file, 'w') as f:
        seperator = ';'
        for dependency in dependency_usage:
            f.write("{}{}{}{}{}\n".format(dependency,seperator,len(dependency_usage[dependency]),seperator,seperator.join(dependency_usage[dependency])))
    
    return dependency_usage
            
os.chdir('/home/agkortzis/git_repos/ICSR19/analysis/tooling')
os.getcwd()
path_to_data = os.path.abspath('../repositories_data') #TODO: replace with relative
path_to_m2 = '/media/agkortzis/Data/m2/repository/'
dependency_usages_output_file = '../depependencies_usages_temp.csv'

currentDT = datetime.datetime.now()
print ("Started at :: {}".format(str(currentDT)))            

dependency_usages = parse_trees(path_to_data, path_to_m2, dependency_usages_output_file)

currentDT = datetime.datetime.now()
print ("Finished at :: {}".format(str(currentDT)))

for dependency in dependency_usages:
    logging.info("Dependency :: {} Usages :: {}".format(dependency,len(dependency_usage[dependency])))

<a id="parse_owasp"></a>
## Parse OWASP vulnerabilities

This script parses the reports that owasp dependency checker tool created for each library in an earlier step of the analysis. It produces a file that contains an entry like the following: 

_jar or war;number of vulnerabilities_

__Important__: Note that the output list contains also the produced jars/wars when Maven projects are build. Thus, It should be filtered to include only the 'dependencies' jars/wars. 

[Back to table of contents](#dc-index)

In [14]:
import datetime
import json
import logging
import os

logging.basicConfig(level=logging.INFO)


os.chdir('/home/agkortzis/git_repos/ICSR19/analysis/tooling')
os.getcwd()
path_to_owasp_reports = '/media/agkortzis/Data/owasp_reports/owasp_reports'
owasp_vulnerabilities = '../owasp_vulnerabilities_enhanced.csv'

def parse_owasp_reports(path_to_owasp_reports, owasp_vulnerabilities):
    logging.info("Pasing owasp reports.. This might take a while.")
    deps_vulnerabilities = {}
    deps_java_vulnerabilities = {}
    
    entries = os.listdir(path_to_owasp_reports)
    total = len(entries)
    for index, entry in enumerate(entries):
#         logging.debug("{}/{} :: Analyzing {}".format(index,total,entry))
        report = os.path.join(path_to_owasp_reports, entry, "dependency-check-report.json")
        if not os.path.exists(report):
            logging.error("{} does not exist".format(report))
            continue
        
        CVEs = set()
        java_cves = set()
        # Open the JSON
        with open(report) as json_file:
            data = json.load(json_file)
            
            # Retrieve vulnerabilities
            dependencies = data['dependencies']
            for dependency in dependencies:
#                 logging.info("Dependency filepath = {}".format(dependency['filePath']))
                if 'vulnerabilities' in dependency:
                    for cve in dependency['vulnerabilities']:
                        CVEs.add(cve['name'])
                        # count only cves in java dependencies
                        if dependency['filePath'].endswith('.jar'):
                            java_cves.add(cve['name']) 
        
        # Make entry library;number_of_vulnerabilities;number_of_java_vulnerabilities;cves_entries;java_cves_entries
        deps_vulnerabilities[entry] = [len(CVEs),len(java_cves),','.join(CVEs),','.join(java_cves)]
    
        # logging.debug("{};{};{}".format(entry,deps_vulnerabilities[entry][0],deps_vulnerabilities[entry][1]))
            
    logging.info("Writting results to file {}".format(owasp_vulnerabilities))
    with open(owasp_vulnerabilities, 'w') as f:
        for entry in deps_vulnerabilities:
            f.write("{};{};{};{};{}\n".format(entry,deps_vulnerabilities[entry][0],deps_vulnerabilities[entry][1],deps_vulnerabilities[entry][2],deps_vulnerabilities[entry][3]))
        
currentDT = datetime.datetime.now()
print ("Started at :: {}".format(str(currentDT)))            

parse_owasp_reports(path_to_owasp_reports, owasp_vulnerabilities)

currentDT = datetime.datetime.now()
print ("Finished at :: {}".format(str(currentDT)))


INFO:root:Pasing owasp reports.. This might take a while.


Started at :: 2019-10-18 09:55:37.627346


ERROR:root:/media/agkortzis/Data/owasp_reports/owasp_reports/owasp_reports/dependency-check-report.json does not exist
INFO:root:Writting results to file ../owasp_vulnerabilities_enhanced.csv


Finished at :: 2019-10-18 10:41:21.156258


<a id="spotbugs_dependencies"></a>
## Spotbugs on Dependencies

In this step we parse each .trees file for projects and analyze their dependencies with Spotbugs. 
The output of this step is a file that contains the number of potential vulnerabilities for each dependency. 

[Back to table of contents](#dc-index)

In [None]:
import os
import logging
import datetime
import subprocess
from utility import Utility

import maven
import spotbugs

logging.basicConfig(level=logging.INFO)

def parse_spotbugs_xml(path_to_file):
    output = 0
    #logging.info("Analyzing file={}".format(path_to_file))
    
    p1 = subprocess.Popen(["grep","<BugInstance", path_to_file], stdout=subprocess.PIPE)
    p2 = subprocess.Popen(["wc", "-l"], stdin=p1.stdout, stdout=subprocess.PIPE)
    p1.stdout.close()  # Allow p1 to receive a SIGPIPE if p2 exits.
    output = p2.communicate()[0].decode('utf-8')
    
    #logging.info("Result={}".format(output))
    
    return output

    
def parse_trees_with_spotbugs(path_to_data, path_to_m2, dependency_vulnerabilities_output_file):
    dependency_vulnerabilities = {}
    projects_tress = [f for f in os.listdir(path_to_data) if f.endswith('.trees')]
    total = len(projects_tress)
    for index, f in enumerate(projects_tress):
        file_project_trees = os.path.join(path_to_data, f)
        project_name = f.replace('.trees','')
        logging.info("{}/{} :: {}".format(index,total,project_name))

        trees = maven.get_compiled_modules(file_project_trees)

        if not trees:
            logging.info(f'No modules to analyze: {file_project_trees}.')
            continue

        modules = [m.artifact for m in trees]
        dep_modules = [m.artifact for t in trees for m in t.deps if m.artifact not in modules]
        dep_modules = list(set(dep_modules)) # remove duplicates

        for module in dep_modules:
            dependency = module.get_m2_path(path_to_m2)
            #logging.info("Dependency with path={}".format(dependency))
            spotbugs_file = "{}.spotbugs.xml".format(os.path.splitext(dependency)[0])
            
            overwrite = False
            spotbugs.analyze_project([dependency], spotbugs_file, overwrite)

            # parse spotbugs file here
            vulnerabilities = parse_spotbugs_xml(spotbugs_file)
       
            dependency_name = dependency.replace(path_to_m2, '').replace('/','_')
            logging.info("Result = {}::{}".format(dependency_name,vulnerabilities))
            dependency_vulnerabilities[dependency_name] = vulnerabilities
                   
    # write results to file
    with open(dependency_vulnerabilities_output_file, 'w') as f:
        seperator = ';'
        for dependency in dependency_vulnerabilities:
            f.write("{}{}{}\n".format(dependency,seperator,dependency_vulnerabilities[dependency]))
    
            
os.chdir('/home/agkortzis/git_repos/ICSR19/analysis/tooling/')
os.getcwd()
path_to_data = os.path.abspath('../repositories_data') #TODO: replace with relative
path_to_m2 = '/media/agkortzis/Data/m2/repository/'
dependency_vulnerabilities_output_file = '../depependencies_spotbugs.csv'

currentDT = datetime.datetime.now()
print ("Started at :: {}".format(str(currentDT)))            

parse_trees_with_spotbugs(path_to_data, path_to_m2, dependency_vulnerabilities_output_file)

currentDT = datetime.datetime.now()
print ("Finished at :: {}".format(str(currentDT)))


<a id="dependencies_projects"></a>
## Detect 'dependency' project

In this step we identify projects that are used as dependencies in other projects. We want to exclude those projects from the analysis. 

[Back to table of contents](#dc-index)

In [13]:
import datetime
import logging
import os

import maven

logging.basicConfig(level=logging.INFO)

def detect_dependencies(path_to_data, path_to_m2, projects_as_dependencies_output_file):
    dependencies = []
    main_modules = {}
    projects_as_dependencies = set()
    projects_tress = [f for f in os.listdir(path_to_data) if f.endswith('.trees')]
    total = len(projects_tress)
    logging.info("Found {} tree files".format(total))
    
    for index, f in enumerate(projects_tress):
        file_project_trees = os.path.join(path_to_data, f)
        project_name = f.replace('.trees','')
        logging.info("{}/{} :: {})".format(index,total,project_name))

        trees = maven.get_compiled_modules(file_project_trees)

        if not trees:
            logging.info(f'No modules to analyze: {file_project_trees}.')
            continue

        modules = [m.artifact for m in trees]
        dep_modules = [m.artifact for t in trees for m in t.deps if m.artifact not in modules]
        # add deps to total one
        dependencies.extend([module.get_m2_path(path_to_m2) for module in dep_modules])
        for main_mod in modules:
            main_modules[main_mod.get_m2_path(path_to_m2)] = project_name
        
    
    dependencies = list(set(dependencies)) # remove duplicates   
    # detect main_modules in dependencies
    for main_mod in main_modules:
            if main_mod in dependencies:
                projects_as_dependencies.add(main_modules[main_mod])
                
    # write results to file
    with open(projects_as_dependencies_output_file, 'w') as f:
        for project in projects_as_dependencies:
            f.write("{}\n".format(project))
                    
    return projects_as_dependencies

os.chdir('/home/agkortzis/git_repos/ICSR19/analysis/tooling/')
os.getcwd()
path_to_data = os.path.abspath('../repositories_data') #TODO: replace with relative
path_to_m2 = "/media/agkortzis/Elements/maven_m2"
projects_as_dependencies_output_file = '../projects_as_dependencies.csv'

currentDT = datetime.datetime.now()
print ("Started at :: {}".format(str(currentDT)))            

projects_as_dependencies = detect_dependencies(path_to_data, path_to_m2, projects_as_dependencies_output_file)
print("Found {} projects as dependencies".format(len(projects_as_dependencies)))

currentDT = datetime.datetime.now()
print ("Finished at :: {}".format(str(currentDT)))


INFO:root:Found 2279 tree files
INFO:root:0/2279 :: wcrasta.TwitterBoost-Tools)
INFO:root:1/2279 :: jackyhung.consumer-dispatcher)
INFO:root:2/2279 :: mdeverdelhan.ta4j)
INFO:root:3/2279 :: lviggiano.owner)
INFO:root:4/2279 :: gwtd3.gwt-d3)
INFO:root:5/2279 :: Hive2Hive.Hive2Hive)
INFO:root:6/2279 :: MutabilityDetector.MutabilityDetector)
INFO:root:7/2279 :: bfwg.angular-spring-starter)
INFO:root:8/2279 :: httl.httl)
INFO:root:9/2279 :: hortonworks-spark.shc)
INFO:root:10/2279 :: tranchis.xsd2thrift)
INFO:root:11/2279 :: rsertelon.android-keystore-recovery)
INFO:root:12/2279 :: linkedin.ml-ease)
INFO:root:13/2279 :: mysecureshell.mysecureshell)
INFO:root:14/2279 :: kloiasoft.eventapis)


Started at :: 2019-10-22 22:40:41.597849


INFO:root:15/2279 :: oxo42.stateless4j)
INFO:root:No modules to analyze: /home/agkortzis/git_repos/ICSR19/analysis/repositories_data/oxo42.stateless4j.trees.
INFO:root:16/2279 :: jingwei.krati)
INFO:root:No modules to analyze: /home/agkortzis/git_repos/ICSR19/analysis/repositories_data/jingwei.krati.trees.
INFO:root:17/2279 :: atomiqio.atomiq)
INFO:root:18/2279 :: RepreZen.KaiZen-OpenAPI-Editor)
INFO:root:No modules to analyze: /home/agkortzis/git_repos/ICSR19/analysis/repositories_data/RepreZen.KaiZen-OpenAPI-Editor.trees.
INFO:root:19/2279 :: streamsets.datacollector)
ERROR:maven:File is malformed: /home/agkortzis/git_repos/ICSR19/analysis/repositories_data/streamsets.datacollector.trees
INFO:root:No modules to analyze: /home/agkortzis/git_repos/ICSR19/analysis/repositories_data/streamsets.datacollector.trees.
INFO:root:20/2279 :: tbroyer.gwt-maven-archetypes)
INFO:root:No modules to analyze: /home/agkortzis/git_repos/ICSR19/analysis/repositories_data/tbroyer.gwt-maven-archetypes.tre

Found 36 projects as dependencies
Finished at :: 2019-10-22 22:42:07.169318
