# Examining the effects of ownership on software quality
Written by: Michel Kraaijeveld (mkraaijeveld, 4244311), Tom den Braber (tdenbraber, 4223780)

## The Case Of Lucene

We want to replicate the [study](http://dl.acm.org/citation.cfm?doid=2025113.2025119 "Examining the effects of ownership on software quality") done by Bird et al. and published in FSE'11. The idea is to see the results of a similar investigation on an OSS system. We select [Lucene](https://lucene.apache.org/core/), a search engine written in Java.

## Data collection

First we need to get the data to create our **table**, in other words we do what is called *data collection*.

In our case, we are interested in checking the relation between some ownership related metrics and post-release bugs. We investigating this relation at *file level*, because we focus on Java and in this language the building blocks are the classes, which most of the time correspond 1-to-1 to files.

This means that our table will have one row per each source code file and as many columns as the metrics we want to compute for that file, plus one column with the number of post release bugs.

### Collecting git data

For computing most of the metrics we want to investigate (e.g., how many people changed a file in its entire history) we need to know the history of files. We can do so by analyzing the *versioning system*. In our case, Lucene has a Subversion repository, but a [git mirror](https://github.com/apache/lucene-solr.git) is also available. We use the git repository as it allows to have the entire history locally, thus making the computations faster.

We clone the repository. For this we use the python library ```sh```. We use the ```json``` package to decode the issue files. Furthermore, the ```os``` library is used to retrieve all the issue files. The ```time``` package is used for calculating timestamps so that we can select the correct issues. Finally, the ```csv``` package is used to write the results to a ```.csv``` file.

In [None]:
import sh
from json import JSONDecoder
from os import listdir, path
from time import strptime, mktime
import csv

We start by cloning the repository

In [None]:
if not path.isdir("lucene-solr"):
    sh.git.clone("https://github.com/apache/lucene-solr.git")

We make sure that ```git``` is configured to use the correct working directory. In order to get rid of an annoying problem when git responses span multiple "pages", we use the ```--no-pager``` parameter

In [None]:
git = sh.git.bake("--no-pager", _cwd='lucene-solr')

To perform the replication, we inspect the ```trunk``` in the versioning system and focus on a 6-month period in which we look at the bugs occurring to the files existing at that moment. Since we have bug data (see discussion later) until half of July 2015, we consider a time window from Jan 01, 2015 to Jul 01, 2015. Therefore the next step is to go through all bug reports and check whether they are between these two dates.

For this to work properly, we defined the following two functions to check for the right bugstatus and the correct time period:

In [None]:
def isClosedResolved(issue):
    return issue['fields']['status']['name'] == "Closed" and issue['fields']['resolution']['name'] == "Fixed"

def isCorrectTimePeriod(issue):
    t = mktime(strptime(issue['fields']['resolutiondate'], TIMEFORMAT))
    return t >= START and t < END

Which was then followed by the actual files being checked and saved in a list:

In [None]:
# load all files
decoder = JSONDecoder()
PATH = "issue_LUCENE"
TIMEFORMAT = "%Y-%m-%dT%H:%M:%S.000+0000"
START = mktime(strptime("2015-01-01T00:00:00.000+0000", TIMEFORMAT))
END = mktime(strptime("2015-07-01T00:00:00.000+0000", TIMEFORMAT))

issues = []

for f in listdir(PATH):
    jsonF = open(PATH + "/" + f)
    issue = decoder.decode(jsonF.read())
    if isClosedResolved(issue) and isCorrectTimePeriod(issue):
        issues.append(issue['key'])

Now that we have the list of bugs in the defined period, it is time to create the actual table. The first thing we need to be able to do this, is to find the commit hash associated to a specific issues. Therefore we create a function that retrieves the hash based on the given bugfix number which has a format of ```LUCENE-####```: 

In [None]:
def linkBugFixNrToCommit(git, bugFixNr):
    """Given a bugfix nr (in the format: LUCENE-#NR#), this function returns the
    commit hash of this bugfix"""
    commits = git.log("--no-merges", "--pretty=%s//::://%H", "--grep", bugFixNr + ":").strip("\n").split("\n")
    bugfixCommits = []
    if len(commits) > 0 and commits[0] != "":
        for commit in commits:
            bugfixCommits.append(commit.split("//::://")[1])
    return bugfixCommits

Now that we have one or more commits associated to a specific bug, we need to check which files were altered in these commits. We assume that all files that were changed had actually to do with the bugfix. We created version numbers for files based on the filename and the commit hash, which means that each file that is changed in a specific commit needs to be added to the table as having 1 bug for that version of the file. 
But, first things first, so lets find the files that were changed in the commits:

In [None]:
def linkCommitToFiles(commitHash):
    """Given a commit hash (small version), this functions finds the files that
    were changed in that specific commit"""
    changedFiles = git('diff-tree', '--no-commit-id', '--name-only', '-r', commitHash)
    changedFiles = changedFiles.split("\n")
    #Remove all files that do not end in .java
    filtered = [ f for f in changedFiles if f.endswith('.java') ]
    return filtered

The function above only returns files that end in '.java', since those are the ones we're interested in. Everything else, such as .txt files, are skipped.

At this point, we have access to the files that were changed. The next step is to calculate the metrics for each of these files, which includes:
* minor: people that have contributed < 5% to the file
* major: people that have contributed > 5% to the file
* total: the total amount of minor and major contributers
* ownership: the highest percentage a single person has contributed to the file
* number of bugs: the number of bugs for the version of the file

In order to do this correctly, we first need to know how much commits each of the found files had:

In [None]:
def getListOfCommitsUptoCommit(git, commitHash, filePath): 
    """Given a git repository, a commithash and a filePath, this function returns a list 
    of commits upto (but excluding) this commit for this file"""
    # get commit hash in which this file was added
    addedCommitHash = git.log("--pretty=%H", "--diff-filter", "A", "--", filePath).split("\n")[0]
    
    # get a list of commits
    commitRange = addedCommitHash + "..." + commitHash
    commitList = git.rl("--pretty=%an", "--reverse", commitRange, 
            "--boundary", "--", filePath).split("\ncommit ")
    
    # the following piece of code does 3 things:
    # 1.    it checks if the commit is a 'boundary commit': if yes, it is only accepted
    #       if it is the 'addedCommitHash'
    # 2.    it removes the last commit, as this is the commit that is not wanted
    # 3.    it splits each hash\nauthorname in hash and authorname and selects the author
    if len(commitList) > 0 and commitList[0] != "":
        commits = [item.strip("\n").split("\n")[1] for item in commitList 
        if ((item.find("-") == -1 or item.find(addedCommitHash) != -1) 
            and item.find(commitHash) == -1)]
    else:
        commits = git.log("--pretty=%an", "-n", 1, commitHash, "--", filePath)
    return commits

The above function retrieves all commits for a specific file using ```git rev-list```. Using the pretty-printing options, it will retrieve just the authorname. The result of this function is thus a list of authors for the given file. Note that an author appears in the list for every commit that (s)he made.

Next, we need to check how many different commit authors there are and how many commits each of the authors has:

In [None]:
def getAuthorsForFile(authors):
    """Returns a dictionary with the authors as keys and the number of commits
    per author as values
    authors: a list of authors"""
    contributors = dict()
    for author in authors:
        if author in contributors.keys():
            contributors[author] += 1
        else:
            contributors[author] = 1
    return (contributors, len(authors))

This list of author/commit combinations can then be used to calculate our metrics on:

In [None]:
def computeStatsOnFile(contribTuple):
    "Computes the following tuple: (#minor, #major, #total, %ownership)"
    (contrib, total) = contribTuple
    maxPercentage = 0.0
    minors = 0
    majors = 0
    for author, commits in contrib.items():
        currentPercentage = commits/total
        if currentPercentage > maxPercentage:
            maxPercentage = currentPercentage
        if currentPercentage < 0.05:
            minors += 1
        else:
            majors += 1
    
    return (minors, majors, len(contrib.keys()), maxPercentage*100)

This results in a tuple which contains the amount of minor contributers, amount of major contributers, total amount of contributers and the maximum ownership in percentage. Now that we calculated our metrics, the only thing left is to add it to the table. Therefore we first need to create a version of the file, which consists of the filename + the commithash:

In [None]:
fileName = javaFile + "_" + commitHash

Then we need to add this to our table and add a 1 to the number of bugs for this entry (which should result in at most 1 bug per file version):

In [None]:
def addTupleToTable(filename, metrics, table, nrOfBugs):
    """Adds the given tuple to the table, based on the filename. If the 
    filename was already present, the amount of bugs is increased by 1"""
    BUG_INDEX = 1;
    if filename in table:
        table[filename][BUG_INDEX] += nrOfBugs
    else:
        table[filename] = [metrics, nrOfBugs]

    return table 

At this point we can create a table with entries for each of the files associated with bug reports, but we still need to add the files that have had 0 bugs. We decided to get all commits in the previously defined period:

In [None]:
allCommitsInPeriod = git.log("--no-merges", "--pretty=%s:::%H", '--since={2015-01-01}', '--until={2015-07-01}').strip("\n").split("\n")

These commits also include the bugfixes, so we had to filter these out:

In [None]:
def isBugFixCommit(commitMsg):
    return commitMsg.find("LUCENE-") != -1 or commitMsg.find("LUCENE_") != -1

For the commits that were not related to bugfixes, we wanted to see which files were changed as those did not include bugs. Therefore the files that are changed in these commits, need to be added to the table with '0' as the number of bugs in them. The reason we chose to only check the files that were committed in the period, is because it would otherwise be very hard to distinguish file version from having a bug or not. Whenever a file was changed x amount of times and there was a bug somewhere, it is very hard to see in which specific versions this bug was already present which could lead to wrongly assigning '0 bugs' to a file version.
The way in which the files were added to the table is the same as for the bugfixes, as both are only about commit hashes.

When we combine all the functions we talked about earlier in this notebook, we can create the table (in csv format) as follows:

In [None]:
results = {}

for issue in issues:
    commitHashes = linkBugFixNrToCommit(git, issue)
    for commitHash in commitHashes:
        for javaFile in linkCommitToFiles(commitHash):
            commits = getListOfCommitsUptoCommit(git, commitHash, javaFile)
            contribTuple = getAuthorsForFile(commits)
            fileStats = computeStatsOnFile(contribTuple)
            fileName = javaFile + "_" + commitHash
            results = addTupleToTable(fileName, fileStats, results, 1)

In [None]:
allCommitsInPeriod = git.log("--no-merges", "--pretty=%s//::://%H", '--since={2015-01-01}', '--until={2015-07-01}').strip("\n").split("\n")
for commit in allCommitsInPeriod:
    print(commit)
    (msg,commitHash) = commit.split("//::://")
    if not isBugFixCommit(msg):
        for javaFile in linkCommitToFiles(commitHash):
            commits = getListOfCommitsUptoCommit(git, commitHash, javaFile)
            contribTuple = getAuthorsForFile(commits)
            fileStats = computeStatsOnFile(contribTuple)
            fileName = javaFile + "_" + commitHash
            results = addTupleToTable(fileName, fileStats, results, 0)


writeResultsToFile(results)

### Discussion
Several decisions had to be made in order to complete this assignment. This section contains a short explanation of those decisions.

#### Period
We decided to choose a 6-month period. However, although the assignment stated that all issues of the Lucene project until August, 2015, were collected, we found that the issues of August were missing and decided to use all issues starting from January 1st (2015) upto but not including July 1st (2015).

#### Data collection
When we found that a bug was fixed in our period, we computed the metrics for that file at one version earlier. We did not make a distinction between bugs that were introduced and fixed in our time period of interest, and bugs that were introduced before our period and fixed in our period.
For all the other files that did not have bugs, we selected all commits that were done in our time period. For each commit, we computed all the metrics on the previous commit, as we do not know if a bug was introduced in the most recent commit. When a file had just one commit, the metrics were computed on this version.

### Results
The results of the code above can be found in ```result-final-6-months.csv```. 