# Fundamentals of Software Systems (FSS)
**Software Evolution – Part 02 Assignment**

## Submission Guidelines

To correctly complete this assignment you must:
* Carry out the assignment in a team of 2 to 4 students.
* Carry out the assignment with your team only. You are allowed to discuss solutions with other teams, but each team should come up its own personal solution. A strict plagiarism policy is going to be applied to all the artifacts submitted for evaluation.
* As your submission, upload the filled Jupyter Notebook (including outputs) together with the d3 visualization web pages (i.e. upload everything you downloaded including the filled Jupyter Notebook plus your `output.json`)
* The files must be uploaded to OLAT as a single ZIP (`.zip`) file by Dec 12, 2022 @ 23:55.


## Group Members
* Firstname, Lastname, Immatrikulation Number
* Lucius, Bachmann, 11-060-274
* Adnreas, Wiemeyer, 15-728-405

## Task Context

In this assigment we will be analyzing the _elasticsearch_ project. All following tasks should be done with the subset of commits from tag `v1.0.0` to tag `v1.1.0`.

Website: https://github.com/elastic/elasticsearch
Repository: https://github.com/elastic/elasticsearch.git

In [None]:
from enum import Enum 

class Modification(Enum):
    ADDED = "Lines added"
    REMOVED = "Lines removed"
    TOTAL = "Lines added + lines removed"
    DIFF = "Lines added - lines removed"

In [None]:
# noinspection PyUnresolvedReferences
import os.path
from datetime import datetime
from os import path, mkdir

# noinspection PyUnresolvedReferences
import pandas as pd
# noinspection PyUnresolvedReferences
import plotly.express as px
from pydriller import Repository, Git

repo_remote_path = 'https://github.com/elastic/elasticsearch.git'
repo_owner = 'elastic'
repo_name = 'elasticsearch'
repo_checkout_path = f'{repo_owner}/{repo_name}'

if not path.exists(repo_owner):
    mkdir(repo_owner)

from_tag = 'v1.0.0'
to_tag = 'v1.1.0'
repo = Repository(repo_remote_path, clone_repo_to=repo_owner, from_tag=from_tag, to_tag=to_tag)
# clone repo if necessary
for commit in repo.traverse_commits():
    break
git = Git(repo_checkout_path)

## Task 1: Author analysis

In the following, please consider only `java` files.

The first task is to get an overview of the author ownership of the _elasticsearch_ project. In particular, we want to understand who are the main authors in the system between the two considered tags, the authors distribution among files and the files distribution among authors. To this aim, perform the following:
- create a dictionary (or a list of tuples) with the pairs author => number of modified files
- create a dictionary (or a list of tuples) with the pairs file => number of authors who modified the file
- visualize the distribution of authors among files: the visualization should have on the x axis the number of authors per file (from 1 to max), and on the y axis the number of files with the given number of authors (so for example the first bar represent the number of files with single author)
- visualize the distribution of files among authors: the visualization should have on the x axis the number of files per author (from 1 to max), and on the y axis the number of authors who modified the given number of files (so for example the first bar represent the minor contributors, i.e., the number of authors who changed only 1 file)

Comment the two distribution visualizations.



Now, let's look at the following 3 packages in more detail:
1. `src/main/java/org/elasticsearch/search`
2. `src/main/java/org/elasticsearch/index`
3. `src/main/java/org/elasticsearch/action`

Create a function that, given the path of a package and a modification type (see class Modification above), returns a dictionary of authors => number, where the number counts the total lines added or removed or added+removed or added-removed (depending on the given Modification parameter), for the given package. To compute the value at the package level, you should aggregate the data per file.

Using the function defined above, visualize the author contributions (lines added + lines removed). The visualization should have the author on the x axis, and the total lines on the y axis. Sort the visualization in decreasing amount of contributions, i.e., the main author should be the first.

Compare the visualization for the 3 packages and comment.

Count the number of authors per file and the number of files per author.

In [None]:
from pandas import DataFrame
from typing import Set

repo = Repository(repo_checkout_path, from_tag=from_tag, to_tag=to_tag)

files_to_authors : {str, Set[str]} = {}
authors_to_files : {str, Set[str]} = {}

for commit in repo.traverse_commits():
    author_identifier = commit.author.name
    authors_to_files.setdefault(author_identifier, set())
    for file in commit.modified_files:
        if file.new_path is not None:
            files_to_authors.setdefault(file.new_path, set())
            files_to_authors[file.new_path].add(author_identifier)
            authors_to_files[author_identifier].add(file.new_path)

Visualize Nr of authors per file:

In [None]:
file_author_count = {(k, len(authors)) for k, authors in files_to_authors.items()}
col_name_author = 'File'
col_name_nr_of_authors = 'Nr of authors'
df_file_author_count = DataFrame(file_author_count, columns=[col_name_author, col_name_nr_of_authors])
df_file_author_count_sorted = df_file_author_count.sort_values(by=[col_name_nr_of_authors], ascending=False)

The 20 files with the most authors:

In [None]:
display(df_file_author_count_sorted[:20])

Distribution of author count per file:

In [None]:
hist_author_count = px.histogram(df_file_author_count_sorted, x=col_name_nr_of_authors)
hist_author_count.show()
hist_author_log_count = px.histogram(df_file_author_count_sorted, x=col_name_nr_of_authors, log_y=True)
hist_author_log_count.show()

As we can see in a lot of projects, there are certain files which are touched by a lot of authors, but most files are touched by a single author.
The file with the most authors is the pom.xml, where the project configuration is stored.
There every developer has to change the file if e.g. a dependency is added, removed or updated.
Then follow some tests with 8 or 7 coauthors, and then we are already at a moderate 7, but only between 2 minor releases.

Visualise Nr of files per author:

In [None]:
author_file_count = {(k, len(files)) for k, files in authors_to_files.items()}
col_name_author = 'Author'
col_name_nr_of_files = 'Nr of files'
df_author_file_count = DataFrame(author_file_count, columns=[col_name_author, col_name_nr_of_files])
df_author_file_count_sorted = df_author_file_count.sort_values(by=[col_name_nr_of_files], ascending=False)

The 20 authors which changed the most files:

In [None]:
display(df_author_file_count_sorted[:20])

Distribution of files modified per author:

In [None]:
hist_author_count = px.histogram(df_author_file_count_sorted, x=col_name_nr_of_files)
hist_author_count.show()

Most of the people only change a few files, as has been seen in other open source projects.
There are some main contributes which changed more than 20 files.


Now, let's look at the following 3 packages in more detail:
1. `src/main/java/org/elasticsearch/search`
2. `src/main/java/org/elasticsearch/index`
3. `src/main/java/org/elasticsearch/action`

Create a function that, given the path of a package and a modification type (see class Modification above), returns a dictionary of authors => number, where the number counts the total lines added or removed or added+removed or added-removed (depending on the given Modification parameter), for the given package. To compute the value at the package level, you should aggregate the data per file.

Using the function defined above, visualize the author contributions (lines added + lines removed). The visualization should have the author on the x axis, and the total lines on the y axis. Sort the visualization in decreasing amount of contributions, i.e., the main author should be the first.

Compare the visualization for the 3 packages and comment.

In [None]:
from typing import Dict


AuthorChurnDict = Dict[str, Dict[Modification, int]]

# noinspection PyShadowingNames
def author_churn_of_path(path: str, from_tag: str = from_tag, to_tag: str = to_tag) -> AuthorChurnDict:
    modifications: AuthorChurnDict = {}
    repo = Repository(path_to_repo=repo_checkout_path, from_tag=from_tag, to_tag=to_tag)
    for commit in repo.traverse_commits():
        author_identifier = commit.author.name
        modifications.setdefault(author_identifier, dict([(mod_type, 0) for mod_type in Modification]))
        for file in commit.modified_files:
            if file.new_path is None:
                continue
            if not file.new_path.__contains__(path):
                continue
            modifications[author_identifier][Modification.ADDED] += file.added_lines
            modifications[author_identifier][Modification.REMOVED] += file.deleted_lines
            modifications[author_identifier][Modification.TOTAL] += file.added_lines + file.deleted_lines
            modifications[author_identifier][Modification.DIFF] += file.added_lines - file.deleted_lines

    return modifications

display([(k, v) for k, v in author_churn_of_path('src/main/java/org/elasticsearch/search').items()][:20])


# noinspection PyShadowingBuiltins
def sorted_df_of_dict(
        dict: AuthorChurnDict,
        col_name_x: str,
        col_name_y: str,
        mod_type=Modification.TOTAL,
) -> DataFrame:
    search_churn_table = [(author, churnDict[mod_type]) for author, churnDict in
                          dict.items()]
    df_search_churn_table = DataFrame(search_churn_table, columns=[col_name_x, col_name_y])
    return df_search_churn_table.sort_values(by=[col_name_y], ascending=False)

Most active authors for path: `src/main/java/org/elasticsearch/search`

In [None]:
search_author_churn_of_path = author_churn_of_path('src/main/java/org/elasticsearch/search')
col_name_total = 'Lines added + removed'
data_frame = sorted_df_of_dict(search_author_churn_of_path, col_name_x=col_name_author, col_name_y=col_name_total)
hist_author_count = px.histogram(data_frame, x=col_name_author, y=col_name_total)
hist_author_count.show()

Most active authors for path: `src/main/java/org/elasticsearch/index`

In [None]:
search_author_churn_of_path = author_churn_of_path('src/main/java/org/elasticsearch/index')
col_name_total = 'Lines added + removed'
data_frame = sorted_df_of_dict(search_author_churn_of_path, col_name_x=col_name_author, col_name_y=col_name_total)
hist_author_count = px.histogram(data_frame, x=col_name_author, y=col_name_total)
hist_author_count.show()

Most active authors for path: `src/main/java/org/elasticsearch/action`

In [None]:
search_author_churn_of_path = author_churn_of_path('src/main/java/org/elasticsearch/action')
col_name_total = 'Lines added + removed'
data_frame = sorted_df_of_dict(search_author_churn_of_path, col_name_x=col_name_author, col_name_y=col_name_total)
hist_author_count = px.histogram(data_frame, x=col_name_author, y=col_name_total)
hist_author_count.show()

## Task 2: Knowledge loss

We now want to analyze the knowledge loss when the main contributor of the analyzed project would leave. For this we will use the circle packaging layout introduced in the "Code as a Crime Scene" book. It should show how much of each file was written by the main contributor of _elasticsearch_ (according to the analysis above using `Modification.TOTAL`) and indicate which areas would be affected most when this contributor leaves the project. This assignment includes the necessary `knowledge_loss.html` file as well as the `d3` folder with the d3 dependencies. Your task is to create the `output.json` file according to the specification below. This file can then be visualized with the files provided.

For showing the visualization, once you have the output as `output.json` you should
* make sure to have the `knowledge_loss.html` file in the same folder
* start a local HTTP server in the same folder (e.g. with python `python3 -m http.server`, serving necessary for d3)
* open the served `knowledge_loss.html` and look at the visualization

For testing, you can use the provided `output.json` and should see a circle packaging layout with two circles, one big red, and one small white-red.

For the package you identify as the worst in terms of knowledge loss, investigate the author contributions using the function defined in the previous exercise and comment how the situation is, e.g. how big the gap between the main author and the second biggest contributor for the selected package is.

In [None]:
def handle_paths(old_path, new_path):
    if (not old_path) or (old_path not in knowledge_loss):
        knowledge_loss[new_path] = {'total churn': 0, 'author churn': 0}
    elif new_path != old_path:
        knowledge_loss[new_path] = knowledge_loss.pop(old_path)

In [None]:
main_author = 'javanna'
knowledge_loss = {}

for commit in reversed(list(repo.traverse_commits())):
    author = commit.author.name

    for file in commit.modified_files:
        #handle deleted files
        if not file.new_path:
            if file.old_path in knowledge_loss:
                del knowledge_loss[file.old_path]

        #increment churn for files that continue to exist
        elif file.new_path.endswith(".java"):
            handle_paths(file.old_path, file.new_path)
            knowledge_loss[file.new_path]['total churn'] += file.added_lines+file.deleted_lines
            if author == main_author:
                knowledge_loss[file.new_path]['author churn'] += file.added_lines+file.deleted_lines

In [None]:
knowledge_loss.keys()

In [None]:
def build_tree(prefix, consumed_dict, output_dict):
    subtrees = {}

    for key in consumed_dict:
        path = prefix+key
        folder_list = key.split("/")
        next_folder = folder_list[0]
        print(next_folder)

        #BASE CASE
        if len(folder_list)==1:
            leaf = {
                "author_color": "red",
                "size": knowledge_loss[path]["total churn"],
                "name": key,#next_folder,
                "weight": knowledge_loss[path]["total churn"],
                "ownership": knowledge_loss[path]["author churn"]/knowledge_loss[path]["total churn"],
                "children": []
                }
            output_dict["children"].append(leaf)

        else:
            new_prefix = prefix+next_folder+'/'
            new_path = "/".join(folder_list[1:])
            new_entry = knowledge_loss[path]
            print(new_prefix)
            print(new_path)
            print(new_entry)
            if new_prefix in subtrees:
                subtrees[new_prefix][new_path] = new_entry
            else:
                subtrees[new_prefix] = {new_path: new_entry}

    #RECURSION
    for prefix in subtrees:
        branch = {
            "name": prefix,#.split('/')[-1],
            "children": []
        }
        output_dict["children"].append(branch)
        build_tree(prefix, subtrees[prefix], branch)

In [None]:
base = {"name": "root",
        "children": []
        }
build_tree("", knowledge_loss, base)

### Output Format for Visualization

Example:

* `root` is always the root of the tree
* `size` should be the total number of lines of contribution
* `weight` can be set to the same as `size`
* `ownership` should be set to the percentage of contributions from the main author (e.g. 0.98 for 98% if contributions coming from the main author)

```
{
  "name": "root",
  "children": [
    {
      "name": "test",
      "children": [
        {
          "name": "benchmarking",
          "children": [
            {
              "author_color": "red",
              "size": "4005",
              "name": "t6726-patmat-analysis.scala",
              "weight": 1.0,
              "ownership": 0.9,
              "children": []
            },
            {
              "author_color": "red",
              "size": "55",
              "name": "TreeSetIterator.scala",
              "weight": 0.88,
              "ownership": 0.2,
              "children": []
            }
          ]
        }
      ]
    }
  ]
}
```

### JSON Export

For exporting the data to JSON you can use the following snippet:

```
import json

with open("output.json", "w") as file:
    json.dump(tree, file, indent=4)
```

In [None]:
import json
from IPython.display import HTML, Javascript


In [None]:
json_output = base
try:
    # noinspection PyUnresolvedReferences,PyUnboundLocalVariable
    json_output
except NameError:
    json_output = {
        "name": "root",
        "children": [
            {
                "name": "test",
                "children": [
                    {
                        "name": "benchmarking",
                        "children": [
                            {
                                "author_color": "red",
                                "size": "4005",
                                "name": "t6726-patmat-analysis.scala",
                                "weight": 1.0,
                                "ownership": 0.9,
                                "children": []
                            },
                            {
                                "author_color": "red",
                                "size": "55",
                                "name": "TreeSetIterator.scala",
                                "weight": 0.88,
                                "ownership": 0.2,
                                "children": []
                            }
                        ]
                    }
                ]
            }
        ]
    }


# noinspection PyUnboundLocalVariable
Javascript(
    f"""
    window.json = {json.dumps(json_output)}
    """
)

Add target div to render to.

This shows the ownership of the main contributor javanna.

In [None]:
HTML('<div class="knowledge-loss-map"></div>')

In [None]:
Javascript(filename='knowledge-map.js', css='knowledge-map.css')

## Task 3: Code Churn Analysis

The third and last task is to analyze the code churn of the _elasticsearch_ project. For this analysis we look at the code churn, meaning the daily change in the total number of lines of the project. Visualize the code churn over time bucketing the data by day. Remember that you'll need to fill the gaps for days when there are no commits. Chose a filling strategy and justify it.

Look at the churn trend over time and identify two outliers. For each of them:
- identify if it was caused by a single or multiple commits (since you are bucketing the data by day)
- find the hash of the involved commit(s)
- find the involved files
- look at the actual diff

Based on the above, discuss if the outlier is a false positive or should be a reason for concern.

In [None]:
date_to_churn = {}

for commit in repo.traverse_commits():
    date = commit.committer_date.date()
    if date in date_to_churn:
        date_to_churn[date]["lines added"] += commit.insertions
        date_to_churn[date]["lines deleted"] += commit.deletions
        date_to_churn[date]["hashes"].append(commit.hash)
        date_to_churn[date]["messages"].append(commit.msg)
    else:
        date_to_churn[date] = {"lines added":commit.insertions,
                               "lines deleted":commit.deletions,
                               "hashes":[commit.hash],
                               "messages":[commit.msg]}



I choose to add zeros to fill the gaps, because if there are no commits, it means that the amount of lines could not have changed.

In [None]:
min_date = min(date_to_churn.keys())
max_date = max(date_to_churn.keys())
date_list = [entry.date() for entry in pd.date_range(min_date, max_date)]
for date in date_list:
    if date not in date_to_churn:
        date_to_churn[date] = {"lines added":0,
                               "lines deleted":0,
                               "hashes":[],
                               "messages":[]}

In [None]:
from bokeh.io.output import output_notebook
from bokeh.layouts import column
from bokeh.models import ColumnDataSource, RangeTool
from bokeh.models.tools import HoverTool
from bokeh.plotting import figure, show
import numpy as np

output_notebook()

items = list(date_to_churn.items())
items.sort(key=lambda t: t[0])
dates = [datetime.combine(item[0], datetime.min.time()) for item in items]

source = ColumnDataSource(data=dict(
    date=dates,
    added=[item[1]['lines added'] for item in items],
    deleted=[item[1]['lines deleted'] for item in items],
    hashes=[len(item[1]['hashes']) for item in items],#[item[1]['hashes'] for item in items], TODO
    ))

p = figure(height=300, width=800, tools="xpan", toolbar_location=None,
           x_axis_type="datetime", x_axis_location="above",
           x_range=(dates[0], dates[int(len(dates)/4)]))

p.line('date', 'added', source=source, color='green', legend_label='added lines')
p.line('date', 'deleted', source=source, color='red', legend_label='deleted lines')
p.yaxis.axis_label = 'lines of change'
p.legend.location = "top_right"
p.add_tools(HoverTool(tooltips=[
            ('date', '@date{%F}'),
            ('added', '@added'),
            ('deleted', '@deleted'),
            ('amount of commits', '@hashes')
            ],
            formatters={
            '@date'        : 'datetime',
            #'@hashes'      : 'printf',# use 'datetime' formatter for '@date' field
            }))

select = figure(height=130, width=800, y_range=p.y_range,
                x_axis_type="datetime", y_axis_type=None,
                tools="", toolbar_location=None)

range_tool = RangeTool(x_range=p.x_range)

select.line('date', 'added', source=source, color='green')
select.line('date', 'deleted', source=source, color='red')

select.add_tools(range_tool)
select.toolbar.active_multi = range_tool

show(column(p, select))

In [None]:
def commits_on(date):
    hashes = date_to_churn[date]['hashes']

    commits = []
    for commit in repo.traverse_commits():
        if commit.hash in hashes:
            commits.append(commit)
    return commits

def date_overview(date):
    print(date)
    for commit in commits_on(date):
        print_commit_overview(commit)

def print_commit_overview(commit):
    print("\nhash:", commit.hash)
    print("msg:", commit.msg)
    for file in commit.modified_files:
        print(f"+{file.added_lines}   \t-{file.deleted_lines}   \t{file.new_path if file.new_path else 'del '+file.old_path}")

def print_diff(diff):
    for code_line in diff.split('\n'):
        print(code_line)

In [None]:
analysed_dates_tuples = [(2014, 3, 13), (2014, 2, 26)]
analysed_dates = [datetime(*tuple).date() for tuple in analysed_dates_tuples]

In [None]:
date_overview(analysed_dates[0])

The first code churn outlier on 2014-02-26 contains a huge amount of changes, which are non-trivial. So it is indeed a problematic outlier.

In [None]:
date_overview(analysed_dates[1])

The second code churn outlier on 2014-02-26 also contains a huge amount of changes, which are non-trivial. So it is indeed a problematic outlier.