# Fundamentals of Software Systems - SE Part I Assignment

By Andy Wiemeyer and Lucius Bachmann

### Setup tools
* checkout repo
* initialize Git utility
* You can recreate the Repository object with other parameters to analyze different time periods.
  The last year was used that the setup is fast.

In [None]:
import os.path
from datetime import datetime
from pydriller import Repository, Git
from os import path, mkdir
import pandas as pd
import plotly.express as px

repo_remote_path = 'https://github.com/mastodon/mastodon.git'
repo_path = 'mastodon'
repo_checkout_path = f'{repo_path}/{repo_path}'
filepath = 'app'

since = datetime.fromisoformat('2021-11-08')
to = datetime.fromisoformat('2022-11-08')

if not path.exists(repo_path):
    mkdir(repo_path)

repo = Repository(repo_remote_path, clone_repo_to=repo_path, since=since, to=to, filepath=filepath)
# clone repo if necessary
for commit in repo.traverse_commits():
    break
git = Git(repo_checkout_path)

### Checkout repo at tag v3.5.3

In [None]:
tag = git.get_commit_from_tag('v3.5.3')
git.checkout(tag.hash)

## 1 Complexity Hotspots

1. You must consider only the app folder from the Mastodon repository
(i.e., https://github.com/mastodon/mastodon).

-> nothing to do

2. Decide on the granularity of your analysis of software entities (e.g., source code
files); describe why you selected this specific granularity.

We decided to look at single files TODO: motivation

3. Create a list of all these entities, as they appear in the latest stable release of
Mastodon (i.e., tag v3.5.3).

In [None]:
full_file_paths_all_dirs = git.files() #full path of all files in the analyzed commit
subdirectory_start = "/" + repo_checkout_path + "/"
full_file_paths = [path for path in full_file_paths_all_dirs if subdirectory_start+"app/" in path]
subdirectory_start_index = full_file_paths[0].find(subdirectory_start) + len(subdirectory_start)
subdirectory_prefix = full_file_paths[0][:subdirectory_start_index]#used later
file_paths = [path[subdirectory_start_index:] for path in full_file_paths] #paths relative to analyzed subdirectory

In [None]:
file_names = [path[max(0,path.rfind("/"))+1:] for path in full_file_paths]
print(f"Amount of different files with equal name: {len(file_names)-len(set(file_names))}")
counted_file_names = {}
for name in file_names:
    if name not in counted_file_names:
        counted_file_names[name] = 1
    else:
        counted_file_names[name] += 1
print([f"{name}: {count}" for name, count in counted_file_names.items() if count>1])

A lot of different files have the same name (see above) so we use the paths to identify files:

In [None]:
print(file_paths)

4. Decide on the type of complexity you want to measure for your software entities
and explain why you selected this type.

We decided at to look at the lines of code in a file TODO: explain

5. Decide on a timeframe on which you want to base your analysis and explain the
rationale of your choice.

TODO

6. For each entity in the system, measure its complexity and the number of changes
(in the given timeframe). Merge these two pieces of information together to cre-
ate a candidate list of problematic hotspots in the app part of Mastodon.

In [None]:
path_to_results = './analysis_data.csv'
if not os.path.isfile(path_to_results): #check if we have already saved the results
    analysis_df = pd.DataFrame( #create dataframe with
        {'Name':file_names, 'Full Path':full_file_paths}, #data columns
        index=file_paths) #and index

    #Compute complexities
    complexity_comp_exceptions = {} #to analyse what caused error (mostly images)
    for idx in analysis_df.index:
        try:
            with open(analysis_df.loc[idx, 'Full Path'], 'r') as file:
                analysis_df.loc[idx, 'Complexity'] = len(file.readlines())
        except Exception as e:
            complexity_comp_exceptions[idx]=e
    analysis_df.to_csv(path_to_results)

    # Count amount of times changed
    analysis_df['CommitPath'] = analysis_df.index #For tracking the path that a file has in the current commit if it was moved
    analysis_df['Amount of changes'] = 0 #set change counter to 0
    for commit in reversed(list(repo.traverse_commits())):
        for file in commit.modified_files:
            idx = analysis_df[analysis_df['CommitPath']==file.new_path].index
            analysis_df.loc[idx, 'Amount of changes'] += 1
            analysis_df.loc[idx, 'CommitPath'] = file.old_path #update commit path
    analysis_df.drop(columns='CommitPath', inplace=True) #get rid of temporary indexing
    analysis_df.to_csv(path_to_results)

In [None]:
analysis_df = pd.read_csv(path_to_results, index_col=[0])
# show results
#analysis_df['Complexity'].sort_values(ascending=False)[:10] #Print top 10 entries
#analysis_df['Amount of changes'].sort_values(ascending=False)[:10] #Print top 10 entries

7. Visualize the hotspots with a visualization of your choice.

In [None]:
complexity_hist = px.histogram(analysis_df, x='Complexity', hover_data={"path": analysis_df.index})
complexity_hist.show() #exponential dist in complexity -> use log axis
change_amount_hist = px.histogram(analysis_df, x='Amount of changes', hover_data={"path": analysis_df.index})
change_amount_hist.show() #exponential dist in amount of changes -> use log axis

comparison_scatter = px.scatter(analysis_df, x='Complexity', y='Amount of changes',
                 hover_data={"path": analysis_df.index})
comparison_scatter.show()

comparison_scatter_log = px.scatter(analysis_df, x='Complexity', y='Amount of changes',
                 log_x=True, log_y=True,
                 hover_data={"path": analysis_df.index})
comparison_scatter_log.show()

8. Analyze six candidate hotspots (not necessarily the top ones) through:

In [None]:
"""
computes the complextity trends for a list of file paths
return: dictionary with paths as key and a list containing a complexity measurements from oldest to newest commit as value
"""
def compute_complexity_trends(file_paths):
    complexity_trends = {path: [] for path in file_paths}
    commit_paths = {path:path for path in file_paths} #For tracking the path that a file has in the current commit if it was moved

    for commit in reversed(list(repo.traverse_commits())):#TODO: use whole list
        git.checkout(commit.hash)

        #add trend values for current commit
        for key_path, value_path in commit_paths.items():
            try:
                full_path = subdirectory_prefix+value_path
                with open(full_path, 'r') as file:
                    complexity_trends[key_path].insert(0,len(file.readlines()))
            except Exception as e:
                print("Issue with complexity trend computation: ", e)

        #update commit paths
        for file in commit.modified_files:
            for key_path, value_path in commit_paths.items():
                if value_path==file.new_path:
                    commit_paths[key_path]=file.old_path

    git.checkout(tag.hash) #return to initial checkout
    return complexity_trends

In [None]:
hotspot_candidates = [
    'app/javascript/fonts/roboto/roboto-medium-webfont.svg', #Complexity outlier, never changes
    'app/models/status.rb', #Very high change rate somewhat high complexity
    'app/javascript/styles/mastodon/components.scss', #Very high complexity, very high amount of changes
    'app/javascript/mastodon/locales/ca.json',
    'app/helpers/application_helper.rb',
    'app/services/activitypub/process_status_update_service.rb']
complexity_trends = compute_complexity_trends(hotspot_candidates)

In [None]:
def plot_complexity_trend(index):
    label, complexity_list = list(complexity_trends.items())[index]
    commits_num = list(range(len(complexity_list)))
    complexity_trend_line = px.line(x=commits_num, y=complexity_list, title=f"Complexity trend for {label}",
                                    labels={'x': 'number of commits', 'y': 'complexity [LOC]'})
    complexity_trend_line.show()
    print(f"Final complexity: {int(analysis_df['Complexity'][label])} LOC" )
    print(f"Amount of changes: {analysis_df['Amount of changes'][label]}" )

Candidate 1: app/javascript/fonts/roboto/roboto-medium-webfont.svg

The 'roboto-medium-webfont.svg' file is an extreme outlier. It has very high complexity and never changes.
From the file name we can easily see that it is not a troublesome hotspot.
The file needs to have a lot of lines of code because it defines the shape of characters from a font.
None of these lines ever need to change though. It is not a hotspot and we do not need to take any action.

In [None]:
plot_complexity_trend(0)


Candidate 2: app/models/status.rb

From the trend we see that the complexity increased gradually, with some decreases inbetween.
Presumably these are refactorings, indicating that the file does indeed require some maintenance work.

In [None]:
plot_complexity_trend(1)

Candidate 3: app/javascript/styles/mastodon/components.scss

In [None]:
plot_complexity_trend(2)

Candidate 4: app/javascript/mastodon/locales/ca.json

In [None]:
plot_complexity_trend(3)

Candidate 5: app/helpers/application_helper.rb

In [None]:
plot_complexity_trend(4)

Candidate 6: app/services/activitypub/process_status_update_service.rb

In [None]:
plot_complexity_trend(5)

## 2 Temporal/Logical Coupling

1. Determine what could be cases of temporal/logical coupling and generate a list
of candidates with a set of coupled entities.

In this analysis we consider logical grouping by commits.
According to the [guidelines for pull requrests](https://github.com/mastodon/mastodon/blob/97f657f8181dc24f6c30b6e9f0ce52df827ac90f/CONTRIBUTING.md#pull-requests),
small pull requests should be preferred, thus it is assumed that they contain only the necessary changes for one "Add", "Change", "Deprecate", "Remove" or "Fix",
and can be viewed as logically grouped. To merge the pull request, a squash commit is added to the main branch. So for this analysis we can use the commits
on the main branch to detect logical coupling between files.

Here the sets of logically coupled is created.

In [None]:
from pandas import DataFrame
from collections import Counter
from typing import List
from dataclasses import dataclass


@dataclass
class ChangeSet:
    files: List[str]

    def __init__(self, files: List[str]):
        self.files=sorted(files)

    def __hash__(self) -> int:
        return hash(", ".join(self.files))

    def __eq__(self, o: object) -> bool:
        if not isinstance(o, ChangeSet):
            return False
        # noinspection PyUnresolvedReferences
        return self.files == o.files

    def to_tuple(self, nr_of_changes: int):
        return self.files[0], self.files[1], nr_of_changes

changed_together_filename = './changed_together.csv'

col_name_nr_changed_together = 'nr_changed_together'
changed_together_columns = ['file1', 'file2', col_name_nr_changed_together]
dataframe_changed_together = DataFrame([], columns=changed_together_columns)

if not path.exists(changed_together_filename):
    files_changed_together = []
    for commit in reversed(list(repo.traverse_commits())):
        for file in commit.modified_files:
            for changed_together in commit.modified_files:
                if file.new_path is None or changed_together.new_path is None:
                    continue
                if not file.new_path == changed_together.new_path:
                    files_changed_together.append(ChangeSet([file.new_path, changed_together.new_path]))
    counter = Counter(files_changed_together)
    changed_together_tuples = [change_set.to_tuple(nr_of_changes / 2) for change_set, nr_of_changes in counter.items()]
    dataframe_changed_together = DataFrame(changed_together_tuples, columns=changed_together_columns)
    dataframe_changed_together = dataframe_changed_together.sort_values(by=col_name_nr_changed_together, ascending=False)
    dataframe_changed_together.to_csv(changed_together_filename)

dataframe_changed_together = pd.read_csv(changed_together_filename, index_col=[0,1])



print(f'most common changed together:')
display(dataframe_changed_together[:10])

As expected, the locales change often together. Thus we need to remove the locales from the analysis.

In [None]:
allowed_file_endings = ['.rb', '.js', '.haml', '.erb']

def contains_allowed_file_type(change_set: List) -> bool:
    for file in change_set:
        ending = path.splitext(file)[1]
        if allowed_file_endings.__contains__(ending):
            return True
    return False

changed_together_filtered = [(row[1], row[2], row[3]) for row in dataframe_changed_together.to_records(index=True) if contains_allowed_file_type([row[1], row[2]])]
dataframe_changed_together_filtered = DataFrame(changed_together_filtered, columns=changed_together_columns)

display(dataframe_changed_together_filtered[:10])

2. Visualize these candidate sets of couple entities with a visualization of your
choice.

In [None]:
complexity_hist = px.histogram(dataframe_changed_together_filtered, x='nr_changed_together', log_y=True)
complexity_hist.update_layout(font_size=18)
complexity_hist.show()

In [None]:
list_for_bar_plot = [(f'{row[1]} +  {row[2]}', row[3]) for row in dataframe_changed_together_filtered.to_records(index=True)]
dataframe_for_bar_plot = DataFrame(list_for_bar_plot, columns=['files', col_name_nr_changed_together])

bar = px.bar(dataframe_for_bar_plot[:20], x=col_name_nr_changed_together, y='files')
bar.update_layout(font_size=18)
bar.show()

3. For three set candidates in the list:
• analyze and explain why these entities are coupled;
• describe how important it would be to fix them, and any ideas for their
improvement.

Canditate set 1:

Canditate set 2:

Candidate set 3:

## 3 Defective Hotspots

1. Decide on how you want to detect entities that had defects in the past (e.g.,
commit message analysis vs. issue tracking system analysis) and motivate your
choice.

2. Determine defective hotspots among the entities in the timeframe that you pre-
viously selected (i.e., consider only defects in the selected timeframe). What
conclusions can you draw from this?

3. Determine complexity hotspots at the beginning of your timeframe, then corre-
late them with the defects they have presented throughout the entire timeframe.
Is there a correlation? Why do you think this is the case?

4. What conclusions can you draw from the relationship between defective hotspots
and complexity hotspots in Mastodon? And on these two metrics in general?
