# Fundamentals of Software Systems - SE Part I Assignment

By Andy Wiemeyer and Lucius Bachmann

### Setup tools
* checkout repo
* initialize Git utility
* You can recreate the Repository object with other parameters to analyze different time periods.
  The last year was used that the setup is fast.

In [120]:
import os.path
from datetime import datetime
from os import path, mkdir

import pandas as pd
import plotly.express as px
from pydriller import Repository, Git

repo_remote_path = 'https://github.com/mastodon/mastodon.git'
repo_path = 'mastodon'
repo_checkout_path = f'{repo_path}/{repo_path}'
filepath = 'app'

if not path.exists(repo_path):
    mkdir(repo_path)

repo = Repository(repo_remote_path, clone_repo_to=repo_path, since=datetime.fromisoformat('2022-10-01'), filepath=filepath)
# clone repo if necessary
for commit in repo.traverse_commits():
    break
git = Git(repo_checkout_path)

### Checkout repo at tag v3.5.3

In [121]:
tag = git.get_commit_from_tag('v3.5.3')
git.checkout(tag.hash)

## 1 Complexity Hotspots

1. You must consider only the app folder from the Mastodon repository
(i.e., https://github.com/mastodon/mastodon).

-> nothing to do

2. Decide on the granularity of your analysis of software entities (e.g., source code
files); describe why you selected this specific granularity.

For this analysis the granularity of source code files is used. It is an easy unit to perform measurements on. The mastodon repository contains a ruby on rails application with some javascript for the frontend. In both languages it's possible to define multiple classes in one file. Without performing a programming language specific analysis, it's not possible to measure smaller units.


3. Create a list of all these entities, as they appear in the latest stable release of
Mastodon (i.e., tag v3.5.3).

In [122]:
full_file_paths_all_dirs = git.files() #full path of all files in the analyzed commit
subdirectory_start = "/" + repo_checkout_path + "/"
full_file_paths = [path for path in full_file_paths_all_dirs if subdirectory_start+"app/" in path]
subdirectory_start_index = full_file_paths[0].find(subdirectory_start) + len(subdirectory_start)
subdirectory_prefix = full_file_paths[0][:subdirectory_start_index]#used later
file_paths = [path[subdirectory_start_index:] for path in full_file_paths] #paths relative to analyzed subdirectory

In [123]:
file_names = [path[max(0,path.rfind("/"))+1:] for path in full_file_paths]
print(f"Amount of different files with equal name: {len(file_names)-len(set(file_names))}")
counted_file_names = {}
for name in file_names:
    if name not in counted_file_names:
        counted_file_names[name] = 1
    else:
        counted_file_names[name] += 1
print([f"{name}: {count}" for name, count in counted_file_names.items() if count>1])

Amount of different files with equal name: 291


A lot of different files have the same name (see above) so we use the paths to identify files:

In [124]:
print(file_paths)



List the file endings of the paths used in this analysis

In [125]:
from pandas import DataFrame
from collections import Counter

extensions = Counter([path.splitext(entity)[1] for entity in file_paths])

colname_nr_of_occurences = 'nr of occurrences'
df_extensions = DataFrame(extensions.items(), columns=['file ending', colname_nr_of_occurences])
display(df_extensions.sort_values(by=colname_nr_of_occurences, ascending=False))

Unnamed: 0,file ending,nr of occurrences
0,.rb,815
11,.js,421
1,.haml,202
12,.json,172
7,.scss,35
3,.erb,33
4,.svg,27
6,.png,15
10,.ttf,7
8,.woff,6


4. Decide on the type of complexity you want to measure for your software entities
and explain why you selected this type.

To decide which metric would be a good indicator for complexity, a file was chosen to show the metrics the lizard library provides.

In [126]:
import lizard

filename = 'mastodon/mastodon/app/workers/scheduler/indexing_scheduler.rb'
file = open(filename, mode='r')
analysis = lizard.analyze_file.analyze_source_code(filename, file.read())
print(f'of file {filename}')
print(f'nr of functions: {len(analysis.function_list)}')
print(f'cyclomatic complexity: {analysis.CCN}')
print(f'lines of code: {analysis.nloc}')
print(f'token_count: {analysis.token_count}')
try:
    print(f'deepest nesting level: {analysis.ND}')
except AttributeError:
    print(f'deepest nesting level threw an error')



of file mastodon/mastodon/app/workers/scheduler/indexing_scheduler.rb
nr of functions: 2
cyclomatic complexity: 2
lines of code: 19
token_count: 89
deepest nesting level threw an error


In the lecture we learned that the number of lines of code is a proven measurement for complexity.
Pydriller already measures this, thus the number of lines of code is used as measurement for complexity.

5. Decide on a timeframe on which you want to base your analysis and explain the
rationale of your choice.

In a first step, the activity on the repository is analysed.

In [164]:
repo = Repository(repo_remote_path, clone_repo_to=repo_path)
commit_dates = []

for commit in repo.traverse_commits():
    commit_dates.append(commit.committer_date.strftime('%Y-%m-%d'))

df_commit_dates = DataFrame(Counter(commit_dates).items(), columns=['date', 'nr_of_commits'])
line = px.line(df_commit_dates, x='date', y='nr_of_commits')
line.update_layout(font_size=18)
line.show()

Here we see that the development activity had a peak in 2017, and would be a little higher in the year 2020.
Because we saw that with a timeframe between 2019 and 2022 we see a significant number of defects,
we chose the timeframe of the last 3 years.

If a complexity hotspot was found which was created a long time ago and
was not changed for a long time, then the analysis would not add value.
This analysis focuses on the last year.

In [128]:
since = datetime.fromisoformat('2019-11-08')
to = datetime.fromisoformat('2022-11-08')

repo = Repository(repo_remote_path, clone_repo_to=repo_path, since=since, to=to, filepath=filepath)
git = Git(repo_checkout_path)

6. For each entity in the system, measure its complexity and the number of changes
(in the given timeframe). Merge these two pieces of information together to cre-
ate a candidate list of problematic hotspots in the app part of Mastodon.

In [129]:
path_to_results = './analysis_data.csv'
if not os.path.isfile(path_to_results): #check if we have already saved the results
    analysis_df = pd.DataFrame( #create dataframe with
        {'Name':file_names, 'Full Path':full_file_paths}, #data columns
        index=file_paths) #and index

    #Compute complexities
    complexity_comp_exceptions = {} #to analyse what caused error (mostly images)
    for idx in analysis_df.index:
        try:
            with open(analysis_df.loc[idx, 'Full Path'], 'r') as file:
                analysis_df.loc[idx, 'Complexity'] = len(file.readlines())
        except Exception as e:
            complexity_comp_exceptions[idx]=e
    analysis_df.to_csv(path_to_results)

    # Count amount of times changed
    analysis_df['CommitPath'] = analysis_df.index #For tracking the path that a file has in the current commit if it was moved
    analysis_df['Amount of changes'] = 0 #set change counter to 0
    for commit in reversed(list(repo.traverse_commits())):
        for file in commit.modified_files:
            idx = analysis_df[analysis_df['CommitPath']==file.new_path].index
            analysis_df.loc[idx, 'Amount of changes'] += 1
            analysis_df.loc[idx, 'CommitPath'] = file.old_path #update commit path
    analysis_df['Initial Path'] = analysis_df['CommitPath']
    analysis_df.drop(columns='CommitPath', inplace=True) #get rid of temporary indexing
    analysis_df.to_csv(path_to_results)

In [130]:
analysis_df = pd.read_csv(path_to_results, index_col=[0])
# show results
#analysis_df['Complexity'].sort_values(ascending=False)[:10] #Print top 10 entries
#analysis_df['Amount of changes'].sort_values(ascending=False)[:10] #Print top 10 entries

7. Visualize the hotspots with a visualization of your choice.

In [131]:
complexity_hist = px.histogram(analysis_df, x='Complexity', hover_data={"path": analysis_df.index})
complexity_hist.show() #exponential dist in complexity -> use log axis
change_amount_hist = px.histogram(analysis_df, x='Amount of changes', hover_data={"path": analysis_df.index})
change_amount_hist.show() #exponential dist in amount of changes -> use log axis

comparison_scatter = px.scatter(analysis_df, x='Complexity', y='Amount of changes',
                 hover_data={"path": analysis_df.index})
comparison_scatter.show()

comparison_scatter_log = px.scatter(analysis_df, x='Complexity', y='Amount of changes',
                 log_x=True, log_y=True,
                 hover_data={"path": analysis_df.index})
comparison_scatter_log.show()

In [132]:
display(analysis_df.sort_values(by=['Amount of changes','Complexity'], ascending=False)[:20])

Unnamed: 0,Name,Full Path,Complexity,Amount of changes,Initial Path,Defects,Initial Complexity
app/javascript/styles/mastodon/components.scss,components.scss,/home/lucius/projects/uzh/fund-software-system...,7790.0,102,app/javascript/styles/mastodon/components.scss,62,6563.0
app/javascript/mastodon/locales/en.json,en.json,/home/lucius/projects/uzh/fund-software-system...,549.0,54,app/javascript/mastodon/locales/en.json,8,425.0
app/javascript/mastodon/locales/defaultMessages.json,defaultMessages.json,/home/lucius/projects/uzh/fund-software-system...,3714.0,51,app/javascript/mastodon/locales/defaultMessage...,5,2780.0
app/models/account.rb,account.rb,/home/lucius/projects/uzh/fund-software-system...,621.0,45,app/models/account.rb,26,535.0
app/javascript/mastodon/locales/ja.json,ja.json,/home/lucius/projects/uzh/fund-software-system...,549.0,43,app/javascript/mastodon/locales/ja.json,5,423.0
app/javascript/mastodon/locales/th.json,th.json,/home/lucius/projects/uzh/fund-software-system...,549.0,42,app/javascript/mastodon/locales/th.json,5,423.0
app/javascript/mastodon/locales/gl.json,gl.json,/home/lucius/projects/uzh/fund-software-system...,549.0,42,app/javascript/mastodon/locales/gl.json,3,423.0
app/javascript/mastodon/locales/vi.json,vi.json,/home/lucius/projects/uzh/fund-software-system...,549.0,42,,5,0.0
app/javascript/mastodon/locales/zh-CN.json,zh-CN.json,/home/lucius/projects/uzh/fund-software-system...,549.0,42,app/javascript/mastodon/locales/zh-CN.json,7,423.0
app/javascript/mastodon/locales/ko.json,ko.json,/home/lucius/projects/uzh/fund-software-system...,549.0,41,app/javascript/mastodon/locales/ko.json,4,423.0


8. Analyze six candidate hotspots (not necessarily the top ones) through:

In [133]:
"""
computes the complextity trends for a list of file paths
return: dictionary with paths as key and a list containing a complexity measurements from oldest to newest commit as value
"""
def compute_complexity_trends(file_paths):
    complexity_trends = {path: [] for path in file_paths}
    commit_paths = {path:path for path in file_paths} #For tracking the path that a file has in the current commit if it was moved

    for commit in reversed(list(repo.traverse_commits())):#TODO: use whole list
        git.checkout(commit.hash)

        #add trend values for current commit
        for key_path, value_path in commit_paths.items():
            try:
                full_path = subdirectory_prefix+value_path
                with open(full_path, 'r') as file:
                    complexity_trends[key_path].insert(0,len(file.readlines()))
            except TypeError as e:
                if e.args[0] == 'can only concatenate str (not "NoneType") to str':
                    complexity_trends[key_path].insert(0,0)
                else:
                    raise Exception
            except Exception as e:
                print("Issue with complexity trend computation: ", e)

        #update commit paths
        for file in commit.modified_files:
            for key_path, value_path in commit_paths.items():
                if value_path==file.new_path and file.new_path!=None:
                    commit_paths[key_path]=file.old_path

    git.checkout(tag.hash) #return to initial checkout
    return complexity_trends


In [134]:
hotspot_candidates = [
    'app/javascript/fonts/roboto/roboto-medium-webfont.svg', #Complexity outlier, never changes
    'app/models/status.rb', #Very high change rate somewhat high complexity
    'app/javascript/styles/mastodon/components.scss', #Very high complexity, very high amount of changes
    'app/javascript/mastodon/locales/ca.json',
    'app/helpers/application_helper.rb',
    'app/services/activitypub/process_status_update_service.rb']
complexity_trends = compute_complexity_trends(hotspot_candidates)

In [135]:
def plot_complexity_trend(index):
    label, complexity_list = list(complexity_trends.items())[index]
    commits_num = list(range(len(complexity_list)))
    complexity_trend_line = px.line(x=commits_num, y=complexity_list, title=f"Complexity trend for {label}",
                                    labels={'x': 'number of commits', 'y': 'complexity [LOC]'})
    complexity_trend_line.show()
    print(f"Final complexity: {int(analysis_df['Complexity'][label])} LOC" )
    print(f"Amount of changes: {analysis_df['Amount of changes'][label]}" )

In [136]:
def get_git_log(filepath: str) -> DataFrame:
    global commit
    repo_file = Repository(path_to_repo=repo_checkout_path, since=since, to=to,
                           filepath=filepath)
    git_log_file = []
    for commit in repo_file.traverse_commits():
        insertions = 0
        deletions = 0
        for file in commit.modified_files:
            if file.new_path == filepath:
                insertions = file.added_lines
                deletions = file.deleted_lines
        git_log_file.append((commit.hash[:7], commit.msg, insertions, deletions))
    return DataFrame(git_log_file, columns=['hash', 'message', 'insertions', 'deletions'])

In [137]:
from glob import glob

def show_stats_for_glob(glob_str: str):
    def create_file_row(file_path: str):
        analysis = lizard.analyze_file(file_path)
        return file_path.replace(repo_checkout_path, ''), analysis.nloc
    files = [create_file_row(filepath) for filepath in glob(f'{repo_checkout_path}/{glob_str}', recursive=True)]
    display(DataFrame(files, columns=['filename', 'nloc']).sort_values(by=['nloc'], ascending=False))

Candidate 1: app/javascript/fonts/roboto/roboto-medium-webfont.svg

The 'roboto-medium-webfont.svg' file is an extreme outlier. It has very high complexity and never changes.
From the file name we can easily see that it is not a troublesome hotspot.
The file needs to have a lot of lines of code because it defines the shape of characters from a font.
None of these lines ever need to change though. It is not a hotspot and we do not need to take any action.

In [138]:
plot_complexity_trend(0)


Final complexity: 16273 LOC
Amount of changes: 0


Candidate 2: app/models/status.rb

From the trend we see that the complexity increased gradually, with some decreases inbetween.
Presumably these are refactorings, indicating that the file does indeed require some maintenance work.

In [139]:
plot_complexity_trend(1)

Final complexity: 523 LOC
Amount of changes: 41


Candidate 3: app/javascript/styles/mastodon/components.scss

In [140]:
plot_complexity_trend(2)

Final complexity: 7790 LOC
Amount of changes: 102


In [141]:
git_log = get_git_log('app/javascript/styles/mastodon/components.scss')
display(git_log[-20:])

Unnamed: 0,hash,message,insertions,deletions
82,07341e7,Add graphs and retention metrics to admin dash...,52,1
83,1630807,Fix color of hashtag column settings inputs (#...,2,1
84,40f202c,Change list title input styling (#17092),8,5
85,1060666,Add support for editing for published statuses...,11,0
86,fd3a45e,Add edit history to web UI (#17390)\n\n* Add e...,164,64
87,a9a43de,Change report modal to include category select...,214,44
88,d4592bb,Add explore page to web UI (#17123)\n\n* Add e...,123,0
89,255748d,Fix media modal footer's “external link” not b...,5,0
90,b5329e0,Spelling (#17705)\n\n* spelling: account\r\n\r...,1,1
91,dba4be1,Change appearance of account cards in web UI (...,14,139


It seems that the file is changed for unrelated features. This assumes that for a lot of css in this file
for different responsibilities.
This raises the question if the scss files are separated at all.
For that the nloc of the scss files needs to be analysed.

In [142]:
show_stats_for_glob('**/*.scss')


Unnamed: 0,filename,nloc
26,/app/javascript/styles/mastodon/components.scss,6560
24,/app/javascript/styles/mastodon/admin.scss,1366
29,/app/javascript/styles/mastodon/forms.scss,901
28,/app/javascript/styles/mastodon/about.scss,753
30,/app/javascript/styles/mastodon/containers.scss,744
7,/app/javascript/styles/mastodon-light/diff.scss,671
27,/app/javascript/styles/mastodon/widgets.scss,530
1,/app/javascript/styles/mailer.scss,476
20,/app/javascript/styles/mastodon/rtl.scss,383
33,/app/javascript/styles/mastodon/accounts.scss,318


Now one can see that the scss is separated in some files, but the components.css is still
a lot larger than all the other scss files.
The commits 2de44d3 and 2de5128 show that 2 pull requests were needed to fix the same regression.
The function `application_helper.rb::react_component` indicates that there are react components
in this repository, which would allow to split and scope the scss. This was not done and thus
the components.scss file is large and handles the styling concerns of more than one component.
This already lead to regressions. Components.scss is a complexity hotspot.


Candidate 4: app/javascript/mastodon/locales/ca.json


In [143]:
plot_complexity_trend(3)

Final complexity: 549 LOC
Amount of changes: 40


ca.json and the other locales files contain some translations for the languages mastodon supports.
So these files contain a lot of lines and change on 3 occasions:

1. If a new translation key is added
2. If the translation for a translation key is changed
3. If the key for a translation needs to be changed


In [144]:
git_log = get_git_log('app/javascript/mastodon/locales/ca.json')
display(git_log[:30])

Unnamed: 0,hash,message,insertions,deletions
0,a369d1c,New Crowdin translations (#12378)\n\n* New tra...,56,41
1,3a6f986,New Crowdin translations (#12830)\n\n* New tra...,12,12
2,105f83f,New Crowdin translations (#12859)\n\n* New tra...,3,2
3,fcd79a5,New Crowdin translations (#12936)\n\n* New tra...,7,7
4,62f0b30,New Crowdin translations (#12953)\n\n* New tra...,3,0
5,4599518,New Crowdin translations (#13036)\n\n* New tra...,2,2
6,af53cfd,New Crowdin translations (#13398)\n\n* New tra...,3,0
7,3e9dc40,Add hints about incomplete remote content to w...,6,0
8,7f1143a,New Crowdin translations (#13749)\n\n* New tra...,16,10
9,c158dda,New Crowdin updates (#14197)\n\n* New translat...,12,8


From the commit messages we can see that mastodon uses Crowdin to manage their translations.
It's good that the translations can be changed independently. On the other hand we see in commits
1392741 and db04dfc we see that sometimes the authors forget to add the translation keys to the
locales files. This is not really a hotspot, but shows that high number of lines of code and a
high change rate do not always indicate a hotspot.

Candidate 5: app/helpers/application_helper.rb


In [145]:
plot_complexity_trend(4)

Final complexity: 249 LOC
Amount of changes: 17


From the name one does not really know what this module should do.
The amount of changes is not large, but the file is in the upper right corner of the scatter plot above.
Because the name does not really describe the responsibility of the module, an analysis of the method names was made.


In [146]:
filename = 'mastodon/mastodon/app/helpers/application_helper.rb'
file = open(filename, mode='r')
analysis = lizard.analyze_file.analyze_source_code(filename, file.read())
functions = [function_info.name for function_info in analysis.function_list]
display(functions)

['friendly_number_to_human',
 'active_nav_class',
 'active_link_to',
 'show_landing_strip?',
 'open_registrations?',
 'approved_registrations?',
 'closed_registrations?',
 'available_sign_up_path',
 'omniauth_only?',
 'link_to_login',
 'provider_sign_in_link',
 'open_deletion?',
 'locale_direction',
 'favicon_path',
 'title',
 'class_for_scope',
 'can?',
 'fa_icon',
 'visibility_icon',
 'interrelationships_icon',
 'custom_emoji_tag',
 'opengraph',
 'react_component',
 'react_admin_component',
 'body_classes',
 'cdn_host',
 'cdn_host?',
 'storage_host',
 'storage_host?',
 'quote_wrap',
 'render_initial_state',
 'grouped_scopes',
 'prerender_custom_emojis']

Some functions seem to be helpers for rendering the ui, like `active_nav_class`, `active_link_to`, `friendly_number_to_human`, `favicon_path`
, `react_component`.
Others seem to contain business logic like `grouped_scopes`, `open_registrations?`.
Its not the most important hotspot, but it would be good to split this module into separate modules.

Candidate 6: app/services/activitypub/process_status_update_service.rb

In [147]:
plot_complexity_trend(5)

Final complexity: 300 LOC
Amount of changes: 15


This file started with a high complexity, and its complexity stayed at the same level.
The changes made to the file did not change the complexity drastically.

In [148]:
display(get_git_log('app/services/activitypub/process_status_update_service.rb'))

Unnamed: 0,hash,message,insertions,deletions
0,1060666,Add support for editing for published statuses...,275,0
1,d412a8d,Fix error when processing poll updates (#17333...,1,1
2,6505b39,Fix poll updates being saved as status edits (...,15,7
3,03d5934,Fix Sidekiq warnings about JSON serialization ...,1,1
4,b6d7726,Remove language detection through cld3 (#17478...,1,5
5,63002cd,Add editing for published statuses (#17320)\n\...,4,10
6,63854be,Fix poll votes not being properly reset on pol...,2,4
7,d17fb70,Change how changes to media attachments are st...,7,12
8,b2cd344,Add rate limit for editing (#17728),2,2
9,d3aa9cf,Fix Updates being forwarded even when not proc...,12,0


The git log shows a lot of fixes in this file. This makes it a complexity hotspot.


## 2 Temporal/Logical Coupling

1. Determine what could be cases of temporal/logical coupling and generate a list
of candidates with a set of coupled entities.

In this analysis we consider logical grouping by commits.
According to the [guidelines for pull requrests](https://github.com/mastodon/mastodon/blob/97f657f8181dc24f6c30b6e9f0ce52df827ac90f/CONTRIBUTING.md#pull-requests),
small pull requests should be preferred, thus it is assumed that they contain only the necessary changes for one "Add", "Change", "Deprecate", "Remove" or "Fix",
and can be viewed as logically grouped. To merge the pull request, a squash commit is added to the main branch. So for this analysis we can use the commits
on the main branch to detect logical coupling between files.

Here the sets of logically coupled is created.

In [149]:
from pandas import DataFrame
from typing import List, Dict
from dataclasses import dataclass


@dataclass
class ChangeSet:
    files: Dict[str,int]
    counter: int

    def __init__(self, files: List[str]):
        self.files={f:1 for f in files}
        self.counter = 0

    def __hash__(self) -> int:
        return hash(", ".join(self.get_files()))


    def __eq__(self, o: object) -> bool:
        if not isinstance(o, ChangeSet):
            return False
        # noinspection PyUnresolvedReferences
        return self.get_files() == o.get_files()

    def to_tuple(self, total_commits_file1 : int, total_commits_file2 : int):
        real_number_of_coupled_commits = self.counter / 2
        file1_coupling = real_number_of_coupled_commits / total_commits_file1
        file2_coupling = real_number_of_coupled_commits / total_commits_file2
        max_coupling = max(file1_coupling, file2_coupling)
        return self.get_files()[0], total_commits_file1, file1_coupling, self.get_files()[1], total_commits_file2, file2_coupling, real_number_of_coupled_commits, max_coupling

    def get_files(self):
        return sorted(self.files.keys())

    def changeset_changed(self):
        self.counter += 1
changed_together_filename = './changed_together.csv'

col_name_nr_changed_together = 'nr_changed_together'
col_name_coupling_file1 = 'file1_coupling'
col_name_coupling_file2 = 'file2_coupling'
col_name_max_coupling = 'max_coupling'
changed_together_columns = ['file1', 'file1_total_commits', col_name_coupling_file1 , 'file2', 'file2_total_commits', col_name_coupling_file2, col_name_nr_changed_together, col_name_max_coupling]
dataframe_changed_together = DataFrame([], columns=changed_together_columns)

if not path.exists(changed_together_filename):
    files_changed_together : Dict[ChangeSet,ChangeSet] = {}
    commits_on_file_counter: Dict[str,int] = {}
    for commit in repo.traverse_commits():
        for file in commit.modified_files:

            if not file.new_path in commits_on_file_counter:
                commits_on_file_counter[file.new_path] = 0
            commits_on_file_counter[file.new_path] += 1

            for changed_together in commit.modified_files:
                if file.new_path is None or changed_together.new_path is None:
                    continue
                if not file.new_path.__contains__('app/') or not changed_together.new_path.__contains__('app'):
                    continue
                if not file.new_path == changed_together.new_path:
                    change_set = ChangeSet([file.new_path, changed_together.new_path])
                    if change_set in files_changed_together:
                        change_set = files_changed_together[change_set]
                    else:
                        files_changed_together[change_set] = change_set
                    change_set.changeset_changed()

    changed_together_tuples = []
    for change_set in files_changed_together.values():
        total_commits_file1 = commits_on_file_counter[change_set.get_files()[0]]
        total_commits_file2 = commits_on_file_counter[change_set.get_files()[1]]
        changed_together_tuples.append(change_set.to_tuple(total_commits_file1, total_commits_file2))

    dataframe_changed_together = DataFrame(changed_together_tuples, columns=changed_together_columns)
    dataframe_changed_together = dataframe_changed_together.sort_values(by=[col_name_nr_changed_together], ascending=False)
    dataframe_changed_together.to_csv(changed_together_filename)

dataframe_changed_together = pd.read_csv(changed_together_filename, index_col=[0,1])



print(f'most common changed together:')
display(dataframe_changed_together[:10])

most common changed together:


Unnamed: 0_level_0,Unnamed: 1_level_0,file1_total_commits,file1_coupling,file2,file2_total_commits,file2_coupling,nr_changed_together,max_coupling
Unnamed: 0_level_1,file1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1292,app/javascript/mastodon/locales/defaultMessages.json,51,0.843137,app/javascript/mastodon/locales/en.json,54,0.796296,43.0,0.843137
2877,app/javascript/mastodon/locales/ja.json,43,0.860465,app/javascript/mastodon/locales/ru.json,40,0.925,37.0,0.925
901,app/javascript/mastodon/locales/ko.json,41,0.902439,app/javascript/mastodon/locales/th.json,42,0.880952,37.0,0.902439
2542,app/javascript/mastodon/locales/gl.json,42,0.880952,app/javascript/mastodon/locales/ja.json,43,0.860465,37.0,0.880952
2547,app/javascript/mastodon/locales/gl.json,42,0.857143,app/javascript/mastodon/locales/ko.json,41,0.878049,36.0,0.878049
2562,app/javascript/mastodon/locales/gl.json,42,0.857143,app/javascript/mastodon/locales/ru.json,40,0.9,36.0,0.9
909,app/javascript/mastodon/locales/ko.json,41,0.878049,app/javascript/mastodon/locales/zh-CN.json,42,0.857143,36.0,0.878049
684,app/javascript/mastodon/locales/fr.json,40,0.9,app/javascript/mastodon/locales/ru.json,40,0.9,36.0,0.9
896,app/javascript/mastodon/locales/ko.json,41,0.878049,app/javascript/mastodon/locales/ru.json,40,0.9,36.0,0.9
3266,app/javascript/mastodon/locales/ru.json,40,0.9,app/javascript/mastodon/locales/vi.json,42,0.857143,36.0,0.9


As expected, the locales change often together. Thus, we need to remove the locales from the analysis.
Then the data is filtered by files which are at least 5 times changed together, and sorted by the max_coupling value.

In [150]:
from tabulate import tabulate
allowed_file_endings = ['.rb', '.js', '.haml', '.erb']

def contains_allowed_file_type(change_set: List) -> bool:
    for file in change_set:
        ending = path.splitext(file)[1]
        if allowed_file_endings.__contains__(ending):
            return True
    return False

changed_together_filtered = [(row[1], row[2], row[3], row[4], row[5], row[6], row[7], row[8]) for row in dataframe_changed_together.to_records(index=True) if contains_allowed_file_type([row[1], row[4]])]
dataframe_changed_together_filtered = DataFrame(changed_together_filtered, columns=changed_together_columns)
dataframe_changed_together_filtered = dataframe_changed_together_filtered[dataframe_changed_together_filtered['nr_changed_together'] > 5].sort_values(by=[col_name_max_coupling], ascending=False)

tabulate(dataframe_changed_together_filtered, headers=changed_together_columns, tablefmt="html")

Unnamed: 0,file1,file1_total_commits,file1_coupling,file2,file2_total_commits,file2_coupling,nr_changed_together,max_coupling
7,app/javascript/mastodon/features/followers/index.js,8,1.0,app/javascript/mastodon/features/following/index.js,8,1.0,8,1.0
13,app/javascript/mastodon/features/account/components/header.js,23,0.304348,app/javascript/mastodon/features/account_timeline/components/header.js,7,1.0,7,1.0
12,app/controllers/follower_accounts_controller.rb,7,1.0,app/controllers/following_accounts_controller.rb,8,0.875,7,1.0
35,app/controllers/api/v1/accounts/follower_accounts_controller.rb,6,1.0,app/controllers/api/v1/accounts/following_accounts_controller.rb,6,1.0,6,1.0
2,app/javascript/mastodon/components/status_action_bar.js,19,0.736842,app/javascript/mastodon/features/status/components/action_bar.js,15,0.933333,14,0.933333
0,app/views/statuses/_detailed_status.html.haml,20,0.9,app/views/statuses/_simple_status.html.haml,24,0.75,18,0.9
8,app/services/suspend_account_service.rb,9,0.888889,app/services/unsuspend_account_service.rb,10,0.8,8,0.888889
10,app/javascript/mastodon/features/notifications/components/column_settings.js,9,0.777778,app/javascript/mastodon/reducers/settings.js,8,0.875,7,0.875
28,app/javascript/mastodon/actions/compose.js,13,0.461538,app/javascript/mastodon/reducers/compose.js,7,0.857143,6,0.857143
22,app/javascript/mastodon/features/getting_started/index.js,11,0.545455,app/javascript/mastodon/features/ui/components/navigation_panel.js,7,0.857143,6,0.857143


2. Visualize these candidate sets of couple entities with a visualization of your
choice.

In [151]:
complexity_hist = px.histogram(dataframe_changed_together_filtered, x=col_name_nr_changed_together, log_y=True)
complexity_hist.update_layout(font_size=18)
complexity_hist.show()

The histogram visualises that there is only one file pair of the analyzed file types (which excludes the locales), which was changed a lot together.
The others were only changed together 6 times in the last 3 years.

In [152]:
list_for_bar_plot = [(f'{row[1]} +  {row[4]}', row[8]) for row in dataframe_changed_together_filtered.to_records(index=True)]
dataframe_for_bar_plot = DataFrame(list_for_bar_plot, columns=['files', col_name_max_coupling])

bar = px.bar(dataframe_for_bar_plot[:20], x=col_name_max_coupling, y='files')
bar.update_layout(font_size=18)
bar.show()

In this horizontal bar charts, the pairs are sorted by the maximal coupling value of the two files involved.
There are the following groups of pairs:
* Coupled frontend (javascript or haml files)
* Coupled backend controllers
* Coupled frontend state management files (for redux which we can see from the actions/compose.js and the actions/reducers.js files)
* Coupled backend helpers with frontend they render

3. For three set candidates in the list:
• analyze and explain why these entities are coupled;
• describe how important it would be to fix them, and any ideas for their
improvement.


Canditate set 1 (app/controllers/follower_accounts_controller.rb, app/controllers/following_accounts_controller.rb):

The values retrieved for the set:

In [153]:
tabulate(dataframe_changed_together_filtered[dataframe_changed_together_filtered['file1'] == 'app/controllers/follower_accounts_controller.rb'], headers=changed_together_columns, tablefmt="html")

Unnamed: 0,file1,file1_total_commits,file1_coupling,file2,file2_total_commits,file2_coupling,nr_changed_together,max_coupling
12,app/controllers/follower_accounts_controller.rb,7,1,app/controllers/following_accounts_controller.rb,8,0.875,7,1


The joined git history:

In [154]:

def join_git_logs(git_log_1: DataFrame, git_log_2: DataFrame) -> DataFrame:
    return git_log_1.set_index('hash').join(git_log_2.set_index('hash'), how='outer', rsuffix='_right')

git_log_1 = get_git_log('app/controllers/follower_accounts_controller.rb')
git_log_2 = get_git_log('app/controllers/following_accounts_controller.rb')
display(join_git_logs(git_log_1, git_log_2))

Unnamed: 0_level_0,message,insertions,deletions,message_right,insertions_right,deletions_right
hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3134691,Add support for reversible suspensions through...,10.0,2.0,Add support for reversible suspensions through...,10,2
33f3818,,,,Fix double render error when authorizing inter...,4,1
3b3bdc7,Hide blocked users from more places (#12733)\n...,5.0,1.0,Hide blocked users from more places (#12733)\n...,5,1
63b807c,Fix serialization of followers/following count...,1.0,1.0,Fix serialization of followers/following count...,1,1
ac8a788,Fix functional user requirements in whitelist ...,1.0,1.0,Fix functional user requirements in whitelist ...,1,1
b154428,"Add federation support for the ""hide network"" ...",10.0,1.0,"Add federation support for the ""hide network"" ...",10,1
b2f8106,Remove unused AccountRelationshipsPresenter ca...,0.0,1.0,Remove unused AccountRelationshipsPresenter ca...,0,1
edf09ec,Add `/api/v1/accounts/familiar_followers` to R...,3.0,3.0,Add `/api/v1/accounts/familiar_followers` to R...,3,3


This git log suggests that the changing parts of the two files are copy pasted.

In [155]:
print(os.system('git diff --no-index mastodon/mastodon/app/controllers/follower_accounts_controller.rb mastodon/mastodon/app/controllers/following_accounts_controller.rb'))

diff --git a/mastodon/mastodon/app/controllers/follower_accounts_controller.rb b/mastodon/mastodon/app/controllers/following_accounts_controller.rb
index f3f8336..69f0321 100644
--- a/mastodon/mastodon/app/controllers/follower_accounts_controller.rb
+++ b/mastodon/mastodon/app/controllers/following_accounts_controller.rb
@@ -1,6 +1,6 @@
 # frozen_string_literal: true
 
-class FollowerAccountsController < ApplicationController
+class FollowingAccountsController < ApplicationController
   include AccountControllerConcern
   include SignatureVerification
 
@@ -21,7 +21,10 @@ class FollowerAccountsController < ApplicationController
       end
 
       format.json do
-        raise Mastodon::NotPermittedError if page_requested? && @account.hide_collections?
+        if page_requested? && @account.hide_collections?
+          forbidden
+          next
+        end
 
         expires_in(page_requested? ? 0 : 3.minutes, public: public_fetch_mode?)
 
@@ -39,9 +42,9 @@ class FollowerAccountsCont

The 2 files are coupled because they need to fulfil the same responsibility in a different context.
The coupling could be avoided if the common logic is extracted into a third class.

Canditate set 2 (app/javascript/mastodon/actions/compose.js, app/javascript/mastodon/reducers/compose.js):

In [156]:
tabulate(dataframe_changed_together_filtered[dataframe_changed_together_filtered['file1'] == 'app/javascript/mastodon/actions/compose.js'], headers=changed_together_columns, tablefmt="html")

Unnamed: 0,file1,file1_total_commits,file1_coupling,file2,file2_total_commits,file2_coupling,nr_changed_together,max_coupling
28,app/javascript/mastodon/actions/compose.js,13,0.461538,app/javascript/mastodon/reducers/compose.js,7,0.857143,6,0.857143


In [157]:
git_log_1 = get_git_log('app/javascript/mastodon/actions/compose.js')
git_log_2 = get_git_log('app/javascript/mastodon/reducers/compose.js')
display(join_git_logs(git_log_1, git_log_2))

Unnamed: 0_level_0,message,insertions,deletions,message_right,insertions_right,deletions_right
hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
06fc6a9,Add ability to choose media thumbnail in web U...,49.0,0.0,Add ability to choose media thumbnail in web U...,22.0,0.0
0cdb077,Add language dropdown to compose in web UI (#1...,11.0,5.0,Add language dropdown to compose in web UI (#1...,8.0,1.0
11502ae,Add aliases for WebUI routes that were renamed...,1.0,1.0,,,
2aafa5b,ignore hashtag suggestions if they vary only i...,20.0,7.0,ignore hashtag suggestions if they vary only i...,14.0,0.0
2ada2ae,Fix/14021 behaviour on add or remove toots (#1...,1.0,1.0,,,
52e5c07,Change routing paths to use usernames in web U...,2.0,2.0,,,
63002cd,Add editing for published statuses (#17320)\n\...,28.0,13.0,Add editing for published statuses (#17320)\n\...,34.0,11.0
667708f,Fix pending upload count not being decremented...,2.0,3.0,Fix pending upload count not being decremented...,1.0,1.0
90f3a00,Fix regression in “Edit media” modal in web UI...,1.0,1.0,,,
9660aa4,Change local media attachments to perform heav...,21.0,2.0,,,


The actions and the reducers are tightly coupled in the redux state management, because the reducers need to calculate the new
state given the old state and the action.
Another state management system might allow to put related concerns into the same file.

Candidate set 3 (app/helpers/statuses_helper.rb, app/views/statuses/_simple_status.html.haml):

In [158]:
tabulate(dataframe_changed_together_filtered[dataframe_changed_together_filtered['file1'] == 'app/helpers/statuses_helper.rb'], headers=changed_together_columns, tablefmt="html")

Unnamed: 0,file1,file1_total_commits,file1_coupling,file2,file2_total_commits,file2_coupling,nr_changed_together,max_coupling
14,app/helpers/statuses_helper.rb,10,0.7,app/views/statuses/_simple_status.html.haml,24,0.291667,7,0.7
38,app/helpers/statuses_helper.rb,10,0.6,app/views/statuses/_detailed_status.html.haml,20,0.3,6,0.6


In [159]:
git_log_1 = get_git_log('app/helpers/statuses_helper.rb')
git_log_2 = get_git_log('app/views/statuses/_simple_status.html.haml')
display(join_git_logs(git_log_1, git_log_2))

Unnamed: 0_level_0,message,insertions,deletions,message_right,insertions_right,deletions_right
hash,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
04582e3,Remove some duplicate methods from StatusHelpe...,0.0,68.0,,,
0e362b7,,,,Fix end-user-facing uses of inline CSS (#13438...,6.0,6.0
0f38f97,,,,Fix hardcoded non-breaking space in public vie...,1.0,1.0
1d07f51,,,,Make visibility icon clickable as part of the ...,2.0,2.0
1ded3bb,,,,Change reported media attachments to always be...,2.0,2.0
1f56405,Change RTL detection to rely on unicode-bidi p...,0.0,20.0,Change RTL detection to rely on unicode-bidi p...,1.0,1.0
351c744,Fix error when trying to render component for ...,80.0,0.0,Fix error when trying to render component for ...,9.0,15.0
418f0a3,,,,Add a visibility icon to status (#14123)\n\n* ...,3.0,1.0
4850338,,,,Fix some account avatars on public pages havin...,2.0,2.0
50cd73e,,,,"Add ""Show thread"" button to public profiles (#...",4.0,0.0


Here the statuses_helper provides logic for the haml template.
Because the status_helper is only shared among 2 related templates, this is not an issue.

## 3 Defective Hotspots

1. Decide on how you want to detect entities that had defects in the past (e.g.,
commit message analysis vs. issue tracking system analysis) and motivate your
choice.

We decided to look at commit messages. Whenever a file was changed in a commit with message that contains a string like "bug" or "fix" (see "indicator_list" below), we increment a counter for that file.

2. Determine defective hotspots among the entities in the timeframe that you pre-
viously selected (i.e., consider only defects in the selected timeframe). What
conclusions can you draw from this?


In [160]:
indictor_list = ["fix", "bug"]
if not 'Defects' in analysis_df: #check if we have already saved the results

    #Compute defect count
    analysis_df['CommitPath'] = analysis_df.index #For tracking the path that a file has in the current commit if it was moved
    analysis_df['Defects'] = 0 #set change counter to 0

    for commit in reversed(list(repo.traverse_commits())):
        fixes_bug = any([indicator in commit.msg.lower() for indicator in indictor_list])
        for file in commit.modified_files:
            idx = analysis_df[analysis_df['CommitPath']==file.new_path].index
            analysis_df.loc[idx, 'CommitPath'] = file.old_path #update commit path
            if fixes_bug:
                analysis_df.loc[idx, 'Defects'] += 1

    analysis_df.drop(columns='CommitPath', inplace=True) #get rid of temporary indexing
    analysis_df.to_csv(path_to_results)

analysis_df['Defects'].sort_values(ascending=False)[:10] #Print top 10 entries

app/javascript/styles/mastodon/components.scss    62
app/models/status.rb                              27
app/models/user.rb                                27
app/models/account.rb                             26
app/javascript/mastodon/components/status.js      20
app/models/media_attachment.rb                    19
app/lib/activitypub/activity/create.rb            18
app/views/statuses/_simple_status.html.haml       18
app/lib/feed_manager.rb                           17
app/views/statuses/_detailed_status.html.haml     16
Name: Defects, dtype: int64

3. Determine complexity hotspots at the beginning of your timeframe, then corre-
late them with the defects they have presented throughout the entire timeframe.
Is there a correlation? Why do you think this is the case?

In [161]:
#Find complexity hotspots at the start
initial_commit = list(repo.traverse_commits())[0]
git.checkout(initial_commit.hash)# go to first commit

complexity_comp_exceptions = {} #to analyse what caused error (mostly images)
for idx in analysis_df.index:
    try:
        if type(analysis_df.loc[idx, 'Initial Path']) != str: #if the file is deleted, initial path is a float, nan
            analysis_df.loc[idx, 'Initial Complexity'] = 0
        else:
            full_path = subdirectory_prefix+analysis_df.loc[idx, 'Initial Path']
            with open(full_path, 'r') as file:
                analysis_df.loc[idx, 'Initial Complexity'] = len(file.readlines())
    except Exception as e:
        complexity_comp_exceptions[idx]=e
#print("Exceptions: \n", complexity_comp_exceptions)
analysis_df.to_csv(path_to_results)

git.checkout(tag.hash)# revert to latest commit
display(analysis_df.sort_values(by=['Initial Complexity', 'Defects'], ascending=False)[:20])

Unnamed: 0,Name,Full Path,Complexity,Amount of changes,Initial Path,Defects,Initial Complexity
app/javascript/fonts/roboto/roboto-medium-webfont.svg,roboto-medium-webfont.svg,/home/lucius/projects/uzh/fund-software-system...,16273.0,0,app/javascript/fonts/roboto/roboto-medium-webf...,0,16273.0
app/javascript/fonts/roboto/roboto-bold-webfont.svg,roboto-bold-webfont.svg,/home/lucius/projects/uzh/fund-software-system...,16273.0,0,app/javascript/fonts/roboto/roboto-bold-webfon...,0,16273.0
app/javascript/fonts/roboto/roboto-regular-webfont.svg,roboto-regular-webfont.svg,/home/lucius/projects/uzh/fund-software-system...,15513.0,0,app/javascript/fonts/roboto/roboto-regular-web...,0,15513.0
app/javascript/fonts/roboto/roboto-italic-webfont.svg,roboto-italic-webfont.svg,/home/lucius/projects/uzh/fund-software-system...,15513.0,0,app/javascript/fonts/roboto/roboto-italic-webf...,0,15513.0
app/javascript/styles/mastodon/components.scss,components.scss,/home/lucius/projects/uzh/fund-software-system...,7790.0,102,app/javascript/styles/mastodon/components.scss,62,6563.0
app/javascript/mastodon/locales/defaultMessages.json,defaultMessages.json,/home/lucius/projects/uzh/fund-software-system...,3714.0,51,app/javascript/mastodon/locales/defaultMessage...,5,2780.0
app/javascript/fonts/roboto-mono/robotomono-regular-webfont.svg,robotomono-regular-webfont.svg,/home/lucius/projects/uzh/fund-software-system...,1051.0,0,app/javascript/fonts/roboto-mono/robotomono-re...,0,1051.0
app/javascript/styles/mastodon/forms.scss,forms.scss,/home/lucius/projects/uzh/fund-software-system...,1075.0,15,app/javascript/styles/mastodon/forms.scss,6,946.0
app/javascript/styles/mastodon/containers.scss,containers.scss,/home/lucius/projects/uzh/fund-software-system...,894.0,5,app/javascript/styles/mastodon/containers.scss,4,899.0
app/javascript/styles/mastodon/about.scss,about.scss,/home/lucius/projects/uzh/fund-software-system...,906.0,7,app/javascript/styles/mastodon/about.scss,3,892.0


In [162]:
comparison_scatter = px.scatter(analysis_df, x='Initial Complexity', y='Defects',
                 hover_data={"path": analysis_df.index})
comparison_scatter.show()

print('Scatter plot zoomed in that the outliers of fonts and components.scss dont distort the visualization.')
comparison_scatter_zoomed = px.scatter(analysis_df, x='Initial Complexity', y='Defects',
                 hover_data={"path": analysis_df.index})
comparison_scatter_zoomed.update_yaxes(range = [0,30])
comparison_scatter_zoomed.update_xaxes(range = [0,1100])
comparison_scatter_zoomed.show()

print('Scatter plot zoomed in without the scss and json files')
reduced_df = analysis_df.dropna(subset=['Initial Path'])
reduced_df = reduced_df[~((reduced_df['Initial Path'].str.endswith('.scss'))|(reduced_df['Initial Path'].str.endswith('.json')))]
comparison_scatter_zoomed_filtered = px.scatter(reduced_df, x='Initial Complexity', y='Defects',
                 hover_data={"path": reduced_df.index})
comparison_scatter_zoomed_filtered.update_yaxes(range = [0,30])
comparison_scatter_zoomed_filtered.update_xaxes(range = [0,1100])
comparison_scatter_zoomed_filtered.show()


Scatter plot zoomed in that the outliers of fonts and components.scss dont distort the visualization.


Scatter plot zoomed in without the scss and json files


We see that there is a slight correlation between the initial complexity and later defects.

4. What conclusions can you draw from the relationship between defective hotspots
and complexity hotspots in Mastodon? And on these two metrics in general?


The initial complexity could not predict the future bugs, for that the correlation is not high enough.
But if the team of mastodon wants to find the hotspots where they had a problem with a too much complexity
which lead to a lot of defects, then this analysis is more objective that the gut feeling of the developers.