# Fundamentals of Software Systems - SE Part I Assignment

By Andy Wiemeyer and Lucius Bachmann

### Setup tools
* checkout repo
* initialize Git utility
* You can recreate the Repository object with other parameters to analyze different time periods.
  The last year was used that the setup is fast.

In [41]:
import os.path
from datetime import datetime
from pydriller import Repository, Git
from os import path, mkdir
import pandas as pd
import plotly.express as px

repo_remote_path = 'https://github.com/mastodon/mastodon.git'
repo_path = 'mastodon'
repo_checkout_path = f'{repo_path}/{repo_path}'
filepath = 'app'

since = datetime.fromisoformat('2019-11-08')
to = datetime.fromisoformat('2022-11-08')

if not path.exists(repo_path):
    mkdir(repo_path)

repo = Repository(repo_remote_path, clone_repo_to=repo_path, since=since, to=to, filepath=filepath)
# clone repo if necessary
for commit in repo.traverse_commits():
    break
git = Git(repo_checkout_path)

### Checkout repo at tag v3.5.3

In [42]:
tag = git.get_commit_from_tag('v3.5.3')
git.checkout(tag.hash)

## 1 Complexity Hotspots

1. You must consider only the app folder from the Mastodon repository
(i.e., https://github.com/mastodon/mastodon).

-> nothing to do

2. Decide on the granularity of your analysis of software entities (e.g., source code
files); describe why you selected this specific granularity.

We decided to look at single files TODO: motivation

3. Create a list of all these entities, as they appear in the latest stable release of
Mastodon (i.e., tag v3.5.3).

In [43]:
full_file_paths_all_dirs = git.files() #full path of all files in the analyzed commit
subdirectory_start = "/" + repo_checkout_path + "/"
full_file_paths = [path for path in full_file_paths_all_dirs if subdirectory_start+"app/" in path]
subdirectory_start_index = full_file_paths[0].find(subdirectory_start) + len(subdirectory_start)
subdirectory_prefix = full_file_paths[0][:subdirectory_start_index]#used later
file_paths = [path[subdirectory_start_index:] for path in full_file_paths] #paths relative to analyzed subdirectory

In [44]:
file_names = [path[max(0,path.rfind("/"))+1:] for path in full_file_paths]
print(f"Amount of different files with equal name: {len(file_names)-len(set(file_names))}")
counted_file_names = {}
for name in file_names:
    if name not in counted_file_names:
        counted_file_names[name] = 1
    else:
        counted_file_names[name] += 1
print([f"{name}: {count}" for name, count in counted_file_names.items() if count>1])

Amount of different files with equal name: 291


A lot of different files have the same name (see above) so we use the paths to identify files:

In [45]:
print(file_paths)



4. Decide on the type of complexity you want to measure for your software entities
and explain why you selected this type.

We decided at to look at the lines of code in a file TODO: explain

5. Decide on a timeframe on which you want to base your analysis and explain the
rationale of your choice.

TODO

6. For each entity in the system, measure its complexity and the number of changes
(in the given timeframe). Merge these two pieces of information together to cre-
ate a candidate list of problematic hotspots in the app part of Mastodon.

In [46]:
path_to_results = './analysis_data.csv'
if not os.path.isfile(path_to_results): #check if we have already saved the results
    analysis_df = pd.DataFrame( #create dataframe with
        {'Name':file_names, 'Full Path':full_file_paths}, #data columns
        index=file_paths) #and index

    #Compute complexities
    complexity_comp_exceptions = {} #to analyse what caused error (mostly images)
    for idx in analysis_df.index:
        try:
            with open(analysis_df.loc[idx, 'Full Path'], 'r') as file:
                analysis_df.loc[idx, 'Complexity'] = len(file.readlines())
        except Exception as e:
            complexity_comp_exceptions[idx]=e
    analysis_df.to_csv(path_to_results)

    # Count amount of times changed
    analysis_df['CommitPath'] = analysis_df.index #For tracking the path that a file has in the current commit if it was moved
    analysis_df['Amount of changes'] = 0 #set change counter to 0
    for commit in reversed(list(repo.traverse_commits())):
        for file in commit.modified_files:
            idx = analysis_df[analysis_df['CommitPath']==file.new_path].index
            analysis_df.loc[idx, 'Amount of changes'] += 1
            analysis_df.loc[idx, 'CommitPath'] = file.old_path #update commit path
    analysis_df.drop(columns='CommitPath', inplace=True) #get rid of temporary indexing
    analysis_df.to_csv(path_to_results)

In [47]:
analysis_df = pd.read_csv(path_to_results, index_col=[0])
# show results
#analysis_df['Complexity'].sort_values(ascending=False)[:10] #Print top 10 entries
#analysis_df['Amount of changes'].sort_values(ascending=False)[:10] #Print top 10 entries

7. Visualize the hotspots with a visualization of your choice.

In [48]:
complexity_hist = px.histogram(analysis_df, x='Complexity', hover_data={"path": analysis_df.index})
complexity_hist.show() #exponential dist in complexity -> use log axis
change_amount_hist = px.histogram(analysis_df, x='Amount of changes', hover_data={"path": analysis_df.index})
change_amount_hist.show() #exponential dist in amount of changes -> use log axis

comparison_scatter = px.scatter(analysis_df, x='Complexity', y='Amount of changes',
                 hover_data={"path": analysis_df.index})
comparison_scatter.show()

comparison_scatter_log = px.scatter(analysis_df, x='Complexity', y='Amount of changes',
                 log_x=True, log_y=True,
                 hover_data={"path": analysis_df.index})
comparison_scatter_log.show()

In [49]:
display(analysis_df.sort_values(by=['Amount of changes','Complexity'], ascending=False)[:20])

Unnamed: 0,Name,Full Path,Complexity,Amount of changes,Defects
app/javascript/styles/mastodon/components.scss,components.scss,/Users/uzh/Library/CloudStorage/OneDrive-Unive...,7790.0,102,10
app/javascript/mastodon/locales/en.json,en.json,/Users/uzh/Library/CloudStorage/OneDrive-Unive...,549.0,54,1
app/javascript/mastodon/locales/defaultMessages.json,defaultMessages.json,/Users/uzh/Library/CloudStorage/OneDrive-Unive...,3714.0,51,1
app/models/account.rb,account.rb,/Users/uzh/Library/CloudStorage/OneDrive-Unive...,621.0,45,8
app/javascript/mastodon/locales/ja.json,ja.json,/Users/uzh/Library/CloudStorage/OneDrive-Unive...,549.0,43,1
app/javascript/mastodon/locales/zh-CN.json,zh-CN.json,/Users/uzh/Library/CloudStorage/OneDrive-Unive...,549.0,42,1
app/javascript/mastodon/locales/gl.json,gl.json,/Users/uzh/Library/CloudStorage/OneDrive-Unive...,549.0,42,1
app/javascript/mastodon/locales/vi.json,vi.json,/Users/uzh/Library/CloudStorage/OneDrive-Unive...,549.0,42,1
app/javascript/mastodon/locales/th.json,th.json,/Users/uzh/Library/CloudStorage/OneDrive-Unive...,549.0,42,1
app/javascript/mastodon/locales/ko.json,ko.json,/Users/uzh/Library/CloudStorage/OneDrive-Unive...,549.0,41,1


#9. Analyze six candidate hotspots (not necessarily the top ones) through:

In [50]:
"""
computes the complextity trends for a list of file paths
return: dictionary with paths as key and a list containing a complexity measurements from oldest to newest commit as value
"""
def compute_complexity_trends(file_paths):
    complexity_trends = {path: [] for path in file_paths}
    commit_paths = {path:path for path in file_paths} #For tracking the path that a file has in the current commit if it was moved

    for commit in reversed(list(repo.traverse_commits())):#TODO: use whole list
        git.checkout(commit.hash)

        #add trend values for current commit
        for key_path, value_path in commit_paths.items():
            try:
                full_path = subdirectory_prefix+value_path
                with open(full_path, 'r') as file:
                    complexity_trends[key_path].insert(0,len(file.readlines()))
            except Exception as e:
                print("Issue with complexity trend computation: ", e)

        #update commit paths
        for file in commit.modified_files:
            for key_path, value_path in commit_paths.items():
                if value_path==file.new_path:
                    commit_paths[key_path]=file.old_path

    git.checkout(tag.hash) #return to initial checkout
    return complexity_trends

In [51]:
hotspot_candidates = [
    'app/javascript/fonts/roboto/roboto-medium-webfont.svg', #Complexity outlier, never changes
    'app/models/status.rb', #Very high change rate somewhat high complexity
    'app/javascript/styles/mastodon/components.scss', #Very high complexity, very high amount of changes
    'app/javascript/mastodon/locales/ca.json',
    'app/helpers/application_helper.rb',
    'app/services/activitypub/process_status_update_service.rb']
complexity_trends = compute_complexity_trends(hotspot_candidates)

Exception ignored in: <function GitConfigParser.__del__ at 0x7fde0d8e8040>
Traceback (most recent call last):
  File "/Users/uzh/Library/CloudStorage/OneDrive-UniversitätZürichUZH/Uni/Master 3/Fundamentals of Software Systems/Exercise/05/git_repo/fund-soft-systems-analyse-mastodon/venv/lib/python3.8/site-packages/git/config.py", line 388, in __del__
    def __del__(self) -> None:
KeyboardInterrupt: 


KeyboardInterrupt: 

In [None]:
def plot_complexity_trend(index):
    label, complexity_list = list(complexity_trends.items())[index]
    commits_num = list(range(len(complexity_list)))
    complexity_trend_line = px.line(x=commits_num, y=complexity_list, title=f"Complexity trend for {label}",
                                    labels={'x': 'number of commits', 'y': 'complexity [LOC]'})
    complexity_trend_line.show()
    print(f"Final complexity: {int(analysis_df['Complexity'][label])} LOC" )
    print(f"Amount of changes: {analysis_df['Amount of changes'][label]}" )

Candidate 1: app/javascript/fonts/roboto/roboto-medium-webfont.svg

The 'roboto-medium-webfont.svg' file is an extreme outlier. It has very high complexity and never changes.
From the file name we can easily see that it is not a troublesome hotspot.
The file needs to have a lot of lines of code because it defines the shape of characters from a font.
None of these lines ever need to change though. It is not a hotspot and we do not need to take any action.

In [None]:
plot_complexity_trend(0)


Candidate 2: app/models/status.rb

From the trend we see that the complexity increased gradually, with some decreases inbetween.
Presumably these are refactorings, indicating that the file does indeed require some maintenance work.

In [None]:
plot_complexity_trend(1)

Candidate 3: app/javascript/styles/mastodon/components.scss

In [None]:
plot_complexity_trend(2)

Candidate 4: app/javascript/mastodon/locales/ca.json

In [None]:
plot_complexity_trend(3)

Candidate 5: app/helpers/application_helper.rb

In [None]:
plot_complexity_trend(4)

Candidate 6: app/services/activitypub/process_status_update_service.rb

In [None]:
plot_complexity_trend(5)

## 2 Temporal/Logical Coupling

1. Determine what could be cases of temporal/logical coupling and generate a list
of candidates with a set of coupled entities.

2. Visualize these candidate sets of couple entities with a visualization of your
choice.

3. For three set candidates in the list:
• analyze and explain why these entities are coupled;
• describe how important it would be to fix them, and any ideas for their
improvement.

Canditate set 1:

Canditate set 2:

Candidate set 3:

## 3 Defective Hotspots

1. Decide on how you want to detect entities that had defects in the past (e.g.,
commit message analysis vs. issue tracking system analysis) and motivate your
choice.

We decided to look at commit messages. Whenever a file was changed in a commit with message that contains a string like "bug" or "fix" (see "indicator_list" below), we increment a counter for that file.

2. Determine defective hotspots among the entities in the timeframe that you pre-
viously selected (i.e., consider only defects in the selected timeframe). What
conclusions can you draw from this?


In [55]:
indictor_list = ["fix", "bug"]
if True:#not 'Defects' in analysis_df: #check if we have already saved the results

    #Compute defect count
    analysis_df['CommitPath'] = analysis_df.index #For tracking the path that a file has in the current commit if it was moved
    analysis_df['Defects'] = 0 #set change counter to 0

    for commit in reversed(list(repo.traverse_commits())):
        fixes_bug = any([indicator in commit.msg.lower() for indicator in indictor_list])
        for file in commit.modified_files:
            idx = analysis_df[analysis_df['CommitPath']==file.new_path].index
            analysis_df.loc[idx, 'CommitPath'] = file.old_path #update commit path
            if fixes_bug:
                analysis_df.loc[idx, 'Defects'] += 1

    analysis_df.drop(columns='CommitPath', inplace=True) #get rid of temporary indexing
    analysis_df.to_csv(path_to_results)

analysis_df['Defects'].sort_values(ascending=False)[:10] #Print top 10 entries

KeyboardInterrupt: 

3. Determine complexity hotspots at the beginning of your timeframe, then corre-
late them with the defects they have presented throughout the entire timeframe.
Is there a correlation? Why do you think this is the case?

4. What conclusions can you draw from the relationship between defective hotspots
and complexity hotspots in Mastodon? And on these two metrics in general?