<a id='title'></a>
# Moth - Mura
MUNI Omniscient Tutor Helper - Masaryk University Repository Analyzer

This tool was created as a part of the thesis *"Measuring Software Development Contributions using Git"* thesis at Masaryk University.
The goal of this tool is to analyze git repositories of students and provide useful information to tutor about their work.

The implementation is originally written in Python 3.9 and uses the following libraries without which the tool would not be possible:
- [levenshtein](https://pypi.org/project/python-Levenshtein/) - for fuzzy matching in syntactic analysis
- [GitPython](https://gitpython.readthedocs.io/en/stable/) - for git operations
- [python-gitlab](https://python-gitlab.readthedocs.io/en/stable/) - for interfacing with GitLab
- [PyGithub](https://pygithub.readthedocs.io/en/latest/) - for interfacing with GitHub
- [matplotlib](https://matplotlib.org/) - for plotting various graphs
- [notebook](https://jupyter.org/) - for the front-end you are currently using
- [python-sonarqube-api](https://python-sonarqube-api.readthedocs.io/) - for interfacing with SonarQube Community Edition
- [docker](https://www.docker.com/) - for managing docker containers for SonarQube

Below are the necessary imports for the tool to work.

In [87]:
import fs_access as file_system
import lib
import mura
import configuration
import semantic_analysis

from uni_chars import *  # shortcut for unicode characters used throughout the tool
from history_analyzer import CommitRange

from IPython.display import display, HTML

display(HTML("<style>.container { width:100% !important; }</style>"))  # wide screen support

# macros for automagically reloading the modules when they are changed
%load_ext autoreload
%autoreload 2

print(f"{SUCCESS} Imports loaded successfully.")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
‚úÖ Imports loaded successfully.


<a id='toc'></a>
# Table of contents [‚Ü©](#title)
## Setup
- [Configuration](#configuration)
- [Repository setup](#repository)
- [Contributors](#contributors)
- [Config Overrides](#overrides)
- [Analysis](#analysis)
## Results
- [Commits](#commits)
- [Commit Graph](#commit-graph)
- [File statistics](#file-stats)
- [File ownership](#percentage-ownership)
- [File ownership dir-tree](#ownership-dir-tree)
- [Line distribution](#lines)
- [Unmerged commits](#unmerged-commits)
- [Rules](#rules)
- [Syntax + Semantics](#syntax-semantics)
- [Constructs](#constructs)
- [Hour estimation](#hour-estimation)
- [Remote repository](#remote-repo)
- [Summary](#summary)

<a id='configuration'></a>
# Configuration [‚Ü©](#toc)

The following code block is a shortcut for opening the configuration folder. The configuration folder contains configuration.txt with general variables and a rules.txt which holds rules to use during ownership analysis.


In [None]:
# configuration.open_configuration_folder()


Apart from the two files mentioned before, separate folders exist for `lang-syntax` and `lang-semantics`. Each definition of a language is stored in a file/folder matching the language's file extension.
The `lang-syntax` folder contains weight definitions for the language's syntax.
The `lang-semantics` folder contains weight definitions for semantics and semantic analyzers themselves as an executable and a launch command for interfacing with the driver python code.
A separate folder `remote-repo-weights` contains weight definitions for remote repository objects: Issues and Pull Requests.

## Configuration file
- `configuration_data/configuration.txt` - contains general configuration of the tool
- `lang-syntax/*` - contains weight definitions for general syntax of a language - This is currently not used as the general idea and configuration approach needs more polish.
- `lang-semantics/*` - contains weight definitions for semantic constructs
- `remote-repo-weigths/weights.txt` - contains weight definitions for remote repository objects

## Rules file

- `configuration_data/rules.txt` - contains rules for the tool

Once the configuration is set, run the code block below to load the configurations into the tool.

In [None]:
config = configuration.validate()

# All properties of the config can be edited here. An editor with code completion support is recommended.
# config.ignore_remote_repo = True

# Use SonarQube Community Edition for syntactic/semantic analysis - requires Docker to be installed.
config.use_sonarqube = True
config.sonarqube_persistent = True
config.sonarqube_login = "admin"
config.sonarqube_password = "admin"
config.sonarqube_port = 8080

# Add extensions to the list to ignore during analysis, provide extensions will not be analyzer even if an analyzer is present.
# config.ignored_extensions = ['.cs']

config.post_validate()

Semantic analyzers for each extension are executed only if the extension is present in the repository. Therefore, it is not required to have all the prerequisites installed if the project does not include those extensions.

<a id='repository'></a>
# Repository [‚Ü©](#toc)

Put the path to the repository you want to analyze into the `repository_path` variable and run the code block below.

In [None]:
repository_path = r"/path/to/the/repository/directory"

repository = file_system.validate_repository(repository_path, config)

<a id='commit-range'></a>
# Commit range [‚Ü©](#toc)
The commit range is defined by the `start` and `end` variables. The variables can be either a commit hash or a tag/branch name.
Additionally, the `end` variable can be set to `ROOT` and `start` to `HEAD` to analyze the repository from the beginning to the current state.

In [None]:
start = "HEAD"
end = "ROOT"

commit_range = CommitRange(repository, start, end, verbose=True)

# The expected amount of hours students are expected to spend on the project, used for hour weight estimation.
hour_estimate_per_contributor = 24

<a id='contributors'></a>
## Contributors [‚Ü©](#toc)
Displays a table of all contributors in the repository. The instance contains the following information:
- Name
- Email
- Aliases

Often times contributors do not have a synchronized git configuration across all development devices. This can lead to the tool not being able to properly group contributions to the correct contributor. The tool attempts to match contributors by their name and email.  Primary in this case meaning the first encounter.

Aliases are also used to match the contributors. If an alias matches commit author, the primary contributor name is used instead. Primary in this case meaning the first encounter.

In [None]:
raw_contributors = mura.display_contributor_info(commit_range, config)

### Contributors
If matching by name and email is not enough an explicit name-to-name mapping can be provided in the `contributor_map` variable. Afterwards, executing the block will show the new contributor identifiers.


In [None]:
contributor_map = \
    [
        # ('Ji≈ô√≠ ≈†≈•astn√Ω', 'Jiri Stastny'),
    ]

config.contributor_map = contributor_map

contributors = mura.display_contributor_info(commit_range, config)

<a id='overrides'></a>
## Overrides [‚Ü©](#toc)
In case a commit was created on behalf of another person, the file ownership is contributed towards the author of the commit. This can lead to incorrect ownership attribution. To fix this, commit ownerships can be overridden by the following code block. The key is the complete commit hash and the value is the contributor name. The contributor name must match the name of a contributor in the `contributors` variable.

üìù This is a last resort solution. Ideally students should commit on their own behalf.


In [None]:
commit_range.ownership_overrides["commit_hexsha"] = "Jiri Stastny"

# Uncomment the following line to enable anonymous mode, which will replace the names of the contributors with "Contributor #n"
# config.anonymous_mode = True

# Uncomment the following line to force skip analysis of a specific extension, e.g. ".cs".
# config.ignored_extensions = [".cs"]


<a id='analysis'></a>
# Analysis [‚Ü©](#toc)
The analysis is a time-consuming process. Taking longer the larger the repository is. For a single project from the PA165 course for Milestone 1 (120 commits), the analysis took about 10 seconds on an Intel i7-12700H CPU. If SonarQube is used, the analysis is started in the background. Sonar cube analysis usually takes longer but is more extensive. The results are later retrieved using the project key. The analysis can be skipped by setting `config.use_sonarqube` to `False`.

In [None]:
project_key = mura.start_sonar_analysis(config, repository_path)

tracked_files = lib.get_tracked_files(repository, verbose=True)
history_analysis_result = commit_range.analyze(verbose=True)

semantic_analysis_grouped_result = semantic_analysis.compute_semantic_weight_result(config, tracked_files, verbose=True)

# Results

The analysis part is finished. The tool provides multiple outputs to help the tutor analyze the students' work. Each output is a separate function code block. Each section links back to the Table of Contents to make it easier to navigate. Apart from output here in the notebook, SonarQube provides a web interface to view the results of its analysis. The web interface can be accessed at `http://localhost:{port}` if SonarQube is enabled in the configuration at the specified port.

<a id='commits'></a>
## Commits [‚Ü©](#toc)

The following code block displays a table of all commits in the repository. For each commit, the full commit hash, the first line of the commit message and the author are displayed.

The commits are then grouped by contributors and additional information about total inserted and deleted lines over the commits are displayed in textual and graphical form.

In [None]:
commit_distribution, insertions_deletions = mura.commit_info(commit_range, repository, contributors)

In [None]:
mura.insertions_deletions_info(insertions_deletions)

<a id='commit-graph'></a>
## Commit graph [‚Ü©](#toc)
Displays a graph of the commits in the repository.

The x axis is the time axis. The y axis is the number of commits. Each dot in the graph represents a commit. The color of the dot represents the author of the commit.

The range of the x-axis is computed from the starting commit date and the ending commit date.
To display only a section of the graph, the list can be sliced. This is generally useful to filter out commits at the boundaries. Taking a section in the middle does not make much sense.

In [None]:
commits = [commit for commit in commit_range]

commits = commits[1:]  # remove first commit
# commits = commits[:10] # remove last 10 commits

mura.plot_commits(commits, commit_range, contributors, repository, force_x_axis_dense_labels=False)

<a id='file-stats'></a>
## File statistics [‚Ü©](#toc)
First part of the output is a combined statistics of all file changes in the repository.

- A: Files Added
- D: Files Deleted
- M: Files Modified
- R: Files Renamed

The statistics are cumulative. Meaning if a file is added and in any subsequent commit it is deleted, the file is counted towards both statistics.

In [None]:
flagged_files = mura.file_statistics_info(commit_range, contributors)

<a id='percentage-ownership'></a>
## Percentages and ownership [‚Ü©](#toc)

The following code block displays the percentage of ownership of each contributor. The percentage is computed based on the number of lines of code contributed by the contributor. The percentage is computed for each file and then summed up for each contributor.

The first output shows the total share of code across the project.
Then the individual files are listed.

In [None]:
percentage, ownership = mura.percentage_info(history_analysis_result, contributors, config)

<a id='ownership-dir-tree'></a>
## Ownership as a directory tree [‚Ü©](#toc)

The above output is not very readable, to offer a nicer view and also show ownership of directories based on their contents, the following code block displays the ownership as a directory tree.

In [None]:
mura.display_dir_tree(config, percentage, repository)

<a id='lines'></a>
## Lines, Blanks and Comments [‚Ü©](#toc)

In this section, apart from number of lines, top 5 largest and smallest files are shown. The information about comments is taken from the final state of the project.

In [None]:
mura.lines_blanks_comments_info(repository, ownership, semantic_analysis_grouped_result, tracked_files, contributors)

<a id='unmerged-commits'></a>
## Unmerged commits [‚Ü©](#toc)

This section analyzes branches that exist in the repository between the `start` and `end` commit but were not merged into the main branch.

In [None]:
commit_range.unmerged_commits_info(repository, config, contributors)

<a id='rules'></a>
## Rules [‚Ü©](#toc)

Rules are an easy way to assert file ownership. Rules are defined in the configuration file mentioned above.

In [None]:
rule_violation_weight_multipliers = mura.rule_info(config, repository, ownership, contributors)

<a id='syntax-semantics'></a>
## Syntax using SonarQube + Semantics [‚Ü©](#toc)

Apart from inspection performed by the tool itself, outputs from SonarQube are also available. The analysis is done in a Docker container in the steps above. The code block below will wait until the analysis is done and then query SonarQube for the results. The web interface can be accessed as well to view further results not used by Mura.

The semantic info call uses the built-in analyzers to obtain information about code constructs for each analyzed file. The results are grouped by folders.

In [None]:
syntactic_weights = mura.syntax_info(config, project_key)

semantic_weights = mura.semantic_info(tracked_files, ownership, semantic_analysis_grouped_result)

<a id='constructs'></a>
## Constructs and ownership [‚Ü©](#toc)

The following code block summarizes the data obtained in the previous step into a more readable format. Discarding details and presenting counts of constructs owned by a contributor.

In [None]:
mura.constructs_info(tracked_files, ownership, semantic_analysis_grouped_result)

<a id='hour-estimation'></a>
## Hour estimation [‚Ü©](#toc)

Hour estimation in based on the tool git-hours available [here](https://github.com/kimmobrunfeldt/git-hours). The angorithm was wertitten into python to remove the need to install npm and all complete all the necessary steps. The estimated hours are then used to compute weights for each contributor based on normal distribution.

In [None]:
hour_estimates = mura.hour_estimates(contributors, repository)

hour_weights = mura.gaussian_weights(config, hour_estimate_per_contributor, hour_estimates)

<a id='remote-repo'></a>
## Remote repository [‚Ü©](#toc)

Apart from code, Mura also analyzes the remote repository is instructed to do so. The following code block displays the number of issues and pull requests. Each "remote object" in participating in the weight computation. Complex pull requests and stale pull requests are penalized. Merging pull requests as self without code review is also penalized.

In [None]:
repo_management_weights = mura.remote_info(commit_range, repository, config, contributors)

<a id='summary'></a>
## Summary [‚Ü©](#toc)

Summarizes the data collected in the steps above and displays the final weights for each contributor. Higher weights mean the contributor was more active during the development process. When comparing complexity of two projects the absolute weight can be used to measure complexity between the two projects. Within the project the relative distribution between the contributors can be used to identify stronger and weaker members of the team.

In [None]:
mura.summary_info(contributors, syntactic_weights, semantic_weights, repo_management_weights,
                  rule_violation_weight_multipliers, hour_weights)