# Welcome to Advanced Software Engineering
## Analyzing Version Control System Data

![](https://comidoc.net/static/assets/thumbs/480/webp/2267906_505c.webp)

## Learning Objectives

After this session (+ exercises) the student will be able to:

  * Understand how research analyzing VCS histories collects the underlying data.
  * Export the history from Git repositories.
  * Apply scripts and programs in various languages to clean and pre-process the exported data.
  * Apply scripts and programs in various languages to analyze Git VCS data with respect to certain hypotheses/research questions.
  * Interpret the analysis results to either better understand current practices or to suggest actionable changes of current practices in software engineering teams.
  
-------------------------------------

# What are we doing today?

## Motivation

You all read the paper C. Bird et al. [_"Don't Touch My Code! Examining the Effects of Ownership on Software Quality"_](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/bird2011dtm.pdf)

In [None]:
from IPython.display import IFrame


IFrame('./bird2011dtm.pdf', width='100%', height=800)

### What did you take away from the paper?

<!--
  * Relationship between single/shared person size of commits vs. failure rates
    - The more minor contirbutors the higher the failure rate
  * Interesting: how ownership is defined
    - Correlation of low-knowledge work and failure rates
  * Main cause of failures how people commit but not who owns the most
  * Ownership by commits in single files?
    - No mapping between authors and specific files?
    - ???
-->

  * 
  * 
  * 



### Some paragraphs that I think are important...

  > "How much does ownership affect quality?"
    
  > Ownership is a general term used to describe whether one person has responsibility for a software component, or if there is no one clearly responsible developer. Within Microsoft, we have found that when more people work on a binary, it has more failures.
  
  > Interestingly, unlike some aspects of software which are known to be related to defects such as dependency complexity, or size, ownership is something that can be deliberately changed by modifying processes and policies. Thus, the answer to the question: “How much does ownership affect quality?” is important as it is actionable. Managers and team leads can make better decisions about how to govern a project by knowing the answer. 
  
   




  > We require several types of data. The most important data are the commit histories and software failures. Software repositories record the contents of every change made to a piece of software, along with the change author, the time of change, and an associated log message that may be indicative of the type of change (e.g. introducing a feature, or fixing a bug). We collected the number of changes made by each developer to each source file and used a mapping of source files to binaries in order to determine the number of changes made by each developer to each binary. Although the source code management system uses branches heavily, we only recorded changes from developers that were edits to the source code. Branching operations (e.g. branching and merging) were not counted as changes.

### After reading the paper, do you know how the researchers collected the data on which they base their analysis?

The big issue is now:
<p style="padding:6px; color: grey; background-color: white; border: orange 2px solid">"Alright, I can read the article and understand the experiment setup and results, but how would I run a similar experiment in my organization?"</p>

# What are we doing today?

## What are we really doing now?

We will -on a running example- perform a set of analyses of data from Git, a particular version control system (VCS). Everything in this lecture should be -with minor adaptations- applicable to other VCS, such as Mercurial, Microsoft TFS, etc.

You will learn how to:

  * Export the history from Git repositories.
  * Apply scripts and programs in various languages to clean and pre-process the exported data.
  * Apply scripts and programs in various languages to analyze Git VCS data with respect to certain hypotheses/research questions.
  * Interpret the analysis results to either better understand current practices or to suggest actionable changes of current practices in software engineering teams.
  
## Working together now

As you have various operating systems installed on your computers and since I cannot support all of these in such a short time, we will work together on a remotely hosted virtual machine Linux environment.

Navigate to: https://github.com/HelgeCPH/2019_ase_behavioural_analysis and press the button that says `launch binder`. In case you have a Linux or Unix installed on your computer you can run everything on your local machine right away.

After waiting a short while click on the file called `Lecture notes.ipynb` now you should see a similar screen to mine.

--------------------------------------------------------


# Getting the VCS Data

In the lecture we use [Apache Airflow](https://airflow.apache.org) as an example software system.

  > Airflow is a platform to programmatically author, schedule and monitor workflows.

The project is accessible on Github: https://github.com/apache/airflow. That is, we make use of the Git VCS during this lecture.

In [None]:
%%bash
git clone https://github.com/apache/airflow.git

In [None]:
%%bash
ls -ltrha airflow/

## Moving back in time 

We want to all work on the same "view" of the software so let's go back to the state of the repository on Friday Sept. 30th.

In [None]:
%%bash
git -C airflow/ checkout $(git -C airflow/ rev-list -n 1 --before="2022-09-30" main)

If you want to perform an analyzes for a certain period only, you can add the `--after` switch too.

## The revisions of a software system, organizational history

```bash
git log
```

Note, you have to switch into the directory of the repository that you want to study.

In [None]:
%%bash
git -C airflow/ log | head -30

### What do we see and how to read it?

```
commit af368243f87dfb5a4bc98a571d7b4775186d214c
Author: Ephraim Anierobi <splendidzigy24@gmail.com>
Date:   Fri Sep 30 00:28:12 2022 +0100

    Add restarting state to TaskState Enum in REST API (#26776)

commit ce071172e22fba018889db7dcfac4a4d0fc41cda
Author: Ephraim Anierobi <splendidzigy24@gmail.com>
Date:   Thu Sep 29 22:51:46 2022 +0100

    Remove DAG parsing from StandardTaskRunner (#26750)
    
    This makes the starting of StandardTaskRunner faster as the parsing of DAG will now be done once at task_run.
    Also removed parsing of example dags when running a task

commit 2e66d2d89e1e4a3c7b31a43b62d0b0ec97165dd4
Author: Brent Bovenzi <brent@astronomer.io>
Date:   Thu Sep 29 14:14:45 2022 -0400

    add icon legend to datasets graph (#26781)

commit bec80af0718e44212d02a969a65d3201648735f4
Author: pierrejeambrun <pierrejbrun@gmail.com>
Date:   Thu Sep 29 16:06:52 2022 +0200

    Allow retrieving error message from data.detail (#26762)

commit b6c5189dadb9c09967ec53c8bca1832852c5500e
Author: HTErik <89977373+hterik@users.noreply.github.com>
Date:   Thu Sep 29 05:14:25 2022 +0200
```

Note, we are looking on the master/main branch only at the moment. That is, you see only commit messages on that single branch (we will talk more about these in three weeks).

```bash
git log --all
```

```bash
git log --branches --remotes --tags --graph --oneline --decorate
```


### What are all these switches?

You can read the help of `git log` via:

```bash
git help log
```

All the `git log`s we looked at so far are meant for humans to read and interpret.

However, we want to automatically analyze the logs to infer information out of it that is hidden in the logs.

---------------------------------------------

# Collecting, Cleansing, and Pre-processing the Data

## Exporting the `git log` into a machine readable format

Read the `PRETTY FORMATS` section of the `git log` help for possible placeholders in the format string

In [None]:
%%bash
git -C airflow/ log --all --pretty=format:'%s' > data/all_commit_msgs.txt

In [None]:
%%bash
head data/all_commit_msgs.txt

In [None]:
%%bash
git -C airflow/ log --all --pretty=format:'"%h","%s"' > data/all_commit_msgs.csv

In [None]:
%%bash
head data/all_commit_msgs.csv

In [None]:
%%bash
git -C airflow/ log --pretty=format:'"%h","%an","%ad"' \
    --date=short \
    --numstat > \
    data/airflow_evo.log

In [None]:
%%bash
ls -lh data/airflow_evo.log

In [None]:
%%bash
head -30 data/airflow_evo.log

Not completely suited to our needs yet. Let's convert it into a CSV file with one file per line.

In [None]:
%%bash
python evo_log_to_csv.py data/airflow_evo.log

In [None]:
%%bash
ls -ltrh data

In [None]:
%%bash
head data/airflow_evo.log.csv

Not completely nice either, what are the following "weird" files?

In [None]:
%%bash
grep "=>" data/airflow_evo.log.csv | head

We have to "clean up" commits in which files have been moved so that we have a view of the current repository contents and do not lose old commit information.


In [None]:
%%bash
python repair_git_move.py data/airflow_evo.log.csv

In [None]:
%%bash
ls -ltrh data

In [None]:
%%bash
head data/airflow_evo.log_repaired.csv

In [None]:
%%bash
grep "=>" data/airflow_evo.log_repaired.csv | head

------------------------------------------------



# Shell-based Analysis

## How many revisions do we have in the repository?

That is: How much does the software change?

### Why does it matter?

  > In general, process measures based on the change history are more useful in predicting fault rates than product metrics of the code: For instance, the number of times code has been changed is a better indication of how many faults it will contain than is its length
  >
  > [Graves et al. _Predicting fault incidence using software change history_](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.9414&rep=rep1&type=pdf)

In [None]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | head

In [None]:
%%bash
cut -d, -f 1 data/airflow_evo.log_repaired.csv | head

In [None]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 1 | head

In [None]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 1 | sort | uniq | wc -l

#### How many revisions of each file do we have?

For the twenty most often changed files, we could formulate a query as in the following:

In [None]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 7 | sort | uniq -c | sort -nr | tail -n +2 | head -30

# Shell-based Analysis

## Who are the ten persons that contribute the most?

### Why does it matter?

  > Our conclusion: adding more programmers improves the chances that a FOSS project will be successful.
  >
  > Schweik et al. _Brooks' Versus Linus' Law: An Empirical Test of Open Source Projects_


  > ...files with changes from nine or more developers were 16 times more likely to have a vulnerability than files changed by fewer than nine developers, indicating that many developers changing code may have a detrimental effect on the system's security
  >
  > [_Secure Open Source Collaboration: An Empirical Study of Linus’ Law_](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.965.7992&rep=rep1&type=pdf)

In [None]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 2 | sort | uniq -c | sort -n | tail -n10 | tac

Note, on MacOS and other Unixes you can let tail reverse directly and you do not have the `tac` command by default.

```bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 2 | sort | uniq -c | sort -n | tail -r -n10
```

In [None]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 2 | sort | uniq -c | sort -n | tail -n10 | tac

### Your turn!

  * How many contributors are there in total?
 
<!--tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 2 | sort | uniq | wc -l -->

In [None]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 2 | ???

  * Who is contributing the least?

<!-- tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 2 | sort | uniq -c | sort -n -->

In [None]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 2 | ???

  * Google-fu question: What is the median contribution value?
  
<!-- tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 2 | sort | uniq -c | cut -d '"' -f 1 | python -c"import sys,statistics;print(statistics.median([int(i) for i in sys.stdin.readlines()]))" -->

In [None]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 2 | ???

# Scripted Analysis 1: 

## What are people doing?

We can analyze the commit messages to get a feeling for what people are doing.

### Why does it matter?

  > If we look at the commit cloud, we see that certain terms dominate. What you'll learn right now is by no means scientific, but it's a useful heuristic: the words that stand out tell you where you spend your time.
  >
  > A. Tornhill _Your Code as a Crime Scene_

In [None]:
%%bash
cat data/all_commit_msgs.txt | head

In [None]:
%%bash
grep "[Ff]ix" data/all_commit_msgs.txt | head

In [None]:
%%bash
grep "[Ff]ix" data/all_commit_msgs.txt | wc -l

In [None]:
%%bash
grep "[Ee]rror" data/all_commit_msgs.txt | wc -l

In [None]:
%%bash
grep "[Bb]ug" data/all_commit_msgs.txt | wc -l

With a bit enhanced word natural-language processing (NLP) based word extraction, we can condense the following word frequencies.

In [None]:
%%bash
python word_freq.py data/all_commit_msgs.txt nlp

Let's create a wordcloud for the data.

In [None]:
%%bash
python wordcloud_gen.py data/all_commit_msgs.txt nlp

In [None]:
%%bash
ls -ltrh out

![](out/all_commit_msgs.png)

# Scripted Analysis 2 

## Sentiment Analysis of Commit Messages

### Why does it matter?

Commit messages can give insight into the state of your team and likely also into the state of your software.

For example they can shed light on when your team members are "working best" and in which setup:

  > Our results show that projects developed in Java tend to have more negative commit comments, and that projects that have more distributed teams tend to have a higher positive polarity in their emotional content. Additionally, we found that commit comments written on Mondays tend to a more negative emotion.
  >
  > [E. Guzman et al. _"Sentiment Analysis of Commit Comments in GitHub: An Empirical Study"_](https://www.researchgate.net/profile/Emitza_Guzman/publication/266657943_Sentiment_analysis_of_commit_comments_in_GitHub_An_empirical_study/links/5b8305ba4585151fd134f10c/Sentiment-analysis-of-commit-comments-in-GitHub-An-empirical-study.pdf)

In [None]:
%%bash
python commit_sentiments.py data/all_commit_msgs.csv > data/all_commit_sentiment_msgs.csv

In [None]:
%%bash
ls -ltrh data

In [None]:
%%bash
head data/all_commit_sentiment_msgs.csv

In [None]:
import pandas as pd


df = pd.read_csv('data/all_commit_sentiment_msgs.csv')
df.head()

#### Which are the commits containing negative commit messages?

In [None]:
df[df.polarity < -0.5]

#### Which are the commits containing positive commit messages?

In [None]:
df[df.polarity > 0.5]

**Possible Project:**

Train an NLP sentiment analysis model on a properly annotated commit history to get better results.

#### Which is the commit with the most negative commit message?

In [None]:
most_negative = df[df.polarity == df.polarity.min()]
print(most_negative.msg.values)
most_negative

## What is the corresponding commit?

Now we are crossing information from two datasets.

In [None]:
%%bash
git -C airflow/ log d4dfe2654

In [None]:
complete_df = pd.read_csv('data/airflow_evo.log_repaired.csv')
complete_df[complete_df.hash == most_negative.index[-1]]

# Scripted Analysis 2 

## A Project's Bus-factor


One of the early mentions of the term in a real project [_"If Guido was hit by a bus?"_](https://legacy.python.org/search/hypermail/python-1994q2/1040.html)


### Why does it matter?

  > ...the number of people on your team that have to be hit by a truck (or quit) before the project is in serious trouble...
  >
  > L. Williams and R. Kessler, Pair Programming Illuminated. Wesley, 2003.
Addison

In [None]:
complete_df.head()

Let's start with computing code ownership. There are different possibilities to define and measure code ownership:

a)
  > One measure of ownership is  how  much  of  the  development  activity  for  a  component comes from one developer.  If one developer makes 80% of the changes to a component, then we say that the component has high ownership.  The other way that we measure ownership  is  by  determining  how  many  low-expertise  developers are working on a component.  If many developers are all making few changes to a component, then there are many non-experts working on the component and we label the component as having low ownership
  >
  > C. Bird et al. [_"Don't Touch My Code! Examining the Effects of Ownership on Software Quality"_]( https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/bird2011dtm.pdf)


  > Proportion of Ownership – The proportion of ownership  (or  simply  ownership)  of  a  contributor  for a particular component is the ratio of number of commits that the contributor has made relative to the total number of commits for that component.
  >
  > C. Bird et al. [_"Don't Touch My Code! Examining the Effects of Ownership on Software Quality"_]( https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/bird2011dtm.pdf)

b)

There are more elaborate measures for code ownership in the literature, for example:

  > $DOA = 3.293 + 1.098 ∗ FA + 0.164 ∗ DL − 0.321 ∗ ln(1 + AC )$
  >
  > The degree of authorship of a developer d in a file f depends on three factors: first authorship (FA), number of deliveries (DL), and number of acceptances (AC). If d is the author of f, FA is 1; otherwise it is 0; DL is the number of changes in f made by D; and AC is the number of changes in f made by other developers.
  > 
  > [G. Avelino et al. _What is the Truck Factor of Popular GitHub Applications? A First Assessment_](https://peerj.com/preprints/1233.pdf)
  
c) 

In *Your Code as a Crime Scene*, A. Tornhill suggests to compute a simple measure -which he calls it *knowledge ownership*- that counts the total amount of lines added through the history of a file. And additionally, per author the amounts of lines added by that author. Out of the two values he computes a the _knowledge ownership_ rate.



For the following example, we go for the latter definition and in our computation we just say that the author with the highest _knowledge ownership_ rate on a file "owns" that file.

In [None]:
complete_df.head()

In [None]:
fname = 'INTHEWILD.md'
complete_df[complete_df.current == fname]

In [None]:
new_rows = []

for fname in set(complete_df.current.values):
    view = complete_df[complete_df.current == fname]
    sum_series = view.groupby(['author']).added.sum()
    view_df = sum_series.reset_index(name='sum_added')
    total_added = view_df.sum_added.sum()
    
    if total_added > 0:  # For binaries there are no lines counted
        view_df['owning_percent'] = view_df.sum_added / total_added
        owning_author = view_df.loc[view_df.owning_percent.idxmax()]
        new_rows.append((fname, owning_author.author, owning_author.sum_added,
                         total_added, owning_author.owning_percent))
    # All binaries are silently skipped in this report...

In [None]:
owner_df = pd.DataFrame(new_rows, columns=['artifact', 'main_dev', 'added', 'total_added', 'owner_rate'])
owner_df.head()

In [None]:
owner_freq_series = owner_df.groupby(['main_dev']).artifact.count()
owner_freq_series

In [None]:
owner_freq_df = owner_freq_series.reset_index(name='owns_no_artifacts')
owner_freq_df.sort_values(by='owns_no_artifacts', inplace=True)
owner_freq_df

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt


fig = plt.figure(figsize=(12, 3))
plt.xticks(rotation=45)
plt.xticks(size=3)
plt.plot(owner_freq_df.main_dev, owner_freq_df.owns_no_artifacts)

Now, we can -similar to 
<a href="https://ieeexplore.ieee.org/stamp)/stamp.jsp?arnumber=7503718">G. Avelino et al. *A novel approach for estimating Truck Factors*</a>- start "removing" low-contributing authors from the dataset as long as still more than half of the files have an owner. The amount of remaining owners is the bus-factor of that project.

In [None]:
no_artifacts = len(owner_df.artifact)
half_no_artifacts = no_artifacts // 2
count = 0

for owner, freq in owner_freq_df.values:
    no_artifacts -= freq
    if no_artifacts < half_no_artifacts:
        break
    else:
        count += 1

busfactor = len(owner_freq_df.main_dev) - count
print(f"The bus factor of Apache Airflow is: {busfactor}")

# Git Analysis Beyound Shell-Scripts

## Co-changeing files Analysis

### Why does it matter?


  > Our software evolution analysis approach enabled us to detect shortcomings of [software] such as architectural weaknesses, poorly designed inheritance hierarchies, or blurred interfaces of modules.
  >
  > H. Gall et al. _CVS release history data for detecting logical couplings_


Basic usage example of `PyDriller`:

  > PyDriller is a Python framework that helps developers on mining software repositories. With PyDriller you can easily extract information from any Git repository, such as commits, developers, modifications, diffs, and source codes, and quickly export CSV files.
  >
  > https://pydriller.readthedocs.io/

In [66]:
from pydriller import Repository


path_to_repo = "./airflow"
for commit in Repository(path_to_repo).traverse_commits():
    if "[bot]" in commit.author.name:
        print(commit.hash, commit.author_date, commit.author.name)

7dfb1f7f4847a4d316159882127cf09f6aead552 2019-08-26 20:54:17-07:00 dependabot[bot]
5bfd7f481cbb9a997d33281eab945e2074f47c52 2019-08-28 20:04:12-07:00 dependabot[bot]
8107651f8f578cc98801f37ae50ebc6be9e9d035 2019-12-14 17:25:41+00:00 dependabot[bot]
7dd7be31646a344b2d79763c3ce70319635f8a4d 2020-09-12 00:10:28+01:00 dependabot[bot]
1a56a58a0b70d02840aa53e659a5dc0c38bfd3f5 2020-12-12 09:22:45+00:00 dependabot[bot]
1be20c614f22d6413d18d492d78d9886cdb9f5c8 2020-12-18 12:45:03-05:00 dependabot[bot]
cb6914509f65f0041308288e7c6614b6c1cc735b 2020-12-21 11:49:52+00:00 dependabot[bot]
6851677a89294698cbdf9fa559bf9d12983c88e0 2021-03-10 15:06:42-05:00 dependabot[bot]
edbc89c64033517fd6ff156067bc572811bfe3ac 2021-05-03 20:07:33-07:00 dependabot[bot]
b09d9235ef99430e3e896f08069a48628f190908 2022-04-07 00:23:53+01:00 dependabot[bot]
0592bfd85631ed3109d68c8ec9aa57f0465d90b3 2022-04-07 00:24:27+01:00 dependabot[bot]
bb27d45c43834a58624a73a72bdedbedbe8d7eb4 2022-04-07 08:48:43-05:00 dependabot[bot]
309e

In [68]:
import pandas as pd
from tqdm import tqdm
from datetime import datetime



def commits_to_df(path_to_repo, from_dt=None, to_dt=None):
    commits = []
    if from_dt and to_dt:
        repo = Repository(path_to_repo, since=from_dt, to=to_dt)
    else:
        repo = Repository(path_to_repo)
    for commit in tqdm(repo.traverse_commits()):
        for f in commit.modified_files:
            commits.append((commit.hash, commit.author_date, commit.author.name, f.old_path, f.new_path))
    df = pd.DataFrame(commits, columns=["sha", "date", "author", "old_path", "new_path"])
    return df


df = commits_to_df(path_to_repo, from_dt=datetime(2021, 1, 1, 0, 0, 0), to_dt=datetime(2021, 2, 1, 0, 0, 0))

286it [00:57,  4.98it/s]


In [69]:
df

Unnamed: 0,sha,date,author,old_path,new_path
0,f6a3c822a37265e1b8a35691e00342e715af648c,2021-01-01 06:06:02+01:00,Kamil Breguła,airflow/www/ask_for_recompile_assets_if_needed.sh,airflow/www/ask_for_recompile_assets_if_needed.sh
1,181d8b66a982c813836968e325692e754ddd848c,2021-01-01 10:26:59+05:18,Vivek Bhojawala,CONTRIBUTING.rst,CONTRIBUTING.rst
2,181d8b66a982c813836968e325692e754ddd848c,2021-01-01 10:26:59+05:18,Vivek Bhojawala,,CONTRIBUTORS_QUICK_START.rst
3,181d8b66a982c813836968e325692e754ddd848c,2021-01-01 10:26:59+05:18,Vivek Bhojawala,images/quick_start/add Interpreter.png,images/quick_start/add Interpreter.png
4,181d8b66a982c813836968e325692e754ddd848c,2021-01-01 10:26:59+05:18,Vivek Bhojawala,images/quick_start/add_configuration.png,images/quick_start/add_configuration.png
...,...,...,...,...,...
1676,7c9cc41f85a909645840c7c307954b7e7420b916,2021-01-31 15:03:07+07:00,Imam Digmi,INTHEWILD.md,INTHEWILD.md
1677,9840e406fd20b91959d603a42f381b014e86d2b3,2021-01-31 09:11:52+00:00,Kaxil Naik,tests/www/test_views.py,tests/www/test_views.py
1678,8eddc8b5019890a712810b8e5b1185997adb9bf4,2021-01-31 17:39:55+01:00,Ephraim Anierobi,scripts/in_container/entrypoint_exec.sh,scripts/in_container/entrypoint_exec.sh
1679,ba54afe58b7cbd3711aca23252027fbd034cca41,2021-01-31 20:23:45+01:00,Danilo Trombino,airflow/providers/docker/operators/docker.py,airflow/providers/docker/operators/docker.py


In [70]:
df = df[(df.old_path == df.new_path) & (df.new_path.str.endswith(".py"))]
df.drop("old_path", axis=1, inplace=True)
df.rename(columns = {"new_path":"path"}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop("old_path", axis=1, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns = {"new_path":"path"}, inplace=True)


In [71]:
df.head()

Unnamed: 0,sha,date,author,path
23,abf34b8aba33db4e751db44194429aeb2e8791c0,2020-12-31 23:59:00-08:00,Kuba Tyszko,airflow/providers/openfaas/hooks/openfaas.py
24,abf34b8aba33db4e751db44194429aeb2e8791c0,2020-12-31 23:59:00-08:00,Kuba Tyszko,tests/providers/openfaas/hooks/test_openfaas.py
31,c674f81cb7ff3453dd2ce693dc047688581c4edd,2021-01-02 02:17:14+01:00,Kamil Breguła,docs/build_docs.py
32,c674f81cb7ff3453dd2ce693dc047688581c4edd,2021-01-02 02:17:14+01:00,Kamil Breguła,docs/exts/docs_build/docs_builder.py
41,f7a1334abe4417409498daad52c97d3f0eb95137,2021-01-02 11:32:07+01:00,Adrián Matellanes,airflow/providers/amazon/aws/transfers/mongo_t...


The following code is adapted from https://www.feststelltaste.de/spotting-co-changing-files/

In [72]:
commits_df = pd.merge(df[["sha", "path"]], df[["sha", "path"]], left_on='sha', right_on='sha', suffixes=['','_other'], how='outer')
commits_df.head()

Unnamed: 0,sha,path,path_other
0,abf34b8aba33db4e751db44194429aeb2e8791c0,airflow/providers/openfaas/hooks/openfaas.py,airflow/providers/openfaas/hooks/openfaas.py
1,abf34b8aba33db4e751db44194429aeb2e8791c0,airflow/providers/openfaas/hooks/openfaas.py,tests/providers/openfaas/hooks/test_openfaas.py
2,abf34b8aba33db4e751db44194429aeb2e8791c0,tests/providers/openfaas/hooks/test_openfaas.py,airflow/providers/openfaas/hooks/openfaas.py
3,abf34b8aba33db4e751db44194429aeb2e8791c0,tests/providers/openfaas/hooks/test_openfaas.py,tests/providers/openfaas/hooks/test_openfaas.py
4,c674f81cb7ff3453dd2ce693dc047688581c4edd,docs/build_docs.py,docs/build_docs.py


In [73]:
commits_df = commits_df[commits_df["path"] != commits_df["path_other"]]
commits_df

Unnamed: 0,sha,path,path_other
1,abf34b8aba33db4e751db44194429aeb2e8791c0,airflow/providers/openfaas/hooks/openfaas.py,tests/providers/openfaas/hooks/test_openfaas.py
2,abf34b8aba33db4e751db44194429aeb2e8791c0,tests/providers/openfaas/hooks/test_openfaas.py,airflow/providers/openfaas/hooks/openfaas.py
5,c674f81cb7ff3453dd2ce693dc047688581c4edd,docs/build_docs.py,docs/exts/docs_build/docs_builder.py
6,c674f81cb7ff3453dd2ce693dc047688581c4edd,docs/exts/docs_build/docs_builder.py,docs/build_docs.py
9,f7a1334abe4417409498daad52c97d3f0eb95137,airflow/providers/amazon/aws/transfers/mongo_t...,tests/providers/amazon/aws/transfers/test_mong...
...,...,...,...
300625,70345293031b56a6ce4019efe66ea9762d96c316,tests/jobs/test_scheduler_job.py,airflow/models/dagbag.py
300626,70345293031b56a6ce4019efe66ea9762d96c316,tests/jobs/test_scheduler_job.py,airflow/models/serialized_dag.py
300627,70345293031b56a6ce4019efe66ea9762d96c316,tests/jobs/test_scheduler_job.py,airflow/serialization/serialized_objects.py
300631,ba54afe58b7cbd3711aca23252027fbd034cca41,airflow/providers/docker/operators/docker.py,tests/providers/docker/operators/test_docker.py


In [74]:
commit_coupling = commits_df.groupby(["path", "path_other"]).count()
commit_coupling

Unnamed: 0_level_0,Unnamed: 1_level_0,sha
path,path_other,Unnamed: 2_level_1
airflow/api/common/experimental/get_code.py,airflow/api/common/experimental/pool.py,1
airflow/api/common/experimental/get_code.py,airflow/api_connexion/endpoints/connection_endpoint.py,1
airflow/api/common/experimental/get_code.py,airflow/cli/commands/dag_command.py,1
airflow/api/common/experimental/get_code.py,airflow/cli/commands/task_command.py,1
airflow/api/common/experimental/get_code.py,airflow/cli/commands/user_command.py,1
...,...,...
tests/www/test_views.py,tests/www/test_app.py,1
tests/www/test_views.py,tests/www/test_init_views.py,1
tests/www/test_views.py,tests/www/test_security.py,2
tests/www/test_views.py,tests/www/test_utils.py,1


In [75]:
commit_coupling['all_changes'] = commit_coupling.groupby(["path"]).sha.transform('sum')
commit_coupling

Unnamed: 0_level_0,Unnamed: 1_level_0,sha,all_changes
path,path_other,Unnamed: 2_level_1,Unnamed: 3_level_1
airflow/api/common/experimental/get_code.py,airflow/api/common/experimental/pool.py,1,128
airflow/api/common/experimental/get_code.py,airflow/api_connexion/endpoints/connection_endpoint.py,1,128
airflow/api/common/experimental/get_code.py,airflow/cli/commands/dag_command.py,1,128
airflow/api/common/experimental/get_code.py,airflow/cli/commands/task_command.py,1,128
airflow/api/common/experimental/get_code.py,airflow/cli/commands/user_command.py,1,128
...,...,...,...
tests/www/test_views.py,tests/www/test_app.py,1,662
tests/www/test_views.py,tests/www/test_init_views.py,1,662
tests/www/test_views.py,tests/www/test_security.py,2,662
tests/www/test_views.py,tests/www/test_utils.py,1,662


In [76]:
commit_coupling["ratio"] = commit_coupling["sha"] / commit_coupling["all_changes"]
commit_coupling.reset_index().sort_values(by=["ratio", "path"], ascending=False, inplace=True)
commit_coupling.rename(columns={"sha" : "cochanging"}, inplace=True)
commit_coupling

Unnamed: 0_level_0,Unnamed: 1_level_0,cochanging,all_changes,ratio
path,path_other,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
airflow/api/common/experimental/get_code.py,airflow/api/common/experimental/pool.py,1,128,0.007812
airflow/api/common/experimental/get_code.py,airflow/api_connexion/endpoints/connection_endpoint.py,1,128,0.007812
airflow/api/common/experimental/get_code.py,airflow/cli/commands/dag_command.py,1,128,0.007812
airflow/api/common/experimental/get_code.py,airflow/cli/commands/task_command.py,1,128,0.007812
airflow/api/common/experimental/get_code.py,airflow/cli/commands/user_command.py,1,128,0.007812
...,...,...,...,...
tests/www/test_views.py,tests/www/test_app.py,1,662,0.001511
tests/www/test_views.py,tests/www/test_init_views.py,1,662,0.001511
tests/www/test_views.py,tests/www/test_security.py,2,662,0.003021
tests/www/test_views.py,tests/www/test_utils.py,1,662,0.001511


In [77]:
commit_coupling.cochanging.max()

4

----------------------------------------------------------------


# Your turn!

![](http://giphygifs.s3.amazonaws.com/media/11M1k4fIwVqPF6/giphy.gif)

Chose one or more software systems that are under version control with Git.

Use the version history as done during class (if you want to in combination with the provided Python scripts) to analyze VCS history according to one of the small projects below.

Base your analysis on the file `<your_system>_evo.log.csv` that you can create similar to the examples above like:

```bash
git clone <url to>/<your_system>.git

git -C <your_system>/ log --pretty=format:'"%h","%an","%ad"' \
    --date=short \
    --numstat > \
    data/<your_system>_evo.log

python evo_log_to_csv.py data/<your_system>_evo.log
python repair_git_move.py data/<your_system>_evo.log.csv
```

You can implement your analysis in the languages and with technologies of your choice. (Likely your favorite scripting language comes in handy here.)

If you do not like the shell-based analysis that relies on Git logs, you may want to consider tools, such as, [PyDriller](https://github.com/ishepard/pydriller), which provide a more high-level API for many of the actions presented in this class.



I suggest that you could work on one of the following problems:


### A) Can you suggest a team structure for a project?

Persons who work often on the same artifacts should perhaps work physically together and not be distributed around the globe, see e.g., [M. Penta et al. _The Effect of Communication Overhead on Software Maintenance Project Staffing: a Search-Based Approach_](http://crest.cs.ucl.ac.uk/fileadmin/crest/sebasepaper/DiPentaHAQ07.pdf) and N. Nagappan et al. [_The Influence of Organizational Structure on Software Quality_](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2008-11.pdf).

Write a script/program that suggests teams of size five to ten persons.

Find a suitable categorization for all persons that rarely contribute to a project.

Chapters 11-13 in A. Tornhill _Your Code as a Crime Scene_ will likely form a good basis for the task.


### B) Hotspot Analysis

Based on A. Tornhill _Your Code as a Crime Scene_ chapters 3-5, write a script/program that identifies and visualizes hotspots (often changing and complex artifacts)

You can make use of [`cloc`](https://github.com/AlDanial/cloc) (installed into this VM) to compute the size (proxy for complexity) of a file:

```bash
./cloc --csv --by-file airflow
```

Alternatively, you can write another small script that computes the whitespace complexity of a file, as in chapter 6

### C) Change TF Code Ownership Metric

Change the above implementation of the Truck Factor metric so that it uses an ownership metric as suggested by Bird et al. That is, do not rely on the number of added lines but instead the number of changes (commits) of files.

How do the resulting Truck Factor values change? Shall we use one ownership metric over the other?



<!--
### C) Architectural Decay

Investigate potential architectural decay based on measures for temporal coupling chapters 7-8 (and 9-10) in A. Tornhill _Your Code as a Crime Scene_.

Write a script/program that suggests artifacts that seem to be prone to architectural decay when their temporal couples increase drastically over an analysis period.
-->



### D) Reproduce Paper Results

Take the initial paper from https://peerj.com/preprints/1233.pdf and the final version <a href="https://ieeexplore.ieee.org/stamp)/stamp.jsp?arnumber=7503718">G. Avelino et al. _A novel approach for estimating Truck Factors_</a>. Download the given repositories, implement a computation of ownership as described in the papers, and finally compare how the bus-factors developed since the papers were published to now.

You will find their tool implementation on Github: [https://github.com/aserg-ufmg/Truck-Factor](https://github.com/aserg-ufmg/Truck-Factor)

### E) Relationships of Contributors and Components

Bird et al. have the hypothesis that:

  > Minor contributors to components will be Major contributors to other components that are related through dependency relationships
  
Does that hold for open-source projects too? Think about a simple definition of "component" that allows you to generate results quickly.

### E) Freestyle

Formulate a problem that you want to investigate based on the history of a VCS. An example for such a problem formulation could be:
  
  * Can we find a relation between sentiments in commit messages and complexity of the files in the commit?

Write a script/program that implements the analysis for the given problem and generates suitable result data or plots.



<!--
## Formalities

  * Chose a project now and work on it for the rest of today's session.
-->
<!--  * I will be there for the session on Thursday to help with practicalities. -->
<!--
* Prepare a small presentation of your results and findings for next Monday. We will start the session with a small presentation per group.
  -->


# I do not have Linux on my computer...

All of the scripts and advise in this material assumes that you are working on a Linux/Unix environment. In case you are on Windows, there are many different ways of setting up such an environment. You can find a guide on setting up virtual machines with -amongst others- Ubuntu Linux here: http://itu.dk/people/ropf/blog/vagrant_install.html. I recommend that you setup such an environment as you as software engineers should be comfortable and able to work in various environments depending on the needs of your future companies/clients.

For small analysis that do not require a lot of resources, you might want to fork and adapt this repository, so that you can run your code on mybinder.