# What are we doing today?

## Motivation

C. Bird et al. [_"Don't Touch My Code! Examining the Effects of Ownership on Software Quality"_]( https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/bird2011dtm.pdf)

In [6]:
from IPython.display import IFrame

url = './bird2011dtm.pdf'
IFrame(url, width='100%', height=500)

  > "How much does ownership affect quality?"
  
  
  > Ownership is a general term used to describe whether one person has responsibility for a software component, or if there is no one clearly responsible developer.
  >
  > Within Microsoft, we have found that when more people work on a binary, it has more failures
  
  
  > Interestingly, unlike some aspects of software which are known to be related to defects such as dependency complexity, or size, ownership is something that can be deliberately changed by modifying processes and policies. Thus, the answer to the question: “How much does ownership affect quality?” is important as it is actionable. Managers and team leads can make better decisions about how to govern a project by knowing the answer. 

# Getting the Data

In the lecture we use [Apache Airflow](https://airflow.apache.org) as an example software system.

  > Airflow is a platform to programmatically author, schedule and monitor workflows.

The project is accessible on Github: https://github.com/apache/airflow

In [None]:
%%bash
git clone https://github.com/apache/airflow.git

In [1]:
%%bash
ls -ltrh airflow/

total 1120
-rw-r--r--   1 ropf  staff    43K Sep 26 11:11 BREEZE.rst
-rw-r--r--   1 ropf  staff   176K Sep 26 11:11 CHANGELOG.txt
-rw-r--r--   1 ropf  staff    24K Sep 26 11:11 CONTRIBUTING.md
-rw-r--r--   1 ropf  staff    13K Sep 26 11:11 Dockerfile
-rw-r--r--   1 ropf  staff   3.5K Sep 26 11:11 Dockerfile-checklicence
-rw-r--r--   1 ropf  staff   1.1K Sep 26 11:11 Dockerfile-context
-rw-r--r--   1 ropf  staff   750B Sep 26 11:11 INSTALL
-rw-r--r--   1 ropf  staff    14K Sep 26 11:11 LICENSE
-rw-r--r--   1 ropf  staff   5.7K Sep 26 11:11 LOCAL_VIRTUALENV.rst
-rw-r--r--   1 ropf  staff   1.1K Sep 26 11:11 MANIFEST.in
-rw-r--r--   1 ropf  staff   769B Sep 26 11:11 NOTICE
-rw-r--r--   1 ropf  staff    37K Sep 26 11:11 README.md
-rw-r--r--   1 ropf  staff    96K Sep 26 11:11 UPDATING.md
drwxr-xr-x  36 ropf  staff   1.1K Sep 26 11:11 airflow
-rwxr-xr-x   1 ropf  staff    37K Sep 26 11:11 breeze
-rw-r--r--   1 ropf  staff   4.7K Sep 26 11:11 breeze-complete
drwxr-xr-x   4 ropf  staff   128B

## Moving back in time 

We want to all work on the same "view" of the software so let's go back to the state of the repository on Monday Sept. 23rd.

In [None]:
%%bash
cd airflow
git checkout $(git rev-list -n 1 --before="2019-09-23" master)

## The revisions of a software system, organizational history

```bash
git log
```

Note, you have to switch into the directory of the repository that you want to study.

### What do we see and how to read it?

```
commit 30c442c9b8f4f98774841308a98f0e5ad1bce6a6 (HEAD)
Author: Jarek Potiuk <jarek.potiuk@polidea.com>
Date:   Sun Sep 22 18:54:03 2019 +0100

    [AIRFLOW-5537] Yamllint is not needed as dependency on host

    It used to be needed for pre-commits but is not needed any more
    as it is automatically installed as dependency in the virtualenv
    created by pre-commit

commit f63e4e37d00e52165d7a241626e207e192aae6f2
Author: Jarek Potiuk <jarek.potiuk@polidea.com>
Date:   Sun Sep 22 18:51:54 2019 +0100

    [AIRFLOW-5536] Better handling of temporary output files

commit 511615c884a09cd95d1e74a748cc10d6d9e9013d
Author: Jarek Potiuk <jarek.potiuk@polidea.com>
Date:   Sun Sep 22 18:47:34 2019 +0100

    [AIRFLOW-5535] Fix name of VERBOSE parameter
```

Note, we are looking on the master branch only at the moment. That is you see only commit messages on that single branch (we will talk more about these in three weeks).

```bash
git log --all
```

```bash
git log --branches --remotes --tags --graph --oneline --decorate
```


### What are all these switches?

You can read the help of `git log` via:

```bash
git help log
```

All the `git log`s we looked at so far are meant for humans to read and interpret.

However, we want to automatically analyze the logs to infer information out of it that is hidden in the logs.

---------------------------------------------

# Collecting and Cleaning the Data

## Exporting the `git log` into a machine readable format

Read the `PRETTY FORMATS` section of the `git log` help for possible placeholders in the format string

In [75]:
%%bash
cd airflow/
git log --all --pretty=format:'%s' > ../data/all_commit_msgs.txt

In [76]:
%%bash
head data/all_commit_msgs.txt

[AIRFLOW-5555] Remove Hipchat integration (#6184)
[AIRFLOW-5528] end_of_log_mark should not be a log record (#6159)
[AIRFLOW-4858] Deprecate "Historical convenience functions" in airflow.configuration (#6144)
[AIRFLOW-3871] Operators template fields can now render fields inside objects (#4743)
[AIRFLOW-4858] Deprecate "Historical convenience functions" in airflow.configuration (#5495)
[AIRFLOW-4864] Remove calls to load_test_config (#5502)
[AIRFLOW-XXX] Don't trust python-requests.org to run a valid HTTPS server (#6179)
[AIRFLOW-XXX] Don't trust python-requests.org to run a valid HTTPS server (#6179)
[AIRFLOW-5522] BQ list dataset tables operator (#6151)
[AIRFLOW-4068] Add GoogleCloudStorageFileTransformOperator (#6177)


In [77]:
%%bash
cd airflow/
git log --all --pretty=format:'"%h","%s"' > ../data/all_commit_msgs.csv

In [78]:
%%bash
head data/all_commit_msgs.csv

"fd8de3e48","[AIRFLOW-5555] Remove Hipchat integration (#6184)"
"d06a95611","[AIRFLOW-5528] end_of_log_mark should not be a log record (#6159)"
"d28cf63ca","[AIRFLOW-4858] Deprecate "Historical convenience functions" in airflow.configuration (#6144)"
"fe3926c10","[AIRFLOW-3871] Operators template fields can now render fields inside objects (#4743)"
"9e6a582bb","[AIRFLOW-4858] Deprecate "Historical convenience functions" in airflow.configuration (#5495)"
"735ac4d40","[AIRFLOW-4864] Remove calls to load_test_config (#5502)"
"d6285e20d","[AIRFLOW-XXX] Don't trust python-requests.org to run a valid HTTPS server (#6179)"
"66a139d73","[AIRFLOW-XXX] Don't trust python-requests.org to run a valid HTTPS server (#6179)"
"65ff16fa1","[AIRFLOW-5522] BQ list dataset tables operator (#6151)"
"04f955977","[AIRFLOW-4068] Add GoogleCloudStorageFileTransformOperator (#6177)"


In [14]:
%%bash
cd airflow/
git log --pretty=format:'"%h","%an","%ad"' \
    --date=short \
    --numstat > \
    ../data/airflow_evo.log

In [15]:
%%bash
head data/airflow_evo.log

"30c442c9b","Jarek Potiuk","2019-09-22"
1	1	scripts/ci/ci_before_install.sh

"f63e4e37d","Jarek Potiuk","2019-09-22"
9	4	scripts/ci/_utils.sh

"511615c88","Jarek Potiuk","2019-09-22"
1	1	scripts/ci/_utils.sh

"1815ef32d","Jarek Potiuk","2019-09-22"


Not completely suited to our needs yet. Let's convert it into a CSV file with one file per line.

In [17]:
%%bash
python evo_log_to_csv.py data/airflow_evo.log

In [18]:
%%bash
ls -ltrh data

total 9264
-rw-r--r--  1 ropf  staff   465K Sep 26 13:47 all_commit_msgs.txt
-rw-r--r--  1 ropf  staff   1.0M Sep 26 13:48 airflow_evo.log
-rw-r--r--  1 ropf  staff   1.7M Sep 26 13:59 airflow_evo.log.csv


In [19]:
%%bash
head data/airflow_evo.log.csv

hash,author,date,added,removed,fname
"30c442c9b","Jarek Potiuk","2019-09-22",1,1,"scripts/ci/ci_before_install.sh"
"f63e4e37d","Jarek Potiuk","2019-09-22",9,4,"scripts/ci/_utils.sh"
"511615c88","Jarek Potiuk","2019-09-22",1,1,"scripts/ci/_utils.sh"
"1815ef32d","Jarek Potiuk","2019-09-22",51,2,"scripts/ci/_utils.sh"
"1815ef32d","Jarek Potiuk","2019-09-22",1,16,"scripts/ci/ci_before_install.sh"
"1815ef32d","Jarek Potiuk","2019-09-22",3,0,"scripts/ci/ci_run_all_static_tests.sh"
"1815ef32d","Jarek Potiuk","2019-09-22",3,0,"scripts/ci/ci_run_all_static_tests_except_pylint.sh"
"1815ef32d","Jarek Potiuk","2019-09-22",3,0,"scripts/ci/ci_run_all_static_tests_except_pylint_licence.sh"
"1815ef32d","Jarek Potiuk","2019-09-22",3,0,"scripts/ci/ci_run_all_static_tests_pylint.sh"


Not completely nice either, what are the following "weird" files?

In [21]:
%%bash
grep "=>" data/airflow_evo.log.csv | head

"8f04ebe66","Tomek","2019-09-21",0,0,"tests/{contrib => gcp}/utils/base_gcp_mock.py"
"8f04ebe66","Tomek","2019-09-21",1,1,"tests/{contrib => gcp}/utils/base_gcp_system_test_case.py"
"8f04ebe66","Tomek","2019-09-21",0,0,"tests/{contrib => gcp}/utils/gcp_authenticator.py"
"857788e30","Jarek Potiuk","2019-09-18",4,16,"scripts/ci/{ci_build.sh => pre_commit_check_license.sh}"
"686fac044","Tomek","2019-09-17",0,0,"airflow/{contrib/hooks/google_discovery_api_hook.py => gcp/hooks/discovery_api.py}"
"686fac044","Tomek","2019-09-17",1,1,"airflow/{contrib => }/operators/google_api_to_s3_transfer.py"
"686fac044","Tomek","2019-09-17",7,7,"tests/{contrib/hooks/test_google_discovery_api_hook.py => gcp/hooks/test_google_discovery_api.py}"
"686fac044","Tomek","2019-09-17",11,11,"tests/{contrib => }/operators/test_google_api_to_s3_transfer.py"
"fe469932c","Kamil Breguła","2019-09-15",0,0,"docs/howto/operator/gcp/{bigquery.rst => bigquery_dts.rst}"
"fe469932c","Kamil Breguła","2019-09-15",0,0,"docs/howto

We have to "clean up" commits in which files have been moved so that we have a view of the current repository contents and do not lose old commit information.


In [None]:
%%bash
python repair_git_move.py data/airflow_evo.log.csv

In [24]:
%%bash
ls -ltrh data

total 15416
-rw-r--r--  1 ropf  staff   465K Sep 26 13:47 all_commit_msgs.txt
-rw-r--r--  1 ropf  staff   1.0M Sep 26 13:48 airflow_evo.log
-rw-r--r--  1 ropf  staff   1.7M Sep 26 13:59 airflow_evo.log.csv
-rw-r--r--  1 ropf  staff   2.7M Sep 26 14:21 airflow_evo.log_repaired.csv


In [25]:
%%bash
head data/airflow_evo.log_repaired.csv

"hash","author","date","added","removed","fname","current","old","new"
"30c442c9b","Jarek Potiuk","2019-09-22","1.0","1.0","scripts/ci/ci_before_install.sh","scripts/ci/ci_before_install.sh","",""
"f63e4e37d","Jarek Potiuk","2019-09-22","9.0","4.0","scripts/ci/_utils.sh","scripts/ci/_utils.sh","",""
"511615c88","Jarek Potiuk","2019-09-22","1.0","1.0","scripts/ci/_utils.sh","scripts/ci/_utils.sh","",""
"1815ef32d","Jarek Potiuk","2019-09-22","51.0","2.0","scripts/ci/_utils.sh","scripts/ci/_utils.sh","",""
"1815ef32d","Jarek Potiuk","2019-09-22","1.0","16.0","scripts/ci/ci_before_install.sh","scripts/ci/ci_before_install.sh","",""
"1815ef32d","Jarek Potiuk","2019-09-22","3.0","0.0","scripts/ci/ci_run_all_static_tests.sh","scripts/ci/ci_run_all_static_tests.sh","",""
"1815ef32d","Jarek Potiuk","2019-09-22","3.0","0.0","scripts/ci/ci_run_all_static_tests_except_pylint.sh","scripts/ci/ci_run_all_static_tests_except_pylint.sh","",""
"1815ef32d","Jarek Potiuk","2019-09-22","3.0","0.0","script

In [27]:
%%bash
grep "=>" data/airflow_evo.log_repaired.csv | head

"8f04ebe66","Tomek","2019-09-21","0.0","0.0","tests/{contrib => gcp}/utils/base_gcp_mock.py","tests/gcp/utils/base_gcp_mock.py","tests/contrib/utils/base_gcp_mock.py","tests/gcp/utils/base_gcp_mock.py"
"8f04ebe66","Tomek","2019-09-21","1.0","1.0","tests/{contrib => gcp}/utils/base_gcp_system_test_case.py","tests/gcp/utils/base_gcp_system_test_case.py","tests/contrib/utils/base_gcp_system_test_case.py","tests/gcp/utils/base_gcp_system_test_case.py"
"8f04ebe66","Tomek","2019-09-21","0.0","0.0","tests/{contrib => gcp}/utils/gcp_authenticator.py","tests/gcp/utils/gcp_authenticator.py","tests/contrib/utils/gcp_authenticator.py","tests/gcp/utils/gcp_authenticator.py"
"857788e30","Jarek Potiuk","2019-09-18","4.0","16.0","scripts/ci/{ci_build.sh => pre_commit_check_license.sh}","scripts/ci/pre_commit_check_license.sh","scripts/ci/ci_build.sh","scripts/ci/pre_commit_check_license.sh"
"686fac044","Tomek","2019-09-17","0.0","0.0","airflow/{contrib/hooks/google_discovery_api_hook.py => gcp/hooks/d

# First Shell-based Analysis

## How many revisions do we have in the repository?

That is how much does the software change?

In [31]:
%%bash
cut -d, -f 1 data/airflow_evo.log_repaired.csv | head

"hash"
"30c442c9b"
"f63e4e37d"
"511615c88"
"1815ef32d"
"1815ef32d"
"1815ef32d"
"1815ef32d"
"1815ef32d"
"1815ef32d"


In [35]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | head

"30c442c9b","Jarek Potiuk","2019-09-22","1.0","1.0","scripts/ci/ci_before_install.sh","scripts/ci/ci_before_install.sh","",""
"f63e4e37d","Jarek Potiuk","2019-09-22","9.0","4.0","scripts/ci/_utils.sh","scripts/ci/_utils.sh","",""
"511615c88","Jarek Potiuk","2019-09-22","1.0","1.0","scripts/ci/_utils.sh","scripts/ci/_utils.sh","",""
"1815ef32d","Jarek Potiuk","2019-09-22","51.0","2.0","scripts/ci/_utils.sh","scripts/ci/_utils.sh","",""
"1815ef32d","Jarek Potiuk","2019-09-22","1.0","16.0","scripts/ci/ci_before_install.sh","scripts/ci/ci_before_install.sh","",""
"1815ef32d","Jarek Potiuk","2019-09-22","3.0","0.0","scripts/ci/ci_run_all_static_tests.sh","scripts/ci/ci_run_all_static_tests.sh","",""
"1815ef32d","Jarek Potiuk","2019-09-22","3.0","0.0","scripts/ci/ci_run_all_static_tests_except_pylint.sh","scripts/ci/ci_run_all_static_tests_except_pylint.sh","",""
"1815ef32d","Jarek Potiuk","2019-09-22","3.0","0.0","scripts/ci/ci_run_all_static_tests_except_pylint_licence.sh","scripts/ci/ci_r

In [36]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 1 | head

"30c442c9b"
"f63e4e37d"
"511615c88"
"1815ef32d"
"1815ef32d"
"1815ef32d"
"1815ef32d"
"1815ef32d"
"1815ef32d"
"1815ef32d"


In [39]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 1 | sort | uniq | wc -l

    7069


# First Shell-based Analysis

## Who are the ten persons that contribute the most?

On Linux the latter is likely `tail -n10 | tac`

In [45]:
%%bash
tail -n +2 data/airflow_evo.log_repaired.csv | cut -d, -f 2 | sort | uniq -c | sort | tail -r -n10

2406 "Maxime Beauchemin"
1725 "Bolke de Bruin"
1061 "Kamil Breguła"
1008 "Jarek Potiuk"
1002 "Fokko Driesprong"
 930 "Maxime"
 750 "Kaxil Naik"
 681 "Tomek"
 585 "Chao-Han Tsai"
 487 "Ash Berlin-Taylor"


### Your turn!

  * Who is contributing the least?
  * What is the median contribution value?

# Analysis 1: What are people doing?

We can analyze the commit messages to get a feeling for what people are doing.

In [48]:
%%bash
grep "fix" data/all_commit_msgs.txt | head

[AIRFLOW-XXX] fix backticks in new file (#6164)
AIRFLOW-5484: fix PigCliHook has incorrect named parameter (#6112)
[AIRFLOW-5441] Ownership of package*.json file group write is fixed (#6061)
fixup! [AIRFLOW-5424] Type annotations for GCP hooks
fix postrgres bug
fix postrgres bug
fixup! [AIRFLOW-4964] Add BigQuery Data Transfer Hook and Operator (#5769)
[AIRFLOW-5209] Bump Sphinx version to fix doc build (#5814)
[AIRFLOW-5248] Pylint fixes related to source constructor param removal
[AIRFLOW-5140] fix all missing type annotation errors from dmypy (#5664)


In [47]:
%%bash
grep "fix" data/all_commit_msgs.txt | wc -l

     487


In [49]:
%%bash
grep "error" data/all_commit_msgs.txt | wc -l

     164


In [50]:
%%bash
grep "bug" data/all_commit_msgs.txt | wc -l

     199


In [51]:
%%bash
python word_freq.py data/all_commit_msgs.txt nlp

Reading data...
Lowering cases...
Counting frequencies...
Removing stopwords...
5370 ]
5365 [
1481 add
1241 fix
1130 merge
984 pull
983 request
705 airflow-xxx
418 use
403 operator
366 test
341 remove
305 airflow
275 update
268 task
259 log
257 list
253 hook
253 dag
250 doc
236 support
220 adding
218 file
215 make
213 '
201 user
173 run
172 branch
156 version
155 view
155 error
152 allow
144 set
142 connection
139 default
133 bug
133 change
131 close
129 typo
129 gcp
125 import
124 documentation
123 config
122 miss
116 apache/incubator-airflow
115 ui
112 move
111 option
110 check
106 google


  0%|          | 0/8778 [00:00<?, ?it/s]  0%|          | 1/8778 [00:02<5:50:57,  2.40s/it]  1%|          | 65/8778 [00:02<4:03:56,  1.68s/it]  2%|▏         | 139/8778 [00:02<2:49:22,  1.18s/it]  2%|▏         | 217/8778 [00:02<1:57:32,  1.21it/s]  3%|▎         | 290/8778 [00:02<1:21:38,  1.73it/s]  4%|▍         | 360/8778 [00:02<56:44,  2.47it/s]    5%|▍         | 420/8778 [00:03<39:30,  3.53it/s]  5%|▌         | 480/8778 [00:03<27:31,  5.02it/s]  6%|▌         | 544/8778 [00:03<19:11,  7.15it/s]  7%|▋         | 613/8778 [00:03<13:22, 10.17it/s]  8%|▊         | 688/8778 [00:03<09:20, 14.45it/s]  9%|▊         | 763/8778 [00:03<06:31, 20.47it/s] 10%|▉         | 836/8778 [00:03<04:34, 28.89it/s] 10%|█         | 907/8778 [00:03<03:14, 40.56it/s] 11%|█         | 977/8778 [00:03<02:18, 56.51it/s] 12%|█▏        | 1047/8778 [00:03<01:39, 78.01it/s] 13%|█▎        | 1117/8778 [00:04<01:12, 106.31it/s] 14%|█▎        | 1192/8778 [00:04<00:53, 143.11it/s] 14%|█▍        | 1264/8778

In [None]:
%%bash
pythonw wordcloud_gen.py data/all_commit_msgs.txt nlp

In [53]:
%%bash
ls -ltrh out

total 624
-rw-r--r--  1 ropf  staff   308K Sep 26 15:19 all_commit_msgs.png


![](out/all_commit_msgs.png)

# 

In [None]:
%%bash
python commit_sentiments.py data/all_commit_msgs.csv > data/all_commit_sentiment_msgs.csv

In [79]:
%%bash
ls -ltrh data

total 18008
-rw-r--r--  1 ropf  staff   1.0M Sep 26 13:48 airflow_evo.log
-rw-r--r--  1 ropf  staff   1.7M Sep 26 13:59 airflow_evo.log.csv
-rw-r--r--  1 ropf  staff   2.7M Sep 26 14:30 airflow_evo.log_repaired.csv
-rw-r--r--  1 ropf  staff   465K Sep 26 15:47 all_commit_msgs.txt
-rw-r--r--  1 ropf  staff   585K Sep 26 15:47 all_commit_msgs.csv
-rw-r--r--  1 ropf  staff   707K Sep 26 15:54 all_commit_sentiment_msgs.csv


In [80]:
import pandas as pd


df = pd.read_csv('data/all_commit_sentiment_msgs.csv')
df.head()

Unnamed: 0,msg,polarity,subjectivity
fd8de3e48,[AIRFLOW-5555] Remove Hipchat integration (#6184),0.0,0.0
d06a95611,[AIRFLOW-5528] end_of_log_mark should not be a...,0.0,0.0
d28cf63ca,[AIRFLOW-4858] Deprecate Historical convenienc...,0.0,0.0
fe3926c10,[AIRFLOW-3871] Operators template fields can n...,0.0,0.0
9e6a582bb,[AIRFLOW-4858] Deprecate Historical convenienc...,0.0,0.0


In [81]:
df[df.polarity < -0.5]

Unnamed: 0,msg,polarity,subjectivity
a9d0c2e2f,[AIRFLOW-5287] Base image for chekclicence can...,-0.8,1.0
e515072ce,[AIRFLOW-5287] Base image for chekclicence can...,-0.8,1.0
108208add,[AIRFLOW-3272] Add base grpc hook (#4101),-0.8,1.0
8d5d46022,[AIRFLOW-3272] Add base grpc hook (#4101),-0.8,1.0
d45a4f351,[AIRFLOW-3679] Added Google Cloud Base Hook to...,-0.8,1.0
49ba1aeb6,[AIRFLOW-3679] Added Google Cloud Base Hook to...,-0.8,1.0
5d50e9b56,[AIRFLOW-XXX] Don't spam test logs with bad cr...,-0.7,0.666667
7a6f4b013,[AIRFLOW-XXX] Don't spam test logs with bad cr...,-0.7,0.666667
d4dfe2654,"[AIRFLOW-2059] taskinstance query is awful, un...",-1.0,1.0
a27bd620d,[AIRFLOW-2160] Fix bad rowid deserialization,-0.7,0.666667


In [82]:
df[df.polarity > 0.5]

Unnamed: 0,msg,polarity,subjectivity
5af870716,[AIRFLOW-3982] Update DagRun state based on it...,0.6,1.0
bbcaf29e6,[AIRFLOW-3982] Update DagRun state based on it...,0.6,1.0
fbc3b9f90,[AIRFLOW-XXX] Added Jeitto as one of happy Air...,1.0,1.0
a6d5ee9ce,[AIRFLOW-2859] Implement own UtcDateTime (#3708),0.6,1.0
6fd4e6055,[AIRFLOW-2859] Implement own UtcDateTime (#3708),0.6,1.0
76d11f24c,[AIRFLOW-102] Fix test_complex_template always...,0.7,0.1
404bee8d8,[AIRFLOW-1436][AIRFLOW-1475] EmrJobFlowSensor ...,0.75,0.95
15600e42c,[AIRFLOW-989] Do not mark dag run successful i...,0.75,0.95
3d6095ff5,[AIRFLOW-989] Do not mark dag run successful i...,0.75,0.95
3b84bcb3e,[AIRFLOW-280] clean up tmp druid table no matt...,0.533333,0.4


Possible Project: Train an NLP sentiment analysis model on a properly annotated commit history to get better results.

In [86]:
most_negative = df[df.polarity == df.polarity.min()]
print(most_negative.msg.values)
most_negative

['[AIRFLOW-2059] taskinstance query is awful, un-indexed, and does not scale']


Unnamed: 0,msg,polarity,subjectivity
d4dfe2654,"[AIRFLOW-2059] taskinstance query is awful, un...",-1.0,1.0


## What is the corresponding commit?



In [90]:
complete_df = pd.read_csv('data/airflow_evo.log_repaired.csv')
complete_df[complete_df.hash == most_negative.index[0]]

Unnamed: 0,hash,author,date,added,removed,fname,current,old,new
12491,d4dfe2654,Tao feng,2018-03-02,1.0,1.0,airflow/models.py,airflow/models/__init__.py,,


# Analyzis 1: A projects bus-factor

One of the early mentions in a real project [_"If Guido was hit by a bus?"_](https://legacy.python.org/search/hypermail/python-1994q2/1040.html)


  > DOA = 3.293 + 1.098 ∗ FA + 0.164 ∗ DL − 0.321 ∗ ln(1 + AC )
The degree of authorship of a developer d in a file f depends on three factors: first authorship (FA), number of deliveries (DL), and number of acceptances (AC). If d is the author of f, FA is 1; otherwise it is 0; DL is the number of changes in f made by D; and AC is the number of changes in f made by other developers.
  > 
  > https://peerj.com/preprints/1233.pdf
  
  
  
  https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7503718

In [92]:
complete_df.head()

Unnamed: 0,hash,author,date,added,removed,fname,current,old,new
0,30c442c9b,Jarek Potiuk,2019-09-22,1.0,1.0,scripts/ci/ci_before_install.sh,scripts/ci/ci_before_install.sh,,
1,f63e4e37d,Jarek Potiuk,2019-09-22,9.0,4.0,scripts/ci/_utils.sh,scripts/ci/_utils.sh,,
2,511615c88,Jarek Potiuk,2019-09-22,1.0,1.0,scripts/ci/_utils.sh,scripts/ci/_utils.sh,,
3,1815ef32d,Jarek Potiuk,2019-09-22,51.0,2.0,scripts/ci/_utils.sh,scripts/ci/_utils.sh,,
4,1815ef32d,Jarek Potiuk,2019-09-22,1.0,16.0,scripts/ci/ci_before_install.sh,scripts/ci/ci_before_install.sh,,


In [120]:
new_rows = []

for fname in set(complete_df.current.values):
    view = complete_df[complete_df.current == fname]
    sum_series = view.groupby(['author']).added.sum()
    view_df = sum_series.reset_index(name='sum_added')
    total_added = view_df.sum_added.sum()
    
    if total_added > 0:  # For binaries there are no lines counted
        view_df['owning_percent'] = view_df.sum_added / total_added
        owning_author = view_df.loc[view_df.owning_percent.idxmax()]
        new_rows.append((fname, owning_author.author, owning_author.sum_added,
                         total_added, owning_author.owning_percent))
    # All binaries are silently skipped in this report...

In [121]:
owner_df = pd.DataFrame(new_rows, columns=['artifact', 'main_dev', 'added', 'total_added', 'owner_rate'])
owner_df

Unnamed: 0,artifact,main_dev,added,total_added,owner_rate
0,airflow/contrib/hooks/gcp_pubsub_hook.py,Jason Prodonovich,211.0,405.0,0.520988
1,airflow/kubernetes/k8s_model.py,davlum,57.0,57.0,1.000000
2,tests/operators/test_sagemaker_create_transfor...,Keliang Chen,140.0,140.0,1.000000
3,airflow/operators/dummy_operator.py,Bolke de Bruin,16.0,46.0,0.347826
4,airflow/ti_deps/deps/trigger_rule_dep.py,Dan Davydov,179.0,288.0,0.621528
5,tests/dags_with_system_exit/b_test_scheduler_d...,Paul Yang,29.0,47.0,0.617021
6,tests/www/__init__.py,Bolke de Bruin,16.0,32.0,0.500000
7,airflow/config_templates/airflow_local_setting...,AllisonWang,167.0,438.0,0.381279
8,airflow/contrib/sensors/__init__.py,Bolke de Bruin,16.0,31.0,0.516129
9,tests/gcp/utils/base_gcp_mock.py,Jarek Potiuk,47.0,70.0,0.671429


In [128]:
owner_freq_series = owner_df.groupby(['main_dev']).artifact.count()
owner_freq_series

main_dev
Aaron Keys                 3
Ace Haidrey                1
Adam Boscarino             2
Agraj Mangal               2
Aishwarya Mohan            1
Aizhamal Nurmamat kyzy     2
Ajay Yadav                13
Akshesh Doshi              4
Alan Ma                    1
Alex                       1
Alex Guziel               10
Alex Van Boxel             2
Alexander Bij              1
Alexander Petrovsky        2
AllisonWang                6
Ananya Mishra              1
Andre F de Miranda         4
Andrew Chen                4
Andrew Harmon              1
Andrii Soldatenko          2
Andy Cooper                2
Andy Hadjigeorgiou         5
Angel Gao                  1
Antoni Smoliński          23
Arthur Wiedmer            21
Ash Berlin-Taylor         13
Bartosz Ługowski           1
Bas Harenslak              3
BasPH                     25
Bob De Schutter            1
                          ..
gtoonstra                  5
inytar                     1
jgao54                     2
jj-ia

In [133]:
owner_freq_df = owner_freq_series.reset_index(name='owns_no_artifacts')
owner_freq_df.sort_values(by='owns_no_artifacts', inplace=True)
owner_freq_df

Unnamed: 0,main_dev,owns_no_artifacts
222,zhongjiajie,1
132,Raphael Lopez Kaufman,1
136,Rodrigo Chaparro Plata Hernandez,1
61,Giovanni Briggs,1
141,Shintaro Murakami,1
58,GRANT NICHOLAS,1
57,Frank Maritato,1
143,Sid Anand,1
131,Qingping Hou,1
55,Feng Lu,1


In [138]:
no_artifacts = len(owner_df.artifact)
half_no_artifacts = no_artifacts // 2
count = 0

for owner, freq in owner_freq_df.values:
    no_artifacts -= freq
    if no_artifacts < half_no_artifacts:
        break
    else:
        count += 1

busfactor = len(owner_freq_df.main_dev) - count
print(busfactor)

12


# Your turn!

![](http://giphygifs.s3.amazonaws.com/media/11M1k4fIwVqPF6/giphy.gif)


Chose one or more software systems that are under version control with Git.

Use the version history to analy