Your Code As A Crime Scene

Here's my notes about the book written by Adam Tornhill.
I am not the author of most of the content available on this repository. I've only regrouped on a single repository every resource I needed to follow the book.

Note: every command prompt describe here must be executed on a bash command line.

Chapter 1 : Welcome!

Tools needed for the examples (more details here):

Code Maat (requires Java >= 1.6)
Git
Python

scripts folder contains python scripts for log analysis.
sample folder contains various samples analysed all along the book.

Chapter 2 : Code as a Crime Scene

Tool: Code city
Each block is a package, each building a class. The height of a building is defined by the number of methods, and the base by the number of attributes.

Tool: MetricsTreeMap
Size and color of blocks describe how frequently a piece of code is modified.

By correlating those two dimensions (size of a class and how often it changes), we can identify hotspots that need our attention.

Chapter 3 : Creating an Offender Profile

On this chapter, we run our first analysis on Code-Maat's source code.

Extracting data from repository

Clone repo: git clone https://github.com/adamtornhill/code-maat.git
Move to expected point in time: git checkout `git rev-list -n 1 --before="2013-11-01" master` . This command should place you on commit d804759.
Getting log with stats: git log --numstat

Automated mining with Code Maat

Mining logs: git log --pretty=format:'[%h] %an %ad %s' --date=short --numstat --before=2013-11-01 > maat_evo.log
data first inspection (evaluating number of changes): maat -l maat_evo.log -c git -a summary

statistic,value
number-of-commits,88
number-of-entities,45
number-of-entities-changed,283
number-of-authors,2

change fequency analysis: maat -l maat_evo.log -c git -a revisions > maat_freqs.csv

entity,n-revs
src/code_maat/analysis/logical_coupling.clj,26
src/code_maat/app/app.clj,25
src/code_maat/core.clj,21
test/code_maat/end_to_end/scenario_tests.clj,20
project.clj,19
src/code_maat/parsers/svn.clj,19
src/code_maat/parsers/git.clj,14
...

Add the Complexity Dimension

Here, we're about to extract the number of lines of code as a complexity metric. As highlighted by the author, that metric is just as bad as any others. He chose it for now because it's langage agnostic and easy to extract.

Tool used for counting lines of code: Cloc.
Extracting number of lines of code: cloc ./ --by-file --csv --quiet --report-file=maat_lines.csv

language,filename,blank,comment,code
Clojure,.\src\code_maat\analysis\logical_coupling.clj,23,14,145
Clojure,.\test\code_maat\end_to_end\scenario_tests.clj,18,19,91
Clojure,.\test\code_maat\analysis\logical_coupling_test.clj,15,5,89
Clojure,.\src\code_maat\app\app.clj,13,6,85
Clojure,.\test\code_maat\parsers\svn_test.clj,7,5,79
...

Merge Complexity and Effort

Merging: python scripts/merge_comp_freqs.py maat_freqs.csv maat_lines.csv

module,revisions,code
src\code_maat\analysis\logical_coupling.clj,26,145
src\code_maat\app\app.clj,25,85
src\code_maat\core.clj,21,35
test\code_maat\end_to_end\scenario_tests.clj,20,91
project.clj,19,17
src\code_maat\parsers\svn.clj,19,53
src\code_maat\parsers\git.clj,14,31
...

Chapter 4 : Analyze Hotspots in Large-Scale Systems

Clone Hibernate Repository

Clone the repo: git clone https://github.com/hibernate/hibernate-orm.git
Move to expected point in time: git checkout `git rev-list -n 1 --before="2013-09-05" main` . This command should place you on commit a5705e011e.

Generate a Version-Control log

Mining logs (note this time we also set a begining date to limit the scope we want to analyze): git log --pretty=format:'[%h] %an %ad %s' --date=short --numstat --before=2013-09-05 --after=2012-01-01 > hib_evo.log
Genreate evolutions log: maat -l hib_evo.log -c git -a summary

Choose a Timespan for your analyses

Between releases
Over iterations
Around significant events

Mining Hibernate

Proceed as we did for Code Maat (Chapter 3), extract frequencies, the number of line of code, then merge both:

Change fequency analysis: maat -l hib_evo.log -c git -a revisions > hib_freqs.csv
Extracting number of lines of code: cloc ./ --by-file --csv --quiet --report-file=hib_lines.csv
Merging: python scripts/merge_comp_freqs.py hib_freqs.csv hib_lines.csv

Explore the Visualization

The circle-packing visualization is from D3.js, it's using the Zoomable Circle Packing algorithm.
Note: this is just one tool among others, the author also highligts:

Spreadsheets: an easy way to exploit CSV outputs.
R programming langage: a langage designed for statistical computation and data visualization.

Here we'll be looking at a prepared example; To launch hotspots visualization:

Go into the sample\hibernate directory
To avoid some security restriction issues on your browser, you have to run python -m SimpleHTTPServer 8888 or python -m http.server 8888 depending of your Python's version.
Then open http://localhost:8888/hibzoomable.html

To transform Code Maat's CSV output into a Json for D3.js, use: python csv_as_enclosure_json.py -h

Chapter 5 : Judge Hotspots with the Power of Names

For cognitive reasons, we put names on things to reduce the load and still express complex concepts: this is called chunking.
Names can be good indicators to identify hotspots: are they descriptive (ex: TcpListener) or clumsy (ex: StateManager)?
Combined with change frequency and number of lines of code, names can help us reducing the number of potential offenders.
The author highlight this as an heuristic, it's not perfect and you can still have false positives.

Chapter 6 : Calculate Complexity Trends from Your Code's Shape

Finding hotspot and acting on them may require several passes, so we need to look at the evolution of the code.
Here we'll be using code's shape via indentation to mesure hotspots complexity and trends over time. Heavy indentation might highlight complex conditionnal flows.

Whitespace analysis of complexity

On the Hibernate folder used on the previous chapter: python scripts/complexity_analysis.py hibernate-core/src/main/java/org/hibernate/cfg/Configuration.java

n,total,mean,sd,max
3335,8204.25,2.46,1.6,14.25

Analyze Compexity Trends in Hotspots

Manny Lehman 's laws of software evolution: the more you change the code and add feature, the more code's complexity increases unless specific work is done to reduce it.

Now, to analyze complexity trends over a period: python scripts/git_complexity_trend.py --start ccc087b --end 46c962e --file hibernate-core/src/main/java/org/hibernate/cfg/Configuration.java

rev,n,total,mean,sd
e75b8a77b1,3080,7735.75,2.51,1.73
23a62802c8,3092,7774.75,2.51,1.73
89911003e3,3100,7783.75,2.51,1.73
8373871c30,3101,7783.75,2.51,1.73
fa1183f3f9,3101,7783.75,2.51,1.73
...

Then you can use the spreadsheet of your choice to visualize trends with graphs.

When total complexity increases, it can mean more indentation (and complexity) or more lines of code.
Standard deviation (sd column) describes code consitency, the lower the better.
For a high total complexity, a low standard deviation means a lot of code and a low overall complexity. Mean should return a similar trend.

Complexity trend can be:

Increasing: That's a waring sign
Decreasing: Some refactoring have been done to reduce complexity
Stable: few modifications other time

Chapter 7 : Treat Your Code As a Cooperative Witness

Now, we're looking at high-level design of the system, we're chasing hidden dependencies and learning the concept of temporal coupling.

Human brain suffers from a lot biases, the way you're asked some questions may result in different answers as the question influence memory access and associations, and even create false memories.
That's why we need to extract real evolution of the code from the code base and our source control tools.

You have temporal coupling when modules change together, note that those modules may not have a static dependency visible through the compiler.

Chapter 8 : Detect Architectural Decay

Analyze Temporal Coupling

Sum of Coupling: looking at how many times a module has been coupled to anothers in a commit and sums it up.
Mesure the Sum of Coupling: maat -l maat_evo.log -c git -a soc

entity,soc
src/code_maat/app/app.clj,105
test/code_maat/end_to_end/scenario_tests.clj,97
src/code_maat/core.clj,93
project.clj,74
src/code_maat/analysis/authors.clj,72
src/code_maat/parsers/svn.clj,67
...

Mesure the Temporal Coupling: maat -l maat_evo.log -c git -a coupling

entity,coupled,degree,average-revs
src/code_maat/parsers/git.clj,test/code_maat/parsers/git_test.clj,83,12
src/code_maat/analysis/entities.clj,test/code_maat/analysis/entities_test.clj,76,7
src/code_maat/analysis/authors.clj,test/code_maat/analysis/authors_test.clj,72,11
src/code_maat/analysis/logical_coupling.clj,test/code_maat/analysis/logical_coupling_test.clj,66,20
test/code_maat/analysis/authors_test.clj,test/code_maat/analysis/test_data.clj,66,8
src/code_maat/parsers/svn.clj,test/code_maat/parsers/svn_test.clj,64,14
src/code_maat/app/app.clj,src/code_maat/core.clj,60,23
...

The output is composed of:

entity: the coupled module
coupled: the couterpart module
degree: the percentage of shared commits where these modules are coupled
average-revs: a weighted number of total revisions for these modules

Note: you can have a high average-revs and a low degree, it means there's a lot of revisions, but only few are shared by the two modules. At the oposite, high degree and low average-revs means a strong coupling but stable modules.

Suggested tool: Evolution Radar

Yet, Temporal Coupling suffers from limitations and biases. The code base may be maintained by several teams, including a timespan to measure coupling might be necessary (see chapter 12). Also, some important coupling occurs between commits, then you have to dig into the code. Finally, renaming module resets counters when using Code Maat, it can sounds problematic, but it's also a good signal that some refactoring happened.

Catch Architectural Decay

Manny Lehman also expressed another law: it states a program that is used undergoes continual change or becomes progressively less used.

We're going to use a new repo for the analysis.
Clone the repo: git clone https://github.com/SirCmpwn/Craft.Net.git
Extract logs: git log --pretty=format:'[%h] %an %ad %s' --date=short --numstat --before=2014-08-08 > craft_evo_complete.log
Mesure the Sum of Coupling: maat -l craft_evo_complete.log -c git -a soc

First module observed, MinecraftServer seems to have a lot of coupling. We're going to run a trend analysis on this module.

First activity for this module is in 2012, so we extract logs for first year as a first period: git log --pretty=format:'[%h] %an %ad %s' --date=short --numstat --before=2013-01-01 > craft_evo_130101.log
Then we run a temporal coupling analysis: maat -l craft_evo_130101.log -c git -a coupling > craft_coupling_130101.csv

Then we repeat this process for a period over 2013 to 2014:

git log --pretty=format:'[%h] %an %ad %s' --date=short --numstat --after=2013-01-01 --before=2014-08-08 > craft_evo_140808.log
maat -l craft_evo_140808.log -c git -a coupling > craft_coupling_140808.csv

We can now open both csv into a spreadsheet app and remove everything not coupled to MinecraftServer.

React to Structural Trends

Now we can use the enclosure diagram (see chapter 4) and compare visually results.
Dependencies that are close to eachothers isn't an issue, we're looking for temporal coupling across (package/project) boundaries, with modules on distant parts of the system.
You can run such an analysis as a routine on your project (each iteration for example) and spot early decays.

Chapter 9 : Build a Safety Net for Your Architecture

Know What's in an Architecture

Automated tests act as a safety net for the software, it's supposed to allow developers to modify the software and ensure there's no regression.
But tests create dependencies with the tested code, so we have to choose those dependencies carefully. The author argues that automated tests should be considered like any other module of the system and designed with the same attention.

Analyse the Evolution on a Sytem Level

We've looked at temporal coupling between invdividual modules, now we're focusing on system boundaries between production code and automated tests.

To specify to Code Maat production and test code, we've to specify a transformation.
Create a file maat_src_test_boundaries.txt in the repository root, then type in:

src/code_maat => Code
test/code_maat => Test

Note: this file must use LF endline sequence instead of CRLF.

Finally, perform the analysis: maat -l maat_evo.log -c git -a coupling -g maat_src_test_boundaries.txt

We can observe a high level of coupling between code and tests, but tests regroup different kinds of tests. We must make a distinction.

Differentiate Between the Level of Tests

Now, put in maat_src_test_boundaries.txt file:

src/code_maat => Code
test/code_maat/analysis => Analysis Test
test/code_maat/dataset => Dataset Test
test/code_maat/end_to_end => End to end Test
test/code_maat/parsers => Parsers Test

Then re-run the analysis.

On this example, Analysis & Parsers tests are unit tests, they change aproximatively 40% of the time with the code. That sounds reasonable. Datastet isn't displayed has its numbers are below the default threshold.
End to end tests also changes 40% of the time, but they're supposed to be more stable. That's a bad signal.
In this specific case, the author explains he changed several times the output format, but didn't encapsulate it as an implementation detail. So every time he made a change, he had to modify the end to end tests.

Create a Safety Net for Your Automated Tests

You can monitor revisions for each boundaries: maat -l maat_evo.log -c git -a revisions -g maat_src_test_boundaries.txt
By collecting several sample points, we can start to see trends. We can also observe the evolution of code/test change ratio.
Every time the ratio evolves to a high test changes, run coupling and hotspot analysis to help you understand the problem.

We can also spot clusters of tests that change together, it's a sign that some test refactoring is needed.

Chapter 10 : Use Beauty as a Guiding Principle

Learn Why Attractiveness Matters

Here, beauty should be interpreted as "the absence of ugliness".

Write Beautiful Code

Translated to code, the absence of ugliness means the absence of special cases.
A beautiful code is a consistent code in terms of code style, conventions, level of expression, etc.
Every time the code base divert, it breaks the reader's expectations and introduces cognitive cost. The result is a code base harder to understand and riskier to modify.
This principle also applies at the architecture level, and is even more important than the local coding constructs.

Avoid Surprises in Your Architecture

Code Maat is build following the Pipes and Filters pattern, so we should expect a low temporal coupling between filters.

In a file maat_pipes_filters_boundaries.txt:

src/code_maat/parsers => Parse
src/code_maat/analysis => Analyze
src/code_maat/output => Output
src/code_maat/app => Application

The perform the temporal coupling analysis: maat -l maat_evo.log -c git -a coupling -g maat_pipes_filters_boundaries.txt

Result doesn't highlight any strong coupling between filters, but two of them are coupled to the Application component.
The reason is: Application contains conditionnal logic to choose the parsers and the analysis to execute. It could be a problem as the software can grow with more options.

Analyze Layered Architectures

The transformation file doesn't have to mirror code stucture. You can ignore minor utility modules and try to focus on what can break your target architecture.

New study case with NopCommerce open source product. It's build with a MVC pattern.

Find Surpring Change Patterns

Clone the new repo: git clone https://github.com/nopSolutions/nopCommerce.git
Then extract logs: git log --pretty=format:'[%h] %an %ad %s' --date=short --numstat --after=2014-01-01 --before=2014-09-25 > nop_evo_2014.log

Note: You may encounter the following error:

warning: inexact rename detection was skipped due to too many files.
warning: you may want to set your diff.renameLimit variable to at least 1954 and retry the command.

Use git config diff.renamelimit 1954 to solve it.

Use the following transformation file arch_boundaries.txt:

src/Presentation/Nop.Web/Administration/Models      => Admin Models
src/Presentation/Nop.Web/Administration/Views       => Admin Views
src/Presentation/Nop.Web/Administration/Controllers => Admin Controllers
src/Libraries/Nop.Services                          => Services
src/Libraries/Nop.Code                              => Core
src/Libraries/Nop.Data                              => Data Access
src/Presentation/Nop.Web/Models                     => Models
src/Presentation/Nop.Web/Views                      => Views
src/Presentation/Nop.Web/Controllers                => Controllers

And run coupling analysis: maat -l nop_evo_2014.log -c git -a coupling -g arch_boundaries.txt

Note: Few commit messages contained several lines, I had to manually fix them on nop_evo_2014.log in order to execute the analysis without failure.

The result highlight several coupling, a lot of them from admin modules.

Run a Hotspots analysis: maat -l nop_evo_2014.log -c git -a revisions -g arch_boundaries.txt

It shows us that the Service layer is the most volatile, and temporal coupling tells us that 35% of those revisions also modify Admin Controllers. But with this data, we can't know if Service leads to Admin Controllers modifications or if it is the other way around.

Expand Your Analyses

You can now use Temporal Coupling as a early warning system. Define the rules you want to protect, then run the analysis on a regular basis.
If the trend evoles in an unexpected way, run a hotspots analysis to investigate.

Chapter 11 : Norms, Groups, and False Serial Killers

Social interactions and team organization are as influent as software architecture for producing bugs and decay.

Learn Why the Right People Don't Speak Up

Process loss is the theory that groups can't operate at 100% efficiency. There's losses by coordination and motivations.
In software development we must accept some losses as systems are too big to be developed by a single person.

Social biases make you influenced by others, by their attitude, how confident they look like, etc.

Understand Pluralistic Ignorance

Pluralistic Ignorance is when everyone reject privately a rule but thinks that others support it. It can lead to teams following rules that no one wants to follow.
An individual can also influence decisions just by repeating his opinion. Just by hearing it more often, we tend to find it more valuable.

Social biases are hard to avoid, the best ways to do so are:

ask questions
talk to people
use data to support decisions

If you're in a leadership position, there are additionnal solutions:

use an outside expert to review decisions
let subgroups work independently on the same problem
avoid advocating a specific solution early in the discussions
discuss worst-case scenarios and build team risk awareness
plan a second meeting to reconsider decisions made on the first one

All those strategies are useful to avoid groupthink. Groupthink is when a group has suppressed all internal forms of dissent, it leads groups suffering from a false sense of consensus, ignoring alternatives and risks.

Witness Groupthing in Action

Some workshops like brainstorming are supposed to promote high group creativity. The reality is: they generate a lot of social biases and tend to produce less creative groups than expected.

Discover Your Team's Modus Operandi

Every team as its own way to work. Even if it can't be observed clearly, using commit logs can provide some useful information.

To extract only commit logs: git log --pretty='%s'
You can then use tools to visualize them with clouds representations, here some examples.

The more a word is present, the more the team is doing it. You can potentially double check your discoveries with a temporal coupling analysis.

Friendly reminder: these kind of analysis is tools made to understand and support decisions, it doesn't replace real discussions and team interactions.

Chapter 12 : Discover Organizational Metrics in Your Codebase

Let's Work in the Communication Business

"Adding manpower to a later software project makes it later" - The Mythical Man-Month: Essays on Software Engineering

Software development is an intellectual work that is hard to parallelize, adding more people to the team won't accelerate the developments. Even worst, the more people you add to a team, the more you increase coordination cost, and those tend to increase exponentially.
Additionally, a large group suffers from process loss and less responsibility for the overall goal.

Find the Social Problems of Scale

Adding developers isn't necessarily a bad thing as long as architecture allows them to work on separate pieces of code. Troubles start when some hotspots accumulate responsibilities and force developers to edit the same code for different reasons.

We can run an author analysis: maat -l hib_evo.log -c git -a authors

entity,n-authors,n-revs
hibernate-core/src/main/java/org/hibernate/persister/entity/AbstractEntityPersister.java,14,44
libraries.gradle,11,28
hibernate-core/src/main/java/org/hibernate/internal/SessionImpl.java,10,39
hibernate-core/src/main/java/org/hibernate/loader/Loader.java,10,23
hibernate-core/src/main/java/org/hibernate/mapping/Table.java,9,28
build.gradle,8,79
...

The result is the modules with the number of authors and revisions.

A study on Windows Vista's code shows that organizational structure add a huge impact on the overall code quality.
The number of authors was one social metric, and it outperformed technical metrics such as code complexity or coverage to predict defects.

Mesure Temporal Coupling over Organizational Boundaries

If a module is highlighted by both hotspot and authors analysis, then we're facing a piece of code that is probably tricky and affects most developers: a good candidate for some rework!

Conway's law: "Any organization that designs a system will inevitably produce a design whose structure is a cop of organization's communication structure."
This famous law can be interpreted in two distinct ways:

"If you have four groups working on a compiler, you'll get a 4-pass compiler"
The reverse one: how to structure organization to match a specific architecture?

To analyze temporal coupling over organizational boundaries, we need to consider commits on the same day as part of a logical change set.
To do so: maat -l hib_evo.log -c git -a coupling --temporal-period 1

entity,coupled,degree,average-revs
hibernate-core/src/main/java/org/hibernate/persister/entity/EntityPersister.java,hibernate-core/src/test/java/org/hibernate/test/legacy/CustomPersister.java,100,6
hibernate-core/src/main/java/org/hibernate/persister/entity/EntityPersister.java,hibernate-core/src/test/java/org/hibernate/test/cfg/persister/GoofyPersisterClassProvider.java,92,7
hibernate-core/src/test/java/org/hibernate/test/cfg/persister/GoofyPersisterClassProvider.java,hibernate-core/src/test/java/org/hibernate/test/legacy/CustomPersister.java,92,7
...

With this analysis, we measure the probability of coupled modules to change within the same day.

Next step is to identify main developers of coupled modules. We can then compare it to the formal organization and reason about communication.

Evaluate Communication Costs

To identify the main developer of a module, we could look at the number of lines add, but it could promote some kind of "copy-paste Cowboys".
Code Maat propose the refactoring-main-dev analysis to identify the developer who removed the more lines of codes as it is probably someone who takes an active part in module maintenance and refactoring.

To identify the main developers: maat -l hib_evo.log -c git -a main-dev > main_devs.csv

entity,main-dev,added,total-added,ownership
.gitignore,Galder Zamarre�o,7,8,0.88
CONTRIBUTING.md,Steve Ebersole,55,56,0.98
README.md,Hardy Ferentschik,37,46,0.8
build.gradle,Steve Ebersole,399,633,0.63
buildSrc/Readme.md,Steve Ebersole,178,178,1.0
buildSrc/build.gradle,Strong Liu,8,15,0.53
...

We can now identify the main developer and his degree of ownership of both the hotspot and the modules it's coupled to.
If those modules are all "owned" by the same person with a strong ownership, then it's ok from the organizational point of view. If it's "owned" by different people, we should consider several things: Are they on the same team? At the office or remotely? On the same time zone?

To calculate individual contributions: maat -l hib_evo.log -c git -a entity-ownership

entity,author,added,deleted
.gitignore,Galder Zamarre�o,7,0
.gitignore,Strong Liu,1,0
CONTRIBUTING.md,Steve Ebersole,55,7
CONTRIBUTING.md,Strong Liu,1,1
README.md,Strong Liu,6,2
README.md,Hardy Ferentschik,37,17
README.md,Steve Ebersole,3,3
build.gradle,Steve Ebersole,399,291
build.gradle,Brett Meyer,69,12
build.gradle,Strong Liu,117,158
build.gradle,Gunnar Morling,24,2
build.gradle,brmeyer,7,5
...

Author note: Git "Two commiters name" can influence results, always look at logs and take times to clean it up before running an analysis.

Take It Step by Step

To recap:

Identify parallel work
Compare against hotspots
Identity temporal coupling
Find the main developers
Check organizational distance
Optimize for communication

The latest step can be achieved by either changing the organizational structure or the software architecture.

Chapter 13 : Build a Knowledge Map of Your System

Know Your Knowledge Distribution

In previous chapter we've identified authors of a module and measured ownership metrics to identify who may old the most knowledge of this module.
But this metric doesn't tell us if there is one main contributor or several ones who maintain overall consistency of the module.
To do so, we have to measure individual contributions: maat -l hib_evo.log -c git -a entity-effort

entity,author,author-revs,total-revs
.gitignore,Galder Zamarre�o,1,2
.gitignore,Strong Liu,1,2
CONTRIBUTING.md,Steve Ebersole,4,5
CONTRIBUTING.md,Strong Liu,1,5
README.md,Strong Liu,1,6
README.md,Hardy Ferentschik,4,6
README.md,Steve Ebersole,1,6
build.gradle,Steve Ebersole,48,79
build.gradle,Brett Meyer,8,79
build.gradle,Strong Liu,16,79
...

To improve visualization of the result, we can use some fractal figures.

You can then observe three different patterns:

Single developer: Easiest pattern, the quality depends only on the expertise of the developer.
Multiple, balanced developers: Few developers with a clear ownership of one of them. The more ownership, the fewer defects in the code and better quality.
Collective chaos: A lot of minor contributors, this is a strong predictor of defects.

Here we've got an example of modules, we can run the same analysis at the architectural level by specifying boundaries (the -g parameter).

Grow Your Mental Maps

We can build a map with modules and associated main contributors.
In the scala sample directory:

Run python -m http.server 8888
Then open http://localhost:8888/scala_knowledge.html

We can now visualize easily whom to ask if we want to work on a specific module.

Investigate Knowledge in the Scala Repository

To build such a map:

Clone the Scala repository: git clone https://github.com/scala/scala.git
Check the branch: git status, in my case I am on branch 2.13.x
Go back in time for predictable results: git checkout `git rev-list -n 1 --before="2013-12-31" origin/2.13.x`
Extract logs: git log --pretty=format:'[%h] %an %ad %s' --date=short --numstat --before=2013-11-01 --after=2011-12-31 > scala_evo.log
Extract main devs: maat -l scala_evo.log -c git -a main-dev > scala_main_devs.csv
Count the number of lines: cloc ./ --by-file --csv --quiet --report-file=scala_lines.csv

At this point we have everything we need to build on the map. Note we've limited the period of time as knowledge decrease over time if we don't edit a module.

We can use tools to generate good color schemes like ColorBrewer.
Colors should be specified by HTML5 color names.
We can then build an author/color mapping like the one in the sample directory.

Finally, we can build our knowledge map: python scripts/csv_main_dev_as_knowledge_json.py --structure scala_lines.csv --owners scala_main_devs.csv --authors scala_author_colors.csv > scala_knowledge_131231.json

We can now use d3.js to visualize the result (replace in the scala_knowledge.html reference to the file with the one you've generated).

Visualize Knowledge Loss

Documentation, reviews, etc, can't replace the intricate knowledge of working on a piece of code. That's why the number of ex-developers who worked on a component is a good predictor of the number of post-release defects.

In the scala repository, we know that Paul Phillips was a main contributor who chose to leave.
Let's identify the abandoned code:

Create a new scala_ex_programmers.csv with content

author,color
Paul Phillips,green

Generate the json visualization: python scripts/csv_main_dev_as_knowledge_json.py --structure scala_lines.csv --owners scala_main_devs.csv --authors scala_ex_programmers.csv > scala_knowledge_loss.json

Now we can visualize every "abandoned" piece of code as the result of Paul's leaving as they appear in green.

Chapter 14 : Dive Deeper with Code Churn

Cure the Disease, Not the Symptoms

Parallel work is an issue if it occurs on the same modules, merging can be more time consuming than the initial work.
Code churn refers to a family of measures indicating the rate at which code evolves.

Discover Your Process Loss from Code

By analyzing churn, we can detect problems in our development process.

To measure code churn trend: maat -c git -l maat_evo.log -a abs-churn

date,added,deleted
2013-08-09,259,20
2013-08-11,146,70
2013-08-12,213,79
2013-08-13,126,23
2013-08-15,334,118
...

It returns the added and deleted lines of code per commit date. Values alone aren't interesting, trends are.
If we observe unexplained peaks, we must investigate into the source control logs to understand why.

Code churn patterns:

On a stabilizing code base, we expect to have decreasing churn over time.
Abnormal peaks means an event occurred, we can also look if there's a time pattern matches organizational events (like the end of a sprint).
Increasing code churn is a bad signal. It means code quality is at risk and a high probability of defects is to be expected.

Investigate the Disposal Sites of Killers and Code

We have to make a lot of decision early in the project life. The tools we have let us spot things when they start moving in the wrong direction, like temporal coupling may highlight unexpected modification patterns.

To measure code churn per module: maat -c git -l craft_evo_140808.log -a entity-churn

entity,added,deleted
source/Craft.Net.Networking/Packets.cs,4263,3911
source/Craft.Net.Server/MinecraftServer.cs,727,786
Craft.Net/Packets.cs,676,186
source/Craft.Net.Client/Session.cs,638,499
Craft.Net.Server/MinecraftServer.cs,635,612
...

Then we can combine it to our previous temporal coupling analysis (on chapter 8). By doing so, we can observe churn on the dependencies of a module.
Increasing size of a highly couple dependency is a warning sign and should be addressed rapidly before more architectural decay.

Predict Defects

Code churn isn't a problem, it's a symptom of changing code. But as changing code can result to defects, code churn can be a good predictor.

So far we used the number of revisions as a metric for hotspot analysis, but it can suffer from some biases:

different commit styles
long-lived branches
squash

To avoid those biases, code churn seems to be a good alternative.
To do so, combine the results of an entity-churn and a complexity analysis, the overlapping modules can be identified as hotspots.

Code churn also has its limitations:

Generated code: can be filtered out
Refactoring: code churn doesn't make a distinction between adding new features and refactoring existing ones
Superficial changes: renaming, rearranging code, etc.

Chapter 15 : Toward the Future

Let Your Questions Guide Your Analysis

Hotspot analysis is the simplest tool we can use, associated with temporal coupling, we can spot architectural decays.
If we know well the software, we can also define boundaries to improve our analysis.
If we need more data, we can add code churn analysis.
And finally, we can analyze the social environment of the software.

Building a knowledge map is a powerful tool, it keeps track of who holds the knowledge and work as a communication aid.
Also, keep an eye on parallel work as they're good candidate for defects, and if needed, act on it at organizational level.

Take Other Approaches

We can investigate more than only source code, if we can use everything that is under source control. For example, we can track temporal coupling between a document (maybe containing requirements) and the code.

The analysis can go beyond file changes granularity, Michael Feathers's use of source control helps him to spot violation of single responsibility principle through some git.

A developer network map can be built from code revisions: each time we edit a piece of code, we get linked with developers who also worked on it, the more editions, the stronger is the link.

Don't hesitate to build your own tools to match your specific needs. We can also check Moose to build analysis.

Adapt your practices and/or tools to support pair programming.

Appendix : Refactoring Hotspots

Small increments is the safest way to improve a design, it allows experimentation and rollbacks.
Group functions by tasks will improve cohesion and readability.

Wishful thinking: defer data representation and imagine you have all the functions you need to solve the problem the simplest possible way. It's then about experimentation.

Do not hesitate to turn off syntax highlighting to avoid distraction during such experimentations.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
sample		sample
scripts		scripts
.gitignore		.gitignore
Readme.md		Readme.md

RomainTrm/YourCodeAsACrimeSceneExercices

Folders and files

Latest commit

History

Repository files navigation